R

在使用 R 中的 cbind() 函數對 a 進行邏輯回歸時2×22×22 times 2表,回歸方程的顯式函數形式是什麼?

  • February 2, 2017

假設我有一個 $ 2 \times 2 $ 看起來像的表:

           Disease       No Disease
Treatment         55                67
Control           42                34

我想在這張桌子上的 R 中做一個邏輯回歸。我知道標準方法是在響應中使用glm帶有函數的函數。cbind換句話說,代碼如下所示:

glm(formula = cbind(c(55,67),c(42,34)) ~ as.factor(c(1, 0)), family = binomial())

我想知道為什麼R需要我們使用該cbind功能,以及為什麼僅使用比例是不夠的。有沒有辦法將其明確寫為公式?它會是什麼樣子:

$$ log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1X $$

在哪裡 $ X = 1 $ 如果我們有治療和 $ X=0 $ 為了控制?

現在似乎我正在回歸依賴值的矩陣。

首先,我將展示如何使用具有比例和權重的聚合數據指定公式。然後,我將展示如何在將數據分解為單個觀察值後指定公式。

中的文檔glm表明:

“對於二項式 GLM,先驗權重用於在響應為成功比例時給出試驗次數”

我分別為“試驗次數”和“成功比例”total創建新列。proportion_disease``df

library(dplyr)
df <- tibble(treatment_status = c("treatment", "no_treatment"),
      disease = c(55, 42),
      no_disease = c(67,34)) %>% 
 mutate(total = no_disease + disease,
        proportion_disease = disease / total) 

model_weighted <- glm(proportion_disease ~ treatment_status, data = df, family = binomial("logit"), weights = total)

上述加權方法採用聚合數據,並提供與該方法相同的解決方案,cbind但允許您指定公式。(以下等效於原始海報的方法,但cbind(c(55,42), c(67,34))不是cbind(c(55,67),c(42,34))因此“疾病”而不是“治療”是響應變量。)

model_cbinded <- glm(cbind(disease, no_disease) ~ treatment_status, data = df, family = binomial("logit"))  

您也可以將數據分解為單獨的觀察結果並將其傳遞給glm(也允許您指定公式)。

df_expanded <- tibble(disease_status = c(1, 1, 0, 0), 
                     treatment_status = rep(c("treatment", "control"), 2)) %>%
                       .[c(rep(1, 55), rep(2, 42), rep(3, 67), rep(4, 34)), ]

model_expanded <- glm(disease_status ~ treatment_status, data = df_expanded, family = binomial("logit"))

現在讓我們通過將每個模型傳遞給summary. model_weightedmodel_cbinded都產生完全相同的結果。model_expanded產生相同的係數和標準誤差,但輸出不同的自由度、偏差、AIC 等(對應於行數/觀察值)。

   > lapply(list(model_weighted, model_cbinded, model_expanded), summary)
[[1]]

Call:
glm(formula = proportion_disease ~ treatment_status, family = binomial("logit"), 
   data = df, weights = total)

Deviance Residuals: 
[1]  0  0

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)
(Intercept)                 0.2113     0.2307   0.916    0.360
treatment_statustreatment  -0.4087     0.2938  -1.391    0.164

(Dispersion parameter for binomial family taken to be 1)

   Null deviance: 1.9451e+00  on 1  degrees of freedom
Residual deviance: 1.0658e-14  on 0  degrees of freedom
AIC: 14.028

Number of Fisher Scoring iterations: 2


[[2]]

Call:
glm(formula = cbind(disease, no_disease) ~ treatment_status, 
   family = binomial("logit"), data = df)

Deviance Residuals: 
[1]  0  0

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)
(Intercept)                 0.2113     0.2307   0.916    0.360
treatment_statustreatment  -0.4087     0.2938  -1.391    0.164

(Dispersion parameter for binomial family taken to be 1)

   Null deviance: 1.9451e+00  on 1  degrees of freedom
Residual deviance: 1.0658e-14  on 0  degrees of freedom
AIC: 14.028

Number of Fisher Scoring iterations: 2


[[3]]

Call:
glm(formula = disease_status ~ treatment_status, family = binomial("logit"), 
   data = df_expanded)

Deviance Residuals: 
  Min      1Q  Median      3Q     Max  
-1.268  -1.095  -1.095   1.262   1.262  

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)
(Intercept)                 0.2113     0.2307   0.916    0.360
treatment_statustreatment  -0.4087     0.2938  -1.391    0.164

(Dispersion parameter for binomial family taken to be 1)

   Null deviance: 274.41  on 197  degrees of freedom
Residual deviance: 272.46  on 196  degrees of freedom
AIC: 276.46

Number of Fisher Scoring iterations: 3

(有關回歸上下文中參數的對話,請參閱R 博主。)weights``glm

引用自:https://stats.stackexchange.com/questions/259502

comments powered by Disqus