R
在使用 R 中的 cbind() 函數對 a 進行邏輯回歸時2×22×22 times 2表,回歸方程的顯式函數形式是什麼?
假設我有一個 $ 2 \times 2 $ 看起來像的表:
Disease No Disease Treatment 55 67 Control 42 34
我想在這張桌子上的 R 中做一個邏輯回歸。我知道標準方法是在響應中使用
glm
帶有函數的函數。cbind
換句話說,代碼如下所示:glm(formula = cbind(c(55,67),c(42,34)) ~ as.factor(c(1, 0)), family = binomial())
我想知道為什麼
R
需要我們使用該cbind
功能,以及為什麼僅使用比例是不夠的。有沒有辦法將其明確寫為公式?它會是什麼樣子:$$ log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1X $$
在哪裡 $ X = 1 $ 如果我們有治療和 $ X=0 $ 為了控制?
現在似乎我正在回歸依賴值的矩陣。
首先,我將展示如何使用具有比例和權重的聚合數據指定公式。然後,我將展示如何在將數據分解為單個觀察值後指定公式。
中的文檔
glm
表明:“對於二項式 GLM,先驗權重用於在響應為成功比例時給出試驗次數”
我分別為“試驗次數”和“成功比例”
total
創建新列。proportion_disease``df
library(dplyr) df <- tibble(treatment_status = c("treatment", "no_treatment"), disease = c(55, 42), no_disease = c(67,34)) %>% mutate(total = no_disease + disease, proportion_disease = disease / total) model_weighted <- glm(proportion_disease ~ treatment_status, data = df, family = binomial("logit"), weights = total)
上述加權方法採用聚合數據,並提供與該方法相同的解決方案,
cbind
但允許您指定公式。(以下等效於原始海報的方法,但cbind(c(55,42), c(67,34))
不是cbind(c(55,67),c(42,34))
因此“疾病”而不是“治療”是響應變量。)model_cbinded <- glm(cbind(disease, no_disease) ~ treatment_status, data = df, family = binomial("logit"))
您也可以將數據分解為單獨的觀察結果並將其傳遞給
glm
(也允許您指定公式)。df_expanded <- tibble(disease_status = c(1, 1, 0, 0), treatment_status = rep(c("treatment", "control"), 2)) %>% .[c(rep(1, 55), rep(2, 42), rep(3, 67), rep(4, 34)), ] model_expanded <- glm(disease_status ~ treatment_status, data = df_expanded, family = binomial("logit"))
現在讓我們通過將每個模型傳遞給
summary
. model_weighted和model_cbinded都產生完全相同的結果。model_expanded產生相同的係數和標準誤差,但輸出不同的自由度、偏差、AIC 等(對應於行數/觀察值)。> lapply(list(model_weighted, model_cbinded, model_expanded), summary) [[1]] Call: glm(formula = proportion_disease ~ treatment_status, family = binomial("logit"), data = df, weights = total) Deviance Residuals: [1] 0 0 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.2113 0.2307 0.916 0.360 treatment_statustreatment -0.4087 0.2938 -1.391 0.164 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1.9451e+00 on 1 degrees of freedom Residual deviance: 1.0658e-14 on 0 degrees of freedom AIC: 14.028 Number of Fisher Scoring iterations: 2 [[2]] Call: glm(formula = cbind(disease, no_disease) ~ treatment_status, family = binomial("logit"), data = df) Deviance Residuals: [1] 0 0 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.2113 0.2307 0.916 0.360 treatment_statustreatment -0.4087 0.2938 -1.391 0.164 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1.9451e+00 on 1 degrees of freedom Residual deviance: 1.0658e-14 on 0 degrees of freedom AIC: 14.028 Number of Fisher Scoring iterations: 2 [[3]] Call: glm(formula = disease_status ~ treatment_status, family = binomial("logit"), data = df_expanded) Deviance Residuals: Min 1Q Median 3Q Max -1.268 -1.095 -1.095 1.262 1.262 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.2113 0.2307 0.916 0.360 treatment_statustreatment -0.4087 0.2938 -1.391 0.164 (Dispersion parameter for binomial family taken to be 1) Null deviance: 274.41 on 197 degrees of freedom Residual deviance: 272.46 on 196 degrees of freedom AIC: 276.46 Number of Fisher Scoring iterations: 3
(有關回歸上下文中參數的對話,請參閱R 博主。)
weights``glm