在列聯表頻率計數上與對數線性泊松 glm 一起使用的推理類型

November 18, 2014

我正在做一些對數線性模型來測試多路列聯表中的交互/關聯（基於此處的教程，http://ww2.coastal.edu/kingw/statistics/R-tutorials/loglin.html）。我在觀察到的頻率上使用泊松glm以及使用MASS‘s來執行此操作loglm。我只是想知道哪種類型的假設檢驗在這裡最有意義，使用順序類型 I anova()（不好，因為那裡的 p 值取決於模型中因子的順序），使用Anova()in的類型 III 測試car（獨立於模型中的因素）還是drop1從最複雜的模型開始？

例如使用泰坦尼克號乘客生存數據

library(COUNT)
data(titanic)
titanic=droplevels(titanic)
head(titanic)
mytable=xtabs(~class+age+sex+survived, data=titanic)
ftable(mytable)
                      survived  no yes
class     age    sex                   
1st class child  women            0   1
                man              0   5
         adults women            4 140
                man            118  57
2nd class child  women            0  13
                man              0  11
         adults women           13  80
                man            154  14
3rd class child  women           17  14
                man             35  13
         adults women           89  76
                man            387  75
freqdata=data.frame(mytable)
fullmodel=glm(Freq~SITE*SEX*MORTALITY,family=poisson,data=freqdata)

那麼對於不同分類因素之間的相互作用的最明智的測試是否會由 I 型 SS 給出，如

anova(fullmodel, test="Chisq")
Analysis of Deviance Table

Model: poisson, link: log

Response: Freq

Terms added sequentially (first to last)


                      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                      23    2173.33              
class                   2   231.18        21    1942.15 < 2.2e-16 ***
age                     1  1072.61        20     869.54 < 2.2e-16 ***
sex                     1   137.74        19     731.80 < 2.2e-16 ***
survived                1    77.61        18     654.19 < 2.2e-16 ***
class:age               2    32.41        16     621.78 9.178e-08 ***
class:sex               2    29.61        14     592.17 3.719e-07 ***
age:sex                 1     6.09        13     586.09   0.01363 *  
class:survived          2   132.69        11     453.40 < 2.2e-16 ***
age:survived            1    25.58        10     427.81 4.237e-07 ***
sex:survived            1   312.93         9     114.89 < 2.2e-16 ***
class:age:sex           2     4.04         7     110.84   0.13250    
class:age:survived      2    35.45         5      75.39 2.002e-08 ***
class:sex:survived      2    73.71         3       1.69 < 2.2e-16 ***
age:sex:survived        1     1.69         2       0.00   0.19421    
class:age:sex:survived  2     0.00         0       0.00   1.00000

或使用 III 型 SS 使用car’s Anova：

library(car)
library(afex)
set_sum_contrasts()
Anova(fullmodel, test="LR", type="III")
Analysis of Deviance Table (Type III tests)

Response: Freq
                      LR Chisq Df Pr(>Chisq)    
class                    37.353  2  7.744e-09 ***
age                       5.545  1  0.0185317 *  
sex                       0.000  1  0.9999999    
survived                  1.386  1  0.2390319    
class:age                 5.476  2  0.0646851 .  
class:sex                 0.000  2  1.0000000    
age:sex                   0.000  1  0.9999888    
class:survived           16.983  2  0.0002052 ***
age:survived              0.056  1  0.8126973    
sex:survived              0.000  1  0.9999953    
class:age:sex             0.000  2  1.0000000    
class:age:survived        3.461  2  0.1771673    
class:sex:survived        0.000  2  1.0000000    
age:sex:survived          0.000  1  0.9999905    
class:age:sex:survived    0.000  2  1.0000000

或使用單項刪除和 LRT drop1：

fullmodel=glm(Freq~class+age+sex+survived+class:age+class:sex+class:survived+age:sex+age:survived+sex:survived, family=poisson, data=freqdata)
drop1(fullmodel,test="Chisq")
Single term deletions

Model:
Freq ~ class + age + sex + survived + class:age + class:sex + 
   class:survived + age:sex + age:survived + sex:survived
              Df Deviance    AIC     LRT  Pr(>Chi)    
<none>              114.89 249.01                      
class:age       2   162.76 292.89  47.877 4.016e-11 ***
class:sex       2   115.74 245.86   0.850    0.6537    
class:survived  2   230.95 361.08 116.067 < 2.2e-16 ***
age:sex         1   114.89 247.02   0.008    0.9294    
age:survived    1   134.39 266.52  19.505 1.003e-05 ***
sex:survived    1   427.81 559.94 312.927 < 2.2e-16 ***

[最後一個結果似乎與MASS‘s相匹配，loglm情況應該如此：

fullmodel=loglm(~class+age+sex+survived+class:age+class:sex+class:survived+age:sex+age:survived+sex:survived, mytable)
stepAIC(fullmodel) 
drop1(fullmodel,test="Chisq")
Single term deletions

Model:
~class + age + sex + survived + class:age + class:sex + class:survived + 
   age:sex + age:survived + sex:survived
              Df    AIC     LRT  Pr(>Chi)    
<none>            144.89                      
class:age       2 188.76  47.877 4.016e-11 ***
class:sex       2 141.74   0.850    0.6537    
class:survived  2 256.95 116.067 < 2.2e-16 ***
age:sex         1 142.89   0.008    0.9294    
age:survived    1 162.39  19.505 1.003e-05 ***
sex:survived    1 455.81 312.927 < 2.2e-16 ***

]

（順便說一句，還有其他更優雅的方法來指定具有主效應+所有一階交互效應的模型嗎？）

有什麼想法是分析這種多路列聯表並充分測試不平衡數據集的關聯的最佳方法嗎？

編輯：根據下面的答案，我找到了drop1解決方案：

fullmodel=glm(Freq~class+age+sex+survived+class:age+class:sex+class:survived+age:sex+age:survived+sex:survived, family=poisson, data=freqdata)
drop1(fullmodel,test="Chisq")

這相當於中的對數線性模型MASS：

fullmodel=loglm(~class+age+sex+survived+class:age+class:sex+class:survived+age:sex+age:survived+sex:survived, mytable)
stepAIC(fullmodel) 
drop1(fullmodel,test="Chisq")

如果我必鬚根據您的設置方式進行選擇，我想我會選擇Anova()，但兩者都沒有多大意義。R 將變量輸入模型的順序是標準化的和任意的。我不會用它來定義我要運行的測試。

相反，在 MASS 庫中使用?loglin或?loglm，然後刪除您對測試感興趣的特定變量/組合。這裡loglm有一個使用泰坦尼克數據集的 R教程。

正如@DWin 所指出的，順序與非區分對應於有意義的不同假設。所以這個問題只能由研究人員來回答。R 世界中這一點的標準版本是 Venables 的論文，Exegesis on linear models ( pdf )。鑑於您說您只是想知道“哪些因素相互關聯”，這似乎不太像條件推斷，而更像是從完整模型中刪除指定的關聯並對其進行測試，或者可能是測試從剝離模型中刪除的關聯其他變量根本不包括在內。

引用自：https://stats.stackexchange.com/questions/124581

在列聯表頻率計數上與對數線性泊松 glm 一起使用的推理類型

相關問答

泊松分佈的原始推導是什麼？

總和nnn帶參數的泊松隨機變量1n1nfrac 1 n

將“標準泊松”轉換為任意泊松

哪些分佈具有未定義的均值但不對稱？

XGboost 和深度學習模型中 Tweedie 或泊松損失/目標函數的用途是什麼

泊松置信區間和 p 值之間的衝突