Generalized-Linear-Model

統計學家是否假設不能過度澆水植物,或者我只是使用錯誤的搜索詞進行曲線回歸?

  • July 11, 2013

我讀到的關於線性回歸和 GLM 的幾乎所有內容都歸結為:在哪裡是一個不增加或不減少的函數和是您估計和檢驗假設的參數。有幾十個鏈接功能和轉換和使的線性函數.

現在,如果您刪除不增加/不減少的要求,我只知道擬合參數線性化模型的兩種選擇:三角函數和多項式。兩者都在每個預測之間產生人為依賴和整套,使它們成為非常不穩健的擬合,除非事先有理由相信您的數據實際上是由循環或多項式過程生成的。

這不是某種深奧的邊緣案例。這是水與農作物產量之間的實際常識關係(一旦地塊在水下足夠深,農作物產量將開始減少),或早餐消耗的卡路里與數學測驗的表現或工廠工人數量之間的關係以及它們產生的小部件的數量……簡而言之,幾乎所有使用線性模型的現實生活案例,但數據涵蓋的範圍足夠廣泛,以至於您可以將收益遞減變成負收益。

我試著尋找“凹”、“凸”、“曲線”、“非單調”、“浴缸”等術語,但我忘記了還有多少其他術語。很少有相關的問題,甚至更少的可用答案。因此,實際上,如果您有以下數據(R 代碼,y 是連續變量 x 和離散變量組的函數):

updown<-data.frame(y=c(46.98,38.39,44.21,46.28,41.67,41.8,44.8,45.22,43.89,45.71,46.09,45.46,40.54,44.94,42.3,43.01,45.17,44.94,36.27,43.07,41.85,40.5,41.14,43.45,33.52,30.39,27.92,19.67,43.64,43.39,42.07,41.66,43.25,42.79,44.11,40.27,40.35,44.34,40.31,49.88,46.49,43.93,50.87,45.2,43.04,42.18,44.97,44.69,44.58,33.72,44.76,41.55,34.46,32.89,20.24,22,17.34,20.14,20.36,24.39,22.05,24.21,26.11,28.48,29.09,31.98,32.97,31.32,40.44,33.82,34.46,42.7,43.03,41.07,41.02,42.85,44.5,44.15,52.58,47.72,44.1,21.49,19.39,26.59,29.38,25.64,28.06,29.23,31.15,34.81,34.25,36,42.91,38.58,42.65,45.33,47.34,50.48,49.2,55.67,54.65,58.04,59.54,65.81,61.43,67.48,69.5,69.72,67.95,67.25,66.56,70.69,70.15,71.08,67.6,71.07,72.73,72.73,81.24,73.37,72.67,74.96,76.34,73.65,76.44,72.09,67.62,70.24,69.85,63.68,64.14,52.91,57.11,48.54,56.29,47.54,19.53,20.92,22.76,29.34,21.34,26.77,29.72,34.36,34.8,33.63,37.56,42.01,40.77,44.74,40.72,46.43,46.26,46.42,51.55,49.78,52.12,60.3,58.17,57,65.81,72.92,72.94,71.56,66.63,68.3,72.44,75.09,73.97,68.34,73.07,74.25,74.12,75.6,73.66,72.63,73.86,76.26,74.59,74.42,74.2,65,64.72,66.98,64.27,59.77,56.36,57.24,48.72,53.09,46.53),
                  x=c(216.37,226.13,237.03,255.17,270.86,287.45,300.52,314.44,325.61,341.12,354.88,365.68,379.77,393.5,410.02,420.88,436.31,450.84,466.95,477,491.89,509.27,521.86,531.53,548.11,563.43,575.43,590.34,213.33,228.99,240.07,250.4,269.75,283.33,294.67,310.44,325.36,340.48,355.66,370.43,377.58,394.32,413.22,428.23,436.41,455.58,465.63,475.51,493.44,505.4,521.42,536.82,550.57,563.17,575.2,592.27,86.15,91.09,97.83,103.39,107.37,114.78,119.9,124.39,131.63,134.49,142.83,147.26,152.2,160.9,163.75,172.29,173.62,179.3,184.82,191.46,197.53,201.89,204.71,214.12,215.06,88.34,109.18,122.12,133.19,148.02,158.72,172.93,189.23,204.04,219.36,229.58,247.49,258.23,273.3,292.69,300.47,314.36,325.65,345.21,356.19,367.29,389.87,397.74,411.46,423.04,444.23,452.41,465.43,484.51,497.33,507.98,522.96,537.37,553.79,566.08,581.91,595.84,610.7,624.04,637.53,649.98,663.43,681.67,698.1,709.79,718.33,734.81,751.93,761.37,775.12,790.15,803.39,818.64,833.71,847.81,88.09,105.72,123.35,132.19,151.87,161.5,177.34,186.92,201.35,216.09,230.12,245.47,255.85,273.45,285.91,303.99,315.98,325.48,343.01,360.05,373.17,381.7,398.41,412.66,423.66,443.67,450.39,468.86,483.93,499.91,511.59,529.34,541.35,550.28,568.31,584.7,592.33,615.74,622.45,639.1,651.41,668.08,679.75,692.94,708.83,720.98,734.42,747.83,762.27,778.74,790.97,806.99,820.03,831.55,844.23),
                  group=factor(rep(c('A','B'),c(81,110))));

plot(y~x,updown,subset=x<500,col=group);

散點圖

您可能首先嘗試 Box-Cox 變換,看看它是否具有機械意義,如果失敗,您可能會擬合具有邏輯或漸近鏈接函數的非線性最小二乘模型。

那麼,當你發現完整的數據集看起來像這樣時,為什麼要完全放棄參數模型並轉而使用像樣條線這樣的黑盒方法……

plot(y~x,updown,col=group);

我的問題是:

  • 為了找到代表此類函數關係的鏈接函數,我應該搜索哪些術語?

或者

  • 我應該閱讀和/或搜索什麼來自學如何設計鏈接函數到此類函數關係或擴展當前僅用於單調響應的現有函數?

或者

  • 哎呀,甚至 StackExchange 標記最適合此類問題!

關於鏈接函數和單調性的問題中的評論是一個紅鯡魚。 它們的基礎似乎是一個隱含的假設,即廣義線性模型 (GLM),通過表達對響應的期望作為單調函數線性組合的解釋變量, 不夠靈活,無法解釋非單調響應。事實並非如此。


也許一個可行的例子可以說明這一點。在 1948 年的一項研究(死後發表於 1977 年,從未經過同行評議)中,J. Tolkien 報告了植物澆水實驗的結果,其中 13 組 24 朵向日葵(Helianthus Gondorensis)從發芽開始到三個月內給予控制量的水的增長。應用的總量從一英寸到 25 英寸不等,以 2 英寸為增量。

圖1

對澆水有明顯的積極反應,對過度澆水有強烈的消極反應。早期的工作,基於離子傳輸的假設動力學模型,假設有兩種競爭機制可能解釋這種行為:一種導致對少量水的線性響應(以生存率的對數來衡量),而另一種 - -抑制因素-呈指數作用(這是一種強烈的非線性效應)。對於大量的水,抑制因素會壓倒水的積極作用並顯著增加死亡率。

讓是(未知)抑制率(每單位水量)。該模型斷言,數一組大小的倖存者接收英寸的水應該有

分佈,其中是將對數賠率轉換回概率的鏈接函數。這是一個二項式 GLM。因此,雖然在,給定任何值它的參數是線性的,, 和. GLM 設置中的“線性”必須從以下意義上理解:是這些參數的線性組合,其係數對於每個參數都是已知的. 它們是:它們相等(係數),本身(係數), 和(係數)。 這個模型——雖然它有點新穎,而且參數不是完全線性的——可以使用標準軟件通過最大化任意概率的可能性來擬合並選擇這個最大值是最大的。這是R執行此操作的代碼,從數據開始:

water <- seq(1, 25, length.out=13)
n.survived <- c(0, 3, 4, 12, 18, 21, 23, 24, 22, 23, 18, 3, 2)
pop <- 24
counts <- cbind(n.survived, n.died=pop-n.survived)
f <- function(k) {
 fit <- glm(counts ~ water + I(-exp(water * k)), family=binomial)
 list(AIC=AIC(fit), fit=fit)
}
k.est <- optim(0.1, function(k) f(k)$AIC, method="Brent", lower=0, upper=1)$par
fit <- f(k.est)$fit

沒有技術困難;計算僅需 1/30 秒。

圖 2

藍色曲線是響應的擬合期望, .

顯然(a)擬合良好,(b)它預測了兩者之間的非單調關係和(倒置的“浴缸”曲線)。R為了清楚地說明這一點,這裡是用於計算和繪製擬合的後續代碼:

x.0 <- seq(min(water), max(water), length.out=100)
p.0 <- cbind(rep(1, length(x.0)), x.0, -exp(k.est * x.0))
logistic <- function(x) 1 - 1/(1 + exp(x))
predicted <- pop * logistic(p.0 %*% coef(fit))

plot(water, n.survived / pop, main="Data and Fit",
    xlab="Total water (inches)", 
    ylab="Proportion surviving at 3 months")
lines(x.0, predicted / pop, col="#a0a0ff", lwd=2)


問題的答案是:

為了找到代表此類函數關係的鏈接函數,我應該搜索哪些術語?

:這不是鏈接功能的目的。

我應該…搜索什麼以…擴展當前僅用於單調響應的現有[鏈接功能]?

什麼都沒有:這是基於對如何建模響應的誤解。

顯然,在構建回歸模型時,首先應該關注*要使用或構建的解釋變量。*正如本例中所建議的,從過去的經驗和理論中尋找指導。

引用自:https://stats.stackexchange.com/questions/63978

comments powered by Disqus