如何確定哪種分佈最適合我的數據？

January 8, 2015

我有一個數據集，想找出最適合我的數據的分佈。

我使用該fitdistr()函數來估計描述假設分佈的必要參數（即 Weibull、Cauchy、Normal）。使用這些參數，我可以進行 Kolmogorov-Smirnov 檢驗來估計我的樣本數據是否來自與我假設的分佈相同的分佈。

如果 p 值 > 0.05，我可以假設樣本數據來自同一分佈。但是 p 值並沒有提供任何關於合身性的信息，不是嗎？

因此，如果對於正態分佈和威布爾分佈，我的樣本數據的 p 值 > 0.05，我怎麼知道哪個分佈更適合我的數據？

這基本上是我所做的：
> mydata
[1] 37.50 46.79 48.30 46.04 43.40 39.25 38.49 49.51 40.38 36.98 40.00
[12] 38.49 37.74 47.92 44.53 44.91 44.91 40.00 41.51 47.92 36.98 43.40
[23] 42.26 41.89 38.87 43.02 39.25 40.38 42.64 36.98 44.15 44.91 43.40
[34] 49.81 38.87 40.00 52.45 53.13 47.92 52.45 44.91 29.54 27.13 35.60
[45] 45.34 43.37 54.15 42.77 42.88 44.26 27.14 39.31 24.80 16.62 30.30
[56] 36.39 28.60 28.53 35.84 31.10 34.55 52.65 48.81 43.42 52.49 38.00
[67] 38.65 34.54 37.70 38.11 43.05 29.95 32.48 24.63 35.33 41.34

# estimate shape and scale to perform KS-test for weibull distribution
> fitdistr(mydata, "weibull")
    shape        scale   
  6.4632971   43.2474500 
( 0.5800149) ( 0.8073102)

# KS-test for weibull distribution
> ks.test(mydata, "pweibull", scale=43.2474500, shape=6.4632971)

       One-sample Kolmogorov-Smirnov test

data:  mydata
D = 0.0686, p-value = 0.8669
alternative hypothesis: two-sided

# KS-test for normal distribution
> ks.test(mydata, "pnorm", mean=mean(mydata), sd=sd(mydata))

       One-sample Kolmogorov-Smirnov test

data:  mydata
D = 0.0912, p-value = 0.5522
alternative hypothesis: two-sided
Weibull 分佈的 p 值為 0.8669，正態分佈的 p 值為 0.5522。因此，我可以假設我的數據遵循 Weibull 以及正態分佈。但是哪個分佈函數更好地描述了我的數據？

參考十一美元我找到了以下代碼，但不知道如何解釋結果：
fits <- list(no = fitdistr(mydata, "normal"),
            we = fitdistr(mydata, "weibull"))
sapply(fits, function(i) i$loglik)
      no        we 
-259.6540 -257.9268 

首先，這裡有一些簡短的評論：

這 帶有估計參數的 Kolmovorov-Smirnov-Test (KS-Test) 的值可能非常錯誤，因為p值沒有考慮估計的不確定性。所以不幸的是，您不能只擬合一個分佈，然後使用 Kolmogorov-Smirnov-Test 中的估計參數來測試您的樣本。有一個稱為Lilliefors 檢驗的正態性檢驗，它是 KS 檢驗的修改版本，允許估計參數。

您的樣本永遠不會完全遵循特定的分佈。所以即使你的 - 來自 KS-Test 的值將是有效的，並且，這只是意味著你不能排除你的數據遵循這個特定的分佈。另一種表述是您的樣本與某個分佈兼容。但是“我的數據是否完全遵循分佈 xy？”這個問題的答案。總是沒有。

這裡的目標不能是確定您的樣本遵循什麼分佈。目標是@whuber（在評論中）所說的對數據的簡約近似描述。具有特定的參數分佈可以用作數據的模型（例如模型“地球是球體”可能很有用）。

但是讓我們做一些探索。我將使用優秀的fitdistrplus包，它為分佈擬合提供了一些很好的功能。我們將使用該函數descdist來獲得有關可能的候選分佈的一些想法。
library(fitdistrplus)
library(logspline)

x <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)
現在讓我們使用descdist：
descdist(x, discrete = FALSE)
您的樣本的峰度和平方偏度被繪製為一個名為“觀察”的藍點。似乎可能的分佈包括 Weibull、Lognormal 和可能的 Gamma 分佈。

讓我們擬合 Weibull 分佈和正態分佈：
fit.weibull <- fitdist(x, "weibull")
fit.norm <- fitdist(x, "norm")
現在檢查是否適合正常：
plot(fit.norm)
對於 Weibull 擬合：
plot(fit.weibull)
兩者看起來都不錯，但從 QQ-Plot 判斷，Weibull 可能看起來更好一些，尤其是在尾部。相應地，Weibull 擬合的 AIC 低於正常擬合：
fit.weibull$aic
[1] 519.8537

fit.norm$aic
[1] 523.3079
Kolmogorov-Smirnov 測試模擬

我將使用此處解釋的@Aksakal 程序來模擬空值下的 KS 統計量。
n.sims <- 5e4

stats <- replicate(n.sims, {      
 r <- rweibull(n = length(x)
               , shape= fit.weibull$estimate["shape"]
               , scale = fit.weibull$estimate["scale"]
 )
 estfit.weibull <- fitdist(r, "weibull") # added to account for the estimated parameters
 as.numeric(ks.test(r
                    , "pweibull"
                    , shape= estfit.weibull$estimate["shape"]
                    , scale = estfit.weibull$estimate["scale"])$statistic
 )      
})
模擬 KS 統計量的 ECDF 如下所示：
plot(ecdf(stats), las = 1, main = "KS-test statistic simulation (CDF)", col = "darkorange", lwd = 1.7)
grid()
最後，我們的 - 使用 KS 統計量的模擬零分佈的值是：
fit <- logspline(stats)

1 - plogspline(ks.test(x
                      , "pweibull"
                      , shape= fit.weibull$estimate["shape"]
                      , scale = fit.weibull$estimate["scale"])$statistic
              , fit
)

[1] 0.4889511
這證實了我們的圖形結論，即樣本與 Weibull 分佈兼容。

正如這裡所解釋的，我們可以使用自舉將逐點置信區間添加到估計的 Weibull PDF 或 CDF：
xs <- seq(10, 65, len=500)

true.weibull <- rweibull(1e6, shape= fit.weibull$estimate["shape"]
                        , scale = fit.weibull$estimate["scale"])

boot.pdf <- sapply(1:1000, function(i) {
 xi <- sample(x, size=length(x), replace=TRUE)
 MLE.est <- suppressWarnings(fitdist(xi, distr="weibull"))  
 dweibull(xs, shape=MLE.est$estimate["shape"],  scale = MLE.est$estimate["scale"])
}
)

boot.cdf <- sapply(1:1000, function(i) {
 xi <- sample(x, size=length(x), replace=TRUE)
 MLE.est <- suppressWarnings(fitdist(xi, distr="weibull"))  
 pweibull(xs, shape= MLE.est$estimate["shape"],  scale = MLE.est$estimate["scale"])
}
)   

#-----------------------------------------------------------------------------
# Plot PDF
#-----------------------------------------------------------------------------

par(bg="white", las=1, cex=1.2)
plot(xs, boot.pdf[, 1], type="l", col=rgb(.6, .6, .6, .1), ylim=range(boot.pdf),
    xlab="x", ylab="Probability density")
for(i in 2:ncol(boot.pdf)) lines(xs, boot.pdf[, i], col=rgb(.6, .6, .6, .1))

# Add pointwise confidence bands

quants <- apply(boot.pdf, 1, quantile, c(0.025, 0.5, 0.975))
min.point <- apply(boot.pdf, 1, min, na.rm=TRUE)
max.point <- apply(boot.pdf, 1, max, na.rm=TRUE)
lines(xs, quants[1, ], col="red", lwd=1.5, lty=2)
lines(xs, quants[3, ], col="red", lwd=1.5, lty=2)
lines(xs, quants[2, ], col="darkred", lwd=2)
#-----------------------------------------------------------------------------
# Plot CDF
#-----------------------------------------------------------------------------

par(bg="white", las=1, cex=1.2)
plot(xs, boot.cdf[, 1], type="l", col=rgb(.6, .6, .6, .1), ylim=range(boot.cdf),
    xlab="x", ylab="F(x)")
for(i in 2:ncol(boot.cdf)) lines(xs, boot.cdf[, i], col=rgb(.6, .6, .6, .1))

# Add pointwise confidence bands

quants <- apply(boot.cdf, 1, quantile, c(0.025, 0.5, 0.975))
min.point <- apply(boot.cdf, 1, min, na.rm=TRUE)
max.point <- apply(boot.cdf, 1, max, na.rm=TRUE)
lines(xs, quants[1, ], col="red", lwd=1.5, lty=2)
lines(xs, quants[3, ], col="red", lwd=1.5, lty=2)
lines(xs, quants[2, ], col="darkred", lwd=2)
#lines(xs, min.point, col="purple")
#lines(xs, max.point, col="purple")
使用 GAMLSS 進行自動分佈擬合

該gamlss軟件包R提供了嘗試許多不同分佈並根據 GAIC（廣義 Akaike 信息標準）選擇“最佳”的能力。主要功能是fitDist。此函數中的一個重要選項是嘗試的分佈類型。例如，設置type = "realline"將嘗試在整個實線上定義的所有已實現分佈，而type = "realsplus"僅嘗試在實正線上定義的分佈。另一個重要的選項是參數，這是對 GAIC 的處罰。在下面的示例中，我設置了參數這意味著根據經典 AIC 選擇“最佳”分佈。你可以設置任何你喜歡的東西，比如為 BIC。
library(gamlss)
library(gamlss.dist)
library(gamlss.add)

x <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
      38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
      42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
      49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
      45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
      36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
      38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)

fit <- fitDist(x, k = 2, type = "realplus", trace = FALSE, try.gamlss = TRUE)

summary(fit)

*******************************************************************
Family:  c("WEI2", "Weibull type 2") 

Call:  gamlssML(formula = y, family = DIST[i], data = sys.parent()) 

Fitting method: "nlminb" 


Coefficient(s):
            Estimate  Std. Error  t value   Pr(>|t|)    
eta.mu    -24.3468041   2.2141197 -10.9962 < 2.22e-16 ***
eta.sigma   1.8661380   0.0892799  20.9021 < 2.22e-16 ***
根據 AIC，Weibull 分佈（更具體地說WEI2，它的特殊參數化）最適合數據。分佈的精確參數化在第 279 頁的文檔WEI2中有詳細說明。讓我們通過查看蠕蟲圖（基本上是去趨勢的 QQ 圖）中的殘差來檢查擬合：

我們預計殘差接近中間水平線，其中 95% 位於上虛線和下虛線之間，這相當於 95% 的逐點置信區間。在這種情況下，蠕蟲圖對我來說看起來很好，表明 Weibull 分佈是合適的。