如何確定哪種分佈最適合我的數據?
我有一個數據集,想找出最適合我的數據的分佈。
我使用該
fitdistr()
函數來估計描述假設分佈的必要參數(即 Weibull、Cauchy、Normal)。使用這些參數,我可以進行 Kolmogorov-Smirnov 檢驗來估計我的樣本數據是否來自與我假設的分佈相同的分佈。如果 p 值 > 0.05,我可以假設樣本數據來自同一分佈。但是 p 值並沒有提供任何關於合身性的信息,不是嗎?
因此,如果對於正態分佈和威布爾分佈,我的樣本數據的 p 值 > 0.05,我怎麼知道哪個分佈更適合我的數據?
這基本上是我所做的:
> mydata [1] 37.50 46.79 48.30 46.04 43.40 39.25 38.49 49.51 40.38 36.98 40.00 [12] 38.49 37.74 47.92 44.53 44.91 44.91 40.00 41.51 47.92 36.98 43.40 [23] 42.26 41.89 38.87 43.02 39.25 40.38 42.64 36.98 44.15 44.91 43.40 [34] 49.81 38.87 40.00 52.45 53.13 47.92 52.45 44.91 29.54 27.13 35.60 [45] 45.34 43.37 54.15 42.77 42.88 44.26 27.14 39.31 24.80 16.62 30.30 [56] 36.39 28.60 28.53 35.84 31.10 34.55 52.65 48.81 43.42 52.49 38.00 [67] 38.65 34.54 37.70 38.11 43.05 29.95 32.48 24.63 35.33 41.34 # estimate shape and scale to perform KS-test for weibull distribution > fitdistr(mydata, "weibull") shape scale 6.4632971 43.2474500 ( 0.5800149) ( 0.8073102) # KS-test for weibull distribution > ks.test(mydata, "pweibull", scale=43.2474500, shape=6.4632971) One-sample Kolmogorov-Smirnov test data: mydata D = 0.0686, p-value = 0.8669 alternative hypothesis: two-sided # KS-test for normal distribution > ks.test(mydata, "pnorm", mean=mean(mydata), sd=sd(mydata)) One-sample Kolmogorov-Smirnov test data: mydata D = 0.0912, p-value = 0.5522 alternative hypothesis: two-sided
Weibull 分佈的 p 值為 0.8669,正態分佈的 p 值為 0.5522。因此,我可以假設我的數據遵循 Weibull 以及正態分佈。但是哪個分佈函數更好地描述了我的數據?
參考十一美元我找到了以下代碼,但不知道如何解釋結果:
fits <- list(no = fitdistr(mydata, "normal"), we = fitdistr(mydata, "weibull")) sapply(fits, function(i) i$loglik) no we -259.6540 -257.9268
首先,這裡有一些簡短的評論:
- 這 $ p $ 帶有估計參數的 Kolmovorov-Smirnov-Test (KS-Test) 的值可能非常錯誤,因為p值沒有考慮估計的不確定性。所以不幸的是,您不能只擬合一個分佈,然後使用 Kolmogorov-Smirnov-Test 中的估計參數來測試您的樣本。有一個稱為Lilliefors 檢驗的正態性檢驗,它是 KS 檢驗的修改版本,允許估計參數。
- 您的樣本永遠不會完全遵循特定的分佈。所以即使你的 $ p $ - 來自 KS-Test 的值將是有效的,並且 $ >0.05 $ ,這只是意味著你不能排除你的數據遵循這個特定的分佈。另一種表述是您的樣本與某個分佈兼容。但是“我的數據是否完全遵循分佈 xy?”這個問題的答案。總是沒有。
- 這裡的目標不能是確定您的樣本遵循什麼分佈。目標是@whuber(在評論中)所說的對數據的簡約近似描述。具有特定的參數分佈可以用作數據的模型(例如模型“地球是球體”可能很有用)。
但是讓我們做一些探索。我將使用優秀的
fitdistrplus
包,它為分佈擬合提供了一些很好的功能。我們將使用該函數descdist
來獲得有關可能的候選分佈的一些想法。library(fitdistrplus) library(logspline) x <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00, 38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40, 42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40, 49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60, 45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30, 36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00, 38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)
現在讓我們使用
descdist
:descdist(x, discrete = FALSE)
您的樣本的峰度和平方偏度被繪製為一個名為“觀察”的藍點。似乎可能的分佈包括 Weibull、Lognormal 和可能的 Gamma 分佈。
讓我們擬合 Weibull 分佈和正態分佈:
fit.weibull <- fitdist(x, "weibull") fit.norm <- fitdist(x, "norm")
現在檢查是否適合正常:
plot(fit.norm)
對於 Weibull 擬合:
plot(fit.weibull)
兩者看起來都不錯,但從 QQ-Plot 判斷,Weibull 可能看起來更好一些,尤其是在尾部。相應地,Weibull 擬合的 AIC 低於正常擬合:
fit.weibull$aic [1] 519.8537 fit.norm$aic [1] 523.3079
Kolmogorov-Smirnov 測試模擬
我將使用此處解釋的@Aksakal 程序來模擬空值下的 KS 統計量。
n.sims <- 5e4 stats <- replicate(n.sims, { r <- rweibull(n = length(x) , shape= fit.weibull$estimate["shape"] , scale = fit.weibull$estimate["scale"] ) estfit.weibull <- fitdist(r, "weibull") # added to account for the estimated parameters as.numeric(ks.test(r , "pweibull" , shape= estfit.weibull$estimate["shape"] , scale = estfit.weibull$estimate["scale"])$statistic ) })
模擬 KS 統計量的 ECDF 如下所示:
plot(ecdf(stats), las = 1, main = "KS-test statistic simulation (CDF)", col = "darkorange", lwd = 1.7) grid()
最後,我們的 $ p $ - 使用 KS 統計量的模擬零分佈的值是:
fit <- logspline(stats) 1 - plogspline(ks.test(x , "pweibull" , shape= fit.weibull$estimate["shape"] , scale = fit.weibull$estimate["scale"])$statistic , fit ) [1] 0.4889511
這證實了我們的圖形結論,即樣本與 Weibull 分佈兼容。
正如這裡所解釋的,我們可以使用自舉將逐點置信區間添加到估計的 Weibull PDF 或 CDF:
xs <- seq(10, 65, len=500) true.weibull <- rweibull(1e6, shape= fit.weibull$estimate["shape"] , scale = fit.weibull$estimate["scale"]) boot.pdf <- sapply(1:1000, function(i) { xi <- sample(x, size=length(x), replace=TRUE) MLE.est <- suppressWarnings(fitdist(xi, distr="weibull")) dweibull(xs, shape=MLE.est$estimate["shape"], scale = MLE.est$estimate["scale"]) } ) boot.cdf <- sapply(1:1000, function(i) { xi <- sample(x, size=length(x), replace=TRUE) MLE.est <- suppressWarnings(fitdist(xi, distr="weibull")) pweibull(xs, shape= MLE.est$estimate["shape"], scale = MLE.est$estimate["scale"]) } ) #----------------------------------------------------------------------------- # Plot PDF #----------------------------------------------------------------------------- par(bg="white", las=1, cex=1.2) plot(xs, boot.pdf[, 1], type="l", col=rgb(.6, .6, .6, .1), ylim=range(boot.pdf), xlab="x", ylab="Probability density") for(i in 2:ncol(boot.pdf)) lines(xs, boot.pdf[, i], col=rgb(.6, .6, .6, .1)) # Add pointwise confidence bands quants <- apply(boot.pdf, 1, quantile, c(0.025, 0.5, 0.975)) min.point <- apply(boot.pdf, 1, min, na.rm=TRUE) max.point <- apply(boot.pdf, 1, max, na.rm=TRUE) lines(xs, quants[1, ], col="red", lwd=1.5, lty=2) lines(xs, quants[3, ], col="red", lwd=1.5, lty=2) lines(xs, quants[2, ], col="darkred", lwd=2)
#----------------------------------------------------------------------------- # Plot CDF #----------------------------------------------------------------------------- par(bg="white", las=1, cex=1.2) plot(xs, boot.cdf[, 1], type="l", col=rgb(.6, .6, .6, .1), ylim=range(boot.cdf), xlab="x", ylab="F(x)") for(i in 2:ncol(boot.cdf)) lines(xs, boot.cdf[, i], col=rgb(.6, .6, .6, .1)) # Add pointwise confidence bands quants <- apply(boot.cdf, 1, quantile, c(0.025, 0.5, 0.975)) min.point <- apply(boot.cdf, 1, min, na.rm=TRUE) max.point <- apply(boot.cdf, 1, max, na.rm=TRUE) lines(xs, quants[1, ], col="red", lwd=1.5, lty=2) lines(xs, quants[3, ], col="red", lwd=1.5, lty=2) lines(xs, quants[2, ], col="darkred", lwd=2) #lines(xs, min.point, col="purple") #lines(xs, max.point, col="purple")
使用 GAMLSS 進行自動分佈擬合
該
gamlss
軟件包R
提供了嘗試許多不同分佈並根據 GAIC(廣義 Akaike 信息標準)選擇“最佳”的能力。主要功能是fitDist
。此函數中的一個重要選項是嘗試的分佈類型。例如,設置type = "realline"
將嘗試在整個實線上定義的所有已實現分佈,而type = "realsplus"
僅嘗試在實正線上定義的分佈。另一個重要的選項是參數 $ k $ ,這是對 GAIC 的處罰。在下面的示例中,我設置了參數 $ k = 2 $ 這意味著根據經典 AIC 選擇“最佳”分佈。你可以設置 $ k $ 任何你喜歡的東西,比如 $ \log(n) $ 為 BIC。library(gamlss) library(gamlss.dist) library(gamlss.add) x <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00, 38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40, 42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40, 49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60, 45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30, 36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00, 38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34) fit <- fitDist(x, k = 2, type = "realplus", trace = FALSE, try.gamlss = TRUE) summary(fit) ******************************************************************* Family: c("WEI2", "Weibull type 2") Call: gamlssML(formula = y, family = DIST[i], data = sys.parent()) Fitting method: "nlminb" Coefficient(s): Estimate Std. Error t value Pr(>|t|) eta.mu -24.3468041 2.2141197 -10.9962 < 2.22e-16 *** eta.sigma 1.8661380 0.0892799 20.9021 < 2.22e-16 ***
根據 AIC,Weibull 分佈(更具體地說
WEI2
,它的特殊參數化)最適合數據。分佈的精確參數化在第 279 頁的文檔WEI2
中有詳細說明。讓我們通過查看蠕蟲圖(基本上是去趨勢的 QQ 圖)中的殘差來檢查擬合:我們預計殘差接近中間水平線,其中 95% 位於上虛線和下虛線之間,這相當於 95% 的逐點置信區間。在這種情況下,蠕蟲圖對我來說看起來很好,表明 Weibull 分佈是合適的。