加權無偏樣本協方差的正確方程

June 8, 2013

我正在尋找正確的方程來計算加權無偏樣本協方差。互聯網資源在這個主題上非常罕見，它們都使用不同的方程式。

我發現的最可能的等式是這個：

來自：https ://en.wikipedia.org/wiki/Sample_mean_and_sample_covariance#Weighted_samples

當然，您必須事先計算加權（無偏）樣本均值。

但是，我發現了其他幾個公式，例如：

或者我什至看過一些源代碼和學術論文只是使用標準協方差公式，但使用加權樣本平均值而不是樣本平均值……

有人可以幫我解釋一下嗎？

/編輯：我的權重只是數據集中樣本的觀察次數，因此 weights.sum() = n

在 1972 年的書中找到了解決方案（George R. Price, Ann. Hum. Genet., Lond, pp485-490, Extension of covariance selection math, 1972）。

有偏加權樣本協方差：

$ \Sigma=\frac{1}{\sum_{i=1}^{N}w_i}\sum_{i=1}^N w_i \left(x_i - \mu^\right)^T\left(x_i - \mu^\right) $

以及通過應用貝塞爾校正給出的無偏加權樣本協方差：

$ \Sigma=\frac{1}{\sum_{i=1}^{N}w_i - 1}\sum_{i=1}^N w_i \left(x_i - \mu^\right)^T\left(x_i - \mu^\right) $

在哪裡是（無偏的）加權樣本均值：

$ \mathbf{\mu^*}=\frac{\sum_{i=1}^N w_i \mathbf{x}i}{\sum{i=1}^N w_i} $

重要說明：僅當權重是“重複”類型的權重時才有效，這意味著每個權重代表一個觀察的出現次數，並且在哪裡表示實際樣本量（實際樣本總數，佔權重）。

我更新了 Wikipedia 上的文章，您還可以在其中找到無偏加權樣本方差的方程：

https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_covariance

實用說明：我建議您先逐列相乘和 $ \left(x_i - \mu^\right) \left(x_i - \mu^\right) $ 總結並自動執行求和。例如在 Python Pandas/Numpy 代碼中：
import pandas as pd
import numpy as np
# X is the dataset, as a Pandas' DataFrame
mean = mean = np.ma.average(X, axis=0, weights=weights) # Computing the weighted sample mean (fast, efficient and precise)
mean = pd.Series(mean, index=list(X.keys())) # Convert to a Pandas' Series (it's just aesthetic and more ergonomic, no differenc in computed values)
xm = X-mean # xm = X diff to mean
xm = xm.fillna(0) # fill NaN with 0 (because anyway a variance of 0 is just void, but at least it keeps the other covariance's values computed correctly))
sigma2 = 1./(w.sum()-1) * xm.mul(w, axis=0).T.dot(xm); # Compute the unbiased weighted sample covariance
使用非加權數據集和等效加權數據集進行了一些健全性檢查，它工作正常。

有關無偏方差/協方差理論的更多詳細信息，請參閱這篇文章。

引用自：https://stats.stackexchange.com/questions/61225

加權無偏樣本協方差的正確方程

相關問答

有偏估計量的方差是否總是比無偏估計量小？

說黎曼和是積分的無偏估計是錯誤的嗎？

我可以（合理地）僅根據先前模型預測不佳的觀察結果訓練第二個模型嗎？

為什麼這個估計器不是無偏的？

一致的估計 - 究竟與什麼一致？

頻率論推理的缺陷