隨機森林概率預測與多數投票

December 8, 2014

Scikit learn似乎使用概率預測而不是多數投票來支持模型聚合技術，而沒有解釋原因（1.9.2.1. Random Forests）。

為什麼有明確的解釋？此外，對於可用於隨機森林裝袋的各種模型聚合技術，是否有一篇好的論文或評論文章？

謝謝！

如果您精通 Python，最好通過查看代碼來回答此類問題。

RandomForestClassifier.predict，至少在當前版本 0.16.1 中，預測具有最高概率估計的類，由給出predict_proba。（這條線）

的文檔predict_proba說：

輸入樣本的預測類別概率計算為森林中樹木的平均預測類別概率。一棵樹的類概率是葉子中同一類的樣本的分數。

與原始方法的區別可能只是使得predict預測結果與predict_proba. 結果有時被稱為“軟投票”，而不是原始 Breiman 論文中使用的“硬”多數票。我無法在快速搜索中找到這兩種方法性能的適當比較，但在這種情況下它們似乎都相當合理。

該predict文檔充其量是具有誤導性的；我已經提交了一個拉取請求來修復它。

如果您想改為進行多數投票預測，這裡有一個函數可以做到這一點。稱它為 likepredict_majvote(clf, X)而不是clf.predict(X). （基於predict_proba; 僅經過輕微測試，但我認為它應該可以工作。）
from scipy.stats import mode
from sklearn.ensemble.forest import _partition_estimators, _parallel_helper
from sklearn.tree._tree import DTYPE
from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import check_array
from sklearn.utils.validation import check_is_fitted

def predict_majvote(forest, X):
   """Predict class for X.

Uses majority voting, rather than the soft voting scheme
used by RandomForestClassifier.predict.

Parameters
----------
X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csr_matrix``.
Returns
-------
y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted classes.
"""
   check_is_fitted(forest, 'n_outputs_')

   # Check data
   X = check_array(X, dtype=DTYPE, accept_sparse="csr")

   # Assign chunk of trees to jobs
   n_jobs, n_trees, starts = _partition_estimators(forest.n_estimators,
                                                   forest.n_jobs)

   # Parallel loop
   all_preds = Parallel(n_jobs=n_jobs, verbose=forest.verbose,
                        backend="threading")(
       delayed(_parallel_helper)(e, 'predict', X, check_input=False)
       for e in forest.estimators_)

   # Reduce
   modes, counts = mode(all_preds, axis=0)

   if forest.n_outputs_ == 1:
       return forest.classes_.take(modes[0], axis=0)
   else:
       n_samples = all_preds[0].shape[0]
       preds = np.zeros((n_samples, forest.n_outputs_),
                        dtype=forest.classes_.dtype)
       for k in range(forest.n_outputs_):
           preds[:, k] = forest.classes_[k].take(modes[:, k], axis=0)
       return preds
在我嘗試的愚蠢合成案例中，預測predict每次都與該方法一致。

引用自：https://stats.stackexchange.com/questions/127077

隨機森林概率預測與多數投票

相關問答

當特徵值很小時，獲得準確的特徵向量

sklearn邏輯回歸收斂到一個簡單案例的意外係數

確定兩個變量是線性還是非線性甚至不相關的最佳編程方法是什麼

在國際象棋數據上訓練神經網絡

為什麼 scikit-learn SVM 解決不了兩個同心圓？

為什麼R對卷積有不同的定義？