`predict.randomForest` 如何估計類別概率？

July 28, 2016

我使用時包如何randomForest估計類概率predict(model, data, type = "prob")？

我使用參數來ranger訓練隨機森林probability = T來預測概率。ranger在文檔中說：

像 Malley 等人一樣種植概率森林。（2012）。

我模擬了一些數據並嘗試了兩個包並獲得了非常不同的結果（見下面的代碼）

所以我知道它使用不同的技術（然後是遊俠）來估計概率。但是哪一個？

simulate_data <- function(n){
 X <- data.frame(matrix(runif(n*10), ncol = 10))
 Y <- data.frame(Y = rbinom(n, size = 1, prob = apply(X, 1, sum) %>%
                              pnorm(mean = 5)
                            ) %>% 
                   as.factor()

 ) 
 dplyr::bind_cols(X, Y)
}

treino <- simulate_data(10000)
teste <- simulate_data(10000)

library(ranger)
modelo_ranger <- ranger(Y ~., data = treino, 
                               num.trees = 100, 
                               mtry = floor(sqrt(10)), 
                               write.forest = T, 
                               min.node.size = 100, 
                               probability = T
                               )

modelo_randomForest <- randomForest(Y ~., data = treino,
                                   ntree = 100, 
                                   mtry = floor(sqrt(10)),
                                   nodesize = 100
                                   )

pred_ranger <- predict(modelo_ranger, teste)$predictions[,1]
pred_randomForest <- predict(modelo_randomForest, teste, type = "prob")[,2]
prob_real <- apply(teste[,1:10], 1, sum) %>% pnorm(mean = 5)

data.frame(prob_real, pred_ranger, pred_randomForest) %>%
 tidyr::gather(pacote, prob, -prob_real) %>%
 ggplot(aes(x = prob, y = prob_real)) + geom_point(size = 0.1) + facet_wrap(~pacote)

這只是整體中樹木的投票比例。

library(randomForest)

rf = randomForest(Species~., data = iris, norm.votes = TRUE, proximity = TRUE)
p1 = predict(rf, iris, type = "prob")
p2 = predict(rf, iris, type = "vote", norm.votes = TRUE)

identical(p1,p2)
#[1] TRUE

或者，如果你將概率乘以ntree，你會得到相同的結果，但現在是計數而不是比例。

p1 = predict(rf, iris, type = "prob")
p2 = predict(rf, iris, type = "vote", norm.votes = FALSE)

identical(500*p1,p2)
#[1] TRUE

引用自：https://stats.stackexchange.com/questions/226109

`predict.randomForest` 如何估計類別概率？

相關問答

哪個是第一位的 - 領域專業知識或實驗方法？

可以使用多項式邏輯回歸學習任何數據嗎

我如何使我的二元分類器偏愛假陽性錯誤而不是假陰性？

隨機森林是否擅長檢測交互項？

邏輯回歸是神經網絡的一個特例嗎？

Brier 分數和極端的階級失衡

predict.randomForest 如何估計類別概率？

相關問答

哪個是第一位的 - 領域專業知識或實驗方法？

可以使用多項式邏輯回歸學習任何數據嗎

我如何使我的二元分類器偏愛假陽性錯誤而不是假陰性？

隨機森林是否擅長檢測交互項？

邏輯回歸是神經網絡的一個特例嗎？

Brier 分數和極端的階級失衡

`predict.randomForest` 如何估計類別概率？