R

predict.randomForest 如何估計類別概率?

  • July 28, 2016

我使用時包如何randomForest估計類概率predict(model, data, type = "prob")

我使用參數來ranger訓練隨機森林probability = T來預測概率。ranger在文檔中說:

像 Malley 等人一樣種植概率森林。(2012)。

我模擬了一些數據並嘗試了兩個包並獲得了非常不同的結果(見下面的代碼)

在此處輸入圖像描述

所以我知道它使用不同的技術(然後是遊俠)來估計概率。但是哪一個?

simulate_data <- function(n){
 X <- data.frame(matrix(runif(n*10), ncol = 10))
 Y <- data.frame(Y = rbinom(n, size = 1, prob = apply(X, 1, sum) %>%
                              pnorm(mean = 5)
                            ) %>% 
                   as.factor()

 ) 
 dplyr::bind_cols(X, Y)
}

treino <- simulate_data(10000)
teste <- simulate_data(10000)

library(ranger)
modelo_ranger <- ranger(Y ~., data = treino, 
                               num.trees = 100, 
                               mtry = floor(sqrt(10)), 
                               write.forest = T, 
                               min.node.size = 100, 
                               probability = T
                               )

modelo_randomForest <- randomForest(Y ~., data = treino,
                                   ntree = 100, 
                                   mtry = floor(sqrt(10)),
                                   nodesize = 100
                                   )

pred_ranger <- predict(modelo_ranger, teste)$predictions[,1]
pred_randomForest <- predict(modelo_randomForest, teste, type = "prob")[,2]
prob_real <- apply(teste[,1:10], 1, sum) %>% pnorm(mean = 5)

data.frame(prob_real, pred_ranger, pred_randomForest) %>%
 tidyr::gather(pacote, prob, -prob_real) %>%
 ggplot(aes(x = prob, y = prob_real)) + geom_point(size = 0.1) + facet_wrap(~pacote)

這只是整體中樹木的投票比例。

library(randomForest)

rf = randomForest(Species~., data = iris, norm.votes = TRUE, proximity = TRUE)
p1 = predict(rf, iris, type = "prob")
p2 = predict(rf, iris, type = "vote", norm.votes = TRUE)

identical(p1,p2)
#[1] TRUE


或者,如果你將概率乘以ntree,你會得到相同的結果,但現在是計數而不是比例。

p1 = predict(rf, iris, type = "prob")
p2 = predict(rf, iris, type = "vote", norm.votes = FALSE)

identical(500*p1,p2)
#[1] TRUE

引用自:https://stats.stackexchange.com/questions/226109

comments powered by Disqus