隨機森林回歸預測不高於訓練數據

September 15, 2016

我注意到在構建隨機森林回歸模型時，至少在中R，預測值永遠不會超過訓練數據中看到的目標變量的最大值。例如，請參見下面的代碼。我正在建立一個回歸模型來mpg根據mtcars數據進行預測。我建立了 OLS 和隨機森林模型，並使用它們來預測mpg一輛應該具有非常好的燃油經濟性的假設汽車。正如預期的那樣， OLS 預測一個高mpg的，但隨機森林沒有。我在更複雜的模型中也注意到了這一點。為什麼是這樣？
> library(datasets)
> library(randomForest)
> 
> data(mtcars)
> max(mtcars$mpg)
[1] 33.9
> 
> set.seed(2)
> fit1 <- lm(mpg~., data=mtcars) #OLS fit
> fit2 <- randomForest(mpg~., data=mtcars) #random forest fit
> 
> #Hypothetical car that should have very high mpg
> hypCar <- data.frame(cyl=4, disp=50, hp=40, drat=5.5, wt=1, qsec=24, vs=1, am=1, gear=4, carb=1)
> 
> predict(fit1, hypCar) #OLS predicts higher mpg than max(mtcars$mpg)
     1 
37.2441 
> predict(fit2, hypCar) #RF does not predict higher mpg than max(mtcars$mpg)
      1 
30.78899 

正如在之前的答案中已經提到的那樣，回歸/回歸樹的隨機森林不會產生超出訓練數據范圍範圍的數據點的預期預測，因為它們無法推斷（很好）。回歸樹由節點的層次結構組成，其中每個節點指定要對屬性值執行的測試，每個葉（終端）節點指定計算預測輸出的規則。在您的情況下，測試觀察通過樹流向葉節點，例如“如果 x > 335，則 y = 15”，然後由隨機森林平均。

這是一個 R 腳本，使用隨機森林和線性回歸來可視化情況。在隨機森林的情況下，對於低於最低訓練數據 x 值或高於最高訓練數據 x 值的測試數據點的預測是恆定的。

library(datasets)
library(randomForest)
library(ggplot2)
library(ggthemes)

# Import mtcars (Motor Trend Car Road Tests) dataset
data(mtcars)

# Define training data
train_data = data.frame(
   x = mtcars$hp,  # Gross horsepower
   y = mtcars$qsec)  # 1/4 mile time

# Train random forest model for regression
random_forest <- randomForest(x = matrix(train_data$x),
                             y = matrix(train_data$y), ntree = 20)
# Train linear regression model using ordinary least squares (OLS) estimator
linear_regr <- lm(y ~ x, train_data)

# Create testing data
test_data = data.frame(x = seq(0, 400))

# Predict targets for testing data points
test_data$y_predicted_rf <- predict(random_forest, matrix(test_data$x)) 
test_data$y_predicted_linreg <- predict(linear_regr, test_data)

# Visualize
ggplot2::ggplot() + 
   # Training data points
   ggplot2::geom_point(data = train_data, size = 2,
                       ggplot2::aes(x = x, y = y, color = "Training data")) +
   # Random forest predictions
   ggplot2::geom_line(data = test_data, size = 2, alpha = 0.7,
                      ggplot2::aes(x = x, y = y_predicted_rf,
                                   color = "Predicted with random forest")) +
   # Linear regression predictions
   ggplot2::geom_line(data = test_data, size = 2, alpha = 0.7,
                      ggplot2::aes(x = x, y = y_predicted_linreg,
                                   color = "Predicted with linear regression")) +
   # Hide legend title, change legend location and add axis labels
   ggplot2::theme(legend.title = element_blank(),
                  legend.position = "bottom") + labs(y = "1/4 mile time",
                                                     x = "Gross horsepower") +
   ggthemes::scale_colour_colorblind()

引用自：https://stats.stackexchange.com/questions/235189

隨機森林回歸預測不高於訓練數據

相關問答

我們真的在線性回歸的第一步中取隨機線嗎？

為什麼指數族在統計學中如此重要？

年齡部分作為連續變量，部分作為分類變量

可以使用多項式邏輯回歸學習任何數據嗎

為什麼不鼓勵使用高階多項式進行回歸？

在訓練數據中添加噪聲如何等同於正則化？