Cross-Validation

嵌套交叉驗證的實現

  • February 4, 2015

我試圖弄清楚我對嵌套交叉驗證的理解是否正確,因此我寫了這個玩具示例來看看我是否正確:

import operator
import numpy as np
from sklearn import cross_validation
from sklearn import ensemble
from sklearn.datasets import load_boston

# set random state
state = 1

# load boston dataset
boston = load_boston()

X = boston.data
y = boston.target

outer_scores = []

# outer cross-validation
outer = cross_validation.KFold(len(y), n_folds=3, shuffle=True, random_state=state)
for fold, (train_index_outer, test_index_outer) in enumerate(outer):
   X_train_outer, X_test_outer = X[train_index_outer], X[test_index_outer]
   y_train_outer, y_test_outer = y[train_index_outer], y[test_index_outer]

   inner_mean_scores = []

   # define explored parameter space.
   # procedure below should be equal to GridSearchCV
   tuned_parameter = [1000, 1100, 1200]
   for param in tuned_parameter:

       inner_scores = []

       # inner cross-validation
       inner = cross_validation.KFold(len(X_train_outer), n_folds=3, shuffle=True, random_state=state)
       for train_index_inner, test_index_inner in inner:
           # split the training data of outer CV
           X_train_inner, X_test_inner = X_train_outer[train_index_inner], X_train_outer[test_index_inner]
           y_train_inner, y_test_inner = y_train_outer[train_index_inner], y_train_outer[test_index_inner]

           # fit extremely randomized trees regressor to training data of inner CV
           clf = ensemble.ExtraTreesRegressor(param, n_jobs=-1, random_state=1)
           clf.fit(X_train_inner, y_train_inner)
           inner_scores.append(clf.score(X_test_inner, y_test_inner))

       # calculate mean score for inner folds
       inner_mean_scores.append(np.mean(inner_scores))

   # get maximum score index
   index, value = max(enumerate(inner_mean_scores), key=operator.itemgetter(1))

   print 'Best parameter of %i fold: %i' % (fold + 1, tuned_parameter[index])

   # fit the selected model to the training set of outer CV
   # for prediction error estimation
   clf2 = ensemble.ExtraTreesRegressor(tuned_parameter[index], n_jobs=-1, random_state=1)
   clf2.fit(X_train_outer, y_train_outer)
   outer_scores.append(clf2.score(X_test_outer, y_test_outer))

# show the prediction error estimate produced by nested CV
print 'Unbiased prediction error: %.4f' % (np.mean(outer_scores))

# finally, fit the selected model to the whole dataset
clf3 = ensemble.ExtraTreesRegressor(tuned_parameter[index], n_jobs=-1, random_state=1)
clf3.fit(X, y)

任何想法表示讚賞。

UPS,代碼是錯誤的,但以一種非常微妙的方式!

a) 將訓練集拆分為內部訓練集和測試集是可以的。

b) 問題出在最後兩行,這反映了對嵌套交叉驗證目的的微妙誤解。嵌套 CV 的目的不是選擇參數,而是對算法的預期準確度進行公正的評估,在這種情況下,無論它們是什麼,都ensemble.ExtraTreesRegressor具有最佳超參數的數據。

這就是您的代碼正確計算的內容:

   print 'Unbiased prediction error: %.4f' % (np.mean(outer_scores))

它使用嵌套 CV 來計算分類器的無偏預測。但請注意,外循環的每次傳遞都可能生成不同的最佳超參數,正如您在編寫以下代碼時所知道的那樣:

  print 'Best parameter of %i fold: %i' % (fold + 1, tuned_parameter[index])

所以現在你需要一個標準的 CV 循環來選擇最終的最佳超參數,使用折疊:

tuned_parameter = [1000, 1100, 1200]
for param in tuned_parameter:

   scores = []

   # normal cross-validation
   kfolds = cross_validation.KFold(len(y), n_folds=3, shuffle=True, random_state=state)
   for train_index, test_index in kfolds:
       # split the training data
       X_train, X_test = X[train_index], X[test_index]
       y_train, y_test = y[train_index], y[test_index]

       # fit extremely randomized trees regressor to training data
       clf2_5 = ensemble.ExtraTreesRegressor(param, n_jobs=-1, random_state=1)
       clf2_5.fit(X_train, y_train)
       scores.append(clf2_5.score(X_test, y_test))

   # calculate mean score for folds
   mean_scores.append(np.mean(scores))

# get maximum score index
index, value = max(enumerate(mean_scores), key=operator.itemgetter(1))

print 'Best parameter : %i' % (tuned_parameter[index])

這是您的代碼,但刪除了對內部的引用。

現在最好的參數是tuned_parameter[index],現在您可以clf3像在代碼中一樣學習最終分類器。

引用自:https://stats.stackexchange.com/questions/136296

comments powered by Disqus