Cross-Validation
嵌套交叉驗證的實現
我試圖弄清楚我對嵌套交叉驗證的理解是否正確,因此我寫了這個玩具示例來看看我是否正確:
import operator import numpy as np from sklearn import cross_validation from sklearn import ensemble from sklearn.datasets import load_boston # set random state state = 1 # load boston dataset boston = load_boston() X = boston.data y = boston.target outer_scores = [] # outer cross-validation outer = cross_validation.KFold(len(y), n_folds=3, shuffle=True, random_state=state) for fold, (train_index_outer, test_index_outer) in enumerate(outer): X_train_outer, X_test_outer = X[train_index_outer], X[test_index_outer] y_train_outer, y_test_outer = y[train_index_outer], y[test_index_outer] inner_mean_scores = [] # define explored parameter space. # procedure below should be equal to GridSearchCV tuned_parameter = [1000, 1100, 1200] for param in tuned_parameter: inner_scores = [] # inner cross-validation inner = cross_validation.KFold(len(X_train_outer), n_folds=3, shuffle=True, random_state=state) for train_index_inner, test_index_inner in inner: # split the training data of outer CV X_train_inner, X_test_inner = X_train_outer[train_index_inner], X_train_outer[test_index_inner] y_train_inner, y_test_inner = y_train_outer[train_index_inner], y_train_outer[test_index_inner] # fit extremely randomized trees regressor to training data of inner CV clf = ensemble.ExtraTreesRegressor(param, n_jobs=-1, random_state=1) clf.fit(X_train_inner, y_train_inner) inner_scores.append(clf.score(X_test_inner, y_test_inner)) # calculate mean score for inner folds inner_mean_scores.append(np.mean(inner_scores)) # get maximum score index index, value = max(enumerate(inner_mean_scores), key=operator.itemgetter(1)) print 'Best parameter of %i fold: %i' % (fold + 1, tuned_parameter[index]) # fit the selected model to the training set of outer CV # for prediction error estimation clf2 = ensemble.ExtraTreesRegressor(tuned_parameter[index], n_jobs=-1, random_state=1) clf2.fit(X_train_outer, y_train_outer) outer_scores.append(clf2.score(X_test_outer, y_test_outer)) # show the prediction error estimate produced by nested CV print 'Unbiased prediction error: %.4f' % (np.mean(outer_scores)) # finally, fit the selected model to the whole dataset clf3 = ensemble.ExtraTreesRegressor(tuned_parameter[index], n_jobs=-1, random_state=1) clf3.fit(X, y)
任何想法表示讚賞。
UPS,代碼是錯誤的,但以一種非常微妙的方式!
a) 將訓練集拆分為內部訓練集和測試集是可以的。
b) 問題出在最後兩行,這反映了對嵌套交叉驗證目的的微妙誤解。嵌套 CV 的目的不是選擇參數,而是對算法的預期準確度進行公正的評估,在這種情況下,無論它們是什麼,都
ensemble.ExtraTreesRegressor
具有最佳超參數的數據。這就是您的代碼正確計算的內容:
print 'Unbiased prediction error: %.4f' % (np.mean(outer_scores))
它使用嵌套 CV 來計算分類器的無偏預測。但請注意,外循環的每次傳遞都可能生成不同的最佳超參數,正如您在編寫以下代碼時所知道的那樣:
print 'Best parameter of %i fold: %i' % (fold + 1, tuned_parameter[index])
所以現在你需要一個標準的 CV 循環來選擇最終的最佳超參數,使用折疊:
tuned_parameter = [1000, 1100, 1200] for param in tuned_parameter: scores = [] # normal cross-validation kfolds = cross_validation.KFold(len(y), n_folds=3, shuffle=True, random_state=state) for train_index, test_index in kfolds: # split the training data X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] # fit extremely randomized trees regressor to training data clf2_5 = ensemble.ExtraTreesRegressor(param, n_jobs=-1, random_state=1) clf2_5.fit(X_train, y_train) scores.append(clf2_5.score(X_test, y_test)) # calculate mean score for folds mean_scores.append(np.mean(scores)) # get maximum score index index, value = max(enumerate(mean_scores), key=operator.itemgetter(1)) print 'Best parameter : %i' % (tuned_parameter[index])
這是您的代碼,但刪除了對內部的引用。
現在最好的參數是
tuned_parameter[index]
,現在您可以clf3
像在代碼中一樣學習最終分類器。