本文的jupyter-notebook地址:https://nbviewer.jupyter.org/github/wangjs-jacky/Jupyter-notebook/blob/master/00_API_document/交叉验证.ipynb
说明:本文的标题取得是数据划分,而jupyter-notebook的标题取得是交叉验证是有问题的。因为交叉验证要求数据集不能重合,故在交叉验证中一定使用的是K-Fold
。
train_test_split
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42,shuffle=True)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2,random_state=42,shuffle=True);
|
参数解释:
- **train_data:**所要划分的样本特征集
- **train_target:**所要划分的样本结果
- test_size:样本占比,如果是整数的话就是样本的数量
- **random_state:**是随机数的种子。
- shuffle: 是否打乱
from sklearn.model_selection import KFold from sklearn.model_selection import ShuffleSplit from sklearn.model_selection import StratifiedKFold from sklearn.model_selection import StratifiedShuffleSplit from sklearn.model_selection import GroupKFold X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]]) y = np.array([0,0, 0, 1, 1, 1])
|
划分原则:
- 根据TEST是否重复可以划分
- 设置指定的标签进行划分
- Group
- tain_size/ test_size
- 根据标签平衡进行划分
- StratifiedKFold/StratifiedShuffleSplit
- 其他
KFolds
提供训练和测试的索引去划分训练和测试集,将数据集拆分为k个连续的折叠(默认情况下不进行混洗)。
每一折回被用作验证,其余的K-1折被用作训练集。其中划分原则是默认是连续的【consecutive folds】,如 [1,2],[3,4],[5,6] ,可以发现这个是有问题的,因为第一组Test中全是相同组,这个问题可以由StratifiedKFold解决。
In [13]: kf = KFold(n_splits=3) ...: ...: print(kf.get_n_splits(X)) ...: print(kf) ...: ...: for train_index, test_index in kf.split(X): ...: print("TRAIN:", train_index, "TEST:", test_index) ...: 3 KFold(n_splits=3, random_state=None, shuffle=False) TRAIN: [2 3 4 5] TEST: [0 1] TRAIN: [0 1 4 5] TEST: [2 3] TRAIN: [0 1 2 3] TEST: [4 5]
|
通过观察TEST值,可以发现,所有的测试集都被用作测试集了!。
GroupKFold
通过groups 可以用作组的划分。
比如说前面四个看做一组,后面两个各看做一组。
所以这里默认是3折,且互不重合。
In [14]: group_kfold = GroupKFold(n_splits=3) ...: groups = np.array([0, 0, 0, 0, 1 ,2]) ...: print(group_kfold.get_n_splits(X, y, groups)) ...: ...: print(group_kfold) ...: ...: for train_index, test_index in group_kfold.split(X, y, groups): ...: print("TRAIN:", train_index, "TEST:", test_index) ...: 3 GroupKFold(n_splits=3) TRAIN: [4 5] TEST: [0 1 2 3] TRAIN: [0 1 2 3 4] TEST: [5] TRAIN: [0 1 2 3 5] TEST: [4]
|
ShuffleSplit
可以发现,与KFold不同的地方在于:test测试集中的数据是有重复的。
应用场景:利用ShuffleSplit类,计算多次取平均值,使得曲线平滑
In [15]: rs = ShuffleSplit(n_splits=3, test_size=1/3, random_state=3) ...: print(rs.get_n_splits(X)) ...: print(rs) ...: ...: for train_index, test_index in rs.split(X): ...: print("TRAIN:", train_index, "TEST:", test_index) ...: 3 ShuffleSplit(n_splits=3, random_state=3, test_size=0.3333333333333333, train_size=None) TRAIN: [4 1 0 2] TEST: [3 5] TRAIN: [1 2 3 0] TEST: [5 4] TRAIN: [3 4 0 2] TEST: [5 1]
|
StratifiedKFold
此交叉验证对象是KFold的变体,它返回分层的折叠。折叠是通过保留每个类别的样品百分比来进行的。分层的含义就是保证对应的标签中所含的类别基本上是差不多的。
根据:y = np.array([0, 0, 0, 1, 1, 1])
相同的类为:[0,1,2] ; [3,4,5]
可以发现训练和测试集分类均衡。
In [16]: skf = StratifiedKFold(n_splits=3) ...: ...: print(skf.get_n_splits(X, y)) ...: print(skf) ...: ...: for train_index, test_index in skf.split(X, y): ...: print("TRAIN:", train_index, "TEST:", test_index) ...: 3 StratifiedKFold(n_splits=3, random_state=None, shuffle=False) TRAIN: [1 2 4 5] TEST: [0 3] TRAIN: [0 2 3 5] TEST: [1 4] TRAIN: [0 1 3 4] TEST: [2 5]
|
StratifiedShuffleSplit
同的类为:[0,1,2] ; [3,4,5]
与StratifiedKFold不同的地方在于,Test中有重复,而K-Fold不予许重复
In [17]: sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0) ...: ...: sss.get_n_splits(X, y) ...: ...: print(sss) ...: ...: for train_index, test_index in sss.split(X, y): ...: print("TRAIN:", train_index, "TEST:", test_index) ...: StratifiedShuffleSplit(n_splits=5, random_state=0, test_size=0.5, train_size=None) TRAIN: [5 2 3] TEST: [4 1 0] TRAIN: [5 1 4] TEST: [0 2 3] TRAIN: [5 0 2] TEST: [4 3 1] TRAIN: [4 1 0] TEST: [2 3 5] TRAIN: [0 5 1] TEST: [3 4 2]
|
GroupShuffleSplit
根据指定的分组进行划分
In [18]: import numpy as np ...: from sklearn.model_selection import GroupShuffleSplit ...: X = np.ones(shape=(8, 2)) ...: y = np.ones(shape=(8, 1)) ...: groups = np.array([1, 1, 2, 2, 2, 3, 3, 3]) ...: print(groups.shape) ...: ...: gss = GroupShuffleSplit(n_splits=15, train_size=.7, random_state=42) ...: gss.get_n_splits() ...: ...: for train_idx, test_idx in gss.split(X, y, groups): ...: print("TRAIN:", train_idx, "TEST:", test_idx) ...: (8,) TRAIN: [2 3 4 5 6 7] TEST: [0 1] TRAIN: [0 1 5 6 7] TEST: [2 3 4] TRAIN: [2 3 4 5 6 7] TEST: [0 1] TRAIN: [0 1 5 6 7] TEST: [2 3 4] TRAIN: [2 3 4 5 6 7] TEST: [0 1] TRAIN: [0 1 5 6 7] TEST: [2 3 4] TRAIN: [0 1 5 6 7] TEST: [2 3 4] TRAIN: [0 1 2 3 4] TEST: [5 6 7] TRAIN: [2 3 4 5 6 7] TEST: [0 1] TRAIN: [0 1 2 3 4] TEST: [5 6 7] TRAIN: [2 3 4 5 6 7] TEST: [0 1] TRAIN: [2 3 4 5 6 7] TEST: [0 1] TRAIN: [0 1 5 6 7] TEST: [2 3 4] TRAIN: [2 3 4 5 6 7] TEST: [0 1] TRAIN: [0 1 2 3 4] TEST: [5 6 7]
|
可使用方法:
-
get_n_splits(self[, X, y, groups])
-
split(self, X[, y, groups])
通过split方法,来划分数据集,返回的是索引值
In [19]: import numpy as np ...: from sklearn.model_selection import KFold ...: X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]]) ...: y = np.array([1, 2, 3, 4]) ...: kf = KFold(n_splits=2) ...: kf.get_n_splits(X) ...: ...: print(kf) ...: ...: for train_index, test_index in kf.split(X): ...: print("TRAIN:", train_index, "TEST:", test_index) ...: X_train, X_test = X[train_index], X[test_index] ...: y_train, y_test = y[train_index], y[test_index] ...: print(X_train) ...: KFold(n_splits=2, random_state=None, shuffle=False) TRAIN: [2 3] TEST: [0 1] [[5 6] [7 8]] TRAIN: [0 1] TEST: [2 3] [[1 2] [3 4]]
|