目录
  1. 1. train_test_split
    1. 1.1. 参数解释:
  • 划分原则:
    1. 1. KFolds
    2. 2. GroupKFold
    3. 3. ShuffleSplit
    4. 4. StratifiedKFold
    5. 5. StratifiedShuffleSplit
    6. 6. GroupShuffleSplit
  • 可使用方法:
  • 数据划分

    本文的jupyter-notebook地址:https://nbviewer.jupyter.org/github/wangjs-jacky/Jupyter-notebook/blob/master/00_API_document/交叉验证.ipynb

    说明:本文的标题取得是数据划分,而jupyter-notebook的标题取得是交叉验证是有问题的。因为交叉验证要求数据集不能重合,故在交叉验证中一定使用的是K-Fold

    train_test_split

    from sklearn.model_selection import train_test_split

    train_set, test_set = train_test_split(data, test_size=0.2, random_state=42,shuffle=True)

    # 或者下面这种写法
    X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2,random_state=42,shuffle=True);

    参数解释:

    • **train_data:**所要划分的样本特征集
    • **train_target:**所要划分的样本结果
    • test_size:样本占比,如果是整数的话就是样本的数量
    • **random_state:**是随机数的种子。
    • shuffle: 是否打乱
    from sklearn.model_selection import KFold
    from sklearn.model_selection import ShuffleSplit
    from sklearn.model_selection import StratifiedKFold
    from sklearn.model_selection import StratifiedShuffleSplit
    from sklearn.model_selection import GroupKFold
    X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]])
    y = np.array([0,0, 0, 1, 1, 1])

    划分原则:

    • 根据TEST是否重复可以划分
      • K - Fold
      • ShuffleSplit
    • 设置指定的标签进行划分
      • Group
      • tain_size/ test_size
    • 根据标签平衡进行划分
      • StratifiedKFold/StratifiedShuffleSplit
      • 其他

    KFolds

    提供训练和测试的索引去划分训练和测试集,将数据集拆分为k个连续的折叠(默认情况下不进行混洗)。

    每一折回被用作验证,其余的K-1折被用作训练集。其中划分原则是默认是连续的【consecutive folds】,如 [1,2],[3,4],[5,6][1,2] ,[3,4] ,[5,6] ,可以发现这个是有问题的,因为第一组Test中全是相同组,这个问题可以由StratifiedKFold解决。

    In [13]: kf = KFold(n_splits=3) 
    ...:
    ...: print(kf.get_n_splits(X))
    ...: print(kf)
    ...:
    ...: for train_index, test_index in kf.split(X):
    ...: print("TRAIN:", train_index, "TEST:", test_index)
    ...:
    3
    KFold(n_splits=3, random_state=None, shuffle=False)
    TRAIN: [2 3 4 5] TEST: [0 1]
    TRAIN: [0 1 4 5] TEST: [2 3]
    TRAIN: [0 1 2 3] TEST: [4 5]

    通过观察TEST值,可以发现,所有的测试集都被用作测试集了!。

    GroupKFold

    通过groups 可以用作组的划分。

    比如说前面四个看做一组,后面两个各看做一组。

    所以这里默认是3折,且互不重合。

    In [14]: group_kfold = GroupKFold(n_splits=3) 
    ...: groups = np.array([0, 0, 0, 0, 1 ,2])
    ...: print(group_kfold.get_n_splits(X, y, groups))
    ...:
    ...: print(group_kfold)
    ...:
    ...: for train_index, test_index in group_kfold.split(X, y, groups):
    ...: print("TRAIN:", train_index, "TEST:", test_index)
    ...:
    3
    GroupKFold(n_splits=3)
    TRAIN: [4 5] TEST: [0 1 2 3]
    TRAIN: [0 1 2 3 4] TEST: [5]
    TRAIN: [0 1 2 3 5] TEST: [4]

    ShuffleSplit

    可以发现,与KFold不同的地方在于:test测试集中的数据是有重复的。

    应用场景:利用ShuffleSplit类,计算多次取平均值,使得曲线平滑

    In [15]: rs = ShuffleSplit(n_splits=3, test_size=1/3, random_state=3) 
    ...: print(rs.get_n_splits(X))
    ...: print(rs)
    ...:
    ...: for train_index, test_index in rs.split(X):
    ...: print("TRAIN:", train_index, "TEST:", test_index)
    ...:
    3
    ShuffleSplit(n_splits=3, random_state=3, test_size=0.3333333333333333,
    train_size=None)
    TRAIN: [4 1 0 2] TEST: [3 5]
    TRAIN: [1 2 3 0] TEST: [5 4]
    TRAIN: [3 4 0 2] TEST: [5 1]

    StratifiedKFold

    此交叉验证对象是KFold的变体,它返回分层的折叠。折叠是通过保留每个类别的样品百分比来进行的。分层的含义就是保证对应的标签中所含的类别基本上是差不多的。

    根据:y = np.array([0, 0, 0, 1, 1, 1])

    相同的类为:[0,1,2][0,1,2][3,4,5][3,4,5]

    可以发现训练和测试集分类均衡。

    In [16]: skf = StratifiedKFold(n_splits=3) 
    ...:
    ...: print(skf.get_n_splits(X, y))
    ...: print(skf)
    ...:
    ...: for train_index, test_index in skf.split(X, y):
    ...: print("TRAIN:", train_index, "TEST:", test_index)
    ...:
    3
    StratifiedKFold(n_splits=3, random_state=None, shuffle=False)
    TRAIN: [1 2 4 5] TEST: [0 3]
    TRAIN: [0 2 3 5] TEST: [1 4]
    TRAIN: [0 1 3 4] TEST: [2 5]

    StratifiedShuffleSplit

    同的类为:[0,1,2][0,1,2][3,4,5][3,4,5]

    与StratifiedKFold不同的地方在于,Test中有重复,而K-Fold不予许重复

    In [17]: sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
    ...:
    ...: sss.get_n_splits(X, y)
    ...:
    ...: print(sss)
    ...:
    ...: for train_index, test_index in sss.split(X, y):
    ...: print("TRAIN:", train_index, "TEST:", test_index)
    ...:
    StratifiedShuffleSplit(n_splits=5, random_state=0, test_size=0.5,
    train_size=None)
    TRAIN: [5 2 3] TEST: [4 1 0]
    TRAIN: [5 1 4] TEST: [0 2 3]
    TRAIN: [5 0 2] TEST: [4 3 1]
    TRAIN: [4 1 0] TEST: [2 3 5]
    TRAIN: [0 5 1] TEST: [3 4 2]

    GroupShuffleSplit

    根据指定的分组进行划分

    In [18]: import numpy as np 
    ...: from sklearn.model_selection import GroupShuffleSplit
    ...: X = np.ones(shape=(8, 2))
    ...: y = np.ones(shape=(8, 1))
    ...: groups = np.array([1, 1, 2, 2, 2, 3, 3, 3])
    ...: print(groups.shape)
    ...:
    ...: gss = GroupShuffleSplit(n_splits=15, train_size=.7, random_state=42)
    ...: gss.get_n_splits()
    ...:
    ...: for train_idx, test_idx in gss.split(X, y, groups):
    ...: print("TRAIN:", train_idx, "TEST:", test_idx)
    ...:
    (8,)
    TRAIN: [2 3 4 5 6 7] TEST: [0 1]
    TRAIN: [0 1 5 6 7] TEST: [2 3 4]
    TRAIN: [2 3 4 5 6 7] TEST: [0 1]
    TRAIN: [0 1 5 6 7] TEST: [2 3 4]
    TRAIN: [2 3 4 5 6 7] TEST: [0 1]
    TRAIN: [0 1 5 6 7] TEST: [2 3 4]
    TRAIN: [0 1 5 6 7] TEST: [2 3 4]
    TRAIN: [0 1 2 3 4] TEST: [5 6 7]
    TRAIN: [2 3 4 5 6 7] TEST: [0 1]
    TRAIN: [0 1 2 3 4] TEST: [5 6 7]
    TRAIN: [2 3 4 5 6 7] TEST: [0 1]
    TRAIN: [2 3 4 5 6 7] TEST: [0 1]
    TRAIN: [0 1 5 6 7] TEST: [2 3 4]
    TRAIN: [2 3 4 5 6 7] TEST: [0 1]
    TRAIN: [0 1 2 3 4] TEST: [5 6 7]

    可使用方法:

    • get_n_splits(self[, X, y, groups])

    • split(self, X[, y, groups])

    通过split方法,来划分数据集,返回的是索引值

    In [19]: import numpy as np 
    ...: from sklearn.model_selection import KFold
    ...: X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
    ...: y = np.array([1, 2, 3, 4])
    ...: kf = KFold(n_splits=2)
    ...: kf.get_n_splits(X)
    ...:
    ...: print(kf)
    ...:
    ...: for train_index, test_index in kf.split(X):
    ...: print("TRAIN:", train_index, "TEST:", test_index)
    ...: X_train, X_test = X[train_index], X[test_index]
    ...: y_train, y_test = y[train_index], y[test_index]
    ...: print(X_train)
    ...:
    KFold(n_splits=2, random_state=None, shuffle=False)
    TRAIN: [2 3] TEST: [0 1]
    [[5 6]
    [7 8]]
    TRAIN: [0 1] TEST: [2 3]
    [[1 2]
    [3 4]]