特征向量化

参考博文：https://www.cnblogs.com/hellcat/p/7886765.html

涉及步骤：
1. 字典化 to_dict(orient='record')
2. 字典向量化 DictVectorizer

字典化 `to_dict(orient=‘’)`

简单案例：

import numpy as np
import pandas as pd
 
index = ['x', 'y']
columns = ['a','b','c']
 
dtype = [('a','int32'), ('b','float32'), ('c','float32')]
values = np.zeros(2, dtype=dtype)
df = pd.DataFrame(values, index=index)
df.to_dict(orient='record')

# 结果显示
In [7]: df
Out[10]:
   a    b    c
x  0  0.0  0.0
y  0  0.0  0.0

In [8]: df.to_dict(orient='record')
Out[8]: [{'a': 0.0, 'b': 0.0, 'c': 0.0}, {'a': 0.0, 'b': 0.0, 'c': 0.0}]

注：pd转化成字典有很多中方法，关键字orient可以控制转换的方向：record就是横着转换成字典。

详细说明：

   a    b    c
x  0  0.0  0.0
y  0  0.0  0.0

#1. 横着转
## 按照record的方式保存，每个字典对应一条数据，但不包含数据的索引ID
In [8]: df.to_dict(orient='record')
Out[8]: [{'a': 0.0, 'b': 0.0, 'c': 0.0}, {'a': 0.0, 'b': 0.0, 'c': 0.0}]

##  包含数据的索引ID
In [16]: df.to_dict(orient='index')
Out[16]: {'x': {'a': 0, 'b': 0.0, 'c': 0.0}, 'y': {'a': 0, 'b': 0.0, 'c': 0.0}}

## 按照list的方式保存，列表的索引==每条数据的索引值
In [13]: df.to_dict(orient='list')
Out[13]: {'a': [0, 0], 'b': [0.0, 0.0], 'c': [0.0, 0.0]}

#2. 竖着转
## 按照pd.Series的方式保存
In [12]: df.to_dict(orient='Series')
Out[12]:
{'a': x    0
 y    0
 Name: a, dtype: int32, 'b': x    0.0
 y    0.0
 Name: b, dtype: float32, 'c': x    0.0
 y    0.0
 Name: c, dtype: float32}
 
## 按照字典的方式保存
In [14]: df.to_dict(orient='dict')
Out[14]: {'a': {'x': 0, 'y': 0}, 'b': {'x': 0.0, 'y': 0.0}, 'c': {'x': 0.0, 'y': 0.0}}

#3. 按照创建pd.DataFrame 进行切分
In [15]: df.to_dict(orient='split')
Out[15]:
{'index': ['x', 'y'],
 'columns': ['a', 'b', 'c'],
 'data': [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]}
 
# 即有了上面的数据，就可以生成DataFrame
pd.DataFrame(index=index,columns=columns,data=data)

字典向量化 `DictVectorizer`

DictVectorizer：将dict类型的list数据，转换成numpy array，具有属性vec.feature_names_，查看提取后的特征名。

对于连续性value直接存储，对于类别型value，进行One-hot编码

In [18]: from sklearn.feature_extraction import DictVectorizer

In [19]: v = DictVectorizer(sparse=False)

In [20]: D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]

In [21]: X = v.fit_transform(D)

In [22]: X
Out[22]:
array([[2., 0., 1.],
       [0., 1., 3.]])

上面转换后的结果看不出，结合属性vec.feature_names_查看

In [23]: v.feature_names_
Out[23]: ['bar', 'baz', 'foo']

有了上面的对应关系，可以对新的一个字典进行转换，如：

In [24]: v.transform({'foo': 4, 'unseen_feature': 3})
Out[24]: array([[0., 0., 4.]])

In [25]: v.transform({'foo': 4, 'baz': 3})
Out[25]: array([[0., 3., 4.]])

foo对应于feature_names_中的第三个，而'unseen_feature': 3没有这个Key值所以不显示。如下：

特征向量化

字典化 to_dict(orient=‘’)

字典向量化 DictVectorizer

字典化 `to_dict(orient=‘’)`

字典向量化 `DictVectorizer`