Files
notes_estom/Tensorflow/TensorFlow2.0/016.md
yinkanglong_lab 68c2dbc3ac apacheCNml&dl
2021-03-20 16:02:39 +08:00

367 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 用 tf.data 加载 CSV 数据
> 原文:[https://tensorflow.google.cn/tutorials/load_data/csv](https://tensorflow.google.cn/tutorials/load_data/csv)
**Note:** 我们的 TensorFlow 社区翻译了这些文档。因为社区翻译是尽力而为, 所以无法保证它们是最准确的,并且反映了最新的 [官方英文文档](https://tensorflow.google.cn/?hl=en)。如果您有改进此翻译的建议, 请提交 pull request 到 [tensorflow/docs](https://github.com/tensorflow/docs) GitHub 仓库。要志愿地撰写或者审核译文,请加入 [docs-zh-cn@tensorflow.org Google Group](https://groups.google.com/a/tensorflow.org/forum/#!forum/docs-zh-cn)。
这篇教程通过一个示例展示了怎样将 CSV 格式的数据加载进 [`tf.data.Dataset`](https://tensorflow.google.cn/api_docs/python/tf/data/Dataset)。
这篇教程使用的是泰坦尼克号乘客的数据。模型会根据乘客的年龄、性别、票务舱和是否独自旅行等特征来预测乘客生还的可能性。
## 设置
```py
import functools
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
```
```py
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)
```
```py
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
32768/30874 [===============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv
16384/13049 [=====================================] - 0s 0us/step
```
```py
# 让 numpy 数据更易读。
np.set_printoptions(precision=3, suppress=True)
```
## 加载数据
开始的时候,我们通过打印 CSV 文件的前几行来了解文件的格式。
```py
head {train_file_path}
```
```py
survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n
```
正如你看到的那样CSV 文件的每列都会有一个列名。dataset 的构造函数会自动识别这些列名。如果你使用的文件的第一行不包含列名,那么需要将列名通过字符串列表传给 `make_csv_dataset` 函数的 `column_names` 参数。
```py
CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']
dataset = tf.data.experimental.make_csv_dataset(
...,
column_names=CSV_COLUMNS,
...)
```
这个示例使用了所有的列。如果你需要忽略数据集中的某些列,创建一个包含你需要使用的列的列表,然后传给构造器的(可选)参数 `select_columns`
```py
dataset = tf.data.experimental.make_csv_dataset(
...,
select_columns = columns_to_use,
...)
```
对于包含模型需要预测的值的列是你需要显式指定的。
```py
LABEL_COLUMN = 'survived'
LABELS = [0, 1]
```
现在从文件中读取 CSV 数据并且创建 dataset。
(完整的文档,参考 [`tf.data.experimental.make_csv_dataset`](https://tensorflow.google.cn/api_docs/python/tf/data/experimental/make_csv_dataset))
```py
def get_dataset(file_path):
dataset = tf.data.experimental.make_csv_dataset(
file_path,
batch_size=12, # 为了示例更容易展示,手动设置较小的值
label_name=LABEL_COLUMN,
na_value="?",
num_epochs=1,
ignore_errors=True)
return dataset
raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)
```
dataset 中的每个条目都是一个批次,用一个元组(*多个样本**多个标签*)表示。样本中的数据组织形式是以列为主的张量(而不是以行为主的张量),每条数据中包含的元素个数就是批次大小(这个示例中是 12
阅读下面的示例有助于你的理解。
```py
examples, labels = next(iter(raw_train_data)) # 第一个批次
print("EXAMPLES: \n", examples, "\n")
print("LABELS: \n", labels)
```
```py
EXAMPLES:
OrderedDict([('sex', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'male', b'male', b'male', b'male', b'male', b'female', b'male',
b'female', b'male', b'male', b'male', b'female'], dtype=object)>), ('age', <tf.Tensor: shape=(12,), dtype=float32, numpy=
array([35., 30., 28., 40., 17., 19., 21., 7., 58., 26., 19., 29.],
dtype=float32)>), ('n_siblings_spouses', <tf.Tensor: shape=(12,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1], dtype=int32)>), ('parch', <tf.Tensor: shape=(12,), dtype=int32, numpy=array([0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0], dtype=int32)>), ('fare', <tf.Tensor: shape=(12,), dtype=float32, numpy=
array([ 8.05 , 13\. , 7.225, 7.896, 8.663, 26.283, 7.925, 26.25 ,
29.7 , 8.663, 0\. , 26\. ], dtype=float32)>), ('class', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'Third', b'Second', b'Third', b'Third', b'Third', b'First',
b'Third', b'Second', b'First', b'Third', b'Third', b'Second'],
dtype=object)>), ('deck', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'unknown', b'unknown', b'unknown', b'unknown', b'unknown', b'D',
b'unknown', b'unknown', b'B', b'unknown', b'unknown', b'unknown'],
dtype=object)>), ('embark_town', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'Southampton', b'Southampton', b'Cherbourg', b'Southampton',
b'Southampton', b'Southampton', b'Southampton', b'Southampton',
b'Cherbourg', b'Southampton', b'Southampton', b'Southampton'],
dtype=object)>), ('alone', <tf.Tensor: shape=(12,), dtype=string, numpy=
array([b'y', b'y', b'y', b'y', b'y', b'n', b'y', b'n', b'y', b'n', b'y',
b'n'], dtype=object)>)])
LABELS:
tf.Tensor([0 0 0 0 0 1 0 1 0 0 0 1], shape=(12,), dtype=int32)
```
## 数据预处理
### 分类数据
CSV 数据中的有些列是分类的列。也就是说,这些列只能在有限的集合中取值。
使用 [`tf.feature_column`](https://tensorflow.google.cn/api_docs/python/tf/feature_column) API 创建一个 [`tf.feature_column.indicator_column`](https://tensorflow.google.cn/api_docs/python/tf/feature_column/indicator_column) 集合,每个 [`tf.feature_column.indicator_column`](https://tensorflow.google.cn/api_docs/python/tf/feature_column/indicator_column) 对应一个分类的列。
```py
CATEGORIES = {
'sex': ['male', 'female'],
'class' : ['First', 'Second', 'Third'],
'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
'alone' : ['y', 'n']
}
```
```py
categorical_columns = []
for feature, vocab in CATEGORIES.items():
cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
key=feature, vocabulary_list=vocab)
categorical_columns.append(tf.feature_column.indicator_column(cat_col))
```
```py
# 你刚才创建的内容
categorical_columns
```
```py
[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southhampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]
```
这将是后续构建模型时处理输入数据的一部分。
### 连续数据
连续数据需要标准化。
写一个函数标准化这些值,然后将这些值改造成 2 维的张量。
```py
def process_continuous_data(mean, data):
# 标准化数据
data = tf.cast(data, tf.float32) * 1/(2*mean)
return tf.reshape(data, [-1, 1])
```
现在创建一个数值列的集合。`tf.feature_columns.numeric_column` API 会使用 `normalizer_fn` 参数。在传参的时候使用 [`functools.partial`](https://docs.python.org/3/library/functools.html#functools.partial)`functools.partial` 由使用每个列的均值进行标准化的函数构成。
```py
MEANS = {
'age' : 29.631308,
'n_siblings_spouses' : 0.545455,
'parch' : 0.379585,
'fare' : 34.385399
}
numerical_columns = []
for feature in MEANS.keys():
num_col = tf.feature_column.numeric_column(feature, normalizer_fn=functools.partial(process_continuous_data, MEANS[feature]))
numerical_columns.append(num_col)
```
```py
# 你刚才创建的内容。
numerical_columns
```
```py
[NumericColumn(key='age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function process_continuous_data at 0x7f3f083021e0>, 29.631308)),
NumericColumn(key='n_siblings_spouses', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function process_continuous_data at 0x7f3f083021e0>, 0.545455)),
NumericColumn(key='parch', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function process_continuous_data at 0x7f3f083021e0>, 0.379585)),
NumericColumn(key='fare', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function process_continuous_data at 0x7f3f083021e0>, 34.385399))]
```
这里使用标准化的方法需要提前知道每列的均值。如果需要计算连续的数据流的标准化的值可以使用 [TensorFlow Transform](https://tensorflow.google.cn/tfx/transform/get_started)。
### 创建预处理层
将这两个特征列的集合相加,并且传给 [`tf.keras.layers.DenseFeatures`](https://tensorflow.google.cn/api_docs/python/tf/keras/layers/DenseFeatures) 从而创建一个进行预处理的输入层。
```py
preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numerical_columns)
```
## 构建模型
`preprocessing_layer` 开始构建 [`tf.keras.Sequential`](https://tensorflow.google.cn/api_docs/python/tf/keras/Sequential)。
```py
model = tf.keras.Sequential([
preprocessing_layer,
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid'),
])
model.compile(
loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
```
## 训练、评估和预测
现在可以实例化和训练模型。
```py
train_data = raw_train_data.shuffle(500)
test_data = raw_test_data
```
```py
model.fit(train_data, epochs=20)
```
```py
Epoch 1/20
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'collections.OrderedDict'> input: OrderedDict([('sex', <tf.Tensor 'ExpandDims_8:0' shape=(None, 1) dtype=string>), ('age', <tf.Tensor 'ExpandDims:0' shape=(None, 1) dtype=float32>), ('n_siblings_spouses', <tf.Tensor 'ExpandDims_6:0' shape=(None, 1) dtype=int32>), ('parch', <tf.Tensor 'ExpandDims_7:0' shape=(None, 1) dtype=int32>), ('fare', <tf.Tensor 'ExpandDims_5:0' shape=(None, 1) dtype=float32>), ('class', <tf.Tensor 'ExpandDims_2:0' shape=(None, 1) dtype=string>), ('deck', <tf.Tensor 'ExpandDims_3:0' shape=(None, 1) dtype=string>), ('embark_town', <tf.Tensor 'ExpandDims_4:0' shape=(None, 1) dtype=string>), ('alone', <tf.Tensor 'ExpandDims_1:0' shape=(None, 1) dtype=string>)])
Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'collections.OrderedDict'> input: OrderedDict([('sex', <tf.Tensor 'ExpandDims_8:0' shape=(None, 1) dtype=string>), ('age', <tf.Tensor 'ExpandDims:0' shape=(None, 1) dtype=float32>), ('n_siblings_spouses', <tf.Tensor 'ExpandDims_6:0' shape=(None, 1) dtype=int32>), ('parch', <tf.Tensor 'ExpandDims_7:0' shape=(None, 1) dtype=int32>), ('fare', <tf.Tensor 'ExpandDims_5:0' shape=(None, 1) dtype=float32>), ('class', <tf.Tensor 'ExpandDims_2:0' shape=(None, 1) dtype=string>), ('deck', <tf.Tensor 'ExpandDims_3:0' shape=(None, 1) dtype=string>), ('embark_town', <tf.Tensor 'ExpandDims_4:0' shape=(None, 1) dtype=string>), ('alone', <tf.Tensor 'ExpandDims_1:0' shape=(None, 1) dtype=string>)])
Consider rewriting this model with the Functional API.
53/53 [==============================] - 0s 4ms/step - loss: 0.5501 - accuracy: 0.7225
Epoch 2/20
53/53 [==============================] - 0s 3ms/step - loss: 0.4399 - accuracy: 0.8102
Epoch 3/20
53/53 [==============================] - 0s 3ms/step - loss: 0.4158 - accuracy: 0.8150
Epoch 4/20
53/53 [==============================] - 0s 3ms/step - loss: 0.4137 - accuracy: 0.8118
Epoch 5/20
53/53 [==============================] - 0s 3ms/step - loss: 0.4011 - accuracy: 0.8278
Epoch 6/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3953 - accuracy: 0.8198
Epoch 7/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3834 - accuracy: 0.8325
Epoch 8/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3831 - accuracy: 0.8309
Epoch 9/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3768 - accuracy: 0.8453
Epoch 10/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3710 - accuracy: 0.8437
Epoch 11/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3704 - accuracy: 0.8389
Epoch 12/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3670 - accuracy: 0.8325
Epoch 13/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3603 - accuracy: 0.8517
Epoch 14/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3548 - accuracy: 0.8501
Epoch 15/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3554 - accuracy: 0.8469
Epoch 16/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3519 - accuracy: 0.8453
Epoch 17/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3472 - accuracy: 0.8596
Epoch 18/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3513 - accuracy: 0.8581
Epoch 19/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3448 - accuracy: 0.8469
Epoch 20/20
53/53 [==============================] - 0s 3ms/step - loss: 0.3390 - accuracy: 0.8581
<tensorflow.python.keras.callbacks.History at 0x7f3f082606a0>
```
当模型训练完成的时候,你可以在测试集 `test_data` 上检查准确性。
```py
test_loss, test_accuracy = model.evaluate(test_data)
print('\n\nTest Loss {}, Test Accuracy {}'.format(test_loss, test_accuracy))
```
```py
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'collections.OrderedDict'> input: OrderedDict([('sex', <tf.Tensor 'ExpandDims_8:0' shape=(None, 1) dtype=string>), ('age', <tf.Tensor 'ExpandDims:0' shape=(None, 1) dtype=float32>), ('n_siblings_spouses', <tf.Tensor 'ExpandDims_6:0' shape=(None, 1) dtype=int32>), ('parch', <tf.Tensor 'ExpandDims_7:0' shape=(None, 1) dtype=int32>), ('fare', <tf.Tensor 'ExpandDims_5:0' shape=(None, 1) dtype=float32>), ('class', <tf.Tensor 'ExpandDims_2:0' shape=(None, 1) dtype=string>), ('deck', <tf.Tensor 'ExpandDims_3:0' shape=(None, 1) dtype=string>), ('embark_town', <tf.Tensor 'ExpandDims_4:0' shape=(None, 1) dtype=string>), ('alone', <tf.Tensor 'ExpandDims_1:0' shape=(None, 1) dtype=string>)])
Consider rewriting this model with the Functional API.
22/22 [==============================] - 0s 3ms/step - loss: 0.4596 - accuracy: 0.7992
Test Loss 0.45956382155418396, Test Accuracy 0.7992424368858337
```
使用 [`tf.keras.Model.predict`](https://tensorflow.google.cn/api_docs/python/tf/keras/Model#predict) 推断一个批次或多个批次的标签。
```py
predictions = model.predict(test_data)
# 显示部分结果
for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):
print("Predicted survival: {:.2%}".format(prediction[0]),
" | Actual outcome: ",
("SURVIVED" if bool(survived) else "DIED"))
```
```py
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'collections.OrderedDict'> input: OrderedDict([('sex', <tf.Tensor 'ExpandDims_8:0' shape=(None, 1) dtype=string>), ('age', <tf.Tensor 'ExpandDims:0' shape=(None, 1) dtype=float32>), ('n_siblings_spouses', <tf.Tensor 'ExpandDims_6:0' shape=(None, 1) dtype=int32>), ('parch', <tf.Tensor 'ExpandDims_7:0' shape=(None, 1) dtype=int32>), ('fare', <tf.Tensor 'ExpandDims_5:0' shape=(None, 1) dtype=float32>), ('class', <tf.Tensor 'ExpandDims_2:0' shape=(None, 1) dtype=string>), ('deck', <tf.Tensor 'ExpandDims_3:0' shape=(None, 1) dtype=string>), ('embark_town', <tf.Tensor 'ExpandDims_4:0' shape=(None, 1) dtype=string>), ('alone', <tf.Tensor 'ExpandDims_1:0' shape=(None, 1) dtype=string>)])
Consider rewriting this model with the Functional API.
Predicted survival: 99.81% | Actual outcome: DIED
Predicted survival: 14.77% | Actual outcome: SURVIVED
Predicted survival: 11.87% | Actual outcome: DIED
Predicted survival: 6.05% | Actual outcome: DIED
Predicted survival: 10.83% | Actual outcome: DIED
Predicted survival: 29.45% | Actual outcome: SURVIVED
Predicted survival: 92.37% | Actual outcome: SURVIVED
Predicted survival: 4.18% | Actual outcome: SURVIVED
Predicted survival: 14.32% | Actual outcome: DIED
Predicted survival: 4.36% | Actual outcome: SURVIVED
```