mirror of
https://github.com/Estom/notes.git
synced 2026-04-05 11:57:37 +08:00
297 lines
15 KiB
Markdown
297 lines
15 KiB
Markdown
# 使用 tf.data 加载 pandas dataframes
|
||
|
||
> 原文:[https://tensorflow.google.cn/tutorials/load_data/pandas_dataframe](https://tensorflow.google.cn/tutorials/load_data/pandas_dataframe)
|
||
|
||
**Note:** 我们的 TensorFlow 社区翻译了这些文档。因为社区翻译是尽力而为, 所以无法保证它们是最准确的,并且反映了最新的 [官方英文文档](https://tensorflow.google.cn/?hl=en)。如果您有改进此翻译的建议, 请提交 pull request 到 [tensorflow/docs](https://github.com/tensorflow/docs) GitHub 仓库。要志愿地撰写或者审核译文,请加入 [docs-zh-cn@tensorflow.org Google Group](https://groups.google.com/a/tensorflow.org/forum/#!forum/docs-zh-cn)。
|
||
|
||
本教程提供了如何将 pandas dataframes 加载到 [`tf.data.Dataset`](https://tensorflow.google.cn/api_docs/python/tf/data/Dataset)。
|
||
|
||
本教程使用了一个小型[数据集](https://archive.ics.uci.edu/ml/datasets/heart+Disease),由克利夫兰诊所心脏病基金会(Cleveland Clinic Foundation for Heart Disease)提供. 此数据集中有几百行 CSV。每行表示一个患者,每列表示一个属性(describe)。我们将使用这些信息来预测患者是否患有心脏病,这是一个二分类问题。
|
||
|
||
## 使用 pandas 读取数据
|
||
|
||
```py
|
||
!pip install -q tensorflow-gpu==2.0.0-rc1
|
||
import pandas as pd
|
||
import tensorflow as tf
|
||
```
|
||
|
||
```py
|
||
WARNING: You are using pip version 20.2.2; however, version 20.2.3 is available.
|
||
You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.
|
||
|
||
```
|
||
|
||
下载包含心脏数据集的 csv 文件。
|
||
|
||
```py
|
||
csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')
|
||
```
|
||
|
||
使用 pandas 读取 csv 文件。
|
||
|
||
```py
|
||
df = pd.read_csv(csv_file)
|
||
```
|
||
|
||
```py
|
||
df.head()
|
||
```
|
||
|
||
<devsite-iframe><iframe src="/tutorials/load_data/pandas_dataframe_420ecafb3d5d72c62762d056cc160cddfd15a9fd8290044191c203a794d6d136.frame" class="framebox inherit-locale " allowfullscreen="" is-upgraded=""></iframe></devsite-iframe>
|
||
|
||
```py
|
||
df.dtypes
|
||
```
|
||
|
||
```py
|
||
age int64
|
||
sex int64
|
||
cp int64
|
||
trestbps int64
|
||
chol int64
|
||
fbs int64
|
||
restecg int64
|
||
thalach int64
|
||
exang int64
|
||
oldpeak float64
|
||
slope int64
|
||
ca int64
|
||
thal object
|
||
target int64
|
||
dtype: object
|
||
|
||
```
|
||
|
||
将 `thal` 列(数据帧(dataframe)中的 `object` )转换为离散数值。
|
||
|
||
```py
|
||
df['thal'] = pd.Categorical(df['thal'])
|
||
df['thal'] = df.thal.cat.codes
|
||
```
|
||
|
||
```py
|
||
df.head()
|
||
```
|
||
|
||
<devsite-iframe><iframe src="/tutorials/load_data/pandas_dataframe_39d2bcddc17dbd9e94883df635bb9acdb6b07d463cf1ca2ea90daeb2b4275ca7.frame" class="framebox inherit-locale " allowfullscreen="" is-upgraded=""></iframe></devsite-iframe>
|
||
|
||
## 使用 [`tf.data.Dataset`](https://tensorflow.google.cn/api_docs/python/tf/data/Dataset) 读取数据
|
||
|
||
使用 [`tf.data.Dataset.from_tensor_slices`](https://tensorflow.google.cn/api_docs/python/tf/data/Dataset#from_tensor_slices) 从 pandas dataframe 中读取数值。
|
||
|
||
使用 [`tf.data.Dataset`](https://tensorflow.google.cn/api_docs/python/tf/data/Dataset) 的其中一个优势是可以允许您写一些简单而又高效的数据管道(data pipelines)。从 [loading data guide](https://tensorflow.google.cn/guide/data) 可以了解更多。
|
||
|
||
```py
|
||
target = df.pop('target')
|
||
```
|
||
|
||
```py
|
||
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))
|
||
```
|
||
|
||
```py
|
||
for feat, targ in dataset.take(5):
|
||
print ('Features: {}, Target: {}'.format(feat, targ))
|
||
```
|
||
|
||
```py
|
||
Features: [ 63\. 1\. 1\. 145\. 233\. 1\. 2\. 150\. 0\. 2.3 3\. 0.
|
||
|
||
2\. ], Target: 0
|
||
Features: [ 67\. 1\. 4\. 160\. 286\. 0\. 2\. 108\. 1\. 1.5 2\. 3.
|
||
3\. ], Target: 1
|
||
Features: [ 67\. 1\. 4\. 120\. 229\. 0\. 2\. 129\. 1\. 2.6 2\. 2.
|
||
4\. ], Target: 0
|
||
Features: [ 37\. 1\. 3\. 130\. 250\. 0\. 0\. 187\. 0\. 3.5 3\. 0.
|
||
3\. ], Target: 0
|
||
Features: [ 41\. 0\. 2\. 130\. 204\. 0\. 2\. 172\. 0\. 1.4 1\. 0.
|
||
3\. ], Target: 0
|
||
|
||
```
|
||
|
||
由于 `pd.Series` 实现了 `__array__` 协议,因此几乎可以在任何使用 `np.array` 或 [`tf.Tensor`](https://tensorflow.google.cn/api_docs/python/tf/Tensor) 的地方透明地使用它。
|
||
|
||
```py
|
||
tf.constant(df['thal'])
|
||
```
|
||
|
||
```py
|
||
<tf.Tensor: id=21, shape=(303,), dtype=int32, numpy=
|
||
array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3,
|
||
3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 4, 2, 4, 3, 4, 3, 4, 4,
|
||
2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 4,
|
||
4, 2, 3, 3, 4, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4,
|
||
3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4,
|
||
3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 2, 4, 4, 2, 3, 3, 4, 4, 3, 4,
|
||
3, 3, 4, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
|
||
4, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3,
|
||
3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 2,
|
||
4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 2, 2, 4, 3, 4, 2, 4, 3,
|
||
3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 2, 2, 4, 3, 4, 3, 2, 4, 3, 3, 2,
|
||
4, 4, 4, 4, 3, 0, 3, 3, 3, 3, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4,
|
||
3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3,
|
||
3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4, 4], dtype=int32)>
|
||
|
||
```
|
||
|
||
随机读取(shuffle)并批量处理数据集。
|
||
|
||
```py
|
||
train_dataset = dataset.shuffle(len(df)).batch(1)
|
||
```
|
||
|
||
## 创建并训练模型
|
||
|
||
```py
|
||
def get_compiled_model():
|
||
model = tf.keras.Sequential([
|
||
tf.keras.layers.Dense(10, activation='relu'),
|
||
tf.keras.layers.Dense(10, activation='relu'),
|
||
tf.keras.layers.Dense(1, activation='sigmoid')
|
||
])
|
||
|
||
model.compile(optimizer='adam',
|
||
loss='binary_crossentropy',
|
||
metrics=['accuracy'])
|
||
return model
|
||
```
|
||
|
||
```py
|
||
model = get_compiled_model()
|
||
model.fit(train_dataset, epochs=15)
|
||
```
|
||
|
||
```py
|
||
WARNING:tensorflow:Layer sequential is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2\. The layer has dtype float32 because it's dtype defaults to floatx.
|
||
|
||
If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.
|
||
|
||
To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.
|
||
|
||
Epoch 1/15
|
||
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
|
||
Instructions for updating:
|
||
Use tf.where in 2.0, which has the same broadcast rule as np.where
|
||
WARNING:tensorflow:Entity <function Function._initialize_uninitialized_variables.<locals>.initialize_variables at 0x7f3d7029f620> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Num'
|
||
WARNING: Entity <function Function._initialize_uninitialized_variables.<locals>.initialize_variables at 0x7f3d7029f620> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Num'
|
||
303/303 [==============================] - 1s 4ms/step - loss: 3.8214 - accuracy: 0.5149
|
||
Epoch 2/15
|
||
303/303 [==============================] - 0s 1ms/step - loss: 0.9302 - accuracy: 0.6766
|
||
Epoch 3/15
|
||
303/303 [==============================] - 0s 1ms/step - loss: 0.8203 - accuracy: 0.6964
|
||
Epoch 4/15
|
||
303/303 [==============================] - 0s 1ms/step - loss: 0.7565 - accuracy: 0.7162
|
||
Epoch 5/15
|
||
303/303 [==============================] - 0s 1ms/step - loss: 0.6607 - accuracy: 0.7162
|
||
Epoch 6/15
|
||
303/303 [==============================] - 0s 1ms/step - loss: 0.6804 - accuracy: 0.6931
|
||
Epoch 7/15
|
||
303/303 [==============================] - 0s 1ms/step - loss: 0.5967 - accuracy: 0.7525
|
||
Epoch 8/15
|
||
303/303 [==============================] - 0s 1ms/step - loss: 0.6198 - accuracy: 0.7228
|
||
Epoch 9/15
|
||
303/303 [==============================] - 0s 1ms/step - loss: 0.5584 - accuracy: 0.7624
|
||
Epoch 10/15
|
||
303/303 [==============================] - 0s 1ms/step - loss: 0.5611 - accuracy: 0.7756
|
||
Epoch 11/15
|
||
303/303 [==============================] - 0s 1ms/step - loss: 0.5364 - accuracy: 0.7492
|
||
Epoch 12/15
|
||
303/303 [==============================] - 0s 1ms/step - loss: 0.5042 - accuracy: 0.7822
|
||
Epoch 13/15
|
||
303/303 [==============================] - 0s 1ms/step - loss: 0.5168 - accuracy: 0.7624
|
||
Epoch 14/15
|
||
303/303 [==============================] - 0s 1ms/step - loss: 0.4560 - accuracy: 0.8053
|
||
Epoch 15/15
|
||
303/303 [==============================] - 0s 1ms/step - loss: 0.4350 - accuracy: 0.7987
|
||
|
||
<tensorflow.python.keras.callbacks.History at 0x7f3d7f250048>
|
||
|
||
```
|
||
|
||
## 代替特征列
|
||
|
||
将字典作为输入传输给模型就像创建 [`tf.keras.layers.Input`](https://tensorflow.google.cn/api_docs/python/tf/keras/Input) 层的匹配字典一样简单,应用任何预处理并使用 [functional api](https://tensorflow.google.cn/guide/keras/functional)。 您可以使用它作为 [feature columns](https://tensorflow.google.cn/tutorials/keras/feature_columns) 的替代方法。
|
||
|
||
```py
|
||
inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
|
||
x = tf.stack(list(inputs.values()), axis=-1)
|
||
|
||
x = tf.keras.layers.Dense(10, activation='relu')(x)
|
||
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)
|
||
|
||
model_func = tf.keras.Model(inputs=inputs, outputs=output)
|
||
|
||
model_func.compile(optimizer='adam',
|
||
loss='binary_crossentropy',
|
||
metrics=['accuracy'])
|
||
```
|
||
|
||
与 [`tf.data`](https://tensorflow.google.cn/api_docs/python/tf/data) 一起使用时,保存 `pd.DataFrame` 列结构的最简单方法是将 `pd.DataFrame` 转换为 `dict` ,并对该字典进行切片。
|
||
|
||
```py
|
||
dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
|
||
```
|
||
|
||
```py
|
||
for dict_slice in dict_slices.take(1):
|
||
print (dict_slice)
|
||
```
|
||
|
||
```py
|
||
({'age': <tf.Tensor: id=14781, shape=(16,), dtype=int32, numpy=
|
||
array([63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57],
|
||
dtype=int32)>, 'sex': <tf.Tensor: id=14789, shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int32)>, 'cp': <tf.Tensor: id=14784, shape=(16,), dtype=int32, numpy=array([1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3], dtype=int32)>, 'trestbps': <tf.Tensor: id=14793, shape=(16,), dtype=int32, numpy=
|
||
array([145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130,
|
||
120, 172, 150], dtype=int32)>, 'chol': <tf.Tensor: id=14783, shape=(16,), dtype=int32, numpy=
|
||
array([233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256,
|
||
263, 199, 168], dtype=int32)>, 'fbs': <tf.Tensor: id=14786, shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int32)>, 'restecg': <tf.Tensor: id=14788, shape=(16,), dtype=int32, numpy=array([2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0], dtype=int32)>, 'thalach': <tf.Tensor: id=14792, shape=(16,), dtype=int32, numpy=
|
||
array([150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142,
|
||
173, 162, 174], dtype=int32)>, 'exang': <tf.Tensor: id=14785, shape=(16,), dtype=int32, numpy=array([0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'oldpeak': <tf.Tensor: id=14787, shape=(16,), dtype=float32, numpy=
|
||
array([2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6,
|
||
|
||
0\. , 0.5, 1.6], dtype=float32)>, 'slope': <tf.Tensor: id=14790, shape=(16,), dtype=int32, numpy=array([3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1], dtype=int32)>, 'ca': <tf.Tensor: id=14782, shape=(16,), dtype=int32, numpy=array([0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'thal': <tf.Tensor: id=14791, shape=(16,), dtype=int32, numpy=array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3], dtype=int32)>}, <tf.Tensor: id=14794, shape=(16,), dtype=int64, numpy=array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0])>)
|
||
|
||
```
|
||
|
||
```py
|
||
model_func.fit(dict_slices, epochs=15)
|
||
```
|
||
|
||
```py
|
||
Epoch 1/15
|
||
WARNING:tensorflow:Entity <function Function._initialize_uninitialized_variables.<locals>.initialize_variables at 0x7f3d2c33a510> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Num'
|
||
WARNING: Entity <function Function._initialize_uninitialized_variables.<locals>.initialize_variables at 0x7f3d2c33a510> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Num'
|
||
19/19 [==============================] - 1s 30ms/step - loss: 17.3744 - accuracy: 0.7261
|
||
Epoch 2/15
|
||
19/19 [==============================] - 0s 3ms/step - loss: 9.7210 - accuracy: 0.7261
|
||
Epoch 3/15
|
||
19/19 [==============================] - 0s 3ms/step - loss: 5.0425 - accuracy: 0.6106
|
||
Epoch 4/15
|
||
19/19 [==============================] - 0s 3ms/step - loss: 4.8356 - accuracy: 0.5182
|
||
Epoch 5/15
|
||
19/19 [==============================] - 0s 3ms/step - loss: 4.4312 - accuracy: 0.5743
|
||
Epoch 6/15
|
||
19/19 [==============================] - 0s 3ms/step - loss: 4.2668 - accuracy: 0.5644
|
||
Epoch 7/15
|
||
19/19 [==============================] - 0s 3ms/step - loss: 4.1296 - accuracy: 0.5776
|
||
Epoch 8/15
|
||
19/19 [==============================] - 0s 3ms/step - loss: 4.0027 - accuracy: 0.5776
|
||
Epoch 9/15
|
||
19/19 [==============================] - 0s 3ms/step - loss: 3.8945 - accuracy: 0.5776
|
||
Epoch 10/15
|
||
19/19 [==============================] - 0s 3ms/step - loss: 3.7877 - accuracy: 0.5776
|
||
Epoch 11/15
|
||
19/19 [==============================] - 0s 3ms/step - loss: 3.6851 - accuracy: 0.5776
|
||
Epoch 12/15
|
||
19/19 [==============================] - 0s 3ms/step - loss: 3.5828 - accuracy: 0.5743
|
||
Epoch 13/15
|
||
19/19 [==============================] - 0s 3ms/step - loss: 3.4813 - accuracy: 0.5776
|
||
Epoch 14/15
|
||
19/19 [==============================] - 0s 3ms/step - loss: 3.3808 - accuracy: 0.5842
|
||
Epoch 15/15
|
||
19/19 [==============================] - 0s 3ms/step - loss: 3.2814 - accuracy: 0.5842
|
||
|
||
<tensorflow.python.keras.callbacks.History at 0x7f3d2c3a0828>
|
||
|
||
``` |