数据加载

2026-04-09 22:09:28 +08:00 · 2021-04-22 21:09:22 +08:00
parent e94e247e10
commit 71a4306ee3
16 changed files with 265 additions and 1001 deletions
--- a/Tensorflow/TensorFlow2.0/0
+++ b/Tensorflow/TensorFlow2.0/0
@@ -0,0 +1,27 @@
+# TensorFlow
+
+> 只需要知道加载数据的方法。和使用keras进行训练的方法。
+
+## 概述
+
+* tf.data
+  * experimental
+  * Dataset
+  * Iterator
+  * FixedLengthRecordDataset
+  * TFRecordDataset
+  * TextLineDataset
+* tf.kerase
+  * layers
+  * activations
+  * datasets
+  * processing
+  * experimental
+  * models
+  * loss
+  * optimizers
+  * mertrics
+  * utils
+  * class Model: Model groups layers into an object with training and inference features.
+  * class Sequential: Sequential groups a linear stack of layers into a tf.keras.Model.
+## 其他
--- a/Tensorflow/TensorFlow2.0/5
+++ b/Tensorflow/TensorFlow2.0/5
@@ -0,0 +1,236 @@
+{
+ "metadata": {
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.8-final"
+  },
+  "orig_nbformat": 2,
+  "kernelspec": {
+   "name": "python388jvsc74a57bd05ef0042cb263260037aa2928643ae94e240dd3afaec7872ebebe4f07619ddd0c",
+   "display_name": "Python 3.8.8 64-bit ('ml': conda)"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2,
+ "cells": [
+  {
+   "source": [
+    "# 1 Pipeline Structure的结构\n",
+    "\n",
+    "我们可以将典型的 TensorFlow 训练输入流水线视为 ETL 流程：\n",
+    "\n",
+    "1. Extract:从永久性存储（可以是 HDD 或 SSD 等本地存储或 GCS 或 HDFS 等远程存储）读取数据。\n",
+    "2. Transform:使用CPU核心解析数据并对其执行预处理操作，例如图像解压缩、数据增强转换（例如随机裁剪、翻转和颜色失真）、重排和批处理。\n",
+    "3. Load:将转换后的数据加载到执行机器学习模型的加速器设备（例如，GPU 或 TPU）上。"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "source": [
+    "# 2 tf.data.dataset API说明\n",
+    "\n",
+    "tf.data API 围绕可组合转换而设计，旨在为用户提供灵活性。虽然这些转换中有很多都是可以交替的，但某些转换的顺序会对性能产生影响。\n",
+    "\n",
+    "## 1 map映射和batch批次\n",
+    "调用传递给 map 转换的用户定义函数具有与调度和执行用户定义函数相关的开销。通常，与函数执行的计算量相比，这种开销很小。但是，如果 map 几乎不起作用，那么这种开销可能会占总成本的很大一部分。在这种情况下，建议向量化用户定义的函数（即，让该函数一次对一批输入进行操作），并在 map 转换之前先应用 batch 转换。\n",
+    "\n",
+    "## 2 map映射和cache缓存\n",
+    "\n",
+    "tf.data.Dataset.cache 转换可以在内存或本地存储中缓存数据集。如果传递给 map 转换的用户定义函数代价很高，则只要内存或本地存储仍可以容纳生成的数据集，就可以在映射转换后应用缓存转换。如果用户定义的函数会增加存储数据集所需的空间，并超出缓存容量，请考虑在训练作业之前预处理数据以减少资源消耗量。\n",
+    "\n",
+    "## 3 map映射和interleave交错/prefetch预取/shuffle重排\n",
+    "许多转换（包括map interleave、prefetch 和 shuffle）都维持一个内部元素缓冲区。如果传递给 map 转换的用户定义函数改变了元素的大小，那么映射转换的顺序和缓冲元素的转换会影响内存使用量。通常，我们建议选择可以减少内存占用的顺序，除非为了提高性能而需要采用不同的顺序（例如，为了混合映射和批次转换）。\n",
+    "\n",
+    "## 4 repeat重复和shuffle重排\n",
+    "tf.data.Dataset.repeat 转换会将输入数据重复有限（或无限）次；每次数据重复通常称为一个周期。tf.data.Dataset.shuffle 转换会随机化数据集样本的顺序。\n",
+    "\n",
+    "如果在 shuffle 转换之前应用 repeat 转换，则系统会对周期边界进行模糊处理。也就是说，某些元素可以在其他元素出现之前重复出现。另一方面，如果在重复转换之前应用 shuffle 转换，那么在每个周期开始时性能可能会降低，因为需要初始化 shuffle 转换的内部状态。换言之，前者（repeat 在 shuffle 之前）可提供更好的性能，而后者（repeat 在 shuffle 之前）可提供更强的排序保证。\n",
+    "\n",
+    "如果可能，建议您使用 tf.contrib.data.shuffle_and_repeat 混合转换，这样可以达到两全其美的效果（良好的性能和强大的排序保证）。否则，我们建议在repeat重复之前进行shuffle重排。"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "source": [
+    "# 3 tf.data.dataset API 实例\n",
+    "\n",
+    "### .map\n",
+    "使用map可以对数据进行预测，和python自带原理一样\n",
+    "```\n",
+    "def prepare_mnist_fea(x, y):\n",
+    "    x = tf.cast(x, tf.float32) / 255.0\n",
+    "    y = tf.cast(y, tf.float32)\n",
+    "    return x, y\n",
+    "\n",
+    "ds.map(prepare_mnist_fea)\n",
+    "```\n",
+    "\n",
+    "### .shuffle#\n",
+    "打乱顺序\n",
+    "```\n",
+    "ds.shuffle(10000)\n",
+    "```\n",
+    "\n",
+    "### .batch#\n",
+    "使用某个batch进行迭代\n",
+    "\n",
+    "```\n",
+    "ds.batch(32)\n",
+    "```\n",
+    "\n",
+    "### .repeat#\n",
+    "重复执行整个数据多少次，也就是epoch的意思\n",
+    "```\n",
+    "ds.repeat(10)\n",
+    "```"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "source": [
+    "# 4 tf.data加载csv\n",
+    "\n",
+    "## 加载数据的方式（csv）\n",
+    "### 从内存中加载数据\n",
+    "* 例如使用numpy.load()或者pandas.read_csv()将数据加载到内存中。然后使用tf.data.dataset方法将数据加载到tensorflow中。_\n",
+    "```\n",
+    "tf.data.Dataset.from_tensors() \n",
+    "tf.data.Dataset.from_tensor_slices()\n",
+    "```\n",
+    "### 从生成器中读取数据\n",
+    "```\n",
+    "ds_counter = tf.data.Dataset.from_generator(python_generator, args=[25], output_types=tf.int32, output_shapes = (), )\n",
+    "```\n",
+    "### 直接读取csv数据\n",
+    "```\n",
+    "tf.data.experimental.make_csv_dataset()\n",
+    "```\n",
+    "### 从文件中加载数据\n",
+    "```\n",
+    "tf.data.TFRecordDataset()  \n",
+    "tf.data.TextLineDataset()\n",
+    "tf.data.FixedLengthRecordDataset\n",
+    "```\n",
+    "### 从generator中加载数据\n",
+    "当有多个文件的时候，可以使用pandas生成读取文件的生成器。然后通过from_generator逐步加载数据。\n",
+    "```\n",
+    "ds_counter = tf.data.Dataset.from_generator(count, args=[25], output_types=tf.int32, output_shapes = (), )\n",
+    "```\n",
+    "## 数据流水线的多层含义\n",
+    "\n",
+    "1. 加载数据处理过程形成的流水线。（处理过程的流水线）\n",
+    "2. 多个文件，按顺序加载形成的流水线。（多个文件的流水线）"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "source": [
+    "# 5 tf.data.dataset常用方法说明"
+   ],
+   "cell_type": "markdown",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import tensorflow as tf\n",
+    "\n",
+    "import pathlib\n",
+    "import os\n",
+    "import matplotlib.pyplot as plt\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "\n",
+    "np.set_printoptions(precision=4)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = tf.data.Dataset.from_tensor_slices(([8, 3, 0, 1],[1,2,1,2]))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "8\n3\n0\n1\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "# 当做可迭代对象\n",
+    "for elem,lebel in dataset:\n",
+    "  print(elem.numpy())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "[12  6]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# 使用reduce对数据内容进行合并\n",
+    "print(dataset.reduce(0, lambda state, value: state + value).numpy())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "output_type": "stream",
+     "name": "stdout",
+     "text": [
+      "(TensorSpec(shape=(), dtype=tf.int32, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None))\n"
+     ]
+    }
+   ],
+   "source": [
+    "# dataset对象可以包含各种数据结构。包括TensorFlow提供的tf.Tensor，tf.sparse.SparseTensor， tf.RaggedTensor，tf.TensorArray，或tf.data.Dataset。和Python原生的数据结构tuple，dict，NamedTuple\n",
+    "\n",
+    "print(dataset.element_spec)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ]
+}
--- a/加载numpy&pandas数据.ipynb
+++ b/加载numpy&pandas数据.ipynb
--- a/Tensorflow/TensorFlow2.0/5.2
+++ b/Tensorflow/TensorFlow2.0/5.2
--- a/加载make_csv_dataset数据.ipynb
+++ b/加载make_csv_dataset数据.ipynb
--- a/Tensorflow/TensorFlow2.0/5.4
+++ b/Tensorflow/TensorFlow2.0/5.4
--- a/加载tf.TextLineReader数据.ipynb
+++ b/加载tf.TextLineReader数据.ipynb
--- a/Tensorflow/TensorFlow2.0/7
+++ b/Tensorflow/TensorFlow2.0/7
@@ -1,85 +0,0 @@
-{
- "metadata": {
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": 3
-  },
-  "orig_nbformat": 2
- },
- "nbformat": 4,
- "nbformat_minor": 2,
- "cells": [
-  {
-   "source": [
-    "# tf.data.dataset常用的API\n",
-    "\n",
-    "tf.data API 围绕可组合转换而设计，旨在为用户提供灵活性。虽然这些转换中有很多都是可以交替的，但某些转换的顺序会对性能产生影响。\n",
-    "\n",
-    "## 1 map映射和batch批次\n",
-    "调用传递给 map 转换的用户定义函数具有与调度和执行用户定义函数相关的开销。通常，与函数执行的计算量相比，这种开销很小。但是，如果 map 几乎不起作用，那么这种开销可能会占总成本的很大一部分。在这种情况下，建议向量化用户定义的函数（即，让该函数一次对一批输入进行操作），并在 map 转换之前先应用 batch 转换。\n",
-    "\n",
-    "## 2 map映射和cache缓存\n",
-    "\n",
-    "tf.data.Dataset.cache 转换可以在内存或本地存储中缓存数据集。如果传递给 map 转换的用户定义函数代价很高，则只要内存或本地存储仍可以容纳生成的数据集，就可以在映射转换后应用缓存转换。如果用户定义的函数会增加存储数据集所需的空间，并超出缓存容量，请考虑在训练作业之前预处理数据以减少资源消耗量。\n",
-    "\n",
-    "## 3 map映射和interleave交错/prefetch预取/shuffle重排\n",
-    "许多转换（包括map interleave、prefetch 和 shuffle）都维持一个内部元素缓冲区。如果传递给 map 转换的用户定义函数改变了元素的大小，那么映射转换的顺序和缓冲元素的转换会影响内存使用量。通常，我们建议选择可以减少内存占用的顺序，除非为了提高性能而需要采用不同的顺序（例如，为了混合映射和批次转换）。\n",
-    "\n",
-    "## 4 repeat重复和shuffle重排\n",
-    "tf.data.Dataset.repeat 转换会将输入数据重复有限（或无限）次；每次数据重复通常称为一个周期。tf.data.Dataset.shuffle 转换会随机化数据集样本的顺序。\n",
-    "\n",
-    "如果在 shuffle 转换之前应用 repeat 转换，则系统会对周期边界进行模糊处理。也就是说，某些元素可以在其他元素出现之前重复出现。另一方面，如果在重复转换之前应用 shuffle 转换，那么在每个周期开始时性能可能会降低，因为需要初始化 shuffle 转换的内部状态。换言之，前者（repeat 在 shuffle 之前）可提供更好的性能，而后者（repeat 在 shuffle 之前）可提供更强的排序保证。\n",
-    "\n",
-    "如果可能，建议您使用 tf.contrib.data.shuffle_and_repeat 混合转换，这样可以达到两全其美的效果（良好的性能和强大的排序保证）。否则，我们建议在repeat重复之前进行shuffle重排。"
-   ],
-   "cell_type": "markdown",
-   "metadata": {}
-  },
-  {
-   "source": [
-    "# Pipeline Structure的结构\n",
-    "\n",
-    "我们可以将典型的 TensorFlow 训练输入流水线视为 ETL 流程：\n",
-    "\n",
-    "1. Extract:从永久性存储（可以是 HDD 或 SSD 等本地存储或 GCS 或 HDFS 等远程存储）读取数据。\n",
-    "2. Transform:使用CPU核心解析数据并对其执行预处理操作，例如图像解压缩、数据增强转换（例如随机裁剪、翻转和颜色失真）、重排和批处理。\n",
-    "3. Load:将转换后的数据加载到执行机器学习模型的加速器设备（例如，GPU 或 TPU）上。"
-   ],
-   "cell_type": "markdown",
-   "metadata": {}
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# tf.data说明\n",
-    "\n",
-    "## 加载数据的方式\n",
-    "### 从内存中加载数据\n",
-    "例如使用numpy.load()或者pandas.read_csv()将数据加载到内存中。然后使用tf.data.dataset方法将数据加载到tensorflow中。_\n",
-    "tf.data.Dataset.from_tensors() or tf.data.Dataset.from_tensor_slices()\n",
-    "### 从文件中加载数据\n",
-    "tf.data.TFRecordDataset()\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# 数据流水线读取多个文件实例（并非广义上的数据流水线）\n"
-   ]
-  }
- ]
-}