2020-10-19 21:01:31

2026-05-12 03:27:11 +08:00 · 2020-10-19 21:01:32 +08:00
parent c457399d84
commit 2fe0addb6f
157 changed files with 149223 additions and 0 deletions
--- a/docs/da/153.md
+++ b/docs/da/153.md
@@ -0,0 +1,474 @@
+# 二维数据结构：DataFrame
+
+In [1]:
+
+```
+import numpy as np
+import pandas as pd
+
+```
+
+`DataFrame` 是 `pandas` 中的二维数据结构，可以看成一个 `Excel` 中的工作表，或者一个 `SQL` 表，或者一个存储 `Series` 对象的字典。
+
+`DataFrame(data, index, columns)` 中的 `data` 可以接受很多数据类型：
+
+*   一个存储一维数组，字典，列表或者 `Series` 的字典
+*   2-D 数组
+*   结构或者记录数组
+*   一个 `Series`
+*   另一个 `DataFrame`
+
+`index` 用于指定行的 `label`，`columns` 用于指定列的 `label`，如果参数不传入，那么会按照传入的内容进行设定。
+
+## 从 Series 字典中构造
+
+可以使用值为 `Series` 的字典进行构造：
+
+In [2]:
+
+```
+d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
+     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
+
+```
+
+如果没有传入 `columns` 的值，那么 `columns` 的值默认为字典 `key`，`index` 默认为所有 `value` 中 `index` 的并集。
+
+In [3]:
+
+```
+df = pd.DataFrame(d)
+
+df
+
+```
+
+Out[3]:
+
+|  | one | two |
+| --- | --- | --- |
+| a | 1 | 1 |
+| b | 2 | 2 |
+| c | 3 | 3 |
+| d | NaN | 4 |
+
+如果指定了 `index` 值，`index` 为指定的 `index` 值：
+
+In [4]:
+
+```
+pd.DataFrame(d, index=['d', 'b', 'a'])
+
+```
+
+Out[4]:
+
+|  | one | two |
+| --- | --- | --- |
+| d | NaN | 4 |
+| b | 2 | 2 |
+| a | 1 | 1 |
+
+如果指定了 `columns` 值，会去字典中寻找，找不到的值为 `NaN`：
+
+In [5]:
+
+```
+pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
+
+```
+
+Out[5]:
+
+|  | two | three |
+| --- | --- | --- |
+| d | 4 | NaN |
+| b | 2 | NaN |
+| a | 1 | NaN |
+
+查看 `index` 和 `columns`：
+
+In [6]:
+
+```
+df.index
+
+```
+
+Out[6]:
+
+```
+Index([u'a', u'b', u'c', u'd'], dtype='object')
+```
+
+In [7]:
+
+```
+df.columns
+
+```
+
+Out[7]:
+
+```
+Index([u'one', u'two'], dtype='object')
+```
+
+## 从 ndarray 或者 list 字典中构造
+
+如果字典是 `ndarray` 或者 `list`，那么它们的长度要严格保持一致：
+
+In [8]:
+
+```
+d = {'one' : [1., 2., 3., 4.],
+     'two' : [4., 3., 2., 1.]}
+
+```
+
+`index` 默认为 `range(n)`，其中 `n` 为数组长度：
+
+In [9]:
+
+```
+pd.DataFrame(d)
+
+```
+
+Out[9]:
+
+|  | one | two |
+| --- | --- | --- |
+| 0 | 1 | 4 |
+| 1 | 2 | 3 |
+| 2 | 3 | 2 |
+| 3 | 4 | 1 |
+
+如果传入 `index` 参数，那么它必须与数组等长：
+
+In [10]:
+
+```
+pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
+
+```
+
+Out[10]:
+
+|  | one | two |
+| --- | --- | --- |
+| a | 1 | 4 |
+| b | 2 | 3 |
+| c | 3 | 2 |
+| d | 4 | 1 |
+
+## 从结构数组中构造
+
+`numpy` 支持结构数组的构造：
+
+In [11]:
+
+```
+data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
+data[:] = [(1,2.,'Hello'), (2,3.,"World")]
+
+data
+
+```
+
+Out[11]:
+
+```
+array([(1, 2.0, 'Hello'), (2, 3.0, 'World')], 
+      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
+```
+
+参数处理的方式与数组字典类似：
+
+In [12]:
+
+```
+pd.DataFrame(data)
+
+```
+
+Out[12]:
+
+|  | A | B | C |
+| --- | --- | --- | --- |
+| 0 | 1 | 2 | Hello |
+| 1 | 2 | 3 | World |
+
+In [13]:
+
+```
+pd.DataFrame(data, index=['first', 'second'])
+
+```
+
+Out[13]:
+
+|  | A | B | C |
+| --- | --- | --- | --- |
+| first | 1 | 2 | Hello |
+| second | 2 | 3 | World |
+
+In [14]:
+
+```
+pd.DataFrame(data, columns=['C', 'A', 'B'])
+
+```
+
+Out[14]:
+
+|  | C | A | B |
+| --- | --- | --- | --- |
+| 0 | Hello | 1 | 2 |
+| 1 | World | 2 | 3 |
+
+## 从字典列表中构造
+
+字典中同一个键的值会被合并到同一列：
+
+In [15]:
+
+```
+data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
+
+pd.DataFrame(data2)
+
+```
+
+Out[15]:
+
+|  | a | b | c |
+| --- | --- | --- | --- |
+| 0 | 1 | 2 | NaN |
+| 1 | 5 | 10 | 20 |
+
+In [16]:
+
+```
+pd.DataFrame(data2, index=['first', 'second'])
+
+```
+
+Out[16]:
+
+|  | a | b | c |
+| --- | --- | --- | --- |
+| first | 1 | 2 | NaN |
+| second | 5 | 10 | 20 |
+
+In [17]:
+
+```
+pd.DataFrame(data2, columns=['a', 'b'])
+
+```
+
+Out[17]:
+
+|  | a | b |
+| --- | --- | --- |
+| 0 | 1 | 2 |
+| 1 | 5 | 10 |
+
+## 从 Series 中构造
+
+相当于将 Series 二维化。
+
+## 其他构造方法
+
+`DataFrame.from_dict` 从现有的一个字典中构造，`DataFrame.from_records` 从现有的一个记录数组中构造：
+
+In [18]:
+
+```
+pd.DataFrame.from_records(data, index='C')
+
+```
+
+Out[18]:
+
+|  | A | B |
+| --- | --- | --- |
+| C |  |  |
+| --- | --- | --- |
+| Hello | 1 | 2 |
+| World | 2 | 3 |
+
+`DataFrame.from_items` 从字典的 `item` 对构造：
+
+In [19]:
+
+```
+pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])])
+
+```
+
+Out[19]:
+
+|  | A | B |
+| --- | --- | --- |
+| 0 | 1 | 4 |
+| 1 | 2 | 5 |
+| 2 | 3 | 6 |
+
+## 列操作
+
+`DataFrame` 可以类似于字典一样对列进行操作：
+
+In [20]:
+
+```
+df["one"]
+
+```
+
+Out[20]:
+
+```
+a     1
+b     2
+c     3
+d   NaN
+Name: one, dtype: float64
+```
+
+添加新列：
+
+In [21]:
+
+```
+df['three'] = df['one'] * df['two']
+
+df['flag'] = df['one'] > 2
+
+df
+
+```
+
+Out[21]:
+
+|  | one | two | three | flag |
+| --- | --- | --- | --- | --- |
+| a | 1 | 1 | 1 | False |
+| b | 2 | 2 | 4 | False |
+| c | 3 | 3 | 9 | True |
+| d | NaN | 4 | NaN | False |
+
+可以像字典一样删除：
+
+In [22]:
+
+```
+del df["two"]
+
+three = df.pop("three")
+
+df
+
+```
+
+Out[22]:
+
+|  | one | flag |
+| --- | --- | --- |
+| a | 1 | False |
+| b | 2 | False |
+| c | 3 | True |
+| d | NaN | False |
+
+给一行赋单一值：
+
+In [23]:
+
+```
+df['foo'] = 'bar'
+
+df
+
+```
+
+Out[23]:
+
+|  | one | flag | foo |
+| --- | --- | --- | --- |
+| a | 1 | False | bar |
+| b | 2 | False | bar |
+| c | 3 | True | bar |
+| d | NaN | False | bar |
+
+如果 `index` 不一致，那么会只保留公共的部分：
+
+In [24]:
+
+```
+df['one_trunc'] = df['one'][:2]
+
+df
+
+```
+
+Out[24]:
+
+|  | one | flag | foo | one_trunc |
+| --- | --- | --- | --- | --- |
+| a | 1 | False | bar | 1 |
+| b | 2 | False | bar | 2 |
+| c | 3 | True | bar | NaN |
+| d | NaN | False | bar | NaN |
+
+也可以直接插入一维数组，但是数组的长度必须与 `index` 一致。
+
+默认新列插入位置在最后，也可以指定位置插入：
+
+In [25]:
+
+```
+df.insert(1, 'bar', df['one'])
+
+df
+
+```
+
+Out[25]:
+
+|  | one | bar | flag | foo | one_trunc |
+| --- | --- | --- | --- | --- | --- |
+| a | 1 | 1 | False | bar | 1 |
+| b | 2 | 2 | False | bar | 2 |
+| c | 3 | 3 | True | bar | NaN |
+| d | NaN | NaN | False | bar | NaN |
+
+添加一个 `test` 新列：
+
+In [28]:
+
+```
+df.assign(test=df["one"] + df["bar"])
+
+```
+
+Out[28]:
+
+|  | one | bar | flag | foo | one_trunc | test |
+| --- | --- | --- | --- | --- | --- | --- |
+| a | 1 | 1 | False | bar | 1 | 2 |
+| b | 2 | 2 | False | bar | 2 | 4 |
+| c | 3 | 3 | True | bar | NaN | 6 |
+| d | NaN | NaN | False | bar | NaN | NaN |
+
+## 索引和选择
+
+基本操作：
+
+| Operation | Syntax | Result |
+| --- | --- | --- |
+| Select column | df[col] | Series |
+| Select row by label | df.loc[label] | Series |
+| Select row by integer location | df.iloc[loc] | Series |
+| Slice rows | df[5:10] | DataFrame |
+| Select rows by boolean vector | df[bool_vec] | DataFrame |