ailearning/docs/da/153.md

# 二维数据结构：DataFrame

In [1]:

```py
import numpy as np
import pandas as pd

```

`DataFrame` 是 `pandas` 中的二维数据结构，可以看成一个 `Excel` 中的工作表，或者一个 `SQL` 表，或者一个存储 `Series` 对象的字典。

`DataFrame(data, index, columns)` 中的 `data` 可以接受很多数据类型：

*   一个存储一维数组，字典，列表或者 `Series` 的字典
*   2-D 数组
*   结构或者记录数组
*   一个 `Series`
*   另一个 `DataFrame`

`index` 用于指定行的 `label`，`columns` 用于指定列的 `label`，如果参数不传入，那么会按照传入的内容进行设定。

## 从 Series 字典中构造

可以使用值为 `Series` 的字典进行构造：

In [2]:

```py
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

```

如果没有传入 `columns` 的值，那么 `columns` 的值默认为字典 `key`，`index` 默认为所有 `value` 中 `index` 的并集。

In [3]:

```py
df = pd.DataFrame(d)

df

```

Out[3]:

|  | one | two |
| --- | --- | --- |
| a | 1 | 1 |
| b | 2 | 2 |
| c | 3 | 3 |
| d | NaN | 4 |

如果指定了 `index` 值，`index` 为指定的 `index` 值：

In [4]:

```py
pd.DataFrame(d, index=['d', 'b', 'a'])

```

Out[4]:

|  | one | two |
| --- | --- | --- |
| d | NaN | 4 |
| b | 2 | 2 |
| a | 1 | 1 |

如果指定了 `columns` 值，会去字典中寻找，找不到的值为 `NaN`：

In [5]:

```py
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

```

Out[5]:

|  | two | three |
| --- | --- | --- |
| d | 4 | NaN |
| b | 2 | NaN |
| a | 1 | NaN |

查看 `index` 和 `columns`：

In [6]:

```py
df.index

```

Out[6]:

```py
Index([u'a', u'b', u'c', u'd'], dtype='object')
```

In [7]:

```py
df.columns

```

Out[7]:

```py
Index([u'one', u'two'], dtype='object')
```

## 从 ndarray 或者 list 字典中构造

如果字典是 `ndarray` 或者 `list`，那么它们的长度要严格保持一致：

In [8]:

```py
d = {'one' : [1., 2., 3., 4.],
     'two' : [4., 3., 2., 1.]}

```

`index` 默认为 `range(n)`，其中 `n` 为数组长度：

In [9]:

```py
pd.DataFrame(d)

```

Out[9]:

|  | one | two |
| --- | --- | --- |
| 0 | 1 | 4 |
| 1 | 2 | 3 |
| 2 | 3 | 2 |
| 3 | 4 | 1 |

如果传入 `index` 参数，那么它必须与数组等长：

In [10]:

```py
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

```

Out[10]:

|  | one | two |
| --- | --- | --- |
| a | 1 | 4 |
| b | 2 | 3 |
| c | 3 | 2 |
| d | 4 | 1 |

## 从结构数组中构造

`numpy` 支持结构数组的构造：

In [11]:

```py
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
data[:] = [(1,2.,'Hello'), (2,3.,"World")]

data

```

Out[11]:

```py
array([(1, 2.0, 'Hello'), (2, 3.0, 'World')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
```

参数处理的方式与数组字典类似：

In [12]:

```py
pd.DataFrame(data)

```

Out[12]:

|  | A | B | C |
| --- | --- | --- | --- |
| 0 | 1 | 2 | Hello |
| 1 | 2 | 3 | World |

In [13]:

```py
pd.DataFrame(data, index=['first', 'second'])

```

Out[13]:

|  | A | B | C |
| --- | --- | --- | --- |
| first | 1 | 2 | Hello |
| second | 2 | 3 | World |

In [14]:

```py
pd.DataFrame(data, columns=['C', 'A', 'B'])

```

Out[14]:

|  | C | A | B |
| --- | --- | --- | --- |
| 0 | Hello | 1 | 2 |
| 1 | World | 2 | 3 |

## 从字典列表中构造

字典中同一个键的值会被合并到同一列：

In [15]:

```py
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

pd.DataFrame(data2)

```

Out[15]:

|  | a | b | c |
| --- | --- | --- | --- |
| 0 | 1 | 2 | NaN |
| 1 | 5 | 10 | 20 |

In [16]:

```py
pd.DataFrame(data2, index=['first', 'second'])

```

Out[16]:

|  | a | b | c |
| --- | --- | --- | --- |
| first | 1 | 2 | NaN |
| second | 5 | 10 | 20 |

In [17]:

```py
pd.DataFrame(data2, columns=['a', 'b'])

```

Out[17]:

|  | a | b |
| --- | --- | --- |
| 0 | 1 | 2 |
| 1 | 5 | 10 |

## 从 Series 中构造

相当于将 Series 二维化。

## 其他构造方法

`DataFrame.from_dict` 从现有的一个字典中构造，`DataFrame.from_records` 从现有的一个记录数组中构造：

In [18]:

```py
pd.DataFrame.from_records(data, index='C')

```

Out[18]:

|  | A | B |
| --- | --- | --- |
| C |  |  |
| --- | --- | --- |
| Hello | 1 | 2 |
| World | 2 | 3 |

`DataFrame.from_items` 从字典的 `item` 对构造：

In [19]:

```py
pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])])

```

Out[19]:

|  | A | B |
| --- | --- | --- |
| 0 | 1 | 4 |
| 1 | 2 | 5 |
| 2 | 3 | 6 |

## 列操作

`DataFrame` 可以类似于字典一样对列进行操作：

In [20]:

```py
df["one"]

```

Out[20]:

```py
a     1
b     2
c     3
d   NaN
Name: one, dtype: float64
```

添加新列：

In [21]:

```py
df['three'] = df['one'] * df['two']

df['flag'] = df['one'] > 2

df

```

Out[21]:

|  | one | two | three | flag |
| --- | --- | --- | --- | --- |
| a | 1 | 1 | 1 | False |
| b | 2 | 2 | 4 | False |
| c | 3 | 3 | 9 | True |
| d | NaN | 4 | NaN | False |

可以像字典一样删除：

In [22]:

```py
del df["two"]

three = df.pop("three")

df

```

Out[22]:

|  | one | flag |
| --- | --- | --- |
| a | 1 | False |
| b | 2 | False |
| c | 3 | True |
| d | NaN | False |

给一行赋单一值：

In [23]:

```py
df['foo'] = 'bar'

df

```

Out[23]:

|  | one | flag | foo |
| --- | --- | --- | --- |
| a | 1 | False | bar |
| b | 2 | False | bar |
| c | 3 | True | bar |
| d | NaN | False | bar |

如果 `index` 不一致，那么会只保留公共的部分：

In [24]:

```py
df['one_trunc'] = df['one'][:2]

df

```

Out[24]:

|  | one | flag | foo | one_trunc |
| --- | --- | --- | --- | --- |
| a | 1 | False | bar | 1 |
| b | 2 | False | bar | 2 |
| c | 3 | True | bar | NaN |
| d | NaN | False | bar | NaN |

也可以直接插入一维数组，但是数组的长度必须与 `index` 一致。

默认新列插入位置在最后，也可以指定位置插入：

In [25]:

```py
df.insert(1, 'bar', df['one'])

df

```

Out[25]:

|  | one | bar | flag | foo | one_trunc |
| --- | --- | --- | --- | --- | --- |
| a | 1 | 1 | False | bar | 1 |
| b | 2 | 2 | False | bar | 2 |
| c | 3 | 3 | True | bar | NaN |
| d | NaN | NaN | False | bar | NaN |

添加一个 `test` 新列：

In [28]:

```py
df.assign(test=df["one"] + df["bar"])

```

Out[28]:

|  | one | bar | flag | foo | one_trunc | test |
| --- | --- | --- | --- | --- | --- | --- |
| a | 1 | 1 | False | bar | 1 | 2 |
| b | 2 | 2 | False | bar | 2 | 4 |
| c | 3 | 3 | True | bar | NaN | 6 |
| d | NaN | NaN | False | bar | NaN | NaN |

## 索引和选择

基本操作：

| Operation | Syntax | Result |
| --- | --- | --- |
| Select column | df[col] | Series |
| Select row by label | df.loc[label] | Series |
| Select row by integer location | df.iloc[loc] | Series |
| Slice rows | df[5:10] | DataFrame |
| Select rows by boolean vector | df[bool_vec] | DataFrame |