Files
ailearning/docs/da/153.md
2020-10-19 21:08:55 +08:00

474 lines
6.8 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 二维数据结构DataFrame
In [1]:
```py
import numpy as np
import pandas as pd
```
`DataFrame``pandas` 中的二维数据结构,可以看成一个 `Excel` 中的工作表,或者一个 `SQL` 表,或者一个存储 `Series` 对象的字典。
`DataFrame(data, index, columns)` 中的 `data` 可以接受很多数据类型:
* 一个存储一维数组,字典,列表或者 `Series` 的字典
* 2-D 数组
* 结构或者记录数组
* 一个 `Series`
* 另一个 `DataFrame`
`index` 用于指定行的 `label``columns` 用于指定列的 `label`,如果参数不传入,那么会按照传入的内容进行设定。
## 从 Series 字典中构造
可以使用值为 `Series` 的字典进行构造:
In [2]:
```py
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
```
如果没有传入 `columns` 的值,那么 `columns` 的值默认为字典 `key``index` 默认为所有 `value``index` 的并集。
In [3]:
```py
df = pd.DataFrame(d)
df
```
Out[3]:
| | one | two |
| --- | --- | --- |
| a | 1 | 1 |
| b | 2 | 2 |
| c | 3 | 3 |
| d | NaN | 4 |
如果指定了 `index` 值,`index` 为指定的 `index` 值:
In [4]:
```py
pd.DataFrame(d, index=['d', 'b', 'a'])
```
Out[4]:
| | one | two |
| --- | --- | --- |
| d | NaN | 4 |
| b | 2 | 2 |
| a | 1 | 1 |
如果指定了 `columns` 值,会去字典中寻找,找不到的值为 `NaN`
In [5]:
```py
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
```
Out[5]:
| | two | three |
| --- | --- | --- |
| d | 4 | NaN |
| b | 2 | NaN |
| a | 1 | NaN |
查看 `index``columns`
In [6]:
```py
df.index
```
Out[6]:
```py
Index([u'a', u'b', u'c', u'd'], dtype='object')
```
In [7]:
```py
df.columns
```
Out[7]:
```py
Index([u'one', u'two'], dtype='object')
```
## 从 ndarray 或者 list 字典中构造
如果字典是 `ndarray` 或者 `list`,那么它们的长度要严格保持一致:
In [8]:
```py
d = {'one' : [1., 2., 3., 4.],
'two' : [4., 3., 2., 1.]}
```
`index` 默认为 `range(n)`,其中 `n` 为数组长度:
In [9]:
```py
pd.DataFrame(d)
```
Out[9]:
| | one | two |
| --- | --- | --- |
| 0 | 1 | 4 |
| 1 | 2 | 3 |
| 2 | 3 | 2 |
| 3 | 4 | 1 |
如果传入 `index` 参数,那么它必须与数组等长:
In [10]:
```py
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
```
Out[10]:
| | one | two |
| --- | --- | --- |
| a | 1 | 4 |
| b | 2 | 3 |
| c | 3 | 2 |
| d | 4 | 1 |
## 从结构数组中构造
`numpy` 支持结构数组的构造:
In [11]:
```py
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
data[:] = [(1,2.,'Hello'), (2,3.,"World")]
data
```
Out[11]:
```py
array([(1, 2.0, 'Hello'), (2, 3.0, 'World')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
```
参数处理的方式与数组字典类似:
In [12]:
```py
pd.DataFrame(data)
```
Out[12]:
| | A | B | C |
| --- | --- | --- | --- |
| 0 | 1 | 2 | Hello |
| 1 | 2 | 3 | World |
In [13]:
```py
pd.DataFrame(data, index=['first', 'second'])
```
Out[13]:
| | A | B | C |
| --- | --- | --- | --- |
| first | 1 | 2 | Hello |
| second | 2 | 3 | World |
In [14]:
```py
pd.DataFrame(data, columns=['C', 'A', 'B'])
```
Out[14]:
| | C | A | B |
| --- | --- | --- | --- |
| 0 | Hello | 1 | 2 |
| 1 | World | 2 | 3 |
## 从字典列表中构造
字典中同一个键的值会被合并到同一列:
In [15]:
```py
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data2)
```
Out[15]:
| | a | b | c |
| --- | --- | --- | --- |
| 0 | 1 | 2 | NaN |
| 1 | 5 | 10 | 20 |
In [16]:
```py
pd.DataFrame(data2, index=['first', 'second'])
```
Out[16]:
| | a | b | c |
| --- | --- | --- | --- |
| first | 1 | 2 | NaN |
| second | 5 | 10 | 20 |
In [17]:
```py
pd.DataFrame(data2, columns=['a', 'b'])
```
Out[17]:
| | a | b |
| --- | --- | --- |
| 0 | 1 | 2 |
| 1 | 5 | 10 |
## 从 Series 中构造
相当于将 Series 二维化。
## 其他构造方法
`DataFrame.from_dict` 从现有的一个字典中构造,`DataFrame.from_records` 从现有的一个记录数组中构造:
In [18]:
```py
pd.DataFrame.from_records(data, index='C')
```
Out[18]:
| | A | B |
| --- | --- | --- |
| C | | |
| --- | --- | --- |
| Hello | 1 | 2 |
| World | 2 | 3 |
`DataFrame.from_items` 从字典的 `item` 对构造:
In [19]:
```py
pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])])
```
Out[19]:
| | A | B |
| --- | --- | --- |
| 0 | 1 | 4 |
| 1 | 2 | 5 |
| 2 | 3 | 6 |
## 列操作
`DataFrame` 可以类似于字典一样对列进行操作:
In [20]:
```py
df["one"]
```
Out[20]:
```py
a 1
b 2
c 3
d NaN
Name: one, dtype: float64
```
添加新列:
In [21]:
```py
df['three'] = df['one'] * df['two']
df['flag'] = df['one'] > 2
df
```
Out[21]:
| | one | two | three | flag |
| --- | --- | --- | --- | --- |
| a | 1 | 1 | 1 | False |
| b | 2 | 2 | 4 | False |
| c | 3 | 3 | 9 | True |
| d | NaN | 4 | NaN | False |
可以像字典一样删除:
In [22]:
```py
del df["two"]
three = df.pop("three")
df
```
Out[22]:
| | one | flag |
| --- | --- | --- |
| a | 1 | False |
| b | 2 | False |
| c | 3 | True |
| d | NaN | False |
给一行赋单一值:
In [23]:
```py
df['foo'] = 'bar'
df
```
Out[23]:
| | one | flag | foo |
| --- | --- | --- | --- |
| a | 1 | False | bar |
| b | 2 | False | bar |
| c | 3 | True | bar |
| d | NaN | False | bar |
如果 `index` 不一致,那么会只保留公共的部分:
In [24]:
```py
df['one_trunc'] = df['one'][:2]
df
```
Out[24]:
| | one | flag | foo | one_trunc |
| --- | --- | --- | --- | --- |
| a | 1 | False | bar | 1 |
| b | 2 | False | bar | 2 |
| c | 3 | True | bar | NaN |
| d | NaN | False | bar | NaN |
也可以直接插入一维数组,但是数组的长度必须与 `index` 一致。
默认新列插入位置在最后,也可以指定位置插入:
In [25]:
```py
df.insert(1, 'bar', df['one'])
df
```
Out[25]:
| | one | bar | flag | foo | one_trunc |
| --- | --- | --- | --- | --- | --- |
| a | 1 | 1 | False | bar | 1 |
| b | 2 | 2 | False | bar | 2 |
| c | 3 | 3 | True | bar | NaN |
| d | NaN | NaN | False | bar | NaN |
添加一个 `test` 新列:
In [28]:
```py
df.assign(test=df["one"] + df["bar"])
```
Out[28]:
| | one | bar | flag | foo | one_trunc | test |
| --- | --- | --- | --- | --- | --- | --- |
| a | 1 | 1 | False | bar | 1 | 2 |
| b | 2 | 2 | False | bar | 2 | 4 |
| c | 3 | 3 | True | bar | NaN | 6 |
| d | NaN | NaN | False | bar | NaN | NaN |
## 索引和选择
基本操作:
| Operation | Syntax | Result |
| --- | --- | --- |
| Select column | df[col] | Series |
| Select row by label | df.loc[label] | Series |
| Select row by integer location | df.iloc[loc] | Series |
| Slice rows | df[5:10] | DataFrame |
| Select rows by boolean vector | df[bool_vec] | DataFrame |