mirror of
https://github.com/apachecn/ailearning.git
synced 2026-02-03 10:24:39 +08:00
474 lines
6.8 KiB
Markdown
474 lines
6.8 KiB
Markdown
# 二维数据结构:DataFrame
|
||
|
||
In [1]:
|
||
|
||
```py
|
||
import numpy as np
|
||
import pandas as pd
|
||
|
||
```
|
||
|
||
`DataFrame` 是 `pandas` 中的二维数据结构,可以看成一个 `Excel` 中的工作表,或者一个 `SQL` 表,或者一个存储 `Series` 对象的字典。
|
||
|
||
`DataFrame(data, index, columns)` 中的 `data` 可以接受很多数据类型:
|
||
|
||
* 一个存储一维数组,字典,列表或者 `Series` 的字典
|
||
* 2-D 数组
|
||
* 结构或者记录数组
|
||
* 一个 `Series`
|
||
* 另一个 `DataFrame`
|
||
|
||
`index` 用于指定行的 `label`,`columns` 用于指定列的 `label`,如果参数不传入,那么会按照传入的内容进行设定。
|
||
|
||
## 从 Series 字典中构造
|
||
|
||
可以使用值为 `Series` 的字典进行构造:
|
||
|
||
In [2]:
|
||
|
||
```py
|
||
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
|
||
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
|
||
|
||
```
|
||
|
||
如果没有传入 `columns` 的值,那么 `columns` 的值默认为字典 `key`,`index` 默认为所有 `value` 中 `index` 的并集。
|
||
|
||
In [3]:
|
||
|
||
```py
|
||
df = pd.DataFrame(d)
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[3]:
|
||
|
||
| | one | two |
|
||
| --- | --- | --- |
|
||
| a | 1 | 1 |
|
||
| b | 2 | 2 |
|
||
| c | 3 | 3 |
|
||
| d | NaN | 4 |
|
||
|
||
如果指定了 `index` 值,`index` 为指定的 `index` 值:
|
||
|
||
In [4]:
|
||
|
||
```py
|
||
pd.DataFrame(d, index=['d', 'b', 'a'])
|
||
|
||
```
|
||
|
||
Out[4]:
|
||
|
||
| | one | two |
|
||
| --- | --- | --- |
|
||
| d | NaN | 4 |
|
||
| b | 2 | 2 |
|
||
| a | 1 | 1 |
|
||
|
||
如果指定了 `columns` 值,会去字典中寻找,找不到的值为 `NaN`:
|
||
|
||
In [5]:
|
||
|
||
```py
|
||
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
|
||
|
||
```
|
||
|
||
Out[5]:
|
||
|
||
| | two | three |
|
||
| --- | --- | --- |
|
||
| d | 4 | NaN |
|
||
| b | 2 | NaN |
|
||
| a | 1 | NaN |
|
||
|
||
查看 `index` 和 `columns`:
|
||
|
||
In [6]:
|
||
|
||
```py
|
||
df.index
|
||
|
||
```
|
||
|
||
Out[6]:
|
||
|
||
```py
|
||
Index([u'a', u'b', u'c', u'd'], dtype='object')
|
||
```
|
||
|
||
In [7]:
|
||
|
||
```py
|
||
df.columns
|
||
|
||
```
|
||
|
||
Out[7]:
|
||
|
||
```py
|
||
Index([u'one', u'two'], dtype='object')
|
||
```
|
||
|
||
## 从 ndarray 或者 list 字典中构造
|
||
|
||
如果字典是 `ndarray` 或者 `list`,那么它们的长度要严格保持一致:
|
||
|
||
In [8]:
|
||
|
||
```py
|
||
d = {'one' : [1., 2., 3., 4.],
|
||
'two' : [4., 3., 2., 1.]}
|
||
|
||
```
|
||
|
||
`index` 默认为 `range(n)`,其中 `n` 为数组长度:
|
||
|
||
In [9]:
|
||
|
||
```py
|
||
pd.DataFrame(d)
|
||
|
||
```
|
||
|
||
Out[9]:
|
||
|
||
| | one | two |
|
||
| --- | --- | --- |
|
||
| 0 | 1 | 4 |
|
||
| 1 | 2 | 3 |
|
||
| 2 | 3 | 2 |
|
||
| 3 | 4 | 1 |
|
||
|
||
如果传入 `index` 参数,那么它必须与数组等长:
|
||
|
||
In [10]:
|
||
|
||
```py
|
||
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
|
||
|
||
```
|
||
|
||
Out[10]:
|
||
|
||
| | one | two |
|
||
| --- | --- | --- |
|
||
| a | 1 | 4 |
|
||
| b | 2 | 3 |
|
||
| c | 3 | 2 |
|
||
| d | 4 | 1 |
|
||
|
||
## 从结构数组中构造
|
||
|
||
`numpy` 支持结构数组的构造:
|
||
|
||
In [11]:
|
||
|
||
```py
|
||
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
|
||
data[:] = [(1,2.,'Hello'), (2,3.,"World")]
|
||
|
||
data
|
||
|
||
```
|
||
|
||
Out[11]:
|
||
|
||
```py
|
||
array([(1, 2.0, 'Hello'), (2, 3.0, 'World')],
|
||
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
|
||
```
|
||
|
||
参数处理的方式与数组字典类似:
|
||
|
||
In [12]:
|
||
|
||
```py
|
||
pd.DataFrame(data)
|
||
|
||
```
|
||
|
||
Out[12]:
|
||
|
||
| | A | B | C |
|
||
| --- | --- | --- | --- |
|
||
| 0 | 1 | 2 | Hello |
|
||
| 1 | 2 | 3 | World |
|
||
|
||
In [13]:
|
||
|
||
```py
|
||
pd.DataFrame(data, index=['first', 'second'])
|
||
|
||
```
|
||
|
||
Out[13]:
|
||
|
||
| | A | B | C |
|
||
| --- | --- | --- | --- |
|
||
| first | 1 | 2 | Hello |
|
||
| second | 2 | 3 | World |
|
||
|
||
In [14]:
|
||
|
||
```py
|
||
pd.DataFrame(data, columns=['C', 'A', 'B'])
|
||
|
||
```
|
||
|
||
Out[14]:
|
||
|
||
| | C | A | B |
|
||
| --- | --- | --- | --- |
|
||
| 0 | Hello | 1 | 2 |
|
||
| 1 | World | 2 | 3 |
|
||
|
||
## 从字典列表中构造
|
||
|
||
字典中同一个键的值会被合并到同一列:
|
||
|
||
In [15]:
|
||
|
||
```py
|
||
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
|
||
|
||
pd.DataFrame(data2)
|
||
|
||
```
|
||
|
||
Out[15]:
|
||
|
||
| | a | b | c |
|
||
| --- | --- | --- | --- |
|
||
| 0 | 1 | 2 | NaN |
|
||
| 1 | 5 | 10 | 20 |
|
||
|
||
In [16]:
|
||
|
||
```py
|
||
pd.DataFrame(data2, index=['first', 'second'])
|
||
|
||
```
|
||
|
||
Out[16]:
|
||
|
||
| | a | b | c |
|
||
| --- | --- | --- | --- |
|
||
| first | 1 | 2 | NaN |
|
||
| second | 5 | 10 | 20 |
|
||
|
||
In [17]:
|
||
|
||
```py
|
||
pd.DataFrame(data2, columns=['a', 'b'])
|
||
|
||
```
|
||
|
||
Out[17]:
|
||
|
||
| | a | b |
|
||
| --- | --- | --- |
|
||
| 0 | 1 | 2 |
|
||
| 1 | 5 | 10 |
|
||
|
||
## 从 Series 中构造
|
||
|
||
相当于将 Series 二维化。
|
||
|
||
## 其他构造方法
|
||
|
||
`DataFrame.from_dict` 从现有的一个字典中构造,`DataFrame.from_records` 从现有的一个记录数组中构造:
|
||
|
||
In [18]:
|
||
|
||
```py
|
||
pd.DataFrame.from_records(data, index='C')
|
||
|
||
```
|
||
|
||
Out[18]:
|
||
|
||
| | A | B |
|
||
| --- | --- | --- |
|
||
| C | | |
|
||
| --- | --- | --- |
|
||
| Hello | 1 | 2 |
|
||
| World | 2 | 3 |
|
||
|
||
`DataFrame.from_items` 从字典的 `item` 对构造:
|
||
|
||
In [19]:
|
||
|
||
```py
|
||
pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])])
|
||
|
||
```
|
||
|
||
Out[19]:
|
||
|
||
| | A | B |
|
||
| --- | --- | --- |
|
||
| 0 | 1 | 4 |
|
||
| 1 | 2 | 5 |
|
||
| 2 | 3 | 6 |
|
||
|
||
## 列操作
|
||
|
||
`DataFrame` 可以类似于字典一样对列进行操作:
|
||
|
||
In [20]:
|
||
|
||
```py
|
||
df["one"]
|
||
|
||
```
|
||
|
||
Out[20]:
|
||
|
||
```py
|
||
a 1
|
||
b 2
|
||
c 3
|
||
d NaN
|
||
Name: one, dtype: float64
|
||
```
|
||
|
||
添加新列:
|
||
|
||
In [21]:
|
||
|
||
```py
|
||
df['three'] = df['one'] * df['two']
|
||
|
||
df['flag'] = df['one'] > 2
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[21]:
|
||
|
||
| | one | two | three | flag |
|
||
| --- | --- | --- | --- | --- |
|
||
| a | 1 | 1 | 1 | False |
|
||
| b | 2 | 2 | 4 | False |
|
||
| c | 3 | 3 | 9 | True |
|
||
| d | NaN | 4 | NaN | False |
|
||
|
||
可以像字典一样删除:
|
||
|
||
In [22]:
|
||
|
||
```py
|
||
del df["two"]
|
||
|
||
three = df.pop("three")
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[22]:
|
||
|
||
| | one | flag |
|
||
| --- | --- | --- |
|
||
| a | 1 | False |
|
||
| b | 2 | False |
|
||
| c | 3 | True |
|
||
| d | NaN | False |
|
||
|
||
给一行赋单一值:
|
||
|
||
In [23]:
|
||
|
||
```py
|
||
df['foo'] = 'bar'
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[23]:
|
||
|
||
| | one | flag | foo |
|
||
| --- | --- | --- | --- |
|
||
| a | 1 | False | bar |
|
||
| b | 2 | False | bar |
|
||
| c | 3 | True | bar |
|
||
| d | NaN | False | bar |
|
||
|
||
如果 `index` 不一致,那么会只保留公共的部分:
|
||
|
||
In [24]:
|
||
|
||
```py
|
||
df['one_trunc'] = df['one'][:2]
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[24]:
|
||
|
||
| | one | flag | foo | one_trunc |
|
||
| --- | --- | --- | --- | --- |
|
||
| a | 1 | False | bar | 1 |
|
||
| b | 2 | False | bar | 2 |
|
||
| c | 3 | True | bar | NaN |
|
||
| d | NaN | False | bar | NaN |
|
||
|
||
也可以直接插入一维数组,但是数组的长度必须与 `index` 一致。
|
||
|
||
默认新列插入位置在最后,也可以指定位置插入:
|
||
|
||
In [25]:
|
||
|
||
```py
|
||
df.insert(1, 'bar', df['one'])
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[25]:
|
||
|
||
| | one | bar | flag | foo | one_trunc |
|
||
| --- | --- | --- | --- | --- | --- |
|
||
| a | 1 | 1 | False | bar | 1 |
|
||
| b | 2 | 2 | False | bar | 2 |
|
||
| c | 3 | 3 | True | bar | NaN |
|
||
| d | NaN | NaN | False | bar | NaN |
|
||
|
||
添加一个 `test` 新列:
|
||
|
||
In [28]:
|
||
|
||
```py
|
||
df.assign(test=df["one"] + df["bar"])
|
||
|
||
```
|
||
|
||
Out[28]:
|
||
|
||
| | one | bar | flag | foo | one_trunc | test |
|
||
| --- | --- | --- | --- | --- | --- | --- |
|
||
| a | 1 | 1 | False | bar | 1 | 2 |
|
||
| b | 2 | 2 | False | bar | 2 | 4 |
|
||
| c | 3 | 3 | True | bar | NaN | 6 |
|
||
| d | NaN | NaN | False | bar | NaN | NaN |
|
||
|
||
## 索引和选择
|
||
|
||
基本操作:
|
||
|
||
| Operation | Syntax | Result |
|
||
| --- | --- | --- |
|
||
| Select column | df[col] | Series |
|
||
| Select row by label | df.loc[label] | Series |
|
||
| Select row by integer location | df.iloc[loc] | Series |
|
||
| Slice rows | df[5:10] | DataFrame |
|
||
| Select rows by boolean vector | df[bool_vec] | DataFrame | |