mirror of
https://github.com/apachecn/ailearning.git
synced 2026-04-14 10:20:44 +08:00
2796 lines
106 KiB
Markdown
2796 lines
106 KiB
Markdown
# 十分钟上手 Pandas
|
||
|
||
`pandas` 是一个 `Python Data Analysis Library`。
|
||
|
||
安装请参考官网的教程,如果安装了 `Anaconda`,则不需要安装 `pandas` 库。
|
||
|
||
In [1]:
|
||
|
||
```py
|
||
%matplotlib inline
|
||
|
||
import pandas as pd
|
||
import numpy as np
|
||
import matplotlib.pyplot as plt
|
||
|
||
```
|
||
|
||
## 产生 Pandas 对象
|
||
|
||
`pandas` 中有三种基本结构:
|
||
|
||
* `Series`
|
||
* 1D labeled homogeneously-typed array
|
||
* `DataFrame`
|
||
* General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns
|
||
* `Panel`
|
||
* General 3D labeled, also size-mutable array
|
||
|
||
### Series
|
||
|
||
一维 `Series` 可以用一维列表初始化:
|
||
|
||
In [2]:
|
||
|
||
```py
|
||
s = pd.Series([1,3,5,np.nan,6,8])
|
||
|
||
print s
|
||
|
||
```
|
||
|
||
```py
|
||
0 1
|
||
1 3
|
||
2 5
|
||
3 NaN
|
||
4 6
|
||
5 8
|
||
dtype: float64
|
||
|
||
```
|
||
|
||
默认情况下,`Series` 的下标都是数字(可以使用额外参数指定),类型是统一的。
|
||
|
||
### DataFrame
|
||
|
||
`DataFrame` 则是个二维结构,这里首先构造一组时间序列,作为我们第一维的下标:
|
||
|
||
In [3]:
|
||
|
||
```py
|
||
dates = pd.date_range('20130101', periods=6)
|
||
|
||
print dates
|
||
|
||
```
|
||
|
||
```py
|
||
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
|
||
'2013-01-05', '2013-01-06'],
|
||
dtype='datetime64[ns]', freq='D')
|
||
|
||
```
|
||
|
||
然后创建一个 `DataFrame` 结构:
|
||
|
||
In [4]:
|
||
|
||
```py
|
||
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[4]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | -0.605936 | -0.861658 | -1.001924 | 1.528584 |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 1.187187 | 1.819818 |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | -0.286067 |
|
||
| 2013-01-04 | 1.289305 | 0.497115 | -0.225351 | 0.040239 |
|
||
| 2013-01-05 | 0.038232 | 0.875057 | -0.092526 | 0.934432 |
|
||
| 2013-01-06 | -2.163453 | -0.010279 | 1.699886 | 1.291653 |
|
||
|
||
默认情况下,如果不指定 `index` 参数和 `columns`,那么他们的值将用从 `0` 开始的数字替代。
|
||
|
||
除了向 `DataFrame` 中传入二维数组,我们也可以使用字典传入数据:
|
||
|
||
In [5]:
|
||
|
||
```py
|
||
df2 = pd.DataFrame({'A' : 1.,
|
||
'B' : pd.Timestamp('20130102'),
|
||
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
|
||
'D' : np.array([3] * 4,dtype='int32'),
|
||
'E' : pd.Categorical(["test","train","test","train"]),
|
||
'F' : 'foo' })
|
||
|
||
df2
|
||
|
||
```
|
||
|
||
Out[5]:
|
||
|
||
| | A | B | C | D | E | F |
|
||
| --- | --- | --- | --- | --- | --- | --- |
|
||
| 0 | 1 | 2013-01-02 | 1 | 3 | test | foo |
|
||
| 1 | 1 | 2013-01-02 | 1 | 3 | train | foo |
|
||
| 2 | 1 | 2013-01-02 | 1 | 3 | test | foo |
|
||
| 3 | 1 | 2013-01-02 | 1 | 3 | train | foo |
|
||
|
||
字典的每个 `key` 代表一列,其 `value` 可以是各种能够转化为 `Series` 的对象。
|
||
|
||
与 `Series` 要求所有的类型都一致不同,`DataFrame` 值要求每一列数据的格式相同:
|
||
|
||
In [6]:
|
||
|
||
```py
|
||
df2.dtypes
|
||
|
||
```
|
||
|
||
Out[6]:
|
||
|
||
```py
|
||
A float64
|
||
B datetime64[ns]
|
||
C float32
|
||
D int32
|
||
E category
|
||
F object
|
||
dtype: object
|
||
```
|
||
|
||
## 查看数据
|
||
|
||
### 头尾数据
|
||
|
||
`head` 和 `tail` 方法可以分别查看最前面几行和最后面几行的数据(默认为 5):
|
||
|
||
In [7]:
|
||
|
||
```py
|
||
df.head()
|
||
|
||
```
|
||
|
||
Out[7]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | -0.605936 | -0.861658 | -1.001924 | 1.528584 |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 1.187187 | 1.819818 |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | -0.286067 |
|
||
| 2013-01-04 | 1.289305 | 0.497115 | -0.225351 | 0.040239 |
|
||
| 2013-01-05 | 0.038232 | 0.875057 | -0.092526 | 0.934432 |
|
||
|
||
最后 3 行:
|
||
|
||
In [8]:
|
||
|
||
```py
|
||
df.tail(3)
|
||
|
||
```
|
||
|
||
Out[8]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| 2013-01-04 | 1.289305 | 0.497115 | -0.225351 | 0.040239 |
|
||
| 2013-01-05 | 0.038232 | 0.875057 | -0.092526 | 0.934432 |
|
||
| 2013-01-06 | -2.163453 | -0.010279 | 1.699886 | 1.291653 |
|
||
|
||
### 下标,列标,数据
|
||
|
||
下标使用 `index` 属性查看:
|
||
|
||
In [9]:
|
||
|
||
```py
|
||
df.index
|
||
|
||
```
|
||
|
||
Out[9]:
|
||
|
||
```py
|
||
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
|
||
'2013-01-05', '2013-01-06'],
|
||
dtype='datetime64[ns]', freq='D')
|
||
```
|
||
|
||
列标使用 `columns` 属性查看:
|
||
|
||
In [10]:
|
||
|
||
```py
|
||
df.columns
|
||
|
||
```
|
||
|
||
Out[10]:
|
||
|
||
```py
|
||
Index([u'A', u'B', u'C', u'D'], dtype='object')
|
||
```
|
||
|
||
数据值使用 `values` 查看:
|
||
|
||
In [11]:
|
||
|
||
```py
|
||
df.values
|
||
|
||
```
|
||
|
||
Out[11]:
|
||
|
||
```py
|
||
array([[-0.60593585, -0.86165752, -1.00192387, 1.52858443],
|
||
[-0.16540784, 0.38833783, 1.18718697, 1.81981793],
|
||
[ 0.06525454, -1.60807414, -1.2823306 , -0.28606716],
|
||
[ 1.28930486, 0.49711531, -0.22535143, 0.04023897],
|
||
[ 0.03823179, 0.87505664, -0.0925258 , 0.93443212],
|
||
[-2.16345271, -0.01027865, 1.69988608, 1.29165337]])
|
||
```
|
||
|
||
### 统计数据
|
||
|
||
查看简单的统计数据:
|
||
|
||
In [12]:
|
||
|
||
```py
|
||
df.describe()
|
||
|
||
```
|
||
|
||
Out[12]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
|
||
| mean | -0.257001 | -0.119917 | 0.047490 | 0.888110 |
|
||
| std | 1.126657 | 0.938705 | 1.182629 | 0.841529 |
|
||
| min | -2.163453 | -1.608074 | -1.282331 | -0.286067 |
|
||
| 25% | -0.495804 | -0.648813 | -0.807781 | 0.263787 |
|
||
| 50% | -0.063588 | 0.189030 | -0.158939 | 1.113043 |
|
||
| 75% | 0.058499 | 0.469921 | 0.867259 | 1.469352 |
|
||
| max | 1.289305 | 0.875057 | 1.699886 | 1.819818 |
|
||
|
||
### 转置
|
||
|
||
In [13]:
|
||
|
||
```py
|
||
df.T
|
||
|
||
```
|
||
|
||
Out[13]:
|
||
|
||
| | 2013-01-01 00:00:00 | 2013-01-02 00:00:00 | 2013-01-03 00:00:00 | 2013-01-04 00:00:00 | 2013-01-05 00:00:00 | 2013-01-06 00:00:00 |
|
||
| --- | --- | --- | --- | --- | --- | --- |
|
||
| A | -0.605936 | -0.165408 | 0.065255 | 1.289305 | 0.038232 | -2.163453 |
|
||
| B | -0.861658 | 0.388338 | -1.608074 | 0.497115 | 0.875057 | -0.010279 |
|
||
| C | -1.001924 | 1.187187 | -1.282331 | -0.225351 | -0.092526 | 1.699886 |
|
||
| D | 1.528584 | 1.819818 | -0.286067 | 0.040239 | 0.934432 | 1.291653 |
|
||
|
||
## 排序
|
||
|
||
`sort_index(axis=0, ascending=True)` 方法按照下标大小进行排序,`axis=0` 表示按第 0 维进行排序。
|
||
|
||
In [14]:
|
||
|
||
```py
|
||
df.sort_index(ascending=False)
|
||
|
||
```
|
||
|
||
Out[14]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| 2013-01-06 | -2.163453 | -0.010279 | 1.699886 | 1.291653 |
|
||
| 2013-01-05 | 0.038232 | 0.875057 | -0.092526 | 0.934432 |
|
||
| 2013-01-04 | 1.289305 | 0.497115 | -0.225351 | 0.040239 |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | -0.286067 |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 1.187187 | 1.819818 |
|
||
| 2013-01-01 | -0.605936 | -0.861658 | -1.001924 | 1.528584 |
|
||
|
||
In [15]:
|
||
|
||
```py
|
||
df.sort_index(axis=1, ascending=False)
|
||
|
||
```
|
||
|
||
Out[15]:
|
||
|
||
| | D | C | B | A |
|
||
| --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | 1.528584 | -1.001924 | -0.861658 | -0.605936 |
|
||
| 2013-01-02 | 1.819818 | 1.187187 | 0.388338 | -0.165408 |
|
||
| 2013-01-03 | -0.286067 | -1.282331 | -1.608074 | 0.065255 |
|
||
| 2013-01-04 | 0.040239 | -0.225351 | 0.497115 | 1.289305 |
|
||
| 2013-01-05 | 0.934432 | -0.092526 | 0.875057 | 0.038232 |
|
||
| 2013-01-06 | 1.291653 | 1.699886 | -0.010279 | -2.163453 |
|
||
|
||
`sort_values(by, axis=0, ascending=True)` 方法按照 `by` 的值的大小进行排序,例如按照 `B` 列的大小:
|
||
|
||
In [16]:
|
||
|
||
```py
|
||
df.sort_values(by="B")
|
||
|
||
```
|
||
|
||
Out[16]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | -0.286067 |
|
||
| 2013-01-01 | -0.605936 | -0.861658 | -1.001924 | 1.528584 |
|
||
| 2013-01-06 | -2.163453 | -0.010279 | 1.699886 | 1.291653 |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 1.187187 | 1.819818 |
|
||
| 2013-01-04 | 1.289305 | 0.497115 | -0.225351 | 0.040239 |
|
||
| 2013-01-05 | 0.038232 | 0.875057 | -0.092526 | 0.934432 |
|
||
|
||
## 索引
|
||
|
||
虽然 `DataFrame` 支持 `Python/Numpy` 的索引语法,但是推荐使用 `.at, .iat, .loc, .iloc 和 .ix` 方法进行索引。
|
||
|
||
### 读取数据
|
||
|
||
选择单列数据:
|
||
|
||
In [17]:
|
||
|
||
```py
|
||
df["A"]
|
||
|
||
```
|
||
|
||
Out[17]:
|
||
|
||
```py
|
||
2013-01-01 -0.605936
|
||
2013-01-02 -0.165408
|
||
2013-01-03 0.065255
|
||
2013-01-04 1.289305
|
||
2013-01-05 0.038232
|
||
2013-01-06 -2.163453
|
||
Freq: D, Name: A, dtype: float64
|
||
```
|
||
|
||
也可以用 `df.A`:
|
||
|
||
In [18]:
|
||
|
||
```py
|
||
df.A
|
||
|
||
```
|
||
|
||
Out[18]:
|
||
|
||
```py
|
||
2013-01-01 -0.605936
|
||
2013-01-02 -0.165408
|
||
2013-01-03 0.065255
|
||
2013-01-04 1.289305
|
||
2013-01-05 0.038232
|
||
2013-01-06 -2.163453
|
||
Freq: D, Name: A, dtype: float64
|
||
```
|
||
|
||
使用切片读取多行:
|
||
|
||
In [19]:
|
||
|
||
```py
|
||
df[0:3]
|
||
|
||
```
|
||
|
||
Out[19]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | -0.605936 | -0.861658 | -1.001924 | 1.528584 |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 1.187187 | 1.819818 |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | -0.286067 |
|
||
|
||
`index` 名字也可以进行切片:
|
||
|
||
In [20]:
|
||
|
||
```py
|
||
df["20130101":"20130103"]
|
||
|
||
```
|
||
|
||
Out[20]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | -0.605936 | -0.861658 | -1.001924 | 1.528584 |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 1.187187 | 1.819818 |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | -0.286067 |
|
||
|
||
### 使用 `label` 索引
|
||
|
||
`loc` 可以方便的使用 `label` 进行索引:
|
||
|
||
In [21]:
|
||
|
||
```py
|
||
df.loc[dates[0]]
|
||
|
||
```
|
||
|
||
Out[21]:
|
||
|
||
```py
|
||
A -0.605936
|
||
B -0.861658
|
||
C -1.001924
|
||
D 1.528584
|
||
Name: 2013-01-01 00:00:00, dtype: float64
|
||
```
|
||
|
||
多列数据:
|
||
|
||
In [22]:
|
||
|
||
```py
|
||
df.loc[:,['A','B']]
|
||
|
||
```
|
||
|
||
Out[22]:
|
||
|
||
| | A | B |
|
||
| --- | --- | --- |
|
||
| 2013-01-01 | -0.605936 | -0.861658 |
|
||
| 2013-01-02 | -0.165408 | 0.388338 |
|
||
| 2013-01-03 | 0.065255 | -1.608074 |
|
||
| 2013-01-04 | 1.289305 | 0.497115 |
|
||
| 2013-01-05 | 0.038232 | 0.875057 |
|
||
| 2013-01-06 | -2.163453 | -0.010279 |
|
||
|
||
选择多行多列:
|
||
|
||
In [23]:
|
||
|
||
```py
|
||
df.loc['20130102':'20130104',['A','B']]
|
||
|
||
```
|
||
|
||
Out[23]:
|
||
|
||
| | A | B |
|
||
| --- | --- | --- |
|
||
| 2013-01-02 | -0.165408 | 0.388338 |
|
||
| 2013-01-03 | 0.065255 | -1.608074 |
|
||
| 2013-01-04 | 1.289305 | 0.497115 |
|
||
|
||
数据降维:
|
||
|
||
In [24]:
|
||
|
||
```py
|
||
df.loc['20130102',['A','B']]
|
||
|
||
```
|
||
|
||
Out[24]:
|
||
|
||
```py
|
||
A -0.165408
|
||
B 0.388338
|
||
Name: 2013-01-02 00:00:00, dtype: float64
|
||
```
|
||
|
||
得到标量值:
|
||
|
||
In [25]:
|
||
|
||
```py
|
||
df.loc[dates[0],'B']
|
||
|
||
```
|
||
|
||
Out[25]:
|
||
|
||
```py
|
||
-0.86165751902832299
|
||
```
|
||
|
||
不过得到标量值可以用 `at`,速度更快:
|
||
|
||
In [26]:
|
||
|
||
```py
|
||
%timeit -n100 df.loc[dates[0],'B']
|
||
%timeit -n100 df.at[dates[0],'B']
|
||
|
||
print df.at[dates[0],'B']
|
||
|
||
```
|
||
|
||
```py
|
||
100 loops, best of 3: 329 µs per loop
|
||
100 loops, best of 3: 31.1 µs per loop
|
||
-0.861657519028
|
||
|
||
```
|
||
|
||
### 使用位置索引
|
||
|
||
`iloc` 使用位置进行索引:
|
||
|
||
In [27]:
|
||
|
||
```py
|
||
df.iloc[3]
|
||
|
||
```
|
||
|
||
Out[27]:
|
||
|
||
```py
|
||
A 1.289305
|
||
B 0.497115
|
||
C -0.225351
|
||
D 0.040239
|
||
Name: 2013-01-04 00:00:00, dtype: float64
|
||
```
|
||
|
||
连续切片:
|
||
|
||
In [28]:
|
||
|
||
```py
|
||
df.iloc[3:5,0:2]
|
||
|
||
```
|
||
|
||
Out[28]:
|
||
|
||
| | A | B |
|
||
| --- | --- | --- |
|
||
| 2013-01-04 | 1.289305 | 0.497115 |
|
||
| 2013-01-05 | 0.038232 | 0.875057 |
|
||
|
||
索引不连续的部分:
|
||
|
||
In [29]:
|
||
|
||
```py
|
||
df.iloc[[1,2,4],[0,2]]
|
||
|
||
```
|
||
|
||
Out[29]:
|
||
|
||
| | A | C |
|
||
| --- | --- | --- |
|
||
| 2013-01-02 | -0.165408 | 1.187187 |
|
||
| 2013-01-03 | 0.065255 | -1.282331 |
|
||
| 2013-01-05 | 0.038232 | -0.092526 |
|
||
|
||
索引整行:
|
||
|
||
In [30]:
|
||
|
||
```py
|
||
df.iloc[1:3,:]
|
||
|
||
```
|
||
|
||
Out[30]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 1.187187 | 1.819818 |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | -0.286067 |
|
||
|
||
整列:
|
||
|
||
In [31]:
|
||
|
||
```py
|
||
df.iloc[:, 1:3]
|
||
|
||
```
|
||
|
||
Out[31]:
|
||
|
||
| | B | C |
|
||
| --- | --- | --- |
|
||
| 2013-01-01 | -0.861658 | -1.001924 |
|
||
| 2013-01-02 | 0.388338 | 1.187187 |
|
||
| 2013-01-03 | -1.608074 | -1.282331 |
|
||
| 2013-01-04 | 0.497115 | -0.225351 |
|
||
| 2013-01-05 | 0.875057 | -0.092526 |
|
||
| 2013-01-06 | -0.010279 | 1.699886 |
|
||
|
||
标量值:
|
||
|
||
In [32]:
|
||
|
||
```py
|
||
df.iloc[1,1]
|
||
|
||
```
|
||
|
||
Out[32]:
|
||
|
||
```py
|
||
0.3883378290420279
|
||
```
|
||
|
||
当然,使用 `iat` 索引标量值更快:
|
||
|
||
In [33]:
|
||
|
||
```py
|
||
%timeit -n100 df.iloc[1,1]
|
||
%timeit -n100 df.iat[1,1]
|
||
|
||
df.iat[1,1]
|
||
|
||
```
|
||
|
||
```py
|
||
100 loops, best of 3: 236 µs per loop
|
||
100 loops, best of 3: 14.5 µs per loop
|
||
|
||
```
|
||
|
||
Out[33]:
|
||
|
||
```py
|
||
0.3883378290420279
|
||
```
|
||
|
||
### 布尔型索引
|
||
|
||
所有 `A` 列大于 0 的行:
|
||
|
||
In [34]:
|
||
|
||
```py
|
||
df[df.A > 0]
|
||
|
||
```
|
||
|
||
Out[34]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | -0.286067 |
|
||
| 2013-01-04 | 1.289305 | 0.497115 | -0.225351 | 0.040239 |
|
||
| 2013-01-05 | 0.038232 | 0.875057 | -0.092526 | 0.934432 |
|
||
|
||
只留下所有大于 0 的数值:
|
||
|
||
In [35]:
|
||
|
||
```py
|
||
df[df > 0]
|
||
|
||
```
|
||
|
||
Out[35]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | NaN | NaN | NaN | 1.528584 |
|
||
| 2013-01-02 | NaN | 0.388338 | 1.187187 | 1.819818 |
|
||
| 2013-01-03 | 0.065255 | NaN | NaN | NaN |
|
||
| 2013-01-04 | 1.289305 | 0.497115 | NaN | 0.040239 |
|
||
| 2013-01-05 | 0.038232 | 0.875057 | NaN | 0.934432 |
|
||
| 2013-01-06 | NaN | NaN | 1.699886 | 1.291653 |
|
||
|
||
使用 `isin` 方法做 `filter` 过滤:
|
||
|
||
In [36]:
|
||
|
||
```py
|
||
df2 = df.copy()
|
||
df2['E'] = ['one', 'one','two','three','four','three']
|
||
|
||
df2
|
||
|
||
```
|
||
|
||
Out[36]:
|
||
|
||
| | A | B | C | D | E |
|
||
| --- | --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | -0.605936 | -0.861658 | -1.001924 | 1.528584 | one |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 1.187187 | 1.819818 | one |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | -0.286067 | two |
|
||
| 2013-01-04 | 1.289305 | 0.497115 | -0.225351 | 0.040239 | three |
|
||
| 2013-01-05 | 0.038232 | 0.875057 | -0.092526 | 0.934432 | four |
|
||
| 2013-01-06 | -2.163453 | -0.010279 | 1.699886 | 1.291653 | three |
|
||
|
||
In [37]:
|
||
|
||
```py
|
||
df2[df2['E'].isin(['two','four'])]
|
||
|
||
```
|
||
|
||
Out[37]:
|
||
|
||
| | A | B | C | D | E |
|
||
| --- | --- | --- | --- | --- | --- |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | -0.286067 | two |
|
||
| 2013-01-05 | 0.038232 | 0.875057 | -0.092526 | 0.934432 | four |
|
||
|
||
### 设定数据的值
|
||
|
||
In [38]:
|
||
|
||
```py
|
||
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
|
||
|
||
s1
|
||
|
||
```
|
||
|
||
Out[38]:
|
||
|
||
```py
|
||
2013-01-02 1
|
||
2013-01-03 2
|
||
2013-01-04 3
|
||
2013-01-05 4
|
||
2013-01-06 5
|
||
2013-01-07 6
|
||
Freq: D, dtype: int64
|
||
```
|
||
|
||
像字典一样,直接指定 `F` 列的值为 `s1`,此时以 `df` 已有的 `index` 为标准将二者进行合并,`s1` 中没有的 `index` 项设为 `NaN`,多余的项舍去:
|
||
|
||
In [39]:
|
||
|
||
```py
|
||
df['F'] = s1
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[39]:
|
||
|
||
| | A | B | C | D | F |
|
||
| --- | --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | -0.605936 | -0.861658 | -1.001924 | 1.528584 | NaN |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 1.187187 | 1.819818 | 1 |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | -0.286067 | 2 |
|
||
| 2013-01-04 | 1.289305 | 0.497115 | -0.225351 | 0.040239 | 3 |
|
||
| 2013-01-05 | 0.038232 | 0.875057 | -0.092526 | 0.934432 | 4 |
|
||
| 2013-01-06 | -2.163453 | -0.010279 | 1.699886 | 1.291653 | 5 |
|
||
|
||
或者使用 `at` 或 `iat` 修改单个值:
|
||
|
||
In [40]:
|
||
|
||
```py
|
||
df.at[dates[0],'A'] = 0
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[40]:
|
||
|
||
| | A | B | C | D | F |
|
||
| --- | --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | 0.000000 | -0.861658 | -1.001924 | 1.528584 | NaN |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 1.187187 | 1.819818 | 1 |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | -0.286067 | 2 |
|
||
| 2013-01-04 | 1.289305 | 0.497115 | -0.225351 | 0.040239 | 3 |
|
||
| 2013-01-05 | 0.038232 | 0.875057 | -0.092526 | 0.934432 | 4 |
|
||
| 2013-01-06 | -2.163453 | -0.010279 | 1.699886 | 1.291653 | 5 |
|
||
|
||
In [41]:
|
||
|
||
```py
|
||
df.iat[0, 1] = 0
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[41]:
|
||
|
||
| | A | B | C | D | F |
|
||
| --- | --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | 0.000000 | 0.000000 | -1.001924 | 1.528584 | NaN |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 1.187187 | 1.819818 | 1 |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | -0.286067 | 2 |
|
||
| 2013-01-04 | 1.289305 | 0.497115 | -0.225351 | 0.040239 | 3 |
|
||
| 2013-01-05 | 0.038232 | 0.875057 | -0.092526 | 0.934432 | 4 |
|
||
| 2013-01-06 | -2.163453 | -0.010279 | 1.699886 | 1.291653 | 5 |
|
||
|
||
设定一整列:
|
||
|
||
In [42]:
|
||
|
||
```py
|
||
df.loc[:,'D'] = np.array([5] * len(df))
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[42]:
|
||
|
||
| | A | B | C | D | F |
|
||
| --- | --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | 0.000000 | 0.000000 | -1.001924 | 5 | NaN |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 1.187187 | 5 | 1 |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | 5 | 2 |
|
||
| 2013-01-04 | 1.289305 | 0.497115 | -0.225351 | 5 | 3 |
|
||
| 2013-01-05 | 0.038232 | 0.875057 | -0.092526 | 5 | 4 |
|
||
| 2013-01-06 | -2.163453 | -0.010279 | 1.699886 | 5 | 5 |
|
||
|
||
设定满足条件的数值:
|
||
|
||
In [43]:
|
||
|
||
```py
|
||
df2 = df.copy()
|
||
|
||
df2[df2 > 0] = -df2
|
||
|
||
df2
|
||
|
||
```
|
||
|
||
Out[43]:
|
||
|
||
| | A | B | C | D | F |
|
||
| --- | --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | 0.000000 | 0.000000 | -1.001924 | -5 | NaN |
|
||
| 2013-01-02 | -0.165408 | -0.388338 | -1.187187 | -5 | -1 |
|
||
| 2013-01-03 | -0.065255 | -1.608074 | -1.282331 | -5 | -2 |
|
||
| 2013-01-04 | -1.289305 | -0.497115 | -0.225351 | -5 | -3 |
|
||
| 2013-01-05 | -0.038232 | -0.875057 | -0.092526 | -5 | -4 |
|
||
| 2013-01-06 | -2.163453 | -0.010279 | -1.699886 | -5 | -5 |
|
||
|
||
## 缺失数据
|
||
|
||
In [44]:
|
||
|
||
```py
|
||
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
|
||
df1.loc[dates[0]:dates[1],'E'] = 1
|
||
|
||
df1
|
||
|
||
```
|
||
|
||
Out[44]:
|
||
|
||
| | A | B | C | D | F | E |
|
||
| --- | --- | --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | 0.000000 | 0.000000 | -1.001924 | 5 | NaN | 1 |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 1.187187 | 5 | 1 | 1 |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | 5 | 2 | NaN |
|
||
| 2013-01-04 | 1.289305 | 0.497115 | -0.225351 | 5 | 3 | NaN |
|
||
|
||
丢弃所有缺失数据的行得到的新数据:
|
||
|
||
In [45]:
|
||
|
||
```py
|
||
df1.dropna(how='any')
|
||
|
||
```
|
||
|
||
Out[45]:
|
||
|
||
| | A | B | C | D | F | E |
|
||
| --- | --- | --- | --- | --- | --- | --- |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 1.187187 | 5 | 1 | 1 |
|
||
|
||
填充缺失数据:
|
||
|
||
In [46]:
|
||
|
||
```py
|
||
df1.fillna(value=5)
|
||
|
||
```
|
||
|
||
Out[46]:
|
||
|
||
| | A | B | C | D | F | E |
|
||
| --- | --- | --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | 0.000000 | 0.000000 | -1.001924 | 5 | 5 | 1 |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 1.187187 | 5 | 1 | 1 |
|
||
| 2013-01-03 | 0.065255 | -1.608074 | -1.282331 | 5 | 2 | 5 |
|
||
| 2013-01-04 | 1.289305 | 0.497115 | -0.225351 | 5 | 3 | 5 |
|
||
|
||
检查缺失数据的位置:
|
||
|
||
In [47]:
|
||
|
||
```py
|
||
pd.isnull(df1)
|
||
|
||
```
|
||
|
||
Out[47]:
|
||
|
||
| | A | B | C | D | F | E |
|
||
| --- | --- | --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | False | False | False | False | True | False |
|
||
| 2013-01-02 | False | False | False | False | False | False |
|
||
| 2013-01-03 | False | False | False | False | False | True |
|
||
| 2013-01-04 | False | False | False | False | False | True |
|
||
|
||
## 计算操作
|
||
|
||
### 统计信息
|
||
|
||
每一列的均值:
|
||
|
||
In [48]:
|
||
|
||
```py
|
||
df.mean()
|
||
|
||
```
|
||
|
||
Out[48]:
|
||
|
||
```py
|
||
A -0.156012
|
||
B 0.023693
|
||
C 0.047490
|
||
D 5.000000
|
||
F 3.000000
|
||
dtype: float64
|
||
```
|
||
|
||
每一行的均值:
|
||
|
||
In [49]:
|
||
|
||
```py
|
||
df.mean(1)
|
||
|
||
```
|
||
|
||
Out[49]:
|
||
|
||
```py
|
||
2013-01-01 0.999519
|
||
2013-01-02 1.482023
|
||
2013-01-03 0.834970
|
||
2013-01-04 1.912214
|
||
2013-01-05 1.964153
|
||
2013-01-06 1.905231
|
||
Freq: D, dtype: float64
|
||
```
|
||
|
||
多个对象之间的操作,如果维度不对,`pandas` 会自动调用 `broadcasting` 机制:
|
||
|
||
In [50]:
|
||
|
||
```py
|
||
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
|
||
|
||
print s
|
||
|
||
```
|
||
|
||
```py
|
||
2013-01-01 NaN
|
||
2013-01-02 NaN
|
||
2013-01-03 1
|
||
2013-01-04 3
|
||
2013-01-05 5
|
||
2013-01-06 NaN
|
||
Freq: D, dtype: float64
|
||
|
||
```
|
||
|
||
相减 `df - s`:
|
||
|
||
In [51]:
|
||
|
||
```py
|
||
df.sub(s, axis='index')
|
||
|
||
```
|
||
|
||
Out[51]:
|
||
|
||
| | A | B | C | D | F |
|
||
| --- | --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | NaN | NaN | NaN | NaN | NaN |
|
||
| 2013-01-02 | NaN | NaN | NaN | NaN | NaN |
|
||
| 2013-01-03 | -0.934745 | -2.608074 | -2.282331 | 4 | 1 |
|
||
| 2013-01-04 | -1.710695 | -2.502885 | -3.225351 | 2 | 0 |
|
||
| 2013-01-05 | -4.961768 | -4.124943 | -5.092526 | 0 | -1 |
|
||
| 2013-01-06 | NaN | NaN | NaN | NaN | NaN |
|
||
|
||
### apply 操作
|
||
|
||
与 `R` 中的 `apply` 操作类似,接收一个函数,默认是对将函数作用到每一列上:
|
||
|
||
In [52]:
|
||
|
||
```py
|
||
df.apply(np.cumsum)
|
||
|
||
```
|
||
|
||
Out[52]:
|
||
|
||
| | A | B | C | D | F |
|
||
| --- | --- | --- | --- | --- | --- |
|
||
| 2013-01-01 | 0.000000 | 0.000000 | -1.001924 | 5 | NaN |
|
||
| 2013-01-02 | -0.165408 | 0.388338 | 0.185263 | 10 | 1 |
|
||
| 2013-01-03 | -0.100153 | -1.219736 | -1.097067 | 15 | 3 |
|
||
| 2013-01-04 | 1.189152 | -0.722621 | -1.322419 | 20 | 6 |
|
||
| 2013-01-05 | 1.227383 | 0.152436 | -1.414945 | 25 | 10 |
|
||
| 2013-01-06 | -0.936069 | 0.142157 | 0.284941 | 30 | 15 |
|
||
|
||
求每列最大最小值之差:
|
||
|
||
In [53]:
|
||
|
||
```py
|
||
df.apply(lambda x: x.max() - x.min())
|
||
|
||
```
|
||
|
||
Out[53]:
|
||
|
||
```py
|
||
A 3.452758
|
||
B 2.483131
|
||
C 2.982217
|
||
D 0.000000
|
||
F 4.000000
|
||
dtype: float64
|
||
```
|
||
|
||
### 直方图
|
||
|
||
In [54]:
|
||
|
||
```py
|
||
s = pd.Series(np.random.randint(0, 7, size=10))
|
||
print s
|
||
|
||
```
|
||
|
||
```py
|
||
0 2
|
||
1 5
|
||
2 6
|
||
3 6
|
||
4 6
|
||
5 3
|
||
6 5
|
||
7 0
|
||
8 4
|
||
9 4
|
||
dtype: int64
|
||
|
||
```
|
||
|
||
直方图信息:
|
||
|
||
In [55]:
|
||
|
||
```py
|
||
print s.value_counts()
|
||
|
||
```
|
||
|
||
```py
|
||
6 3
|
||
5 2
|
||
4 2
|
||
3 1
|
||
2 1
|
||
0 1
|
||
dtype: int64
|
||
|
||
```
|
||
|
||
绘制直方图信息:
|
||
|
||
In [56]:
|
||
|
||
```py
|
||
h = s.hist()
|
||
|
||
```
|
||
|
||

|
||
|
||
### 字符串方法
|
||
|
||
当 `Series` 或者 `DataFrame` 的某一列是字符串时,我们可以用 `.str` 对这个字符串数组进行字符串的基本操作:
|
||
|
||
In [57]:
|
||
|
||
```py
|
||
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
|
||
|
||
print s.str.lower()
|
||
|
||
```
|
||
|
||
```py
|
||
0 a
|
||
1 b
|
||
2 c
|
||
3 aaba
|
||
4 baca
|
||
5 NaN
|
||
6 caba
|
||
7 dog
|
||
8 cat
|
||
dtype: object
|
||
|
||
```
|
||
|
||
## 合并
|
||
|
||
### 连接
|
||
|
||
In [58]:
|
||
|
||
```py
|
||
df = pd.DataFrame(np.random.randn(10, 4))
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[58]:
|
||
|
||
| | 0 | 1 | 2 | 3 |
|
||
| --- | --- | --- | --- | --- |
|
||
| 0 | -2.346373 | 0.105651 | -0.048027 | 0.010637 |
|
||
| 1 | -0.682198 | 0.943043 | 0.147312 | -0.657871 |
|
||
| 2 | 0.515766 | -0.768286 | 0.361570 | 1.146278 |
|
||
| 3 | -0.607277 | -0.003086 | -1.499001 | 1.165728 |
|
||
| 4 | -1.226279 | -0.177246 | -1.379631 | -0.639261 |
|
||
| 5 | 0.807364 | -1.855060 | 0.325968 | 1.898831 |
|
||
| 6 | 0.438539 | -0.728131 | -0.009924 | 0.398360 |
|
||
| 7 | 1.497457 | -1.506314 | -1.557624 | 0.869043 |
|
||
| 8 | 0.945985 | -0.519435 | -0.510359 | -1.077751 |
|
||
| 9 | 1.597679 | -0.285955 | -1.060736 | 0.608629 |
|
||
|
||
可以使用 `pd.concat` 函数将多个 `pandas` 对象进行连接:
|
||
|
||
In [59]:
|
||
|
||
```py
|
||
pieces = [df[:2], df[4:5], df[7:]]
|
||
|
||
pd.concat(pieces)
|
||
|
||
```
|
||
|
||
Out[59]:
|
||
|
||
| | 0 | 1 | 2 | 3 |
|
||
| --- | --- | --- | --- | --- |
|
||
| 0 | -2.346373 | 0.105651 | -0.048027 | 0.010637 |
|
||
| 1 | -0.682198 | 0.943043 | 0.147312 | -0.657871 |
|
||
| 4 | -1.226279 | -0.177246 | -1.379631 | -0.639261 |
|
||
| 7 | 1.497457 | -1.506314 | -1.557624 | 0.869043 |
|
||
| 8 | 0.945985 | -0.519435 | -0.510359 | -1.077751 |
|
||
| 9 | 1.597679 | -0.285955 | -1.060736 | 0.608629 |
|
||
|
||
### 数据库中的 Join
|
||
|
||
`merge` 可以实现数据库中的 `join` 操作:
|
||
|
||
In [60]:
|
||
|
||
```py
|
||
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
|
||
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
|
||
|
||
print left
|
||
print right
|
||
|
||
```
|
||
|
||
```py
|
||
key lval
|
||
0 foo 1
|
||
1 foo 2
|
||
key rval
|
||
0 foo 4
|
||
1 foo 5
|
||
|
||
```
|
||
|
||
In [61]:
|
||
|
||
```py
|
||
pd.merge(left, right, on='key')
|
||
|
||
```
|
||
|
||
Out[61]:
|
||
|
||
| | key | lval | rval |
|
||
| --- | --- | --- | --- |
|
||
| 0 | foo | 1 | 4 |
|
||
| 1 | foo | 1 | 5 |
|
||
| 2 | foo | 2 | 4 |
|
||
| 3 | foo | 2 | 5 |
|
||
|
||
### append
|
||
|
||
向 `DataFrame` 中添加行:
|
||
|
||
In [62]:
|
||
|
||
```py
|
||
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[62]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| 0 | 1.587778 | -0.110297 | 0.602245 | 1.212597 |
|
||
| 1 | -0.551109 | 0.337387 | -0.220919 | 0.363332 |
|
||
| 2 | 1.207373 | -0.128394 | 0.619937 | -0.612694 |
|
||
| 3 | -0.978282 | -1.038170 | 0.048995 | -0.788973 |
|
||
| 4 | 0.843893 | -1.079021 | 0.092212 | 0.485422 |
|
||
| 5 | -0.056594 | 1.831206 | 1.910864 | -1.331739 |
|
||
| 6 | -0.487106 | -1.495367 | 0.853440 | 0.410854 |
|
||
| 7 | 1.830852 | -0.014893 | 0.254025 | 0.197422 |
|
||
|
||
将第三行的值添加到最后:
|
||
|
||
In [63]:
|
||
|
||
```py
|
||
s = df.iloc[3]
|
||
|
||
df.append(s, ignore_index=True)
|
||
|
||
```
|
||
|
||
Out[63]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| 0 | 1.587778 | -0.110297 | 0.602245 | 1.212597 |
|
||
| 1 | -0.551109 | 0.337387 | -0.220919 | 0.363332 |
|
||
| 2 | 1.207373 | -0.128394 | 0.619937 | -0.612694 |
|
||
| 3 | -0.978282 | -1.038170 | 0.048995 | -0.788973 |
|
||
| 4 | 0.843893 | -1.079021 | 0.092212 | 0.485422 |
|
||
| 5 | -0.056594 | 1.831206 | 1.910864 | -1.331739 |
|
||
| 6 | -0.487106 | -1.495367 | 0.853440 | 0.410854 |
|
||
| 7 | 1.830852 | -0.014893 | 0.254025 | 0.197422 |
|
||
| 8 | -0.978282 | -1.038170 | 0.048995 | -0.788973 |
|
||
|
||
### Grouping
|
||
|
||
In [64]:
|
||
|
||
```py
|
||
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
|
||
'foo', 'bar', 'foo', 'foo'],
|
||
'B' : ['one', 'one', 'two', 'three',
|
||
'two', 'two', 'one', 'three'],
|
||
'C' : np.random.randn(8),
|
||
'D' : np.random.randn(8)})
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[64]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| 0 | foo | one | 0.773062 | 0.206503 |
|
||
| 1 | bar | one | 1.414609 | -0.346719 |
|
||
| 2 | foo | two | 0.964174 | 0.706623 |
|
||
| 3 | bar | three | 0.182239 | -1.516509 |
|
||
| 4 | foo | two | -0.096255 | 0.494177 |
|
||
| 5 | bar | two | -0.759471 | -0.389213 |
|
||
| 6 | foo | one | -0.257519 | -1.411693 |
|
||
| 7 | foo | three | -0.109368 | 0.241862 |
|
||
|
||
按照 `A` 的值进行分类:
|
||
|
||
In [65]:
|
||
|
||
```py
|
||
df.groupby('A').sum()
|
||
|
||
```
|
||
|
||
Out[65]:
|
||
|
||
| | C | D |
|
||
| --- | --- | --- |
|
||
| A | | |
|
||
| --- | --- | --- |
|
||
| bar | 0.837377 | -2.252441 |
|
||
| foo | 1.274094 | 0.237472 |
|
||
|
||
按照 `A, B` 的值进行分类:
|
||
|
||
In [66]:
|
||
|
||
```py
|
||
df.groupby(['A', 'B']).sum()
|
||
|
||
```
|
||
|
||
Out[66]:
|
||
|
||
| | | C | D |
|
||
| --- | --- | --- | --- |
|
||
| A | B | | |
|
||
| --- | --- | --- | --- |
|
||
| bar | one | 1.414609 | -0.346719 |
|
||
| three | 0.182239 | -1.516509 |
|
||
| two | -0.759471 | -0.389213 |
|
||
| foo | one | 0.515543 | -1.205191 |
|
||
| three | -0.109368 | 0.241862 |
|
||
| two | 0.867919 | 1.200800 |
|
||
|
||
## 改变形状
|
||
|
||
### Stack
|
||
|
||
产生一个多 `index` 的 `DataFrame`:
|
||
|
||
In [67]:
|
||
|
||
```py
|
||
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
|
||
'foo', 'foo', 'qux', 'qux'],
|
||
['one', 'two', 'one', 'two',
|
||
'one', 'two', 'one', 'two']]))
|
||
|
||
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
|
||
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[67]:
|
||
|
||
| | | A | B |
|
||
| --- | --- | --- | --- |
|
||
| first | second | | |
|
||
| --- | --- | --- | --- |
|
||
| bar | one | -0.109174 | 0.958551 |
|
||
| two | -0.254743 | -0.975924 |
|
||
| baz | one | -0.132039 | -0.119009 |
|
||
| two | 0.587063 | -0.819037 |
|
||
| foo | one | -0.754123 | 0.430747 |
|
||
| two | -0.426544 | 0.389822 |
|
||
| qux | one | -0.382501 | -0.562910 |
|
||
| two | -0.529287 | 0.826337 |
|
||
|
||
`stack` 方法将 `columns` 变成一个新的 `index` 部分:
|
||
|
||
In [68]:
|
||
|
||
```py
|
||
df2 = df[:4]
|
||
|
||
stacked = df2.stack()
|
||
|
||
stacked
|
||
|
||
```
|
||
|
||
Out[68]:
|
||
|
||
```py
|
||
first second
|
||
bar one A -0.109174
|
||
B 0.958551
|
||
two A -0.254743
|
||
B -0.975924
|
||
baz one A -0.132039
|
||
B -0.119009
|
||
two A 0.587063
|
||
B -0.819037
|
||
dtype: float64
|
||
```
|
||
|
||
可以使用 `unstack()` 将最后一级 `index` 放回 `column`:
|
||
|
||
In [69]:
|
||
|
||
```py
|
||
stacked.unstack()
|
||
|
||
```
|
||
|
||
Out[69]:
|
||
|
||
| | | A | B |
|
||
| --- | --- | --- | --- |
|
||
| first | second | | |
|
||
| --- | --- | --- | --- |
|
||
| bar | one | -0.109174 | 0.958551 |
|
||
| two | -0.254743 | -0.975924 |
|
||
| baz | one | -0.132039 | -0.119009 |
|
||
| two | 0.587063 | -0.819037 |
|
||
|
||
也可以指定其他的级别:
|
||
|
||
In [70]:
|
||
|
||
```py
|
||
stacked.unstack(1)
|
||
|
||
```
|
||
|
||
Out[70]:
|
||
|
||
| | second | one | two |
|
||
| --- | --- | --- | --- |
|
||
| first | | | |
|
||
| --- | --- | --- | --- |
|
||
| bar | A | -0.109174 | -0.254743 |
|
||
| B | 0.958551 | -0.975924 |
|
||
| baz | A | -0.132039 | 0.587063 |
|
||
| B | -0.119009 | -0.819037 |
|
||
|
||
## 时间序列
|
||
|
||
金融分析中常用到时间序列数据:
|
||
|
||
In [71]:
|
||
|
||
```py
|
||
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
|
||
ts = pd.Series(np.random.randn(len(rng)), rng)
|
||
|
||
ts
|
||
|
||
```
|
||
|
||
Out[71]:
|
||
|
||
```py
|
||
2012-03-06 1.096788
|
||
2012-03-07 0.029678
|
||
2012-03-08 0.511461
|
||
2012-03-09 -0.332369
|
||
2012-03-10 1.720321
|
||
Freq: D, dtype: float64
|
||
```
|
||
|
||
标准时间表示:
|
||
|
||
In [72]:
|
||
|
||
```py
|
||
ts_utc = ts.tz_localize('UTC')
|
||
|
||
ts_utc
|
||
|
||
```
|
||
|
||
Out[72]:
|
||
|
||
```py
|
||
2012-03-06 00:00:00+00:00 1.096788
|
||
2012-03-07 00:00:00+00:00 0.029678
|
||
2012-03-08 00:00:00+00:00 0.511461
|
||
2012-03-09 00:00:00+00:00 -0.332369
|
||
2012-03-10 00:00:00+00:00 1.720321
|
||
Freq: D, dtype: float64
|
||
```
|
||
|
||
In [ ]:
|
||
|
||
改变时区表示:
|
||
|
||
In [73]:
|
||
|
||
```py
|
||
ts_utc.tz_convert('US/Eastern')
|
||
|
||
```
|
||
|
||
Out[73]:
|
||
|
||
```py
|
||
2012-03-05 19:00:00-05:00 1.096788
|
||
2012-03-06 19:00:00-05:00 0.029678
|
||
2012-03-07 19:00:00-05:00 0.511461
|
||
2012-03-08 19:00:00-05:00 -0.332369
|
||
2012-03-09 19:00:00-05:00 1.720321
|
||
Freq: D, dtype: float64
|
||
```
|
||
|
||
## Categoricals
|
||
|
||
In [74]:
|
||
|
||
```py
|
||
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
|
||
|
||
df
|
||
|
||
```
|
||
|
||
Out[74]:
|
||
|
||
| | id | raw_grade |
|
||
| --- | --- | --- |
|
||
| 0 | 1 | a |
|
||
| 1 | 2 | b |
|
||
| 2 | 3 | b |
|
||
| 3 | 4 | a |
|
||
| 4 | 5 | a |
|
||
| 5 | 6 | e |
|
||
|
||
可以将 `grade` 变成类别:
|
||
|
||
In [75]:
|
||
|
||
```py
|
||
df["grade"] = df["raw_grade"].astype("category")
|
||
|
||
df["grade"]
|
||
|
||
```
|
||
|
||
Out[75]:
|
||
|
||
```py
|
||
0 a
|
||
1 b
|
||
2 b
|
||
3 a
|
||
4 a
|
||
5 e
|
||
Name: grade, dtype: category
|
||
Categories (3, object): [a, b, e]
|
||
```
|
||
|
||
将类别的表示转化为有意义的字符:
|
||
|
||
In [76]:
|
||
|
||
```py
|
||
df["grade"].cat.categories = ["very good", "good", "very bad"]
|
||
|
||
df["grade"]
|
||
|
||
```
|
||
|
||
Out[76]:
|
||
|
||
```py
|
||
0 very good
|
||
1 good
|
||
2 good
|
||
3 very good
|
||
4 very good
|
||
5 very bad
|
||
Name: grade, dtype: category
|
||
Categories (3, object): [very good, good, very bad]
|
||
```
|
||
|
||
添加缺失的类别:
|
||
|
||
In [77]:
|
||
|
||
```py
|
||
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
|
||
df["grade"]
|
||
|
||
```
|
||
|
||
Out[77]:
|
||
|
||
```py
|
||
0 very good
|
||
1 good
|
||
2 good
|
||
3 very good
|
||
4 very good
|
||
5 very bad
|
||
Name: grade, dtype: category
|
||
Categories (5, object): [very bad, bad, medium, good, very good]
|
||
```
|
||
|
||
使用 `grade` 分组:
|
||
|
||
In [78]:
|
||
|
||
```py
|
||
df.groupby("grade").size()
|
||
|
||
```
|
||
|
||
Out[78]:
|
||
|
||
```py
|
||
grade
|
||
very bad 1
|
||
bad 0
|
||
medium 0
|
||
good 2
|
||
very good 3
|
||
dtype: int64
|
||
```
|
||
|
||
## 绘图
|
||
|
||
使用 `ggplot` 风格:
|
||
|
||
In [79]:
|
||
|
||
```py
|
||
plt.style.use('ggplot')
|
||
|
||
```
|
||
|
||
`Series` 绘图:
|
||
|
||
In [80]:
|
||
|
||
```py
|
||
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
|
||
|
||
p = ts.cumsum().plot()
|
||
|
||
```
|
||
|
||

|
||
|
||
`DataFrame` 按照 `columns` 绘图:
|
||
|
||
In [81]:
|
||
|
||
```py
|
||
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
|
||
columns=['A', 'B', 'C', 'D'])
|
||
|
||
df.cumsum().plot()
|
||
p = plt.legend(loc="best")
|
||
|
||
```
|
||
|
||

|
||
|
||
## 文件读写
|
||
|
||
### csv
|
||
|
||
写入文件:
|
||
|
||
In [82]:
|
||
|
||
```py
|
||
df.to_csv('foo.csv')
|
||
|
||
```
|
||
|
||
从文件中读取:
|
||
|
||
In [83]:
|
||
|
||
```py
|
||
pd.read_csv('foo.csv').head()
|
||
|
||
```
|
||
|
||
Out[83]:
|
||
|
||
| | Unnamed: 0 | A | B | C | D |
|
||
| --- | --- | --- | --- | --- | --- |
|
||
| 0 | 2000-01-01 | -1.011554 | 1.200283 | -0.310949 | -1.060734 |
|
||
| 1 | 2000-01-02 | -1.030894 | 0.660518 | -0.214002 | -0.422014 |
|
||
| 2 | 2000-01-03 | -0.488692 | 1.709209 | -0.602208 | 1.115456 |
|
||
| 3 | 2000-01-04 | -0.440243 | 0.826692 | 0.321648 | -0.351698 |
|
||
| 4 | 2000-01-05 | -0.165684 | 1.297303 | 0.817233 | 0.174767 |
|
||
|
||
### hdf5
|
||
|
||
写入文件:
|
||
|
||
In [84]:
|
||
|
||
```py
|
||
df.to_hdf("foo.h5", "df")
|
||
|
||
```
|
||
|
||
读取文件:
|
||
|
||
In [85]:
|
||
|
||
```py
|
||
pd.read_hdf('foo.h5','df').head()
|
||
|
||
```
|
||
|
||
Out[85]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| 2000-01-01 | -1.011554 | 1.200283 | -0.310949 | -1.060734 |
|
||
| 2000-01-02 | -1.030894 | 0.660518 | -0.214002 | -0.422014 |
|
||
| 2000-01-03 | -0.488692 | 1.709209 | -0.602208 | 1.115456 |
|
||
| 2000-01-04 | -0.440243 | 0.826692 | 0.321648 | -0.351698 |
|
||
| 2000-01-05 | -0.165684 | 1.297303 | 0.817233 | 0.174767 |
|
||
|
||
### excel
|
||
|
||
写入文件:
|
||
|
||
In [86]:
|
||
|
||
```py
|
||
df.to_excel('foo.xlsx', sheet_name='Sheet1')
|
||
|
||
```
|
||
|
||
读取文件:
|
||
|
||
In [87]:
|
||
|
||
```py
|
||
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA']).head()
|
||
|
||
```
|
||
|
||
Out[87]:
|
||
|
||
| | A | B | C | D |
|
||
| --- | --- | --- | --- | --- |
|
||
| 2000-01-01 | -1.011554 | 1.200283 | -0.310949 | -1.060734 |
|
||
| 2000-01-02 | -1.030894 | 0.660518 | -0.214002 | -0.422014 |
|
||
| 2000-01-03 | -0.488692 | 1.709209 | -0.602208 | 1.115456 |
|
||
| 2000-01-04 | -0.440243 | 0.826692 | 0.321648 | -0.351698 |
|
||
| 2000-01-05 | -0.165684 | 1.297303 | 0.817233 | 0.174767 |
|
||
|
||
清理生成的临时文件:
|
||
|
||
In [88]:
|
||
|
||
```py
|
||
import glob
|
||
import os
|
||
|
||
for f in glob.glob("foo*"):
|
||
os.remove(f)
|
||
|
||
``` |