Files
notes_estom/Python/pandas/user_guide/timeseries.md
2020-09-26 22:03:11 +08:00

3443 lines
127 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 时间序列与日期用法
依托 NumPy 的 `datetime64``timedelta64` 等数据类型pandas 可以处理各种时间序列数据,还能调用 `scikits.timeseries` 等 Python 支持库的时间序列功能。
Pandas 支持以下操作:
解析`时间格式字符串``np.datetime64``datetime.datetime` 等多种时间序列数据。
```python
In [1]: import datetime
In [2]: dti = pd.to_datetime(['1/1/2018', np.datetime64('2018-01-01'),
...: datetime.datetime(2018, 1, 1)])
...:
In [3]: dti
Out[3]: DatetimeIndex(['2018-01-01', '2018-01-01', '2018-01-01'], dtype='datetime64[ns]', freq=None)
```
生成 ` DatetimeIndex ``TimedeltaIndex `` PeriodIndex ` 等定频日期与时间段序列。
```python
In [4]: dti = pd.date_range('2018-01-01', periods=3, freq='H')
In [5]: dti
Out[5]:
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:00:00',
'2018-01-01 02:00:00'],
dtype='datetime64[ns]', freq='H')
```
处理、转换带时区的日期时间数据。
```python
In [6]: dti = dti.tz_localize('UTC')
In [7]: dti
Out[7]:
DatetimeIndex(['2018-01-01 00:00:00+00:00', '2018-01-01 01:00:00+00:00',
'2018-01-01 02:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq='H')
In [8]: dti.tz_convert('US/Pacific')
Out[8]:
DatetimeIndex(['2017-12-31 16:00:00-08:00', '2017-12-31 17:00:00-08:00',
'2017-12-31 18:00:00-08:00'],
dtype='datetime64[ns, US/Pacific]', freq='H')
```
按指定频率重采样,并转换为时间序列。
```python
In [9]: idx = pd.date_range('2018-01-01', periods=5, freq='H')
In [10]: ts = pd.Series(range(len(idx)), index=idx)
In [11]: ts
Out[11]:
2018-01-01 00:00:00 0
2018-01-01 01:00:00 1
2018-01-01 02:00:00 2
2018-01-01 03:00:00 3
2018-01-01 04:00:00 4
Freq: H, dtype: int64
In [12]: ts.resample('2H').mean()
Out[12]:
2018-01-01 00:00:00 0.5
2018-01-01 02:00:00 2.5
2018-01-01 04:00:00 4.0
Freq: 2H, dtype: float64
```
用绝对或相对时间差计算日期与时间。
```python
In [13]: friday = pd.Timestamp('2018-01-05')
In [14]: friday.day_name()
Out[14]: 'Friday'
# 添加 1 个日历日
In [15]: saturday = friday + pd.Timedelta('1 day')
In [16]: saturday.day_name()
Out[16]: 'Saturday'
# 添加 1 个工作日,从星期五跳到星期一
In [17]: monday = friday + pd.offsets.BDay()
In [18]: monday.day_name()
Out[18]: 'Monday'
```
pandas 提供了一组精悍、实用的工具集以完成上述操作。
## 纵览
pandas 支持 4 种常见时间概念:
1. 日期时间Datetime带时区的日期时间类似于标准库的 `datetime.datetime`
2. 时间差Timedelta绝对时间周期类似于标准库的 `datetime.timedelta`
3. 时间段Timespan在某一时点以指定频率定义的时间跨度。
4. 日期偏移Dateoffset与日历运算对应的时间段类似于 `dateutil``dateutil.relativedelta.relativedelta`
| 时间概念 | 标量类 | 数组类 | Pandas 数据类型 |主要构建方法 |
| :-----------: | :-----------: | :---------------: | :--------------------------------------: | :----------------------------------: |
| Date times | `Timestamp` | `DatetimeIndex` | `datetime64[ns]``datetime64[ns,tz]` | `to_datetime``date_range` |
| Time deltas | `Timedelta` | `TimedeltaIndex` | `timedelta64[ns]` | `to_timedelta``timedelta_range` |
| Time spans | `Period` | `PeriodIndex` | `period[freq]` | `Period``period_range` |
| Date offsets | `DateOffset` | `None` | `None` | `DateOffset` |
一般情况下,时间序列主要是 [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series "pandas.Series") 或 [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame "pandas.DataFrame") 的时间型索引,可以用时间元素进行操控。
```python
In [19]: pd.Series(range(3), index=pd.date_range('2000', freq='D', periods=3))
Out[19]:
2000-01-01 0
2000-01-02 1
2000-01-03 2
Freq: D, dtype: int64
```
当然,[`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series "pandas.Series") 与 [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame "pandas.DataFrame") 也可以直接把时间序列当成数据。
```python
In [20]: pd.Series(pd.date_range('2000', freq='D', periods=3))
Out[20]:
0 2000-01-01
1 2000-01-02
2 2000-01-03
dtype: datetime64[ns]
```
[`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series "pandas.Series") 与 [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame "pandas.DataFrame") 提供了 `datetime``timedelta``Period` 扩展类型与专有用法,不过,`Dateoffset` 则保存为 `object`
```python
In [21]: pd.Series(pd.period_range('1/1/2011', freq='M', periods=3))
Out[21]:
0 2011-01
1 2011-02
2 2011-03
dtype: period[M]
In [22]: pd.Series([pd.DateOffset(1), pd.DateOffset(2)])
Out[22]:
0 <DateOffset>
1 <2 * DateOffsets>
dtype: object
In [23]: pd.Series(pd.date_range('1/1/2011', freq='M', periods=3))
Out[23]:
0 2011-01-31
1 2011-02-28
2 2011-03-31
dtype: datetime64[ns]
```
Pandas 用 `NaT` 表示日期时间、时间差及时间段的空值,代表了缺失日期或空日期的值,类似于浮点数的 `np.nan`
```python
In [24]: pd.Timestamp(pd.NaT)
Out[24]: NaT
In [25]: pd.Timedelta(pd.NaT)
Out[25]: NaT
In [26]: pd.Period(pd.NaT)
Out[26]: NaT
# 与 np.nan 一样pd.NaT 不等于 pd.NaT
In [27]: pd.NaT == pd.NaT
Out[27]: False
```
## 时间戳 vs. 时间段
时间戳是最基本的时间序列数据用于把数值与时点关联在一起。Pandas 对象通过时间戳调用时点数据。
```python
In [28]: pd.Timestamp(datetime.datetime(2012, 5, 1))
Out[28]: Timestamp('2012-05-01 00:00:00')
In [29]: pd.Timestamp('2012-05-01')
Out[29]: Timestamp('2012-05-01 00:00:00')
In [30]: pd.Timestamp(2012, 5, 1)
Out[30]: Timestamp('2012-05-01 00:00:00')
```
不过,大多数情况下,用时间段改变变量更自然。`Period` 表示的时间段更直观,还可以用日期时间格式的字符串进行推断。
示例如下:
```python
In [31]: pd.Period('2011-01')
Out[31]: Period('2011-01', 'M')
In [32]: pd.Period('2012-05', freq='D')
Out[32]: Period('2012-05-01', 'D')
```
[`Timestamp`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html#pandas.Timestamp "pandas.Timestamp") 与 [`Period`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Period.html#pandas.Period "pandas.Period") 可以用作索引。作为索引的 `Timestamp``Period` 列表则被强制转换为对应的 [`DatetimeIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html#pandas.DatetimeIndex "pandas.DatetimeIndex") 与 [`PeriodIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.PeriodIndex.html#pandas.PeriodIndex "pandas.PeriodIndex")。
```python
In [33]: dates = [pd.Timestamp('2012-05-01'),
....: pd.Timestamp('2012-05-02'),
....: pd.Timestamp('2012-05-03')]
....:
In [34]: ts = pd.Series(np.random.randn(3), dates)
In [35]: type(ts.index)
Out[35]: pandas.core.indexes.datetimes.DatetimeIndex
In [36]: ts.index
Out[36]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]', freq=None)
In [37]: ts
Out[37]:
2012-05-01 0.469112
2012-05-02 -0.282863
2012-05-03 -1.509059
dtype: float64
In [38]: periods = [pd.Period('2012-01'), pd.Period('2012-02'), pd.Period('2012-03')]
In [39]: ts = pd.Series(np.random.randn(3), periods)
In [40]: type(ts.index)
Out[40]: pandas.core.indexes.period.PeriodIndex
In [41]: ts.index
Out[41]: PeriodIndex(['2012-01', '2012-02', '2012-03'], dtype='period[M]', freq='M')
In [42]: ts
Out[42]:
2012-01 -1.135632
2012-02 1.212112
2012-03 -0.173215
Freq: M, dtype: float64
```
Pandas 可以识别这两种表现形式并在两者之间进行转化。Pandas 后台用 `Timestamp` 实例代表时间戳,用 `DatetimeIndex` 实例代表时间戳序列。pandas 用 `Period` 对象表示符合规律的时间段标量值,用 `PeriodIndex` 表示时间段序列。未来版本将支持用任意起止时间实现不规律时间间隔。
## 转换时间戳
`to_datetime` 函数用于转换字符串、纪元式及混合的日期 [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series "pandas.Series") 或日期列表。转换的是 `Series` 时,返回的是具有相同的索引的 `Series`,日期时间列表则会被转换为 `DatetimeIndex`
```python
In [43]: pd.to_datetime(pd.Series(['Jul 31, 2009', '2010-01-10', None]))
Out[43]:
0 2009-07-31
1 2010-01-10
2 NaT
dtype: datetime64[ns]
In [44]: pd.to_datetime(['2005/11/23', '2010.12.31'])
Out[44]: DatetimeIndex(['2005-11-23', '2010-12-31'], dtype='datetime64[ns]', freq=None)
```
解析欧式日期(日-月-年),要用 `dayfirst` 关键字参数:
```python
In [45]: pd.to_datetime(['04-01-2012 10:00'], dayfirst=True)
Out[45]: DatetimeIndex(['2012-01-04 10:00:00'], dtype='datetime64[ns]', freq=None)
In [46]: pd.to_datetime(['14-01-2012', '01-14-2012'], dayfirst=True)
Out[46]: DatetimeIndex(['2012-01-14', '2012-01-14'], dtype='datetime64[ns]', freq=None)
```
::: danger 警告
从上例可以看出,`dayfirst` 并没有那么严苛,如果不能把第一个数解析为**日**,就会以 `dayfirst``False` 进行解析。
:::
`to_datetime` 转换单个字符串时,返回的是单个 `Timestamp``Timestamp` 仅支持字符串输入,不支持 `dayfirst``format` 等字符串解析选项,如果要使用这些选项,就要用 `to_datetime`
```python
In [47]: pd.to_datetime('2010/11/12')
Out[47]: Timestamp('2010-11-12 00:00:00')
In [48]: pd.Timestamp('2010/11/12')
Out[48]: Timestamp('2010-11-12 00:00:00')
```
Pandas 还支持直接使用 `DatetimeIndex` 构建器:
```python
In [49]: pd.DatetimeIndex(['2018-01-01', '2018-01-03', '2018-01-05'])
Out[49]: DatetimeIndex(['2018-01-01', '2018-01-03', '2018-01-05'], dtype='datetime64[ns]', freq=None)
```
创建 `DatetimeIndex` 时,传递字符串 `infer` 即可推断索引的频率。
```python
In [50]: pd.DatetimeIndex(['2018-01-01', '2018-01-03', '2018-01-05'], freq='infer')
Out[50]: DatetimeIndex(['2018-01-01', '2018-01-03', '2018-01-05'], dtype='datetime64[ns]', freq='2D')
```
### 提供格式参数
要实现精准转换,除了传递 `datetime` 字符串,还要指定 `format` 参数,指定此参数还可以加速转换速度。
```python
In [51]: pd.to_datetime('2010/11/12', format='%Y/%m/%d')
Out[51]: Timestamp('2010-11-12 00:00:00')
In [52]: pd.to_datetime('12-11-2010 00:00', format='%d-%m-%Y %H:%M')
Out[52]: Timestamp('2010-11-12 00:00:00')
```
要了解更多 `format` 选项,请参阅 Python [日期时间文档](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)。
### 用多列组合日期时间
*0.18.1 版新增。*
pandas 还可以把 `DataFrame` 里的整数或字符串列组合成 `Timestamp Series`
```python
In [53]: df = pd.DataFrame({'year': [2015, 2016],
....: 'month': [2, 3],
....: 'day': [4, 5],
....: 'hour': [2, 3]})
....:
In [54]: pd.to_datetime(df)
Out[54]:
0 2015-02-04 02:00:00
1 2016-03-05 03:00:00
dtype: datetime64[ns]
```
只传递组合所需的列也可以。
```python
In [55]: pd.to_datetime(df[['year', 'month', 'day']])
Out[55]:
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
```
`pd.to_datetime` 查找列名里日期时间组件的标准名称,包括:
* 必填:`year``month``day`
* 可选:`hour``minute``second``millisecond``microsecond``nanosecond`
### 无效数据
不可解析时,默认值 `errors='raise'` 会触发错误:
```python
In [2]: pd.to_datetime(['2009/07/31', 'asd'], errors='raise')
ValueError: Unknown string format
```
`errors='ignore'` 返回原始输入:
```python
In [56]: pd.to_datetime(['2009/07/31', 'asd'], errors='ignore')
Out[56]: Index(['2009/07/31', 'asd'], dtype='object')
```
`errors='coerce'` 把无法解析的数据转换为 `NaT`即不是时间Not a Time
```python
In [57]: pd.to_datetime(['2009/07/31', 'asd'], errors='coerce')
Out[57]: DatetimeIndex(['2009-07-31', 'NaT'], dtype='datetime64[ns]', freq=None)
```
### 纪元时间戳
pandas 支持把整数或浮点数纪元时间转换为 `Timestamp``DatetimeIndex`。鉴于 `Timestamp` 对象内部存储方式,这种转换的默认单位是纳秒。不过,一般都会用指定其它时间单位 `unit` 来存储纪元数据,纪元时间从 `origin` 参数指定的时点开始计算。
```python
In [58]: pd.to_datetime([1349720105, 1349806505, 1349892905,
....: 1349979305, 1350065705], unit='s')
....:
Out[58]:
DatetimeIndex(['2012-10-08 18:15:05', '2012-10-09 18:15:05',
'2012-10-10 18:15:05', '2012-10-11 18:15:05',
'2012-10-12 18:15:05'],
dtype='datetime64[ns]', freq=None)
In [59]: pd.to_datetime([1349720105100, 1349720105200, 1349720105300,
....: 1349720105400, 1349720105500], unit='ms')
....:
Out[59]:
DatetimeIndex(['2012-10-08 18:15:05.100000', '2012-10-08 18:15:05.200000',
'2012-10-08 18:15:05.300000', '2012-10-08 18:15:05.400000',
'2012-10-08 18:15:05.500000'],
dtype='datetime64[ns]', freq=None)
```
用带 `tz` 参数的纪元时间戳创建 [`Timestamp`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html#pandas.Timestamp "pandas.Timestamp") 或 [`DatetimeIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html#pandas.DatetimeIndex "pandas.DatetimeIndex") 时,要先把纪元时间戳转化为 UTC然后再把结果转换为指定时区。不过这种操作方式现在已经[废弃](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#whatsnew-0240-deprecations-integer-tz)了,对于其它时区 Wall Time 里的纪元时间戳,建议先把纪元时间戳转换为无时区时间戳,然后再把时区本地化。
```python
In [60]: pd.Timestamp(1262347200000000000).tz_localize('US/Pacific')
Out[60]: Timestamp('2010-01-01 12:00:00-0800', tz='US/Pacific')
In [61]: pd.DatetimeIndex([1262347200000000000]).tz_localize('US/Pacific')
Out[61]: DatetimeIndex(['2010-01-01 12:00:00-08:00'], dtype='datetime64[ns, US/Pacific]', freq=None)
```
::: tip 注意
纪元时间取整到最近的纳秒。
:::
::: danger 警告
[Python 浮点数](https://docs.python.org/3/tutorial/floatingpoint.html#tut-fp-issues "(in Python v3.7)")只精确到 15 位小数,因此,转换浮点纪元时间可能会导致不精准或失控的结果。转换过程中,免不了会对高精度 `Timestamp` 取整,只有用 `int64` 等定宽类型才有可能实现极其精准的效果。
```python
In [62]: pd.to_datetime([1490195805.433, 1490195805.433502912], unit='s')
Out[62]: DatetimeIndex(['2017-03-22 15:16:45.433000088', '2017-03-22 15:16:45.433502913'], dtype='datetime64[ns]', freq=None)
In [63]: pd.to_datetime(1490195805433502912, unit='ns')
Out[63]: Timestamp('2017-03-22 15:16:45.433502912')
```
:::
::: tip 注意
纪元时间取整到最近的纳秒。
:::
::: tip 参阅
[应用 `origin` 参数](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-origin)
:::
### 把时间戳转换为纪元
反转上述操作,把 `Timestamp` 转换为 `unix` 纪元:
```python
In [64]: stamps = pd.date_range('2012-10-08 18:15:05', periods=4, freq='D')
In [65]: stamps
Out[65]:
DatetimeIndex(['2012-10-08 18:15:05', '2012-10-09 18:15:05',
'2012-10-10 18:15:05', '2012-10-11 18:15:05'],
dtype='datetime64[ns]', freq='D')
```
首先与纪元开始时点1970 年 1 月 1 日午夜UTC相减然后以 1 秒为时间单位(`unit='1s'`)取底整除。
```python
In [66]: (stamps - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
Out[66]: Int64Index([1349720105, 1349806505, 1349892905, 1349979305], dtype='int64')
```
### 应用 `origin` 参数
*0.20.0 版新增。*
`origin` 参数可以指定 `DatetimeIndex` 的备选开始时点。例如,把`1960-01-01` 作为开始日期:
```python
In [67]: pd.to_datetime([1, 2, 3], unit='D', origin=pd.Timestamp('1960-01-01'))
Out[67]: DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)
```
默认值为 `origin='unix'`,即 `1970-01-01 00:00:00`,一般把这个时点称为 `unix 纪元``POSIX` 时间。
```python
In [68]: pd.to_datetime([1, 2, 3], unit='D')
Out[68]: DatetimeIndex(['1970-01-02', '1970-01-03', '1970-01-04'], dtype='datetime64[ns]', freq=None)
```
## 生成时间戳范围
`DatetimeIndex``Index` 构建器可以生成时间戳索引,此处要提供 `datetime` 对象列表。
```python
In [69]: dates = [datetime.datetime(2012, 5, 1),
....: datetime.datetime(2012, 5, 2),
....: datetime.datetime(2012, 5, 3)]
....:
# 注意频率信息
In [70]: index = pd.DatetimeIndex(dates)
In [71]: index
Out[71]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]', freq=None)
# 自动转换为 DatetimeIndex
In [72]: index = pd.Index(dates)
In [73]: index
Out[73]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]', freq=None)
```
实际工作中,经常要生成含大量时间戳的超长索引,一个个输入时间戳又枯燥,又低效。如果时间戳是定频的,用 [`date_range()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html#pandas.date_range "pandas.date_range") 与 [`bdate_range()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.bdate_range.html#pandas.bdate_range "pandas.bdate_range") 函数即可创建 `DatetimeIndex``date_range` 默认的频率是**日历日**`bdate_range` 的默认频率是**工作日**
```python
In [74]: start = datetime.datetime(2011, 1, 1)
In [75]: end = datetime.datetime(2012, 1, 1)
In [76]: index = pd.date_range(start, end)
In [77]: index
Out[77]:
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
'2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08',
'2011-01-09', '2011-01-10',
...
'2011-12-23', '2011-12-24', '2011-12-25', '2011-12-26',
'2011-12-27', '2011-12-28', '2011-12-29', '2011-12-30',
'2011-12-31', '2012-01-01'],
dtype='datetime64[ns]', length=366, freq='D')
In [78]: index = pd.bdate_range(start, end)
In [79]: index
Out[79]:
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
'2011-01-07', '2011-01-10', '2011-01-11', '2011-01-12',
'2011-01-13', '2011-01-14',
...
'2011-12-19', '2011-12-20', '2011-12-21', '2011-12-22',
'2011-12-23', '2011-12-26', '2011-12-27', '2011-12-28',
'2011-12-29', '2011-12-30'],
dtype='datetime64[ns]', length=260, freq='B')
```
`date_range``bdate_range` 等便捷函数可以调用各种[频率别名](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases)
```python
In [80]: pd.date_range(start, periods=1000, freq='M')
Out[80]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-30',
'2011-05-31', '2011-06-30', '2011-07-31', '2011-08-31',
'2011-09-30', '2011-10-31',
...
'2093-07-31', '2093-08-31', '2093-09-30', '2093-10-31',
'2093-11-30', '2093-12-31', '2094-01-31', '2094-02-28',
'2094-03-31', '2094-04-30'],
dtype='datetime64[ns]', length=1000, freq='M')
In [81]: pd.bdate_range(start, periods=250, freq='BQS')
Out[81]:
DatetimeIndex(['2011-01-03', '2011-04-01', '2011-07-01', '2011-10-03',
'2012-01-02', '2012-04-02', '2012-07-02', '2012-10-01',
'2013-01-01', '2013-04-01',
...
'2071-01-01', '2071-04-01', '2071-07-01', '2071-10-01',
'2072-01-01', '2072-04-01', '2072-07-01', '2072-10-03',
'2073-01-02', '2073-04-03'],
dtype='datetime64[ns]', length=250, freq='BQS-JAN')
```
`date_range``bdate_range` 通过指定 `start``end``period``freq` 等参数,简化了生成日期范围这项工作。开始与结束日期是必填项,因此,不会生成指定范围之外的日期。
```python
In [82]: pd.date_range(start, end, freq='BM')
Out[82]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
'2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
'2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
dtype='datetime64[ns]', freq='BM')
In [83]: pd.date_range(start, end, freq='W')
Out[83]:
DatetimeIndex(['2011-01-02', '2011-01-09', '2011-01-16', '2011-01-23',
'2011-01-30', '2011-02-06', '2011-02-13', '2011-02-20',
'2011-02-27', '2011-03-06', '2011-03-13', '2011-03-20',
'2011-03-27', '2011-04-03', '2011-04-10', '2011-04-17',
'2011-04-24', '2011-05-01', '2011-05-08', '2011-05-15',
'2011-05-22', '2011-05-29', '2011-06-05', '2011-06-12',
'2011-06-19', '2011-06-26', '2011-07-03', '2011-07-10',
'2011-07-17', '2011-07-24', '2011-07-31', '2011-08-07',
'2011-08-14', '2011-08-21', '2011-08-28', '2011-09-04',
'2011-09-11', '2011-09-18', '2011-09-25', '2011-10-02',
'2011-10-09', '2011-10-16', '2011-10-23', '2011-10-30',
'2011-11-06', '2011-11-13', '2011-11-20', '2011-11-27',
'2011-12-04', '2011-12-11', '2011-12-18', '2011-12-25',
'2012-01-01'],
dtype='datetime64[ns]', freq='W-SUN')
In [84]: pd.bdate_range(end=end, periods=20)
Out[84]:
DatetimeIndex(['2011-12-05', '2011-12-06', '2011-12-07', '2011-12-08',
'2011-12-09', '2011-12-12', '2011-12-13', '2011-12-14',
'2011-12-15', '2011-12-16', '2011-12-19', '2011-12-20',
'2011-12-21', '2011-12-22', '2011-12-23', '2011-12-26',
'2011-12-27', '2011-12-28', '2011-12-29', '2011-12-30'],
dtype='datetime64[ns]', freq='B')
In [85]: pd.bdate_range(start=start, periods=20)
Out[85]:
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
'2011-01-07', '2011-01-10', '2011-01-11', '2011-01-12',
'2011-01-13', '2011-01-14', '2011-01-17', '2011-01-18',
'2011-01-19', '2011-01-20', '2011-01-21', '2011-01-24',
'2011-01-25', '2011-01-26', '2011-01-27', '2011-01-28'],
dtype='datetime64[ns]', freq='B')
```
*0.23.0 版新增。*
指定 `start``end``periods` 即可生成从 `start` 开始至 `end` 结束的等距日期范围,这个日期范围包含了 `start``end`,生成的 `DatetimeIndex` 里的元素数量为 `periods` 的值。
```python
In [86]: pd.date_range('2018-01-01', '2018-01-05', periods=5)
Out[86]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05'],
dtype='datetime64[ns]', freq=None)
In [87]: pd.date_range('2018-01-01', '2018-01-05', periods=10)
Out[87]:
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 10:40:00',
'2018-01-01 21:20:00', '2018-01-02 08:00:00',
'2018-01-02 18:40:00', '2018-01-03 05:20:00',
'2018-01-03 16:00:00', '2018-01-04 02:40:00',
'2018-01-04 13:20:00', '2018-01-05 00:00:00'],
dtype='datetime64[ns]', freq=None)
```
### 自定义频率范围
设定 `weekmask``holidays` 参数,`bdate_range` 还可以生成自定义频率日期范围。这些参数只用于传递自定义字符串。
```python
In [88]: weekmask = 'Mon Wed Fri'
In [89]: holidays = [datetime.datetime(2011, 1, 5), datetime.datetime(2011, 3, 14)]
In [90]: pd.bdate_range(start, end, freq='C', weekmask=weekmask, holidays=holidays)
Out[90]:
DatetimeIndex(['2011-01-03', '2011-01-07', '2011-01-10', '2011-01-12',
'2011-01-14', '2011-01-17', '2011-01-19', '2011-01-21',
'2011-01-24', '2011-01-26',
...
'2011-12-09', '2011-12-12', '2011-12-14', '2011-12-16',
'2011-12-19', '2011-12-21', '2011-12-23', '2011-12-26',
'2011-12-28', '2011-12-30'],
dtype='datetime64[ns]', length=154, freq='C')
In [91]: pd.bdate_range(start, end, freq='CBMS', weekmask=weekmask)
Out[91]:
DatetimeIndex(['2011-01-03', '2011-02-02', '2011-03-02', '2011-04-01',
'2011-05-02', '2011-06-01', '2011-07-01', '2011-08-01',
'2011-09-02', '2011-10-03', '2011-11-02', '2011-12-02'],
dtype='datetime64[ns]', freq='CBMS')
```
::: tip 参阅
[自定义工作日](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-custombusinessdays)
:::
## 时间戳的界限
Pandas 时间戳的最低单位为纳秒64 位整数显示的时间跨度约为 584 年,这就是 `Timestamp` 的界限:
```python
In [92]: pd.Timestamp.min
Out[92]: Timestamp('1677-09-21 00:12:43.145225')
In [93]: pd.Timestamp.max
Out[93]: Timestamp('2262-04-11 23:47:16.854775807')
```
::: tip 参阅
[时间段越界展示](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-oob)
:::
## 索引
`DatetimeIndex` 主要用作 pandas 对象的索引。`DatetimeIndex` 类为时间序列做了很多优化:
* 预计算了各种偏移量的日期范围,并在后台缓存,让后台生成后续日期范围的速度非常快(仅需抓取切片)。
* 在 pandas 对象上使用 `shift``tshift` 方法进行快速偏移。
* 合并具有相同频率的重叠 `DatetimeIndex` 对象的速度非常快(这点对快速数据对齐非常重要)。
* 通过 `year``month` 等属性快速访问日期字段。
* `snap` 等正则函数与超快的 `asof` 逻辑。
`DatetimeIndex` 对象支持全部常规 `Index` 对象的基本用法,及一些列简化频率处理的高级时间序列专有方法。
::: tip 参阅
[重置索引](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-reindexing)
:::
::: tip 注意
Pandas 不强制排序日期索引,但如果日期没有排序,可能会引发可控范围之外的或不正确的操作。
:::
`DatetimeIndex` 可以当作常规索引,支持选择、切片等方法。
```python
In [94]: rng = pd.date_range(start, end, freq='BM')
In [95]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [96]: ts.index
Out[96]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
'2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
'2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
dtype='datetime64[ns]', freq='BM')
In [97]: ts[:5].index
Out[97]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
'2011-05-31'],
dtype='datetime64[ns]', freq='BM')
In [98]: ts[::2].index
Out[98]:
DatetimeIndex(['2011-01-31', '2011-03-31', '2011-05-31', '2011-07-29',
'2011-09-30', '2011-11-30'],
dtype='datetime64[ns]', freq='2BM')
```
### 局部字符串索引
能解析为时间戳的日期与字符串可以作为索引的参数:
```python
In [99]: ts['1/31/2011']
Out[99]: 0.11920871129693428
In [100]: ts[datetime.datetime(2011, 12, 25):]
Out[100]:
2011-12-30 0.56702
Freq: BM, dtype: float64
In [101]: ts['10/31/2011':'12/31/2011']
Out[101]:
2011-10-31 0.271860
2011-11-30 -0.424972
2011-12-30 0.567020
Freq: BM, dtype: float64
```
pandas 为访问较长的时间序列提供了便捷方法,**年**、**年月**字符串均可:
```python
In [102]: ts['2011']
Out[102]:
2011-01-31 0.119209
2011-02-28 -1.044236
2011-03-31 -0.861849
2011-04-29 -2.104569
2011-05-31 -0.494929
2011-06-30 1.071804
2011-07-29 0.721555
2011-08-31 -0.706771
2011-09-30 -1.039575
2011-10-31 0.271860
2011-11-30 -0.424972
2011-12-30 0.567020
Freq: BM, dtype: float64
In [103]: ts['2011-6']
Out[103]:
2011-06-30 1.071804
Freq: BM, dtype: float64
```
`DatetimeIndex``DateFrame` 也支持这种切片方式。局部字符串是标签切片的一种形式,这种切片也**包含**截止时点,即,与日期匹配的时间也会包含在内:
```python
In [104]: dft = pd.DataFrame(np.random.randn(100000, 1), columns=['A'],
.....: index=pd.date_range('20130101', periods=100000, freq='T'))
.....:
In [105]: dft
Out[105]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043
[100000 rows x 1 columns]
In [106]: dft['2013']
Out[106]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043
[100000 rows x 1 columns]
```
下列代码截取了自 1 月 1 日凌晨起,至 2 月 28 日午夜的日期与时间。
```python
In [107]: dft['2013-1':'2013-2']
Out[107]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-02-28 23:55:00 0.850929
2013-02-28 23:56:00 0.976712
2013-02-28 23:57:00 -2.693884
2013-02-28 23:58:00 -1.575535
2013-02-28 23:59:00 -1.573517
[84960 rows x 1 columns]
```
下列代码截取了**包含截止日期及其时间在内**的日期与时间。
```python
In [108]: dft['2013-1':'2013-2-28']
Out[108]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-02-28 23:55:00 0.850929
2013-02-28 23:56:00 0.976712
2013-02-28 23:57:00 -2.693884
2013-02-28 23:58:00 -1.575535
2013-02-28 23:59:00 -1.573517
[84960 rows x 1 columns]
```
下列代码指定了精准的截止时间,注意此处的结果与上述截取结果的区别:
```python
In [109]: dft['2013-1':'2013-2-28 00:00:00']
Out[109]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-02-27 23:56:00 1.197749
2013-02-27 23:57:00 0.720521
2013-02-27 23:58:00 -0.072718
2013-02-27 23:59:00 -0.681192
2013-02-28 00:00:00 -0.557501
[83521 rows x 1 columns]
```
截止时间是索引的一部分,包含在截取的内容之内:
```python
In [110]: dft['2013-1-15':'2013-1-15 12:30:00']
Out[110]:
A
2013-01-15 00:00:00 -0.984810
2013-01-15 00:01:00 0.941451
2013-01-15 00:02:00 1.559365
2013-01-15 00:03:00 1.034374
2013-01-15 00:04:00 -1.480656
... ...
2013-01-15 12:26:00 0.371454
2013-01-15 12:27:00 -0.930806
2013-01-15 12:28:00 -0.069177
2013-01-15 12:29:00 0.066510
2013-01-15 12:30:00 -0.003945
[751 rows x 1 columns]
```
*0.18.0 版新增*
`DatetimeIndex` 局部字符串索引还支持多重索引 `DataFrame`
```python
In [111]: dft2 = pd.DataFrame(np.random.randn(20, 1),
.....: columns=['A'],
.....: index=pd.MultiIndex.from_product(
.....: [pd.date_range('20130101', periods=10, freq='12H'),
.....: ['a', 'b']]))
.....:
In [112]: dft2
Out[112]:
A
2013-01-01 00:00:00 a -0.298694
b 0.823553
2013-01-01 12:00:00 a 0.943285
b -1.479399
2013-01-02 00:00:00 a -1.643342
... ...
2013-01-04 12:00:00 b 0.069036
2013-01-05 00:00:00 a 0.122297
b 1.422060
2013-01-05 12:00:00 a 0.370079
b 1.016331
[20 rows x 1 columns]
In [113]: dft2.loc['2013-01-05']
Out[113]:
A
2013-01-05 00:00:00 a 0.122297
b 1.422060
2013-01-05 12:00:00 a 0.370079
b 1.016331
In [114]: idx = pd.IndexSlice
In [115]: dft2 = dft2.swaplevel(0, 1).sort_index()
In [116]: dft2.loc[idx[:, '2013-01-05'], :]
Out[116]:
A
a 2013-01-05 00:00:00 0.122297
2013-01-05 12:00:00 0.370079
b 2013-01-05 00:00:00 1.422060
2013-01-05 12:00:00 1.016331
```
*0.25.0 版新增*
字符串索引切片支持 UTC 偏移。
```python
In [117]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))
In [118]: df
Out[118]:
0
2019-01-01 00:00:00-08:00 0
In [119]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
Out[119]:
0
2019-01-01 00:00:00-08:00 0
```
### 切片 vs. 精准匹配
*0.20.0 版新增。*
基于索引的精度,字符串既可用于切片,也可用于精准匹配。字符串精度比索引精度低,就是切片,比索引精度高,则是精准匹配。
```python
In [120]: series_minute = pd.Series([1, 2, 3],
.....: pd.DatetimeIndex(['2011-12-31 23:59:00',
.....: '2012-01-01 00:00:00',
.....: '2012-01-01 00:02:00']))
.....:
In [121]: series_minute.index.resolution
Out[121]: 'minute'
```
下例中的时间戳字符串没有 `Series` 对象的精度高。`series_minute``秒`,时间戳字符串只到`分`
```python
In [122]: series_minute['2011-12-31 23']
Out[122]:
2011-12-31 23:59:00 1
dtype: int64
```
精度为分钟(或更高精度)的时间戳字符串,给出的是标量,不会被当作切片。
```python
In [123]: series_minute['2011-12-31 23:59']
Out[123]: 1
In [124]: series_minute['2011-12-31 23:59:00']
Out[124]: 1
```
索引的精度为秒时,精度为分钟的时间戳返回的是 `Series`
```python
In [125]: series_second = pd.Series([1, 2, 3],
.....: pd.DatetimeIndex(['2011-12-31 23:59:59',
.....: '2012-01-01 00:00:00',
.....: '2012-01-01 00:00:01']))
.....:
In [126]: series_second.index.resolution
Out[126]: 'second'
In [127]: series_second['2011-12-31 23:59']
Out[127]:
2011-12-31 23:59:59 1
dtype: int64
```
用时间戳字符串切片时,还可以用 `[]` 索引 `DataFrame`
```python
In [128]: dft_minute = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]},
.....: index=series_minute.index)
.....:
In [129]: dft_minute['2011-12-31 23']
Out[129]:
a b
2011-12-31 23:59:00 1 4
```
::: danger 警告
字符串执行精确匹配时,用 `[]` 按列,而不是按行截取 `DateFrame` ,参阅 [索引基础](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-basics)。如,`dft_minute ['2011-12-31 23:59']` 会触发 `KeyError`,这是因为 `2012-12-31 23:59`与索引的精度一样,但没有叫这个名字的列。
为了实现精准切片,要用 `.loc` 对行进行切片或选择。
```python
In [130]: dft_minute.loc['2011-12-31 23:59']
Out[130]:
a 1
b 4
Name: 2011-12-31 23:59:00, dtype: int64
```
:::
注意:`DatetimeIndex` 精度不能低于日。
```python
In [131]: series_monthly = pd.Series([1, 2, 3],
.....: pd.DatetimeIndex(['2011-12', '2012-01', '2012-02']))
.....:
In [132]: series_monthly.index.resolution
Out[132]: 'day'
In [133]: series_monthly['2011-12'] # 返回的是 Series
Out[133]:
2011-12-01 1
dtype: int64
```
### 精确索引
正如上节所述,局部字符串依靠时间段的**精度**索引 `DatetimeIndex`,即时间间隔与索引精度相关。反之,用 `Timestamp``datetime` 索引更精准,这些对象指定的时间更精确。注意,精确索引包含了起始时点。
就算没有显式指定,`Timestamp``datetime` 也支持 `hours``minutes``seconds`,默认值为 0。
```python
In [134]: dft[datetime.datetime(2013, 1, 1):datetime.datetime(2013, 2, 28)]
Out[134]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-02-27 23:56:00 1.197749
2013-02-27 23:57:00 0.720521
2013-02-27 23:58:00 -0.072718
2013-02-27 23:59:00 -0.681192
2013-02-28 00:00:00 -0.557501
[83521 rows x 1 columns]
```
不用默认值。
```python
In [135]: dft[datetime.datetime(2013, 1, 1, 10, 12, 0):
.....: datetime.datetime(2013, 2, 28, 10, 12, 0)]
.....:
Out[135]:
A
2013-01-01 10:12:00 0.565375
2013-01-01 10:13:00 0.068184
2013-01-01 10:14:00 0.788871
2013-01-01 10:15:00 -0.280343
2013-01-01 10:16:00 0.931536
... ...
2013-02-28 10:08:00 0.148098
2013-02-28 10:09:00 -0.388138
2013-02-28 10:10:00 0.139348
2013-02-28 10:11:00 0.085288
2013-02-28 10:12:00 0.950146
[83521 rows x 1 columns]
```
### 截取与花式索引
[`truncate()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.truncate.html#pandas.DataFrame.truncate "pandas.DataFrame.truncate") 便捷函数与切片类似。注意,与切片返回的是部分匹配日期不同, `truncate` 假设 `DatetimeIndex` 里未标明时间组件的值为 0。
```python
In [136]: rng2 = pd.date_range('2011-01-01', '2012-01-01', freq='W')
In [137]: ts2 = pd.Series(np.random.randn(len(rng2)), index=rng2)
In [138]: ts2.truncate(before='2011-11', after='2011-12')
Out[138]:
2011-11-06 0.437823
2011-11-13 -0.293083
2011-11-20 -0.059881
2011-11-27 1.252450
Freq: W-SUN, dtype: float64
In [139]: ts2['2011-11':'2011-12']
Out[139]:
2011-11-06 0.437823
2011-11-13 -0.293083
2011-11-20 -0.059881
2011-11-27 1.252450
2011-12-04 0.046611
2011-12-11 0.059478
2011-12-18 -0.286539
2011-12-25 0.841669
Freq: W-SUN, dtype: float64
```
花式索引返回的是 `DatetimeIndex` 但因为打乱了 `DatetimeIndex` 的频率,所以频率信息没有了,见 `freq=None`
```python
In [140]: ts2[[0, 2, 6]].index
Out[140]: DatetimeIndex(['2011-01-02', '2011-01-16', '2011-02-13'], dtype='datetime64[ns]', freq=None)
```
## 日期/时间组件
以下日期/时间属性可以访问 `Timestamp``DatetimeIndex`
| 属性 | 说明 |
| :---------------: | :----------------------------------------------------: |
| year | datetime 的年 |
| month | datetime 的月 |
| day | datetime 的日 |
| hour | datetime 的小时 |
| minute | datetime 的分钟 |
| second | datetime 的秒 |
| microsecond | datetime 的微秒 |
| nanosecond | datetime 的纳秒 |
| date | 返回 datetime.date不包含时区信息 |
| time | 返回 datetime.time不包含时区信息 |
| timetz | 返回带本地时区信息的 datetime.time |
| dayofyear | 一年里的第几天 |
| weekofyear | 一年里的第几周 |
| week | 一年里的第几周 |
| dayofweek | 一周里的第几天Monday=0, Sunday=6 |
| weekday | 一周里的第几天Monday=0, Sunday=6 |
| weekday_name | 这一天是星期几 Friday |
| quarter | 日期所处的季节Jan-Mar = 1Apr-Jun = 2 等 |
| days_in_month | 日期所在的月有多少天 |
| is_month_start | 逻辑判断是不是月初(由频率定义) |
| is_month_end | 逻辑判断是不是月末(由频率定义) |
| is_quarter_start | 逻辑判断是不是季初(由频率定义) |
| is_quarter_end | 逻辑判断是不是季末(由频率定义) |
| is_year_start | 逻辑判断是不是年初(由频率定义) |
| is_year_end | 逻辑判断是不是年末(由频率定义) |
| is_leap_year | 逻辑判断是不是日期所在年是不是闰年 |
参照 [.dt 访问器](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dt-accessors) 一节介绍的知识点,`Series` 的值为 `datetime` 时,还可以用 `.dt` 访问这些属性。
## DateOffset 对象
上例中,频率字符串(如,`D`)用于定义指定的频率:
* 用 [`date_range()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html#pandas.date_range "pandas.date_range") 按指定频率分隔 [`DatetimeIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html#pandas.DatetimeIndex "pandas.DatetimeIndex")` 里的日期与时间
* [`Period`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Period.html#pandas.Period "pandas.Period") 或 [`PeriodIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.PeriodIndex.html#pandas.PeriodIndex "pandas.PeriodIndex") 的频率
频率字符串表示的是 `DateOffset` 对象及其子类。`DateOffset` 类似于时间差 [`Timedelta`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.html#pandas.Timedelta "pandas.Timedelta") ,但遵循指定的日历日规则。例如,[`Timedelta`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.html#pandas.Timedelta "pandas.Timedelta") 表示的每日时间差一直都是 24 小时,而 `DateOffset` 的每日偏移量则是与下一天相同的时间差,使用夏时制时,每日偏移时间有可能是 23 或 24 小时,甚至还有可能是 25 小时。不过,`DateOffset` 子类只能是等于或小于**小时**的时间单位(`Hour``Minute``Second``Milli``Micro``Nano`),操作类似于 [`Timedelta`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.html#pandas.Timedelta "pandas.Timedelta") 及对应的绝对时间。
`DateOffset` 基础操作类似于 `dateutil.relativedelta`[relativedelta 文档](https://dateutil.readthedocs.io/en/stable/relativedelta.html)),可按指定的日历日时间段偏移日期时间。可用算数运算符(+)或 `apply` 方法执行日期偏移操作。
```python
# 指定包含夏时制变迁的某天
In [141]: ts = pd.Timestamp('2016-10-30 00:00:00', tz='Europe/Helsinki')
# 对应的绝对时间
In [142]: ts + pd.Timedelta(days=1)
Out[142]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')
# 对应的日历时间
In [143]: ts + pd.DateOffset(days=1)
Out[143]: Timestamp('2016-10-31 00:00:00+0200', tz='Europe/Helsinki')
In [144]: friday = pd.Timestamp('2018-01-05')
In [145]: friday.day_name()
Out[145]: 'Friday'
# 与两个工作日相加(星期五 --> 星期二)
In [146]: two_business_days = 2 * pd.offsets.BDay()
In [147]: two_business_days.apply(friday)
Out[147]: Timestamp('2018-01-09 00:00:00')
In [148]: friday + two_business_days
Out[148]: Timestamp('2018-01-09 00:00:00')
In [149]: (friday + two_business_days).day_name()
Out[149]: 'Tuesday'
```
大多数 `DateOffset` 都支持频率字符串或偏移别名,可用作 `freq` 关键字参数。有效的日期偏移及频率字符串如下:
| 日期偏移量 | 频率字符串 | 说明 |
| :-----------------------------------------------------------: | :----------------: | :-----------------------------------------: |
| [`DateOffset`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.DateOffset.html#pandas.tseries.offsets.DateOffset) | 无 | 通用偏移类,默认为一个日历日 |
| [`BDay`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.BDay.html#pandas.tseries.offsets.BDay) 或 [`BusinessDay`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.BusinessDay.html#pandas.tseries.offsets.BusinessDay) | `'B'` | 工作日 |
| [`CDay`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.CDay.html#pandas.tseries.offsets.CDay) 或 [`CustomBusinessDay`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.CustomBusinessDay.html#pandas.tseries.offsets.CustomBusinessDay) | `'C'` | 自定义工作日 |
| [`Week`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.Week.html#pandas.tseries.offsets.Week) | `'W'` | 一周,可选周内固定某日 |
| [`WeekOfMonth`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.WeekOfMonth.html#pandas.tseries.offsets.WeekOfMonth) | `'WOM'` | 每月第几周的第几天 |
| [`LastWeekOfMonth`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.LastWeekOfMonth.html#pandas.tseries.offsets.LastWeekOfMonth) | `'LWOM'` | 每月最后一周的第几天 |
| [`MonthEnd`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.MonthEnd.html#pandas.tseries.offsets.MonthEnd) | `'M'` | 日历日月末 |
| [`MonthBegin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.MonthBegin.html#pandas.tseries.offsets.MonthBegin) | `'MS'` | 日历日月初 |
| [`BMonthEnd`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.BMonthEnd.html#pandas.tseries.offsets.BMonthEnd) 或 [`BusinessMonthEnd`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.BusinessMonthEnd.html#pandas.tseries.offsets.BusinessMonthEnd) | `'BM'` | 工作日月末 |
| [`BMonthBegin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.BMonthBegin.html#pandas.tseries.offsets.BMonthBegin) 或 [`BusinessMonthBegin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.BusinessMonthBegin.html#pandas.tseries.offsets.BusinessMonthBegin) | `'BMS'` | 工作日月初 |
| [`CBMonthEnd`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.CBMonthEnd.html#pandas.tseries.offsets.CBMonthEnd) 或 [`CustomBusinessMonthEnd`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.CustomBusinessMonthEnd.html#pandas.tseries.offsets.CustomBusinessMonthEnd) | `'CBM'` | 自定义工作日月末 |
| [`CBMonthBegin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.CBMonthBegin.html#pandas.tseries.offsets.CBMonthBegin) 或 [`CustomBusinessMonthBegin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.CustomBusinessMonthBegin.html#pandas.tseries.offsets.CustomBusinessMonthBegin) | `'CBMS'` | 自定义工作日月初 |
| [`SemiMonthEnd`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.SemiMonthEnd.html#pandas.tseries.offsets.SemiMonthEnd) | `'SM'` | 某月第 15 天(或其它半数日期)与日历日月末 |
| [`SemiMonthBegin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.SemiMonthBegin.html#pandas.tseries.offsets.SemiMonthBegin) | `'SMS'` | 日历日月初与第 15 天(或其它半数日期) |
| [`QuarterEnd`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.QuarterEnd.html#pandas.tseries.offsets.QuarterEnd) | `'Q'` | 日历日季末 |
| [`QuarterBegin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.QuarterBegin.html#pandas.tseries.offsets.QuarterBegin) | `'QS'` | 日历日季初 |
| [`BQuarterEnd`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.BQuarterEnd.html#pandas.tseries.offsets.BQuarterEnd) | `'BQ` | 工作日季末 |
| [`BQuarterBegin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.BQuarterBegin.html#pandas.tseries.offsets.BQuarterBegin) | `'BQS'` | 工作日季初 |
| [`FY5253Quarter`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.FY5253Quarter.html#pandas.tseries.offsets.FY5253Quarter) | `'REQ'` | 零售季,又名 52-53 周 |
| [`YearEnd`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.YearEnd.html#pandas.tseries.offsets.YearEnd) | `'A'` | 日历日年末 |
| [`YearBegin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.YearBegin.html#pandas.tseries.offsets.YearBegin) | `'AS'``'BYS'` | 日历日年初 |
| [`BYearEnd`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.BYearEnd.html#pandas.tseries.offsets.BYearEnd) | `'BA'` | 工作日年末 |
| [`BYearBegin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.BYearBegin.html#pandas.tseries.offsets.BYearBegin) | `'BAS'` | 工作日年初 |
| [`FY5253`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.FY5253.html#pandas.tseries.offsets.FY5253) | `'RE'` | 零售年(又名 52-53 周) |
| [`Easter`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.Easter.html#pandas.tseries.offsets.Easter) | 无 | 复活节假日 |
| [`BusinessHour`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.BusinessHour.html#pandas.tseries.offsets.BusinessHour) | `'BH'` | 工作小时 |
| [`CustomBusinessHour`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.CustomBusinessHour.html#pandas.tseries.offsets.CustomBusinessHour) | `'CBH'` | 自定义工作小时 |
| [`Day`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.Day.html#pandas.tseries.offsets.Day) | `'D'` | 一天 |
| [`Hour`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.Hour.html#pandas.tseries.offsets.Hour) | `'H'` | 一小时 |
| [`Minute`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.Minute.html#pandas.tseries.offsets.Minute) | `'T'``'min'` | 一分钟 |
| [`Second`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.Second.html#pandas.tseries.offsets.Second) | `'S'` | 一秒 |
| [`Milli`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.Milli.html#pandas.tseries.offsets.Milli) | `'L'``'ms'` | 一毫秒 |
| [`Micro`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.Micro.html#pandas.tseries.offsets.Micro) | `'U'``'us'` | 一微秒 |
| [`Nano`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.Nano.html#pandas.tseries.offsets.Nano) | `'N'` | 一纳秒 |
`DateOffset` 还支持 `rollforward()``rollback()` 方法,按偏移量把某一日期**向前**或**向后**移动至有效偏移日期。例如,工作日偏移滚动日期时会跳过周末(即,星期六与星期日),直接到星期一,因为工作日偏移针对的是工作日。
```python
In [150]: ts = pd.Timestamp('2018-01-06 00:00:00')
In [151]: ts.day_name()
Out[151]: 'Saturday'
# 工作时间的有效偏移日期为星期一至星期五
In [152]: offset = pd.offsets.BusinessHour(start='09:00')
# 向前偏移到最近的工作日,即星期一
In [153]: offset.rollforward(ts)
Out[153]: Timestamp('2018-01-08 09:00:00')
# 向前偏移至最近的工作日,同时,小时也相应增加了
In [154]: ts + offset
Out[154]: Timestamp('2018-01-08 10:00:00')
```
这些操作默认保存时间(小时、分钟等)信息。`normalize()` 可以把时间重置为午夜零点,是否应用此操作,取决于是否需要保留时间信息。
```python
In [155]: ts = pd.Timestamp('2014-01-01 09:00')
In [156]: day = pd.offsets.Day()
In [157]: day.apply(ts)
Out[157]: Timestamp('2014-01-02 09:00:00')
In [158]: day.apply(ts).normalize()
Out[158]: Timestamp('2014-01-02 00:00:00')
In [159]: ts = pd.Timestamp('2014-01-01 22:00')
In [160]: hour = pd.offsets.Hour()
In [161]: hour.apply(ts)
Out[161]: Timestamp('2014-01-01 23:00:00')
In [162]: hour.apply(ts).normalize()
Out[162]: Timestamp('2014-01-01 00:00:00')
In [163]: hour.apply(pd.Timestamp("2014-01-01 23:30")).normalize()
Out[163]: Timestamp('2014-01-02 00:00:00')
```
### 参数偏移
偏移量支持参数,可以让不同操作生成不同结果。例如,`Week` 偏移生成每周数据时支持 `weekday` 参数,生成日期始终位于一周中的指定日期。
```python
In [164]: d = datetime.datetime(2008, 8, 18, 9, 0)
In [165]: d
Out[165]: datetime.datetime(2008, 8, 18, 9, 0)
In [166]: d + pd.offsets.Week()
Out[166]: Timestamp('2008-08-25 09:00:00')
In [167]: d + pd.offsets.Week(weekday=4)
Out[167]: Timestamp('2008-08-22 09:00:00')
In [168]: (d + pd.offsets.Week(weekday=4)).weekday()
Out[168]: 4
In [169]: d - pd.offsets.Week()
Out[169]: Timestamp('2008-08-11 09:00:00')
```
加减法也支持 `normalize` 选项。
```python
In [170]: d + pd.offsets.Week(normalize=True)
Out[170]: Timestamp('2008-08-25 00:00:00')
In [171]: d - pd.offsets.Week(normalize=True)
Out[171]: Timestamp('2008-08-11 00:00:00')
```
`YearEnd` 也支持参数,如 `month` 参数,用于指定月份 。
```python
In [172]: d + pd.offsets.YearEnd()
Out[172]: Timestamp('2008-12-31 09:00:00')
In [173]: d + pd.offsets.YearEnd(month=6)
Out[173]: Timestamp('2009-06-30 09:00:00')
```
### `Series` 与 `DatetimeIndex` 偏移
可以为 `Series``DatetimeIndex` 里的每个元素应用偏移。
```python
In [174]: rng = pd.date_range('2012-01-01', '2012-01-03')
In [175]: s = pd.Series(rng)
In [176]: rng
Out[176]: DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'], dtype='datetime64[ns]', freq='D')
In [177]: rng + pd.DateOffset(months=2)
Out[177]: DatetimeIndex(['2012-03-01', '2012-03-02', '2012-03-03'], dtype='datetime64[ns]', freq='D')
In [178]: s + pd.DateOffset(months=2)
Out[178]:
0 2012-03-01
1 2012-03-02
2 2012-03-03
dtype: datetime64[ns]
In [179]: s - pd.DateOffset(months=2)
Out[179]:
0 2011-11-01
1 2011-11-02
2 2011-11-03
dtype: datetime64[ns]
```
如果偏移直接映射 `Timedelta` `Day``Hour``Minute``Second``Micro``Milli``Nano`),则该偏移与 `Timedelta` 的使用方式完全一样。参阅[时间差 - Timedelta](https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html#timedeltas-operations),查看更多示例。
```python
In [180]: s - pd.offsets.Day(2)
Out[180]:
0 2011-12-30
1 2011-12-31
2 2012-01-01
dtype: datetime64[ns]
In [181]: td = s - pd.Series(pd.date_range('2011-12-29', '2011-12-31'))
In [182]: td
Out[182]:
0 3 days
1 3 days
2 3 days
dtype: timedelta64[ns]
In [183]: td + pd.offsets.Minute(15)
Out[183]:
0 3 days 00:15:00
1 3 days 00:15:00
2 3 days 00:15:00
dtype: timedelta64[ns]
```
注意,某些偏移量(如 `BQuarterEnd`)不支持矢量操作,即使可以执行运算,速度也非常慢,并可能显示 `PerformanceWaring`(性能警告)。
```python
In [184]: rng + pd.offsets.BQuarterEnd()
Out[184]: DatetimeIndex(['2012-03-30', '2012-03-30', '2012-03-30'], dtype='datetime64[ns]', freq='D')
```
### 自定义工作日
`Cday``CustomBusinessDay` 类可以参数化 `BusinessDay` 类,用于创建支持本地周末与传统节假日的自定义工作日历。
下面这个例子就很有意思,知道吗?埃及的周末是星期五与星期六。
```python
In [185]: weekmask_egypt = 'Sun Mon Tue Wed Thu'
# 下面是 2012 - 2014 年的五一劳动节
In [186]: holidays = ['2012-05-01',
.....: datetime.datetime(2013, 5, 1),
.....: np.datetime64('2014-05-01')]
.....:
In [187]: bday_egypt = pd.offsets.CustomBusinessDay(holidays=holidays,
.....: weekmask=weekmask_egypt)
.....:
In [188]: dt = datetime.datetime(2013, 4, 30)
In [189]: dt + 2 * bday_egypt
Out[189]: Timestamp('2013-05-05 00:00:00')
```
下列代码实现了日期与工作日之间的映射关系。
```python
In [190]: dts = pd.date_range(dt, periods=5, freq=bday_egypt)
In [191]: pd.Series(dts.weekday, dts).map(
.....: pd.Series('Mon Tue Wed Thu Fri Sat Sun'.split()))
.....:
Out[191]:
2013-04-30 Tue
2013-05-02 Thu
2013-05-05 Sun
2013-05-06 Mon
2013-05-07 Tue
Freq: C, dtype: object
```
节日日历支持节假日列表。更多信息,请参阅[节日日历](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-holiday)文档。
```python
In [192]: from pandas.tseries.holiday import USFederalHolidayCalendar
In [193]: bday_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
# 马丁路德金纪念日前的星期五
In [194]: dt = datetime.datetime(2014, 1, 17)
# 马丁路德金纪念日后的星期二,因为星期一放假,所以跳过了
In [195]: dt + bday_us
Out[195]: Timestamp('2014-01-21 00:00:00')
```
遵循节日日历规则的月偏移可以用正常方式定义。
```python
In [196]: bmth_us = pd.offsets.CustomBusinessMonthBegin(
.....: calendar=USFederalHolidayCalendar())
.....:
# 跳过新年
In [197]: dt = datetime.datetime(2013, 12, 17)
In [198]: dt + bmth_us
Out[198]: Timestamp('2014-01-02 00:00:00')
# 定义带自定义偏移的日期索引
In [199]: pd.date_range(start='20100101', end='20120101', freq=bmth_us)
Out[199]:
DatetimeIndex(['2010-01-04', '2010-02-01', '2010-03-01', '2010-04-01',
'2010-05-03', '2010-06-01', '2010-07-01', '2010-08-02',
'2010-09-01', '2010-10-01', '2010-11-01', '2010-12-01',
'2011-01-03', '2011-02-01', '2011-03-01', '2011-04-01',
'2011-05-02', '2011-06-01', '2011-07-01', '2011-08-01',
'2011-09-01', '2011-10-03', '2011-11-01', '2011-12-01'],
dtype='datetime64[ns]', freq='CBMS')
```
::: tip 注意
频率字符串 'C' 验证 `CustomBusinessDay` 日期偏移 调用,注意,`CustomBusinessDay` 可实现参数化,`CustomBusinessDay` 实例会各不相同,且频率字符串 'C' 无法识别这个问题。用户应确保应用里调用的频率字符串 'C' 的一致性 。
### 工作时间
`BusinessHour` 表示 `BusinessDay` 基础上的工作时间,用于指定开始与结束工作时间。
`BusinessHour` 默认的工作时间是 9:00 - 17:00。`BusinessHour` 加法以小时频率增加 `Timestamp` 。如果目标 `Timestamp` 超出了一小时,则要先移动到下一个工作小时,再行增加。如果超过了当日工作时间的范围,剩下的时间则添加到下一个工作日。
```python
In [200]: bh = pd.offsets.BusinessHour()
In [201]: bh
Out[201]: <BusinessHour: BH=09:00-17:00>
# 2014 年 8 月 1 日是星期五
In [202]: pd.Timestamp('2014-08-01 10:00').weekday()
Out[202]: 4
In [203]: pd.Timestamp('2014-08-01 10:00') + bh
Out[203]: Timestamp('2014-08-01 11:00:00')
# 下例等同于: pd.Timestamp('2014-08-01 09:00') + bh
In [204]: pd.Timestamp('2014-08-01 08:00') + bh
Out[204]: Timestamp('2014-08-01 10:00:00')
# 如果计算结果为当日下班时间,则转移到下一个工作日的上班时间
In [205]: pd.Timestamp('2014-08-01 16:00') + bh
Out[205]: Timestamp('2014-08-04 09:00:00')
# 剩下的时间也会添加到下一天
In [206]: pd.Timestamp('2014-08-01 16:30') + bh
Out[206]: Timestamp('2014-08-04 09:30:00')
# 添加 2 个工作小时
In [207]: pd.Timestamp('2014-08-01 10:00') + pd.offsets.BusinessHour(2)
Out[207]: Timestamp('2014-08-01 12:00:00')
# 减掉 3 个工作小时
In [208]: pd.Timestamp('2014-08-01 10:00') + pd.offsets.BusinessHour(-3)
Out[208]: Timestamp('2014-07-31 15:00:00')
```
还可以用关键字指定 `start``end` 时间。参数必须是`hour:minute` 格式的字符串或 `datetime.time` 实例。把秒、微秒、纳秒设置为工作时间会导致 `ValueError`
```python
In [209]: bh = pd.offsets.BusinessHour(start='11:00', end=datetime.time(20, 0))
In [210]: bh
Out[210]: <BusinessHour: BH=11:00-20:00>
In [211]: pd.Timestamp('2014-08-01 13:00') + bh
Out[211]: Timestamp('2014-08-01 14:00:00')
In [212]: pd.Timestamp('2014-08-01 09:00') + bh
Out[212]: Timestamp('2014-08-01 12:00:00')
In [213]: pd.Timestamp('2014-08-01 18:00') + bh
Out[213]: Timestamp('2014-08-01 19:00:00')
```
`start` 时间晚于 `end` 时间表示夜班工作时间。此时,工作时间将从午夜延至第二天。工作时间是否有效取决于该时间是否开始于有效的 `BusinessDay`
```python
In [214]: bh = pd.offsets.BusinessHour(start='17:00', end='09:00')
In [215]: bh
Out[215]: <BusinessHour: BH=17:00-09:00>
In [216]: pd.Timestamp('2014-08-01 17:00') + bh
Out[216]: Timestamp('2014-08-01 18:00:00')
In [217]: pd.Timestamp('2014-08-01 23:00') + bh
Out[217]: Timestamp('2014-08-02 00:00:00')
# 虽然 2014 年 8 月 2 日是星期六,
# 但因为工作时间开始于星期五,因此,也是有效的
In [218]: pd.Timestamp('2014-08-02 04:00') + bh
Out[218]: Timestamp('2014-08-02 05:00:00')
# 虽然 2014 年 8 月 4 日是星期一,
# 但开始时间是星期日,因此,超出了工作时间
In [219]: pd.Timestamp('2014-08-04 04:00') + bh
Out[219]: Timestamp('2014-08-04 18:00:00')
```
`BusinessHour.rollforward``rollback` 操作将前滚至下一天的上班时间,或回滚至前一天的下班时间。与其它偏移量不同,`BusinessHour.rollforward` 输出与 `apply` 定义不同的结果。
这是因为一天工作时间的结束等同于第二天工作时间的开始。默认情况下,工作时间为 9:00 - 17:00pandas 认为 `2014-08-01 17:00``2014-08-04 09:00` 之间的时间间隔为 0 分钟。
```python
# 把时间戳回滚到前一天的下班时间
In [220]: pd.offsets.BusinessHour().rollback(pd.Timestamp('2014-08-02 15:00'))
Out[220]: Timestamp('2014-08-01 17:00:00')
# 把时间戳前滚到下一个工作日的上班时间
In [221]: pd.offsets.BusinessHour().rollforward(pd.Timestamp('2014-08-02 15:00'))
Out[221]: Timestamp('2014-08-04 09:00:00')
# 等同于BusinessHour().apply(pd.Timestamp('2014-08-01 17:00'))
# 与 BusinessHour().apply(pd.Timestamp('2014-08-04 09:00'))
In [222]: pd.offsets.BusinessHour().apply(pd.Timestamp('2014-08-02 15:00'))
Out[222]: Timestamp('2014-08-04 10:00:00')
# 工作日的结果(仅供参考)
In [223]: pd.offsets.BusinessHour().rollforward(pd.Timestamp('2014-08-02'))
Out[223]: Timestamp('2014-08-04 09:00:00')
# 等同于 BusinessDay().apply(pd.Timestamp('2014-08-01'))
# 等同于 rollforward 因为工作日不会重叠
In [224]: pd.offsets.BusinessHour().apply(pd.Timestamp('2014-08-02'))
Out[224]: Timestamp('2014-08-04 10:00:00')
```
`BusinessHour` 把星期六与星期日当成假日。`CustomBusinessHour` 可以把节假日设为工作时间,详见下文。
### 自定义工作时间
*0.18.1 版新增*
`CustomBusinessHour``BusinessHour``CustomBusinessDay` 的混合体,可以指定任意节假日。除了跳过自定义节假日之外,`CustomBusinessHour` 的运作方式与 `BusinessHour` 一样。
```python
In [225]: from pandas.tseries.holiday import USFederalHolidayCalendar
In [226]: bhour_us = pd.offsets.CustomBusinessHour(calendar=USFederalHolidayCalendar())
# 马丁路德金纪念日之前的星期五
In [227]: dt = datetime.datetime(2014, 1, 17, 15)
In [228]: dt + bhour_us
Out[228]: Timestamp('2014-01-17 16:00:00')
# 跳至马丁路德金纪念日之后的星期二,星期一过节,所以跳过了
In [229]: dt + bhour_us * 2
Out[229]: Timestamp('2014-01-21 09:00:00')
```
`BusinessHour` 支持与 `CustomBusinessDay` 一样的关键字参数。
```python
In [230]: bhour_mon = pd.offsets.CustomBusinessHour(start='10:00',
.....: weekmask='Tue Wed Thu Fri')
.....:
# 跳过了星期一,因为星期一过节,工作时间从 10 点开始
In [231]: dt + bhour_mon * 2
Out[231]: Timestamp('2014-01-21 10:00:00')
```
### 偏移量别名
时间序列频率的字符串别名在这里叫**偏移量别名**。
| 别名 | 说明 |
| :-------: | :------------------------- |
| B | 工作日频率 |
| C | 自定义工作日频率 |
| D | 日历日频率 |
| W | 周频率 |
| M | 月末频率 |
| SM | 半月末频率15 号与月末) |
| BM | 工作日月末频率 |
| CBM | 自定义工作日月末频率 |
| MS | 月初频率 |
| SMS | 半月初频率1 号与 15 号) |
| BMS | 工作日月初频率 |
| CBMS | 自定义工作日月初频率 |
| Q | 季末频率 |
| BQ | 工作日季末频率 |
| QS | 季初频率 |
| BQS | 工作日季初频率 |
| A, Y | 年末频率 |
| BA, BY | 工作日年末频率 |
| AS, YS | 年初频率 |
| BAS, BYS | 工作日年初频率 |
| BH | 工作时间频率 |
| H | 小时频率 |
| T, min | 分钟频率 |
| S | 秒频率 |
| L, ms | 毫秒 |
| U, us | 微秒 |
| N | 纳秒 |
### 别名组合
如前说述,别名与偏移量实例在绝大多数函数里可以互换:
```python
In [232]: pd.date_range(start, periods=5, freq='B')
Out[232]:
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
'2011-01-07'],
dtype='datetime64[ns]', freq='B')
In [233]: pd.date_range(start, periods=5, freq=pd.offsets.BDay())
Out[233]:
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
'2011-01-07'],
dtype='datetime64[ns]', freq='B')
```
可以组合日与当日偏移量。
```python
In [234]: pd.date_range(start, periods=10, freq='2h20min')
Out[234]:
DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 02:20:00',
'2011-01-01 04:40:00', '2011-01-01 07:00:00',
'2011-01-01 09:20:00', '2011-01-01 11:40:00',
'2011-01-01 14:00:00', '2011-01-01 16:20:00',
'2011-01-01 18:40:00', '2011-01-01 21:00:00'],
dtype='datetime64[ns]', freq='140T')
In [235]: pd.date_range(start, periods=10, freq='1D10U')
Out[235]:
DatetimeIndex([ '2011-01-01 00:00:00', '2011-01-02 00:00:00.000010',
'2011-01-03 00:00:00.000020', '2011-01-04 00:00:00.000030',
'2011-01-05 00:00:00.000040', '2011-01-06 00:00:00.000050',
'2011-01-07 00:00:00.000060', '2011-01-08 00:00:00.000070',
'2011-01-09 00:00:00.000080', '2011-01-10 00:00:00.000090'],
dtype='datetime64[ns]', freq='86400000010U')
```
### 锚定偏移量
可以指定某些频率的锚定后缀:
| 别名 | 说明 |
| :----------: | :------------------------------------ |
| W-SUN | 周频率(星期日),与 “W” 相同 |
| W-MON | 周频率(星期一) |
| W-TUE | 周频率(星期二) |
| W-WED | 周频率(星期三) |
| W-THU | 周频率(星期四) |
| W-FRI | 周频率(星期五) |
| W-SAT | 周频率(星期六) |
| (B)Q(S)-DEC | 季频率,该年结束于十二月,与 “Q” 相同 |
| (B)Q(S)-JAN | 季频率,该年结束于一月 |
| (B)Q(S)-FEB | 季频率,该年结束于二月 |
| (B)Q(S)-MAR | 季频率,该年结束于三月 |
| (B)Q(S)-APR | 季频率,该年结束于四月 |
| (B)Q(S)-MAY | 季频率,该年结束于五月 |
| (B)Q(S)-JUN | 季频率,该年结束于六月 |
| (B)Q(S)-JUL | 季频率,该年结束于七月 |
| (B)Q(S)-AUG | 季频率,该年结束于八月 |
| (B)Q(S)-SEP | 季频率,该年结束于九月 |
| (B)Q(S)-OCT | 季频率,该年结束于十月 |
| (B)Q(S)-NOV | 季频率,该年结束于十一月 |
| (B)A(S)-DEC | 年频率,锚定结束于十二月,与 “A” 相同 |
| (B)A(S)-JAN | 年频率,锚定结束于一月 |
| (B)A(S)-FEB | 年频率,锚定结束于二月 |
| (B)A(S)-MAR | 年频率,锚定结束于三月 |
| (B)A(S)-APR | 年频率,锚定结束于四月 |
| (B)A(S)-MAY | 年频率,锚定结束于五月 |
| (B)A(S)-JUN | 年频率,锚定结束于六月 |
| (B)A(S)-JUL | 年频率,锚定结束于七月 |
| (B)A(S)-AUG | 年频率,锚定结束于八月 |
| (B)A(S)-SEP | 年频率,锚定结束于九月 |
| (B)A(S)-OCT | 年频率,锚定结束于十月 |
| (B)A(S)-NOV | 年频率,锚定结束于十一月 |
这些别名可以用作 `date_range``bdate_range``DatetimeIndex` 及其它时间序列函数的参数。
### 锚定偏移量的含义
对于偏移量锚定于开始或结束指定频率(`MonthEnd``MonthBegin``WeekEnd` 等)下列规则应用于前滚与后滚。
`n` 不为 0 时,如果给定日期不是锚定日期,将寻找下一个或上一个锚点,并向前或向后移动 `|n|-1 ` 步。
```python
In [236]: pd.Timestamp('2014-01-02') + pd.offsets.MonthBegin(n=1)
Out[236]: Timestamp('2014-02-01 00:00:00')
In [237]: pd.Timestamp('2014-01-02') + pd.offsets.MonthEnd(n=1)
Out[237]: Timestamp('2014-01-31 00:00:00')
In [238]: pd.Timestamp('2014-01-02') - pd.offsets.MonthBegin(n=1)
Out[238]: Timestamp('2014-01-01 00:00:00')
In [239]: pd.Timestamp('2014-01-02') - pd.offsets.MonthEnd(n=1)
Out[239]: Timestamp('2013-12-31 00:00:00')
In [240]: pd.Timestamp('2014-01-02') + pd.offsets.MonthBegin(n=4)
Out[240]: Timestamp('2014-05-01 00:00:00')
In [241]: pd.Timestamp('2014-01-02') - pd.offsets.MonthBegin(n=4)
Out[241]: Timestamp('2013-10-01 00:00:00')
```
如果给定日期是锚定日期,则向前(或向后)移动 `|n|` 个点。
```python
In [242]: pd.Timestamp('2014-01-01') + pd.offsets.MonthBegin(n=1)
Out[242]: Timestamp('2014-02-01 00:00:00')
In [243]: pd.Timestamp('2014-01-31') + pd.offsets.MonthEnd(n=1)
Out[243]: Timestamp('2014-02-28 00:00:00')
In [244]: pd.Timestamp('2014-01-01') - pd.offsets.MonthBegin(n=1)
Out[244]: Timestamp('2013-12-01 00:00:00')
In [245]: pd.Timestamp('2014-01-31') - pd.offsets.MonthEnd(n=1)
Out[245]: Timestamp('2013-12-31 00:00:00')
In [246]: pd.Timestamp('2014-01-01') + pd.offsets.MonthBegin(n=4)
Out[246]: Timestamp('2014-05-01 00:00:00')
In [247]: pd.Timestamp('2014-01-31') - pd.offsets.MonthBegin(n=4)
Out[247]: Timestamp('2013-10-01 00:00:00')
```
`n=0` 时,如果日期在锚点,则不移动,否则将前滚至下一个锚点。
```python
In [248]: pd.Timestamp('2014-01-02') + pd.offsets.MonthBegin(n=0)
Out[248]: Timestamp('2014-02-01 00:00:00')
In [249]: pd.Timestamp('2014-01-02') + pd.offsets.MonthEnd(n=0)
Out[249]: Timestamp('2014-01-31 00:00:00')
In [250]: pd.Timestamp('2014-01-01') + pd.offsets.MonthBegin(n=0)
Out[250]: Timestamp('2014-01-01 00:00:00')
In [251]: pd.Timestamp('2014-01-31') + pd.offsets.MonthEnd(n=0)
Out[251]: Timestamp('2014-01-31 00:00:00')
```
### 假日与节日日历
用假日与日历可以轻松定义 `CustomBusinessDay` 假日规则,或其它分析所需的预设假日。`AbstractHolidayCalendar` 类支持所有返回假日列表的方法,并且仅需在指定假日日历类里定义 `rules``start_date``end_date` 类属性决定了假日的范围。该操作会覆盖 `AbstractHolidayCalendar` 类,适用于所有日历子类。`USFederalHolidayCalendar` 是仅有的假日日历,主要用作开发其它日历的示例。
固定日期的假日如美国阵亡将士纪念日或美国国庆日7 月 4 日),取决于该假日是否是在周末,可以使用以下规则:
| 规则 | 说明 |
| :---------------------: | :------------------------------------: |
| nearest_workday | 把星期六移至星期五,星期日移至星期一 |
| sunday_to_monday | 星期六紧接着星期一 |
| next_monday_or_tuesday | 把星期六移至星期一,并把星期日/星期一移至星期二 |
| previous_friday | 把星期六与星期日移至上一个星期五 |
| next_monday | 把星期六与星期日移至下一个星期一 |
下例展示如何定义假日与假日日历:
```python
In [252]: from pandas.tseries.holiday import Holiday, USMemorialDay,\
.....: AbstractHolidayCalendar, nearest_workday, MO
.....:
In [253]: class ExampleCalendar(AbstractHolidayCalendar):
.....: rules = [
.....: USMemorialDay,
.....: Holiday('July 4th', month=7, day=4, observance=nearest_workday),
.....: Holiday('Columbus Day', month=10, day=1,
.....: offset=pd.DateOffset(weekday=MO(2)))]
.....:
In [254]: cal = ExampleCalendar()
In [255]: cal.holidays(datetime.datetime(2012, 1, 1), datetime.datetime(2012, 12, 31))
Out[255]: DatetimeIndex(['2012-05-28', '2012-07-04', '2012-10-08'], dtype='datetime64[ns]', freq=None)
```
::: tip 提示
`weekday=MO(2)``2 * Week(weekday=2)` 相同。
:::
用这个日历创建索引,或计算偏移量,将跳过周末与假日(如,纪念日与国庆节)。下列代码用 `ExampleCalendar` 设定自定义工作日偏移量。至于其它偏移量,可以用于创建 `DatetimeIndex` 或添加到 `datetime``Timestamp` 对象。
```python
In [256]: pd.date_range(start='7/1/2012', end='7/10/2012',
.....: freq=pd.offsets.CDay(calendar=cal)).to_pydatetime()
.....:
Out[256]:
array([datetime.datetime(2012, 7, 2, 0, 0),
datetime.datetime(2012, 7, 3, 0, 0),
datetime.datetime(2012, 7, 5, 0, 0),
datetime.datetime(2012, 7, 6, 0, 0),
datetime.datetime(2012, 7, 9, 0, 0),
datetime.datetime(2012, 7, 10, 0, 0)], dtype=object)
In [257]: offset = pd.offsets.CustomBusinessDay(calendar=cal)
In [258]: datetime.datetime(2012, 5, 25) + offset
Out[258]: Timestamp('2012-05-29 00:00:00')
In [259]: datetime.datetime(2012, 7, 3) + offset
Out[259]: Timestamp('2012-07-05 00:00:00')
In [260]: datetime.datetime(2012, 7, 3) + 2 * offset
Out[260]: Timestamp('2012-07-06 00:00:00')
In [261]: datetime.datetime(2012, 7, 6) + offset
Out[261]: Timestamp('2012-07-09 00:00:00')
```
`AbstractHolidayCalendar` 的类属性 `start_date``end_date` 定义日期范围。默认值如下:
```python
In [262]: AbstractHolidayCalendar.start_date
Out[262]: Timestamp('1970-01-01 00:00:00')
In [263]: AbstractHolidayCalendar.end_date
Out[263]: Timestamp('2030-12-31 00:00:00')
```
这两个日期可以用 `datetime``Timestamp``字符串` 修改。
```python
In [264]: AbstractHolidayCalendar.start_date = datetime.datetime(2012, 1, 1)
In [265]: AbstractHolidayCalendar.end_date = datetime.datetime(2012, 12, 31)
In [266]: cal.holidays()
Out[266]: DatetimeIndex(['2012-05-28', '2012-07-04', '2012-10-08'], dtype='datetime64[ns]', freq=None)
```
`get_calender` 函数通过日历名称访问日历,返回的是日历实例。任意导入的日历都自动适用于此函数。同时,`HolidayCalendarFactory` 还提供了一个创建日历组合或含附加规则日历的简易接口。
```python
In [267]: from pandas.tseries.holiday import get_calendar, HolidayCalendarFactory,\
.....: USLaborDay
.....:
In [268]: cal = get_calendar('ExampleCalendar')
In [269]: cal.rules
Out[269]:
[Holiday: Memorial Day (month=5, day=31, offset=<DateOffset: weekday=MO(-1)>),
Holiday: July 4th (month=7, day=4, observance=<function nearest_workday at 0x7f2460862c20>),
Holiday: Columbus Day (month=10, day=1, offset=<DateOffset: weekday=MO(+2)>)]
In [270]: new_cal = HolidayCalendarFactory('NewExampleCalendar', cal, USLaborDay)
In [271]: new_cal.rules
Out[271]:
[Holiday: Labor Day (month=9, day=1, offset=<DateOffset: weekday=MO(+1)>),
Holiday: Memorial Day (month=5, day=31, offset=<DateOffset: weekday=MO(-1)>),
Holiday: July 4th (month=7, day=4, observance=<function nearest_workday at 0x7f2460862c20>),
Holiday: Columbus Day (month=10, day=1, offset=<DateOffset: weekday=MO(+2)>)]
```
## 时间序列实例方法
### 移位与延迟
有时,需要整体向前或向后移动时间序列里的值,这就是移位与延迟。实现这一操作的方法是 [`shift()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.shift.html#pandas.Series.shift "pandas.Series.shift"),该方法适用于所有 pandas 对象。
```python
In [272]: ts = pd.Series(range(len(rng)), index=rng)
In [273]: ts = ts[:5]
In [274]: ts.shift(1)
Out[274]:
2012-01-01 NaN
2012-01-02 0.0
2012-01-03 1.0
Freq: D, dtype: float64
```
`shift` 方法支持 `freq` 参数,可以把 `DateOffset``timedelta` 对象、[`偏移量别名`](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases) 作为参数值:
```python
In [275]: ts.shift(5, freq=pd.offsets.BDay())
Out[275]:
2012-01-06 0
2012-01-09 1
2012-01-10 2
Freq: B, dtype: int64
In [276]: ts.shift(5, freq='BM')
Out[276]:
2012-05-31 0
2012-05-31 1
2012-05-31 2
Freq: D, dtype: int64
```
除更改数据与索引的对齐方式外,`DataFrame``Series` 对象还提供了 [`tshift()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.tshift.html#pandas.Series.tshift "pandas.Series.tshift") 便捷方法,可以指定偏移量修改索引日期。
```python
In [277]: ts.tshift(5, freq='D')
Out[277]:
2012-01-06 0
2012-01-07 1
2012-01-08 2
Freq: D, dtype: int64
```
注意,使用 `tshift()` 时,因为数据没有重对齐,` NaN ` 不会排在前面。
### 频率转换
改变频率的函数主要是 [`asfreq()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.asfreq.html#pandas.Series.asfreq "pandas.Series.asfreq")。对于 `DatetimeIndex`,这就是一个调用 [`reindex()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html#pandas.Series.reindex "pandas.Series.reindex"),并生成 `date_range` 的便捷打包器。
```python
In [278]: dr = pd.date_range('1/1/2010', periods=3, freq=3 * pd.offsets.BDay())
In [279]: ts = pd.Series(np.random.randn(3), index=dr)
In [280]: ts
Out[280]:
2010-01-01 1.494522
2010-01-06 -0.778425
2010-01-11 -0.253355
Freq: 3B, dtype: float64
In [281]: ts.asfreq(pd.offsets.BDay())
Out[281]:
2010-01-01 1.494522
2010-01-04 NaN
2010-01-05 NaN
2010-01-06 -0.778425
2010-01-07 NaN
2010-01-08 NaN
2010-01-11 -0.253355
Freq: B, dtype: float64
```
`asfreq` 用起来很方便,可以为频率转化后出现的任意间隔指定插值方法。
```python
In [282]: ts.asfreq(pd.offsets.BDay(), method='pad')
Out[282]:
2010-01-01 1.494522
2010-01-04 1.494522
2010-01-05 1.494522
2010-01-06 -0.778425
2010-01-07 -0.778425
2010-01-08 -0.778425
2010-01-11 -0.253355
Freq: B, dtype: float64
```
### 向前与向后填充
`asfreq``reindex` 相关的是 [`fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html#pandas.Series.fillna "pandas.Series.fillna"),有关文档请参阅[缺失值](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data-fillna)。
### 转换 Python 日期与时间
`to_datetime` 方法可以把`DatetimeIndex` 转换为 Python 原生 [`datetime.datetime`](https://docs.python.org/3/library/datetime.html#datetime.datetime "(in Python v3.7)") 对象数组。
## 重采样
::: danger 警告
0.18.0 版修改了 `.resample` 接口,现在的 `.resample` 更灵活,更像 groupby。参阅[更新文档](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.18.0.html#whatsnew-0180-breaking-resample) ,对比新旧版本操作的区别。
:::
Pandas 有一个虽然简单,但却强大、高效的功能,可在频率转换时执行重采样,如,将秒数据转换为 5 分钟数据,这种操作在金融等领域里的应用非常广泛。
[`resample()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.resample.html#pandas.Series.resample "pandas.Series.resample") 是基于时间的分组操作,每个组都遵循归纳方法。参阅 [Cookbook 示例](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook-resample)了解高级应用。
从 0.18.0 版开始,`resample()` 可以直接用于 `DataFrameGroupBy` 对象,参阅 [groupby 文档](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#groupby-transform-window-resample)。
::: tip 注意
`.resample()` 类似于基于时间偏移量的 [`rolling()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rolling.html#pandas.Series.rolling "pandas.Series.rolling") 操作,请参阅[这里](https://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html#stats-moments-ts-versus-resampling)的讨论。
:::
### 基础知识
```python
In [283]: rng = pd.date_range('1/1/2012', periods=100, freq='S')
In [284]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
In [285]: ts.resample('5Min').sum()
Out[285]:
2012-01-01 25103
Freq: 5T, dtype: int64
```
`resample` 函数非常灵活,可以指定多种频率转换与重采样参数。
任何支持[派送dispatch](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#groupby-dispatch)的函数都可用于 `resample` 返回对象,包括 `sum``mean``std``sem``max``min``mid``median``first``last``ohlc`
```python
In [286]: ts.resample('5Min').mean()
Out[286]:
2012-01-01 251.03
Freq: 5T, dtype: float64
In [287]: ts.resample('5Min').ohlc()
Out[287]:
open high low close
2012-01-01 308 460 9 205
In [288]: ts.resample('5Min').max()
Out[288]:
2012-01-01 460
Freq: 5T, dtype: int64
```
对于下采样,`closed` 可以设置为`left``right`,用于指定关闭哪一端间隔:
```python
In [289]: ts.resample('5Min', closed='right').mean()
Out[289]:
2011-12-31 23:55:00 308.000000
2012-01-01 00:00:00 250.454545
Freq: 5T, dtype: float64
In [290]: ts.resample('5Min', closed='left').mean()
Out[290]:
2012-01-01 251.03
Freq: 5T, dtype: float64
```
`label``loffset` 等参数用于生成标签。`label` 指定生成的结果是否要为间隔标注起始时间。`loffset` 调整输出标签的时间。
```python
In [291]: ts.resample('5Min').mean() # 默认为 label='left'
Out[291]:
2012-01-01 251.03
Freq: 5T, dtype: float64
In [292]: ts.resample('5Min', label='left').mean()
Out[292]:
2012-01-01 251.03
Freq: 5T, dtype: float64
In [293]: ts.resample('5Min', label='left', loffset='1s').mean()
Out[293]:
2012-01-01 00:00:01 251.03
dtype: float64
```
::: danger 警告
除了 `M``A``Q``BM``BA``BQ``W` 的默认值是 `right` 外,其它频率偏移量的 `label``closed` 默认值都是 `left`
这种操作可能会导致时间回溯,即后面的时间会被拉回到前面的时间,如下例的 [`BusinessDay`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.BusinessDay.html#pandas.tseries.offsets.BusinessDay "pandas.tseries.offsets.BusinessDay") 频率所示。
```python
In [294]: s = pd.date_range('2000-01-01', '2000-01-05').to_series()
In [295]: s.iloc[2] = pd.NaT
In [296]: s.dt.weekday_name
Out[296]:
2000-01-01 Saturday
2000-01-02 Sunday
2000-01-03 NaN
2000-01-04 Tuesday
2000-01-05 Wednesday
Freq: D, dtype: object
# 默认为label='left', closed='left'
In [297]: s.resample('B').last().dt.weekday_name
Out[297]:
1999-12-31 Sunday
2000-01-03 NaN
2000-01-04 Tuesday
2000-01-05 Wednesday
Freq: B, dtype: object
```
看到了吗?星期日被拉回到了上一个星期五。要想把星期日移至星期一,改用以下代码:
```python
In [298]: s.resample('B', label='right', closed='right').last().dt.weekday_name
Out[298]:
2000-01-03 Sunday
2000-01-04 Tuesday
2000-01-05 Wednesday
Freq: B, dtype: object
```
:::
`axis` 参数的值为 `0``1`,并可指定 `DataFrame` 重采样的轴。
`kind` 参数可以是 `timestamp``period`,转换为时间戳或时间段形式的索引。`resample` 默认保留输入的日期时间形式。
重采样 `period` 数据时(详情见下文),`convention` 可以设置为 `start``end`。指定低频时间段如何转换为高频时间段。
### 上采样
上采样可以指定上采样的方式及插入时间间隔的 `limit` 参数:
```python
# 从秒到每 250 毫秒
In [299]: ts[:2].resample('250L').asfreq()
Out[299]:
2012-01-01 00:00:00.000 308.0
2012-01-01 00:00:00.250 NaN
2012-01-01 00:00:00.500 NaN
2012-01-01 00:00:00.750 NaN
2012-01-01 00:00:01.000 204.0
Freq: 250L, dtype: float64
In [300]: ts[:2].resample('250L').ffill()
Out[300]:
2012-01-01 00:00:00.000 308
2012-01-01 00:00:00.250 308
2012-01-01 00:00:00.500 308
2012-01-01 00:00:00.750 308
2012-01-01 00:00:01.000 204
Freq: 250L, dtype: int64
In [301]: ts[:2].resample('250L').ffill(limit=2)
Out[301]:
2012-01-01 00:00:00.000 308.0
2012-01-01 00:00:00.250 308.0
2012-01-01 00:00:00.500 308.0
2012-01-01 00:00:00.750 NaN
2012-01-01 00:00:01.000 204.0
Freq: 250L, dtype: float64
```
### 稀疏重采样
相对于时间点总量,稀疏时间序列重采样的点要少很多。单纯上采样稀疏系列可能会生成很多中间值。未指定填充值,即 `fill_method``None` 时,中间值将填充为 `NaN`
鉴于 `resample` 是基于时间的分组,下列这种方法可以有效重采样,只是分组不是都为 `NaN`
```python
In [302]: rng = pd.date_range('2014-1-1', periods=100, freq='D') + pd.Timedelta('1s')
In [303]: ts = pd.Series(range(100), index=rng)
```
`Series` 全范围重采样。
```python
In [304]: ts.resample('3T').sum()
Out[304]:
2014-01-01 00:00:00 0
2014-01-01 00:03:00 0
2014-01-01 00:06:00 0
2014-01-01 00:09:00 0
2014-01-01 00:12:00 0
..
2014-04-09 23:48:00 0
2014-04-09 23:51:00 0
2014-04-09 23:54:00 0
2014-04-09 23:57:00 0
2014-04-10 00:00:00 99
Freq: 3T, Length: 47521, dtype: int64
```
对以下包含点的分组重采样:
```python
In [305]: from functools import partial
In [306]: from pandas.tseries.frequencies import to_offset
In [307]: def round(t, freq):
.....: freq = to_offset(freq)
.....: return pd.Timestamp((t.value // freq.delta.value) * freq.delta.value)
.....:
In [308]: ts.groupby(partial(round, freq='3T')).sum()
Out[308]:
2014-01-01 0
2014-01-02 1
2014-01-03 2
2014-01-04 3
2014-01-05 4
..
2014-04-06 95
2014-04-07 96
2014-04-08 97
2014-04-09 98
2014-04-10 99
Length: 100, dtype: int64
```
### 聚合
类似于[聚合 API](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-aggregate)[Groupby API](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#groupby-aggregate) 及[窗口函数 API](https://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html#stats-aggregate)`Resampler` 可以有选择地重采样。
`DataFrame` 重采样,默认用相同函数操作所有列。
```python
In [309]: df = pd.DataFrame(np.random.randn(1000, 3),
.....: index=pd.date_range('1/1/2012', freq='S', periods=1000),
.....: columns=['A', 'B', 'C'])
.....:
In [310]: r = df.resample('3T')
In [311]: r.mean()
Out[311]:
A B C
2012-01-01 00:00:00 -0.033823 -0.121514 -0.081447
2012-01-01 00:03:00 0.056909 0.146731 -0.024320
2012-01-01 00:06:00 -0.058837 0.047046 -0.052021
2012-01-01 00:09:00 0.063123 -0.026158 -0.066533
2012-01-01 00:12:00 0.186340 -0.003144 0.074752
2012-01-01 00:15:00 -0.085954 -0.016287 -0.050046
```
标准 `getitem` 操作可以指定的一列或多列。
```python
In [312]: r['A'].mean()
Out[312]:
2012-01-01 00:00:00 -0.033823
2012-01-01 00:03:00 0.056909
2012-01-01 00:06:00 -0.058837
2012-01-01 00:09:00 0.063123
2012-01-01 00:12:00 0.186340
2012-01-01 00:15:00 -0.085954
Freq: 3T, Name: A, dtype: float64
In [313]: r[['A', 'B']].mean()
Out[313]:
A B
2012-01-01 00:00:00 -0.033823 -0.121514
2012-01-01 00:03:00 0.056909 0.146731
2012-01-01 00:06:00 -0.058837 0.047046
2012-01-01 00:09:00 0.063123 -0.026158
2012-01-01 00:12:00 0.186340 -0.003144
2012-01-01 00:15:00 -0.085954 -0.016287
```
聚合还支持函数列表与字典,输出的是 `DataFrame`
```python
In [314]: r['A'].agg([np.sum, np.mean, np.std])
Out[314]:
sum mean std
2012-01-01 00:00:00 -6.088060 -0.033823 1.043263
2012-01-01 00:03:00 10.243678 0.056909 1.058534
2012-01-01 00:06:00 -10.590584 -0.058837 0.949264
2012-01-01 00:09:00 11.362228 0.063123 1.028096
2012-01-01 00:12:00 33.541257 0.186340 0.884586
2012-01-01 00:15:00 -8.595393 -0.085954 1.035476
```
重采样后的 `DataFrame`,可以为每列指定函数列表,生成结构化索引的聚合结果:
```python
In [315]: r.agg([np.sum, np.mean])
Out[315]:
A B C
sum mean sum mean sum mean
2012-01-01 00:00:00 -6.088060 -0.033823 -21.872530 -0.121514 -14.660515 -0.081447
2012-01-01 00:03:00 10.243678 0.056909 26.411633 0.146731 -4.377642 -0.024320
2012-01-01 00:06:00 -10.590584 -0.058837 8.468289 0.047046 -9.363825 -0.052021
2012-01-01 00:09:00 11.362228 0.063123 -4.708526 -0.026158 -11.975895 -0.066533
2012-01-01 00:12:00 33.541257 0.186340 -0.565895 -0.003144 13.455299 0.074752
2012-01-01 00:15:00 -8.595393 -0.085954 -1.628689 -0.016287 -5.004580 -0.050046
```
把字典传递给 `aggregate`,可以为 `DataFrame` 里不同的列应用不同聚合函数。
```python
In [316]: r.agg({'A': np.sum,
.....: 'B': lambda x: np.std(x, ddof=1)})
.....:
Out[316]:
A B
2012-01-01 00:00:00 -6.088060 1.001294
2012-01-01 00:03:00 10.243678 1.074597
2012-01-01 00:06:00 -10.590584 0.987309
2012-01-01 00:09:00 11.362228 0.944953
2012-01-01 00:12:00 33.541257 1.095025
2012-01-01 00:15:00 -8.595393 1.035312
```
还可以用字符串代替函数名。为了让字符串有效,必须在重采样对象上操作:
```python
In [317]: r.agg({'A': 'sum', 'B': 'std'})
Out[317]:
A B
2012-01-01 00:00:00 -6.088060 1.001294
2012-01-01 00:03:00 10.243678 1.074597
2012-01-01 00:06:00 -10.590584 0.987309
2012-01-01 00:09:00 11.362228 0.944953
2012-01-01 00:12:00 33.541257 1.095025
2012-01-01 00:15:00 -8.595393 1.035312
```
甚至还可以为每列单独多个聚合函数。
```python
In [318]: r.agg({'A': ['sum', 'std'], 'B': ['mean', 'std']})
Out[318]:
A B
sum std mean std
2012-01-01 00:00:00 -6.088060 1.043263 -0.121514 1.001294
2012-01-01 00:03:00 10.243678 1.058534 0.146731 1.074597
2012-01-01 00:06:00 -10.590584 0.949264 0.047046 0.987309
2012-01-01 00:09:00 11.362228 1.028096 -0.026158 0.944953
2012-01-01 00:12:00 33.541257 0.884586 -0.003144 1.095025
2012-01-01 00:15:00 -8.595393 1.035476 -0.016287 1.035312
```
如果 `DataFrame` 用的不是 `datetime` 型索引,则可以基于 `datetime` 数据列重采样,用关键字 `on` 控制。
```python
In [319]: df = pd.DataFrame({'date': pd.date_range('2015-01-01', freq='W', periods=5),
.....: 'a': np.arange(5)},
.....: index=pd.MultiIndex.from_arrays([
.....: [1, 2, 3, 4, 5],
.....: pd.date_range('2015-01-01', freq='W', periods=5)],
.....: names=['v', 'd']))
.....:
In [320]: df
Out[320]:
date a
v d
1 2015-01-04 2015-01-04 0
2 2015-01-11 2015-01-11 1
3 2015-01-18 2015-01-18 2
4 2015-01-25 2015-01-25 3
5 2015-02-01 2015-02-01 4
In [321]: df.resample('M', on='date').sum()
Out[321]:
a
date
2015-01-31 6
2015-02-28 4
```
同样,还可以对 `datetime MultiIndex` 重采样,通过关键字 `level` 传递名字与位置。
```python
In [322]: df.resample('M', level='d').sum()
Out[322]:
a
d
2015-01-31 6
2015-02-28 4
```
### 分组迭代
`Resampler`对象迭代分组数据的操作非常自然,类似于 [`itertools.groupby()`](https://docs.python.org/3/library/itertools.html#itertools.groupby "(in Python v3.7)")
```python
In [323]: small = pd.Series(
.....: range(6),
.....: index=pd.to_datetime(['2017-01-01T00:00:00',
.....: '2017-01-01T00:30:00',
.....: '2017-01-01T00:31:00',
.....: '2017-01-01T01:00:00',
.....: '2017-01-01T03:00:00',
.....: '2017-01-01T03:05:00'])
.....: )
.....:
In [324]: resampled = small.resample('H')
In [325]: for name, group in resampled:
.....: print("Group: ", name)
.....: print("-" * 27)
.....: print(group, end="\n\n")
.....:
Group: 2017-01-01 00:00:00
---------------------------
2017-01-01 00:00:00 0
2017-01-01 00:30:00 1
2017-01-01 00:31:00 2
dtype: int64
Group: 2017-01-01 01:00:00
---------------------------
2017-01-01 01:00:00 3
dtype: int64
Group: 2017-01-01 02:00:00
---------------------------
Series([], dtype: int64)
Group: 2017-01-01 03:00:00
---------------------------
2017-01-01 03:00:00 4
2017-01-01 03:05:00 5
dtype: int64
```
了解更多详情,请参阅[分组迭代](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#groupby-iterating-label)或 [`itertools.groupby()`](https://docs.python.org/3/library/itertools.html#itertools.groupby "(in Python v3.7)")。
## 时间跨度表示
规律时间间隔可以用 pandas 的 `Peirod` 对象表示,`Period` 对象序列叫做 `PeriodIndex`,用便捷函数 `period_range` 创建。
### Period
`Period` 表示时间跨度,即时间段,如年、季、月、日等。关键字 `freq` 与频率别名可以指定时间段。`freq` 表示的是 `Period` 的时间跨度,不能为负,如,`-3D`
```python
In [326]: pd.Period('2012', freq='A-DEC')
Out[326]: Period('2012', 'A-DEC')
In [327]: pd.Period('2012-1-1', freq='D')
Out[327]: Period('2012-01-01', 'D')
In [328]: pd.Period('2012-1-1 19:00', freq='H')
Out[328]: Period('2012-01-01 19:00', 'H')
In [329]: pd.Period('2012-1-1 19:00', freq='5H')
Out[329]: Period('2012-01-01 19:00', '5H')
```
时间段加减法按自身频率位移。 不同频率的时间段不可进行算术运算。
```python
In [330]: p = pd.Period('2012', freq='A-DEC')
In [331]: p + 1
Out[331]: Period('2013', 'A-DEC')
In [332]: p - 3
Out[332]: Period('2009', 'A-DEC')
In [333]: p = pd.Period('2012-01', freq='2M')
In [334]: p + 2
Out[334]: Period('2012-05', '2M')
In [335]: p - 1
Out[335]: Period('2011-11', '2M')
In [336]: p == pd.Period('2012-01', freq='3M')
---------------------------------------------------------------------------
IncompatibleFrequency Traceback (most recent call last)
<ipython-input-336-4b67dc0b596c> in <module>
----> 1 p == pd.Period('2012-01', freq='3M')
/pandas/pandas/_libs/tslibs/period.pyx in pandas._libs.tslibs.period._Period.__richcmp__()
IncompatibleFrequency: Input has different freq=3M from Period(freq=2M)
```
`freq` 的频率为日或更高频率时,如 `D``H``T``S``L``U``N``offsets``timedelta` 可以用相同频率实现加法。否则,会触发 `ValueError`
```python
In [337]: p = pd.Period('2014-07-01 09:00', freq='H')
In [338]: p + pd.offsets.Hour(2)
Out[338]: Period('2014-07-01 11:00', 'H')
In [339]: p + datetime.timedelta(minutes=120)
Out[339]: Period('2014-07-01 11:00', 'H')
In [340]: p + np.timedelta64(7200, 's')
Out[340]: Period('2014-07-01 11:00', 'H')
In [1]: p + pd.offsets.Minute(5)
Traceback
...
ValueError: Input has different freq from Period(freq=H)
```
如果 `Period` 为其它频率,只有相同频率的 `offsets` 可以相加。否则,会触发 `ValueError`
```python
In [341]: p = pd.Period('2014-07', freq='M')
In [342]: p + pd.offsets.MonthEnd(3)
Out[342]: Period('2014-10', 'M')
In [1]: p + pd.offsets.MonthBegin(3)
Traceback
...
ValueError: Input has different freq from Period(freq=M)
```
用相同频率计算不同时间段实例之间的区别,将返回这些实例之间的频率单元数量。
```python
In [343]: pd.Period('2012', freq='A-DEC') - pd.Period('2002', freq='A-DEC')
Out[343]: <10 * YearEnds: month=12>
```
### PeriodIndex 与 period_range
`period_range` 便捷函数可以创建有规律的 `Period` 对象序列,即 `PeriodIndex`
```python
In [344]: prng = pd.period_range('1/1/2011', '1/1/2012', freq='M')
In [345]: prng
Out[345]:
PeriodIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05', '2011-06',
'2011-07', '2011-08', '2011-09', '2011-10', '2011-11', '2011-12',
'2012-01'],
dtype='period[M]', freq='M')
```
也可以直接用 `PeriodIndex` 创建:
```python
In [346]: pd.PeriodIndex(['2011-1', '2011-2', '2011-3'], freq='M')
Out[346]: PeriodIndex(['2011-01', '2011-02', '2011-03'], dtype='period[M]', freq='M')
```
频率为复数时,输出的 `Period` 序列为复数时间段。
```python
In [347]: pd.period_range(start='2014-01', freq='3M', periods=4)
Out[347]: PeriodIndex(['2014-01', '2014-04', '2014-07', '2014-10'], dtype='period[3M]', freq='3M')
```
`Period` 对象的 `start``end` 会被当作 `PeriodIndex` 的锚定终点,其频率与 `PeriodIndex` 的频率一样。
```python
In [348]: pd.period_range(start=pd.Period('2017Q1', freq='Q'),
.....: end=pd.Period('2017Q2', freq='Q'), freq='M')
.....:
Out[348]: PeriodIndex(['2017-03', '2017-04', '2017-05', '2017-06'], dtype='period[M]', freq='M')
```
`DatetimeIndex` 一样,`PeriodIndex` 也可以作为 pandas 对象的索引。
```python
In [349]: ps = pd.Series(np.random.randn(len(prng)), prng)
In [350]: ps
Out[350]:
2011-01 -2.916901
2011-02 0.514474
2011-03 1.346470
2011-04 0.816397
2011-05 2.258648
2011-06 0.494789
2011-07 0.301239
2011-08 0.464776
2011-09 -1.393581
2011-10 0.056780
2011-11 0.197035
2011-12 2.261385
2012-01 -0.329583
Freq: M, dtype: float64
```
`PeriodIndex` 的加减法与 `Period` 一样。
```python
In [351]: idx = pd.period_range('2014-07-01 09:00', periods=5, freq='H')
In [352]: idx
Out[352]:
PeriodIndex(['2014-07-01 09:00', '2014-07-01 10:00', '2014-07-01 11:00',
'2014-07-01 12:00', '2014-07-01 13:00'],
dtype='period[H]', freq='H')
In [353]: idx + pd.offsets.Hour(2)
Out[353]:
PeriodIndex(['2014-07-01 11:00', '2014-07-01 12:00', '2014-07-01 13:00',
'2014-07-01 14:00', '2014-07-01 15:00'],
dtype='period[H]', freq='H')
In [354]: idx = pd.period_range('2014-07', periods=5, freq='M')
In [355]: idx
Out[355]: PeriodIndex(['2014-07', '2014-08', '2014-09', '2014-10', '2014-11'], dtype='period[M]', freq='M')
In [356]: idx + pd.offsets.MonthEnd(3)
Out[356]: PeriodIndex(['2014-10', '2014-11', '2014-12', '2015-01', '2015-02'], dtype='period[M]', freq='M')
```
`PeriodIndex` 有自己的数据类型,即 `period`,请参阅 [Period 数据类型](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-period-dtype)。
### Period 数据类型
*0.19.0 版新增*
`PeriodIndex` 的自定义数据类型是 `period`,是 pandas 扩展数据类型,类似于[带时区信息的数据类型](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-timezone-series)`datetime64[ns, tz]`)。
`Period` 数据类型支持 `freq` 属性,还可以用 `period[freq]` 表示,如,`period[D]``period[M]`,这里用的是[频率字符串](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases)。
```python
In [357]: pi = pd.period_range('2016-01-01', periods=3, freq='M')
In [358]: pi
Out[358]: PeriodIndex(['2016-01', '2016-02', '2016-03'], dtype='period[M]', freq='M')
In [359]: pi.dtype
Out[359]: period[M]
```
`period` 数据类型在 `.astype(...)` 里使用。允许改变 `PeriodIndex``freq``.asfreq()`,并用 `to_period()``DatetimeIndex` 转化为 `PeriodIndex`
```python
# 把月频改为日频
In [360]: pi.astype('period[D]')
Out[360]: PeriodIndex(['2016-01-31', '2016-02-29', '2016-03-31'], dtype='period[D]', freq='D')
# 转换为 DatetimeIndex
In [361]: pi.astype('datetime64[ns]')
Out[361]: DatetimeIndex(['2016-01-01', '2016-02-01', '2016-03-01'], dtype='datetime64[ns]', freq='MS')
# 转换为 PeriodIndex
In [362]: dti = pd.date_range('2011-01-01', freq='M', periods=3)
In [363]: dti
Out[363]: DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31'], dtype='datetime64[ns]', freq='M')
In [364]: dti.astype('period[M]')
Out[364]: PeriodIndex(['2011-01', '2011-02', '2011-03'], dtype='period[M]', freq='M')
```
### PeriodIndex 局部字符串索引
`DatetimeIndex` 一样,`PeriodIndex` 可以把日期与字符串传递给 `Series``DataFrame`。详情请参阅 [DatetimeIndex 局部字符串索引](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-partialindexing)。
```python
In [365]: ps['2011-01']
Out[365]: -2.9169013294054507
In [366]: ps[datetime.datetime(2011, 12, 25):]
Out[366]:
2011-12 2.261385
2012-01 -0.329583
Freq: M, dtype: float64
In [367]: ps['10/31/2011':'12/31/2011']
Out[367]:
2011-10 0.056780
2011-11 0.197035
2011-12 2.261385
Freq: M, dtype: float64
```
传递比 `PeriodIndex` 更低频率的字符串会返回局部切片数据。
```python
In [368]: ps['2011']
Out[368]:
2011-01 -2.916901
2011-02 0.514474
2011-03 1.346470
2011-04 0.816397
2011-05 2.258648
2011-06 0.494789
2011-07 0.301239
2011-08 0.464776
2011-09 -1.393581
2011-10 0.056780
2011-11 0.197035
2011-12 2.261385
Freq: M, dtype: float64
In [369]: dfp = pd.DataFrame(np.random.randn(600, 1),
.....: columns=['A'],
.....: index=pd.period_range('2013-01-01 9:00',
.....: periods=600,
.....: freq='T'))
.....:
In [370]: dfp
Out[370]:
A
2013-01-01 09:00 -0.538468
2013-01-01 09:01 -1.365819
2013-01-01 09:02 -0.969051
2013-01-01 09:03 -0.331152
2013-01-01 09:04 -0.245334
... ...
2013-01-01 18:55 0.522460
2013-01-01 18:56 0.118710
2013-01-01 18:57 0.167517
2013-01-01 18:58 0.922883
2013-01-01 18:59 1.721104
[600 rows x 1 columns]
In [371]: dfp['2013-01-01 10H']
Out[371]:
A
2013-01-01 10:00 -0.308975
2013-01-01 10:01 0.542520
2013-01-01 10:02 1.061068
2013-01-01 10:03 0.754005
2013-01-01 10:04 0.352933
... ...
2013-01-01 10:55 -0.865621
2013-01-01 10:56 -1.167818
2013-01-01 10:57 -2.081748
2013-01-01 10:58 -0.527146
2013-01-01 10:59 0.802298
[60 rows x 1 columns]
```
`DatetimeIndex` 一样,终点包含在结果范围之内。下例中的切片数据就是从 10:00 到 11:59。
```python
In [372]: dfp['2013-01-01 10H':'2013-01-01 11H']
Out[372]:
A
2013-01-01 10:00 -0.308975
2013-01-01 10:01 0.542520
2013-01-01 10:02 1.061068
2013-01-01 10:03 0.754005
2013-01-01 10:04 0.352933
... ...
2013-01-01 11:55 -0.590204
2013-01-01 11:56 1.539990
2013-01-01 11:57 -1.224826
2013-01-01 11:58 0.578798
2013-01-01 11:59 -0.685496
[120 rows x 1 columns]
```
### 频率转换与 `PeriodIndex` 重采样
`Period``PeriodIndex` 的频率可以用 `asfreq` 转换。下列代码开始于 2011 财年,结束时间为十二月:
```python
In [373]: p = pd.Period('2011', freq='A-DEC')
In [374]: p
Out[374]: Period('2011', 'A-DEC')
```
可以把它转换为月频。使用 `how` 参数,指定是否返回开始或结束月份。
```python
In [375]: p.asfreq('M', how='start')
Out[375]: Period('2011-01', 'M')
In [376]: p.asfreq('M', how='end')
Out[376]: Period('2011-12', 'M')
```
简称 `s``e` 用起来更方便:
```python
In [377]: p.asfreq('M', 's')
Out[377]: Period('2011-01', 'M')
In [378]: p.asfreq('M', 'e')
Out[378]: Period('2011-12', 'M')
```
转换为“超级 period”年频就是季频的超级 period自动返回包含输入时间段的超级 period
```python
In [379]: p = pd.Period('2011-12', freq='M')
In [380]: p.asfreq('A-NOV')
Out[380]: Period('2012', 'A-NOV')
```
注意因为转换年频是在十一月结束的2011 年 12 月的月时间段实际上是 `2012 A-NOV` period。
用锚定频率转换时间段对经济学、商业等领域里的各种季度数据特别有用。很多公司都依据其财年开始月与结束月定义季度。因此2011 年第一个季度有可能 2010 年就开始了,也有可能 2011 年过了几个月才开始。通过锚定频率pandas 可以处理所有从 `Q-JAN``Q-DEC`的季度频率。
`Q-DEC` 定义的是常规日历季度:
```python
In [381]: p = pd.Period('2012Q1', freq='Q-DEC')
In [382]: p.asfreq('D', 's')
Out[382]: Period('2012-01-01', 'D')
In [383]: p.asfreq('D', 'e')
Out[383]: Period('2012-03-31', 'D')
```
`Q-MAR` 定义的是财年结束于三月:
```python
In [384]: p = pd.Period('2011Q4', freq='Q-MAR')
In [385]: p.asfreq('D', 's')
Out[385]: Period('2011-01-01', 'D')
In [386]: p.asfreq('D', 'e')
Out[386]: Period('2011-03-31', 'D')
```
### 不同表现形式之间的转换
`to_period` 把时间戳转换为 `PeriodIndex``to_timestamp` 则执行反向操作。
```python
In [387]: rng = pd.date_range('1/1/2012', periods=5, freq='M')
In [388]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [389]: ts
Out[389]:
2012-01-31 1.931253
2012-02-29 -0.184594
2012-03-31 0.249656
2012-04-30 -0.978151
2012-05-31 -0.873389
Freq: M, dtype: float64
In [390]: ps = ts.to_period()
In [391]: ps
Out[391]:
2012-01 1.931253
2012-02 -0.184594
2012-03 0.249656
2012-04 -0.978151
2012-05 -0.873389
Freq: M, dtype: float64
In [392]: ps.to_timestamp()
Out[392]:
2012-01-01 1.931253
2012-02-01 -0.184594
2012-03-01 0.249656
2012-04-01 -0.978151
2012-05-01 -0.873389
Freq: MS, dtype: float64
```
记住 `s``e` 返回 `period` 开始或结束的时间戳:
```python
In [393]: ps.to_timestamp('D', how='s')
Out[393]:
2012-01-01 1.931253
2012-02-01 -0.184594
2012-03-01 0.249656
2012-04-01 -0.978151
2012-05-01 -0.873389
Freq: MS, dtype: float64
```
用便捷算数函数可以转换时间段与时间戳`。下例中,把以 11 月年度结束的季频转换为以下一个季度月末上午 9 点:
```python
In [394]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
In [395]: ts = pd.Series(np.random.randn(len(prng)), prng)
In [396]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
In [397]: ts.head()
Out[397]:
1990-03-01 09:00 -0.109291
1990-06-01 09:00 -0.637235
1990-09-01 09:00 -1.735925
1990-12-01 09:00 2.096946
1991-03-01 09:00 -1.039926
Freq: H, dtype: float64
```
## 界外跨度表示
数据在 `Timestamp` 限定边界外时,参阅 [Timestamp 限制](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-timestamp-limits),可以用 `PeriodIndex` 或 `Periods` 的 `Series` 执行计算。
```python
In [398]: span = pd.period_range('1215-01-01', '1381-01-01', freq='D')
In [399]: span
Out[399]:
PeriodIndex(['1215-01-01', '1215-01-02', '1215-01-03', '1215-01-04',
'1215-01-05', '1215-01-06', '1215-01-07', '1215-01-08',
'1215-01-09', '1215-01-10',
...
'1380-12-23', '1380-12-24', '1380-12-25', '1380-12-26',
'1380-12-27', '1380-12-28', '1380-12-29', '1380-12-30',
'1380-12-31', '1381-01-01'],
dtype='period[D]', length=60632, freq='D')
```
从基于 `int64` 的 `YYYYMMDD` 表示形式转换。
```python
In [400]: s = pd.Series([20121231, 20141130, 99991231])
In [401]: s
Out[401]:
0 20121231
1 20141130
2 99991231
dtype: int64
In [402]: def conv(x):
.....: return pd.Period(year=x // 10000, month=x // 100 % 100,
.....: day=x % 100, freq='D')
.....:
In [403]: s.apply(conv)
Out[403]:
0 2012-12-31
1 2014-11-30
2 9999-12-31
dtype: period[D]
In [404]: s.apply(conv)[2]
Out[404]: Period('9999-12-31', 'D')
```
轻轻松松就可以这些数据转换成 `PeriodIndex`
```python
In [405]: span = pd.PeriodIndex(s.apply(conv))
In [406]: span
Out[406]: PeriodIndex(['2012-12-31', '2014-11-30', '9999-12-31'], dtype='period[D]', freq='D')
```
## 时区控制
利用 `pytz` 与 `datetuil` 或标准库 `datetime.timezone` 对象pandas 能以多种方式处理不同时区的时间戳。
### 处理时区
Pandas 对象默认不支持时区信息:
```python
In [407]: rng = pd.date_range('3/6/2012 00:00', periods=15, freq='D')
In [408]: rng.tz is None
Out[408]: True
```
用 [`date_range()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html#pandas.date_range "pandas.date_range")、[`Timestamp`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html#pandas.Timestamp "pandas.Timestamp") 、[`DatetimeIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html#pandas.DatetimeIndex "pandas.DatetimeIndex") 的 `tz_localize` 方法或 `tz` 关键字参数,可以为这些日期加上本地时区,即,把指定时区分配给不带时区的日期。还可以传递 `pytz` 、 `dateutil` 时区对象或奥尔森时区数据库字符串。奥尔森时区字符串默认返回 `pytz` 时区对象。要返回 `dateutil` 时区对象,在字符串前加上 `datetuil/`。
* 用 `from pytz import common_timezones, all_timezones` 在 `pytz` 里查找通用时区。
* `dateutil` 使用操作系统时区,没有固定的列表,其通用时区名与 `pytz` 相同。
```python
In [409]: import dateutil
# pytz
In [410]: rng_pytz = pd.date_range('3/6/2012 00:00', periods=3, freq='D',
.....: tz='Europe/London')
.....:
In [411]: rng_pytz.tz
Out[411]: <DstTzInfo 'Europe/London' LMT-1 day, 23:59:00 STD>
# dateutil
In [412]: rng_dateutil = pd.date_range('3/6/2012 00:00', periods=3, freq='D')
In [413]: rng_dateutil = rng_dateutil.tz_localize('dateutil/Europe/London')
In [414]: rng_dateutil.tz
Out[414]: tzfile('/usr/share/zoneinfo/Europe/London')
# dateutil - utc special case
In [415]: rng_utc = pd.date_range('3/6/2012 00:00', periods=3, freq='D',
.....: tz=dateutil.tz.tzutc())
.....:
In [416]: rng_utc.tz
Out[416]: tzutc()
```
*0.25.0 版新增。*
```python
# datetime.timezone
In [417]: rng_utc = pd.date_range('3/6/2012 00:00', periods=3, freq='D',
.....: tz=datetime.timezone.utc)
.....:
In [418]: rng_utc.tz
Out[418]: datetime.timezone.utc
```
注意, `dateutil` 的 `UTC` 时区是个特例,要显式地创建 `dateutil.tz.tzutc` 实例。可以先创建其它时区对象。
```python
In [419]: import pytz
# pytz
In [420]: tz_pytz = pytz.timezone('Europe/London')
In [421]: rng_pytz = pd.date_range('3/6/2012 00:00', periods=3, freq='D')
In [422]: rng_pytz = rng_pytz.tz_localize(tz_pytz)
In [423]: rng_pytz.tz == tz_pytz
Out[423]: True
# dateutil
In [424]: tz_dateutil = dateutil.tz.gettz('Europe/London')
In [425]: rng_dateutil = pd.date_range('3/6/2012 00:00', periods=3, freq='D',
.....: tz=tz_dateutil)
.....:
In [426]: rng_dateutil.tz == tz_dateutil
Out[426]: True
```
不同时区之间转换带时区的 pandas 对象时,用 `tz_convert` 方法。
```python
In [427]: rng_pytz.tz_convert('US/Eastern')
Out[427]:
DatetimeIndex(['2012-03-05 19:00:00-05:00', '2012-03-06 19:00:00-05:00',
'2012-03-07 19:00:00-05:00'],
dtype='datetime64[ns, US/Eastern]', freq='D')
```
::: tip 注意
使用 `pytz` 时区时,对于相同的输入时区,[`DatetimeIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html#pandas.DatetimeIndex "pandas.DatetimeIndex") 会构建一个与 [`Timestamp`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html#pandas.Timestamp "pandas.Timestamp") 不同的时区对象。[`DatetimeIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html#pandas.DatetimeIndex "pandas.DatetimeIndex") 具有一组 [`Timestamp`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html#pandas.Timestamp "pandas.Timestamp") 对象UTC 偏移量也不同,不能用一个 `pytz` 时区实例简洁地表示,[`Timestamp`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html#pandas.Timestamp "pandas.Timestamp") 则可以用来指定 UTC 偏移量表示一个时点。
```python
In [428]: dti = pd.date_range('2019-01-01', periods=3, freq='D', tz='US/Pacific')
In [429]: dti.tz
Out[429]: <DstTzInfo 'US/Pacific' LMT-1 day, 16:07:00 STD>
In [430]: ts = pd.Timestamp('2019-01-01', tz='US/Pacific')
In [431]: ts.tz
Out[431]: <DstTzInfo 'US/Pacific' PST-1 day, 16:00:00 STD>
```
:::
::: danger 警告
注意不同支持库之间的转换。一些时区,`pytz` 与 `datetuil` 对时区的定义不一样。与 `US/Eastern` 等“标准”时区相比,那些更少见的时区的问题更严重。
:::
::: danger 警告
注意不同版本时区支持库对时区的定义并不一致。在处理本地存储数据时使用一种版本的支持库,在运算时使用另一种版本的支持库,可能会引起问题。参阅[本文](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-hdf5-notes)了解如何处理这种问题。
:::
::: danger 警告
对于 `pytz` 时区,直接把时区对象传递给 `datetime.datetime` 构建器是不对的,如,`datetime.datetime(2011, 1, 1, tz=pytz.timezone('US/Eastern'))`。反之datetime 要在 `pytz` 时区对象上使用 `localize` 方法。
:::
在后台,所有 Timestamp 都存储为 UTC。含时区信息的 [`DatetimeIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html#pandas.DatetimeIndex "pandas.DatetimeIndex") 或 [`Timestamp`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html#pandas.Timestamp "pandas.Timestamp") 的值有其自己的本地化时区字段(日、小时、分钟等)。不过,对于不同时区时间戳,如果其 UTC 值相同,将被视作是相等的时间。
```python
In [432]: rng_eastern = rng_utc.tz_convert('US/Eastern')
In [433]: rng_berlin = rng_utc.tz_convert('Europe/Berlin')
In [434]: rng_eastern[2]
Out[434]: Timestamp('2012-03-07 19:00:00-0500', tz='US/Eastern', freq='D')
In [435]: rng_berlin[2]
Out[435]: Timestamp('2012-03-08 01:00:00+0100', tz='Europe/Berlin', freq='D')
In [436]: rng_eastern[2] == rng_berlin[2]
Out[436]: True
```
不同时区 [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series "pandas.Series") 之间的操作生成的是与 UTC 时间戳数据对齐的 UTC [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series "pandas.Series")。
```python
In [437]: ts_utc = pd.Series(range(3), pd.date_range('20130101', periods=3, tz='UTC'))
In [438]: eastern = ts_utc.tz_convert('US/Eastern')
In [439]: berlin = ts_utc.tz_convert('Europe/Berlin')
In [440]: result = eastern + berlin
In [441]: result
Out[441]:
2013-01-01 00:00:00+00:00 0
2013-01-02 00:00:00+00:00 2
2013-01-03 00:00:00+00:00 4
Freq: D, dtype: int64
In [442]: result.index
Out[442]:
DatetimeIndex(['2013-01-01 00:00:00+00:00', '2013-01-02 00:00:00+00:00',
'2013-01-03 00:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq='D')
```
用 `tz_localize(None)` 或 `tz_convert(None)` 去掉时区信息。`tz_localize(None)` 去掉带本地时间表示的时区信息。`tz_convert(None)`先把时间戳转为 UTC 时间,再去掉时区信息。
```python
In [443]: didx = pd.date_range(start='2014-08-01 09:00', freq='H',
.....: periods=3, tz='US/Eastern')
.....:
In [444]: didx
Out[444]:
DatetimeIndex(['2014-08-01 09:00:00-04:00', '2014-08-01 10:00:00-04:00',
'2014-08-01 11:00:00-04:00'],
dtype='datetime64[ns, US/Eastern]', freq='H')
In [445]: didx.tz_localize(None)
Out[445]:
DatetimeIndex(['2014-08-01 09:00:00', '2014-08-01 10:00:00',
'2014-08-01 11:00:00'],
dtype='datetime64[ns]', freq='H')
In [446]: didx.tz_convert(None)
Out[446]:
DatetimeIndex(['2014-08-01 13:00:00', '2014-08-01 14:00:00',
'2014-08-01 15:00:00'],
dtype='datetime64[ns]', freq='H')
# tz_convert(None) 等同于 tz_convert('UTC').tz_localize(None)
In [447]: didx.tz_convert('UTC').tz_localize(None)
Out[447]:
DatetimeIndex(['2014-08-01 13:00:00', '2014-08-01 14:00:00',
'2014-08-01 15:00:00'],
dtype='datetime64[ns]', freq='H')
```
### 本地化导致的混淆时间
`tz_localize` 不能决定时间戳的 UTC偏移量因为本地时区的夏时制DST会引起一些时间在一天内出现两次的问题“时钟回调”。下面的选项是有效的
* `raise`:默认触发 `pytz.AmbiguousTimeError`
* `infer`:依据时间戳的单一性,尝试推断正确的偏移量
* `NaT`:用 `NaT` 替换混淆时间
* `bool``True` 代表夏时制DST时间`False` 代表正常时间。数组型的 `bool` 值支持一组时间序列。
```python
In [448]: rng_hourly = pd.DatetimeIndex(['11/06/2011 00:00', '11/06/2011 01:00',
.....: '11/06/2011 01:00', '11/06/2011 02:00'])
.....:
```
这种操作会引起混淆时间失败错误( '11/06/2011 01:00')。
```python
In [2]: rng_hourly.tz_localize('US/Eastern')
AmbiguousTimeError: Cannot infer dst time from Timestamp('2011-11-06 01:00:00'), try using the 'ambiguous' argument
```
用下列指定的关键字控制混淆时间。
```python
In [449]: rng_hourly.tz_localize('US/Eastern', ambiguous='infer')
Out[449]:
DatetimeIndex(['2011-11-06 00:00:00-04:00', '2011-11-06 01:00:00-04:00',
'2011-11-06 01:00:00-05:00', '2011-11-06 02:00:00-05:00'],
dtype='datetime64[ns, US/Eastern]', freq=None)
In [450]: rng_hourly.tz_localize('US/Eastern', ambiguous='NaT')
Out[450]:
DatetimeIndex(['2011-11-06 00:00:00-04:00', 'NaT', 'NaT',
'2011-11-06 02:00:00-05:00'],
dtype='datetime64[ns, US/Eastern]', freq=None)
In [451]: rng_hourly.tz_localize('US/Eastern', ambiguous=[True, True, False, False])
Out[451]:
DatetimeIndex(['2011-11-06 00:00:00-04:00', '2011-11-06 01:00:00-04:00',
'2011-11-06 01:00:00-05:00', '2011-11-06 02:00:00-05:00'],
dtype='datetime64[ns, US/Eastern]', freq=None)
```
### 本地化时不存在的时间
夏时制转换会移位本地时间一个小时,这样会创建一个不存在的本地时间(“时钟春季前滚”)。这种本地化操作会导致时间序列出现不存在的时间,此问题可以用 `nonexistent` 参数解决。下列都是有效的选项:
* `raise`:默认触发 `pytz.NonExistentTimeError`
* `NaT`:用 `NaT` 替换不存在的时间
* `shift_forward`:把不存在的时间前移至最近的真实时间
* `shift_backward`:把不存在的时间后滚至最近的真实时间
* `Timedelta` 对象:用 `timedelta` 移位不存在的时间
```python
In [452]: dti = pd.date_range(start='2015-03-29 02:30:00', periods=3, freq='H')
# 2:30 是不存在的时间
```
对不存在的时间进行本地化操作默认会触发错误。
```python
In [2]: dti.tz_localize('Europe/Warsaw')
NonExistentTimeError: 2015-03-29 02:30:00
```
把不存在的时间转换为 `NaT` 或移位时间
```python
In [453]: dti
Out[453]:
DatetimeIndex(['2015-03-29 02:30:00', '2015-03-29 03:30:00',
'2015-03-29 04:30:00'],
dtype='datetime64[ns]', freq='H')
In [454]: dti.tz_localize('Europe/Warsaw', nonexistent='shift_forward')
Out[454]:
DatetimeIndex(['2015-03-29 03:00:00+02:00', '2015-03-29 03:30:00+02:00',
'2015-03-29 04:30:00+02:00'],
dtype='datetime64[ns, Europe/Warsaw]', freq='H')
In [455]: dti.tz_localize('Europe/Warsaw', nonexistent='shift_backward')
Out[455]:
DatetimeIndex(['2015-03-29 01:59:59.999999999+01:00',
'2015-03-29 03:30:00+02:00',
'2015-03-29 04:30:00+02:00'],
dtype='datetime64[ns, Europe/Warsaw]', freq='H')
In [456]: dti.tz_localize('Europe/Warsaw', nonexistent=pd.Timedelta(1, unit='H'))
Out[456]:
DatetimeIndex(['2015-03-29 03:30:00+02:00', '2015-03-29 03:30:00+02:00',
'2015-03-29 04:30:00+02:00'],
dtype='datetime64[ns, Europe/Warsaw]', freq='H')
In [457]: dti.tz_localize('Europe/Warsaw', nonexistent='NaT')
Out[457]:
DatetimeIndex(['NaT', '2015-03-29 03:30:00+02:00',
'2015-03-29 04:30:00+02:00'],
dtype='datetime64[ns, Europe/Warsaw]', freq='H')
```
### 时区序列操作
无时区 [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series "pandas.Series") 值的数据类型是 datetime64[ns]。
```python
In [458]: s_naive = pd.Series(pd.date_range('20130101', periods=3))
In [459]: s_naive
Out[459]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
dtype: datetime64[ns]
```
有时区 [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series "pandas.Series") 值的数据类型是 datetime64[ns, tz]`tz` 指的是时区。
```python
In [460]: s_aware = pd.Series(pd.date_range('20130101', periods=3, tz='US/Eastern'))
In [461]: s_aware
Out[461]:
0 2013-01-01 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]
```
这两种 [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series "pandas.Series") 的时区信息都可以用 `.dt` 访问器操控,参阅 [dt 访问器](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dt-accessors)。
例如,本地化与把无时区时间戳转换为有时区时间戳。
```python
In [462]: s_naive.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
Out[462]:
0 2012-12-31 19:00:00-05:00
1 2013-01-01 19:00:00-05:00
2 2013-01-02 19:00:00-05:00
dtype: datetime64[ns, US/Eastern]
```
时区信息还可以用 `astype` 操控。这种方法可以本地化并转换无时区时间戳或转换有时区时间戳。
```python
# 本地化,并把无时区转换为有时区
In [463]: s_naive.astype('datetime64[ns, US/Eastern]')
Out[463]:
0 2012-12-31 19:00:00-05:00
1 2013-01-01 19:00:00-05:00
2 2013-01-02 19:00:00-05:00
dtype: datetime64[ns, US/Eastern]
# 把有时区变为无时区
In [464]: s_aware.astype('datetime64[ns]')
Out[464]:
0 2013-01-01 05:00:00
1 2013-01-02 05:00:00
2 2013-01-03 05:00:00
dtype: datetime64[ns]
# 转换为新的时区
In [465]: s_aware.astype('datetime64[ns, CET]')
Out[465]:
0 2013-01-01 06:00:00+01:00
1 2013-01-02 06:00:00+01:00
2 2013-01-03 06:00:00+01:00
dtype: datetime64[ns, CET]
```
::: tip 注意
在 `Series` 上应用 [`Series.to_numpy()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy "pandas.Series.to_numpy"),返回数据的 NumPy 数组。虽然 NumPy 可以**输出**本地时区!但其实它当前并不支持时区,因此,有时区时间戳数据返回的是时间戳对象数组:
```python
In [466]: s_naive.to_numpy()
Out[466]:
array(['2013-01-01T00:00:00.000000000', '2013-01-02T00:00:00.000000000',
'2013-01-03T00:00:00.000000000'], dtype='datetime64[ns]')
In [467]: s_aware.to_numpy()
Out[467]:
array([Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern', freq='D'),
Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern', freq='D'),
Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern', freq='D')],
dtype=object)
```
通过转换时间戳数组,保留时区信息。例如,转换回 `Series` 时:
```python
In [468]: pd.Series(s_aware.to_numpy())
Out[468]:
0 2013-01-01 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]
```
如果需要 NumPy `datetime64[ns]` 数组(带已转为 UTC 的值)而不是对象数组,可以指定 `dtype` 参数:
```python
In [469]: s_aware.to_numpy(dtype='datetime64[ns]')
Out[469]:
array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
'2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')
```
:::