matplotlib & pandas

This commit is contained in:
estomm
2020-09-26 22:03:11 +08:00
parent 73cc328c81
commit d31be4f219
599 changed files with 99925 additions and 0 deletions

View File

@@ -0,0 +1,213 @@
---
meta:
- name: keywords
content: Pandas指南
- name: description
content: “用户指南” 按主题划分区域涵盖了几乎所有Pandas的功能。每个小节都介绍了一个主题例如“处理缺失的数据”并讨论了Pandas如何解决问题其中包含许多示例。
---
# Pandas 用户指南目录
“用户指南” 按主题划分区域涵盖了几乎所有Pandas的功能。每个小节都介绍了一个主题例如“处理缺失的数据”并讨论了Pandas如何解决问题其中包含许多示例。
刚开始接触Pandas的同学应该从[十分钟入门Pandas](/docs/getting_started/10min.html)开始看起。
有关任何特定方法的更多信息,请[参阅API参考](/docs/reference.html)。
- [IO工具文本CSVHDF5](io.html)
- [CSV & text files](io.html#csv-text-files)
- [JSON](io.html#json)
- [HTML](io.html#html)
- [Excel files](io.html#excel-files)
- [OpenDocument Spreadsheets](io.html#opendocument-spreadsheets)
- [Clipboard](io.html#clipboard)
- [Pickling](io.html#pickling)
- [msgpack](io.html#msgpack)
- [HDF5 (PyTables)](io.html#hdf5-pytables)
- [Feather](io.html#feather)
- [Parquet](io.html#parquet)
- [SQL queries](io.html#sql-queries)
- [Google BigQuery](io.html#google-bigquery)
- [Stata format](io.html#stata-format)
- [SAS formats](io.html#sas-formats)
- [Other file formats](io.html#other-file-formats)
- [Performance considerations](io.html#performance-considerations)
- [索引和数据选择器](indexing.html)
- [Different choices for indexing](indexing.html#different-choices-for-indexing)
- [Basics](indexing.html#basics)
- [Attribute access](indexing.html#attribute-access)
- [Slicing ranges](indexing.html#slicing-ranges)
- [Selection by label](indexing.html#selection-by-label)
- [Selection by position](indexing.html#selection-by-position)
- [Selection by callable](indexing.html#selection-by-callable)
- [IX indexer is deprecated](indexing.html#ix-indexer-is-deprecated)
- [Indexing with list with missing labels is deprecated](indexing.html#indexing-with-list-with-missing-labels-is-deprecated)
- [Selecting random samples](indexing.html#selecting-random-samples)
- [Setting with enlargement](indexing.html#setting-with-enlargement)
- [Fast scalar value getting and setting](indexing.html#fast-scalar-value-getting-and-setting)
- [Boolean indexing](indexing.html#boolean-indexing)
- [Indexing with isin](indexing.html#indexing-with-isin)
- [The ``where()`` Method and Masking](indexing.html#the-where-method-and-masking)
- [The ``query()`` Method](indexing.html#the-query-method)
- [Duplicate data](indexing.html#duplicate-data)
- [Dictionary-like ``get()`` method](indexing.html#dictionary-like-get-method)
- [The ``lookup()`` method](indexing.html#the-lookup-method)
- [Index objects](indexing.html#index-objects)
- [Set / reset index](indexing.html#set-reset-index)
- [Returning a view versus a copy](indexing.html#returning-a-view-versus-a-copy)
- [多索引/高级索引](advanced.html)
- [Hierarchical indexing (MultiIndex)](advanced.html#hierarchical-indexing-multiindex)
- [Advanced indexing with hierarchical index](advanced.html#advanced-indexing-with-hierarchical-index)
- [Sorting a ``MultiIndex``](advanced.html#sorting-a-multiindex)
- [Take methods](advanced.html#take-methods)
- [Index types](advanced.html#index-types)
- [Miscellaneous indexing FAQ](advanced.html#miscellaneous-indexing-faq)
- [合并、联接和连接](merging.html)
- [Concatenating objects](merging.html#concatenating-objects)
- [Database-style DataFrame or named Series joining/merging](merging.html#database-style-dataframe-or-named-series-joining-merging)
- [Timeseries friendly merging](merging.html#timeseries-friendly-merging)
- [重塑和数据透视表](reshaping.html)
- [Reshaping by pivoting DataFrame objects](reshaping.html#reshaping-by-pivoting-dataframe-objects)
- [Reshaping by stacking and unstacking](reshaping.html#reshaping-by-stacking-and-unstacking)
- [Reshaping by Melt](reshaping.html#reshaping-by-melt)
- [Combining with stats and GroupBy](reshaping.html#combining-with-stats-and-groupby)
- [Pivot tables](reshaping.html#pivot-tables)
- [Cross tabulations](reshaping.html#cross-tabulations)
- [Tiling](reshaping.html#tiling)
- [Computing indicator / dummy variables](reshaping.html#computing-indicator-dummy-variables)
- [Factorizing values](reshaping.html#factorizing-values)
- [Examples](reshaping.html#examples)
- [Exploding a list-like column](reshaping.html#exploding-a-list-like-column)
- [处理文本字符串](text.html)
- [Splitting and replacing strings](text.html#splitting-and-replacing-strings)
- [Concatenation](text.html#concatenation)
- [Indexing with ``.str``](text.html#indexing-with-str)
- [Extracting substrings](text.html#extracting-substrings)
- [Testing for Strings that match or contain a pattern](text.html#testing-for-strings-that-match-or-contain-a-pattern)
- [Creating indicator variables](text.html#creating-indicator-variables)
- [Method summary](text.html#method-summary)
- [处理丢失的数据](missing_data.html)
- [Values considered “missing”](missing_data.html#values-considered-missing)
- [Sum/prod of empties/nans](missing_data.html#sum-prod-of-empties-nans)
- [NA values in GroupBy](missing_data.html#na-values-in-groupby)
- [Filling missing values: fillna](missing_data.html#filling-missing-values-fillna)
- [Filling with a PandasObject](missing_data.html#filling-with-a-pandasobject)
- [Dropping axis labels with missing data: dropna](missing_data.html#dropping-axis-labels-with-missing-data-dropna)
- [Interpolation](missing_data.html#interpolation)
- [Replacing generic values](missing_data.html#replacing-generic-values)
- [String/regular expression replacement](missing_data.html#string-regular-expression-replacement)
- [Numeric replacement](missing_data.html#numeric-replacement)
- [分类数据](categorical.html)
- [Object creation](categorical.html#object-creation)
- [CategoricalDtype](categorical.html#categoricaldtype)
- [Description](categorical.html#description)
- [Working with categories](categorical.html#working-with-categories)
- [Sorting and order](categorical.html#sorting-and-order)
- [Comparisons](categorical.html#comparisons)
- [Operations](categorical.html#operations)
- [Data munging](categorical.html#data-munging)
- [Getting data in/out](categorical.html#getting-data-in-out)
- [Missing data](categorical.html#missing-data)
- [Differences to Rs <cite>factor</cite>](categorical.html#differences-to-r-s-factor)
- [Gotchas](categorical.html#gotchas)
- [Nullable整型数据类型](integer_na.html)
- [可视化](visualization.html)
- [Basic plotting: ``plot``](visualization.html#basic-plotting-plot)
- [Other plots](visualization.html#other-plots)
- [Plotting with missing data](visualization.html#plotting-with-missing-data)
- [Plotting Tools](visualization.html#plotting-tools)
- [Plot Formatting](visualization.html#plot-formatting)
- [Plotting directly with matplotlib](visualization.html#plotting-directly-with-matplotlib)
- [Trellis plotting interface](visualization.html#trellis-plotting-interface)
- [计算工具](computation.html)
- [Statistical functions](computation.html#statistical-functions)
- [Window Functions](computation.html#window-functions)
- [Aggregation](computation.html#aggregation)
- [Expanding windows](computation.html#expanding-windows)
- [Exponentially weighted windows](computation.html#exponentially-weighted-windows)
- [组操作: 拆分-应用-组合](groupby.html)
- [Splitting an object into groups](groupby.html#splitting-an-object-into-groups)
- [Iterating through groups](groupby.html#iterating-through-groups)
- [Selecting a group](groupby.html#selecting-a-group)
- [Aggregation](groupby.html#aggregation)
- [Transformation](groupby.html#transformation)
- [Filtration](groupby.html#filtration)
- [Dispatching to instance methods](groupby.html#dispatching-to-instance-methods)
- [Flexible ``apply``](groupby.html#flexible-apply)
- [Other useful features](groupby.html#other-useful-features)
- [Examples](groupby.html#examples)
- [时间序列/日期方法](timeseries.html)
- [Overview](timeseries.html#overview)
- [Timestamps vs. Time Spans](timeseries.html#timestamps-vs-time-spans)
- [Converting to timestamps](timeseries.html#converting-to-timestamps)
- [Generating ranges of timestamps](timeseries.html#generating-ranges-of-timestamps)
- [Timestamp limitations](timeseries.html#timestamp-limitations)
- [Indexing](timeseries.html#indexing)
- [Time/date components](timeseries.html#time-date-components)
- [DateOffset objects](timeseries.html#dateoffset-objects)
- [Time Series-Related Instance Methods](timeseries.html#time-series-related-instance-methods)
- [Resampling](timeseries.html#resampling)
- [Time span representation](timeseries.html#time-span-representation)
- [Converting between representations](timeseries.html#converting-between-representations)
- [Representing out-of-bounds spans](timeseries.html#representing-out-of-bounds-spans)
- [Time zone handling](timeseries.html#time-zone-handling)
- [时间增量](timedeltas.html)
- [Parsing](timedeltas.html#parsing)
- [Operations](timedeltas.html#operations)
- [Reductions](timedeltas.html#reductions)
- [Frequency conversion](timedeltas.html#frequency-conversion)
- [Attributes](timedeltas.html#attributes)
- [TimedeltaIndex](timedeltas.html#timedeltaindex)
- [Resampling](timedeltas.html#resampling)
- [样式](style.html)
- [Building styles](style.html#Building-styles)
- [Finer control: slicing](style.html#Finer-control:-slicing)
- [Finer Control: Display Values](style.html#Finer-Control:-Display-Values)
- [Builtin styles](style.html#Builtin-styles)
- [Sharing styles](style.html#Sharing-styles)
- [Other Options](style.html#Other-Options)
- [Fun stuff](style.html#Fun-stuff)
- [Export to Excel](style.html#Export-to-Excel)
- [Extensibility](style.html#Extensibility)
- [选项和设置](options.html)
- [Overview](options.html#overview)
- [Getting and setting options](options.html#getting-and-setting-options)
- [Setting startup options in Python/IPython environment](options.html#setting-startup-options-in-python-ipython-environment)
- [Frequently Used Options](options.html#frequently-used-options)
- [Available options](options.html#available-options)
- [Number formatting](options.html#number-formatting)
- [Unicode formatting](options.html#unicode-formatting)
- [Table schema display](options.html#table-schema-display)
- [提高性能](enhancingperf.html)
- [Cython (writing C extensions for pandas)](enhancingperf.html#cython-writing-c-extensions-for-pandas)
- [Using Numba](enhancingperf.html#using-numba)
- [Expression evaluation via ``>eval()``](enhancingperf.html#expression-evaluation-via-eval)
- [稀疏数据结构](sparse.html)
- [SparseArray](sparse.html#sparsearray)
- [SparseDtype](sparse.html#sparsedtype)
- [Sparse accessor](sparse.html#sparse-accessor)
- [Sparse calculation](sparse.html#sparse-calculation)
- [Migrating](sparse.html#migrating)
- [Interaction with scipy.sparse](sparse.html#interaction-with-scipy-sparse)
- [Sparse subclasses](sparse.html#sparse-subclasses)
- [常见问题(FAQ)](gotchas.html)
- [DataFrame memory usage](gotchas.html#dataframe-memory-usage)
- [Using if/truth statements with pandas](gotchas.html#using-if-truth-statements-with-pandas)
- [``NaN``, Integer ``NA`` values and ``NA`` type promotions](gotchas.html#nan-integer-na-values-and-na-type-promotions)
- [Differences with NumPy](gotchas.html#differences-with-numpy)
- [Thread-safety](gotchas.html#thread-safety)
- [Byte-Ordering issues](gotchas.html#byte-ordering-issues)
- [烹饪指南](cookbook.html)
- [Idioms](cookbook.html#idioms)
- [Selection](cookbook.html#selection)
- [MultiIndexing](cookbook.html#multiindexing)
- [Missing data](cookbook.html#missing-data)
- [Grouping](cookbook.html#grouping)
- [Timeseries](cookbook.html#timeseries)
- [Merge](cookbook.html#merge)
- [Plotting](cookbook.html#plotting)
- [Data In/Out](cookbook.html#data-in-out)
- [Computation](cookbook.html#computation)
- [Timedeltas](cookbook.html#timedeltas)
- [Aliasing axis names](cookbook.html#aliasing-axis-names)
- [Creating example data](cookbook.html#creating-example-data)

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,984 @@
# Enhancing performance
In this part of the tutorial, we will investigate how to speed up certain
functions operating on pandas ``DataFrames`` using three different techniques:
Cython, Numba and [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval). We will see a speed improvement of ~200
when we use Cython and Numba on a test function operating row-wise on the
``DataFrame``. Using [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) we will speed up a sum by an order of
~2.
## Cython (writing C extensions for pandas)
For many use cases writing pandas in pure Python and NumPy is sufficient. In some
computationally heavy applications however, it can be possible to achieve sizable
speed-ups by offloading work to [cython](http://cython.org/).
This tutorial assumes you have refactored as much as possible in Python, for example
by trying to remove for-loops and making use of NumPy vectorization. Its always worth
optimising in Python first.
This tutorial walks through a “typical” process of cythonizing a slow computation.
We use an [example from the Cython documentation](http://docs.cython.org/src/quickstart/cythonize.html)
but in the context of pandas. Our final cythonized solution is around 100 times
faster than the pure Python solution.
### Pure Python
We have a ``DataFrame`` to which we want to apply a function row-wise.
``` python
In [1]: df = pd.DataFrame({'a': np.random.randn(1000),
...: 'b': np.random.randn(1000),
...: 'N': np.random.randint(100, 1000, (1000)),
...: 'x': 'x'})
...:
In [2]: df
Out[2]:
a b N x
0 0.469112 -0.218470 585 x
1 -0.282863 -0.061645 841 x
2 -1.509059 -0.723780 251 x
3 -1.135632 0.551225 972 x
4 1.212112 -0.497767 181 x
.. ... ... ... ..
995 -1.512743 0.874737 374 x
996 0.933753 1.120790 246 x
997 -0.308013 0.198768 157 x
998 -0.079915 1.757555 977 x
999 -1.010589 -1.115680 770 x
[1000 rows x 4 columns]
```
Heres the function in pure Python:
``` python
In [3]: def f(x):
...: return x * (x - 1)
...:
In [4]: def integrate_f(a, b, N):
...: s = 0
...: dx = (b - a) / N
...: for i in range(N):
...: s += f(a + i * dx)
...: return s * dx
...:
```
We achieve our result by using ``apply`` (row-wise):
``` python
In [7]: %timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)
10 loops, best of 3: 174 ms per loop
```
But clearly this isnt fast enough for us. Lets take a look and see where the
time is spent during this operation (limited to the most time consuming
four calls) using the [prun ipython magic function](http://ipython.org/ipython-doc/stable/api/generated/IPython.core.magics.execution.html#IPython.core.magics.execution.ExecutionMagics.prun):
``` python
In [5]: %prun -l 4 df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1) # noqa E999
672332 function calls (667306 primitive calls) in 0.285 seconds
Ordered by: internal time
List reduced from 221 to 4 due to restriction <4>
ncalls tottime percall cumtime percall filename:lineno(function)
1000 0.144 0.000 0.217 0.000 <ipython-input-4-c2a74e076cf0>:1(integrate_f)
552423 0.074 0.000 0.074 0.000 <ipython-input-3-c138bdd570e3>:1(f)
3000 0.008 0.000 0.045 0.000 base.py:4695(get_value)
6001 0.005 0.000 0.012 0.000 {pandas._libs.lib.values_from_object}
```
By far the majority of time is spend inside either ``integrate_f`` or ``f``,
hence well concentrate our efforts cythonizing these two functions.
::: tip Note
In Python 2 replacing the ``range`` with its generator counterpart (``xrange``)
would mean the ``range`` line would vanish. In Python 3 ``range`` is already a generator.
:::
### Plain Cython
First were going to need to import the Cython magic function to ipython:
``` python
In [6]: %load_ext Cython
```
Now, lets simply copy our functions over to Cython as is (the suffix
is here to distinguish between function versions):
``` python
In [7]: %%cython
...: def f_plain(x):
...: return x * (x - 1)
...: def integrate_f_plain(a, b, N):
...: s = 0
...: dx = (b - a) / N
...: for i in range(N):
...: s += f_plain(a + i * dx)
...: return s * dx
...:
```
::: tip Note
If youre having trouble pasting the above into your ipython, you may need
to be using bleeding edge ipython for paste to play well with cell magics.
:::
``` python
In [4]: %timeit df.apply(lambda x: integrate_f_plain(x['a'], x['b'], x['N']), axis=1)
10 loops, best of 3: 85.5 ms per loop
```
Already this has shaved a third off, not too bad for a simple copy and paste.
### Adding type
We get another huge improvement simply by providing type information:
``` python
In [8]: %%cython
...: cdef double f_typed(double x) except? -2:
...: return x * (x - 1)
...: cpdef double integrate_f_typed(double a, double b, int N):
...: cdef int i
...: cdef double s, dx
...: s = 0
...: dx = (b - a) / N
...: for i in range(N):
...: s += f_typed(a + i * dx)
...: return s * dx
...:
```
``` python
In [4]: %timeit df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)
10 loops, best of 3: 20.3 ms per loop
```
Now, were talking! Its now over ten times faster than the original python
implementation, and we havent *really* modified the code. Lets have another
look at whats eating up time:
``` python
In [9]: %prun -l 4 df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)
119905 function calls (114879 primitive calls) in 0.096 seconds
Ordered by: internal time
List reduced from 216 to 4 due to restriction <4>
ncalls tottime percall cumtime percall filename:lineno(function)
3000 0.012 0.000 0.064 0.000 base.py:4695(get_value)
6001 0.007 0.000 0.017 0.000 {pandas._libs.lib.values_from_object}
3000 0.007 0.000 0.073 0.000 series.py:1061(__getitem__)
3000 0.006 0.000 0.006 0.000 {method 'get_value' of 'pandas._libs.index.IndexEngine' objects}
```
### Using ndarray
Its calling series… a lot! Its creating a Series from each row, and get-ting from both
the index and the series (three times for each row). Function calls are expensive
in Python, so maybe we could minimize these by cythonizing the apply part.
::: tip Note
We are now passing ndarrays into the Cython function, fortunately Cython plays
very nicely with NumPy.
:::
``` python
In [10]: %%cython
....: cimport numpy as np
....: import numpy as np
....: cdef double f_typed(double x) except? -2:
....: return x * (x - 1)
....: cpdef double integrate_f_typed(double a, double b, int N):
....: cdef int i
....: cdef double s, dx
....: s = 0
....: dx = (b - a) / N
....: for i in range(N):
....: s += f_typed(a + i * dx)
....: return s * dx
....: cpdef np.ndarray[double] apply_integrate_f(np.ndarray col_a, np.ndarray col_b,
....: np.ndarray col_N):
....: assert (col_a.dtype == np.float
....: and col_b.dtype == np.float and col_N.dtype == np.int)
....: cdef Py_ssize_t i, n = len(col_N)
....: assert (len(col_a) == len(col_b) == n)
....: cdef np.ndarray[double] res = np.empty(n)
....: for i in range(len(col_a)):
....: res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
....: return res
....:
```
The implementation is simple, it creates an array of zeros and loops over
the rows, applying our ``integrate_f_typed``, and putting this in the zeros array.
::: danger Warning
You can **not pass** a ``Series`` directly as a ``ndarray`` typed parameter
to a Cython function. Instead pass the actual ``ndarray`` using the
[``Series.to_numpy()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy). The reason is that the Cython
definition is specific to an ndarray and not the passed ``Series``.
So, do not do this:
``` python
apply_integrate_f(df['a'], df['b'], df['N'])
```
But rather, use [``Series.to_numpy()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy) to get the underlying ``ndarray``:
``` python
apply_integrate_f(df['a'].to_numpy(),
df['b'].to_numpy(),
df['N'].to_numpy())
```
:::
::: tip Note
Loops like this would be *extremely* slow in Python, but in Cython looping
over NumPy arrays is *fast*.
:::
``` python
In [4]: %timeit apply_integrate_f(df['a'].to_numpy(),
df['b'].to_numpy(),
df['N'].to_numpy())
1000 loops, best of 3: 1.25 ms per loop
```
Weve gotten another big improvement. Lets check again where the time is spent:
``` python
In [11]: %prun -l 4 apply_integrate_f(df['a'].to_numpy(),
....: df['b'].to_numpy(),
....: df['N'].to_numpy())
....:
File "<ipython-input-11-613f5c6ec02d>", line 2
df['b'].to_numpy(),
^
IndentationError: unexpected indent
```
As one might expect, the majority of the time is now spent in ``apply_integrate_f``,
so if we wanted to make anymore efficiencies we must continue to concentrate our
efforts here.
### More advanced techniques
There is still hope for improvement. Heres an example of using some more
advanced Cython techniques:
``` python
In [12]: %%cython
....: cimport cython
....: cimport numpy as np
....: import numpy as np
....: cdef double f_typed(double x) except? -2:
....: return x * (x - 1)
....: cpdef double integrate_f_typed(double a, double b, int N):
....: cdef int i
....: cdef double s, dx
....: s = 0
....: dx = (b - a) / N
....: for i in range(N):
....: s += f_typed(a + i * dx)
....: return s * dx
....: @cython.boundscheck(False)
....: @cython.wraparound(False)
....: cpdef np.ndarray[double] apply_integrate_f_wrap(np.ndarray[double] col_a,
....: np.ndarray[double] col_b,
....: np.ndarray[int] col_N):
....: cdef int i, n = len(col_N)
....: assert len(col_a) == len(col_b) == n
....: cdef np.ndarray[double] res = np.empty(n)
....: for i in range(n):
....: res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
....: return res
....:
```
``` python
In [4]: %timeit apply_integrate_f_wrap(df['a'].to_numpy(),
df['b'].to_numpy(),
df['N'].to_numpy())
1000 loops, best of 3: 987 us per loop
```
Even faster, with the caveat that a bug in our Cython code (an off-by-one error,
for example) might cause a segfault because memory access isnt checked.
For more about ``boundscheck`` and ``wraparound``, see the Cython docs on
[compiler directives](http://cython.readthedocs.io/en/latest/src/reference/compilation.html?highlight=wraparound#compiler-directives).
## Using Numba
A recent alternative to statically compiling Cython code, is to use a *dynamic jit-compiler*, Numba.
Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.
::: tip Note
You will need to install Numba. This is easy with ``conda``, by using: ``conda install numba``, see [installing using miniconda](https://pandas.pydata.org/pandas-docs/stable/install.html#install-miniconda).
:::
::: tip Note
As of Numba version 0.20, pandas objects cannot be passed directly to Numba-compiled functions. Instead, one must pass the NumPy array underlying the pandas object to the Numba-compiled function as demonstrated below.
:::
### Jit
We demonstrate how to use Numba to just-in-time compile our code. We simply
take the plain Python code from above and annotate with the ``@jit`` decorator.
``` python
import numba
@numba.jit
def f_plain(x):
return x * (x - 1)
@numba.jit
def integrate_f_numba(a, b, N):
s = 0
dx = (b - a) / N
for i in range(N):
s += f_plain(a + i * dx)
return s * dx
@numba.jit
def apply_integrate_f_numba(col_a, col_b, col_N):
n = len(col_N)
result = np.empty(n, dtype='float64')
assert len(col_a) == len(col_b) == n
for i in range(n):
result[i] = integrate_f_numba(col_a[i], col_b[i], col_N[i])
return result
def compute_numba(df):
result = apply_integrate_f_numba(df['a'].to_numpy(),
df['b'].to_numpy(),
df['N'].to_numpy())
return pd.Series(result, index=df.index, name='result')
```
Note that we directly pass NumPy arrays to the Numba function. ``compute_numba`` is just a wrapper that provides a
nicer interface by passing/returning pandas objects.
``` python
In [4]: %timeit compute_numba(df)
1000 loops, best of 3: 798 us per loop
```
In this example, using Numba was faster than Cython.
### Vectorize
Numba can also be used to write vectorized functions that do not require the user to explicitly
loop over the observations of a vector; a vectorized function will be applied to each row automatically.
Consider the following toy example of doubling each observation:
``` python
import numba
def double_every_value_nonumba(x):
return x * 2
@numba.vectorize
def double_every_value_withnumba(x): # noqa E501
return x * 2
```
``` python
# Custom function without numba
In [5]: %timeit df['col1_doubled'] = df.a.apply(double_every_value_nonumba) # noqa E501
1000 loops, best of 3: 797 us per loop
# Standard implementation (faster than a custom function)
In [6]: %timeit df['col1_doubled'] = df.a * 2
1000 loops, best of 3: 233 us per loop
# Custom function with numba
In [7]: %timeit (df['col1_doubled'] = double_every_value_withnumba(df.a.to_numpy())
1000 loops, best of 3: 145 us per loop
```
### Caveats
::: tip Note
Numba will execute on any function, but can only accelerate certain classes of functions.
:::
Numba is best at accelerating functions that apply numerical functions to NumPy
arrays. When passed a function that only uses operations it knows how to
accelerate, it will execute in ``nopython`` mode.
If Numba is passed a function that includes something it doesnt know how to
work with a category that currently includes sets, lists, dictionaries, or
string functions it will revert to ``object mode``. In ``object mode``,
Numba will execute but your code will not speed up significantly. If you would
prefer that Numba throw an error if it cannot compile a function in a way that
speeds up your code, pass Numba the argument
``nopython=True`` (e.g. ``@numba.jit(nopython=True)``). For more on
troubleshooting Numba modes, see the [Numba troubleshooting page](http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#the-compiled-code-is-too-slow).
Read more in the [Numba docs](http://numba.pydata.org/).
## Expression evaluation via ``eval()``
The top-level function [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) implements expression evaluation of
[``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) and [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) objects.
::: tip Note
To benefit from using [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) you need to
install ``numexpr``. See the [recommended dependencies section](https://pandas.pydata.org/pandas-docs/stable/install.html#install-recommended-dependencies) for more details.
:::
The point of using [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) for expression evaluation rather than
plain Python is two-fold: 1) large [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) objects are
evaluated more efficiently and 2) large arithmetic and boolean expressions are
evaluated all at once by the underlying engine (by default ``numexpr`` is used
for evaluation).
::: tip Note
You should not use [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) for simple
expressions or for expressions involving small DataFrames. In fact,
[``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) is many orders of magnitude slower for
smaller expressions/objects than plain ol Python. A good rule of thumb is
to only use [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) when you have a
``DataFrame`` with more than 10,000 rows.
:::
[``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) supports all arithmetic expressions supported by the
engine in addition to some extensions available only in pandas.
::: tip Note
The larger the frame and the larger the expression the more speedup you will
see from using [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval).
:::
### Supported syntax
These operations are supported by [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval):
- Arithmetic operations except for the left shift (``<<``) and right shift
(``>>``) operators, e.g., ``df + 2 * pi / s ** 4 % 42 - the_golden_ratio``
- Comparison operations, including chained comparisons, e.g., ``2 < df < df2``
- Boolean operations, e.g., ``df < df2 and df3 < df4 or not df_bool``
- ``list`` and ``tuple`` literals, e.g., ``[1, 2]`` or ``(1, 2)``
- Attribute access, e.g., ``df.a``
- Subscript expressions, e.g., ``df[0]``
- Simple variable evaluation, e.g., ``pd.eval('df')`` (this is not very useful)
- Math functions: *sin*, *cos*, *exp*, *log*, *expm1*, *log1p*,
*sqrt*, *sinh*, *cosh*, *tanh*, *arcsin*, *arccos*, *arctan*, *arccosh*,
*arcsinh*, *arctanh*, *abs*, *arctan2* and *log10*.
This Python syntax is **not** allowed:
- Expressions
- Function calls other than math functions.
- ``is``/``is not`` operations
- ``if`` expressions
- ``lambda`` expressions
- ``list``/``set``/``dict`` comprehensions
- Literal ``dict`` and ``set`` expressions
- ``yield`` expressions
- Generator expressions
- Boolean expressions consisting of only scalar values
- Statements
- Neither [simple](https://docs.python.org/3/reference/simple_stmts.html)
nor [compound](https://docs.python.org/3/reference/compound_stmts.html)
statements are allowed. This includes things like ``for``, ``while``, and
``if``.
### ``eval()`` examples
[``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) works well with expressions containing large arrays.
First lets create a few decent-sized arrays to play with:
``` python
In [13]: nrows, ncols = 20000, 100
In [14]: df1, df2, df3, df4 = [pd.DataFrame(np.random.randn(nrows, ncols)) for _ in range(4)]
```
Now lets compare adding them together using plain ol Python versus
[``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval):
``` python
In [15]: %timeit df1 + df2 + df3 + df4
21 ms +- 787 us per loop (mean +- std. dev. of 7 runs, 10 loops each)
```
``` python
In [16]: %timeit pd.eval('df1 + df2 + df3 + df4')
8.12 ms +- 249 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
```
Now lets do the same thing but with comparisons:
``` python
In [17]: %timeit (df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)
272 ms +- 6.92 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)
```
``` python
In [18]: %timeit pd.eval('(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)')
19.2 ms +- 1.87 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
```
[``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) also works with unaligned pandas objects:
``` python
In [19]: s = pd.Series(np.random.randn(50))
In [20]: %timeit df1 + df2 + df3 + df4 + s
103 ms +- 12.7 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
```
``` python
In [21]: %timeit pd.eval('df1 + df2 + df3 + df4 + s')
10.2 ms +- 215 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
```
::: tip Note
Operations such as
``` python
1 and 2 # would parse to 1 & 2, but should evaluate to 2
3 or 4 # would parse to 3 | 4, but should evaluate to 3
~1 # this is okay, but slower when using eval
```
should be performed in Python. An exception will be raised if you try to
perform any boolean/bitwise operations with scalar operands that are not
of type ``bool`` or ``np.bool_``. Again, you should perform these kinds of
operations in plain Python.
:::
### The ``DataFrame.eval`` method
In addition to the top level [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) function you can also
evaluate an expression in the “context” of a [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame).
``` python
In [22]: df = pd.DataFrame(np.random.randn(5, 2), columns=['a', 'b'])
In [23]: df.eval('a + b')
Out[23]:
0 -0.246747
1 0.867786
2 -1.626063
3 -1.134978
4 -1.027798
dtype: float64
```
Any expression that is a valid [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) expression is also a valid
[``DataFrame.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.eval.html#pandas.DataFrame.eval) expression, with the added benefit that you dont have to
prefix the name of the [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) to the column(s) youre
interested in evaluating.
In addition, you can perform assignment of columns within an expression.
This allows for *formulaic evaluation*. The assignment target can be a
new column name or an existing column name, and it must be a valid Python
identifier.
*New in version 0.18.0.*
The ``inplace`` keyword determines whether this assignment will performed
on the original ``DataFrame`` or return a copy with the new column.
::: danger Warning
For backwards compatibility, ``inplace`` defaults to ``True`` if not
specified. This will change in a future version of pandas - if your
code depends on an inplace assignment you should update to explicitly
set ``inplace=True``.
:::
``` python
In [24]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
In [25]: df.eval('c = a + b', inplace=True)
In [26]: df.eval('d = a + b + c', inplace=True)
In [27]: df.eval('a = 1', inplace=True)
In [28]: df
Out[28]:
a b c d
0 1 5 5 10
1 1 6 7 14
2 1 7 9 18
3 1 8 11 22
4 1 9 13 26
```
When ``inplace`` is set to ``False``, a copy of the ``DataFrame`` with the
new or modified columns is returned and the original frame is unchanged.
``` python
In [29]: df
Out[29]:
a b c d
0 1 5 5 10
1 1 6 7 14
2 1 7 9 18
3 1 8 11 22
4 1 9 13 26
In [30]: df.eval('e = a - c', inplace=False)
Out[30]:
a b c d e
0 1 5 5 10 -4
1 1 6 7 14 -6
2 1 7 9 18 -8
3 1 8 11 22 -10
4 1 9 13 26 -12
In [31]: df
Out[31]:
a b c d
0 1 5 5 10
1 1 6 7 14
2 1 7 9 18
3 1 8 11 22
4 1 9 13 26
```
*New in version 0.18.0.*
As a convenience, multiple assignments can be performed by using a
multi-line string.
``` python
In [32]: df.eval("""
....: c = a + b
....: d = a + b + c
....: a = 1""", inplace=False)
....:
Out[32]:
a b c d
0 1 5 6 12
1 1 6 7 14
2 1 7 8 16
3 1 8 9 18
4 1 9 10 20
```
The equivalent in standard Python would be
``` python
In [33]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
In [34]: df['c'] = df.a + df.b
In [35]: df['d'] = df.a + df.b + df.c
In [36]: df['a'] = 1
In [37]: df
Out[37]:
a b c d
0 1 5 5 10
1 1 6 7 14
2 1 7 9 18
3 1 8 11 22
4 1 9 13 26
```
*New in version 0.18.0.*
The ``query`` method gained the ``inplace`` keyword which determines
whether the query modifies the original frame.
``` python
In [38]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
In [39]: df.query('a > 2')
Out[39]:
a b
3 3 8
4 4 9
In [40]: df.query('a > 2', inplace=True)
In [41]: df
Out[41]:
a b
3 3 8
4 4 9
```
::: danger Warning
Unlike with ``eval``, the default value for ``inplace`` for ``query``
is ``False``. This is consistent with prior versions of pandas.
:::
### Local variables
You must *explicitly reference* any local variable that you want to use in an
expression by placing the ``@`` character in front of the name. For example,
``` python
In [42]: df = pd.DataFrame(np.random.randn(5, 2), columns=list('ab'))
In [43]: newcol = np.random.randn(len(df))
In [44]: df.eval('b + @newcol')
Out[44]:
0 -0.173926
1 2.493083
2 -0.881831
3 -0.691045
4 1.334703
dtype: float64
In [45]: df.query('b < @newcol')
Out[45]:
a b
0 0.863987 -0.115998
2 -2.621419 -1.297879
```
If you dont prefix the local variable with ``@``, pandas will raise an
exception telling you the variable is undefined.
When using [``DataFrame.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.eval.html#pandas.DataFrame.eval) and [``DataFrame.query()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query), this allows you
to have a local variable and a [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) column with the same
name in an expression.
``` python
In [46]: a = np.random.randn()
In [47]: df.query('@a < a')
Out[47]:
a b
0 0.863987 -0.115998
In [48]: df.loc[a < df.a] # same as the previous expression
Out[48]:
a b
0 0.863987 -0.115998
```
With [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) you cannot use the ``@`` prefix *at all*, because it
isnt defined in that context. ``pandas`` will let you know this if you try to
use ``@`` in a top-level call to [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval). For example,
``` python
In [49]: a, b = 1, 2
In [50]: pd.eval('@a + b')
Traceback (most recent call last):
File "/opt/conda/envs/pandas/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3325, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-50-af17947a194f>", line 1, in <module>
pd.eval('@a + b')
File "/pandas/pandas/core/computation/eval.py", line 311, in eval
_check_for_locals(expr, level, parser)
File "/pandas/pandas/core/computation/eval.py", line 166, in _check_for_locals
raise SyntaxError(msg)
File "<string>", line unknown
SyntaxError: The '@' prefix is not allowed in top-level eval calls,
please refer to your variables by name without the '@' prefix
```
In this case, you should simply refer to the variables like you would in
standard Python.
``` python
In [51]: pd.eval('a + b')
Out[51]: 3
```
### ``pandas.eval()`` parsers
There are two different parsers and two different engines you can use as
the backend.
The default ``'pandas'`` parser allows a more intuitive syntax for expressing
query-like operations (comparisons, conjunctions and disjunctions). In
particular, the precedence of the ``&`` and ``|`` operators is made equal to
the precedence of the corresponding boolean operations ``and`` and ``or``.
For example, the above conjunction can be written without parentheses.
Alternatively, you can use the ``'python'`` parser to enforce strict Python
semantics.
``` python
In [52]: expr = '(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)'
In [53]: x = pd.eval(expr, parser='python')
In [54]: expr_no_parens = 'df1 > 0 & df2 > 0 & df3 > 0 & df4 > 0'
In [55]: y = pd.eval(expr_no_parens, parser='pandas')
In [56]: np.all(x == y)
Out[56]: True
```
The same expression can be “anded” together with the word [``and``](https://docs.python.org/3/reference/expressions.html#and) as
well:
``` python
In [57]: expr = '(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)'
In [58]: x = pd.eval(expr, parser='python')
In [59]: expr_with_ands = 'df1 > 0 and df2 > 0 and df3 > 0 and df4 > 0'
In [60]: y = pd.eval(expr_with_ands, parser='pandas')
In [61]: np.all(x == y)
Out[61]: True
```
The ``and`` and ``or`` operators here have the same precedence that they would
in vanilla Python.
### ``pandas.eval()`` backends
Theres also the option to make [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) operate identical to plain
ol Python.
::: tip Note
Using the ``'python'`` engine is generally *not* useful, except for testing
other evaluation engines against it. You will achieve **no** performance
benefits using [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) with ``engine='python'`` and in fact may
incur a performance hit.
:::
You can see this by using [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) with the ``'python'`` engine. It
is a bit slower (not by much) than evaluating the same expression in Python
``` python
In [62]: %timeit df1 + df2 + df3 + df4
9.5 ms +- 241 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
```
``` python
In [63]: %timeit pd.eval('df1 + df2 + df3 + df4', engine='python')
10.8 ms +- 898 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
```
### ``pandas.eval()`` performance
[``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) is intended to speed up certain kinds of operations. In
particular, those operations involving complex expressions with large
[``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame)/[``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) objects should see a
significant performance benefit. Here is a plot showing the running time of
[``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) as function of the size of the frame involved in the
computation. The two lines are two different engines.
![eval-perf](https://static.pypandas.cn/public/static/images/eval-perf.png)
::: tip Note
Operations with smallish objects (around 15k-20k rows) are faster using
plain Python:
![eval-perf-small](https://static.pypandas.cn/public/static/images/eval-perf-small.png)
:::
This plot was created using a ``DataFrame`` with 3 columns each containing
floating point values generated using ``numpy.random.randn()``.
### Technical minutia regarding expression evaluation
Expressions that would result in an object dtype or involve datetime operations
(because of ``NaT``) must be evaluated in Python space. The main reason for
this behavior is to maintain backwards compatibility with versions of NumPy <
1.7. In those versions of NumPy a call to ``ndarray.astype(str)`` will
truncate any strings that are more than 60 characters in length. Second, we
cant pass ``object`` arrays to ``numexpr`` thus string comparisons must be
evaluated in Python space.
The upshot is that this *only* applies to object-dtype expressions. So, if
you have an expressionfor example
``` python
In [64]: df = pd.DataFrame({'strings': np.repeat(list('cba'), 3),
....: 'nums': np.repeat(range(3), 3)})
....:
In [65]: df
Out[65]:
strings nums
0 c 0
1 c 0
2 c 0
3 b 1
4 b 1
5 b 1
6 a 2
7 a 2
8 a 2
In [66]: df.query('strings == "a" and nums == 1')
Out[66]:
Empty DataFrame
Columns: [strings, nums]
Index: []
```
the numeric part of the comparison (``nums == 1``) will be evaluated by
``numexpr``.
In general, [``DataFrame.query()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query)/[``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) will
evaluate the subexpressions that *can* be evaluated by ``numexpr`` and those
that must be evaluated in Python space transparently to the user. This is done
by inferring the result type of an expression from its arguments and operators.

View File

@@ -0,0 +1,429 @@
# Frequently Asked Questions (FAQ)
## DataFrame memory usage
The memory usage of a ``DataFrame`` (including the index) is shown when calling
the [``info()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info). A configuration option, ``display.memory_usage``
(see [the list of options](options.html#options-available)), specifies if the
``DataFrame``s memory usage will be displayed when invoking the ``df.info()``
method.
For example, the memory usage of the ``DataFrame`` below is shown
when calling [``info()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info):
``` python
In [1]: dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]',
...: 'complex128', 'object', 'bool']
...:
In [2]: n = 5000
In [3]: data = {t: np.random.randint(100, size=n).astype(t) for t in dtypes}
In [4]: df = pd.DataFrame(data)
In [5]: df['categorical'] = df['object'].astype('category')
In [6]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
int64 5000 non-null int64
float64 5000 non-null float64
datetime64[ns] 5000 non-null datetime64[ns]
timedelta64[ns] 5000 non-null timedelta64[ns]
complex128 5000 non-null complex128
object 5000 non-null object
bool 5000 non-null bool
categorical 5000 non-null category
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
memory usage: 289.1+ KB
```
The ``+`` symbol indicates that the true memory usage could be higher, because
pandas does not count the memory used by values in columns with
``dtype=object``.
Passing ``memory_usage='deep'`` will enable a more accurate memory usage report,
accounting for the full usage of the contained objects. This is optional
as it can be expensive to do this deeper introspection.
``` python
In [7]: df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
int64 5000 non-null int64
float64 5000 non-null float64
datetime64[ns] 5000 non-null datetime64[ns]
timedelta64[ns] 5000 non-null timedelta64[ns]
complex128 5000 non-null complex128
object 5000 non-null object
bool 5000 non-null bool
categorical 5000 non-null category
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
memory usage: 425.6 KB
```
By default the display option is set to ``True`` but can be explicitly
overridden by passing the ``memory_usage`` argument when invoking ``df.info()``.
The memory usage of each column can be found by calling the
[``memory_usage()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.memory_usage.html#pandas.DataFrame.memory_usage) method. This returns a ``Series`` with an index
represented by column names and memory usage of each column shown in bytes. For
the ``DataFrame`` above, the memory usage of each column and the total memory
usage can be found with the ``memory_usage`` method:
``` python
In [8]: df.memory_usage()
Out[8]:
Index 128
int64 40000
float64 40000
datetime64[ns] 40000
timedelta64[ns] 40000
complex128 80000
object 40000
bool 5000
categorical 10920
dtype: int64
# total memory usage of dataframe
In [9]: df.memory_usage().sum()
Out[9]: 296048
```
By default the memory usage of the ``DataFrame``s index is shown in the
returned ``Series``, the memory usage of the index can be suppressed by passing
the ``index=False`` argument:
``` python
In [10]: df.memory_usage(index=False)
Out[10]:
int64 40000
float64 40000
datetime64[ns] 40000
timedelta64[ns] 40000
complex128 80000
object 40000
bool 5000
categorical 10920
dtype: int64
```
The memory usage displayed by the [``info()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) method utilizes the
[``memory_usage()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.memory_usage.html#pandas.DataFrame.memory_usage) method to determine the memory usage of a
``DataFrame`` while also formatting the output in human-readable units (base-2
representation; i.e. 1KB = 1024 bytes).
See also [Categorical Memory Usage](categorical.html#categorical-memory).
## Using if/truth statements with pandas
pandas follows the NumPy convention of raising an error when you try to convert
something to a ``bool``. This happens in an ``if``-statement or when using the
boolean operations: ``and``, ``or``, and ``not``. It is not clear what the result
of the following code should be:
``` python
>>> if pd.Series([False, True, False]):
... pass
```
Should it be ``True`` because its not zero-length, or ``False`` because there
are ``False`` values? It is unclear, so instead, pandas raises a ``ValueError``:
``` python
>>> if pd.Series([False, True, False]):
... print("I was true")
Traceback
...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
```
You need to explicitly choose what you want to do with the ``DataFrame``, e.g.
use [``any()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.any.html#pandas.DataFrame.any), [``all()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.all.html#pandas.DataFrame.all) or [``empty()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.empty.html#pandas.DataFrame.empty).
Alternatively, you might want to compare if the pandas object is ``None``:
``` python
>>> if pd.Series([False, True, False]) is not None:
... print("I was not None")
I was not None
```
Below is how to check if any of the values are ``True``:
``` python
>>> if pd.Series([False, True, False]).any():
... print("I am any")
I am any
```
To evaluate single-element pandas objects in a boolean context, use the method
[``bool()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.bool.html#pandas.DataFrame.bool):
``` python
In [11]: pd.Series([True]).bool()
Out[11]: True
In [12]: pd.Series([False]).bool()
Out[12]: False
In [13]: pd.DataFrame([[True]]).bool()
Out[13]: True
In [14]: pd.DataFrame([[False]]).bool()
Out[14]: False
```
### Bitwise boolean
Bitwise boolean operators like ``==`` and ``!=`` return a boolean ``Series``,
which is almost always what you want anyways.
``` python
>>> s = pd.Series(range(5))
>>> s == 4
0 False
1 False
2 False
3 False
4 True
dtype: bool
```
See [boolean comparisons](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-compare) for more examples.
### Using the ``in`` operator
Using the Python ``in`` operator on a ``Series`` tests for membership in the
index, not membership among the values.
``` python
In [15]: s = pd.Series(range(5), index=list('abcde'))
In [16]: 2 in s
Out[16]: False
In [17]: 'b' in s
Out[17]: True
```
If this behavior is surprising, keep in mind that using ``in`` on a Python
dictionary tests keys, not values, and ``Series`` are dict-like.
To test for membership in the values, use the method [``isin()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html#pandas.Series.isin):
``` python
In [18]: s.isin([2])
Out[18]:
a False
b False
c True
d False
e False
dtype: bool
In [19]: s.isin([2]).any()
Out[19]: True
```
For ``DataFrames``, likewise, ``in`` applies to the column axis,
testing for membership in the list of column names.
## ``NaN``, Integer ``NA`` values and ``NA`` type promotions
### Choice of ``NA`` representation
For lack of ``NA`` (missing) support from the ground up in NumPy and Python in
general, we were given the difficult choice between either:
- A *masked array* solution: an array of data and an array of boolean values
indicating whether a value is there or is missing.
- Using a special sentinel value, bit pattern, or set of sentinel values to
denote ``NA`` across the dtypes.
For many reasons we chose the latter. After years of production use it has
proven, at least in my opinion, to be the best decision given the state of
affairs in NumPy and Python in general. The special value ``NaN``
(Not-A-Number) is used everywhere as the ``NA`` value, and there are API
functions ``isna`` and ``notna`` which can be used across the dtypes to
detect NA values.
However, it comes with it a couple of trade-offs which I most certainly have
not ignored.
### Support for integer ``NA``
In the absence of high performance ``NA`` support being built into NumPy from
the ground up, the primary casualty is the ability to represent NAs in integer
arrays. For example:
``` python
In [20]: s = pd.Series([1, 2, 3, 4, 5], index=list('abcde'))
In [21]: s
Out[21]:
a 1
b 2
c 3
d 4
e 5
dtype: int64
In [22]: s.dtype
Out[22]: dtype('int64')
In [23]: s2 = s.reindex(['a', 'b', 'c', 'f', 'u'])
In [24]: s2
Out[24]:
a 1.0
b 2.0
c 3.0
f NaN
u NaN
dtype: float64
In [25]: s2.dtype
Out[25]: dtype('float64')
```
This trade-off is made largely for memory and performance reasons, and also so
that the resulting ``Series`` continues to be “numeric”.
If you need to represent integers with possibly missing values, use one of
the nullable-integer extension dtypes provided by pandas
- [``Int8Dtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Int8Dtype.html#pandas.Int8Dtype)
- [``Int16Dtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Int16Dtype.html#pandas.Int16Dtype)
- [``Int32Dtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Int32Dtype.html#pandas.Int32Dtype)
- [``Int64Dtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Int64Dtype.html#pandas.Int64Dtype)
``` python
In [26]: s_int = pd.Series([1, 2, 3, 4, 5], index=list('abcde'),
....: dtype=pd.Int64Dtype())
....:
In [27]: s_int
Out[27]:
a 1
b 2
c 3
d 4
e 5
dtype: Int64
In [28]: s_int.dtype
Out[28]: Int64Dtype()
In [29]: s2_int = s_int.reindex(['a', 'b', 'c', 'f', 'u'])
In [30]: s2_int
Out[30]:
a 1
b 2
c 3
f NaN
u NaN
dtype: Int64
In [31]: s2_int.dtype
Out[31]: Int64Dtype()
```
See [Nullable integer data type](integer_na.html#integer-na) for more.
### ``NA`` type promotions
When introducing NAs into an existing ``Series`` or ``DataFrame`` via
[``reindex()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html#pandas.Series.reindex) or some other means, boolean and integer types will be
promoted to a different dtype in order to store the NAs. The promotions are
summarized in this table:
Typeclass | Promotion dtype for storing NAs
---|---
floating | no change
object | no change
integer | cast to float64
boolean | cast to object
While this may seem like a heavy trade-off, I have found very few cases where
this is an issue in practice i.e. storing values greater than 2**53. Some
explanation for the motivation is in the next section.
### Why not make NumPy like R?
Many people have suggested that NumPy should simply emulate the ``NA`` support
present in the more domain-specific statistical programming language [R](https://r-project.org). Part of the reason is the NumPy type hierarchy:
Typeclass | Dtypes
---|---
numpy.floating | float16, float32, float64, float128
numpy.integer | int8, int16, int32, int64
numpy.unsignedinteger | uint8, uint16, uint32, uint64
numpy.object_ | object_
numpy.bool_ | bool_
numpy.character | string_, unicode_
The R language, by contrast, only has a handful of built-in data types:
``integer``, ``numeric`` (floating-point), ``character``, and
``boolean``. ``NA`` types are implemented by reserving special bit patterns for
each type to be used as the missing value. While doing this with the full NumPy
type hierarchy would be possible, it would be a more substantial trade-off
(especially for the 8- and 16-bit data types) and implementation undertaking.
An alternate approach is that of using masked arrays. A masked array is an
array of data with an associated boolean *mask* denoting whether each value
should be considered ``NA`` or not. I am personally not in love with this
approach as I feel that overall it places a fairly heavy burden on the user and
the library implementer. Additionally, it exacts a fairly high performance cost
when working with numerical data compared with the simple approach of using
``NaN``. Thus, I have chosen the Pythonic “practicality beats purity” approach
and traded integer ``NA`` capability for a much simpler approach of using a
special value in float and object arrays to denote ``NA``, and promoting
integer arrays to floating when NAs must be introduced.
## Differences with NumPy
For ``Series`` and ``DataFrame`` objects, [``var()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html#pandas.DataFrame.var) normalizes by
``N-1`` to produce unbiased estimates of the sample variance, while NumPys
``var`` normalizes by N, which measures the variance of the sample. Note that
[``cov()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cov.html#pandas.DataFrame.cov) normalizes by ``N-1`` in both pandas and NumPy.
## Thread-safety
As of pandas 0.11, pandas is not 100% thread safe. The known issues relate to
the [``copy()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html#pandas.DataFrame.copy) method. If you are doing a lot of copying of
``DataFrame`` objects shared among threads, we recommend holding locks inside
the threads where the data copying occurs.
See [this link](https://stackoverflow.com/questions/13592618/python-pandas-dataframe-thread-safe)
for more information.
## Byte-Ordering issues
Occasionally you may have to deal with data that were created on a machine with
a different byte order than the one on which you are running Python. A common
symptom of this issue is an error like::
``` python
Traceback
...
ValueError: Big-endian buffer not supported on little-endian compiler
```
To deal
with this issue you should convert the underlying NumPy array to the native
system byte order *before* passing it to ``Series`` or ``DataFrame``
constructors using something similar to the following:
``` python
In [32]: x = np.array(list(range(10)), '>i4') # big endian
In [33]: newx = x.byteswap().newbyteorder() # force native byteorder
In [34]: s = pd.Series(newx)
```
See [the NumPy documentation on byte order](https://docs.scipy.org/doc/numpy/user/basics.byteswapping.html) for more
details.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,175 @@
---
meta:
- name: keywords
content: Nullable,整型数据类型
- name: description
content: 在处理丢失的数据部分, 我们知道pandas主要使用 NaN 来代表丢失数据。因为 NaN 属于浮点型数据这强制有缺失值的整型array强制转换成浮点型。
---
# Nullable整型数据类型
*在0.24.0版本中新引入*
::: tip 小贴士
IntegerArray目前属于实验性阶段因此他的API或者使用方式可能会在没有提示的情况下更改。
:::
在 [处理丢失的数据](missing_data.html#missing-data)部分, 我们知道pandas主要使用 ``NaN`` 来代表丢失数据。因为 ``NaN`` 属于浮点型数据这强制有缺失值的整型array强制转换成浮点型。在某些情况下这可能不会有太大影响但是如果你的整型数据恰好是标识符数据类型的转换可能会存在隐患。同时某些整数无法使用浮点型来表示。
Pandas能够将可能存在缺失值的整型数据使用[``arrays.IntegerArray``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.arrays.IntegerArray.html#pandas.arrays.IntegerArray)来表示。这是pandas中内置的 [扩展方式](https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-extension-types)。 它并不是整型数据组成array对象的默认方式并且并不会被pandas直接使用。因此如果你希望生成这种数据类型你需要在生成[``array()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.array.html#pandas.array) 或者 [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)时,在``dtype``变量中直接指定。
``` python
In [1]: arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
In [2]: arr
Out[2]:
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64
```
或者使用字符串``"Int64"``(注意此处的 ``"I"``需要大写以此和NumPy中的``'int64'``数据类型作出区别):
``` python
In [3]: pd.array([1, 2, np.nan], dtype="Int64")
Out[3]:
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64
```
这样的array对象与NumPy的array对象类似可以被存放在[``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) 或 [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)中。
``` python
In [4]: pd.Series(arr)
Out[4]:
0 1
1 2
2 NaN
dtype: Int64
```
你也可以直接将列表形式的数据直接传入[``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)中,并指明``dtype``。
``` python
In [5]: s = pd.Series([1, 2, np.nan], dtype="Int64")
In [6]: s
Out[6]:
0 1
1 2
2 NaN
dtype: Int64
```
默认情况下(如果你不指明``dtype``则会使用NumPy来构建这个数据最终你会得到``float64``类型的Series
``` python
In [7]: pd.Series([1, 2, np.nan])
Out[7]:
0 1.0
1 2.0
2 NaN
dtype: float64
```
对使用了整型array的操作与对NumPy中array的操作类似缺失值会被继承并保留原本的数据类型但在必要的情况下数据类型也会发生转变。
``` python
# 运算
In [8]: s + 1
Out[8]:
0 2
1 3
2 NaN
dtype: Int64
# 比较
In [9]: s == 1
Out[9]:
0 True
1 False
2 False
dtype: bool
# 索引
In [10]: s.iloc[1:3]
Out[10]:
1 2
2 NaN
dtype: Int64
# 和其他数据类型联合使用
In [11]: s + s.iloc[1:3].astype('Int8')
Out[11]:
0 NaN
1 4
2 NaN
dtype: Int64
# 在必要情况下,数据类型发生转变
In [12]: s + 0.01
Out[12]:
0 1.01
1 2.01
2 NaN
dtype: float64
```
这种数据类型可以作为 ``DataFrame``的一部分进行使用。
``` python
In [13]: df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')})
In [14]: df
Out[14]:
A B C
0 1 1 a
1 2 1 a
2 NaN 3 b
In [15]: df.dtypes
Out[15]:
A Int64
B int64
C object
dtype: object
```
这种数据类型也可以在合并merge、重构reshape和类型转换cast
``` python
In [16]: pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes
Out[16]:
A Int64
B int64
C object
dtype: object
In [17]: df['A'].astype(float)
Out[17]:
0 1.0
1 2.0
2 NaN
Name: A, dtype: float64
```
类似于求和的降维和分组操作也能正常使用。
``` python
In [18]: df.sum()
Out[18]:
A 3
B 5
C aab
dtype: object
In [19]: df.groupby('B').A.sum()
Out[19]:
B
1 3
3 0
Name: A, dtype: Int64
```

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,711 @@
# Options and settings
## Overview
pandas has an options system that lets you customize some aspects of its behaviour,
display-related options being those the user is most likely to adjust.
Options have a full “dotted-style”, case-insensitive name (e.g. ``display.max_rows``).
You can get/set options directly as attributes of the top-level ``options`` attribute:
``` python
In [1]: import pandas as pd
In [2]: pd.options.display.max_rows
Out[2]: 15
In [3]: pd.options.display.max_rows = 999
In [4]: pd.options.display.max_rows
Out[4]: 999
```
The API is composed of 5 relevant functions, available directly from the ``pandas``
namespace:
- [``get_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_option.html#pandas.get_option) / [``set_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html#pandas.set_option) - get/set the value of a single option.
- [``reset_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.reset_option.html#pandas.reset_option) - reset one or more options to their default value.
- [``describe_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.describe_option.html#pandas.describe_option) - print the descriptions of one or more options.
- [``option_context()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.option_context.html#pandas.option_context) - execute a codeblock with a set of options
that revert to prior settings after execution.
**Note:** Developers can check out [pandas/core/config.py](https://github.com/pandas-dev/pandas/blob/master/pandas/core/config.py) for more information.
All of the functions above accept a regexp pattern (``re.search`` style) as an argument,
and so passing in a substring will work - as long as it is unambiguous:
``` python
In [5]: pd.get_option("display.max_rows")
Out[5]: 999
In [6]: pd.set_option("display.max_rows", 101)
In [7]: pd.get_option("display.max_rows")
Out[7]: 101
In [8]: pd.set_option("max_r", 102)
In [9]: pd.get_option("display.max_rows")
Out[9]: 102
```
The following will **not work** because it matches multiple option names, e.g.
``display.max_colwidth``, ``display.max_rows``, ``display.max_columns``:
``` python
In [10]: try:
....: pd.get_option("column")
....: except KeyError as e:
....: print(e)
....:
'Pattern matched multiple keys'
```
**Note:** Using this form of shorthand may cause your code to break if new options with similar names are added in future versions.
You can get a list of available options and their descriptions with ``describe_option``. When called
with no argument ``describe_option`` will print out the descriptions for all available options.
## Getting and setting options
As described above, [``get_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_option.html#pandas.get_option) and [``set_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html#pandas.set_option)
are available from the pandas namespace. To change an option, call
``set_option('option regex', new_value)``.
``` python
In [11]: pd.get_option('mode.sim_interactive')
Out[11]: False
In [12]: pd.set_option('mode.sim_interactive', True)
In [13]: pd.get_option('mode.sim_interactive')
Out[13]: True
```
**Note:** The option mode.sim_interactive is mostly used for debugging purposes.
All options also have a default value, and you can use ``reset_option`` to do just that:
``` python
In [14]: pd.get_option("display.max_rows")
Out[14]: 60
In [15]: pd.set_option("display.max_rows", 999)
In [16]: pd.get_option("display.max_rows")
Out[16]: 999
In [17]: pd.reset_option("display.max_rows")
In [18]: pd.get_option("display.max_rows")
Out[18]: 60
```
Its also possible to reset multiple options at once (using a regex):
``` python
In [19]: pd.reset_option("^display")
```
``option_context`` context manager has been exposed through
the top-level API, allowing you to execute code with given option values. Option values
are restored automatically when you exit the *with* block:
``` python
In [20]: with pd.option_context("display.max_rows", 10, "display.max_columns", 5):
....: print(pd.get_option("display.max_rows"))
....: print(pd.get_option("display.max_columns"))
....:
10
5
In [21]: print(pd.get_option("display.max_rows"))
60
In [22]: print(pd.get_option("display.max_columns"))
0
```
## Setting startup options in Python/IPython environment
Using startup scripts for the Python/IPython environment to import pandas and set options makes working with pandas more efficient. To do this, create a .py or .ipy script in the startup directory of the desired profile. An example where the startup folder is in a default ipython profile can be found at:
```
$IPYTHONDIR/profile_default/startup
```
More information can be found in the [ipython documentation](https://ipython.org/ipython-doc/stable/interactive/tutorial.html#startup-files). An example startup script for pandas is displayed below:
``` python
import pandas as pd
pd.set_option('display.max_rows', 999)
pd.set_option('precision', 5)
```
## Frequently Used Options
The following is a walk-through of the more frequently used display options.
``display.max_rows`` and ``display.max_columns`` sets the maximum number
of rows and columns displayed when a frame is pretty-printed. Truncated
lines are replaced by an ellipsis.
``` python
In [23]: df = pd.DataFrame(np.random.randn(7, 2))
In [24]: pd.set_option('max_rows', 7)
In [25]: df
Out[25]:
0 1
0 0.469112 -0.282863
1 -1.509059 -1.135632
2 1.212112 -0.173215
3 0.119209 -1.044236
4 -0.861849 -2.104569
5 -0.494929 1.071804
6 0.721555 -0.706771
In [26]: pd.set_option('max_rows', 5)
In [27]: df
Out[27]:
0 1
0 0.469112 -0.282863
1 -1.509059 -1.135632
.. ... ...
5 -0.494929 1.071804
6 0.721555 -0.706771
[7 rows x 2 columns]
In [28]: pd.reset_option('max_rows')
```
Once the ``display.max_rows`` is exceeded, the ``display.min_rows`` options
determines how many rows are shown in the truncated repr.
``` python
In [29]: pd.set_option('max_rows', 8)
In [30]: pd.set_option('max_rows', 4)
# below max_rows -> all rows shown
In [31]: df = pd.DataFrame(np.random.randn(7, 2))
In [32]: df
Out[32]:
0 1
0 -1.039575 0.271860
1 -0.424972 0.567020
.. ... ...
5 0.404705 0.577046
6 -1.715002 -1.039268
[7 rows x 2 columns]
# above max_rows -> only min_rows (4) rows shown
In [33]: df = pd.DataFrame(np.random.randn(9, 2))
In [34]: df
Out[34]:
0 1
0 -0.370647 -1.157892
1 -1.344312 0.844885
.. ... ...
7 0.276662 -0.472035
8 -0.013960 -0.362543
[9 rows x 2 columns]
In [35]: pd.reset_option('max_rows')
In [36]: pd.reset_option('min_rows')
```
``display.expand_frame_repr`` allows for the representation of
dataframes to stretch across pages, wrapped over the full column vs row-wise.
``` python
In [37]: df = pd.DataFrame(np.random.randn(5, 10))
In [38]: pd.set_option('expand_frame_repr', True)
In [39]: df
Out[39]:
0 1 2 3 4 5 6 7 8 9
0 -0.006154 -0.923061 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299 -0.226169
1 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127 -1.436737 -1.413681 1.607920
2 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747 -0.410001 -0.078638 0.545952 -1.219217
3 -1.226825 0.769804 -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734 0.959726 -1.110336
4 -0.619976 0.149748 -0.732339 0.687738 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849
In [40]: pd.set_option('expand_frame_repr', False)
In [41]: df
Out[41]:
0 1 2 3 4 5 6 7 8 9
0 -0.006154 -0.923061 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299 -0.226169
1 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127 -1.436737 -1.413681 1.607920
2 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747 -0.410001 -0.078638 0.545952 -1.219217
3 -1.226825 0.769804 -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734 0.959726 -1.110336
4 -0.619976 0.149748 -0.732339 0.687738 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849
In [42]: pd.reset_option('expand_frame_repr')
```
``display.large_repr`` lets you select whether to display dataframes that exceed
``max_columns`` or ``max_rows`` as a truncated frame, or as a summary.
``` python
In [43]: df = pd.DataFrame(np.random.randn(10, 10))
In [44]: pd.set_option('max_rows', 5)
In [45]: pd.set_option('large_repr', 'truncate')
In [46]: df
Out[46]:
0 1 2 3 4 5 6 7 8 9
0 -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232 0.690579 0.995761 2.396780 0.014871
1 3.357427 -0.317441 -1.236269 0.896171 -0.487602 -0.082240 -2.182937 0.380396 0.084844 0.432390
.. ... ... ... ... ... ... ... ... ... ...
8 -0.303421 -0.858447 0.306996 -0.028665 0.384316 1.574159 1.588931 0.476720 0.473424 -0.242861
9 -0.014805 -0.284319 0.650776 -1.461665 -1.137707 -0.891060 -0.693921 1.613616 0.464000 0.227371
[10 rows x 10 columns]
In [47]: pd.set_option('large_repr', 'info')
In [48]: df
Out[48]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
0 10 non-null float64
1 10 non-null float64
2 10 non-null float64
3 10 non-null float64
4 10 non-null float64
5 10 non-null float64
6 10 non-null float64
7 10 non-null float64
8 10 non-null float64
9 10 non-null float64
dtypes: float64(10)
memory usage: 928.0 bytes
In [49]: pd.reset_option('large_repr')
In [50]: pd.reset_option('max_rows')
```
``display.max_colwidth`` sets the maximum width of columns. Cells
of this length or longer will be truncated with an ellipsis.
``` python
In [51]: df = pd.DataFrame(np.array([['foo', 'bar', 'bim', 'uncomfortably long string'],
....: ['horse', 'cow', 'banana', 'apple']]))
....:
In [52]: pd.set_option('max_colwidth', 40)
In [53]: df
Out[53]:
0 1 2 3
0 foo bar bim uncomfortably long string
1 horse cow banana apple
In [54]: pd.set_option('max_colwidth', 6)
In [55]: df
Out[55]:
0 1 2 3
0 foo bar bim un...
1 horse cow ba... apple
In [56]: pd.reset_option('max_colwidth')
```
``display.max_info_columns`` sets a threshold for when by-column info
will be given.
``` python
In [57]: df = pd.DataFrame(np.random.randn(10, 10))
In [58]: pd.set_option('max_info_columns', 11)
In [59]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
0 10 non-null float64
1 10 non-null float64
2 10 non-null float64
3 10 non-null float64
4 10 non-null float64
5 10 non-null float64
6 10 non-null float64
7 10 non-null float64
8 10 non-null float64
9 10 non-null float64
dtypes: float64(10)
memory usage: 928.0 bytes
In [60]: pd.set_option('max_info_columns', 5)
In [61]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Columns: 10 entries, 0 to 9
dtypes: float64(10)
memory usage: 928.0 bytes
In [62]: pd.reset_option('max_info_columns')
```
``display.max_info_rows``: ``df.info()`` will usually show null-counts for each column.
For large frames this can be quite slow. ``max_info_rows`` and ``max_info_cols``
limit this null check only to frames with smaller dimensions then specified. Note that you
can specify the option ``df.info(null_counts=True)`` to override on showing a particular frame.
``` python
In [63]: df = pd.DataFrame(np.random.choice([0, 1, np.nan], size=(10, 10)))
In [64]: df
Out[64]:
0 1 2 3 4 5 6 7 8 9
0 0.0 NaN 1.0 NaN NaN 0.0 NaN 0.0 NaN 1.0
1 1.0 NaN 1.0 1.0 1.0 1.0 NaN 0.0 0.0 NaN
2 0.0 NaN 1.0 0.0 0.0 NaN NaN NaN NaN 0.0
3 NaN NaN NaN 0.0 1.0 1.0 NaN 1.0 NaN 1.0
4 0.0 NaN NaN NaN 0.0 NaN NaN NaN 1.0 0.0
5 0.0 1.0 1.0 1.0 1.0 0.0 NaN NaN 1.0 0.0
6 1.0 1.0 1.0 NaN 1.0 NaN 1.0 0.0 NaN NaN
7 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN
8 NaN NaN NaN 0.0 NaN NaN NaN NaN 1.0 NaN
9 0.0 NaN 0.0 NaN NaN 0.0 NaN 1.0 1.0 0.0
In [65]: pd.set_option('max_info_rows', 11)
In [66]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
0 8 non-null float64
1 3 non-null float64
2 7 non-null float64
3 6 non-null float64
4 7 non-null float64
5 6 non-null float64
6 2 non-null float64
7 6 non-null float64
8 6 non-null float64
9 6 non-null float64
dtypes: float64(10)
memory usage: 928.0 bytes
In [67]: pd.set_option('max_info_rows', 5)
In [68]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
0 float64
1 float64
2 float64
3 float64
4 float64
5 float64
6 float64
7 float64
8 float64
9 float64
dtypes: float64(10)
memory usage: 928.0 bytes
In [69]: pd.reset_option('max_info_rows')
```
``display.precision`` sets the output display precision in terms of decimal places.
This is only a suggestion.
``` python
In [70]: df = pd.DataFrame(np.random.randn(5, 5))
In [71]: pd.set_option('precision', 7)
In [72]: df
Out[72]:
0 1 2 3 4
0 -1.1506406 -0.7983341 -0.5576966 0.3813531 1.3371217
1 -1.5310949 1.3314582 -0.5713290 -0.0266708 -1.0856630
2 -1.1147378 -0.0582158 -0.4867681 1.6851483 0.1125723
3 -1.4953086 0.8984347 -0.1482168 -1.5960698 0.1596530
4 0.2621358 0.0362196 0.1847350 -0.2550694 -0.2710197
In [73]: pd.set_option('precision', 4)
In [74]: df
Out[74]:
0 1 2 3 4
0 -1.1506 -0.7983 -0.5577 0.3814 1.3371
1 -1.5311 1.3315 -0.5713 -0.0267 -1.0857
2 -1.1147 -0.0582 -0.4868 1.6851 0.1126
3 -1.4953 0.8984 -0.1482 -1.5961 0.1597
4 0.2621 0.0362 0.1847 -0.2551 -0.2710
```
``display.chop_threshold`` sets at what level pandas rounds to zero when
it displays a Series of DataFrame. This setting does not change the
precision at which the number is stored.
``` python
In [75]: df = pd.DataFrame(np.random.randn(6, 6))
In [76]: pd.set_option('chop_threshold', 0)
In [77]: df
Out[77]:
0 1 2 3 4 5
0 1.2884 0.2946 -1.1658 0.8470 -0.6856 0.6091
1 -0.3040 0.6256 -0.0593 0.2497 1.1039 -1.0875
2 1.9980 -0.2445 0.1362 0.8863 -1.3507 -0.8863
3 -1.0133 1.9209 -0.3882 -2.3144 0.6655 0.4026
4 0.3996 -1.7660 0.8504 0.3881 0.9923 0.7441
5 -0.7398 -1.0549 -0.1796 0.6396 1.5850 1.9067
In [78]: pd.set_option('chop_threshold', .5)
In [79]: df
Out[79]:
0 1 2 3 4 5
0 1.2884 0.0000 -1.1658 0.8470 -0.6856 0.6091
1 0.0000 0.6256 0.0000 0.0000 1.1039 -1.0875
2 1.9980 0.0000 0.0000 0.8863 -1.3507 -0.8863
3 -1.0133 1.9209 0.0000 -2.3144 0.6655 0.0000
4 0.0000 -1.7660 0.8504 0.0000 0.9923 0.7441
5 -0.7398 -1.0549 0.0000 0.6396 1.5850 1.9067
In [80]: pd.reset_option('chop_threshold')
```
``display.colheader_justify`` controls the justification of the headers.
The options are right, and left.
``` python
In [81]: df = pd.DataFrame(np.array([np.random.randn(6),
....: np.random.randint(1, 9, 6) * .1,
....: np.zeros(6)]).T,
....: columns=['A', 'B', 'C'], dtype='float')
....:
In [82]: pd.set_option('colheader_justify', 'right')
In [83]: df
Out[83]:
A B C
0 0.1040 0.1 0.0
1 0.1741 0.5 0.0
2 -0.4395 0.4 0.0
3 -0.7413 0.8 0.0
4 -0.0797 0.4 0.0
5 -0.9229 0.3 0.0
In [84]: pd.set_option('colheader_justify', 'left')
In [85]: df
Out[85]:
A B C
0 0.1040 0.1 0.0
1 0.1741 0.5 0.0
2 -0.4395 0.4 0.0
3 -0.7413 0.8 0.0
4 -0.0797 0.4 0.0
5 -0.9229 0.3 0.0
In [86]: pd.reset_option('colheader_justify')
```
## Available options
Option | Default | Function
---|---|---
display.chop_threshold | None | If set to a float value, all float values smaller then the given threshold will be displayed as exactly 0 by repr and friends.
display.colheader_justify | right | Controls the justification of column headers. used by DataFrameFormatter.
display.column_space | 12 | No description available.
display.date_dayfirst | False | When True, prints and parses dates with the day first, eg 20/01/2005
display.date_yearfirst | False | When True, prints and parses dates with the year first, eg 2005/01/20
display.encoding | UTF-8 | Defaults to the detected encoding of the console. Specifies the encoding to be used for strings returned by to_string, these are generally strings meant to be displayed on the console.
display.expand_frame_repr | True | Whether to print out the full DataFrame repr for wide DataFrames across multiple lines, max_columns is still respected, but the output will wrap-around across multiple “pages” if its width exceeds display.width.
display.float_format | None | The callable should accept a floating point number and return a string with the desired format of the number. This is used in some places like SeriesFormatter. See core.format.EngFormatter for an example.
display.large_repr | truncate | For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can show a truncated table (the default), or switch to the view from df.info() (the behaviour in earlier versions of pandas). allowable settings, [truncate, info]
display.latex.repr | False | Whether to produce a latex DataFrame representation for jupyter frontends that support it.
display.latex.escape | True | Escapes special characters in DataFrames, when using the to_latex method.
display.latex.longtable | False | Specifies if the to_latex method of a DataFrame uses the longtable format.
display.latex.multicolumn | True | Combines columns when using a MultiIndex
display.latex.multicolumn_format | l | Alignment of multicolumn labels
display.latex.multirow | False | Combines rows when using a MultiIndex. Centered instead of top-aligned, separated by clines.
display.max_columns | 0 or 20 | max_rows and max_columns are used in __repr__() methods to decide if to_string() or info() is used to render an object to a string. In case Python/IPython is running in a terminal this is set to 0 by default and pandas will correctly auto-detect the width of the terminal and switch to a smaller format in case all columns would not fit vertically. The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible to do correct auto-detection, in which case the default is set to 20. None value means unlimited.
display.max_colwidth | 50 | The maximum width in characters of a column in the repr of a pandas data structure. When the column overflows, a “…” placeholder is embedded in the output.
display.max_info_columns | 100 | max_info_columns is used in DataFrame.info method to decide if per column information will be printed.
display.max_info_rows | 1690785 | df.info() will usually show null-counts for each column. For large frames this can be quite slow. max_info_rows and max_info_cols limit this null check only to frames with smaller dimensions then specified.
display.max_rows | 60 | This sets the maximum number of rows pandas should output when printing out various output. For example, this value determines whether the repr() for a dataframe prints out fully or just a truncated or summary repr. None value means unlimited.
display.min_rows | 10 | The numbers of rows to show in a truncated repr (when max_rows is exceeded). Ignored when max_rows is set to None or 0. When set to None, follows the value of max_rows.
display.max_seq_items | 100 | when pretty-printing a long sequence, no more then max_seq_items will be printed. If items are omitted, they will be denoted by the addition of “…” to the resulting string. If set to None, the number of items to be printed is unlimited.
display.memory_usage | True | This specifies if the memory usage of a DataFrame should be displayed when the df.info() method is invoked.
display.multi_sparse | True | “Sparsify” MultiIndex display (dont display repeated elements in outer levels within groups)
display.notebook_repr_html | True | When True, IPython notebook will use html representation for pandas objects (if it is available).
display.pprint_nest_depth | 3 | Controls the number of nested levels to process when pretty-printing
display.precision | 6 | Floating point output precision in terms of number of places after the decimal, for regular formatting as well as scientific notation. Similar to numpys precision print option
display.show_dimensions | truncate | Whether to print out dimensions at the end of DataFrame repr. If truncate is specified, only print out the dimensions if the frame is truncated (e.g. not display all rows and/or columns)
display.width | 80 | Width of the display in characters. In case python/IPython is running in a terminal this can be set to None and pandas will correctly auto-detect the width. Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible to correctly detect the width.
display.html.table_schema | False | Whether to publish a Table Schema representation for frontends that support it.
display.html.border | 1 | A border=value attribute is inserted in the ``<table>`` tag for the DataFrame HTML repr.
display.html.use_mathjax | True | When True, Jupyter notebook will process table contents using MathJax, rendering mathematical expressions enclosed by the dollar symbol.
io.excel.xls.writer | xlwt | The default Excel writer engine for xls files.
io.excel.xlsm.writer | openpyxl | The default Excel writer engine for xlsm files. Available options: openpyxl (the default).
io.excel.xlsx.writer | openpyxl | The default Excel writer engine for xlsx files.
io.hdf.default_format | None | default format writing format, if None, then put will default to fixed and append will default to table
io.hdf.dropna_table | True | drop ALL nan rows when appending to a table
io.parquet.engine | None | The engine to use as a default for parquet reading and writing. If None then try pyarrow and fastparquet
mode.chained_assignment | warn | Controls SettingWithCopyWarning: raise, warn, or None. Raise an exception, warn, or no action if trying to use [chained assignment](indexing.html#indexing-evaluation-order).
mode.sim_interactive | False | Whether to simulate interactive mode for purposes of testing.
mode.use_inf_as_na | False | True means treat None, NaN, -INF, INF as NA (old way), False means None and NaN are null, but INF, -INF are not NA (new way).
compute.use_bottleneck | True | Use the bottleneck library to accelerate computation if it is installed.
compute.use_numexpr | True | Use the numexpr library to accelerate computation if it is installed.
plotting.backend | matplotlib | Change the plotting backend to a different backend than the current matplotlib one. Backends can be implemented as third-party libraries implementing the pandas plotting API. They can use other plotting libraries like Bokeh, Altair, etc.
plotting.matplotlib.register_converters | True | Register custom converters with matplotlib. Set to False to de-register.
## Number formatting
pandas also allows you to set how numbers are displayed in the console.
This option is not set through the ``set_options`` API.
Use the ``set_eng_float_format`` function
to alter the floating-point formatting of pandas objects to produce a particular
format.
For instance:
``` python
In [87]: import numpy as np
In [88]: pd.set_eng_float_format(accuracy=3, use_eng_prefix=True)
In [89]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
In [90]: s / 1.e3
Out[90]:
a 303.638u
b -721.084u
c -622.696u
d 648.250u
e -1.945m
dtype: float64
In [91]: s / 1.e6
Out[91]:
a 303.638n
b -721.084n
c -622.696n
d 648.250n
e -1.945u
dtype: float64
```
To round floats on a case-by-case basis, you can also use [``round()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.round.html#pandas.Series.round) and [``round()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.round.html#pandas.DataFrame.round).
## Unicode formatting
::: danger Warning
Enabling this option will affect the performance for printing of DataFrame and Series (about 2 times slower).
Use only when it is actually required.
:::
Some East Asian countries use Unicode characters whose width corresponds to two Latin characters.
If a DataFrame or Series contains these characters, the default output mode may not align them properly.
::: tip Note
Screen captures are attached for each output to show the actual results.
:::
``` python
In [92]: df = pd.DataFrame({'国籍': ['UK', '日本'], '名前': ['Alice', 'しのぶ']})
In [93]: df
Out[93]:
国籍 名前
0 UK Alice
1 日本 しのぶ
```
![option_unicode01](https://static.pypandas.cn/public/static/images/option_unicode01.png)
Enabling ``display.unicode.east_asian_width`` allows pandas to check each characters “East Asian Width” property.
These characters can be aligned properly by setting this option to ``True``. However, this will result in longer render
times than the standard ``len`` function.
``` python
In [94]: pd.set_option('display.unicode.east_asian_width', True)
In [95]: df
Out[95]:
国籍 名前
0 UK Alice
1 日本 しのぶ
```
![option_unicode02](https://static.pypandas.cn/public/static/images/option_unicode02.png)
In addition, Unicode characters whose width is “Ambiguous” can either be 1 or 2 characters wide depending on the
terminal setting or encoding. The option ``display.unicode.ambiguous_as_wide`` can be used to handle the ambiguity.
By default, an “Ambiguous” characters width, such as “¡” (inverted exclamation) in the example below, is taken to be 1.
``` python
In [96]: df = pd.DataFrame({'a': ['xxx', '¡¡'], 'b': ['yyy', '¡¡']})
In [97]: df
Out[97]:
a b
0 xxx yyy
1 ¡¡ ¡¡
```
![option_unicode03](https://static.pypandas.cn/public/static/images/option_unicode03.png)
Enabling ``display.unicode.ambiguous_as_wide`` makes pandas interpret these characters widths to be 2.
(Note that this option will only be effective when ``display.unicode.east_asian_width`` is enabled.)
However, setting this option incorrectly for your terminal will cause these characters to be aligned incorrectly:
``` python
In [98]: pd.set_option('display.unicode.ambiguous_as_wide', True)
In [99]: df
Out[99]:
a b
0 xxx yyy
1 ¡¡ ¡¡
```
![option_unicode04](https://static.pypandas.cn/public/static/images/option_unicode04.png)
## Table schema display
*New in version 0.20.0.*
``DataFrame`` and ``Series`` will publish a Table Schema representation
by default. False by default, this can be enabled globally with the
``display.html.table_schema`` option:
``` python
In [100]: pd.set_option('display.html.table_schema', True)
```
Only ``'display.max_rows'`` are serialized and published.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,565 @@
# Sparse data structures
::: tip Note
``SparseSeries`` and ``SparseDataFrame`` have been deprecated. Their purpose
is served equally well by a [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) or [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) with
sparse values. See [Migrating](#sparse-migration) for tips on migrating.
:::
Pandas provides data structures for efficiently storing sparse data.
These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these
objects as being “compressed” where any data matching a specific value (``NaN`` / missing value, though any value
can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.
``` python
In [1]: arr = np.random.randn(10)
In [2]: arr[2:-2] = np.nan
In [3]: ts = pd.Series(pd.SparseArray(arr))
In [4]: ts
Out[4]:
0 0.469112
1 -0.282863
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 -0.861849
9 -2.104569
dtype: Sparse[float64, nan]
```
Notice the dtype, ``Sparse[float64, nan]``. The ``nan`` means that elements in the
array that are ``nan`` arent actually stored, only the non-``nan`` elements are.
Those non-``nan`` elements have a ``float64`` dtype.
The sparse objects exist for memory efficiency reasons. Suppose you had a
large, mostly NA ``DataFrame``:
``` python
In [5]: df = pd.DataFrame(np.random.randn(10000, 4))
In [6]: df.iloc[:9998] = np.nan
In [7]: sdf = df.astype(pd.SparseDtype("float", np.nan))
In [8]: sdf.head()
Out[8]:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
In [9]: sdf.dtypes
Out[9]:
0 Sparse[float64, nan]
1 Sparse[float64, nan]
2 Sparse[float64, nan]
3 Sparse[float64, nan]
dtype: object
In [10]: sdf.sparse.density
Out[10]: 0.0002
```
As you can see, the density (% of values that have not been “compressed”) is
extremely low. This sparse object takes up much less memory on disk (pickled)
and in the Python interpreter.
``` python
In [11]: 'dense : {:0.2f} bytes'.format(df.memory_usage().sum() / 1e3)
Out[11]: 'dense : 320.13 bytes'
In [12]: 'sparse: {:0.2f} bytes'.format(sdf.memory_usage().sum() / 1e3)
Out[12]: 'sparse: 0.22 bytes'
```
Functionally, their behavior should be nearly
identical to their dense counterparts.
## SparseArray
[``SparseArray``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.SparseArray.html#pandas.SparseArray) is a [``ExtensionArray``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.extensions.ExtensionArray.html#pandas.api.extensions.ExtensionArray)
for storing an array of sparse values (see [dtypes](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes) for more
on extension arrays). It is a 1-dimensional ndarray-like object storing
only values distinct from the ``fill_value``:
``` python
In [13]: arr = np.random.randn(10)
In [14]: arr[2:5] = np.nan
In [15]: arr[7:8] = np.nan
In [16]: sparr = pd.SparseArray(arr)
In [17]: sparr
Out[17]:
[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)
```
A sparse array can be converted to a regular (dense) ndarray with ``numpy.asarray()``
``` python
In [18]: np.asarray(sparr)
Out[18]:
array([-1.9557, -1.6589, nan, nan, nan, 1.1589, 0.1453,
nan, 0.606 , 1.3342])
```
## SparseDtype
The ``SparseArray.dtype`` property stores two pieces of information
1. The dtype of the non-sparse values
1. The scalar fill value
``` python
In [19]: sparr.dtype
Out[19]: Sparse[float64, nan]
```
A [``SparseDtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.SparseDtype.html#pandas.SparseDtype) may be constructed by passing each of these
``` python
In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
Out[20]: Sparse[datetime64[ns], NaT]
```
The default fill value for a given NumPy dtype is the “missing” value for that dtype,
though it may be overridden.
``` python
In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'),
....: fill_value=pd.Timestamp('2017-01-01'))
....:
Out[21]: Sparse[datetime64[ns], 2017-01-01 00:00:00]
```
Finally, the string alias ``'Sparse[dtype]'`` may be used to specify a sparse dtype
in many places
``` python
In [22]: pd.array([1, 0, 0, 2], dtype='Sparse[int]')
Out[22]:
[1, 0, 0, 2]
Fill: 0
IntIndex
Indices: array([0, 3], dtype=int32)
```
## Sparse accessor
*New in version 0.24.0.*
Pandas provides a ``.sparse`` accessor, similar to ``.str`` for string data, ``.cat``
for categorical data, and ``.dt`` for datetime-like data. This namespace provides
attributes and methods that are specific to sparse data.
``` python
In [23]: s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")
In [24]: s.sparse.density
Out[24]: 0.5
In [25]: s.sparse.fill_value
Out[25]: 0
```
This accessor is available only on data with ``SparseDtype``, and on the [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)
class itself for creating a Series with sparse data from a scipy COO matrix with.
*New in version 0.25.0.*
A ``.sparse`` accessor has been added for [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) as well.
See [Sparse accessor](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#api-frame-sparse) for more.
## Sparse calculation
You can apply NumPy [ufuncs](https://docs.scipy.org/doc/numpy/reference/ufuncs.html)
to ``SparseArray`` and get a ``SparseArray`` as a result.
``` python
In [26]: arr = pd.SparseArray([1., np.nan, np.nan, -2., np.nan])
In [27]: np.abs(arr)
Out[27]:
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)
```
The *ufunc* is also applied to ``fill_value``. This is needed to get
the correct dense result.
``` python
In [28]: arr = pd.SparseArray([1., -1, -1, -2., -1], fill_value=-1)
In [29]: np.abs(arr)
Out[29]:
[1.0, 1, 1, 2.0, 1]
Fill: 1
IntIndex
Indices: array([0, 3], dtype=int32)
In [30]: np.abs(arr).to_dense()
Out[30]: array([1., 1., 1., 2., 1.])
```
## Migrating
In older versions of pandas, the ``SparseSeries`` and ``SparseDataFrame`` classes (documented below)
were the preferred way to work with sparse data. With the advent of extension arrays, these subclasses
are no longer needed. Their purpose is better served by using a regular Series or DataFrame with
sparse values instead.
::: tip Note
Theres no performance or memory penalty to using a Series or DataFrame with sparse values,
rather than a SparseSeries or SparseDataFrame.
:::
This section provides some guidance on migrating your code to the new style. As a reminder,
you can use the python warnings module to control warnings. But we recommend modifying
your code, rather than ignoring the warning.
**Construction**
From an array-like, use the regular [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) or
[``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) constructors with [``SparseArray``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.SparseArray.html#pandas.SparseArray) values.
``` python
# Previous way
>>> pd.SparseDataFrame({"A": [0, 1]})
```
``` python
# New way
In [31]: pd.DataFrame({"A": pd.SparseArray([0, 1])})
Out[31]:
A
0 0
1 1
```
From a SciPy sparse matrix, use [``DataFrame.sparse.from_spmatrix()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sparse.from_spmatrix.html#pandas.DataFrame.sparse.from_spmatrix),
``` python
# Previous way
>>> from scipy import sparse
>>> mat = sparse.eye(3)
>>> df = pd.SparseDataFrame(mat, columns=['A', 'B', 'C'])
```
``` python
# New way
In [32]: from scipy import sparse
In [33]: mat = sparse.eye(3)
In [34]: df = pd.DataFrame.sparse.from_spmatrix(mat, columns=['A', 'B', 'C'])
In [35]: df.dtypes
Out[35]:
A Sparse[float64, 0.0]
B Sparse[float64, 0.0]
C Sparse[float64, 0.0]
dtype: object
```
**Conversion**
From sparse to dense, use the ``.sparse`` accessors
``` python
In [36]: df.sparse.to_dense()
Out[36]:
A B C
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
In [37]: df.sparse.to_coo()
Out[37]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
```
From dense to sparse, use [``DataFrame.astype()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html#pandas.DataFrame.astype) with a [``SparseDtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.SparseDtype.html#pandas.SparseDtype).
``` python
In [38]: dense = pd.DataFrame({"A": [1, 0, 0, 1]})
In [39]: dtype = pd.SparseDtype(int, fill_value=0)
In [40]: dense.astype(dtype)
Out[40]:
A
0 1
1 0
2 0
3 1
```
**Sparse Properties**
Sparse-specific properties, like ``density``, are available on the ``.sparse`` accessor.
``` python
In [41]: df.sparse.density
Out[41]: 0.3333333333333333
```
**General differences**
In a ``SparseDataFrame``, *all* columns were sparse. A [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) can have a mixture of
sparse and dense columns. As a consequence, assigning new columns to a ``DataFrame`` with sparse
values will not automatically convert the input to be sparse.
``` python
# Previous Way
>>> df = pd.SparseDataFrame({"A": [0, 1]})
>>> df['B'] = [0, 0] # implicitly becomes Sparse
>>> df['B'].dtype
Sparse[int64, nan]
```
Instead, youll need to ensure that the values being assigned are sparse
``` python
In [42]: df = pd.DataFrame({"A": pd.SparseArray([0, 1])})
In [43]: df['B'] = [0, 0] # remains dense
In [44]: df['B'].dtype
Out[44]: dtype('int64')
In [45]: df['B'] = pd.SparseArray([0, 0])
In [46]: df['B'].dtype
Out[46]: Sparse[int64, 0]
```
The ``SparseDataFrame.default_kind`` and ``SparseDataFrame.default_fill_value`` attributes
have no replacement.
## Interaction with scipy.sparse
Use [``DataFrame.sparse.from_spmatrix()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sparse.from_spmatrix.html#pandas.DataFrame.sparse.from_spmatrix) to create a ``DataFrame`` with sparse values from a sparse matrix.
*New in version 0.25.0.*
``` python
In [47]: from scipy.sparse import csr_matrix
In [48]: arr = np.random.random(size=(1000, 5))
In [49]: arr[arr < .9] = 0
In [50]: sp_arr = csr_matrix(arr)
In [51]: sp_arr
Out[51]:
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
with 517 stored elements in Compressed Sparse Row format>
In [52]: sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)
In [53]: sdf.head()
Out[53]:
0 1 2 3 4
0 0.956380 0.0 0.0 0.000000 0.0
1 0.000000 0.0 0.0 0.000000 0.0
2 0.000000 0.0 0.0 0.000000 0.0
3 0.000000 0.0 0.0 0.000000 0.0
4 0.999552 0.0 0.0 0.956153 0.0
In [54]: sdf.dtypes
Out[54]:
0 Sparse[float64, 0.0]
1 Sparse[float64, 0.0]
2 Sparse[float64, 0.0]
3 Sparse[float64, 0.0]
4 Sparse[float64, 0.0]
dtype: object
```
All sparse formats are supported, but matrices that are not in [``COOrdinate``](https://docs.scipy.org/doc/scipy/reference/sparse.html#module-scipy.sparse) format will be converted, copying data as needed.
To convert back to sparse SciPy matrix in COO format, you can use the [``DataFrame.sparse.to_coo()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sparse.to_coo.html#pandas.DataFrame.sparse.to_coo) method:
``` python
In [55]: sdf.sparse.to_coo()
Out[55]:
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
with 517 stored elements in COOrdinate format>
```
meth:*Series.sparse.to_coo* is implemented for transforming a ``Series`` with sparse values indexed by a [``MultiIndex``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html#pandas.MultiIndex) to a [``scipy.sparse.coo_matrix``](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html#scipy.sparse.coo_matrix).
The method requires a ``MultiIndex`` with two or more levels.
``` python
In [56]: s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])
In [57]: s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),
....: (1, 2, 'a', 1),
....: (1, 1, 'b', 0),
....: (1, 1, 'b', 1),
....: (2, 1, 'b', 0),
....: (2, 1, 'b', 1)],
....: names=['A', 'B', 'C', 'D'])
....:
In [58]: s
Out[58]:
A B C D
1 2 a 0 3.0
1 NaN
1 b 0 1.0
1 3.0
2 1 b 0 NaN
1 NaN
dtype: float64
In [59]: ss = s.astype('Sparse')
In [60]: ss
Out[60]:
A B C D
1 2 a 0 3.0
1 NaN
1 b 0 1.0
1 3.0
2 1 b 0 NaN
1 NaN
dtype: Sparse[float64, nan]
```
In the example below, we transform the ``Series`` to a sparse representation of a 2-d array by specifying that the first and second ``MultiIndex`` levels define labels for the rows and the third and fourth levels define labels for the columns. We also specify that the column and row labels should be sorted in the final sparse representation.
``` python
In [61]: A, rows, columns = ss.sparse.to_coo(row_levels=['A', 'B'],
....: column_levels=['C', 'D'],
....: sort_labels=True)
....:
In [62]: A
Out[62]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [63]: A.todense()
Out[63]:
matrix([[0., 0., 1., 3.],
[3., 0., 0., 0.],
[0., 0., 0., 0.]])
In [64]: rows
Out[64]: [(1, 1), (1, 2), (2, 1)]
In [65]: columns
Out[65]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]
```
Specifying different row and column labels (and not sorting them) yields a different sparse matrix:
``` python
In [66]: A, rows, columns = ss.sparse.to_coo(row_levels=['A', 'B', 'C'],
....: column_levels=['D'],
....: sort_labels=False)
....:
In [67]: A
Out[67]:
<3x2 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [68]: A.todense()
Out[68]:
matrix([[3., 0.],
[1., 3.],
[0., 0.]])
In [69]: rows
Out[69]: [(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]
In [70]: columns
Out[70]: [0, 1]
```
A convenience method [``Series.sparse.from_coo()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sparse.from_coo.html#pandas.Series.sparse.from_coo) is implemented for creating a ``Series`` with sparse values from a ``scipy.sparse.coo_matrix``.
``` python
In [71]: from scipy import sparse
In [72]: A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])),
....: shape=(3, 4))
....:
In [73]: A
Out[73]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [74]: A.todense()
Out[74]:
matrix([[0., 0., 1., 2.],
[3., 0., 0., 0.],
[0., 0., 0., 0.]])
```
The default behaviour (with ``dense_index=False``) simply returns a ``Series`` containing
only the non-null entries.
``` python
In [75]: ss = pd.Series.sparse.from_coo(A)
In [76]: ss
Out[76]:
0 2 1.0
3 2.0
1 0 3.0
dtype: Sparse[float64, nan]
```
Specifying ``dense_index=True`` will result in an index that is the Cartesian product of the
row and columns coordinates of the matrix. Note that this will consume a significant amount of memory
(relative to ``dense_index=False``) if the sparse matrix is large (and sparse) enough.
``` python
In [77]: ss_dense = pd.Series.sparse.from_coo(A, dense_index=True)
In [78]: ss_dense
Out[78]:
0 0 NaN
1 NaN
2 1.0
3 2.0
1 0 3.0
1 NaN
2 NaN
3 NaN
2 0 NaN
1 NaN
2 NaN
3 NaN
dtype: Sparse[float64, nan]
```
## Sparse subclasses
The ``SparseSeries`` and ``SparseDataFrame`` classes are deprecated. Visit their
API pages for usage.

View File

@@ -0,0 +1,439 @@
# Styling
*New in version 0.17.1*
Provisional: This is a new feature and still under development. Well be adding features and possibly making breaking changes in future releases. Wed love to hear your feedback.
This document is written as a Jupyter Notebook, and can be viewed or downloaded [here](http://nbviewer.ipython.org/github/pandas-dev/pandas/blob/master/doc/source/style.ipynb).
You can apply **conditional formatting**, the visual styling of a DataFrame depending on the data within, by using the ``DataFrame.style`` property. This is a property that returns a ``Styler`` object, which has useful methods for formatting and displaying DataFrames.
The styling is accomplished using CSS. You write “style functions” that take scalars, ``DataFrame``s or ``Series``, and return *like-indexed* DataFrames or Series with CSS ``"attribute: value"`` pairs for the values. These functions can be incrementally passed to the ``Styler`` which collects the styles before rendering.
## Building styles
Pass your style functions into one of the following methods:
- ``Styler.applymap``: elementwise
- ``Styler.apply``: column-/row-/table-wise
Both of those methods take a function (and some other keyword arguments) and applies your function to the DataFrame in a certain way. ``Styler.applymap`` works through the DataFrame elementwise. ``Styler.apply`` passes each column or row into your DataFrame one-at-a-time or the entire table at once, depending on the ``axis`` keyword argument. For columnwise use ``axis=0``, rowwise use ``axis=1``, and for the entire table at once use ``axis=None``.
For ``Styler.applymap`` your function should take a scalar and return a single string with the CSS attribute-value pair.
For ``Styler.apply`` your function should take a Series or DataFrame (depending on the axis parameter), and return a Series or DataFrame with an identical shape where each value is a string with a CSS attribute-value pair.
Lets see some examples.
![style02](https://static.pypandas.cn/public/static/images/style/user_guide_style_02.png)
Heres a boring example of rendering a DataFrame, without any (visible) styles:
![style03](https://static.pypandas.cn/public/static/images/style/user_guide_style_03.png)
*Note*: The ``DataFrame.style`` attribute is a property that returns a ``Styler`` object. ``Styler`` has a ``_repr_html_`` method defined on it so they are rendered automatically. If you want the actual HTML back for further processing or for writing to file call the ``.render()`` method which returns a string.
The above output looks very similar to the standard DataFrame HTML representation. But weve done some work behind the scenes to attach CSS classes to each cell. We can view these by calling the ``.render`` method.
``` javascript
df.style.highlight_null().render().split('\n')[:10]
```
``` javascript
['<style type="text/css" >',
' #T_acfc12d6_a988_11e9_a75e_31802e421a9brow0_col2 {',
' background-color: red;',
' }</style><table id="T_acfc12d6_a988_11e9_a75e_31802e421a9b" ><thead> <tr> <th class="blank level0" ></th> <th class="col_heading level0 col0" >A</th> <th class="col_heading level0 col1" >B</th> <th class="col_heading level0 col2" >C</th> <th class="col_heading level0 col3" >D</th> <th class="col_heading level0 col4" >E</th> </tr></thead><tbody>',
' <tr>',
' <th id="T_acfc12d6_a988_11e9_a75e_31802e421a9blevel0_row0" class="row_heading level0 row0" >0</th>',
' <td id="T_acfc12d6_a988_11e9_a75e_31802e421a9brow0_col0" class="data row0 col0" >1</td>',
' <td id="T_acfc12d6_a988_11e9_a75e_31802e421a9brow0_col1" class="data row0 col1" >1.32921</td>',
' <td id="T_acfc12d6_a988_11e9_a75e_31802e421a9brow0_col2" class="data row0 col2" >nan</td>',
' <td id="T_acfc12d6_a988_11e9_a75e_31802e421a9brow0_col3" class="data row0 col3" >-0.31628</td>']
```
The ``row0_col2`` is the identifier for that particular cell. Weve also prepended each row/column identifier with a UUID unique to each DataFrame so that the style from one doesnt collide with the styling from another within the same notebook or page (you can set the ``uuid`` if youd like to tie together the styling of two DataFrames).
When writing style functions, you take care of producing the CSS attribute / value pairs you want. Pandas matches those up with the CSS classes that identify each cell.
Lets write a simple style function that will color negative numbers red and positive numbers black.
![style04](https://static.pypandas.cn/public/static/images/style/user_guide_style_04.png)
In this case, the cells style depends only on its own value. That means we should use the ``Styler.applymap`` method which works elementwise.
![style05](https://static.pypandas.cn/public/static/images/style/user_guide_style_05.png)
Notice the similarity with the standard ``df.applymap``, which operates on DataFrames elementwise. We want you to be able to reuse your existing knowledge of how to interact with DataFrames.
Notice also that our function returned a string containing the CSS attribute and value, separated by a colon just like in a ```` tag. This will be a common theme.</p>
Finally, the input shapes matched. ``Styler.applymap`` calls the function on each scalar input, and the function returns a scalar output.
Now suppose you wanted to highlight the maximum value in each column. We cant use ``.applymap`` anymore since that operated elementwise. Instead, well turn to ``.apply`` which operates columnwise (or rowwise using the ``axis`` keyword). Later on well see that something like ``highlight_max`` is already defined on ``Styler`` so you wouldnt need to write this yourself.
![style06](https://static.pypandas.cn/public/static/images/style/user_guide_style_06.png)
In this case the input is a ``Series``, one column at a time. Notice that the output shape of ``highlight_max`` matches the input shape, an array with ``len(s)`` items.
We encourage you to use method chains to build up a style piecewise, before finally rending at the end of the chain.
![style07](https://static.pypandas.cn/public/static/images/style/user_guide_style_07.png)
Above we used ``Styler.apply`` to pass in each column one at a time.
Debugging Tip: If youre having trouble writing your style function, try just passing it into DataFrame.apply. Internally, Styler.apply uses DataFrame.apply so the result should be the same.
What if you wanted to highlight just the maximum value in the entire table? Use ``.apply(function, axis=None)`` to indicate that your function wants the entire table, not one column or row at a time. Lets try that next.
Well rewrite our ``highlight-max`` to handle either Series (from ``.apply(axis=0 or 1)``) or DataFrames (from ``.apply(axis=None)``). Well also allow the color to be adjustable, to demonstrate that ``.apply``, and ``.applymap`` pass along keyword arguments.
![style08](https://static.pypandas.cn/public/static/images/style/user_guide_style_08.png)
When using ``Styler.apply(func, axis=None)``, the function must return a DataFrame with the same index and column labels.
![style09](https://static.pypandas.cn/public/static/images/style/user_guide_style_09.png)
### Building Styles Summary
Style functions should return strings with one or more CSS ``attribute: value`` delimited by semicolons. Use
- ``Styler.applymap(func)`` for elementwise styles
- ``Styler.apply(func, axis=0)`` for columnwise styles
- ``Styler.apply(func, axis=1)`` for rowwise styles
- ``Styler.apply(func, axis=None)`` for tablewise styles
And crucially the input and output shapes of ``func`` must match. If ``x`` is the input then ``func(x).shape == x.shape``.
## Finer control: slicing
Both ``Styler.apply``, and ``Styler.applymap`` accept a ``subset`` keyword. This allows you to apply styles to specific rows or columns, without having to code that logic into your ``style`` function.
The value passed to ``subset`` behaves similar to slicing a DataFrame.
- A scalar is treated as a column label
- A list (or series or numpy array)
- A tuple is treated as ``(row_indexer, column_indexer)``
Consider using ``pd.IndexSlice`` to construct the tuple for the last one.
![style10](https://static.pypandas.cn/public/static/images/style/user_guide_style_10.png)
For row and column slicing, any valid indexer to ``.loc`` will work.
![style11](https://static.pypandas.cn/public/static/images/style/user_guide_style_11.png)
Only label-based slicing is supported right now, not positional.
If your style function uses a ``subset`` or ``axis`` keyword argument, consider wrapping your function in a ``functools.partial``, partialing out that keyword.
``` python
my_func2 = functools.partial(my_func, subset=42)
```
## Finer Control: Display Values
We distinguish the *display* value from the *actual* value in ``Styler``. To control the display value, the text is printed in each cell, use ``Styler.format``. Cells can be formatted according to a [format spec string](https://docs.python.org/3/library/string.html#format-specification-mini-language) or a callable that takes a single value and returns a string.
![style12](https://static.pypandas.cn/public/static/images/style/user_guide_style_12.png)
Use a dictionary to format specific columns.
![style13](https://static.pypandas.cn/public/static/images/style/user_guide_style_13.png)
Or pass in a callable (or dictionary of callables) for more flexible handling.
![style14](https://static.pypandas.cn/public/static/images/style/user_guide_style_14.png)
## Builtin styles
Finally, we expect certain styling functions to be common enough that weve included a few “built-in” to the ``Styler``, so you dont have to write them yourself.
![style15](https://static.pypandas.cn/public/static/images/style/user_guide_style_15.png)
You can create “heatmaps” with the ``background_gradient`` method. These require matplotlib, and well use [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/) to get a nice colormap.
``` python
import seaborn as sns
cm = sns.light_palette("green", as_cmap=True)
s = df.style.background_gradient(cmap=cm)
s
/opt/conda/envs/pandas/lib/python3.7/site-packages/matplotlib/colors.py:479: RuntimeWarning: invalid value encountered in less
xa[xa < 0] = -1
```
![style16](https://static.pypandas.cn/public/static/images/style/user_guide_style_16.png)
``Styler.background_gradient`` takes the keyword arguments ``low`` and ``high``. Roughly speaking these extend the range of your data by ``low`` and ``high`` percent so that when we convert the colors, the colormaps entire range isnt used. This is useful so that you can actually read the text still.
![style17](https://static.pypandas.cn/public/static/images/style/user_guide_style_17.png)
Theres also ``.highlight_min`` and ``.highlight_max``.
![style18](https://static.pypandas.cn/public/static/images/style/user_guide_style_18.png)
Use ``Styler.set_properties`` when the style doesnt actually depend on the values.
![style19](https://static.pypandas.cn/public/static/images/style/user_guide_style_19.png)
### Bar charts
You can include “bar charts” in your DataFrame.
![style20](https://static.pypandas.cn/public/static/images/style/user_guide_style_20.png)
New in version 0.20.0 is the ability to customize further the bar chart: You can now have the ``df.style.bar`` be centered on zero or midpoint value (in addition to the already existing way of having the min value at the left side of the cell), and you can pass a list of ``[color_negative, color_positive]``.
Heres how you can change the above with the new ``align='mid'`` option:
![style21](https://static.pypandas.cn/public/static/images/style/user_guide_style_21.png)
The following example aims to give a highlight of the behavior of the new align options:
``` python
import pandas as pd
from IPython.display import HTML
# Test series
test1 = pd.Series([-100,-60,-30,-20], name='All Negative')
test2 = pd.Series([10,20,50,100], name='All Positive')
test3 = pd.Series([-10,-5,0,90], name='Both Pos and Neg')
head = """
<table>
<thead>
<th>Align</th>
<th>All Negative</th>
<th>All Positive</th>
<th>Both Neg and Pos</th>
</thead>
</tbody>
"""
aligns = ['left','zero','mid']
for align in aligns:
row = "<tr><th>{}</th>".format(align)
for serie in [test1,test2,test3]:
s = serie.copy()
s.name=''
row += "<td>{}</td>".format(s.to_frame().style.bar(align=align,
color=['#d65f5f', '#5fba7d'],
width=100).render()) #testn['width']
row += '</tr>'
head += row
head+= """
</tbody>
</table>"""
HTML(head)
```
![style22](https://static.pypandas.cn/public/static/images/style/user_guide_style_22.png)
## Sharing styles
Say you have a lovely style built up for a DataFrame, and now you want to apply the same style to a second DataFrame. Export the style with ``df1.style.export``, and import it on the second DataFrame with ``df1.style.set``
![style23](https://static.pypandas.cn/public/static/images/style/user_guide_style_23.png)
Notice that youre able share the styles even though theyre data aware. The styles are re-evaluated on the new DataFrame theyve been ``use``d upon.
## Other Options
Youve seen a few methods for data-driven styling. ``Styler`` also provides a few other options for styles that dont depend on the data.
- precision
- captions
- table-wide styles
- hiding the index or columns
Each of these can be specified in two ways:
- A keyword argument to ``Styler.__init__``
- A call to one of the ``.set_`` or ``.hide_`` methods, e.g. ``.set_caption`` or ``.hide_columns``
The best method to use depends on the context. Use the ``Styler`` constructor when building many styled DataFrames that should all share the same properties. For interactive use, the``.set_`` and ``.hide_`` methods are more convenient.
### Precision
You can control the precision of floats using pandas regular ``display.precision`` option.
![style24](https://static.pypandas.cn/public/static/images/style/user_guide_style_24.png)
Or through a ``set_precision`` method.
![style25](https://static.pypandas.cn/public/static/images/style/user_guide_style_25.png)
Setting the precision only affects the printed number; the full-precision values are always passed to your style functions. You can always use ``df.round(2).style`` if youd prefer to round from the start.
### Captions
Regular table captions can be added in a few ways.
![style26](https://static.pypandas.cn/public/static/images/style/user_guide_style_26.png)
### Table styles
The next option you have are “table styles”. These are styles that apply to the table as a whole, but dont look at the data. Certain sytlings, including pseudo-selectors like ``:hover`` can only be used this way.
![style27](https://static.pypandas.cn/public/static/images/style/user_guide_style_27.png)
``table_styles`` should be a list of dictionaries. Each dictionary should have the ``selector`` and ``props`` keys. The value for ``selector`` should be a valid CSS selector. Recall that all the styles are already attached to an ``id``, unique to each ``Styler``. This selector is in addition to that ``id``. The value for ``props`` should be a list of tuples of ``('attribute', 'value')``.
``table_styles`` are extremely flexible, but not as fun to type out by hand. We hope to collect some useful ones either in pandas, or preferable in a new package that [builds on top](#Extensibility) the tools here.
### Hiding the Index or Columns
The index can be hidden from rendering by calling ``Styler.hide_index``. Columns can be hidden from rendering by calling ``Styler.hide_columns`` and passing in the name of a column, or a slice of columns.
![style28](https://static.pypandas.cn/public/static/images/style/user_guide_style_28.png)
### CSS classes
Certain CSS classes are attached to cells.
- Index and Column names include ``index_name`` and ``level`` where ``k`` is its level in a MultiIndex
- Index label cells include
``row_heading``
``row`` where ``n`` is the numeric position of the row
``level`` where ``k`` is the level in a MultiIndex
- ``row_heading``
- ``row`` where ``n`` is the numeric position of the row
- ``level`` where ``k`` is the level in a MultiIndex
- Column label cells include
``col_heading``
``col`` where ``n`` is the numeric position of the column
``level`` where ``k`` is the level in a MultiIndex
- ``col_heading``
- ``col`` where ``n`` is the numeric position of the column
- ``level`` where ``k`` is the level in a MultiIndex
- Blank cells include ``blank``
- Data cells include ``data``
### Limitations
- DataFrame only ``(use Series.to_frame().style)``
- The index and columns must be unique
- No large repr, and performance isnt great; this is intended for summary DataFrames
- You can only style the *values*, not the index or columns
- You can only apply styles, you cant insert new HTML entities
Some of these will be addressed in the future.
### Terms
- Style function: a function thats passed into ``Styler.apply`` or ``Styler.applymap`` and returns values like ``'css attribute: value'``
- Builtin style functions: style functions that are methods on ``Styler``
- table style: a dictionary with the two keys ``selector`` and ``props``. ``selector`` is the CSS selector that ``props`` will apply to. ``props`` is a list of ``(attribute, value)`` tuples. A list of table styles passed into ``Styler``.
## Fun stuff
Here are a few interesting examples.
``Styler`` interacts pretty well with widgets. If youre viewing this online instead of running the notebook yourself, youre missing out on interactively adjusting the color palette.
![style29](https://static.pypandas.cn/public/static/images/style/user_guide_style_29.png)
![style30](https://static.pypandas.cn/public/static/images/style/user_guide_style_30.png)
## Export to Excel
*New in version 0.20.0*
Experimental: This is a new feature and still under development. Well be adding features and possibly making breaking changes in future releases. Wed love to hear your feedback.
Some support is available for exporting styled ``DataFrames`` to Excel worksheets using the ``OpenPyXL`` or ``XlsxWriter`` engines. CSS2.2 properties handled include:
- ``background-color``
- ``border-style``, ``border-width``, ``border-color`` and their {``top``, ``right``, ``bottom``, ``left`` variants}
- ``color``
- ``font-family``
- ``font-style``
- ``font-weight``
- ``text-align``
- ``text-decoration``
- ``vertical-align``
- ``white-space: nowrap``
- Only CSS2 named colors and hex colors of the form ``#rgb`` or ``#rrggbb`` are currently supported.
- The following pseudo CSS properties are also available to set excel specific style properties:
``number-format``
- ``number-format``
``` python
df.style.\
applymap(color_negative_red).\
apply(highlight_max).\
to_excel('styled.xlsx', engine='openpyxl')
```
A screenshot of the output:
![excel](https://static.pypandas.cn/public/static/images/style-excel.png)
## Extensibility
The core of pandas is, and will remain, its “high-performance, easy-to-use data structures”. With that in mind, we hope that ``DataFrame.style`` accomplishes two goals
- Provide an API that is pleasing to use interactively and is “good enough” for many tasks
- Provide the foundations for dedicated libraries to build on
If you build a great library on top of this, let us know and well [link](http://pandas.pydata.org/pandas-docs/stable/ecosystem.html) to it.
### Subclassing
If the default template doesnt quite suit your needs, you can subclass Styler and extend or override the template. Well show an example of extending the default template to insert a custom header before each table.
``` python
from jinja2 import Environment, ChoiceLoader, FileSystemLoader
from IPython.display import HTML
from pandas.io.formats.style import Styler
```
Well use the following template:
``` python
with open("templates/myhtml.tpl") as f:
print(f.read())
```
Now that weve created a template, we need to set up a subclass of ``Styler`` that knows about it.
``` python
class MyStyler(Styler):
env = Environment(
loader=ChoiceLoader([
FileSystemLoader("templates"), # contains ours
Styler.loader, # the default
])
)
template = env.get_template("myhtml.tpl")
```
Notice that we include the original loader in our environments loader. Thats because we extend the original template, so the Jinja environment needs to be able to find it.
Now we can use that custom styler. Its ``__init__`` takes a DataFrame.
![style31](https://static.pypandas.cn/public/static/images/style/user_guide_style_31.png)
Our custom template accepts a ``table_title`` keyword. We can provide the value in the ``.render`` method.
![style32](https://static.pypandas.cn/public/static/images/style/user_guide_style_32.png)
For convenience, we provide the ``Styler.from_custom_template`` method that does the same as the custom subclass.
![style33](https://static.pypandas.cn/public/static/images/style/user_guide_style_33.png)
Heres the template structure:
![style34](https://static.pypandas.cn/public/static/images/style/user_guide_style_34.png)
See the template in the [GitHub repo](https://github.com/pandas-dev/pandas) for more details.

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,832 @@
# 时间差
`Timedelta`,时间差,即时间之间的差异,用 `日、时、分、秒` 等时间单位表示,时间单位可为正,也可为负。
`Timedelta``datetime.timedelta` 的子类,两者的操作方式相似,但 `Timedelta` 兼容 `np.timedelta64` 等数据类型,还支持自定义表示形式、能解析多种类型的数据,并支持自有属性。
## 解析数据,生成时间差
`Timedelta()` 支持用多种参数生成时间差:
``` python
In [1]: import datetime
# 字符串
In [2]: pd.Timedelta('1 days')
Out[2]: Timedelta('1 days 00:00:00')
In [3]: pd.Timedelta('1 days 00:00:00')
Out[3]: Timedelta('1 days 00:00:00')
In [4]: pd.Timedelta('1 days 2 hours')
Out[4]: Timedelta('1 days 02:00:00')
In [5]: pd.Timedelta('-1 days 2 min 3us')
Out[5]: Timedelta('-2 days +23:57:59.999997')
# datetime.timedelta
# 注意:必须指定关键字参数
In [6]: pd.Timedelta(days=1, seconds=1)
Out[6]: Timedelta('1 days 00:00:01')
# 用整数与时间单位生成时间差
In [7]: pd.Timedelta(1, unit='d')
Out[7]: Timedelta('1 days 00:00:00')
# datetime.timedelta 与 np.timedelta64
In [8]: pd.Timedelta(datetime.timedelta(days=1, seconds=1))
Out[8]: Timedelta('1 days 00:00:01')
In [9]: pd.Timedelta(np.timedelta64(1, 'ms'))
Out[9]: Timedelta('0 days 00:00:00.001000')
# 用字符串表示负数时间差
# 更接近 datetime.timedelta
In [10]: pd.Timedelta('-1us')
Out[10]: Timedelta('-1 days +23:59:59.999999')
# 时间差缺失值
In [11]: pd.Timedelta('nan')
Out[11]: NaT
In [12]: pd.Timedelta('nat')
Out[12]: NaT
# ISO8601 时间格式字符串
In [13]: pd.Timedelta('P0DT0H1M0S')
Out[13]: Timedelta('0 days 00:01:00')
In [14]: pd.Timedelta('P0DT0H0M0.000000123S')
Out[14]: Timedelta('0 days 00:00:00.000000')
```
*0.23.0 版新增*:增加了用 [ISO8601 时间格式](https://en.wikipedia.org/wiki/ISO_8601#Durations)生成时间差。
[DateOffsets](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offsets)`Day`、`Hour`、`Minute`、`Second`、`Milli`、`Micro`、`Nano`)也可以用来生成时间差。
``` python
In [15]: pd.Timedelta(pd.offsets.Second(2))
Out[15]: Timedelta('0 days 00:00:02')
```
标量运算生成的也是 `Timedelta` 标量。
``` python
In [16]: pd.Timedelta(pd.offsets.Day(2)) + pd.Timedelta(pd.offsets.Second(2)) +\
....: pd.Timedelta('00:00:00.000123')
....:
Out[16]: Timedelta('2 days 00:00:02.000123')
```
### to_timedelta
`pd.to_timedelta()` 可以把符合时间差格式的标量、数组、列表、序列等数据转换为`Timedelta`。输入数据是序列,输出的就是序列,输入数据是标量,输出的就是标量,其它形式的输入数据则输出 `TimedeltaIndex`。
`to_timedelta()` 可以解析单个字符串:
``` python
In [17]: pd.to_timedelta('1 days 06:05:01.00003')
Out[17]: Timedelta('1 days 06:05:01.000030')
In [18]: pd.to_timedelta('15.5us')
Out[18]: Timedelta('0 days 00:00:00.000015')
```
还能解析字符串列表或数组:
``` python
In [19]: pd.to_timedelta(['1 days 06:05:01.00003', '15.5us', 'nan'])
Out[19]: TimedeltaIndex(['1 days 06:05:01.000030', '0 days 00:00:00.000015', NaT], dtype='timedelta64[ns]', freq=None)
```
`unit` 关键字参数指定时间差的单位:
``` python
In [20]: pd.to_timedelta(np.arange(5), unit='s')
Out[20]: TimedeltaIndex(['00:00:00', '00:00:01', '00:00:02', '00:00:03', '00:00:04'], dtype='timedelta64[ns]', freq=None)
In [21]: pd.to_timedelta(np.arange(5), unit='d')
Out[21]: TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq=None)
```
### 时间差界限
Pandas 时间差的纳秒解析度是 64 位整数,这就决定了 `Timedelta` 的上下限。
``` python
In [22]: pd.Timedelta.min
Out[22]: Timedelta('-106752 days +00:12:43.145224')
In [23]: pd.Timedelta.max
Out[23]: Timedelta('106751 days 23:47:16.854775')
```
## 运算
以时间差为数据的 `Series` 与 `DataFrame` 支持各种运算,`datetime64 [ns]` 序列或 `Timestamps` 减法运算生成的是`timedelta64 [ns]` 序列。
``` python
In [24]: s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))
In [25]: td = pd.Series([pd.Timedelta(days=i) for i in range(3)])
In [26]: df = pd.DataFrame({'A': s, 'B': td})
In [27]: df
Out[27]:
A B
0 2012-01-01 0 days
1 2012-01-02 1 days
2 2012-01-03 2 days
In [28]: df['C'] = df['A'] + df['B']
In [29]: df
Out[29]:
A B C
0 2012-01-01 0 days 2012-01-01
1 2012-01-02 1 days 2012-01-03
2 2012-01-03 2 days 2012-01-05
In [30]: df.dtypes
Out[30]:
A datetime64[ns]
B timedelta64[ns]
C datetime64[ns]
dtype: object
In [31]: s - s.max()
Out[31]:
0 -2 days
1 -1 days
2 0 days
dtype: timedelta64[ns]
In [32]: s - datetime.datetime(2011, 1, 1, 3, 5)
Out[32]:
0 364 days 20:55:00
1 365 days 20:55:00
2 366 days 20:55:00
dtype: timedelta64[ns]
In [33]: s + datetime.timedelta(minutes=5)
Out[33]:
0 2012-01-01 00:05:00
1 2012-01-02 00:05:00
2 2012-01-03 00:05:00
dtype: datetime64[ns]
In [34]: s + pd.offsets.Minute(5)
Out[34]:
0 2012-01-01 00:05:00
1 2012-01-02 00:05:00
2 2012-01-03 00:05:00
dtype: datetime64[ns]
In [35]: s + pd.offsets.Minute(5) + pd.offsets.Milli(5)
Out[35]:
0 2012-01-01 00:05:00.005
1 2012-01-02 00:05:00.005
2 2012-01-03 00:05:00.005
dtype: datetime64[ns]
```
`timedelta64 [ns]` 序列的标量运算:
``` python
In [36]: y = s - s[0]
In [37]: y
Out[37]:
0 0 days
1 1 days
2 2 days
dtype: timedelta64[ns]
```
时间差序列支持 `NaT` 值:
``` python
In [38]: y = s - s.shift()
In [39]: y
Out[39]:
0 NaT
1 1 days
2 1 days
dtype: timedelta64[ns]
```
与 `datetime` 类似,`np.nan` 把时间差设置为 `NaT`
``` python
In [40]: y[1] = np.nan
In [41]: y
Out[41]:
0 NaT
1 NaT
2 1 days
dtype: timedelta64[ns]
```
运算符也可以显示为逆序(序列与单个对象的运算):
``` python
In [42]: s.max() - s
Out[42]:
0 2 days
1 1 days
2 0 days
dtype: timedelta64[ns]
In [43]: datetime.datetime(2011, 1, 1, 3, 5) - s
Out[43]:
0 -365 days +03:05:00
1 -366 days +03:05:00
2 -367 days +03:05:00
dtype: timedelta64[ns]
In [44]: datetime.timedelta(minutes=5) + s
Out[44]:
0 2012-01-01 00:05:00
1 2012-01-02 00:05:00
2 2012-01-03 00:05:00
dtype: datetime64[ns]
```
`DataFrame` 支持 `min`、`max` 及 `idxmin`、`idxmax` 运算:
``` python
In [45]: A = s - pd.Timestamp('20120101') - pd.Timedelta('00:05:05')
In [46]: B = s - pd.Series(pd.date_range('2012-1-2', periods=3, freq='D'))
In [47]: df = pd.DataFrame({'A': A, 'B': B})
In [48]: df
Out[48]:
A B
0 -1 days +23:54:55 -1 days
1 0 days 23:54:55 -1 days
2 1 days 23:54:55 -1 days
In [49]: df.min()
Out[49]:
A -1 days +23:54:55
B -1 days +00:00:00
dtype: timedelta64[ns]
In [50]: df.min(axis=1)
Out[50]:
0 -1 days
1 -1 days
2 -1 days
dtype: timedelta64[ns]
In [51]: df.idxmin()
Out[51]:
A 0
B 0
dtype: int64
In [52]: df.idxmax()
Out[52]:
A 2
B 0
dtype: int64
```
`Series` 也支持`min`、`max` 及 `idxmin`、`idxmax` 运算。标量计算结果为 `Timedelta`。
``` python
In [53]: df.min().max()
Out[53]: Timedelta('-1 days +23:54:55')
In [54]: df.min(axis=1).min()
Out[54]: Timedelta('-1 days +00:00:00')
In [55]: df.min().idxmax()
Out[55]: 'A'
In [56]: df.min(axis=1).idxmin()
Out[56]: 0
```
时间差支持 `fillna` 函数,参数是 `Timedelta`,用于指定填充值。
``` python
In [57]: y.fillna(pd.Timedelta(0))
Out[57]:
0 0 days
1 0 days
2 1 days
dtype: timedelta64[ns]
In [58]: y.fillna(pd.Timedelta(10, unit='s'))
Out[58]:
0 0 days 00:00:10
1 0 days 00:00:10
2 1 days 00:00:00
dtype: timedelta64[ns]
In [59]: y.fillna(pd.Timedelta('-1 days, 00:00:05'))
Out[59]:
0 -1 days +00:00:05
1 -1 days +00:00:05
2 1 days 00:00:00
dtype: timedelta64[ns]
```
`Timedelta` 还支持取反、乘法及绝对值(`Abs`)运算:
``` python
In [60]: td1 = pd.Timedelta('-1 days 2 hours 3 seconds')
In [61]: td1
Out[61]: Timedelta('-2 days +21:59:57')
In [62]: -1 * td1
Out[62]: Timedelta('1 days 02:00:03')
In [63]: - td1
Out[63]: Timedelta('1 days 02:00:03')
In [64]: abs(td1)
Out[64]: Timedelta('1 days 02:00:03')
```
## 归约
`timedelta64 [ns]` 数值归约运算返回的是 `Timedelta` 对象。 一般情况下,`NaT` 不计数。
``` python
In [65]: y2 = pd.Series(pd.to_timedelta(['-1 days +00:00:05', 'nat',
....: '-1 days +00:00:05', '1 days']))
....:
In [66]: y2
Out[66]:
0 -1 days +00:00:05
1 NaT
2 -1 days +00:00:05
3 1 days 00:00:00
dtype: timedelta64[ns]
In [67]: y2.mean()
Out[67]: Timedelta('-1 days +16:00:03.333333')
In [68]: y2.median()
Out[68]: Timedelta('-1 days +00:00:05')
In [69]: y2.quantile(.1)
Out[69]: Timedelta('-1 days +00:00:05')
In [70]: y2.sum()
Out[70]: Timedelta('-1 days +00:00:10')
```
## 频率转换
时间差除法把 `Timedelta` 序列、`TimedeltaIndex`、`Timedelta` 标量转换为其它“频率”,`astype` 也可以将之转换为指定的时间差。这些运算生成的是序列,并把 `NaT` 转换为 `nan`。 注意NumPy 标量除法是真除法,`astype` 则等同于取底整除Floor Division
::: tip 说明
Floor Division 即两数的商为向下取整9 / 2 = 4。又译作地板除或向下取整除本文译作**取底整除**
扩展知识:
Ceiling Division即两数的商为向上取整9 / 2 = 5。又译作屋顶除或向上取整除本文译作**取顶整除**。
:::
``` python
In [71]: december = pd.Series(pd.date_range('20121201', periods=4))
In [72]: january = pd.Series(pd.date_range('20130101', periods=4))
In [73]: td = january - december
In [74]: td[2] += datetime.timedelta(minutes=5, seconds=3)
In [75]: td[3] = np.nan
In [76]: td
Out[76]:
0 31 days 00:00:00
1 31 days 00:00:00
2 31 days 00:05:03
3 NaT
dtype: timedelta64[ns]
# 转为日
In [77]: td / np.timedelta64(1, 'D')
Out[77]:
0 31.000000
1 31.000000
2 31.003507
3 NaN
dtype: float64
In [78]: td.astype('timedelta64[D]')
Out[78]:
0 31.0
1 31.0
2 31.0
3 NaN
dtype: float64
# 转为秒
In [79]: td / np.timedelta64(1, 's')
Out[79]:
0 2678400.0
1 2678400.0
2 2678703.0
3 NaN
dtype: float64
In [80]: td.astype('timedelta64[s]')
Out[80]:
0 2678400.0
1 2678400.0
2 2678703.0
3 NaN
dtype: float64
# 转为月 (此处用常量表示)
In [81]: td / np.timedelta64(1, 'M')
Out[81]:
0 1.018501
1 1.018501
2 1.018617
3 NaN
dtype: float64
```
`timedelta64 [ns]` 序列与整数或整数序列相乘或相除,生成的也是 `timedelta64 [ns]` 序列。
``` python
In [82]: td * -1
Out[82]:
0 -31 days +00:00:00
1 -31 days +00:00:00
2 -32 days +23:54:57
3 NaT
dtype: timedelta64[ns]
In [83]: td * pd.Series([1, 2, 3, 4])
Out[83]:
0 31 days 00:00:00
1 62 days 00:00:00
2 93 days 00:15:09
3 NaT
dtype: timedelta64[ns]
```
`timedelta64 [ns]` 序列与 `Timedelta` 标量相除的结果为取底整除的整数序列。
``` python
In [84]: td // pd.Timedelta(days=3, hours=4)
Out[84]:
0 9.0
1 9.0
2 9.0
3 NaN
dtype: float64
In [85]: pd.Timedelta(days=3, hours=4) // td
Out[85]:
0 0.0
1 0.0
2 0.0
3 NaN
dtype: float64
```
`Timedelta` 的求余(`mod(%)`)与除余(`divmod`)运算,支持时间差与数值参数。
``` python
In [86]: pd.Timedelta(hours=37) % datetime.timedelta(hours=2)
Out[86]: Timedelta('0 days 01:00:00')
# 除余运算的参数为时间差时返回一对值int, Timedelta
In [87]: divmod(datetime.timedelta(hours=2), pd.Timedelta(minutes=11))
Out[87]: (10, Timedelta('0 days 00:10:00'))
# 除余运算的参数为数值时也返回一对值Timedelta, Timedelta
In [88]: divmod(pd.Timedelta(hours=25), 86400000000000)
Out[88]: (Timedelta('0 days 00:00:00.000000'), Timedelta('0 days 01:00:00'))
```
## 属性
`Timedelta` 或 `TimedeltaIndex` 的组件可以直接访问 `days`、`seconds`、`microseconds`、`nanoseconds` 等属性。这些属性与`datetime.timedelta` 的返回值相同,例如,`.seconds` 属性表示大于等于 0 天且小于 1 天的秒数。带符号的 `Timedelta` 返回的值也带符号。
`Series` 的 `.dt` 属性也可以直接访问这些数据。
::: tip 注意
这些属性**不是** `Timedelta` 显示的值。`.components` 可以提取显示的值。
:::
对于 `Series`
``` python
In [89]: td.dt.days
Out[89]:
0 31.0
1 31.0
2 31.0
3 NaN
dtype: float64
In [90]: td.dt.seconds
Out[90]:
0 0.0
1 0.0
2 303.0
3 NaN
dtype: float64
```
直接访问 `Timedelta` 标量字段值。
``` python
In [91]: tds = pd.Timedelta('31 days 5 min 3 sec')
In [92]: tds.days
Out[92]: 31
In [93]: tds.seconds
Out[93]: 303
In [94]: (-tds).seconds
Out[94]: 86097
```
`.components` 属性可以快速访问时间差的组件,返回结果是 `DataFrame`。 下列代码输出 `Timedelta` 的显示值。
``` python
In [95]: td.dt.components
Out[95]:
days hours minutes seconds milliseconds microseconds nanoseconds
0 31.0 0.0 0.0 0.0 0.0 0.0 0.0
1 31.0 0.0 0.0 0.0 0.0 0.0 0.0
2 31.0 0.0 5.0 3.0 0.0 0.0 0.0
3 NaN NaN NaN NaN NaN NaN NaN
In [96]: td.dt.components.seconds
Out[96]:
0 0.0
1 0.0
2 3.0
3 NaN
Name: seconds, dtype: float64
```
`.isoformat` 方法可以把 `Timedelta` 转换为 [ISO8601 时间格式](https://en.wikipedia.org/wiki/ISO_8601#Durations)字符串。
*0.20.0 版新增。*
``` python
In [97]: pd.Timedelta(days=6, minutes=50, seconds=3,
....: milliseconds=10, microseconds=10,
....: nanoseconds=12).isoformat()
....:
Out[97]: 'P6DT0H50M3.010010012S'
```
## TimedeltaIndex
[`TimedeltaIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.TimedeltaIndex.html#pandas.TimedeltaIndex) 或 [`timedelta_range()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.timedelta_range.html#pandas.timedelta_range) 可以生成时间差索引。
`TimedeltaIndex` 支持字符串型的 `Timedelta`、`timedelta` 或 `np.timedelta64`对象。
`np.nan`、`pd.NaT`、`nat` 代表缺失值。
``` python
In [98]: pd.TimedeltaIndex(['1 days', '1 days, 00:00:05', np.timedelta64(2, 'D'),
....: datetime.timedelta(days=2, seconds=2)])
....:
Out[98]:
TimedeltaIndex(['1 days 00:00:00', '1 days 00:00:05', '2 days 00:00:00',
'2 days 00:00:02'],
dtype='timedelta64[ns]', freq=None)
```
`freq` 关键字参数为 `infer` 时,`TimedeltaIndex` 可以自行推断时间频率:
``` python
In [99]: pd.TimedeltaIndex(['0 days', '10 days', '20 days'], freq='infer')
Out[99]: TimedeltaIndex(['0 days', '10 days', '20 days'], dtype='timedelta64[ns]', freq='10D')
```
### 生成时间差范围
与 [`date_range()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html#pandas.date_range) 相似,[`timedelta_range()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.timedelta_range.html#pandas.timedelta_range) 可以生成定频 `TimedeltaIndex``timedelta_range` 的默认频率是日历日:
``` python
In [100]: pd.timedelta_range(start='1 days', periods=5)
Out[100]: TimedeltaIndex(['1 days', '2 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq='D')
```
`timedelta_range` 支持 `start`、`end`、`periods` 三个参数:
``` python
In [101]: pd.timedelta_range(start='1 days', end='5 days')
Out[101]: TimedeltaIndex(['1 days', '2 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq='D')
In [102]: pd.timedelta_range(end='10 days', periods=4)
Out[102]: TimedeltaIndex(['7 days', '8 days', '9 days', '10 days'], dtype='timedelta64[ns]', freq='D')
```
`freq` 参数支持各种[频率别名](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases)
``` python
In [103]: pd.timedelta_range(start='1 days', end='2 days', freq='30T')
Out[103]:
TimedeltaIndex(['1 days 00:00:00', '1 days 00:30:00', '1 days 01:00:00',
'1 days 01:30:00', '1 days 02:00:00', '1 days 02:30:00',
'1 days 03:00:00', '1 days 03:30:00', '1 days 04:00:00',
'1 days 04:30:00', '1 days 05:00:00', '1 days 05:30:00',
'1 days 06:00:00', '1 days 06:30:00', '1 days 07:00:00',
'1 days 07:30:00', '1 days 08:00:00', '1 days 08:30:00',
'1 days 09:00:00', '1 days 09:30:00', '1 days 10:00:00',
'1 days 10:30:00', '1 days 11:00:00', '1 days 11:30:00',
'1 days 12:00:00', '1 days 12:30:00', '1 days 13:00:00',
'1 days 13:30:00', '1 days 14:00:00', '1 days 14:30:00',
'1 days 15:00:00', '1 days 15:30:00', '1 days 16:00:00',
'1 days 16:30:00', '1 days 17:00:00', '1 days 17:30:00',
'1 days 18:00:00', '1 days 18:30:00', '1 days 19:00:00',
'1 days 19:30:00', '1 days 20:00:00', '1 days 20:30:00',
'1 days 21:00:00', '1 days 21:30:00', '1 days 22:00:00',
'1 days 22:30:00', '1 days 23:00:00', '1 days 23:30:00',
'2 days 00:00:00'],
dtype='timedelta64[ns]', freq='30T')
In [104]: pd.timedelta_range(start='1 days', periods=5, freq='2D5H')
Out[104]:
TimedeltaIndex(['1 days 00:00:00', '3 days 05:00:00', '5 days 10:00:00',
'7 days 15:00:00', '9 days 20:00:00'],
dtype='timedelta64[ns]', freq='53H')
```
*0.23.0 版新增*。
用 `start`、`end`、`period` 可以生成等宽时间差范围,其中,`start` 与 `end`(含)是起止两端的时间,`periods` 为 `TimedeltaIndex` 里的元素数量:
``` python
In [105]: pd.timedelta_range('0 days', '4 days', periods=5)
Out[105]: TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq=None)
In [106]: pd.timedelta_range('0 days', '4 days', periods=10)
Out[106]:
TimedeltaIndex(['0 days 00:00:00', '0 days 10:40:00', '0 days 21:20:00',
'1 days 08:00:00', '1 days 18:40:00', '2 days 05:20:00',
'2 days 16:00:00', '3 days 02:40:00', '3 days 13:20:00',
'4 days 00:00:00'],
dtype='timedelta64[ns]', freq=None)
```
### TimedeltaIndex 应用
与 `DatetimeIndex`、`PeriodIndex` 等 `datetime` 型索引类似,`TimedeltaIndex` 也可当作 pandas 对象的索引。
``` python
In [107]: s = pd.Series(np.arange(100),
.....: index=pd.timedelta_range('1 days', periods=100, freq='h'))
.....:
In [108]: s
Out[108]:
1 days 00:00:00 0
1 days 01:00:00 1
1 days 02:00:00 2
1 days 03:00:00 3
1 days 04:00:00 4
..
4 days 23:00:00 95
5 days 00:00:00 96
5 days 01:00:00 97
5 days 02:00:00 98
5 days 03:00:00 99
Freq: H, Length: 100, dtype: int64
```
选择操作也差不多,可以强制转换字符串与切片:
``` python
In [109]: s['1 day':'2 day']
Out[109]:
1 days 00:00:00 0
1 days 01:00:00 1
1 days 02:00:00 2
1 days 03:00:00 3
1 days 04:00:00 4
..
2 days 19:00:00 43
2 days 20:00:00 44
2 days 21:00:00 45
2 days 22:00:00 46
2 days 23:00:00 47
Freq: H, Length: 48, dtype: int64
In [110]: s['1 day 01:00:00']
Out[110]: 1
In [111]: s[pd.Timedelta('1 day 1h')]
Out[111]: 1
```
`TimedeltaIndex` 还支持局部字符串选择,并且可以推断选择范围:
``` python
In [112]: s['1 day':'1 day 5 hours']
Out[112]:
1 days 00:00:00 0
1 days 01:00:00 1
1 days 02:00:00 2
1 days 03:00:00 3
1 days 04:00:00 4
1 days 05:00:00 5
Freq: H, dtype: int64
```
### TimedeltaIndex 运算
`TimedeltaIndex` 与 `DatetimeIndex` 运算可以保留 `NaT` 值:
``` python
In [113]: tdi = pd.TimedeltaIndex(['1 days', pd.NaT, '2 days'])
In [114]: tdi.to_list()
Out[114]: [Timedelta('1 days 00:00:00'), NaT, Timedelta('2 days 00:00:00')]
In [115]: dti = pd.date_range('20130101', periods=3)
In [116]: dti.to_list()
Out[116]:
[Timestamp('2013-01-01 00:00:00', freq='D'),
Timestamp('2013-01-02 00:00:00', freq='D'),
Timestamp('2013-01-03 00:00:00', freq='D')]
In [117]: (dti + tdi).to_list()
Out[117]: [Timestamp('2013-01-02 00:00:00'), NaT, Timestamp('2013-01-05 00:00:00')]
In [118]: (dti - tdi).to_list()
Out[118]: [Timestamp('2012-12-31 00:00:00'), NaT, Timestamp('2013-01-01 00:00:00')]
```
### 转换
与 `Series` 频率转换类似,可以把 `TimedeltaIndex` 转换为其它索引。
``` python
In [119]: tdi / np.timedelta64(1, 's')
Out[119]: Float64Index([86400.0, nan, 172800.0], dtype='float64')
In [120]: tdi.astype('timedelta64[s]')
Out[120]: Float64Index([86400.0, nan, 172800.0], dtype='float64')
```
与标量操作类似,会返回**不同**类型的索引。
``` python
# 时间差与日期相加结果为日期型索引DatetimeIndex
In [121]: tdi + pd.Timestamp('20130101')
Out[121]: DatetimeIndex(['2013-01-02', 'NaT', '2013-01-03'], dtype='datetime64[ns]', freq=None)
# 日期与时间戳相减结果为日期型数据Timestamp
# note that trying to subtract a date from a Timedelta will raise an exception
In [122]: (pd.Timestamp('20130101') - tdi).to_list()
Out[122]: [Timestamp('2012-12-31 00:00:00'), NaT, Timestamp('2012-12-30 00:00:00')]
# 时间差与时间差相加,结果还是时间差索引
In [123]: tdi + pd.Timedelta('10 days')
Out[123]: TimedeltaIndex(['11 days', NaT, '12 days'], dtype='timedelta64[ns]', freq=None)
# 除数是整数,则结果为时间差索引
In [124]: tdi / 2
Out[124]: TimedeltaIndex(['0 days 12:00:00', NaT, '1 days 00:00:00'], dtype='timedelta64[ns]', freq=None)
# 除数是时间差,则结果为 Float64Index
In [125]: tdi / tdi[0]
Out[125]: Float64Index([1.0, nan, 2.0], dtype='float64')
```
## 重采样
与[时间序列重采样](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-resampling)一样,`TimedeltaIndex` 也支持重采样。
``` python
In [126]: s.resample('D').mean()
Out[126]:
1 days 11.5
2 days 35.5
3 days 59.5
4 days 83.5
5 days 97.5
Freq: D, dtype: float64
```

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff