mirror of
https://github.com/Estom/notes.git
synced 2026-04-13 18:00:27 +08:00
matplotlib & pandas
This commit is contained in:
213
Python/pandas/user_guide/README.md
Normal file
213
Python/pandas/user_guide/README.md
Normal file
@@ -0,0 +1,213 @@
|
||||
---
|
||||
meta:
|
||||
- name: keywords
|
||||
content: Pandas指南
|
||||
- name: description
|
||||
content: “用户指南” 按主题划分区域涵盖了几乎所有Pandas的功能。每个小节都介绍了一个主题(例如“处理缺失的数据”),并讨论了Pandas如何解决问题,其中包含许多示例。
|
||||
---
|
||||
|
||||
# Pandas 用户指南目录
|
||||
|
||||
“用户指南” 按主题划分区域涵盖了几乎所有Pandas的功能。每个小节都介绍了一个主题(例如“处理缺失的数据”),并讨论了Pandas如何解决问题,其中包含许多示例。
|
||||
|
||||
刚开始接触Pandas的同学应该从[十分钟入门Pandas](/docs/getting_started/10min.html)开始看起。
|
||||
|
||||
有关任何特定方法的更多信息,请[参阅API参考](/docs/reference.html)。
|
||||
|
||||
- [IO工具(文本,CSV,HDF5,…)](io.html)
|
||||
- [CSV & text files](io.html#csv-text-files)
|
||||
- [JSON](io.html#json)
|
||||
- [HTML](io.html#html)
|
||||
- [Excel files](io.html#excel-files)
|
||||
- [OpenDocument Spreadsheets](io.html#opendocument-spreadsheets)
|
||||
- [Clipboard](io.html#clipboard)
|
||||
- [Pickling](io.html#pickling)
|
||||
- [msgpack](io.html#msgpack)
|
||||
- [HDF5 (PyTables)](io.html#hdf5-pytables)
|
||||
- [Feather](io.html#feather)
|
||||
- [Parquet](io.html#parquet)
|
||||
- [SQL queries](io.html#sql-queries)
|
||||
- [Google BigQuery](io.html#google-bigquery)
|
||||
- [Stata format](io.html#stata-format)
|
||||
- [SAS formats](io.html#sas-formats)
|
||||
- [Other file formats](io.html#other-file-formats)
|
||||
- [Performance considerations](io.html#performance-considerations)
|
||||
- [索引和数据选择器](indexing.html)
|
||||
- [Different choices for indexing](indexing.html#different-choices-for-indexing)
|
||||
- [Basics](indexing.html#basics)
|
||||
- [Attribute access](indexing.html#attribute-access)
|
||||
- [Slicing ranges](indexing.html#slicing-ranges)
|
||||
- [Selection by label](indexing.html#selection-by-label)
|
||||
- [Selection by position](indexing.html#selection-by-position)
|
||||
- [Selection by callable](indexing.html#selection-by-callable)
|
||||
- [IX indexer is deprecated](indexing.html#ix-indexer-is-deprecated)
|
||||
- [Indexing with list with missing labels is deprecated](indexing.html#indexing-with-list-with-missing-labels-is-deprecated)
|
||||
- [Selecting random samples](indexing.html#selecting-random-samples)
|
||||
- [Setting with enlargement](indexing.html#setting-with-enlargement)
|
||||
- [Fast scalar value getting and setting](indexing.html#fast-scalar-value-getting-and-setting)
|
||||
- [Boolean indexing](indexing.html#boolean-indexing)
|
||||
- [Indexing with isin](indexing.html#indexing-with-isin)
|
||||
- [The ``where()`` Method and Masking](indexing.html#the-where-method-and-masking)
|
||||
- [The ``query()`` Method](indexing.html#the-query-method)
|
||||
- [Duplicate data](indexing.html#duplicate-data)
|
||||
- [Dictionary-like ``get()`` method](indexing.html#dictionary-like-get-method)
|
||||
- [The ``lookup()`` method](indexing.html#the-lookup-method)
|
||||
- [Index objects](indexing.html#index-objects)
|
||||
- [Set / reset index](indexing.html#set-reset-index)
|
||||
- [Returning a view versus a copy](indexing.html#returning-a-view-versus-a-copy)
|
||||
- [多索引/高级索引](advanced.html)
|
||||
- [Hierarchical indexing (MultiIndex)](advanced.html#hierarchical-indexing-multiindex)
|
||||
- [Advanced indexing with hierarchical index](advanced.html#advanced-indexing-with-hierarchical-index)
|
||||
- [Sorting a ``MultiIndex``](advanced.html#sorting-a-multiindex)
|
||||
- [Take methods](advanced.html#take-methods)
|
||||
- [Index types](advanced.html#index-types)
|
||||
- [Miscellaneous indexing FAQ](advanced.html#miscellaneous-indexing-faq)
|
||||
- [合并、联接和连接](merging.html)
|
||||
- [Concatenating objects](merging.html#concatenating-objects)
|
||||
- [Database-style DataFrame or named Series joining/merging](merging.html#database-style-dataframe-or-named-series-joining-merging)
|
||||
- [Timeseries friendly merging](merging.html#timeseries-friendly-merging)
|
||||
- [重塑和数据透视表](reshaping.html)
|
||||
- [Reshaping by pivoting DataFrame objects](reshaping.html#reshaping-by-pivoting-dataframe-objects)
|
||||
- [Reshaping by stacking and unstacking](reshaping.html#reshaping-by-stacking-and-unstacking)
|
||||
- [Reshaping by Melt](reshaping.html#reshaping-by-melt)
|
||||
- [Combining with stats and GroupBy](reshaping.html#combining-with-stats-and-groupby)
|
||||
- [Pivot tables](reshaping.html#pivot-tables)
|
||||
- [Cross tabulations](reshaping.html#cross-tabulations)
|
||||
- [Tiling](reshaping.html#tiling)
|
||||
- [Computing indicator / dummy variables](reshaping.html#computing-indicator-dummy-variables)
|
||||
- [Factorizing values](reshaping.html#factorizing-values)
|
||||
- [Examples](reshaping.html#examples)
|
||||
- [Exploding a list-like column](reshaping.html#exploding-a-list-like-column)
|
||||
- [处理文本字符串](text.html)
|
||||
- [Splitting and replacing strings](text.html#splitting-and-replacing-strings)
|
||||
- [Concatenation](text.html#concatenation)
|
||||
- [Indexing with ``.str``](text.html#indexing-with-str)
|
||||
- [Extracting substrings](text.html#extracting-substrings)
|
||||
- [Testing for Strings that match or contain a pattern](text.html#testing-for-strings-that-match-or-contain-a-pattern)
|
||||
- [Creating indicator variables](text.html#creating-indicator-variables)
|
||||
- [Method summary](text.html#method-summary)
|
||||
- [处理丢失的数据](missing_data.html)
|
||||
- [Values considered “missing”](missing_data.html#values-considered-missing)
|
||||
- [Sum/prod of empties/nans](missing_data.html#sum-prod-of-empties-nans)
|
||||
- [NA values in GroupBy](missing_data.html#na-values-in-groupby)
|
||||
- [Filling missing values: fillna](missing_data.html#filling-missing-values-fillna)
|
||||
- [Filling with a PandasObject](missing_data.html#filling-with-a-pandasobject)
|
||||
- [Dropping axis labels with missing data: dropna](missing_data.html#dropping-axis-labels-with-missing-data-dropna)
|
||||
- [Interpolation](missing_data.html#interpolation)
|
||||
- [Replacing generic values](missing_data.html#replacing-generic-values)
|
||||
- [String/regular expression replacement](missing_data.html#string-regular-expression-replacement)
|
||||
- [Numeric replacement](missing_data.html#numeric-replacement)
|
||||
- [分类数据](categorical.html)
|
||||
- [Object creation](categorical.html#object-creation)
|
||||
- [CategoricalDtype](categorical.html#categoricaldtype)
|
||||
- [Description](categorical.html#description)
|
||||
- [Working with categories](categorical.html#working-with-categories)
|
||||
- [Sorting and order](categorical.html#sorting-and-order)
|
||||
- [Comparisons](categorical.html#comparisons)
|
||||
- [Operations](categorical.html#operations)
|
||||
- [Data munging](categorical.html#data-munging)
|
||||
- [Getting data in/out](categorical.html#getting-data-in-out)
|
||||
- [Missing data](categorical.html#missing-data)
|
||||
- [Differences to R’s <cite>factor</cite>](categorical.html#differences-to-r-s-factor)
|
||||
- [Gotchas](categorical.html#gotchas)
|
||||
- [Nullable整型数据类型](integer_na.html)
|
||||
- [可视化](visualization.html)
|
||||
- [Basic plotting: ``plot``](visualization.html#basic-plotting-plot)
|
||||
- [Other plots](visualization.html#other-plots)
|
||||
- [Plotting with missing data](visualization.html#plotting-with-missing-data)
|
||||
- [Plotting Tools](visualization.html#plotting-tools)
|
||||
- [Plot Formatting](visualization.html#plot-formatting)
|
||||
- [Plotting directly with matplotlib](visualization.html#plotting-directly-with-matplotlib)
|
||||
- [Trellis plotting interface](visualization.html#trellis-plotting-interface)
|
||||
- [计算工具](computation.html)
|
||||
- [Statistical functions](computation.html#statistical-functions)
|
||||
- [Window Functions](computation.html#window-functions)
|
||||
- [Aggregation](computation.html#aggregation)
|
||||
- [Expanding windows](computation.html#expanding-windows)
|
||||
- [Exponentially weighted windows](computation.html#exponentially-weighted-windows)
|
||||
- [组操作: 拆分-应用-组合](groupby.html)
|
||||
- [Splitting an object into groups](groupby.html#splitting-an-object-into-groups)
|
||||
- [Iterating through groups](groupby.html#iterating-through-groups)
|
||||
- [Selecting a group](groupby.html#selecting-a-group)
|
||||
- [Aggregation](groupby.html#aggregation)
|
||||
- [Transformation](groupby.html#transformation)
|
||||
- [Filtration](groupby.html#filtration)
|
||||
- [Dispatching to instance methods](groupby.html#dispatching-to-instance-methods)
|
||||
- [Flexible ``apply``](groupby.html#flexible-apply)
|
||||
- [Other useful features](groupby.html#other-useful-features)
|
||||
- [Examples](groupby.html#examples)
|
||||
- [时间序列/日期方法](timeseries.html)
|
||||
- [Overview](timeseries.html#overview)
|
||||
- [Timestamps vs. Time Spans](timeseries.html#timestamps-vs-time-spans)
|
||||
- [Converting to timestamps](timeseries.html#converting-to-timestamps)
|
||||
- [Generating ranges of timestamps](timeseries.html#generating-ranges-of-timestamps)
|
||||
- [Timestamp limitations](timeseries.html#timestamp-limitations)
|
||||
- [Indexing](timeseries.html#indexing)
|
||||
- [Time/date components](timeseries.html#time-date-components)
|
||||
- [DateOffset objects](timeseries.html#dateoffset-objects)
|
||||
- [Time Series-Related Instance Methods](timeseries.html#time-series-related-instance-methods)
|
||||
- [Resampling](timeseries.html#resampling)
|
||||
- [Time span representation](timeseries.html#time-span-representation)
|
||||
- [Converting between representations](timeseries.html#converting-between-representations)
|
||||
- [Representing out-of-bounds spans](timeseries.html#representing-out-of-bounds-spans)
|
||||
- [Time zone handling](timeseries.html#time-zone-handling)
|
||||
- [时间增量](timedeltas.html)
|
||||
- [Parsing](timedeltas.html#parsing)
|
||||
- [Operations](timedeltas.html#operations)
|
||||
- [Reductions](timedeltas.html#reductions)
|
||||
- [Frequency conversion](timedeltas.html#frequency-conversion)
|
||||
- [Attributes](timedeltas.html#attributes)
|
||||
- [TimedeltaIndex](timedeltas.html#timedeltaindex)
|
||||
- [Resampling](timedeltas.html#resampling)
|
||||
- [样式](style.html)
|
||||
- [Building styles](style.html#Building-styles)
|
||||
- [Finer control: slicing](style.html#Finer-control:-slicing)
|
||||
- [Finer Control: Display Values](style.html#Finer-Control:-Display-Values)
|
||||
- [Builtin styles](style.html#Builtin-styles)
|
||||
- [Sharing styles](style.html#Sharing-styles)
|
||||
- [Other Options](style.html#Other-Options)
|
||||
- [Fun stuff](style.html#Fun-stuff)
|
||||
- [Export to Excel](style.html#Export-to-Excel)
|
||||
- [Extensibility](style.html#Extensibility)
|
||||
- [选项和设置](options.html)
|
||||
- [Overview](options.html#overview)
|
||||
- [Getting and setting options](options.html#getting-and-setting-options)
|
||||
- [Setting startup options in Python/IPython environment](options.html#setting-startup-options-in-python-ipython-environment)
|
||||
- [Frequently Used Options](options.html#frequently-used-options)
|
||||
- [Available options](options.html#available-options)
|
||||
- [Number formatting](options.html#number-formatting)
|
||||
- [Unicode formatting](options.html#unicode-formatting)
|
||||
- [Table schema display](options.html#table-schema-display)
|
||||
- [提高性能](enhancingperf.html)
|
||||
- [Cython (writing C extensions for pandas)](enhancingperf.html#cython-writing-c-extensions-for-pandas)
|
||||
- [Using Numba](enhancingperf.html#using-numba)
|
||||
- [Expression evaluation via ``>eval()``](enhancingperf.html#expression-evaluation-via-eval)
|
||||
- [稀疏数据结构](sparse.html)
|
||||
- [SparseArray](sparse.html#sparsearray)
|
||||
- [SparseDtype](sparse.html#sparsedtype)
|
||||
- [Sparse accessor](sparse.html#sparse-accessor)
|
||||
- [Sparse calculation](sparse.html#sparse-calculation)
|
||||
- [Migrating](sparse.html#migrating)
|
||||
- [Interaction with scipy.sparse](sparse.html#interaction-with-scipy-sparse)
|
||||
- [Sparse subclasses](sparse.html#sparse-subclasses)
|
||||
- [常见问题(FAQ)](gotchas.html)
|
||||
- [DataFrame memory usage](gotchas.html#dataframe-memory-usage)
|
||||
- [Using if/truth statements with pandas](gotchas.html#using-if-truth-statements-with-pandas)
|
||||
- [``NaN``, Integer ``NA`` values and ``NA`` type promotions](gotchas.html#nan-integer-na-values-and-na-type-promotions)
|
||||
- [Differences with NumPy](gotchas.html#differences-with-numpy)
|
||||
- [Thread-safety](gotchas.html#thread-safety)
|
||||
- [Byte-Ordering issues](gotchas.html#byte-ordering-issues)
|
||||
- [烹饪指南](cookbook.html)
|
||||
- [Idioms](cookbook.html#idioms)
|
||||
- [Selection](cookbook.html#selection)
|
||||
- [MultiIndexing](cookbook.html#multiindexing)
|
||||
- [Missing data](cookbook.html#missing-data)
|
||||
- [Grouping](cookbook.html#grouping)
|
||||
- [Timeseries](cookbook.html#timeseries)
|
||||
- [Merge](cookbook.html#merge)
|
||||
- [Plotting](cookbook.html#plotting)
|
||||
- [Data In/Out](cookbook.html#data-in-out)
|
||||
- [Computation](cookbook.html#computation)
|
||||
- [Timedeltas](cookbook.html#timedeltas)
|
||||
- [Aliasing axis names](cookbook.html#aliasing-axis-names)
|
||||
- [Creating example data](cookbook.html#creating-example-data)
|
||||
2024
Python/pandas/user_guide/advanced.md
Normal file
2024
Python/pandas/user_guide/advanced.md
Normal file
File diff suppressed because it is too large
Load Diff
2012
Python/pandas/user_guide/categorical.md
Normal file
2012
Python/pandas/user_guide/categorical.md
Normal file
File diff suppressed because it is too large
Load Diff
1452
Python/pandas/user_guide/computation.md
Normal file
1452
Python/pandas/user_guide/computation.md
Normal file
File diff suppressed because it is too large
Load Diff
2050
Python/pandas/user_guide/cookbook.md
Normal file
2050
Python/pandas/user_guide/cookbook.md
Normal file
File diff suppressed because it is too large
Load Diff
984
Python/pandas/user_guide/enhancingperf.md
Normal file
984
Python/pandas/user_guide/enhancingperf.md
Normal file
@@ -0,0 +1,984 @@
|
||||
# Enhancing performance
|
||||
|
||||
In this part of the tutorial, we will investigate how to speed up certain
|
||||
functions operating on pandas ``DataFrames`` using three different techniques:
|
||||
Cython, Numba and [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval). We will see a speed improvement of ~200
|
||||
when we use Cython and Numba on a test function operating row-wise on the
|
||||
``DataFrame``. Using [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) we will speed up a sum by an order of
|
||||
~2.
|
||||
|
||||
## Cython (writing C extensions for pandas)
|
||||
|
||||
For many use cases writing pandas in pure Python and NumPy is sufficient. In some
|
||||
computationally heavy applications however, it can be possible to achieve sizable
|
||||
speed-ups by offloading work to [cython](http://cython.org/).
|
||||
|
||||
This tutorial assumes you have refactored as much as possible in Python, for example
|
||||
by trying to remove for-loops and making use of NumPy vectorization. It’s always worth
|
||||
optimising in Python first.
|
||||
|
||||
This tutorial walks through a “typical” process of cythonizing a slow computation.
|
||||
We use an [example from the Cython documentation](http://docs.cython.org/src/quickstart/cythonize.html)
|
||||
but in the context of pandas. Our final cythonized solution is around 100 times
|
||||
faster than the pure Python solution.
|
||||
|
||||
### Pure Python
|
||||
|
||||
We have a ``DataFrame`` to which we want to apply a function row-wise.
|
||||
|
||||
``` python
|
||||
In [1]: df = pd.DataFrame({'a': np.random.randn(1000),
|
||||
...: 'b': np.random.randn(1000),
|
||||
...: 'N': np.random.randint(100, 1000, (1000)),
|
||||
...: 'x': 'x'})
|
||||
...:
|
||||
|
||||
In [2]: df
|
||||
Out[2]:
|
||||
a b N x
|
||||
0 0.469112 -0.218470 585 x
|
||||
1 -0.282863 -0.061645 841 x
|
||||
2 -1.509059 -0.723780 251 x
|
||||
3 -1.135632 0.551225 972 x
|
||||
4 1.212112 -0.497767 181 x
|
||||
.. ... ... ... ..
|
||||
995 -1.512743 0.874737 374 x
|
||||
996 0.933753 1.120790 246 x
|
||||
997 -0.308013 0.198768 157 x
|
||||
998 -0.079915 1.757555 977 x
|
||||
999 -1.010589 -1.115680 770 x
|
||||
|
||||
[1000 rows x 4 columns]
|
||||
```
|
||||
|
||||
Here’s the function in pure Python:
|
||||
|
||||
``` python
|
||||
In [3]: def f(x):
|
||||
...: return x * (x - 1)
|
||||
...:
|
||||
|
||||
In [4]: def integrate_f(a, b, N):
|
||||
...: s = 0
|
||||
...: dx = (b - a) / N
|
||||
...: for i in range(N):
|
||||
...: s += f(a + i * dx)
|
||||
...: return s * dx
|
||||
...:
|
||||
```
|
||||
|
||||
We achieve our result by using ``apply`` (row-wise):
|
||||
|
||||
``` python
|
||||
In [7]: %timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)
|
||||
10 loops, best of 3: 174 ms per loop
|
||||
```
|
||||
|
||||
But clearly this isn’t fast enough for us. Let’s take a look and see where the
|
||||
time is spent during this operation (limited to the most time consuming
|
||||
four calls) using the [prun ipython magic function](http://ipython.org/ipython-doc/stable/api/generated/IPython.core.magics.execution.html#IPython.core.magics.execution.ExecutionMagics.prun):
|
||||
|
||||
``` python
|
||||
In [5]: %prun -l 4 df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1) # noqa E999
|
||||
672332 function calls (667306 primitive calls) in 0.285 seconds
|
||||
|
||||
Ordered by: internal time
|
||||
List reduced from 221 to 4 due to restriction <4>
|
||||
|
||||
ncalls tottime percall cumtime percall filename:lineno(function)
|
||||
1000 0.144 0.000 0.217 0.000 <ipython-input-4-c2a74e076cf0>:1(integrate_f)
|
||||
552423 0.074 0.000 0.074 0.000 <ipython-input-3-c138bdd570e3>:1(f)
|
||||
3000 0.008 0.000 0.045 0.000 base.py:4695(get_value)
|
||||
6001 0.005 0.000 0.012 0.000 {pandas._libs.lib.values_from_object}
|
||||
```
|
||||
|
||||
By far the majority of time is spend inside either ``integrate_f`` or ``f``,
|
||||
hence we’ll concentrate our efforts cythonizing these two functions.
|
||||
|
||||
::: tip Note
|
||||
|
||||
In Python 2 replacing the ``range`` with its generator counterpart (``xrange``)
|
||||
would mean the ``range`` line would vanish. In Python 3 ``range`` is already a generator.
|
||||
|
||||
:::
|
||||
|
||||
### Plain Cython
|
||||
|
||||
First we’re going to need to import the Cython magic function to ipython:
|
||||
|
||||
``` python
|
||||
In [6]: %load_ext Cython
|
||||
```
|
||||
|
||||
Now, let’s simply copy our functions over to Cython as is (the suffix
|
||||
is here to distinguish between function versions):
|
||||
|
||||
``` python
|
||||
In [7]: %%cython
|
||||
...: def f_plain(x):
|
||||
...: return x * (x - 1)
|
||||
...: def integrate_f_plain(a, b, N):
|
||||
...: s = 0
|
||||
...: dx = (b - a) / N
|
||||
...: for i in range(N):
|
||||
...: s += f_plain(a + i * dx)
|
||||
...: return s * dx
|
||||
...:
|
||||
```
|
||||
|
||||
::: tip Note
|
||||
|
||||
If you’re having trouble pasting the above into your ipython, you may need
|
||||
to be using bleeding edge ipython for paste to play well with cell magics.
|
||||
|
||||
:::
|
||||
|
||||
``` python
|
||||
In [4]: %timeit df.apply(lambda x: integrate_f_plain(x['a'], x['b'], x['N']), axis=1)
|
||||
10 loops, best of 3: 85.5 ms per loop
|
||||
```
|
||||
|
||||
Already this has shaved a third off, not too bad for a simple copy and paste.
|
||||
|
||||
### Adding type
|
||||
|
||||
We get another huge improvement simply by providing type information:
|
||||
|
||||
``` python
|
||||
In [8]: %%cython
|
||||
...: cdef double f_typed(double x) except? -2:
|
||||
...: return x * (x - 1)
|
||||
...: cpdef double integrate_f_typed(double a, double b, int N):
|
||||
...: cdef int i
|
||||
...: cdef double s, dx
|
||||
...: s = 0
|
||||
...: dx = (b - a) / N
|
||||
...: for i in range(N):
|
||||
...: s += f_typed(a + i * dx)
|
||||
...: return s * dx
|
||||
...:
|
||||
```
|
||||
|
||||
``` python
|
||||
In [4]: %timeit df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)
|
||||
10 loops, best of 3: 20.3 ms per loop
|
||||
```
|
||||
|
||||
Now, we’re talking! It’s now over ten times faster than the original python
|
||||
implementation, and we haven’t *really* modified the code. Let’s have another
|
||||
look at what’s eating up time:
|
||||
|
||||
``` python
|
||||
In [9]: %prun -l 4 df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)
|
||||
119905 function calls (114879 primitive calls) in 0.096 seconds
|
||||
|
||||
Ordered by: internal time
|
||||
List reduced from 216 to 4 due to restriction <4>
|
||||
|
||||
ncalls tottime percall cumtime percall filename:lineno(function)
|
||||
3000 0.012 0.000 0.064 0.000 base.py:4695(get_value)
|
||||
6001 0.007 0.000 0.017 0.000 {pandas._libs.lib.values_from_object}
|
||||
3000 0.007 0.000 0.073 0.000 series.py:1061(__getitem__)
|
||||
3000 0.006 0.000 0.006 0.000 {method 'get_value' of 'pandas._libs.index.IndexEngine' objects}
|
||||
```
|
||||
|
||||
### Using ndarray
|
||||
|
||||
It’s calling series… a lot! It’s creating a Series from each row, and get-ting from both
|
||||
the index and the series (three times for each row). Function calls are expensive
|
||||
in Python, so maybe we could minimize these by cythonizing the apply part.
|
||||
|
||||
::: tip Note
|
||||
|
||||
We are now passing ndarrays into the Cython function, fortunately Cython plays
|
||||
very nicely with NumPy.
|
||||
|
||||
:::
|
||||
|
||||
``` python
|
||||
In [10]: %%cython
|
||||
....: cimport numpy as np
|
||||
....: import numpy as np
|
||||
....: cdef double f_typed(double x) except? -2:
|
||||
....: return x * (x - 1)
|
||||
....: cpdef double integrate_f_typed(double a, double b, int N):
|
||||
....: cdef int i
|
||||
....: cdef double s, dx
|
||||
....: s = 0
|
||||
....: dx = (b - a) / N
|
||||
....: for i in range(N):
|
||||
....: s += f_typed(a + i * dx)
|
||||
....: return s * dx
|
||||
....: cpdef np.ndarray[double] apply_integrate_f(np.ndarray col_a, np.ndarray col_b,
|
||||
....: np.ndarray col_N):
|
||||
....: assert (col_a.dtype == np.float
|
||||
....: and col_b.dtype == np.float and col_N.dtype == np.int)
|
||||
....: cdef Py_ssize_t i, n = len(col_N)
|
||||
....: assert (len(col_a) == len(col_b) == n)
|
||||
....: cdef np.ndarray[double] res = np.empty(n)
|
||||
....: for i in range(len(col_a)):
|
||||
....: res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
|
||||
....: return res
|
||||
....:
|
||||
```
|
||||
|
||||
The implementation is simple, it creates an array of zeros and loops over
|
||||
the rows, applying our ``integrate_f_typed``, and putting this in the zeros array.
|
||||
|
||||
::: danger Warning
|
||||
|
||||
You can **not pass** a ``Series`` directly as a ``ndarray`` typed parameter
|
||||
to a Cython function. Instead pass the actual ``ndarray`` using the
|
||||
[``Series.to_numpy()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy). The reason is that the Cython
|
||||
definition is specific to an ndarray and not the passed ``Series``.
|
||||
|
||||
So, do not do this:
|
||||
|
||||
``` python
|
||||
apply_integrate_f(df['a'], df['b'], df['N'])
|
||||
```
|
||||
|
||||
But rather, use [``Series.to_numpy()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy) to get the underlying ``ndarray``:
|
||||
|
||||
``` python
|
||||
apply_integrate_f(df['a'].to_numpy(),
|
||||
df['b'].to_numpy(),
|
||||
df['N'].to_numpy())
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::: tip Note
|
||||
|
||||
Loops like this would be *extremely* slow in Python, but in Cython looping
|
||||
over NumPy arrays is *fast*.
|
||||
|
||||
:::
|
||||
|
||||
``` python
|
||||
In [4]: %timeit apply_integrate_f(df['a'].to_numpy(),
|
||||
df['b'].to_numpy(),
|
||||
df['N'].to_numpy())
|
||||
1000 loops, best of 3: 1.25 ms per loop
|
||||
```
|
||||
|
||||
We’ve gotten another big improvement. Let’s check again where the time is spent:
|
||||
|
||||
``` python
|
||||
In [11]: %prun -l 4 apply_integrate_f(df['a'].to_numpy(),
|
||||
....: df['b'].to_numpy(),
|
||||
....: df['N'].to_numpy())
|
||||
....:
|
||||
File "<ipython-input-11-613f5c6ec02d>", line 2
|
||||
df['b'].to_numpy(),
|
||||
^
|
||||
IndentationError: unexpected indent
|
||||
```
|
||||
|
||||
As one might expect, the majority of the time is now spent in ``apply_integrate_f``,
|
||||
so if we wanted to make anymore efficiencies we must continue to concentrate our
|
||||
efforts here.
|
||||
|
||||
### More advanced techniques
|
||||
|
||||
There is still hope for improvement. Here’s an example of using some more
|
||||
advanced Cython techniques:
|
||||
|
||||
``` python
|
||||
In [12]: %%cython
|
||||
....: cimport cython
|
||||
....: cimport numpy as np
|
||||
....: import numpy as np
|
||||
....: cdef double f_typed(double x) except? -2:
|
||||
....: return x * (x - 1)
|
||||
....: cpdef double integrate_f_typed(double a, double b, int N):
|
||||
....: cdef int i
|
||||
....: cdef double s, dx
|
||||
....: s = 0
|
||||
....: dx = (b - a) / N
|
||||
....: for i in range(N):
|
||||
....: s += f_typed(a + i * dx)
|
||||
....: return s * dx
|
||||
....: @cython.boundscheck(False)
|
||||
....: @cython.wraparound(False)
|
||||
....: cpdef np.ndarray[double] apply_integrate_f_wrap(np.ndarray[double] col_a,
|
||||
....: np.ndarray[double] col_b,
|
||||
....: np.ndarray[int] col_N):
|
||||
....: cdef int i, n = len(col_N)
|
||||
....: assert len(col_a) == len(col_b) == n
|
||||
....: cdef np.ndarray[double] res = np.empty(n)
|
||||
....: for i in range(n):
|
||||
....: res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
|
||||
....: return res
|
||||
....:
|
||||
```
|
||||
|
||||
``` python
|
||||
In [4]: %timeit apply_integrate_f_wrap(df['a'].to_numpy(),
|
||||
df['b'].to_numpy(),
|
||||
df['N'].to_numpy())
|
||||
1000 loops, best of 3: 987 us per loop
|
||||
```
|
||||
|
||||
Even faster, with the caveat that a bug in our Cython code (an off-by-one error,
|
||||
for example) might cause a segfault because memory access isn’t checked.
|
||||
For more about ``boundscheck`` and ``wraparound``, see the Cython docs on
|
||||
[compiler directives](http://cython.readthedocs.io/en/latest/src/reference/compilation.html?highlight=wraparound#compiler-directives).
|
||||
|
||||
## Using Numba
|
||||
|
||||
A recent alternative to statically compiling Cython code, is to use a *dynamic jit-compiler*, Numba.
|
||||
|
||||
Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
|
||||
|
||||
Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.
|
||||
|
||||
::: tip Note
|
||||
|
||||
You will need to install Numba. This is easy with ``conda``, by using: ``conda install numba``, see [installing using miniconda](https://pandas.pydata.org/pandas-docs/stable/install.html#install-miniconda).
|
||||
|
||||
:::
|
||||
|
||||
::: tip Note
|
||||
|
||||
As of Numba version 0.20, pandas objects cannot be passed directly to Numba-compiled functions. Instead, one must pass the NumPy array underlying the pandas object to the Numba-compiled function as demonstrated below.
|
||||
|
||||
:::
|
||||
|
||||
### Jit
|
||||
|
||||
We demonstrate how to use Numba to just-in-time compile our code. We simply
|
||||
take the plain Python code from above and annotate with the ``@jit`` decorator.
|
||||
|
||||
``` python
|
||||
import numba
|
||||
|
||||
|
||||
@numba.jit
|
||||
def f_plain(x):
|
||||
return x * (x - 1)
|
||||
|
||||
|
||||
@numba.jit
|
||||
def integrate_f_numba(a, b, N):
|
||||
s = 0
|
||||
dx = (b - a) / N
|
||||
for i in range(N):
|
||||
s += f_plain(a + i * dx)
|
||||
return s * dx
|
||||
|
||||
|
||||
@numba.jit
|
||||
def apply_integrate_f_numba(col_a, col_b, col_N):
|
||||
n = len(col_N)
|
||||
result = np.empty(n, dtype='float64')
|
||||
assert len(col_a) == len(col_b) == n
|
||||
for i in range(n):
|
||||
result[i] = integrate_f_numba(col_a[i], col_b[i], col_N[i])
|
||||
return result
|
||||
|
||||
|
||||
def compute_numba(df):
|
||||
result = apply_integrate_f_numba(df['a'].to_numpy(),
|
||||
df['b'].to_numpy(),
|
||||
df['N'].to_numpy())
|
||||
return pd.Series(result, index=df.index, name='result')
|
||||
```
|
||||
|
||||
Note that we directly pass NumPy arrays to the Numba function. ``compute_numba`` is just a wrapper that provides a
|
||||
nicer interface by passing/returning pandas objects.
|
||||
|
||||
``` python
|
||||
In [4]: %timeit compute_numba(df)
|
||||
1000 loops, best of 3: 798 us per loop
|
||||
```
|
||||
|
||||
In this example, using Numba was faster than Cython.
|
||||
|
||||
### Vectorize
|
||||
|
||||
Numba can also be used to write vectorized functions that do not require the user to explicitly
|
||||
loop over the observations of a vector; a vectorized function will be applied to each row automatically.
|
||||
Consider the following toy example of doubling each observation:
|
||||
|
||||
``` python
|
||||
import numba
|
||||
|
||||
|
||||
def double_every_value_nonumba(x):
|
||||
return x * 2
|
||||
|
||||
|
||||
@numba.vectorize
|
||||
def double_every_value_withnumba(x): # noqa E501
|
||||
return x * 2
|
||||
```
|
||||
|
||||
``` python
|
||||
# Custom function without numba
|
||||
In [5]: %timeit df['col1_doubled'] = df.a.apply(double_every_value_nonumba) # noqa E501
|
||||
1000 loops, best of 3: 797 us per loop
|
||||
|
||||
# Standard implementation (faster than a custom function)
|
||||
In [6]: %timeit df['col1_doubled'] = df.a * 2
|
||||
1000 loops, best of 3: 233 us per loop
|
||||
|
||||
# Custom function with numba
|
||||
In [7]: %timeit (df['col1_doubled'] = double_every_value_withnumba(df.a.to_numpy())
|
||||
1000 loops, best of 3: 145 us per loop
|
||||
```
|
||||
|
||||
### Caveats
|
||||
|
||||
::: tip Note
|
||||
|
||||
Numba will execute on any function, but can only accelerate certain classes of functions.
|
||||
|
||||
:::
|
||||
|
||||
Numba is best at accelerating functions that apply numerical functions to NumPy
|
||||
arrays. When passed a function that only uses operations it knows how to
|
||||
accelerate, it will execute in ``nopython`` mode.
|
||||
|
||||
If Numba is passed a function that includes something it doesn’t know how to
|
||||
work with – a category that currently includes sets, lists, dictionaries, or
|
||||
string functions – it will revert to ``object mode``. In ``object mode``,
|
||||
Numba will execute but your code will not speed up significantly. If you would
|
||||
prefer that Numba throw an error if it cannot compile a function in a way that
|
||||
speeds up your code, pass Numba the argument
|
||||
``nopython=True`` (e.g. ``@numba.jit(nopython=True)``). For more on
|
||||
troubleshooting Numba modes, see the [Numba troubleshooting page](http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#the-compiled-code-is-too-slow).
|
||||
|
||||
Read more in the [Numba docs](http://numba.pydata.org/).
|
||||
|
||||
## Expression evaluation via ``eval()``
|
||||
|
||||
The top-level function [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) implements expression evaluation of
|
||||
[``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) and [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) objects.
|
||||
|
||||
::: tip Note
|
||||
|
||||
To benefit from using [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) you need to
|
||||
install ``numexpr``. See the [recommended dependencies section](https://pandas.pydata.org/pandas-docs/stable/install.html#install-recommended-dependencies) for more details.
|
||||
|
||||
:::
|
||||
|
||||
The point of using [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) for expression evaluation rather than
|
||||
plain Python is two-fold: 1) large [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) objects are
|
||||
evaluated more efficiently and 2) large arithmetic and boolean expressions are
|
||||
evaluated all at once by the underlying engine (by default ``numexpr`` is used
|
||||
for evaluation).
|
||||
|
||||
::: tip Note
|
||||
|
||||
You should not use [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) for simple
|
||||
expressions or for expressions involving small DataFrames. In fact,
|
||||
[``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) is many orders of magnitude slower for
|
||||
smaller expressions/objects than plain ol’ Python. A good rule of thumb is
|
||||
to only use [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) when you have a
|
||||
``DataFrame`` with more than 10,000 rows.
|
||||
|
||||
:::
|
||||
|
||||
[``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) supports all arithmetic expressions supported by the
|
||||
engine in addition to some extensions available only in pandas.
|
||||
|
||||
::: tip Note
|
||||
|
||||
The larger the frame and the larger the expression the more speedup you will
|
||||
see from using [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval).
|
||||
|
||||
:::
|
||||
|
||||
### Supported syntax
|
||||
|
||||
These operations are supported by [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval):
|
||||
|
||||
- Arithmetic operations except for the left shift (``<<``) and right shift
|
||||
(``>>``) operators, e.g., ``df + 2 * pi / s ** 4 % 42 - the_golden_ratio``
|
||||
- Comparison operations, including chained comparisons, e.g., ``2 < df < df2``
|
||||
- Boolean operations, e.g., ``df < df2 and df3 < df4 or not df_bool``
|
||||
- ``list`` and ``tuple`` literals, e.g., ``[1, 2]`` or ``(1, 2)``
|
||||
- Attribute access, e.g., ``df.a``
|
||||
- Subscript expressions, e.g., ``df[0]``
|
||||
- Simple variable evaluation, e.g., ``pd.eval('df')`` (this is not very useful)
|
||||
- Math functions: *sin*, *cos*, *exp*, *log*, *expm1*, *log1p*,
|
||||
*sqrt*, *sinh*, *cosh*, *tanh*, *arcsin*, *arccos*, *arctan*, *arccosh*,
|
||||
*arcsinh*, *arctanh*, *abs*, *arctan2* and *log10*.
|
||||
|
||||
This Python syntax is **not** allowed:
|
||||
|
||||
- Expressions
|
||||
- Function calls other than math functions.
|
||||
- ``is``/``is not`` operations
|
||||
- ``if`` expressions
|
||||
- ``lambda`` expressions
|
||||
- ``list``/``set``/``dict`` comprehensions
|
||||
- Literal ``dict`` and ``set`` expressions
|
||||
- ``yield`` expressions
|
||||
- Generator expressions
|
||||
- Boolean expressions consisting of only scalar values
|
||||
|
||||
- Statements
|
||||
|
||||
- Neither [simple](https://docs.python.org/3/reference/simple_stmts.html)
|
||||
nor [compound](https://docs.python.org/3/reference/compound_stmts.html)
|
||||
statements are allowed. This includes things like ``for``, ``while``, and
|
||||
``if``.
|
||||
|
||||
### ``eval()`` examples
|
||||
|
||||
[``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) works well with expressions containing large arrays.
|
||||
|
||||
First let’s create a few decent-sized arrays to play with:
|
||||
|
||||
``` python
|
||||
In [13]: nrows, ncols = 20000, 100
|
||||
|
||||
In [14]: df1, df2, df3, df4 = [pd.DataFrame(np.random.randn(nrows, ncols)) for _ in range(4)]
|
||||
```
|
||||
|
||||
Now let’s compare adding them together using plain ol’ Python versus
|
||||
[``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval):
|
||||
|
||||
``` python
|
||||
In [15]: %timeit df1 + df2 + df3 + df4
|
||||
21 ms +- 787 us per loop (mean +- std. dev. of 7 runs, 10 loops each)
|
||||
```
|
||||
|
||||
``` python
|
||||
In [16]: %timeit pd.eval('df1 + df2 + df3 + df4')
|
||||
8.12 ms +- 249 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
|
||||
```
|
||||
|
||||
Now let’s do the same thing but with comparisons:
|
||||
|
||||
``` python
|
||||
In [17]: %timeit (df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)
|
||||
272 ms +- 6.92 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)
|
||||
```
|
||||
|
||||
``` python
|
||||
In [18]: %timeit pd.eval('(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)')
|
||||
19.2 ms +- 1.87 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
|
||||
```
|
||||
|
||||
[``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) also works with unaligned pandas objects:
|
||||
|
||||
``` python
|
||||
In [19]: s = pd.Series(np.random.randn(50))
|
||||
|
||||
In [20]: %timeit df1 + df2 + df3 + df4 + s
|
||||
103 ms +- 12.7 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
|
||||
```
|
||||
|
||||
``` python
|
||||
In [21]: %timeit pd.eval('df1 + df2 + df3 + df4 + s')
|
||||
10.2 ms +- 215 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
|
||||
```
|
||||
|
||||
::: tip Note
|
||||
|
||||
Operations such as
|
||||
|
||||
``` python
|
||||
1 and 2 # would parse to 1 & 2, but should evaluate to 2
|
||||
3 or 4 # would parse to 3 | 4, but should evaluate to 3
|
||||
~1 # this is okay, but slower when using eval
|
||||
```
|
||||
|
||||
should be performed in Python. An exception will be raised if you try to
|
||||
perform any boolean/bitwise operations with scalar operands that are not
|
||||
of type ``bool`` or ``np.bool_``. Again, you should perform these kinds of
|
||||
operations in plain Python.
|
||||
|
||||
:::
|
||||
|
||||
### The ``DataFrame.eval`` method
|
||||
|
||||
In addition to the top level [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) function you can also
|
||||
evaluate an expression in the “context” of a [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame).
|
||||
|
||||
``` python
|
||||
In [22]: df = pd.DataFrame(np.random.randn(5, 2), columns=['a', 'b'])
|
||||
|
||||
In [23]: df.eval('a + b')
|
||||
Out[23]:
|
||||
0 -0.246747
|
||||
1 0.867786
|
||||
2 -1.626063
|
||||
3 -1.134978
|
||||
4 -1.027798
|
||||
dtype: float64
|
||||
```
|
||||
|
||||
Any expression that is a valid [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) expression is also a valid
|
||||
[``DataFrame.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.eval.html#pandas.DataFrame.eval) expression, with the added benefit that you don’t have to
|
||||
prefix the name of the [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) to the column(s) you’re
|
||||
interested in evaluating.
|
||||
|
||||
In addition, you can perform assignment of columns within an expression.
|
||||
This allows for *formulaic evaluation*. The assignment target can be a
|
||||
new column name or an existing column name, and it must be a valid Python
|
||||
identifier.
|
||||
|
||||
*New in version 0.18.0.*
|
||||
|
||||
The ``inplace`` keyword determines whether this assignment will performed
|
||||
on the original ``DataFrame`` or return a copy with the new column.
|
||||
|
||||
::: danger Warning
|
||||
|
||||
For backwards compatibility, ``inplace`` defaults to ``True`` if not
|
||||
specified. This will change in a future version of pandas - if your
|
||||
code depends on an inplace assignment you should update to explicitly
|
||||
set ``inplace=True``.
|
||||
|
||||
:::
|
||||
|
||||
``` python
|
||||
In [24]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
|
||||
|
||||
In [25]: df.eval('c = a + b', inplace=True)
|
||||
|
||||
In [26]: df.eval('d = a + b + c', inplace=True)
|
||||
|
||||
In [27]: df.eval('a = 1', inplace=True)
|
||||
|
||||
In [28]: df
|
||||
Out[28]:
|
||||
a b c d
|
||||
0 1 5 5 10
|
||||
1 1 6 7 14
|
||||
2 1 7 9 18
|
||||
3 1 8 11 22
|
||||
4 1 9 13 26
|
||||
```
|
||||
|
||||
When ``inplace`` is set to ``False``, a copy of the ``DataFrame`` with the
|
||||
new or modified columns is returned and the original frame is unchanged.
|
||||
|
||||
``` python
|
||||
In [29]: df
|
||||
Out[29]:
|
||||
a b c d
|
||||
0 1 5 5 10
|
||||
1 1 6 7 14
|
||||
2 1 7 9 18
|
||||
3 1 8 11 22
|
||||
4 1 9 13 26
|
||||
|
||||
In [30]: df.eval('e = a - c', inplace=False)
|
||||
Out[30]:
|
||||
a b c d e
|
||||
0 1 5 5 10 -4
|
||||
1 1 6 7 14 -6
|
||||
2 1 7 9 18 -8
|
||||
3 1 8 11 22 -10
|
||||
4 1 9 13 26 -12
|
||||
|
||||
In [31]: df
|
||||
Out[31]:
|
||||
a b c d
|
||||
0 1 5 5 10
|
||||
1 1 6 7 14
|
||||
2 1 7 9 18
|
||||
3 1 8 11 22
|
||||
4 1 9 13 26
|
||||
```
|
||||
|
||||
*New in version 0.18.0.*
|
||||
|
||||
As a convenience, multiple assignments can be performed by using a
|
||||
multi-line string.
|
||||
|
||||
``` python
|
||||
In [32]: df.eval("""
|
||||
....: c = a + b
|
||||
....: d = a + b + c
|
||||
....: a = 1""", inplace=False)
|
||||
....:
|
||||
Out[32]:
|
||||
a b c d
|
||||
0 1 5 6 12
|
||||
1 1 6 7 14
|
||||
2 1 7 8 16
|
||||
3 1 8 9 18
|
||||
4 1 9 10 20
|
||||
```
|
||||
|
||||
The equivalent in standard Python would be
|
||||
|
||||
``` python
|
||||
In [33]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
|
||||
|
||||
In [34]: df['c'] = df.a + df.b
|
||||
|
||||
In [35]: df['d'] = df.a + df.b + df.c
|
||||
|
||||
In [36]: df['a'] = 1
|
||||
|
||||
In [37]: df
|
||||
Out[37]:
|
||||
a b c d
|
||||
0 1 5 5 10
|
||||
1 1 6 7 14
|
||||
2 1 7 9 18
|
||||
3 1 8 11 22
|
||||
4 1 9 13 26
|
||||
```
|
||||
|
||||
*New in version 0.18.0.*
|
||||
|
||||
The ``query`` method gained the ``inplace`` keyword which determines
|
||||
whether the query modifies the original frame.
|
||||
|
||||
``` python
|
||||
In [38]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
|
||||
|
||||
In [39]: df.query('a > 2')
|
||||
Out[39]:
|
||||
a b
|
||||
3 3 8
|
||||
4 4 9
|
||||
|
||||
In [40]: df.query('a > 2', inplace=True)
|
||||
|
||||
In [41]: df
|
||||
Out[41]:
|
||||
a b
|
||||
3 3 8
|
||||
4 4 9
|
||||
```
|
||||
|
||||
::: danger Warning
|
||||
|
||||
Unlike with ``eval``, the default value for ``inplace`` for ``query``
|
||||
is ``False``. This is consistent with prior versions of pandas.
|
||||
|
||||
:::
|
||||
|
||||
### Local variables
|
||||
|
||||
You must *explicitly reference* any local variable that you want to use in an
|
||||
expression by placing the ``@`` character in front of the name. For example,
|
||||
|
||||
``` python
|
||||
In [42]: df = pd.DataFrame(np.random.randn(5, 2), columns=list('ab'))
|
||||
|
||||
In [43]: newcol = np.random.randn(len(df))
|
||||
|
||||
In [44]: df.eval('b + @newcol')
|
||||
Out[44]:
|
||||
0 -0.173926
|
||||
1 2.493083
|
||||
2 -0.881831
|
||||
3 -0.691045
|
||||
4 1.334703
|
||||
dtype: float64
|
||||
|
||||
In [45]: df.query('b < @newcol')
|
||||
Out[45]:
|
||||
a b
|
||||
0 0.863987 -0.115998
|
||||
2 -2.621419 -1.297879
|
||||
```
|
||||
|
||||
If you don’t prefix the local variable with ``@``, pandas will raise an
|
||||
exception telling you the variable is undefined.
|
||||
|
||||
When using [``DataFrame.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.eval.html#pandas.DataFrame.eval) and [``DataFrame.query()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query), this allows you
|
||||
to have a local variable and a [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) column with the same
|
||||
name in an expression.
|
||||
|
||||
``` python
|
||||
In [46]: a = np.random.randn()
|
||||
|
||||
In [47]: df.query('@a < a')
|
||||
Out[47]:
|
||||
a b
|
||||
0 0.863987 -0.115998
|
||||
|
||||
In [48]: df.loc[a < df.a] # same as the previous expression
|
||||
Out[48]:
|
||||
a b
|
||||
0 0.863987 -0.115998
|
||||
```
|
||||
|
||||
With [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) you cannot use the ``@`` prefix *at all*, because it
|
||||
isn’t defined in that context. ``pandas`` will let you know this if you try to
|
||||
use ``@`` in a top-level call to [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval). For example,
|
||||
|
||||
``` python
|
||||
In [49]: a, b = 1, 2
|
||||
|
||||
In [50]: pd.eval('@a + b')
|
||||
Traceback (most recent call last):
|
||||
|
||||
File "/opt/conda/envs/pandas/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3325, in run_code
|
||||
exec(code_obj, self.user_global_ns, self.user_ns)
|
||||
|
||||
File "<ipython-input-50-af17947a194f>", line 1, in <module>
|
||||
pd.eval('@a + b')
|
||||
|
||||
File "/pandas/pandas/core/computation/eval.py", line 311, in eval
|
||||
_check_for_locals(expr, level, parser)
|
||||
|
||||
File "/pandas/pandas/core/computation/eval.py", line 166, in _check_for_locals
|
||||
raise SyntaxError(msg)
|
||||
|
||||
File "<string>", line unknown
|
||||
SyntaxError: The '@' prefix is not allowed in top-level eval calls,
|
||||
please refer to your variables by name without the '@' prefix
|
||||
```
|
||||
|
||||
In this case, you should simply refer to the variables like you would in
|
||||
standard Python.
|
||||
|
||||
``` python
|
||||
In [51]: pd.eval('a + b')
|
||||
Out[51]: 3
|
||||
```
|
||||
|
||||
### ``pandas.eval()`` parsers
|
||||
|
||||
There are two different parsers and two different engines you can use as
|
||||
the backend.
|
||||
|
||||
The default ``'pandas'`` parser allows a more intuitive syntax for expressing
|
||||
query-like operations (comparisons, conjunctions and disjunctions). In
|
||||
particular, the precedence of the ``&`` and ``|`` operators is made equal to
|
||||
the precedence of the corresponding boolean operations ``and`` and ``or``.
|
||||
|
||||
For example, the above conjunction can be written without parentheses.
|
||||
Alternatively, you can use the ``'python'`` parser to enforce strict Python
|
||||
semantics.
|
||||
|
||||
``` python
|
||||
In [52]: expr = '(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)'
|
||||
|
||||
In [53]: x = pd.eval(expr, parser='python')
|
||||
|
||||
In [54]: expr_no_parens = 'df1 > 0 & df2 > 0 & df3 > 0 & df4 > 0'
|
||||
|
||||
In [55]: y = pd.eval(expr_no_parens, parser='pandas')
|
||||
|
||||
In [56]: np.all(x == y)
|
||||
Out[56]: True
|
||||
```
|
||||
|
||||
The same expression can be “anded” together with the word [``and``](https://docs.python.org/3/reference/expressions.html#and) as
|
||||
well:
|
||||
|
||||
``` python
|
||||
In [57]: expr = '(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)'
|
||||
|
||||
In [58]: x = pd.eval(expr, parser='python')
|
||||
|
||||
In [59]: expr_with_ands = 'df1 > 0 and df2 > 0 and df3 > 0 and df4 > 0'
|
||||
|
||||
In [60]: y = pd.eval(expr_with_ands, parser='pandas')
|
||||
|
||||
In [61]: np.all(x == y)
|
||||
Out[61]: True
|
||||
```
|
||||
|
||||
The ``and`` and ``or`` operators here have the same precedence that they would
|
||||
in vanilla Python.
|
||||
|
||||
### ``pandas.eval()`` backends
|
||||
|
||||
There’s also the option to make [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) operate identical to plain
|
||||
ol’ Python.
|
||||
|
||||
::: tip Note
|
||||
|
||||
Using the ``'python'`` engine is generally *not* useful, except for testing
|
||||
other evaluation engines against it. You will achieve **no** performance
|
||||
benefits using [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) with ``engine='python'`` and in fact may
|
||||
incur a performance hit.
|
||||
|
||||
:::
|
||||
|
||||
You can see this by using [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) with the ``'python'`` engine. It
|
||||
is a bit slower (not by much) than evaluating the same expression in Python
|
||||
|
||||
``` python
|
||||
In [62]: %timeit df1 + df2 + df3 + df4
|
||||
9.5 ms +- 241 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
|
||||
```
|
||||
|
||||
``` python
|
||||
In [63]: %timeit pd.eval('df1 + df2 + df3 + df4', engine='python')
|
||||
10.8 ms +- 898 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
|
||||
```
|
||||
|
||||
### ``pandas.eval()`` performance
|
||||
|
||||
[``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) is intended to speed up certain kinds of operations. In
|
||||
particular, those operations involving complex expressions with large
|
||||
[``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame)/[``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) objects should see a
|
||||
significant performance benefit. Here is a plot showing the running time of
|
||||
[``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) as function of the size of the frame involved in the
|
||||
computation. The two lines are two different engines.
|
||||
|
||||

|
||||
|
||||
::: tip Note
|
||||
|
||||
Operations with smallish objects (around 15k-20k rows) are faster using
|
||||
plain Python:
|
||||
|
||||

|
||||
|
||||
:::
|
||||
|
||||
This plot was created using a ``DataFrame`` with 3 columns each containing
|
||||
floating point values generated using ``numpy.random.randn()``.
|
||||
|
||||
### Technical minutia regarding expression evaluation
|
||||
|
||||
Expressions that would result in an object dtype or involve datetime operations
|
||||
(because of ``NaT``) must be evaluated in Python space. The main reason for
|
||||
this behavior is to maintain backwards compatibility with versions of NumPy <
|
||||
1.7. In those versions of NumPy a call to ``ndarray.astype(str)`` will
|
||||
truncate any strings that are more than 60 characters in length. Second, we
|
||||
can’t pass ``object`` arrays to ``numexpr`` thus string comparisons must be
|
||||
evaluated in Python space.
|
||||
|
||||
The upshot is that this *only* applies to object-dtype expressions. So, if
|
||||
you have an expression–for example
|
||||
|
||||
``` python
|
||||
In [64]: df = pd.DataFrame({'strings': np.repeat(list('cba'), 3),
|
||||
....: 'nums': np.repeat(range(3), 3)})
|
||||
....:
|
||||
|
||||
In [65]: df
|
||||
Out[65]:
|
||||
strings nums
|
||||
0 c 0
|
||||
1 c 0
|
||||
2 c 0
|
||||
3 b 1
|
||||
4 b 1
|
||||
5 b 1
|
||||
6 a 2
|
||||
7 a 2
|
||||
8 a 2
|
||||
|
||||
In [66]: df.query('strings == "a" and nums == 1')
|
||||
Out[66]:
|
||||
Empty DataFrame
|
||||
Columns: [strings, nums]
|
||||
Index: []
|
||||
```
|
||||
|
||||
the numeric part of the comparison (``nums == 1``) will be evaluated by
|
||||
``numexpr``.
|
||||
|
||||
In general, [``DataFrame.query()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query)/[``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) will
|
||||
evaluate the subexpressions that *can* be evaluated by ``numexpr`` and those
|
||||
that must be evaluated in Python space transparently to the user. This is done
|
||||
by inferring the result type of an expression from its arguments and operators.
|
||||
|
||||
429
Python/pandas/user_guide/gotchas.md
Normal file
429
Python/pandas/user_guide/gotchas.md
Normal file
@@ -0,0 +1,429 @@
|
||||
# Frequently Asked Questions (FAQ)
|
||||
|
||||
## DataFrame memory usage
|
||||
|
||||
The memory usage of a ``DataFrame`` (including the index) is shown when calling
|
||||
the [``info()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info). A configuration option, ``display.memory_usage``
|
||||
(see [the list of options](options.html#options-available)), specifies if the
|
||||
``DataFrame``’s memory usage will be displayed when invoking the ``df.info()``
|
||||
method.
|
||||
|
||||
For example, the memory usage of the ``DataFrame`` below is shown
|
||||
when calling [``info()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info):
|
||||
|
||||
``` python
|
||||
In [1]: dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]',
|
||||
...: 'complex128', 'object', 'bool']
|
||||
...:
|
||||
|
||||
In [2]: n = 5000
|
||||
|
||||
In [3]: data = {t: np.random.randint(100, size=n).astype(t) for t in dtypes}
|
||||
|
||||
In [4]: df = pd.DataFrame(data)
|
||||
|
||||
In [5]: df['categorical'] = df['object'].astype('category')
|
||||
|
||||
In [6]: df.info()
|
||||
<class 'pandas.core.frame.DataFrame'>
|
||||
RangeIndex: 5000 entries, 0 to 4999
|
||||
Data columns (total 8 columns):
|
||||
int64 5000 non-null int64
|
||||
float64 5000 non-null float64
|
||||
datetime64[ns] 5000 non-null datetime64[ns]
|
||||
timedelta64[ns] 5000 non-null timedelta64[ns]
|
||||
complex128 5000 non-null complex128
|
||||
object 5000 non-null object
|
||||
bool 5000 non-null bool
|
||||
categorical 5000 non-null category
|
||||
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
|
||||
memory usage: 289.1+ KB
|
||||
```
|
||||
|
||||
The ``+`` symbol indicates that the true memory usage could be higher, because
|
||||
pandas does not count the memory used by values in columns with
|
||||
``dtype=object``.
|
||||
|
||||
Passing ``memory_usage='deep'`` will enable a more accurate memory usage report,
|
||||
accounting for the full usage of the contained objects. This is optional
|
||||
as it can be expensive to do this deeper introspection.
|
||||
|
||||
``` python
|
||||
In [7]: df.info(memory_usage='deep')
|
||||
<class 'pandas.core.frame.DataFrame'>
|
||||
RangeIndex: 5000 entries, 0 to 4999
|
||||
Data columns (total 8 columns):
|
||||
int64 5000 non-null int64
|
||||
float64 5000 non-null float64
|
||||
datetime64[ns] 5000 non-null datetime64[ns]
|
||||
timedelta64[ns] 5000 non-null timedelta64[ns]
|
||||
complex128 5000 non-null complex128
|
||||
object 5000 non-null object
|
||||
bool 5000 non-null bool
|
||||
categorical 5000 non-null category
|
||||
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
|
||||
memory usage: 425.6 KB
|
||||
```
|
||||
|
||||
By default the display option is set to ``True`` but can be explicitly
|
||||
overridden by passing the ``memory_usage`` argument when invoking ``df.info()``.
|
||||
|
||||
The memory usage of each column can be found by calling the
|
||||
[``memory_usage()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.memory_usage.html#pandas.DataFrame.memory_usage) method. This returns a ``Series`` with an index
|
||||
represented by column names and memory usage of each column shown in bytes. For
|
||||
the ``DataFrame`` above, the memory usage of each column and the total memory
|
||||
usage can be found with the ``memory_usage`` method:
|
||||
|
||||
``` python
|
||||
In [8]: df.memory_usage()
|
||||
Out[8]:
|
||||
Index 128
|
||||
int64 40000
|
||||
float64 40000
|
||||
datetime64[ns] 40000
|
||||
timedelta64[ns] 40000
|
||||
complex128 80000
|
||||
object 40000
|
||||
bool 5000
|
||||
categorical 10920
|
||||
dtype: int64
|
||||
|
||||
# total memory usage of dataframe
|
||||
In [9]: df.memory_usage().sum()
|
||||
Out[9]: 296048
|
||||
```
|
||||
|
||||
By default the memory usage of the ``DataFrame``’s index is shown in the
|
||||
returned ``Series``, the memory usage of the index can be suppressed by passing
|
||||
the ``index=False`` argument:
|
||||
|
||||
``` python
|
||||
In [10]: df.memory_usage(index=False)
|
||||
Out[10]:
|
||||
int64 40000
|
||||
float64 40000
|
||||
datetime64[ns] 40000
|
||||
timedelta64[ns] 40000
|
||||
complex128 80000
|
||||
object 40000
|
||||
bool 5000
|
||||
categorical 10920
|
||||
dtype: int64
|
||||
```
|
||||
|
||||
The memory usage displayed by the [``info()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) method utilizes the
|
||||
[``memory_usage()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.memory_usage.html#pandas.DataFrame.memory_usage) method to determine the memory usage of a
|
||||
``DataFrame`` while also formatting the output in human-readable units (base-2
|
||||
representation; i.e. 1KB = 1024 bytes).
|
||||
|
||||
See also [Categorical Memory Usage](categorical.html#categorical-memory).
|
||||
|
||||
## Using if/truth statements with pandas
|
||||
|
||||
pandas follows the NumPy convention of raising an error when you try to convert
|
||||
something to a ``bool``. This happens in an ``if``-statement or when using the
|
||||
boolean operations: ``and``, ``or``, and ``not``. It is not clear what the result
|
||||
of the following code should be:
|
||||
|
||||
``` python
|
||||
>>> if pd.Series([False, True, False]):
|
||||
... pass
|
||||
```
|
||||
|
||||
Should it be ``True`` because it’s not zero-length, or ``False`` because there
|
||||
are ``False`` values? It is unclear, so instead, pandas raises a ``ValueError``:
|
||||
|
||||
``` python
|
||||
>>> if pd.Series([False, True, False]):
|
||||
... print("I was true")
|
||||
Traceback
|
||||
...
|
||||
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
|
||||
```
|
||||
|
||||
You need to explicitly choose what you want to do with the ``DataFrame``, e.g.
|
||||
use [``any()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.any.html#pandas.DataFrame.any), [``all()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.all.html#pandas.DataFrame.all) or [``empty()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.empty.html#pandas.DataFrame.empty).
|
||||
Alternatively, you might want to compare if the pandas object is ``None``:
|
||||
|
||||
``` python
|
||||
>>> if pd.Series([False, True, False]) is not None:
|
||||
... print("I was not None")
|
||||
I was not None
|
||||
```
|
||||
|
||||
Below is how to check if any of the values are ``True``:
|
||||
|
||||
``` python
|
||||
>>> if pd.Series([False, True, False]).any():
|
||||
... print("I am any")
|
||||
I am any
|
||||
```
|
||||
|
||||
To evaluate single-element pandas objects in a boolean context, use the method
|
||||
[``bool()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.bool.html#pandas.DataFrame.bool):
|
||||
|
||||
``` python
|
||||
In [11]: pd.Series([True]).bool()
|
||||
Out[11]: True
|
||||
|
||||
In [12]: pd.Series([False]).bool()
|
||||
Out[12]: False
|
||||
|
||||
In [13]: pd.DataFrame([[True]]).bool()
|
||||
Out[13]: True
|
||||
|
||||
In [14]: pd.DataFrame([[False]]).bool()
|
||||
Out[14]: False
|
||||
```
|
||||
|
||||
### Bitwise boolean
|
||||
|
||||
Bitwise boolean operators like ``==`` and ``!=`` return a boolean ``Series``,
|
||||
which is almost always what you want anyways.
|
||||
|
||||
``` python
|
||||
>>> s = pd.Series(range(5))
|
||||
>>> s == 4
|
||||
0 False
|
||||
1 False
|
||||
2 False
|
||||
3 False
|
||||
4 True
|
||||
dtype: bool
|
||||
```
|
||||
|
||||
See [boolean comparisons](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-compare) for more examples.
|
||||
|
||||
### Using the ``in`` operator
|
||||
|
||||
Using the Python ``in`` operator on a ``Series`` tests for membership in the
|
||||
index, not membership among the values.
|
||||
|
||||
``` python
|
||||
In [15]: s = pd.Series(range(5), index=list('abcde'))
|
||||
|
||||
In [16]: 2 in s
|
||||
Out[16]: False
|
||||
|
||||
In [17]: 'b' in s
|
||||
Out[17]: True
|
||||
```
|
||||
|
||||
If this behavior is surprising, keep in mind that using ``in`` on a Python
|
||||
dictionary tests keys, not values, and ``Series`` are dict-like.
|
||||
To test for membership in the values, use the method [``isin()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html#pandas.Series.isin):
|
||||
|
||||
``` python
|
||||
In [18]: s.isin([2])
|
||||
Out[18]:
|
||||
a False
|
||||
b False
|
||||
c True
|
||||
d False
|
||||
e False
|
||||
dtype: bool
|
||||
|
||||
In [19]: s.isin([2]).any()
|
||||
Out[19]: True
|
||||
```
|
||||
|
||||
For ``DataFrames``, likewise, ``in`` applies to the column axis,
|
||||
testing for membership in the list of column names.
|
||||
|
||||
## ``NaN``, Integer ``NA`` values and ``NA`` type promotions
|
||||
|
||||
### Choice of ``NA`` representation
|
||||
|
||||
For lack of ``NA`` (missing) support from the ground up in NumPy and Python in
|
||||
general, we were given the difficult choice between either:
|
||||
|
||||
- A *masked array* solution: an array of data and an array of boolean values
|
||||
indicating whether a value is there or is missing.
|
||||
- Using a special sentinel value, bit pattern, or set of sentinel values to
|
||||
denote ``NA`` across the dtypes.
|
||||
|
||||
For many reasons we chose the latter. After years of production use it has
|
||||
proven, at least in my opinion, to be the best decision given the state of
|
||||
affairs in NumPy and Python in general. The special value ``NaN``
|
||||
(Not-A-Number) is used everywhere as the ``NA`` value, and there are API
|
||||
functions ``isna`` and ``notna`` which can be used across the dtypes to
|
||||
detect NA values.
|
||||
|
||||
However, it comes with it a couple of trade-offs which I most certainly have
|
||||
not ignored.
|
||||
|
||||
### Support for integer ``NA``
|
||||
|
||||
In the absence of high performance ``NA`` support being built into NumPy from
|
||||
the ground up, the primary casualty is the ability to represent NAs in integer
|
||||
arrays. For example:
|
||||
|
||||
``` python
|
||||
In [20]: s = pd.Series([1, 2, 3, 4, 5], index=list('abcde'))
|
||||
|
||||
In [21]: s
|
||||
Out[21]:
|
||||
a 1
|
||||
b 2
|
||||
c 3
|
||||
d 4
|
||||
e 5
|
||||
dtype: int64
|
||||
|
||||
In [22]: s.dtype
|
||||
Out[22]: dtype('int64')
|
||||
|
||||
In [23]: s2 = s.reindex(['a', 'b', 'c', 'f', 'u'])
|
||||
|
||||
In [24]: s2
|
||||
Out[24]:
|
||||
a 1.0
|
||||
b 2.0
|
||||
c 3.0
|
||||
f NaN
|
||||
u NaN
|
||||
dtype: float64
|
||||
|
||||
In [25]: s2.dtype
|
||||
Out[25]: dtype('float64')
|
||||
```
|
||||
|
||||
This trade-off is made largely for memory and performance reasons, and also so
|
||||
that the resulting ``Series`` continues to be “numeric”.
|
||||
|
||||
If you need to represent integers with possibly missing values, use one of
|
||||
the nullable-integer extension dtypes provided by pandas
|
||||
|
||||
- [``Int8Dtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Int8Dtype.html#pandas.Int8Dtype)
|
||||
- [``Int16Dtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Int16Dtype.html#pandas.Int16Dtype)
|
||||
- [``Int32Dtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Int32Dtype.html#pandas.Int32Dtype)
|
||||
- [``Int64Dtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Int64Dtype.html#pandas.Int64Dtype)
|
||||
|
||||
``` python
|
||||
In [26]: s_int = pd.Series([1, 2, 3, 4, 5], index=list('abcde'),
|
||||
....: dtype=pd.Int64Dtype())
|
||||
....:
|
||||
|
||||
In [27]: s_int
|
||||
Out[27]:
|
||||
a 1
|
||||
b 2
|
||||
c 3
|
||||
d 4
|
||||
e 5
|
||||
dtype: Int64
|
||||
|
||||
In [28]: s_int.dtype
|
||||
Out[28]: Int64Dtype()
|
||||
|
||||
In [29]: s2_int = s_int.reindex(['a', 'b', 'c', 'f', 'u'])
|
||||
|
||||
In [30]: s2_int
|
||||
Out[30]:
|
||||
a 1
|
||||
b 2
|
||||
c 3
|
||||
f NaN
|
||||
u NaN
|
||||
dtype: Int64
|
||||
|
||||
In [31]: s2_int.dtype
|
||||
Out[31]: Int64Dtype()
|
||||
```
|
||||
|
||||
See [Nullable integer data type](integer_na.html#integer-na) for more.
|
||||
|
||||
### ``NA`` type promotions
|
||||
|
||||
When introducing NAs into an existing ``Series`` or ``DataFrame`` via
|
||||
[``reindex()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html#pandas.Series.reindex) or some other means, boolean and integer types will be
|
||||
promoted to a different dtype in order to store the NAs. The promotions are
|
||||
summarized in this table:
|
||||
|
||||
Typeclass | Promotion dtype for storing NAs
|
||||
---|---
|
||||
floating | no change
|
||||
object | no change
|
||||
integer | cast to float64
|
||||
boolean | cast to object
|
||||
|
||||
While this may seem like a heavy trade-off, I have found very few cases where
|
||||
this is an issue in practice i.e. storing values greater than 2**53. Some
|
||||
explanation for the motivation is in the next section.
|
||||
|
||||
### Why not make NumPy like R?
|
||||
|
||||
Many people have suggested that NumPy should simply emulate the ``NA`` support
|
||||
present in the more domain-specific statistical programming language [R](https://r-project.org). Part of the reason is the NumPy type hierarchy:
|
||||
|
||||
Typeclass | Dtypes
|
||||
---|---
|
||||
numpy.floating | float16, float32, float64, float128
|
||||
numpy.integer | int8, int16, int32, int64
|
||||
numpy.unsignedinteger | uint8, uint16, uint32, uint64
|
||||
numpy.object_ | object_
|
||||
numpy.bool_ | bool_
|
||||
numpy.character | string_, unicode_
|
||||
|
||||
The R language, by contrast, only has a handful of built-in data types:
|
||||
``integer``, ``numeric`` (floating-point), ``character``, and
|
||||
``boolean``. ``NA`` types are implemented by reserving special bit patterns for
|
||||
each type to be used as the missing value. While doing this with the full NumPy
|
||||
type hierarchy would be possible, it would be a more substantial trade-off
|
||||
(especially for the 8- and 16-bit data types) and implementation undertaking.
|
||||
|
||||
An alternate approach is that of using masked arrays. A masked array is an
|
||||
array of data with an associated boolean *mask* denoting whether each value
|
||||
should be considered ``NA`` or not. I am personally not in love with this
|
||||
approach as I feel that overall it places a fairly heavy burden on the user and
|
||||
the library implementer. Additionally, it exacts a fairly high performance cost
|
||||
when working with numerical data compared with the simple approach of using
|
||||
``NaN``. Thus, I have chosen the Pythonic “practicality beats purity” approach
|
||||
and traded integer ``NA`` capability for a much simpler approach of using a
|
||||
special value in float and object arrays to denote ``NA``, and promoting
|
||||
integer arrays to floating when NAs must be introduced.
|
||||
|
||||
## Differences with NumPy
|
||||
|
||||
For ``Series`` and ``DataFrame`` objects, [``var()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html#pandas.DataFrame.var) normalizes by
|
||||
``N-1`` to produce unbiased estimates of the sample variance, while NumPy’s
|
||||
``var`` normalizes by N, which measures the variance of the sample. Note that
|
||||
[``cov()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cov.html#pandas.DataFrame.cov) normalizes by ``N-1`` in both pandas and NumPy.
|
||||
|
||||
## Thread-safety
|
||||
|
||||
As of pandas 0.11, pandas is not 100% thread safe. The known issues relate to
|
||||
the [``copy()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html#pandas.DataFrame.copy) method. If you are doing a lot of copying of
|
||||
``DataFrame`` objects shared among threads, we recommend holding locks inside
|
||||
the threads where the data copying occurs.
|
||||
|
||||
See [this link](https://stackoverflow.com/questions/13592618/python-pandas-dataframe-thread-safe)
|
||||
for more information.
|
||||
|
||||
## Byte-Ordering issues
|
||||
|
||||
Occasionally you may have to deal with data that were created on a machine with
|
||||
a different byte order than the one on which you are running Python. A common
|
||||
symptom of this issue is an error like::
|
||||
|
||||
``` python
|
||||
Traceback
|
||||
...
|
||||
ValueError: Big-endian buffer not supported on little-endian compiler
|
||||
```
|
||||
|
||||
To deal
|
||||
with this issue you should convert the underlying NumPy array to the native
|
||||
system byte order *before* passing it to ``Series`` or ``DataFrame``
|
||||
constructors using something similar to the following:
|
||||
|
||||
``` python
|
||||
In [32]: x = np.array(list(range(10)), '>i4') # big endian
|
||||
|
||||
In [33]: newx = x.byteswap().newbyteorder() # force native byteorder
|
||||
|
||||
In [34]: s = pd.Series(newx)
|
||||
```
|
||||
|
||||
See [the NumPy documentation on byte order](https://docs.scipy.org/doc/numpy/user/basics.byteswapping.html) for more
|
||||
details.
|
||||
2417
Python/pandas/user_guide/groupby.md
Normal file
2417
Python/pandas/user_guide/groupby.md
Normal file
File diff suppressed because it is too large
Load Diff
3114
Python/pandas/user_guide/indexing.md
Normal file
3114
Python/pandas/user_guide/indexing.md
Normal file
File diff suppressed because it is too large
Load Diff
175
Python/pandas/user_guide/integer_na.md
Normal file
175
Python/pandas/user_guide/integer_na.md
Normal file
@@ -0,0 +1,175 @@
|
||||
---
|
||||
meta:
|
||||
- name: keywords
|
||||
content: Nullable,整型数据类型
|
||||
- name: description
|
||||
content: 在处理丢失的数据部分, 我们知道pandas主要使用 NaN 来代表丢失数据。因为 NaN 属于浮点型数据,这强制有缺失值的整型array强制转换成浮点型。
|
||||
---
|
||||
|
||||
# Nullable整型数据类型
|
||||
|
||||
*在0.24.0版本中新引入*
|
||||
|
||||
::: tip 小贴士
|
||||
|
||||
IntegerArray目前属于实验性阶段,因此他的API或者使用方式可能会在没有提示的情况下更改。
|
||||
|
||||
:::
|
||||
|
||||
在 [处理丢失的数据](missing_data.html#missing-data)部分, 我们知道pandas主要使用 ``NaN`` 来代表丢失数据。因为 ``NaN`` 属于浮点型数据,这强制有缺失值的整型array强制转换成浮点型。在某些情况下,这可能不会有太大影响,但是如果你的整型数据恰好是标识符,数据类型的转换可能会存在隐患。同时,某些整数无法使用浮点型来表示。
|
||||
|
||||
Pandas能够将可能存在缺失值的整型数据使用[``arrays.IntegerArray``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.arrays.IntegerArray.html#pandas.arrays.IntegerArray)来表示。这是pandas中内置的 [扩展方式](https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-extension-types)。 它并不是整型数据组成array对象的默认方式,并且并不会被pandas直接使用。因此,如果你希望生成这种数据类型,你需要在生成[``array()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.array.html#pandas.array) 或者 [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)时,在``dtype``变量中直接指定。
|
||||
|
||||
``` python
|
||||
In [1]: arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
|
||||
|
||||
In [2]: arr
|
||||
Out[2]:
|
||||
<IntegerArray>
|
||||
[1, 2, NaN]
|
||||
Length: 3, dtype: Int64
|
||||
```
|
||||
|
||||
或者使用字符串``"Int64"``(注意此处的 ``"I"``需要大写,以此和NumPy中的``'int64'``数据类型作出区别):
|
||||
|
||||
``` python
|
||||
In [3]: pd.array([1, 2, np.nan], dtype="Int64")
|
||||
Out[3]:
|
||||
<IntegerArray>
|
||||
[1, 2, NaN]
|
||||
Length: 3, dtype: Int64
|
||||
```
|
||||
|
||||
这样的array对象与NumPy的array对象类似,可以被存放在[``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) 或 [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)中。
|
||||
|
||||
``` python
|
||||
In [4]: pd.Series(arr)
|
||||
Out[4]:
|
||||
0 1
|
||||
1 2
|
||||
2 NaN
|
||||
dtype: Int64
|
||||
```
|
||||
|
||||
你也可以直接将列表形式的数据直接传入[``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)中,并指明``dtype``。
|
||||
|
||||
``` python
|
||||
In [5]: s = pd.Series([1, 2, np.nan], dtype="Int64")
|
||||
|
||||
In [6]: s
|
||||
Out[6]:
|
||||
0 1
|
||||
1 2
|
||||
2 NaN
|
||||
dtype: Int64
|
||||
```
|
||||
|
||||
默认情况下(如果你不指明``dtype``),则会使用NumPy来构建这个数据,最终你会得到``float64``类型的Series:
|
||||
|
||||
``` python
|
||||
In [7]: pd.Series([1, 2, np.nan])
|
||||
Out[7]:
|
||||
0 1.0
|
||||
1 2.0
|
||||
2 NaN
|
||||
dtype: float64
|
||||
```
|
||||
|
||||
对使用了整型array的操作与对NumPy中array的操作类似,缺失值会被继承并保留原本的数据类型,但在必要的情况下,数据类型也会发生转变。
|
||||
|
||||
``` python
|
||||
# 运算
|
||||
In [8]: s + 1
|
||||
Out[8]:
|
||||
0 2
|
||||
1 3
|
||||
2 NaN
|
||||
dtype: Int64
|
||||
|
||||
# 比较
|
||||
In [9]: s == 1
|
||||
Out[9]:
|
||||
0 True
|
||||
1 False
|
||||
2 False
|
||||
dtype: bool
|
||||
|
||||
# 索引
|
||||
In [10]: s.iloc[1:3]
|
||||
Out[10]:
|
||||
1 2
|
||||
2 NaN
|
||||
dtype: Int64
|
||||
|
||||
# 和其他数据类型联合使用
|
||||
In [11]: s + s.iloc[1:3].astype('Int8')
|
||||
Out[11]:
|
||||
0 NaN
|
||||
1 4
|
||||
2 NaN
|
||||
dtype: Int64
|
||||
|
||||
# 在必要情况下,数据类型发生转变
|
||||
In [12]: s + 0.01
|
||||
Out[12]:
|
||||
0 1.01
|
||||
1 2.01
|
||||
2 NaN
|
||||
dtype: float64
|
||||
```
|
||||
|
||||
这种数据类型可以作为 ``DataFrame``的一部分进行使用。
|
||||
|
||||
``` python
|
||||
In [13]: df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')})
|
||||
|
||||
In [14]: df
|
||||
Out[14]:
|
||||
A B C
|
||||
0 1 1 a
|
||||
1 2 1 a
|
||||
2 NaN 3 b
|
||||
|
||||
In [15]: df.dtypes
|
||||
Out[15]:
|
||||
A Int64
|
||||
B int64
|
||||
C object
|
||||
dtype: object
|
||||
```
|
||||
|
||||
这种数据类型也可以在合并(merge)、重构(reshape)和类型转换(cast)。
|
||||
|
||||
``` python
|
||||
In [16]: pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes
|
||||
Out[16]:
|
||||
A Int64
|
||||
B int64
|
||||
C object
|
||||
dtype: object
|
||||
|
||||
In [17]: df['A'].astype(float)
|
||||
Out[17]:
|
||||
0 1.0
|
||||
1 2.0
|
||||
2 NaN
|
||||
Name: A, dtype: float64
|
||||
```
|
||||
|
||||
类似于求和的降维和分组操作也能正常使用。
|
||||
|
||||
``` python
|
||||
In [18]: df.sum()
|
||||
Out[18]:
|
||||
A 3
|
||||
B 5
|
||||
C aab
|
||||
dtype: object
|
||||
|
||||
In [19]: df.groupby('B').A.sum()
|
||||
Out[19]:
|
||||
B
|
||||
1 3
|
||||
3 0
|
||||
Name: A, dtype: Int64
|
||||
```
|
||||
7123
Python/pandas/user_guide/io.md
Normal file
7123
Python/pandas/user_guide/io.md
Normal file
File diff suppressed because it is too large
Load Diff
1415
Python/pandas/user_guide/merging.md
Normal file
1415
Python/pandas/user_guide/merging.md
Normal file
File diff suppressed because it is too large
Load Diff
1477
Python/pandas/user_guide/missing_data.md
Normal file
1477
Python/pandas/user_guide/missing_data.md
Normal file
File diff suppressed because it is too large
Load Diff
711
Python/pandas/user_guide/options.md
Normal file
711
Python/pandas/user_guide/options.md
Normal file
@@ -0,0 +1,711 @@
|
||||
# Options and settings
|
||||
|
||||
## Overview
|
||||
|
||||
pandas has an options system that lets you customize some aspects of its behaviour,
|
||||
display-related options being those the user is most likely to adjust.
|
||||
|
||||
Options have a full “dotted-style”, case-insensitive name (e.g. ``display.max_rows``).
|
||||
You can get/set options directly as attributes of the top-level ``options`` attribute:
|
||||
|
||||
``` python
|
||||
In [1]: import pandas as pd
|
||||
|
||||
In [2]: pd.options.display.max_rows
|
||||
Out[2]: 15
|
||||
|
||||
In [3]: pd.options.display.max_rows = 999
|
||||
|
||||
In [4]: pd.options.display.max_rows
|
||||
Out[4]: 999
|
||||
```
|
||||
|
||||
The API is composed of 5 relevant functions, available directly from the ``pandas``
|
||||
namespace:
|
||||
|
||||
- [``get_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_option.html#pandas.get_option) / [``set_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html#pandas.set_option) - get/set the value of a single option.
|
||||
- [``reset_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.reset_option.html#pandas.reset_option) - reset one or more options to their default value.
|
||||
- [``describe_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.describe_option.html#pandas.describe_option) - print the descriptions of one or more options.
|
||||
- [``option_context()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.option_context.html#pandas.option_context) - execute a codeblock with a set of options
|
||||
that revert to prior settings after execution.
|
||||
|
||||
**Note:** Developers can check out [pandas/core/config.py](https://github.com/pandas-dev/pandas/blob/master/pandas/core/config.py) for more information.
|
||||
|
||||
All of the functions above accept a regexp pattern (``re.search`` style) as an argument,
|
||||
and so passing in a substring will work - as long as it is unambiguous:
|
||||
|
||||
``` python
|
||||
In [5]: pd.get_option("display.max_rows")
|
||||
Out[5]: 999
|
||||
|
||||
In [6]: pd.set_option("display.max_rows", 101)
|
||||
|
||||
In [7]: pd.get_option("display.max_rows")
|
||||
Out[7]: 101
|
||||
|
||||
In [8]: pd.set_option("max_r", 102)
|
||||
|
||||
In [9]: pd.get_option("display.max_rows")
|
||||
Out[9]: 102
|
||||
```
|
||||
|
||||
The following will **not work** because it matches multiple option names, e.g.
|
||||
``display.max_colwidth``, ``display.max_rows``, ``display.max_columns``:
|
||||
|
||||
``` python
|
||||
In [10]: try:
|
||||
....: pd.get_option("column")
|
||||
....: except KeyError as e:
|
||||
....: print(e)
|
||||
....:
|
||||
'Pattern matched multiple keys'
|
||||
```
|
||||
|
||||
**Note:** Using this form of shorthand may cause your code to break if new options with similar names are added in future versions.
|
||||
|
||||
You can get a list of available options and their descriptions with ``describe_option``. When called
|
||||
with no argument ``describe_option`` will print out the descriptions for all available options.
|
||||
|
||||
## Getting and setting options
|
||||
|
||||
As described above, [``get_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_option.html#pandas.get_option) and [``set_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html#pandas.set_option)
|
||||
are available from the pandas namespace. To change an option, call
|
||||
``set_option('option regex', new_value)``.
|
||||
|
||||
``` python
|
||||
In [11]: pd.get_option('mode.sim_interactive')
|
||||
Out[11]: False
|
||||
|
||||
In [12]: pd.set_option('mode.sim_interactive', True)
|
||||
|
||||
In [13]: pd.get_option('mode.sim_interactive')
|
||||
Out[13]: True
|
||||
```
|
||||
|
||||
**Note:** The option ‘mode.sim_interactive’ is mostly used for debugging purposes.
|
||||
|
||||
All options also have a default value, and you can use ``reset_option`` to do just that:
|
||||
|
||||
``` python
|
||||
In [14]: pd.get_option("display.max_rows")
|
||||
Out[14]: 60
|
||||
|
||||
In [15]: pd.set_option("display.max_rows", 999)
|
||||
|
||||
In [16]: pd.get_option("display.max_rows")
|
||||
Out[16]: 999
|
||||
|
||||
In [17]: pd.reset_option("display.max_rows")
|
||||
|
||||
In [18]: pd.get_option("display.max_rows")
|
||||
Out[18]: 60
|
||||
```
|
||||
|
||||
It’s also possible to reset multiple options at once (using a regex):
|
||||
|
||||
``` python
|
||||
In [19]: pd.reset_option("^display")
|
||||
```
|
||||
|
||||
``option_context`` context manager has been exposed through
|
||||
the top-level API, allowing you to execute code with given option values. Option values
|
||||
are restored automatically when you exit the *with* block:
|
||||
|
||||
``` python
|
||||
In [20]: with pd.option_context("display.max_rows", 10, "display.max_columns", 5):
|
||||
....: print(pd.get_option("display.max_rows"))
|
||||
....: print(pd.get_option("display.max_columns"))
|
||||
....:
|
||||
10
|
||||
5
|
||||
|
||||
In [21]: print(pd.get_option("display.max_rows"))
|
||||
60
|
||||
|
||||
In [22]: print(pd.get_option("display.max_columns"))
|
||||
0
|
||||
```
|
||||
|
||||
## Setting startup options in Python/IPython environment
|
||||
|
||||
Using startup scripts for the Python/IPython environment to import pandas and set options makes working with pandas more efficient. To do this, create a .py or .ipy script in the startup directory of the desired profile. An example where the startup folder is in a default ipython profile can be found at:
|
||||
|
||||
```
|
||||
$IPYTHONDIR/profile_default/startup
|
||||
```
|
||||
|
||||
More information can be found in the [ipython documentation](https://ipython.org/ipython-doc/stable/interactive/tutorial.html#startup-files). An example startup script for pandas is displayed below:
|
||||
|
||||
``` python
|
||||
import pandas as pd
|
||||
pd.set_option('display.max_rows', 999)
|
||||
pd.set_option('precision', 5)
|
||||
```
|
||||
|
||||
## Frequently Used Options
|
||||
|
||||
The following is a walk-through of the more frequently used display options.
|
||||
|
||||
``display.max_rows`` and ``display.max_columns`` sets the maximum number
|
||||
of rows and columns displayed when a frame is pretty-printed. Truncated
|
||||
lines are replaced by an ellipsis.
|
||||
|
||||
``` python
|
||||
In [23]: df = pd.DataFrame(np.random.randn(7, 2))
|
||||
|
||||
In [24]: pd.set_option('max_rows', 7)
|
||||
|
||||
In [25]: df
|
||||
Out[25]:
|
||||
0 1
|
||||
0 0.469112 -0.282863
|
||||
1 -1.509059 -1.135632
|
||||
2 1.212112 -0.173215
|
||||
3 0.119209 -1.044236
|
||||
4 -0.861849 -2.104569
|
||||
5 -0.494929 1.071804
|
||||
6 0.721555 -0.706771
|
||||
|
||||
In [26]: pd.set_option('max_rows', 5)
|
||||
|
||||
In [27]: df
|
||||
Out[27]:
|
||||
0 1
|
||||
0 0.469112 -0.282863
|
||||
1 -1.509059 -1.135632
|
||||
.. ... ...
|
||||
5 -0.494929 1.071804
|
||||
6 0.721555 -0.706771
|
||||
|
||||
[7 rows x 2 columns]
|
||||
|
||||
In [28]: pd.reset_option('max_rows')
|
||||
```
|
||||
|
||||
Once the ``display.max_rows`` is exceeded, the ``display.min_rows`` options
|
||||
determines how many rows are shown in the truncated repr.
|
||||
|
||||
``` python
|
||||
In [29]: pd.set_option('max_rows', 8)
|
||||
|
||||
In [30]: pd.set_option('max_rows', 4)
|
||||
|
||||
# below max_rows -> all rows shown
|
||||
In [31]: df = pd.DataFrame(np.random.randn(7, 2))
|
||||
|
||||
In [32]: df
|
||||
Out[32]:
|
||||
0 1
|
||||
0 -1.039575 0.271860
|
||||
1 -0.424972 0.567020
|
||||
.. ... ...
|
||||
5 0.404705 0.577046
|
||||
6 -1.715002 -1.039268
|
||||
|
||||
[7 rows x 2 columns]
|
||||
|
||||
# above max_rows -> only min_rows (4) rows shown
|
||||
In [33]: df = pd.DataFrame(np.random.randn(9, 2))
|
||||
|
||||
In [34]: df
|
||||
Out[34]:
|
||||
0 1
|
||||
0 -0.370647 -1.157892
|
||||
1 -1.344312 0.844885
|
||||
.. ... ...
|
||||
7 0.276662 -0.472035
|
||||
8 -0.013960 -0.362543
|
||||
|
||||
[9 rows x 2 columns]
|
||||
|
||||
In [35]: pd.reset_option('max_rows')
|
||||
|
||||
In [36]: pd.reset_option('min_rows')
|
||||
```
|
||||
|
||||
``display.expand_frame_repr`` allows for the representation of
|
||||
dataframes to stretch across pages, wrapped over the full column vs row-wise.
|
||||
|
||||
``` python
|
||||
In [37]: df = pd.DataFrame(np.random.randn(5, 10))
|
||||
|
||||
In [38]: pd.set_option('expand_frame_repr', True)
|
||||
|
||||
In [39]: df
|
||||
Out[39]:
|
||||
0 1 2 3 4 5 6 7 8 9
|
||||
0 -0.006154 -0.923061 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299 -0.226169
|
||||
1 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127 -1.436737 -1.413681 1.607920
|
||||
2 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747 -0.410001 -0.078638 0.545952 -1.219217
|
||||
3 -1.226825 0.769804 -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734 0.959726 -1.110336
|
||||
4 -0.619976 0.149748 -0.732339 0.687738 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849
|
||||
|
||||
In [40]: pd.set_option('expand_frame_repr', False)
|
||||
|
||||
In [41]: df
|
||||
Out[41]:
|
||||
0 1 2 3 4 5 6 7 8 9
|
||||
0 -0.006154 -0.923061 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299 -0.226169
|
||||
1 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127 -1.436737 -1.413681 1.607920
|
||||
2 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747 -0.410001 -0.078638 0.545952 -1.219217
|
||||
3 -1.226825 0.769804 -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734 0.959726 -1.110336
|
||||
4 -0.619976 0.149748 -0.732339 0.687738 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849
|
||||
|
||||
In [42]: pd.reset_option('expand_frame_repr')
|
||||
```
|
||||
|
||||
``display.large_repr`` lets you select whether to display dataframes that exceed
|
||||
``max_columns`` or ``max_rows`` as a truncated frame, or as a summary.
|
||||
|
||||
``` python
|
||||
In [43]: df = pd.DataFrame(np.random.randn(10, 10))
|
||||
|
||||
In [44]: pd.set_option('max_rows', 5)
|
||||
|
||||
In [45]: pd.set_option('large_repr', 'truncate')
|
||||
|
||||
In [46]: df
|
||||
Out[46]:
|
||||
0 1 2 3 4 5 6 7 8 9
|
||||
0 -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232 0.690579 0.995761 2.396780 0.014871
|
||||
1 3.357427 -0.317441 -1.236269 0.896171 -0.487602 -0.082240 -2.182937 0.380396 0.084844 0.432390
|
||||
.. ... ... ... ... ... ... ... ... ... ...
|
||||
8 -0.303421 -0.858447 0.306996 -0.028665 0.384316 1.574159 1.588931 0.476720 0.473424 -0.242861
|
||||
9 -0.014805 -0.284319 0.650776 -1.461665 -1.137707 -0.891060 -0.693921 1.613616 0.464000 0.227371
|
||||
|
||||
[10 rows x 10 columns]
|
||||
|
||||
In [47]: pd.set_option('large_repr', 'info')
|
||||
|
||||
In [48]: df
|
||||
Out[48]:
|
||||
<class 'pandas.core.frame.DataFrame'>
|
||||
RangeIndex: 10 entries, 0 to 9
|
||||
Data columns (total 10 columns):
|
||||
0 10 non-null float64
|
||||
1 10 non-null float64
|
||||
2 10 non-null float64
|
||||
3 10 non-null float64
|
||||
4 10 non-null float64
|
||||
5 10 non-null float64
|
||||
6 10 non-null float64
|
||||
7 10 non-null float64
|
||||
8 10 non-null float64
|
||||
9 10 non-null float64
|
||||
dtypes: float64(10)
|
||||
memory usage: 928.0 bytes
|
||||
|
||||
In [49]: pd.reset_option('large_repr')
|
||||
|
||||
In [50]: pd.reset_option('max_rows')
|
||||
```
|
||||
|
||||
``display.max_colwidth`` sets the maximum width of columns. Cells
|
||||
of this length or longer will be truncated with an ellipsis.
|
||||
|
||||
``` python
|
||||
In [51]: df = pd.DataFrame(np.array([['foo', 'bar', 'bim', 'uncomfortably long string'],
|
||||
....: ['horse', 'cow', 'banana', 'apple']]))
|
||||
....:
|
||||
|
||||
In [52]: pd.set_option('max_colwidth', 40)
|
||||
|
||||
In [53]: df
|
||||
Out[53]:
|
||||
0 1 2 3
|
||||
0 foo bar bim uncomfortably long string
|
||||
1 horse cow banana apple
|
||||
|
||||
In [54]: pd.set_option('max_colwidth', 6)
|
||||
|
||||
In [55]: df
|
||||
Out[55]:
|
||||
0 1 2 3
|
||||
0 foo bar bim un...
|
||||
1 horse cow ba... apple
|
||||
|
||||
In [56]: pd.reset_option('max_colwidth')
|
||||
```
|
||||
|
||||
``display.max_info_columns`` sets a threshold for when by-column info
|
||||
will be given.
|
||||
|
||||
``` python
|
||||
In [57]: df = pd.DataFrame(np.random.randn(10, 10))
|
||||
|
||||
In [58]: pd.set_option('max_info_columns', 11)
|
||||
|
||||
In [59]: df.info()
|
||||
<class 'pandas.core.frame.DataFrame'>
|
||||
RangeIndex: 10 entries, 0 to 9
|
||||
Data columns (total 10 columns):
|
||||
0 10 non-null float64
|
||||
1 10 non-null float64
|
||||
2 10 non-null float64
|
||||
3 10 non-null float64
|
||||
4 10 non-null float64
|
||||
5 10 non-null float64
|
||||
6 10 non-null float64
|
||||
7 10 non-null float64
|
||||
8 10 non-null float64
|
||||
9 10 non-null float64
|
||||
dtypes: float64(10)
|
||||
memory usage: 928.0 bytes
|
||||
|
||||
In [60]: pd.set_option('max_info_columns', 5)
|
||||
|
||||
In [61]: df.info()
|
||||
<class 'pandas.core.frame.DataFrame'>
|
||||
RangeIndex: 10 entries, 0 to 9
|
||||
Columns: 10 entries, 0 to 9
|
||||
dtypes: float64(10)
|
||||
memory usage: 928.0 bytes
|
||||
|
||||
In [62]: pd.reset_option('max_info_columns')
|
||||
```
|
||||
|
||||
``display.max_info_rows``: ``df.info()`` will usually show null-counts for each column.
|
||||
For large frames this can be quite slow. ``max_info_rows`` and ``max_info_cols``
|
||||
limit this null check only to frames with smaller dimensions then specified. Note that you
|
||||
can specify the option ``df.info(null_counts=True)`` to override on showing a particular frame.
|
||||
|
||||
``` python
|
||||
In [63]: df = pd.DataFrame(np.random.choice([0, 1, np.nan], size=(10, 10)))
|
||||
|
||||
In [64]: df
|
||||
Out[64]:
|
||||
0 1 2 3 4 5 6 7 8 9
|
||||
0 0.0 NaN 1.0 NaN NaN 0.0 NaN 0.0 NaN 1.0
|
||||
1 1.0 NaN 1.0 1.0 1.0 1.0 NaN 0.0 0.0 NaN
|
||||
2 0.0 NaN 1.0 0.0 0.0 NaN NaN NaN NaN 0.0
|
||||
3 NaN NaN NaN 0.0 1.0 1.0 NaN 1.0 NaN 1.0
|
||||
4 0.0 NaN NaN NaN 0.0 NaN NaN NaN 1.0 0.0
|
||||
5 0.0 1.0 1.0 1.0 1.0 0.0 NaN NaN 1.0 0.0
|
||||
6 1.0 1.0 1.0 NaN 1.0 NaN 1.0 0.0 NaN NaN
|
||||
7 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN
|
||||
8 NaN NaN NaN 0.0 NaN NaN NaN NaN 1.0 NaN
|
||||
9 0.0 NaN 0.0 NaN NaN 0.0 NaN 1.0 1.0 0.0
|
||||
|
||||
In [65]: pd.set_option('max_info_rows', 11)
|
||||
|
||||
In [66]: df.info()
|
||||
<class 'pandas.core.frame.DataFrame'>
|
||||
RangeIndex: 10 entries, 0 to 9
|
||||
Data columns (total 10 columns):
|
||||
0 8 non-null float64
|
||||
1 3 non-null float64
|
||||
2 7 non-null float64
|
||||
3 6 non-null float64
|
||||
4 7 non-null float64
|
||||
5 6 non-null float64
|
||||
6 2 non-null float64
|
||||
7 6 non-null float64
|
||||
8 6 non-null float64
|
||||
9 6 non-null float64
|
||||
dtypes: float64(10)
|
||||
memory usage: 928.0 bytes
|
||||
|
||||
In [67]: pd.set_option('max_info_rows', 5)
|
||||
|
||||
In [68]: df.info()
|
||||
<class 'pandas.core.frame.DataFrame'>
|
||||
RangeIndex: 10 entries, 0 to 9
|
||||
Data columns (total 10 columns):
|
||||
0 float64
|
||||
1 float64
|
||||
2 float64
|
||||
3 float64
|
||||
4 float64
|
||||
5 float64
|
||||
6 float64
|
||||
7 float64
|
||||
8 float64
|
||||
9 float64
|
||||
dtypes: float64(10)
|
||||
memory usage: 928.0 bytes
|
||||
|
||||
In [69]: pd.reset_option('max_info_rows')
|
||||
```
|
||||
|
||||
``display.precision`` sets the output display precision in terms of decimal places.
|
||||
This is only a suggestion.
|
||||
|
||||
``` python
|
||||
In [70]: df = pd.DataFrame(np.random.randn(5, 5))
|
||||
|
||||
In [71]: pd.set_option('precision', 7)
|
||||
|
||||
In [72]: df
|
||||
Out[72]:
|
||||
0 1 2 3 4
|
||||
0 -1.1506406 -0.7983341 -0.5576966 0.3813531 1.3371217
|
||||
1 -1.5310949 1.3314582 -0.5713290 -0.0266708 -1.0856630
|
||||
2 -1.1147378 -0.0582158 -0.4867681 1.6851483 0.1125723
|
||||
3 -1.4953086 0.8984347 -0.1482168 -1.5960698 0.1596530
|
||||
4 0.2621358 0.0362196 0.1847350 -0.2550694 -0.2710197
|
||||
|
||||
In [73]: pd.set_option('precision', 4)
|
||||
|
||||
In [74]: df
|
||||
Out[74]:
|
||||
0 1 2 3 4
|
||||
0 -1.1506 -0.7983 -0.5577 0.3814 1.3371
|
||||
1 -1.5311 1.3315 -0.5713 -0.0267 -1.0857
|
||||
2 -1.1147 -0.0582 -0.4868 1.6851 0.1126
|
||||
3 -1.4953 0.8984 -0.1482 -1.5961 0.1597
|
||||
4 0.2621 0.0362 0.1847 -0.2551 -0.2710
|
||||
```
|
||||
|
||||
``display.chop_threshold`` sets at what level pandas rounds to zero when
|
||||
it displays a Series of DataFrame. This setting does not change the
|
||||
precision at which the number is stored.
|
||||
|
||||
``` python
|
||||
In [75]: df = pd.DataFrame(np.random.randn(6, 6))
|
||||
|
||||
In [76]: pd.set_option('chop_threshold', 0)
|
||||
|
||||
In [77]: df
|
||||
Out[77]:
|
||||
0 1 2 3 4 5
|
||||
0 1.2884 0.2946 -1.1658 0.8470 -0.6856 0.6091
|
||||
1 -0.3040 0.6256 -0.0593 0.2497 1.1039 -1.0875
|
||||
2 1.9980 -0.2445 0.1362 0.8863 -1.3507 -0.8863
|
||||
3 -1.0133 1.9209 -0.3882 -2.3144 0.6655 0.4026
|
||||
4 0.3996 -1.7660 0.8504 0.3881 0.9923 0.7441
|
||||
5 -0.7398 -1.0549 -0.1796 0.6396 1.5850 1.9067
|
||||
|
||||
In [78]: pd.set_option('chop_threshold', .5)
|
||||
|
||||
In [79]: df
|
||||
Out[79]:
|
||||
0 1 2 3 4 5
|
||||
0 1.2884 0.0000 -1.1658 0.8470 -0.6856 0.6091
|
||||
1 0.0000 0.6256 0.0000 0.0000 1.1039 -1.0875
|
||||
2 1.9980 0.0000 0.0000 0.8863 -1.3507 -0.8863
|
||||
3 -1.0133 1.9209 0.0000 -2.3144 0.6655 0.0000
|
||||
4 0.0000 -1.7660 0.8504 0.0000 0.9923 0.7441
|
||||
5 -0.7398 -1.0549 0.0000 0.6396 1.5850 1.9067
|
||||
|
||||
In [80]: pd.reset_option('chop_threshold')
|
||||
```
|
||||
|
||||
``display.colheader_justify`` controls the justification of the headers.
|
||||
The options are ‘right’, and ‘left’.
|
||||
|
||||
``` python
|
||||
In [81]: df = pd.DataFrame(np.array([np.random.randn(6),
|
||||
....: np.random.randint(1, 9, 6) * .1,
|
||||
....: np.zeros(6)]).T,
|
||||
....: columns=['A', 'B', 'C'], dtype='float')
|
||||
....:
|
||||
|
||||
In [82]: pd.set_option('colheader_justify', 'right')
|
||||
|
||||
In [83]: df
|
||||
Out[83]:
|
||||
A B C
|
||||
0 0.1040 0.1 0.0
|
||||
1 0.1741 0.5 0.0
|
||||
2 -0.4395 0.4 0.0
|
||||
3 -0.7413 0.8 0.0
|
||||
4 -0.0797 0.4 0.0
|
||||
5 -0.9229 0.3 0.0
|
||||
|
||||
In [84]: pd.set_option('colheader_justify', 'left')
|
||||
|
||||
In [85]: df
|
||||
Out[85]:
|
||||
A B C
|
||||
0 0.1040 0.1 0.0
|
||||
1 0.1741 0.5 0.0
|
||||
2 -0.4395 0.4 0.0
|
||||
3 -0.7413 0.8 0.0
|
||||
4 -0.0797 0.4 0.0
|
||||
5 -0.9229 0.3 0.0
|
||||
|
||||
In [86]: pd.reset_option('colheader_justify')
|
||||
```
|
||||
|
||||
## Available options
|
||||
|
||||
Option | Default | Function
|
||||
---|---|---
|
||||
display.chop_threshold | None | If set to a float value, all float values smaller then the given threshold will be displayed as exactly 0 by repr and friends.
|
||||
display.colheader_justify | right | Controls the justification of column headers. used by DataFrameFormatter.
|
||||
display.column_space | 12 | No description available.
|
||||
display.date_dayfirst | False | When True, prints and parses dates with the day first, eg 20/01/2005
|
||||
display.date_yearfirst | False | When True, prints and parses dates with the year first, eg 2005/01/20
|
||||
display.encoding | UTF-8 | Defaults to the detected encoding of the console. Specifies the encoding to be used for strings returned by to_string, these are generally strings meant to be displayed on the console.
|
||||
display.expand_frame_repr | True | Whether to print out the full DataFrame repr for wide DataFrames across multiple lines, max_columns is still respected, but the output will wrap-around across multiple “pages” if its width exceeds display.width.
|
||||
display.float_format | None | The callable should accept a floating point number and return a string with the desired format of the number. This is used in some places like SeriesFormatter. See core.format.EngFormatter for an example.
|
||||
display.large_repr | truncate | For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can show a truncated table (the default), or switch to the view from df.info() (the behaviour in earlier versions of pandas). allowable settings, [‘truncate’, ‘info’]
|
||||
display.latex.repr | False | Whether to produce a latex DataFrame representation for jupyter frontends that support it.
|
||||
display.latex.escape | True | Escapes special characters in DataFrames, when using the to_latex method.
|
||||
display.latex.longtable | False | Specifies if the to_latex method of a DataFrame uses the longtable format.
|
||||
display.latex.multicolumn | True | Combines columns when using a MultiIndex
|
||||
display.latex.multicolumn_format | ‘l’ | Alignment of multicolumn labels
|
||||
display.latex.multirow | False | Combines rows when using a MultiIndex. Centered instead of top-aligned, separated by clines.
|
||||
display.max_columns | 0 or 20 | max_rows and max_columns are used in __repr__() methods to decide if to_string() or info() is used to render an object to a string. In case Python/IPython is running in a terminal this is set to 0 by default and pandas will correctly auto-detect the width of the terminal and switch to a smaller format in case all columns would not fit vertically. The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible to do correct auto-detection, in which case the default is set to 20. ‘None’ value means unlimited.
|
||||
display.max_colwidth | 50 | The maximum width in characters of a column in the repr of a pandas data structure. When the column overflows, a “…” placeholder is embedded in the output.
|
||||
display.max_info_columns | 100 | max_info_columns is used in DataFrame.info method to decide if per column information will be printed.
|
||||
display.max_info_rows | 1690785 | df.info() will usually show null-counts for each column. For large frames this can be quite slow. max_info_rows and max_info_cols limit this null check only to frames with smaller dimensions then specified.
|
||||
display.max_rows | 60 | This sets the maximum number of rows pandas should output when printing out various output. For example, this value determines whether the repr() for a dataframe prints out fully or just a truncated or summary repr. ‘None’ value means unlimited.
|
||||
display.min_rows | 10 | The numbers of rows to show in a truncated repr (when max_rows is exceeded). Ignored when max_rows is set to None or 0. When set to None, follows the value of max_rows.
|
||||
display.max_seq_items | 100 | when pretty-printing a long sequence, no more then max_seq_items will be printed. If items are omitted, they will be denoted by the addition of “…” to the resulting string. If set to None, the number of items to be printed is unlimited.
|
||||
display.memory_usage | True | This specifies if the memory usage of a DataFrame should be displayed when the df.info() method is invoked.
|
||||
display.multi_sparse | True | “Sparsify” MultiIndex display (don’t display repeated elements in outer levels within groups)
|
||||
display.notebook_repr_html | True | When True, IPython notebook will use html representation for pandas objects (if it is available).
|
||||
display.pprint_nest_depth | 3 | Controls the number of nested levels to process when pretty-printing
|
||||
display.precision | 6 | Floating point output precision in terms of number of places after the decimal, for regular formatting as well as scientific notation. Similar to numpy’s precision print option
|
||||
display.show_dimensions | truncate | Whether to print out dimensions at the end of DataFrame repr. If ‘truncate’ is specified, only print out the dimensions if the frame is truncated (e.g. not display all rows and/or columns)
|
||||
display.width | 80 | Width of the display in characters. In case python/IPython is running in a terminal this can be set to None and pandas will correctly auto-detect the width. Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible to correctly detect the width.
|
||||
display.html.table_schema | False | Whether to publish a Table Schema representation for frontends that support it.
|
||||
display.html.border | 1 | A border=value attribute is inserted in the ``<table>`` tag for the DataFrame HTML repr.
|
||||
display.html.use_mathjax | True | When True, Jupyter notebook will process table contents using MathJax, rendering mathematical expressions enclosed by the dollar symbol.
|
||||
io.excel.xls.writer | xlwt | The default Excel writer engine for ‘xls’ files.
|
||||
io.excel.xlsm.writer | openpyxl | The default Excel writer engine for ‘xlsm’ files. Available options: ‘openpyxl’ (the default).
|
||||
io.excel.xlsx.writer | openpyxl | The default Excel writer engine for ‘xlsx’ files.
|
||||
io.hdf.default_format | None | default format writing format, if None, then put will default to ‘fixed’ and append will default to ‘table’
|
||||
io.hdf.dropna_table | True | drop ALL nan rows when appending to a table
|
||||
io.parquet.engine | None | The engine to use as a default for parquet reading and writing. If None then try ‘pyarrow’ and ‘fastparquet’
|
||||
mode.chained_assignment | warn | Controls SettingWithCopyWarning: ‘raise’, ‘warn’, or None. Raise an exception, warn, or no action if trying to use [chained assignment](indexing.html#indexing-evaluation-order).
|
||||
mode.sim_interactive | False | Whether to simulate interactive mode for purposes of testing.
|
||||
mode.use_inf_as_na | False | True means treat None, NaN, -INF, INF as NA (old way), False means None and NaN are null, but INF, -INF are not NA (new way).
|
||||
compute.use_bottleneck | True | Use the bottleneck library to accelerate computation if it is installed.
|
||||
compute.use_numexpr | True | Use the numexpr library to accelerate computation if it is installed.
|
||||
plotting.backend | matplotlib | Change the plotting backend to a different backend than the current matplotlib one. Backends can be implemented as third-party libraries implementing the pandas plotting API. They can use other plotting libraries like Bokeh, Altair, etc.
|
||||
plotting.matplotlib.register_converters | True | Register custom converters with matplotlib. Set to False to de-register.
|
||||
|
||||
## Number formatting
|
||||
|
||||
pandas also allows you to set how numbers are displayed in the console.
|
||||
This option is not set through the ``set_options`` API.
|
||||
|
||||
Use the ``set_eng_float_format`` function
|
||||
to alter the floating-point formatting of pandas objects to produce a particular
|
||||
format.
|
||||
|
||||
For instance:
|
||||
|
||||
``` python
|
||||
In [87]: import numpy as np
|
||||
|
||||
In [88]: pd.set_eng_float_format(accuracy=3, use_eng_prefix=True)
|
||||
|
||||
In [89]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
|
||||
|
||||
In [90]: s / 1.e3
|
||||
Out[90]:
|
||||
a 303.638u
|
||||
b -721.084u
|
||||
c -622.696u
|
||||
d 648.250u
|
||||
e -1.945m
|
||||
dtype: float64
|
||||
|
||||
In [91]: s / 1.e6
|
||||
Out[91]:
|
||||
a 303.638n
|
||||
b -721.084n
|
||||
c -622.696n
|
||||
d 648.250n
|
||||
e -1.945u
|
||||
dtype: float64
|
||||
```
|
||||
|
||||
To round floats on a case-by-case basis, you can also use [``round()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.round.html#pandas.Series.round) and [``round()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.round.html#pandas.DataFrame.round).
|
||||
|
||||
## Unicode formatting
|
||||
|
||||
::: danger Warning
|
||||
|
||||
Enabling this option will affect the performance for printing of DataFrame and Series (about 2 times slower).
|
||||
Use only when it is actually required.
|
||||
|
||||
:::
|
||||
|
||||
Some East Asian countries use Unicode characters whose width corresponds to two Latin characters.
|
||||
If a DataFrame or Series contains these characters, the default output mode may not align them properly.
|
||||
|
||||
::: tip Note
|
||||
|
||||
Screen captures are attached for each output to show the actual results.
|
||||
|
||||
:::
|
||||
|
||||
``` python
|
||||
In [92]: df = pd.DataFrame({'国籍': ['UK', '日本'], '名前': ['Alice', 'しのぶ']})
|
||||
|
||||
In [93]: df
|
||||
Out[93]:
|
||||
国籍 名前
|
||||
0 UK Alice
|
||||
1 日本 しのぶ
|
||||
```
|
||||
|
||||

|
||||
|
||||
Enabling ``display.unicode.east_asian_width`` allows pandas to check each character’s “East Asian Width” property.
|
||||
These characters can be aligned properly by setting this option to ``True``. However, this will result in longer render
|
||||
times than the standard ``len`` function.
|
||||
|
||||
``` python
|
||||
In [94]: pd.set_option('display.unicode.east_asian_width', True)
|
||||
|
||||
In [95]: df
|
||||
Out[95]:
|
||||
国籍 名前
|
||||
0 UK Alice
|
||||
1 日本 しのぶ
|
||||
```
|
||||
|
||||

|
||||
|
||||
In addition, Unicode characters whose width is “Ambiguous” can either be 1 or 2 characters wide depending on the
|
||||
terminal setting or encoding. The option ``display.unicode.ambiguous_as_wide`` can be used to handle the ambiguity.
|
||||
|
||||
By default, an “Ambiguous” character’s width, such as “¡” (inverted exclamation) in the example below, is taken to be 1.
|
||||
|
||||
``` python
|
||||
In [96]: df = pd.DataFrame({'a': ['xxx', '¡¡'], 'b': ['yyy', '¡¡']})
|
||||
|
||||
In [97]: df
|
||||
Out[97]:
|
||||
a b
|
||||
0 xxx yyy
|
||||
1 ¡¡ ¡¡
|
||||
```
|
||||
|
||||

|
||||
|
||||
Enabling ``display.unicode.ambiguous_as_wide`` makes pandas interpret these characters’ widths to be 2.
|
||||
(Note that this option will only be effective when ``display.unicode.east_asian_width`` is enabled.)
|
||||
|
||||
However, setting this option incorrectly for your terminal will cause these characters to be aligned incorrectly:
|
||||
|
||||
``` python
|
||||
In [98]: pd.set_option('display.unicode.ambiguous_as_wide', True)
|
||||
|
||||
In [99]: df
|
||||
Out[99]:
|
||||
a b
|
||||
0 xxx yyy
|
||||
1 ¡¡ ¡¡
|
||||
```
|
||||
|
||||

|
||||
|
||||
## Table schema display
|
||||
|
||||
*New in version 0.20.0.*
|
||||
|
||||
``DataFrame`` and ``Series`` will publish a Table Schema representation
|
||||
by default. False by default, this can be enabled globally with the
|
||||
``display.html.table_schema`` option:
|
||||
|
||||
``` python
|
||||
In [100]: pd.set_option('display.html.table_schema', True)
|
||||
```
|
||||
|
||||
Only ``'display.max_rows'`` are serialized and published.
|
||||
1520
Python/pandas/user_guide/reshaping.md
Normal file
1520
Python/pandas/user_guide/reshaping.md
Normal file
File diff suppressed because it is too large
Load Diff
565
Python/pandas/user_guide/sparse.md
Normal file
565
Python/pandas/user_guide/sparse.md
Normal file
@@ -0,0 +1,565 @@
|
||||
# Sparse data structures
|
||||
|
||||
::: tip Note
|
||||
|
||||
``SparseSeries`` and ``SparseDataFrame`` have been deprecated. Their purpose
|
||||
is served equally well by a [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) or [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) with
|
||||
sparse values. See [Migrating](#sparse-migration) for tips on migrating.
|
||||
|
||||
:::
|
||||
|
||||
Pandas provides data structures for efficiently storing sparse data.
|
||||
These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these
|
||||
objects as being “compressed” where any data matching a specific value (``NaN`` / missing value, though any value
|
||||
can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.
|
||||
|
||||
``` python
|
||||
In [1]: arr = np.random.randn(10)
|
||||
|
||||
In [2]: arr[2:-2] = np.nan
|
||||
|
||||
In [3]: ts = pd.Series(pd.SparseArray(arr))
|
||||
|
||||
In [4]: ts
|
||||
Out[4]:
|
||||
0 0.469112
|
||||
1 -0.282863
|
||||
2 NaN
|
||||
3 NaN
|
||||
4 NaN
|
||||
5 NaN
|
||||
6 NaN
|
||||
7 NaN
|
||||
8 -0.861849
|
||||
9 -2.104569
|
||||
dtype: Sparse[float64, nan]
|
||||
```
|
||||
|
||||
Notice the dtype, ``Sparse[float64, nan]``. The ``nan`` means that elements in the
|
||||
array that are ``nan`` aren’t actually stored, only the non-``nan`` elements are.
|
||||
Those non-``nan`` elements have a ``float64`` dtype.
|
||||
|
||||
The sparse objects exist for memory efficiency reasons. Suppose you had a
|
||||
large, mostly NA ``DataFrame``:
|
||||
|
||||
``` python
|
||||
In [5]: df = pd.DataFrame(np.random.randn(10000, 4))
|
||||
|
||||
In [6]: df.iloc[:9998] = np.nan
|
||||
|
||||
In [7]: sdf = df.astype(pd.SparseDtype("float", np.nan))
|
||||
|
||||
In [8]: sdf.head()
|
||||
Out[8]:
|
||||
0 1 2 3
|
||||
0 NaN NaN NaN NaN
|
||||
1 NaN NaN NaN NaN
|
||||
2 NaN NaN NaN NaN
|
||||
3 NaN NaN NaN NaN
|
||||
4 NaN NaN NaN NaN
|
||||
|
||||
In [9]: sdf.dtypes
|
||||
Out[9]:
|
||||
0 Sparse[float64, nan]
|
||||
1 Sparse[float64, nan]
|
||||
2 Sparse[float64, nan]
|
||||
3 Sparse[float64, nan]
|
||||
dtype: object
|
||||
|
||||
In [10]: sdf.sparse.density
|
||||
Out[10]: 0.0002
|
||||
```
|
||||
|
||||
As you can see, the density (% of values that have not been “compressed”) is
|
||||
extremely low. This sparse object takes up much less memory on disk (pickled)
|
||||
and in the Python interpreter.
|
||||
|
||||
``` python
|
||||
In [11]: 'dense : {:0.2f} bytes'.format(df.memory_usage().sum() / 1e3)
|
||||
Out[11]: 'dense : 320.13 bytes'
|
||||
|
||||
In [12]: 'sparse: {:0.2f} bytes'.format(sdf.memory_usage().sum() / 1e3)
|
||||
Out[12]: 'sparse: 0.22 bytes'
|
||||
```
|
||||
|
||||
Functionally, their behavior should be nearly
|
||||
identical to their dense counterparts.
|
||||
|
||||
## SparseArray
|
||||
|
||||
[``SparseArray``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.SparseArray.html#pandas.SparseArray) is a [``ExtensionArray``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.extensions.ExtensionArray.html#pandas.api.extensions.ExtensionArray)
|
||||
for storing an array of sparse values (see [dtypes](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes) for more
|
||||
on extension arrays). It is a 1-dimensional ndarray-like object storing
|
||||
only values distinct from the ``fill_value``:
|
||||
|
||||
``` python
|
||||
In [13]: arr = np.random.randn(10)
|
||||
|
||||
In [14]: arr[2:5] = np.nan
|
||||
|
||||
In [15]: arr[7:8] = np.nan
|
||||
|
||||
In [16]: sparr = pd.SparseArray(arr)
|
||||
|
||||
In [17]: sparr
|
||||
Out[17]:
|
||||
[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]
|
||||
Fill: nan
|
||||
IntIndex
|
||||
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)
|
||||
```
|
||||
|
||||
A sparse array can be converted to a regular (dense) ndarray with ``numpy.asarray()``
|
||||
|
||||
``` python
|
||||
In [18]: np.asarray(sparr)
|
||||
Out[18]:
|
||||
array([-1.9557, -1.6589, nan, nan, nan, 1.1589, 0.1453,
|
||||
nan, 0.606 , 1.3342])
|
||||
```
|
||||
|
||||
## SparseDtype
|
||||
|
||||
The ``SparseArray.dtype`` property stores two pieces of information
|
||||
|
||||
1. The dtype of the non-sparse values
|
||||
1. The scalar fill value
|
||||
|
||||
``` python
|
||||
In [19]: sparr.dtype
|
||||
Out[19]: Sparse[float64, nan]
|
||||
```
|
||||
|
||||
A [``SparseDtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.SparseDtype.html#pandas.SparseDtype) may be constructed by passing each of these
|
||||
|
||||
``` python
|
||||
In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
|
||||
Out[20]: Sparse[datetime64[ns], NaT]
|
||||
```
|
||||
|
||||
The default fill value for a given NumPy dtype is the “missing” value for that dtype,
|
||||
though it may be overridden.
|
||||
|
||||
``` python
|
||||
In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'),
|
||||
....: fill_value=pd.Timestamp('2017-01-01'))
|
||||
....:
|
||||
Out[21]: Sparse[datetime64[ns], 2017-01-01 00:00:00]
|
||||
```
|
||||
|
||||
Finally, the string alias ``'Sparse[dtype]'`` may be used to specify a sparse dtype
|
||||
in many places
|
||||
|
||||
``` python
|
||||
In [22]: pd.array([1, 0, 0, 2], dtype='Sparse[int]')
|
||||
Out[22]:
|
||||
[1, 0, 0, 2]
|
||||
Fill: 0
|
||||
IntIndex
|
||||
Indices: array([0, 3], dtype=int32)
|
||||
```
|
||||
|
||||
## Sparse accessor
|
||||
|
||||
*New in version 0.24.0.*
|
||||
|
||||
Pandas provides a ``.sparse`` accessor, similar to ``.str`` for string data, ``.cat``
|
||||
for categorical data, and ``.dt`` for datetime-like data. This namespace provides
|
||||
attributes and methods that are specific to sparse data.
|
||||
|
||||
``` python
|
||||
In [23]: s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")
|
||||
|
||||
In [24]: s.sparse.density
|
||||
Out[24]: 0.5
|
||||
|
||||
In [25]: s.sparse.fill_value
|
||||
Out[25]: 0
|
||||
```
|
||||
|
||||
This accessor is available only on data with ``SparseDtype``, and on the [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)
|
||||
class itself for creating a Series with sparse data from a scipy COO matrix with.
|
||||
|
||||
*New in version 0.25.0.*
|
||||
|
||||
A ``.sparse`` accessor has been added for [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) as well.
|
||||
See [Sparse accessor](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#api-frame-sparse) for more.
|
||||
|
||||
## Sparse calculation
|
||||
|
||||
You can apply NumPy [ufuncs](https://docs.scipy.org/doc/numpy/reference/ufuncs.html)
|
||||
to ``SparseArray`` and get a ``SparseArray`` as a result.
|
||||
|
||||
``` python
|
||||
In [26]: arr = pd.SparseArray([1., np.nan, np.nan, -2., np.nan])
|
||||
|
||||
In [27]: np.abs(arr)
|
||||
Out[27]:
|
||||
[1.0, nan, nan, 2.0, nan]
|
||||
Fill: nan
|
||||
IntIndex
|
||||
Indices: array([0, 3], dtype=int32)
|
||||
```
|
||||
|
||||
The *ufunc* is also applied to ``fill_value``. This is needed to get
|
||||
the correct dense result.
|
||||
|
||||
``` python
|
||||
In [28]: arr = pd.SparseArray([1., -1, -1, -2., -1], fill_value=-1)
|
||||
|
||||
In [29]: np.abs(arr)
|
||||
Out[29]:
|
||||
[1.0, 1, 1, 2.0, 1]
|
||||
Fill: 1
|
||||
IntIndex
|
||||
Indices: array([0, 3], dtype=int32)
|
||||
|
||||
In [30]: np.abs(arr).to_dense()
|
||||
Out[30]: array([1., 1., 1., 2., 1.])
|
||||
```
|
||||
|
||||
## Migrating
|
||||
|
||||
In older versions of pandas, the ``SparseSeries`` and ``SparseDataFrame`` classes (documented below)
|
||||
were the preferred way to work with sparse data. With the advent of extension arrays, these subclasses
|
||||
are no longer needed. Their purpose is better served by using a regular Series or DataFrame with
|
||||
sparse values instead.
|
||||
|
||||
::: tip Note
|
||||
|
||||
There’s no performance or memory penalty to using a Series or DataFrame with sparse values,
|
||||
rather than a SparseSeries or SparseDataFrame.
|
||||
|
||||
:::
|
||||
|
||||
This section provides some guidance on migrating your code to the new style. As a reminder,
|
||||
you can use the python warnings module to control warnings. But we recommend modifying
|
||||
your code, rather than ignoring the warning.
|
||||
|
||||
**Construction**
|
||||
|
||||
From an array-like, use the regular [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) or
|
||||
[``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) constructors with [``SparseArray``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.SparseArray.html#pandas.SparseArray) values.
|
||||
|
||||
``` python
|
||||
# Previous way
|
||||
>>> pd.SparseDataFrame({"A": [0, 1]})
|
||||
```
|
||||
|
||||
``` python
|
||||
# New way
|
||||
In [31]: pd.DataFrame({"A": pd.SparseArray([0, 1])})
|
||||
Out[31]:
|
||||
A
|
||||
0 0
|
||||
1 1
|
||||
```
|
||||
|
||||
From a SciPy sparse matrix, use [``DataFrame.sparse.from_spmatrix()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sparse.from_spmatrix.html#pandas.DataFrame.sparse.from_spmatrix),
|
||||
|
||||
``` python
|
||||
# Previous way
|
||||
>>> from scipy import sparse
|
||||
>>> mat = sparse.eye(3)
|
||||
>>> df = pd.SparseDataFrame(mat, columns=['A', 'B', 'C'])
|
||||
```
|
||||
|
||||
``` python
|
||||
# New way
|
||||
In [32]: from scipy import sparse
|
||||
|
||||
In [33]: mat = sparse.eye(3)
|
||||
|
||||
In [34]: df = pd.DataFrame.sparse.from_spmatrix(mat, columns=['A', 'B', 'C'])
|
||||
|
||||
In [35]: df.dtypes
|
||||
Out[35]:
|
||||
A Sparse[float64, 0.0]
|
||||
B Sparse[float64, 0.0]
|
||||
C Sparse[float64, 0.0]
|
||||
dtype: object
|
||||
```
|
||||
|
||||
**Conversion**
|
||||
|
||||
From sparse to dense, use the ``.sparse`` accessors
|
||||
|
||||
``` python
|
||||
In [36]: df.sparse.to_dense()
|
||||
Out[36]:
|
||||
A B C
|
||||
0 1.0 0.0 0.0
|
||||
1 0.0 1.0 0.0
|
||||
2 0.0 0.0 1.0
|
||||
|
||||
In [37]: df.sparse.to_coo()
|
||||
Out[37]:
|
||||
<3x3 sparse matrix of type '<class 'numpy.float64'>'
|
||||
with 3 stored elements in COOrdinate format>
|
||||
```
|
||||
|
||||
From dense to sparse, use [``DataFrame.astype()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html#pandas.DataFrame.astype) with a [``SparseDtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.SparseDtype.html#pandas.SparseDtype).
|
||||
|
||||
``` python
|
||||
In [38]: dense = pd.DataFrame({"A": [1, 0, 0, 1]})
|
||||
|
||||
In [39]: dtype = pd.SparseDtype(int, fill_value=0)
|
||||
|
||||
In [40]: dense.astype(dtype)
|
||||
Out[40]:
|
||||
A
|
||||
0 1
|
||||
1 0
|
||||
2 0
|
||||
3 1
|
||||
```
|
||||
|
||||
**Sparse Properties**
|
||||
|
||||
Sparse-specific properties, like ``density``, are available on the ``.sparse`` accessor.
|
||||
|
||||
``` python
|
||||
In [41]: df.sparse.density
|
||||
Out[41]: 0.3333333333333333
|
||||
```
|
||||
|
||||
**General differences**
|
||||
|
||||
In a ``SparseDataFrame``, *all* columns were sparse. A [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) can have a mixture of
|
||||
sparse and dense columns. As a consequence, assigning new columns to a ``DataFrame`` with sparse
|
||||
values will not automatically convert the input to be sparse.
|
||||
|
||||
``` python
|
||||
# Previous Way
|
||||
>>> df = pd.SparseDataFrame({"A": [0, 1]})
|
||||
>>> df['B'] = [0, 0] # implicitly becomes Sparse
|
||||
>>> df['B'].dtype
|
||||
Sparse[int64, nan]
|
||||
```
|
||||
|
||||
Instead, you’ll need to ensure that the values being assigned are sparse
|
||||
|
||||
``` python
|
||||
In [42]: df = pd.DataFrame({"A": pd.SparseArray([0, 1])})
|
||||
|
||||
In [43]: df['B'] = [0, 0] # remains dense
|
||||
|
||||
In [44]: df['B'].dtype
|
||||
Out[44]: dtype('int64')
|
||||
|
||||
In [45]: df['B'] = pd.SparseArray([0, 0])
|
||||
|
||||
In [46]: df['B'].dtype
|
||||
Out[46]: Sparse[int64, 0]
|
||||
```
|
||||
|
||||
The ``SparseDataFrame.default_kind`` and ``SparseDataFrame.default_fill_value`` attributes
|
||||
have no replacement.
|
||||
|
||||
## Interaction with scipy.sparse
|
||||
|
||||
Use [``DataFrame.sparse.from_spmatrix()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sparse.from_spmatrix.html#pandas.DataFrame.sparse.from_spmatrix) to create a ``DataFrame`` with sparse values from a sparse matrix.
|
||||
|
||||
*New in version 0.25.0.*
|
||||
|
||||
``` python
|
||||
In [47]: from scipy.sparse import csr_matrix
|
||||
|
||||
In [48]: arr = np.random.random(size=(1000, 5))
|
||||
|
||||
In [49]: arr[arr < .9] = 0
|
||||
|
||||
In [50]: sp_arr = csr_matrix(arr)
|
||||
|
||||
In [51]: sp_arr
|
||||
Out[51]:
|
||||
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
|
||||
with 517 stored elements in Compressed Sparse Row format>
|
||||
|
||||
In [52]: sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)
|
||||
|
||||
In [53]: sdf.head()
|
||||
Out[53]:
|
||||
0 1 2 3 4
|
||||
0 0.956380 0.0 0.0 0.000000 0.0
|
||||
1 0.000000 0.0 0.0 0.000000 0.0
|
||||
2 0.000000 0.0 0.0 0.000000 0.0
|
||||
3 0.000000 0.0 0.0 0.000000 0.0
|
||||
4 0.999552 0.0 0.0 0.956153 0.0
|
||||
|
||||
In [54]: sdf.dtypes
|
||||
Out[54]:
|
||||
0 Sparse[float64, 0.0]
|
||||
1 Sparse[float64, 0.0]
|
||||
2 Sparse[float64, 0.0]
|
||||
3 Sparse[float64, 0.0]
|
||||
4 Sparse[float64, 0.0]
|
||||
dtype: object
|
||||
```
|
||||
|
||||
All sparse formats are supported, but matrices that are not in [``COOrdinate``](https://docs.scipy.org/doc/scipy/reference/sparse.html#module-scipy.sparse) format will be converted, copying data as needed.
|
||||
To convert back to sparse SciPy matrix in COO format, you can use the [``DataFrame.sparse.to_coo()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sparse.to_coo.html#pandas.DataFrame.sparse.to_coo) method:
|
||||
|
||||
``` python
|
||||
In [55]: sdf.sparse.to_coo()
|
||||
Out[55]:
|
||||
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
|
||||
with 517 stored elements in COOrdinate format>
|
||||
```
|
||||
|
||||
meth:*Series.sparse.to_coo* is implemented for transforming a ``Series`` with sparse values indexed by a [``MultiIndex``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html#pandas.MultiIndex) to a [``scipy.sparse.coo_matrix``](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html#scipy.sparse.coo_matrix).
|
||||
|
||||
The method requires a ``MultiIndex`` with two or more levels.
|
||||
|
||||
``` python
|
||||
In [56]: s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])
|
||||
|
||||
In [57]: s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),
|
||||
....: (1, 2, 'a', 1),
|
||||
....: (1, 1, 'b', 0),
|
||||
....: (1, 1, 'b', 1),
|
||||
....: (2, 1, 'b', 0),
|
||||
....: (2, 1, 'b', 1)],
|
||||
....: names=['A', 'B', 'C', 'D'])
|
||||
....:
|
||||
|
||||
In [58]: s
|
||||
Out[58]:
|
||||
A B C D
|
||||
1 2 a 0 3.0
|
||||
1 NaN
|
||||
1 b 0 1.0
|
||||
1 3.0
|
||||
2 1 b 0 NaN
|
||||
1 NaN
|
||||
dtype: float64
|
||||
|
||||
In [59]: ss = s.astype('Sparse')
|
||||
|
||||
In [60]: ss
|
||||
Out[60]:
|
||||
A B C D
|
||||
1 2 a 0 3.0
|
||||
1 NaN
|
||||
1 b 0 1.0
|
||||
1 3.0
|
||||
2 1 b 0 NaN
|
||||
1 NaN
|
||||
dtype: Sparse[float64, nan]
|
||||
```
|
||||
|
||||
In the example below, we transform the ``Series`` to a sparse representation of a 2-d array by specifying that the first and second ``MultiIndex`` levels define labels for the rows and the third and fourth levels define labels for the columns. We also specify that the column and row labels should be sorted in the final sparse representation.
|
||||
|
||||
``` python
|
||||
In [61]: A, rows, columns = ss.sparse.to_coo(row_levels=['A', 'B'],
|
||||
....: column_levels=['C', 'D'],
|
||||
....: sort_labels=True)
|
||||
....:
|
||||
|
||||
In [62]: A
|
||||
Out[62]:
|
||||
<3x4 sparse matrix of type '<class 'numpy.float64'>'
|
||||
with 3 stored elements in COOrdinate format>
|
||||
|
||||
In [63]: A.todense()
|
||||
Out[63]:
|
||||
matrix([[0., 0., 1., 3.],
|
||||
[3., 0., 0., 0.],
|
||||
[0., 0., 0., 0.]])
|
||||
|
||||
In [64]: rows
|
||||
Out[64]: [(1, 1), (1, 2), (2, 1)]
|
||||
|
||||
In [65]: columns
|
||||
Out[65]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]
|
||||
```
|
||||
|
||||
Specifying different row and column labels (and not sorting them) yields a different sparse matrix:
|
||||
|
||||
``` python
|
||||
In [66]: A, rows, columns = ss.sparse.to_coo(row_levels=['A', 'B', 'C'],
|
||||
....: column_levels=['D'],
|
||||
....: sort_labels=False)
|
||||
....:
|
||||
|
||||
In [67]: A
|
||||
Out[67]:
|
||||
<3x2 sparse matrix of type '<class 'numpy.float64'>'
|
||||
with 3 stored elements in COOrdinate format>
|
||||
|
||||
In [68]: A.todense()
|
||||
Out[68]:
|
||||
matrix([[3., 0.],
|
||||
[1., 3.],
|
||||
[0., 0.]])
|
||||
|
||||
In [69]: rows
|
||||
Out[69]: [(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]
|
||||
|
||||
In [70]: columns
|
||||
Out[70]: [0, 1]
|
||||
```
|
||||
|
||||
A convenience method [``Series.sparse.from_coo()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sparse.from_coo.html#pandas.Series.sparse.from_coo) is implemented for creating a ``Series`` with sparse values from a ``scipy.sparse.coo_matrix``.
|
||||
|
||||
``` python
|
||||
In [71]: from scipy import sparse
|
||||
|
||||
In [72]: A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])),
|
||||
....: shape=(3, 4))
|
||||
....:
|
||||
|
||||
In [73]: A
|
||||
Out[73]:
|
||||
<3x4 sparse matrix of type '<class 'numpy.float64'>'
|
||||
with 3 stored elements in COOrdinate format>
|
||||
|
||||
In [74]: A.todense()
|
||||
Out[74]:
|
||||
matrix([[0., 0., 1., 2.],
|
||||
[3., 0., 0., 0.],
|
||||
[0., 0., 0., 0.]])
|
||||
```
|
||||
|
||||
The default behaviour (with ``dense_index=False``) simply returns a ``Series`` containing
|
||||
only the non-null entries.
|
||||
|
||||
``` python
|
||||
In [75]: ss = pd.Series.sparse.from_coo(A)
|
||||
|
||||
In [76]: ss
|
||||
Out[76]:
|
||||
0 2 1.0
|
||||
3 2.0
|
||||
1 0 3.0
|
||||
dtype: Sparse[float64, nan]
|
||||
```
|
||||
|
||||
Specifying ``dense_index=True`` will result in an index that is the Cartesian product of the
|
||||
row and columns coordinates of the matrix. Note that this will consume a significant amount of memory
|
||||
(relative to ``dense_index=False``) if the sparse matrix is large (and sparse) enough.
|
||||
|
||||
``` python
|
||||
In [77]: ss_dense = pd.Series.sparse.from_coo(A, dense_index=True)
|
||||
|
||||
In [78]: ss_dense
|
||||
Out[78]:
|
||||
0 0 NaN
|
||||
1 NaN
|
||||
2 1.0
|
||||
3 2.0
|
||||
1 0 3.0
|
||||
1 NaN
|
||||
2 NaN
|
||||
3 NaN
|
||||
2 0 NaN
|
||||
1 NaN
|
||||
2 NaN
|
||||
3 NaN
|
||||
dtype: Sparse[float64, nan]
|
||||
```
|
||||
|
||||
## Sparse subclasses
|
||||
|
||||
The ``SparseSeries`` and ``SparseDataFrame`` classes are deprecated. Visit their
|
||||
API pages for usage.
|
||||
439
Python/pandas/user_guide/style.md
Normal file
439
Python/pandas/user_guide/style.md
Normal file
@@ -0,0 +1,439 @@
|
||||
# Styling
|
||||
|
||||
*New in version 0.17.1*
|
||||
|
||||
Provisional: This is a new feature and still under development. We’ll be adding features and possibly making breaking changes in future releases. We’d love to hear your feedback.
|
||||
|
||||
This document is written as a Jupyter Notebook, and can be viewed or downloaded [here](http://nbviewer.ipython.org/github/pandas-dev/pandas/blob/master/doc/source/style.ipynb).
|
||||
|
||||
You can apply **conditional formatting**, the visual styling of a DataFrame depending on the data within, by using the ``DataFrame.style`` property. This is a property that returns a ``Styler`` object, which has useful methods for formatting and displaying DataFrames.
|
||||
|
||||
The styling is accomplished using CSS. You write “style functions” that take scalars, ``DataFrame``s or ``Series``, and return *like-indexed* DataFrames or Series with CSS ``"attribute: value"`` pairs for the values. These functions can be incrementally passed to the ``Styler`` which collects the styles before rendering.
|
||||
|
||||
## Building styles
|
||||
|
||||
Pass your style functions into one of the following methods:
|
||||
|
||||
- ``Styler.applymap``: elementwise
|
||||
- ``Styler.apply``: column-/row-/table-wise
|
||||
|
||||
Both of those methods take a function (and some other keyword arguments) and applies your function to the DataFrame in a certain way. ``Styler.applymap`` works through the DataFrame elementwise. ``Styler.apply`` passes each column or row into your DataFrame one-at-a-time or the entire table at once, depending on the ``axis`` keyword argument. For columnwise use ``axis=0``, rowwise use ``axis=1``, and for the entire table at once use ``axis=None``.
|
||||
|
||||
For ``Styler.applymap`` your function should take a scalar and return a single string with the CSS attribute-value pair.
|
||||
|
||||
For ``Styler.apply`` your function should take a Series or DataFrame (depending on the axis parameter), and return a Series or DataFrame with an identical shape where each value is a string with a CSS attribute-value pair.
|
||||
|
||||
Let’s see some examples.
|
||||
|
||||

|
||||
|
||||
Here’s a boring example of rendering a DataFrame, without any (visible) styles:
|
||||
|
||||

|
||||
|
||||
*Note*: The ``DataFrame.style`` attribute is a property that returns a ``Styler`` object. ``Styler`` has a ``_repr_html_`` method defined on it so they are rendered automatically. If you want the actual HTML back for further processing or for writing to file call the ``.render()`` method which returns a string.
|
||||
|
||||
The above output looks very similar to the standard DataFrame HTML representation. But we’ve done some work behind the scenes to attach CSS classes to each cell. We can view these by calling the ``.render`` method.
|
||||
|
||||
``` javascript
|
||||
df.style.highlight_null().render().split('\n')[:10]
|
||||
```
|
||||
|
||||
``` javascript
|
||||
['<style type="text/css" >',
|
||||
' #T_acfc12d6_a988_11e9_a75e_31802e421a9brow0_col2 {',
|
||||
' background-color: red;',
|
||||
' }</style><table id="T_acfc12d6_a988_11e9_a75e_31802e421a9b" ><thead> <tr> <th class="blank level0" ></th> <th class="col_heading level0 col0" >A</th> <th class="col_heading level0 col1" >B</th> <th class="col_heading level0 col2" >C</th> <th class="col_heading level0 col3" >D</th> <th class="col_heading level0 col4" >E</th> </tr></thead><tbody>',
|
||||
' <tr>',
|
||||
' <th id="T_acfc12d6_a988_11e9_a75e_31802e421a9blevel0_row0" class="row_heading level0 row0" >0</th>',
|
||||
' <td id="T_acfc12d6_a988_11e9_a75e_31802e421a9brow0_col0" class="data row0 col0" >1</td>',
|
||||
' <td id="T_acfc12d6_a988_11e9_a75e_31802e421a9brow0_col1" class="data row0 col1" >1.32921</td>',
|
||||
' <td id="T_acfc12d6_a988_11e9_a75e_31802e421a9brow0_col2" class="data row0 col2" >nan</td>',
|
||||
' <td id="T_acfc12d6_a988_11e9_a75e_31802e421a9brow0_col3" class="data row0 col3" >-0.31628</td>']
|
||||
```
|
||||
|
||||
The ``row0_col2`` is the identifier for that particular cell. We’ve also prepended each row/column identifier with a UUID unique to each DataFrame so that the style from one doesn’t collide with the styling from another within the same notebook or page (you can set the ``uuid`` if you’d like to tie together the styling of two DataFrames).
|
||||
|
||||
When writing style functions, you take care of producing the CSS attribute / value pairs you want. Pandas matches those up with the CSS classes that identify each cell.
|
||||
|
||||
Let’s write a simple style function that will color negative numbers red and positive numbers black.
|
||||
|
||||

|
||||
|
||||
In this case, the cell’s style depends only on it’s own value. That means we should use the ``Styler.applymap`` method which works elementwise.
|
||||
|
||||

|
||||
|
||||
Notice the similarity with the standard ``df.applymap``, which operates on DataFrames elementwise. We want you to be able to reuse your existing knowledge of how to interact with DataFrames.
|
||||
|
||||
Notice also that our function returned a string containing the CSS attribute and value, separated by a colon just like in a ```` tag. This will be a common theme.</p>
|
||||
|
||||
Finally, the input shapes matched. ``Styler.applymap`` calls the function on each scalar input, and the function returns a scalar output.
|
||||
|
||||
Now suppose you wanted to highlight the maximum value in each column. We can’t use ``.applymap`` anymore since that operated elementwise. Instead, we’ll turn to ``.apply`` which operates columnwise (or rowwise using the ``axis`` keyword). Later on we’ll see that something like ``highlight_max`` is already defined on ``Styler`` so you wouldn’t need to write this yourself.
|
||||
|
||||

|
||||
|
||||
In this case the input is a ``Series``, one column at a time. Notice that the output shape of ``highlight_max`` matches the input shape, an array with ``len(s)`` items.
|
||||
|
||||
We encourage you to use method chains to build up a style piecewise, before finally rending at the end of the chain.
|
||||
|
||||

|
||||
|
||||
Above we used ``Styler.apply`` to pass in each column one at a time.
|
||||
|
||||
Debugging Tip: If you’re having trouble writing your style function, try just passing it into DataFrame.apply. Internally, Styler.apply uses DataFrame.apply so the result should be the same.
|
||||
|
||||
What if you wanted to highlight just the maximum value in the entire table? Use ``.apply(function, axis=None)`` to indicate that your function wants the entire table, not one column or row at a time. Let’s try that next.
|
||||
|
||||
We’ll rewrite our ``highlight-max`` to handle either Series (from ``.apply(axis=0 or 1)``) or DataFrames (from ``.apply(axis=None)``). We’ll also allow the color to be adjustable, to demonstrate that ``.apply``, and ``.applymap`` pass along keyword arguments.
|
||||
|
||||

|
||||
|
||||
When using ``Styler.apply(func, axis=None)``, the function must return a DataFrame with the same index and column labels.
|
||||
|
||||

|
||||
|
||||
### Building Styles Summary
|
||||
|
||||
Style functions should return strings with one or more CSS ``attribute: value`` delimited by semicolons. Use
|
||||
|
||||
- ``Styler.applymap(func)`` for elementwise styles
|
||||
- ``Styler.apply(func, axis=0)`` for columnwise styles
|
||||
- ``Styler.apply(func, axis=1)`` for rowwise styles
|
||||
- ``Styler.apply(func, axis=None)`` for tablewise styles
|
||||
|
||||
And crucially the input and output shapes of ``func`` must match. If ``x`` is the input then ``func(x).shape == x.shape``.
|
||||
|
||||
## Finer control: slicing
|
||||
|
||||
Both ``Styler.apply``, and ``Styler.applymap`` accept a ``subset`` keyword. This allows you to apply styles to specific rows or columns, without having to code that logic into your ``style`` function.
|
||||
|
||||
The value passed to ``subset`` behaves similar to slicing a DataFrame.
|
||||
|
||||
- A scalar is treated as a column label
|
||||
- A list (or series or numpy array)
|
||||
- A tuple is treated as ``(row_indexer, column_indexer)``
|
||||
|
||||
Consider using ``pd.IndexSlice`` to construct the tuple for the last one.
|
||||
|
||||

|
||||
|
||||
For row and column slicing, any valid indexer to ``.loc`` will work.
|
||||
|
||||

|
||||
|
||||
Only label-based slicing is supported right now, not positional.
|
||||
|
||||
If your style function uses a ``subset`` or ``axis`` keyword argument, consider wrapping your function in a ``functools.partial``, partialing out that keyword.
|
||||
|
||||
``` python
|
||||
my_func2 = functools.partial(my_func, subset=42)
|
||||
```
|
||||
|
||||
## Finer Control: Display Values
|
||||
|
||||
We distinguish the *display* value from the *actual* value in ``Styler``. To control the display value, the text is printed in each cell, use ``Styler.format``. Cells can be formatted according to a [format spec string](https://docs.python.org/3/library/string.html#format-specification-mini-language) or a callable that takes a single value and returns a string.
|
||||
|
||||

|
||||
|
||||
Use a dictionary to format specific columns.
|
||||
|
||||

|
||||
|
||||
Or pass in a callable (or dictionary of callables) for more flexible handling.
|
||||
|
||||

|
||||
|
||||
## Builtin styles
|
||||
|
||||
Finally, we expect certain styling functions to be common enough that we’ve included a few “built-in” to the ``Styler``, so you don’t have to write them yourself.
|
||||
|
||||

|
||||
|
||||
You can create “heatmaps” with the ``background_gradient`` method. These require matplotlib, and we’ll use [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/) to get a nice colormap.
|
||||
|
||||
``` python
|
||||
import seaborn as sns
|
||||
|
||||
cm = sns.light_palette("green", as_cmap=True)
|
||||
|
||||
s = df.style.background_gradient(cmap=cm)
|
||||
s
|
||||
|
||||
/opt/conda/envs/pandas/lib/python3.7/site-packages/matplotlib/colors.py:479: RuntimeWarning: invalid value encountered in less
|
||||
xa[xa < 0] = -1
|
||||
```
|
||||
|
||||

|
||||
|
||||
``Styler.background_gradient`` takes the keyword arguments ``low`` and ``high``. Roughly speaking these extend the range of your data by ``low`` and ``high`` percent so that when we convert the colors, the colormap’s entire range isn’t used. This is useful so that you can actually read the text still.
|
||||
|
||||

|
||||
|
||||
There’s also ``.highlight_min`` and ``.highlight_max``.
|
||||
|
||||

|
||||
|
||||
Use ``Styler.set_properties`` when the style doesn’t actually depend on the values.
|
||||
|
||||

|
||||
|
||||
### Bar charts
|
||||
|
||||
You can include “bar charts” in your DataFrame.
|
||||
|
||||

|
||||
|
||||
New in version 0.20.0 is the ability to customize further the bar chart: You can now have the ``df.style.bar`` be centered on zero or midpoint value (in addition to the already existing way of having the min value at the left side of the cell), and you can pass a list of ``[color_negative, color_positive]``.
|
||||
|
||||
Here’s how you can change the above with the new ``align='mid'`` option:
|
||||
|
||||

|
||||
|
||||
The following example aims to give a highlight of the behavior of the new align options:
|
||||
|
||||
``` python
|
||||
import pandas as pd
|
||||
from IPython.display import HTML
|
||||
|
||||
# Test series
|
||||
test1 = pd.Series([-100,-60,-30,-20], name='All Negative')
|
||||
test2 = pd.Series([10,20,50,100], name='All Positive')
|
||||
test3 = pd.Series([-10,-5,0,90], name='Both Pos and Neg')
|
||||
|
||||
head = """
|
||||
<table>
|
||||
<thead>
|
||||
<th>Align</th>
|
||||
<th>All Negative</th>
|
||||
<th>All Positive</th>
|
||||
<th>Both Neg and Pos</th>
|
||||
</thead>
|
||||
</tbody>
|
||||
|
||||
"""
|
||||
|
||||
aligns = ['left','zero','mid']
|
||||
for align in aligns:
|
||||
row = "<tr><th>{}</th>".format(align)
|
||||
for serie in [test1,test2,test3]:
|
||||
s = serie.copy()
|
||||
s.name=''
|
||||
row += "<td>{}</td>".format(s.to_frame().style.bar(align=align,
|
||||
color=['#d65f5f', '#5fba7d'],
|
||||
width=100).render()) #testn['width']
|
||||
row += '</tr>'
|
||||
head += row
|
||||
|
||||
head+= """
|
||||
</tbody>
|
||||
</table>"""
|
||||
|
||||
|
||||
HTML(head)
|
||||
```
|
||||
|
||||

|
||||
|
||||
## Sharing styles
|
||||
|
||||
Say you have a lovely style built up for a DataFrame, and now you want to apply the same style to a second DataFrame. Export the style with ``df1.style.export``, and import it on the second DataFrame with ``df1.style.set``
|
||||
|
||||

|
||||
|
||||
Notice that you’re able share the styles even though they’re data aware. The styles are re-evaluated on the new DataFrame they’ve been ``use``d upon.
|
||||
|
||||
## Other Options
|
||||
|
||||
You’ve seen a few methods for data-driven styling. ``Styler`` also provides a few other options for styles that don’t depend on the data.
|
||||
|
||||
- precision
|
||||
- captions
|
||||
- table-wide styles
|
||||
- hiding the index or columns
|
||||
|
||||
Each of these can be specified in two ways:
|
||||
|
||||
- A keyword argument to ``Styler.__init__``
|
||||
- A call to one of the ``.set_`` or ``.hide_`` methods, e.g. ``.set_caption`` or ``.hide_columns``
|
||||
|
||||
The best method to use depends on the context. Use the ``Styler`` constructor when building many styled DataFrames that should all share the same properties. For interactive use, the``.set_`` and ``.hide_`` methods are more convenient.
|
||||
|
||||
### Precision
|
||||
|
||||
You can control the precision of floats using pandas’ regular ``display.precision`` option.
|
||||
|
||||

|
||||
|
||||
Or through a ``set_precision`` method.
|
||||
|
||||

|
||||
|
||||
Setting the precision only affects the printed number; the full-precision values are always passed to your style functions. You can always use ``df.round(2).style`` if you’d prefer to round from the start.
|
||||
|
||||
### Captions
|
||||
|
||||
Regular table captions can be added in a few ways.
|
||||
|
||||

|
||||
|
||||
### Table styles
|
||||
|
||||
The next option you have are “table styles”. These are styles that apply to the table as a whole, but don’t look at the data. Certain sytlings, including pseudo-selectors like ``:hover`` can only be used this way.
|
||||
|
||||

|
||||
|
||||
``table_styles`` should be a list of dictionaries. Each dictionary should have the ``selector`` and ``props`` keys. The value for ``selector`` should be a valid CSS selector. Recall that all the styles are already attached to an ``id``, unique to each ``Styler``. This selector is in addition to that ``id``. The value for ``props`` should be a list of tuples of ``('attribute', 'value')``.
|
||||
|
||||
``table_styles`` are extremely flexible, but not as fun to type out by hand. We hope to collect some useful ones either in pandas, or preferable in a new package that [builds on top](#Extensibility) the tools here.
|
||||
|
||||
### Hiding the Index or Columns
|
||||
|
||||
The index can be hidden from rendering by calling ``Styler.hide_index``. Columns can be hidden from rendering by calling ``Styler.hide_columns`` and passing in the name of a column, or a slice of columns.
|
||||
|
||||

|
||||
|
||||
### CSS classes
|
||||
|
||||
Certain CSS classes are attached to cells.
|
||||
|
||||
- Index and Column names include ``index_name`` and ``level`` where ``k`` is its level in a MultiIndex
|
||||
- Index label cells include
|
||||
``row_heading``
|
||||
``row`` where ``n`` is the numeric position of the row
|
||||
``level`` where ``k`` is the level in a MultiIndex
|
||||
- ``row_heading``
|
||||
- ``row`` where ``n`` is the numeric position of the row
|
||||
- ``level`` where ``k`` is the level in a MultiIndex
|
||||
- Column label cells include
|
||||
``col_heading``
|
||||
``col`` where ``n`` is the numeric position of the column
|
||||
``level`` where ``k`` is the level in a MultiIndex
|
||||
- ``col_heading``
|
||||
- ``col`` where ``n`` is the numeric position of the column
|
||||
- ``level`` where ``k`` is the level in a MultiIndex
|
||||
- Blank cells include ``blank``
|
||||
- Data cells include ``data``
|
||||
|
||||
### Limitations
|
||||
|
||||
- DataFrame only ``(use Series.to_frame().style)``
|
||||
- The index and columns must be unique
|
||||
- No large repr, and performance isn’t great; this is intended for summary DataFrames
|
||||
- You can only style the *values*, not the index or columns
|
||||
- You can only apply styles, you can’t insert new HTML entities
|
||||
|
||||
Some of these will be addressed in the future.
|
||||
|
||||
### Terms
|
||||
|
||||
- Style function: a function that’s passed into ``Styler.apply`` or ``Styler.applymap`` and returns values like ``'css attribute: value'``
|
||||
- Builtin style functions: style functions that are methods on ``Styler``
|
||||
- table style: a dictionary with the two keys ``selector`` and ``props``. ``selector`` is the CSS selector that ``props`` will apply to. ``props`` is a list of ``(attribute, value)`` tuples. A list of table styles passed into ``Styler``.
|
||||
|
||||
## Fun stuff
|
||||
|
||||
Here are a few interesting examples.
|
||||
|
||||
``Styler`` interacts pretty well with widgets. If you’re viewing this online instead of running the notebook yourself, you’re missing out on interactively adjusting the color palette.
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
## Export to Excel
|
||||
|
||||
*New in version 0.20.0*
|
||||
|
||||
Experimental: This is a new feature and still under development. We’ll be adding features and possibly making breaking changes in future releases. We’d love to hear your feedback.
|
||||
|
||||
Some support is available for exporting styled ``DataFrames`` to Excel worksheets using the ``OpenPyXL`` or ``XlsxWriter`` engines. CSS2.2 properties handled include:
|
||||
|
||||
- ``background-color``
|
||||
- ``border-style``, ``border-width``, ``border-color`` and their {``top``, ``right``, ``bottom``, ``left`` variants}
|
||||
- ``color``
|
||||
- ``font-family``
|
||||
- ``font-style``
|
||||
- ``font-weight``
|
||||
- ``text-align``
|
||||
- ``text-decoration``
|
||||
- ``vertical-align``
|
||||
- ``white-space: nowrap``
|
||||
- Only CSS2 named colors and hex colors of the form ``#rgb`` or ``#rrggbb`` are currently supported.
|
||||
- The following pseudo CSS properties are also available to set excel specific style properties:
|
||||
``number-format``
|
||||
- ``number-format``
|
||||
|
||||
``` python
|
||||
df.style.\
|
||||
applymap(color_negative_red).\
|
||||
apply(highlight_max).\
|
||||
to_excel('styled.xlsx', engine='openpyxl')
|
||||
```
|
||||
|
||||
A screenshot of the output:
|
||||
|
||||

|
||||
|
||||
## Extensibility
|
||||
|
||||
The core of pandas is, and will remain, its “high-performance, easy-to-use data structures”. With that in mind, we hope that ``DataFrame.style`` accomplishes two goals
|
||||
|
||||
- Provide an API that is pleasing to use interactively and is “good enough” for many tasks
|
||||
- Provide the foundations for dedicated libraries to build on
|
||||
|
||||
If you build a great library on top of this, let us know and we’ll [link](http://pandas.pydata.org/pandas-docs/stable/ecosystem.html) to it.
|
||||
|
||||
### Subclassing
|
||||
|
||||
If the default template doesn’t quite suit your needs, you can subclass Styler and extend or override the template. We’ll show an example of extending the default template to insert a custom header before each table.
|
||||
|
||||
|
||||
``` python
|
||||
from jinja2 import Environment, ChoiceLoader, FileSystemLoader
|
||||
from IPython.display import HTML
|
||||
from pandas.io.formats.style import Styler
|
||||
```
|
||||
|
||||
We’ll use the following template:
|
||||
|
||||
|
||||
``` python
|
||||
with open("templates/myhtml.tpl") as f:
|
||||
print(f.read())
|
||||
```
|
||||
|
||||
Now that we’ve created a template, we need to set up a subclass of ``Styler`` that knows about it.
|
||||
|
||||
|
||||
``` python
|
||||
class MyStyler(Styler):
|
||||
env = Environment(
|
||||
loader=ChoiceLoader([
|
||||
FileSystemLoader("templates"), # contains ours
|
||||
Styler.loader, # the default
|
||||
])
|
||||
)
|
||||
template = env.get_template("myhtml.tpl")
|
||||
```
|
||||
|
||||
Notice that we include the original loader in our environment’s loader. That’s because we extend the original template, so the Jinja environment needs to be able to find it.
|
||||
|
||||
Now we can use that custom styler. It’s ``__init__`` takes a DataFrame.
|
||||
|
||||

|
||||
|
||||
Our custom template accepts a ``table_title`` keyword. We can provide the value in the ``.render`` method.
|
||||
|
||||

|
||||
|
||||
For convenience, we provide the ``Styler.from_custom_template`` method that does the same as the custom subclass.
|
||||
|
||||

|
||||
|
||||
Here’s the template structure:
|
||||
|
||||

|
||||
|
||||
See the template in the [GitHub repo](https://github.com/pandas-dev/pandas) for more details.
|
||||
1056
Python/pandas/user_guide/text.md
Normal file
1056
Python/pandas/user_guide/text.md
Normal file
File diff suppressed because it is too large
Load Diff
832
Python/pandas/user_guide/timedeltas.md
Normal file
832
Python/pandas/user_guide/timedeltas.md
Normal file
@@ -0,0 +1,832 @@
|
||||
# 时间差
|
||||
|
||||
`Timedelta`,时间差,即时间之间的差异,用 `日、时、分、秒` 等时间单位表示,时间单位可为正,也可为负。
|
||||
|
||||
`Timedelta` 是 `datetime.timedelta` 的子类,两者的操作方式相似,但 `Timedelta` 兼容 `np.timedelta64` 等数据类型,还支持自定义表示形式、能解析多种类型的数据,并支持自有属性。
|
||||
|
||||
## 解析数据,生成时间差
|
||||
|
||||
`Timedelta()` 支持用多种参数生成时间差:
|
||||
|
||||
``` python
|
||||
In [1]: import datetime
|
||||
|
||||
# 字符串
|
||||
In [2]: pd.Timedelta('1 days')
|
||||
Out[2]: Timedelta('1 days 00:00:00')
|
||||
|
||||
In [3]: pd.Timedelta('1 days 00:00:00')
|
||||
Out[3]: Timedelta('1 days 00:00:00')
|
||||
|
||||
In [4]: pd.Timedelta('1 days 2 hours')
|
||||
Out[4]: Timedelta('1 days 02:00:00')
|
||||
|
||||
In [5]: pd.Timedelta('-1 days 2 min 3us')
|
||||
Out[5]: Timedelta('-2 days +23:57:59.999997')
|
||||
|
||||
# datetime.timedelta
|
||||
# 注意:必须指定关键字参数
|
||||
In [6]: pd.Timedelta(days=1, seconds=1)
|
||||
Out[6]: Timedelta('1 days 00:00:01')
|
||||
|
||||
# 用整数与时间单位生成时间差
|
||||
In [7]: pd.Timedelta(1, unit='d')
|
||||
Out[7]: Timedelta('1 days 00:00:00')
|
||||
|
||||
# datetime.timedelta 与 np.timedelta64
|
||||
In [8]: pd.Timedelta(datetime.timedelta(days=1, seconds=1))
|
||||
Out[8]: Timedelta('1 days 00:00:01')
|
||||
|
||||
In [9]: pd.Timedelta(np.timedelta64(1, 'ms'))
|
||||
Out[9]: Timedelta('0 days 00:00:00.001000')
|
||||
|
||||
# 用字符串表示负数时间差
|
||||
# 更接近 datetime.timedelta
|
||||
In [10]: pd.Timedelta('-1us')
|
||||
Out[10]: Timedelta('-1 days +23:59:59.999999')
|
||||
|
||||
# 时间差缺失值
|
||||
In [11]: pd.Timedelta('nan')
|
||||
Out[11]: NaT
|
||||
|
||||
In [12]: pd.Timedelta('nat')
|
||||
Out[12]: NaT
|
||||
|
||||
# ISO8601 时间格式字符串
|
||||
In [13]: pd.Timedelta('P0DT0H1M0S')
|
||||
Out[13]: Timedelta('0 days 00:01:00')
|
||||
|
||||
In [14]: pd.Timedelta('P0DT0H0M0.000000123S')
|
||||
Out[14]: Timedelta('0 days 00:00:00.000000')
|
||||
```
|
||||
|
||||
*0.23.0 版新增*:增加了用 [ISO8601 时间格式](https://en.wikipedia.org/wiki/ISO_8601#Durations)生成时间差。
|
||||
|
||||
[DateOffsets](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offsets)(`Day`、`Hour`、`Minute`、`Second`、`Milli`、`Micro`、`Nano`)也可以用来生成时间差。
|
||||
|
||||
``` python
|
||||
In [15]: pd.Timedelta(pd.offsets.Second(2))
|
||||
Out[15]: Timedelta('0 days 00:00:02')
|
||||
```
|
||||
|
||||
标量运算生成的也是 `Timedelta` 标量。
|
||||
|
||||
``` python
|
||||
In [16]: pd.Timedelta(pd.offsets.Day(2)) + pd.Timedelta(pd.offsets.Second(2)) +\
|
||||
....: pd.Timedelta('00:00:00.000123')
|
||||
....:
|
||||
Out[16]: Timedelta('2 days 00:00:02.000123')
|
||||
```
|
||||
|
||||
### to_timedelta
|
||||
|
||||
`pd.to_timedelta()` 可以把符合时间差格式的标量、数组、列表、序列等数据转换为`Timedelta`。输入数据是序列,输出的就是序列,输入数据是标量,输出的就是标量,其它形式的输入数据则输出 `TimedeltaIndex`。
|
||||
|
||||
`to_timedelta()` 可以解析单个字符串:
|
||||
|
||||
``` python
|
||||
In [17]: pd.to_timedelta('1 days 06:05:01.00003')
|
||||
Out[17]: Timedelta('1 days 06:05:01.000030')
|
||||
|
||||
In [18]: pd.to_timedelta('15.5us')
|
||||
Out[18]: Timedelta('0 days 00:00:00.000015')
|
||||
```
|
||||
|
||||
还能解析字符串列表或数组:
|
||||
|
||||
``` python
|
||||
In [19]: pd.to_timedelta(['1 days 06:05:01.00003', '15.5us', 'nan'])
|
||||
Out[19]: TimedeltaIndex(['1 days 06:05:01.000030', '0 days 00:00:00.000015', NaT], dtype='timedelta64[ns]', freq=None)
|
||||
```
|
||||
|
||||
`unit` 关键字参数指定时间差的单位:
|
||||
|
||||
``` python
|
||||
In [20]: pd.to_timedelta(np.arange(5), unit='s')
|
||||
Out[20]: TimedeltaIndex(['00:00:00', '00:00:01', '00:00:02', '00:00:03', '00:00:04'], dtype='timedelta64[ns]', freq=None)
|
||||
|
||||
In [21]: pd.to_timedelta(np.arange(5), unit='d')
|
||||
Out[21]: TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq=None)
|
||||
```
|
||||
|
||||
### 时间差界限
|
||||
|
||||
Pandas 时间差的纳秒解析度是 64 位整数,这就决定了 `Timedelta` 的上下限。
|
||||
|
||||
``` python
|
||||
In [22]: pd.Timedelta.min
|
||||
Out[22]: Timedelta('-106752 days +00:12:43.145224')
|
||||
|
||||
In [23]: pd.Timedelta.max
|
||||
Out[23]: Timedelta('106751 days 23:47:16.854775')
|
||||
```
|
||||
|
||||
## 运算
|
||||
|
||||
以时间差为数据的 `Series` 与 `DataFrame` 支持各种运算,`datetime64 [ns]` 序列或 `Timestamps` 减法运算生成的是`timedelta64 [ns]` 序列。
|
||||
|
||||
``` python
|
||||
In [24]: s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))
|
||||
|
||||
In [25]: td = pd.Series([pd.Timedelta(days=i) for i in range(3)])
|
||||
|
||||
In [26]: df = pd.DataFrame({'A': s, 'B': td})
|
||||
|
||||
In [27]: df
|
||||
Out[27]:
|
||||
A B
|
||||
0 2012-01-01 0 days
|
||||
1 2012-01-02 1 days
|
||||
2 2012-01-03 2 days
|
||||
|
||||
In [28]: df['C'] = df['A'] + df['B']
|
||||
|
||||
In [29]: df
|
||||
Out[29]:
|
||||
A B C
|
||||
0 2012-01-01 0 days 2012-01-01
|
||||
1 2012-01-02 1 days 2012-01-03
|
||||
2 2012-01-03 2 days 2012-01-05
|
||||
|
||||
In [30]: df.dtypes
|
||||
Out[30]:
|
||||
A datetime64[ns]
|
||||
B timedelta64[ns]
|
||||
C datetime64[ns]
|
||||
dtype: object
|
||||
|
||||
In [31]: s - s.max()
|
||||
Out[31]:
|
||||
0 -2 days
|
||||
1 -1 days
|
||||
2 0 days
|
||||
dtype: timedelta64[ns]
|
||||
|
||||
In [32]: s - datetime.datetime(2011, 1, 1, 3, 5)
|
||||
Out[32]:
|
||||
0 364 days 20:55:00
|
||||
1 365 days 20:55:00
|
||||
2 366 days 20:55:00
|
||||
dtype: timedelta64[ns]
|
||||
|
||||
In [33]: s + datetime.timedelta(minutes=5)
|
||||
Out[33]:
|
||||
0 2012-01-01 00:05:00
|
||||
1 2012-01-02 00:05:00
|
||||
2 2012-01-03 00:05:00
|
||||
dtype: datetime64[ns]
|
||||
|
||||
In [34]: s + pd.offsets.Minute(5)
|
||||
Out[34]:
|
||||
0 2012-01-01 00:05:00
|
||||
1 2012-01-02 00:05:00
|
||||
2 2012-01-03 00:05:00
|
||||
dtype: datetime64[ns]
|
||||
|
||||
In [35]: s + pd.offsets.Minute(5) + pd.offsets.Milli(5)
|
||||
Out[35]:
|
||||
0 2012-01-01 00:05:00.005
|
||||
1 2012-01-02 00:05:00.005
|
||||
2 2012-01-03 00:05:00.005
|
||||
dtype: datetime64[ns]
|
||||
```
|
||||
|
||||
`timedelta64 [ns]` 序列的标量运算:
|
||||
|
||||
``` python
|
||||
In [36]: y = s - s[0]
|
||||
|
||||
In [37]: y
|
||||
Out[37]:
|
||||
0 0 days
|
||||
1 1 days
|
||||
2 2 days
|
||||
dtype: timedelta64[ns]
|
||||
```
|
||||
|
||||
时间差序列支持 `NaT` 值:
|
||||
|
||||
``` python
|
||||
In [38]: y = s - s.shift()
|
||||
|
||||
In [39]: y
|
||||
Out[39]:
|
||||
0 NaT
|
||||
1 1 days
|
||||
2 1 days
|
||||
dtype: timedelta64[ns]
|
||||
```
|
||||
|
||||
与 `datetime` 类似,`np.nan` 把时间差设置为 `NaT`:
|
||||
|
||||
``` python
|
||||
In [40]: y[1] = np.nan
|
||||
|
||||
In [41]: y
|
||||
Out[41]:
|
||||
0 NaT
|
||||
1 NaT
|
||||
2 1 days
|
||||
dtype: timedelta64[ns]
|
||||
```
|
||||
|
||||
运算符也可以显示为逆序(序列与单个对象的运算):
|
||||
|
||||
``` python
|
||||
In [42]: s.max() - s
|
||||
Out[42]:
|
||||
0 2 days
|
||||
1 1 days
|
||||
2 0 days
|
||||
dtype: timedelta64[ns]
|
||||
|
||||
In [43]: datetime.datetime(2011, 1, 1, 3, 5) - s
|
||||
Out[43]:
|
||||
0 -365 days +03:05:00
|
||||
1 -366 days +03:05:00
|
||||
2 -367 days +03:05:00
|
||||
dtype: timedelta64[ns]
|
||||
|
||||
In [44]: datetime.timedelta(minutes=5) + s
|
||||
Out[44]:
|
||||
0 2012-01-01 00:05:00
|
||||
1 2012-01-02 00:05:00
|
||||
2 2012-01-03 00:05:00
|
||||
dtype: datetime64[ns]
|
||||
```
|
||||
|
||||
`DataFrame` 支持 `min`、`max` 及 `idxmin`、`idxmax` 运算:
|
||||
|
||||
``` python
|
||||
In [45]: A = s - pd.Timestamp('20120101') - pd.Timedelta('00:05:05')
|
||||
|
||||
In [46]: B = s - pd.Series(pd.date_range('2012-1-2', periods=3, freq='D'))
|
||||
|
||||
In [47]: df = pd.DataFrame({'A': A, 'B': B})
|
||||
|
||||
In [48]: df
|
||||
Out[48]:
|
||||
A B
|
||||
0 -1 days +23:54:55 -1 days
|
||||
1 0 days 23:54:55 -1 days
|
||||
2 1 days 23:54:55 -1 days
|
||||
|
||||
In [49]: df.min()
|
||||
Out[49]:
|
||||
A -1 days +23:54:55
|
||||
B -1 days +00:00:00
|
||||
dtype: timedelta64[ns]
|
||||
|
||||
In [50]: df.min(axis=1)
|
||||
Out[50]:
|
||||
0 -1 days
|
||||
1 -1 days
|
||||
2 -1 days
|
||||
dtype: timedelta64[ns]
|
||||
|
||||
In [51]: df.idxmin()
|
||||
Out[51]:
|
||||
A 0
|
||||
B 0
|
||||
dtype: int64
|
||||
|
||||
In [52]: df.idxmax()
|
||||
Out[52]:
|
||||
A 2
|
||||
B 0
|
||||
dtype: int64
|
||||
```
|
||||
|
||||
`Series` 也支持`min`、`max` 及 `idxmin`、`idxmax` 运算。标量计算结果为 `Timedelta`。
|
||||
|
||||
``` python
|
||||
In [53]: df.min().max()
|
||||
Out[53]: Timedelta('-1 days +23:54:55')
|
||||
|
||||
In [54]: df.min(axis=1).min()
|
||||
Out[54]: Timedelta('-1 days +00:00:00')
|
||||
|
||||
In [55]: df.min().idxmax()
|
||||
Out[55]: 'A'
|
||||
|
||||
In [56]: df.min(axis=1).idxmin()
|
||||
Out[56]: 0
|
||||
```
|
||||
|
||||
时间差支持 `fillna` 函数,参数是 `Timedelta`,用于指定填充值。
|
||||
|
||||
``` python
|
||||
In [57]: y.fillna(pd.Timedelta(0))
|
||||
Out[57]:
|
||||
0 0 days
|
||||
1 0 days
|
||||
2 1 days
|
||||
dtype: timedelta64[ns]
|
||||
|
||||
In [58]: y.fillna(pd.Timedelta(10, unit='s'))
|
||||
Out[58]:
|
||||
0 0 days 00:00:10
|
||||
1 0 days 00:00:10
|
||||
2 1 days 00:00:00
|
||||
dtype: timedelta64[ns]
|
||||
|
||||
In [59]: y.fillna(pd.Timedelta('-1 days, 00:00:05'))
|
||||
Out[59]:
|
||||
0 -1 days +00:00:05
|
||||
1 -1 days +00:00:05
|
||||
2 1 days 00:00:00
|
||||
dtype: timedelta64[ns]
|
||||
```
|
||||
|
||||
`Timedelta` 还支持取反、乘法及绝对值(`Abs`)运算:
|
||||
|
||||
``` python
|
||||
In [60]: td1 = pd.Timedelta('-1 days 2 hours 3 seconds')
|
||||
|
||||
In [61]: td1
|
||||
Out[61]: Timedelta('-2 days +21:59:57')
|
||||
|
||||
In [62]: -1 * td1
|
||||
Out[62]: Timedelta('1 days 02:00:03')
|
||||
|
||||
In [63]: - td1
|
||||
Out[63]: Timedelta('1 days 02:00:03')
|
||||
|
||||
In [64]: abs(td1)
|
||||
Out[64]: Timedelta('1 days 02:00:03')
|
||||
```
|
||||
|
||||
## 归约
|
||||
|
||||
`timedelta64 [ns]` 数值归约运算返回的是 `Timedelta` 对象。 一般情况下,`NaT` 不计数。
|
||||
|
||||
``` python
|
||||
In [65]: y2 = pd.Series(pd.to_timedelta(['-1 days +00:00:05', 'nat',
|
||||
....: '-1 days +00:00:05', '1 days']))
|
||||
....:
|
||||
|
||||
In [66]: y2
|
||||
Out[66]:
|
||||
0 -1 days +00:00:05
|
||||
1 NaT
|
||||
2 -1 days +00:00:05
|
||||
3 1 days 00:00:00
|
||||
dtype: timedelta64[ns]
|
||||
|
||||
In [67]: y2.mean()
|
||||
Out[67]: Timedelta('-1 days +16:00:03.333333')
|
||||
|
||||
In [68]: y2.median()
|
||||
Out[68]: Timedelta('-1 days +00:00:05')
|
||||
|
||||
In [69]: y2.quantile(.1)
|
||||
Out[69]: Timedelta('-1 days +00:00:05')
|
||||
|
||||
In [70]: y2.sum()
|
||||
Out[70]: Timedelta('-1 days +00:00:10')
|
||||
```
|
||||
|
||||
## 频率转换
|
||||
|
||||
时间差除法把 `Timedelta` 序列、`TimedeltaIndex`、`Timedelta` 标量转换为其它“频率”,`astype` 也可以将之转换为指定的时间差。这些运算生成的是序列,并把 `NaT` 转换为 `nan`。 注意,NumPy 标量除法是真除法,`astype` 则等同于取底整除(Floor Division)。
|
||||
|
||||
::: tip 说明
|
||||
|
||||
Floor Division ,即两数的商为向下取整,如,9 / 2 = 4。又译作地板除或向下取整除,本文译作**取底整除**;
|
||||
|
||||
扩展知识:
|
||||
|
||||
Ceiling Division,即两数的商为向上取整,如,9 / 2 = 5。又译作屋顶除或向上取整除,本文译作**取顶整除**。
|
||||
|
||||
:::
|
||||
|
||||
``` python
|
||||
In [71]: december = pd.Series(pd.date_range('20121201', periods=4))
|
||||
|
||||
In [72]: january = pd.Series(pd.date_range('20130101', periods=4))
|
||||
|
||||
In [73]: td = january - december
|
||||
|
||||
In [74]: td[2] += datetime.timedelta(minutes=5, seconds=3)
|
||||
|
||||
In [75]: td[3] = np.nan
|
||||
|
||||
In [76]: td
|
||||
Out[76]:
|
||||
0 31 days 00:00:00
|
||||
1 31 days 00:00:00
|
||||
2 31 days 00:05:03
|
||||
3 NaT
|
||||
dtype: timedelta64[ns]
|
||||
|
||||
# 转为日
|
||||
In [77]: td / np.timedelta64(1, 'D')
|
||||
Out[77]:
|
||||
0 31.000000
|
||||
1 31.000000
|
||||
2 31.003507
|
||||
3 NaN
|
||||
dtype: float64
|
||||
|
||||
In [78]: td.astype('timedelta64[D]')
|
||||
Out[78]:
|
||||
0 31.0
|
||||
1 31.0
|
||||
2 31.0
|
||||
3 NaN
|
||||
dtype: float64
|
||||
|
||||
# 转为秒
|
||||
In [79]: td / np.timedelta64(1, 's')
|
||||
Out[79]:
|
||||
0 2678400.0
|
||||
1 2678400.0
|
||||
2 2678703.0
|
||||
3 NaN
|
||||
dtype: float64
|
||||
|
||||
In [80]: td.astype('timedelta64[s]')
|
||||
Out[80]:
|
||||
0 2678400.0
|
||||
1 2678400.0
|
||||
2 2678703.0
|
||||
3 NaN
|
||||
dtype: float64
|
||||
|
||||
# 转为月 (此处用常量表示)
|
||||
In [81]: td / np.timedelta64(1, 'M')
|
||||
Out[81]:
|
||||
0 1.018501
|
||||
1 1.018501
|
||||
2 1.018617
|
||||
3 NaN
|
||||
dtype: float64
|
||||
```
|
||||
|
||||
`timedelta64 [ns]` 序列与整数或整数序列相乘或相除,生成的也是 `timedelta64 [ns]` 序列。
|
||||
|
||||
``` python
|
||||
In [82]: td * -1
|
||||
Out[82]:
|
||||
0 -31 days +00:00:00
|
||||
1 -31 days +00:00:00
|
||||
2 -32 days +23:54:57
|
||||
3 NaT
|
||||
dtype: timedelta64[ns]
|
||||
|
||||
In [83]: td * pd.Series([1, 2, 3, 4])
|
||||
Out[83]:
|
||||
0 31 days 00:00:00
|
||||
1 62 days 00:00:00
|
||||
2 93 days 00:15:09
|
||||
3 NaT
|
||||
dtype: timedelta64[ns]
|
||||
```
|
||||
|
||||
`timedelta64 [ns]` 序列与 `Timedelta` 标量相除的结果为取底整除的整数序列。
|
||||
|
||||
|
||||
``` python
|
||||
In [84]: td // pd.Timedelta(days=3, hours=4)
|
||||
Out[84]:
|
||||
0 9.0
|
||||
1 9.0
|
||||
2 9.0
|
||||
3 NaN
|
||||
dtype: float64
|
||||
|
||||
In [85]: pd.Timedelta(days=3, hours=4) // td
|
||||
Out[85]:
|
||||
0 0.0
|
||||
1 0.0
|
||||
2 0.0
|
||||
3 NaN
|
||||
dtype: float64
|
||||
```
|
||||
|
||||
`Timedelta` 的求余(`mod(%)`)与除余(`divmod`)运算,支持时间差与数值参数。
|
||||
|
||||
``` python
|
||||
In [86]: pd.Timedelta(hours=37) % datetime.timedelta(hours=2)
|
||||
Out[86]: Timedelta('0 days 01:00:00')
|
||||
|
||||
# 除余运算的参数为时间差时,返回一对值(int, Timedelta)
|
||||
In [87]: divmod(datetime.timedelta(hours=2), pd.Timedelta(minutes=11))
|
||||
Out[87]: (10, Timedelta('0 days 00:10:00'))
|
||||
|
||||
# 除余运算的参数为数值时,也返回一对值(Timedelta, Timedelta)
|
||||
In [88]: divmod(pd.Timedelta(hours=25), 86400000000000)
|
||||
Out[88]: (Timedelta('0 days 00:00:00.000000'), Timedelta('0 days 01:00:00'))
|
||||
```
|
||||
|
||||
## 属性
|
||||
|
||||
`Timedelta` 或 `TimedeltaIndex` 的组件可以直接访问 `days`、`seconds`、`microseconds`、`nanoseconds` 等属性。这些属性与`datetime.timedelta` 的返回值相同,例如,`.seconds` 属性表示大于等于 0 天且小于 1 天的秒数。带符号的 `Timedelta` 返回的值也带符号。
|
||||
|
||||
`Series` 的 `.dt` 属性也可以直接访问这些数据。
|
||||
|
||||
::: tip 注意
|
||||
|
||||
这些属性**不是** `Timedelta` 显示的值。`.components` 可以提取显示的值。
|
||||
|
||||
:::
|
||||
|
||||
对于 `Series`:
|
||||
|
||||
``` python
|
||||
In [89]: td.dt.days
|
||||
Out[89]:
|
||||
0 31.0
|
||||
1 31.0
|
||||
2 31.0
|
||||
3 NaN
|
||||
dtype: float64
|
||||
|
||||
In [90]: td.dt.seconds
|
||||
Out[90]:
|
||||
0 0.0
|
||||
1 0.0
|
||||
2 303.0
|
||||
3 NaN
|
||||
dtype: float64
|
||||
```
|
||||
|
||||
直接访问 `Timedelta` 标量字段值。
|
||||
|
||||
``` python
|
||||
In [91]: tds = pd.Timedelta('31 days 5 min 3 sec')
|
||||
|
||||
In [92]: tds.days
|
||||
Out[92]: 31
|
||||
|
||||
In [93]: tds.seconds
|
||||
Out[93]: 303
|
||||
|
||||
In [94]: (-tds).seconds
|
||||
Out[94]: 86097
|
||||
```
|
||||
|
||||
`.components` 属性可以快速访问时间差的组件,返回结果是 `DataFrame`。 下列代码输出 `Timedelta` 的显示值。
|
||||
|
||||
``` python
|
||||
In [95]: td.dt.components
|
||||
Out[95]:
|
||||
days hours minutes seconds milliseconds microseconds nanoseconds
|
||||
0 31.0 0.0 0.0 0.0 0.0 0.0 0.0
|
||||
1 31.0 0.0 0.0 0.0 0.0 0.0 0.0
|
||||
2 31.0 0.0 5.0 3.0 0.0 0.0 0.0
|
||||
3 NaN NaN NaN NaN NaN NaN NaN
|
||||
|
||||
In [96]: td.dt.components.seconds
|
||||
Out[96]:
|
||||
0 0.0
|
||||
1 0.0
|
||||
2 3.0
|
||||
3 NaN
|
||||
Name: seconds, dtype: float64
|
||||
```
|
||||
|
||||
`.isoformat` 方法可以把 `Timedelta` 转换为 [ISO8601 时间格式](https://en.wikipedia.org/wiki/ISO_8601#Durations)字符串。
|
||||
|
||||
*0.20.0 版新增。*
|
||||
|
||||
``` python
|
||||
In [97]: pd.Timedelta(days=6, minutes=50, seconds=3,
|
||||
....: milliseconds=10, microseconds=10,
|
||||
....: nanoseconds=12).isoformat()
|
||||
....:
|
||||
Out[97]: 'P6DT0H50M3.010010012S'
|
||||
```
|
||||
|
||||
## TimedeltaIndex
|
||||
|
||||
[`TimedeltaIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.TimedeltaIndex.html#pandas.TimedeltaIndex) 或 [`timedelta_range()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.timedelta_range.html#pandas.timedelta_range) 可以生成时间差索引。
|
||||
|
||||
`TimedeltaIndex` 支持字符串型的 `Timedelta`、`timedelta` 或 `np.timedelta64`对象。
|
||||
|
||||
`np.nan`、`pd.NaT`、`nat` 代表缺失值。
|
||||
|
||||
``` python
|
||||
In [98]: pd.TimedeltaIndex(['1 days', '1 days, 00:00:05', np.timedelta64(2, 'D'),
|
||||
....: datetime.timedelta(days=2, seconds=2)])
|
||||
....:
|
||||
Out[98]:
|
||||
TimedeltaIndex(['1 days 00:00:00', '1 days 00:00:05', '2 days 00:00:00',
|
||||
'2 days 00:00:02'],
|
||||
dtype='timedelta64[ns]', freq=None)
|
||||
```
|
||||
|
||||
`freq` 关键字参数为 `infer` 时,`TimedeltaIndex` 可以自行推断时间频率:
|
||||
|
||||
``` python
|
||||
In [99]: pd.TimedeltaIndex(['0 days', '10 days', '20 days'], freq='infer')
|
||||
Out[99]: TimedeltaIndex(['0 days', '10 days', '20 days'], dtype='timedelta64[ns]', freq='10D')
|
||||
```
|
||||
|
||||
### 生成时间差范围
|
||||
|
||||
与 [`date_range()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html#pandas.date_range) 相似,[`timedelta_range()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.timedelta_range.html#pandas.timedelta_range) 可以生成定频 `TimedeltaIndex`,`timedelta_range` 的默认频率是日历日:
|
||||
|
||||
``` python
|
||||
In [100]: pd.timedelta_range(start='1 days', periods=5)
|
||||
Out[100]: TimedeltaIndex(['1 days', '2 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq='D')
|
||||
```
|
||||
|
||||
`timedelta_range` 支持 `start`、`end`、`periods` 三个参数:
|
||||
|
||||
``` python
|
||||
In [101]: pd.timedelta_range(start='1 days', end='5 days')
|
||||
Out[101]: TimedeltaIndex(['1 days', '2 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq='D')
|
||||
|
||||
In [102]: pd.timedelta_range(end='10 days', periods=4)
|
||||
Out[102]: TimedeltaIndex(['7 days', '8 days', '9 days', '10 days'], dtype='timedelta64[ns]', freq='D')
|
||||
```
|
||||
|
||||
`freq` 参数支持各种[频率别名](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases):
|
||||
|
||||
``` python
|
||||
In [103]: pd.timedelta_range(start='1 days', end='2 days', freq='30T')
|
||||
Out[103]:
|
||||
TimedeltaIndex(['1 days 00:00:00', '1 days 00:30:00', '1 days 01:00:00',
|
||||
'1 days 01:30:00', '1 days 02:00:00', '1 days 02:30:00',
|
||||
'1 days 03:00:00', '1 days 03:30:00', '1 days 04:00:00',
|
||||
'1 days 04:30:00', '1 days 05:00:00', '1 days 05:30:00',
|
||||
'1 days 06:00:00', '1 days 06:30:00', '1 days 07:00:00',
|
||||
'1 days 07:30:00', '1 days 08:00:00', '1 days 08:30:00',
|
||||
'1 days 09:00:00', '1 days 09:30:00', '1 days 10:00:00',
|
||||
'1 days 10:30:00', '1 days 11:00:00', '1 days 11:30:00',
|
||||
'1 days 12:00:00', '1 days 12:30:00', '1 days 13:00:00',
|
||||
'1 days 13:30:00', '1 days 14:00:00', '1 days 14:30:00',
|
||||
'1 days 15:00:00', '1 days 15:30:00', '1 days 16:00:00',
|
||||
'1 days 16:30:00', '1 days 17:00:00', '1 days 17:30:00',
|
||||
'1 days 18:00:00', '1 days 18:30:00', '1 days 19:00:00',
|
||||
'1 days 19:30:00', '1 days 20:00:00', '1 days 20:30:00',
|
||||
'1 days 21:00:00', '1 days 21:30:00', '1 days 22:00:00',
|
||||
'1 days 22:30:00', '1 days 23:00:00', '1 days 23:30:00',
|
||||
'2 days 00:00:00'],
|
||||
dtype='timedelta64[ns]', freq='30T')
|
||||
|
||||
In [104]: pd.timedelta_range(start='1 days', periods=5, freq='2D5H')
|
||||
Out[104]:
|
||||
TimedeltaIndex(['1 days 00:00:00', '3 days 05:00:00', '5 days 10:00:00',
|
||||
'7 days 15:00:00', '9 days 20:00:00'],
|
||||
dtype='timedelta64[ns]', freq='53H')
|
||||
```
|
||||
|
||||
*0.23.0 版新增*。
|
||||
|
||||
用 `start`、`end`、`period` 可以生成等宽时间差范围,其中,`start` 与 `end`(含)是起止两端的时间,`periods` 为 `TimedeltaIndex` 里的元素数量:
|
||||
|
||||
``` python
|
||||
In [105]: pd.timedelta_range('0 days', '4 days', periods=5)
|
||||
Out[105]: TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days'], dtype='timedelta64[ns]', freq=None)
|
||||
|
||||
In [106]: pd.timedelta_range('0 days', '4 days', periods=10)
|
||||
Out[106]:
|
||||
TimedeltaIndex(['0 days 00:00:00', '0 days 10:40:00', '0 days 21:20:00',
|
||||
'1 days 08:00:00', '1 days 18:40:00', '2 days 05:20:00',
|
||||
'2 days 16:00:00', '3 days 02:40:00', '3 days 13:20:00',
|
||||
'4 days 00:00:00'],
|
||||
dtype='timedelta64[ns]', freq=None)
|
||||
```
|
||||
|
||||
### TimedeltaIndex 应用
|
||||
|
||||
与 `DatetimeIndex`、`PeriodIndex` 等 `datetime` 型索引类似,`TimedeltaIndex` 也可当作 pandas 对象的索引。
|
||||
|
||||
``` python
|
||||
In [107]: s = pd.Series(np.arange(100),
|
||||
.....: index=pd.timedelta_range('1 days', periods=100, freq='h'))
|
||||
.....:
|
||||
|
||||
In [108]: s
|
||||
Out[108]:
|
||||
1 days 00:00:00 0
|
||||
1 days 01:00:00 1
|
||||
1 days 02:00:00 2
|
||||
1 days 03:00:00 3
|
||||
1 days 04:00:00 4
|
||||
..
|
||||
4 days 23:00:00 95
|
||||
5 days 00:00:00 96
|
||||
5 days 01:00:00 97
|
||||
5 days 02:00:00 98
|
||||
5 days 03:00:00 99
|
||||
Freq: H, Length: 100, dtype: int64
|
||||
```
|
||||
|
||||
选择操作也差不多,可以强制转换字符串与切片:
|
||||
|
||||
``` python
|
||||
In [109]: s['1 day':'2 day']
|
||||
Out[109]:
|
||||
1 days 00:00:00 0
|
||||
1 days 01:00:00 1
|
||||
1 days 02:00:00 2
|
||||
1 days 03:00:00 3
|
||||
1 days 04:00:00 4
|
||||
..
|
||||
2 days 19:00:00 43
|
||||
2 days 20:00:00 44
|
||||
2 days 21:00:00 45
|
||||
2 days 22:00:00 46
|
||||
2 days 23:00:00 47
|
||||
Freq: H, Length: 48, dtype: int64
|
||||
|
||||
In [110]: s['1 day 01:00:00']
|
||||
Out[110]: 1
|
||||
|
||||
In [111]: s[pd.Timedelta('1 day 1h')]
|
||||
Out[111]: 1
|
||||
```
|
||||
|
||||
`TimedeltaIndex` 还支持局部字符串选择,并且可以推断选择范围:
|
||||
|
||||
``` python
|
||||
In [112]: s['1 day':'1 day 5 hours']
|
||||
Out[112]:
|
||||
1 days 00:00:00 0
|
||||
1 days 01:00:00 1
|
||||
1 days 02:00:00 2
|
||||
1 days 03:00:00 3
|
||||
1 days 04:00:00 4
|
||||
1 days 05:00:00 5
|
||||
Freq: H, dtype: int64
|
||||
```
|
||||
|
||||
### TimedeltaIndex 运算
|
||||
|
||||
`TimedeltaIndex` 与 `DatetimeIndex` 运算可以保留 `NaT` 值:
|
||||
|
||||
``` python
|
||||
In [113]: tdi = pd.TimedeltaIndex(['1 days', pd.NaT, '2 days'])
|
||||
|
||||
In [114]: tdi.to_list()
|
||||
Out[114]: [Timedelta('1 days 00:00:00'), NaT, Timedelta('2 days 00:00:00')]
|
||||
|
||||
In [115]: dti = pd.date_range('20130101', periods=3)
|
||||
|
||||
In [116]: dti.to_list()
|
||||
Out[116]:
|
||||
[Timestamp('2013-01-01 00:00:00', freq='D'),
|
||||
Timestamp('2013-01-02 00:00:00', freq='D'),
|
||||
Timestamp('2013-01-03 00:00:00', freq='D')]
|
||||
|
||||
In [117]: (dti + tdi).to_list()
|
||||
Out[117]: [Timestamp('2013-01-02 00:00:00'), NaT, Timestamp('2013-01-05 00:00:00')]
|
||||
|
||||
In [118]: (dti - tdi).to_list()
|
||||
Out[118]: [Timestamp('2012-12-31 00:00:00'), NaT, Timestamp('2013-01-01 00:00:00')]
|
||||
```
|
||||
|
||||
### 转换
|
||||
|
||||
与 `Series` 频率转换类似,可以把 `TimedeltaIndex` 转换为其它索引。
|
||||
|
||||
``` python
|
||||
In [119]: tdi / np.timedelta64(1, 's')
|
||||
Out[119]: Float64Index([86400.0, nan, 172800.0], dtype='float64')
|
||||
|
||||
In [120]: tdi.astype('timedelta64[s]')
|
||||
Out[120]: Float64Index([86400.0, nan, 172800.0], dtype='float64')
|
||||
```
|
||||
|
||||
与标量操作类似,会返回**不同**类型的索引。
|
||||
|
||||
``` python
|
||||
# 时间差与日期相加,结果为日期型索引(DatetimeIndex)
|
||||
In [121]: tdi + pd.Timestamp('20130101')
|
||||
Out[121]: DatetimeIndex(['2013-01-02', 'NaT', '2013-01-03'], dtype='datetime64[ns]', freq=None)
|
||||
|
||||
# 日期与时间戳相减,结果为日期型数据(Timestamp)
|
||||
# note that trying to subtract a date from a Timedelta will raise an exception
|
||||
In [122]: (pd.Timestamp('20130101') - tdi).to_list()
|
||||
Out[122]: [Timestamp('2012-12-31 00:00:00'), NaT, Timestamp('2012-12-30 00:00:00')]
|
||||
|
||||
# 时间差与时间差相加,结果还是时间差索引
|
||||
In [123]: tdi + pd.Timedelta('10 days')
|
||||
Out[123]: TimedeltaIndex(['11 days', NaT, '12 days'], dtype='timedelta64[ns]', freq=None)
|
||||
|
||||
# 除数是整数,则结果为时间差索引
|
||||
In [124]: tdi / 2
|
||||
Out[124]: TimedeltaIndex(['0 days 12:00:00', NaT, '1 days 00:00:00'], dtype='timedelta64[ns]', freq=None)
|
||||
|
||||
# 除数是时间差,则结果为 Float64Index
|
||||
In [125]: tdi / tdi[0]
|
||||
Out[125]: Float64Index([1.0, nan, 2.0], dtype='float64')
|
||||
```
|
||||
|
||||
## 重采样
|
||||
|
||||
与[时间序列重采样](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-resampling)一样,`TimedeltaIndex` 也支持重采样。
|
||||
|
||||
``` python
|
||||
In [126]: s.resample('D').mean()
|
||||
Out[126]:
|
||||
1 days 11.5
|
||||
2 days 35.5
|
||||
3 days 59.5
|
||||
4 days 83.5
|
||||
5 days 97.5
|
||||
Freq: D, dtype: float64
|
||||
```
|
||||
3442
Python/pandas/user_guide/timeseries.md
Normal file
3442
Python/pandas/user_guide/timeseries.md
Normal file
File diff suppressed because it is too large
Load Diff
1344
Python/pandas/user_guide/visualization.md
Normal file
1344
Python/pandas/user_guide/visualization.md
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user