\ No newline at end of file
diff --git a/Python/matplotlab/advanced/path_tutorial.md b/Python/matplotlab/advanced/path_tutorial.md
new file mode 100644
index 00000000..1d2c463c
--- /dev/null
+++ b/Python/matplotlab/advanced/path_tutorial.md
@@ -0,0 +1,265 @@
+---
+sidebarDepth: 3
+sidebar: auto
+---
+
+# Path Tutorial
+
+Defining paths in your Matplotlib visualization.
+
+The object underlying all of the ``matplotlib.patch`` objects is
+the [``Path``](https://matplotlib.orgapi/path_api.html#matplotlib.path.Path), which supports the standard set of
+moveto, lineto, curveto commands to draw simple and compound outlines
+consisting of line segments and splines. The ``Path`` is instantiated
+with a (N,2) array of (x,y) vertices, and a N-length array of path
+codes. For example to draw the unit rectangle from (0,0) to (1,1), we
+could use this code
+
+``` python
+import matplotlib.pyplot as plt
+from matplotlib.path import Path
+import matplotlib.patches as patches
+
+verts = [
+ (0., 0.), # left, bottom
+ (0., 1.), # left, top
+ (1., 1.), # right, top
+ (1., 0.), # right, bottom
+ (0., 0.), # ignored
+]
+
+codes = [
+ Path.MOVETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.LINETO,
+ Path.CLOSEPOLY,
+]
+
+path = Path(verts, codes)
+
+fig, ax = plt.subplots()
+patch = patches.PathPatch(path, facecolor='orange', lw=2)
+ax.add_patch(patch)
+ax.set_xlim(-2, 2)
+ax.set_ylim(-2, 2)
+plt.show()
+```
+
+
+
+The following path codes are recognized
+
+
+---
+
+
+
+
+
+
+Code
+Vertices
+Description
+
+
+
+STOP
+1 (ignored)
+A marker for the end of the entire path (currently not required and ignored)
+
+MOVETO
+1
+Pick up the pen and move to the given vertex.
+
+LINETO
+1
+Draw a line from the current position to the given vertex.
+
+CURVE3
+2 (1 control point, 1 endpoint)
+Draw a quadratic Bézier curve from the current position, with the given control point, to the given end point.
+
+CURVE4
+3 (2 control points, 1 endpoint)
+Draw a cubic Bézier curve from the current position, with the given control points, to the given end point.
+
+CLOSEPOLY
+1 (point itself is ignored)
+Draw a line segment to the start point of the current polyline.
+
+
+
+
+## Bézier example
+
+Some of the path components require multiple vertices to specify them:
+for example CURVE 3 is a [bézier](https://en.wikipedia.org/wiki/B%C3%A9zier_curve) curve with one
+control point and one end point, and CURVE4 has three vertices for the
+two control points and the end point. The example below shows a
+CURVE4 Bézier spline -- the bézier curve will be contained in the
+convex hull of the start point, the two control points, and the end
+point
+
+``` python
+verts = [
+ (0., 0.), # P0
+ (0.2, 1.), # P1
+ (1., 0.8), # P2
+ (0.8, 0.), # P3
+]
+
+codes = [
+ Path.MOVETO,
+ Path.CURVE4,
+ Path.CURVE4,
+ Path.CURVE4,
+]
+
+path = Path(verts, codes)
+
+fig, ax = plt.subplots()
+patch = patches.PathPatch(path, facecolor='none', lw=2)
+ax.add_patch(patch)
+
+xs, ys = zip(*verts)
+ax.plot(xs, ys, 'x--', lw=2, color='black', ms=10)
+
+ax.text(-0.05, -0.05, 'P0')
+ax.text(0.15, 1.05, 'P1')
+ax.text(1.05, 0.85, 'P2')
+ax.text(0.85, -0.05, 'P3')
+
+ax.set_xlim(-0.1, 1.1)
+ax.set_ylim(-0.1, 1.1)
+plt.show()
+```
+
+
+
+## Compound paths
+
+All of the simple patch primitives in matplotlib, Rectangle, Circle,
+Polygon, etc, are implemented with simple path. Plotting functions
+like [``hist()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.hist.html#matplotlib.axes.Axes.hist) and
+[``bar()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.bar.html#matplotlib.axes.Axes.bar), which create a number of
+primitives, e.g., a bunch of Rectangles, can usually be implemented more
+efficiently using a compound path. The reason ``bar`` creates a list
+of rectangles and not a compound path is largely historical: the
+[``Path``](https://matplotlib.orgapi/path_api.html#matplotlib.path.Path) code is comparatively new and ``bar``
+predates it. While we could change it now, it would break old code,
+so here we will cover how to create compound paths, replacing the
+functionality in bar, in case you need to do so in your own code for
+efficiency reasons, e.g., you are creating an animated bar plot.
+
+We will make the histogram chart by creating a series of rectangles
+for each histogram bar: the rectangle width is the bin width and the
+rectangle height is the number of datapoints in that bin. First we'll
+create some random normally distributed data and compute the
+histogram. Because numpy returns the bin edges and not centers, the
+length of ``bins`` is 1 greater than the length of ``n`` in the
+example below:
+
+``` python
+# histogram our data with numpy
+data = np.random.randn(1000)
+n, bins = np.histogram(data, 100)
+```
+
+We'll now extract the corners of the rectangles. Each of the
+``left``, ``bottom``, etc, arrays below is ``len(n)``, where ``n`` is
+the array of counts for each histogram bar:
+
+``` python
+# get the corners of the rectangles for the histogram
+left = np.array(bins[:-1])
+right = np.array(bins[1:])
+bottom = np.zeros(len(left))
+top = bottom + n
+```
+
+Now we have to construct our compound path, which will consist of a
+series of ``MOVETO``, ``LINETO`` and ``CLOSEPOLY`` for each rectangle.
+For each rectangle, we need 5 vertices: 1 for the ``MOVETO``, 3 for
+the ``LINETO``, and 1 for the ``CLOSEPOLY``. As indicated in the
+table above, the vertex for the closepoly is ignored but we still need
+it to keep the codes aligned with the vertices:
+
+``` python
+nverts = nrects*(1+3+1)
+verts = np.zeros((nverts, 2))
+codes = np.ones(nverts, int) * path.Path.LINETO
+codes[0::5] = path.Path.MOVETO
+codes[4::5] = path.Path.CLOSEPOLY
+verts[0::5,0] = left
+verts[0::5,1] = bottom
+verts[1::5,0] = left
+verts[1::5,1] = top
+verts[2::5,0] = right
+verts[2::5,1] = top
+verts[3::5,0] = right
+verts[3::5,1] = bottom
+```
+
+All that remains is to create the path, attach it to a
+``PathPatch``, and add it to our axes:
+
+``` python
+barpath = path.Path(verts, codes)
+patch = patches.PathPatch(barpath, facecolor='green',
+ edgecolor='yellow', alpha=0.5)
+ax.add_patch(patch)
+```
+
+``` python
+import numpy as np
+import matplotlib.patches as patches
+import matplotlib.path as path
+
+fig, ax = plt.subplots()
+# Fixing random state for reproducibility
+np.random.seed(19680801)
+
+# histogram our data with numpy
+data = np.random.randn(1000)
+n, bins = np.histogram(data, 100)
+
+# get the corners of the rectangles for the histogram
+left = np.array(bins[:-1])
+right = np.array(bins[1:])
+bottom = np.zeros(len(left))
+top = bottom + n
+nrects = len(left)
+
+nverts = nrects*(1+3+1)
+verts = np.zeros((nverts, 2))
+codes = np.ones(nverts, int) * path.Path.LINETO
+codes[0::5] = path.Path.MOVETO
+codes[4::5] = path.Path.CLOSEPOLY
+verts[0::5, 0] = left
+verts[0::5, 1] = bottom
+verts[1::5, 0] = left
+verts[1::5, 1] = top
+verts[2::5, 0] = right
+verts[2::5, 1] = top
+verts[3::5, 0] = right
+verts[3::5, 1] = bottom
+
+barpath = path.Path(verts, codes)
+patch = patches.PathPatch(barpath, facecolor='green',
+ edgecolor='yellow', alpha=0.5)
+ax.add_patch(patch)
+
+ax.set_xlim(left[0], right[-1])
+ax.set_ylim(bottom.min(), top.max())
+
+plt.show()
+```
+
+
+
+## Download
+
+- [Download Python source code: path_tutorial.py](https://matplotlib.org/_downloads/ec90dd07bc241d860eb972db796c96bc/path_tutorial.py)
+- [Download Jupyter notebook: path_tutorial.ipynb](https://matplotlib.org/_downloads/da8cacf827800cc7398495a527da865d/path_tutorial.ipynb)
+
\ No newline at end of file
diff --git a/Python/matplotlab/advanced/patheffects_guide.md b/Python/matplotlab/advanced/patheffects_guide.md
new file mode 100644
index 00000000..82068387
--- /dev/null
+++ b/Python/matplotlab/advanced/patheffects_guide.md
@@ -0,0 +1,122 @@
+---
+sidebarDepth: 3
+sidebar: auto
+---
+
+# Path effects guide
+
+Defining paths that objects follow on a canvas.
+
+Matplotlib's [``patheffects``](#module-matplotlib.patheffects) module provides functionality to
+apply a multiple draw stage to any Artist which can be rendered via a
+[``Path``](https://matplotlib.orgapi/path_api.html#matplotlib.path.Path).
+
+Artists which can have a path effect applied to them include [``Patch``](https://matplotlib.orgapi/_as_gen/matplotlib.patches.Patch.html#matplotlib.patches.Patch),
+[``Line2D``](https://matplotlib.orgapi/_as_gen/matplotlib.lines.Line2D.html#matplotlib.lines.Line2D), [``Collection``](https://matplotlib.orgapi/collections_api.html#matplotlib.collections.Collection) and even
+[``Text``](https://matplotlib.orgapi/text_api.html#matplotlib.text.Text). Each artist's path effects can be controlled via the
+``set_path_effects`` method ([``set_path_effects``](https://matplotlib.orgapi/_as_gen/matplotlib.artist.Artist.set_path_effects.html#matplotlib.artist.Artist.set_path_effects)), which takes
+an iterable of [``AbstractPathEffect``](https://matplotlib.orgapi/patheffects_api.html#matplotlib.patheffects.AbstractPathEffect) instances.
+
+The simplest path effect is the [``Normal``](https://matplotlib.orgapi/patheffects_api.html#matplotlib.patheffects.Normal) effect, which simply
+draws the artist without any effect:
+
+``` python
+import matplotlib.pyplot as plt
+import matplotlib.patheffects as path_effects
+
+fig = plt.figure(figsize=(5, 1.5))
+text = fig.text(0.5, 0.5, 'Hello path effects world!\nThis is the normal '
+ 'path effect.\nPretty dull, huh?',
+ ha='center', va='center', size=20)
+text.set_path_effects([path_effects.Normal()])
+plt.show()
+```
+
+
+
+Whilst the plot doesn't look any different to what you would expect without any path
+effects, the drawing of the text now been changed to use the path effects
+framework, opening up the possibilities for more interesting examples.
+
+## Adding a shadow
+
+A far more interesting path effect than [``Normal``](https://matplotlib.orgapi/patheffects_api.html#matplotlib.patheffects.Normal) is the
+drop-shadow, which we can apply to any of our path based artists. The classes
+[``SimplePatchShadow``](https://matplotlib.orgapi/patheffects_api.html#matplotlib.patheffects.SimplePatchShadow) and
+[``SimpleLineShadow``](https://matplotlib.orgapi/patheffects_api.html#matplotlib.patheffects.SimpleLineShadow) do precisely this by drawing either a filled
+patch or a line patch below the original artist:
+
+``` python
+import matplotlib.patheffects as path_effects
+
+text = plt.text(0.5, 0.5, 'Hello path effects world!',
+ path_effects=[path_effects.withSimplePatchShadow()])
+
+plt.plot([0, 3, 2, 5], linewidth=5, color='blue',
+ path_effects=[path_effects.SimpleLineShadow(),
+ path_effects.Normal()])
+plt.show()
+```
+
+
+
+Notice the two approaches to setting the path effects in this example. The
+first uses the ``with*`` classes to include the desired functionality automatically
+followed with the "normal" effect, whereas the latter explicitly defines the two path
+effects to draw.
+
+## Making an artist stand out
+
+One nice way of making artists visually stand out is to draw an outline in a bold
+color below the actual artist. The [``Stroke``](https://matplotlib.orgapi/patheffects_api.html#matplotlib.patheffects.Stroke) path effect
+makes this a relatively simple task:
+
+``` python
+fig = plt.figure(figsize=(7, 1))
+text = fig.text(0.5, 0.5, 'This text stands out because of\n'
+ 'its black border.', color='white',
+ ha='center', va='center', size=30)
+text.set_path_effects([path_effects.Stroke(linewidth=3, foreground='black'),
+ path_effects.Normal()])
+plt.show()
+```
+
+
+
+It is important to note that this effect only works because we have drawn the text
+path twice; once with a thick black line, and then once with the original text
+path on top.
+
+You may have noticed that the keywords to [``Stroke``](https://matplotlib.orgapi/patheffects_api.html#matplotlib.patheffects.Stroke) and
+[``SimplePatchShadow``](https://matplotlib.orgapi/patheffects_api.html#matplotlib.patheffects.SimplePatchShadow) and [``SimpleLineShadow``](https://matplotlib.orgapi/patheffects_api.html#matplotlib.patheffects.SimpleLineShadow) are not the usual Artist
+keywords (such as ``facecolor`` and ``edgecolor`` etc.). This is because with these
+path effects we are operating at lower level of matplotlib. In fact, the keywords
+which are accepted are those for a [``matplotlib.backend_bases.GraphicsContextBase``](https://matplotlib.orgapi/backend_bases_api.html#matplotlib.backend_bases.GraphicsContextBase)
+instance, which have been designed for making it easy to create new backends - and not
+for its user interface.
+
+## Greater control of the path effect artist
+
+As already mentioned, some of the path effects operate at a lower level than most users
+will be used to, meaning that setting keywords such as ``facecolor`` and ``edgecolor``
+raise an AttributeError. Luckily there is a generic [``PathPatchEffect``](https://matplotlib.orgapi/patheffects_api.html#matplotlib.patheffects.PathPatchEffect) path effect
+which creates a [``PathPatch``](https://matplotlib.orgapi/_as_gen/matplotlib.patches.PathPatch.html#matplotlib.patches.PathPatch) class with the original path.
+The keywords to this effect are identical to those of [``PathPatch``](https://matplotlib.orgapi/_as_gen/matplotlib.patches.PathPatch.html#matplotlib.patches.PathPatch):
+
+``` python
+fig = plt.figure(figsize=(8, 1))
+t = fig.text(0.02, 0.5, 'Hatch shadow', fontsize=75, weight=1000, va='center')
+t.set_path_effects([path_effects.PathPatchEffect(offset=(4, -4), hatch='xxxx',
+ facecolor='gray'),
+ path_effects.PathPatchEffect(edgecolor='white', linewidth=1.1,
+ facecolor='black')])
+plt.show()
+```
+
+
+
+## Download
+
+- [Download Python source code: patheffects_guide.py](https://matplotlib.org/_downloads/b0857128f7eceadab81240baf9185710/patheffects_guide.py)
+- [Download Jupyter notebook: patheffects_guide.ipynb](https://matplotlib.org/_downloads/d678b58ce777643e611577a5aafc6f8d/patheffects_guide.ipynb)
+
\ No newline at end of file
diff --git a/Python/matplotlab/advanced/transforms_tutorial.md b/Python/matplotlab/advanced/transforms_tutorial.md
new file mode 100644
index 00000000..2eaeff51
--- /dev/null
+++ b/Python/matplotlab/advanced/transforms_tutorial.md
@@ -0,0 +1,615 @@
+---
+sidebarDepth: 3
+sidebar: auto
+---
+
+# Transformations Tutorial
+
+Like any graphics packages, Matplotlib is built on top of a
+transformation framework to easily move between coordinate systems,
+the userland ``data`` coordinate system, the ``axes`` coordinate system,
+the ``figure`` coordinate system, and the ``display`` coordinate system.
+In 95% of your plotting, you won't need to think about this, as it
+happens under the hood, but as you push the limits of custom figure
+generation, it helps to have an understanding of these objects so you
+can reuse the existing transformations Matplotlib makes available to
+you, or create your own (see [``matplotlib.transforms``](https://matplotlib.orgapi/transformations.html#module-matplotlib.transforms)). The table
+below summarizes the some useful coordinate systems, the transformation
+object you should use to work in that coordinate system, and the
+description of that system. In the ``Transformation Object`` column,
+``ax`` is a [``Axes``](https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes) instance, and ``fig`` is a
+[``Figure``](https://matplotlib.orgapi/_as_gen/matplotlib.figure.Figure.html#matplotlib.figure.Figure) instance.
+
+
+---
+
+
+
+
+
+
+Coordinates
+Transformation object
+Description
+
+
+
+"data"
+ax.transData
+The coordinate system for the data,
+controlled by xlim and ylim.
+
+"axes"
+ax.trans[Axes](https://matplotlib.org/../api/axes_api.html#matplotlib.axes.Axes)
+The coordinate system of the
+Axes; (0, 0)
+is bottom left of the axes, and
+(1, 1) is top right of the axes.
+
+"figure"
+fig.trans[[Figure](https://matplotlib.org/../api/_as_gen/matplotlib.figure.Figure.html#matplotlib.figure.Figure)](https://matplotlib.org/../api/_as_gen/matplotlib.figure.Figure.html#matplotlib.figure.Figure)
+The coordinate system of the
+Figure; (0, 0) is bottom left
+of the figure, and (1, 1) is top
+right of the figure.
+
+"figure-inches"
+fig.dpi_scale_trans
+The coordinate system of the
+Figure in inches; (0, 0) is
+bottom left of the figure, and
+(width, height) is the top right
+of the figure in inches.
+
+"display"
+None, or
+IdentityTransform()
+The pixel coordinate system of the
+display window; (0, 0) is bottom
+left of the window, and (width,
+height) is top right of the
+display window in pixels.
+
+"xaxis",
+"yaxis"
+ax.get_xaxis_transform(),
+ax.get_yaxis_transform()
+Blended coordinate systems; use
+data coordinates on one of the axis
+and axes coordinates on the other.
+
+
+
+
+All of the transformation objects in the table above take inputs in
+their coordinate system, and transform the input to the ``display``
+coordinate system. That is why the ``display`` coordinate system has
+``None`` for the ``Transformation Object`` column -- it already is in
+display coordinates. The transformations also know how to invert
+themselves, to go from ``display`` back to the native coordinate system.
+This is particularly useful when processing events from the user
+interface, which typically occur in display space, and you want to
+know where the mouse click or key-press occurred in your data
+coordinate system.
+
+Note that specifying objects in ``display`` coordinates will change their
+location if the ``dpi`` of the figure changes. This can cause confusion when
+printing or changing screen resolution, because the object can change location
+and size. Therefore it is most common
+for artists placed in an axes or figure to have their transform set to
+something *other* than the [``IdentityTransform()``](https://matplotlib.orgapi/transformations.html#matplotlib.transforms.IdentityTransform); the default when
+an artist is placed on an axes using ``add_artist`` is for the
+transform to be ``ax.transData``.
+
+## Data coordinates
+
+Let's start with the most commonly used coordinate, the ``data``
+coordinate system. Whenever you add data to the axes, Matplotlib
+updates the datalimits, most commonly updated with the
+[``set_xlim()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.set_xlim.html#matplotlib.axes.Axes.set_xlim) and
+[``set_ylim()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.set_ylim.html#matplotlib.axes.Axes.set_ylim) methods. For example, in the
+figure below, the data limits stretch from 0 to 10 on the x-axis, and
+-1 to 1 on the y-axis.
+
+``` python
+import numpy as np
+import matplotlib.pyplot as plt
+import matplotlib.patches as mpatches
+
+x = np.arange(0, 10, 0.005)
+y = np.exp(-x/2.) * np.sin(2*np.pi*x)
+
+fig, ax = plt.subplots()
+ax.plot(x, y)
+ax.set_xlim(0, 10)
+ax.set_ylim(-1, 1)
+
+plt.show()
+```
+
+
+
+You can use the ``ax.transData`` instance to transform from your
+``data`` to your ``display`` coordinate system, either a single point or a
+sequence of points as shown below:
+
+``` python
+In [14]: type(ax.transData)
+Out[14]:
+
+In [15]: ax.transData.transform((5, 0))
+Out[15]: array([ 335.175, 247. ])
+
+In [16]: ax.transData.transform([(5, 0), (1, 2)])
+Out[16]:
+array([[ 335.175, 247. ],
+ [ 132.435, 642.2 ]])
+```
+
+You can use the [``inverted()``](https://matplotlib.orgapi/transformations.html#matplotlib.transforms.Transform.inverted)
+method to create a transform which will take you from display to data
+coordinates:
+
+``` python
+In [41]: inv = ax.transData.inverted()
+
+In [42]: type(inv)
+Out[42]:
+
+In [43]: inv.transform((335.175, 247.))
+Out[43]: array([ 5., 0.])
+```
+
+If your are typing along with this tutorial, the exact values of the
+display coordinates may differ if you have a different window size or
+dpi setting. Likewise, in the figure below, the display labeled
+points are probably not the same as in the ipython session because the
+documentation figure size defaults are different.
+
+``` python
+x = np.arange(0, 10, 0.005)
+y = np.exp(-x/2.) * np.sin(2*np.pi*x)
+
+fig, ax = plt.subplots()
+ax.plot(x, y)
+ax.set_xlim(0, 10)
+ax.set_ylim(-1, 1)
+
+xdata, ydata = 5, 0
+xdisplay, ydisplay = ax.transData.transform_point((xdata, ydata))
+
+bbox = dict(boxstyle="round", fc="0.8")
+arrowprops = dict(
+ arrowstyle="->",
+ connectionstyle="angle,angleA=0,angleB=90,rad=10")
+
+offset = 72
+ax.annotate('data = (%.1f, %.1f)' % (xdata, ydata),
+ (xdata, ydata), xytext=(-2*offset, offset), textcoords='offset points',
+ bbox=bbox, arrowprops=arrowprops)
+
+disp = ax.annotate('display = (%.1f, %.1f)' % (xdisplay, ydisplay),
+ (xdisplay, ydisplay), xytext=(0.5*offset, -offset),
+ xycoords='figure pixels',
+ textcoords='offset points',
+ bbox=bbox, arrowprops=arrowprops)
+
+plt.show()
+```
+
+
+
+::: tip Note
+
+If you run the source code in the example above in a GUI backend,
+you may also find that the two arrows for the ``data`` and ``display``
+annotations do not point to exactly the same point. This is because
+the display point was computed before the figure was displayed, and
+the GUI backend may slightly resize the figure when it is created.
+The effect is more pronounced if you resize the figure yourself.
+This is one good reason why you rarely want to work in display
+space, but you can connect to the ``'on_draw'``
+[``Event``](https://matplotlib.orgapi/backend_bases_api.html#matplotlib.backend_bases.Event) to update figure
+coordinates on figure draws; see [Event handling and picking](https://matplotlib.orgusers/event_handling.html#event-handling-tutorial).
+
+:::
+
+When you change the x or y limits of your axes, the data limits are
+updated so the transformation yields a new display point. Note that
+when we just change the ylim, only the y-display coordinate is
+altered, and when we change the xlim too, both are altered. More on
+this later when we talk about the
+[``Bbox``](https://matplotlib.orgapi/transformations.html#matplotlib.transforms.Bbox).
+
+``` python
+In [54]: ax.transData.transform((5, 0))
+Out[54]: array([ 335.175, 247. ])
+
+In [55]: ax.set_ylim(-1, 2)
+Out[55]: (-1, 2)
+
+In [56]: ax.transData.transform((5, 0))
+Out[56]: array([ 335.175 , 181.13333333])
+
+In [57]: ax.set_xlim(10, 20)
+Out[57]: (10, 20)
+
+In [58]: ax.transData.transform((5, 0))
+Out[58]: array([-171.675 , 181.13333333])
+```
+
+## Axes coordinates
+
+After the ``data`` coordinate system, ``axes`` is probably the second most
+useful coordinate system. Here the point (0, 0) is the bottom left of
+your axes or subplot, (0.5, 0.5) is the center, and (1.0, 1.0) is the
+top right. You can also refer to points outside the range, so (-0.1,
+1.1) is to the left and above your axes. This coordinate system is
+extremely useful when placing text in your axes, because you often
+want a text bubble in a fixed, location, e.g., the upper left of the axes
+pane, and have that location remain fixed when you pan or zoom. Here
+is a simple example that creates four panels and labels them 'A', 'B',
+'C', 'D' as you often see in journals.
+
+``` python
+fig = plt.figure()
+for i, label in enumerate(('A', 'B', 'C', 'D')):
+ ax = fig.add_subplot(2, 2, i+1)
+ ax.text(0.05, 0.95, label, transform=ax.transAxes,
+ fontsize=16, fontweight='bold', va='top')
+
+plt.show()
+```
+
+
+
+You can also make lines or patches in the axes coordinate system, but
+this is less useful in my experience than using ``ax.transAxes`` for
+placing text. Nonetheless, here is a silly example which plots some
+random dots in ``data`` space, and overlays a semi-transparent
+[``Circle``](https://matplotlib.orgapi/_as_gen/matplotlib.patches.Circle.html#matplotlib.patches.Circle) centered in the middle of the axes
+with a radius one quarter of the axes -- if your axes does not
+preserve aspect ratio (see [``set_aspect()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.set_aspect.html#matplotlib.axes.Axes.set_aspect)),
+this will look like an ellipse. Use the pan/zoom tool to move around,
+or manually change the data xlim and ylim, and you will see the data
+move, but the circle will remain fixed because it is not in ``data``
+coordinates and will always remain at the center of the axes.
+
+``` python
+fig, ax = plt.subplots()
+x, y = 10*np.random.rand(2, 1000)
+ax.plot(x, y, 'go', alpha=0.2) # plot some data in data coordinates
+
+circ = mpatches.Circle((0.5, 0.5), 0.25, transform=ax.transAxes,
+ facecolor='blue', alpha=0.75)
+ax.add_patch(circ)
+plt.show()
+```
+
+
+
+## Blended transformations
+
+Drawing in ``blended`` coordinate spaces which mix ``axes`` with ``data``
+coordinates is extremely useful, for example to create a horizontal
+span which highlights some region of the y-data but spans across the
+x-axis regardless of the data limits, pan or zoom level, etc. In fact
+these blended lines and spans are so useful, we have built in
+functions to make them easy to plot (see
+[``axhline()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.axhline.html#matplotlib.axes.Axes.axhline),
+[``axvline()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.axvline.html#matplotlib.axes.Axes.axvline),
+[``axhspan()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.axhspan.html#matplotlib.axes.Axes.axhspan),
+[``axvspan()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.axvspan.html#matplotlib.axes.Axes.axvspan)) but for didactic purposes we
+will implement the horizontal span here using a blended
+transformation. This trick only works for separable transformations,
+like you see in normal Cartesian coordinate systems, but not on
+inseparable transformations like the
+[``PolarTransform``](https://matplotlib.orgapi/projections_api.html#matplotlib.projections.polar.PolarAxes.PolarTransform).
+
+``` python
+import matplotlib.transforms as transforms
+
+fig, ax = plt.subplots()
+x = np.random.randn(1000)
+
+ax.hist(x, 30)
+ax.set_title(r'$\sigma=1 \/ \dots \/ \sigma=2$', fontsize=16)
+
+# the x coords of this transformation are data, and the
+# y coord are axes
+trans = transforms.blended_transform_factory(
+ ax.transData, ax.transAxes)
+
+# highlight the 1..2 stddev region with a span.
+# We want x to be in data coordinates and y to
+# span from 0..1 in axes coords
+rect = mpatches.Rectangle((1, 0), width=1, height=1,
+ transform=trans, color='yellow',
+ alpha=0.5)
+
+ax.add_patch(rect)
+
+plt.show()
+```
+
+
+
+::: tip Note
+
+The blended transformations where x is in data coords and y in axes
+coordinates is so useful that we have helper methods to return the
+versions mpl uses internally for drawing ticks, ticklabels, etc.
+The methods are [``matplotlib.axes.Axes.get_xaxis_transform()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.get_xaxis_transform.html#matplotlib.axes.Axes.get_xaxis_transform) and
+[``matplotlib.axes.Axes.get_yaxis_transform()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.get_yaxis_transform.html#matplotlib.axes.Axes.get_yaxis_transform). So in the example
+above, the call to
+[``blended_transform_factory()``](https://matplotlib.orgapi/transformations.html#matplotlib.transforms.blended_transform_factory) can be
+replaced by ``get_xaxis_transform``:
+
+``` python
+trans = ax.get_xaxis_transform()
+```
+
+:::
+
+## Plotting in physical units
+
+Sometimes we want an object to be a certain physical size on the plot.
+Here we draw the same circle as above, but in physical units. If done
+interactively, you can see that changing the size of the figure does
+not change the offset of the circle from the lower-left corner,
+does not change its size, and the circle remains a circle regardless of
+the aspect ratio of the axes.
+
+``` python
+fig, ax = plt.subplots(figsize=(5, 4))
+x, y = 10*np.random.rand(2, 1000)
+ax.plot(x, y*10., 'go', alpha=0.2) # plot some data in data coordinates
+# add a circle in fixed-units
+circ = mpatches.Circle((2.5, 2), 1.0, transform=fig.dpi_scale_trans,
+ facecolor='blue', alpha=0.75)
+ax.add_patch(circ)
+plt.show()
+```
+
+
+
+If we change the figure size, the circle does not change its absolute
+position and is cropped.
+
+``` python
+fig, ax = plt.subplots(figsize=(7, 2))
+x, y = 10*np.random.rand(2, 1000)
+ax.plot(x, y*10., 'go', alpha=0.2) # plot some data in data coordinates
+# add a circle in fixed-units
+circ = mpatches.Circle((2.5, 2), 1.0, transform=fig.dpi_scale_trans,
+ facecolor='blue', alpha=0.75)
+ax.add_patch(circ)
+plt.show()
+```
+
+
+
+Another use is putting a patch with a set physical dimension around a
+data point on the axes. Here we add together two transforms. The
+first sets the scaling of how large the ellipse should be and the second
+sets its position. The ellipse is then placed at the origin, and then
+we use the helper transform [``ScaledTranslation``](https://matplotlib.orgapi/transformations.html#matplotlib.transforms.ScaledTranslation)
+to move it
+to the right place in the ``ax.transData`` coordinate system.
+This helper is instantiated with:
+
+``` python
+trans = ScaledTranslation(xt, yt, scale_trans)
+```
+
+where ``xt`` and ``yt`` are the translation offsets, and ``scale_trans`` is
+a transformation which scales ``xt`` and ``yt`` at transformation time
+before applying the offsets.
+
+Note the use of the plus operator on the transforms below.
+This code says: first apply the scale transformation ``fig.dpi_scale_trans``
+to make the ellipse the proper size, but still centered at (0, 0),
+and then translate the data to ``xdata[0]`` and ``ydata[0]`` in data space.
+
+In interactive use, the ellipse stays the same size even if the
+axes limits are changed via zoom.
+
+``` python
+fig, ax = plt.subplots()
+xdata, ydata = (0.2, 0.7), (0.5, 0.5)
+ax.plot(xdata, ydata, "o")
+ax.set_xlim((0, 1))
+
+trans = (fig.dpi_scale_trans +
+ transforms.ScaledTranslation(xdata[0], ydata[0], ax.transData))
+
+# plot an ellipse around the point that is 150 x 130 points in diameter...
+circle = mpatches.Ellipse((0, 0), 150/72, 130/72, angle=40,
+ fill=None, transform=trans)
+ax.add_patch(circle)
+plt.show()
+```
+
+
+
+::: tip Note
+
+The order of transformation matters. Here the ellipse
+is given the right dimensions in display space *first* and then moved
+in data space to the correct spot.
+If we had done the ``ScaledTranslation`` first, then
+``xdata[0]`` and ``ydata[0]`` would
+first be transformed to ``display`` coordinates (``[ 358.4 475.2]`` on
+a 200-dpi monitor) and then those coordinates
+would be scaled by ``fig.dpi_scale_trans`` pushing the center of
+the ellipse well off the screen (i.e. ``[ 71680. 95040.]``).
+
+:::
+
+## Using offset transforms to create a shadow effect
+
+Another use of [``ScaledTranslation``](https://matplotlib.orgapi/transformations.html#matplotlib.transforms.ScaledTranslation) is to create
+a new transformation that is
+offset from another transformation, e.g., to place one object shifted a
+bit relative to another object. Typically you want the shift to be in
+some physical dimension, like points or inches rather than in data
+coordinates, so that the shift effect is constant at different zoom
+levels and dpi settings.
+
+One use for an offset is to create a shadow effect, where you draw one
+object identical to the first just to the right of it, and just below
+it, adjusting the zorder to make sure the shadow is drawn first and
+then the object it is shadowing above it.
+
+Here we apply the transforms in the *opposite* order to the use of
+[``ScaledTranslation``](https://matplotlib.orgapi/transformations.html#matplotlib.transforms.ScaledTranslation) above. The plot is
+first made in data units (``ax.transData``) and then shifted by
+``dx`` and ``dy`` points using ``fig.dpi_scale_trans``. (In typography,
+a`point <[https://en.wikipedia.org/wiki/Point_%28typography%29](https://en.wikipedia.org/wiki/Point_%28typography%29)>`_ is
+1/72 inches, and by specifying your offsets in points, your figure
+will look the same regardless of the dpi resolution it is saved in.)
+
+``` python
+fig, ax = plt.subplots()
+
+# make a simple sine wave
+x = np.arange(0., 2., 0.01)
+y = np.sin(2*np.pi*x)
+line, = ax.plot(x, y, lw=3, color='blue')
+
+# shift the object over 2 points, and down 2 points
+dx, dy = 2/72., -2/72.
+offset = transforms.ScaledTranslation(dx, dy, fig.dpi_scale_trans)
+shadow_transform = ax.transData + offset
+
+# now plot the same data with our offset transform;
+# use the zorder to make sure we are below the line
+ax.plot(x, y, lw=3, color='gray',
+ transform=shadow_transform,
+ zorder=0.5*line.get_zorder())
+
+ax.set_title('creating a shadow effect with an offset transform')
+plt.show()
+```
+
+
+
+::: tip Note
+
+The dpi and inches offset is a
+common-enough use case that we have a special helper function to
+create it in [``matplotlib.transforms.offset_copy()``](https://matplotlib.orgapi/transformations.html#matplotlib.transforms.offset_copy), which returns
+a new transform with an added offset. So above we could have done:
+
+``` python
+shadow_transform = transforms.offset_copy(ax.transData,
+ fig=fig, dx, dy, units='inches')
+```
+
+:::
+
+## The transformation pipeline
+
+The ``ax.transData`` transform we have been working with in this
+tutorial is a composite of three different transformations that
+comprise the transformation pipeline from ``data`` -> ``display``
+coordinates. Michael Droettboom implemented the transformations
+framework, taking care to provide a clean API that segregated the
+nonlinear projections and scales that happen in polar and logarithmic
+plots, from the linear affine transformations that happen when you pan
+and zoom. There is an efficiency here, because you can pan and zoom
+in your axes which affects the affine transformation, but you may not
+need to compute the potentially expensive nonlinear scales or
+projections on simple navigation events. It is also possible to
+multiply affine transformation matrices together, and then apply them
+to coordinates in one step. This is not true of all possible
+transformations.
+
+Here is how the ``ax.transData`` instance is defined in the basic
+separable axis [``Axes``](https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes) class:
+
+``` python
+self.transData = self.transScale + (self.transLimits + self.transAxes)
+```
+
+We've been introduced to the ``transAxes`` instance above in
+[Axes coordinates](#axes-coords), which maps the (0, 0), (1, 1) corners of the
+axes or subplot bounding box to ``display`` space, so let's look at
+these other two pieces.
+
+``self.transLimits`` is the transformation that takes you from
+``data`` to ``axes`` coordinates; i.e., it maps your view xlim and ylim
+to the unit space of the axes (and ``transAxes`` then takes that unit
+space to display space). We can see this in action here
+
+``` python
+In [80]: ax = subplot(111)
+
+In [81]: ax.set_xlim(0, 10)
+Out[81]: (0, 10)
+
+In [82]: ax.set_ylim(-1, 1)
+Out[82]: (-1, 1)
+
+In [84]: ax.transLimits.transform((0, -1))
+Out[84]: array([ 0., 0.])
+
+In [85]: ax.transLimits.transform((10, -1))
+Out[85]: array([ 1., 0.])
+
+In [86]: ax.transLimits.transform((10, 1))
+Out[86]: array([ 1., 1.])
+
+In [87]: ax.transLimits.transform((5, 0))
+Out[87]: array([ 0.5, 0.5])
+```
+
+and we can use this same inverted transformation to go from the unit
+``axes`` coordinates back to ``data`` coordinates.
+
+``` python
+In [90]: inv.transform((0.25, 0.25))
+Out[90]: array([ 2.5, -0.5])
+```
+
+The final piece is the ``self.transScale`` attribute, which is
+responsible for the optional non-linear scaling of the data, e.g., for
+logarithmic axes. When an Axes is initially setup, this is just set to
+the identity transform, since the basic Matplotlib axes has linear
+scale, but when you call a logarithmic scaling function like
+[``semilogx()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.semilogx.html#matplotlib.axes.Axes.semilogx) or explicitly set the scale to
+logarithmic with [``set_xscale()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.set_xscale.html#matplotlib.axes.Axes.set_xscale), then the
+``ax.transScale`` attribute is set to handle the nonlinear projection.
+The scales transforms are properties of the respective ``xaxis`` and
+``yaxis`` [``Axis``](https://matplotlib.orgapi/axis_api.html#matplotlib.axis.Axis) instances. For example, when
+you call ``ax.set_xscale('log')``, the xaxis updates its scale to a
+[``matplotlib.scale.LogScale``](https://matplotlib.orgapi/scale_api.html#matplotlib.scale.LogScale) instance.
+
+For non-separable axes the PolarAxes, there is one more piece to
+consider, the projection transformation. The ``transData``
+[``matplotlib.projections.polar.PolarAxes``](https://matplotlib.orgapi/projections_api.html#matplotlib.projections.polar.PolarAxes) is similar to that for
+the typical separable matplotlib Axes, with one additional piece
+``transProjection``:
+
+``` python
+self.transData = self.transScale + self.transProjection + \
+ (self.transProjectionAffine + self.transAxes)
+```
+
+``transProjection`` handles the projection from the space,
+e.g., latitude and longitude for map data, or radius and theta for polar
+data, to a separable Cartesian coordinate system. There are several
+projection examples in the ``matplotlib.projections`` package, and the
+best way to learn more is to open the source for those packages and
+see how to make your own, since Matplotlib supports extensible axes
+and projections. Michael Droettboom has provided a nice tutorial
+example of creating a Hammer projection axes; see
+[Custom projection](https://matplotlib.orggallery/misc/custom_projection.html).
+
+**Total running time of the script:** ( 0 minutes 1.328 seconds)
+
+## Download
+
+- [Download Python source code: transforms_tutorial.py](https://matplotlib.org/_downloads/1d1cf62db33a4554c487470c01670fe5/transforms_tutorial.py)
+- [Download Jupyter notebook: transforms_tutorial.ipynb](https://matplotlib.org/_downloads/b6ea9be45c260fbed02d8e2d9b2e4549/transforms_tutorial.ipynb)
+
\ No newline at end of file
diff --git a/Python/matplotlab/colors/colorbar_only.md b/Python/matplotlab/colors/colorbar_only.md
new file mode 100644
index 00000000..0c6d54dc
--- /dev/null
+++ b/Python/matplotlab/colors/colorbar_only.md
@@ -0,0 +1,123 @@
+---
+sidebarDepth: 3
+sidebar: auto
+---
+
+# Customized Colorbars Tutorial
+
+This tutorial shows how to build colorbars without an attached plot.
+
+## Customized Colorbars
+
+[``ColorbarBase``](https://matplotlib.orgapi/colorbar_api.html#matplotlib.colorbar.ColorbarBase) puts a colorbar in a specified axes,
+and can make a colorbar for a given colormap; it does not need a mappable
+object like an image. In this tutorial we will explore what can be done with
+standalone colorbar.
+
+### Basic continuous colorbar
+
+Set the colormap and norm to correspond to the data for which the colorbar
+will be used. Then create the colorbar by calling
+[``ColorbarBase``](https://matplotlib.orgapi/colorbar_api.html#matplotlib.colorbar.ColorbarBase) and specify axis, colormap, norm
+and orientation as parameters. Here we create a basic continuous colorbar
+with ticks and labels. For more information see the
+[``colorbar``](https://matplotlib.orgapi/colorbar_api.html#module-matplotlib.colorbar) API.
+
+``` python
+import matplotlib.pyplot as plt
+import matplotlib as mpl
+
+fig, ax = plt.subplots(figsize=(6, 1))
+fig.subplots_adjust(bottom=0.5)
+
+cmap = mpl.cm.cool
+norm = mpl.colors.Normalize(vmin=5, vmax=10)
+
+cb1 = mpl.colorbar.ColorbarBase(ax, cmap=cmap,
+ norm=norm,
+ orientation='horizontal')
+cb1.set_label('Some Units')
+fig.show()
+```
+
+
+
+### Discrete intervals colorbar
+
+The second example illustrates the use of a
+[``ListedColormap``](https://matplotlib.orgapi/_as_gen/matplotlib.colors.ListedColormap.html#matplotlib.colors.ListedColormap) which generates a colormap from a
+set of listed colors, ``colors.BoundaryNorm()`` which generates a colormap
+index based on discrete intervals and extended ends to show the "over" and
+"under" value colors. Over and under are used to display data outside of the
+normalized [0,1] range. Here we pass colors as gray shades as a string
+encoding a float in the 0-1 range.
+
+If a [``ListedColormap``](https://matplotlib.orgapi/_as_gen/matplotlib.colors.ListedColormap.html#matplotlib.colors.ListedColormap) is used, the length of the
+bounds array must be one greater than the length of the color list. The
+bounds must be monotonically increasing.
+
+This time we pass some more arguments in addition to previous arguments to
+[``ColorbarBase``](https://matplotlib.orgapi/colorbar_api.html#matplotlib.colorbar.ColorbarBase). For the out-of-range values to
+display on the colorbar, we have to use the *extend* keyword argument. To use
+*extend*, you must specify two extra boundaries. Finally spacing argument
+ensures that intervals are shown on colorbar proportionally.
+
+``` python
+fig, ax = plt.subplots(figsize=(6, 1))
+fig.subplots_adjust(bottom=0.5)
+
+cmap = mpl.colors.ListedColormap(['red', 'green', 'blue', 'cyan'])
+cmap.set_over('0.25')
+cmap.set_under('0.75')
+
+bounds = [1, 2, 4, 7, 8]
+norm = mpl.colors.BoundaryNorm(bounds, cmap.N)
+cb2 = mpl.colorbar.ColorbarBase(ax, cmap=cmap,
+ norm=norm,
+ boundaries=[0] + bounds + [13],
+ extend='both',
+ ticks=bounds,
+ spacing='proportional',
+ orientation='horizontal')
+cb2.set_label('Discrete intervals, some other units')
+fig.show()
+```
+
+
+
+### Colorbar with custom extension lengths
+
+Here we illustrate the use of custom length colorbar extensions, used on a
+colorbar with discrete intervals. To make the length of each extension the
+same as the length of the interior colors, use ``extendfrac='auto'``.
+
+``` python
+fig, ax = plt.subplots(figsize=(6, 1))
+fig.subplots_adjust(bottom=0.5)
+
+cmap = mpl.colors.ListedColormap(['royalblue', 'cyan',
+ 'yellow', 'orange'])
+cmap.set_over('red')
+cmap.set_under('blue')
+
+bounds = [-1.0, -0.5, 0.0, 0.5, 1.0]
+norm = mpl.colors.BoundaryNorm(bounds, cmap.N)
+cb3 = mpl.colorbar.ColorbarBase(ax, cmap=cmap,
+ norm=norm,
+ boundaries=[-10] + bounds + [10],
+ extend='both',
+ extendfrac='auto',
+ ticks=bounds,
+ spacing='uniform',
+ orientation='horizontal')
+cb3.set_label('Custom extension lengths, some other units')
+fig.show()
+```
+
+
+
+## Download
+
+- [Download Python source code: colorbar_only.py](https://matplotlib.org/_downloads/23690f47313380b801750e3adc4c317e/colorbar_only.py)
+- [Download Jupyter notebook: colorbar_only.ipynb](https://matplotlib.org/_downloads/4d3eb6ad2b03a5eb988f576ea050f104/colorbar_only.ipynb)
+
\ No newline at end of file
diff --git a/Python/matplotlab/colors/colormap-manipulation.md b/Python/matplotlab/colors/colormap-manipulation.md
new file mode 100644
index 00000000..4f89f884
--- /dev/null
+++ b/Python/matplotlab/colors/colormap-manipulation.md
@@ -0,0 +1,311 @@
+---
+sidebarDepth: 3
+sidebar: auto
+---
+
+# Creating Colormaps in Matplotlib
+
+Matplotlib has a number of built-in colormaps accessible via
+[``matplotlib.cm.get_cmap``](https://matplotlib.orgapi/cm_api.html#matplotlib.cm.get_cmap). There are also external libraries like
+[palettable](https://jiffyclub.github.io/palettable/) that have many extra colormaps.
+
+However, we often want to create or manipulate colormaps in Matplotlib.
+This can be done using the class [``ListedColormap``](https://matplotlib.orgapi/_as_gen/matplotlib.colors.ListedColormap.html#matplotlib.colors.ListedColormap) and a Nx4 numpy array of
+values between 0 and 1 to represent the RGBA values of the colormap. There
+is also a [``LinearSegmentedColormap``](https://matplotlib.orgapi/_as_gen/matplotlib.colors.LinearSegmentedColormap.html#matplotlib.colors.LinearSegmentedColormap) class that allows colormaps to be
+specified with a few anchor points defining segments, and linearly
+interpolating between the anchor points.
+
+## Getting colormaps and accessing their values
+
+First, getting a named colormap, most of which are listed in
+[Choosing Colormaps in Matplotlib](colormaps.html) requires the use of
+[``matplotlib.cm.get_cmap``](https://matplotlib.orgapi/cm_api.html#matplotlib.cm.get_cmap), which returns a
+[``matplotlib.colors.ListedColormap``](https://matplotlib.orgapi/_as_gen/matplotlib.colors.ListedColormap.html#matplotlib.colors.ListedColormap) object. The second argument gives
+the size of the list of colors used to define the colormap, and below we
+use a modest value of 12 so there are not a lot of values to look at.
+
+``` python
+import numpy as np
+import matplotlib.pyplot as plt
+from matplotlib import cm
+from matplotlib.colors import ListedColormap, LinearSegmentedColormap
+
+viridis = cm.get_cmap('viridis', 12)
+print(viridis)
+```
+
+Out:
+
+```
+
+```
+
+The object ``viridis`` is a callable, that when passed a float between
+0 and 1 returns an RGBA value from the colormap:
+
+``` python
+print(viridis(0.56))
+```
+
+Out:
+
+```
+(0.119512, 0.607464, 0.540218, 1.0)
+```
+
+The list of colors that comprise the colormap can be directly accessed using
+the ``colors`` property,
+or it can be accessed indirectly by calling ``viridis`` with an array
+of values matching the length of the colormap. Note that the returned list
+is in the form of an RGBA Nx4 array, where N is the length of the colormap.
+
+``` python
+print('viridis.colors', viridis.colors)
+print('viridis(range(12))', viridis(range(12)))
+print('viridis(np.linspace(0, 1, 12))', viridis(np.linspace(0, 1, 12)))
+```
+
+Out:
+
+```
+viridis.colors [[0.267004 0.004874 0.329415 1. ]
+ [0.283072 0.130895 0.449241 1. ]
+ [0.262138 0.242286 0.520837 1. ]
+ [0.220057 0.343307 0.549413 1. ]
+ [0.177423 0.437527 0.557565 1. ]
+ [0.143343 0.522773 0.556295 1. ]
+ [0.119512 0.607464 0.540218 1. ]
+ [0.166383 0.690856 0.496502 1. ]
+ [0.319809 0.770914 0.411152 1. ]
+ [0.525776 0.833491 0.288127 1. ]
+ [0.762373 0.876424 0.137064 1. ]
+ [0.993248 0.906157 0.143936 1. ]]
+viridis(range(12)) [[0.267004 0.004874 0.329415 1. ]
+ [0.283072 0.130895 0.449241 1. ]
+ [0.262138 0.242286 0.520837 1. ]
+ [0.220057 0.343307 0.549413 1. ]
+ [0.177423 0.437527 0.557565 1. ]
+ [0.143343 0.522773 0.556295 1. ]
+ [0.119512 0.607464 0.540218 1. ]
+ [0.166383 0.690856 0.496502 1. ]
+ [0.319809 0.770914 0.411152 1. ]
+ [0.525776 0.833491 0.288127 1. ]
+ [0.762373 0.876424 0.137064 1. ]
+ [0.993248 0.906157 0.143936 1. ]]
+viridis(np.linspace(0, 1, 12)) [[0.267004 0.004874 0.329415 1. ]
+ [0.283072 0.130895 0.449241 1. ]
+ [0.262138 0.242286 0.520837 1. ]
+ [0.220057 0.343307 0.549413 1. ]
+ [0.177423 0.437527 0.557565 1. ]
+ [0.143343 0.522773 0.556295 1. ]
+ [0.119512 0.607464 0.540218 1. ]
+ [0.166383 0.690856 0.496502 1. ]
+ [0.319809 0.770914 0.411152 1. ]
+ [0.525776 0.833491 0.288127 1. ]
+ [0.762373 0.876424 0.137064 1. ]
+ [0.993248 0.906157 0.143936 1. ]]
+```
+
+The colormap is a lookup table, so "oversampling" the colormap returns
+nearest-neighbor interpolation (note the repeated colors in the list below)
+
+``` python
+print('viridis(np.linspace(0, 1, 15))', viridis(np.linspace(0, 1, 15)))
+```
+
+Out:
+
+```
+viridis(np.linspace(0, 1, 15)) [[0.267004 0.004874 0.329415 1. ]
+ [0.267004 0.004874 0.329415 1. ]
+ [0.283072 0.130895 0.449241 1. ]
+ [0.262138 0.242286 0.520837 1. ]
+ [0.220057 0.343307 0.549413 1. ]
+ [0.177423 0.437527 0.557565 1. ]
+ [0.143343 0.522773 0.556295 1. ]
+ [0.119512 0.607464 0.540218 1. ]
+ [0.119512 0.607464 0.540218 1. ]
+ [0.166383 0.690856 0.496502 1. ]
+ [0.319809 0.770914 0.411152 1. ]
+ [0.525776 0.833491 0.288127 1. ]
+ [0.762373 0.876424 0.137064 1. ]
+ [0.993248 0.906157 0.143936 1. ]
+ [0.993248 0.906157 0.143936 1. ]]
+```
+
+## Creating listed colormaps
+
+This is essential the inverse operation of the above where we supply a
+Nx4 numpy array with all values between 0 and 1,
+to [``ListedColormap``](https://matplotlib.orgapi/_as_gen/matplotlib.colors.ListedColormap.html#matplotlib.colors.ListedColormap) to make a new colormap. This means that
+any numpy operations that we can do on a Nx4 array make carpentry of
+new colormaps from existing colormaps quite straight forward.
+
+Suppose we want to make the first 25 entries of a 256-length "viridis"
+colormap pink for some reason:
+
+``` python
+viridis = cm.get_cmap('viridis', 256)
+newcolors = viridis(np.linspace(0, 1, 256))
+pink = np.array([248/256, 24/256, 148/256, 1])
+newcolors[:25, :] = pink
+newcmp = ListedColormap(newcolors)
+
+
+def plot_examples(cms):
+ """
+ helper function to plot two colormaps
+ """
+ np.random.seed(19680801)
+ data = np.random.randn(30, 30)
+
+ fig, axs = plt.subplots(1, 2, figsize=(6, 3), constrained_layout=True)
+ for [ax, cmap] in zip(axs, cms):
+ psm = ax.pcolormesh(data, cmap=cmap, rasterized=True, vmin=-4, vmax=4)
+ fig.colorbar(psm, ax=ax)
+ plt.show()
+
+plot_examples([viridis, newcmp])
+```
+
+
+
+We can easily reduce the dynamic range of a colormap; here we choose the
+middle 0.5 of the colormap. However, we need to interpolate from a larger
+colormap, otherwise the new colormap will have repeated values.
+
+``` python
+viridisBig = cm.get_cmap('viridis', 512)
+newcmp = ListedColormap(viridisBig(np.linspace(0.25, 0.75, 256)))
+plot_examples([viridis, newcmp])
+```
+
+
+
+and we can easily concatenate two colormaps:
+
+``` python
+top = cm.get_cmap('Oranges_r', 128)
+bottom = cm.get_cmap('Blues', 128)
+
+newcolors = np.vstack((top(np.linspace(0, 1, 128)),
+ bottom(np.linspace(0, 1, 128))))
+newcmp = ListedColormap(newcolors, name='OrangeBlue')
+plot_examples([viridis, newcmp])
+```
+
+
+
+Of course we need not start from a named colormap, we just need to create
+the Nx4 array to pass to [``ListedColormap``](https://matplotlib.orgapi/_as_gen/matplotlib.colors.ListedColormap.html#matplotlib.colors.ListedColormap). Here we create a
+brown colormap that goes to white....
+
+``` python
+N = 256
+vals = np.ones((N, 4))
+vals[:, 0] = np.linspace(90/256, 1, N)
+vals[:, 1] = np.linspace(39/256, 1, N)
+vals[:, 2] = np.linspace(41/256, 1, N)
+newcmp = ListedColormap(vals)
+plot_examples([viridis, newcmp])
+```
+
+
+
+## Creating linear segmented colormaps
+
+[``LinearSegmentedColormap``](https://matplotlib.orgapi/_as_gen/matplotlib.colors.LinearSegmentedColormap.html#matplotlib.colors.LinearSegmentedColormap) class specifies colormaps using anchor points
+between which RGB(A) values are interpolated.
+
+The format to specify these colormaps allows discontinuities at the anchor
+points. Each anchor point is specified as a row in a matrix of the
+form ``[x[i] yleft[i] yright[i]]``, where ``x[i]`` is the anchor, and
+``yleft[i]`` and ``yright[i]`` are the values of the color on either
+side of the anchor point.
+
+If there are no discontinuities, then ``yleft[i]=yright[i]``:
+
+``` python
+cdict = {'red': [[0.0, 0.0, 0.0],
+ [0.5, 1.0, 1.0],
+ [1.0, 1.0, 1.0]],
+ 'green': [[0.0, 0.0, 0.0],
+ [0.25, 0.0, 0.0],
+ [0.75, 1.0, 1.0],
+ [1.0, 1.0, 1.0]],
+ 'blue': [[0.0, 0.0, 0.0],
+ [0.5, 0.0, 0.0],
+ [1.0, 1.0, 1.0]]}
+
+
+def plot_linearmap(cdict):
+ newcmp = LinearSegmentedColormap('testCmap', segmentdata=cdict, N=256)
+ rgba = newcmp(np.linspace(0, 1, 256))
+ fig, ax = plt.subplots(figsize=(4, 3), constrained_layout=True)
+ col = ['r', 'g', 'b']
+ for xx in [0.25, 0.5, 0.75]:
+ ax.axvline(xx, color='0.7', linestyle='--')
+ for i in range(3):
+ ax.plot(np.arange(256)/256, rgba[:, i], color=col[i])
+ ax.set_xlabel('index')
+ ax.set_ylabel('RGB')
+ plt.show()
+
+plot_linearmap(cdict)
+```
+
+
+
+In order to make a discontinuity at an anchor point, the third column is
+different than the second. The matrix for each of "red", "green", "blue",
+and optionally "alpha" is set up as:
+
+``` python
+cdict['red'] = [...
+ [x[i] yleft[i] yright[i]],
+ [x[i+1] yleft[i+1] yright[i+1]],
+ ...]
+```
+
+and for values passed to the colormap between ``x[i]`` and ``x[i+1]``,
+the interpolation is between ``yright[i]`` and ``yleft[i+1]``.
+
+In the example below there is a discontinuity in red at 0.5. The
+interpolation between 0 and 0.5 goes from 0.3 to 1, and between 0.5 and 1
+it goes from 0.9 to 1. Note that red[0, 1], and red[2, 2] are both
+superfluous to the interpolation because red[0, 1] is the value to the
+left of 0, and red[2, 2] is the value to the right of 1.0.
+
+``` python
+cdict['red'] = [[0.0, 0.0, 0.3],
+ [0.5, 1.0, 0.9],
+ [1.0, 1.0, 1.0]]
+plot_linearmap(cdict)
+```
+
+
+
+### References
+
+The use of the following functions, methods, classes and modules is shown
+in this example:
+
+``` python
+import matplotlib
+matplotlib.axes.Axes.pcolormesh
+matplotlib.figure.Figure.colorbar
+matplotlib.colors
+matplotlib.colors.LinearSegmentedColormap
+matplotlib.colors.ListedColormap
+matplotlib.cm
+matplotlib.cm.get_cmap
+```
+
+**Total running time of the script:** ( 0 minutes 2.220 seconds)
+
+## Download
+
+- [Download Python source code: colormap-manipulation.py](https://matplotlib.org/_downloads/f55e73a6ac8441fd68270d3c6f2a7c7c/colormap-manipulation.py)
+- [Download Jupyter notebook: colormap-manipulation.ipynb](https://matplotlib.org/_downloads/fd9acfdbb45f341d3bb04199f0868a38/colormap-manipulation.ipynb)
+
\ No newline at end of file
diff --git a/Python/matplotlab/colors/colormapnorms.md b/Python/matplotlab/colors/colormapnorms.md
new file mode 100644
index 00000000..eb913782
--- /dev/null
+++ b/Python/matplotlab/colors/colormapnorms.md
@@ -0,0 +1,281 @@
+---
+sidebarDepth: 3
+sidebar: auto
+---
+
+# Colormap Normalization
+
+Objects that use colormaps by default linearly map the colors in the
+colormap from data values *vmin* to *vmax*. For example:
+
+``` python
+pcm = ax.pcolormesh(x, y, Z, vmin=-1., vmax=1., cmap='RdBu_r')
+```
+
+will map the data in *Z* linearly from -1 to +1, so *Z=0* will
+give a color at the center of the colormap *RdBu_r* (white in this
+case).
+
+Matplotlib does this mapping in two steps, with a normalization from
+the input data to [0, 1] occurring first, and then mapping onto the
+indices in the colormap. Normalizations are classes defined in the
+[``matplotlib.colors()``](https://matplotlib.orgapi/colors_api.html#module-matplotlib.colors) module. The default, linear normalization
+is [``matplotlib.colors.Normalize()``](https://matplotlib.orgapi/_as_gen/matplotlib.colors.Normalize.html#matplotlib.colors.Normalize).
+
+Artists that map data to color pass the arguments *vmin* and *vmax* to
+construct a [``matplotlib.colors.Normalize()``](https://matplotlib.orgapi/_as_gen/matplotlib.colors.Normalize.html#matplotlib.colors.Normalize) instance, then call it:
+
+``` python
+In [1]: import matplotlib as mpl
+
+In [2]: norm = mpl.colors.Normalize(vmin=-1.,vmax=1.)
+
+In [3]: norm(0.)
+Out[3]: 0.5
+```
+
+However, there are sometimes cases where it is useful to map data to
+colormaps in a non-linear fashion.
+
+## Logarithmic
+
+One of the most common transformations is to plot data by taking its logarithm
+(to the base-10). This transformation is useful to display changes across
+disparate scales. Using [``colors.LogNorm``](https://matplotlib.orgapi/_as_gen/matplotlib.colors.LogNorm.html#matplotlib.colors.LogNorm) normalizes the data via
+\(log_{10}\). In the example below, there are two bumps, one much smaller
+than the other. Using [``colors.LogNorm``](https://matplotlib.orgapi/_as_gen/matplotlib.colors.LogNorm.html#matplotlib.colors.LogNorm), the shape and location of each bump
+can clearly be seen:
+
+``` python
+import numpy as np
+import matplotlib.pyplot as plt
+import matplotlib.colors as colors
+import matplotlib.cbook as cbook
+
+N = 100
+X, Y = np.mgrid[-3:3:complex(0, N), -2:2:complex(0, N)]
+
+# A low hump with a spike coming out of the top right. Needs to have
+# z/colour axis on a log scale so we see both hump and spike. linear
+# scale only shows the spike.
+Z1 = np.exp(-(X)**2 - (Y)**2)
+Z2 = np.exp(-(X * 10)**2 - (Y * 10)**2)
+Z = Z1 + 50 * Z2
+
+fig, ax = plt.subplots(2, 1)
+
+pcm = ax[0].pcolor(X, Y, Z,
+ norm=colors.LogNorm(vmin=Z.min(), vmax=Z.max()),
+ cmap='PuBu_r')
+fig.colorbar(pcm, ax=ax[0], extend='max')
+
+pcm = ax[1].pcolor(X, Y, Z, cmap='PuBu_r')
+fig.colorbar(pcm, ax=ax[1], extend='max')
+plt.show()
+```
+
+
+
+## Symmetric logarithmic
+
+Similarly, it sometimes happens that there is data that is positive
+and negative, but we would still like a logarithmic scaling applied to
+both. In this case, the negative numbers are also scaled
+logarithmically, and mapped to smaller numbers; e.g., if ``vmin=-vmax``,
+then they the negative numbers are mapped from 0 to 0.5 and the
+positive from 0.5 to 1.
+
+Since the logarithm of values close to zero tends toward infinity, a
+small range around zero needs to be mapped linearly. The parameter
+*linthresh* allows the user to specify the size of this range
+(-*linthresh*, *linthresh*). The size of this range in the colormap is
+set by *linscale*. When *linscale* == 1.0 (the default), the space used
+for the positive and negative halves of the linear range will be equal
+to one decade in the logarithmic range.
+
+``` python
+N = 100
+X, Y = np.mgrid[-3:3:complex(0, N), -2:2:complex(0, N)]
+Z1 = np.exp(-X**2 - Y**2)
+Z2 = np.exp(-(X - 1)**2 - (Y - 1)**2)
+Z = (Z1 - Z2) * 2
+
+fig, ax = plt.subplots(2, 1)
+
+pcm = ax[0].pcolormesh(X, Y, Z,
+ norm=colors.SymLogNorm(linthresh=0.03, linscale=0.03,
+ vmin=-1.0, vmax=1.0),
+ cmap='RdBu_r')
+fig.colorbar(pcm, ax=ax[0], extend='both')
+
+pcm = ax[1].pcolormesh(X, Y, Z, cmap='RdBu_r', vmin=-np.max(Z))
+fig.colorbar(pcm, ax=ax[1], extend='both')
+plt.show()
+```
+
+
+
+## Power-law
+
+Sometimes it is useful to remap the colors onto a power-law
+relationship (i.e. \(y=x^{\gamma}\), where \(\gamma\) is the
+power). For this we use the ``colors.PowerNorm()``. It takes as an
+argument *gamma* (*gamma* == 1.0 will just yield the default linear
+normalization):
+
+::: tip Note
+
+There should probably be a good reason for plotting the data using
+this type of transformation. Technical viewers are used to linear
+and logarithmic axes and data transformations. Power laws are less
+common, and viewers should explicitly be made aware that they have
+been used.
+
+:::
+
+``` python
+N = 100
+X, Y = np.mgrid[0:3:complex(0, N), 0:2:complex(0, N)]
+Z1 = (1 + np.sin(Y * 10.)) * X**(2.)
+
+fig, ax = plt.subplots(2, 1)
+
+pcm = ax[0].pcolormesh(X, Y, Z1, norm=colors.PowerNorm(gamma=0.5),
+ cmap='PuBu_r')
+fig.colorbar(pcm, ax=ax[0], extend='max')
+
+pcm = ax[1].pcolormesh(X, Y, Z1, cmap='PuBu_r')
+fig.colorbar(pcm, ax=ax[1], extend='max')
+plt.show()
+```
+
+
+
+## Discrete bounds
+
+Another normaization that comes with Matplotlib is
+``colors.BoundaryNorm()``. In addition to *vmin* and *vmax*, this
+takes as arguments boundaries between which data is to be mapped. The
+colors are then linearly distributed between these "bounds". For
+instance:
+
+``` python
+In [4]: import matplotlib.colors as colors
+
+In [5]: bounds = np.array([-0.25, -0.125, 0, 0.5, 1])
+
+In [6]: norm = colors.BoundaryNorm(boundaries=bounds, ncolors=4)
+
+In [7]: print(norm([-0.2,-0.15,-0.02, 0.3, 0.8, 0.99]))
+[0 0 1 2 3 3]
+```
+
+Note unlike the other norms, this norm returns values from 0 to *ncolors*-1.
+
+``` python
+N = 100
+X, Y = np.mgrid[-3:3:complex(0, N), -2:2:complex(0, N)]
+Z1 = np.exp(-X**2 - Y**2)
+Z2 = np.exp(-(X - 1)**2 - (Y - 1)**2)
+Z = (Z1 - Z2) * 2
+
+fig, ax = plt.subplots(3, 1, figsize=(8, 8))
+ax = ax.flatten()
+# even bounds gives a contour-like effect
+bounds = np.linspace(-1, 1, 10)
+norm = colors.BoundaryNorm(boundaries=bounds, ncolors=256)
+pcm = ax[0].pcolormesh(X, Y, Z,
+ norm=norm,
+ cmap='RdBu_r')
+fig.colorbar(pcm, ax=ax[0], extend='both', orientation='vertical')
+
+# uneven bounds changes the colormapping:
+bounds = np.array([-0.25, -0.125, 0, 0.5, 1])
+norm = colors.BoundaryNorm(boundaries=bounds, ncolors=256)
+pcm = ax[1].pcolormesh(X, Y, Z, norm=norm, cmap='RdBu_r')
+fig.colorbar(pcm, ax=ax[1], extend='both', orientation='vertical')
+
+pcm = ax[2].pcolormesh(X, Y, Z, cmap='RdBu_r', vmin=-np.max(Z))
+fig.colorbar(pcm, ax=ax[2], extend='both', orientation='vertical')
+plt.show()
+```
+
+
+
+## DivergingNorm: Different mapping on either side of a center
+
+Sometimes we want to have a different colormap on either side of a
+conceptual center point, and we want those two colormaps to have
+different linear scales. An example is a topographic map where the land
+and ocean have a center at zero, but land typically has a greater
+elevation range than the water has depth range, and they are often
+represented by a different colormap.
+
+``` python
+filename = cbook.get_sample_data('topobathy.npz', asfileobj=False)
+with np.load(filename) as dem:
+ topo = dem['topo']
+ longitude = dem['longitude']
+ latitude = dem['latitude']
+
+fig, ax = plt.subplots()
+# make a colormap that has land and ocean clearly delineated and of the
+# same length (256 + 256)
+colors_undersea = plt.cm.terrain(np.linspace(0, 0.17, 256))
+colors_land = plt.cm.terrain(np.linspace(0.25, 1, 256))
+all_colors = np.vstack((colors_undersea, colors_land))
+terrain_map = colors.LinearSegmentedColormap.from_list('terrain_map',
+ all_colors)
+
+# make the norm: Note the center is offset so that the land has more
+# dynamic range:
+divnorm = colors.DivergingNorm(vmin=-500., vcenter=0, vmax=4000)
+
+pcm = ax.pcolormesh(longitude, latitude, topo, rasterized=True, norm=divnorm,
+ cmap=terrain_map,)
+# Simple geographic plot, set aspect ratio beecause distance between lines of
+# longitude depends on latitude.
+ax.set_aspect(1 / np.cos(np.deg2rad(49)))
+fig.colorbar(pcm, shrink=0.6)
+plt.show()
+```
+
+
+
+## Custom normalization: Manually implement two linear ranges
+
+The [``DivergingNorm``](https://matplotlib.orgapi/_as_gen/matplotlib.colors.DivergingNorm.html#matplotlib.colors.DivergingNorm) described above makes a useful example for
+defining your own norm.
+
+``` python
+class MidpointNormalize(colors.Normalize):
+ def __init__(self, vmin=None, vmax=None, vcenter=None, clip=False):
+ self.vcenter = vcenter
+ colors.Normalize.__init__(self, vmin, vmax, clip)
+
+ def __call__(self, value, clip=None):
+ # I'm ignoring masked values and all kinds of edge cases to make a
+ # simple example...
+ x, y = [self.vmin, self.vcenter, self.vmax], [0, 0.5, 1]
+ return np.ma.masked_array(np.interp(value, x, y))
+
+
+fig, ax = plt.subplots()
+midnorm = MidpointNormalize(vmin=-500., vcenter=0, vmax=4000)
+
+pcm = ax.pcolormesh(longitude, latitude, topo, rasterized=True, norm=midnorm,
+ cmap=terrain_map)
+ax.set_aspect(1 / np.cos(np.deg2rad(49)))
+fig.colorbar(pcm, shrink=0.6, extend='both')
+plt.show()
+```
+
+
+
+**Total running time of the script:** ( 0 minutes 1.895 seconds)
+
+## Download
+
+- [Download Python source code: colormapnorms.py](https://matplotlib.org/_downloads/56fa91958fd427757e621c21de870bda/colormapnorms.py)
+- [Download Jupyter notebook: colormapnorms.ipynb](https://matplotlib.org/_downloads/59a7c8f3db252ae16cd43fd50d6a004c/colormapnorms.ipynb)
+
\ No newline at end of file
diff --git a/Python/matplotlab/colors/colormaps.md b/Python/matplotlab/colors/colormaps.md
new file mode 100644
index 00000000..fdbe5e36
--- /dev/null
+++ b/Python/matplotlab/colors/colormaps.md
@@ -0,0 +1,521 @@
+---
+sidebarDepth: 3
+sidebar: auto
+---
+
+# Choosing Colormaps in Matplotlib
+
+Matplotlib has a number of built-in colormaps accessible via
+[``matplotlib.cm.get_cmap``](https://matplotlib.orgapi/cm_api.html#matplotlib.cm.get_cmap). There are also external libraries like
+[[palettable]](#palettable) and [[colorcet]](#colorcet) that have many extra colormaps.
+Here we briefly discuss how to choose between the many options. For
+help on creating your own colormaps, see
+[Creating Colormaps in Matplotlib](colormap-manipulation.html).
+
+## Overview
+
+The idea behind choosing a good colormap is to find a good representation in 3D
+colorspace for your data set. The best colormap for any given data set depends
+on many things including:
+
+- Whether representing form or metric data ([[Ware]](#ware))
+- Your knowledge of the data set (*e.g.*, is there a critical value
+from which the other values deviate?)
+- If there is an intuitive color scheme for the parameter you are plotting
+- If there is a standard in the field the audience may be expecting
+
+For many applications, a perceptually uniform colormap is the best
+choice --- one in which equal steps in data are perceived as equal
+steps in the color space. Researchers have found that the human brain
+perceives changes in the lightness parameter as changes in the data
+much better than, for example, changes in hue. Therefore, colormaps
+which have monotonically increasing lightness through the colormap
+will be better interpreted by the viewer. A wonderful example of
+perceptually uniform colormaps is [[colorcet]](#colorcet).
+
+Color can be represented in 3D space in various ways. One way to represent color
+is using CIELAB. In CIELAB, color space is represented by lightness,
+\(L^*\); red-green, \(a^*\); and yellow-blue, \(b^*\). The lightness
+parameter \(L^*\) can then be used to learn more about how the matplotlib
+colormaps will be perceived by viewers.
+
+An excellent starting resource for learning about human perception of colormaps
+is from [[IBM]](#ibm).
+
+## Classes of colormaps
+
+Colormaps are often split into several categories based on their function (see,
+*e.g.*, [[Moreland]](#moreland)):
+
+1. Sequential: change in lightness and often saturation of color
+incrementally, often using a single hue; should be used for
+representing information that has ordering.
+1. Diverging: change in lightness and possibly saturation of two
+different colors that meet in the middle at an unsaturated color;
+should be used when the information being plotted has a critical
+middle value, such as topography or when the data deviates around
+zero.
+1. Cyclic: change in lightness of two different colors that meet in
+the middle and beginning/end at an unsaturated color; should be
+used for values that wrap around at the endpoints, such as phase
+angle, wind direction, or time of day.
+1. Qualitative: often are miscellaneous colors; should be used to
+represent information which does not have ordering or
+relationships.
+
+``` python
+# sphinx_gallery_thumbnail_number = 2
+
+import numpy as np
+import matplotlib as mpl
+import matplotlib.pyplot as plt
+from matplotlib import cm
+from colorspacious import cspace_converter
+from collections import OrderedDict
+
+cmaps = OrderedDict()
+```
+
+### Sequential
+
+For the Sequential plots, the lightness value increases monotonically through
+the colormaps. This is good. Some of the \(L^*\) values in the colormaps
+span from 0 to 100 (binary and the other grayscale), and others start around
+\(L^*=20\). Those that have a smaller range of \(L^*\) will accordingly
+have a smaller perceptual range. Note also that the \(L^*\) function varies
+amongst the colormaps: some are approximately linear in \(L^*\) and others
+are more curved.
+
+``` python
+cmaps['Perceptually Uniform Sequential'] = [
+ 'viridis', 'plasma', 'inferno', 'magma', 'cividis']
+
+cmaps['Sequential'] = [
+ 'Greys', 'Purples', 'Blues', 'Greens', 'Oranges', 'Reds',
+ 'YlOrBr', 'YlOrRd', 'OrRd', 'PuRd', 'RdPu', 'BuPu',
+ 'GnBu', 'PuBu', 'YlGnBu', 'PuBuGn', 'BuGn', 'YlGn']
+```
+
+### Sequential2
+
+Many of the \(L^*\) values from the Sequential2 plots are monotonically
+increasing, but some (autumn, cool, spring, and winter) plateau or even go both
+up and down in \(L^*\) space. Others (afmhot, copper, gist_heat, and hot)
+have kinks in the \(L^*\) functions. Data that is being represented in a
+region of the colormap that is at a plateau or kink will lead to a perception of
+banding of the data in those values in the colormap (see [[mycarta-banding]](#mycarta-banding) for
+an excellent example of this).
+
+``` python
+cmaps['Sequential (2)'] = [
+ 'binary', 'gist_yarg', 'gist_gray', 'gray', 'bone', 'pink',
+ 'spring', 'summer', 'autumn', 'winter', 'cool', 'Wistia',
+ 'hot', 'afmhot', 'gist_heat', 'copper']
+```
+
+### Diverging
+
+For the Diverging maps, we want to have monotonically increasing \(L^*\)
+values up to a maximum, which should be close to \(L^*=100\), followed by
+monotonically decreasing \(L^*\) values. We are looking for approximately
+equal minimum \(L^*\) values at opposite ends of the colormap. By these
+measures, BrBG and RdBu are good options. coolwarm is a good option, but it
+doesn't span a wide range of \(L^*\) values (see grayscale section below).
+
+``` python
+cmaps['Diverging'] = [
+ 'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu',
+ 'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic']
+```
+
+### Cyclic
+
+For Cyclic maps, we want to start and end on the same color, and meet a
+symmetric center point in the middle. \(L^*\) should change monotonically
+from start to middle, and inversely from middle to end. It should be symmetric
+on the increasing and decreasing side, and only differ in hue. At the ends and
+middle, \(L^*\) will reverse direction, which should be smoothed in
+\(L^*\) space to reduce artifacts. See [[kovesi-colormaps]](#kovesi-colormaps) for more
+information on the design of cyclic maps.
+
+The often-used HSV colormap is included in this set of colormaps, although it
+is not symmetric to a center point. Additionally, the \(L^*\) values vary
+widely throughout the colormap, making it a poor choice for representing data
+for viewers to see perceptually. See an extension on this idea at
+[[mycarta-jet]](#mycarta-jet).
+
+``` python
+cmaps['Cyclic'] = ['twilight', 'twilight_shifted', 'hsv']
+```
+
+### Qualitative
+
+Qualitative colormaps are not aimed at being perceptual maps, but looking at the
+lightness parameter can verify that for us. The \(L^*\) values move all over
+the place throughout the colormap, and are clearly not monotonically increasing.
+These would not be good options for use as perceptual colormaps.
+
+``` python
+cmaps['Qualitative'] = ['Pastel1', 'Pastel2', 'Paired', 'Accent',
+ 'Dark2', 'Set1', 'Set2', 'Set3',
+ 'tab10', 'tab20', 'tab20b', 'tab20c']
+```
+
+### Miscellaneous
+
+Some of the miscellaneous colormaps have particular uses for which
+they have been created. For example, gist_earth, ocean, and terrain
+all seem to be created for plotting topography (green/brown) and water
+depths (blue) together. We would expect to see a divergence in these
+colormaps, then, but multiple kinks may not be ideal, such as in
+gist_earth and terrain. CMRmap was created to convert well to
+grayscale, though it does appear to have some small kinks in
+\(L^*\). cubehelix was created to vary smoothly in both lightness
+and hue, but appears to have a small hump in the green hue area.
+
+The often-used jet colormap is included in this set of colormaps. We can see
+that the \(L^*\) values vary widely throughout the colormap, making it a
+poor choice for representing data for viewers to see perceptually. See an
+extension on this idea at [[mycarta-jet]](#mycarta-jet).
+
+``` python
+cmaps['Miscellaneous'] = [
+ 'flag', 'prism', 'ocean', 'gist_earth', 'terrain', 'gist_stern',
+ 'gnuplot', 'gnuplot2', 'CMRmap', 'cubehelix', 'brg',
+ 'gist_rainbow', 'rainbow', 'jet', 'nipy_spectral', 'gist_ncar']
+```
+
+First, we'll show the range of each colormap. Note that some seem
+to change more "quickly" than others.
+
+``` python
+nrows = max(len(cmap_list) for cmap_category, cmap_list in cmaps.items())
+gradient = np.linspace(0, 1, 256)
+gradient = np.vstack((gradient, gradient))
+
+
+def plot_color_gradients(cmap_category, cmap_list, nrows):
+ fig, axes = plt.subplots(nrows=nrows)
+ fig.subplots_adjust(top=0.95, bottom=0.01, left=0.2, right=0.99)
+ axes[0].set_title(cmap_category + ' colormaps', fontsize=14)
+
+ for ax, name in zip(axes, cmap_list):
+ ax.imshow(gradient, aspect='auto', cmap=plt.get_cmap(name))
+ pos = list(ax.get_position().bounds)
+ x_text = pos[0] - 0.01
+ y_text = pos[1] + pos[3]/2.
+ fig.text(x_text, y_text, name, va='center', ha='right', fontsize=10)
+
+ # Turn off *all* ticks & spines, not just the ones with colormaps.
+ for ax in axes:
+ ax.set_axis_off()
+
+
+for cmap_category, cmap_list in cmaps.items():
+ plot_color_gradients(cmap_category, cmap_list, nrows)
+
+plt.show()
+```
+
+- 
+- 
+- 
+- 
+- 
+- 
+- 
+
+## Lightness of matplotlib colormaps
+
+Here we examine the lightness values of the matplotlib colormaps.
+Note that some documentation on the colormaps is available
+([[list-colormaps]](#list-colormaps)).
+
+``` python
+mpl.rcParams.update({'font.size': 12})
+
+# Number of colormap per subplot for particular cmap categories
+_DSUBS = {'Perceptually Uniform Sequential': 5, 'Sequential': 6,
+ 'Sequential (2)': 6, 'Diverging': 6, 'Cyclic': 3,
+ 'Qualitative': 4, 'Miscellaneous': 6}
+
+# Spacing between the colormaps of a subplot
+_DC = {'Perceptually Uniform Sequential': 1.4, 'Sequential': 0.7,
+ 'Sequential (2)': 1.4, 'Diverging': 1.4, 'Cyclic': 1.4,
+ 'Qualitative': 1.4, 'Miscellaneous': 1.4}
+
+# Indices to step through colormap
+x = np.linspace(0.0, 1.0, 100)
+
+# Do plot
+for cmap_category, cmap_list in cmaps.items():
+
+ # Do subplots so that colormaps have enough space.
+ # Default is 6 colormaps per subplot.
+ dsub = _DSUBS.get(cmap_category, 6)
+ nsubplots = int(np.ceil(len(cmap_list) / dsub))
+
+ # squeeze=False to handle similarly the case of a single subplot
+ fig, axes = plt.subplots(nrows=nsubplots, squeeze=False,
+ figsize=(7, 2.6*nsubplots))
+
+ for i, ax in enumerate(axes.flat):
+
+ locs = [] # locations for text labels
+
+ for j, cmap in enumerate(cmap_list[i*dsub:(i+1)*dsub]):
+
+ # Get RGB values for colormap and convert the colormap in
+ # CAM02-UCS colorspace. lab[0, :, 0] is the lightness.
+ rgb = cm.get_cmap(cmap)(x)[np.newaxis, :, :3]
+ lab = cspace_converter("sRGB1", "CAM02-UCS")(rgb)
+
+ # Plot colormap L values. Do separately for each category
+ # so each plot can be pretty. To make scatter markers change
+ # color along plot:
+ # http://stackoverflow.com/questions/8202605/
+
+ if cmap_category == 'Sequential':
+ # These colormaps all start at high lightness but we want them
+ # reversed to look nice in the plot, so reverse the order.
+ y_ = lab[0, ::-1, 0]
+ c_ = x[::-1]
+ else:
+ y_ = lab[0, :, 0]
+ c_ = x
+
+ dc = _DC.get(cmap_category, 1.4) # cmaps horizontal spacing
+ ax.scatter(x + j*dc, y_, c=c_, cmap=cmap, s=300, linewidths=0.0)
+
+ # Store locations for colormap labels
+ if cmap_category in ('Perceptually Uniform Sequential',
+ 'Sequential'):
+ locs.append(x[-1] + j*dc)
+ elif cmap_category in ('Diverging', 'Qualitative', 'Cyclic',
+ 'Miscellaneous', 'Sequential (2)'):
+ locs.append(x[int(x.size/2.)] + j*dc)
+
+ # Set up the axis limits:
+ # * the 1st subplot is used as a reference for the x-axis limits
+ # * lightness values goes from 0 to 100 (y-axis limits)
+ ax.set_xlim(axes[0, 0].get_xlim())
+ ax.set_ylim(0.0, 100.0)
+
+ # Set up labels for colormaps
+ ax.xaxis.set_ticks_position('top')
+ ticker = mpl.ticker.FixedLocator(locs)
+ ax.xaxis.set_major_locator(ticker)
+ formatter = mpl.ticker.FixedFormatter(cmap_list[i*dsub:(i+1)*dsub])
+ ax.xaxis.set_major_formatter(formatter)
+ ax.xaxis.set_tick_params(rotation=50)
+
+ ax.set_xlabel(cmap_category + ' colormaps', fontsize=14)
+ fig.text(0.0, 0.55, 'Lightness $L^*$', fontsize=12,
+ transform=fig.transFigure, rotation=90)
+
+ fig.tight_layout(h_pad=0.0, pad=1.5)
+ plt.show()
+```
+
+- 
+- 
+- 
+- 
+- 
+- 
+- 
+
+## Grayscale conversion
+
+It is important to pay attention to conversion to grayscale for color
+plots, since they may be printed on black and white printers. If not
+carefully considered, your readers may end up with indecipherable
+plots because the grayscale changes unpredictably through the
+colormap.
+
+Conversion to grayscale is done in many different ways [[bw]](#bw). Some of the
+better ones use a linear combination of the rgb values of a pixel, but
+weighted according to how we perceive color intensity. A nonlinear method of
+conversion to grayscale is to use the \(L^*\) values of the pixels. In
+general, similar principles apply for this question as they do for presenting
+one's information perceptually; that is, if a colormap is chosen that is
+monotonically increasing in \(L^*\) values, it will print in a reasonable
+manner to grayscale.
+
+With this in mind, we see that the Sequential colormaps have reasonable
+representations in grayscale. Some of the Sequential2 colormaps have decent
+enough grayscale representations, though some (autumn, spring, summer,
+winter) have very little grayscale change. If a colormap like this was used
+in a plot and then the plot was printed to grayscale, a lot of the
+information may map to the same gray values. The Diverging colormaps mostly
+vary from darker gray on the outer edges to white in the middle. Some
+(PuOr and seismic) have noticeably darker gray on one side than the other
+and therefore are not very symmetric. coolwarm has little range of gray scale
+and would print to a more uniform plot, losing a lot of detail. Note that
+overlaid, labeled contours could help differentiate between one side of the
+colormap vs. the other since color cannot be used once a plot is printed to
+grayscale. Many of the Qualitative and Miscellaneous colormaps, such as
+Accent, hsv, and jet, change from darker to lighter and back to darker gray
+throughout the colormap. This would make it impossible for a viewer to
+interpret the information in a plot once it is printed in grayscale.
+
+``` python
+mpl.rcParams.update({'font.size': 14})
+
+# Indices to step through colormap.
+x = np.linspace(0.0, 1.0, 100)
+
+gradient = np.linspace(0, 1, 256)
+gradient = np.vstack((gradient, gradient))
+
+
+def plot_color_gradients(cmap_category, cmap_list):
+ fig, axes = plt.subplots(nrows=len(cmap_list), ncols=2)
+ fig.subplots_adjust(top=0.95, bottom=0.01, left=0.2, right=0.99,
+ wspace=0.05)
+ fig.suptitle(cmap_category + ' colormaps', fontsize=14, y=1.0, x=0.6)
+
+ for ax, name in zip(axes, cmap_list):
+
+ # Get RGB values for colormap.
+ rgb = cm.get_cmap(plt.get_cmap(name))(x)[np.newaxis, :, :3]
+
+ # Get colormap in CAM02-UCS colorspace. We want the lightness.
+ lab = cspace_converter("sRGB1", "CAM02-UCS")(rgb)
+ L = lab[0, :, 0]
+ L = np.float32(np.vstack((L, L, L)))
+
+ ax[0].imshow(gradient, aspect='auto', cmap=plt.get_cmap(name))
+ ax[1].imshow(L, aspect='auto', cmap='binary_r', vmin=0., vmax=100.)
+ pos = list(ax[0].get_position().bounds)
+ x_text = pos[0] - 0.01
+ y_text = pos[1] + pos[3]/2.
+ fig.text(x_text, y_text, name, va='center', ha='right', fontsize=10)
+
+ # Turn off *all* ticks & spines, not just the ones with colormaps.
+ for ax in axes.flat:
+ ax.set_axis_off()
+
+ plt.show()
+
+
+for cmap_category, cmap_list in cmaps.items():
+
+ plot_color_gradients(cmap_category, cmap_list)
+```
+
+- 
+- 
+- 
+- 
+- 
+- 
+- 
+
+## Color vision deficiencies
+
+There is a lot of information available about color blindness (*e.g.*,
+[[colorblindness]](#colorblindness)). Additionally, there are tools available to convert images
+to how they look for different types of color vision deficiencies.
+
+The most common form of color vision deficiency involves differentiating
+between red and green. Thus, avoiding colormaps with both red and green will
+avoid many problems in general.
+
+## References
+
+
+---
+
+
+[colorcet]([1](#id[2](#id4)), 2) [https://colorcet.pyviz.org](https://colorcet.pyviz.org)
+
+
+
+
+---
+
+
+[[Ware]](#id3)[http://ccom.unh.edu/sites/default/files/publications/Ware_1988_CGA_Color_sequences_univariate_maps.pdf](http://ccom.unh.edu/sites/default/files/publications/Ware_1988_CGA_Color_sequences_univariate_maps.pdf)
+
+
+
+
+---
+
+
+[[Moreland]](#id6)[http://www.kennethmoreland.com/color-maps/ColorMapsExpanded.pdf](http://www.kennethmoreland.com/color-maps/ColorMapsExpanded.pdf)
+
+
+
+
+---
+
+
+[[list-colormaps]](#id11)[https://gist.github.com/endolith/2719900#id7](https://gist.github.com/endolith/2719900#id7)
+
+
+
+
+---
+
+
+[[mycarta-banding]](#id7)[https://mycarta.wordpress.com/2012/10/14/the-rainbow-is-deadlong-live-the-rainbow-part-4-cie-lab-heated-body/](https://mycarta.wordpress.com/2012/10/14/the-rainbow-is-deadlong-live-the-rainbow-part-4-cie-lab-heated-body/)
+
+
+
+
+---
+
+
+[mycarta-jet]([1](#id9), [2](#id10)) [https://mycarta.wordpress.com/2012/10/06/the-rainbow-is-deadlong-live-the-rainbow-part-3/](https://mycarta.wordpress.com/2012/10/06/the-rainbow-is-deadlong-live-the-rainbow-part-3/)
+
+
+
+
+---
+
+
+[[kovesi-colormaps]](#id8)[https://arxiv.org/abs/1509.03700](https://arxiv.org/abs/1509.03700)
+
+
+
+
+---
+
+
+[[bw]](#id12)[http://www.tannerhelland.com/3643/grayscale-image-algorithm-vb6/](http://www.tannerhelland.com/3643/grayscale-image-algorithm-vb6/)
+
+
+
+
+---
+
+
+[[colorblindness]](#id13)[http://www.color-blindness.com/](http://www.color-blindness.com/)
+
+
+
+
+---
+
+
+[[IBM]](#id5)[https://doi.org/10.1109/VISUAL.1995.480803](https://doi.org/10.1109/VISUAL.1995.480803)
+
+
+
+
+---
+
+
+[[palettable]](#id1)[https://jiffyclub.github.io/palettable/](https://jiffyclub.github.io/palettable/)
+
+
+
+**Total running time of the script:** ( 0 minutes 9.320 seconds)
+
+## Download
+
+- [Download Python source code: colormaps.py](https://matplotlib.org/_downloads/9df0748eeda573fbccab51a7272f7d81/colormaps.py)
+- [Download Jupyter notebook: colormaps.ipynb](https://matplotlib.org/_downloads/6024d841c77bf197ffe5612254186669/colormaps.ipynb)
+
\ No newline at end of file
diff --git a/Python/matplotlab/colors/colors.md b/Python/matplotlab/colors/colors.md
new file mode 100644
index 00000000..8ef8cff0
--- /dev/null
+++ b/Python/matplotlab/colors/colors.md
@@ -0,0 +1,135 @@
+---
+sidebarDepth: 3
+sidebar: auto
+---
+
+# Specifying Colors
+
+Matplotlib recognizes the following formats to specify a color:
+
+- an RGB or RGBA (red, green, blue, alpha) tuple of float values in ``[0, 1]``
+(e.g., ``(0.1, 0.2, 0.5)`` or ``(0.1, 0.2, 0.5, 0.3)``);
+- a hex RGB or RGBA string (e.g., ``'#0f0f0f'`` or ``'#0f0f0f80'``;
+case-insensitive);
+- a string representation of a float value in ``[0, 1]`` inclusive for gray
+level (e.g., ``'0.5'``);
+- one of ``{'b', 'g', 'r', 'c', 'm', 'y', 'k', 'w'}``;
+- a X11/CSS4 color name (case-insensitive);
+- a name from the [xkcd color survey](https://xkcd.com/color/rgb/), prefixed with ``'xkcd:'`` (e.g.,
+``'xkcd:sky blue'``; case insensitive);
+- one of the Tableau Colors from the 'T10' categorical palette (the default
+color cycle): ``{'tab:blue', 'tab:orange', 'tab:green', 'tab:red',
+'tab:purple', 'tab:brown', 'tab:pink', 'tab:gray', 'tab:olive', 'tab:cyan'}``
+(case-insensitive);
+- a "CN" color spec, i.e. ``'C'`` followed by a number, which is an index into
+the default property cycle (``matplotlib.rcParams['axes.prop_cycle']``); the
+indexing is intended to occur at rendering time, and defaults to black if the
+cycle does not include color.
+
+"Red", "Green", and "Blue" are the intensities of those colors, the combination
+of which span the colorspace.
+
+How "Alpha" behaves depends on the ``zorder`` of the Artist. Higher
+``zorder`` Artists are drawn on top of lower Artists, and "Alpha" determines
+whether the lower artist is covered by the higher.
+If the old RGB of a pixel is ``RGBold`` and the RGB of the
+pixel of the Artist being added is ``RGBnew`` with Alpha ``alpha``,
+then the RGB of the pixel is updated to:
+``RGB = RGBOld * (1 - Alpha) + RGBnew * Alpha``. Alpha
+of 1 means the old color is completely covered by the new Artist, Alpha of 0
+means that pixel of the Artist is transparent.
+
+For more information on colors in matplotlib see
+
+- the [Color Demo](https://matplotlib.orggallery/color/color_demo.html) example;
+- the [``matplotlib.colors``](https://matplotlib.orgapi/colors_api.html#module-matplotlib.colors) API;
+- the [List of named colors](https://matplotlib.orggallery/color/named_colors.html) example.
+
+## "CN" color selection
+
+"CN" colors are converted to RGBA as soon as the artist is created. For
+example,
+
+``` python
+import numpy as np
+import matplotlib.pyplot as plt
+import matplotlib as mpl
+
+th = np.linspace(0, 2*np.pi, 128)
+
+
+def demo(sty):
+ mpl.style.use(sty)
+ fig, ax = plt.subplots(figsize=(3, 3))
+
+ ax.set_title('style: {!r}'.format(sty), color='C0')
+
+ ax.plot(th, np.cos(th), 'C1', label='C1')
+ ax.plot(th, np.sin(th), 'C2', label='C2')
+ ax.legend()
+
+demo('default')
+demo('seaborn')
+```
+
+- 
+- 
+
+will use the first color for the title and then plot using the second
+and third colors of each style's ``mpl.rcParams['axes.prop_cycle']``.
+
+## xkcd v X11/CSS4
+
+The xkcd colors are derived from a user survey conducted by the
+webcomic xkcd. [Details of the survey are available on the xkcd blog](https://blog.xkcd.com/2010/05/03/color-survey-results/).
+
+Out of 148 colors in the CSS color list, there are 95 name collisions
+between the X11/CSS4 names and the xkcd names, all but 3 of which have
+different hex values. For example ``'blue'`` maps to ``'#0000FF'``
+where as ``'xkcd:blue'`` maps to ``'#0343DF'``. Due to these name
+collisions all of the xkcd colors have ``'xkcd:'`` prefixed. As noted in
+the blog post, while it might be interesting to re-define the X11/CSS4 names
+based on such a survey, we do not do so unilaterally.
+
+The name collisions are shown in the table below; the color names
+where the hex values agree are shown in bold.
+
+``` python
+import matplotlib._color_data as mcd
+import matplotlib.patches as mpatch
+
+overlap = {name for name in mcd.CSS4_COLORS
+ if "xkcd:" + name in mcd.XKCD_COLORS}
+
+fig = plt.figure(figsize=[4.8, 16])
+ax = fig.add_axes([0, 0, 1, 1])
+
+for j, n in enumerate(sorted(overlap, reverse=True)):
+ weight = None
+ cn = mcd.CSS4_COLORS[n]
+ xkcd = mcd.XKCD_COLORS["xkcd:" + n].upper()
+ if cn == xkcd:
+ weight = 'bold'
+
+ r1 = mpatch.Rectangle((0, j), 1, 1, color=cn)
+ r2 = mpatch.Rectangle((1, j), 1, 1, color=xkcd)
+ txt = ax.text(2, j+.5, ' ' + n, va='center', fontsize=10,
+ weight=weight)
+ ax.add_patch(r1)
+ ax.add_patch(r2)
+ ax.axhline(j, color='k')
+
+ax.text(.5, j + 1.5, 'X11', ha='center', va='center')
+ax.text(1.5, j + 1.5, 'xkcd', ha='center', va='center')
+ax.set_xlim(0, 3)
+ax.set_ylim(0, j + 2)
+ax.axis('off')
+```
+
+
+
+## Download
+
+- [Download Python source code: colors.py](https://matplotlib.org/_downloads/8fb6dfde0db5f6422a7627d0d4e328b2/colors.py)
+- [Download Jupyter notebook: colors.ipynb](https://matplotlib.org/_downloads/04907c28d4180c02e547778b9aaee05d/colors.ipynb)
+
\ No newline at end of file
diff --git a/Python/matplotlab/gallery/README.md b/Python/matplotlab/gallery/README.md
new file mode 100644
index 00000000..73e0cb48
--- /dev/null
+++ b/Python/matplotlab/gallery/README.md
@@ -0,0 +1,3911 @@
+---
+sidebarDepth: 3
+sidebar: auto
+---
+
+# Gallery
+
+This gallery contains examples of the many things you can do with
+Matplotlib. Click on any image to see the full image and source code.
+
+For longer tutorials, see our [tutorials page](https://pandas.pydata.org/pandas-docs/stable/tutorials/index.html).
+You can also find [external resources](https://pandas.pydata.org/pandas-docs/stable/resources/index.html) and
+a [FAQ](https://pandas.pydata.org/pandas-docs/stable/faq/index.html) in our [user guide](https://pandas.pydata.org/pandas-docs/stable/contents.html).
+
+
+## Lines, bars and markers
+
+
+
+## Color
+For more in-depth information about the colormaps available in matplotlib
+as well as a description of their properties,
+see the colormaps tutorial.
+
+
+## Event handling
+Matplotlib supports event handling with a GUI
+neutral event model, so you can connect to Matplotlib events without knowledge
+of what user interface Matplotlib will ultimately be plugged in to. This has
+two advantages: the code you write will be more portable, and Matplotlib events
+are aware of things like data coordinate space and which axes the event occurs
+in so you don't have to mess with low level transformation details to go from
+canvas space to data space. Object picking examples are also included.
+
+
+## Our Favorite Recipes
+Here is a collection of short tutorials, examples and code snippets
+that illustrate some of the useful idioms and tricks to make snazzier
+figures and overcome some matplotlib warts.
+
+
+## Embedding Matplotlib in graphical user interfaces
+You can embed Matplotlib directly into a user interface application by
+following the embedding_in_SOMEGUI.py examples here. Currently
+matplotlib supports wxpython, pygtk, tkinter and pyqt4/5.
+
+
+## Multiple subplots in one figure
+
+Multiple axes (i.e. subplots) are created with the
+[``subplot()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplot.html#matplotlib.pyplot.subplot) function:
+
+
+
+**Example of using [``imshow()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.imshow.html#matplotlib.pyplot.imshow) to display a CT scan**
+
+## Contouring and pseudocolor
+
+The [``pcolormesh()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.pcolormesh.html#matplotlib.pyplot.pcolormesh) function can make a colored
+representation of a two-dimensional array, even if the horizontal dimensions
+are unevenly spaced. The
+[``contour()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.contour.html#matplotlib.pyplot.contour) function is another way to represent
+the same data:
+
+
+
+**Example comparing [``pcolormesh()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.pcolormesh.html#matplotlib.pyplot.pcolormesh) and [``contour()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.contour.html#matplotlib.pyplot.contour) for plotting two-dimensional data**
+
+## Histograms
+
+The [``hist()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html#matplotlib.pyplot.hist) function automatically generates
+histograms and returns the bin counts or probabilities:
+
+
+
+## Paths
+
+You can add arbitrary paths in Matplotlib using the
+[``matplotlib.path``](https://matplotlib.org/api/path_api.html#module-matplotlib.path) module:
+
+
+
+## Three-dimensional plotting
+
+The mplot3d toolkit (see [Getting started](https://matplotlib.org//toolkits/mplot3d.html#toolkit-mplot3d-tutorial) and
+[3D plotting](https://matplotlib.org/gallery/index.html#mplot3d-examples-index)) has support for simple 3d graphs
+including surface, wireframe, scatter, and bar charts.
+
+
+
+Thanks to John Porter, Jonathon Taylor, Reinier Heeres, and Ben Root for
+the ``mplot3d`` toolkit. This toolkit is included with all standard Matplotlib
+installs.
+
+## Streamplot
+
+The [``streamplot()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.streamplot.html#matplotlib.pyplot.streamplot) function plots the streamlines of
+a vector field. In addition to simply plotting the streamlines, it allows you
+to map the colors and/or line widths of streamlines to a separate parameter,
+such as the speed or local intensity of the vector field.
+
+
+
+This feature complements the [``quiver()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.quiver.html#matplotlib.pyplot.quiver) function for
+plotting vector fields. Thanks to Tom Flannaghan and Tony Yu for adding the
+streamplot function.
+
+## Ellipses
+
+In support of the [Phoenix](http://www.jpl.nasa.gov/news/phoenix/main.php)
+mission to Mars (which used Matplotlib to display ground tracking of
+spacecraft), Michael Droettboom built on work by Charlie Moad to provide
+an extremely accurate 8-spline approximation to elliptical arcs (see
+[``Arc``](https://matplotlib.org/api/_as_gen/matplotlib.patches.Arc.html#matplotlib.patches.Arc)), which are insensitive to zoom level.
+
+
+
+## Bar charts
+
+Use the [``bar()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.bar.html#matplotlib.pyplot.bar) function to make bar charts, which
+includes customizations such as error bars:
+
+
+
+You can also create stacked bars
+([bar_stacked.py](https://matplotlib.org/gallery/lines_bars_and_markers/bar_stacked.html)),
+or horizontal bar charts
+([barh.py](https://matplotlib.org/gallery/lines_bars_and_markers/barh.html)).
+
+## Pie charts
+
+The [``pie()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.pie.html#matplotlib.pyplot.pie) function allows you to create pie
+charts. Optional features include auto-labeling the percentage of area,
+exploding one or more wedges from the center of the pie, and a shadow effect.
+Take a close look at the attached code, which generates this figure in just
+a few lines of code.
+
+
+
+## Tables
+
+The [``table()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.table.html#matplotlib.pyplot.table) function adds a text table
+to an axes.
+
+
+
+## Scatter plots
+
+The [``scatter()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html#matplotlib.pyplot.scatter) function makes a scatter plot
+with (optional) size and color arguments. This example plots changes
+in Google's stock price, with marker sizes reflecting the
+trading volume and colors varying with time. Here, the
+alpha attribute is used to make semitransparent circle markers.
+
+
+
+## GUI widgets
+
+Matplotlib has basic GUI widgets that are independent of the graphical
+user interface you are using, allowing you to write cross GUI figures
+and widgets. See [``matplotlib.widgets``](https://matplotlib.org/api/widgets_api.html#module-matplotlib.widgets) and the
+[widget examples](https://matplotlib.org/gallery/index.html).
+
+
+
+Thanks to Andrew Straw for adding this function.
+
+## Date handling
+
+You can plot timeseries data with major and minor ticks and custom
+tick formatters for both.
+
+
+
+See [``matplotlib.ticker``](https://matplotlib.org/api/ticker_api.html#module-matplotlib.ticker) and [``matplotlib.dates``](https://matplotlib.org/api/dates_api.html#module-matplotlib.dates) for details and usage.
+
+## Log plots
+
+The [``semilogx()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.semilogx.html#matplotlib.pyplot.semilogx),
+[``semilogy()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.semilogy.html#matplotlib.pyplot.semilogy) and
+[``loglog()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.loglog.html#matplotlib.pyplot.loglog) functions simplify the creation of
+logarithmic plots.
+
+
+
+Thanks to Andrew Straw, Darren Dale and Gregory Lielens for contributions
+log-scaling infrastructure.
+
+## Polar plots
+
+The [``polar()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.polar.html#matplotlib.pyplot.polar) function generates polar plots.
+
+
+
+Thanks to Charles Twardy for input on the legend function.
+
+## TeX-notation for text objects
+
+Below is a sampling of the many TeX expressions now supported by Matplotlib's
+internal mathtext engine. The mathtext module provides TeX style mathematical
+expressions using [FreeType](https://www.freetype.org/)
+and the DejaVu, BaKoMa computer modern, or [STIX](http://www.stixfonts.org)
+fonts. See the [``matplotlib.mathtext``](https://matplotlib.org/api/mathtext_api.html#module-matplotlib.mathtext) module for additional details.
+
+
+
+Matplotlib's mathtext infrastructure is an independent implementation and
+does not require TeX or any external packages installed on your computer. See
+the tutorial at [Writing mathematical expressions](https://matplotlib.org//text/mathtext.html).
+
+## Native TeX rendering
+
+Although Matplotlib's internal math rendering engine is quite
+powerful, sometimes you need TeX. Matplotlib supports external TeX
+rendering of strings with the *usetex* option.
+
+
+
+## EEG GUI
+
+You can embed Matplotlib into pygtk, wx, Tk, or Qt applications.
+Here is a screenshot of an EEG viewer called [pbrain](https://github.com/nipy/pbrain).
+
+
+
+The lower axes uses [``specgram()``](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.specgram.html#matplotlib.pyplot.specgram)
+to plot the spectrogram of one of the EEG channels.
+
+For examples of how to embed Matplotlib in different toolkits, see:
+
+- [Embedding in GTK3](https://matplotlib.org/gallery/user_interfaces/embedding_in_gtk3_sgskip.html)
+- [Embedding in wx #2](https://matplotlib.org/gallery/user_interfaces/embedding_in_wx2_sgskip.html)
+- [Matplotlib With Glade 3](https://matplotlib.org/gallery/user_interfaces/mpl_with_glade3_sgskip.html)
+- [Embedding in Qt](https://matplotlib.org/gallery/user_interfaces/embedding_in_qt_sgskip.html)
+- [Embedding in Tk](https://matplotlib.org/gallery/user_interfaces/embedding_in_tk_sgskip.html)
+
+## XKCD-style sketch plots
+
+Just for fun, Matplotlib supports plotting in the style of ``xkcd``.
+
+
+
+## Subplot example
+
+Many plot types can be combined in one figure to create
+powerful and flexible representations of data.
+
+
+
+
+
+``` python
+import matplotlib.pyplot as plt
+import numpy as np
+
+np.random.seed(19680801)
+data = np.random.randn(2, 100)
+
+fig, axs = plt.subplots(2, 2, figsize=(5, 5))
+axs[0, 0].hist(data[0])
+axs[1, 0].scatter(data[0], data[1])
+axs[0, 1].plot(data[0], data[1])
+axs[1, 1].hist2d(data[0], data[1])
+
+plt.show()
+```
+
+## Download
+
+- [Download Python source code: sample_plots.py](https://matplotlib.org/_downloads/6b0f2d1b3dc8d0e75eaa96feb738e947/sample_plots.py)
+- [Download Jupyter notebook: sample_plots.ipynb](https://matplotlib.org/_downloads/dcfd63fc031d50e9c085f5dc4aa458b1/sample_plots.ipynb)
diff --git a/Python/matplotlab/introductory/usage.md b/Python/matplotlab/introductory/usage.md
new file mode 100644
index 00000000..8feee2c6
--- /dev/null
+++ b/Python/matplotlab/introductory/usage.md
@@ -0,0 +1,812 @@
+---
+sidebarDepth: 3
+sidebar: auto
+---
+
+# Usage Guide
+
+This tutorial covers some basic usage patterns and best-practices to
+help you get started with Matplotlib.
+
+## General Concepts
+
+``matplotlib`` has an extensive codebase that can be daunting to many
+new users. However, most of matplotlib can be understood with a fairly
+simple conceptual framework and knowledge of a few important points.
+
+Plotting requires action on a range of levels, from the most general
+(e.g., 'contour this 2-D array') to the most specific (e.g., 'color
+this screen pixel red'). The purpose of a plotting package is to assist
+you in visualizing your data as easily as possible, with all the necessary
+control -- that is, by using relatively high-level commands most of
+the time, and still have the ability to use the low-level commands when
+needed.
+
+Therefore, everything in matplotlib is organized in a hierarchy. At the top
+of the hierarchy is the matplotlib "state-machine environment" which is
+provided by the [``matplotlib.pyplot``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.html#module-matplotlib.pyplot) module. At this level, simple
+functions are used to add plot elements (lines, images, text, etc.) to
+the current axes in the current figure.
+
+::: tip Note
+
+Pyplot's state-machine environment behaves similarly to MATLAB and
+should be most familiar to users with MATLAB experience.
+
+:::
+
+The next level down in the hierarchy is the first level of the object-oriented
+interface, in which pyplot is used only for a few functions such as figure
+creation, and the user explicitly creates and keeps track of the figure
+and axes objects. At this level, the user uses pyplot to create figures,
+and through those figures, one or more axes objects can be created. These
+axes objects are then used for most plotting actions.
+
+For even more control -- which is essential for things like embedding
+matplotlib plots in GUI applications -- the pyplot level may be dropped
+completely, leaving a purely object-oriented approach.
+
+``` python
+# sphinx_gallery_thumbnail_number = 3
+import matplotlib.pyplot as plt
+import numpy as np
+```
+
+## Parts of a Figure
+
+
+
+### ``Figure``
+
+The **whole** figure. The figure keeps
+track of all the child [``Axes``](https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes), a smattering of
+'special' artists (titles, figure legends, etc), and the **canvas**.
+(Don't worry too much about the canvas, it is crucial as it is the
+object that actually does the drawing to get you your plot, but as the
+user it is more-or-less invisible to you). A figure can have any
+number of [``Axes``](https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes), but to be useful should have
+at least one.
+
+The easiest way to create a new figure is with pyplot:
+
+``` python
+fig = plt.figure() # an empty figure with no axes
+fig.suptitle('No axes on this figure') # Add a title so we know which it is
+
+fig, ax_lst = plt.subplots(2, 2) # a figure with a 2x2 grid of Axes
+```
+
+- 
+- 
+
+### ``Axes``
+
+This is what you think of as 'a plot', it is the region of the image
+with the data space. A given figure
+can contain many Axes, but a given [``Axes``](https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes)
+object can only be in one [``Figure``](https://matplotlib.orgapi/_as_gen/matplotlib.figure.Figure.html#matplotlib.figure.Figure). The
+Axes contains two (or three in the case of 3D)
+[``Axis``](https://matplotlib.orgapi/axis_api.html#matplotlib.axis.Axis) objects (be aware of the difference
+between **Axes** and **Axis**) which take care of the data limits (the
+data limits can also be controlled via set via the
+[``set_xlim()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.set_xlim.html#matplotlib.axes.Axes.set_xlim) and
+[``set_ylim()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.set_ylim.html#matplotlib.axes.Axes.set_ylim) ``Axes`` methods). Each
+``Axes`` has a title (set via
+[``set_title()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.set_title.html#matplotlib.axes.Axes.set_title)), an x-label (set via
+[``set_xlabel()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.set_xlabel.html#matplotlib.axes.Axes.set_xlabel)), and a y-label set via
+[``set_ylabel()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.set_ylabel.html#matplotlib.axes.Axes.set_ylabel)).
+
+The ``Axes`` class and its member functions are the primary entry
+point to working with the OO interface.
+
+### ``Axis``
+
+These are the number-line-like objects. They take
+care of setting the graph limits and generating the ticks (the marks
+on the axis) and ticklabels (strings labeling the ticks). The
+location of the ticks is determined by a
+[``Locator``](https://matplotlib.orgapi/ticker_api.html#matplotlib.ticker.Locator) object and the ticklabel strings
+are formatted by a [``Formatter``](https://matplotlib.orgapi/ticker_api.html#matplotlib.ticker.Formatter). The
+combination of the correct ``Locator`` and ``Formatter`` gives
+very fine control over the tick locations and labels.
+
+### ``Artist``
+
+Basically everything you can see on the figure is an artist (even the
+``Figure``, ``Axes``, and ``Axis`` objects). This
+includes ``Text`` objects, ``Line2D`` objects,
+``collection`` objects, ``Patch`` objects ... (you get the
+idea). When the figure is rendered, all of the artists are drawn to
+the **canvas**. Most Artists are tied to an Axes; such an Artist
+cannot be shared by multiple Axes, or moved from one to another.
+
+## Types of inputs to plotting functions
+
+All of plotting functions expect ``np.array`` or ``np.ma.masked_array`` as
+input. Classes that are 'array-like' such as [``pandas``](https://pandas.pydata.org/pandas-docs/stable/index.html#module-pandas) data objects
+and ``np.matrix`` may or may not work as intended. It is best to
+convert these to ``np.array`` objects prior to plotting.
+
+For example, to convert a [``pandas.DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame)
+
+``` python
+a = pandas.DataFrame(np.random.rand(4,5), columns = list('abcde'))
+a_asarray = a.values
+```
+
+and to convert a ``np.matrix``
+
+``` python
+b = np.matrix([[1,2],[3,4]])
+b_asarray = np.asarray(b)
+```
+
+## Matplotlib, pyplot and pylab: how are they related?
+
+Matplotlib is the whole package and [``matplotlib.pyplot``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.html#module-matplotlib.pyplot) is a module in
+Matplotlib.
+
+For functions in the pyplot module, there is always a "current" figure and
+axes (which is created automatically on request). For example, in the
+following example, the first call to ``plt.plot`` creates the axes, then
+subsequent calls to ``plt.plot`` add additional lines on the same axes, and
+``plt.xlabel``, ``plt.ylabel``, ``plt.title`` and ``plt.legend`` set the
+axes labels and title and add a legend.
+
+``` python
+x = np.linspace(0, 2, 100)
+
+plt.plot(x, x, label='linear')
+plt.plot(x, x**2, label='quadratic')
+plt.plot(x, x**3, label='cubic')
+
+plt.xlabel('x label')
+plt.ylabel('y label')
+
+plt.title("Simple Plot")
+
+plt.legend()
+
+plt.show()
+```
+
+
+
+``pylab`` is a convenience module that bulk imports
+[``matplotlib.pyplot``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.html#module-matplotlib.pyplot) (for plotting) and [``numpy``](https://docs.scipy.org/doc/numpy/reference/index.html#module-numpy)
+(for mathematics and working with arrays) in a single namespace.
+pylab is deprecated and its use is strongly discouraged because
+of namespace pollution. Use pyplot instead.
+
+For non-interactive plotting it is suggested
+to use pyplot to create the figures and then the OO interface for
+plotting.
+
+## Coding Styles
+
+When viewing this documentation and examples, you will find different
+coding styles and usage patterns. These styles are perfectly valid
+and have their pros and cons. Just about all of the examples can be
+converted into another style and achieve the same results.
+The only caveat is to avoid mixing the coding styles for your own code.
+
+::: tip Note
+
+Developers for matplotlib have to follow a specific style and guidelines.
+See [The Matplotlib Developers' Guide](https://matplotlib.orgdevel/index.html#developers-guide-index).
+
+:::
+
+Of the different styles, there are two that are officially supported.
+Therefore, these are the preferred ways to use matplotlib.
+
+For the pyplot style, the imports at the top of your
+scripts will typically be:
+
+``` python
+import matplotlib.pyplot as plt
+import numpy as np
+```
+
+Then one calls, for example, np.arange, np.zeros, np.pi, plt.figure,
+plt.plot, plt.show, etc. Use the pyplot interface
+for creating figures, and then use the object methods for the rest:
+
+``` python
+x = np.arange(0, 10, 0.2)
+y = np.sin(x)
+fig, ax = plt.subplots()
+ax.plot(x, y)
+plt.show()
+```
+
+
+
+So, why all the extra typing instead of the MATLAB-style (which relies
+on global state and a flat namespace)? For very simple things like
+this example, the only advantage is academic: the wordier styles are
+more explicit, more clear as to where things come from and what is
+going on. For more complicated applications, this explicitness and
+clarity becomes increasingly valuable, and the richer and more
+complete object-oriented interface will likely make the program easier
+to write and maintain.
+
+Typically one finds oneself making the same plots over and over
+again, but with different data sets, which leads to needing to write
+specialized functions to do the plotting. The recommended function
+signature is something like:
+
+``` python
+def my_plotter(ax, data1, data2, param_dict):
+ """
+ A helper function to make a graph
+
+ Parameters
+ ----------
+ ax : Axes
+ The axes to draw to
+
+ data1 : array
+ The x data
+
+ data2 : array
+ The y data
+
+ param_dict : dict
+ Dictionary of kwargs to pass to ax.plot
+
+ Returns
+ -------
+ out : list
+ list of artists added
+ """
+ out = ax.plot(data1, data2, **param_dict)
+ return out
+
+# which you would then use as:
+
+data1, data2, data3, data4 = np.random.randn(4, 100)
+fig, ax = plt.subplots(1, 1)
+my_plotter(ax, data1, data2, {'marker': 'x'})
+```
+
+
+
+or if you wanted to have 2 sub-plots:
+
+``` python
+fig, (ax1, ax2) = plt.subplots(1, 2)
+my_plotter(ax1, data1, data2, {'marker': 'x'})
+my_plotter(ax2, data3, data4, {'marker': 'o'})
+```
+
+
+
+Again, for these simple examples this style seems like overkill, however
+once the graphs get slightly more complex it pays off.
+
+## Backends
+
+### What is a backend?
+
+A lot of documentation on the website and in the mailing lists refers
+to the "backend" and many new users are confused by this term.
+matplotlib targets many different use cases and output formats. Some
+people use matplotlib interactively from the python shell and have
+plotting windows pop up when they type commands. Some people run
+[Jupyter](https://jupyter.org) notebooks and draw inline plots for
+quick data analysis. Others embed matplotlib into graphical user
+interfaces like wxpython or pygtk to build rich applications. Some
+people use matplotlib in batch scripts to generate postscript images
+from numerical simulations, and still others run web application
+servers to dynamically serve up graphs.
+
+To support all of these use cases, matplotlib can target different
+outputs, and each of these capabilities is called a backend; the
+"frontend" is the user facing code, i.e., the plotting code, whereas the
+"backend" does all the hard work behind-the-scenes to make the figure.
+There are two types of backends: user interface backends (for use in
+pygtk, wxpython, tkinter, qt4, or macosx; also referred to as
+"interactive backends") and hardcopy backends to make image files
+(PNG, SVG, PDF, PS; also referred to as "non-interactive backends").
+
+There are four ways to configure your backend. If they conflict each other,
+the method mentioned last in the following list will be used, e.g. calling
+[``use()``](https://matplotlib.orgapi/matplotlib_configuration_api.html#matplotlib.use) will override the setting in your ``matplotlibrc``.
+
+::: tip Note
+
+Backend name specifications are not case-sensitive; e.g., 'GTK3Agg'
+and 'gtk3agg' are equivalent.
+
+:::
+
+With a typical installation of matplotlib, such as from a
+binary installer or a linux distribution package, a good default
+backend will already be set, allowing both interactive work and
+plotting from scripts, with output to the screen and/or to
+a file, so at least initially you will not need to use any of the
+methods given above.
+
+If, however, you want to write graphical user interfaces, or a web
+application server ([How to use Matplotlib in a web application server](https://matplotlib.orgfaq/howto_faq.html#howto-webapp)), or need a better
+understanding of what is going on, read on. To make things a little
+more customizable for graphical user interfaces, matplotlib separates
+the concept of the renderer (the thing that actually does the drawing)
+from the canvas (the place where the drawing goes). The canonical
+renderer for user interfaces is ``Agg`` which uses the [Anti-Grain
+Geometry](http://antigrain.com/) C++ library to make a raster (pixel) image of the figure.
+All of the user interfaces except ``macosx`` can be used with
+agg rendering, e.g., ``WXAgg``, ``GTK3Agg``, ``QT4Agg``, ``QT5Agg``,
+``TkAgg``. In addition, some of the user interfaces support other rendering
+engines. For example, with GTK+ 3, you can also select Cairo rendering
+(backend ``GTK3Cairo``).
+
+For the rendering engines, one can also distinguish between [vector](https://en.wikipedia.org/wiki/Vector_graphics) or [raster](https://en.wikipedia.org/wiki/Raster_graphics) renderers. Vector
+graphics languages issue drawing commands like "draw a line from this
+point to this point" and hence are scale free, and raster backends
+generate a pixel representation of the line whose accuracy depends on a
+DPI setting.
+
+Here is a summary of the matplotlib renderers (there is an eponymous
+backend for each; these are *non-interactive backends*, capable of
+writing to a file):
+
+
+---
+
+
+
+
+
+
+Renderer
+Filetypes
+Description
+
+
+
+[AGG](htt[[ps](https://matplotlib.org/../glossary/index.html#term-ps)](https://matplotlib.org/../glossary/index.html#term-ps)://matplotlib.org/../glossary/index.html#term-agg)
+[[png](https://matplotlib.org/../glossary/index.html#term-png)](https://matplotlib.org/../glossary/index.html#term-png)
+[[raster graphics](https://matplotlib.org/../glossary/index.html#term-raster-graphics)](https://matplotlib.org/../glossary/index.html#term-raster-graphics) -- high quality images
+using the [Anti-Grain Geometry](http://antigrain.com/) engine
+
+PS
+ps
+[eps](https://matplotlib.org/../glossary/index.html#term-eps)
+[[[[vector graphics](https://matplotlib.org/../glossary/index.html#term-vector-graphics)](https://matplotlib.org/../glossary/index.html#term-vector-graphics)](https://matplotlib.org/../glossary/index.html#term-vector-graphics)](https://matplotlib.org/../glossary/index.html#term-vector-graphics) -- [Postscript](https://en.wikipedia.org/wiki/PostScript) output
+
+PDF
+[[pdf](https://matplotlib.org/../glossary/index.html#term-pdf)](https://matplotlib.org/../glossary/index.html#term-pdf)
+vector graphics --
+[Portable Document Format](https://en.wikipedia.org/wiki/Portable_Document_Format)
+
+SVG
+[[svg](https://matplotlib.org/../glossary/index.html#term-svg)](https://matplotlib.org/../glossary/index.html#term-svg)
+vector graphics --
+[Scalable Vector Graphics](https://en.wikipedia.org/wiki/Scalable_Vector_Graphics)
+
+[Cairo](https://matplotlib.org/../glossary/index.html#term-cairo)
+png
+ps
+pdf
+svg
+raster graphics and
+vector graphics -- using the
+[Cairo graphics](https://wwW.cairographics.org) library
+
+
+
+
+And here are the user interfaces and renderer combinations supported;
+these are *interactive backends*, capable of displaying to the screen
+and of using appropriate renderers from the table above to write to
+a file:
+
+
+---
+
+
+
+
+
+Backend
+Description
+
+
+
+[Qt5](https://matplotlib.org/../glossary/index.html#term-qt5)Agg
+Agg rendering in a Qt5 canvas (requires [PyQt5](https://riverbankcomputing.com/software/pyqt/intro)). This
+backend can be activated in IPython with %matplotlib qt5.
+
+ipympl
+Agg rendering embedded in a Jupyter widget. (requires ipympl).
+This backend can be enabled in a Jupyter notebook with
+%matplotlib ipympl.
+
+[[GTK](https://matplotlib.org/../glossary/index.html#term-gtk)](https://matplotlib.org/../glossary/index.html#term-gtk)3Agg
+Agg rendering to a GTK 3.x canvas (requires [[PyGObject](https://wiki.gnome.org/action/show/Projects/PyGObject)](https://wiki.gnome.org/action/show/Projects/PyGObject),
+and [[pycairo](https://www.cairographics.org/pycairo/)](https://www.cairographics.org/pycairo/) or [[cairocffi](https://pythonhosted.org/cairocffi/)](https://pythonhosted.org/cairocffi/)). This backend can be activated in
+IPython with %matplotlib gtk3.
+
+macosx
+Agg rendering into a Cocoa canvas in OSX. This backend can be
+activated in IPython with %matplotlib osx.
+
+[Tk](https://matplotlib.org/../glossary/index.html#term-tk)Agg
+Agg rendering to a Tk canvas (requires [TkInter](https://wiki.python.org/moin/TkInter)). This
+backend can be activated in IPython with %matplotlib tk.
+
+nbAgg
+Embed an interactive figure in a Jupyter classic notebook. This
+backend can be enabled in Jupyter notebooks via
+%matplotlib notebook.
+
+WebAgg
+On show() will start a tornado server with an interactive
+figure.
+
+GTK3Cairo
+Cairo rendering to a GTK 3.x canvas (requires PyGObject,
+and pycairo or cairocffi).
+
+[Qt4](https://matplotlib.org/../glossary/index.html#term-qt4)Agg
+Agg rendering to a Qt4 canvas (requires [PyQt4](https://riverbankcomputing.com/software/pyqt/intro) or
+pyside). This backend can be activated in IPython with
+%matplotlib qt4.
+
+WXAgg
+Agg rendering to a [wxWidgets](https://matplotlib.org/../glossary/index.html#term-wxwidgets) canvas (requires [wxPython](https://www.wxpython.org/) 4).
+This backend can be activated in IPython with %matplotlib wx.
+
+
+
+
+### ipympl
+
+The Jupyter widget ecosystem is moving too fast to support directly in
+Matplotlib. To install ipympl
+
+``` bash
+pip install ipympl
+jupyter nbextension enable --py --sys-prefix ipympl
+```
+
+or
+
+``` bash
+conda install ipympl -c conda-forge
+```
+
+See [jupyter-matplotlib](https://github.com/matplotlib/jupyter-matplotlib)
+for more details.
+
+### GTK and Cairo
+
+``GTK3`` backends (*both* ``GTK3Agg`` and ``GTK3Cairo``) depend on Cairo
+(pycairo>=1.11.0 or cairocffi).
+
+### How do I select PyQt4 or PySide?
+
+The ``QT_API`` environment variable can be set to either ``pyqt`` or ``pyside``
+to use ``PyQt4`` or ``PySide``, respectively.
+
+Since the default value for the bindings to be used is ``PyQt4``,
+``matplotlib`` first tries to import it, if the import fails, it tries to
+import ``PySide``.
+
+## What is interactive mode?
+
+Use of an interactive backend (see [What is a backend?](#what-is-a-backend))
+permits--but does not by itself require or ensure--plotting
+to the screen. Whether and when plotting to the screen occurs,
+and whether a script or shell session continues after a plot
+is drawn on the screen, depends on the functions and methods
+that are called, and on a state variable that determines whether
+matplotlib is in "interactive mode". The default Boolean value is set
+by the ``matplotlibrc`` file, and may be customized like any other
+configuration parameter (see [Customizing Matplotlib with style sheets and rcParams](customizing.html)). It
+may also be set via [``matplotlib.interactive()``](https://matplotlib.orgapi/matplotlib_configuration_api.html#matplotlib.interactive), and its
+value may be queried via [``matplotlib.is_interactive()``](https://matplotlib.orgapi/matplotlib_configuration_api.html#matplotlib.is_interactive). Turning
+interactive mode on and off in the middle of a stream of plotting
+commands, whether in a script or in a shell, is rarely needed
+and potentially confusing, so in the following we will assume all
+plotting is done with interactive mode either on or off.
+
+::: tip Note
+
+Major changes related to interactivity, and in particular the
+role and behavior of [``show()``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.show.html#matplotlib.pyplot.show), were made in the
+transition to matplotlib version 1.0, and bugs were fixed in
+1.0.1. Here we describe the version 1.0.1 behavior for the
+primary interactive backends, with the partial exception of
+*macosx*.
+
+:::
+
+Interactive mode may also be turned on via [``matplotlib.pyplot.ion()``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.ion.html#matplotlib.pyplot.ion),
+and turned off via [``matplotlib.pyplot.ioff()``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.ioff.html#matplotlib.pyplot.ioff).
+
+::: tip Note
+
+Interactive mode works with suitable backends in ipython and in
+the ordinary python shell, but it does *not* work in the IDLE IDE.
+If the default backend does not support interactivity, an interactive
+backend can be explicitly activated using any of the methods discussed in [What is a backend?](#id4).
+
+:::
+
+### Interactive example
+
+From an ordinary python prompt, or after invoking ipython with no options,
+try this:
+
+``` python
+import matplotlib.pyplot as plt
+plt.ion()
+plt.plot([1.6, 2.7])
+```
+
+Assuming you are running version 1.0.1 or higher, and you have
+an interactive backend installed and selected by default, you should
+see a plot, and your terminal prompt should also be active; you
+can type additional commands such as:
+
+``` python
+plt.title("interactive test")
+plt.xlabel("index")
+```
+
+and you will see the plot being updated after each line. Since version 1.5,
+modifying the plot by other means *should* also automatically
+update the display on most backends. Get a reference to the [``Axes``](https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes) instance,
+and call a method of that instance:
+
+``` python
+ax = plt.gca()
+ax.plot([3.1, 2.2])
+```
+
+If you are using certain backends (like ``macosx``), or an older version
+of matplotlib, you may not see the new line added to the plot immediately.
+In this case, you need to explicitly call [``draw()``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.draw.html#matplotlib.pyplot.draw)
+in order to update the plot:
+
+``` python
+plt.draw()
+```
+
+### Non-interactive example
+
+Start a fresh session as in the previous example, but now
+turn interactive mode off:
+
+``` python
+import matplotlib.pyplot as plt
+plt.ioff()
+plt.plot([1.6, 2.7])
+```
+
+Nothing happened--or at least nothing has shown up on the
+screen (unless you are using *macosx* backend, which is
+anomalous). To make the plot appear, you need to do this:
+
+``` python
+plt.show()
+```
+
+Now you see the plot, but your terminal command line is
+unresponsive; the ``show()`` command *blocks* the input
+of additional commands until you manually kill the plot
+window.
+
+What good is this--being forced to use a blocking function?
+Suppose you need a script that plots the contents of a file
+to the screen. You want to look at that plot, and then end
+the script. Without some blocking command such as show(), the
+script would flash up the plot and then end immediately,
+leaving nothing on the screen.
+
+In addition, non-interactive mode delays all drawing until
+show() is called; this is more efficient than redrawing
+the plot each time a line in the script adds a new feature.
+
+Prior to version 1.0, show() generally could not be called
+more than once in a single script (although sometimes one
+could get away with it); for version 1.0.1 and above, this
+restriction is lifted, so one can write a script like this:
+
+``` python
+import numpy as np
+import matplotlib.pyplot as plt
+
+plt.ioff()
+for i in range(3):
+ plt.plot(np.random.rand(10))
+ plt.show()
+```
+
+which makes three plots, one at a time. I.e. the second plot will show up,
+once the first plot is closed.
+
+### Summary
+
+In interactive mode, pyplot functions automatically draw
+to the screen.
+
+When plotting interactively, if using
+object method calls in addition to pyplot functions, then
+call [``draw()``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.draw.html#matplotlib.pyplot.draw) whenever you want to
+refresh the plot.
+
+Use non-interactive mode in scripts in which you want to
+generate one or more figures and display them before ending
+or generating a new set of figures. In that case, use
+[``show()``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.show.html#matplotlib.pyplot.show) to display the figure(s) and
+to block execution until you have manually destroyed them.
+
+## Performance
+
+Whether exploring data in interactive mode or programmatically
+saving lots of plots, rendering performance can be a painful
+bottleneck in your pipeline. Matplotlib provides a couple
+ways to greatly reduce rendering time at the cost of a slight
+change (to a settable tolerance) in your plot's appearance.
+The methods available to reduce rendering time depend on the
+type of plot that is being created.
+
+### Line segment simplification
+
+For plots that have line segments (e.g. typical line plots,
+outlines of polygons, etc.), rendering performance can be
+controlled by the ``path.simplify`` and
+``path.simplify_threshold`` parameters in your
+``matplotlibrc`` file (see
+[Customizing Matplotlib with style sheets and rcParams](customizing.html) for
+more information about the ``matplotlibrc`` file).
+The ``path.simplify`` parameter is a boolean indicating whether
+or not line segments are simplified at all. The
+``path.simplify_threshold`` parameter controls how much line
+segments are simplified; higher thresholds result in quicker
+rendering.
+
+The following script will first display the data without any
+simplification, and then display the same data with simplification.
+Try interacting with both of them:
+
+``` python
+import numpy as np
+import matplotlib.pyplot as plt
+import matplotlib as mpl
+
+# Setup, and create the data to plot
+y = np.random.rand(100000)
+y[50000:] *= 2
+y[np.logspace(1, np.log10(50000), 400).astype(int)] = -1
+mpl.rcParams['path.simplify'] = True
+
+mpl.rcParams['path.simplify_threshold'] = 0.0
+plt.plot(y)
+plt.show()
+
+mpl.rcParams['path.simplify_threshold'] = 1.0
+plt.plot(y)
+plt.show()
+```
+
+Matplotlib currently defaults to a conservative simplification
+threshold of ``1/9``. If you want to change your default settings
+to use a different value, you can change your ``matplotlibrc``
+file. Alternatively, you could create a new style for
+interactive plotting (with maximal simplification) and another
+style for publication quality plotting (with minimal
+simplification) and activate them as necessary. See
+[Customizing Matplotlib with style sheets and rcParams](customizing.html) for
+instructions on how to perform these actions.
+
+The simplification works by iteratively merging line segments
+into a single vector until the next line segment's perpendicular
+distance to the vector (measured in display-coordinate space)
+is greater than the ``path.simplify_threshold`` parameter.
+
+::: tip Note
+
+Changes related to how line segments are simplified were made
+in version 2.1. Rendering time will still be improved by these
+parameters prior to 2.1, but rendering time for some kinds of
+data will be vastly improved in versions 2.1 and greater.
+
+:::
+
+### Marker simplification
+
+Markers can also be simplified, albeit less robustly than
+line segments. Marker simplification is only available
+to [``Line2D``](https://matplotlib.orgapi/_as_gen/matplotlib.lines.Line2D.html#matplotlib.lines.Line2D) objects (through the
+``markevery`` property). Wherever
+[``Line2D``](https://matplotlib.orgapi/_as_gen/matplotlib.lines.Line2D.html#matplotlib.lines.Line2D) construction parameters
+are passed through, such as
+[``matplotlib.pyplot.plot()``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot) and
+[``matplotlib.axes.Axes.plot()``](https://matplotlib.orgapi/_as_gen/matplotlib.axes.Axes.plot.html#matplotlib.axes.Axes.plot), the ``markevery``
+parameter can be used:
+
+``` python
+plt.plot(x, y, markevery=10)
+```
+
+The markevery argument allows for naive subsampling, or an
+attempt at evenly spaced (along the *x* axis) sampling. See the
+[Markevery Demo](https://matplotlib.orggallery/lines_bars_and_markers/markevery_demo.html)
+for more information.
+
+### Splitting lines into smaller chunks
+
+If you are using the Agg backend (see [What is a backend?](#what-is-a-backend)),
+then you can make use of the ``agg.path.chunksize`` rc parameter.
+This allows you to specify a chunk size, and any lines with
+greater than that many vertices will be split into multiple
+lines, each of which has no more than ``agg.path.chunksize``
+many vertices. (Unless ``agg.path.chunksize`` is zero, in
+which case there is no chunking.) For some kind of data,
+chunking the line up into reasonable sizes can greatly
+decrease rendering time.
+
+The following script will first display the data without any
+chunk size restriction, and then display the same data with
+a chunk size of 10,000. The difference can best be seen when
+the figures are large, try maximizing the GUI and then
+interacting with them:
+
+``` python
+import numpy as np
+import matplotlib.pyplot as plt
+import matplotlib as mpl
+mpl.rcParams['path.simplify_threshold'] = 1.0
+
+# Setup, and create the data to plot
+y = np.random.rand(100000)
+y[50000:] *= 2
+y[np.logspace(1,np.log10(50000), 400).astype(int)] = -1
+mpl.rcParams['path.simplify'] = True
+
+mpl.rcParams['agg.path.chunksize'] = 0
+plt.plot(y)
+plt.show()
+
+mpl.rcParams['agg.path.chunksize'] = 10000
+plt.plot(y)
+plt.show()
+```
+
+### Legends
+
+The default legend behavior for axes attempts to find the location
+that covers the fewest data points (``loc='best'``). This can be a
+very expensive computation if there are lots of data points. In
+this case, you may want to provide a specific location.
+
+### Using the fast style
+
+The *fast* style can be used to automatically set
+simplification and chunking parameters to reasonable
+settings to speed up plotting large amounts of data.
+It can be used simply by running:
+
+``` python
+import matplotlib.style as mplstyle
+mplstyle.use('fast')
+```
+
+It is very light weight, so it plays nicely with other
+styles, just make sure the fast style is applied last
+so that other styles do not overwrite the settings:
+
+``` python
+mplstyle.use(['dark_background', 'ggplot', 'fast'])
+```
+
+## Download
+
+- [Download Python source code: usage.py](https://matplotlib.org/_downloads/841a514c2538fd0de68b22f22b25f56d/usage.py)
+- [Download Jupyter notebook: usage.ipynb](https://matplotlib.org/_downloads/16d604c55fb650c0dce205aa67def02b/usage.ipynb)
+
\ No newline at end of file
diff --git a/Python/matplotlab/pyplot_attr.md b/Python/matplotlab/pyplot_attr.md
new file mode 100644
index 00000000..959cf746
--- /dev/null
+++ b/Python/matplotlab/pyplot_attr.md
@@ -0,0 +1,279 @@
+# 面向对象的绘图方式
+
+
+## 配置参数
+
+* axex: 设置坐标轴边界和表面的颜色、坐标刻度值大小和网格的显示
+* figure: 控制dpi、边界颜色、图形大小、和子区( subplot)设置
+* font: 字体集(font family)、字体大小和样式设置
+* grid: 设置网格颜色和线性
+* legend: 设置图例和其中的文本的显示
+* line: 设置线条(颜色、线型、宽度等)和标记
+* patch: 是填充2D空间的图形对象,如多边形和圆。控制线宽、颜色和抗锯齿设置等。
+* savefig: 可以对保存的图形进行单独设置。例如,设置渲染的文件的背景为白色。
+* verbose: 设置matplotlib在执行期间信息输出,如silent、helpful、debug和debug-annoying。
+* xticks和yticks: 为x,y轴的主刻度和次刻度设置颜色、大小、方向,以及标签大小。
+
+
+## 线条风格
+
+线条风格linestyle或ls | 描述
+----|---
+‘-‘ |实线
+‘:’ |虚线
+‘–’ |破折线
+‘None’,’ ‘,’’ |什么都不画
+‘-.’ |点划线
+
+## 线条标记
+
+标记maker | 描述
+----|----
+‘o’ |圆圈
+‘.’ |点
+‘D’ |菱形
+‘s’ |正方形
+‘h’ |六边形1
+‘*’ |星号
+‘H’ |六边形2
+‘d’ |小菱形
+‘_’ | 水平线
+‘v’ |一角朝下的三角形
+‘8’ |八边形
+‘<’ | 一角朝左的三角形
+‘p’ |五边形
+‘>’ |一角朝右的三角形
+‘,’ |像素
+‘^’ | 一角朝上的三角形
+‘+’ | 加号
+‘\ ‘ |竖线
+‘None’,’’,’ ‘ |无
+‘x’ | X
+
+
+## 颜色
+
+别名 | 颜色
+---|---
+b | 蓝色
+g |绿色
+r |红色
+y |黄色
+c |青色
+k |黑色
+m |洋红色
+w |白色
+
+
+## 绘图步骤
+
+```py
+#使用numpy产生数据
+x=np.arange(-5,5,0.1)
+y=x*3
+
+#创建窗口、子图
+#方法1:先创建窗口,再创建子图。(一定绘制)
+fig = plt.figure(num=1, figsize=(15, 8),dpi=80) #开启一个窗口,同时设置大小,分辨率
+ax1 = fig.add_subplot(2,1,1) #通过fig添加子图,参数:行数,列数,第几个。
+ax2 = fig.add_subplot(2,1,2) #通过fig添加子图,参数:行数,列数,第几个。
+print(fig,ax1,ax2)
+#方法2:一次性创建窗口和多个子图。(空白不绘制)
+fig,axarr = plt.subplots(4,1) #开一个新窗口,并添加4个子图,返回子图数组
+ax1 = axarr[0] #通过子图数组获取一个子图
+print(fig,ax1)
+#方法3:一次性创建窗口和一个子图。(空白不绘制)
+ax1 = plt.subplot(1,1,1,facecolor='white') #开一个新窗口,创建1个子图。facecolor设置背景颜色
+print(ax1)
+#获取对窗口的引用,适用于上面三种方法
+# fig = plt.gcf() #获得当前figure
+# fig=ax1.figure #获得指定子图所属窗口
+
+# fig.subplots_adjust(left=0) #设置窗口左内边距为0,即左边留白为0。
+
+#设置子图的基本元素
+ax1.set_title('python-drawing') #设置图体,plt.title
+ax1.set_xlabel('x-name') #设置x轴名称,plt.xlabel
+ax1.set_ylabel('y-name') #设置y轴名称,plt.ylabel
+plt.axis([-6,6,-10,10]) #设置横纵坐标轴范围,这个在子图中被分解为下面两个函数
+ax1.set_xlim(-5,5) #设置横轴范围,会覆盖上面的横坐标,plt.xlim
+ax1.set_ylim(-10,10) #设置纵轴范围,会覆盖上面的纵坐标,plt.ylim
+
+xmajorLocator = MultipleLocator(2) #定义横向主刻度标签的刻度差为2的倍数。就是隔几个刻度才显示一个标签文本
+ymajorLocator = MultipleLocator(3) #定义纵向主刻度标签的刻度差为3的倍数。就是隔几个刻度才显示一个标签文本
+
+ax1.xaxis.set_major_locator(xmajorLocator) #x轴 应用定义的横向主刻度格式。如果不应用将采用默认刻度格式
+ax1.yaxis.set_major_locator(ymajorLocator) #y轴 应用定义的纵向主刻度格式。如果不应用将采用默认刻度格式
+
+ax1.xaxis.grid(True, which='major') #x坐标轴的网格使用定义的主刻度格式
+ax1.yaxis.grid(True, which='major') #x坐标轴的网格使用定义的主刻度格式
+
+ax1.set_xticks([]) #去除坐标轴刻度
+ax1.set_xticks((-5,-3,-1,1,3,5)) #设置坐标轴刻度
+ax1.set_xticklabels(labels=['x1','x2','x3','x4','x5'],rotation=-30,fontsize='small') #设置刻度的显示文本,rotation旋转角度,fontsize字体大小
+
+plot1=ax1.plot(x,y,marker='o',color='g',label='legend1') #点图:marker图标
+plot2=ax1.plot(x,y,linestyle='--',alpha=0.5,color='r',label='legend2') #线图:linestyle线性,alpha透明度,color颜色,label图例文本
+
+ax1.legend(loc='upper left') #显示图例,plt.legend()
+ax1.text(2.8, 7, r'y=3*x') #指定位置显示文字,plt.text()
+ax1.annotate('important point', xy=(2, 6), xytext=(3, 1.5), #添加标注,参数:注释文本、指向点、文字位置、箭头属性
+ arrowprops=dict(facecolor='black', shrink=0.05),
+ )
+#显示网格。which参数的值为major(只绘制大刻度)、minor(只绘制小刻度)、both,默认值为major。axis为'x','y','both'
+ax1.grid(b=True,which='major',axis='both',alpha= 0.5,color='skyblue',linestyle='--',linewidth=2)
+
+axes1 = plt.axes([.2, .3, .1, .1], facecolor='y') #在当前窗口添加一个子图,rect=[左, 下, 宽, 高],是使用的绝对布局,不和以存在窗口挤占空间
+axes1.plot(x,y) #在子图上画图
+plt.savefig('aa.jpg',dpi=400,bbox_inches='tight') #savefig保存图片,dpi分辨率,bbox_inches子图周边白色空间的大小
+plt.show() #打开窗口,对于方法1创建在窗口一定绘制,对于方法2方法3创建的窗口,若坐标系全部空白,则不绘制
+
+```
+
+## plot属性
+
+```py
+属性 值类型
+alpha 浮点值
+animated [True / False]
+antialiased or aa [True / False]
+clip_box matplotlib.transform.Bbox 实例
+clip_on [True / False]
+clip_path Path 实例, Transform,以及Patch实例
+color or c 任何 matplotlib 颜色
+contains 命中测试函数
+dash_capstyle ['butt' / 'round' / 'projecting']
+dash_joinstyle ['miter' / 'round' / 'bevel']
+dashes 以点为单位的连接/断开墨水序列
+data (np.array xdata, np.array ydata)
+figure matplotlib.figure.Figure 实例
+label 任何字符串
+linestyle or ls [ '-' / '--' / '-.' / ':' / 'steps' / ...]
+linewidth or lw 以点为单位的浮点值
+lod [True / False]
+marker [ '+' / ',' / '.' / '1' / '2' / '3' / '4' ]
+markeredgecolor or mec 任何 matplotlib 颜色
+markeredgewidth or mew 以点为单位的浮点值
+markerfacecolor or mfc 任何 matplotlib 颜色
+markersize or ms 浮点值
+markevery [ None / 整数值 / (startind, stride) ]
+picker 用于交互式线条选择
+pickradius 线条的拾取选择半径
+solid_capstyle ['butt' / 'round' / 'projecting']
+solid_joinstyle ['miter' / 'round' / 'bevel']
+transform matplotlib.transforms.Transform 实例
+visible [True / False]
+xdata np.array
+ydata np.array
+zorder 任何数值
+```
+
+## 多图绘制
+
+```py
+#一个窗口,多个图,多条数据
+sub1=plt.subplot(211,facecolor=(0.1843,0.3098,0.3098)) #将窗口分成2行1列,在第1个作图,并设置背景色
+sub2=plt.subplot(212) #将窗口分成2行1列,在第2个作图
+sub1.plot(x,y) #绘制子图
+sub2.plot(x,y) #绘制子图
+
+axes1 = plt.axes([.2, .3, .1, .1], facecolor='y') #添加一个子坐标系,rect=[左, 下, 宽, 高]
+plt.plot(x,y) #绘制子坐标系,
+axes2 = plt.axes([0.7, .2, .1, .1], facecolor='y') #添加一个子坐标系,rect=[左, 下, 宽, 高]
+plt.plot(x,y)
+plt.show()
+```
+## 极坐标
+
+```py
+fig = plt.figure(2) #新开一个窗口
+ax1 = fig.add_subplot(1,2,1,polar=True) #启动一个极坐标子图
+theta=np.arange(0,2*np.pi,0.02) #角度数列值
+ax1.plot(theta,2*np.ones_like(theta),lw=2) #画图,参数:角度,半径,lw线宽
+ax1.plot(theta,theta/6,linestyle='--',lw=2) #画图,参数:角度,半径,linestyle样式,lw线宽
+
+ax2 = fig.add_subplot(1,2,2,polar=True) #启动一个极坐标子图
+ax2.plot(theta,np.cos(5*theta),linestyle='--',lw=2)
+ax2.plot(theta,2*np.cos(4*theta),lw=2)
+
+ax2.set_rgrids(np.arange(0.2,2,0.2),angle=45) #距离网格轴,轴线刻度和显示位置
+ax2.set_thetagrids([0,45,90]) #角度网格轴,范围0-360度
+
+plt.show()
+```
+
+## 柱状图
+
+```py
+plt.figure(3)
+x_index = np.arange(5) #柱的索引
+x_data = ('A', 'B', 'C', 'D', 'E')
+y1_data = (20, 35, 30, 35, 27)
+y2_data = (25, 32, 34, 20, 25)
+bar_width = 0.35 #定义一个数字代表每个独立柱的宽度
+
+rects1 = plt.bar(x_index, y1_data, width=bar_width,alpha=0.4, color='b',label='legend1') #参数:左偏移、高度、柱宽、透明度、颜色、图例
+rects2 = plt.bar(x_index + bar_width, y2_data, width=bar_width,alpha=0.5,color='r',label='legend2') #参数:左偏移、高度、柱宽、透明度、颜色、图例
+#关于左偏移,不用关心每根柱的中心不中心,因为只要把刻度线设置在柱的中间就可以了
+plt.xticks(x_index + bar_width/2, x_data) #x轴刻度线
+plt.legend() #显示图例
+plt.tight_layout() #自动控制图像外部边缘,此方法不能够很好的控制图像间的间隔
+plt.show()
+```
+
+## 直方图
+
+```py
+fig,(ax0,ax1) = plt.subplots(nrows=2,figsize=(9,6)) #在窗口上添加2个子图
+sigma = 1 #标准差
+mean = 0 #均值
+x=mean+sigma*np.random.randn(10000) #正态分布随机数
+ax0.hist(x,bins=40,normed=False,histtype='bar',facecolor='yellowgreen',alpha=0.75) #normed是否归一化,histtype直方图类型,facecolor颜色,alpha透明度
+ax1.hist(x,bins=20,normed=1,histtype='bar',facecolor='pink',alpha=0.75,cumulative=True,rwidth=0.8) #bins柱子的个数,cumulative是否计算累加分布,rwidth柱子宽度
+plt.show() #所有窗口运行
+```
+
+## 散点图
+
+```py
+fig = plt.figure(4) #添加一个窗口
+ax =fig.add_subplot(1,1,1) #在窗口上添加一个子图
+x=np.random.random(100) #产生随机数组
+y=np.random.random(100) #产生随机数组
+ax.scatter(x,y,s=x*1000,c='y',marker=(5,1),alpha=0.5,lw=2,facecolors='none') #x横坐标,y纵坐标,s图像大小,c颜色,marker图片,lw图像边框宽度
+plt.show() #所有窗口运行
+```
+
+## 三维图
+
+```py
+fig = plt.figure(5)
+ax=fig.add_subplot(1,1,1,projection='3d') #绘制三维图
+
+x,y=np.mgrid[-2:2:20j,-2:2:20j] #获取x轴数据,y轴数据
+z=x*np.exp(-x**2-y**2) #获取z轴数据
+
+ax.plot_surface(x,y,z,rstride=2,cstride=1,cmap=plt.cm.coolwarm,alpha=0.8) #绘制三维图表面
+ax.set_xlabel('x-name') #x轴名称
+ax.set_ylabel('y-name') #y轴名称
+ax.set_zlabel('z-name') #z轴名称
+
+plt.show()
+```
+
+## 集合图形
+
+```py
+fig = plt.figure(6) #创建一个窗口
+ax=fig.add_subplot(1,1,1) #添加一个子图
+rect1 = plt.Rectangle((0.1,0.2),0.2,0.3,color='r') #创建一个矩形,参数:(x,y),width,height
+circ1 = plt.Circle((0.7,0.2),0.15,color='r',alpha=0.3) #创建一个椭圆,参数:中心点,半径,默认这个圆形会跟随窗口大小进行长宽压缩
+pgon1 = plt.Polygon([[0.45,0.45],[0.65,0.6],[0.2,0.6]]) #创建一个多边形,参数:每个顶点坐标
+
+ax.add_patch(rect1) #将形状添加到子图上
+ax.add_patch(circ1) #将形状添加到子图上
+ax.add_patch(pgon1) #将形状添加到子图上
+
+fig.canvas.draw() #子图绘制
+plt.show()
+```
\ No newline at end of file
diff --git a/Python/matplotlab/pyplot_function.md b/Python/matplotlab/pyplot_function.md
new file mode 100644
index 00000000..1b8d4a16
--- /dev/null
+++ b/Python/matplotlab/pyplot_function.md
@@ -0,0 +1,184 @@
+
+
+## 模块简介
+
+matplotlib.pyplot is a state-based interface to matplotlib. It provides a MATLAB-like way of plotting.
+
+
+## 使用教程
+pyplot is mainly intended for interactive plots and simple cases of programmatic plot generation:
+
+```py
+import numpy as np
+import matplotlib.pyplot as plt
+
+x = np.arange(0, 5, 0.1)
+y = np.sin(x)
+plt.plot(x, y)
+```
+
+
+
+## 内部函数
+
+| 名称 | 作用 |
+|------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
+| acorr\(x, \*\[, data\]\) | Plot the autocorrelation of x\. |
+| angle\_spectrum\(x\[, Fs, Fc, window, pad\_to, \.\.\.\]\) | Plot the angle spectrum\. |
+| annotate\(text, xy, \*args, \*\*kwargs\) | Annotate the point xy with text text\. |
+| arrow\(x, y, dx, dy, \*\*kwargs\) | Add an arrow to the axes\. |
+| autoscale\(\[enable, axis, tight\]\) | Autoscale the axis view to the data \(toggle\)\. |
+| autumn\(\) | Set the colormap to "autumn"\. |
+| axes\(\[arg\]\) | Add an axes to the current figure and make it the current axes\. |
+| axhline\(\[y, xmin, xmax\]\) | Add a horizontal line across the axis\. |
+| axhspan\(ymin, ymax\[, xmin, xmax\]\) | Add a horizontal span \(rectangle\) across the axis\. |
+| axis\(\*args\[, emit\]\) | Convenience method to get or set some axis properties\. |
+| axline\(xy1\[, xy2, slope\]\) | Add an infinitely long straight line\. |
+| axvline\(\[x, ymin, ymax\]\) | Add a vertical line across the axes\. |
+| axvspan\(xmin, xmax\[, ymin, ymax\]\) | Add a vertical span \(rectangle\) across the axes\. |
+| bar\(x, height\[, width, bottom, align, data\]\) | Make a bar plot\. |
+| barbs\(\*args\[, data\]\) | Plot a 2D field of barbs\. |
+| barh\(y, width\[, height, left, align\]\) | Make a horizontal bar plot\. |
+| bone\(\) | Set the colormap to "bone"\. |
+| box\(\[on\]\) | Turn the axes box on or off on the current axes\. |
+| boxplot\(x\[, notch, sym, vert, whis, \.\.\.\]\) | Make a box and whisker plot\. |
+| broken\_barh\(xranges, yrange, \*\[, data\]\) | Plot a horizontal sequence of rectangles\. |
+| cla\(\) | Clear the current axes\. |
+| clabel\(CS\[, levels\]\) | Label a contour plot\. |
+| clf\(\) | Clear the current figure\. |
+| clim\(\[vmin, vmax\]\) | Set the color limits of the current image\. |
+| close\(\[fig\]\) | Close a figure window\. |
+| cohere\(x, y\[, NFFT, Fs, Fc, detrend, \.\.\.\]\) | Plot the coherence between x and y\. |
+| colorbar\(\[mappable, cax, ax\]\) | Add a colorbar to a plot\. |
+| connect\(s, func\) | Bind function func to event s\. |
+| contour\(\*args\[, data\]\) | Plot contours\. |
+| contourf\(\*args\[, data\]\) | Plot contours\. |
+| cool\(\) | Set the colormap to "cool"\. |
+| copper\(\) | Set the colormap to "copper"\. |
+| csd\(x, y\[, NFFT, Fs, Fc, detrend, window, \.\.\.\]\) | Plot the cross\-spectral density\. |
+| delaxes\(\[ax\]\) | Remove an Axes \(defaulting to the current axes\) from its figure\. |
+| disconnect\(cid\) | Disconnect the callback with id cid\. |
+| draw\(\) | Redraw the current figure\. |
+| draw\_if\_interactive\(\) | |
+| errorbar\(x, y\[, yerr, xerr, fmt, ecolor, \.\.\.\]\) | Plot y versus x as lines and/or markers with attached errorbars\. |
+| eventplot\(positions\[, orientation, \.\.\.\]\) | Plot identical parallel lines at the given positions\. |
+| figimage\(X\[, xo, yo, alpha, norm, cmap, \.\.\.\]\) | Add a non\-resampled image to the figure\. |
+| figlegend\(\*args, \*\*kwargs\) | Place a legend on the figure\. |
+| fignum\_exists\(num\) | Return whether the figure with the given id exists\. |
+| figtext\(x, y, s\[, fontdict\]\) | Add text to figure\. |
+| figure\(\[num, figsize, dpi, facecolor, \.\.\.\]\) | Create a new figure, or activate an existing figure\. |
+| fill\(\*args\[, data\]\) | Plot filled polygons\. |
+| fill\_between\(x, y1\[, y2, where, \.\.\.\]\) | Fill the area between two horizontal curves\. |
+| fill\_betweenx\(y, x1\[, x2, where, step, \.\.\.\]\) | Fill the area between two vertical curves\. |
+| findobj\(\[o, match, include\_self\]\) | Find artist objects\. |
+| flag\(\) | Set the colormap to "flag"\. |
+| gca\(\*\*kwargs\) | Get the current axes, creating one if necessary\. |
+| gcf\(\) | Get the current figure\. |
+| gci\(\) | Get the current colorable artist\. |
+| get\(obj, \*args, \*\*kwargs\) | Return the value of an object's property, or print all of them\. |
+| get\_current\_fig\_manager\(\) | Return the figure manager of the current figure\. |
+| get\_figlabels\(\) | Return a list of existing figure labels\. |
+| get\_fignums\(\) | Return a list of existing figure numbers\. |
+| get\_plot\_commands\(\) | Get a sorted list of all of the plotting commands\. |
+| getp\(obj, \*args, \*\*kwargs\) | Return the value of an object's property, or print all of them\. |
+| ginput\(\[n, timeout, show\_clicks, mouse\_add, \.\.\.\]\) | Blocking call to interact with a figure\. |
+| gray\(\) | Set the colormap to "gray"\. |
+| grid\(\[b, which, axis\]\) | Configure the grid lines\. |
+| hexbin\(x, y\[, C, gridsize, bins, xscale, \.\.\.\]\) | Make a 2D hexagonal binning plot of points x, y\. |
+| hist\(x\[, bins, range, density, weights, \.\.\.\]\) | Plot a histogram\. |
+| hist2d\(x, y\[, bins, range, density, \.\.\.\]\) | Make a 2D histogram plot\. |
+| hlines\(y, xmin, xmax\[, colors, linestyles, \.\.\.\]\) | Plot horizontal lines at each y from xmin to xmax\. |
+| hot\(\) | Set the colormap to "hot"\. |
+| hsv\(\) | Set the colormap to "hsv"\. |
+| imread\(fname\[, format\]\) | Read an image from a file into an array\. |
+| imsave\(fname, arr, \*\*kwargs\) | Save an array as an image file\. |
+| imshow\(X\[, cmap, norm, aspect, \.\.\.\]\) | Display data as an image, i\.e\., on a 2D regular raster\. |
+| inferno\(\) | Set the colormap to "inferno"\. |
+| install\_repl\_displayhook\(\) | Install a repl display hook so that any stale figure are automatically redrawn when control is returned to the repl\. |
+| ioff\(\) | Turn the interactive mode off\. |
+| ion\(\) | Turn the interactive mode on\. |
+| isinteractive\(\) | Return if pyplot is in "interactive mode" or not\. |
+| jet\(\) | Set the colormap to "jet"\. |
+| legend\(\*args, \*\*kwargs\) | Place a legend on the axes\. |
+| locator\_params\(\[axis, tight\]\) | Control behavior of major tick locators\. |
+| loglog\(\*args, \*\*kwargs\) | Make a plot with log scaling on both the x and y axis\. |
+| magma\(\) | Set the colormap to "magma"\. |
+| magnitude\_spectrum\(x\[, Fs, Fc, window, \.\.\.\]\) | Plot the magnitude spectrum\. |
+| margins\(\*margins\[, x, y, tight\]\) | Set or retrieve autoscaling margins\. |
+| matshow\(A\[, fignum\]\) | Display an array as a matrix in a new figure window\. |
+| minorticks\_off\(\) | Remove minor ticks from the axes\. |
+| minorticks\_on\(\) | Display minor ticks on the axes\. |
+| new\_figure\_manager\(num, \*args, \*\*kwargs\) | Create a new figure manager instance\. |
+| nipy\_spectral\(\) | Set the colormap to "nipy\_spectral"\. |
+| pause\(interval\) | Run the GUI event loop for interval seconds\. |
+| pcolor\(\*args\[, shading, alpha, norm, cmap, \.\.\.\]\) | Create a pseudocolor plot with a non\-regular rectangular grid\. |
+| pcolormesh\(\*args\[, alpha, norm, cmap, vmin, \.\.\.\]\) | Create a pseudocolor plot with a non\-regular rectangular grid\. |
+| phase\_spectrum\(x\[, Fs, Fc, window, pad\_to, \.\.\.\]\) | Plot the phase spectrum\. |
+| pie\(x\[, explode, labels, colors, autopct, \.\.\.\]\) | Plot a pie chart\. |
+| pink\(\) | Set the colormap to "pink"\. |
+| plasma\(\) | Set the colormap to "plasma"\. |
+| plot\(\*args\[, scalex, scaley, data\]\) | Plot y versus x as lines and/or markers\. |
+| plot\_date\(x, y\[, fmt, tz, xdate, ydate, data\]\) | Plot data that contains dates\. |
+| polar\(\*args, \*\*kwargs\) | Make a polar plot\. |
+| prism\(\) | Set the colormap to "prism"\. |
+| psd\(x\[, NFFT, Fs, Fc, detrend, window, \.\.\.\]\) | Plot the power spectral density\. |
+| quiver\(\*args\[, data\]\) | Plot a 2D field of arrows\. |
+| quiverkey\(Q, X, Y, U, label, \*\*kw\) | Add a key to a quiver plot\. |
+| rc\(group, \*\*kwargs\) | Set the current rcParams\. |
+| rc\_context\(\[rc, fname\]\) | Return a context manager for temporarily changing rcParams\. |
+| rcdefaults\(\) | Restore the rcParams from Matplotlib's internal default style\. |
+| rgrids\(\[radii, labels, angle, fmt\]\) | Get or set the radial gridlines on the current polar plot\. |
+| savefig\(\*args, \*\*kwargs\) | Save the current figure\. |
+| sca\(ax\) | Set the current Axes to ax and the current Figure to the parent of ax\. |
+| scatter\(x, y\[, s, c, marker, cmap, norm, \.\.\.\]\) | A scatter plot of y vs\. |
+| sci\(im\) | Set the current image\. |
+| semilogx\(\*args, \*\*kwargs\) | Make a plot with log scaling on the x axis\. |
+| semilogy\(\*args, \*\*kwargs\) | Make a plot with log scaling on the y axis\. |
+| set\_cmap\(cmap\) | Set the default colormap, and applies it to the current image if any\. |
+| setp\(obj, \*args, \*\*kwargs\) | Set a property on an artist object\. |
+| show\(\*\[, block\]\) | Display all open figures\. |
+| specgram\(x\[, NFFT, Fs, Fc, detrend, window, \.\.\.\]\) | Plot a spectrogram\. |
+| spring\(\) | Set the colormap to "spring"\. |
+| spy\(Z\[, precision, marker, markersize, \.\.\.\]\) | Plot the sparsity pattern of a 2D array\. |
+| stackplot\(x, \*args\[, labels, colors, \.\.\.\]\) | Draw a stacked area plot\. |
+| stem\(\*args\[, linefmt, markerfmt, basefmt, \.\.\.\]\) | Create a stem plot\. |
+| step\(x, y, \*args\[, where, data\]\) | Make a step plot\. |
+| streamplot\(x, y, u, v\[, density, linewidth, \.\.\.\]\) | Draw streamlines of a vector flow\. |
+| subplot\(\*args, \*\*kwargs\) | Add a subplot to the current figure\. |
+| subplot2grid\(shape, loc\[, rowspan, colspan, fig\]\) | Create a subplot at a specific location inside a regular grid\. |
+| subplot\_mosaic\(layout, \*\[, subplot\_kw, \.\.\.\]\) | Build a layout of Axes based on ASCII art or nested lists\. |
+| subplot\_tool\(\[targetfig\]\) | Launch a subplot tool window for a figure\. |
+| subplots\(\[nrows, ncols, sharex, sharey, \.\.\.\]\) | Create a figure and a set of subplots\. |
+| subplots\_adjust\(\[left, bottom, right, top, \.\.\.\]\) | Adjust the subplot layout parameters\. |
+| summer\(\) | Set the colormap to "summer"\. |
+| suptitle\(t, \*\*kwargs\) | Add a centered title to the figure\. |
+| switch\_backend\(newbackend\) | Close all open figures and set the Matplotlib backend\. |
+| table\(\[cellText, cellColours, cellLoc, \.\.\.\]\) | Add a table to an Axes\. |
+| text\(x, y, s\[, fontdict\]\) | Add text to the axes\. |
+| thetagrids\(\[angles, labels, fmt\]\) | Get or set the theta gridlines on the current polar plot\. |
+| tick\_params\(\[axis\]\) | Change the appearance of ticks, tick labels, and gridlines\. |
+| ticklabel\_format\(\*\[, axis, style, \.\.\.\]\) | Configure the ScalarFormatter used by default for linear axes\. |
+| tight\_layout\(\*\[, pad, h\_pad, w\_pad, rect\]\) | Adjust the padding between and around subplots\. |
+| title\(label\[, fontdict, loc, pad, y\]\) | Set a title for the axes\. |
+| tricontour\(\*args, \*\*kwargs\) | Draw contour lines on an unstructured triangular grid\. |
+| tricontourf\(\*args, \*\*kwargs\) | Draw contour regions on an unstructured triangular grid\. |
+| tripcolor\(\*args\[, alpha, norm, cmap, vmin, \.\.\.\]\) | Create a pseudocolor plot of an unstructured triangular grid\. |
+| triplot\(\*args, \*\*kwargs\) | Draw a unstructured triangular grid as lines and/or markers\. |
+| twinx\(\[ax\]\) | Make and return a second axes that shares the x\-axis\. |
+| twiny\(\[ax\]\) | Make and return a second axes that shares the y\-axis\. |
+| uninstall\_repl\_displayhook\(\) | Uninstall the matplotlib display hook\. |
+| violinplot\(dataset\[, positions, vert, \.\.\.\]\) | Make a violin plot\. |
+| viridis\(\) | Set the colormap to "viridis"\. |
+| vlines\(x, ymin, ymax\[, colors, linestyles, \.\.\.\]\) | Plot vertical lines\. |
+| waitforbuttonpress\(\[timeout\]\) | Blocking call to interact with the figure\. |
+| winter\(\) | Set the colormap to "winter"\. |
+| xcorr\(x, y\[, normed, detrend, usevlines, \.\.\.\]\) | Plot the cross correlation between x and y\. |
+| xkcd\(\[scale, length, randomness\]\) | Turn on xkcd sketch\-style drawing mode\.This will only have effect on things drawn after this function is called\.\. |
+| xlabel\(xlabel\[, fontdict, labelpad, loc\]\) | Set the label for the x\-axis\. |
+| xlim\(\*args, \*\*kwargs\) | Get or set the x limits of the current axes\. |
+| xscale\(value, \*\*kwargs\) | Set the x\-axis scale\. |
+| xticks\(\[ticks, labels\]\) | Get or set the current tick locations and labels of the x\-axis\. |
+| ylabel\(ylabel\[, fontdict, labelpad, loc\]\) | Set the label for the y\-axis\. |
+| ylim\(\*args, \*\*kwargs\) | Get or set the y\-limits of the current axes\. |
+| yscale\(value, \*\*kwargs\) | Set the y\-axis scale\. |
+| yticks\(\[ticks, labels\]\) | Get or set the current tick locations and labels of the y\-axis\. |
diff --git a/Python/matplotlab/text/annotations.md b/Python/matplotlab/text/annotations.md
new file mode 100644
index 00000000..54cd3b0d
--- /dev/null
+++ b/Python/matplotlab/text/annotations.md
@@ -0,0 +1,580 @@
+---
+sidebarDepth: 3
+sidebar: auto
+---
+
+# Annotations
+
+Annotating text with Matplotlib.
+
+Table of Contents
+
+- [Annotations](#annotations)
+- [Basic annotation](#basic-annotation)
+- [Advanced Annotation](#advanced-annotation)
+[Annotating with Text with Box](#annotating-with-text-with-box)
+[Annotating with Arrow](#annotating-with-arrow)
+[Placing Artist at the anchored location of the Axes](#placing-artist-at-the-anchored-location-of-the-axes)
+[Using Complex Coordinates with Annotations](#using-complex-coordinates-with-annotations)
+[Using ConnectionPatch](#using-connectionpatch)
+[Advanced Topics](#advanced-topics)
+
+
+[Zoom effect between Axes](#zoom-effect-between-axes)
+[Define Custom BoxStyle](#define-custom-boxstyle)
+- [Annotating with Text with Box](#annotating-with-text-with-box)
+- [Annotating with Arrow](#annotating-with-arrow)
+- [Placing Artist at the anchored location of the Axes](#placing-artist-at-the-anchored-location-of-the-axes)
+- [Using Complex Coordinates with Annotations](#using-complex-coordinates-with-annotations)
+- [Using ConnectionPatch](#using-connectionpatch)
+[Advanced Topics](#advanced-topics)
+- [Advanced Topics](#advanced-topics)
+- [Zoom effect between Axes](#zoom-effect-between-axes)
+- [Define Custom BoxStyle](#define-custom-boxstyle)
+# Basic annotation
+
+The uses of the basic [``text()``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.text.html#matplotlib.pyplot.text) will place text
+at an arbitrary position on the Axes. A common use case of text is to
+annotate some feature of the plot, and the
+``annotate()`` method provides helper functionality
+to make annotations easy. In an annotation, there are two points to
+consider: the location being annotated represented by the argument
+``xy`` and the location of the text ``xytext``. Both of these
+arguments are ``(x,y)`` tuples.
+
+Annotation Basic
+
+In this example, both the ``xy`` (arrow tip) and ``xytext`` locations
+(text location) are in data coordinates. There are a variety of other
+coordinate systems one can choose -- you can specify the coordinate
+system of ``xy`` and ``xytext`` with one of the following strings for
+``xycoords`` and ``textcoords`` (default is 'data')
+
+
+---
+
+
+
+
+
+argument
+coordinate system
+
+
+
+'figure points'
+points from the lower left corner of the figure
+
+'figure pixels'
+pixels from the lower left corner of the figure
+
+'figure fraction'
+0,0 is lower left of figure and 1,1 is upper right
+
+'axes points'
+points from lower left corner of axes
+
+'axes pixels'
+pixels from lower left corner of axes
+
+'axes fraction'
+0,0 is lower left of axes and 1,1 is upper right
+
+'data'
+use the axes data coordinate system
+
+
+
+
+For example to place the text coordinates in fractional axes
+coordinates, one could do:
+
+``` python
+ax.annotate('local max', xy=(3, 1), xycoords='data',
+ xytext=(0.8, 0.95), textcoords='axes fraction',
+ arrowprops=dict(facecolor='black', shrink=0.05),
+ horizontalalignment='right', verticalalignment='top',
+ )
+```
+
+For physical coordinate systems (points or pixels) the origin is the
+bottom-left of the figure or axes.
+
+Optionally, you can enable drawing of an arrow from the text to the annotated
+point by giving a dictionary of arrow properties in the optional keyword
+argument ``arrowprops``.
+
+
+---
+
+
+
+
+
+arrowprops key
+description
+
+
+
+width
+the width of the arrow in points
+
+frac
+the fraction of the arrow length occupied by the head
+
+headwidth
+the width of the base of the arrow head in points
+
+shrink
+move the tip and base some percent away from
+the annotated point and text
+
+**kwargs
+any key for [matplotlib.patches.Polygon](https://matplotlib.org/../api/_as_gen/matplotlib.patches.Polygon.html#matplotlib.patches.Polygon),
+e.g., facecolor
+
+
+
+
+In the example below, the ``xy`` point is in native coordinates
+(``xycoords`` defaults to 'data'). For a polar axes, this is in
+(theta, radius) space. The text in this example is placed in the
+fractional figure coordinate system. [``matplotlib.text.Text``](https://matplotlib.orgapi/text_api.html#matplotlib.text.Text)
+keyword args like ``horizontalalignment``, ``verticalalignment`` and
+``fontsize`` are passed from ``annotate`` to the
+``Text`` instance.
+
+Annotation Polar
+
+For more on all the wild and wonderful things you can do with
+annotations, including fancy arrows, see [Advanced Annotation](#plotting-guide-annotation)
+and [Annotating Plots](https://matplotlib.orggallery/text_labels_and_annotations/annotation_demo.html).
+
+Do not proceed unless you have already read [Basic annotation](#annotations-tutorial),
+[``text()``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.text.html#matplotlib.pyplot.text) and [``annotate()``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.annotate.html#matplotlib.pyplot.annotate)!
+# Advanced Annotation
+
+## Annotating with Text with Box
+
+Let's start with a simple example.
+
+Annotate Text Arrow
+
+The [``text()``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.text.html#matplotlib.pyplot.text) function in the pyplot module (or
+text method of the Axes class) takes bbox keyword argument, and when
+given, a box around the text is drawn.
+
+``` python
+bbox_props = dict(boxstyle="rarrow,pad=0.3", fc="cyan", ec="b", lw=2)
+t = ax.text(0, 0, "Direction", ha="center", va="center", rotation=45,
+ size=15,
+ bbox=bbox_props)
+```
+
+The patch object associated with the text can be accessed by:
+
+``` python
+bb = t.get_bbox_patch()
+```
+
+The return value is an instance of FancyBboxPatch and the patch
+properties like facecolor, edgewidth, etc. can be accessed and
+modified as usual. To change the shape of the box, use the *set_boxstyle*
+method.
+
+``` python
+bb.set_boxstyle("rarrow", pad=0.6)
+```
+
+The arguments are the name of the box style with its attributes as
+keyword arguments. Currently, following box styles are implemented.
+
+
+---
+
+
+
+
+
+
+Class
+Name
+Attrs
+
+
+
+Circle
+circle
+pad=0.3
+
+DArrow
+darrow
+pad=0.3
+
+LArrow
+larrow
+pad=0.3
+
+RArrow
+rarrow
+pad=0.3
+
+Round
+round
+pad=0.3,rounding_size=None
+
+Round4
+round4
+pad=0.3,rounding_size=None
+
+Roundtooth
+roundtooth
+pad=0.3,tooth_size=None
+
+Sawtooth
+sawtooth
+pad=0.3,tooth_size=None
+
+Square
+square
+pad=0.3
+
+
+
+
+Fancybox Demo
+
+Note that the attribute arguments can be specified within the style
+name with separating comma (this form can be used as "boxstyle" value
+of bbox argument when initializing the text instance)
+
+``` python
+bb.set_boxstyle("rarrow,pad=0.6")
+```
+
+## Annotating with Arrow
+
+The [``annotate()``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.annotate.html#matplotlib.pyplot.annotate) function in the pyplot module
+(or annotate method of the Axes class) is used to draw an arrow
+connecting two points on the plot.
+
+``` python
+ax.annotate("Annotation",
+ xy=(x1, y1), xycoords='data',
+ xytext=(x2, y2), textcoords='offset points',
+ )
+```
+
+This annotates a point at ``xy`` in the given coordinate (``xycoords``)
+with the text at ``xytext`` given in ``textcoords``. Often, the
+annotated point is specified in the *data* coordinate and the annotating
+text in *offset points*.
+See [``annotate()``](https://matplotlib.orgapi/_as_gen/matplotlib.pyplot.annotate.html#matplotlib.pyplot.annotate) for available coordinate systems.
+
+An arrow connecting two points (xy & xytext) can be optionally drawn by
+specifying the ``arrowprops`` argument. To draw only an arrow, use
+empty string as the first argument.
+
+``` python
+ax.annotate("",
+ xy=(0.2, 0.2), xycoords='data',
+ xytext=(0.8, 0.8), textcoords='data',
+ arrowprops=dict(arrowstyle="->",
+ connectionstyle="arc3"),
+ )
+```
+
+Annotate Simple01
+
+The arrow drawing takes a few steps.
+
+1. a connecting path between two points are created. This is
+controlled by ``connectionstyle`` key value.
+1. If patch object is given (*patchA* & *patchB*), the path is clipped to
+avoid the patch.
+1. The path is further shrunk by given amount of pixels (*shrinkA*
+& *shrinkB*)
+1. The path is transmuted to arrow patch, which is controlled by the
+``arrowstyle`` key value.
+
+Annotate Explain
+
+The creation of the connecting path between two points is controlled by
+``connectionstyle`` key and the following styles are available.
+
+
+---
+
+
+
+
+
+Name
+Attrs
+
+
+
+angle
+angleA=90,angleB=0,rad=0.0
+
+angle3
+angleA=90,angleB=0
+
+arc
+angleA=0,angleB=0,armA=None,armB=None,rad=0.0
+
+arc3
+rad=0.0
+
+bar
+armA=0.0,armB=0.0,fraction=0.3,angle=None
+
+
+
+
+Note that "3" in ``angle3`` and ``arc3`` is meant to indicate that the
+resulting path is a quadratic spline segment (three control
+points). As will be discussed below, some arrow style options can only
+be used when the connecting path is a quadratic spline.
+
+The behavior of each connection style is (limitedly) demonstrated in the
+example below. (Warning : The behavior of the ``bar`` style is currently not
+well defined, it may be changed in the future).
+
+Connectionstyle Demo
+
+The connecting path (after clipping and shrinking) is then mutated to
+an arrow patch, according to the given ``arrowstyle``.
+
+
+---
+
+
+
+
+
+Name
+Attrs
+
+
+
+-
+None
+
+->
+head_length=0.4,head_width=0.2
+
+-[
+widthB=1.0,lengthB=0.2,angleB=None
+
+|-|
+widthA=1.0,widthB=1.0
+
+-|>
+head_length=0.4,head_width=0.2
+
+<-
+head_length=0.4,head_width=0.2
+
+<->
+head_length=0.4,head_width=0.2
+
+<|-
+head_length=0.4,head_width=0.2
+
+<|-|>
+head_length=0.4,head_width=0.2
+
+fancy
+head_length=0.4,head_width=0.4,tail_width=0.4
+
+simple
+head_length=0.5,head_width=0.5,tail_width=0.2
+
+wedge
+tail_width=0.3,shrink_factor=0.5
+
+
+
+
+Fancyarrow Demo
+
+Some arrowstyles only work with connection styles that generate a
+quadratic-spline segment. They are ``fancy``, ``simple``, and ``wedge``.
+For these arrow styles, you must use the "angle3" or "arc3" connection
+style.
+
+If the annotation string is given, the patchA is set to the bbox patch
+of the text by default.
+
+Annotate Simple02
+
+As in the text command, a box around the text can be drawn using
+the ``bbox`` argument.
+
+Annotate Simple03
+
+By default, the starting point is set to the center of the text
+extent. This can be adjusted with ``relpos`` key value. The values
+are normalized to the extent of the text. For example, (0,0) means
+lower-left corner and (1,1) means top-right.
+
+Annotate Simple04
+
+## Placing Artist at the anchored location of the Axes
+
+There are classes of artists that can be placed at an anchored location
+in the Axes. A common example is the legend. This type of artist can
+be created by using the OffsetBox class. A few predefined classes are
+available in ``mpl_toolkits.axes_grid1.anchored_artists`` others in
+``matplotlib.offsetbox``
+
+``` python
+from matplotlib.offsetbox import AnchoredText
+at = AnchoredText("Figure 1a",
+ prop=dict(size=15), frameon=True,
+ loc='upper left',
+ )
+at.patch.set_boxstyle("round,pad=0.,rounding_size=0.2")
+ax.add_artist(at)
+```
+
+Anchored Box01
+
+The *loc* keyword has same meaning as in the legend command.
+
+A simple application is when the size of the artist (or collection of
+artists) is known in pixel size during the time of creation. For
+example, If you want to draw a circle with fixed size of 20 pixel x 20
+pixel (radius = 10 pixel), you can utilize
+``AnchoredDrawingArea``. The instance is created with a size of the
+drawing area (in pixels), and arbitrary artists can added to the
+drawing area. Note that the extents of the artists that are added to
+the drawing area are not related to the placement of the drawing
+area itself. Only the initial size matters.
+
+``` python
+from mpl_toolkits.axes_grid1.anchored_artists import AnchoredDrawingArea
+
+ada = AnchoredDrawingArea(20, 20, 0, 0,
+ loc='upper right', pad=0., frameon=False)
+p1 = Circle((10, 10), 10)
+ada.drawing_area.add_artist(p1)
+p2 = Circle((30, 10), 5, fc="r")
+ada.drawing_area.add_artist(p2)
+```
+
+The artists that are added to the drawing area should not have a
+transform set (it will be overridden) and the dimensions of those
+artists are interpreted as a pixel coordinate, i.e., the radius of the
+circles in above example are 10 pixels and 5 pixels, respectively.
+
+Anchored Box02
+
+Sometimes, you want your artists to scale with the data coordinate (or
+coordinates other than canvas pixels). You can use
+``AnchoredAuxTransformBox`` class. This is similar to
+``AnchoredDrawingArea`` except that the extent of the artist is
+determined during the drawing time respecting the specified transform.
+
+``` python
+from mpl_toolkits.axes_grid1.anchored_artists import AnchoredAuxTransformBox
+
+box = AnchoredAuxTransformBox(ax.transData, loc='upper left')
+el = Ellipse((0,0), width=0.1, height=0.4, angle=30) # in data coordinates!
+box.drawing_area.add_artist(el)
+```
+
+The ellipse in the above example will have width and height
+corresponding to 0.1 and 0.4 in data coordinates and will be
+automatically scaled when the view limits of the axes change.
+
+Anchored Box03
+
+As in the legend, the bbox_to_anchor argument can be set. Using the
+HPacker and VPacker, you can have an arrangement(?) of artist as in the
+legend (as a matter of fact, this is how the legend is created).
+
+Anchored Box04
+
+Note that unlike the legend, the ``bbox_transform`` is set
+to IdentityTransform by default.
+
+## Using Complex Coordinates with Annotations
+
+The Annotation in matplotlib supports several types of coordinates as
+described in [Basic annotation](#annotations-tutorial). For an advanced user who wants
+more control, it supports a few other options.
+
+## Using ConnectionPatch
+
+The ConnectionPatch is like an annotation without text. While the annotate
+function is recommended in most situations, the ConnectionPatch is useful when
+you want to connect points in different axes.
+
+``` python
+from matplotlib.patches import ConnectionPatch
+xy = (0.2, 0.2)
+con = ConnectionPatch(xyA=xy, xyB=xy, coordsA="data", coordsB="data",
+ axesA=ax1, axesB=ax2)
+ax2.add_artist(con)
+```
+
+The above code connects point xy in the data coordinates of ``ax1`` to
+point xy in the data coordinates of ``ax2``. Here is a simple example.
+
+Connect Simple01
+
+While the ConnectionPatch instance can be added to any axes, you may want to add
+it to the axes that is latest in drawing order to prevent overlap by other
+axes.
+
+### Advanced Topics
+
+## Zoom effect between Axes
+
+``mpl_toolkits.axes_grid1.inset_locator`` defines some patch classes useful
+for interconnecting two axes. Understanding the code requires some
+knowledge of how mpl's transform works. But, utilizing it will be
+straight forward.
+
+Axes Zoom Effect
+
+## Define Custom BoxStyle
+
+You can use a custom box style. The value for the ``boxstyle`` can be a
+callable object in the following forms.:
+
+``` python
+def __call__(self, x0, y0, width, height, mutation_size,
+ aspect_ratio=1.):
+ '''
+ Given the location and size of the box, return the path of
+ the box around it.
+
+ - *x0*, *y0*, *width*, *height* : location and size of the box
+ - *mutation_size* : a reference scale for the mutation.
+ - *aspect_ratio* : aspect-ratio for the mutation.
+ '''
+ path = ...
+ return path
+```
+
+Here is a complete example.
+
+Custom Boxstyle01
+
+However, it is recommended that you derive from the
+matplotlib.patches.BoxStyle._Base as demonstrated below.
+
+Custom Boxstyle02
+
+Similarly, you can define a custom ConnectionStyle and a custom ArrowStyle.
+See the source code of ``lib/matplotlib/patches.py`` and check
+how each style class is defined.
+
+## Download
+
+- [Download Python source code: annotations.py](https://matplotlib.org/_downloads/e9b9ec3e7de47d2ccae486e437e86de2/annotations.py)
+- [Download Jupyter notebook: annotations.ipynb](https://matplotlib.org/_downloads/c4f2a18ccd63dc25619141aee3712b03/annotations.ipynb)
+
\ No newline at end of file
diff --git a/Python/matplotlab/text/mathtext.md b/Python/matplotlab/text/mathtext.md
new file mode 100644
index 00000000..3292aff1
--- /dev/null
+++ b/Python/matplotlab/text/mathtext.md
@@ -0,0 +1,1186 @@
+---
+sidebarDepth: 3
+sidebar: auto
+---
+
+# Writing mathematical expressions
+
+An introduction to writing mathematical expressions in Matplotlib.
+
+You can use a subset TeX markup in any matplotlib text string by placing it
+inside a pair of dollar signs ($).
+
+Note that you do not need to have TeX installed, since Matplotlib ships
+its own TeX expression parser, layout engine, and fonts. The layout engine
+is a fairly direct adaptation of the layout algorithms in Donald Knuth's
+TeX, so the quality is quite good (matplotlib also provides a ``usetex``
+option for those who do want to call out to TeX to generate their text (see
+[Text rendering With LaTeX](usetex.html)).
+
+Any text element can use math text. You should use raw strings (precede the
+quotes with an ``'r'``), and surround the math text with dollar signs ($), as
+in TeX. Regular text and mathtext can be interleaved within the same string.
+Mathtext can use DejaVu Sans (default), DejaVu Serif, the Computer Modern fonts
+(from (La)TeX), [STIX](http://www.stixfonts.org/) fonts (with are designed
+to blend well with Times), or a Unicode font that you provide. The mathtext
+font can be selected with the customization variable ``mathtext.fontset`` (see
+[Customizing Matplotlib with style sheets and rcParams](https://matplotlib.org/introductory/customizing.html))
+
+Here is a simple example:
+
+``` python
+# plain text
+plt.title('alpha > beta')
+```
+
+produces "alpha > beta".
+
+Whereas this:
+
+``` python
+# math text
+plt.title(r'$\alpha > \beta$')
+```
+
+produces "".
+
+::: tip Note
+
+Mathtext should be placed between a pair of dollar signs ($). To make it
+easy to display monetary values, e.g., "$100.00", if a single dollar sign
+is present in the entire string, it will be displayed verbatim as a dollar
+sign. This is a small change from regular TeX, where the dollar sign in
+non-math text would have to be escaped ('\$').
+
+:::
+
+::: tip Note
+
+While the syntax inside the pair of dollar signs ($) aims to be TeX-like,
+the text outside does not. In particular, characters such as:
+
+``` python
+# $ % & ~ _ ^ \ { } \( \) \[ \]
+```
+
+have special meaning outside of math mode in TeX. Therefore, these
+characters will behave differently depending on the rcParam ``text.usetex``
+flag. See the [usetex tutorial](usetex.html) for more
+information.
+
+:::
+
+## Subscripts and superscripts
+
+To make subscripts and superscripts, use the ``'_'`` and ``'^'`` symbols:
+
+``` python
+r'$\alpha_i > \beta_i$'
+```
+
+
+It's a mathematical formula.
+
+
+Some symbols automatically put their sub/superscripts under and over the
+operator. For example, to write the sum of from to
+, you could do:
+
+``` python
+r'$\sum_{i=0}^\infty x_i$'
+```
+
+
+It's a mathematical formula.
+
+
+## Fractions, binomials, and stacked numbers
+
+Fractions, binomials, and stacked numbers can be created with the
+``\frac{}{}``, ``\binom{}{}`` and ``\genfrac{}{}{}{}{}{}`` commands,
+respectively:
+
+``` python
+r'$\frac{3}{4} \binom{3}{4} \genfrac{}{}{0}{}{3}{4}$'
+```
+
+produces
+
+
+It's a mathematical formula.
+
+
+Fractions can be arbitrarily nested:
+
+``` python
+r'$\frac{5 - \frac{1}{x}}{4}$'
+```
+
+produces
+
+
+It's a mathematical formula.
+
+
+Note that special care needs to be taken to place parentheses and brackets
+around fractions. Doing things the obvious way produces brackets that are too
+small:
+
+``` python
+r'$(\frac{5 - \frac{1}{x}}{4})$'
+```
+
+
+It's a mathematical formula.
+
+
+The solution is to precede the bracket with ``\left`` and ``\right`` to inform
+the parser that those brackets encompass the entire object.:
+
+``` python
+r'$\left(\frac{5 - \frac{1}{x}}{4}\right)$'
+```
+
+
+It's a mathematical formula.
+
+
+## Radicals
+
+Radicals can be produced with the ``\sqrt[]{}`` command. For example:
+
+``` python
+r'$\sqrt{2}$'
+```
+
+
+It's a mathematical formula.
+
+
+Any base can (optionally) be provided inside square brackets. Note that the
+base must be a simple expression, and can not contain layout commands such as
+fractions or sub/superscripts:
+
+``` python
+r'$\sqrt[3]{x}$'
+```
+
+
+It's a mathematical formula.
+
+
+## Fonts
+
+The default font is *italics* for mathematical symbols.
+
+::: tip Note
+
+This default can be changed using the ``mathtext.default`` rcParam. This is
+useful, for example, to use the same font as regular non-math text for math
+text, by setting it to ``regular``.
+
+:::
+
+To change fonts, e.g., to write "sin" in a Roman font, enclose the text in a
+font command:
+
+``` python
+r'$s(t) = \mathcal{A}\mathrm{sin}(2 \omega t)$'
+```
+
+
+It's a mathematical formula.
+
+
+More conveniently, many commonly used function names that are typeset in
+a Roman font have shortcuts. So the expression above could be written as
+follows:
+
+``` python
+r'$s(t) = \mathcal{A}\sin(2 \omega t)$'
+```
+
+
+It's a mathematical formula.
+
+
+Here "s" and "t" are variable in italics font (default), "sin" is in Roman
+font, and the amplitude "A" is in calligraphy font. Note in the example above
+the calligraphy ``A`` is squished into the ``sin``. You can use a spacing
+command to add a little whitespace between them:
+
+``` python
+r's(t) = \mathcal{A}\/\sin(2 \omega t)'
+```
+
+
+It's a mathematical formula.
+
+
+The choices available with all fonts are:
+
+
+---
+
+
+
+
+
+Command
+Result
+
+
+
+\mathrm{Roman}
+
+
+\mathit{Italic}
+
+
+\mathtt{Typewriter}
+
+
+\mathcal{CALLIGRAPHY}
+
+
+
+
+
+When using the [STIX](http://www.stixfonts.org/) fonts, you also have the
+choice of:
+
+
+---
+
+
+
+
+
+Command
+Result
+
+
+
+\mathbb{blackboard}
+
+
+\mathrm{\mathbb{blackboard}}
+
+
+\mathfrak{Fraktur}
+
+
+\mathsf{sansserif}
+
+
+\mathrm{\mathsf{sansserif}}
+
+
+
+
+
+There are also three global "font sets" to choose from, which are
+selected using the ``mathtext.fontset`` parameter in [matplotlibrc](https://matplotlib.org/introductory/customizing.html#matplotlibrc-sample).
+
+``cm``: **Computer Modern (TeX)**
+
+
+
+``stix``: **STIX** (designed to blend well with Times)
+
+
+
+``stixsans``: **STIX sans-serif**
+
+
+
+Additionally, you can use ``\mathdefault{...}`` or its alias
+``\mathregular{...}`` to use the font used for regular text outside of
+mathtext. There are a number of limitations to this approach, most notably
+that far fewer symbols will be available, but it can be useful to make math
+expressions blend well with other text in the plot.
+
+### Custom fonts
+
+mathtext also provides a way to use custom fonts for math. This method is
+fairly tricky to use, and should be considered an experimental feature for
+patient users only. By setting the rcParam ``mathtext.fontset`` to ``custom``,
+you can then set the following parameters, which control which font file to use
+for a particular set of math characters.
+
+
+---
+
+
+
+
+
+Parameter
+Corresponds to
+
+
+
+mathtext.it
+\mathit{} or default italic
+
+mathtext.rm
+\mathrm{} Roman (upright)
+
+mathtext.tt
+\mathtt{} Typewriter (monospace)
+
+mathtext.bf
+\mathbf{} bold italic
+
+mathtext.cal
+\mathcal{} calligraphic
+
+mathtext.sf
+\mathsf{} sans-serif
+
+
+
+
+Each parameter should be set to a fontconfig font descriptor (as defined in the
+yet-to-be-written font chapter).
+
+The fonts used should have a Unicode mapping in order to find any
+non-Latin characters, such as Greek. If you want to use a math symbol
+that is not contained in your custom fonts, you can set the rcParam
+``mathtext.fallback_to_cm`` to ``True`` which will cause the mathtext system
+to use characters from the default Computer Modern fonts whenever a particular
+character can not be found in the custom font.
+
+Note that the math glyphs specified in Unicode have evolved over time, and many
+fonts may not have glyphs in the correct place for mathtext.
+
+## Accents
+
+An accent command may precede any symbol to add an accent above it. There are
+long and short forms for some of them.
+
+
+---
+
+
+
+
+
+Command
+Result
+
+
+
+\acute a or \'a
+
+
+\bar a
+
+
+\breve a
+
+
+\ddot a or \''a
+
+
+\dot a or \.a
+
+
+\grave a or \`a
+
+
+\hat a or \^a
+
+
+\tilde a or \~a
+
+
+\vec a
+
+
+\overline{abc}
+
+
+
+
+
+In addition, there are two special accents that automatically adjust to the
+width of the symbols below:
+
+
+---
+
+
+
+
+
+Command
+Result
+
+
+
+\widehat{xyz}
+
+
+\widetilde{xyz}
+
+
+
+
+
+Care should be taken when putting accents on lower-case i's and j's. Note that
+in the following ``\imath`` is used to avoid the extra dot over the i:
+
+``` python
+r"$\hat i\ \ \hat \imath$"
+```
+
+
+
+where \\\(x_t\\\) is the input, \\(y_t\\) is the result and the \\(w_i\\)
+are the weights.
+
+The EW functions support two variants of exponential weights.
+The default, ``adjust=True``, uses the weights \\(w_i = (1 - \alpha)^i\\)
+which gives
+
+
+\[\begin{split}w_i = \begin{cases}
+ \alpha (1 - \alpha)^i & \text{if } i < t \\
+ (1 - \alpha)^i & \text{if } i = t.
+\end{cases}\end{split}\]
+
+
+::: tip Note
+
+These equations are sometimes written in terms of \\(\alpha' = 1 - \alpha\\), e.g.
+
+
+\[y_t = \alpha' y_{t-1} + (1 - \alpha') x_t.\]
+
+
+:::
+
+The difference between the above two variants arises because we are
+dealing with series which have finite history. Consider a series of infinite
+history, with ``adjust=True``:
+
+
+\[y_t = \alpha' y_{t-1} + (1 - \alpha') x_t.\]
+
+
+Noting that the denominator is a geometric series with initial term equal to 1
+and a ratio of \\(1 - \alpha\\) we have
+
+
+
+which is the same expression as ``adjust=False`` above and therefore
+shows the equivalence of the two variants for infinite series.
+When ``adjust=False``, we have \\(y_0 = x_0\\) and
+\\(y_t = \alpha x_t + (1 - \alpha) y_{t-1}\\).
+Therefore, there is an assumption that \\(x_0\\) is not an ordinary value
+but rather an exponentially weighted moment of the infinite series up to that
+point.
+
+One must have \\(0 < \alpha \leq 1\\), and while since version 0.18.0
+it has been possible to pass \\(\alpha\\) directly, it’s often easier
+to think about either the **span**, **center of mass (com)** or **half-life**
+of an EW moment:
+
+
+\[\begin{split}\alpha =
+ \begin{cases}
+ \frac{2}{s + 1}, & \text{for span}\ s \geq 1\\
+ \frac{1}{1 + c}, & \text{for center of mass}\ c \geq 0\\
+ 1 - \exp^{\frac{\log 0.5}{h}}, & \text{for half-life}\ h > 0
+ \end{cases}\end{split}\]
+
+
+One must specify precisely one of **span**, **center of mass**, **half-life**
+and **alpha** to the EW functions:
+
+- **Span** corresponds to what is commonly called an “N-day EW moving average”.
+- **Center of mass** has a more physical interpretation and can be thought of
+in terms of span: \\(c = (s - 1) / 2\\).
+- **Half-life** is the period of time for the exponential weight to reduce to
+one half.
+- **Alpha** specifies the smoothing factor directly.
+
+Here is an example for a univariate time series:
+
+``` python
+In [109]: s.plot(style='k--')
+Out[109]:
+
+In [110]: s.ewm(span=20).mean().plot(style='k')
+Out[110]:
+```
+
+
+
+EWM has a ``min_periods`` argument, which has the same
+meaning it does for all the ``.expanding`` and ``.rolling`` methods:
+no output values will be set until at least ``min_periods`` non-null values
+are encountered in the (expanding) window.
+
+EWM also has an ``ignore_na`` argument, which determines how
+intermediate null values affect the calculation of the weights.
+When ``ignore_na=False`` (the default), weights are calculated based on absolute
+positions, so that intermediate null values affect the result.
+When ``ignore_na=True``,
+weights are calculated by ignoring intermediate null values.
+For example, assuming ``adjust=True``, if ``ignore_na=False``, the weighted
+average of ``3, NaN, 5`` would be calculated as
+
+
+
+The ``var()``, ``std()``, and ``cov()`` functions have a ``bias`` argument,
+specifying whether the result should contain biased or unbiased statistics.
+For example, if ``bias=True``, ``ewmvar(x)`` is calculated as
+``ewmvar(x) = ewma(x**2) - ewma(x)**2``;
+whereas if ``bias=False`` (the default), the biased variance statistics
+are scaled by debiasing factors
+
+
+
+(For \\(w_i = 1\\), this reduces to the usual \\(N / (N - 1)\\) factor,
+with \\(N = t + 1\\).)
+See [Weighted Sample Variance](http://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_variance)
+on Wikipedia for further details.
diff --git a/Python/pandas/user_guide/cookbook.md b/Python/pandas/user_guide/cookbook.md
new file mode 100644
index 00000000..077a8902
--- /dev/null
+++ b/Python/pandas/user_guide/cookbook.md
@@ -0,0 +1,2050 @@
+# 烹饪指南
+
+本节列出了一些**短小精悍**的 Pandas 实例与链接。
+
+我们希望 Pandas 用户能积极踊跃地为本文档添加更多内容。为本节添加实用示例的链接或代码,是 Pandas 用户提交第一个 **Pull Request** 最好的选择。
+
+本节列出了简单、精练、易上手的实例代码,以及 Stack Overflow 或 GitHub 上的链接,这些链接包含实例代码的更多详情。
+
+`pd` 与 `np` 是 Pandas 与 Numpy 的缩写。为了让新手易于理解,其它模块是显式导入的。
+
+下列实例均为 Python 3 代码,简单修改即可用于 Python 早期版本。
+
+## 惯用语
+
+以下是 Pandas 的`惯用语`。
+
+[对一列数据执行 if-then / if-then-else 操作,把计算结果赋值给一列或多列:](https://stackoverflow.com/questions/17128302/python-pandas-idiom-for-if-then-else)
+
+```python
+In [1]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
+ ...: 'BBB': [10, 20, 30, 40],
+ ...: 'CCC': [100, 50, -30, -50]})
+ ...:
+
+In [2]: df
+Out[2]:
+ AAA BBB CCC
+0 4 10 100
+1 5 20 50
+2 6 30 -30
+3 7 40 -50
+```
+
+### if-then…
+
+在一列上执行 if-then 操作:
+
+```python
+In [3]: df.loc[df.AAA >= 5, 'BBB'] = -1
+
+In [4]: df
+Out[4]:
+ AAA BBB CCC
+0 4 10 100
+1 5 -1 50
+2 6 -1 -30
+3 7 -1 -50
+```
+
+在两列上执行 if-then 操作:
+
+```python
+In [5]: df.loc[df.AAA >= 5, ['BBB', 'CCC']] = 555
+
+In [6]: df
+Out[6]:
+ AAA BBB CCC
+0 4 10 100
+1 5 555 555
+2 6 555 555
+3 7 555 555
+```
+
+再添加一行代码,执行 -else 操作:
+
+```python
+In [7]: df.loc[df.AAA < 5, ['BBB', 'CCC']] = 2000
+
+In [8]: df
+Out[8]:
+ AAA BBB CCC
+0 4 2000 2000
+1 5 555 555
+2 6 555 555
+3 7 555 555
+```
+
+或用 Pandas 的 `where` 设置掩码(mask):
+
+```python
+In [9]: df_mask = pd.DataFrame({'AAA': [True] * 4,
+ ...: 'BBB': [False] * 4,
+ ...: 'CCC': [True, False] * 2})
+ ...:
+
+In [10]: df.where(df_mask, -1000)
+Out[10]:
+ AAA BBB CCC
+0 4 -1000 2000
+1 5 -1000 -1000
+2 6 -1000 555
+3 7 -1000 -1000
+```
+
+[用 NumPy where() 函数实现 if-then-else](https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column)
+
+```python
+In [11]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
+ ....: 'BBB': [10, 20, 30, 40],
+ ....: 'CCC': [100, 50, -30, -50]})
+ ....:
+
+In [12]: df
+Out[12]:
+ AAA BBB CCC
+0 4 10 100
+1 5 20 50
+2 6 30 -30
+3 7 40 -50
+
+In [13]: df['logic'] = np.where(df['AAA'] > 5, 'high', 'low')
+
+In [14]: df
+Out[14]:
+ AAA BBB CCC logic
+0 4 10 100 low
+1 5 20 50 low
+2 6 30 -30 high
+3 7 40 -50 high
+```
+
+### 切割
+
+[用布尔条件切割 DataFrame](https://stackoverflow.com/questions/14957116/how-to-split-a-dataframe-according-to-a-boolean-criterion)
+
+```python
+In [15]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
+ ....: 'BBB': [10, 20, 30, 40],
+ ....: 'CCC': [100, 50, -30, -50]})
+ ....:
+
+In [16]: df
+Out[16]:
+ AAA BBB CCC
+0 4 10 100
+1 5 20 50
+2 6 30 -30
+3 7 40 -50
+
+In [17]: df[df.AAA <= 5]
+Out[17]:
+ AAA BBB CCC
+0 4 10 100
+1 5 20 50
+
+In [18]: df[df.AAA > 5]
+Out[18]:
+ AAA BBB CCC
+2 6 30 -30
+3 7 40 -50
+```
+
+### 设置条件
+
+[多列条件选择](https://stackoverflow.com/questions/15315452/selecting-with-complex-criteria-from-pandas-dataframe)
+
+```python
+In [19]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
+ ....: 'BBB': [10, 20, 30, 40],
+ ....: 'CCC': [100, 50, -30, -50]})
+ ....:
+
+In [20]: df
+Out[20]:
+ AAA BBB CCC
+0 4 10 100
+1 5 20 50
+2 6 30 -30
+3 7 40 -50
+```
+
+和(&),不赋值,直接返回 Series:
+
+```python
+In [21]: df.loc[(df['BBB'] < 25) & (df['CCC'] >= -40), 'AAA']
+Out[21]:
+0 4
+1 5
+Name: AAA, dtype: int64
+```
+
+或(|),不赋值,直接返回 Series:
+
+```python
+In [22]: df.loc[(df['BBB'] > 25) | (df['CCC'] >= -40), 'AAA']
+Out[22]:
+0 4
+1 5
+2 6
+3 7
+Name: AAA, dtype: int64
+```
+
+或(|),赋值,修改 DataFrame:
+
+```python
+In [23]: df.loc[(df['BBB'] > 25) | (df['CCC'] >= 75), 'AAA'] = 0.1
+
+In [24]: df
+Out[24]:
+ AAA BBB CCC
+0 0.1 10 100
+1 5.0 20 50
+2 0.1 30 -30
+3 0.1 40 -50
+```
+
+[用 argsort 选择最接近指定值的行](https://stackoverflow.com/questions/17758023/return-rows-in-a-dataframe-closest-to-a-user-defined-number)
+
+```python
+In [25]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
+ ....: 'BBB': [10, 20, 30, 40],
+ ....: 'CCC': [100, 50, -30, -50]})
+ ....:
+
+In [26]: df
+Out[26]:
+ AAA BBB CCC
+0 4 10 100
+1 5 20 50
+2 6 30 -30
+3 7 40 -50
+
+In [27]: aValue = 43.0
+
+In [28]: df.loc[(df.CCC - aValue).abs().argsort()]
+Out[28]:
+ AAA BBB CCC
+1 5 20 50
+0 4 10 100
+2 6 30 -30
+3 7 40 -50
+```
+
+[用二进制运算符动态减少条件列表](https://stackoverflow.com/questions/21058254/pandas-boolean-operation-in-a-python-list/21058331)
+
+```python
+In [29]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
+ ....: 'BBB': [10, 20, 30, 40],
+ ....: 'CCC': [100, 50, -30, -50]})
+ ....:
+
+In [30]: df
+Out[30]:
+ AAA BBB CCC
+0 4 10 100
+1 5 20 50
+2 6 30 -30
+3 7 40 -50
+
+In [31]: Crit1 = df.AAA <= 5.5
+
+In [32]: Crit2 = df.BBB == 10.0
+
+In [33]: Crit3 = df.CCC > -40.0
+```
+
+硬编码方式为:
+
+```python
+In [34]: AllCrit = Crit1 & Crit2 & Crit3
+```
+
+生成动态条件列表:
+
+```python
+In [35]: import functools
+
+In [36]: CritList = [Crit1, Crit2, Crit3]
+
+In [37]: AllCrit = functools.reduce(lambda x, y: x & y, CritList)
+
+In [38]: df[AllCrit]
+Out[38]:
+ AAA BBB CCC
+0 4 10 100
+```
+
+## 选择
+
+### DataFrames
+
+更多信息,请参阅[索引](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing)文档。
+
+[行标签与值作为条件](https://stackoverflow.com/questions/14725068/pandas-using-row-labels-in-boolean-indexing)
+
+```python
+In [39]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
+ ....: 'BBB': [10, 20, 30, 40],
+ ....: 'CCC': [100, 50, -30, -50]})
+ ....:
+
+In [40]: df
+Out[40]:
+ AAA BBB CCC
+0 4 10 100
+1 5 20 50
+2 6 30 -30
+3 7 40 -50
+
+In [41]: df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))]
+Out[41]:
+ AAA BBB CCC
+0 4 10 100
+2 6 30 -30
+```
+
+[标签切片用 loc,位置切片用 iloc](https://github.com/pandas-dev/pandas/issues/2904)
+
+```python
+In [42]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
+ ....: 'BBB': [10, 20, 30, 40],
+ ....: 'CCC': [100, 50, -30, -50]},
+ ....: index=['foo', 'bar', 'boo', 'kar'])
+ ....:
+```
+
+前 2 个是显式切片方法,第 3 个是通用方法:
+
+1. 位置切片,Python 切片风格,不包括结尾数据;
+2. 标签切片,非 Python 切片风格,包括结尾数据;
+3. 通用切片,支持两种切片风格,取决于切片用的是标签还是位置。
+
+
+```python
+In [43]: df.loc['bar':'kar'] # Label
+Out[43]:
+ AAA BBB CCC
+bar 5 20 50
+boo 6 30 -30
+kar 7 40 -50
+
+# Generic
+In [44]: df.iloc[0:3]
+Out[44]:
+ AAA BBB CCC
+foo 4 10 100
+bar 5 20 50
+boo 6 30 -30
+
+In [45]: df.loc['bar':'kar']
+Out[45]:
+ AAA BBB CCC
+bar 5 20 50
+boo 6 30 -30
+kar 7 40 -50
+```
+
+包含整数,且不从 0 开始的索引,或不是逐步递增的索引会引发歧义。
+
+```python
+In [46]: data = {'AAA': [4, 5, 6, 7],
+ ....: 'BBB': [10, 20, 30, 40],
+ ....: 'CCC': [100, 50, -30, -50]}
+ ....:
+
+In [47]: df2 = pd.DataFrame(data=data, index=[1, 2, 3, 4]) # Note index starts at 1.
+
+In [48]: df2.iloc[1:3] # Position-oriented
+Out[48]:
+ AAA BBB CCC
+2 5 20 50
+3 6 30 -30
+
+In [49]: df2.loc[1:3] # Label-oriented
+Out[49]:
+ AAA BBB CCC
+1 4 10 100
+2 5 20 50
+3 6 30 -30
+```
+
+[用逆运算符 (~)提取掩码的反向内容](https://stackoverflow.com/questions/14986510/picking-out-elements-based-on-complement-of-indices-in-python-pandas)
+
+```python
+In [50]: df = pd.DataFrame({'AAA': [4, 5, 6, 7],
+ ....: 'BBB': [10, 20, 30, 40],
+ ....: 'CCC': [100, 50, -30, -50]})
+ ....:
+
+In [51]: df
+Out[51]:
+ AAA BBB CCC
+0 4 10 100
+1 5 20 50
+2 6 30 -30
+3 7 40 -50
+
+In [52]: df[~((df.AAA <= 6) & (df.index.isin([0, 2, 4])))]
+Out[52]:
+ AAA BBB CCC
+1 5 20 50
+3 7 40 -50
+```
+
+### 生成新列
+
+[用 applymap 高效动态生成新列](https://stackoverflow.com/questions/16575868/efficiently-creating-additional-columns-in-a-pandas-dataframe-using-map)
+
+```python
+In [53]: df = pd.DataFrame({'AAA': [1, 2, 1, 3],
+ ....: 'BBB': [1, 1, 2, 2],
+ ....: 'CCC': [2, 1, 3, 1]})
+ ....:
+
+In [54]: df
+Out[54]:
+ AAA BBB CCC
+0 1 1 2
+1 2 1 1
+2 1 2 3
+3 3 2 1
+
+In [55]: source_cols = df.columns # Or some subset would work too
+
+In [56]: new_cols = [str(x) + "_cat" for x in source_cols]
+
+In [57]: categories = {1: 'Alpha', 2: 'Beta', 3: 'Charlie'}
+
+In [58]: df[new_cols] = df[source_cols].applymap(categories.get)
+
+In [59]: df
+Out[59]:
+ AAA BBB CCC AAA_cat BBB_cat CCC_cat
+0 1 1 2 Alpha Alpha Beta
+1 2 1 1 Beta Alpha Alpha
+2 1 2 3 Alpha Beta Charlie
+3 3 2 1 Charlie Beta Alpha
+```
+
+[分组时用 min()](https://stackoverflow.com/questions/23394476/keep-other-columns-when-using-min-with-groupby)
+
+```python
+In [60]: df = pd.DataFrame({'AAA': [1, 1, 1, 2, 2, 2, 3, 3],
+ ....: 'BBB': [2, 1, 3, 4, 5, 1, 2, 3]})
+ ....:
+
+In [61]: df
+Out[61]:
+ AAA BBB
+0 1 2
+1 1 1
+2 1 3
+3 2 4
+4 2 5
+5 2 1
+6 3 2
+7 3 3
+```
+
+方法1:用 idxmin() 提取每组最小值的索引
+
+```python
+In [62]: df.loc[df.groupby("AAA")["BBB"].idxmin()]
+Out[62]:
+ AAA BBB
+1 1 1
+5 2 1
+6 3 2
+```
+
+方法 2:先排序,再提取每组的第一个值
+
+```python
+In [63]: df.sort_values(by="BBB").groupby("AAA", as_index=False).first()
+Out[63]:
+ AAA BBB
+0 1 1
+1 2 1
+2 3 2
+```
+
+注意,提取的数据一样,但索引不一样。
+
+
+## 多层索引
+
+更多信息,请参阅[多层索引](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-hierarchical)文档。
+
+[用带标签的字典创建多层索引](https://stackoverflow.com/questions/14916358/reshaping-dataframes-in-pandas-based-on-column-labels)
+
+```python
+In [64]: df = pd.DataFrame({'row': [0, 1, 2],
+ ....: 'One_X': [1.1, 1.1, 1.1],
+ ....: 'One_Y': [1.2, 1.2, 1.2],
+ ....: 'Two_X': [1.11, 1.11, 1.11],
+ ....: 'Two_Y': [1.22, 1.22, 1.22]})
+ ....:
+
+In [65]: df
+Out[65]:
+ row One_X One_Y Two_X Two_Y
+0 0 1.1 1.2 1.11 1.22
+1 1 1.1 1.2 1.11 1.22
+2 2 1.1 1.2 1.11 1.22
+
+# 设置索引标签
+In [66]: df = df.set_index('row')
+
+In [67]: df
+Out[67]:
+ One_X One_Y Two_X Two_Y
+row
+0 1.1 1.2 1.11 1.22
+1 1.1 1.2 1.11 1.22
+2 1.1 1.2 1.11 1.22
+
+# 多层索引的列
+In [68]: df.columns = pd.MultiIndex.from_tuples([tuple(c.split('_'))
+ ....: for c in df.columns])
+ ....:
+
+In [69]: df
+Out[69]:
+ One Two
+ X Y X Y
+row
+0 1.1 1.2 1.11 1.22
+1 1.1 1.2 1.11 1.22
+2 1.1 1.2 1.11 1.22
+
+# 先 stack,然后 Reset 索引
+
+In [70]: df = df.stack(0).reset_index(1)
+
+In [71]: df
+Out[71]:
+ level_1 X Y
+row
+0 One 1.10 1.20
+0 Two 1.11 1.22
+1 One 1.10 1.20
+1 Two 1.11 1.22
+2 One 1.10 1.20
+2 Two 1.11 1.22
+
+# 修整标签,注意自动添加了标签 `level_1`
+In [72]: df.columns = ['Sample', 'All_X', 'All_Y']
+
+In [73]: df
+Out[73]:
+ Sample All_X All_Y
+row
+0 One 1.10 1.20
+0 Two 1.11 1.22
+1 One 1.10 1.20
+1 Two 1.11 1.22
+2 One 1.10 1.20
+2 Two 1.11 1.22
+```
+
+### 运算
+
+[多层索引运算要用广播机制](https://stackoverflow.com/questions/19501510/divide-entire-pandas-multiindex-dataframe-by-dataframe-variable/19502176#19502176)
+
+```python
+In [74]: cols = pd.MultiIndex.from_tuples([(x, y) for x in ['A', 'B', 'C']
+ ....: for y in ['O', 'I']])
+ ....:
+
+In [75]: df = pd.DataFrame(np.random.randn(2, 6), index=['n', 'm'], columns=cols)
+
+In [76]: df
+Out[76]:
+ A B C
+ O I O I O I
+n 0.469112 -0.282863 -1.509059 -1.135632 1.212112 -0.173215
+m 0.119209 -1.044236 -0.861849 -2.104569 -0.494929 1.071804
+
+In [77]: df = df.div(df['C'], level=1)
+
+In [78]: df
+Out[78]:
+ A B C
+ O I O I O I
+n 0.387021 1.633022 -1.244983 6.556214 1.0 1.0
+m -0.240860 -0.974279 1.741358 -1.963577 1.0 1.0
+```
+
+### 切片
+
+[用 xs 切片多层索引](https://stackoverflow.com/questions/12590131/how-to-slice-multindex-columns-in-pandas-dataframes)
+
+```python
+In [79]: coords = [('AA', 'one'), ('AA', 'six'), ('BB', 'one'), ('BB', 'two'),
+ ....: ('BB', 'six')]
+ ....:
+
+In [80]: index = pd.MultiIndex.from_tuples(coords)
+
+In [81]: df = pd.DataFrame([11, 22, 33, 44, 55], index, ['MyData'])
+
+In [82]: df
+Out[82]:
+ MyData
+AA one 11
+ six 22
+BB one 33
+ two 44
+ six 55
+```
+
+提取第一层与索引第一个轴的交叉数据:
+
+```python
+# 注意:level 与 axis 是可选项,默认为 0
+In [83]: df.xs('BB', level=0, axis=0)
+Out[83]:
+ MyData
+one 33
+two 44
+six 55
+```
+
+……现在是第 1 个轴的第 2 层
+
+```python
+In [84]: df.xs('six', level=1, axis=0)
+Out[84]:
+ MyData
+AA 22
+BB 55
+```
+
+[用 xs 切片多层索引,方法 #2](https://stackoverflow.com/questions/14964493/multiindex-based-indexing-in-pandas)
+
+```python
+In [85]: import itertools
+
+In [86]: index = list(itertools.product(['Ada', 'Quinn', 'Violet'],
+ ....: ['Comp', 'Math', 'Sci']))
+ ....:
+
+In [87]: headr = list(itertools.product(['Exams', 'Labs'], ['I', 'II']))
+
+In [88]: indx = pd.MultiIndex.from_tuples(index, names=['Student', 'Course'])
+
+In [89]: cols = pd.MultiIndex.from_tuples(headr) # Notice these are un-named
+
+In [90]: data = [[70 + x + y + (x * y) % 3 for x in range(4)] for y in range(9)]
+
+In [91]: df = pd.DataFrame(data, indx, cols)
+
+In [92]: df
+Out[92]:
+ Exams Labs
+ I II I II
+Student Course
+Ada Comp 70 71 72 73
+ Math 71 73 75 74
+ Sci 72 75 75 75
+Quinn Comp 73 74 75 76
+ Math 74 76 78 77
+ Sci 75 78 78 78
+Violet Comp 76 77 78 79
+ Math 77 79 81 80
+ Sci 78 81 81 81
+
+In [93]: All = slice(None)
+
+In [94]: df.loc['Violet']
+Out[94]:
+ Exams Labs
+ I II I II
+Course
+Comp 76 77 78 79
+Math 77 79 81 80
+Sci 78 81 81 81
+
+In [95]: df.loc[(All, 'Math'), All]
+Out[95]:
+ Exams Labs
+ I II I II
+Student Course
+Ada Math 71 73 75 74
+Quinn Math 74 76 78 77
+Violet Math 77 79 81 80
+
+In [96]: df.loc[(slice('Ada', 'Quinn'), 'Math'), All]
+Out[96]:
+ Exams Labs
+ I II I II
+Student Course
+Ada Math 71 73 75 74
+Quinn Math 74 76 78 77
+
+In [97]: df.loc[(All, 'Math'), ('Exams')]
+Out[97]:
+ I II
+Student Course
+Ada Math 71 73
+Quinn Math 74 76
+Violet Math 77 79
+
+In [98]: df.loc[(All, 'Math'), (All, 'II')]
+Out[98]:
+ Exams Labs
+ II II
+Student Course
+Ada Math 73 74
+Quinn Math 76 77
+Violet Math 79 80
+```
+
+[用 xs 设置多层索引比例](https://stackoverflow.com/questions/19319432/pandas-selecting-a-lower-level-in-a-dataframe-to-do-a-ffill)
+
+### 排序
+
+[用多层索引按指定列或列序列表排序x](https://stackoverflow.com/questions/14733871/mutli-index-sorting-in-pandas)
+
+```python
+In [99]: df.sort_values(by=('Labs', 'II'), ascending=False)
+Out[99]:
+ Exams Labs
+ I II I II
+Student Course
+Violet Sci 78 81 81 81
+ Math 77 79 81 80
+ Comp 76 77 78 79
+Quinn Sci 75 78 78 78
+ Math 74 76 78 77
+ Comp 73 74 75 76
+Ada Sci 72 75 75 75
+ Math 71 73 75 74
+ Comp 70 71 72 73
+```
+
+[部分选择,需要排序](https://github.com/pandas-dev/pandas/issues/2995)
+
+### 层级
+
+[为多层索引添加一层](http://stackoverflow.com/questions/14744068/prepend-a-level-to-a-pandas-multiindex)
+
+[平铺结构化列](http://stackoverflow.com/questions/14507794/python-pandas-how-to-flatten-a-hierarchical-index-in-columns)
+
+
+
+## 缺失数据
+
+[缺失数据](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data) 文档。
+
+向前填充逆序时间序列。
+
+```python
+In [100]: df = pd.DataFrame(np.random.randn(6, 1),
+ .....: index=pd.date_range('2013-08-01', periods=6, freq='B'),
+ .....: columns=list('A'))
+ .....:
+
+In [101]: df.loc[df.index[3], 'A'] = np.nan
+
+In [102]: df
+Out[102]:
+ A
+2013-08-01 0.721555
+2013-08-02 -0.706771
+2013-08-05 -1.039575
+2013-08-06 NaN
+2013-08-07 -0.424972
+2013-08-08 0.567020
+
+In [103]: df.reindex(df.index[::-1]).ffill()
+Out[103]:
+ A
+2013-08-08 0.567020
+2013-08-07 -0.424972
+2013-08-06 -0.424972
+2013-08-05 -1.039575
+2013-08-02 -0.706771
+2013-08-01 0.721555
+```
+
+[空值时重置为 0,有值时累加](http://stackoverflow.com/questions/18196811/cumsum-reset-at-nan)
+
+### 替换
+
+[用反引用替换](http://stackoverflow.com/questions/16818871/extracting-value-and-creating-new-column-out-of-it)
+
+## 分组
+
+[分组](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#groupby) 文档。
+
+[用 apply 执行分组基础操作](http://stackoverflow.com/questions/15322632/python-pandas-df-groupy-agg-column-reference-in-agg)
+
+与聚合不同,传递给 DataFrame 子集的 apply 可回调,可以访问所有列。
+
+```python
+In [104]: df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),
+ .....: 'size': list('SSMMMLL'),
+ .....: 'weight': [8, 10, 11, 1, 20, 12, 12],
+ .....: 'adult': [False] * 5 + [True] * 2})
+ .....:
+
+In [105]: df
+Out[105]:
+ animal size weight adult
+0 cat S 8 False
+1 dog S 10 False
+2 cat M 11 False
+3 fish M 1 False
+4 dog M 20 False
+5 cat L 12 True
+6 cat L 12 True
+
+# 提取 size 列最重的动物列表
+In [106]: df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
+Out[106]:
+animal
+cat L
+dog M
+fish M
+dtype: object
+```
+
+[使用 get_group](http://stackoverflow.com/questions/14734533/how-to-access-pandas-groupby-dataframe-by-key)
+
+```python
+In [107]: gb = df.groupby(['animal'])
+
+In [108]: gb.get_group('cat')
+Out[108]:
+ animal size weight adult
+0 cat S 8 False
+2 cat M 11 False
+5 cat L 12 True
+6 cat L 12 True
+```
+
+[为同一分组的不同内容使用 Apply 函数](http://stackoverflow.com/questions/15262134/apply-different-functions-to-different-items-in-group-object-python-pandas)
+
+```python
+In [109]: def GrowUp(x):
+ .....: avg_weight = sum(x[x['size'] == 'S'].weight * 1.5)
+ .....: avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)
+ .....: avg_weight += sum(x[x['size'] == 'L'].weight)
+ .....: avg_weight /= len(x)
+ .....: return pd.Series(['L', avg_weight, True],
+ .....: index=['size', 'weight', 'adult'])
+ .....:
+
+In [110]: expected_df = gb.apply(GrowUp)
+
+In [111]: expected_df
+Out[111]:
+ size weight adult
+animal
+cat L 12.4375 True
+dog L 20.0000 True
+fish L 1.2500 True
+```
+
+[Apply 函数扩展](http://stackoverflow.com/questions/14542145/reductions-down-a-column-in-pandas)
+
+```python
+In [112]: S = pd.Series([i / 100.0 for i in range(1, 11)])
+
+In [113]: def cum_ret(x, y):
+ .....: return x * (1 + y)
+ .....:
+
+In [114]: def red(x):
+ .....: return functools.reduce(cum_ret, x, 1.0)
+ .....:
+
+In [115]: S.expanding().apply(red, raw=True)
+Out[115]:
+0 1.010000
+1 1.030200
+2 1.061106
+3 1.103550
+4 1.158728
+5 1.228251
+6 1.314229
+7 1.419367
+8 1.547110
+9 1.701821
+dtype: float64
+```
+
+[用分组里的剩余值的平均值进行替换](http://stackoverflow.com/questions/14760757/replacing-values-with-groupby-means)
+
+```python
+In [116]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, -1, 1, 2]})
+
+In [117]: gb = df.groupby('A')
+
+In [118]: def replace(g):
+ .....: mask = g < 0
+ .....: return g.where(mask, g[~mask].mean())
+ .....:
+
+In [119]: gb.transform(replace)
+Out[119]:
+ B
+0 1.0
+1 -1.0
+2 1.5
+3 1.5
+```
+
+[按聚合数据排序](http://stackoverflow.com/questions/14941366/pandas-sort-by-group-aggregate-and-column)
+
+```python
+In [120]: df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,
+ .....: 'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
+ .....: 'flag': [False, True] * 3})
+ .....:
+
+In [121]: code_groups = df.groupby('code')
+
+In [122]: agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')
+
+In [123]: sorted_df = df.loc[agg_n_sort_order.index]
+
+In [124]: sorted_df
+Out[124]:
+ code data flag
+1 bar -0.21 True
+4 bar -0.59 False
+0 foo 0.16 False
+3 foo 0.45 True
+2 baz 0.33 False
+5 baz 0.62 True
+```
+
+[创建多个聚合列](http://stackoverflow.com/questions/14897100/create-multiple-columns-in-pandas-aggregation-function)
+
+```python
+In [125]: rng = pd.date_range(start="2014-10-07", periods=10, freq='2min')
+
+In [126]: ts = pd.Series(data=list(range(10)), index=rng)
+
+In [127]: def MyCust(x):
+ .....: if len(x) > 2:
+ .....: return x[1] * 1.234
+ .....: return pd.NaT
+ .....:
+
+In [128]: mhc = {'Mean': np.mean, 'Max': np.max, 'Custom': MyCust}
+
+In [129]: ts.resample("5min").apply(mhc)
+Out[129]:
+Mean 2014-10-07 00:00:00 1
+ 2014-10-07 00:05:00 3.5
+ 2014-10-07 00:10:00 6
+ 2014-10-07 00:15:00 8.5
+Max 2014-10-07 00:00:00 2
+ 2014-10-07 00:05:00 4
+ 2014-10-07 00:10:00 7
+ 2014-10-07 00:15:00 9
+Custom 2014-10-07 00:00:00 1.234
+ 2014-10-07 00:05:00 NaT
+ 2014-10-07 00:10:00 7.404
+ 2014-10-07 00:15:00 NaT
+dtype: object
+
+In [130]: ts
+Out[130]:
+2014-10-07 00:00:00 0
+2014-10-07 00:02:00 1
+2014-10-07 00:04:00 2
+2014-10-07 00:06:00 3
+2014-10-07 00:08:00 4
+2014-10-07 00:10:00 5
+2014-10-07 00:12:00 6
+2014-10-07 00:14:00 7
+2014-10-07 00:16:00 8
+2014-10-07 00:18:00 9
+Freq: 2T, dtype: int64
+```
+
+[为 DataFrame 创建值计数列](http://stackoverflow.com/questions/17709270/i-want-to-create-a-column-of-value-counts-in-my-pandas-dataframe)
+
+```python
+In [131]: df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(),
+ .....: 'Value': [100, 150, 50, 50]})
+ .....:
+
+In [132]: df
+Out[132]:
+ Color Value
+0 Red 100
+1 Red 150
+2 Red 50
+3 Blue 50
+
+In [133]: df['Counts'] = df.groupby(['Color']).transform(len)
+
+In [134]: df
+Out[134]:
+ Color Value Counts
+0 Red 100 3
+1 Red 150 3
+2 Red 50 3
+3 Blue 50 1
+```
+
+[基于索引唯一某列不同分组的值](http://stackoverflow.com/q/23198053/190597)
+
+```python
+In [135]: df = pd.DataFrame({'line_race': [10, 10, 8, 10, 10, 8],
+ .....: 'beyer': [99, 102, 103, 103, 88, 100]},
+ .....: index=['Last Gunfighter', 'Last Gunfighter',
+ .....: 'Last Gunfighter', 'Paynter', 'Paynter',
+ .....: 'Paynter'])
+ .....:
+
+In [136]: df
+Out[136]:
+ line_race beyer
+Last Gunfighter 10 99
+Last Gunfighter 10 102
+Last Gunfighter 8 103
+Paynter 10 103
+Paynter 10 88
+Paynter 8 100
+
+In [137]: df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1)
+
+In [138]: df
+Out[138]:
+ line_race beyer beyer_shifted
+Last Gunfighter 10 99 NaN
+Last Gunfighter 10 102 99.0
+Last Gunfighter 8 103 102.0
+Paynter 10 103 NaN
+Paynter 10 88 103.0
+Paynter 8 100 88.0
+```
+
+[选择每组最大值的行](http://stackoverflow.com/q/26701849/190597)
+
+```python
+In [139]: df = pd.DataFrame({'host': ['other', 'other', 'that', 'this', 'this'],
+ .....: 'service': ['mail', 'web', 'mail', 'mail', 'web'],
+ .....: 'no': [1, 2, 1, 2, 1]}).set_index(['host', 'service'])
+ .....:
+
+In [140]: mask = df.groupby(level=0).agg('idxmax')
+
+In [141]: df_count = df.loc[mask['no']].reset_index()
+
+In [142]: df_count
+Out[142]:
+ host service no
+0 other web 2
+1 that mail 1
+2 this mail 2
+```
+
+[Python itertools.groupby 式分组](http://stackoverflow.com/q/29142487/846892)
+
+```python
+In [143]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])
+
+In [144]: df.A.groupby((df.A != df.A.shift()).cumsum()).groups
+Out[144]:
+{1: Int64Index([0], dtype='int64'),
+ 2: Int64Index([1], dtype='int64'),
+ 3: Int64Index([2], dtype='int64'),
+ 4: Int64Index([3, 4, 5], dtype='int64'),
+ 5: Int64Index([6], dtype='int64'),
+ 6: Int64Index([7, 8], dtype='int64')}
+
+In [145]: df.A.groupby((df.A != df.A.shift()).cumsum()).cumsum()
+Out[145]:
+0 0
+1 1
+2 0
+3 1
+4 2
+5 3
+6 0
+7 1
+8 2
+Name: A, dtype: int64
+```
+
+### 扩展数据
+
+[Alignment and to-date](http://stackoverflow.com/questions/15489011/python-time-series-alignment-and-to-date-functions)
+
+[基于计数值进行移动窗口计算](http://stackoverflow.com/questions/14300768/pandas-rolling-computation-with-window-based-on-values-instead-of-counts)
+
+[按时间间隔计算滚动平均](http://stackoverflow.com/questions/15771472/pandas-rolling-mean-by-time-interval)
+
+### 分割
+
+[分割 DataFrame](http://stackoverflow.com/questions/13353233/best-way-to-split-a-dataframe-given-an-edge/15449992#15449992)
+
+按指定逻辑,将不同的行,分割成 DataFrame 列表。
+
+```python
+In [146]: df = pd.DataFrame(data={'Case': ['A', 'A', 'A', 'B', 'A', 'A', 'B', 'A',
+ .....: 'A'],
+ .....: 'Data': np.random.randn(9)})
+ .....:
+
+In [147]: dfs = list(zip(*df.groupby((1 * (df['Case'] == 'B')).cumsum()
+ .....: .rolling(window=3, min_periods=1).median())))[-1]
+ .....:
+
+In [148]: dfs[0]
+Out[148]:
+ Case Data
+0 A 0.276232
+1 A -1.087401
+2 A -0.673690
+3 B 0.113648
+
+In [149]: dfs[1]
+Out[149]:
+ Case Data
+4 A -1.478427
+5 A 0.524988
+6 B 0.404705
+
+In [150]: dfs[2]
+Out[150]:
+ Case Data
+7 A 0.577046
+8 A -1.715002
+```
+
+
+### 透视表
+
+[透视表](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#reshaping-pivot) 文档。
+
+[部分汇总与小计](http://stackoverflow.com/questions/15570099/pandas-pivot-tables-row-subtotals/15574875#15574875)
+
+```python
+In [151]: df = pd.DataFrame(data={'Province': ['ON', 'QC', 'BC', 'AL', 'AL', 'MN', 'ON'],
+ .....: 'City': ['Toronto', 'Montreal', 'Vancouver',
+ .....: 'Calgary', 'Edmonton', 'Winnipeg',
+ .....: 'Windsor'],
+ .....: 'Sales': [13, 6, 16, 8, 4, 3, 1]})
+ .....:
+
+In [152]: table = pd.pivot_table(df, values=['Sales'], index=['Province'],
+ .....: columns=['City'], aggfunc=np.sum, margins=True)
+ .....:
+
+In [153]: table.stack('City')
+Out[153]:
+ Sales
+Province City
+AL All 12.0
+ Calgary 8.0
+ Edmonton 4.0
+BC All 16.0
+ Vancouver 16.0
+... ...
+All Montreal 6.0
+ Toronto 13.0
+ Vancouver 16.0
+ Windsor 1.0
+ Winnipeg 3.0
+
+[20 rows x 1 columns]
+```
+
+[类似 R 的 plyr 频率表](http://stackoverflow.com/questions/15589354/frequency-tables-in-pandas-like-plyr-in-r)
+
+```python
+In [154]: grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]
+
+In [155]: df = pd.DataFrame({'ID': ["x%d" % r for r in range(10)],
+ .....: 'Gender': ['F', 'M', 'F', 'M', 'F',
+ .....: 'M', 'F', 'M', 'M', 'M'],
+ .....: 'ExamYear': ['2007', '2007', '2007', '2008', '2008',
+ .....: '2008', '2008', '2009', '2009', '2009'],
+ .....: 'Class': ['algebra', 'stats', 'bio', 'algebra',
+ .....: 'algebra', 'stats', 'stats', 'algebra',
+ .....: 'bio', 'bio'],
+ .....: 'Participated': ['yes', 'yes', 'yes', 'yes', 'no',
+ .....: 'yes', 'yes', 'yes', 'yes', 'yes'],
+ .....: 'Passed': ['yes' if x > 50 else 'no' for x in grades],
+ .....: 'Employed': [True, True, True, False,
+ .....: False, False, False, True, True, False],
+ .....: 'Grade': grades})
+ .....:
+
+In [156]: df.groupby('ExamYear').agg({'Participated': lambda x: x.value_counts()['yes'],
+ .....: 'Passed': lambda x: sum(x == 'yes'),
+ .....: 'Employed': lambda x: sum(x),
+ .....: 'Grade': lambda x: sum(x) / len(x)})
+ .....:
+Out[156]:
+ Participated Passed Employed Grade
+ExamYear
+2007 3 2 3 74.000000
+2008 3 3 0 68.500000
+2009 3 2 2 60.666667
+```
+
+[按年生成 DataFrame](http://stackoverflow.com/questions/30379789/plot-pandas-data-frame-with-year-over-year-data)
+
+跨列表创建年月:
+
+```python
+In [157]: df = pd.DataFrame({'value': np.random.randn(36)},
+ .....: index=pd.date_range('2011-01-01', freq='M', periods=36))
+ .....:
+
+In [158]: pd.pivot_table(df, index=df.index.month, columns=df.index.year,
+ .....: values='value', aggfunc='sum')
+ .....:
+Out[158]:
+ 2011 2012 2013
+1 -1.039268 -0.968914 2.565646
+2 -0.370647 -1.294524 1.431256
+3 -1.157892 0.413738 1.340309
+4 -1.344312 0.276662 -1.170299
+5 0.844885 -0.472035 -0.226169
+6 1.075770 -0.013960 0.410835
+7 -0.109050 -0.362543 0.813850
+8 1.643563 -0.006154 0.132003
+9 -1.469388 -0.923061 -0.827317
+10 0.357021 0.895717 -0.076467
+11 -0.674600 0.805244 -1.187678
+12 -1.776904 -1.206412 1.130127
+```
+
+### Apply 函数
+
+[把嵌入列表转换为多层索引 DataFrame](http://stackoverflow.com/questions/17349981/converting-pandas-dataframe-with-categorical-values-into-binary-values)
+
+```python
+In [159]: df = pd.DataFrame(data={'A': [[2, 4, 8, 16], [100, 200], [10, 20, 30]],
+ .....: 'B': [['a', 'b', 'c'], ['jj', 'kk'], ['ccc']]},
+ .....: index=['I', 'II', 'III'])
+ .....:
+
+In [160]: def SeriesFromSubList(aList):
+ .....: return pd.Series(aList)
+ .....:
+
+In [161]: df_orgz = pd.concat({ind: row.apply(SeriesFromSubList)
+ .....: for ind, row in df.iterrows()})
+ .....:
+
+In [162]: df_orgz
+Out[162]:
+ 0 1 2 3
+I A 2 4 8 16.0
+ B a b c NaN
+II A 100 200 NaN NaN
+ B jj kk NaN NaN
+III A 10 20 30 NaN
+ B ccc NaN NaN NaN
+```
+
+[返回 Series](http://stackoverflow.com/questions/19121854/using-rolling-apply-on-a-dataframe-object)
+
+Rolling Apply to multiple columns where function calculates a Series before a Scalar from the Series is returned
+
+```python
+In [163]: df = pd.DataFrame(data=np.random.randn(2000, 2) / 10000,
+ .....: index=pd.date_range('2001-01-01', periods=2000),
+ .....: columns=['A', 'B'])
+ .....:
+
+In [164]: df
+Out[164]:
+ A B
+2001-01-01 -0.000144 -0.000141
+2001-01-02 0.000161 0.000102
+2001-01-03 0.000057 0.000088
+2001-01-04 -0.000221 0.000097
+2001-01-05 -0.000201 -0.000041
+... ... ...
+2006-06-19 0.000040 -0.000235
+2006-06-20 -0.000123 -0.000021
+2006-06-21 -0.000113 0.000114
+2006-06-22 0.000136 0.000109
+2006-06-23 0.000027 0.000030
+
+[2000 rows x 2 columns]
+
+In [165]: def gm(df, const):
+ .....: v = ((((df.A + df.B) + 1).cumprod()) - 1) * const
+ .....: return v.iloc[-1]
+ .....:
+
+In [166]: s = pd.Series({df.index[i]: gm(df.iloc[i:min(i + 51, len(df) - 1)], 5)
+ .....: for i in range(len(df) - 50)})
+ .....:
+
+In [167]: s
+Out[167]:
+2001-01-01 0.000930
+2001-01-02 0.002615
+2001-01-03 0.001281
+2001-01-04 0.001117
+2001-01-05 0.002772
+ ...
+2006-04-30 0.003296
+2006-05-01 0.002629
+2006-05-02 0.002081
+2006-05-03 0.004247
+2006-05-04 0.003928
+Length: 1950, dtype: float64
+```
+
+[返回标量值](http://stackoverflow.com/questions/21040766/python-pandas-rolling-apply-two-column-input-into-function/21045831#21045831)
+
+Rolling Apply to multiple columns where function returns a Scalar (Volume Weighted Average Price)
+对多列执行滚动 Apply,函数返回标量值(成交价加权平均价)
+
+```python
+In [168]: rng = pd.date_range(start='2014-01-01', periods=100)
+
+In [169]: df = pd.DataFrame({'Open': np.random.randn(len(rng)),
+ .....: 'Close': np.random.randn(len(rng)),
+ .....: 'Volume': np.random.randint(100, 2000, len(rng))},
+ .....: index=rng)
+ .....:
+
+In [170]: df
+Out[170]:
+ Open Close Volume
+2014-01-01 -1.611353 -0.492885 1219
+2014-01-02 -3.000951 0.445794 1054
+2014-01-03 -0.138359 -0.076081 1381
+2014-01-04 0.301568 1.198259 1253
+2014-01-05 0.276381 -0.669831 1728
+... ... ... ...
+2014-04-06 -0.040338 0.937843 1188
+2014-04-07 0.359661 -0.285908 1864
+2014-04-08 0.060978 1.714814 941
+2014-04-09 1.759055 -0.455942 1065
+2014-04-10 0.138185 -1.147008 1453
+
+[100 rows x 3 columns]
+
+In [171]: def vwap(bars):
+ .....: return ((bars.Close * bars.Volume).sum() / bars.Volume.sum())
+ .....:
+
+In [172]: window = 5
+
+In [173]: s = pd.concat([(pd.Series(vwap(df.iloc[i:i + window]),
+ .....: index=[df.index[i + window]]))
+ .....: for i in range(len(df) - window)])
+ .....:
+
+In [174]: s.round(2)
+Out[174]:
+2014-01-06 0.02
+2014-01-07 0.11
+2014-01-08 0.10
+2014-01-09 0.07
+2014-01-10 -0.29
+ ...
+2014-04-06 -0.63
+2014-04-07 -0.02
+2014-04-08 -0.03
+2014-04-09 0.34
+2014-04-10 0.29
+Length: 95, dtype: float64
+```
+
+## 时间序列
+
+[删除指定时间之外的数据](http://stackoverflow.com/questions/14539992/pandas-drop-rows-outside-of-time-range)
+
+[用 indexer 提取在时间范围内的数据](http://stackoverflow.com/questions/17559885/pandas-dataframe-mask-based-on-index)
+
+[创建不包括周末,且只包含指定时间的日期时间范围](http://stackoverflow.com/questions/24010830/pandas-generate-sequential-timestamp-with-jump/24014440#24014440?)
+
+[矢量查询](http://stackoverflow.com/questions/13893227/vectorized-look-up-of-values-in-pandas-dataframe)
+
+[聚合与绘制时间序列](http://nipunbatra.github.io/2015/06/timeseries/)
+
+把以小时为列,天为行的矩阵转换为连续的时间序列。 [如何重排 DataFrame?](http://stackoverflow.com/questions/15432659/how-to-rearrange-a-python-pandas-dataframe)
+
+[重建索引为指定频率时,如何处理重复值](http://stackoverflow.com/questions/22244383/pandas-df-refill-adding-two-columns-of-different-shape)
+
+为 DatetimeIndex 里每条记录计算当月第一天
+
+```python
+In [175]: dates = pd.date_range('2000-01-01', periods=5)
+
+In [176]: dates.to_period(freq='M').to_timestamp()
+Out[176]:
+DatetimeIndex(['2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01',
+ '2000-01-01'],
+ dtype='datetime64[ns]', freq=None)
+```
+
+
+
+### 重采样
+
+[重采样](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-resampling) 文档。
+
+[用 Grouper 代替 TimeGrouper 处理时间分组的值 ](https://stackoverflow.com/questions/15297053/how-can-i-divide-single-values-of-a-dataframe-by-monthly-averages)
+
+[含缺失值的时间分组](https://stackoverflow.com/questions/33637312/pandas-grouper-by-frequency-with-completeness-requirement)
+
+[Grouper 的有效时间频率参数](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases)
+
+[用多层索引分组](https://stackoverflow.com/questions/41483763/pandas-timegrouper-on-multiindex)
+
+[用 TimeGrouper 与另一个分组创建子分组,再 Apply 自定义函数](https://github.com/pandas-dev/pandas/issues/3791)
+
+[按自定义时间段重采样](http://stackoverflow.com/questions/15408156/resampling-with-custom-periods)
+
+[不添加新日期,重采样某日数据](http://stackoverflow.com/questions/14898574/resample-intrday-pandas-dataframe-without-add-new-days)
+
+[按分钟重采样数据](http://stackoverflow.com/questions/14861023/resampling-minute-data)
+
+[分组重采样](http://stackoverflow.com/q/18677271/564538)
+
+
+## 合并
+
+[连接](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging-concatenation) docs. The [Join](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging-join)文档。
+
+[模拟 R 的 rbind:追加两个重叠索引的 DataFrame](http://stackoverflow.com/questions/14988480/pandas-version-of-rbind)
+
+```python
+In [177]: rng = pd.date_range('2000-01-01', periods=6)
+
+In [178]: df1 = pd.DataFrame(np.random.randn(6, 3), index=rng, columns=['A', 'B', 'C'])
+
+In [179]: df2 = df1.copy()
+```
+
+基于 df 构建器,需要`ignore_index`。
+
+```python
+In [180]: df = df1.append(df2, ignore_index=True)
+
+In [181]: df
+Out[181]:
+ A B C
+0 -0.870117 -0.479265 -0.790855
+1 0.144817 1.726395 -0.464535
+2 -0.821906 1.597605 0.187307
+3 -0.128342 -1.511638 -0.289858
+4 0.399194 -1.430030 -0.639760
+5 1.115116 -2.012600 1.810662
+6 -0.870117 -0.479265 -0.790855
+7 0.144817 1.726395 -0.464535
+8 -0.821906 1.597605 0.187307
+9 -0.128342 -1.511638 -0.289858
+10 0.399194 -1.430030 -0.639760
+11 1.115116 -2.012600 1.810662
+```
+
+[自连接 DataFrame](https://github.com/pandas-dev/pandas/issues/2996)
+
+```python
+In [182]: df = pd.DataFrame(data={'Area': ['A'] * 5 + ['C'] * 2,
+ .....: 'Bins': [110] * 2 + [160] * 3 + [40] * 2,
+ .....: 'Test_0': [0, 1, 0, 1, 2, 0, 1],
+ .....: 'Data': np.random.randn(7)})
+ .....:
+
+In [183]: df
+Out[183]:
+ Area Bins Test_0 Data
+0 A 110 0 -0.433937
+1 A 110 1 -0.160552
+2 A 160 0 0.744434
+3 A 160 1 1.754213
+4 A 160 2 0.000850
+5 C 40 0 0.342243
+6 C 40 1 1.070599
+
+In [184]: df['Test_1'] = df['Test_0'] - 1
+
+In [185]: pd.merge(df, df, left_on=['Bins', 'Area', 'Test_0'],
+ .....: right_on=['Bins', 'Area', 'Test_1'],
+ .....: suffixes=('_L', '_R'))
+ .....:
+Out[185]:
+ Area Bins Test_0_L Data_L Test_1_L Test_0_R Data_R Test_1_R
+0 A 110 0 -0.433937 -1 1 -0.160552 0
+1 A 160 0 0.744434 -1 1 1.754213 0
+2 A 160 1 1.754213 0 2 0.000850 1
+3 C 40 0 0.342243 -1 1 1.070599 0
+```
+
+[如何设置索引与连接](http://stackoverflow.com/questions/14341805/pandas-merge-pd-merge-how-to-set-the-index-and-join)
+
+[KDB 式的 asof 连接](http://stackoverflow.com/questions/12322289/kdb-like-asof-join-for-timeseries-data-in-pandas/12336039#12336039)
+
+[基于符合条件的值进行连接](http://stackoverflow.com/questions/15581829/how-to-perform-an-inner-or-outer-join-of-dataframes-with-pandas-on-non-simplisti)
+
+[基于范围里的值,用 searchsorted 合并](http://stackoverflow.com/questions/25125626/pandas-merge-with-logic/2512764)
+
+
+
+## 可视化
+[可视化](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#visualization) 文档。
+
+[让 Matplotlib 看上去像 R](http://stackoverflow.com/questions/14349055/making-matplotlib-graphs-look-like-r-by-default)
+
+[设置 x 轴的主次标签](http://stackoverflow.com/questions/12945971/pandas-timeseries-plot-setting-x-axis-major-and-minor-ticks-and-labels)
+
+[在 iPython Notebook 里创建多个可视图](http://stackoverflow.com/questions/16392921/make-more-than-one-chart-in-same-ipython-notebook-cell)
+
+[创建多行可视图](http://stackoverflow.com/questions/16568964/make-a-multiline-plot-from-csv-file-in-matplotlib)
+
+[绘制热力图](http://stackoverflow.com/questions/17050202/plot-timeseries-of-histograms-in-python)
+
+[标记时间序列图](http://stackoverflow.com/questions/11067368/annotate-time-series-plot-in-matplotlib)
+
+[标记时间序列图 #2](http://stackoverflow.com/questions/17891493/annotating-points-from-a-pandas-dataframe-in-matplotlib-plot)
+
+[用 Pandas、Vincent、xlsxwriter 生成 Excel 文件里的嵌入可视图](https://pandas-xlsxwriter-charts.readthedocs.io/)
+
+[为分层变量的每个四分位数绘制箱型图](http://stackoverflow.com/questions/23232989/boxplot-stratified-by-column-in-python-pandas)
+
+```python
+In [186]: df = pd.DataFrame(
+ .....: {'stratifying_var': np.random.uniform(0, 100, 20),
+ .....: 'price': np.random.normal(100, 5, 20)})
+ .....:
+
+In [187]: df['quartiles'] = pd.qcut(
+ .....: df['stratifying_var'],
+ .....: 4,
+ .....: labels=['0-25%', '25-50%', '50-75%', '75-100%'])
+ .....:
+
+In [188]: df.boxplot(column='price', by='quartiles')
+Out[188]:
+```
+
+
+
+## 数据输入输出
+
+[SQL 与 HDF5 性能对比](http://stackoverflow.com/questions/16628329/hdf5-and-sqlite-concurrency-compression-i-o-performance)
+
+
+### CSV
+
+[CSV](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-csv-table)文档
+
+[read_csv 函数实战](http://wesmckinney.com/blog/update-on-upcoming-pandas-v0-10-new-file-parser-other-performance-wins/)
+
+[把 DataFrame 追加到 CSV 文件](http://stackoverflow.com/questions/17134942/pandas-dataframe-output-end-of-csv)
+
+[分块读取 CSV](http://stackoverflow.com/questions/11622652/large-persistent-dataframe-in-pandas/12193309#12193309)
+
+[分块读取指定的行](http://stackoverflow.com/questions/19674212/pandas-data-frame-select-rows-and-clear-memory)
+
+[只读取 DataFrame 的前几列](http://stackoverflow.com/questions/15008970/way-to-read-first-few-lines-for-pandas-dataframe)
+
+读取不是 `gzip 或 bz2` 压缩(read_csv 可识别的内置压缩格式)的文件。本例在介绍如何读取 `WinZip` 压缩文件的同时,还介绍了在环境管理器里打开文件,并读取内容的通用操作方式。[详见本链接](http://stackoverflow.com/questions/17789907/pandas-convert-winzipped-csv-file-to-data-frame)
+
+[推断文件数据类型](http://stackoverflow.com/questions/15555005/get-inferred-dataframe-types-iteratively-using-chunksize)
+
+[处理出错数据](http://github.com/pandas-dev/pandas/issues/2886)
+
+[处理出错数据 II](http://nipunbatra.github.io/2013/06/reading-unclean-data-csv-using-pandas/)
+
+[用 Unix 时间戳读取 CSV,并转为本地时区](http://nipunbatra.github.io/2013/06/pandas-reading-csv-with-unix-timestamps-and-converting-to-local-timezone/)
+
+[写入多行索引 CSV 时,不写入重复值](http://stackoverflow.com/questions/17349574/pandas-write-multiindex-rows-with-to-csv)
+
+
+#### 从多个文件读取数据,创建单个 DataFrame
+
+最好的方式是先一个个读取单个文件,然后再把每个文件的内容存成列表,再用 `pd.concat()` 组合成一个 DataFrame:
+
+```python
+In [189]: for i in range(3):
+ .....: data = pd.DataFrame(np.random.randn(10, 4))
+ .....: data.to_csv('file_{}.csv'.format(i))
+ .....:
+
+In [190]: files = ['file_0.csv', 'file_1.csv', 'file_2.csv']
+
+In [191]: result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
+```
+
+还可以用同样的方法读取所有匹配同一模式的文件,下面这个例子使用的是`glob`:
+
+```python
+In [192]: import glob
+
+In [193]: import os
+
+In [194]: files = glob.glob('file_*.csv')
+
+In [195]: result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
+```
+
+最后,这种方式也适用于 [io 文档](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io) 介绍的其它 `pd.read_*` 函数。
+
+#### 解析多列里的日期组件
+
+用一种格式解析多列的日期组件,速度更快。
+
+```python
+In [196]: i = pd.date_range('20000101', periods=10000)
+
+In [197]: df = pd.DataFrame({'year': i.year, 'month': i.month, 'day': i.day})
+
+In [198]: df.head()
+Out[198]:
+ year month day
+0 2000 1 1
+1 2000 1 2
+2 2000 1 3
+3 2000 1 4
+4 2000 1 5
+
+In [199]: %timeit pd.to_datetime(df.year * 10000 + df.month * 100 + df.day, format='%Y%m%d')
+ .....: ds = df.apply(lambda x: "%04d%02d%02d" % (x['year'],
+ .....: x['month'], x['day']), axis=1)
+ .....: ds.head()
+ .....: %timeit pd.to_datetime(ds)
+ .....:
+10.6 ms +- 698 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
+3.21 ms +- 36.4 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
+```
+
+#### 跳过标题与数据之间的行
+
+```python
+In [200]: data = """;;;;
+ .....: ;;;;
+ .....: ;;;;
+ .....: ;;;;
+ .....: ;;;;
+ .....: ;;;;
+ .....: ;;;;
+ .....: ;;;;
+ .....: ;;;;
+ .....: ;;;;
+ .....: date;Param1;Param2;Param4;Param5
+ .....: ;m²;°C;m²;m
+ .....: ;;;;
+ .....: 01.01.1990 00:00;1;1;2;3
+ .....: 01.01.1990 01:00;5;3;4;5
+ .....: 01.01.1990 02:00;9;5;6;7
+ .....: 01.01.1990 03:00;13;7;8;9
+ .....: 01.01.1990 04:00;17;9;10;11
+ .....: 01.01.1990 05:00;21;11;12;13
+ .....: """
+ .....:
+```
+
+##### 选项 1:显式跳过行
+
+```python
+In [201]: from io import StringIO
+
+In [202]: pd.read_csv(StringIO(data), sep=';', skiprows=[11, 12],
+ .....: index_col=0, parse_dates=True, header=10)
+ .....:
+Out[202]:
+ Param1 Param2 Param4 Param5
+date
+1990-01-01 00:00:00 1 1 2 3
+1990-01-01 01:00:00 5 3 4 5
+1990-01-01 02:00:00 9 5 6 7
+1990-01-01 03:00:00 13 7 8 9
+1990-01-01 04:00:00 17 9 10 11
+1990-01-01 05:00:00 21 11 12 13
+```
+
+##### 选项 2:读取列名,然后再读取数据
+
+```python
+In [203]: pd.read_csv(StringIO(data), sep=';', header=10, nrows=10).columns
+Out[203]: Index(['date', 'Param1', 'Param2', 'Param4', 'Param5'], dtype='object')
+
+In [204]: columns = pd.read_csv(StringIO(data), sep=';', header=10, nrows=10).columns
+
+In [205]: pd.read_csv(StringIO(data), sep=';', index_col=0,
+ .....: header=12, parse_dates=True, names=columns)
+ .....:
+Out[205]:
+ Param1 Param2 Param4 Param5
+date
+1990-01-01 00:00:00 1 1 2 3
+1990-01-01 01:00:00 5 3 4 5
+1990-01-01 02:00:00 9 5 6 7
+1990-01-01 03:00:00 13 7 8 9
+1990-01-01 04:00:00 17 9 10 11
+1990-01-01 05:00:00 21 11 12 13
+```
+
+### SQL
+
+[SQL](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql) 文档
+
+[用 SQL 读取数据库](http://stackoverflow.com/questions/10065051/python-pandas-and-databases-like-mysql)
+
+### Excel
+
+[Excel](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-excel) 文档
+
+[读取文件式句柄](http://stackoverflow.com/questions/15588713/sheets-of-excel-workbook-from-a-url-into-a-pandas-dataframe)
+
+[用 XlsxWriter 修改输出格式](http://pbpython.com/improve-pandas-excel-output.html)
+
+### HTML
+
+[从不能处理默认请求 header 的服务器读取 HTML 表格](http://stackoverflow.com/a/18939272/564538)
+
+### HDFStore
+
+[HDFStores](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-hdf5)文档
+
+[时间戳索引简单查询](http://stackoverflow.com/questions/13926089/selecting-columns-from-pandas-hdfstore-table)
+
+[用链式多表架构管理异构数据](http://github.com/pandas-dev/pandas/issues/3032)
+
+[在硬盘上合并数百万行的表格](http://stackoverflow.com/questions/14614512/merging-two-tables-with-millions-of-rows-in-python/14617925#14617925)
+
+[避免多进程/线程存储数据出现不一致](http://stackoverflow.com/a/29014295/2858145)
+
+按块对大规模数据存储去重的本质是递归还原操作。[这里](http://stackoverflow.com/questions/16110252/need-to-compare-very-large-files-around-1-5gb-in-python/16110391#16110391)介绍了一个函数,可以从 CSV 文件里按块提取数据,解析日期后,再按块存储。
+
+[按块读取 CSV 文件,并保存](http://stackoverflow.com/questions/20428355/appending-column-to-frame-of-hdf-file-in-pandas/20428786#20428786)
+
+[追加到已存储的文件,且确保索引唯一](http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397)
+
+[大规模数据工作流](http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas)
+
+[读取一系列文件,追加时采用全局唯一索引](http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-data-to-a-pandas-hdfstore-and-get-a-natural)
+
+[用低分组密度分组 HDFStore 文件](http://stackoverflow.com/questions/15798209/pandas-group-by-query-on-large-data-in-hdfstore)
+
+[用高分组密度分组 HDFStore 文件](http://stackoverflow.com/questions/25459982/trouble-with-grouby-on-millions-of-keys-on-a-chunked-file-in-python-pandas/25471765#25471765)
+
+[HDFStore 文件结构化查询](http://stackoverflow.com/questions/22777284/improve-query-performance-from-a-large-hdfstore-table-with-pandas/22820780#22820780)
+
+[HDFStore 计数](http://stackoverflow.com/questions/20497897/converting-dict-of-dicts-into-pandas-dataframe-memory-issues)
+
+[HDFStore 异常解答](http://stackoverflow.com/questions/15488809/how-to-trouble-shoot-hdfstore-exception-cannot-find-the-correct-atom-type)
+
+[用字符串设置 min_itemsize](http://stackoverflow.com/questions/15988871/hdfstore-appendstring-dataframe-fails-when-string-column-contents-are-longer)
+
+[用 ptrepack 创建完全排序索引](http://stackoverflow.com/questions/17893370/ptrepack-sortby-needs-full-index)
+
+把属性存至分组节点
+
+```python
+In [206]: df = pd.DataFrame(np.random.randn(8, 3))
+
+In [207]: store = pd.HDFStore('test.h5')
+
+In [208]: store.put('df', df)
+
+# 用 pickle 存储任意 Python 对象
+In [209]: store.get_storer('df').attrs.my_attribute = {'A': 10}
+
+In [210]: store.get_storer('df').attrs.my_attribute
+Out[210]: {'A': 10}
+```
+
+
+
+### 二进制文件
+
+读取 C 结构体数组组成的二进制文件,Pandas 支持 NumPy 记录数组。 比如说,名为 `main.c` 的文件包含下列 C 代码,并在 64 位机器上用 `gcc main.c -std=gnu99` 进行编译。
+
+```python
+#include
+#include
+
+typedef struct _Data
+{
+ int32_t count;
+ double avg;
+ float scale;
+} Data;
+
+int main(int argc, const char *argv[])
+{
+ size_t n = 10;
+ Data d[n];
+
+ for (int i = 0; i < n; ++i)
+ {
+ d[i].count = i;
+ d[i].avg = i + 1.0;
+ d[i].scale = (float) i + 2.0f;
+ }
+
+ FILE *file = fopen("binary.dat", "wb");
+ fwrite(&d, sizeof(Data), n, file);
+ fclose(file);
+
+ return 0;
+}
+```
+
+下列 Python 代码读取二进制二建 `binary.dat`,并将之存为 pandas `DataFrame`,每个结构体的元素对应 DataFrame 里的列:
+
+```python
+names = 'count', 'avg', 'scale'
+
+# 注意:因为结构体填充,位移量比类型尺寸大
+offsets = 0, 8, 16
+formats = 'i4', 'f8', 'f4'
+dt = np.dtype({'names': names, 'offsets': offsets, 'formats': formats},
+ align=True)
+df = pd.DataFrame(np.fromfile('binary.dat', dt))
+```
+
+::: tip 注意
+
+不同机器上创建的文件因其架构不同,结构化元素的位移量也不同,原生二进制格式文件不能跨平台使用,因此不建议作为通用数据存储格式。建议用 Pandas IO 功能支持的 HDF5 或 msgpack 文件。
+
+:::
+
+
+## 计算
+
+[基于采样的时间序列数值整合](http://nbviewer.ipython.org/5720498)
+
+### 相关性
+
+用 [`DataFrame.corr()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html#pandas.DataFrame.corr) 计算得出的相关矩阵的下(或上)三角形式一般都非常有用。下例通过把布尔掩码传递给 `where` 可以实现这一功能:
+
+```python
+In [211]: df = pd.DataFrame(np.random.random(size=(100, 5)))
+
+In [212]: corr_mat = df.corr()
+
+In [213]: mask = np.tril(np.ones_like(corr_mat, dtype=np.bool), k=-1)
+
+In [214]: corr_mat.where(mask)
+Out[214]:
+ 0 1 2 3 4
+0 NaN NaN NaN NaN NaN
+1 -0.018923 NaN NaN NaN NaN
+2 -0.076296 -0.012464 NaN NaN NaN
+3 -0.169941 -0.289416 0.076462 NaN NaN
+4 0.064326 0.018759 -0.084140 -0.079859 NaN
+```
+
+除了命名相关类型之外,`DataFrame.corr` 还接受回调,此处计算 DataFrame 对象的[距离相关矩阵](https://en.wikipedia.org/wiki/Distance_correlation)。
+
+```python
+In [215]: def distcorr(x, y):
+ .....: n = len(x)
+ .....: a = np.zeros(shape=(n, n))
+ .....: b = np.zeros(shape=(n, n))
+ .....: for i in range(n):
+ .....: for j in range(i + 1, n):
+ .....: a[i, j] = abs(x[i] - x[j])
+ .....: b[i, j] = abs(y[i] - y[j])
+ .....: a += a.T
+ .....: b += b.T
+ .....: a_bar = np.vstack([np.nanmean(a, axis=0)] * n)
+ .....: b_bar = np.vstack([np.nanmean(b, axis=0)] * n)
+ .....: A = a - a_bar - a_bar.T + np.full(shape=(n, n), fill_value=a_bar.mean())
+ .....: B = b - b_bar - b_bar.T + np.full(shape=(n, n), fill_value=b_bar.mean())
+ .....: cov_ab = np.sqrt(np.nansum(A * B)) / n
+ .....: std_a = np.sqrt(np.sqrt(np.nansum(A**2)) / n)
+ .....: std_b = np.sqrt(np.sqrt(np.nansum(B**2)) / n)
+ .....: return cov_ab / std_a / std_b
+ .....:
+
+In [216]: df = pd.DataFrame(np.random.normal(size=(100, 3)))
+
+In [217]: df.corr(method=distcorr)
+Out[217]:
+ 0 1 2
+0 1.000000 0.199653 0.214871
+1 0.199653 1.000000 0.195116
+2 0.214871 0.195116 1.000000
+```
+
+## 时间差
+
+[时间差](https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html#timedeltas-timedeltas)文档。
+
+[使用时间差](http://github.com/pandas-dev/pandas/pull/2899)
+
+```python
+In [218]: import datetime
+
+In [219]: s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))
+
+In [220]: s - s.max()
+Out[220]:
+0 -2 days
+1 -1 days
+2 0 days
+dtype: timedelta64[ns]
+
+In [221]: s.max() - s
+Out[221]:
+0 2 days
+1 1 days
+2 0 days
+dtype: timedelta64[ns]
+
+In [222]: s - datetime.datetime(2011, 1, 1, 3, 5)
+Out[222]:
+0 364 days 20:55:00
+1 365 days 20:55:00
+2 366 days 20:55:00
+dtype: timedelta64[ns]
+
+In [223]: s + datetime.timedelta(minutes=5)
+Out[223]:
+0 2012-01-01 00:05:00
+1 2012-01-02 00:05:00
+2 2012-01-03 00:05:00
+dtype: datetime64[ns]
+
+In [224]: datetime.datetime(2011, 1, 1, 3, 5) - s
+Out[224]:
+0 -365 days +03:05:00
+1 -366 days +03:05:00
+2 -367 days +03:05:00
+dtype: timedelta64[ns]
+
+In [225]: datetime.timedelta(minutes=5) + s
+Out[225]:
+0 2012-01-01 00:05:00
+1 2012-01-02 00:05:00
+2 2012-01-03 00:05:00
+dtype: datetime64[ns]
+```
+
+[日期加减](http://stackoverflow.com/questions/16385785/add-days-to-dates-in-dataframe)
+
+```python
+In [226]: deltas = pd.Series([datetime.timedelta(days=i) for i in range(3)])
+
+In [227]: df = pd.DataFrame({'A': s, 'B': deltas})
+
+In [228]: df
+Out[228]:
+ A B
+0 2012-01-01 0 days
+1 2012-01-02 1 days
+2 2012-01-03 2 days
+
+In [229]: df['New Dates'] = df['A'] + df['B']
+
+In [230]: df['Delta'] = df['A'] - df['New Dates']
+
+In [231]: df
+Out[231]:
+ A B New Dates Delta
+0 2012-01-01 0 days 2012-01-01 0 days
+1 2012-01-02 1 days 2012-01-03 -1 days
+2 2012-01-03 2 days 2012-01-05 -2 days
+
+In [232]: df.dtypes
+Out[232]:
+A datetime64[ns]
+B timedelta64[ns]
+New Dates datetime64[ns]
+Delta timedelta64[ns]
+dtype: object
+```
+
+[其它示例](http://stackoverflow.com/questions/15683588/iterating-through-a-pandas-dataframe)
+
+与 datetime 类似,用 `np.nan` 可以把值设为 `NaT`。
+
+```python
+In [233]: y = s - s.shift()
+
+In [234]: y
+Out[234]:
+0 NaT
+1 1 days
+2 1 days
+dtype: timedelta64[ns]
+
+In [235]: y[1] = np.nan
+
+In [236]: y
+Out[236]:
+0 NaT
+1 NaT
+2 1 days
+dtype: timedelta64[ns]
+```
+
+## 轴别名
+
+设置全局轴别名,可以定义以下两个函数:
+
+```python
+In [237]: def set_axis_alias(cls, axis, alias):
+ .....: if axis not in cls._AXIS_NUMBERS:
+ .....: raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
+ .....: cls._AXIS_ALIASES[alias] = axis
+ .....:
+In [238]: def clear_axis_alias(cls, axis, alias):
+ .....: if axis not in cls._AXIS_NUMBERS:
+ .....: raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
+ .....: cls._AXIS_ALIASES.pop(alias, None)
+ .....:
+In [239]: set_axis_alias(pd.DataFrame, 'columns', 'myaxis2')
+
+In [240]: df2 = pd.DataFrame(np.random.randn(3, 2), columns=['c1', 'c2'],
+ .....: index=['i1', 'i2', 'i3'])
+ .....:
+
+In [241]: df2.sum(axis='myaxis2')
+Out[241]:
+i1 -0.461013
+i2 2.040016
+i3 0.904681
+dtype: float64
+
+In [242]: clear_axis_alias(pd.DataFrame, 'columns', 'myaxis2')
+```
+
+## 创建示例数据
+
+类似 R 的 `expand.grid()` 函数,用不同类型的值组生成 DataFrame,需要创建键是列名,值是数据值列表的字典:
+
+```python
+In [243]: def expand_grid(data_dict):
+ .....: rows = itertools.product(*data_dict.values())
+ .....: return pd.DataFrame.from_records(rows, columns=data_dict.keys())
+ .....:
+
+In [244]: df = expand_grid({'height': [60, 70],
+ .....: 'weight': [100, 140, 180],
+ .....: 'sex': ['Male', 'Female']})
+ .....:
+
+In [245]: df
+Out[245]:
+ height weight sex
+0 60 100 Male
+1 60 100 Female
+2 60 140 Male
+3 60 140 Female
+4 60 180 Male
+5 60 180 Female
+6 70 100 Male
+7 70 100 Female
+8 70 140 Male
+9 70 140 Female
+10 70 180 Male
+11 70 180 Female
+```
\ No newline at end of file
diff --git a/Python/pandas/user_guide/enhancingperf.md b/Python/pandas/user_guide/enhancingperf.md
new file mode 100644
index 00000000..099eea77
--- /dev/null
+++ b/Python/pandas/user_guide/enhancingperf.md
@@ -0,0 +1,984 @@
+# Enhancing performance
+
+In this part of the tutorial, we will investigate how to speed up certain
+functions operating on pandas ``DataFrames`` using three different techniques:
+Cython, Numba and [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval). We will see a speed improvement of ~200
+when we use Cython and Numba on a test function operating row-wise on the
+``DataFrame``. Using [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) we will speed up a sum by an order of
+~2.
+
+## Cython (writing C extensions for pandas)
+
+For many use cases writing pandas in pure Python and NumPy is sufficient. In some
+computationally heavy applications however, it can be possible to achieve sizable
+speed-ups by offloading work to [cython](http://cython.org/).
+
+This tutorial assumes you have refactored as much as possible in Python, for example
+by trying to remove for-loops and making use of NumPy vectorization. It’s always worth
+optimising in Python first.
+
+This tutorial walks through a “typical” process of cythonizing a slow computation.
+We use an [example from the Cython documentation](http://docs.cython.org/src/quickstart/cythonize.html)
+but in the context of pandas. Our final cythonized solution is around 100 times
+faster than the pure Python solution.
+
+### Pure Python
+
+We have a ``DataFrame`` to which we want to apply a function row-wise.
+
+``` python
+In [1]: df = pd.DataFrame({'a': np.random.randn(1000),
+ ...: 'b': np.random.randn(1000),
+ ...: 'N': np.random.randint(100, 1000, (1000)),
+ ...: 'x': 'x'})
+ ...:
+
+In [2]: df
+Out[2]:
+ a b N x
+0 0.469112 -0.218470 585 x
+1 -0.282863 -0.061645 841 x
+2 -1.509059 -0.723780 251 x
+3 -1.135632 0.551225 972 x
+4 1.212112 -0.497767 181 x
+.. ... ... ... ..
+995 -1.512743 0.874737 374 x
+996 0.933753 1.120790 246 x
+997 -0.308013 0.198768 157 x
+998 -0.079915 1.757555 977 x
+999 -1.010589 -1.115680 770 x
+
+[1000 rows x 4 columns]
+```
+
+Here’s the function in pure Python:
+
+``` python
+In [3]: def f(x):
+ ...: return x * (x - 1)
+ ...:
+
+In [4]: def integrate_f(a, b, N):
+ ...: s = 0
+ ...: dx = (b - a) / N
+ ...: for i in range(N):
+ ...: s += f(a + i * dx)
+ ...: return s * dx
+ ...:
+```
+
+We achieve our result by using ``apply`` (row-wise):
+
+``` python
+In [7]: %timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)
+10 loops, best of 3: 174 ms per loop
+```
+
+But clearly this isn’t fast enough for us. Let’s take a look and see where the
+time is spent during this operation (limited to the most time consuming
+four calls) using the [prun ipython magic function](http://ipython.org/ipython-doc/stable/api/generated/IPython.core.magics.execution.html#IPython.core.magics.execution.ExecutionMagics.prun):
+
+``` python
+In [5]: %prun -l 4 df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1) # noqa E999
+ 672332 function calls (667306 primitive calls) in 0.285 seconds
+
+ Ordered by: internal time
+ List reduced from 221 to 4 due to restriction <4>
+
+ ncalls tottime percall cumtime percall filename:lineno(function)
+ 1000 0.144 0.000 0.217 0.000 :1(integrate_f)
+ 552423 0.074 0.000 0.074 0.000 :1(f)
+ 3000 0.008 0.000 0.045 0.000 base.py:4695(get_value)
+ 6001 0.005 0.000 0.012 0.000 {pandas._libs.lib.values_from_object}
+```
+
+By far the majority of time is spend inside either ``integrate_f`` or ``f``,
+hence we’ll concentrate our efforts cythonizing these two functions.
+
+::: tip Note
+
+In Python 2 replacing the ``range`` with its generator counterpart (``xrange``)
+would mean the ``range`` line would vanish. In Python 3 ``range`` is already a generator.
+
+:::
+
+### Plain Cython
+
+First we’re going to need to import the Cython magic function to ipython:
+
+``` python
+In [6]: %load_ext Cython
+```
+
+Now, let’s simply copy our functions over to Cython as is (the suffix
+is here to distinguish between function versions):
+
+``` python
+In [7]: %%cython
+ ...: def f_plain(x):
+ ...: return x * (x - 1)
+ ...: def integrate_f_plain(a, b, N):
+ ...: s = 0
+ ...: dx = (b - a) / N
+ ...: for i in range(N):
+ ...: s += f_plain(a + i * dx)
+ ...: return s * dx
+ ...:
+```
+
+::: tip Note
+
+If you’re having trouble pasting the above into your ipython, you may need
+to be using bleeding edge ipython for paste to play well with cell magics.
+
+:::
+
+``` python
+In [4]: %timeit df.apply(lambda x: integrate_f_plain(x['a'], x['b'], x['N']), axis=1)
+10 loops, best of 3: 85.5 ms per loop
+```
+
+Already this has shaved a third off, not too bad for a simple copy and paste.
+
+### Adding type
+
+We get another huge improvement simply by providing type information:
+
+``` python
+In [8]: %%cython
+ ...: cdef double f_typed(double x) except? -2:
+ ...: return x * (x - 1)
+ ...: cpdef double integrate_f_typed(double a, double b, int N):
+ ...: cdef int i
+ ...: cdef double s, dx
+ ...: s = 0
+ ...: dx = (b - a) / N
+ ...: for i in range(N):
+ ...: s += f_typed(a + i * dx)
+ ...: return s * dx
+ ...:
+```
+
+``` python
+In [4]: %timeit df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)
+10 loops, best of 3: 20.3 ms per loop
+```
+
+Now, we’re talking! It’s now over ten times faster than the original python
+implementation, and we haven’t *really* modified the code. Let’s have another
+look at what’s eating up time:
+
+``` python
+In [9]: %prun -l 4 df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)
+ 119905 function calls (114879 primitive calls) in 0.096 seconds
+
+ Ordered by: internal time
+ List reduced from 216 to 4 due to restriction <4>
+
+ ncalls tottime percall cumtime percall filename:lineno(function)
+ 3000 0.012 0.000 0.064 0.000 base.py:4695(get_value)
+ 6001 0.007 0.000 0.017 0.000 {pandas._libs.lib.values_from_object}
+ 3000 0.007 0.000 0.073 0.000 series.py:1061(__getitem__)
+ 3000 0.006 0.000 0.006 0.000 {method 'get_value' of 'pandas._libs.index.IndexEngine' objects}
+```
+
+### Using ndarray
+
+It’s calling series… a lot! It’s creating a Series from each row, and get-ting from both
+the index and the series (three times for each row). Function calls are expensive
+in Python, so maybe we could minimize these by cythonizing the apply part.
+
+::: tip Note
+
+We are now passing ndarrays into the Cython function, fortunately Cython plays
+very nicely with NumPy.
+
+:::
+
+``` python
+In [10]: %%cython
+ ....: cimport numpy as np
+ ....: import numpy as np
+ ....: cdef double f_typed(double x) except? -2:
+ ....: return x * (x - 1)
+ ....: cpdef double integrate_f_typed(double a, double b, int N):
+ ....: cdef int i
+ ....: cdef double s, dx
+ ....: s = 0
+ ....: dx = (b - a) / N
+ ....: for i in range(N):
+ ....: s += f_typed(a + i * dx)
+ ....: return s * dx
+ ....: cpdef np.ndarray[double] apply_integrate_f(np.ndarray col_a, np.ndarray col_b,
+ ....: np.ndarray col_N):
+ ....: assert (col_a.dtype == np.float
+ ....: and col_b.dtype == np.float and col_N.dtype == np.int)
+ ....: cdef Py_ssize_t i, n = len(col_N)
+ ....: assert (len(col_a) == len(col_b) == n)
+ ....: cdef np.ndarray[double] res = np.empty(n)
+ ....: for i in range(len(col_a)):
+ ....: res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
+ ....: return res
+ ....:
+```
+
+The implementation is simple, it creates an array of zeros and loops over
+the rows, applying our ``integrate_f_typed``, and putting this in the zeros array.
+
+::: danger Warning
+
+You can **not pass** a ``Series`` directly as a ``ndarray`` typed parameter
+to a Cython function. Instead pass the actual ``ndarray`` using the
+[``Series.to_numpy()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy). The reason is that the Cython
+definition is specific to an ndarray and not the passed ``Series``.
+
+So, do not do this:
+
+``` python
+apply_integrate_f(df['a'], df['b'], df['N'])
+```
+
+But rather, use [``Series.to_numpy()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy) to get the underlying ``ndarray``:
+
+``` python
+apply_integrate_f(df['a'].to_numpy(),
+ df['b'].to_numpy(),
+ df['N'].to_numpy())
+```
+
+:::
+
+::: tip Note
+
+Loops like this would be *extremely* slow in Python, but in Cython looping
+over NumPy arrays is *fast*.
+
+:::
+
+``` python
+In [4]: %timeit apply_integrate_f(df['a'].to_numpy(),
+ df['b'].to_numpy(),
+ df['N'].to_numpy())
+1000 loops, best of 3: 1.25 ms per loop
+```
+
+We’ve gotten another big improvement. Let’s check again where the time is spent:
+
+``` python
+In [11]: %prun -l 4 apply_integrate_f(df['a'].to_numpy(),
+ ....: df['b'].to_numpy(),
+ ....: df['N'].to_numpy())
+ ....:
+ File "", line 2
+ df['b'].to_numpy(),
+ ^
+IndentationError: unexpected indent
+```
+
+As one might expect, the majority of the time is now spent in ``apply_integrate_f``,
+so if we wanted to make anymore efficiencies we must continue to concentrate our
+efforts here.
+
+### More advanced techniques
+
+There is still hope for improvement. Here’s an example of using some more
+advanced Cython techniques:
+
+``` python
+In [12]: %%cython
+ ....: cimport cython
+ ....: cimport numpy as np
+ ....: import numpy as np
+ ....: cdef double f_typed(double x) except? -2:
+ ....: return x * (x - 1)
+ ....: cpdef double integrate_f_typed(double a, double b, int N):
+ ....: cdef int i
+ ....: cdef double s, dx
+ ....: s = 0
+ ....: dx = (b - a) / N
+ ....: for i in range(N):
+ ....: s += f_typed(a + i * dx)
+ ....: return s * dx
+ ....: @cython.boundscheck(False)
+ ....: @cython.wraparound(False)
+ ....: cpdef np.ndarray[double] apply_integrate_f_wrap(np.ndarray[double] col_a,
+ ....: np.ndarray[double] col_b,
+ ....: np.ndarray[int] col_N):
+ ....: cdef int i, n = len(col_N)
+ ....: assert len(col_a) == len(col_b) == n
+ ....: cdef np.ndarray[double] res = np.empty(n)
+ ....: for i in range(n):
+ ....: res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
+ ....: return res
+ ....:
+```
+
+``` python
+In [4]: %timeit apply_integrate_f_wrap(df['a'].to_numpy(),
+ df['b'].to_numpy(),
+ df['N'].to_numpy())
+1000 loops, best of 3: 987 us per loop
+```
+
+Even faster, with the caveat that a bug in our Cython code (an off-by-one error,
+for example) might cause a segfault because memory access isn’t checked.
+For more about ``boundscheck`` and ``wraparound``, see the Cython docs on
+[compiler directives](http://cython.readthedocs.io/en/latest/src/reference/compilation.html?highlight=wraparound#compiler-directives).
+
+## Using Numba
+
+A recent alternative to statically compiling Cython code, is to use a *dynamic jit-compiler*, Numba.
+
+Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
+
+Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.
+
+::: tip Note
+
+You will need to install Numba. This is easy with ``conda``, by using: ``conda install numba``, see [installing using miniconda](https://pandas.pydata.org/pandas-docs/stable/install.html#install-miniconda).
+
+:::
+
+::: tip Note
+
+As of Numba version 0.20, pandas objects cannot be passed directly to Numba-compiled functions. Instead, one must pass the NumPy array underlying the pandas object to the Numba-compiled function as demonstrated below.
+
+:::
+
+### Jit
+
+We demonstrate how to use Numba to just-in-time compile our code. We simply
+take the plain Python code from above and annotate with the ``@jit`` decorator.
+
+``` python
+import numba
+
+
+@numba.jit
+def f_plain(x):
+ return x * (x - 1)
+
+
+@numba.jit
+def integrate_f_numba(a, b, N):
+ s = 0
+ dx = (b - a) / N
+ for i in range(N):
+ s += f_plain(a + i * dx)
+ return s * dx
+
+
+@numba.jit
+def apply_integrate_f_numba(col_a, col_b, col_N):
+ n = len(col_N)
+ result = np.empty(n, dtype='float64')
+ assert len(col_a) == len(col_b) == n
+ for i in range(n):
+ result[i] = integrate_f_numba(col_a[i], col_b[i], col_N[i])
+ return result
+
+
+def compute_numba(df):
+ result = apply_integrate_f_numba(df['a'].to_numpy(),
+ df['b'].to_numpy(),
+ df['N'].to_numpy())
+ return pd.Series(result, index=df.index, name='result')
+```
+
+Note that we directly pass NumPy arrays to the Numba function. ``compute_numba`` is just a wrapper that provides a
+nicer interface by passing/returning pandas objects.
+
+``` python
+In [4]: %timeit compute_numba(df)
+1000 loops, best of 3: 798 us per loop
+```
+
+In this example, using Numba was faster than Cython.
+
+### Vectorize
+
+Numba can also be used to write vectorized functions that do not require the user to explicitly
+loop over the observations of a vector; a vectorized function will be applied to each row automatically.
+Consider the following toy example of doubling each observation:
+
+``` python
+import numba
+
+
+def double_every_value_nonumba(x):
+ return x * 2
+
+
+@numba.vectorize
+def double_every_value_withnumba(x): # noqa E501
+ return x * 2
+```
+
+``` python
+# Custom function without numba
+In [5]: %timeit df['col1_doubled'] = df.a.apply(double_every_value_nonumba) # noqa E501
+1000 loops, best of 3: 797 us per loop
+
+# Standard implementation (faster than a custom function)
+In [6]: %timeit df['col1_doubled'] = df.a * 2
+1000 loops, best of 3: 233 us per loop
+
+# Custom function with numba
+In [7]: %timeit (df['col1_doubled'] = double_every_value_withnumba(df.a.to_numpy())
+1000 loops, best of 3: 145 us per loop
+```
+
+### Caveats
+
+::: tip Note
+
+Numba will execute on any function, but can only accelerate certain classes of functions.
+
+:::
+
+Numba is best at accelerating functions that apply numerical functions to NumPy
+arrays. When passed a function that only uses operations it knows how to
+accelerate, it will execute in ``nopython`` mode.
+
+If Numba is passed a function that includes something it doesn’t know how to
+work with – a category that currently includes sets, lists, dictionaries, or
+string functions – it will revert to ``object mode``. In ``object mode``,
+Numba will execute but your code will not speed up significantly. If you would
+prefer that Numba throw an error if it cannot compile a function in a way that
+speeds up your code, pass Numba the argument
+``nopython=True`` (e.g. ``@numba.jit(nopython=True)``). For more on
+troubleshooting Numba modes, see the [Numba troubleshooting page](http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#the-compiled-code-is-too-slow).
+
+Read more in the [Numba docs](http://numba.pydata.org/).
+
+## Expression evaluation via ``eval()``
+
+The top-level function [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) implements expression evaluation of
+[``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) and [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) objects.
+
+::: tip Note
+
+To benefit from using [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) you need to
+install ``numexpr``. See the [recommended dependencies section](https://pandas.pydata.org/pandas-docs/stable/install.html#install-recommended-dependencies) for more details.
+
+:::
+
+The point of using [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) for expression evaluation rather than
+plain Python is two-fold: 1) large [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) objects are
+evaluated more efficiently and 2) large arithmetic and boolean expressions are
+evaluated all at once by the underlying engine (by default ``numexpr`` is used
+for evaluation).
+
+::: tip Note
+
+You should not use [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) for simple
+expressions or for expressions involving small DataFrames. In fact,
+[``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) is many orders of magnitude slower for
+smaller expressions/objects than plain ol’ Python. A good rule of thumb is
+to only use [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) when you have a
+``DataFrame`` with more than 10,000 rows.
+
+:::
+
+[``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) supports all arithmetic expressions supported by the
+engine in addition to some extensions available only in pandas.
+
+::: tip Note
+
+The larger the frame and the larger the expression the more speedup you will
+see from using [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval).
+
+:::
+
+### Supported syntax
+
+These operations are supported by [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval):
+
+- Arithmetic operations except for the left shift (``<<``) and right shift
+(``>>``) operators, e.g., ``df + 2 * pi / s ** 4 % 42 - the_golden_ratio``
+- Comparison operations, including chained comparisons, e.g., ``2 < df < df2``
+- Boolean operations, e.g., ``df < df2 and df3 < df4 or not df_bool``
+- ``list`` and ``tuple`` literals, e.g., ``[1, 2]`` or ``(1, 2)``
+- Attribute access, e.g., ``df.a``
+- Subscript expressions, e.g., ``df[0]``
+- Simple variable evaluation, e.g., ``pd.eval('df')`` (this is not very useful)
+- Math functions: *sin*, *cos*, *exp*, *log*, *expm1*, *log1p*,
+*sqrt*, *sinh*, *cosh*, *tanh*, *arcsin*, *arccos*, *arctan*, *arccosh*,
+*arcsinh*, *arctanh*, *abs*, *arctan2* and *log10*.
+
+This Python syntax is **not** allowed:
+
+- Expressions
+ - Function calls other than math functions.
+ - ``is``/``is not`` operations
+ - ``if`` expressions
+ - ``lambda`` expressions
+ - ``list``/``set``/``dict`` comprehensions
+ - Literal ``dict`` and ``set`` expressions
+ - ``yield`` expressions
+ - Generator expressions
+ - Boolean expressions consisting of only scalar values
+
+- Statements
+
+ - Neither [simple](https://docs.python.org/3/reference/simple_stmts.html)
+ nor [compound](https://docs.python.org/3/reference/compound_stmts.html)
+ statements are allowed. This includes things like ``for``, ``while``, and
+ ``if``.
+
+### ``eval()`` examples
+
+[``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) works well with expressions containing large arrays.
+
+First let’s create a few decent-sized arrays to play with:
+
+``` python
+In [13]: nrows, ncols = 20000, 100
+
+In [14]: df1, df2, df3, df4 = [pd.DataFrame(np.random.randn(nrows, ncols)) for _ in range(4)]
+```
+
+Now let’s compare adding them together using plain ol’ Python versus
+[``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval):
+
+``` python
+In [15]: %timeit df1 + df2 + df3 + df4
+21 ms +- 787 us per loop (mean +- std. dev. of 7 runs, 10 loops each)
+```
+
+``` python
+In [16]: %timeit pd.eval('df1 + df2 + df3 + df4')
+8.12 ms +- 249 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
+```
+
+Now let’s do the same thing but with comparisons:
+
+``` python
+In [17]: %timeit (df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)
+272 ms +- 6.92 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)
+```
+
+``` python
+In [18]: %timeit pd.eval('(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)')
+19.2 ms +- 1.87 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
+```
+
+[``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) also works with unaligned pandas objects:
+
+``` python
+In [19]: s = pd.Series(np.random.randn(50))
+
+In [20]: %timeit df1 + df2 + df3 + df4 + s
+103 ms +- 12.7 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
+```
+
+``` python
+In [21]: %timeit pd.eval('df1 + df2 + df3 + df4 + s')
+10.2 ms +- 215 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
+```
+
+::: tip Note
+
+Operations such as
+
+``` python
+1 and 2 # would parse to 1 & 2, but should evaluate to 2
+3 or 4 # would parse to 3 | 4, but should evaluate to 3
+~1 # this is okay, but slower when using eval
+```
+
+should be performed in Python. An exception will be raised if you try to
+perform any boolean/bitwise operations with scalar operands that are not
+of type ``bool`` or ``np.bool_``. Again, you should perform these kinds of
+operations in plain Python.
+
+:::
+
+### The ``DataFrame.eval`` method
+
+In addition to the top level [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) function you can also
+evaluate an expression in the “context” of a [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame).
+
+``` python
+In [22]: df = pd.DataFrame(np.random.randn(5, 2), columns=['a', 'b'])
+
+In [23]: df.eval('a + b')
+Out[23]:
+0 -0.246747
+1 0.867786
+2 -1.626063
+3 -1.134978
+4 -1.027798
+dtype: float64
+```
+
+Any expression that is a valid [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) expression is also a valid
+[``DataFrame.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.eval.html#pandas.DataFrame.eval) expression, with the added benefit that you don’t have to
+prefix the name of the [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) to the column(s) you’re
+interested in evaluating.
+
+In addition, you can perform assignment of columns within an expression.
+This allows for *formulaic evaluation*. The assignment target can be a
+new column name or an existing column name, and it must be a valid Python
+identifier.
+
+*New in version 0.18.0.*
+
+The ``inplace`` keyword determines whether this assignment will performed
+on the original ``DataFrame`` or return a copy with the new column.
+
+::: danger Warning
+
+For backwards compatibility, ``inplace`` defaults to ``True`` if not
+specified. This will change in a future version of pandas - if your
+code depends on an inplace assignment you should update to explicitly
+set ``inplace=True``.
+
+:::
+
+``` python
+In [24]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
+
+In [25]: df.eval('c = a + b', inplace=True)
+
+In [26]: df.eval('d = a + b + c', inplace=True)
+
+In [27]: df.eval('a = 1', inplace=True)
+
+In [28]: df
+Out[28]:
+ a b c d
+0 1 5 5 10
+1 1 6 7 14
+2 1 7 9 18
+3 1 8 11 22
+4 1 9 13 26
+```
+
+When ``inplace`` is set to ``False``, a copy of the ``DataFrame`` with the
+new or modified columns is returned and the original frame is unchanged.
+
+``` python
+In [29]: df
+Out[29]:
+ a b c d
+0 1 5 5 10
+1 1 6 7 14
+2 1 7 9 18
+3 1 8 11 22
+4 1 9 13 26
+
+In [30]: df.eval('e = a - c', inplace=False)
+Out[30]:
+ a b c d e
+0 1 5 5 10 -4
+1 1 6 7 14 -6
+2 1 7 9 18 -8
+3 1 8 11 22 -10
+4 1 9 13 26 -12
+
+In [31]: df
+Out[31]:
+ a b c d
+0 1 5 5 10
+1 1 6 7 14
+2 1 7 9 18
+3 1 8 11 22
+4 1 9 13 26
+```
+
+*New in version 0.18.0.*
+
+As a convenience, multiple assignments can be performed by using a
+multi-line string.
+
+``` python
+In [32]: df.eval("""
+ ....: c = a + b
+ ....: d = a + b + c
+ ....: a = 1""", inplace=False)
+ ....:
+Out[32]:
+ a b c d
+0 1 5 6 12
+1 1 6 7 14
+2 1 7 8 16
+3 1 8 9 18
+4 1 9 10 20
+```
+
+The equivalent in standard Python would be
+
+``` python
+In [33]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
+
+In [34]: df['c'] = df.a + df.b
+
+In [35]: df['d'] = df.a + df.b + df.c
+
+In [36]: df['a'] = 1
+
+In [37]: df
+Out[37]:
+ a b c d
+0 1 5 5 10
+1 1 6 7 14
+2 1 7 9 18
+3 1 8 11 22
+4 1 9 13 26
+```
+
+*New in version 0.18.0.*
+
+The ``query`` method gained the ``inplace`` keyword which determines
+whether the query modifies the original frame.
+
+``` python
+In [38]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
+
+In [39]: df.query('a > 2')
+Out[39]:
+ a b
+3 3 8
+4 4 9
+
+In [40]: df.query('a > 2', inplace=True)
+
+In [41]: df
+Out[41]:
+ a b
+3 3 8
+4 4 9
+```
+
+::: danger Warning
+
+Unlike with ``eval``, the default value for ``inplace`` for ``query``
+is ``False``. This is consistent with prior versions of pandas.
+
+:::
+
+### Local variables
+
+You must *explicitly reference* any local variable that you want to use in an
+expression by placing the ``@`` character in front of the name. For example,
+
+``` python
+In [42]: df = pd.DataFrame(np.random.randn(5, 2), columns=list('ab'))
+
+In [43]: newcol = np.random.randn(len(df))
+
+In [44]: df.eval('b + @newcol')
+Out[44]:
+0 -0.173926
+1 2.493083
+2 -0.881831
+3 -0.691045
+4 1.334703
+dtype: float64
+
+In [45]: df.query('b < @newcol')
+Out[45]:
+ a b
+0 0.863987 -0.115998
+2 -2.621419 -1.297879
+```
+
+If you don’t prefix the local variable with ``@``, pandas will raise an
+exception telling you the variable is undefined.
+
+When using [``DataFrame.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.eval.html#pandas.DataFrame.eval) and [``DataFrame.query()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query), this allows you
+to have a local variable and a [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) column with the same
+name in an expression.
+
+``` python
+In [46]: a = np.random.randn()
+
+In [47]: df.query('@a < a')
+Out[47]:
+ a b
+0 0.863987 -0.115998
+
+In [48]: df.loc[a < df.a] # same as the previous expression
+Out[48]:
+ a b
+0 0.863987 -0.115998
+```
+
+With [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) you cannot use the ``@`` prefix *at all*, because it
+isn’t defined in that context. ``pandas`` will let you know this if you try to
+use ``@`` in a top-level call to [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval). For example,
+
+``` python
+In [49]: a, b = 1, 2
+
+In [50]: pd.eval('@a + b')
+Traceback (most recent call last):
+
+ File "/opt/conda/envs/pandas/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3325, in run_code
+ exec(code_obj, self.user_global_ns, self.user_ns)
+
+ File "", line 1, in
+ pd.eval('@a + b')
+
+ File "/pandas/pandas/core/computation/eval.py", line 311, in eval
+ _check_for_locals(expr, level, parser)
+
+ File "/pandas/pandas/core/computation/eval.py", line 166, in _check_for_locals
+ raise SyntaxError(msg)
+
+ File "", line unknown
+SyntaxError: The '@' prefix is not allowed in top-level eval calls,
+please refer to your variables by name without the '@' prefix
+```
+
+In this case, you should simply refer to the variables like you would in
+standard Python.
+
+``` python
+In [51]: pd.eval('a + b')
+Out[51]: 3
+```
+
+### ``pandas.eval()`` parsers
+
+There are two different parsers and two different engines you can use as
+the backend.
+
+The default ``'pandas'`` parser allows a more intuitive syntax for expressing
+query-like operations (comparisons, conjunctions and disjunctions). In
+particular, the precedence of the ``&`` and ``|`` operators is made equal to
+the precedence of the corresponding boolean operations ``and`` and ``or``.
+
+For example, the above conjunction can be written without parentheses.
+Alternatively, you can use the ``'python'`` parser to enforce strict Python
+semantics.
+
+``` python
+In [52]: expr = '(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)'
+
+In [53]: x = pd.eval(expr, parser='python')
+
+In [54]: expr_no_parens = 'df1 > 0 & df2 > 0 & df3 > 0 & df4 > 0'
+
+In [55]: y = pd.eval(expr_no_parens, parser='pandas')
+
+In [56]: np.all(x == y)
+Out[56]: True
+```
+
+The same expression can be “anded” together with the word [``and``](https://docs.python.org/3/reference/expressions.html#and) as
+well:
+
+``` python
+In [57]: expr = '(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)'
+
+In [58]: x = pd.eval(expr, parser='python')
+
+In [59]: expr_with_ands = 'df1 > 0 and df2 > 0 and df3 > 0 and df4 > 0'
+
+In [60]: y = pd.eval(expr_with_ands, parser='pandas')
+
+In [61]: np.all(x == y)
+Out[61]: True
+```
+
+The ``and`` and ``or`` operators here have the same precedence that they would
+in vanilla Python.
+
+### ``pandas.eval()`` backends
+
+There’s also the option to make [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) operate identical to plain
+ol’ Python.
+
+::: tip Note
+
+Using the ``'python'`` engine is generally *not* useful, except for testing
+other evaluation engines against it. You will achieve **no** performance
+benefits using [``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) with ``engine='python'`` and in fact may
+incur a performance hit.
+
+:::
+
+You can see this by using [``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) with the ``'python'`` engine. It
+is a bit slower (not by much) than evaluating the same expression in Python
+
+``` python
+In [62]: %timeit df1 + df2 + df3 + df4
+9.5 ms +- 241 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
+```
+
+``` python
+In [63]: %timeit pd.eval('df1 + df2 + df3 + df4', engine='python')
+10.8 ms +- 898 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
+```
+
+### ``pandas.eval()`` performance
+
+[``eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) is intended to speed up certain kinds of operations. In
+particular, those operations involving complex expressions with large
+[``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame)/[``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) objects should see a
+significant performance benefit. Here is a plot showing the running time of
+[``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) as function of the size of the frame involved in the
+computation. The two lines are two different engines.
+
+
+
+::: tip Note
+
+Operations with smallish objects (around 15k-20k rows) are faster using
+plain Python:
+
+
+
+:::
+
+This plot was created using a ``DataFrame`` with 3 columns each containing
+floating point values generated using ``numpy.random.randn()``.
+
+### Technical minutia regarding expression evaluation
+
+Expressions that would result in an object dtype or involve datetime operations
+(because of ``NaT``) must be evaluated in Python space. The main reason for
+this behavior is to maintain backwards compatibility with versions of NumPy <
+1.7. In those versions of NumPy a call to ``ndarray.astype(str)`` will
+truncate any strings that are more than 60 characters in length. Second, we
+can’t pass ``object`` arrays to ``numexpr`` thus string comparisons must be
+evaluated in Python space.
+
+The upshot is that this *only* applies to object-dtype expressions. So, if
+you have an expression–for example
+
+``` python
+In [64]: df = pd.DataFrame({'strings': np.repeat(list('cba'), 3),
+ ....: 'nums': np.repeat(range(3), 3)})
+ ....:
+
+In [65]: df
+Out[65]:
+ strings nums
+0 c 0
+1 c 0
+2 c 0
+3 b 1
+4 b 1
+5 b 1
+6 a 2
+7 a 2
+8 a 2
+
+In [66]: df.query('strings == "a" and nums == 1')
+Out[66]:
+Empty DataFrame
+Columns: [strings, nums]
+Index: []
+```
+
+the numeric part of the comparison (``nums == 1``) will be evaluated by
+``numexpr``.
+
+In general, [``DataFrame.query()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query)/[``pandas.eval()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) will
+evaluate the subexpressions that *can* be evaluated by ``numexpr`` and those
+that must be evaluated in Python space transparently to the user. This is done
+by inferring the result type of an expression from its arguments and operators.
+
\ No newline at end of file
diff --git a/Python/pandas/user_guide/gotchas.md b/Python/pandas/user_guide/gotchas.md
new file mode 100644
index 00000000..86f5f018
--- /dev/null
+++ b/Python/pandas/user_guide/gotchas.md
@@ -0,0 +1,429 @@
+# Frequently Asked Questions (FAQ)
+
+## DataFrame memory usage
+
+The memory usage of a ``DataFrame`` (including the index) is shown when calling
+the [``info()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info). A configuration option, ``display.memory_usage``
+(see [the list of options](options.html#options-available)), specifies if the
+``DataFrame``’s memory usage will be displayed when invoking the ``df.info()``
+method.
+
+For example, the memory usage of the ``DataFrame`` below is shown
+when calling [``info()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info):
+
+``` python
+In [1]: dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]',
+ ...: 'complex128', 'object', 'bool']
+ ...:
+
+In [2]: n = 5000
+
+In [3]: data = {t: np.random.randint(100, size=n).astype(t) for t in dtypes}
+
+In [4]: df = pd.DataFrame(data)
+
+In [5]: df['categorical'] = df['object'].astype('category')
+
+In [6]: df.info()
+
+RangeIndex: 5000 entries, 0 to 4999
+Data columns (total 8 columns):
+int64 5000 non-null int64
+float64 5000 non-null float64
+datetime64[ns] 5000 non-null datetime64[ns]
+timedelta64[ns] 5000 non-null timedelta64[ns]
+complex128 5000 non-null complex128
+object 5000 non-null object
+bool 5000 non-null bool
+categorical 5000 non-null category
+dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
+memory usage: 289.1+ KB
+```
+
+The ``+`` symbol indicates that the true memory usage could be higher, because
+pandas does not count the memory used by values in columns with
+``dtype=object``.
+
+Passing ``memory_usage='deep'`` will enable a more accurate memory usage report,
+accounting for the full usage of the contained objects. This is optional
+as it can be expensive to do this deeper introspection.
+
+``` python
+In [7]: df.info(memory_usage='deep')
+
+RangeIndex: 5000 entries, 0 to 4999
+Data columns (total 8 columns):
+int64 5000 non-null int64
+float64 5000 non-null float64
+datetime64[ns] 5000 non-null datetime64[ns]
+timedelta64[ns] 5000 non-null timedelta64[ns]
+complex128 5000 non-null complex128
+object 5000 non-null object
+bool 5000 non-null bool
+categorical 5000 non-null category
+dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
+memory usage: 425.6 KB
+```
+
+By default the display option is set to ``True`` but can be explicitly
+overridden by passing the ``memory_usage`` argument when invoking ``df.info()``.
+
+The memory usage of each column can be found by calling the
+[``memory_usage()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.memory_usage.html#pandas.DataFrame.memory_usage) method. This returns a ``Series`` with an index
+represented by column names and memory usage of each column shown in bytes. For
+the ``DataFrame`` above, the memory usage of each column and the total memory
+usage can be found with the ``memory_usage`` method:
+
+``` python
+In [8]: df.memory_usage()
+Out[8]:
+Index 128
+int64 40000
+float64 40000
+datetime64[ns] 40000
+timedelta64[ns] 40000
+complex128 80000
+object 40000
+bool 5000
+categorical 10920
+dtype: int64
+
+# total memory usage of dataframe
+In [9]: df.memory_usage().sum()
+Out[9]: 296048
+```
+
+By default the memory usage of the ``DataFrame``’s index is shown in the
+returned ``Series``, the memory usage of the index can be suppressed by passing
+the ``index=False`` argument:
+
+``` python
+In [10]: df.memory_usage(index=False)
+Out[10]:
+int64 40000
+float64 40000
+datetime64[ns] 40000
+timedelta64[ns] 40000
+complex128 80000
+object 40000
+bool 5000
+categorical 10920
+dtype: int64
+```
+
+The memory usage displayed by the [``info()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) method utilizes the
+[``memory_usage()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.memory_usage.html#pandas.DataFrame.memory_usage) method to determine the memory usage of a
+``DataFrame`` while also formatting the output in human-readable units (base-2
+representation; i.e. 1KB = 1024 bytes).
+
+See also [Categorical Memory Usage](categorical.html#categorical-memory).
+
+## Using if/truth statements with pandas
+
+pandas follows the NumPy convention of raising an error when you try to convert
+something to a ``bool``. This happens in an ``if``-statement or when using the
+boolean operations: ``and``, ``or``, and ``not``. It is not clear what the result
+of the following code should be:
+
+``` python
+>>> if pd.Series([False, True, False]):
+... pass
+```
+
+Should it be ``True`` because it’s not zero-length, or ``False`` because there
+are ``False`` values? It is unclear, so instead, pandas raises a ``ValueError``:
+
+``` python
+>>> if pd.Series([False, True, False]):
+... print("I was true")
+Traceback
+ ...
+ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
+```
+
+You need to explicitly choose what you want to do with the ``DataFrame``, e.g.
+use [``any()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.any.html#pandas.DataFrame.any), [``all()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.all.html#pandas.DataFrame.all) or [``empty()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.empty.html#pandas.DataFrame.empty).
+Alternatively, you might want to compare if the pandas object is ``None``:
+
+``` python
+>>> if pd.Series([False, True, False]) is not None:
+... print("I was not None")
+I was not None
+```
+
+Below is how to check if any of the values are ``True``:
+
+``` python
+>>> if pd.Series([False, True, False]).any():
+... print("I am any")
+I am any
+```
+
+To evaluate single-element pandas objects in a boolean context, use the method
+[``bool()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.bool.html#pandas.DataFrame.bool):
+
+``` python
+In [11]: pd.Series([True]).bool()
+Out[11]: True
+
+In [12]: pd.Series([False]).bool()
+Out[12]: False
+
+In [13]: pd.DataFrame([[True]]).bool()
+Out[13]: True
+
+In [14]: pd.DataFrame([[False]]).bool()
+Out[14]: False
+```
+
+### Bitwise boolean
+
+Bitwise boolean operators like ``==`` and ``!=`` return a boolean ``Series``,
+which is almost always what you want anyways.
+
+``` python
+>>> s = pd.Series(range(5))
+>>> s == 4
+0 False
+1 False
+2 False
+3 False
+4 True
+dtype: bool
+```
+
+See [boolean comparisons](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-compare) for more examples.
+
+### Using the ``in`` operator
+
+Using the Python ``in`` operator on a ``Series`` tests for membership in the
+index, not membership among the values.
+
+``` python
+In [15]: s = pd.Series(range(5), index=list('abcde'))
+
+In [16]: 2 in s
+Out[16]: False
+
+In [17]: 'b' in s
+Out[17]: True
+```
+
+If this behavior is surprising, keep in mind that using ``in`` on a Python
+dictionary tests keys, not values, and ``Series`` are dict-like.
+To test for membership in the values, use the method [``isin()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html#pandas.Series.isin):
+
+``` python
+In [18]: s.isin([2])
+Out[18]:
+a False
+b False
+c True
+d False
+e False
+dtype: bool
+
+In [19]: s.isin([2]).any()
+Out[19]: True
+```
+
+For ``DataFrames``, likewise, ``in`` applies to the column axis,
+testing for membership in the list of column names.
+
+## ``NaN``, Integer ``NA`` values and ``NA`` type promotions
+
+### Choice of ``NA`` representation
+
+For lack of ``NA`` (missing) support from the ground up in NumPy and Python in
+general, we were given the difficult choice between either:
+
+- A *masked array* solution: an array of data and an array of boolean values
+indicating whether a value is there or is missing.
+- Using a special sentinel value, bit pattern, or set of sentinel values to
+denote ``NA`` across the dtypes.
+
+For many reasons we chose the latter. After years of production use it has
+proven, at least in my opinion, to be the best decision given the state of
+affairs in NumPy and Python in general. The special value ``NaN``
+(Not-A-Number) is used everywhere as the ``NA`` value, and there are API
+functions ``isna`` and ``notna`` which can be used across the dtypes to
+detect NA values.
+
+However, it comes with it a couple of trade-offs which I most certainly have
+not ignored.
+
+### Support for integer ``NA``
+
+In the absence of high performance ``NA`` support being built into NumPy from
+the ground up, the primary casualty is the ability to represent NAs in integer
+arrays. For example:
+
+``` python
+In [20]: s = pd.Series([1, 2, 3, 4, 5], index=list('abcde'))
+
+In [21]: s
+Out[21]:
+a 1
+b 2
+c 3
+d 4
+e 5
+dtype: int64
+
+In [22]: s.dtype
+Out[22]: dtype('int64')
+
+In [23]: s2 = s.reindex(['a', 'b', 'c', 'f', 'u'])
+
+In [24]: s2
+Out[24]:
+a 1.0
+b 2.0
+c 3.0
+f NaN
+u NaN
+dtype: float64
+
+In [25]: s2.dtype
+Out[25]: dtype('float64')
+```
+
+This trade-off is made largely for memory and performance reasons, and also so
+that the resulting ``Series`` continues to be “numeric”.
+
+If you need to represent integers with possibly missing values, use one of
+the nullable-integer extension dtypes provided by pandas
+
+- [``Int8Dtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Int8Dtype.html#pandas.Int8Dtype)
+- [``Int16Dtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Int16Dtype.html#pandas.Int16Dtype)
+- [``Int32Dtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Int32Dtype.html#pandas.Int32Dtype)
+- [``Int64Dtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Int64Dtype.html#pandas.Int64Dtype)
+
+``` python
+In [26]: s_int = pd.Series([1, 2, 3, 4, 5], index=list('abcde'),
+ ....: dtype=pd.Int64Dtype())
+ ....:
+
+In [27]: s_int
+Out[27]:
+a 1
+b 2
+c 3
+d 4
+e 5
+dtype: Int64
+
+In [28]: s_int.dtype
+Out[28]: Int64Dtype()
+
+In [29]: s2_int = s_int.reindex(['a', 'b', 'c', 'f', 'u'])
+
+In [30]: s2_int
+Out[30]:
+a 1
+b 2
+c 3
+f NaN
+u NaN
+dtype: Int64
+
+In [31]: s2_int.dtype
+Out[31]: Int64Dtype()
+```
+
+See [Nullable integer data type](integer_na.html#integer-na) for more.
+
+### ``NA`` type promotions
+
+When introducing NAs into an existing ``Series`` or ``DataFrame`` via
+[``reindex()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html#pandas.Series.reindex) or some other means, boolean and integer types will be
+promoted to a different dtype in order to store the NAs. The promotions are
+summarized in this table:
+
+Typeclass | Promotion dtype for storing NAs
+---|---
+floating | no change
+object | no change
+integer | cast to float64
+boolean | cast to object
+
+While this may seem like a heavy trade-off, I have found very few cases where
+this is an issue in practice i.e. storing values greater than 2**53. Some
+explanation for the motivation is in the next section.
+
+### Why not make NumPy like R?
+
+Many people have suggested that NumPy should simply emulate the ``NA`` support
+present in the more domain-specific statistical programming language [R](https://r-project.org). Part of the reason is the NumPy type hierarchy:
+
+Typeclass | Dtypes
+---|---
+numpy.floating | float16, float32, float64, float128
+numpy.integer | int8, int16, int32, int64
+numpy.unsignedinteger | uint8, uint16, uint32, uint64
+numpy.object_ | object_
+numpy.bool_ | bool_
+numpy.character | string_, unicode_
+
+The R language, by contrast, only has a handful of built-in data types:
+``integer``, ``numeric`` (floating-point), ``character``, and
+``boolean``. ``NA`` types are implemented by reserving special bit patterns for
+each type to be used as the missing value. While doing this with the full NumPy
+type hierarchy would be possible, it would be a more substantial trade-off
+(especially for the 8- and 16-bit data types) and implementation undertaking.
+
+An alternate approach is that of using masked arrays. A masked array is an
+array of data with an associated boolean *mask* denoting whether each value
+should be considered ``NA`` or not. I am personally not in love with this
+approach as I feel that overall it places a fairly heavy burden on the user and
+the library implementer. Additionally, it exacts a fairly high performance cost
+when working with numerical data compared with the simple approach of using
+``NaN``. Thus, I have chosen the Pythonic “practicality beats purity” approach
+and traded integer ``NA`` capability for a much simpler approach of using a
+special value in float and object arrays to denote ``NA``, and promoting
+integer arrays to floating when NAs must be introduced.
+
+## Differences with NumPy
+
+For ``Series`` and ``DataFrame`` objects, [``var()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html#pandas.DataFrame.var) normalizes by
+``N-1`` to produce unbiased estimates of the sample variance, while NumPy’s
+``var`` normalizes by N, which measures the variance of the sample. Note that
+[``cov()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cov.html#pandas.DataFrame.cov) normalizes by ``N-1`` in both pandas and NumPy.
+
+## Thread-safety
+
+As of pandas 0.11, pandas is not 100% thread safe. The known issues relate to
+the [``copy()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html#pandas.DataFrame.copy) method. If you are doing a lot of copying of
+``DataFrame`` objects shared among threads, we recommend holding locks inside
+the threads where the data copying occurs.
+
+See [this link](https://stackoverflow.com/questions/13592618/python-pandas-dataframe-thread-safe)
+for more information.
+
+## Byte-Ordering issues
+
+Occasionally you may have to deal with data that were created on a machine with
+a different byte order than the one on which you are running Python. A common
+symptom of this issue is an error like::
+
+``` python
+Traceback
+ ...
+ValueError: Big-endian buffer not supported on little-endian compiler
+```
+
+To deal
+with this issue you should convert the underlying NumPy array to the native
+system byte order *before* passing it to ``Series`` or ``DataFrame``
+constructors using something similar to the following:
+
+``` python
+In [32]: x = np.array(list(range(10)), '>i4') # big endian
+
+In [33]: newx = x.byteswap().newbyteorder() # force native byteorder
+
+In [34]: s = pd.Series(newx)
+```
+
+See [the NumPy documentation on byte order](https://docs.scipy.org/doc/numpy/user/basics.byteswapping.html) for more
+details.
diff --git a/Python/pandas/user_guide/groupby.md b/Python/pandas/user_guide/groupby.md
new file mode 100644
index 00000000..172adab0
--- /dev/null
+++ b/Python/pandas/user_guide/groupby.md
@@ -0,0 +1,2417 @@
+# Group By: split-apply-combine
+
+By “group by” we are referring to a process involving one or more of the following
+steps:
+
+- **Splitting** the data into groups based on some criteria.
+- **Applying** a function to each group independently.
+- **Combining** the results into a data structure.
+
+Out of these, the split step is the most straightforward. In fact, in many
+situations we may wish to split the data set into groups and do something with
+those groups. In the apply step, we might wish to do one of the
+following:
+
+- **Aggregation**: compute a summary statistic (or statistics) for each
+group. Some examples:
+
+ - Compute group sums or means.
+ - Compute group sizes / counts.
+
+- **Transformation**: perform some group-specific computations and return a
+like-indexed object. Some examples:
+
+ - Standardize data (zscore) within a group.
+ - Filling NAs within groups with a value derived from each group.
+
+- **Filtration**: discard some groups, according to a group-wise computation
+that evaluates True or False. Some examples:
+
+ - Discard data that belongs to groups with only a few members.
+ - Filter out data based on the group sum or mean.
+
+- Some combination of the above: GroupBy will examine the results of the apply
+step and try to return a sensibly combined result if it doesn’t fit into
+either of the above two categories.
+
+Since the set of object instance methods on pandas data structures are generally
+rich and expressive, we often simply want to invoke, say, a DataFrame function
+on each group. The name GroupBy should be quite familiar to those who have used
+a SQL-based tool (or ``itertools``), in which you can write code like:
+
+``` sql
+SELECT Column1, Column2, mean(Column3), sum(Column4)
+FROM SomeTable
+GROUP BY Column1, Column2
+```
+
+We aim to make operations like this natural and easy to express using
+pandas. We’ll address each area of GroupBy functionality then provide some
+non-trivial examples / use cases.
+
+See the [cookbook](cookbook.html#cookbook-grouping) for some advanced strategies.
+
+## Splitting an object into groups
+
+pandas objects can be split on any of their axes. The abstract definition of
+grouping is to provide a mapping of labels to group names. To create a GroupBy
+object (more on what the GroupBy object is later), you may do the following:
+
+``` python
+In [1]: df = pd.DataFrame([('bird', 'Falconiformes', 389.0),
+ ...: ('bird', 'Psittaciformes', 24.0),
+ ...: ('mammal', 'Carnivora', 80.2),
+ ...: ('mammal', 'Primates', np.nan),
+ ...: ('mammal', 'Carnivora', 58)],
+ ...: index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'],
+ ...: columns=('class', 'order', 'max_speed'))
+ ...:
+
+In [2]: df
+Out[2]:
+ class order max_speed
+falcon bird Falconiformes 389.0
+parrot bird Psittaciformes 24.0
+lion mammal Carnivora 80.2
+monkey mammal Primates NaN
+leopard mammal Carnivora 58.0
+
+# default is axis=0
+In [3]: grouped = df.groupby('class')
+
+In [4]: grouped = df.groupby('order', axis='columns')
+
+In [5]: grouped = df.groupby(['class', 'order'])
+```
+
+The mapping can be specified many different ways:
+
+- A Python function, to be called on each of the axis labels.
+- A list or NumPy array of the same length as the selected axis.
+- A dict or ``Series``, providing a ``label -> group name`` mapping.
+- For ``DataFrame`` objects, a string indicating a column to be used to group.
+Of course ``df.groupby('A')`` is just syntactic sugar for
+``df.groupby(df['A'])``, but it makes life simpler.
+- For ``DataFrame`` objects, a string indicating an index level to be used to
+group.
+- A list of any of the above things.
+
+Collectively we refer to the grouping objects as the **keys**. For example,
+consider the following ``DataFrame``:
+
+::: tip Note
+
+A string passed to ``groupby`` may refer to either a column or an index level.
+If a string matches both a column name and an index level name, a
+``ValueError`` will be raised.
+
+:::
+
+``` python
+In [6]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
+ ...: 'foo', 'bar', 'foo', 'foo'],
+ ...: 'B': ['one', 'one', 'two', 'three',
+ ...: 'two', 'two', 'one', 'three'],
+ ...: 'C': np.random.randn(8),
+ ...: 'D': np.random.randn(8)})
+ ...:
+
+In [7]: df
+Out[7]:
+ A B C D
+0 foo one 0.469112 -0.861849
+1 bar one -0.282863 -2.104569
+2 foo two -1.509059 -0.494929
+3 bar three -1.135632 1.071804
+4 foo two 1.212112 0.721555
+5 bar two -0.173215 -0.706771
+6 foo one 0.119209 -1.039575
+7 foo three -1.044236 0.271860
+```
+
+On a DataFrame, we obtain a GroupBy object by calling [``groupby()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby).
+We could naturally group by either the ``A`` or ``B`` columns, or both:
+
+``` python
+In [8]: grouped = df.groupby('A')
+
+In [9]: grouped = df.groupby(['A', 'B'])
+```
+
+*New in version 0.24.*
+
+If we also have a MultiIndex on columns ``A`` and ``B``, we can group by all
+but the specified columns
+
+``` python
+In [10]: df2 = df.set_index(['A', 'B'])
+
+In [11]: grouped = df2.groupby(level=df2.index.names.difference(['B']))
+
+In [12]: grouped.sum()
+Out[12]:
+ C D
+A
+bar -1.591710 -1.739537
+foo -0.752861 -1.402938
+```
+
+These will split the DataFrame on its index (rows). We could also split by the
+columns:
+
+``` python
+In [13]: def get_letter_type(letter):
+ ....: if letter.lower() in 'aeiou':
+ ....: return 'vowel'
+ ....: else:
+ ....: return 'consonant'
+ ....:
+
+In [14]: grouped = df.groupby(get_letter_type, axis=1)
+```
+
+pandas [``Index``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html#pandas.Index) objects support duplicate values. If a
+non-unique index is used as the group key in a groupby operation, all values
+for the same index value will be considered to be in one group and thus the
+output of aggregation functions will only contain unique index values:
+
+``` python
+In [15]: lst = [1, 2, 3, 1, 2, 3]
+
+In [16]: s = pd.Series([1, 2, 3, 10, 20, 30], lst)
+
+In [17]: grouped = s.groupby(level=0)
+
+In [18]: grouped.first()
+Out[18]:
+1 1
+2 2
+3 3
+dtype: int64
+
+In [19]: grouped.last()
+Out[19]:
+1 10
+2 20
+3 30
+dtype: int64
+
+In [20]: grouped.sum()
+Out[20]:
+1 11
+2 22
+3 33
+dtype: int64
+```
+
+Note that **no splitting occurs** until it’s needed. Creating the GroupBy object
+only verifies that you’ve passed a valid mapping.
+
+::: tip Note
+
+Many kinds of complicated data manipulations can be expressed in terms of
+GroupBy operations (though can’t be guaranteed to be the most
+efficient). You can get quite creative with the label mapping functions.
+
+:::
+
+### GroupBy sorting
+
+By default the group keys are sorted during the ``groupby`` operation. You may however pass ``sort=False`` for potential speedups:
+
+``` python
+In [21]: df2 = pd.DataFrame({'X': ['B', 'B', 'A', 'A'], 'Y': [1, 2, 3, 4]})
+
+In [22]: df2.groupby(['X']).sum()
+Out[22]:
+ Y
+X
+A 7
+B 3
+
+In [23]: df2.groupby(['X'], sort=False).sum()
+Out[23]:
+ Y
+X
+B 3
+A 7
+```
+
+Note that ``groupby`` will preserve the order in which *observations* are sorted *within* each group.
+For example, the groups created by ``groupby()`` below are in the order they appeared in the original ``DataFrame``:
+
+``` python
+In [24]: df3 = pd.DataFrame({'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})
+
+In [25]: df3.groupby(['X']).get_group('A')
+Out[25]:
+ X Y
+0 A 1
+2 A 3
+
+In [26]: df3.groupby(['X']).get_group('B')
+Out[26]:
+ X Y
+1 B 4
+3 B 2
+```
+
+### GroupBy object attributes
+
+The ``groups`` attribute is a dict whose keys are the computed unique groups
+and corresponding values being the axis labels belonging to each group. In the
+above example we have:
+
+``` python
+In [27]: df.groupby('A').groups
+Out[27]:
+{'bar': Int64Index([1, 3, 5], dtype='int64'),
+ 'foo': Int64Index([0, 2, 4, 6, 7], dtype='int64')}
+
+In [28]: df.groupby(get_letter_type, axis=1).groups
+Out[28]:
+{'consonant': Index(['B', 'C', 'D'], dtype='object'),
+ 'vowel': Index(['A'], dtype='object')}
+```
+
+Calling the standard Python ``len`` function on the GroupBy object just returns
+the length of the ``groups`` dict, so it is largely just a convenience:
+
+``` python
+In [29]: grouped = df.groupby(['A', 'B'])
+
+In [30]: grouped.groups
+Out[30]:
+{('bar', 'one'): Int64Index([1], dtype='int64'),
+ ('bar', 'three'): Int64Index([3], dtype='int64'),
+ ('bar', 'two'): Int64Index([5], dtype='int64'),
+ ('foo', 'one'): Int64Index([0, 6], dtype='int64'),
+ ('foo', 'three'): Int64Index([7], dtype='int64'),
+ ('foo', 'two'): Int64Index([2, 4], dtype='int64')}
+
+In [31]: len(grouped)
+Out[31]: 6
+```
+
+``GroupBy`` will tab complete column names (and other attributes):
+
+``` python
+In [32]: df
+Out[32]:
+ height weight gender
+2000-01-01 42.849980 157.500553 male
+2000-01-02 49.607315 177.340407 male
+2000-01-03 56.293531 171.524640 male
+2000-01-04 48.421077 144.251986 female
+2000-01-05 46.556882 152.526206 male
+2000-01-06 68.448851 168.272968 female
+2000-01-07 70.757698 136.431469 male
+2000-01-08 58.909500 176.499753 female
+2000-01-09 76.435631 174.094104 female
+2000-01-10 45.306120 177.540920 male
+
+In [33]: gb = df.groupby('gender')
+```
+
+``` python
+In [34]: gb. # noqa: E225, E999
+gb.agg gb.boxplot gb.cummin gb.describe gb.filter gb.get_group gb.height gb.last gb.median gb.ngroups gb.plot gb.rank gb.std gb.transform
+gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var
+gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
+```
+
+### GroupBy with MultiIndex
+
+With [hierarchically-indexed data](advanced.html#advanced-hierarchical), it’s quite
+natural to group by one of the levels of the hierarchy.
+
+Let’s create a Series with a two-level ``MultiIndex``.
+
+``` python
+In [35]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
+ ....: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
+ ....:
+
+In [36]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
+
+In [37]: s = pd.Series(np.random.randn(8), index=index)
+
+In [38]: s
+Out[38]:
+first second
+bar one -0.919854
+ two -0.042379
+baz one 1.247642
+ two -0.009920
+foo one 0.290213
+ two 0.495767
+qux one 0.362949
+ two 1.548106
+dtype: float64
+```
+
+We can then group by one of the levels in ``s``.
+
+``` python
+In [39]: grouped = s.groupby(level=0)
+
+In [40]: grouped.sum()
+Out[40]:
+first
+bar -0.962232
+baz 1.237723
+foo 0.785980
+qux 1.911055
+dtype: float64
+```
+
+If the MultiIndex has names specified, these can be passed instead of the level
+number:
+
+``` python
+In [41]: s.groupby(level='second').sum()
+Out[41]:
+second
+one 0.980950
+two 1.991575
+dtype: float64
+```
+
+The aggregation functions such as ``sum`` will take the level parameter
+directly. Additionally, the resulting index will be named according to the
+chosen level:
+
+``` python
+In [42]: s.sum(level='second')
+Out[42]:
+second
+one 0.980950
+two 1.991575
+dtype: float64
+```
+
+Grouping with multiple levels is supported.
+
+``` python
+In [43]: s
+Out[43]:
+first second third
+bar doo one -1.131345
+ two -0.089329
+baz bee one 0.337863
+ two -0.945867
+foo bop one -0.932132
+ two 1.956030
+qux bop one 0.017587
+ two -0.016692
+dtype: float64
+
+In [44]: s.groupby(level=['first', 'second']).sum()
+Out[44]:
+first second
+bar doo -1.220674
+baz bee -0.608004
+foo bop 1.023898
+qux bop 0.000895
+dtype: float64
+```
+
+*New in version 0.20.*
+
+Index level names may be supplied as keys.
+
+``` python
+In [45]: s.groupby(['first', 'second']).sum()
+Out[45]:
+first second
+bar doo -1.220674
+baz bee -0.608004
+foo bop 1.023898
+qux bop 0.000895
+dtype: float64
+```
+
+More on the ``sum`` function and aggregation later.
+
+### Grouping DataFrame with Index levels and columns
+
+A DataFrame may be grouped by a combination of columns and index levels by
+specifying the column names as strings and the index levels as ``pd.Grouper``
+objects.
+
+``` python
+In [46]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
+ ....: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
+ ....:
+
+In [47]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
+
+In [48]: df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
+ ....: 'B': np.arange(8)},
+ ....: index=index)
+ ....:
+
+In [49]: df
+Out[49]:
+ A B
+first second
+bar one 1 0
+ two 1 1
+baz one 1 2
+ two 1 3
+foo one 2 4
+ two 2 5
+qux one 3 6
+ two 3 7
+```
+
+The following example groups ``df`` by the ``second`` index level and
+the ``A`` column.
+
+``` python
+In [50]: df.groupby([pd.Grouper(level=1), 'A']).sum()
+Out[50]:
+ B
+second A
+one 1 2
+ 2 4
+ 3 6
+two 1 4
+ 2 5
+ 3 7
+```
+
+Index levels may also be specified by name.
+
+``` python
+In [51]: df.groupby([pd.Grouper(level='second'), 'A']).sum()
+Out[51]:
+ B
+second A
+one 1 2
+ 2 4
+ 3 6
+two 1 4
+ 2 5
+ 3 7
+```
+
+*New in version 0.20.*
+
+Index level names may be specified as keys directly to ``groupby``.
+
+``` python
+In [52]: df.groupby(['second', 'A']).sum()
+Out[52]:
+ B
+second A
+one 1 2
+ 2 4
+ 3 6
+two 1 4
+ 2 5
+ 3 7
+```
+
+### DataFrame column selection in GroupBy
+
+Once you have created the GroupBy object from a DataFrame, you might want to do
+something different for each of the columns. Thus, using ``[]`` similar to
+getting a column from a DataFrame, you can do:
+
+``` python
+In [53]: grouped = df.groupby(['A'])
+
+In [54]: grouped_C = grouped['C']
+
+In [55]: grouped_D = grouped['D']
+```
+
+This is mainly syntactic sugar for the alternative and much more verbose:
+
+``` python
+In [56]: df['C'].groupby(df['A'])
+Out[56]:
+```
+
+Additionally this method avoids recomputing the internal grouping information
+derived from the passed key.
+
+## Iterating through groups
+
+With the GroupBy object in hand, iterating through the grouped data is very
+natural and functions similarly to [``itertools.groupby()``](https://docs.python.org/3/library/itertools.html#itertools.groupby):
+
+``` python
+In [57]: grouped = df.groupby('A')
+
+In [58]: for name, group in grouped:
+ ....: print(name)
+ ....: print(group)
+ ....:
+bar
+ A B C D
+1 bar one 0.254161 1.511763
+3 bar three 0.215897 -0.990582
+5 bar two -0.077118 1.211526
+foo
+ A B C D
+0 foo one -0.575247 1.346061
+2 foo two -1.143704 1.627081
+4 foo two 1.193555 -0.441652
+6 foo one -0.408530 0.268520
+7 foo three -0.862495 0.024580
+```
+
+In the case of grouping by multiple keys, the group name will be a tuple:
+
+``` python
+In [59]: for name, group in df.groupby(['A', 'B']):
+ ....: print(name)
+ ....: print(group)
+ ....:
+('bar', 'one')
+ A B C D
+1 bar one 0.254161 1.511763
+('bar', 'three')
+ A B C D
+3 bar three 0.215897 -0.990582
+('bar', 'two')
+ A B C D
+5 bar two -0.077118 1.211526
+('foo', 'one')
+ A B C D
+0 foo one -0.575247 1.346061
+6 foo one -0.408530 0.268520
+('foo', 'three')
+ A B C D
+7 foo three -0.862495 0.02458
+('foo', 'two')
+ A B C D
+2 foo two -1.143704 1.627081
+4 foo two 1.193555 -0.441652
+```
+
+See [Iterating through groups](timeseries.html#timeseries-iterating-label).
+
+## Selecting a group
+
+A single group can be selected using
+``get_group()``:
+
+``` python
+In [60]: grouped.get_group('bar')
+Out[60]:
+ A B C D
+1 bar one 0.254161 1.511763
+3 bar three 0.215897 -0.990582
+5 bar two -0.077118 1.211526
+```
+
+Or for an object grouped on multiple columns:
+
+``` python
+In [61]: df.groupby(['A', 'B']).get_group(('bar', 'one'))
+Out[61]:
+ A B C D
+1 bar one 0.254161 1.511763
+```
+
+## Aggregation
+
+Once the GroupBy object has been created, several methods are available to
+perform a computation on the grouped data. These operations are similar to the
+[aggregating API](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-aggregate), [window functions API](computation.html#stats-aggregate),
+and [resample API](timeseries.html#timeseries-aggregate).
+
+An obvious one is aggregation via the
+``aggregate()`` or equivalently
+``agg()`` method:
+
+``` python
+In [62]: grouped = df.groupby('A')
+
+In [63]: grouped.aggregate(np.sum)
+Out[63]:
+ C D
+A
+bar 0.392940 1.732707
+foo -1.796421 2.824590
+
+In [64]: grouped = df.groupby(['A', 'B'])
+
+In [65]: grouped.aggregate(np.sum)
+Out[65]:
+ C D
+A B
+bar one 0.254161 1.511763
+ three 0.215897 -0.990582
+ two -0.077118 1.211526
+foo one -0.983776 1.614581
+ three -0.862495 0.024580
+ two 0.049851 1.185429
+```
+
+As you can see, the result of the aggregation will have the group names as the
+new index along the grouped axis. In the case of multiple keys, the result is a
+[MultiIndex](advanced.html#advanced-hierarchical) by default, though this can be
+changed by using the ``as_index`` option:
+
+``` python
+In [66]: grouped = df.groupby(['A', 'B'], as_index=False)
+
+In [67]: grouped.aggregate(np.sum)
+Out[67]:
+ A B C D
+0 bar one 0.254161 1.511763
+1 bar three 0.215897 -0.990582
+2 bar two -0.077118 1.211526
+3 foo one -0.983776 1.614581
+4 foo three -0.862495 0.024580
+5 foo two 0.049851 1.185429
+
+In [68]: df.groupby('A', as_index=False).sum()
+Out[68]:
+ A C D
+0 bar 0.392940 1.732707
+1 foo -1.796421 2.824590
+```
+
+Note that you could use the ``reset_index`` DataFrame function to achieve the
+same result as the column names are stored in the resulting ``MultiIndex``:
+
+``` python
+In [69]: df.groupby(['A', 'B']).sum().reset_index()
+Out[69]:
+ A B C D
+0 bar one 0.254161 1.511763
+1 bar three 0.215897 -0.990582
+2 bar two -0.077118 1.211526
+3 foo one -0.983776 1.614581
+4 foo three -0.862495 0.024580
+5 foo two 0.049851 1.185429
+```
+
+Another simple aggregation example is to compute the size of each group.
+This is included in GroupBy as the ``size`` method. It returns a Series whose
+index are the group names and whose values are the sizes of each group.
+
+``` python
+In [70]: grouped.size()
+Out[70]:
+A B
+bar one 1
+ three 1
+ two 1
+foo one 2
+ three 1
+ two 2
+dtype: int64
+```
+
+``` python
+In [71]: grouped.describe()
+Out[71]:
+ C D
+ count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
+0 1.0 0.254161 NaN 0.254161 0.254161 0.254161 0.254161 0.254161 1.0 1.511763 NaN 1.511763 1.511763 1.511763 1.511763 1.511763
+1 1.0 0.215897 NaN 0.215897 0.215897 0.215897 0.215897 0.215897 1.0 -0.990582 NaN -0.990582 -0.990582 -0.990582 -0.990582 -0.990582
+2 1.0 -0.077118 NaN -0.077118 -0.077118 -0.077118 -0.077118 -0.077118 1.0 1.211526 NaN 1.211526 1.211526 1.211526 1.211526 1.211526
+3 2.0 -0.491888 0.117887 -0.575247 -0.533567 -0.491888 -0.450209 -0.408530 2.0 0.807291 0.761937 0.268520 0.537905 0.807291 1.076676 1.346061
+4 1.0 -0.862495 NaN -0.862495 -0.862495 -0.862495 -0.862495 -0.862495 1.0 0.024580 NaN 0.024580 0.024580 0.024580 0.024580 0.024580
+5 2.0 0.024925 1.652692 -1.143704 -0.559389 0.024925 0.609240 1.193555 2.0 0.592714 1.462816 -0.441652 0.075531 0.592714 1.109898 1.627081
+```
+
+::: tip Note
+
+Aggregation functions **will not** return the groups that you are aggregating over
+if they are named *columns*, when ``as_index=True``, the default. The grouped columns will
+be the **indices** of the returned object.
+
+Passing ``as_index=False`` **will** return the groups that you are aggregating over, if they are
+named *columns*.
+
+:::
+
+Aggregating functions are the ones that reduce the dimension of the returned objects.
+Some common aggregating functions are tabulated below:
+
+Function | Description
+---|---
+mean() | Compute mean of groups
+sum() | Compute sum of group values
+size() | Compute group sizes
+count() | Compute count of group
+std() | Standard deviation of groups
+var() | Compute variance of groups
+sem() | Standard error of the mean of groups
+describe() | Generates descriptive statistics
+first() | Compute first of group values
+last() | Compute last of group values
+nth() | Take nth value, or a subset if n is a list
+min() | Compute min of group values
+max() | Compute max of group values
+
+The aggregating functions above will exclude NA values. Any function which
+reduces a [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) to a scalar value is an aggregation function and will work,
+a trivial example is ``df.groupby('A').agg(lambda ser: 1)``. Note that
+``nth()`` can act as a reducer *or* a
+filter, see [here](#groupby-nth).
+
+### Applying multiple functions at once
+
+With grouped ``Series`` you can also pass a list or dict of functions to do
+aggregation with, outputting a DataFrame:
+
+``` python
+In [72]: grouped = df.groupby('A')
+
+In [73]: grouped['C'].agg([np.sum, np.mean, np.std])
+Out[73]:
+ sum mean std
+A
+bar 0.392940 0.130980 0.181231
+foo -1.796421 -0.359284 0.912265
+```
+
+On a grouped ``DataFrame``, you can pass a list of functions to apply to each
+column, which produces an aggregated result with a hierarchical index:
+
+``` python
+In [74]: grouped.agg([np.sum, np.mean, np.std])
+Out[74]:
+ C D
+ sum mean std sum mean std
+A
+bar 0.392940 0.130980 0.181231 1.732707 0.577569 1.366330
+foo -1.796421 -0.359284 0.912265 2.824590 0.564918 0.884785
+```
+
+The resulting aggregations are named for the functions themselves. If you
+need to rename, then you can add in a chained operation for a ``Series`` like this:
+
+``` python
+In [75]: (grouped['C'].agg([np.sum, np.mean, np.std])
+ ....: .rename(columns={'sum': 'foo',
+ ....: 'mean': 'bar',
+ ....: 'std': 'baz'}))
+ ....:
+Out[75]:
+ foo bar baz
+A
+bar 0.392940 0.130980 0.181231
+foo -1.796421 -0.359284 0.912265
+```
+
+For a grouped ``DataFrame``, you can rename in a similar manner:
+
+``` python
+In [76]: (grouped.agg([np.sum, np.mean, np.std])
+ ....: .rename(columns={'sum': 'foo',
+ ....: 'mean': 'bar',
+ ....: 'std': 'baz'}))
+ ....:
+Out[76]:
+ C D
+ foo bar baz foo bar baz
+A
+bar 0.392940 0.130980 0.181231 1.732707 0.577569 1.366330
+foo -1.796421 -0.359284 0.912265 2.824590 0.564918 0.884785
+```
+
+::: tip Note
+
+In general, the output column names should be unique. You can’t apply
+the same function (or two functions with the same name) to the same
+column.
+
+``` python
+In [77]: grouped['C'].agg(['sum', 'sum'])
+---------------------------------------------------------------------------
+SpecificationError Traceback (most recent call last)
+ in
+----> 1 grouped['C'].agg(['sum', 'sum'])
+
+/pandas/pandas/core/groupby/generic.py in aggregate(self, func_or_funcs, *args, **kwargs)
+ 849 # but not the class list / tuple itself.
+ 850 func_or_funcs = _maybe_mangle_lambdas(func_or_funcs)
+--> 851 ret = self._aggregate_multiple_funcs(func_or_funcs, (_level or 0) + 1)
+ 852 if relabeling:
+ 853 ret.columns = columns
+
+/pandas/pandas/core/groupby/generic.py in _aggregate_multiple_funcs(self, arg, _level)
+ 919 raise SpecificationError(
+ 920 "Function names must be unique, found multiple named "
+--> 921 "{}".format(name)
+ 922 )
+ 923
+
+SpecificationError: Function names must be unique, found multiple named sum
+```
+
+Pandas *does* allow you to provide multiple lambdas. In this case, pandas
+will mangle the name of the (nameless) lambda functions, appending ``_``
+to each subsequent lambda.
+
+``` python
+In [78]: grouped['C'].agg([lambda x: x.max() - x.min(),
+ ....: lambda x: x.median() - x.mean()])
+ ....:
+Out[78]:
+
+A
+bar 0.331279 0.084917
+foo 2.337259 -0.215962
+```
+
+:::
+
+### Named aggregation
+
+*New in version 0.25.0.*
+
+To support column-specific aggregation *with control over the output column names*, pandas
+accepts the special syntax in ``GroupBy.agg()``, known as “named aggregation”, where
+
+- The keywords are the *output* column names
+- The values are tuples whose first element is the column to select
+and the second element is the aggregation to apply to that column. Pandas
+provides the ``pandas.NamedAgg`` namedtuple with the fields ``['column', 'aggfunc']``
+to make it clearer what the arguments are. As usual, the aggregation can
+be a callable or a string alias.
+
+``` python
+In [79]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
+ ....: 'height': [9.1, 6.0, 9.5, 34.0],
+ ....: 'weight': [7.9, 7.5, 9.9, 198.0]})
+ ....:
+
+In [80]: animals
+Out[80]:
+ kind height weight
+0 cat 9.1 7.9
+1 dog 6.0 7.5
+2 cat 9.5 9.9
+3 dog 34.0 198.0
+
+In [81]: animals.groupby("kind").agg(
+ ....: min_height=pd.NamedAgg(column='height', aggfunc='min'),
+ ....: max_height=pd.NamedAgg(column='height', aggfunc='max'),
+ ....: average_weight=pd.NamedAgg(column='weight', aggfunc=np.mean),
+ ....: )
+ ....:
+Out[81]:
+ min_height max_height average_weight
+kind
+cat 9.1 9.5 8.90
+dog 6.0 34.0 102.75
+```
+
+``pandas.NamedAgg`` is just a ``namedtuple``. Plain tuples are allowed as well.
+
+``` python
+In [82]: animals.groupby("kind").agg(
+ ....: min_height=('height', 'min'),
+ ....: max_height=('height', 'max'),
+ ....: average_weight=('weight', np.mean),
+ ....: )
+ ....:
+Out[82]:
+ min_height max_height average_weight
+kind
+cat 9.1 9.5 8.90
+dog 6.0 34.0 102.75
+```
+
+If your desired output column names are not valid python keywords, construct a dictionary
+and unpack the keyword arguments
+
+``` python
+In [83]: animals.groupby("kind").agg(**{
+ ....: 'total weight': pd.NamedAgg(column='weight', aggfunc=sum),
+ ....: })
+ ....:
+Out[83]:
+ total weight
+kind
+cat 17.8
+dog 205.5
+```
+
+Additional keyword arguments are not passed through to the aggregation functions. Only pairs
+of ``(column, aggfunc)`` should be passed as ``**kwargs``. If your aggregation functions
+requires additional arguments, partially apply them with ``functools.partial()``.
+
+::: tip Note
+
+For Python 3.5 and earlier, the order of ``**kwargs`` in a functions was not
+preserved. This means that the output column ordering would not be
+consistent. To ensure consistent ordering, the keys (and so output columns)
+will always be sorted for Python 3.5.
+
+:::
+
+Named aggregation is also valid for Series groupby aggregations. In this case there’s
+no column selection, so the values are just the functions.
+
+``` python
+In [84]: animals.groupby("kind").height.agg(
+ ....: min_height='min',
+ ....: max_height='max',
+ ....: )
+ ....:
+Out[84]:
+ min_height max_height
+kind
+cat 9.1 9.5
+dog 6.0 34.0
+```
+
+### Applying different functions to DataFrame columns
+
+By passing a dict to ``aggregate`` you can apply a different aggregation to the
+columns of a DataFrame:
+
+``` python
+In [85]: grouped.agg({'C': np.sum,
+ ....: 'D': lambda x: np.std(x, ddof=1)})
+ ....:
+Out[85]:
+ C D
+A
+bar 0.392940 1.366330
+foo -1.796421 0.884785
+```
+
+The function names can also be strings. In order for a string to be valid it
+must be either implemented on GroupBy or available via [dispatching](#groupby-dispatch):
+
+``` python
+In [86]: grouped.agg({'C': 'sum', 'D': 'std'})
+Out[86]:
+ C D
+A
+bar 0.392940 1.366330
+foo -1.796421 0.884785
+```
+
+### Cython-optimized aggregation functions
+
+Some common aggregations, currently only ``sum``, ``mean``, ``std``, and ``sem``, have
+optimized Cython implementations:
+
+``` python
+In [87]: df.groupby('A').sum()
+Out[87]:
+ C D
+A
+bar 0.392940 1.732707
+foo -1.796421 2.824590
+
+In [88]: df.groupby(['A', 'B']).mean()
+Out[88]:
+ C D
+A B
+bar one 0.254161 1.511763
+ three 0.215897 -0.990582
+ two -0.077118 1.211526
+foo one -0.491888 0.807291
+ three -0.862495 0.024580
+ two 0.024925 0.592714
+```
+
+Of course ``sum`` and ``mean`` are implemented on pandas objects, so the above
+code would work even without the special versions via dispatching (see below).
+
+## Transformation
+
+The ``transform`` method returns an object that is indexed the same (same size)
+as the one being grouped. The transform function must:
+
+- Return a result that is either the same size as the group chunk or
+broadcastable to the size of the group chunk (e.g., a scalar,
+``grouped.transform(lambda x: x.iloc[-1])``).
+- Operate column-by-column on the group chunk. The transform is applied to
+the first group chunk using chunk.apply.
+- Not perform in-place operations on the group chunk. Group chunks should
+be treated as immutable, and changes to a group chunk may produce unexpected
+results. For example, when using ``fillna``, ``inplace`` must be ``False``
+(``grouped.transform(lambda x: x.fillna(inplace=False))``).
+- (Optionally) operates on the entire group chunk. If this is supported, a
+fast path is used starting from the *second* chunk.
+
+For example, suppose we wished to standardize the data within each group:
+
+``` python
+In [89]: index = pd.date_range('10/1/1999', periods=1100)
+
+In [90]: ts = pd.Series(np.random.normal(0.5, 2, 1100), index)
+
+In [91]: ts = ts.rolling(window=100, min_periods=100).mean().dropna()
+
+In [92]: ts.head()
+Out[92]:
+2000-01-08 0.779333
+2000-01-09 0.778852
+2000-01-10 0.786476
+2000-01-11 0.782797
+2000-01-12 0.798110
+Freq: D, dtype: float64
+
+In [93]: ts.tail()
+Out[93]:
+2002-09-30 0.660294
+2002-10-01 0.631095
+2002-10-02 0.673601
+2002-10-03 0.709213
+2002-10-04 0.719369
+Freq: D, dtype: float64
+
+In [94]: transformed = (ts.groupby(lambda x: x.year)
+ ....: .transform(lambda x: (x - x.mean()) / x.std()))
+ ....:
+```
+
+We would expect the result to now have mean 0 and standard deviation 1 within
+each group, which we can easily check:
+
+``` python
+# Original Data
+In [95]: grouped = ts.groupby(lambda x: x.year)
+
+In [96]: grouped.mean()
+Out[96]:
+2000 0.442441
+2001 0.526246
+2002 0.459365
+dtype: float64
+
+In [97]: grouped.std()
+Out[97]:
+2000 0.131752
+2001 0.210945
+2002 0.128753
+dtype: float64
+
+# Transformed Data
+In [98]: grouped_trans = transformed.groupby(lambda x: x.year)
+
+In [99]: grouped_trans.mean()
+Out[99]:
+2000 1.168208e-15
+2001 1.454544e-15
+2002 1.726657e-15
+dtype: float64
+
+In [100]: grouped_trans.std()
+Out[100]:
+2000 1.0
+2001 1.0
+2002 1.0
+dtype: float64
+```
+
+We can also visually compare the original and transformed data sets.
+
+``` python
+In [101]: compare = pd.DataFrame({'Original': ts, 'Transformed': transformed})
+
+In [102]: compare.plot()
+Out[102]:
+```
+
+
+
+Transformation functions that have lower dimension outputs are broadcast to
+match the shape of the input array.
+
+``` python
+In [103]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())
+Out[103]:
+2000-01-08 0.623893
+2000-01-09 0.623893
+2000-01-10 0.623893
+2000-01-11 0.623893
+2000-01-12 0.623893
+ ...
+2002-09-30 0.558275
+2002-10-01 0.558275
+2002-10-02 0.558275
+2002-10-03 0.558275
+2002-10-04 0.558275
+Freq: D, Length: 1001, dtype: float64
+```
+
+Alternatively, the built-in methods could be used to produce the same outputs.
+
+``` python
+In [104]: max = ts.groupby(lambda x: x.year).transform('max')
+
+In [105]: min = ts.groupby(lambda x: x.year).transform('min')
+
+In [106]: max - min
+Out[106]:
+2000-01-08 0.623893
+2000-01-09 0.623893
+2000-01-10 0.623893
+2000-01-11 0.623893
+2000-01-12 0.623893
+ ...
+2002-09-30 0.558275
+2002-10-01 0.558275
+2002-10-02 0.558275
+2002-10-03 0.558275
+2002-10-04 0.558275
+Freq: D, Length: 1001, dtype: float64
+```
+
+Another common data transform is to replace missing data with the group mean.
+
+``` python
+In [107]: data_df
+Out[107]:
+ A B C
+0 1.539708 -1.166480 0.533026
+1 1.302092 -0.505754 NaN
+2 -0.371983 1.104803 -0.651520
+3 -1.309622 1.118697 -1.161657
+4 -1.924296 0.396437 0.812436
+.. ... ... ...
+995 -0.093110 0.683847 -0.774753
+996 -0.185043 1.438572 NaN
+997 -0.394469 -0.642343 0.011374
+998 -1.174126 1.857148 NaN
+999 0.234564 0.517098 0.393534
+
+[1000 rows x 3 columns]
+
+In [108]: countries = np.array(['US', 'UK', 'GR', 'JP'])
+
+In [109]: key = countries[np.random.randint(0, 4, 1000)]
+
+In [110]: grouped = data_df.groupby(key)
+
+# Non-NA count in each group
+In [111]: grouped.count()
+Out[111]:
+ A B C
+GR 209 217 189
+JP 240 255 217
+UK 216 231 193
+US 239 250 217
+
+In [112]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))
+```
+
+We can verify that the group means have not changed in the transformed data
+and that the transformed data contains no NAs.
+
+``` python
+In [113]: grouped_trans = transformed.groupby(key)
+
+In [114]: grouped.mean() # original group means
+Out[114]:
+ A B C
+GR -0.098371 -0.015420 0.068053
+JP 0.069025 0.023100 -0.077324
+UK 0.034069 -0.052580 -0.116525
+US 0.058664 -0.020399 0.028603
+
+In [115]: grouped_trans.mean() # transformation did not change group means
+Out[115]:
+ A B C
+GR -0.098371 -0.015420 0.068053
+JP 0.069025 0.023100 -0.077324
+UK 0.034069 -0.052580 -0.116525
+US 0.058664 -0.020399 0.028603
+
+In [116]: grouped.count() # original has some missing data points
+Out[116]:
+ A B C
+GR 209 217 189
+JP 240 255 217
+UK 216 231 193
+US 239 250 217
+
+In [117]: grouped_trans.count() # counts after transformation
+Out[117]:
+ A B C
+GR 228 228 228
+JP 267 267 267
+UK 247 247 247
+US 258 258 258
+
+In [118]: grouped_trans.size() # Verify non-NA count equals group size
+Out[118]:
+GR 228
+JP 267
+UK 247
+US 258
+dtype: int64
+```
+
+::: tip Note
+
+Some functions will automatically transform the input when applied to a
+GroupBy object, but returning an object of the same shape as the original.
+Passing ``as_index=False`` will not affect these transformation methods.
+
+For example: ``fillna, ffill, bfill, shift.``.
+
+``` python
+In [119]: grouped.ffill()
+Out[119]:
+ A B C
+0 1.539708 -1.166480 0.533026
+1 1.302092 -0.505754 0.533026
+2 -0.371983 1.104803 -0.651520
+3 -1.309622 1.118697 -1.161657
+4 -1.924296 0.396437 0.812436
+.. ... ... ...
+995 -0.093110 0.683847 -0.774753
+996 -0.185043 1.438572 -0.774753
+997 -0.394469 -0.642343 0.011374
+998 -1.174126 1.857148 -0.774753
+999 0.234564 0.517098 0.393534
+
+[1000 rows x 3 columns]
+```
+
+:::
+
+### New syntax to window and resample operations
+
+*New in version 0.18.1.*
+
+Working with the resample, expanding or rolling operations on the groupby
+level used to require the application of helper functions. However,
+now it is possible to use ``resample()``, ``expanding()`` and
+``rolling()`` as methods on groupbys.
+
+The example below will apply the ``rolling()`` method on the samples of
+the column B based on the groups of column A.
+
+``` python
+In [120]: df_re = pd.DataFrame({'A': [1] * 10 + [5] * 10,
+ .....: 'B': np.arange(20)})
+ .....:
+
+In [121]: df_re
+Out[121]:
+ A B
+0 1 0
+1 1 1
+2 1 2
+3 1 3
+4 1 4
+.. .. ..
+15 5 15
+16 5 16
+17 5 17
+18 5 18
+19 5 19
+
+[20 rows x 2 columns]
+
+In [122]: df_re.groupby('A').rolling(4).B.mean()
+Out[122]:
+A
+1 0 NaN
+ 1 NaN
+ 2 NaN
+ 3 1.5
+ 4 2.5
+ ...
+5 15 13.5
+ 16 14.5
+ 17 15.5
+ 18 16.5
+ 19 17.5
+Name: B, Length: 20, dtype: float64
+```
+
+The ``expanding()`` method will accumulate a given operation
+(``sum()`` in the example) for all the members of each particular
+group.
+
+``` python
+In [123]: df_re.groupby('A').expanding().sum()
+Out[123]:
+ A B
+A
+1 0 1.0 0.0
+ 1 2.0 1.0
+ 2 3.0 3.0
+ 3 4.0 6.0
+ 4 5.0 10.0
+... ... ...
+5 15 30.0 75.0
+ 16 35.0 91.0
+ 17 40.0 108.0
+ 18 45.0 126.0
+ 19 50.0 145.0
+
+[20 rows x 2 columns]
+```
+
+Suppose you want to use the ``resample()`` method to get a daily
+frequency in each group of your dataframe and wish to complete the
+missing values with the ``ffill()`` method.
+
+``` python
+In [124]: df_re = pd.DataFrame({'date': pd.date_range(start='2016-01-01', periods=4,
+ .....: freq='W'),
+ .....: 'group': [1, 1, 2, 2],
+ .....: 'val': [5, 6, 7, 8]}).set_index('date')
+ .....:
+
+In [125]: df_re
+Out[125]:
+ group val
+date
+2016-01-03 1 5
+2016-01-10 1 6
+2016-01-17 2 7
+2016-01-24 2 8
+
+In [126]: df_re.groupby('group').resample('1D').ffill()
+Out[126]:
+ group val
+group date
+1 2016-01-03 1 5
+ 2016-01-04 1 5
+ 2016-01-05 1 5
+ 2016-01-06 1 5
+ 2016-01-07 1 5
+... ... ...
+2 2016-01-20 2 7
+ 2016-01-21 2 7
+ 2016-01-22 2 7
+ 2016-01-23 2 7
+ 2016-01-24 2 8
+
+[16 rows x 2 columns]
+```
+
+## Filtration
+
+The ``filter`` method returns a subset of the original object. Suppose we
+want to take only elements that belong to groups with a group sum greater
+than 2.
+
+``` python
+In [127]: sf = pd.Series([1, 1, 2, 3, 3, 3])
+
+In [128]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
+Out[128]:
+3 3
+4 3
+5 3
+dtype: int64
+```
+
+The argument of ``filter`` must be a function that, applied to the group as a
+whole, returns ``True`` or ``False``.
+
+Another useful operation is filtering out elements that belong to groups
+with only a couple members.
+
+``` python
+In [129]: dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})
+
+In [130]: dff.groupby('B').filter(lambda x: len(x) > 2)
+Out[130]:
+ A B
+2 2 b
+3 3 b
+4 4 b
+5 5 b
+```
+
+Alternatively, instead of dropping the offending groups, we can return a
+like-indexed objects where the groups that do not pass the filter are filled
+with NaNs.
+
+``` python
+In [131]: dff.groupby('B').filter(lambda x: len(x) > 2, dropna=False)
+Out[131]:
+ A B
+0 NaN NaN
+1 NaN NaN
+2 2.0 b
+3 3.0 b
+4 4.0 b
+5 5.0 b
+6 NaN NaN
+7 NaN NaN
+```
+
+For DataFrames with multiple columns, filters should explicitly specify a column as the filter criterion.
+
+``` python
+In [132]: dff['C'] = np.arange(8)
+
+In [133]: dff.groupby('B').filter(lambda x: len(x['C']) > 2)
+Out[133]:
+ A B C
+2 2 b 2
+3 3 b 3
+4 4 b 4
+5 5 b 5
+```
+
+::: tip Note
+
+Some functions when applied to a groupby object will act as a **filter** on the input, returning
+a reduced shape of the original (and potentially eliminating groups), but with the index unchanged.
+Passing ``as_index=False`` will not affect these transformation methods.
+
+For example: ``head, tail``.
+
+``` python
+In [134]: dff.groupby('B').head(2)
+Out[134]:
+ A B C
+0 0 a 0
+1 1 a 1
+2 2 b 2
+3 3 b 3
+6 6 c 6
+7 7 c 7
+```
+
+:::
+
+## Dispatching to instance methods
+
+When doing an aggregation or transformation, you might just want to call an
+instance method on each data group. This is pretty easy to do by passing lambda
+functions:
+
+``` python
+In [135]: grouped = df.groupby('A')
+
+In [136]: grouped.agg(lambda x: x.std())
+Out[136]:
+ C D
+A
+bar 0.181231 1.366330
+foo 0.912265 0.884785
+```
+
+But, it’s rather verbose and can be untidy if you need to pass additional
+arguments. Using a bit of metaprogramming cleverness, GroupBy now has the
+ability to “dispatch” method calls to the groups:
+
+``` python
+In [137]: grouped.std()
+Out[137]:
+ C D
+A
+bar 0.181231 1.366330
+foo 0.912265 0.884785
+```
+
+What is actually happening here is that a function wrapper is being
+generated. When invoked, it takes any passed arguments and invokes the function
+with any arguments on each group (in the above example, the ``std``
+function). The results are then combined together much in the style of ``agg``
+and ``transform`` (it actually uses ``apply`` to infer the gluing, documented
+next). This enables some operations to be carried out rather succinctly:
+
+``` python
+In [138]: tsdf = pd.DataFrame(np.random.randn(1000, 3),
+ .....: index=pd.date_range('1/1/2000', periods=1000),
+ .....: columns=['A', 'B', 'C'])
+ .....:
+
+In [139]: tsdf.iloc[::2] = np.nan
+
+In [140]: grouped = tsdf.groupby(lambda x: x.year)
+
+In [141]: grouped.fillna(method='pad')
+Out[141]:
+ A B C
+2000-01-01 NaN NaN NaN
+2000-01-02 -0.353501 -0.080957 -0.876864
+2000-01-03 -0.353501 -0.080957 -0.876864
+2000-01-04 0.050976 0.044273 -0.559849
+2000-01-05 0.050976 0.044273 -0.559849
+... ... ... ...
+2002-09-22 0.005011 0.053897 -1.026922
+2002-09-23 0.005011 0.053897 -1.026922
+2002-09-24 -0.456542 -1.849051 1.559856
+2002-09-25 -0.456542 -1.849051 1.559856
+2002-09-26 1.123162 0.354660 1.128135
+
+[1000 rows x 3 columns]
+```
+
+In this example, we chopped the collection of time series into yearly chunks
+then independently called [fillna](missing_data.html#missing-data-fillna) on the
+groups.
+
+The ``nlargest`` and ``nsmallest`` methods work on ``Series`` style groupbys:
+
+``` python
+In [142]: s = pd.Series([9, 8, 7, 5, 19, 1, 4.2, 3.3])
+
+In [143]: g = pd.Series(list('abababab'))
+
+In [144]: gb = s.groupby(g)
+
+In [145]: gb.nlargest(3)
+Out[145]:
+a 4 19.0
+ 0 9.0
+ 2 7.0
+b 1 8.0
+ 3 5.0
+ 7 3.3
+dtype: float64
+
+In [146]: gb.nsmallest(3)
+Out[146]:
+a 6 4.2
+ 2 7.0
+ 0 9.0
+b 5 1.0
+ 7 3.3
+ 3 5.0
+dtype: float64
+```
+
+## Flexible ``apply``
+
+Some operations on the grouped data might not fit into either the aggregate or
+transform categories. Or, you may simply want GroupBy to infer how to combine
+the results. For these, use the ``apply`` function, which can be substituted
+for both ``aggregate`` and ``transform`` in many standard use cases. However,
+``apply`` can handle some exceptional use cases, for example:
+
+``` python
+In [147]: df
+Out[147]:
+ A B C D
+0 foo one -0.575247 1.346061
+1 bar one 0.254161 1.511763
+2 foo two -1.143704 1.627081
+3 bar three 0.215897 -0.990582
+4 foo two 1.193555 -0.441652
+5 bar two -0.077118 1.211526
+6 foo one -0.408530 0.268520
+7 foo three -0.862495 0.024580
+
+In [148]: grouped = df.groupby('A')
+
+# could also just call .describe()
+In [149]: grouped['C'].apply(lambda x: x.describe())
+Out[149]:
+A
+bar count 3.000000
+ mean 0.130980
+ std 0.181231
+ min -0.077118
+ 25% 0.069390
+ ...
+foo min -1.143704
+ 25% -0.862495
+ 50% -0.575247
+ 75% -0.408530
+ max 1.193555
+Name: C, Length: 16, dtype: float64
+```
+
+The dimension of the returned result can also change:
+
+``` python
+In [150]: grouped = df.groupby('A')['C']
+
+In [151]: def f(group):
+ .....: return pd.DataFrame({'original': group,
+ .....: 'demeaned': group - group.mean()})
+ .....:
+
+In [152]: grouped.apply(f)
+Out[152]:
+ original demeaned
+0 -0.575247 -0.215962
+1 0.254161 0.123181
+2 -1.143704 -0.784420
+3 0.215897 0.084917
+4 1.193555 1.552839
+5 -0.077118 -0.208098
+6 -0.408530 -0.049245
+7 -0.862495 -0.503211
+```
+
+``apply`` on a Series can operate on a returned value from the applied function,
+that is itself a series, and possibly upcast the result to a DataFrame:
+
+``` python
+In [153]: def f(x):
+ .....: return pd.Series([x, x ** 2], index=['x', 'x^2'])
+ .....:
+
+In [154]: s = pd.Series(np.random.rand(5))
+
+In [155]: s
+Out[155]:
+0 0.321438
+1 0.493496
+2 0.139505
+3 0.910103
+4 0.194158
+dtype: float64
+
+In [156]: s.apply(f)
+Out[156]:
+ x x^2
+0 0.321438 0.103323
+1 0.493496 0.243538
+2 0.139505 0.019462
+3 0.910103 0.828287
+4 0.194158 0.037697
+```
+
+::: tip Note
+
+``apply`` can act as a reducer, transformer, *or* filter function, depending on exactly what is passed to it.
+So depending on the path taken, and exactly what you are grouping. Thus the grouped columns(s) may be included in
+the output as well as set the indices.
+
+:::
+
+## Other useful features
+
+### Automatic exclusion of “nuisance” columns
+
+Again consider the example DataFrame we’ve been looking at:
+
+``` python
+In [157]: df
+Out[157]:
+ A B C D
+0 foo one -0.575247 1.346061
+1 bar one 0.254161 1.511763
+2 foo two -1.143704 1.627081
+3 bar three 0.215897 -0.990582
+4 foo two 1.193555 -0.441652
+5 bar two -0.077118 1.211526
+6 foo one -0.408530 0.268520
+7 foo three -0.862495 0.024580
+```
+
+Suppose we wish to compute the standard deviation grouped by the ``A``
+column. There is a slight problem, namely that we don’t care about the data in
+column ``B``. We refer to this as a “nuisance” column. If the passed
+aggregation function can’t be applied to some columns, the troublesome columns
+will be (silently) dropped. Thus, this does not pose any problems:
+
+``` python
+In [158]: df.groupby('A').std()
+Out[158]:
+ C D
+A
+bar 0.181231 1.366330
+foo 0.912265 0.884785
+```
+
+Note that ``df.groupby('A').colname.std().`` is more efficient than
+``df.groupby('A').std().colname``, so if the result of an aggregation function
+is only interesting over one column (here ``colname``), it may be filtered
+*before* applying the aggregation function.
+
+::: tip Note
+
+Any object column, also if it contains numerical values such as ``Decimal``
+objects, is considered as a “nuisance” columns. They are excluded from
+aggregate functions automatically in groupby.
+
+If you do wish to include decimal or object columns in an aggregation with
+other non-nuisance data types, you must do so explicitly.
+
+:::
+
+``` python
+In [159]: from decimal import Decimal
+
+In [160]: df_dec = pd.DataFrame(
+ .....: {'id': [1, 2, 1, 2],
+ .....: 'int_column': [1, 2, 3, 4],
+ .....: 'dec_column': [Decimal('0.50'), Decimal('0.15'),
+ .....: Decimal('0.25'), Decimal('0.40')]
+ .....: }
+ .....: )
+ .....:
+
+# Decimal columns can be sum'd explicitly by themselves...
+In [161]: df_dec.groupby(['id'])[['dec_column']].sum()
+Out[161]:
+ dec_column
+id
+1 0.75
+2 0.55
+
+# ...but cannot be combined with standard data types or they will be excluded
+In [162]: df_dec.groupby(['id'])[['int_column', 'dec_column']].sum()
+Out[162]:
+ int_column
+id
+1 4
+2 6
+
+# Use .agg function to aggregate over standard and "nuisance" data types
+# at the same time
+In [163]: df_dec.groupby(['id']).agg({'int_column': 'sum', 'dec_column': 'sum'})
+Out[163]:
+ int_column dec_column
+id
+1 4 0.75
+2 6 0.55
+```
+
+### Handling of (un)observed Categorical values
+
+When using a ``Categorical`` grouper (as a single grouper, or as part of multiple groupers), the ``observed`` keyword
+controls whether to return a cartesian product of all possible groupers values (``observed=False``) or only those
+that are observed groupers (``observed=True``).
+
+Show all values:
+
+``` python
+In [164]: pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'],
+ .....: categories=['a', 'b']),
+ .....: observed=False).count()
+ .....:
+Out[164]:
+a 3
+b 0
+dtype: int64
+```
+
+Show only the observed values:
+
+``` python
+In [165]: pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'],
+ .....: categories=['a', 'b']),
+ .....: observed=True).count()
+ .....:
+Out[165]:
+a 3
+dtype: int64
+```
+
+The returned dtype of the grouped will *always* include *all* of the categories that were grouped.
+
+``` python
+In [166]: s = pd.Series([1, 1, 1]).groupby(pd.Categorical(['a', 'a', 'a'],
+ .....: categories=['a', 'b']),
+ .....: observed=False).count()
+ .....:
+
+In [167]: s.index.dtype
+Out[167]: CategoricalDtype(categories=['a', 'b'], ordered=False)
+```
+
+### NA and NaT group handling
+
+If there are any NaN or NaT values in the grouping key, these will be
+automatically excluded. In other words, there will never be an “NA group” or
+“NaT group”. This was not the case in older versions of pandas, but users were
+generally discarding the NA group anyway (and supporting it was an
+implementation headache).
+
+### Grouping with ordered factors
+
+Categorical variables represented as instance of pandas’s ``Categorical`` class
+can be used as group keys. If so, the order of the levels will be preserved:
+
+``` python
+In [168]: data = pd.Series(np.random.randn(100))
+
+In [169]: factor = pd.qcut(data, [0, .25, .5, .75, 1.])
+
+In [170]: data.groupby(factor).mean()
+Out[170]:
+(-2.645, -0.523] -1.362896
+(-0.523, 0.0296] -0.260266
+(0.0296, 0.654] 0.361802
+(0.654, 2.21] 1.073801
+dtype: float64
+```
+
+### Grouping with a grouper specification
+
+You may need to specify a bit more data to properly group. You can
+use the ``pd.Grouper`` to provide this local control.
+
+``` python
+In [171]: import datetime
+
+In [172]: df = pd.DataFrame({'Branch': 'A A A A A A A B'.split(),
+ .....: 'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(),
+ .....: 'Quantity': [1, 3, 5, 1, 8, 1, 9, 3],
+ .....: 'Date': [
+ .....: datetime.datetime(2013, 1, 1, 13, 0),
+ .....: datetime.datetime(2013, 1, 1, 13, 5),
+ .....: datetime.datetime(2013, 10, 1, 20, 0),
+ .....: datetime.datetime(2013, 10, 2, 10, 0),
+ .....: datetime.datetime(2013, 10, 1, 20, 0),
+ .....: datetime.datetime(2013, 10, 2, 10, 0),
+ .....: datetime.datetime(2013, 12, 2, 12, 0),
+ .....: datetime.datetime(2013, 12, 2, 14, 0)]
+ .....: })
+ .....:
+
+In [173]: df
+Out[173]:
+ Branch Buyer Quantity Date
+0 A Carl 1 2013-01-01 13:00:00
+1 A Mark 3 2013-01-01 13:05:00
+2 A Carl 5 2013-10-01 20:00:00
+3 A Carl 1 2013-10-02 10:00:00
+4 A Joe 8 2013-10-01 20:00:00
+5 A Joe 1 2013-10-02 10:00:00
+6 A Joe 9 2013-12-02 12:00:00
+7 B Carl 3 2013-12-02 14:00:00
+```
+
+Groupby a specific column with the desired frequency. This is like resampling.
+
+``` python
+In [174]: df.groupby([pd.Grouper(freq='1M', key='Date'), 'Buyer']).sum()
+Out[174]:
+ Quantity
+Date Buyer
+2013-01-31 Carl 1
+ Mark 3
+2013-10-31 Carl 6
+ Joe 9
+2013-12-31 Carl 3
+ Joe 9
+```
+
+You have an ambiguous specification in that you have a named index and a column
+that could be potential groupers.
+
+``` python
+In [175]: df = df.set_index('Date')
+
+In [176]: df['Date'] = df.index + pd.offsets.MonthEnd(2)
+
+In [177]: df.groupby([pd.Grouper(freq='6M', key='Date'), 'Buyer']).sum()
+Out[177]:
+ Quantity
+Date Buyer
+2013-02-28 Carl 1
+ Mark 3
+2014-02-28 Carl 9
+ Joe 18
+
+In [178]: df.groupby([pd.Grouper(freq='6M', level='Date'), 'Buyer']).sum()
+Out[178]:
+ Quantity
+Date Buyer
+2013-01-31 Carl 1
+ Mark 3
+2014-01-31 Carl 9
+ Joe 18
+```
+
+### Taking the first rows of each group
+
+Just like for a DataFrame or Series you can call head and tail on a groupby:
+
+``` python
+In [179]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
+
+In [180]: df
+Out[180]:
+ A B
+0 1 2
+1 1 4
+2 5 6
+
+In [181]: g = df.groupby('A')
+
+In [182]: g.head(1)
+Out[182]:
+ A B
+0 1 2
+2 5 6
+
+In [183]: g.tail(1)
+Out[183]:
+ A B
+1 1 4
+2 5 6
+```
+
+This shows the first or last n rows from each group.
+
+### Taking the nth row of each group
+
+To select from a DataFrame or Series the nth item, use
+``nth()``. This is a reduction method, and
+will return a single row (or no row) per group if you pass an int for n:
+
+``` python
+In [184]: df = pd.DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B'])
+
+In [185]: g = df.groupby('A')
+
+In [186]: g.nth(0)
+Out[186]:
+ B
+A
+1 NaN
+5 6.0
+
+In [187]: g.nth(-1)
+Out[187]:
+ B
+A
+1 4.0
+5 6.0
+
+In [188]: g.nth(1)
+Out[188]:
+ B
+A
+1 4.0
+```
+
+If you want to select the nth not-null item, use the ``dropna`` kwarg. For a DataFrame this should be either ``'any'`` or ``'all'`` just like you would pass to dropna:
+
+``` python
+# nth(0) is the same as g.first()
+In [189]: g.nth(0, dropna='any')
+Out[189]:
+ B
+A
+1 4.0
+5 6.0
+
+In [190]: g.first()
+Out[190]:
+ B
+A
+1 4.0
+5 6.0
+
+# nth(-1) is the same as g.last()
+In [191]: g.nth(-1, dropna='any') # NaNs denote group exhausted when using dropna
+Out[191]:
+ B
+A
+1 4.0
+5 6.0
+
+In [192]: g.last()
+Out[192]:
+ B
+A
+1 4.0
+5 6.0
+
+In [193]: g.B.nth(0, dropna='all')
+Out[193]:
+A
+1 4.0
+5 6.0
+Name: B, dtype: float64
+```
+
+As with other methods, passing ``as_index=False``, will achieve a filtration, which returns the grouped row.
+
+``` python
+In [194]: df = pd.DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B'])
+
+In [195]: g = df.groupby('A', as_index=False)
+
+In [196]: g.nth(0)
+Out[196]:
+ A B
+0 1 NaN
+2 5 6.0
+
+In [197]: g.nth(-1)
+Out[197]:
+ A B
+1 1 4.0
+2 5 6.0
+```
+
+You can also select multiple rows from each group by specifying multiple nth values as a list of ints.
+
+``` python
+In [198]: business_dates = pd.date_range(start='4/1/2014', end='6/30/2014', freq='B')
+
+In [199]: df = pd.DataFrame(1, index=business_dates, columns=['a', 'b'])
+
+# get the first, 4th, and last date index for each month
+In [200]: df.groupby([df.index.year, df.index.month]).nth([0, 3, -1])
+Out[200]:
+ a b
+2014 4 1 1
+ 4 1 1
+ 4 1 1
+ 5 1 1
+ 5 1 1
+ 5 1 1
+ 6 1 1
+ 6 1 1
+ 6 1 1
+```
+
+### Enumerate group items
+
+To see the order in which each row appears within its group, use the
+``cumcount`` method:
+
+``` python
+In [201]: dfg = pd.DataFrame(list('aaabba'), columns=['A'])
+
+In [202]: dfg
+Out[202]:
+ A
+0 a
+1 a
+2 a
+3 b
+4 b
+5 a
+
+In [203]: dfg.groupby('A').cumcount()
+Out[203]:
+0 0
+1 1
+2 2
+3 0
+4 1
+5 3
+dtype: int64
+
+In [204]: dfg.groupby('A').cumcount(ascending=False)
+Out[204]:
+0 3
+1 2
+2 1
+3 1
+4 0
+5 0
+dtype: int64
+```
+
+### Enumerate groups
+
+*New in version 0.20.2.*
+
+To see the ordering of the groups (as opposed to the order of rows
+within a group given by ``cumcount``) you can use
+``ngroup()``.
+
+Note that the numbers given to the groups match the order in which the
+groups would be seen when iterating over the groupby object, not the
+order they are first observed.
+
+``` python
+In [205]: dfg = pd.DataFrame(list('aaabba'), columns=['A'])
+
+In [206]: dfg
+Out[206]:
+ A
+0 a
+1 a
+2 a
+3 b
+4 b
+5 a
+
+In [207]: dfg.groupby('A').ngroup()
+Out[207]:
+0 0
+1 0
+2 0
+3 1
+4 1
+5 0
+dtype: int64
+
+In [208]: dfg.groupby('A').ngroup(ascending=False)
+Out[208]:
+0 1
+1 1
+2 1
+3 0
+4 0
+5 1
+dtype: int64
+```
+
+### Plotting
+
+Groupby also works with some plotting methods. For example, suppose we
+suspect that some features in a DataFrame may differ by group, in this case,
+the values in column 1 where the group is “B” are 3 higher on average.
+
+``` python
+In [209]: np.random.seed(1234)
+
+In [210]: df = pd.DataFrame(np.random.randn(50, 2))
+
+In [211]: df['g'] = np.random.choice(['A', 'B'], size=50)
+
+In [212]: df.loc[df['g'] == 'B', 1] += 3
+```
+
+We can easily visualize this with a boxplot:
+
+``` python
+In [213]: df.groupby('g').boxplot()
+Out[213]:
+A AxesSubplot(0.1,0.15;0.363636x0.75)
+B AxesSubplot(0.536364,0.15;0.363636x0.75)
+dtype: object
+```
+
+
+
+The result of calling ``boxplot`` is a dictionary whose keys are the values
+of our grouping column ``g`` (“A” and “B”). The values of the resulting dictionary
+can be controlled by the ``return_type`` keyword of ``boxplot``.
+See the [visualization documentation](visualization.html#visualization-box) for more.
+
+::: danger Warning
+
+For historical reasons, ``df.groupby("g").boxplot()`` is not equivalent
+to ``df.boxplot(by="g")``. See [here](visualization.html#visualization-box-return) for
+an explanation.
+
+:::
+
+### Piping function calls
+
+*New in version 0.21.0.*
+
+Similar to the functionality provided by ``DataFrame`` and ``Series``, functions
+that take ``GroupBy`` objects can be chained together using a ``pipe`` method to
+allow for a cleaner, more readable syntax. To read about ``.pipe`` in general terms,
+see [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-pipe).
+
+Combining ``.groupby`` and ``.pipe`` is often useful when you need to reuse
+GroupBy objects.
+
+As an example, imagine having a DataFrame with columns for stores, products,
+revenue and quantity sold. We’d like to do a groupwise calculation of *prices*
+(i.e. revenue/quantity) per store and per product. We could do this in a
+multi-step operation, but expressing it in terms of piping can make the
+code more readable. First we set the data:
+
+``` python
+In [214]: n = 1000
+
+In [215]: df = pd.DataFrame({'Store': np.random.choice(['Store_1', 'Store_2'], n),
+ .....: 'Product': np.random.choice(['Product_1',
+ .....: 'Product_2'], n),
+ .....: 'Revenue': (np.random.random(n) * 50 + 10).round(2),
+ .....: 'Quantity': np.random.randint(1, 10, size=n)})
+ .....:
+
+In [216]: df.head(2)
+Out[216]:
+ Store Product Revenue Quantity
+0 Store_2 Product_1 26.12 1
+1 Store_2 Product_1 28.86 1
+```
+
+Now, to find prices per store/product, we can simply do:
+
+``` python
+In [217]: (df.groupby(['Store', 'Product'])
+ .....: .pipe(lambda grp: grp.Revenue.sum() / grp.Quantity.sum())
+ .....: .unstack().round(2))
+ .....:
+Out[217]:
+Product Product_1 Product_2
+Store
+Store_1 6.82 7.05
+Store_2 6.30 6.64
+```
+
+Piping can also be expressive when you want to deliver a grouped object to some
+arbitrary function, for example:
+
+``` python
+In [218]: def mean(groupby):
+ .....: return groupby.mean()
+ .....:
+
+In [219]: df.groupby(['Store', 'Product']).pipe(mean)
+Out[219]:
+ Revenue Quantity
+Store Product
+Store_1 Product_1 34.622727 5.075758
+ Product_2 35.482815 5.029630
+Store_2 Product_1 32.972837 5.237589
+ Product_2 34.684360 5.224000
+```
+
+where ``mean`` takes a GroupBy object and finds the mean of the Revenue and Quantity
+columns respectively for each Store-Product combination. The ``mean`` function can
+be any function that takes in a GroupBy object; the ``.pipe`` will pass the GroupBy
+object as a parameter into the function you specify.
+
+## Examples
+
+### Regrouping by factor
+
+Regroup columns of a DataFrame according to their sum, and sum the aggregated ones.
+
+``` python
+In [220]: df = pd.DataFrame({'a': [1, 0, 0], 'b': [0, 1, 0],
+ .....: 'c': [1, 0, 0], 'd': [2, 3, 4]})
+ .....:
+
+In [221]: df
+Out[221]:
+ a b c d
+0 1 0 1 2
+1 0 1 0 3
+2 0 0 0 4
+
+In [222]: df.groupby(df.sum(), axis=1).sum()
+Out[222]:
+ 1 9
+0 2 2
+1 1 3
+2 0 4
+```
+
+### Multi-column factorization
+
+By using ``ngroup()``, we can extract
+information about the groups in a way similar to [``factorize()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html#pandas.factorize) (as described
+further in the [reshaping API](reshaping.html#reshaping-factorize)) but which applies
+naturally to multiple columns of mixed type and different
+sources. This can be useful as an intermediate categorical-like step
+in processing, when the relationships between the group rows are more
+important than their content, or as input to an algorithm which only
+accepts the integer encoding. (For more information about support in
+pandas for full categorical data, see the [Categorical
+introduction](categorical.html#categorical) and the
+[API documentation](https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html#api-arrays-categorical).)
+
+``` python
+In [223]: dfg = pd.DataFrame({"A": [1, 1, 2, 3, 2], "B": list("aaaba")})
+
+In [224]: dfg
+Out[224]:
+ A B
+0 1 a
+1 1 a
+2 2 a
+3 3 b
+4 2 a
+
+In [225]: dfg.groupby(["A", "B"]).ngroup()
+Out[225]:
+0 0
+1 0
+2 1
+3 2
+4 1
+dtype: int64
+
+In [226]: dfg.groupby(["A", [0, 0, 0, 1, 1]]).ngroup()
+Out[226]:
+0 0
+1 0
+2 1
+3 3
+4 2
+dtype: int64
+```
+
+### Groupby by indexer to ‘resample’ data
+
+Resampling produces new hypothetical samples (resamples) from already existing observed data or from a model that generates data. These new samples are similar to the pre-existing samples.
+
+In order to resample to work on indices that are non-datetimelike, the following procedure can be utilized.
+
+In the following examples, **df.index // 5** returns a binary array which is used to determine what gets selected for the groupby operation.
+
+::: tip Note
+
+The below example shows how we can downsample by consolidation of samples into fewer samples. Here by using **df.index // 5**, we are aggregating the samples in bins. By applying **std()** function, we aggregate the information contained in many samples into a small subset of values which is their standard deviation thereby reducing the number of samples.
+
+:::
+
+``` python
+In [227]: df = pd.DataFrame(np.random.randn(10, 2))
+
+In [228]: df
+Out[228]:
+ 0 1
+0 -0.793893 0.321153
+1 0.342250 1.618906
+2 -0.975807 1.918201
+3 -0.810847 -1.405919
+4 -1.977759 0.461659
+5 0.730057 -1.316938
+6 -0.751328 0.528290
+7 -0.257759 -1.081009
+8 0.505895 -1.701948
+9 -1.006349 0.020208
+
+In [229]: df.index // 5
+Out[229]: Int64Index([0, 0, 0, 0, 0, 1, 1, 1, 1, 1], dtype='int64')
+
+In [230]: df.groupby(df.index // 5).std()
+Out[230]:
+ 0 1
+0 0.823647 1.312912
+1 0.760109 0.942941
+```
+
+### Returning a Series to propagate names
+
+Group DataFrame columns, compute a set of metrics and return a named Series.
+The Series name is used as the name for the column index. This is especially
+useful in conjunction with reshaping operations such as stacking in which the
+column index name will be used as the name of the inserted column:
+
+``` python
+In [231]: df = pd.DataFrame({'a': [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
+ .....: 'b': [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
+ .....: 'c': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
+ .....: 'd': [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1]})
+ .....:
+
+In [232]: def compute_metrics(x):
+ .....: result = {'b_sum': x['b'].sum(), 'c_mean': x['c'].mean()}
+ .....: return pd.Series(result, name='metrics')
+ .....:
+
+In [233]: result = df.groupby('a').apply(compute_metrics)
+
+In [234]: result
+Out[234]:
+metrics b_sum c_mean
+a
+0 2.0 0.5
+1 2.0 0.5
+2 2.0 0.5
+
+In [235]: result.stack()
+Out[235]:
+a metrics
+0 b_sum 2.0
+ c_mean 0.5
+1 b_sum 2.0
+ c_mean 0.5
+2 b_sum 2.0
+ c_mean 0.5
+dtype: float64
+```
diff --git a/Python/pandas/user_guide/indexing.md b/Python/pandas/user_guide/indexing.md
new file mode 100644
index 00000000..7308ef5a
--- /dev/null
+++ b/Python/pandas/user_guide/indexing.md
@@ -0,0 +1,3114 @@
+# 索引和数据选择器
+
+Pandas对象中的轴标记信息有多种用途:
+
+- 使用已知指标识别数据(即提供*元数据*),这对于分析,可视化和交互式控制台显示非常重要。
+- 启用自动和显式数据对齐。
+- 允许直观地获取和设置数据集的子集。
+
+在本节中,我们将重点关注最后一点:即如何切片,切块,以及通常获取和设置pandas对象的子集。主要关注的是Series和DataFrame,因为他们在这个领域受到了更多的开发关注。
+
+::: tip 注意
+
+Python和NumPy索引运算符``[]``和属性运算符``.``
+可以在各种用例中快速轻松地访问pandas数据结构。这使得交互式工作变得直观,因为如果您已经知道如何处理Python字典和NumPy数组,那么几乎没有新的东西需要学习。但是,由于预先不知道要访问的数据类型,因此直接使用标准运算符会有一些优化限制。对于生产代码,我们建议您利用本章中介绍的优化的pandas数据访问方法。
+
+:::
+
+::: danger 警告
+
+是否为设置操作返回副本或引用可能取决于上下文。这有时被称为应该避免。请参阅[返回视图与复制](#indexing-view-versus-copy)。``chained assignment``[](#indexing-view-versus-copy)
+
+:::
+
+::: danger 警告
+
+使用浮点数对基于整数的索引进行索引已在0.18.0中进行了说明,有关更改的摘要,请参见[此处](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.18.0.html#whatsnew-0180-float-indexers)。
+
+:::
+
+见[多指标/高级索引](advanced.html#advanced)的``MultiIndex``和更先进的索引文件。
+
+有关一些高级策略,请参阅[食谱](cookbook.html#cookbook-selection)。
+
+## 索引的不同选择
+
+对象选择已经有许多用户请求的添加,以支持更明确的基于位置的索引。Pandas现在支持三种类型的多轴索引。
+
+- ``.loc``主要是基于标签的,但也可以与布尔数组一起使用。当找不到物品时``.loc``会提高``KeyError``。允许的输入是:
+ - 单个标签,例如``5``或``'a'``(注意,它``5``被解释为索引的
+ *标签*。此用法**不是**索引的整数位置。)。
+ - 列表或标签数组。``['a', 'b', 'c']``
+ - 带标签的切片对象``'a':'f'``(注意,相反普通的Python片,**都**开始和停止都包括在内,当存在于索引中!见[有标签切片](#indexing-slicing-with-labels)
+ 和[端点都包括在内](advanced.html#advanced-endpoints-are-inclusive)。)
+ - 布尔数组
+ - 一个``callable``带有一个参数的函数(调用Series或DataFrame)并返回有效的索引输出(上面的一个)。
+
+ *版本0.18.1中的新功能。*
+
+ 在[标签选择中](#indexing-label)查看更多信息。
+
+- ``.iloc``是基于主要的整数位置(从``0``到 ``length-1``所述轴的),但也可以用布尔阵列使用。 如果请求的索引器超出范围,``.iloc``则会引发``IndexError``,但允许越界索引的*切片*索引器除外。(这符合Python / NumPy *切片*
+语义)。允许的输入是:
+ - 一个整数,例如``5``。
+ - 整数列表或数组。``[4, 3, 0]``
+ - 带有整数的切片对象``1:7``。
+ - 布尔数组。
+ - 一个``callable``带有一个参数的函数(调用Series或DataFrame)并返回有效的索引输出(上面的一个)。
+
+ *版本0.18.1中的新功能。*
+
+ 有关详细信息,请参阅[按位置选择](#indexing-integer),[高级索引](advanced.html#advanced)和[高级层次结构](advanced.html#advanced-advanced-hierarchical)。
+
+- ``.loc``,``.iloc``以及``[]``索引也可以接受一个``callable``索引器。在[Select By Callable中](#indexing-callable)查看更多信息。
+
+从具有多轴选择的对象获取值使用以下表示法(使用``.loc``作为示例,但以下也适用``.iloc``)。任何轴访问器可以是空切片``:``。假设超出规范的轴是``:``,例如``p.loc['a']``相当于
+ 。``p.loc['a', :, :]``
+
+对象类型 | 索引
+---|---
+系列 | s.loc[indexer]
+数据帧 | df.loc[row_indexer,column_indexer]
+
+## 基础知识
+
+正如在[上一节中](/docs/getting_started/basics.html)介绍数据结构时所提到的,索引的主要功能``[]``(也就是``__getitem__``
+那些熟悉在Python中实现类行为的人)是选择低维切片。下表显示了使用以下方法索引pandas对象时的返回类型值``[]``:
+
+对象类型 | 选择 | 返回值类型
+---|---|---
+系列 | series[label] | 标量值
+数据帧 | frame[colname] | Series 对应于colname
+
+在这里,我们构建一个简单的时间序列数据集,用于说明索引功能:
+
+``` python
+In [1]: dates = pd.date_range('1/1/2000', periods=8)
+
+In [2]: df = pd.DataFrame(np.random.randn(8, 4),
+ ...: index=dates, columns=['A', 'B', 'C', 'D'])
+ ...:
+
+In [3]: df
+Out[3]:
+ A B C D
+2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
+2000-01-02 1.212112 -0.173215 0.119209 -1.044236
+2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
+2000-01-04 0.721555 -0.706771 -1.039575 0.271860
+2000-01-05 -0.424972 0.567020 0.276232 -1.087401
+2000-01-06 -0.673690 0.113648 -1.478427 0.524988
+2000-01-07 0.404705 0.577046 -1.715002 -1.039268
+2000-01-08 -0.370647 -1.157892 -1.344312 0.844885
+```
+
+::: tip 注意
+
+除非特别说明,否则索引功能都不是时间序列特定的。
+
+:::
+
+因此,如上所述,我们使用最基本的索引``[]``:
+
+``` python
+In [4]: s = df['A']
+
+In [5]: s[dates[5]]
+Out[5]: -0.6736897080883706
+```
+
+您可以传递列表列表``[]``以按该顺序选择列。如果DataFrame中未包含列,则会引发异常。也可以这种方式设置多列:
+
+``` python
+In [6]: df
+Out[6]:
+ A B C D
+2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
+2000-01-02 1.212112 -0.173215 0.119209 -1.044236
+2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
+2000-01-04 0.721555 -0.706771 -1.039575 0.271860
+2000-01-05 -0.424972 0.567020 0.276232 -1.087401
+2000-01-06 -0.673690 0.113648 -1.478427 0.524988
+2000-01-07 0.404705 0.577046 -1.715002 -1.039268
+2000-01-08 -0.370647 -1.157892 -1.344312 0.844885
+
+In [7]: df[['B', 'A']] = df[['A', 'B']]
+
+In [8]: df
+Out[8]:
+ A B C D
+2000-01-01 -0.282863 0.469112 -1.509059 -1.135632
+2000-01-02 -0.173215 1.212112 0.119209 -1.044236
+2000-01-03 -2.104569 -0.861849 -0.494929 1.071804
+2000-01-04 -0.706771 0.721555 -1.039575 0.271860
+2000-01-05 0.567020 -0.424972 0.276232 -1.087401
+2000-01-06 0.113648 -0.673690 -1.478427 0.524988
+2000-01-07 0.577046 0.404705 -1.715002 -1.039268
+2000-01-08 -1.157892 -0.370647 -1.344312 0.844885
+```
+
+您可能会发现这对于将变换(就地)应用于列的子集非常有用。
+
+::: danger 警告
+
+pandas在设置``Series``和``DataFrame``来自``.loc``和时对齐所有AXES ``.iloc``。
+
+这**不会**修改,``df``因为列对齐在赋值之前。
+
+``` python
+In [9]: df[['A', 'B']]
+Out[9]:
+ A B
+2000-01-01 -0.282863 0.469112
+2000-01-02 -0.173215 1.212112
+2000-01-03 -2.104569 -0.861849
+2000-01-04 -0.706771 0.721555
+2000-01-05 0.567020 -0.424972
+2000-01-06 0.113648 -0.673690
+2000-01-07 0.577046 0.404705
+2000-01-08 -1.157892 -0.370647
+
+In [10]: df.loc[:, ['B', 'A']] = df[['A', 'B']]
+
+In [11]: df[['A', 'B']]
+Out[11]:
+ A B
+2000-01-01 -0.282863 0.469112
+2000-01-02 -0.173215 1.212112
+2000-01-03 -2.104569 -0.861849
+2000-01-04 -0.706771 0.721555
+2000-01-05 0.567020 -0.424972
+2000-01-06 0.113648 -0.673690
+2000-01-07 0.577046 0.404705
+2000-01-08 -1.157892 -0.370647
+```
+
+交换列值的正确方法是使用原始值:
+
+``` python
+In [12]: df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()
+
+In [13]: df[['A', 'B']]
+Out[13]:
+ A B
+2000-01-01 0.469112 -0.282863
+2000-01-02 1.212112 -0.173215
+2000-01-03 -0.861849 -2.104569
+2000-01-04 0.721555 -0.706771
+2000-01-05 -0.424972 0.567020
+2000-01-06 -0.673690 0.113648
+2000-01-07 0.404705 0.577046
+2000-01-08 -0.370647 -1.157892
+```
+
+:::
+
+## 属性访问
+
+您可以直接访问某个``Series``或列上的索引``DataFrame``作为属性:
+
+``` python
+In [14]: sa = pd.Series([1, 2, 3], index=list('abc'))
+
+In [15]: dfa = df.copy()
+```
+
+``` python
+In [16]: sa.b
+Out[16]: 2
+
+In [17]: dfa.A
+Out[17]:
+2000-01-01 0.469112
+2000-01-02 1.212112
+2000-01-03 -0.861849
+2000-01-04 0.721555
+2000-01-05 -0.424972
+2000-01-06 -0.673690
+2000-01-07 0.404705
+2000-01-08 -0.370647
+Freq: D, Name: A, dtype: float64
+```
+
+``` python
+In [18]: sa.a = 5
+
+In [19]: sa
+Out[19]:
+a 5
+b 2
+c 3
+dtype: int64
+
+In [20]: dfa.A = list(range(len(dfa.index))) # ok if A already exists
+
+In [21]: dfa
+Out[21]:
+ A B C D
+2000-01-01 0 -0.282863 -1.509059 -1.135632
+2000-01-02 1 -0.173215 0.119209 -1.044236
+2000-01-03 2 -2.104569 -0.494929 1.071804
+2000-01-04 3 -0.706771 -1.039575 0.271860
+2000-01-05 4 0.567020 0.276232 -1.087401
+2000-01-06 5 0.113648 -1.478427 0.524988
+2000-01-07 6 0.577046 -1.715002 -1.039268
+2000-01-08 7 -1.157892 -1.344312 0.844885
+
+In [22]: dfa['A'] = list(range(len(dfa.index))) # use this form to create a new column
+
+In [23]: dfa
+Out[23]:
+ A B C D
+2000-01-01 0 -0.282863 -1.509059 -1.135632
+2000-01-02 1 -0.173215 0.119209 -1.044236
+2000-01-03 2 -2.104569 -0.494929 1.071804
+2000-01-04 3 -0.706771 -1.039575 0.271860
+2000-01-05 4 0.567020 0.276232 -1.087401
+2000-01-06 5 0.113648 -1.478427 0.524988
+2000-01-07 6 0.577046 -1.715002 -1.039268
+2000-01-08 7 -1.157892 -1.344312 0.844885
+```
+
+::: danger 警告
+
+- 仅当index元素是有效的Python标识符时才可以使用此访问权限,例如``s.1``,不允许。有关[有效标识符的说明,](https://docs.python.org/3/reference/lexical_analysis.html#identifiers)请参见[此处](https://docs.python.org/3/reference/lexical_analysis.html#identifiers)。
+- 如果该属性与现有方法名称冲突,则该属性将不可用,例如``s.min``,不允许。
+- 同样的,如果它与任何下面的列表冲突的属性将不可用:``index``,
+ ``major_axis``,``minor_axis``,``items``。
+- 在任何一种情况下,标准索引仍然可以工作,例如``s['1']``,``s['min']``和``s['index']``将访问相应的元素或列。
+
+:::
+
+如果您使用的是IPython环境,则还可以使用tab-completion来查看这些可访问的属性。
+
+您还可以将a分配``dict``给一行``DataFrame``:
+
+``` python
+In [24]: x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})
+
+In [25]: x.iloc[1] = {'x': 9, 'y': 99}
+
+In [26]: x
+Out[26]:
+ x y
+0 1 3
+1 9 99
+2 3 5
+```
+
+您可以使用属性访问来修改DataFrame的Series或列的现有元素,但要小心; 如果您尝试使用属性访问权来创建新列,则会创建新属性而不是新列。在0.21.0及更高版本中,这将引发``UserWarning``:
+
+``` python
+In [1]: df = pd.DataFrame({'one': [1., 2., 3.]})
+In [2]: df.two = [4, 5, 6]
+UserWarning: Pandas doesn't allow Series to be assigned into nonexistent columns - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute_access
+In [3]: df
+Out[3]:
+ one
+0 1.0
+1 2.0
+2 3.0
+```
+
+## 切片范围
+
+沿着任意轴切割范围的最稳健和一致的方法在详细说明该方法的“ [按位置选择”](#indexing-integer)部分中描述``.iloc``。现在,我们解释使用``[]``运算符切片的语义。
+
+使用Series,语法与ndarray完全一样,返回值的一部分和相应的标签:
+
+``` python
+In [27]: s[:5]
+Out[27]:
+2000-01-01 0.469112
+2000-01-02 1.212112
+2000-01-03 -0.861849
+2000-01-04 0.721555
+2000-01-05 -0.424972
+Freq: D, Name: A, dtype: float64
+
+In [28]: s[::2]
+Out[28]:
+2000-01-01 0.469112
+2000-01-03 -0.861849
+2000-01-05 -0.424972
+2000-01-07 0.404705
+Freq: 2D, Name: A, dtype: float64
+
+In [29]: s[::-1]
+Out[29]:
+2000-01-08 -0.370647
+2000-01-07 0.404705
+2000-01-06 -0.673690
+2000-01-05 -0.424972
+2000-01-04 0.721555
+2000-01-03 -0.861849
+2000-01-02 1.212112
+2000-01-01 0.469112
+Freq: -1D, Name: A, dtype: float64
+```
+
+请注意,设置也适用:
+
+``` python
+In [30]: s2 = s.copy()
+
+In [31]: s2[:5] = 0
+
+In [32]: s2
+Out[32]:
+2000-01-01 0.000000
+2000-01-02 0.000000
+2000-01-03 0.000000
+2000-01-04 0.000000
+2000-01-05 0.000000
+2000-01-06 -0.673690
+2000-01-07 0.404705
+2000-01-08 -0.370647
+Freq: D, Name: A, dtype: float64
+```
+
+使用DataFrame,切片内部``[]`` **切片**。这主要是为了方便而提供的,因为它是如此常见的操作。
+
+``` python
+In [33]: df[:3]
+Out[33]:
+ A B C D
+2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
+2000-01-02 1.212112 -0.173215 0.119209 -1.044236
+2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
+
+In [34]: df[::-1]
+Out[34]:
+ A B C D
+2000-01-08 -0.370647 -1.157892 -1.344312 0.844885
+2000-01-07 0.404705 0.577046 -1.715002 -1.039268
+2000-01-06 -0.673690 0.113648 -1.478427 0.524988
+2000-01-05 -0.424972 0.567020 0.276232 -1.087401
+2000-01-04 0.721555 -0.706771 -1.039575 0.271860
+2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
+2000-01-02 1.212112 -0.173215 0.119209 -1.044236
+2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
+```
+
+## 按标签选择
+
+::: danger 警告
+
+是否为设置操作返回副本或引用可能取决于上下文。这有时被称为应该避免。请参阅[返回视图与复制](#indexing-view-versus-copy)。``chained assignment``[](#indexing-view-versus-copy)
+
+:::
+
+::: danger 警告
+
+``` python
+In [35]: dfl = pd.DataFrame(np.random.randn(5, 4),
+ ....: columns=list('ABCD'),
+ ....: index=pd.date_range('20130101', periods=5))
+ ....:
+
+In [36]: dfl
+Out[36]:
+ A B C D
+2013-01-01 1.075770 -0.109050 1.643563 -1.469388
+2013-01-02 0.357021 -0.674600 -1.776904 -0.968914
+2013-01-03 -1.294524 0.413738 0.276662 -0.472035
+2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061
+2013-01-05 0.895717 0.805244 -1.206412 2.565646
+```
+
+``` python
+In [4]: dfl.loc[2:3]
+TypeError: cannot do slice indexing on with these indexers [2] of
+```
+
+切片中的字符串喜欢*可以*转换为索引的类型并导致自然切片。
+
+``` python
+In [37]: dfl.loc['20130102':'20130104']
+Out[37]:
+ A B C D
+2013-01-02 0.357021 -0.674600 -1.776904 -0.968914
+2013-01-03 -1.294524 0.413738 0.276662 -0.472035
+2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061
+```
+
+:::
+
+::: danger 警告
+
+从0.21.0开始,pandas将显示``FutureWarning``带有缺少标签的列表的if索引。将来这会提高一个``KeyError``。请参阅[list-like使用列表中缺少键的loc是不推荐使用](#indexing-deprecate-loc-reindex-listlike)。
+
+:::
+
+pandas提供了一套方法,以便拥有**纯粹基于标签的索引**。这是一个严格的包含协议。要求的每个标签必须在索引中,否则``KeyError``将被提出。切片时,如果索引中存在,则*包括*起始绑定**和**停止边界。整数是有效标签,但它们是指标签**而不是位置**。******
+
+该``.loc``属性是主要访问方法。以下是有效输入:
+
+- 单个标签,例如``5``或``'a'``(注意,它``5``被解释为索引的*标签*。此用法**不是**索引的整数位置。)。
+- 列表或标签数组。``['a', 'b', 'c']``
+- 带有标签的切片对象``'a':'f'``(注意,与通常的python切片相反,**包括**起始和停止,当存在于索引中时!请参见[切片标签](#indexing-slicing-with-labels)。
+- 布尔数组。
+- A ``callable``,参见[按可调用选择](#indexing-callable)。
+
+``` python
+In [38]: s1 = pd.Series(np.random.randn(6), index=list('abcdef'))
+
+In [39]: s1
+Out[39]:
+a 1.431256
+b 1.340309
+c -1.170299
+d -0.226169
+e 0.410835
+f 0.813850
+dtype: float64
+
+In [40]: s1.loc['c':]
+Out[40]:
+c -1.170299
+d -0.226169
+e 0.410835
+f 0.813850
+dtype: float64
+
+In [41]: s1.loc['b']
+Out[41]: 1.3403088497993827
+```
+
+请注意,设置也适用:
+
+``` python
+In [42]: s1.loc['c':] = 0
+
+In [43]: s1
+Out[43]:
+a 1.431256
+b 1.340309
+c 0.000000
+d 0.000000
+e 0.000000
+f 0.000000
+dtype: float64
+```
+
+使用DataFrame:
+
+``` python
+In [44]: df1 = pd.DataFrame(np.random.randn(6, 4),
+ ....: index=list('abcdef'),
+ ....: columns=list('ABCD'))
+ ....:
+
+In [45]: df1
+Out[45]:
+ A B C D
+a 0.132003 -0.827317 -0.076467 -1.187678
+b 1.130127 -1.436737 -1.413681 1.607920
+c 1.024180 0.569605 0.875906 -2.211372
+d 0.974466 -2.006747 -0.410001 -0.078638
+e 0.545952 -1.219217 -1.226825 0.769804
+f -1.281247 -0.727707 -0.121306 -0.097883
+
+In [46]: df1.loc[['a', 'b', 'd'], :]
+Out[46]:
+ A B C D
+a 0.132003 -0.827317 -0.076467 -1.187678
+b 1.130127 -1.436737 -1.413681 1.607920
+d 0.974466 -2.006747 -0.410001 -0.078638
+```
+
+通过标签切片访问:
+
+``` python
+In [47]: df1.loc['d':, 'A':'C']
+Out[47]:
+ A B C
+d 0.974466 -2.006747 -0.410001
+e 0.545952 -1.219217 -1.226825
+f -1.281247 -0.727707 -0.121306
+```
+
+使用标签获取横截面(相当于``df.xs('a')``):
+
+``` python
+In [48]: df1.loc['a']
+Out[48]:
+A 0.132003
+B -0.827317
+C -0.076467
+D -1.187678
+Name: a, dtype: float64
+```
+
+要使用布尔数组获取值:
+
+``` python
+In [49]: df1.loc['a'] > 0
+Out[49]:
+A True
+B False
+C False
+D False
+Name: a, dtype: bool
+
+In [50]: df1.loc[:, df1.loc['a'] > 0]
+Out[50]:
+ A
+a 0.132003
+b 1.130127
+c 1.024180
+d 0.974466
+e 0.545952
+f -1.281247
+```
+
+要明确获取值(相当于已弃用``df.get_value('a','A')``):
+
+``` python
+# this is also equivalent to ``df1.at['a','A']``
+In [51]: df1.loc['a', 'A']
+Out[51]: 0.13200317033032932
+```
+
+### 用标签切片
+
+使用``.loc``切片时,如果索引中存在开始和停止标签,则返回*位于*两者之间的元素(包括它们):
+
+``` python
+In [52]: s = pd.Series(list('abcde'), index=[0, 3, 2, 5, 4])
+
+In [53]: s.loc[3:5]
+Out[53]:
+3 b
+2 c
+5 d
+dtype: object
+```
+
+如果两个中至少有一个不存在,但索引已排序,并且可以与开始和停止标签进行比较,那么通过选择在两者之间*排名的*标签,切片仍将按预期工作:
+
+``` python
+In [54]: s.sort_index()
+Out[54]:
+0 a
+2 c
+3 b
+4 e
+5 d
+dtype: object
+
+In [55]: s.sort_index().loc[1:6]
+Out[55]:
+2 c
+3 b
+4 e
+5 d
+dtype: object
+```
+
+然而,如果两个中的至少一个不存在*并且*索引未被排序,则将引发错误(因为否则将是计算上昂贵的,并且对于混合类型索引可能是模糊的)。例如,在上面的例子中,``s.loc[1:6]``会提高``KeyError``。
+
+有关此行为背后的基本原理,请参阅
+ [端点包含](advanced.html#advanced-endpoints-are-inclusive)。
+
+## 按位置选择
+
+::: danger 警告
+
+是否为设置操作返回副本或引用可能取决于上下文。这有时被称为应该避免。请参阅[返回视图与复制](#indexing-view-versus-copy)。``chained assignment``[](#indexing-view-versus-copy)
+
+:::
+
+Pandas提供了一套方法,以获得**纯粹基于整数的索引**。语义紧跟Python和NumPy切片。这些是``0-based``索引。切片时,所结合的开始被*包括*,而上限是*排除*。尝试使用非整数,甚至是**有效的**标签都会引发一个问题``IndexError``。
+
+该``.iloc``属性是主要访问方法。以下是有效输入:
+
+- 一个整数,例如``5``。
+- 整数列表或数组。``[4, 3, 0]``
+- 带有整数的切片对象``1:7``。
+- 布尔数组。
+- A ``callable``,参见[按可调用选择](#indexing-callable)。
+
+``` python
+In [56]: s1 = pd.Series(np.random.randn(5), index=list(range(0, 10, 2)))
+
+In [57]: s1
+Out[57]:
+0 0.695775
+2 0.341734
+4 0.959726
+6 -1.110336
+8 -0.619976
+dtype: float64
+
+In [58]: s1.iloc[:3]
+Out[58]:
+0 0.695775
+2 0.341734
+4 0.959726
+dtype: float64
+
+In [59]: s1.iloc[3]
+Out[59]: -1.110336102891167
+```
+
+请注意,设置也适用:
+
+``` python
+In [60]: s1.iloc[:3] = 0
+
+In [61]: s1
+Out[61]:
+0 0.000000
+2 0.000000
+4 0.000000
+6 -1.110336
+8 -0.619976
+dtype: float64
+```
+
+使用DataFrame:
+
+``` python
+In [62]: df1 = pd.DataFrame(np.random.randn(6, 4),
+ ....: index=list(range(0, 12, 2)),
+ ....: columns=list(range(0, 8, 2)))
+ ....:
+
+In [63]: df1
+Out[63]:
+ 0 2 4 6
+0 0.149748 -0.732339 0.687738 0.176444
+2 0.403310 -0.154951 0.301624 -2.179861
+4 -1.369849 -0.954208 1.462696 -1.743161
+6 -0.826591 -0.345352 1.314232 0.690579
+8 0.995761 2.396780 0.014871 3.357427
+10 -0.317441 -1.236269 0.896171 -0.487602
+```
+
+通过整数切片选择:
+
+``` python
+In [64]: df1.iloc[:3]
+Out[64]:
+ 0 2 4 6
+0 0.149748 -0.732339 0.687738 0.176444
+2 0.403310 -0.154951 0.301624 -2.179861
+4 -1.369849 -0.954208 1.462696 -1.743161
+
+In [65]: df1.iloc[1:5, 2:4]
+Out[65]:
+ 4 6
+2 0.301624 -2.179861
+4 1.462696 -1.743161
+6 1.314232 0.690579
+8 0.014871 3.357427
+```
+
+通过整数列表选择:
+
+``` python
+In [66]: df1.iloc[[1, 3, 5], [1, 3]]
+Out[66]:
+ 2 6
+2 -0.154951 -2.179861
+6 -0.345352 0.690579
+10 -1.236269 -0.487602
+```
+
+``` python
+In [67]: df1.iloc[1:3, :]
+Out[67]:
+ 0 2 4 6
+2 0.403310 -0.154951 0.301624 -2.179861
+4 -1.369849 -0.954208 1.462696 -1.743161
+```
+
+``` python
+In [68]: df1.iloc[:, 1:3]
+Out[68]:
+ 2 4
+0 -0.732339 0.687738
+2 -0.154951 0.301624
+4 -0.954208 1.462696
+6 -0.345352 1.314232
+8 2.396780 0.014871
+10 -1.236269 0.896171
+```
+
+``` python
+# this is also equivalent to ``df1.iat[1,1]``
+In [69]: df1.iloc[1, 1]
+Out[69]: -0.1549507744249032
+```
+
+使用整数位置(等效``df.xs(1)``)得到横截面:
+
+``` python
+In [70]: df1.iloc[1]
+Out[70]:
+0 0.403310
+2 -0.154951
+4 0.301624
+6 -2.179861
+Name: 2, dtype: float64
+```
+
+超出范围的切片索引正如Python / Numpy中一样优雅地处理。
+
+``` python
+# these are allowed in python/numpy.
+In [71]: x = list('abcdef')
+
+In [72]: x
+Out[72]: ['a', 'b', 'c', 'd', 'e', 'f']
+
+In [73]: x[4:10]
+Out[73]: ['e', 'f']
+
+In [74]: x[8:10]
+Out[74]: []
+
+In [75]: s = pd.Series(x)
+
+In [76]: s
+Out[76]:
+0 a
+1 b
+2 c
+3 d
+4 e
+5 f
+dtype: object
+
+In [77]: s.iloc[4:10]
+Out[77]:
+4 e
+5 f
+dtype: object
+
+In [78]: s.iloc[8:10]
+Out[78]: Series([], dtype: object)
+```
+
+请注意,使用超出边界的切片可能会导致空轴(例如,返回一个空的DataFrame)。
+
+``` python
+In [79]: dfl = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
+
+In [80]: dfl
+Out[80]:
+ A B
+0 -0.082240 -2.182937
+1 0.380396 0.084844
+2 0.432390 1.519970
+3 -0.493662 0.600178
+4 0.274230 0.132885
+
+In [81]: dfl.iloc[:, 2:3]
+Out[81]:
+Empty DataFrame
+Columns: []
+Index: [0, 1, 2, 3, 4]
+
+In [82]: dfl.iloc[:, 1:3]
+Out[82]:
+ B
+0 -2.182937
+1 0.084844
+2 1.519970
+3 0.600178
+4 0.132885
+
+In [83]: dfl.iloc[4:6]
+Out[83]:
+ A B
+4 0.27423 0.132885
+```
+
+一个超出范围的索引器会引发一个``IndexError``。任何元素超出范围的索引器列表都会引发
+ ``IndexError``。
+
+``` python
+>>> dfl.iloc[[4, 5, 6]]
+IndexError: positional indexers are out-of-bounds
+
+>>> dfl.iloc[:, 4]
+IndexError: single positional indexer is out-of-bounds
+```
+
+## 通过可调用选择
+
+*版本0.18.1中的新功能。*
+
+``.loc``,``.iloc``以及``[]``索引也可以接受一个``callable``索引器。在``callable``必须与一个参数(调用系列或数据帧)返回的有效输出索引功能。
+
+``` python
+In [84]: df1 = pd.DataFrame(np.random.randn(6, 4),
+ ....: index=list('abcdef'),
+ ....: columns=list('ABCD'))
+ ....:
+
+In [85]: df1
+Out[85]:
+ A B C D
+a -0.023688 2.410179 1.450520 0.206053
+b -0.251905 -2.213588 1.063327 1.266143
+c 0.299368 -0.863838 0.408204 -1.048089
+d -0.025747 -0.988387 0.094055 1.262731
+e 1.289997 0.082423 -0.055758 0.536580
+f -0.489682 0.369374 -0.034571 -2.484478
+
+In [86]: df1.loc[lambda df: df.A > 0, :]
+Out[86]:
+ A B C D
+c 0.299368 -0.863838 0.408204 -1.048089
+e 1.289997 0.082423 -0.055758 0.536580
+
+In [87]: df1.loc[:, lambda df: ['A', 'B']]
+Out[87]:
+ A B
+a -0.023688 2.410179
+b -0.251905 -2.213588
+c 0.299368 -0.863838
+d -0.025747 -0.988387
+e 1.289997 0.082423
+f -0.489682 0.369374
+
+In [88]: df1.iloc[:, lambda df: [0, 1]]
+Out[88]:
+ A B
+a -0.023688 2.410179
+b -0.251905 -2.213588
+c 0.299368 -0.863838
+d -0.025747 -0.988387
+e 1.289997 0.082423
+f -0.489682 0.369374
+
+In [89]: df1[lambda df: df.columns[0]]
+Out[89]:
+a -0.023688
+b -0.251905
+c 0.299368
+d -0.025747
+e 1.289997
+f -0.489682
+Name: A, dtype: float64
+```
+
+您可以使用可调用索引``Series``。
+
+``` python
+In [90]: df1.A.loc[lambda s: s > 0]
+Out[90]:
+c 0.299368
+e 1.289997
+Name: A, dtype: float64
+```
+
+使用这些方法/索引器,您可以在不使用临时变量的情况下链接数据选择操作。
+
+``` python
+In [91]: bb = pd.read_csv('data/baseball.csv', index_col='id')
+
+In [92]: (bb.groupby(['year', 'team']).sum()
+ ....: .loc[lambda df: df.r > 100])
+ ....:
+Out[92]:
+ stint g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
+year team
+2007 CIN 6 379 745 101 203 35 2 36 125.0 10.0 1.0 105 127.0 14.0 1.0 1.0 15.0 18.0
+ DET 5 301 1062 162 283 54 4 37 144.0 24.0 7.0 97 176.0 3.0 10.0 4.0 8.0 28.0
+ HOU 4 311 926 109 218 47 6 14 77.0 10.0 4.0 60 212.0 3.0 9.0 16.0 6.0 17.0
+ LAN 11 413 1021 153 293 61 3 36 154.0 7.0 5.0 114 141.0 8.0 9.0 3.0 8.0 29.0
+ NYN 13 622 1854 240 509 101 3 61 243.0 22.0 4.0 174 310.0 24.0 23.0 18.0 15.0 48.0
+ SFN 5 482 1305 198 337 67 6 40 171.0 26.0 7.0 235 188.0 51.0 8.0 16.0 6.0 41.0
+ TEX 2 198 729 115 200 40 4 28 115.0 21.0 4.0 73 140.0 4.0 5.0 2.0 8.0 16.0
+ TOR 4 459 1408 187 378 96 2 58 223.0 4.0 2.0 190 265.0 16.0 12.0 4.0 16.0 38.0
+```
+
+## 不推荐使用IX索引器
+
+::: danger 警告
+
+在0.20.0开始,``.ix``索引器已被弃用,赞成更加严格``.iloc``
+和``.loc``索引。
+
+:::
+
+``.ix``在推断用户想要做的事情上提供了很多魔力。也就是说,``.ix``可以根据索引的数据类型决定按*位置*或通过*标签*进行索引。多年来,这引起了相当多的用户混淆。
+
+建议的索引方法是:
+
+- ``.loc``如果你想*标记*索引。
+- ``.iloc``如果你想要*定位*索引。
+
+``` python
+In [93]: dfd = pd.DataFrame({'A': [1, 2, 3],
+ ....: 'B': [4, 5, 6]},
+ ....: index=list('abc'))
+ ....:
+
+In [94]: dfd
+Out[94]:
+ A B
+a 1 4
+b 2 5
+c 3 6
+```
+
+以前的行为,您希望从“A”列中获取索引中的第0个和第2个元素。
+
+``` python
+In [3]: dfd.ix[[0, 2], 'A']
+Out[3]:
+a 1
+c 3
+Name: A, dtype: int64
+```
+
+用``.loc``。这里我们将从索引中选择适当的索引,然后使用*标签*索引。
+
+``` python
+In [95]: dfd.loc[dfd.index[[0, 2]], 'A']
+Out[95]:
+a 1
+c 3
+Name: A, dtype: int64
+```
+
+这也可以``.iloc``通过在索引器上显式获取位置,并使用
+ *位置*索引来选择事物来表达。
+
+``` python
+In [96]: dfd.iloc[[0, 2], dfd.columns.get_loc('A')]
+Out[96]:
+a 1
+c 3
+Name: A, dtype: int64
+```
+
+要获得*多个*索引器,请使用``.get_indexer``:
+
+``` python
+In [97]: dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])]
+Out[97]:
+ A B
+a 1 4
+c 3 6
+```
+
+## 不推荐使用缺少标签的列表进行索引
+
+::: danger 警告
+
+从0.21.0开始,使用``.loc``或``[]``包含一个或多个缺少标签的列表,不赞成使用``.reindex``。
+
+:::
+
+在以前的版本中,``.loc[list-of-labels]``只要找到*至少1*个密钥,使用就可以工作(否则会引起a ``KeyError``)。不推荐使用此行为,并将显示指向此部分的警告消息。推荐的替代方案是使用``.reindex()``。
+
+例如。
+
+``` python
+In [98]: s = pd.Series([1, 2, 3])
+
+In [99]: s
+Out[99]:
+0 1
+1 2
+2 3
+dtype: int64
+```
+
+找到所有键的选择保持不变。
+
+``` python
+In [100]: s.loc[[1, 2]]
+Out[100]:
+1 2
+2 3
+dtype: int64
+```
+
+以前的行为
+
+``` python
+In [4]: s.loc[[1, 2, 3]]
+Out[4]:
+1 2.0
+2 3.0
+3 NaN
+dtype: float64
+```
+
+目前的行为
+
+``` python
+In [4]: s.loc[[1, 2, 3]]
+Passing list-likes to .loc with any non-matching elements will raise
+KeyError in the future, you can use .reindex() as an alternative.
+
+See the documentation here:
+http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
+
+Out[4]:
+1 2.0
+2 3.0
+3 NaN
+dtype: float64
+```
+
+### 重新索引
+
+实现选择潜在的未找到元素的惯用方法是通过``.reindex()``。另请参阅[重建索引](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-reindexing)部分。
+
+``` python
+In [101]: s.reindex([1, 2, 3])
+Out[101]:
+1 2.0
+2 3.0
+3 NaN
+dtype: float64
+```
+
+或者,如果您只想选择*有效的*密钥,则以下是惯用且有效的; 保证保留选择的dtype。
+
+``` python
+In [102]: labels = [1, 2, 3]
+
+In [103]: s.loc[s.index.intersection(labels)]
+Out[103]:
+1 2
+2 3
+dtype: int64
+```
+
+拥有重复索引会引发``.reindex()``:
+
+``` python
+In [104]: s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])
+
+In [105]: labels = ['c', 'd']
+```
+
+``` python
+In [17]: s.reindex(labels)
+ValueError: cannot reindex from a duplicate axis
+```
+
+通常,您可以将所需标签与当前轴相交,然后重新索引。
+
+``` python
+In [106]: s.loc[s.index.intersection(labels)].reindex(labels)
+Out[106]:
+c 3.0
+d NaN
+dtype: float64
+```
+
+但是,如果生成的索引重复,这*仍然会*提高。
+
+``` python
+In [41]: labels = ['a', 'd']
+
+In [42]: s.loc[s.index.intersection(labels)].reindex(labels)
+ValueError: cannot reindex from a duplicate axis
+```
+
+## 选择随机样本
+
+使用该[``sample()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html#pandas.DataFrame.sample)方法随机选择Series或DataFrame中的行或列。默认情况下,该方法将对行进行采样,并接受要返回的特定行数/列数或一小部分行。
+
+``` python
+In [107]: s = pd.Series([0, 1, 2, 3, 4, 5])
+
+# When no arguments are passed, returns 1 row.
+In [108]: s.sample()
+Out[108]:
+4 4
+dtype: int64
+
+# One may specify either a number of rows:
+In [109]: s.sample(n=3)
+Out[109]:
+0 0
+4 4
+1 1
+dtype: int64
+
+# Or a fraction of the rows:
+In [110]: s.sample(frac=0.5)
+Out[110]:
+5 5
+3 3
+1 1
+dtype: int64
+```
+
+默认情况下,``sample``最多会返回每行一次,但也可以使用以下``replace``选项进行替换:
+
+``` python
+In [111]: s = pd.Series([0, 1, 2, 3, 4, 5])
+
+# Without replacement (default):
+In [112]: s.sample(n=6, replace=False)
+Out[112]:
+0 0
+1 1
+5 5
+3 3
+2 2
+4 4
+dtype: int64
+
+# With replacement:
+In [113]: s.sample(n=6, replace=True)
+Out[113]:
+0 0
+4 4
+3 3
+2 2
+4 4
+4 4
+dtype: int64
+```
+
+默认情况下,每行具有相同的选择概率,但如果您希望行具有不同的概率,则可以将``sample``函数采样权重作为
+ ``weights``。这些权重可以是列表,NumPy数组或系列,但它们的长度必须与您采样的对象的长度相同。缺失的值将被视为零的权重,并且不允许使用inf值。如果权重不总和为1,则通过将所有权重除以权重之和来对它们进行重新规范化。例如:
+
+``` python
+In [114]: s = pd.Series([0, 1, 2, 3, 4, 5])
+
+In [115]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
+
+In [116]: s.sample(n=3, weights=example_weights)
+Out[116]:
+5 5
+4 4
+3 3
+dtype: int64
+
+# Weights will be re-normalized automatically
+In [117]: example_weights2 = [0.5, 0, 0, 0, 0, 0]
+
+In [118]: s.sample(n=1, weights=example_weights2)
+Out[118]:
+0 0
+dtype: int64
+```
+
+应用于DataFrame时,只需将列的名称作为字符串传递,就可以使用DataFrame的列作为采样权重(假设您要对行而不是列进行采样)。
+
+``` python
+In [119]: df2 = pd.DataFrame({'col1': [9, 8, 7, 6],
+ .....: 'weight_column': [0.5, 0.4, 0.1, 0]})
+ .....:
+
+In [120]: df2.sample(n=3, weights='weight_column')
+Out[120]:
+ col1 weight_column
+1 8 0.4
+0 9 0.5
+2 7 0.1
+```
+
+``sample``还允许用户使用``axis``参数对列而不是行进行采样。
+
+``` python
+In [121]: df3 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})
+
+In [122]: df3.sample(n=1, axis=1)
+Out[122]:
+ col1
+0 1
+1 2
+2 3
+```
+
+最后,还可以``sample``使用``random_state``参数为随机数生成器设置种子,该参数将接受整数(作为种子)或NumPy RandomState对象。
+
+``` python
+In [123]: df4 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})
+
+# With a given seed, the sample will always draw the same rows.
+In [124]: df4.sample(n=2, random_state=2)
+Out[124]:
+ col1 col2
+2 3 4
+1 2 3
+
+In [125]: df4.sample(n=2, random_state=2)
+Out[125]:
+ col1 col2
+2 3 4
+1 2 3
+```
+
+## 用放大设定
+
+``.loc/[]``当为该轴设置不存在的键时,操作可以执行放大。
+
+在这种``Series``情况下,这实际上是一种附加操作。
+
+``` python
+In [126]: se = pd.Series([1, 2, 3])
+
+In [127]: se
+Out[127]:
+0 1
+1 2
+2 3
+dtype: int64
+
+In [128]: se[5] = 5.
+
+In [129]: se
+Out[129]:
+0 1.0
+1 2.0
+2 3.0
+5 5.0
+dtype: float64
+```
+
+A ``DataFrame``可以在任一轴上放大``.loc``。
+
+``` python
+In [130]: dfi = pd.DataFrame(np.arange(6).reshape(3, 2),
+ .....: columns=['A', 'B'])
+ .....:
+
+In [131]: dfi
+Out[131]:
+ A B
+0 0 1
+1 2 3
+2 4 5
+
+In [132]: dfi.loc[:, 'C'] = dfi.loc[:, 'A']
+
+In [133]: dfi
+Out[133]:
+ A B C
+0 0 1 0
+1 2 3 2
+2 4 5 4
+```
+
+这就像是一个``append``操作``DataFrame``。
+
+``` python
+In [134]: dfi.loc[3] = 5
+
+In [135]: dfi
+Out[135]:
+ A B C
+0 0 1 0
+1 2 3 2
+2 4 5 4
+3 5 5 5
+```
+
+## 快速标量值获取和设置
+
+因为索引``[]``必须处理很多情况(单标签访问,切片,布尔索引等),所以它有一些开销以便弄清楚你要求的是什么。如果您只想访问标量值,最快的方法是使用在所有数据结构上实现的``at``和``iat``方法。
+
+与之类似``loc``,``at``提供基于**标签**的标量查找,同时``iat``提供类似于基于**整数**的查找``iloc``
+
+``` python
+In [136]: s.iat[5]
+Out[136]: 5
+
+In [137]: df.at[dates[5], 'A']
+Out[137]: -0.6736897080883706
+
+In [138]: df.iat[3, 0]
+Out[138]: 0.7215551622443669
+```
+
+您也可以使用这些相同的索引器进行设置。
+
+``` python
+In [139]: df.at[dates[5], 'E'] = 7
+
+In [140]: df.iat[3, 0] = 7
+```
+
+``at`` 如果索引器丢失,可以如上所述放大对象。
+
+``` python
+In [141]: df.at[dates[-1] + pd.Timedelta('1 day'), 0] = 7
+
+In [142]: df
+Out[142]:
+ A B C D E 0
+2000-01-01 0.469112 -0.282863 -1.509059 -1.135632 NaN NaN
+2000-01-02 1.212112 -0.173215 0.119209 -1.044236 NaN NaN
+2000-01-03 -0.861849 -2.104569 -0.494929 1.071804 NaN NaN
+2000-01-04 7.000000 -0.706771 -1.039575 0.271860 NaN NaN
+2000-01-05 -0.424972 0.567020 0.276232 -1.087401 NaN NaN
+2000-01-06 -0.673690 0.113648 -1.478427 0.524988 7.0 NaN
+2000-01-07 0.404705 0.577046 -1.715002 -1.039268 NaN NaN
+2000-01-08 -0.370647 -1.157892 -1.344312 0.844885 NaN NaN
+2000-01-09 NaN NaN NaN NaN NaN 7.0
+```
+
+## 布尔索引
+
+另一种常见操作是使用布尔向量来过滤数据。运营商是:``|``for ``or``,``&``for ``and``和``~``for ``not``。**必须**使用括号对这些进行分组,因为默认情况下,Python将评估表达式,例如as
+ ,而期望的评估顺序是
+ 。``df.A > 2 & df.B < 3````df.A > (2 & df.B) < 3````(df.A > 2) & (df.B < 3)``
+
+使用布尔向量索引系列的工作方式与NumPy ndarray完全相同:
+
+``` python
+In [143]: s = pd.Series(range(-3, 4))
+
+In [144]: s
+Out[144]:
+0 -3
+1 -2
+2 -1
+3 0
+4 1
+5 2
+6 3
+dtype: int64
+
+In [145]: s[s > 0]
+Out[145]:
+4 1
+5 2
+6 3
+dtype: int64
+
+In [146]: s[(s < -1) | (s > 0.5)]
+Out[146]:
+0 -3
+1 -2
+4 1
+5 2
+6 3
+dtype: int64
+
+In [147]: s[~(s < 0)]
+Out[147]:
+3 0
+4 1
+5 2
+6 3
+dtype: int64
+```
+
+您可以使用与DataFrame索引长度相同的布尔向量从DataFrame中选择行(例如,从DataFrame的其中一列派生的东西):
+
+``` python
+In [148]: df[df['A'] > 0]
+Out[148]:
+ A B C D E 0
+2000-01-01 0.469112 -0.282863 -1.509059 -1.135632 NaN NaN
+2000-01-02 1.212112 -0.173215 0.119209 -1.044236 NaN NaN
+2000-01-04 7.000000 -0.706771 -1.039575 0.271860 NaN NaN
+2000-01-07 0.404705 0.577046 -1.715002 -1.039268 NaN NaN
+```
+
+列表推导和``map``系列方法也可用于产生更复杂的标准:
+
+``` python
+In [149]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
+ .....: 'b': ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
+ .....: 'c': np.random.randn(7)})
+ .....:
+
+# only want 'two' or 'three'
+In [150]: criterion = df2['a'].map(lambda x: x.startswith('t'))
+
+In [151]: df2[criterion]
+Out[151]:
+ a b c
+2 two y 0.041290
+3 three x 0.361719
+4 two y -0.238075
+
+# equivalent but slower
+In [152]: df2[[x.startswith('t') for x in df2['a']]]
+Out[152]:
+ a b c
+2 two y 0.041290
+3 three x 0.361719
+4 two y -0.238075
+
+# Multiple criteria
+In [153]: df2[criterion & (df2['b'] == 'x')]
+Out[153]:
+ a b c
+3 three x 0.361719
+```
+
+随着选择方法[通过标签选择](#indexing-label),[通过位置选择](#indexing-integer)和[高级索引](advanced.html#advanced),你可以沿着使用布尔向量与其他索引表达式中组合选择多个轴。
+
+``` python
+In [154]: df2.loc[criterion & (df2['b'] == 'x'), 'b':'c']
+Out[154]:
+ b c
+3 x 0.361719
+```
+
+## 使用isin进行索引
+
+考虑一下[``isin()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html#pandas.Series.isin)方法``Series``,该方法返回一个布尔向量,只要``Series``元素存在于传递列表中,该向量就为真。这允许您选择一列或多列具有所需值的行:
+
+``` python
+In [155]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
+
+In [156]: s
+Out[156]:
+4 0
+3 1
+2 2
+1 3
+0 4
+dtype: int64
+
+In [157]: s.isin([2, 4, 6])
+Out[157]:
+4 False
+3 False
+2 True
+1 False
+0 True
+dtype: bool
+
+In [158]: s[s.isin([2, 4, 6])]
+Out[158]:
+2 2
+0 4
+dtype: int64
+```
+
+``Index``对象可以使用相同的方法,当您不知道哪些搜索标签实际存在时,它们非常有用:
+
+``` python
+In [159]: s[s.index.isin([2, 4, 6])]
+Out[159]:
+4 0
+2 2
+dtype: int64
+
+# compare it to the following
+In [160]: s.reindex([2, 4, 6])
+Out[160]:
+2 2.0
+4 0.0
+6 NaN
+dtype: float64
+```
+
+除此之外,还``MultiIndex``允许选择在成员资格检查中使用的单独级别:
+
+``` python
+In [161]: s_mi = pd.Series(np.arange(6),
+ .....: index=pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]))
+ .....:
+
+In [162]: s_mi
+Out[162]:
+0 a 0
+ b 1
+ c 2
+1 a 3
+ b 4
+ c 5
+dtype: int64
+
+In [163]: s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])]
+Out[163]:
+0 c 2
+1 a 3
+dtype: int64
+
+In [164]: s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)]
+Out[164]:
+0 a 0
+ c 2
+1 a 3
+ c 5
+dtype: int64
+```
+
+DataFrame也有一个[``isin()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html#pandas.DataFrame.isin)方法。调用时``isin``,将一组值作为数组或字典传递。如果values是一个数组,则``isin``返回与原始DataFrame形状相同的布尔数据框,并在元素序列中的任何位置使用True。
+
+``` python
+In [165]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
+ .....: 'ids2': ['a', 'n', 'c', 'n']})
+ .....:
+
+In [166]: values = ['a', 'b', 1, 3]
+
+In [167]: df.isin(values)
+Out[167]:
+ vals ids ids2
+0 True True True
+1 False True False
+2 True False False
+3 False False False
+```
+
+通常,您需要将某些值与某些列匹配。只需将值设置``dict``为键为列的位置,值即为要检查的项目列表。
+
+``` python
+In [168]: values = {'ids': ['a', 'b'], 'vals': [1, 3]}
+
+In [169]: df.isin(values)
+Out[169]:
+ vals ids ids2
+0 True True False
+1 False True False
+2 True False False
+3 False False False
+```
+
+结合数据帧的``isin``同``any()``和``all()``方法来快速选择符合给定的标准对数据子集。要选择每列符合其自己标准的行:
+
+``` python
+In [170]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}
+
+In [171]: row_mask = df.isin(values).all(1)
+
+In [172]: df[row_mask]
+Out[172]:
+ vals ids ids2
+0 1 a a
+```
+
+## 该``where()``方法和屏蔽
+
+从具有布尔向量的Series中选择值通常会返回数据的子集。为了保证选择输出与原始数据具有相同的形状,您可以``where``在``Series``和中使用该方法``DataFrame``。
+
+仅返回选定的行:
+
+``` python
+In [173]: s[s > 0]
+Out[173]:
+3 1
+2 2
+1 3
+0 4
+dtype: int64
+```
+
+要返回与原始形状相同的系列:
+
+``` python
+In [174]: s.where(s > 0)
+Out[174]:
+4 NaN
+3 1.0
+2 2.0
+1 3.0
+0 4.0
+dtype: float64
+```
+
+现在,使用布尔标准从DataFrame中选择值也可以保留输入数据形状。``where``在引擎盖下用作实现。下面的代码相当于。``df.where(df < 0)``
+
+``` python
+In [175]: df[df < 0]
+Out[175]:
+ A B C D
+2000-01-01 -2.104139 -1.309525 NaN NaN
+2000-01-02 -0.352480 NaN -1.192319 NaN
+2000-01-03 -0.864883 NaN -0.227870 NaN
+2000-01-04 NaN -1.222082 NaN -1.233203
+2000-01-05 NaN -0.605656 -1.169184 NaN
+2000-01-06 NaN -0.948458 NaN -0.684718
+2000-01-07 -2.670153 -0.114722 NaN -0.048048
+2000-01-08 NaN NaN -0.048788 -0.808838
+```
+
+此外,在返回的副本中,``where``使用可选``other``参数替换条件为False的值。
+
+``` python
+In [176]: df.where(df < 0, -df)
+Out[176]:
+ A B C D
+2000-01-01 -2.104139 -1.309525 -0.485855 -0.245166
+2000-01-02 -0.352480 -0.390389 -1.192319 -1.655824
+2000-01-03 -0.864883 -0.299674 -0.227870 -0.281059
+2000-01-04 -0.846958 -1.222082 -0.600705 -1.233203
+2000-01-05 -0.669692 -0.605656 -1.169184 -0.342416
+2000-01-06 -0.868584 -0.948458 -2.297780 -0.684718
+2000-01-07 -2.670153 -0.114722 -0.168904 -0.048048
+2000-01-08 -0.801196 -1.392071 -0.048788 -0.808838
+```
+
+您可能希望根据某些布尔条件设置值。这可以直观地完成,如下所示:
+
+``` python
+In [177]: s2 = s.copy()
+
+In [178]: s2[s2 < 0] = 0
+
+In [179]: s2
+Out[179]:
+4 0
+3 1
+2 2
+1 3
+0 4
+dtype: int64
+
+In [180]: df2 = df.copy()
+
+In [181]: df2[df2 < 0] = 0
+
+In [182]: df2
+Out[182]:
+ A B C D
+2000-01-01 0.000000 0.000000 0.485855 0.245166
+2000-01-02 0.000000 0.390389 0.000000 1.655824
+2000-01-03 0.000000 0.299674 0.000000 0.281059
+2000-01-04 0.846958 0.000000 0.600705 0.000000
+2000-01-05 0.669692 0.000000 0.000000 0.342416
+2000-01-06 0.868584 0.000000 2.297780 0.000000
+2000-01-07 0.000000 0.000000 0.168904 0.000000
+2000-01-08 0.801196 1.392071 0.000000 0.000000
+```
+
+默认情况下,``where``返回数据的修改副本。有一个可选参数,``inplace``以便可以在不创建副本的情况下修改原始数据:
+
+``` python
+In [183]: df_orig = df.copy()
+
+In [184]: df_orig.where(df > 0, -df, inplace=True)
+
+In [185]: df_orig
+Out[185]:
+ A B C D
+2000-01-01 2.104139 1.309525 0.485855 0.245166
+2000-01-02 0.352480 0.390389 1.192319 1.655824
+2000-01-03 0.864883 0.299674 0.227870 0.281059
+2000-01-04 0.846958 1.222082 0.600705 1.233203
+2000-01-05 0.669692 0.605656 1.169184 0.342416
+2000-01-06 0.868584 0.948458 2.297780 0.684718
+2000-01-07 2.670153 0.114722 0.168904 0.048048
+2000-01-08 0.801196 1.392071 0.048788 0.808838
+```
+
+::: tip 注意
+
+签名[``DataFrame.where()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html#pandas.DataFrame.where)不同于[``numpy.where()``](https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html#numpy.where)。大致相当于。``df1.where(m, df2)````np.where(m, df1, df2)``
+
+``` python
+In [186]: df.where(df < 0, -df) == np.where(df < 0, df, -df)
+Out[186]:
+ A B C D
+2000-01-01 True True True True
+2000-01-02 True True True True
+2000-01-03 True True True True
+2000-01-04 True True True True
+2000-01-05 True True True True
+2000-01-06 True True True True
+2000-01-07 True True True True
+2000-01-08 True True True True
+```
+
+:::
+
+**对准**
+
+此外,``where``对齐输入布尔条件(ndarray或DataFrame),以便可以使用设置进行部分选择。这类似于部分设置通过``.loc``(但是在内容而不是轴标签上)。
+
+``` python
+In [187]: df2 = df.copy()
+
+In [188]: df2[df2[1:4] > 0] = 3
+
+In [189]: df2
+Out[189]:
+ A B C D
+2000-01-01 -2.104139 -1.309525 0.485855 0.245166
+2000-01-02 -0.352480 3.000000 -1.192319 3.000000
+2000-01-03 -0.864883 3.000000 -0.227870 3.000000
+2000-01-04 3.000000 -1.222082 3.000000 -1.233203
+2000-01-05 0.669692 -0.605656 -1.169184 0.342416
+2000-01-06 0.868584 -0.948458 2.297780 -0.684718
+2000-01-07 -2.670153 -0.114722 0.168904 -0.048048
+2000-01-08 0.801196 1.392071 -0.048788 -0.808838
+```
+
+哪里也可以接受``axis``和``level``参数在执行时对齐输入``where``。
+
+``` python
+In [190]: df2 = df.copy()
+
+In [191]: df2.where(df2 > 0, df2['A'], axis='index')
+Out[191]:
+ A B C D
+2000-01-01 -2.104139 -2.104139 0.485855 0.245166
+2000-01-02 -0.352480 0.390389 -0.352480 1.655824
+2000-01-03 -0.864883 0.299674 -0.864883 0.281059
+2000-01-04 0.846958 0.846958 0.600705 0.846958
+2000-01-05 0.669692 0.669692 0.669692 0.342416
+2000-01-06 0.868584 0.868584 2.297780 0.868584
+2000-01-07 -2.670153 -2.670153 0.168904 -2.670153
+2000-01-08 0.801196 1.392071 0.801196 0.801196
+```
+
+这相当于(但快于)以下内容。
+
+``` python
+In [192]: df2 = df.copy()
+
+In [193]: df.apply(lambda x, y: x.where(x > 0, y), y=df['A'])
+Out[193]:
+ A B C D
+2000-01-01 -2.104139 -2.104139 0.485855 0.245166
+2000-01-02 -0.352480 0.390389 -0.352480 1.655824
+2000-01-03 -0.864883 0.299674 -0.864883 0.281059
+2000-01-04 0.846958 0.846958 0.600705 0.846958
+2000-01-05 0.669692 0.669692 0.669692 0.342416
+2000-01-06 0.868584 0.868584 2.297780 0.868584
+2000-01-07 -2.670153 -2.670153 0.168904 -2.670153
+2000-01-08 0.801196 1.392071 0.801196 0.801196
+```
+
+*版本0.18.1中的新功能。*
+
+哪里可以接受一个可调用的条件和``other``参数。该函数必须带有一个参数(调用Series或DataFrame),并返回有效的输出作为条件和``other``参数。
+
+``` python
+In [194]: df3 = pd.DataFrame({'A': [1, 2, 3],
+ .....: 'B': [4, 5, 6],
+ .....: 'C': [7, 8, 9]})
+ .....:
+
+In [195]: df3.where(lambda x: x > 4, lambda x: x + 10)
+Out[195]:
+ A B C
+0 11 14 7
+1 12 5 8
+2 13 6 9
+```
+
+### 面具
+
+[``mask()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mask.html#pandas.DataFrame.mask)是的反布尔运算``where``。
+
+``` python
+In [196]: s.mask(s >= 0)
+Out[196]:
+4 NaN
+3 NaN
+2 NaN
+1 NaN
+0 NaN
+dtype: float64
+
+In [197]: df.mask(df >= 0)
+Out[197]:
+ A B C D
+2000-01-01 -2.104139 -1.309525 NaN NaN
+2000-01-02 -0.352480 NaN -1.192319 NaN
+2000-01-03 -0.864883 NaN -0.227870 NaN
+2000-01-04 NaN -1.222082 NaN -1.233203
+2000-01-05 NaN -0.605656 -1.169184 NaN
+2000-01-06 NaN -0.948458 NaN -0.684718
+2000-01-07 -2.670153 -0.114722 NaN -0.048048
+2000-01-08 NaN NaN -0.048788 -0.808838
+```
+
+## 该``query()``方法
+
+[``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame)对象有一个[``query()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query)
+允许使用表达式进行选择的方法。
+
+您可以获取列的值,其中列``b``具有列值``a``和值之间的值``c``。例如:
+
+``` python
+In [198]: n = 10
+
+In [199]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
+
+In [200]: df
+Out[200]:
+ a b c
+0 0.438921 0.118680 0.863670
+1 0.138138 0.577363 0.686602
+2 0.595307 0.564592 0.520630
+3 0.913052 0.926075 0.616184
+4 0.078718 0.854477 0.898725
+5 0.076404 0.523211 0.591538
+6 0.792342 0.216974 0.564056
+7 0.397890 0.454131 0.915716
+8 0.074315 0.437913 0.019794
+9 0.559209 0.502065 0.026437
+
+# pure python
+In [201]: df[(df.a < df.b) & (df.b < df.c)]
+Out[201]:
+ a b c
+1 0.138138 0.577363 0.686602
+4 0.078718 0.854477 0.898725
+5 0.076404 0.523211 0.591538
+7 0.397890 0.454131 0.915716
+
+# query
+In [202]: df.query('(a < b) & (b < c)')
+Out[202]:
+ a b c
+1 0.138138 0.577363 0.686602
+4 0.078718 0.854477 0.898725
+5 0.076404 0.523211 0.591538
+7 0.397890 0.454131 0.915716
+```
+
+如果没有名称的列,则执行相同的操作但返回命名索引``a``。
+
+``` python
+In [203]: df = pd.DataFrame(np.random.randint(n / 2, size=(n, 2)), columns=list('bc'))
+
+In [204]: df.index.name = 'a'
+
+In [205]: df
+Out[205]:
+ b c
+a
+0 0 4
+1 0 1
+2 3 4
+3 4 3
+4 1 4
+5 0 3
+6 0 1
+7 3 4
+8 2 3
+9 1 1
+
+In [206]: df.query('a < b and b < c')
+Out[206]:
+ b c
+a
+2 3 4
+```
+
+如果您不希望或不能命名索引,则可以``index``在查询表达式中使用该名称
+ :
+
+``` python
+In [207]: df = pd.DataFrame(np.random.randint(n, size=(n, 2)), columns=list('bc'))
+
+In [208]: df
+Out[208]:
+ b c
+0 3 1
+1 3 0
+2 5 6
+3 5 2
+4 7 4
+5 0 1
+6 2 5
+7 0 1
+8 6 0
+9 7 9
+
+In [209]: df.query('index < b < c')
+Out[209]:
+ b c
+2 5 6
+```
+
+::: tip 注意
+
+如果索引的名称与列名称重叠,则列名称优先。例如,
+
+``` python
+In [210]: df = pd.DataFrame({'a': np.random.randint(5, size=5)})
+
+In [211]: df.index.name = 'a'
+
+In [212]: df.query('a > 2') # uses the column 'a', not the index
+Out[212]:
+ a
+a
+1 3
+3 3
+```
+
+您仍然可以使用特殊标识符'index'在查询表达式中使用索引:
+
+``` python
+In [213]: df.query('index > 2')
+Out[213]:
+ a
+a
+3 3
+4 2
+```
+
+如果由于某种原因你有一个名为列的列``index``,那么你也可以引用索引``ilevel_0``,但是此时你应该考虑将列重命名为不那么模糊的列。
+
+:::
+
+### ``MultiIndex`` ``query()``语法
+
+您还可以使用的水平``DataFrame``带
+ [``MultiIndex``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html#pandas.MultiIndex),好像他们是在框架柱:
+
+``` python
+In [214]: n = 10
+
+In [215]: colors = np.random.choice(['red', 'green'], size=n)
+
+In [216]: foods = np.random.choice(['eggs', 'ham'], size=n)
+
+In [217]: colors
+Out[217]:
+array(['red', 'red', 'red', 'green', 'green', 'green', 'green', 'green',
+ 'green', 'green'], dtype='
+```
+
+### ``query()``Python与pandas语法比较
+
+完全类似numpy的语法:
+
+``` python
+In [232]: df = pd.DataFrame(np.random.randint(n, size=(n, 3)), columns=list('abc'))
+
+In [233]: df
+Out[233]:
+ a b c
+0 7 8 9
+1 1 0 7
+2 2 7 2
+3 6 2 2
+4 2 6 3
+5 3 8 2
+6 1 7 2
+7 5 1 5
+8 9 8 0
+9 1 5 0
+
+In [234]: df.query('(a < b) & (b < c)')
+Out[234]:
+ a b c
+0 7 8 9
+
+In [235]: df[(df.a < df.b) & (df.b < df.c)]
+Out[235]:
+ a b c
+0 7 8 9
+```
+
+通过删除括号略微更好(通过绑定使比较运算符绑定比``&``和更紧``|``)。
+
+``` python
+In [236]: df.query('a < b & b < c')
+Out[236]:
+ a b c
+0 7 8 9
+```
+
+使用英语而不是符号:
+
+``` python
+In [237]: df.query('a < b and b < c')
+Out[237]:
+ a b c
+0 7 8 9
+```
+
+非常接近你如何在纸上写它:
+
+``` python
+In [238]: df.query('a < b < c')
+Out[238]:
+ a b c
+0 7 8 9
+```
+
+### 在``in``与运营商``not in``
+
+[``query()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query)还支持Python ``in``和
+ 比较运算符的特殊用法,为调用或的方法提供了简洁的语法
+ 。``not in````isin````Series````DataFrame``
+
+``` python
+# get all rows where columns "a" and "b" have overlapping values
+In [239]: df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
+ .....: 'c': np.random.randint(5, size=12),
+ .....: 'd': np.random.randint(9, size=12)})
+ .....:
+
+In [240]: df
+Out[240]:
+ a b c d
+0 a a 2 6
+1 a a 4 7
+2 b a 1 6
+3 b a 2 1
+4 c b 3 6
+5 c b 0 2
+6 d b 3 3
+7 d b 2 1
+8 e c 4 3
+9 e c 2 0
+10 f c 0 6
+11 f c 1 2
+
+In [241]: df.query('a in b')
+Out[241]:
+ a b c d
+0 a a 2 6
+1 a a 4 7
+2 b a 1 6
+3 b a 2 1
+4 c b 3 6
+5 c b 0 2
+
+# How you'd do it in pure Python
+In [242]: df[df.a.isin(df.b)]
+Out[242]:
+ a b c d
+0 a a 2 6
+1 a a 4 7
+2 b a 1 6
+3 b a 2 1
+4 c b 3 6
+5 c b 0 2
+
+In [243]: df.query('a not in b')
+Out[243]:
+ a b c d
+6 d b 3 3
+7 d b 2 1
+8 e c 4 3
+9 e c 2 0
+10 f c 0 6
+11 f c 1 2
+
+# pure Python
+In [244]: df[~df.a.isin(df.b)]
+Out[244]:
+ a b c d
+6 d b 3 3
+7 d b 2 1
+8 e c 4 3
+9 e c 2 0
+10 f c 0 6
+11 f c 1 2
+```
+
+您可以将此与其他表达式结合使用,以获得非常简洁的查询:
+
+``` python
+# rows where cols a and b have overlapping values
+# and col c's values are less than col d's
+In [245]: df.query('a in b and c < d')
+Out[245]:
+ a b c d
+0 a a 2 6
+1 a a 4 7
+2 b a 1 6
+4 c b 3 6
+5 c b 0 2
+
+# pure Python
+In [246]: df[df.b.isin(df.a) & (df.c < df.d)]
+Out[246]:
+ a b c d
+0 a a 2 6
+1 a a 4 7
+2 b a 1 6
+4 c b 3 6
+5 c b 0 2
+10 f c 0 6
+11 f c 1 2
+```
+
+::: tip 注意
+
+请注意``in``并在Python中进行评估,因为
+它没有相应的操作。但是,**只有** / **expression本身**在vanilla Python中进行评估。例如,在表达式中``not in````numexpr``**** ``in````not in``
+****
+
+``` python
+df.query('a in b + c + d')
+```
+
+``(b + c + d)``通过评估``numexpr``和*然后*的``in``
+操作在普通的Python评价。通常,任何可以使用的评估操作``numexpr``都是。
+
+:::
+
+### ``==``运算符与``list``对象的特殊用法
+
+一个比较``list``值的使用列``==``/ ``!=``工程,以类似``in``/ 。``not in``
+
+``` python
+In [247]: df.query('b == ["a", "b", "c"]')
+Out[247]:
+ a b c d
+0 a a 2 6
+1 a a 4 7
+2 b a 1 6
+3 b a 2 1
+4 c b 3 6
+5 c b 0 2
+6 d b 3 3
+7 d b 2 1
+8 e c 4 3
+9 e c 2 0
+10 f c 0 6
+11 f c 1 2
+
+# pure Python
+In [248]: df[df.b.isin(["a", "b", "c"])]
+Out[248]:
+ a b c d
+0 a a 2 6
+1 a a 4 7
+2 b a 1 6
+3 b a 2 1
+4 c b 3 6
+5 c b 0 2
+6 d b 3 3
+7 d b 2 1
+8 e c 4 3
+9 e c 2 0
+10 f c 0 6
+11 f c 1 2
+
+In [249]: df.query('c == [1, 2]')
+Out[249]:
+ a b c d
+0 a a 2 6
+2 b a 1 6
+3 b a 2 1
+7 d b 2 1
+9 e c 2 0
+11 f c 1 2
+
+In [250]: df.query('c != [1, 2]')
+Out[250]:
+ a b c d
+1 a a 4 7
+4 c b 3 6
+5 c b 0 2
+6 d b 3 3
+8 e c 4 3
+10 f c 0 6
+
+# using in/not in
+In [251]: df.query('[1, 2] in c')
+Out[251]:
+ a b c d
+0 a a 2 6
+2 b a 1 6
+3 b a 2 1
+7 d b 2 1
+9 e c 2 0
+11 f c 1 2
+
+In [252]: df.query('[1, 2] not in c')
+Out[252]:
+ a b c d
+1 a a 4 7
+4 c b 3 6
+5 c b 0 2
+6 d b 3 3
+8 e c 4 3
+10 f c 0 6
+
+# pure Python
+In [253]: df[df.c.isin([1, 2])]
+Out[253]:
+ a b c d
+0 a a 2 6
+2 b a 1 6
+3 b a 2 1
+7 d b 2 1
+9 e c 2 0
+11 f c 1 2
+```
+
+### 布尔运算符
+
+您可以使用单词``not``或``~``运算符否定布尔表达式。
+
+``` python
+In [254]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
+
+In [255]: df['bools'] = np.random.rand(len(df)) > 0.5
+
+In [256]: df.query('~bools')
+Out[256]:
+ a b c bools
+2 0.697753 0.212799 0.329209 False
+7 0.275396 0.691034 0.826619 False
+8 0.190649 0.558748 0.262467 False
+
+In [257]: df.query('not bools')
+Out[257]:
+ a b c bools
+2 0.697753 0.212799 0.329209 False
+7 0.275396 0.691034 0.826619 False
+8 0.190649 0.558748 0.262467 False
+
+In [258]: df.query('not bools') == df[~df.bools]
+Out[258]:
+ a b c bools
+2 True True True True
+7 True True True True
+8 True True True True
+```
+
+当然,表达式也可以是任意复杂的:
+
+``` python
+# short query syntax
+In [259]: shorter = df.query('a < b < c and (not bools) or bools > 2')
+
+# equivalent in pure Python
+In [260]: longer = df[(df.a < df.b) & (df.b < df.c) & (~df.bools) | (df.bools > 2)]
+
+In [261]: shorter
+Out[261]:
+ a b c bools
+7 0.275396 0.691034 0.826619 False
+
+In [262]: longer
+Out[262]:
+ a b c bools
+7 0.275396 0.691034 0.826619 False
+
+In [263]: shorter == longer
+Out[263]:
+ a b c bools
+7 True True True True
+```
+
+### 的表现¶``query()``
+
+``DataFrame.query()````numexpr``对于大型帧,使用比Python略快。
+
+
+
+::: tip 注意
+
+如果您的框架超过大约200,000行,您将只看到使用``numexpr``引擎的性能优势``DataFrame.query()``。
+
+
+
+:::
+
+此图是使用``DataFrame``3列创建的,每列包含使用生成的浮点值``numpy.random.randn()``。
+
+## 重复数据
+
+如果要识别和删除DataFrame中的重复行,有两种方法可以提供帮助:``duplicated``和``drop_duplicates``。每个都将用于标识重复行的列作为参数。
+
+- ``duplicated`` 返回一个布尔向量,其长度为行数,表示行是否重复。
+- ``drop_duplicates`` 删除重复的行。
+
+默认情况下,重复集的第一个观察行被认为是唯一的,但每个方法都有一个``keep``参数来指定要保留的目标。
+
+- ``keep='first'`` (默认值):标记/删除重复项,第一次出现除外。
+- ``keep='last'``:标记/删除重复项,除了最后一次出现。
+- ``keep=False``:标记/删除所有重复项。
+
+``` python
+In [264]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
+ .....: 'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
+ .....: 'c': np.random.randn(7)})
+ .....:
+
+In [265]: df2
+Out[265]:
+ a b c
+0 one x -1.067137
+1 one y 0.309500
+2 two x -0.211056
+3 two y -1.842023
+4 two x -0.390820
+5 three x -1.964475
+6 four x 1.298329
+
+In [266]: df2.duplicated('a')
+Out[266]:
+0 False
+1 True
+2 False
+3 True
+4 True
+5 False
+6 False
+dtype: bool
+
+In [267]: df2.duplicated('a', keep='last')
+Out[267]:
+0 True
+1 False
+2 True
+3 True
+4 False
+5 False
+6 False
+dtype: bool
+
+In [268]: df2.duplicated('a', keep=False)
+Out[268]:
+0 True
+1 True
+2 True
+3 True
+4 True
+5 False
+6 False
+dtype: bool
+
+In [269]: df2.drop_duplicates('a')
+Out[269]:
+ a b c
+0 one x -1.067137
+2 two x -0.211056
+5 three x -1.964475
+6 four x 1.298329
+
+In [270]: df2.drop_duplicates('a', keep='last')
+Out[270]:
+ a b c
+1 one y 0.309500
+4 two x -0.390820
+5 three x -1.964475
+6 four x 1.298329
+
+In [271]: df2.drop_duplicates('a', keep=False)
+Out[271]:
+ a b c
+5 three x -1.964475
+6 four x 1.298329
+```
+
+此外,您可以传递列表列表以识别重复。
+
+``` python
+In [272]: df2.duplicated(['a', 'b'])
+Out[272]:
+0 False
+1 False
+2 False
+3 False
+4 True
+5 False
+6 False
+dtype: bool
+
+In [273]: df2.drop_duplicates(['a', 'b'])
+Out[273]:
+ a b c
+0 one x -1.067137
+1 one y 0.309500
+2 two x -0.211056
+3 two y -1.842023
+5 three x -1.964475
+6 four x 1.298329
+```
+
+要按索引值删除重复项,请使用``Index.duplicated``然后执行切片。``keep``参数可以使用相同的选项集。
+
+``` python
+In [274]: df3 = pd.DataFrame({'a': np.arange(6),
+ .....: 'b': np.random.randn(6)},
+ .....: index=['a', 'a', 'b', 'c', 'b', 'a'])
+ .....:
+
+In [275]: df3
+Out[275]:
+ a b
+a 0 1.440455
+a 1 2.456086
+b 2 1.038402
+c 3 -0.894409
+b 4 0.683536
+a 5 3.082764
+
+In [276]: df3.index.duplicated()
+Out[276]: array([False, True, False, False, True, True])
+
+In [277]: df3[~df3.index.duplicated()]
+Out[277]:
+ a b
+a 0 1.440455
+b 2 1.038402
+c 3 -0.894409
+
+In [278]: df3[~df3.index.duplicated(keep='last')]
+Out[278]:
+ a b
+c 3 -0.894409
+b 4 0.683536
+a 5 3.082764
+
+In [279]: df3[~df3.index.duplicated(keep=False)]
+Out[279]:
+ a b
+c 3 -0.894409
+```
+
+## 类字典``get()``方法
+
+Series或DataFrame中的每一个都有一个``get``可以返回默认值的方法。
+
+``` python
+In [280]: s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
+
+In [281]: s.get('a') # equivalent to s['a']
+Out[281]: 1
+
+In [282]: s.get('x', default=-1)
+Out[282]: -1
+```
+
+## 该``lookup()``方法
+
+有时,您希望在给定一系列行标签和列标签的情况下提取一组值,并且该``lookup``方法允许此操作并返回NumPy数组。例如:
+
+``` python
+In [283]: dflookup = pd.DataFrame(np.random.rand(20, 4), columns = ['A', 'B', 'C', 'D'])
+
+In [284]: dflookup.lookup(list(range(0, 10, 2)), ['B', 'C', 'A', 'B', 'D'])
+Out[284]: array([0.3506, 0.4779, 0.4825, 0.9197, 0.5019])
+```
+
+## 索引对象
+
+pandas [``Index``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html#pandas.Index)类及其子类可以视为实现*有序的多集合*。允许重复。但是,如果您尝试将[``Index``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html#pandas.Index)具有重复条目的对象转换为a
+ ``set``,则会引发异常。
+
+[``Index``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html#pandas.Index)还提供了查找,数据对齐和重建索引所需的基础结构。[``Index``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html#pandas.Index)直接创建的最简单方法
+ 是将一个``list``或其他序列传递给
+ [``Index``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html#pandas.Index):
+
+``` python
+In [285]: index = pd.Index(['e', 'd', 'a', 'b'])
+
+In [286]: index
+Out[286]: Index(['e', 'd', 'a', 'b'], dtype='object')
+
+In [287]: 'd' in index
+Out[287]: True
+```
+
+您还可以传递一个``name``存储在索引中:
+
+``` python
+In [288]: index = pd.Index(['e', 'd', 'a', 'b'], name='something')
+
+In [289]: index.name
+Out[289]: 'something'
+```
+
+名称(如果已设置)将显示在控制台显示中:
+
+``` python
+In [290]: index = pd.Index(list(range(5)), name='rows')
+
+In [291]: columns = pd.Index(['A', 'B', 'C'], name='cols')
+
+In [292]: df = pd.DataFrame(np.random.randn(5, 3), index=index, columns=columns)
+
+In [293]: df
+Out[293]:
+cols A B C
+rows
+0 1.295989 0.185778 0.436259
+1 0.678101 0.311369 -0.528378
+2 -0.674808 -1.103529 -0.656157
+3 1.889957 2.076651 -1.102192
+4 -1.211795 -0.791746 0.634724
+
+In [294]: df['A']
+Out[294]:
+rows
+0 1.295989
+1 0.678101
+2 -0.674808
+3 1.889957
+4 -1.211795
+Name: A, dtype: float64
+```
+
+### 设置元数据
+
+索引是“不可改变的大多是”,但它可以设置和改变它们的元数据,如指数``name``(或为``MultiIndex``,``levels``和
+ ``codes``)。
+
+您可以使用``rename``,``set_names``,``set_levels``,和``set_codes``
+直接设置这些属性。他们默认返回一份副本; 但是,您可以指定``inplace=True``使数据更改到位。
+
+有关MultiIndexes的使用,请参阅[高级索引](advanced.html#advanced)。
+
+``` python
+In [295]: ind = pd.Index([1, 2, 3])
+
+In [296]: ind.rename("apple")
+Out[296]: Int64Index([1, 2, 3], dtype='int64', name='apple')
+
+In [297]: ind
+Out[297]: Int64Index([1, 2, 3], dtype='int64')
+
+In [298]: ind.set_names(["apple"], inplace=True)
+
+In [299]: ind.name = "bob"
+
+In [300]: ind
+Out[300]: Int64Index([1, 2, 3], dtype='int64', name='bob')
+```
+
+``set_names``,``set_levels``并且``set_codes``还采用可选
+ ``level``参数
+
+``` python
+In [301]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])
+
+In [302]: index
+Out[302]:
+MultiIndex([(0, 'one'),
+ (0, 'two'),
+ (1, 'one'),
+ (1, 'two'),
+ (2, 'one'),
+ (2, 'two')],
+ names=['first', 'second'])
+
+In [303]: index.levels[1]
+Out[303]: Index(['one', 'two'], dtype='object', name='second')
+
+In [304]: index.set_levels(["a", "b"], level=1)
+Out[304]:
+MultiIndex([(0, 'a'),
+ (0, 'b'),
+ (1, 'a'),
+ (1, 'b'),
+ (2, 'a'),
+ (2, 'b')],
+ names=['first', 'second'])
+```
+
+### 在Index对象上设置操作
+
+两个主要业务是和。这些可以直接称为实例方法,也可以通过重载运算符使用。通过该方法提供差异。``union (|)````intersection (&)````.difference()``
+
+``` python
+In [305]: a = pd.Index(['c', 'b', 'a'])
+
+In [306]: b = pd.Index(['c', 'e', 'd'])
+
+In [307]: a | b
+Out[307]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
+
+In [308]: a & b
+Out[308]: Index(['c'], dtype='object')
+
+In [309]: a.difference(b)
+Out[309]: Index(['a', 'b'], dtype='object')
+```
+
+同时还提供了操作,它返回出现在任一元件或,但不是在两者。这相当于创建的索引,删除了重复项。``symmetric_difference (^)````idx1````idx2````idx1.difference(idx2).union(idx2.difference(idx1))``
+
+``` python
+In [310]: idx1 = pd.Index([1, 2, 3, 4])
+
+In [311]: idx2 = pd.Index([2, 3, 4, 5])
+
+In [312]: idx1.symmetric_difference(idx2)
+Out[312]: Int64Index([1, 5], dtype='int64')
+
+In [313]: idx1 ^ idx2
+Out[313]: Int64Index([1, 5], dtype='int64')
+```
+
+::: tip 注意
+
+来自设置操作的结果索引将按升序排序。
+
+:::
+
+在[``Index.union()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.union.html#pandas.Index.union)具有不同dtypes的索引之间执行时,必须将索引强制转换为公共dtype。通常,虽然并非总是如此,但这是对象dtype。例外是在整数和浮点数据之间执行联合。在这种情况下,整数值将转换为float
+
+``` python
+In [314]: idx1 = pd.Index([0, 1, 2])
+
+In [315]: idx2 = pd.Index([0.5, 1.5])
+
+In [316]: idx1 | idx2
+Out[316]: Float64Index([0.0, 0.5, 1.0, 1.5, 2.0], dtype='float64')
+```
+
+### 缺少值
+
+即使``Index``可以保存缺失值(``NaN``),但如果您不想要任何意外结果,也应该避免使用。例如,某些操作会隐式排除缺失值。
+
+``Index.fillna`` 使用指定的标量值填充缺失值。
+
+``` python
+In [317]: idx1 = pd.Index([1, np.nan, 3, 4])
+
+In [318]: idx1
+Out[318]: Float64Index([1.0, nan, 3.0, 4.0], dtype='float64')
+
+In [319]: idx1.fillna(2)
+Out[319]: Float64Index([1.0, 2.0, 3.0, 4.0], dtype='float64')
+
+In [320]: idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'),
+ .....: pd.NaT,
+ .....: pd.Timestamp('2011-01-03')])
+ .....:
+
+In [321]: idx2
+Out[321]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)
+
+In [322]: idx2.fillna(pd.Timestamp('2011-01-02'))
+Out[322]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)
+```
+
+## 设置/重置索引
+
+有时您会将数据集加载或创建到DataFrame中,并希望在您已经完成之后添加索引。有几种不同的方式。
+
+### 设置索引
+
+DataFrame有一个[``set_index()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html#pandas.DataFrame.set_index)方法,它采用列名(对于常规``Index``)或列名列表(对于a ``MultiIndex``)。要创建新的重新索引的DataFrame:
+
+``` python
+In [323]: data
+Out[323]:
+ a b c d
+0 bar one z 1.0
+1 bar two y 2.0
+2 foo one x 3.0
+3 foo two w 4.0
+
+In [324]: indexed1 = data.set_index('c')
+
+In [325]: indexed1
+Out[325]:
+ a b d
+c
+z bar one 1.0
+y bar two 2.0
+x foo one 3.0
+w foo two 4.0
+
+In [326]: indexed2 = data.set_index(['a', 'b'])
+
+In [327]: indexed2
+Out[327]:
+ c d
+a b
+bar one z 1.0
+ two y 2.0
+foo one x 3.0
+ two w 4.0
+```
+
+该``append``关键字选项让你保持现有索引并追加给列一个多指标:
+
+``` python
+In [328]: frame = data.set_index('c', drop=False)
+
+In [329]: frame = frame.set_index(['a', 'b'], append=True)
+
+In [330]: frame
+Out[330]:
+ c d
+c a b
+z bar one z 1.0
+y bar two y 2.0
+x foo one x 3.0
+w foo two w 4.0
+```
+
+其他选项``set_index``允许您不删除索引列或就地添加索引(不创建新对象):
+
+``` python
+In [331]: data.set_index('c', drop=False)
+Out[331]:
+ a b c d
+c
+z bar one z 1.0
+y bar two y 2.0
+x foo one x 3.0
+w foo two w 4.0
+
+In [332]: data.set_index(['a', 'b'], inplace=True)
+
+In [333]: data
+Out[333]:
+ c d
+a b
+bar one z 1.0
+ two y 2.0
+foo one x 3.0
+ two w 4.0
+```
+
+### 重置索引
+
+为方便起见,DataFrame上有一个新函数,它将
+ [``reset_index()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html#pandas.DataFrame.reset_index)索引值传输到DataFrame的列中并设置一个简单的整数索引。这是反向操作[``set_index()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html#pandas.DataFrame.set_index)。
+
+``` python
+In [334]: data
+Out[334]:
+ c d
+a b
+bar one z 1.0
+ two y 2.0
+foo one x 3.0
+ two w 4.0
+
+In [335]: data.reset_index()
+Out[335]:
+ a b c d
+0 bar one z 1.0
+1 bar two y 2.0
+2 foo one x 3.0
+3 foo two w 4.0
+```
+
+输出更类似于SQL表或记录数组。从索引派生的列的名称是存储在``names``属性中的名称。
+
+您可以使用``level``关键字仅删除索引的一部分:
+
+``` python
+In [336]: frame
+Out[336]:
+ c d
+c a b
+z bar one z 1.0
+y bar two y 2.0
+x foo one x 3.0
+w foo two w 4.0
+
+In [337]: frame.reset_index(level=1)
+Out[337]:
+ a c d
+c b
+z one bar z 1.0
+y two bar y 2.0
+x one foo x 3.0
+w two foo w 4.0
+```
+
+``reset_index``采用一个可选参数``drop``,如果为true,则只丢弃索引,而不是将索引值放在DataFrame的列中。
+
+### 添加ad hoc索引
+
+如果您自己创建索引,则可以将其分配给``index``字段:
+
+``` python
+data.index = index
+```
+
+## 返回视图与副本
+
+在pandas对象中设置值时,必须注意避免调用所谓的对象
+ 。这是一个例子。``chained indexing``
+
+``` python
+In [338]: dfmi = pd.DataFrame([list('abcd'),
+ .....: list('efgh'),
+ .....: list('ijkl'),
+ .....: list('mnop')],
+ .....: columns=pd.MultiIndex.from_product([['one', 'two'],
+ .....: ['first', 'second']]))
+ .....:
+
+In [339]: dfmi
+Out[339]:
+ one two
+ first second first second
+0 a b c d
+1 e f g h
+2 i j k l
+3 m n o p
+```
+
+比较这两种访问方法:
+
+``` python
+In [340]: dfmi['one']['second']
+Out[340]:
+0 b
+1 f
+2 j
+3 n
+Name: second, dtype: object
+```
+
+``` python
+In [341]: dfmi.loc[:, ('one', 'second')]
+Out[341]:
+0 b
+1 f
+2 j
+3 n
+Name: (one, second), dtype: object
+```
+
+这两者都产生相同的结果,所以你应该使用哪个?理解这些操作的顺序以及为什么方法2(``.loc``)比方法1(链接``[]``)更受欢迎是有益的。
+
+``dfmi['one']``选择列的第一级并返回单索引的DataFrame。然后另一个Python操作``dfmi_with_one['second']``选择索引的系列``'second'``。这由变量指示,``dfmi_with_one``因为pandas将这些操作视为单独的事件。例如,单独调用``__getitem__``,因此它必须将它们视为线性操作,它们一个接一个地发生。
+
+对比这个``df.loc[:,('one','second')]``将一个嵌套的元组传递``(slice(None),('one','second'))``给一个单独的调用
+ ``__getitem__``。这允许pandas将其作为单个实体来处理。此外,这种操作顺序*可以*明显更快,并且如果需要,允许人们对*两个*轴进行索引。
+
+### 使用链式索引时为什么分配失败?
+
+上一节中的问题只是一个性能问题。这是怎么回事与``SettingWithCopy``警示?当你做一些可能花费几毫秒的事情时,我们**通常**不会发出警告!
+
+但事实证明,分配链式索引的产品具有固有的不可预测的结果。要看到这一点,请考虑Python解释器如何执行此代码:
+
+``` python
+dfmi.loc[:, ('one', 'second')] = value
+# becomes
+dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)
+```
+
+但是这个代码的处理方式不同:
+
+``` python
+dfmi['one']['second'] = value
+# becomes
+dfmi.__getitem__('one').__setitem__('second', value)
+```
+
+看到``__getitem__``那里?除了简单的情况之外,很难预测它是否会返回一个视图或一个副本(它取决于数组的内存布局,关于哪些pandas不能保证),因此是否``__setitem__``会修改``dfmi``或者是一个临时对象之后立即抛出。**那**什么``SettingWithCopy``是警告你!
+
+::: tip 注意
+
+您可能想知道我们是否应该关注``loc``
+第一个示例中的属性。但``dfmi.loc``保证``dfmi``
+本身具有修改的索引行为,因此``dfmi.loc.__getitem__``/
+ 直接``dfmi.loc.__setitem__``操作``dfmi``。当然,
+ ``dfmi.loc.__getitem__(idx)``可能是一个视图或副本``dfmi``。
+
+:::
+
+有时``SettingWithCopy``,当没有明显的链式索引时,会出现警告。**这些**``SettingWithCopy``是旨在捕获的错误
+ !Pandas可能会试图警告你,你已经这样做了:
+
+``` python
+def do_something(df):
+ foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
+ # ... many lines here ...
+ # We don't know whether this will modify df or not!
+ foo['quux'] = value
+ return foo
+```
+
+哎呀!
+
+### 评估订单事项
+
+使用链式索引时,索引操作的顺序和类型会部分确定结果是原始对象的切片还是切片的副本。
+
+Pandas有,``SettingWithCopyWarning``因为分配一个切片的副本通常不是故意的,而是由链式索引引起的错误返回一个预期切片的副本。
+
+如果您希望pandas或多或少地信任链接索引表达式的赋值,则可以将[选项](options.html#options)
+设置``mode.chained_assignment``为以下值之一:
+
+- ``'warn'``,默认值表示``SettingWithCopyWarning``打印。
+- ``'raise'``意味着大Pandas会提出``SettingWithCopyException``
+你必须处理的事情。
+- ``None`` 将完全压制警告。
+
+``` python
+In [342]: dfb = pd.DataFrame({'a': ['one', 'one', 'two',
+ .....: 'three', 'two', 'one', 'six'],
+ .....: 'c': np.arange(7)})
+ .....:
+
+# This will show the SettingWithCopyWarning
+# but the frame values will be set
+In [343]: dfb['c'][dfb.a.str.startswith('o')] = 42
+```
+
+然而,这是在副本上运行,不起作用。
+
+``` python
+>>> pd.set_option('mode.chained_assignment','warn')
+>>> dfb[dfb.a.str.startswith('o')]['c'] = 42
+Traceback (most recent call last)
+ ...
+SettingWithCopyWarning:
+ A value is trying to be set on a copy of a slice from a DataFrame.
+ Try using .loc[row_index,col_indexer] = value instead
+```
+
+链式分配也可以在混合dtype帧中进行设置。
+
+::: tip 注意
+
+这些设置规则适用于所有``.loc/.iloc``。
+
+:::
+
+这是正确的访问方法:
+
+``` python
+In [344]: dfc = pd.DataFrame({'A': ['aaa', 'bbb', 'ccc'], 'B': [1, 2, 3]})
+
+In [345]: dfc.loc[0, 'A'] = 11
+
+In [346]: dfc
+Out[346]:
+ A B
+0 11 1
+1 bbb 2
+2 ccc 3
+```
+
+这有时*会*起作用,但不能保证,因此应该避免:
+
+``` python
+In [347]: dfc = dfc.copy()
+
+In [348]: dfc['A'][0] = 111
+
+In [349]: dfc
+Out[349]:
+ A B
+0 111 1
+1 bbb 2
+2 ccc 3
+```
+
+这**根本**不起作用,所以应该避免:
+
+``` python
+>>> pd.set_option('mode.chained_assignment','raise')
+>>> dfc.loc[0]['A'] = 1111
+Traceback (most recent call last)
+ ...
+SettingWithCopyException:
+ A value is trying to be set on a copy of a slice from a DataFrame.
+ Try using .loc[row_index,col_indexer] = value instead
+```
+
+::: danger 警告
+
+链式分配警告/异常旨在通知用户可能无效的分配。可能存在误报; 无意中报告链式作业的情况。
+
+:::
diff --git a/Python/pandas/user_guide/integer_na.md b/Python/pandas/user_guide/integer_na.md
new file mode 100644
index 00000000..5e5f4442
--- /dev/null
+++ b/Python/pandas/user_guide/integer_na.md
@@ -0,0 +1,175 @@
+---
+meta:
+ - name: keywords
+ content: Nullable,整型数据类型
+ - name: description
+ content: 在处理丢失的数据部分, 我们知道pandas主要使用 NaN 来代表丢失数据。因为 NaN 属于浮点型数据,这强制有缺失值的整型array强制转换成浮点型。
+---
+
+# Nullable整型数据类型
+
+*在0.24.0版本中新引入*
+
+::: tip 小贴士
+
+IntegerArray目前属于实验性阶段,因此他的API或者使用方式可能会在没有提示的情况下更改。
+
+:::
+
+在 [处理丢失的数据](missing_data.html#missing-data)部分, 我们知道pandas主要使用 ``NaN`` 来代表丢失数据。因为 ``NaN`` 属于浮点型数据,这强制有缺失值的整型array强制转换成浮点型。在某些情况下,这可能不会有太大影响,但是如果你的整型数据恰好是标识符,数据类型的转换可能会存在隐患。同时,某些整数无法使用浮点型来表示。
+
+Pandas能够将可能存在缺失值的整型数据使用[``arrays.IntegerArray``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.arrays.IntegerArray.html#pandas.arrays.IntegerArray)来表示。这是pandas中内置的 [扩展方式](https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-extension-types)。 它并不是整型数据组成array对象的默认方式,并且并不会被pandas直接使用。因此,如果你希望生成这种数据类型,你需要在生成[``array()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.array.html#pandas.array) 或者 [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)时,在``dtype``变量中直接指定。
+
+``` python
+In [1]: arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
+
+In [2]: arr
+Out[2]:
+
+[1, 2, NaN]
+Length: 3, dtype: Int64
+```
+
+或者使用字符串``"Int64"``(注意此处的 ``"I"``需要大写,以此和NumPy中的``'int64'``数据类型作出区别):
+
+``` python
+In [3]: pd.array([1, 2, np.nan], dtype="Int64")
+Out[3]:
+
+[1, 2, NaN]
+Length: 3, dtype: Int64
+```
+
+这样的array对象与NumPy的array对象类似,可以被存放在[``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) 或 [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)中。
+
+``` python
+In [4]: pd.Series(arr)
+Out[4]:
+0 1
+1 2
+2 NaN
+dtype: Int64
+```
+
+你也可以直接将列表形式的数据直接传入[``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)中,并指明``dtype``。
+
+``` python
+In [5]: s = pd.Series([1, 2, np.nan], dtype="Int64")
+
+In [6]: s
+Out[6]:
+0 1
+1 2
+2 NaN
+dtype: Int64
+```
+
+默认情况下(如果你不指明``dtype``),则会使用NumPy来构建这个数据,最终你会得到``float64``类型的Series:
+
+``` python
+In [7]: pd.Series([1, 2, np.nan])
+Out[7]:
+0 1.0
+1 2.0
+2 NaN
+dtype: float64
+```
+
+对使用了整型array的操作与对NumPy中array的操作类似,缺失值会被继承并保留原本的数据类型,但在必要的情况下,数据类型也会发生转变。
+
+``` python
+# 运算
+In [8]: s + 1
+Out[8]:
+0 2
+1 3
+2 NaN
+dtype: Int64
+
+# 比较
+In [9]: s == 1
+Out[9]:
+0 True
+1 False
+2 False
+dtype: bool
+
+# 索引
+In [10]: s.iloc[1:3]
+Out[10]:
+1 2
+2 NaN
+dtype: Int64
+
+# 和其他数据类型联合使用
+In [11]: s + s.iloc[1:3].astype('Int8')
+Out[11]:
+0 NaN
+1 4
+2 NaN
+dtype: Int64
+
+# 在必要情况下,数据类型发生转变
+In [12]: s + 0.01
+Out[12]:
+0 1.01
+1 2.01
+2 NaN
+dtype: float64
+```
+
+这种数据类型可以作为 ``DataFrame``的一部分进行使用。
+
+``` python
+In [13]: df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')})
+
+In [14]: df
+Out[14]:
+ A B C
+0 1 1 a
+1 2 1 a
+2 NaN 3 b
+
+In [15]: df.dtypes
+Out[15]:
+A Int64
+B int64
+C object
+dtype: object
+```
+
+这种数据类型也可以在合并(merge)、重构(reshape)和类型转换(cast)。
+
+``` python
+In [16]: pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes
+Out[16]:
+A Int64
+B int64
+C object
+dtype: object
+
+In [17]: df['A'].astype(float)
+Out[17]:
+0 1.0
+1 2.0
+2 NaN
+Name: A, dtype: float64
+```
+
+类似于求和的降维和分组操作也能正常使用。
+
+``` python
+In [18]: df.sum()
+Out[18]:
+A 3
+B 5
+C aab
+dtype: object
+
+In [19]: df.groupby('B').A.sum()
+Out[19]:
+B
+1 3
+3 0
+Name: A, dtype: Int64
+```
diff --git a/Python/pandas/user_guide/io.md b/Python/pandas/user_guide/io.md
new file mode 100644
index 00000000..55874820
--- /dev/null
+++ b/Python/pandas/user_guide/io.md
@@ -0,0 +1,7123 @@
+# IO工具(文本,CSV,HDF5,…)
+
+pandas的I/O API是一组``read``函数,比如[``pandas.read_csv()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv)函数。这类函数可以返回pandas对象。相应的``write``函数是像[``DataFrame.to_csv()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv)一样的对象方法。下面是一个方法列表,包含了这里面的所有``readers``函数和``writer``函数。
+
+Format Type | Data Description | Reader | Writer
+---|---|---|---
+text | [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) | [read_csv](#io-read-csv-table) | [to_csv](#io-store-in-csv)
+text | [JSON](https://www.json.org/) | [read_json](#io-json-reader) | [to_json](#io-json-writer)
+text | [HTML](https://en.wikipedia.org/wiki/HTML) | [read_html](#io-read-html) | [to_html](#io-html)
+text | Local clipboard | [read_clipboard](#io-clipboard) | [to_clipboard](#io-clipboard)
+binary | [MS Excel](https://en.wikipedia.org/wiki/Microsoft_Excel) | [read_excel](#io-excel-reader) | [to_excel](#io-excel-writer)
+binary | [OpenDocument](http://www.opendocumentformat.org) | [read_excel](#io-ods) |
+binary | [HDF5 Format](https://support.hdfgroup.org/HDF5/whatishdf5.html) | [read_hdf](#io-hdf5) | [to_hdf](#io-hdf5)
+binary | [Feather Format](https://github.com/wesm/feather) | [read_feather](#io-feather) | [to_feather](#io-feather)
+binary | [Parquet Format](https://parquet.apache.org/) | [read_parquet](#io-parquet) | [to_parquet](#io-parquet)
+binary | [Msgpack](https://msgpack.org/index.html) | [read_msgpack](#io-msgpack) | [to_msgpack](#io-msgpack)
+binary | [Stata](https://en.wikipedia.org/wiki/Stata) | [read_stata](#io-stata-reader) | [to_stata](#io-stata-writer)
+binary | [SAS](https://en.wikipedia.org/wiki/SAS_(software)) | [read_sas](#io-sas-reader) |
+binary | [Python Pickle Format](https://docs.python.org/3/library/pickle.html) | [read_pickle](#io-pickle) | [to_pickle](#io-pickle)
+[SQL](https://en.wikipedia.org/wiki/SQL) | SQL | [read_sql](#io-sql) | [to_sql](#io-sql)
+SQL | [Google Big Query](https://en.wikipedia.org/wiki/BigQuery) | [read_gbq](#io-bigquery) | [to_gbq](#io-bigquery)
+
+[Here](#io-perf) is an informal performance comparison for some of these IO methods.
+
+::: tip 注意
+
+比如在使用 ``StringIO`` 类时, 请先确定python的版本信息。也就是说,是使用python2的``from StringIO import StringIO``还是python3的``from io import StringIO``。
+
+:::
+
+## CSV & 文本文件
+
+读文本文件 (a.k.a. flat files)的主要方法 is
+[``read_csv()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv). 关于一些更高级的用法请参阅[cookbook](cookbook.html#cookbook-csv)。
+
+### 方法解析(Parsing options)
+
+[``read_csv()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv) 可接受以下常用参数:
+
+#### 基础
+
+filepath_or_buffer : *various*
+
+- 文件路径 (a [``str``](https://docs.python.org/3/library/stdtypes.html#str), [``pathlib.Path``](https://docs.python.org/3/library/pathlib.html#pathlib.Path),
+or ``py._path.local.LocalPath``), URL (including http, ftp, and S3
+locations), 或者具有 ``read()`` 方法的任何对象 (such as an open file or
+[``StringIO``](https://docs.python.org/3/library/io.html#io.StringIO)).
+
+sep : *str, 默认 [``read_csv()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv)分隔符为``','``, [``read_table()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html#pandas.read_table)方法,分隔符为 ``\t``*
+
+- 分隔符的使用. 如果分隔符为``None``,虽然C不能解析,但python解析引擎可解析,这意味着python将被使用,通过内置的sniffer tool自动检测分隔符,
+[``csv.Sniffer``](https://docs.python.org/3/library/csv.html#csv.Sniffer). 除此之外,字符长度超过1并且不同于 ``'s+'`` 的将被视为正则表达式,并且将强制使用python解析引擎。需要注意的是,正则表达式易于忽略引用数据(主要注意转义字符的使用) 例如: ``'\\r\\t'``.
+
+delimiter : *str, default ``None``*
+
+- sep的替代参数.
+
+delim_whitespace : *boolean, default False*
+
+- 指定是否将空格 (e.g. ``' '`` or ``'\t'``)当作delimiter。
+等价于设置 ``sep='\s+'``.
+如果这个选项被设置为 ``True``,就不要给
+``delimiter`` 传参了.
+
+*version 0.18.1:* 支持Python解析器.
+
+#### 列、索引、名称
+
+header : *int or list of ints, default ``'infer'``*
+
+- 当选择默认值或``header=0``时,将首行设为列名。如果列名被传入明确值就令``header=None``。注意,当``header=0``时,即使列名被传参也会被覆盖。
+
+
+- 标题可以是指定列上的MultiIndex的行位置的整数列表,例如 ``[0,1,3]``。在列名指定时,若某列未被指定,读取时将跳过该列 (例如 在下面的例子中第二列将被跳过).注意,如果 ``skip_blank_lines=True``,此参数将忽略空行和注释行, 因此 header=0 表示第一行数据而非文件的第一行.
+
+names : *array-like, default ``None``*
+
+- 列名列表的使用. 如果文件不包含列名,那么应该设置``header=None``。 列名列表中不允许有重复值.
+
+index_col : *int, str, sequence of int / str, or False, default ``None``*
+
+- ``DataFrame``的行索引列表, 既可以是字符串名称也可以是列索引. 如果传入一个字符串序列或者整数序列,那么一定要使用多级索引(MultiIndex).
+
+- 注意: 当``index_col=False`` ,pandas不再使用首列作为索引。例如, 当你的文件是一个每行末尾都带有一个分割符的格式错误的文件时.
+
+usecols : *list-like or callable, default ``None``*
+
+- 返回列名列表的子集. 如果该参数为列表形式, 那么所有元素应全为位置(即文档列中的整数索引)或者 全为相应列的列名字符串(这些列名字符串为*names*参数给出的或者文档的``header``行内容).例如,一个有效的列表型参数
+*usecols* 将会是是 ``[0, 1, 2]`` 或者 ``['foo', 'bar', 'baz']``.
+
+- 元素顺序可忽略,因此 ``usecols=[0, 1]``等价于 ``[1, 0]``。如果想实例化一个自定义列顺序的DataFrame,请使用``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` ,这样列的顺序为 ``['foo', 'bar']`` 。如果设置``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]`` 那么列的顺序为``['bar', 'foo']`` 。
+
+- 如果使用callable的方式, 可调用函数将根据列名计算,
+返回可调用函数计算结果为True的名称:
+
+``` python
+In [1]: from io import StringIO, BytesIO
+
+In [2]: data = ('col1,col2,col3\n'
+ ...: 'a,b,1\n'
+ ...: 'a,b,2\n'
+ ...: 'c,d,3')
+ ...:
+
+In [3]: pd.read_csv(StringIO(data))
+Out[3]:
+ col1 col2 col3
+0 a b 1
+1 a b 2
+2 c d 3
+
+In [4]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['COL1', 'COL3'])
+Out[4]:
+ col1 col3
+0 a 1
+1 a 2
+2 c 3
+
+```
+
+使用此参数可以大大加快解析时间并降低内存使用率。
+
+squeeze : *boolean, default ``False``*
+
+- 如果解析的数据仅包含一个列,那么结果将以 ``Series``的形式返回.
+
+prefix : *str, default ``None``*
+
+- 当没有header时,可通过该参数为数字列名添加前缀, e.g. ‘X’ for X0, X1, …
+
+mangle_dupe_cols : *boolean, default ``True``*
+
+- 当列名有重复时,解析列名将变为 ‘X’, ‘X.1’…’X.N’而不是 ‘X’…’X’。 如果该参数为 ``False`` ,那么当列名中有重复时,前列将会被后列覆盖。
+
+#### 常规解析配置
+
+dtype : *Type name or dict of column -> type, default ``None``*
+
+- 指定某列或整体数据的数据类型. E.g. ``{'a': np.float64, 'b': np.int32}``
+(不支持 ``engine='python'``).将*str*或*object*与合适的设置一起使用以保留和不解释dtype。
+
+- *New in version 0.20.0:* 支持python解析器.
+
+engine : *{``'c'``, ``'python'``}*
+
+- 解析引擎的使用。 尽管C引擎速度更快,但是目前python引擎功能更加完美。
+
+converters : *dict, default ``None``*
+
+- Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
+
+true_values : *list, default ``None``*
+
+- Values to consider as ``True``.
+
+false_values : *list, default ``None``*
+
+- Values to consider as ``False``.
+
+skipinitialspace : *boolean, default ``False``*
+
+- Skip spaces after delimiter.
+
+skiprows : *list-like or integer, default ``None``*
+
+- Line numbers to skip (0-indexed) or number of lines to skip (int) at the start
+of the file.
+
+- If callable, the callable function will be evaluated against the row
+indices, returning True if the row should be skipped and False otherwise:
+
+``` python
+In [5]: data = ('col1,col2,col3\n'
+ ...: 'a,b,1\n'
+ ...: 'a,b,2\n'
+ ...: 'c,d,3')
+ ...:
+
+In [6]: pd.read_csv(StringIO(data))
+Out[6]:
+ col1 col2 col3
+0 a b 1
+1 a b 2
+2 c d 3
+
+In [7]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
+Out[7]:
+ col1 col2 col3
+0 a b 2
+
+```
+
+skipfooter : *int, default ``0``*
+
+- Number of lines at bottom of file to skip (unsupported with engine=’c’).
+
+nrows : *int, default ``None``*
+
+- Number of rows of file to read. Useful for reading pieces of large files.
+
+low_memory : *boolean, default ``True``*
+
+- Internally process the file in chunks, resulting in lower memory use
+while parsing, but possibly mixed type inference. To ensure no mixed
+types either set ``False``, or specify the type with the ``dtype`` parameter.
+Note that the entire file is read into a single ``DataFrame`` regardless,
+use the ``chunksize`` or ``iterator`` parameter to return the data in chunks.
+(Only valid with C parser)
+
+memory_map : *boolean, default False*
+
+- If a filepath is provided for ``filepath_or_buffer``, map the file object
+directly onto memory and access the data directly from there. Using this
+option can improve performance because there is no longer any I/O overhead.
+
+#### NA and missing data handling
+
+na_values : *scalar, str, list-like, or dict, default ``None``*
+
+- Additional strings to recognize as NA/NaN. If dict passed, specific per-column
+NA values. See [na values const](#io-navaluesconst) below
+for a list of the values interpreted as NaN by default.
+
+keep_default_na : *boolean, default ``True``*
+
+- Whether or not to include the default NaN values when parsing the data.
+Depending on whether *na_values* is passed in, the behavior is as follows:
+ - If *keep_default_na* is ``True``, and *na_values* are specified, *na_values*
+ is appended to the default NaN values used for parsing.
+ - If *keep_default_na* is ``True``, and *na_values* are not specified, only
+ the default NaN values are used for parsing.
+ - If *keep_default_na* is ``False``, and *na_values* are specified, only
+ the NaN values specified *na_values* are used for parsing.
+ - If *keep_default_na* is ``False``, and *na_values* are not specified, no
+ strings will be parsed as NaN.
+
+ Note that if *na_filter* is passed in as ``False``, the *keep_default_na* and *na_values* parameters will be ignored.
+
+na_filter : *boolean, default ``True``*
+
+- Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing ``na_filter=False`` can improve the performance of reading a large file.
+
+verbose : *boolean, default ``False``*
+
+- Indicate number of NA values placed in non-numeric columns.
+
+skip_blank_lines : *boolean, default ``True``*
+
+- If ``True``, skip over blank lines rather than interpreting as NaN values.
+
+#### Datetime handling
+
+parse_dates : *boolean or list of ints or names or list of lists or dict, default ``False``.*
+
+- If ``True`` -> try parsing the index.
+- If ``[1, 2, 3]`` -> try parsing columns 1, 2, 3 each as a separate date
+column.
+- If ``[[1, 3]]`` -> combine columns 1 and 3 and parse as a single date
+column.
+- If ``{'foo': [1, 3]}`` -> parse columns 1, 3 as date and call result ‘foo’.
+A fast-path exists for iso8601-formatted dates.
+
+infer_datetime_format : *boolean, default ``False``*
+
+- If ``True`` and parse_dates is enabled for a column, attempt to infer the datetime format to speed up the processing.
+
+keep_date_col : *boolean, default ``False``*
+
+- If ``True`` and parse_dates specifies combining multiple columns then keep the original columns.
+
+date_parser : *function, default ``None``*
+
+- Function to use for converting a sequence of string columns to an array of
+datetime instances. The default uses ``dateutil.parser.parser`` to do the
+conversion. pandas will try to call date_parser in three different ways,
+advancing to the next if an exception occurs: 1) Pass one or more arrays (as
+defined by parse_dates) as arguments; 2) concatenate (row-wise) the string
+values from the columns defined by parse_dates into a single array and pass
+that; and 3) call date_parser once for each row using one or more strings
+(corresponding to the columns defined by parse_dates) as arguments.
+
+dayfirst : *boolean, default ``False``*
+
+- DD/MM format dates, international and European format.
+
+cache_dates : *boolean, default True*
+
+- If True, use a cache of unique, converted dates to apply the datetime
+conversion. May produce significant speed-up when parsing duplicate
+date strings, especially ones with timezone offsets.
+
+*New in version 0.25.0.*
+
+#### Iteration
+
+iterator : *boolean, default ``False``*
+
+- Return TextFileReader object for iteration or getting chunks with ``get_chunk()``.
+
+chunksize : *int, default ``None``*
+
+- Return TextFileReader object for iteration. See [iterating and chunking](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking) below.
+
+#### Quoting, compression, and file format
+
+compression : *{``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``None``}, default ``'infer'``*
+
+- For on-the-fly decompression of on-disk data. If ‘infer’, then use gzip,
+bz2, zip, or xz if filepath_or_buffer is a string ending in ‘.gz’, ‘.bz2’,
+‘.zip’, or ‘.xz’, respectively, and no decompression otherwise. If using ‘zip’,
+the ZIP file must contain only one data file to be read in.
+Set to ``None`` for no decompression.
+
+*New in version 0.18.1:* support for ‘zip’ and ‘xz’ compression.
+
+*Changed in version 0.24.0:* ‘infer’ option added and set to default.
+
+thousands : *str, default ``None``*
+
+- Thousands separator.
+
+decimal : *str, default ``'.'``*
+
+- Character to recognize as decimal point. E.g. use ',' for European data.
+
+float_precision : *string, default None*
+
+- Specifies which converter the C engine should use for floating-point values.
+The options are ``None`` for the ordinary converter, ``high`` for the
+high-precision converter, and ``round_trip`` for the round-trip converter.
+
+lineterminator : *str (length 1), default ``None``*
+
+- Character to break file into lines. Only valid with C parser.
+
+quotechar : *str (length 1)*
+
+- The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
+
+quoting : *int or ``csv.QUOTE_*`` instance, default ``0``*
+
+- Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of ``QUOTE_MINIMAL`` (0), ``QUOTE_ALL`` (1), ``QUOTE_NONNUMERIC`` (2) or ``QUOTE_NONE`` (3).
+
+doublequote : *boolean, default ``True``*
+
+- When ``quotechar`` is specified and ``quoting`` is not ``QUOTE_NONE``, indicate whether or not to interpret two consecutive ``quotechar`` elements **inside** a field as a single ``quotechar`` element.
+
+escapechar : *str (length 1), default ``None``*
+
+- One-character string used to escape delimiter when quoting is ``QUOTE_NONE``.
+
+comment : *str, default ``None``*
+
+- Indicates remainder of line should not be parsed. If found at the beginning of
+a line, the line will be ignored altogether. This parameter must be a single
+character. Like empty lines (as long as ``skip_blank_lines=True``), fully
+commented lines are ignored by the parameter *header* but not by *skiprows*.
+For example, if ``comment='#'``, parsing ‘#empty a,b,c 1,2,3’ with *header=0* will result in ‘a,b,c’ being treated as the header.
+
+encoding : *str, default ``None``*
+
+- Encoding to use for UTF when reading/writing (e.g. ``'utf-8'``). [List of Python standard encodings](https://docs.python.org/3/library/codecs.html#standard-encodings).
+
+dialect : *str or [``csv.Dialect``](https://docs.python.org/3/library/csv.html#csv.Dialect) instance, default ``None``*
+
+- If provided, this parameter will override values (default or not) for the following parameters: *delimiter, doublequote, escapechar, skipinitialspace, quotechar, and quoting.* If it is necessary to override values, a ParserWarning will be issued. See [csv.Dialect](https://docs.python.org/3/library/csv.html#csv.Dialect) documentation for more details.
+
+#### Error handling
+
+error_bad_lines : *boolean, default ``True``*
+
+- Lines with too many fields (e.g. a csv line with too many commas) will by
+default cause an exception to be raised, and no ``DataFrame`` will be
+returned. If ``False``, then these “bad lines” will dropped from the
+``DataFrame`` that is returned. See [bad lines](#io-bad-lines) below.
+
+warn_bad_lines : *boolean, default ``True``*
+
+- If error_bad_lines is ``False``, and warn_bad_lines is ``True``, a warning for each “bad line” will be output.
+
+### Specifying column data types
+
+You can indicate the data type for the whole ``DataFrame`` or individual
+columns:
+
+``` python
+In [8]: data = ('a,b,c,d\n'
+ ...: '1,2,3,4\n'
+ ...: '5,6,7,8\n'
+ ...: '9,10,11')
+ ...:
+
+In [9]: print(data)
+a,b,c,d
+1,2,3,4
+5,6,7,8
+9,10,11
+
+In [10]: df = pd.read_csv(StringIO(data), dtype=object)
+
+In [11]: df
+Out[11]:
+ a b c d
+0 1 2 3 4
+1 5 6 7 8
+2 9 10 11 NaN
+
+In [12]: df['a'][0]
+Out[12]: '1'
+
+In [13]: df = pd.read_csv(StringIO(data),
+ ....: dtype={'b': object, 'c': np.float64, 'd': 'Int64'})
+ ....:
+
+In [14]: df.dtypes
+Out[14]:
+a int64
+b object
+c float64
+d Int64
+dtype: object
+
+```
+
+Fortunately, pandas offers more than one way to ensure that your column(s)
+contain only one ``dtype``. If you’re unfamiliar with these concepts, you can
+see [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes) to learn more about dtypes, and
+[here](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-object-conversion) to learn more about ``object`` conversion in
+pandas.
+
+For instance, you can use the ``converters`` argument
+of [``read_csv()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv):
+
+``` python
+In [15]: data = ("col_1\n"
+ ....: "1\n"
+ ....: "2\n"
+ ....: "'A'\n"
+ ....: "4.22")
+ ....:
+
+In [16]: df = pd.read_csv(StringIO(data), converters={'col_1': str})
+
+In [17]: df
+Out[17]:
+ col_1
+0 1
+1 2
+2 'A'
+3 4.22
+
+In [18]: df['col_1'].apply(type).value_counts()
+Out[18]:
+ 4
+Name: col_1, dtype: int64
+
+```
+
+Or you can use the [``to_numeric()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html#pandas.to_numeric) function to coerce the
+dtypes after reading in the data,
+
+``` python
+In [19]: df2 = pd.read_csv(StringIO(data))
+
+In [20]: df2['col_1'] = pd.to_numeric(df2['col_1'], errors='coerce')
+
+In [21]: df2
+Out[21]:
+ col_1
+0 1.00
+1 2.00
+2 NaN
+3 4.22
+
+In [22]: df2['col_1'].apply(type).value_counts()
+Out[22]:
+ 4
+Name: col_1, dtype: int64
+
+```
+
+which will convert all valid parsing to floats, leaving the invalid parsing
+as ``NaN``.
+
+Ultimately, how you deal with reading in columns containing mixed dtypes
+depends on your specific needs. In the case above, if you wanted to ``NaN`` out
+the data anomalies, then [``to_numeric()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html#pandas.to_numeric) is probably your best option.
+However, if you wanted for all the data to be coerced, no matter the type, then
+using the ``converters`` argument of [``read_csv()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv) would certainly be
+worth trying.
+
+*New in version 0.20.0:* support for the Python parser.
+
+The ``dtype`` option is supported by the ‘python’ engine.
+
+::: tip Note
+
+In some cases, reading in abnormal data with columns containing mixed dtypes
+will result in an inconsistent dataset. If you rely on pandas to infer the
+dtypes of your columns, the parsing engine will go and infer the dtypes for
+different chunks of the data, rather than the whole dataset at once. Consequently,
+you can end up with column(s) with mixed dtypes. For example,
+
+``` python
+In [23]: col_1 = list(range(500000)) + ['a', 'b'] + list(range(500000))
+
+In [24]: df = pd.DataFrame({'col_1': col_1})
+
+In [25]: df.to_csv('foo.csv')
+
+In [26]: mixed_df = pd.read_csv('foo.csv')
+
+In [27]: mixed_df['col_1'].apply(type).value_counts()
+Out[27]:
+ 737858
+ 262144
+Name: col_1, dtype: int64
+
+In [28]: mixed_df['col_1'].dtype
+Out[28]: dtype('O')
+
+```
+
+will result with *mixed_df* containing an ``int`` dtype for certain chunks
+of the column, and ``str`` for others due to the mixed dtypes from the
+data that was read in. It is important to note that the overall column will be
+marked with a ``dtype`` of ``object``, which is used for columns with mixed dtypes.
+
+:::
+
+### Specifying categorical dtype
+
+*New in version 0.19.0.*
+
+``Categorical`` columns can be parsed directly by specifying ``dtype='category'`` or
+``dtype=CategoricalDtype(categories, ordered)``.
+
+``` python
+In [29]: data = ('col1,col2,col3\n'
+ ....: 'a,b,1\n'
+ ....: 'a,b,2\n'
+ ....: 'c,d,3')
+ ....:
+
+In [30]: pd.read_csv(StringIO(data))
+Out[30]:
+ col1 col2 col3
+0 a b 1
+1 a b 2
+2 c d 3
+
+In [31]: pd.read_csv(StringIO(data)).dtypes
+Out[31]:
+col1 object
+col2 object
+col3 int64
+dtype: object
+
+In [32]: pd.read_csv(StringIO(data), dtype='category').dtypes
+Out[32]:
+col1 category
+col2 category
+col3 category
+dtype: object
+
+```
+
+Individual columns can be parsed as a ``Categorical`` using a dict
+specification:
+
+``` python
+In [33]: pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes
+Out[33]:
+col1 category
+col2 object
+col3 int64
+dtype: object
+
+```
+
+*New in version 0.21.0.*
+
+Specifying ``dtype='category'`` will result in an unordered ``Categorical``
+whose ``categories`` are the unique values observed in the data. For more
+control on the categories and order, create a
+``CategoricalDtype`` ahead of time, and pass that for
+that column’s ``dtype``.
+
+``` python
+In [34]: from pandas.api.types import CategoricalDtype
+
+In [35]: dtype = CategoricalDtype(['d', 'c', 'b', 'a'], ordered=True)
+
+In [36]: pd.read_csv(StringIO(data), dtype={'col1': dtype}).dtypes
+Out[36]:
+col1 category
+col2 object
+col3 int64
+dtype: object
+
+```
+
+When using ``dtype=CategoricalDtype``, “unexpected” values outside of
+``dtype.categories`` are treated as missing values.
+
+``` python
+In [37]: dtype = CategoricalDtype(['a', 'b', 'd']) # No 'c'
+
+In [38]: pd.read_csv(StringIO(data), dtype={'col1': dtype}).col1
+Out[38]:
+0 a
+1 a
+2 NaN
+Name: col1, dtype: category
+Categories (3, object): [a, b, d]
+
+```
+
+This matches the behavior of ``Categorical.set_categories()``.
+
+::: tip Note
+
+With ``dtype='category'``, the resulting categories will always be parsed
+as strings (object dtype). If the categories are numeric they can be
+converted using the [``to_numeric()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html#pandas.to_numeric) function, or as appropriate, another
+converter such as [``to_datetime()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html#pandas.to_datetime).
+
+When ``dtype`` is a ``CategoricalDtype`` with homogeneous ``categories`` (
+all numeric, all datetimes, etc.), the conversion is done automatically.
+
+``` python
+In [39]: df = pd.read_csv(StringIO(data), dtype='category')
+
+In [40]: df.dtypes
+Out[40]:
+col1 category
+col2 category
+col3 category
+dtype: object
+
+In [41]: df['col3']
+Out[41]:
+0 1
+1 2
+2 3
+Name: col3, dtype: category
+Categories (3, object): [1, 2, 3]
+
+In [42]: df['col3'].cat.categories = pd.to_numeric(df['col3'].cat.categories)
+
+In [43]: df['col3']
+Out[43]:
+0 1
+1 2
+2 3
+Name: col3, dtype: category
+Categories (3, int64): [1, 2, 3]
+
+```
+
+:::
+
+### Naming and using columns
+
+#### Handling column names
+
+A file may or may not have a header row. pandas assumes the first row should be
+used as the column names:
+
+``` python
+In [44]: data = ('a,b,c\n'
+ ....: '1,2,3\n'
+ ....: '4,5,6\n'
+ ....: '7,8,9')
+ ....:
+
+In [45]: print(data)
+a,b,c
+1,2,3
+4,5,6
+7,8,9
+
+In [46]: pd.read_csv(StringIO(data))
+Out[46]:
+ a b c
+0 1 2 3
+1 4 5 6
+2 7 8 9
+
+```
+
+By specifying the ``names`` argument in conjunction with ``header`` you can
+indicate other names to use and whether or not to throw away the header row (if
+any):
+
+``` python
+In [47]: print(data)
+a,b,c
+1,2,3
+4,5,6
+7,8,9
+
+In [48]: pd.read_csv(StringIO(data), names=['foo', 'bar', 'baz'], header=0)
+Out[48]:
+ foo bar baz
+0 1 2 3
+1 4 5 6
+2 7 8 9
+
+In [49]: pd.read_csv(StringIO(data), names=['foo', 'bar', 'baz'], header=None)
+Out[49]:
+ foo bar baz
+0 a b c
+1 1 2 3
+2 4 5 6
+3 7 8 9
+
+```
+
+If the header is in a row other than the first, pass the row number to
+``header``. This will skip the preceding rows:
+
+``` python
+In [50]: data = ('skip this skip it\n'
+ ....: 'a,b,c\n'
+ ....: '1,2,3\n'
+ ....: '4,5,6\n'
+ ....: '7,8,9')
+ ....:
+
+In [51]: pd.read_csv(StringIO(data), header=1)
+Out[51]:
+ a b c
+0 1 2 3
+1 4 5 6
+2 7 8 9
+
+```
+
+::: tip Note
+
+Default behavior is to infer the column names: if no names are
+passed the behavior is identical to ``header=0`` and column names
+are inferred from the first non-blank line of the file, if column
+names are passed explicitly then the behavior is identical to
+``header=None``.
+
+:::
+
+### Duplicate names parsing
+
+If the file or header contains duplicate names, pandas will by default
+distinguish between them so as to prevent overwriting data:
+
+``` python
+In [52]: data = ('a,b,a\n'
+ ....: '0,1,2\n'
+ ....: '3,4,5')
+ ....:
+
+In [53]: pd.read_csv(StringIO(data))
+Out[53]:
+ a b a.1
+0 0 1 2
+1 3 4 5
+
+```
+
+There is no more duplicate data because ``mangle_dupe_cols=True`` by default,
+which modifies a series of duplicate columns ‘X’, …, ‘X’ to become
+‘X’, ‘X.1’, …, ‘X.N’. If ``mangle_dupe_cols=False``, duplicate data can
+arise:
+
+``` python
+In [2]: data = 'a,b,a\n0,1,2\n3,4,5'
+In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
+Out[3]:
+ a b a
+0 2 1 2
+1 5 4 5
+
+```
+
+To prevent users from encountering this problem with duplicate data, a ``ValueError``
+exception is raised if ``mangle_dupe_cols != True``:
+
+``` python
+In [2]: data = 'a,b,a\n0,1,2\n3,4,5'
+In [3]: pd.read_csv(StringIO(data), mangle_dupe_cols=False)
+...
+ValueError: Setting mangle_dupe_cols=False is not supported yet
+
+```
+
+#### Filtering columns (``usecols``)
+
+The ``usecols`` argument allows you to select any subset of the columns in a
+file, either using the column names, position numbers or a callable:
+
+*New in version 0.20.0:* support for callable *usecols* arguments
+
+``` python
+In [54]: data = 'a,b,c,d\n1,2,3,foo\n4,5,6,bar\n7,8,9,baz'
+
+In [55]: pd.read_csv(StringIO(data))
+Out[55]:
+ a b c d
+0 1 2 3 foo
+1 4 5 6 bar
+2 7 8 9 baz
+
+In [56]: pd.read_csv(StringIO(data), usecols=['b', 'd'])
+Out[56]:
+ b d
+0 2 foo
+1 5 bar
+2 8 baz
+
+In [57]: pd.read_csv(StringIO(data), usecols=[0, 2, 3])
+Out[57]:
+ a c d
+0 1 3 foo
+1 4 6 bar
+2 7 9 baz
+
+In [58]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['A', 'C'])
+Out[58]:
+ a c
+0 1 3
+1 4 6
+2 7 9
+
+```
+
+The ``usecols`` argument can also be used to specify which columns not to
+use in the final result:
+
+``` python
+In [59]: pd.read_csv(StringIO(data), usecols=lambda x: x not in ['a', 'c'])
+Out[59]:
+ b d
+0 2 foo
+1 5 bar
+2 8 baz
+
+```
+
+In this case, the callable is specifying that we exclude the “a” and “c”
+columns from the output.
+
+### Comments and empty lines
+
+#### Ignoring line comments and empty lines
+
+If the ``comment`` parameter is specified, then completely commented lines will
+be ignored. By default, completely blank lines will be ignored as well.
+
+``` python
+In [60]: data = ('\n'
+ ....: 'a,b,c\n'
+ ....: ' \n'
+ ....: '# commented line\n'
+ ....: '1,2,3\n'
+ ....: '\n'
+ ....: '4,5,6')
+ ....:
+
+In [61]: print(data)
+
+a,b,c
+
+# commented line
+1,2,3
+
+4,5,6
+
+In [62]: pd.read_csv(StringIO(data), comment='#')
+Out[62]:
+ a b c
+0 1 2 3
+1 4 5 6
+
+```
+
+If ``skip_blank_lines=False``, then ``read_csv`` will not ignore blank lines:
+
+``` python
+In [63]: data = ('a,b,c\n'
+ ....: '\n'
+ ....: '1,2,3\n'
+ ....: '\n'
+ ....: '\n'
+ ....: '4,5,6')
+ ....:
+
+In [64]: pd.read_csv(StringIO(data), skip_blank_lines=False)
+Out[64]:
+ a b c
+0 NaN NaN NaN
+1 1.0 2.0 3.0
+2 NaN NaN NaN
+3 NaN NaN NaN
+4 4.0 5.0 6.0
+
+```
+
+::: danger Warning
+
+The presence of ignored lines might create ambiguities involving line numbers;
+the parameter ``header`` uses row numbers (ignoring commented/empty
+lines), while ``skiprows`` uses line numbers (including commented/empty lines):
+
+``` python
+In [65]: data = ('#comment\n'
+ ....: 'a,b,c\n'
+ ....: 'A,B,C\n'
+ ....: '1,2,3')
+ ....:
+
+In [66]: pd.read_csv(StringIO(data), comment='#', header=1)
+Out[66]:
+ A B C
+0 1 2 3
+
+In [67]: data = ('A,B,C\n'
+ ....: '#comment\n'
+ ....: 'a,b,c\n'
+ ....: '1,2,3')
+ ....:
+
+In [68]: pd.read_csv(StringIO(data), comment='#', skiprows=2)
+Out[68]:
+ a b c
+0 1 2 3
+
+```
+
+If both ``header`` and ``skiprows`` are specified, ``header`` will be
+relative to the end of ``skiprows``. For example:
+
+:::
+
+``` python
+In [69]: data = ('# empty\n'
+ ....: '# second empty line\n'
+ ....: '# third emptyline\n'
+ ....: 'X,Y,Z\n'
+ ....: '1,2,3\n'
+ ....: 'A,B,C\n'
+ ....: '1,2.,4.\n'
+ ....: '5.,NaN,10.0\n')
+ ....:
+
+In [70]: print(data)
+# empty
+# second empty line
+# third emptyline
+X,Y,Z
+1,2,3
+A,B,C
+1,2.,4.
+5.,NaN,10.0
+
+
+In [71]: pd.read_csv(StringIO(data), comment='#', skiprows=4, header=1)
+Out[71]:
+ A B C
+0 1.0 2.0 4.0
+1 5.0 NaN 10.0
+
+```
+
+#### Comments
+
+Sometimes comments or meta data may be included in a file:
+
+``` python
+In [72]: print(open('tmp.csv').read())
+ID,level,category
+Patient1,123000,x # really unpleasant
+Patient2,23000,y # wouldn't take his medicine
+Patient3,1234018,z # awesome
+
+```
+
+By default, the parser includes the comments in the output:
+
+``` python
+In [73]: df = pd.read_csv('tmp.csv')
+
+In [74]: df
+Out[74]:
+ ID level category
+0 Patient1 123000 x # really unpleasant
+1 Patient2 23000 y # wouldn't take his medicine
+2 Patient3 1234018 z # awesome
+
+```
+
+We can suppress the comments using the ``comment`` keyword:
+
+``` python
+In [75]: df = pd.read_csv('tmp.csv', comment='#')
+
+In [76]: df
+Out[76]:
+ ID level category
+0 Patient1 123000 x
+1 Patient2 23000 y
+2 Patient3 1234018 z
+
+```
+
+### Dealing with Unicode data
+
+The ``encoding`` argument should be used for encoded unicode data, which will
+result in byte strings being decoded to unicode in the result:
+
+``` python
+In [77]: data = (b'word,length\n'
+ ....: b'Tr\xc3\xa4umen,7\n'
+ ....: b'Gr\xc3\xbc\xc3\x9fe,5')
+ ....:
+
+In [78]: data = data.decode('utf8').encode('latin-1')
+
+In [79]: df = pd.read_csv(BytesIO(data), encoding='latin-1')
+
+In [80]: df
+Out[80]:
+ word length
+0 Träumen 7
+1 Grüße 5
+
+In [81]: df['word'][1]
+Out[81]: 'Grüße'
+
+```
+
+Some formats which encode all characters as multiple bytes, like UTF-16, won’t
+parse correctly at all without specifying the encoding. [Full list of Python
+standard encodings](https://docs.python.org/3/library/codecs.html#standard-encodings).
+
+### Index columns and trailing delimiters
+
+If a file has one more column of data than the number of column names, the
+first column will be used as the ``DataFrame``’s row names:
+
+``` python
+In [82]: data = ('a,b,c\n'
+ ....: '4,apple,bat,5.7\n'
+ ....: '8,orange,cow,10')
+ ....:
+
+In [83]: pd.read_csv(StringIO(data))
+Out[83]:
+ a b c
+4 apple bat 5.7
+8 orange cow 10.0
+
+```
+
+``` python
+In [84]: data = ('index,a,b,c\n'
+ ....: '4,apple,bat,5.7\n'
+ ....: '8,orange,cow,10')
+ ....:
+
+In [85]: pd.read_csv(StringIO(data), index_col=0)
+Out[85]:
+ a b c
+index
+4 apple bat 5.7
+8 orange cow 10.0
+
+```
+
+Ordinarily, you can achieve this behavior using the ``index_col`` option.
+
+There are some exception cases when a file has been prepared with delimiters at
+the end of each data line, confusing the parser. To explicitly disable the
+index column inference and discard the last column, pass ``index_col=False``:
+
+``` python
+In [86]: data = ('a,b,c\n'
+ ....: '4,apple,bat,\n'
+ ....: '8,orange,cow,')
+ ....:
+
+In [87]: print(data)
+a,b,c
+4,apple,bat,
+8,orange,cow,
+
+In [88]: pd.read_csv(StringIO(data))
+Out[88]:
+ a b c
+4 apple bat NaN
+8 orange cow NaN
+
+In [89]: pd.read_csv(StringIO(data), index_col=False)
+Out[89]:
+ a b c
+0 4 apple bat
+1 8 orange cow
+
+```
+
+If a subset of data is being parsed using the ``usecols`` option, the
+``index_col`` specification is based on that subset, not the original data.
+
+``` python
+In [90]: data = ('a,b,c\n'
+ ....: '4,apple,bat,\n'
+ ....: '8,orange,cow,')
+ ....:
+
+In [91]: print(data)
+a,b,c
+4,apple,bat,
+8,orange,cow,
+
+In [92]: pd.read_csv(StringIO(data), usecols=['b', 'c'])
+Out[92]:
+ b c
+4 bat NaN
+8 cow NaN
+
+In [93]: pd.read_csv(StringIO(data), usecols=['b', 'c'], index_col=0)
+Out[93]:
+ b c
+4 bat NaN
+8 cow NaN
+
+```
+
+### Date Handling
+
+#### Specifying date columns
+
+To better facilitate working with datetime data, [``read_csv()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv)
+uses the keyword arguments ``parse_dates`` and ``date_parser``
+to allow users to specify a variety of columns and date/time formats to turn the
+input text data into ``datetime`` objects.
+
+The simplest case is to just pass in ``parse_dates=True``:
+
+``` python
+# Use a column as an index, and parse it as dates.
+In [94]: df = pd.read_csv('foo.csv', index_col=0, parse_dates=True)
+
+In [95]: df
+Out[95]:
+ A B C
+date
+2009-01-01 a 1 2
+2009-01-02 b 3 4
+2009-01-03 c 4 5
+
+# These are Python datetime objects
+In [96]: df.index
+Out[96]: DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03'], dtype='datetime64[ns]', name='date', freq=None)
+
+```
+
+It is often the case that we may want to store date and time data separately,
+or store various date fields separately. the ``parse_dates`` keyword can be
+used to specify a combination of columns to parse the dates and/or times from.
+
+You can specify a list of column lists to ``parse_dates``, the resulting date
+columns will be prepended to the output (so as to not affect the existing column
+order) and the new column names will be the concatenation of the component
+column names:
+
+``` python
+In [97]: print(open('tmp.csv').read())
+KORD,19990127, 19:00:00, 18:56:00, 0.8100
+KORD,19990127, 20:00:00, 19:56:00, 0.0100
+KORD,19990127, 21:00:00, 20:56:00, -0.5900
+KORD,19990127, 21:00:00, 21:18:00, -0.9900
+KORD,19990127, 22:00:00, 21:56:00, -0.5900
+KORD,19990127, 23:00:00, 22:56:00, -0.5900
+
+In [98]: df = pd.read_csv('tmp.csv', header=None, parse_dates=[[1, 2], [1, 3]])
+
+In [99]: df
+Out[99]:
+ 1_2 1_3 0 4
+0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
+1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
+2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
+3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
+4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
+5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
+
+```
+
+By default the parser removes the component date columns, but you can choose
+to retain them via the ``keep_date_col`` keyword:
+
+``` python
+In [100]: df = pd.read_csv('tmp.csv', header=None, parse_dates=[[1, 2], [1, 3]],
+ .....: keep_date_col=True)
+ .....:
+
+In [101]: df
+Out[101]:
+ 1_2 1_3 0 1 2 3 4
+0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 19990127 19:00:00 18:56:00 0.81
+1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 19990127 20:00:00 19:56:00 0.01
+2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD 19990127 21:00:00 20:56:00 -0.59
+3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD 19990127 21:00:00 21:18:00 -0.99
+4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD 19990127 22:00:00 21:56:00 -0.59
+5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD 19990127 23:00:00 22:56:00 -0.59
+
+```
+
+Note that if you wish to combine multiple columns into a single date column, a
+nested list must be used. In other words, ``parse_dates=[1, 2]`` indicates that
+the second and third columns should each be parsed as separate date columns
+while ``parse_dates=[[1, 2]]`` means the two columns should be parsed into a
+single column.
+
+You can also use a dict to specify custom name columns:
+
+``` python
+In [102]: date_spec = {'nominal': [1, 2], 'actual': [1, 3]}
+
+In [103]: df = pd.read_csv('tmp.csv', header=None, parse_dates=date_spec)
+
+In [104]: df
+Out[104]:
+ nominal actual 0 4
+0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
+1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
+2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
+3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
+4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
+5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
+
+```
+
+It is important to remember that if multiple text columns are to be parsed into
+a single date column, then a new column is prepended to the data. The *index_col*
+specification is based off of this new set of columns rather than the original
+data columns:
+
+``` python
+In [105]: date_spec = {'nominal': [1, 2], 'actual': [1, 3]}
+
+In [106]: df = pd.read_csv('tmp.csv', header=None, parse_dates=date_spec,
+ .....: index_col=0) # index is the nominal column
+ .....:
+
+In [107]: df
+Out[107]:
+ actual 0 4
+nominal
+1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
+1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
+1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
+1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
+1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
+1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
+
+```
+
+::: tip Note
+
+If a column or index contains an unparsable date, the entire column or
+index will be returned unaltered as an object data type. For non-standard
+datetime parsing, use [``to_datetime()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html#pandas.to_datetime) after ``pd.read_csv``.
+
+:::
+
+::: tip Note
+
+read_csv has a fast_path for parsing datetime strings in iso8601 format,
+e.g “2000-01-01T00:01:02+00:00” and similar variations. If you can arrange
+for your data to store datetimes in this format, load times will be
+significantly faster, ~20x has been observed.
+
+:::
+
+::: tip Note
+
+When passing a dict as the *parse_dates* argument, the order of
+the columns prepended is not guaranteed, because *dict* objects do not impose
+an ordering on their keys. On Python 2.7+ you may use *collections.OrderedDict*
+instead of a regular *dict* if this matters to you. Because of this, when using a
+dict for ‘parse_dates’ in conjunction with the *index_col* argument, it’s best to
+specify *index_col* as a column label rather then as an index on the resulting frame.
+
+:::
+
+#### Date parsing functions
+
+Finally, the parser allows you to specify a custom ``date_parser`` function to
+take full advantage of the flexibility of the date parsing API:
+
+``` python
+In [108]: df = pd.read_csv('tmp.csv', header=None, parse_dates=date_spec,
+ .....: date_parser=pd.io.date_converters.parse_date_time)
+ .....:
+
+In [109]: df
+Out[109]:
+ nominal actual 0 4
+0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
+1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
+2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
+3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
+4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
+5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
+
+```
+
+Pandas will try to call the ``date_parser`` function in three different ways. If
+an exception is raised, the next one is tried:
+
+1. ``date_parser`` is first called with one or more arrays as arguments,
+as defined using *parse_dates* (e.g., ``date_parser(['2013', '2013'], ['1', '2'])``).
+1. If #1 fails, ``date_parser`` is called with all the columns
+concatenated row-wise into a single array (e.g., ``date_parser(['2013 1', '2013 2'])``).
+1. If #2 fails, ``date_parser`` is called once for every row with one or more
+string arguments from the columns indicated with *parse_dates*
+(e.g., ``date_parser('2013', '1')`` for the first row, ``date_parser('2013', '2')``
+for the second, etc.).
+
+Note that performance-wise, you should try these methods of parsing dates in order:
+
+1. Try to infer the format using ``infer_datetime_format=True`` (see section below).
+1. If you know the format, use ``pd.to_datetime()``:
+``date_parser=lambda x: pd.to_datetime(x, format=...)``.
+1. If you have a really non-standard format, use a custom ``date_parser`` function.
+For optimal performance, this should be vectorized, i.e., it should accept arrays
+as arguments.
+
+You can explore the date parsing functionality in
+[date_converters.py](https://github.com/pandas-dev/pandas/blob/master/pandas/io/date_converters.py)
+and add your own. We would love to turn this module into a community supported
+set of date/time parsers. To get you started, ``date_converters.py`` contains
+functions to parse dual date and time columns, year/month/day columns,
+and year/month/day/hour/minute/second columns. It also contains a
+``generic_parser`` function so you can curry it with a function that deals with
+a single date rather than the entire array.
+
+#### Parsing a CSV with mixed timezones
+
+Pandas cannot natively represent a column or index with mixed timezones. If your CSV
+file contains columns with a mixture of timezones, the default result will be
+an object-dtype column with strings, even with ``parse_dates``.
+
+``` python
+In [110]: content = """\
+ .....: a
+ .....: 2000-01-01T00:00:00+05:00
+ .....: 2000-01-01T00:00:00+06:00"""
+ .....:
+
+In [111]: df = pd.read_csv(StringIO(content), parse_dates=['a'])
+
+In [112]: df['a']
+Out[112]:
+0 2000-01-01 00:00:00+05:00
+1 2000-01-01 00:00:00+06:00
+Name: a, dtype: object
+
+```
+
+To parse the mixed-timezone values as a datetime column, pass a partially-applied
+[``to_datetime()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html#pandas.to_datetime) with ``utc=True`` as the ``date_parser``.
+
+``` python
+In [113]: df = pd.read_csv(StringIO(content), parse_dates=['a'],
+ .....: date_parser=lambda col: pd.to_datetime(col, utc=True))
+ .....:
+
+In [114]: df['a']
+Out[114]:
+0 1999-12-31 19:00:00+00:00
+1 1999-12-31 18:00:00+00:00
+Name: a, dtype: datetime64[ns, UTC]
+
+```
+
+#### Inferring datetime format
+
+If you have ``parse_dates`` enabled for some or all of your columns, and your
+datetime strings are all formatted the same way, you may get a large speed
+up by setting ``infer_datetime_format=True``. If set, pandas will attempt
+to guess the format of your datetime strings, and then use a faster means
+of parsing the strings. 5-10x parsing speeds have been observed. pandas
+will fallback to the usual parsing if either the format cannot be guessed
+or the format that was guessed cannot properly parse the entire column
+of strings. So in general, ``infer_datetime_format`` should not have any
+negative consequences if enabled.
+
+Here are some examples of datetime strings that can be guessed (All
+representing December 30th, 2011 at 00:00:00):
+
+- “20111230”
+- “2011/12/30”
+- “20111230 00:00:00”
+- “12/30/2011 00:00:00”
+- “30/Dec/2011 00:00:00”
+- “30/December/2011 00:00:00”
+
+Note that ``infer_datetime_format`` is sensitive to ``dayfirst``. With
+``dayfirst=True``, it will guess “01/12/2011” to be December 1st. With
+``dayfirst=False`` (default) it will guess “01/12/2011” to be January 12th.
+
+``` python
+# Try to infer the format for the index column
+In [115]: df = pd.read_csv('foo.csv', index_col=0, parse_dates=True,
+ .....: infer_datetime_format=True)
+ .....:
+
+In [116]: df
+Out[116]:
+ A B C
+date
+2009-01-01 a 1 2
+2009-01-02 b 3 4
+2009-01-03 c 4 5
+
+```
+
+#### International date formats
+
+While US date formats tend to be MM/DD/YYYY, many international formats use
+DD/MM/YYYY instead. For convenience, a ``dayfirst`` keyword is provided:
+
+``` python
+In [117]: print(open('tmp.csv').read())
+date,value,cat
+1/6/2000,5,a
+2/6/2000,10,b
+3/6/2000,15,c
+
+In [118]: pd.read_csv('tmp.csv', parse_dates=[0])
+Out[118]:
+ date value cat
+0 2000-01-06 5 a
+1 2000-02-06 10 b
+2 2000-03-06 15 c
+
+In [119]: pd.read_csv('tmp.csv', dayfirst=True, parse_dates=[0])
+Out[119]:
+ date value cat
+0 2000-06-01 5 a
+1 2000-06-02 10 b
+2 2000-06-03 15 c
+
+```
+
+### Specifying method for floating-point conversion
+
+The parameter ``float_precision`` can be specified in order to use
+a specific floating-point converter during parsing with the C engine.
+The options are the ordinary converter, the high-precision converter, and
+the round-trip converter (which is guaranteed to round-trip values after
+writing to a file). For example:
+
+``` python
+In [120]: val = '0.3066101993807095471566981359501369297504425048828125'
+
+In [121]: data = 'a,b,c\n1,2,{0}'.format(val)
+
+In [122]: abs(pd.read_csv(StringIO(data), engine='c',
+ .....: float_precision=None)['c'][0] - float(val))
+ .....:
+Out[122]: 1.1102230246251565e-16
+
+In [123]: abs(pd.read_csv(StringIO(data), engine='c',
+ .....: float_precision='high')['c'][0] - float(val))
+ .....:
+Out[123]: 5.551115123125783e-17
+
+In [124]: abs(pd.read_csv(StringIO(data), engine='c',
+ .....: float_precision='round_trip')['c'][0] - float(val))
+ .....:
+Out[124]: 0.0
+
+```
+
+### Thousand separators
+
+For large numbers that have been written with a thousands separator, you can
+set the ``thousands`` keyword to a string of length 1 so that integers will be parsed
+correctly:
+
+By default, numbers with a thousands separator will be parsed as strings:
+
+``` python
+In [125]: print(open('tmp.csv').read())
+ID|level|category
+Patient1|123,000|x
+Patient2|23,000|y
+Patient3|1,234,018|z
+
+In [126]: df = pd.read_csv('tmp.csv', sep='|')
+
+In [127]: df
+Out[127]:
+ ID level category
+0 Patient1 123,000 x
+1 Patient2 23,000 y
+2 Patient3 1,234,018 z
+
+In [128]: df.level.dtype
+Out[128]: dtype('O')
+
+```
+
+The ``thousands`` keyword allows integers to be parsed correctly:
+
+``` python
+In [129]: print(open('tmp.csv').read())
+ID|level|category
+Patient1|123,000|x
+Patient2|23,000|y
+Patient3|1,234,018|z
+
+In [130]: df = pd.read_csv('tmp.csv', sep='|', thousands=',')
+
+In [131]: df
+Out[131]:
+ ID level category
+0 Patient1 123000 x
+1 Patient2 23000 y
+2 Patient3 1234018 z
+
+In [132]: df.level.dtype
+Out[132]: dtype('int64')
+
+```
+
+### NA values
+
+To control which values are parsed as missing values (which are signified by
+``NaN``), specify a string in ``na_values``. If you specify a list of strings,
+then all values in it are considered to be missing values. If you specify a
+number (a ``float``, like ``5.0`` or an ``integer`` like ``5``), the
+corresponding equivalent values will also imply a missing value (in this case
+effectively ``[5.0, 5]`` are recognized as ``NaN``).
+
+To completely override the default values that are recognized as missing, specify ``keep_default_na=False``.
+
+The default ``NaN`` recognized values are ``['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A',
+'n/a', 'NA', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', '']``.
+
+Let us consider some examples:
+
+``` python
+pd.read_csv('path_to_file.csv', na_values=[5])
+
+```
+
+In the example above ``5`` and ``5.0`` will be recognized as ``NaN``, in
+addition to the defaults. A string will first be interpreted as a numerical
+``5``, then as a ``NaN``.
+
+``` python
+pd.read_csv('path_to_file.csv', keep_default_na=False, na_values=[""])
+
+```
+
+Above, only an empty field will be recognized as ``NaN``.
+
+``` python
+pd.read_csv('path_to_file.csv', keep_default_na=False, na_values=["NA", "0"])
+
+```
+
+Above, both ``NA`` and ``0`` as strings are ``NaN``.
+
+``` python
+pd.read_csv('path_to_file.csv', na_values=["Nope"])
+
+```
+
+The default values, in addition to the string ``"Nope"`` are recognized as
+``NaN``.
+
+### Infinity
+
+``inf`` like values will be parsed as ``np.inf`` (positive infinity), and ``-inf`` as ``-np.inf`` (negative infinity).
+These will ignore the case of the value, meaning ``Inf``, will also be parsed as ``np.inf``.
+
+### Returning Series
+
+Using the ``squeeze`` keyword, the parser will return output with a single column
+as a ``Series``:
+
+``` python
+In [133]: print(open('tmp.csv').read())
+level
+Patient1,123000
+Patient2,23000
+Patient3,1234018
+
+In [134]: output = pd.read_csv('tmp.csv', squeeze=True)
+
+In [135]: output
+Out[135]:
+Patient1 123000
+Patient2 23000
+Patient3 1234018
+Name: level, dtype: int64
+
+In [136]: type(output)
+Out[136]: pandas.core.series.Series
+
+```
+
+### Boolean values
+
+The common values ``True``, ``False``, ``TRUE``, and ``FALSE`` are all
+recognized as boolean. Occasionally you might want to recognize other values
+as being boolean. To do this, use the ``true_values`` and ``false_values``
+options as follows:
+
+``` python
+In [137]: data = ('a,b,c\n'
+ .....: '1,Yes,2\n'
+ .....: '3,No,4')
+ .....:
+
+In [138]: print(data)
+a,b,c
+1,Yes,2
+3,No,4
+
+In [139]: pd.read_csv(StringIO(data))
+Out[139]:
+ a b c
+0 1 Yes 2
+1 3 No 4
+
+In [140]: pd.read_csv(StringIO(data), true_values=['Yes'], false_values=['No'])
+Out[140]:
+ a b c
+0 1 True 2
+1 3 False 4
+
+```
+
+### Handling “bad” lines
+
+Some files may have malformed lines with too few fields or too many. Lines with
+too few fields will have NA values filled in the trailing fields. Lines with
+too many fields will raise an error by default:
+
+``` python
+In [141]: data = ('a,b,c\n'
+ .....: '1,2,3\n'
+ .....: '4,5,6,7\n'
+ .....: '8,9,10')
+ .....:
+
+In [142]: pd.read_csv(StringIO(data))
+---------------------------------------------------------------------------
+ParserError Traceback (most recent call last)
+ in
+----> 1 pd.read_csv(StringIO(data))
+
+/pandas/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
+ 683 )
+ 684
+--> 685 return _read(filepath_or_buffer, kwds)
+ 686
+ 687 parser_f.__name__ = name
+
+/pandas/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
+ 461
+ 462 try:
+--> 463 data = parser.read(nrows)
+ 464 finally:
+ 465 parser.close()
+
+/pandas/pandas/io/parsers.py in read(self, nrows)
+ 1152 def read(self, nrows=None):
+ 1153 nrows = _validate_integer("nrows", nrows)
+-> 1154 ret = self._engine.read(nrows)
+ 1155
+ 1156 # May alter columns / col_dict
+
+/pandas/pandas/io/parsers.py in read(self, nrows)
+ 2046 def read(self, nrows=None):
+ 2047 try:
+-> 2048 data = self._reader.read(nrows)
+ 2049 except StopIteration:
+ 2050 if self._first_chunk:
+
+/pandas/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
+
+/pandas/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
+
+/pandas/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
+
+/pandas/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()
+
+/pandas/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
+
+ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4
+
+```
+
+You can elect to skip bad lines:
+
+``` python
+In [29]: pd.read_csv(StringIO(data), error_bad_lines=False)
+Skipping line 3: expected 3 fields, saw 4
+
+Out[29]:
+ a b c
+0 1 2 3
+1 8 9 10
+
+```
+
+You can also use the ``usecols`` parameter to eliminate extraneous column
+data that appear in some lines but not others:
+
+``` python
+In [30]: pd.read_csv(StringIO(data), usecols=[0, 1, 2])
+
+ Out[30]:
+ a b c
+ 0 1 2 3
+ 1 4 5 6
+ 2 8 9 10
+
+```
+
+### Dialect
+
+The ``dialect`` keyword gives greater flexibility in specifying the file format.
+By default it uses the Excel dialect but you can specify either the dialect name
+or a [``csv.Dialect``](https://docs.python.org/3/library/csv.html#csv.Dialect) instance.
+
+Suppose you had data with unenclosed quotes:
+
+``` python
+In [143]: print(data)
+label1,label2,label3
+index1,"a,c,e
+index2,b,d,f
+
+```
+
+By default, ``read_csv`` uses the Excel dialect and treats the double quote as
+the quote character, which causes it to fail when it finds a newline before it
+finds the closing double quote.
+
+We can get around this using ``dialect``:
+
+``` python
+In [144]: import csv
+
+In [145]: dia = csv.excel()
+
+In [146]: dia.quoting = csv.QUOTE_NONE
+
+In [147]: pd.read_csv(StringIO(data), dialect=dia)
+Out[147]:
+ label1 label2 label3
+index1 "a c e
+index2 b d f
+
+```
+
+All of the dialect options can be specified separately by keyword arguments:
+
+``` python
+In [148]: data = 'a,b,c~1,2,3~4,5,6'
+
+In [149]: pd.read_csv(StringIO(data), lineterminator='~')
+Out[149]:
+ a b c
+0 1 2 3
+1 4 5 6
+
+```
+
+Another common dialect option is ``skipinitialspace``, to skip any whitespace
+after a delimiter:
+
+``` python
+In [150]: data = 'a, b, c\n1, 2, 3\n4, 5, 6'
+
+In [151]: print(data)
+a, b, c
+1, 2, 3
+4, 5, 6
+
+In [152]: pd.read_csv(StringIO(data), skipinitialspace=True)
+Out[152]:
+ a b c
+0 1 2 3
+1 4 5 6
+
+```
+
+The parsers make every attempt to “do the right thing” and not be fragile. Type
+inference is a pretty big deal. If a column can be coerced to integer dtype
+without altering the contents, the parser will do so. Any non-numeric
+columns will come through as object dtype as with the rest of pandas objects.
+
+### Quoting and Escape Characters
+
+Quotes (and other escape characters) in embedded fields can be handled in any
+number of ways. One way is to use backslashes; to properly parse this data, you
+should pass the ``escapechar`` option:
+
+``` python
+In [153]: data = 'a,b\n"hello, \\"Bob\\", nice to see you",5'
+
+In [154]: print(data)
+a,b
+"hello, \"Bob\", nice to see you",5
+
+In [155]: pd.read_csv(StringIO(data), escapechar='\\')
+Out[155]:
+ a b
+0 hello, "Bob", nice to see you 5
+
+```
+
+### Files with fixed width columns
+
+While [``read_csv()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv) reads delimited data, the [``read_fwf()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html#pandas.read_fwf) function works
+with data files that have known and fixed column widths. The function parameters
+to ``read_fwf`` are largely the same as *read_csv* with two extra parameters, and
+a different usage of the ``delimiter`` parameter:
+
+- ``colspecs``: A list of pairs (tuples) giving the extents of the
+fixed-width fields of each line as half-open intervals (i.e., [from, to[ ).
+String value ‘infer’ can be used to instruct the parser to try detecting
+the column specifications from the first 100 rows of the data. Default
+behavior, if not specified, is to infer.
+- ``widths``: A list of field widths which can be used instead of ‘colspecs’
+if the intervals are contiguous.
+- ``delimiter``: Characters to consider as filler characters in the fixed-width file.
+Can be used to specify the filler character of the fields
+if it is not spaces (e.g., ‘~’).
+
+Consider a typical fixed-width data file:
+
+``` python
+In [156]: print(open('bar.csv').read())
+id8141 360.242940 149.910199 11950.7
+id1594 444.953632 166.985655 11788.4
+id1849 364.136849 183.628767 11806.2
+id1230 413.836124 184.375703 11916.8
+id1948 502.953953 173.237159 12468.3
+
+```
+
+In order to parse this file into a ``DataFrame``, we simply need to supply the
+column specifications to the *read_fwf* function along with the file name:
+
+``` python
+# Column specifications are a list of half-intervals
+In [157]: colspecs = [(0, 6), (8, 20), (21, 33), (34, 43)]
+
+In [158]: df = pd.read_fwf('bar.csv', colspecs=colspecs, header=None, index_col=0)
+
+In [159]: df
+Out[159]:
+ 1 2 3
+0
+id8141 360.242940 149.910199 11950.7
+id1594 444.953632 166.985655 11788.4
+id1849 364.136849 183.628767 11806.2
+id1230 413.836124 184.375703 11916.8
+id1948 502.953953 173.237159 12468.3
+
+```
+
+Note how the parser automatically picks column names ``X.`` when
+``header=None`` argument is specified. Alternatively, you can supply just the
+column widths for contiguous columns:
+
+``` python
+# Widths are a list of integers
+In [160]: widths = [6, 14, 13, 10]
+
+In [161]: df = pd.read_fwf('bar.csv', widths=widths, header=None)
+
+In [162]: df
+Out[162]:
+ 0 1 2 3
+0 id8141 360.242940 149.910199 11950.7
+1 id1594 444.953632 166.985655 11788.4
+2 id1849 364.136849 183.628767 11806.2
+3 id1230 413.836124 184.375703 11916.8
+4 id1948 502.953953 173.237159 12468.3
+
+```
+
+The parser will take care of extra white spaces around the columns
+so it’s ok to have extra separation between the columns in the file.
+
+By default, ``read_fwf`` will try to infer the file’s ``colspecs`` by using the
+first 100 rows of the file. It can do it only in cases when the columns are
+aligned and correctly separated by the provided ``delimiter`` (default delimiter
+is whitespace).
+
+``` python
+In [163]: df = pd.read_fwf('bar.csv', header=None, index_col=0)
+
+In [164]: df
+Out[164]:
+ 1 2 3
+0
+id8141 360.242940 149.910199 11950.7
+id1594 444.953632 166.985655 11788.4
+id1849 364.136849 183.628767 11806.2
+id1230 413.836124 184.375703 11916.8
+id1948 502.953953 173.237159 12468.3
+
+```
+
+*New in version 0.20.0.*
+
+``read_fwf`` supports the ``dtype`` parameter for specifying the types of
+parsed columns to be different from the inferred type.
+
+``` python
+In [165]: pd.read_fwf('bar.csv', header=None, index_col=0).dtypes
+Out[165]:
+1 float64
+2 float64
+3 float64
+dtype: object
+
+In [166]: pd.read_fwf('bar.csv', header=None, dtype={2: 'object'}).dtypes
+Out[166]:
+0 object
+1 float64
+2 object
+3 float64
+dtype: object
+
+```
+
+### Indexes
+
+#### Files with an “implicit” index column
+
+Consider a file with one less entry in the header than the number of data
+column:
+
+``` python
+In [167]: print(open('foo.csv').read())
+A,B,C
+20090101,a,1,2
+20090102,b,3,4
+20090103,c,4,5
+
+```
+
+In this special case, ``read_csv`` assumes that the first column is to be used
+as the index of the ``DataFrame``:
+
+``` python
+In [168]: pd.read_csv('foo.csv')
+Out[168]:
+ A B C
+20090101 a 1 2
+20090102 b 3 4
+20090103 c 4 5
+
+```
+
+Note that the dates weren’t automatically parsed. In that case you would need
+to do as before:
+
+``` python
+In [169]: df = pd.read_csv('foo.csv', parse_dates=True)
+
+In [170]: df.index
+Out[170]: DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03'], dtype='datetime64[ns]', freq=None)
+
+```
+
+#### Reading an index with a ``MultiIndex``
+
+Suppose you have data indexed by two columns:
+
+``` python
+In [171]: print(open('data/mindex_ex.csv').read())
+year,indiv,zit,xit
+1977,"A",1.2,.6
+1977,"B",1.5,.5
+1977,"C",1.7,.8
+1978,"A",.2,.06
+1978,"B",.7,.2
+1978,"C",.8,.3
+1978,"D",.9,.5
+1978,"E",1.4,.9
+1979,"C",.2,.15
+1979,"D",.14,.05
+1979,"E",.5,.15
+1979,"F",1.2,.5
+1979,"G",3.4,1.9
+1979,"H",5.4,2.7
+1979,"I",6.4,1.2
+
+```
+
+The ``index_col`` argument to ``read_csv`` can take a list of
+column numbers to turn multiple columns into a ``MultiIndex`` for the index of the
+returned object:
+
+``` python
+In [172]: df = pd.read_csv("data/mindex_ex.csv", index_col=[0, 1])
+
+In [173]: df
+Out[173]:
+ zit xit
+year indiv
+1977 A 1.20 0.60
+ B 1.50 0.50
+ C 1.70 0.80
+1978 A 0.20 0.06
+ B 0.70 0.20
+ C 0.80 0.30
+ D 0.90 0.50
+ E 1.40 0.90
+1979 C 0.20 0.15
+ D 0.14 0.05
+ E 0.50 0.15
+ F 1.20 0.50
+ G 3.40 1.90
+ H 5.40 2.70
+ I 6.40 1.20
+
+In [174]: df.loc[1978]
+Out[174]:
+ zit xit
+indiv
+A 0.2 0.06
+B 0.7 0.20
+C 0.8 0.30
+D 0.9 0.50
+E 1.4 0.90
+
+```
+
+#### Reading columns with a ``MultiIndex``
+
+By specifying list of row locations for the ``header`` argument, you
+can read in a ``MultiIndex`` for the columns. Specifying non-consecutive
+rows will skip the intervening rows.
+
+``` python
+In [175]: from pandas.util.testing import makeCustomDataframe as mkdf
+
+In [176]: df = mkdf(5, 3, r_idx_nlevels=2, c_idx_nlevels=4)
+
+In [177]: df.to_csv('mi.csv')
+
+In [178]: print(open('mi.csv').read())
+C0,,C_l0_g0,C_l0_g1,C_l0_g2
+C1,,C_l1_g0,C_l1_g1,C_l1_g2
+C2,,C_l2_g0,C_l2_g1,C_l2_g2
+C3,,C_l3_g0,C_l3_g1,C_l3_g2
+R0,R1,,,
+R_l0_g0,R_l1_g0,R0C0,R0C1,R0C2
+R_l0_g1,R_l1_g1,R1C0,R1C1,R1C2
+R_l0_g2,R_l1_g2,R2C0,R2C1,R2C2
+R_l0_g3,R_l1_g3,R3C0,R3C1,R3C2
+R_l0_g4,R_l1_g4,R4C0,R4C1,R4C2
+
+
+In [179]: pd.read_csv('mi.csv', header=[0, 1, 2, 3], index_col=[0, 1])
+Out[179]:
+C0 C_l0_g0 C_l0_g1 C_l0_g2
+C1 C_l1_g0 C_l1_g1 C_l1_g2
+C2 C_l2_g0 C_l2_g1 C_l2_g2
+C3 C_l3_g0 C_l3_g1 C_l3_g2
+R0 R1
+R_l0_g0 R_l1_g0 R0C0 R0C1 R0C2
+R_l0_g1 R_l1_g1 R1C0 R1C1 R1C2
+R_l0_g2 R_l1_g2 R2C0 R2C1 R2C2
+R_l0_g3 R_l1_g3 R3C0 R3C1 R3C2
+R_l0_g4 R_l1_g4 R4C0 R4C1 R4C2
+
+```
+
+``read_csv`` is also able to interpret a more common format
+of multi-columns indices.
+
+``` python
+In [180]: print(open('mi2.csv').read())
+,a,a,a,b,c,c
+,q,r,s,t,u,v
+one,1,2,3,4,5,6
+two,7,8,9,10,11,12
+
+In [181]: pd.read_csv('mi2.csv', header=[0, 1], index_col=0)
+Out[181]:
+ a b c
+ q r s t u v
+one 1 2 3 4 5 6
+two 7 8 9 10 11 12
+
+```
+
+Note: If an ``index_col`` is not specified (e.g. you don’t have an index, or wrote it
+with ``df.to_csv(..., index=False)``, then any ``names`` on the columns index will be lost.
+
+### Automatically “sniffing” the delimiter
+
+``read_csv`` is capable of inferring delimited (not necessarily
+comma-separated) files, as pandas uses the [``csv.Sniffer``](https://docs.python.org/3/library/csv.html#csv.Sniffer)
+class of the csv module. For this, you have to specify ``sep=None``.
+
+``` python
+In [182]: print(open('tmp2.sv').read())
+:0:1:2:3
+0:0.4691122999071863:-0.2828633443286633:-1.5090585031735124:-1.1356323710171934
+1:1.2121120250208506:-0.17321464905330858:0.11920871129693428:-1.0442359662799567
+2:-0.8618489633477999:-2.1045692188948086:-0.4949292740687813:1.071803807037338
+3:0.7215551622443669:-0.7067711336300845:-1.0395749851146963:0.27185988554282986
+4:-0.42497232978883753:0.567020349793672:0.27623201927771873:-1.0874006912859915
+5:-0.6736897080883706:0.1136484096888855:-1.4784265524372235:0.5249876671147047
+6:0.4047052186802365:0.5770459859204836:-1.7150020161146375:-1.0392684835147725
+7:-0.3706468582364464:-1.1578922506419993:-1.344311812731667:0.8448851414248841
+8:1.0757697837155533:-0.10904997528022223:1.6435630703622064:-1.4693879595399115
+9:0.35702056413309086:-0.6746001037299882:-1.776903716971867:-0.9689138124473498
+
+
+In [183]: pd.read_csv('tmp2.sv', sep=None, engine='python')
+Out[183]:
+ Unnamed: 0 0 1 2 3
+0 0 0.469112 -0.282863 -1.509059 -1.135632
+1 1 1.212112 -0.173215 0.119209 -1.044236
+2 2 -0.861849 -2.104569 -0.494929 1.071804
+3 3 0.721555 -0.706771 -1.039575 0.271860
+4 4 -0.424972 0.567020 0.276232 -1.087401
+5 5 -0.673690 0.113648 -1.478427 0.524988
+6 6 0.404705 0.577046 -1.715002 -1.039268
+7 7 -0.370647 -1.157892 -1.344312 0.844885
+8 8 1.075770 -0.109050 1.643563 -1.469388
+9 9 0.357021 -0.674600 -1.776904 -0.968914
+
+```
+
+### Reading multiple files to create a single DataFrame
+
+It’s best to use [``concat()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat) to combine multiple files.
+See the [cookbook](cookbook.html#cookbook-csv-multiple-files) for an example.
+
+### Iterating through files chunk by chunk
+
+Suppose you wish to iterate through a (potentially very large) file lazily
+rather than reading the entire file into memory, such as the following:
+
+``` python
+In [184]: print(open('tmp.sv').read())
+|0|1|2|3
+0|0.4691122999071863|-0.2828633443286633|-1.5090585031735124|-1.1356323710171934
+1|1.2121120250208506|-0.17321464905330858|0.11920871129693428|-1.0442359662799567
+2|-0.8618489633477999|-2.1045692188948086|-0.4949292740687813|1.071803807037338
+3|0.7215551622443669|-0.7067711336300845|-1.0395749851146963|0.27185988554282986
+4|-0.42497232978883753|0.567020349793672|0.27623201927771873|-1.0874006912859915
+5|-0.6736897080883706|0.1136484096888855|-1.4784265524372235|0.5249876671147047
+6|0.4047052186802365|0.5770459859204836|-1.7150020161146375|-1.0392684835147725
+7|-0.3706468582364464|-1.1578922506419993|-1.344311812731667|0.8448851414248841
+8|1.0757697837155533|-0.10904997528022223|1.6435630703622064|-1.4693879595399115
+9|0.35702056413309086|-0.6746001037299882|-1.776903716971867|-0.9689138124473498
+
+
+In [185]: table = pd.read_csv('tmp.sv', sep='|')
+
+In [186]: table
+Out[186]:
+ Unnamed: 0 0 1 2 3
+0 0 0.469112 -0.282863 -1.509059 -1.135632
+1 1 1.212112 -0.173215 0.119209 -1.044236
+2 2 -0.861849 -2.104569 -0.494929 1.071804
+3 3 0.721555 -0.706771 -1.039575 0.271860
+4 4 -0.424972 0.567020 0.276232 -1.087401
+5 5 -0.673690 0.113648 -1.478427 0.524988
+6 6 0.404705 0.577046 -1.715002 -1.039268
+7 7 -0.370647 -1.157892 -1.344312 0.844885
+8 8 1.075770 -0.109050 1.643563 -1.469388
+9 9 0.357021 -0.674600 -1.776904 -0.968914
+
+```
+
+By specifying a ``chunksize`` to ``read_csv``, the return
+value will be an iterable object of type ``TextFileReader``:
+
+``` python
+In [187]: reader = pd.read_csv('tmp.sv', sep='|', chunksize=4)
+
+In [188]: reader
+Out[188]:
+
+In [189]: for chunk in reader:
+ .....: print(chunk)
+ .....:
+ Unnamed: 0 0 1 2 3
+0 0 0.469112 -0.282863 -1.509059 -1.135632
+1 1 1.212112 -0.173215 0.119209 -1.044236
+2 2 -0.861849 -2.104569 -0.494929 1.071804
+3 3 0.721555 -0.706771 -1.039575 0.271860
+ Unnamed: 0 0 1 2 3
+4 4 -0.424972 0.567020 0.276232 -1.087401
+5 5 -0.673690 0.113648 -1.478427 0.524988
+6 6 0.404705 0.577046 -1.715002 -1.039268
+7 7 -0.370647 -1.157892 -1.344312 0.844885
+ Unnamed: 0 0 1 2 3
+8 8 1.075770 -0.10905 1.643563 -1.469388
+9 9 0.357021 -0.67460 -1.776904 -0.968914
+
+```
+
+Specifying ``iterator=True`` will also return the ``TextFileReader`` object:
+
+``` python
+In [190]: reader = pd.read_csv('tmp.sv', sep='|', iterator=True)
+
+In [191]: reader.get_chunk(5)
+Out[191]:
+ Unnamed: 0 0 1 2 3
+0 0 0.469112 -0.282863 -1.509059 -1.135632
+1 1 1.212112 -0.173215 0.119209 -1.044236
+2 2 -0.861849 -2.104569 -0.494929 1.071804
+3 3 0.721555 -0.706771 -1.039575 0.271860
+4 4 -0.424972 0.567020 0.276232 -1.087401
+
+```
+
+### Specifying the parser engine
+
+Under the hood pandas uses a fast and efficient parser implemented in C as well
+as a Python implementation which is currently more feature-complete. Where
+possible pandas uses the C parser (specified as ``engine='c'``), but may fall
+back to Python if C-unsupported options are specified. Currently, C-unsupported
+options include:
+
+- ``sep`` other than a single character (e.g. regex separators)
+- ``skipfooter``
+- ``sep=None`` with ``delim_whitespace=False``
+
+Specifying any of the above options will produce a ``ParserWarning`` unless the
+python engine is selected explicitly using ``engine='python'``.
+
+### Reading remote files
+
+You can pass in a URL to a CSV file:
+
+``` python
+df = pd.read_csv('https://download.bls.gov/pub/time.series/cu/cu.item',
+ sep='\t')
+
+```
+
+S3 URLs are handled as well but require installing the [S3Fs](https://pypi.org/project/s3fs/) library:
+
+``` python
+df = pd.read_csv('s3://pandas-test/tips.csv')
+
+```
+
+If your S3 bucket requires credentials you will need to set them as environment
+variables or in the ``~/.aws/credentials`` config file, refer to the [S3Fs
+documentation on credentials](https://s3fs.readthedocs.io/en/latest/#credentials).
+
+### Writing out data
+
+#### Writing to CSV format
+
+The ``Series`` and ``DataFrame`` objects have an instance method ``to_csv`` which
+allows storing the contents of the object as a comma-separated-values file. The
+function takes a number of arguments. Only the first is required.
+
+- ``path_or_buf``: A string path to the file to write or a file object. If a file object it must be opened with *newline=’‘*
+- ``sep`` : Field delimiter for the output file (default “,”)
+- ``na_rep``: A string representation of a missing value (default ‘’)
+- ``float_format``: Format string for floating point numbers
+- ``columns``: Columns to write (default None)
+- ``header``: Whether to write out the column names (default True)
+- ``index``: whether to write row (index) names (default True)
+- ``index_label``: Column label(s) for index column(s) if desired. If None
+(default), and *header* and *index* are True, then the index names are
+used. (A sequence should be given if the ``DataFrame`` uses MultiIndex).
+- ``mode`` : Python write mode, default ‘w’
+- ``encoding``: a string representing the encoding to use if the contents are
+non-ASCII, for Python versions prior to 3
+- ``line_terminator``: Character sequence denoting line end (default *os.linesep*)
+- ``quoting``: Set quoting rules as in csv module (default csv.QUOTE_MINIMAL). Note that if you have set a *float_format* then floats are converted to strings and csv.QUOTE_NONNUMERIC will treat them as non-numeric
+- ``quotechar``: Character used to quote fields (default ‘”’)
+- ``doublequote``: Control quoting of ``quotechar`` in fields (default True)
+- ``escapechar``: Character used to escape ``sep`` and ``quotechar`` when
+appropriate (default None)
+- ``chunksize``: Number of rows to write at a time
+- ``date_format``: Format string for datetime objects
+
+#### Writing a formatted string
+
+The ``DataFrame`` object has an instance method ``to_string`` which allows control
+over the string representation of the object. All arguments are optional:
+
+- ``buf`` default None, for example a StringIO object
+- ``columns`` default None, which columns to write
+- ``col_space`` default None, minimum width of each column.
+- ``na_rep`` default ``NaN``, representation of NA value
+- ``formatters`` default None, a dictionary (by column) of functions each of
+which takes a single argument and returns a formatted string
+- ``float_format`` default None, a function which takes a single (float)
+argument and returns a formatted string; to be applied to floats in the
+``DataFrame``.
+- ``sparsify`` default True, set to False for a ``DataFrame`` with a hierarchical
+index to print every MultiIndex key at each row.
+- ``index_names`` default True, will print the names of the indices
+- ``index`` default True, will print the index (ie, row labels)
+- ``header`` default True, will print the column labels
+- ``justify`` default ``left``, will print column headers left- or
+right-justified
+
+The ``Series`` object also has a ``to_string`` method, but with only the ``buf``,
+``na_rep``, ``float_format`` arguments. There is also a ``length`` argument
+which, if set to ``True``, will additionally output the length of the Series.
+
+## JSON
+
+读取和写入 `JSON` 格式的文本和字符串。
+
+### Writing JSON
+
+一个`Series` 或 ` DataFrame` 能转化成一个有效的`JSON`字符串。使用`to_json` 同可选的参数:
+- `path_or_buf` : 写入输出的路径名或缓存可以是`None` , 在这种情况下会返回一个JSON字符串。
+- `orient` :
+
+ `Series` :
+ - 默认是 `index` ;
+ - 允许的值可以是{`split`, `records`, `index`}。
+
+ `DataFrame` :
+ - 默认是 `columns` ;
+ - 允许的值可以是{`split`, `records`, ` index`, `columns`, `values`, `table`}。
+
+ JSON字符串的格式:
+
+
+ split | dict like {index -> [index], columns -> [columns], data -> [values]}
+ ------------- | -------------
+ records | list like [{column -> value}, … , {column -> value}]
+ index | dict like {index -> {column -> value}}
+ columns | dict like {column -> {index -> value}}
+ values | just the values array
+
+- `date_format` : 字符串,日期类型的转换,'eposh'是时间戳,'iso'是 ISO8601。
+
+- `double_precision` : 当要编码的是浮点数值时使用的小数位数,默认是 10。
+
+- `force_ascii` : 强制编码字符串为 ASCII , 默认是True。
+
+- `date_unit` : 时间单位被编码来管理时间戳 和 ISO8601精度。's', 'ms', 'us' 或'ns'中的一个分别为 秒,毫秒,微秒,纳秒。默认是 'ms'。
+
+- `default_handler` : 如果一个对象没有转换成一个恰当的JSON格式,处理程序就会被调用。采用单个参数,即要转换的对象,并返回一个序列化的对象。
+
+- `lines` : 如果面向 `records` ,就将每行写入记录为json。
+
+注意:`NaN`'S , `NaT`'S 和`None` 将会被转换为`null`, 并且`datetime` 将会基于`date_format` 和 `date_unit` 两个参数转换。
+
+```python
+In [192]: dfj = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
+
+In [193]: json = dfj.to_json()
+
+In [194]: json
+Out[194]: '{"A":{"0":-1.2945235903,"1":0.2766617129,"2":-0.0139597524,"3":-0.0061535699,"4":0.8957173022},"B":{"0":0.4137381054,
+"1":-0.472034511,"2":-0.3625429925,"3":-0.923060654,"4":0.8052440254}}'
+```
+
+#### 面向选项(Orient options)
+
+要生成JSON文件/字符串,这儿有很多可选的格式。如下面的 `DataFrame ` 和 `Series` :
+
+```python
+In [195]: dfjo = pd.DataFrame(dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)),
+ ..... : columns=list('ABC'), index=list('xyz'))
+ ..... :
+
+In [196]: dfjo
+Out[196]:
+ A B C
+x 1 4 7
+y 2 5 8
+z 3 6 9
+
+In [197]: sjo = pd.Series(dict(x=15, y=16, z=17), name='D')
+
+In [198]: sjo
+Out[198]:
+x 15
+y 16
+z 17
+Name: D, dtype: int64
+
+```
+
+**面向列** 序列化数据(默认是 `DataFrame`)来作为嵌套的JSON对象,且列标签充当主索引:
+
+```python
+In [199]: dfjo.to_json(orient="columns")
+Out[199]: '{"A":{"x":1,"y":2,"z":3},"B":{"x":4,"y":5,"z":6},"C":{"x":7,"y":8,"z":9}}'
+
+# Not available for Series (不适用于 Series)
+
+```
+
+**面向索引** (默认是 `Series`) 与面向列类似,但是索引标签是主键:
+
+```python
+In [200]: dfjo.to_json(orient="index")
+Out[200]: '{"x":{"A":1,"B":4,"C":7},"y":{"A":2,"B":5,"C":8},"z":{"A":3,"B":6,"C":9}}'
+
+In [201]: sjo.to_json(orient="index")
+Out[201]: '{"x":15,"y":16,"z":17}'
+
+```
+
+**面向记录** 序列化数据为一列JSON数组 -> 值的记录,索引标签不包括在内。这个在传递 `DataFrame` 数据到绘图库的时候很有用,例如JavaScript库 `d3.js` :
+
+```python
+In [202]: dfjo.to_json(orient="records")
+Out[202]: '[{"A":1,"B":4,"C":7},{"A":2,"B":5,"C":8},{"A":3,"B":6,"C":9}]'
+
+In [203]: sjo.to_json(orient="records")
+Out[203]: '[15,16,17]'
+
+```
+
+**面向值** 是一个概要的选项,它只序列化为嵌套的JSON数组值,列和索引标签不包括在内:
+
+```python
+In [204]: dfjo.to_json(orient="values")
+Out[204]: '[[1,4,7],[2,5,8],[3,6,9]]'
+
+# Not available for Series
+
+```
+
+**面向切分** 序列化成一个JSON对象,它包括单项的值、索引和列。`Series` 的命名也包括:
+
+```python
+In [205]: dfjo.to_json(orient="split")
+Out[205]: '{"columns":["A","B","C"],"index":["x","y","z"],"data":[[1,4,7],[2,5,8],[3,6,9]]}'
+
+In [206]: sjo.to_json(orient="split")
+Out[206]: '{"name":"D","index":["x","y","z"],"data":[15,16,17]}'
+
+```
+
+**面向表格** 序列化为JSON的[ 表格模式(Table Schema)](https://specs.frictionlessdata.io/json-table-schema/ " Table Schema"),允许保存为元数据,包括但不限于dtypes和索引名称。
+
+::: tip 注意
+
+任何面向选项编码为一个JSON对象在转为序列化期间将不会保留索引和列标签的顺序。如果你想要保留标签的顺序,就使用`split`选项,因为它使用有序的容器。
+
+:::
+
+#### 日期处理(Date handling)
+
+用ISO日期格式来写入:
+
+```python
+In [207]: dfd = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
+
+In [208]: dfd['date'] = pd.Timestamp('20130101')
+
+In [209]: dfd = dfd.sort_index(1, ascending=False)
+
+In [210]: json = dfd.to_json(date_format='iso')
+
+In [211]: json
+Out[211]: '{"date":{"0":"2013-01-01T00:00:00.000Z","1":"2013-01-01T00:00:00.000Z","2":"2013-01-01T00:00:00.000Z","3":"2013-01-01T00:00:00.000Z","4":"2013-01-01T00:00:00.000Z"},"B":{"0":2.5656459463,"1":1.3403088498,"2":-0.2261692849,"3":0.8138502857,"4":-0.8273169356},"A":{"0":-1.2064117817,"1":1.4312559863,"2":-1.1702987971,"3":0.4108345112,"4":0.1320031703}}'
+
+```
+
+以ISO日期格式的微秒单位写入:
+
+```python
+In [212]: json = dfd.to_json(date_format='iso', date_unit='us')
+
+In [213]: json
+Out[213]: '{"date":{"0":"2013-01-01T00:00:00.000000Z","1":"2013-01-01T00:00:00.000000Z","2":"2013-01-01T00:00:00.000000Z","3":"2013-01-01T00:00:00.000000Z","4":"2013-01-01T00:00:00.000000Z"},"B":{"0":2.5656459463,"1":1.3403088498,"2":-0.2261692849,"3":0.8138502857,"4":-0.8273169356},"A":{"0":-1.2064117817,"1":1.4312559863,"2":-1.1702987971,"3":0.4108345112,"4":0.1320031703}}
+
+```
+时间戳的时间,以秒为单位:
+
+```python
+In [214]: json = dfd.to_json(date_format='epoch', date_unit='s')
+
+In [215]: json
+Out[215]: '{"date":{"0":1356998400,"1":1356998400,"2":1356998400,"3":1356998400,"4":1356998400},"B":{"0":2.5656459463,"1":1.3403088498,"2":-0.2261692849,"3":0.8138502857,"4":-0.8273169356},"A":{"0":-1.2064117817,"1":1.4312559863,"2":-1.1702987971,"3":0.4108345112,"4":0.1320031703}}'
+
+```
+
+写入文件,以日期索引和日期列格式:
+
+```python
+In [216]: dfj2 = dfj.copy()
+
+In [217]: dfj2['date'] = pd.Timestamp('20130101')
+
+In [218]: dfj2['ints'] = list(range(5))
+
+In [219]: dfj2['bools'] = True
+
+In [220]: dfj2.index = pd.date_range('20130101', periods=5)
+
+In [221]: dfj2.to_json('test.json')
+
+In [222]: with open('test.json') as fh:
+ .....: print(fh.read())
+ .....:
+{"A":{"1356998400000":-1.2945235903,"1357084800000":0.2766617129,"1357171200000":-0.0139597524,"1357257600000":-0.0061535699,"1357344000000":0.8957173022},"B":{"1356998400000":0.4137381054,"1357084800000":-0.472034511,"1357171200000":-0.3625429925,"1357257600000":-0.923060654,"1357344000000":0.8052440254},"date":{"1356998400000":1356998400000,"1357084800000":1356998400000,"1357171200000":1356998400000,"1357257600000":1356998400000,"1357344000000":1356998400000},"ints":{"1356998400000":0,"1357084800000":1,"1357171200000":2,"1357257600000":3,"1357344000000":4},"bools":{"1356998400000":true,"1357084800000":true,"1357171200000":true,"1357257600000":true,"1357344000000":true}}
+
+```
+
+#### 回退行为(Fallback behavior)
+
+如果JSON序列不能直接处理容器的内容,他将会以下面的方式发生回退:
+
+- 如果dtype是不被支持的(例如:` np.complex` ) ,则将为每个值调用 `default_handler` (如果提供),否则引发异常。
+
+- 如果对象不受支持,它将尝试以下操作:
+ - 检查一下是否对象被定义为 `toDict ` 的方法并调用它。`toDict`的方法将返回一个`dict`,它将会是序列化的JSON格式。
+ - 如果提供了`default_handler`,则调用它。
+ - 通过遍历其内容将对象转换为`dict`。 但是,这通常会出现`OverflowError`而失败或抛出意外的结果。
+
+通常,对于不被支持的对象或dtypes,处理的最佳方法是提供`default_handler`。 例如:
+
+``` python
+>>> DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json() # raises
+RuntimeError: Unhandled numpy dtype 15
+
+```
+
+可以通过指定一个简单`default_handler`来处理:
+
+``` python
+In [223]: pd.DataFrame([1.0, 2.0, complex(1.0, 2.0)]).to_json(default_handler=str)
+Out[223]: '{"0":{"0":"(1+0j)","1":"(2+0j)","2":"(1+2j)"}}'
+
+```
+
+### JSON的读取(Reading JSON)
+
+把JSON字符串读取到pandas对象里会采用很多参数。如果`typ`没有提供或者为`None`,解析器将尝试解析`DataFrame`。 要强制地进行`Series`解析,请传递参数如`typ = series`。
+
+- `filepath_or_buffer` : 一个**有效**的JSON字符串或文件句柄/StringIO(在内存中读写字符串)。字符串可以是一个URL。有效的URL格式包括http, ftp, S3和文件。对于文件型的URL, 最好有个主机地址。例如一个本地文件可以是 file://localhost/path/to/table.json 这样的格式。
+
+- `typ` : 要恢复的对象类型(series或者frame),默认“frame”。
+
+- `orient` :
+
+ Series:
+ - 默认是 `index `。
+ - 允许值为{ `split`, `records`, `index`}。
+
+ DataFrame:
+ - 默认是 `columns `。
+ - 允许值是{ `split`, `records`, `index`, `columns`, `values`, `table`}。
+
+JSON字符串的格式:
+
+
+ split | dict like {index -> [index], columns -> [columns], data -> [values]}
+ ------------- | -------------
+ records | list like [{column -> value}, … , {column -> value}]
+ index | dict like {index -> {column -> value}}
+ columns | dict like {column -> {index -> value}}
+ values | just the values array
+ table | adhering to the JSON [Table Schema](https://specs.frictionlessdata.io/json-table-schema/)
+
+- ` dtype `: 如果为True,推断dtypes,如果列为dtype的字典,则使用那些;如果为`False`,则根本不推断dtypes,默认为True,仅适用于数据。
+
+- `convert_axes` : 布尔值,尝试将轴转换为正确的dtypes,默认为`True`。
+
+- `convert_dates` :一列列表要解析为日期; 如果为`True`,则尝试解析类似日期的列,默认为`True`。
+
+- `keep_default_dates` :布尔值,默认为`True`。 如果解析日期,则解析默认的类似日期的列。
+
+- `numpy` :直接解码为NumPy数组。 默认为`False`; 虽然标签可能是非数字的,但仅支持数字数据。 另请注意,如果`numpy = True`,则每个术语的JSON顺序 **必须** 相同。
+
+- `precise_float` :布尔值,默认为`False`。 当解码字符串为双值时,设置为能使用更高精度(strtod)函数。 默认(`False`)快速使用但不精确的内置功能。
+
+- `date_unit` :字符串,用于检测转换日期的时间戳单位。 默认无。 默认情况下,将检测时间戳精度,如果不需要,则传递's','ms','us'或'ns'中的一个,以强制时间戳精度分别为秒,毫秒,微秒或纳秒。
+
+- `lines` :读取文件每行作为一个JSON对象。
+
+- `encoding` :用于解码py3字节的编码。
+
+- `chunksize` :当与`lines = True`结合使用时,返回一个Json读取器(JSONReader),每次迭代读取`chunksize`行。
+
+如果JSON不能解析,解析器将抛出`ValueError / TypeError / AssertionError `中的一个错误。
+
+如果在编码为JSON时使用非默认的`orient`方法,请确保在此处传递相同的选项以便解码产生合理的结果,请参阅 [Orient Options](https://www.pypandas.cn/docs/user_guide/io.html#orient-options)以获取概述。
+
+#### 数据转换(Data conversion)
+
+`convert_axes = True`,`dtype = True`和`convert_dates = True`的默认值将尝试解析轴,并将所有数据解析为适当的类型,包括日期。 如果需要覆盖特定的dtypes,请将字典传递给`dtype`。 如果您需要在轴中保留类似字符串的数字(例如“1”,“2”),则只应将`convert_axes`设置为`False`。
+
+::: tip 注意
+
+如果`convert_dates = True`并且数据和/或列标签显示为“类似日期('date-like')“,则可以将大的整数值转换为日期。 确切的标准取决于指定的`date_unit`。 'date-like'表示列标签符合以下标准之一:
+- 结尾以 `'_at'`
+- 结尾以 `'_time'`
+- 开头以 `'timestamp'`
+- 它是 `'modified'`
+- 它是 `'date'`
+
+:::
+
+::: danger 警告
+
+在读取JSON数据时,自动强制转换为dtypes有一些不同寻常的地方:
+
+- 索引可以按序列化的不同顺序重建,也就是说,返回的顺序不能保证与序列化之前的顺序相同
+
+- 如果可以安全地,那么一列浮动(`float`)数据将被转换为一列整数(`integer`),例如 一列 `1`
+- 布尔列将在重建时转换为整数(`integer `)
+
+因此,有时你会有那样的时刻可能想通过`dtype`关键字参数指定特定的dtypes。
+
+:::
+
+读取JSON字符串:
+
+```python
+In [224]: pd.read_json(json)
+Out[224]:
+ date B A
+0 2013-01-01 2.565646 -1.206412
+1 2013-01-01 1.340309 1.431256
+2 2013-01-01 -0.226169 -1.170299
+3 2013-01-01 0.813850 0.410835
+4 2013-01-01 -0.827317 0.132003
+
+```
+读取文件:
+
+```python
+In [225]: pd.read_json('test.json')
+Out[225]:
+ A B date ints bools
+2013-01-01 -1.294524 0.413738 2013-01-01 0 True
+2013-01-02 0.276662 -0.472035 2013-01-01 1 True
+2013-01-03 -0.013960 -0.362543 2013-01-01 2 True
+2013-01-04 -0.006154 -0.923061 2013-01-01 3 True
+2013-01-05 0.895717 0.805244 2013-01-01 4 True
+
+```
+不要转换任何数据(但仍然转换轴和日期):
+
+```python
+In [226]: pd.read_json('test.json', dtype=object).dtypes
+Out[226]:
+A object
+B object
+date object
+ints object
+bools object
+dtype: object
+
+```
+指定转换的dtypes:
+
+```python
+In [227]: pd.read_json('test.json', dtype={'A': 'float32', 'bools': 'int8'}).dtypes
+Out[227]:
+A float32
+B float64
+date datetime64[ns]
+ints int64
+bools int8
+dtype: object
+
+```
+保留字符串索引:
+
+```python
+In [228]: si = pd.DataFrame(np.zeros((4, 4)), columns=list(range(4)),
+ .....: index=[str(i) for i in range(4)])
+ .....:
+
+In [229]: si
+Out[229]:
+ 0 1 2 3
+0 0.0 0.0 0.0 0.0
+1 0.0 0.0 0.0 0.0
+2 0.0 0.0 0.0 0.0
+3 0.0 0.0 0.0 0.0
+
+In [230]: si.index
+Out[230]: Index(['0', '1', '2', '3'], dtype='object')
+
+In [231]: si.columns
+Out[231]: Int64Index([0, 1, 2, 3], dtype='int64')
+
+In [232]: json = si.to_json()
+
+In [233]: sij = pd.read_json(json, convert_axes=False)
+
+In [234]: sij
+Out[234]:
+ 0 1 2 3
+0 0 0 0 0
+1 0 0 0 0
+2 0 0 0 0
+3 0 0 0 0
+
+In [235]: sij.index
+Out[235]: Index(['0', '1', '2', '3'], dtype='object')
+
+In [236]: sij.columns
+Out[236]: Index(['0', '1', '2', '3'], dtype='object')
+
+```
+以纳秒为单位的日期需要以纳秒为单位读回:
+
+``` python
+In [237]: json = dfj2.to_json(date_unit='ns')
+
+# Try to parse timestamps as milliseconds -> Won't Work
+In [238]: dfju = pd.read_json(json, date_unit='ms')
+
+In [239]: dfju
+Out[239]:
+ A B date ints bools
+1356998400000000000 -1.294524 0.413738 1356998400000000000 0 True
+1357084800000000000 0.276662 -0.472035 1356998400000000000 1 True
+1357171200000000000 -0.013960 -0.362543 1356998400000000000 2 True
+1357257600000000000 -0.006154 -0.923061 1356998400000000000 3 True
+1357344000000000000 0.895717 0.805244 1356998400000000000 4 True
+
+# Let pandas detect the correct precision
+In [240]: dfju = pd.read_json(json)
+
+In [241]: dfju
+Out[241]:
+ A B date ints bools
+2013-01-01 -1.294524 0.413738 2013-01-01 0 True
+2013-01-02 0.276662 -0.472035 2013-01-01 1 True
+2013-01-03 -0.013960 -0.362543 2013-01-01 2 True
+2013-01-04 -0.006154 -0.923061 2013-01-01 3 True
+2013-01-05 0.895717 0.805244 2013-01-01 4 True
+
+# Or specify that all timestamps are in nanoseconds
+In [242]: dfju = pd.read_json(json, date_unit='ns')
+
+In [243]: dfju
+Out[243]:
+ A B date ints bools
+2013-01-01 -1.294524 0.413738 2013-01-01 0 True
+2013-01-02 0.276662 -0.472035 2013-01-01 1 True
+2013-01-03 -0.013960 -0.362543 2013-01-01 2 True
+2013-01-04 -0.006154 -0.923061 2013-01-01 3 True
+2013-01-05 0.895717 0.805244 2013-01-01 4 True
+
+```
+
+#### Numpy 参数
+
+::: tip 注意
+
+这仅支持数值数据。 索引和列标签可以是非数字的,例如 字符串,日期等。
+
+:::
+
+如果将`numpy = True`传递给`read_json`,则会在反序列化期间尝试找到适当的dtype,然后直接解码到NumPy数组,从而绕过对中间Python对象的需求。
+
+如果要反序列化大量数值数据,这可以提供加速:
+
+``` python
+In [244]: randfloats = np.random.uniform(-100, 1000, 10000)
+
+In [245]: randfloats.shape = (1000, 10)
+
+In [246]: dffloats = pd.DataFrame(randfloats, columns=list('ABCDEFGHIJ'))
+
+In [247]: jsonfloats = dffloats.to_json()
+
+```
+
+``` python
+In [248]: %timeit pd.read_json(jsonfloats)
+12.4 ms +- 116 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
+
+```
+
+``` python
+In [249]: %timeit pd.read_json(jsonfloats, numpy=True)
+9.56 ms +- 82.8 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
+
+```
+
+对于较小的数据集,加速不太明显:
+
+``` python
+In [250]: jsonfloats = dffloats.head(100).to_json()
+
+```
+
+``` python
+In [251]: %timeit pd.read_json(jsonfloats)
+8.05 ms +- 120 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
+
+```
+
+``` python
+In [252]: %timeit pd.read_json(jsonfloats, numpy=True)
+7 ms +- 162 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
+
+```
+::: danger 警告
+
+直接NumPy解码会产生许多假设并可能导致失败,或如果这些假设不满足,则产生意外地输出:
+
+- 数据是数值。
+- 数据是统一的。 从解码的第一个值中找到dtype。可能会引发`ValueError`错误,或者如果这个条件不满足可能产生不正确的输出。
+
+- 标签是有序的。 标签仅从第一个容器读取,假设每个后续行/列已按相同顺序编码。 如果使用`to_json`编码数据,则应该满足这一要求,但如果JSON来自其他来源,则可能不是这种情况。
+
+:::
+
+
+### 标准化(Normalization)
+
+pandas提供了一个实用程序函数来获取一个字典或字典列表,并将这个半结构化数据规范化为一个平面表。
+
+``` python
+In [253]: from pandas.io.json import json_normalize
+
+In [254]: data = [{'id': 1, 'name': {'first': 'Coleen', 'last': 'Volk'}},
+ .....: {'name': {'given': 'Mose', 'family': 'Regner'}},
+ .....: {'id': 2, 'name': 'Faye Raker'}]
+ .....:
+
+In [255]: json_normalize(data)
+Out[255]:
+ id name.first name.last name.given name.family name
+0 1.0 Coleen Volk NaN NaN NaN
+1 NaN NaN NaN Mose Regner NaN
+2 2.0 NaN NaN NaN NaN Faye Raker
+
+```
+
+``` python
+In [256]: data = [{'state': 'Florida',
+ .....: 'shortname': 'FL',
+ .....: 'info': {'governor': 'Rick Scott'},
+ .....: 'counties': [{'name': 'Dade', 'population': 12345},
+ .....: {'name': 'Broward', 'population': 40000},
+ .....: {'name': 'Palm Beach', 'population': 60000}]},
+ .....: {'state': 'Ohio',
+ .....: 'shortname': 'OH',
+ .....: 'info': {'governor': 'John Kasich'},
+ .....: 'counties': [{'name': 'Summit', 'population': 1234},
+ .....: {'name': 'Cuyahoga', 'population': 1337}]}]
+ .....:
+
+In [257]: json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])
+Out[257]:
+ name population state shortname info.governor
+0 Dade 12345 Florida FL Rick Scott
+1 Broward 40000 Florida FL Rick Scott
+2 Palm Beach 60000 Florida FL Rick Scott
+3 Summit 1234 Ohio OH John Kasich
+4 Cuyahoga 1337 Ohio OH John Kasich
+
+```
+max_level 参数提供了对结束规范化的级别的更多控制。 当max_level = 1时,以下代码段会标准化,直到提供了字典的第一个嵌套级别为止。
+
+``` python
+In [258]: data = [{'CreatedBy': {'Name': 'User001'},
+ .....: 'Lookup': {'TextField': 'Some text',
+ .....: 'UserField': {'Id': 'ID001',
+ .....: 'Name': 'Name001'}},
+ .....: 'Image': {'a': 'b'}
+ .....: }]
+ .....:
+
+In [259]: json_normalize(data, max_level=1)
+Out[259]:
+ CreatedBy.Name Lookup.TextField Lookup.UserField Image.a
+0 User001 Some text {'Id': 'ID001', 'Name': 'Name001'} b
+
+```
+
+### json的行分割(Line delimited json)
+
+*New in version 0.19.0.*
+
+pandas能够读取和写入行分隔的json文件通常是在用Hadoop或Spark进行数据处理的管道中。
+
+*New in version 0.21.0.*
+
+对于行分隔的json文件,pandas也可以返回一个迭代器,它能一次读取`chunksize`行。 这对于大型文件或从数据流中读取非常有用。
+
+``` python
+In [260]: jsonl = '''
+ .....: {"a": 1, "b": 2}
+ .....: {"a": 3, "b": 4}
+ .....: '''
+ .....:
+
+In [261]: df = pd.read_json(jsonl, lines=True)
+
+In [262]: df
+Out[262]:
+ a b
+0 1 2
+1 3 4
+
+In [263]: df.to_json(orient='records', lines=True)
+Out[263]: '{"a":1,"b":2}\n{"a":3,"b":4}'
+
+# reader is an iterator that returns `chunksize` lines each iteration
+In [264]: reader = pd.read_json(StringIO(jsonl), lines=True, chunksize=1)
+
+In [265]: reader
+Out[265]:
+
+In [266]: for chunk in reader:
+ .....: print(chunk)
+ .....:
+Empty DataFrame
+Columns: []
+Index: []
+ a b
+0 1 2
+ a b
+1 3 4
+
+```
+
+### 表模式(Table schema)
+
+*New in version 0.20.0.*
+
+表模式([Table schema](https://specs.frictionlessdata.io/json-table-schema/))是用于将表格数据集描述为JSON对象的一种规范。 JSON包含有关字段名称,类型和其他属性的信息。 你可以使用面向`table`来构建一个JSON字符串包含两个字段,`schema`和`data`。
+
+``` python
+In [267]: df = pd.DataFrame({'A': [1, 2, 3],
+ .....: 'B': ['a', 'b', 'c'],
+ .....: 'C': pd.date_range('2016-01-01', freq='d', periods=3)},
+ .....: index=pd.Index(range(3), name='idx'))
+ .....:
+
+In [268]: df
+Out[268]:
+ A B C
+idx
+0 1 a 2016-01-01
+1 2 b 2016-01-02
+2 3 c 2016-01-03
+
+In [269]: df.to_json(orient='table', date_format="iso")
+Out[269]: '{"schema": {"fields":[{"name":"idx","type":"integer"},{"name":"A","type":"integer"},{"name":"B","type":"string"},{"name":"C","type":"datetime"}],"primaryKey":["idx"],"pandas_version":"0.20.0"}, "data": [{"idx":0,"A":1,"B":"a","C":"2016-01-01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},{"idx":2,"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}'
+
+```
+`schema`字段包含`fields`主键,它本身包含一个列名称到列对的列表,包括`Index`或`MultiIndex`(请参阅下面的类型列表)。 如果(多)索引是唯一的,则`schema`字段也包含一个`primaryKey`字段。
+
+第二个字段`data`包含用面向`records`来序列化数据。 索引是包括的,并且任何日期时间都是ISO 8601格式,正如表模式规范所要求的那样。
+
+表模式规范中描述了所有支持的全部类型列表。 此表显示了pandas类型的映射:
+
+Pandas type | Table Schema type
+---|---
+int64 | integer
+float64 | number
+bool | boolean
+datetime64[ns] | datetime
+timedelta64[ns] | duration
+categorical | any
+object | str
+
+关于生成的表模式的一些注意事项:
+
+- `schema`对象包含`pandas_version`的字段。 它包含模式的pandas方言版本,并将随每个修订增加。
+- 序列化时,所有日期都转换为UTC。 甚至是时区的初始值,也被视为UTC,偏移量为0。
+
+``` python
+In [270]: from pandas.io.json import build_table_schema
+
+In [271]: s = pd.Series(pd.date_range('2016', periods=4))
+
+In [272]: build_table_schema(s)
+Out[272]:
+{'fields': [{'name': 'index', 'type': 'integer'},
+ {'name': 'values', 'type': 'datetime'}],
+ 'primaryKey': ['index'],
+ 'pandas_version': '0.20.0'}
+
+```
+- 具有时区的日期时间(在序列化之前),包括具有时区名称的附加字段`tz`(例如:`'US / Central'`)。
+
+``` python
+In [273]: s_tz = pd.Series(pd.date_range('2016', periods=12,
+ .....: tz='US/Central'))
+ .....:
+
+In [274]: build_table_schema(s_tz)
+Out[274]:
+{'fields': [{'name': 'index', 'type': 'integer'},
+ {'name': 'values', 'type': 'datetime', 'tz': 'US/Central'}],
+ 'primaryKey': ['index'],
+ 'pandas_version': '0.20.0'}
+
+```
+- 时间段在序列化之前是转换为时间戳的,因此具有转换为UTC的相同方式。 此外,时间段将包含具有时间段频率的附加字段`freq`,例如:`'A-DEC'`。
+
+``` python
+In [275]: s_per = pd.Series(1, index=pd.period_range('2016', freq='A-DEC',
+ .....: periods=4))
+ .....:
+
+In [276]: build_table_schema(s_per)
+Out[276]:
+{'fields': [{'name': 'index', 'type': 'datetime', 'freq': 'A-DEC'},
+ {'name': 'values', 'type': 'integer'}],
+ 'primaryKey': ['index'],
+ 'pandas_version': '0.20.0'}
+
+```
+- 分类使用`any`类型和`enum`约束来列出可能值的集合。 此外,还包括一个`ordered`字段:
+
+``` python
+In [277]: s_cat = pd.Series(pd.Categorical(['a', 'b', 'a']))
+
+In [278]: build_table_schema(s_cat)
+Out[278]:
+{'fields': [{'name': 'index', 'type': 'integer'},
+ {'name': 'values',
+ 'type': 'any',
+ 'constraints': {'enum': ['a', 'b']},
+ 'ordered': False}],
+ 'primaryKey': ['index'],
+ 'pandas_version': '0.20.0'}
+
+```
+- 如果索引是唯一的,则包含`primaryKey`字段,它包含了标签数组:
+
+``` python
+In [279]: s_dupe = pd.Series([1, 2], index=[1, 1])
+
+In [280]: build_table_schema(s_dupe)
+Out[280]:
+{'fields': [{'name': 'index', 'type': 'integer'},
+ {'name': 'values', 'type': 'integer'}],
+ 'pandas_version': '0.20.0'}
+
+```
+- `primaryKey `的形式与多索引相同,但在这种情况下,`primaryKey`是一个数组:
+
+``` python
+In [281]: s_multi = pd.Series(1, index=pd.MultiIndex.from_product([('a', 'b'),
+ .....: (0, 1)]))
+ .....:
+
+In [282]: build_table_schema(s_multi)
+Out[282]:
+{'fields': [{'name': 'level_0', 'type': 'string'},
+ {'name': 'level_1', 'type': 'integer'},
+ {'name': 'values', 'type': 'integer'}],
+ 'primaryKey': FrozenList(['level_0', 'level_1']),
+ 'pandas_version': '0.20.0'}
+
+```
+- 默认命名大致遵循以下规则:
+
+ - 对于series,使用`object.name`。 如果没有,那么名称就是`values`
+ - 对于`DataFrames`,使用列名称的字符串化版本
+ - 对于`Index`(不是`MultiIndex`),使用`index.name`,如果为None,则使用回退`index`。
+ - 对于`MultiIndex`,使用`mi.names`。 如果任何级别没有名称,则使用`level_`。
+
+*New in version 0.23.0.*
+
+`read_json`也接受`orient ='table'`作为参数。 这允许以可循环移动的方式保存诸如dtypes和索引名称之类的元数据。
+
+``` python
+In [283]: df = pd.DataFrame({'foo': [1, 2, 3, 4],
+ .....: 'bar': ['a', 'b', 'c', 'd'],
+ .....: 'baz': pd.date_range('2018-01-01', freq='d', periods=4),
+ .....: 'qux': pd.Categorical(['a', 'b', 'c', 'c'])
+ .....: }, index=pd.Index(range(4), name='idx'))
+ .....:
+
+In [284]: df
+Out[284]:
+ foo bar baz qux
+idx
+0 1 a 2018-01-01 a
+1 2 b 2018-01-02 b
+2 3 c 2018-01-03 c
+3 4 d 2018-01-04 c
+
+In [285]: df.dtypes
+Out[285]:
+foo int64
+bar object
+baz datetime64[ns]
+qux category
+dtype: object
+
+In [286]: df.to_json('test.json', orient='table')
+
+In [287]: new_df = pd.read_json('test.json', orient='table')
+
+In [288]: new_df
+Out[288]:
+ foo bar baz qux
+idx
+0 1 a 2018-01-01 a
+1 2 b 2018-01-02 b
+2 3 c 2018-01-03 c
+3 4 d 2018-01-04 c
+
+In [289]: new_df.dtypes
+Out[289]:
+foo int64
+bar object
+baz datetime64[ns]
+qux category
+dtype: object
+
+```
+请注意,作为 [Index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html#pandas.Index) 名称的文字字符串'index'是不能循环移动的,也不能在 [MultiIndex](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html#pandas.MultiIndex) 中用以`'level_'`开头的任何名称。 这些默认情况下在 [DataFrame.to_json()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html#pandas.DataFrame.to_json) 中用于指示缺失值和后续读取无法区分的目的。
+
+``` python
+In [290]: df.index.name = 'index'
+
+In [291]: df.to_json('test.json', orient='table')
+
+In [292]: new_df = pd.read_json('test.json', orient='table')
+
+In [293]: print(new_df.index.name)
+None
+
+```
+
+## HTML
+
+### 读取HTML的内容
+
+::: danger 警告:
+
+我们**强烈建议**你阅读 [HTML Table Parsing gotchas](https://www.pypandas.cn/docs/user_guide/io.html#io-html-gotchas)里面相关的围绕BeautifulSoup4/html5lib/lxml解析器部分的问题。
+
+:::
+
+顶级的`read_html()`函数能接受HTML字符串/文件/URL格式,并且能解析HTML 表格为pandas`DataFrames`的列表,让我们看看下面的几个例子。
+
+::: tip 注意:
+
+`read_html`返回的是一个`DataFrame`对象的`list`,即便在HTML页面里只包含单个表格。
+
+:::
+
+读取一个没有选项的URL:
+
+```python
+In [294]: url = 'https://www.fdic.gov/bank/individual/failed/banklist.html'
+
+In [295]: dfs = pd.read_html(url)
+
+In [296]: dfs
+Out[296]:
+[ Bank Name City ST CERT Acquiring Institution Closing Date Updated Date
+ 0 The Enloe State Bank Cooper TX 10716 Legend Bank, N. A. May 31, 2019 June 18, 2019
+ 1 Washington Federal Bank for Savings Chicago IL 30570 Royal Savings Bank December 15, 2017 February 1, 2019
+ 2 The Farmers and Merchants State Bank of Argonia Argonia KS 17719 Conway Bank October 13, 2017 February 21, 2018
+ 3 Fayette County Bank Saint Elmo IL 1802 United Fidelity Bank, fsb May 26, 2017 January 29, 2019
+ 4 Guaranty Bank, (d/b/a BestBank in Georgia & Mi... Milwaukee WI 30003 First-Citizens Bank & Trust Company May 5, 2017 March 22, 2018
+ .. ... ... .. ... ... ... ...
+ 551 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB July 27, 2001 August 19, 2014
+ 552 Malta National Bank Malta OH 6629 North Valley Bank May 3, 2001 November 18, 2002
+ 553 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust February 2, 2001 February 18, 2003
+ 554 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion December 14, 2000 March 17, 2005
+ 555 Bank of Honolulu Honolulu HI 21029 Bank of the Orient October 13, 2000 March 17, 2005
+
+ [556 rows x 7 columns]]
+
+```
+
+::: tip 注意:
+
+上面的URL数据修改了每个周一以至于上面的数据结果跟下面的数据结果可能有轻微的不同。
+
+:::
+
+从上面的URL读取文件内容并且传递它给`read_html`作为一个字符串:
+
+```python
+In [297]: with open(file_path, 'r') as f:
+ .....: dfs = pd.read_html(f.read())
+ .....:
+
+In [298]: dfs
+Out[298]:
+[ Bank Name City ST CERT Acquiring Institution Closing Date Updated Date
+ 0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha WI 35386 North Shore Bank, FSB May 31, 2013 May 31, 2013
+ 1 Central Arizona Bank Scottsdale AZ 34527 Western State Bank May 14, 2013 May 20, 2013
+ 2 Sunrise Bank Valdosta GA 58185 Synovus Bank May 10, 2013 May 21, 2013
+ 3 Pisgah Community Bank Asheville NC 58701 Capital Bank, N.A. May 10, 2013 May 14, 2013
+ 4 Douglas County Bank Douglasville GA 21649 Hamilton State Bank April 26, 2013 May 16, 2013
+ .. ... ... .. ... ... ... ...
+ 500 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB July 27, 2001 June 5, 2012
+ 501 Malta National Bank Malta OH 6629 North Valley Bank May 3, 2001 November 18, 2002
+ 502 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust February 2, 2001 February 18, 2003
+ 503 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion December 14, 2000 March 17, 2005
+ 504 Bank of Honolulu Honolulu HI 21029 Bank of the Orient October 13, 2000 March 17, 2005
+
+ [505 rows x 7 columns]]
+
+```
+
+甚至如果你想,你还可以传递一个`StringIO`的实例:
+
+```python
+In [299]: with open(file_path, 'r') as f:
+ .....: sio = StringIO(f.read())
+ .....:
+
+In [300]: dfs = pd.read_html(sio)
+
+In [301]: dfs
+Out[301]:
+[ Bank Name City ST CERT Acquiring Institution Closing Date Updated Date
+ 0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha WI 35386 North Shore Bank, FSB May 31, 2013 May 31, 2013
+ 1 Central Arizona Bank Scottsdale AZ 34527 Western State Bank May 14, 2013 May 20, 2013
+ 2 Sunrise Bank Valdosta GA 58185 Synovus Bank May 10, 2013 May 21, 2013
+ 3 Pisgah Community Bank Asheville NC 58701 Capital Bank, N.A. May 10, 2013 May 14, 2013
+ 4 Douglas County Bank Douglasville GA 21649 Hamilton State Bank April 26, 2013 May 16, 2013
+ .. ... ... .. ... ... ... ...
+ 500 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB July 27, 2001 June 5, 2012
+ 501 Malta National Bank Malta OH 6629 North Valley Bank May 3, 2001 November 18, 2002
+ 502 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust February 2, 2001 February 18, 2003
+ 503 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion December 14, 2000 March 17, 2005
+ 504 Bank of Honolulu Honolulu HI 21029 Bank of the Orient October 13, 2000 March 17, 2005
+
+ [505 rows x 7 columns]]
+
+```
+
+::: tip 注意:
+
+以下的例子在IPython的程序中不会运行,因为有太多的网络接入函数减缓了文档的创建。如果你的程序报错或者例子不运行,请立即向[ pandas GitHub issues page](https://www.github.com/pandas-dev/pandas/issues) 上报。
+
+:::
+
+读取一个URL并匹配表格里面所包含的具体文本内容:
+
+```python
+match = 'Metcalf Bank'
+df_list = pd.read_html(url, match=match)
+
+```
+
+指定一个标题行(通过默认的\
+
+```
+
+| **-** | **a** | **b** |
+| --- | --- | --- |
+| 0 | & | -0.474063 |
+| 1 | < | -0.230305 |
+| 2 | > | -0.400654 |
+
+::: tip 注意:
+
+一些浏览器在渲染上面的两个HTML表格的时候可能看不出区别。
+
+:::
+
+### HTML表格解析陷阱
+
+在使用顶级的pandas io函数`read_html`来解析HTML表格的时候,围绕这些库,存在一些版本的问题。
+
+**[lxml](https://lxml.de/)问题**:
+
+- 优点:
+ - [lxml](https://lxml.de/) 是非常快的。
+ - [lxml](https://lxml.de/)要求Cython正确安装。
+
+- 缺点:
+ - [lxml](https://lxml.de/) 不能保证它的解析结果除非使用[严格有效地标记](https://validator.w3.org/docs/help.html#validation_basics)。
+ - 鉴于上述情况,我们选择允许用户使用 [lxml](https://lxml.de/) 作为后端,但是如果 [lxml](https://lxml.de/) 解析失败,**这个后端将使用[html5lib](https://github.com/html5lib/html5lib-python)**。
+ - 因此,强烈推荐你安装**[BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup)**和**[html5lib](https://github.com/html5lib/html5lib-python)**这两个库。这样即使[lxml](https://lxml.de/)解析失败,你仍然能够得到一个有效的结果(前提是其他所有内容都有效)。
+
+**[BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup)使用[lxml](https://lxml.de/)作为后端的问题**:
+
+- 以上问题仍然会存在因为**[BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup)**本质上是一个围绕后端解析的包装器。
+
+**[BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup)使用[html5lib](https://github.com/html5lib/html5lib-python)作为后端的问题**:
+
+- 优点:
+ - [html5lib](https://github.com/html5lib/html5lib-python)比[lxml](https://lxml.de/)宽容得多,所以会以一种更理智的方式处理*现实生活中的标记*,而不是仅仅,比如在未通知你的情况下删除元素。
+ - [html5lib](https://github.com/html5lib/html5lib-python)*能自动从无效标记中生成有效的 HTML5 标记*。这在解析HTML表格的时候是相当重要的,因为它保证了它是有效的文件。然而这不意味着它是“正确的“,因为修复标记的过程没有一个定义。
+ - [html5lib](https://github.com/html5lib/html5lib-python)是纯净的Python,除了它自己的安装步骤没有其他的步骤。
+
+- 缺点:
+ - 使用[html5lib](https://github.com/html5lib/html5lib-python)最大的缺点就是太慢了。但是考虑到网络上许多表格并不足以如解析算法运行时的那么重要,它更可能像是正在通过网络上的URL读取原始文本过程中的瓶颈。例如当IO(输入-输出) 时,对于非常大的表,事实可能并非如此。
+
+## Excel 文件
+
+[read_excel()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html#pandas.read_excel)方法使用Python的`xlrd`模块来读取Excel 2003(`.xls`)版的文件,而Excel 2007+ (`.xlsx`)版本的是用`xlrd`或者`openpyxl`模块来读取的。[to_excel()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html#pandas.DataFrame.to_excel)方法则是用来把`DataFrame`数据存储为Excel格式。一般来说,它的语法同使用[csv](https://www.pypandas.cn/docs/user_guide/io.html#io-read-csv-table)数据是类似的,更多高级的用法可以参考[cookbook](https://www.pypandas.cn/docs/user_guide/cookbook.html#cookbook-excel)。
+
+### 读取 Excel 文件
+
+在大多数基本的使用案例中,`read_excel`会读取Excel文件通过一个路径,并且`sheet_name `会表明需要解析哪一张表格。
+
+```python
+# Returns a DataFrame
+pd.read_excel('path_to_file.xls', sheet_name='Sheet1')
+
+```
+
+#### `ExcelFile` 类
+
+为了更方便地读取同一个文件的多张表格,`ExcelFile`类可用来打包文件并传递给`read_excel`。因为仅需读取一次内存,所以这种方式读取一个文件的多张表格会有性能上的优势。
+
+```python
+xlsx = pd.ExcelFile('path_to_file.xls')
+df = pd.read_excel(xlsx, 'Sheet1')
+
+```
+
+`ExcelFile`类也能用来作为上下文管理器。
+
+```python
+with pd.ExcelFile('path_to_file.xls') as xls:
+ df1 = pd.read_excel(xls, 'Sheet1')
+ df2 = pd.read_excel(xls, 'Sheet2')
+
+```
+
+`sheet_names`属性能将文件中的所有表格名字生成一组列表。
+
+`ExcelFile`一个主要的用法就是用来解析多张表格的不同参数:
+
+```python
+data = {}
+# For when Sheet1's format differs from Sheet2
+with pd.ExcelFile('path_to_file.xls') as xls:
+ data['Sheet1'] = pd.read_excel(xls, 'Sheet1', index_col=None,
+ na_values=['NA'])
+ data['Sheet2'] = pd.read_excel(xls, 'Sheet2', index_col=1)
+
+```
+
+注意如果所有的表格解析同一个参数,那么这组表格名的列表能轻易地传递给`read_excel`且不会有性能上地损失。
+
+```python
+# using the ExcelFile class
+data = {}
+with pd.ExcelFile('path_to_file.xls') as xls:
+ data['Sheet1'] = pd.read_excel(xls, 'Sheet1', index_col=None,
+ na_values=['NA'])
+ data['Sheet2'] = pd.read_excel(xls, 'Sheet2', index_col=None,
+ na_values=['NA'])
+
+# equivalent using the read_excel function
+data = pd.read_excel('path_to_file.xls', ['Sheet1', 'Sheet2'],
+ index_col=None, na_values=['NA'])
+
+```
+
+`ExcelFile`也能同`xlrd.book.Book`对象作为一个参数被调用。这种方法让用户可以控制Excel文件被如何读取。例如,表格可以根据需求加载通过调用`xlrd.open_workbook()`伴随`on_demand=True`。
+
+```python
+import xlrd
+xlrd_book = xlrd.open_workbook('path_to_file.xls', on_demand=True)
+with pd.ExcelFile(xlrd_book) as xls:
+ df1 = pd.read_excel(xls, 'Sheet1')
+ df2 = pd.read_excel(xls, 'Sheet2')
+
+```
+
+#### 指定表格
+
+::: tip 注意
+
+第二个参数是`sheet_name`,不要同`ExcelFile.sheet_names`搞混淆。
+
+:::
+
+::: tip 注意
+
+ExcelFile's的属性`sheet_names`提供的是多张表格所生成的列表。
+
+:::
+
+- `sheet_name`参数允许指定单张表格或多张表格被读取。
+
+- `sheet_name`的默认值是0,这表明读取的是第一张表格。
+
+- 在工作簿里面,使用字符串指向特定的表格名称。
+
+- 使用整数指向表格的索引,索引遵守Python的约定是从0开始的。
+
+- 无论是使用一组字符串还是整数的列表,返回的都是指定表格的字典。
+
+- 使用`None`值则会返回所有可用表格的一组字典。
+
+```python
+# Returns a DataFrame
+pd.read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])
+
+```
+
+使用表格索引:
+
+```python
+# Returns a DataFrame
+pd.read_excel('path_to_file.xls', 0, index_col=None, na_values=['NA'])
+
+```
+
+使用所有默认值:
+
+```python
+# Returns a DataFrame
+pd.read_excel('path_to_file.xls')
+
+```
+
+使用None获取所有表格:
+
+```python
+# Returns a dictionary of DataFrames
+pd.read_excel('path_to_file.xls', sheet_name=None)
+
+```
+
+使用列表获取多张表格:
+
+```python
+# Returns the 1st and 4th sheet, as a dictionary of DataFrames.
+pd.read_excel('path_to_file.xls', sheet_name=['Sheet1', 3])
+
+```
+
+`read_excel`能读取不止一张表格,通过`sheet_name`能设置为读取表格名称的列表,表格位置的列表,还能设置为`None`来读取所有表格。多张表格能通过表格索引或表格名称分别使用整数或字符串来指定读取。
+
+#### `MultiIndex`读取
+
+`read_excel`能用`MultiIndex`读取多个索引,通过`index_col`方法来传递列的列表和`header`将行的列表传递给`MultiIndex`的列。无论是`index`还是`columns`,如果已经具有序列化的层级名称,则可以通过指定组成层级的行/列来读取它们。
+
+例如,用`MultiIndex`读取没有名称的索引:
+
+```python
+In [314]: df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]},
+ .....: index=pd.MultiIndex.from_product([['a', 'b'], ['c', 'd']]))
+ .....:
+
+In [315]: df.to_excel('path_to_file.xlsx')
+
+In [316]: df = pd.read_excel('path_to_file.xlsx', index_col=[0, 1])
+
+In [317]: df
+Out[317]:
+ a b
+a c 1 5
+ d 2 6
+b c 3 7
+ d 4 8
+
+```
+
+如果索引具有层级名称,它们将使用相同的参数进行解析:
+
+```python
+In [318]: df.index = df.index.set_names(['lvl1', 'lvl2'])
+
+In [319]: df.to_excel('path_to_file.xlsx')
+
+In [320]: df = pd.read_excel('path_to_file.xlsx', index_col=[0, 1])
+
+In [321]: df
+Out[321]:
+ a b
+lvl1 lvl2
+a c 1 5
+ d 2 6
+b c 3 7
+ d 4 8
+
+```
+
+如果源文件具有`MultiIndex`索引和多列,那么可以使用`index_col`和`header`指定列表的每个值。
+
+```python
+In [322]: df.columns = pd.MultiIndex.from_product([['a'], ['b', 'd']],
+ .....: names=['c1', 'c2'])
+ .....:
+
+In [323]: df.to_excel('path_to_file.xlsx')
+
+In [324]: df = pd.read_excel('path_to_file.xlsx', index_col=[0, 1], header=[0, 1])
+
+In [325]: df
+Out[325]:
+c1 a
+c2 b d
+lvl1 lvl2
+a c 1 5
+ d 2 6
+b c 3 7
+ d 4 8
+
+```
+
+#### 解析特定的列
+
+常常会有这样的情况,当用户想要插入几列数据到Excel表格里面作为临时计算,但是你又不想要读取这些列的时候,`read_excel`提供的`usecols`方法就派上用场了,它让你可以解析指定的列。
+
+*Deprecated since version 0.24.0.*
+
+不推荐`usecols`方法使用单个整数值,请在`usecols`中使用包括从0开始的整数列表。
+
+如果`usecols`是一个整数,那么它将被认为是暗示解析最后一列。
+
+```python
+pd.read_excel('path_to_file.xls', 'Sheet1', usecols=2)
+
+```
+
+你也可以将逗号分隔的一组Excel列和范围指定为字符串:
+
+```python
+pd.read_excel('path_to_file.xls', 'Sheet1', usecols='A,C:E')
+
+```
+
+如果`usecols`是一组整数列,那么将认为是解析的文件列索引。
+
+```python
+pd.read_excel('path_to_file.xls', 'Sheet1', usecols=[0, 2, 3])
+
+```
+
+元素的顺序是可以忽略的,因此`usecols=[0, 1]`是等价于`[1, 0]`的。
+
+*New in version 0.24.*
+
+如果`usecols`是字符串列表,那么可以认为每个字符串对应的就是表格的每一个列名,列名是由`name`中的用户提供或从文档标题行推断出来。这些字符串定义了那些列将要被解析:
+
+```python
+pd.read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])
+
+```
+
+元素的顺序同样被忽略,因此`usecols=['baz', 'joe']`等同于`['joe', 'baz']`。
+
+*New in version 0.24.*
+
+如果`usecols`是可调用的,那么该调用函数将会根据列名来调用,也会返回根据可调用函数为`True`的列名。
+
+```python
+pd.read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())
+
+```
+
+#### 解析日期
+
+当读取excel文件的时候,像日期时间的值通常会自动转换为恰当的dtype(数据类型)。但是如果你有一列字符串看起来很像日期(实际上并不是excel里面的日期格式),那么你就能使用`parse_dates`方法来解析这些字符串为日期:
+
+```python
+pd.read_excel('path_to_file.xls', 'Sheet1', parse_dates=['date_strings'])
+
+```
+
+#### 单元格转换
+
+Excel里面的单元格内容是可以通过`converters`方法来进行转换的。例如,把一列转换为布尔值:
+
+```python
+pd.read_excel('path_to_file.xls', 'Sheet1', converters={'MyBools': bool})
+
+```
+
+这个方法可以处理缺失值并且能对缺失的数据进行如期的转换。由于转换是在单元格之间发生而不是整列,因此不能保证dtype为数组。例如一列含有缺失值的整数是不能转换为具有整数dtype的数组,因为NaN严格的被认为是浮点数。你能够手动地标记缺失数据为恢复整数dtype:
+
+```python
+def cfun(x):
+ return int(x) if x else -1
+
+
+pd.read_excel('path_to_file.xls', 'Sheet1', converters={'MyInts': cfun})
+
+```
+
+#### 数据类型规范
+
+*New in version 0.20.*
+
+作为另一个种转换器,使用*dtype*能指定整列地类型,它能让字典映射列名为数据类型。使用`str`或`object`来转译不能判断类型的数据:
+
+```python
+pd.read_excel('path_to_file.xls', dtype={'MyInts': 'int64', 'MyText': str})
+
+```
+
+### 写入Excel文件
+
+#### 写入Excel文件到磁盘
+
+你可以使用`to_excel`方法把`DataFrame`对象写入到Excel文件的一张表格中。它的参数大部分同前面`to_csv `提到的相同,第一个参数是excel文件的名字,而可选的第二个参数是`DataFrame`应该写入的表格名称,例如:
+
+```python
+df.to_excel('path_to_file.xlsx', sheet_name='Sheet1')
+
+```
+
+文件以`.xls` 结尾的将用`xlwt`写入,而那些以`.xlsx`结尾的则使用`xlsxwriter`(如果可用的话)或`openpyxl`来写入。
+
+`DataFrame `将尝试以模拟REPL(“读取-求值-输出" 循环的简写)输出的方式写入。`index_label`将代替第一行放置到第二行,你也能放置它到第一行通过在`to_excel()`里设置`merge_cells`选项为`False`:
+
+```python
+df.to_excel('path_to_file.xlsx', index_label='label', merge_cells=False)
+
+```
+
+为了把`DataFrames`数据分开写入Excel文件的不同表格中,可以使用`ExcelWriter`方法。
+
+```python
+with pd.ExcelWriter('path_to_file.xlsx') as writer:
+ df1.to_excel(writer, sheet_name='Sheet1')
+ df2.to_excel(writer, sheet_name='Sheet2')
+
+```
+
+::: tip 注意
+
+为了从`read_excel`内部获取更多点的性能,Excel存储所有数值型数据为浮点数。但这会产生意外的情况当读取数据的时候,如果没有损失信息的话(`1.0 --> 1`),pandas默认的转换整数为浮点数。你可以通过`convert_float=False`禁止这种行为,这可能会在性能上有轻微的优化。
+
+:::
+
+#### 写入Excel文件到内存
+
+Pandas支持写入Excel文件到类缓存区对象如`StringIO`或`BytesIO`,使用`ExcelWriter`方法。
+
+```python
+# Safe import for either Python 2.x or 3.x
+try:
+ from io import BytesIO
+except ImportError:
+ from cStringIO import StringIO as BytesIO
+
+bio = BytesIO()
+
+# By setting the 'engine' in the ExcelWriter constructor.
+writer = pd.ExcelWriter(bio, engine='xlsxwriter')
+df.to_excel(writer, sheet_name='Sheet1')
+
+# Save the workbook
+writer.save()
+
+# Seek to the beginning and read to copy the workbook to a variable in memory
+bio.seek(0)
+workbook = bio.read()
+
+```
+::: tip 注意
+
+虽然`engine`是可选方法,但是推荐使用。设置engine决定了工作簿生成的版本。设置`engine='xlrd'`将生成 Excel 2003版的工作簿(xls)。而使用`'openpyxl'`或`'xlsxwriter'`将生成Excel 2007版的工作簿(xlsx)。如果省略,将直接生成Excel 2007版的。
+
+:::
+
+### Excel写入引擎
+
+Pandas选择Excel写入有两种方式:
+
+1. 使用`engine`参数
+2. 文件名的扩展(通过默认的配置方式指定)
+
+默认的,pandas使用[ XlsxWriter](https://xlsxwriter.readthedocs.io/)为`.xlsx`,使用[openpyxl](https://openpyxl.readthedocs.io/)为`.xlsm`,并且使用[xlwt](http://www.python-excel.org/)为`.xls`文件。如果你安装了多个引擎,你可以通过[setting the config options](https://www.pypandas.cn/docs/user_guide/options.html#options)`io.excel.xlsx.writer`和`io.excel.xls.writer`方法设置默认引擎。如果[ XlsxWriter](https://xlsxwriter.readthedocs.io/)不可用,pandas将回退使用[openpyxl](https://openpyxl.readthedocs.io/)为`xlsx`文件。
+
+为了指定你想要使用的写入方式,你可以设置引擎的主要参数为`to_excel`和`ExcelWriter`。内置引擎是:
+
+- `openpyxl`: 要求2.4或者更高的版本。
+- `xlsxwriter`
+- `xlwt`
+
+```python
+# By setting the 'engine' in the DataFrame 'to_excel()' methods.
+df.to_excel('path_to_file.xlsx', sheet_name='Sheet1', engine='xlsxwriter')
+
+# By setting the 'engine' in the ExcelWriter constructor.
+writer = pd.ExcelWriter('path_to_file.xlsx', engine='xlsxwriter')
+
+# Or via pandas configuration.
+from pandas import options # noqa: E402
+options.io.excel.xlsx.writer = 'xlsxwriter'
+
+df.to_excel('path_to_file.xlsx', sheet_name='Sheet1')
+
+```
+
+### 样式
+
+通过pandas产生的Excel工作表的样式可以使用`DataFrame`的`to_excel`方法的以下参数进行修改。
+
+- `float_format`:格式化字符串用于浮点数(默认是`None`)。
+- `freeze_panes`:两个整数的元组,表示要固化的最底行和最右列。这些参数中的每个都是以1为底,因此(1, 1)将固化第一行和第一列(默认是`None`)。
+
+使用[ XlsxWriter](https://xlsxwriter.readthedocs.io/)引擎提供的多种方法来修改用`to_excel`方法创建的Excel工作表的样式。你能在[ XlsxWriter](https://xlsxwriter.readthedocs.io/)文档里面找到绝佳的例子:[https://xlsxwriter.readthedocs.io/working_with_pandas.html](https://xlsxwriter.readthedocs.io/working_with_pandas.html)
+
+## OpenDocument 电子表格
+
+*New in version 0.25.*
+
+[`read_excel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html#pandas.read_excel "`read_excel`")方法也能使用`odfpy`模块来读取OpenDocument电子表格。读取OpenDocument电子表格的语法和方法同使用`engine='odf'`来操作[Excel files](https://www.pypandas.cn/docs/user_guide/io.html#excel-files "Excel files")的方法类似。
+
+```python
+# Returns a DataFrame
+pd.read_excel('path_to_file.ods', engine='odf')
+```
+::: tip 注意
+
+目前pandas仅支持读取OpenDocument电子表格,写入是不行的。
+
+:::
+
+## 剪贴板
+
+使用`read_clipboard()`方法是一种便捷的获取数据的方式,通过把剪贴的内容暂存,然后传递给`read_csv`方法。例如,你可以复制以下文本来剪贴(在许多操作系统上是CTRL-C):
+
+```python
+ A B C
+x 1 4 p
+y 2 5 q
+z 3 6 r
+```
+
+接着直接使用`DataFrame`来导入数据:
+
+```python
+>>> clipdf = pd.read_clipboard()
+>>> clipdf
+ A B C
+x 1 4 p
+y 2 5 q
+z 3 6 r
+```
+
+`to_clipboard`方法可以把`DataFrame`内容写入到剪贴板。使用下面的方法你可以粘贴剪贴板的内容到其他应用(在许多系统中用的是CTRL-V)。这里我们解释一下如何使用`DataFrame`把内容写入到剪贴板并读回。
+
+```python
+>>> df = pd.DataFrame({'A': [1, 2, 3],
+... 'B': [4, 5, 6],
+... 'C': ['p', 'q', 'r']},
+... index=['x', 'y', 'z'])
+>>> df
+ A B C
+x 1 4 p
+y 2 5 q
+z 3 6 r
+>>> df.to_clipboard()
+>>> pd.read_clipboard()
+ A B C
+x 1 4 p
+y 2 5 q
+z 3 6 r
+```
+
+我们可以看到返回了同样的内容,那就是我们早先写入剪贴板的内容。
+
+::: tip 注意
+
+要使用上面的这些方法,你可能需要在Linux上面安装(带有PyQt5, PyQt4 or qtpy)的xclip或者xsel 。
+
+:::
+
+## 序列化(Pickling)
+
+所有的pandas对象都具有`to_pickle`方法,该方法使用Python的` cPickle`模块以序列化格式存储数据结构到磁盘上。
+
+```python
+In [326]: df
+Out[326]:
+c1 a
+c2 b d
+lvl1 lvl2
+a c 1 5
+ d 2 6
+b c 3 7
+ d 4 8
+
+In [327]: df.to_pickle('foo.pkl')
+```
+
+在`pandas`中命名的`read_pickle`函数能够从文件中加载任意序列化的pandas对象(或者任何其他的序列化对象):
+
+```python
+In [328]: pd.read_pickle('foo.pkl')
+Out[328]:
+c1 a
+c2 b d
+lvl1 lvl2
+a c 1 5
+ d 2 6
+b c 3 7
+ d 4 8
+```
+
+::: danger 警告
+
+加载来自不信任来源的序列化数据是不安全的。
+参见:https://docs.python.org/3/library/pickle.html
+
+:::
+
+::: danger 警告
+
+[`read_pickle()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_pickle.html#pandas.read_pickle "`read_pickle()`")仅在pandas的0.20.3版本及以下版本兼容。
+
+:::
+
+### 压缩序列化文件
+
+*New in version 0.20.0.*
+
+[`read_pickle()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_pickle.html#pandas.read_pickle "`read_pickle()`"),[`DataFrame.to_pickle()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_pickle.html#pandas.DataFrame.to_pickle)和[`Series.to_pickle()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_pickle.html#pandas.Series.to_pickle)能够读取和写入压缩的序列化文件。读取和写入所支持的压缩文件类型有`gzip`, `bz2`, `xz`。`zip`文件格式仅支持读取,并且必须仅包含一个要读取的数据文件。
+
+压缩类型可以是显式参数,也可以从文件扩展名中推断出来。如果文件名是以`'.gz'`,` '.bz2'`,` '.zip'`, 或者` '.xz'`结尾的,那么可以推断出应分别使用`gzip`, `bz2`,`zip`,或 `xz`压缩类型。
+
+```python
+In [329]: df = pd.DataFrame({
+ .....: 'A': np.random.randn(1000),
+ .....: 'B': 'foo',
+ .....: 'C': pd.date_range('20130101', periods=1000, freq='s')})
+ .....:
+
+In [330]: df
+Out[330]:
+ A B C
+0 -0.288267 foo 2013-01-01 00:00:00
+1 -0.084905 foo 2013-01-01 00:00:01
+2 0.004772 foo 2013-01-01 00:00:02
+3 1.382989 foo 2013-01-01 00:00:03
+4 0.343635 foo 2013-01-01 00:00:04
+.. ... ... ...
+995 -0.220893 foo 2013-01-01 00:16:35
+996 0.492996 foo 2013-01-01 00:16:36
+997 -0.461625 foo 2013-01-01 00:16:37
+998 1.361779 foo 2013-01-01 00:16:38
+999 -1.197988 foo 2013-01-01 00:16:39
+
+[1000 rows x 3 columns]
+```
+使用显式压缩类型:
+
+```python
+In [331]: df.to_pickle("data.pkl.compress", compression="gzip")
+
+In [332]: rt = pd.read_pickle("data.pkl.compress", compression="gzip")
+
+In [333]: rt
+Out[333]:
+ A B C
+0 -0.288267 foo 2013-01-01 00:00:00
+1 -0.084905 foo 2013-01-01 00:00:01
+2 0.004772 foo 2013-01-01 00:00:02
+3 1.382989 foo 2013-01-01 00:00:03
+4 0.343635 foo 2013-01-01 00:00:04
+.. ... ... ...
+995 -0.220893 foo 2013-01-01 00:16:35
+996 0.492996 foo 2013-01-01 00:16:36
+997 -0.461625 foo 2013-01-01 00:16:37
+998 1.361779 foo 2013-01-01 00:16:38
+999 -1.197988 foo 2013-01-01 00:16:39
+
+[1000 rows x 3 columns]
+```
+从扩展名推断压缩类型:
+
+```python
+In [334]: df.to_pickle("data.pkl.xz", compression="infer")
+
+In [335]: rt = pd.read_pickle("data.pkl.xz", compression="infer")
+
+In [336]: rt
+Out[336]:
+ A B C
+0 -0.288267 foo 2013-01-01 00:00:00
+1 -0.084905 foo 2013-01-01 00:00:01
+2 0.004772 foo 2013-01-01 00:00:02
+3 1.382989 foo 2013-01-01 00:00:03
+4 0.343635 foo 2013-01-01 00:00:04
+.. ... ... ...
+995 -0.220893 foo 2013-01-01 00:16:35
+996 0.492996 foo 2013-01-01 00:16:36
+997 -0.461625 foo 2013-01-01 00:16:37
+998 1.361779 foo 2013-01-01 00:16:38
+999 -1.197988 foo 2013-01-01 00:16:39
+
+[1000 rows x 3 columns]
+```
+默认是使用“推断”:
+
+```python
+In [337]: df.to_pickle("data.pkl.gz")
+
+In [338]: rt = pd.read_pickle("data.pkl.gz")
+
+In [339]: rt
+Out[339]:
+ A B C
+0 -0.288267 foo 2013-01-01 00:00:00
+1 -0.084905 foo 2013-01-01 00:00:01
+2 0.004772 foo 2013-01-01 00:00:02
+3 1.382989 foo 2013-01-01 00:00:03
+4 0.343635 foo 2013-01-01 00:00:04
+.. ... ... ...
+995 -0.220893 foo 2013-01-01 00:16:35
+996 0.492996 foo 2013-01-01 00:16:36
+997 -0.461625 foo 2013-01-01 00:16:37
+998 1.361779 foo 2013-01-01 00:16:38
+999 -1.197988 foo 2013-01-01 00:16:39
+
+[1000 rows x 3 columns]
+
+In [340]: df["A"].to_pickle("s1.pkl.bz2")
+
+In [341]: rt = pd.read_pickle("s1.pkl.bz2")
+
+In [342]: rt
+Out[342]:
+0 -0.288267
+1 -0.084905
+2 0.004772
+3 1.382989
+4 0.343635
+ ...
+995 -0.220893
+996 0.492996
+997 -0.461625
+998 1.361779
+999 -1.197988
+Name: A, Length: 1000, dtype: float64
+```
+## msgpack(一种二进制格式)
+
+pandas支持`msgpack`格式的对象序列化。他是一种轻量级可移植的二进制格式,同二进制的JSON类似,具有高效的空间利用率以及不错的写入(序列化)和读取(反序列化)性能。
+
+::: danger 警告
+
+从0.25版本开始,不推荐使用msgpack格式,并且之后的版本也将删除它。推荐使用pyarrow对pandas对象进行在线的转换。
+
+:::
+
+::: danger 警告
+
+[`read_msgpack()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_msgpack.html#pandas.read_msgpack "`read_msgpack()`")仅在pandas的0.20.3版本及以下版本兼容。
+
+:::
+
+```python
+In [343]: df = pd.DataFrame(np.random.rand(5, 2), columns=list('AB'))
+
+In [344]: df.to_msgpack('foo.msg')
+
+In [345]: pd.read_msgpack('foo.msg')
+Out[345]:
+ A B
+0 0.275432 0.293583
+1 0.842639 0.165381
+2 0.608925 0.778891
+3 0.136543 0.029703
+4 0.318083 0.604870
+
+In [346]: s = pd.Series(np.random.rand(5), index=pd.date_range('20130101', periods=5))
+```
+
+你可以传递一组对象列表并得到反序列化的结果。
+
+```python
+In [347]: pd.to_msgpack('foo.msg', df, 'foo', np.array([1, 2, 3]), s)
+
+In [348]: pd.read_msgpack('foo.msg')
+Out[348]:
+[ A B
+ 0 0.275432 0.293583
+ 1 0.842639 0.165381
+ 2 0.608925 0.778891
+ 3 0.136543 0.029703
+ 4 0.318083 0.604870, 'foo', array([1, 2, 3]), 2013-01-01 0.330824
+ 2013-01-02 0.790825
+ 2013-01-03 0.308468
+ 2013-01-04 0.092397
+ 2013-01-05 0.703091
+ Freq: D, dtype: float64]
+```
+你能传递`iterator=True`参数来迭代解压后的结果:
+
+```python
+In [349]: for o in pd.read_msgpack('foo.msg', iterator=True):
+ .....: print(o)
+ .....:
+ A B
+0 0.275432 0.293583
+1 0.842639 0.165381
+2 0.608925 0.778891
+3 0.136543 0.029703
+4 0.318083 0.604870
+foo
+[1 2 3]
+2013-01-01 0.330824
+2013-01-02 0.790825
+2013-01-03 0.308468
+2013-01-04 0.092397
+2013-01-05 0.703091
+Freq: D, dtype: float64
+```
+你也能传递`append=True`参数,给现有的包添加写入:
+
+```python
+In [350]: df.to_msgpack('foo.msg', append=True)
+
+In [351]: pd.read_msgpack('foo.msg')
+Out[351]:
+[ A B
+ 0 0.275432 0.293583
+ 1 0.842639 0.165381
+ 2 0.608925 0.778891
+ 3 0.136543 0.029703
+ 4 0.318083 0.604870, 'foo', array([1, 2, 3]), 2013-01-01 0.330824
+ 2013-01-02 0.790825
+ 2013-01-03 0.308468
+ 2013-01-04 0.092397
+ 2013-01-05 0.703091
+ Freq: D, dtype: float64, A B
+ 0 0.275432 0.293583
+ 1 0.842639 0.165381
+ 2 0.608925 0.778891
+ 3 0.136543 0.029703
+ 4 0.318083 0.604870]
+```
+不像其他io方法,`to_msgpack`既可以基于每个对象使用`df.to_msgpack()`方法,也可以在混合pandas对象的时候使用顶层`pd.to_msgpack(...)`方法,该方法可以让你打包任意的Python列表、字典、标量的集合。
+
+```python
+In [352]: pd.to_msgpack('foo2.msg', {'dict': [{'df': df}, {'string': 'foo'},
+ .....: {'scalar': 1.}, {'s': s}]})
+ .....:
+
+In [353]: pd.read_msgpack('foo2.msg')
+Out[353]:
+{'dict': ({'df': A B
+ 0 0.275432 0.293583
+ 1 0.842639 0.165381
+ 2 0.608925 0.778891
+ 3 0.136543 0.029703
+ 4 0.318083 0.604870},
+ {'string': 'foo'},
+ {'scalar': 1.0},
+ {'s': 2013-01-01 0.330824
+ 2013-01-02 0.790825
+ 2013-01-03 0.308468
+ 2013-01-04 0.092397
+ 2013-01-05 0.703091
+ Freq: D, dtype: float64})}
+```
+
+### 读/写API
+
+Msgpacks也能读写字符串。
+
+```python
+In [354]: df.to_msgpack()
+Out[354]: b'\x84\xa3typ\xadblock_manager\xa5klass\xa9DataFrame\xa4axes\x92\x86\xa3typ\xa5index\xa5klass\xa5Index\xa4name\xc0\xa5dtype\xa6object\xa4data\x92\xa1A\xa1B\xa8compress\xc0\x86\xa3typ\xabrange_index\xa5klass\xaaRangeIndex\xa4name\xc0\xa5start\x00\xa4stop\x05\xa4step\x01\xa6blocks\x91\x86\xa4locs\x86\xa3typ\xa7ndarray\xa5shape\x91\x02\xa4ndim\x01\xa5dtype\xa5int64\xa4data\xd8\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\xa8compress\xc0\xa6values\xc7P\x00\xc84 \x84\xac\xa0\xd1?\x0f\xa4.\xb5\xe6\xf6\xea?\xb9\x85\x9aLO|\xe3?\xac\xf0\xd7\x81>z\xc1?\\\xca\x97\ty[\xd4?\x9c\x9b\x8a:\x11\xca\xd2?\x14zX\xd01+\xc5?4=\x19b\xad\xec\xe8?\xc0!\xe9\xf4\x8ej\x9e?\xa7>_\xac\x17[\xe3?\xa5shape\x92\x02\x05\xa5dtype\xa7float64\xa5klass\xaaFloatBlock\xa8compress\xc0'
+```
+此外你可以连接字符串生成一个原始的对象列表。
+
+```python
+In [355]: pd.read_msgpack(df.to_msgpack() + s.to_msgpack())
+Out[355]:
+[ A B
+ 0 0.275432 0.293583
+ 1 0.842639 0.165381
+ 2 0.608925 0.778891
+ 3 0.136543 0.029703
+ 4 0.318083 0.604870, 2013-01-01 0.330824
+ 2013-01-02 0.790825
+ 2013-01-03 0.308468
+ 2013-01-04 0.092397
+ 2013-01-05 0.703091
+ Freq: D, dtype: float64]
+```
+## HDF5(PyTables) (一种以.h5结尾的分层数据格式)
+
+`HDFStore`是一个能读写pandas的类似字典的对象,它能使用高性能的HDF5格式,该格式是用优秀的[PyTables](https://www.pytables.org/ "PyTables")库写的。一些更高级的用法参考[cookbook](https://www.pypandas.cn/docs/user_guide/cookbook.html#cookbook-hdf "cookbook")。
+
+::: danger 警告
+
+pandas要求使用的`PyTables`版本要 > = 3.0.0。当使用索引来检索存储的时候,`PyTables`< 3.2的版本会出现索引bug。如果返回一个结果的子集,那么你就需要升级`PyTables` 的版本 >= 3.2才行。先前创建的存储数据将会使用更新后的版本再次写入。
+
+:::
+
+```python
+In [356]: store = pd.HDFStore('store.h5')
+
+In [357]: print(store)
+
+File path: store.h5
+```
+
+对象能够被写入文件就像成对的键-值添加到字典里面一样:
+
+```python
+In [358]: index = pd.date_range('1/1/2000', periods=8)
+
+In [359]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
+
+In [360]: df = pd.DataFrame(np.random.randn(8, 3), index=index,
+ .....: columns=['A', 'B', 'C'])
+ .....:
+
+# store.put('s', s) is an equivalent method
+In [361]: store['s'] = s
+
+In [362]: store['df'] = df
+
+In [363]: store
+Out[363]:
+
+File path: store.h5
+```
+在当前或者之后的Python会话中,你都能检索存储的对象:
+
+```python
+# store.get('df') is an equivalent method
+In [364]: store['df']
+Out[364]:
+ A B C
+2000-01-01 -0.426936 -1.780784 0.322691
+2000-01-02 1.638174 -2.184251 0.049673
+2000-01-03 -1.022803 0.889445 2.827717
+2000-01-04 1.767446 -1.305266 -0.378355
+2000-01-05 0.486743 0.954551 0.859671
+2000-01-06 -1.170458 -1.211386 -0.852728
+2000-01-07 -0.450781 1.064650 1.014927
+2000-01-08 -0.810399 0.254343 -0.875526
+
+# dotted (attribute) access provides get as well
+In [365]: store.df
+Out[365]:
+ A B C
+2000-01-01 -0.426936 -1.780784 0.322691
+2000-01-02 1.638174 -2.184251 0.049673
+2000-01-03 -1.022803 0.889445 2.827717
+2000-01-04 1.767446 -1.305266 -0.378355
+2000-01-05 0.486743 0.954551 0.859671
+2000-01-06 -1.170458 -1.211386 -0.852728
+2000-01-07 -0.450781 1.064650 1.014927
+2000-01-08 -0.810399 0.254343 -0.875526
+```
+
+使用键删除指定的对象:
+
+```python
+# store.remove('df') is an equivalent method
+In [366]: del store['df']
+
+In [367]: store
+Out[367]:
+
+File path: store.h5
+```
+
+关闭存储对象并使用环境管理器:
+
+```python
+In [368]: store.close()
+
+In [369]: store
+Out[369]:
+
+File path: store.h5
+
+In [370]: store.is_open
+Out[370]: False
+
+# Working with, and automatically closing the store using a context manager
+In [371]: with pd.HDFStore('store.h5') as store:
+ .....: store.keys()
+ .....:
+```
+
+### 读/写 API
+
+`HDFStore `支持顶层的API,用`read_hdf`来读取,和使用`to_hdf`来写入,类似于`read_csv` 和`to_csv`的用法。
+
+```python
+In [372]: df_tl = pd.DataFrame({'A': list(range(5)), 'B': list(range(5))})
+
+In [373]: df_tl.to_hdf('store_tl.h5', 'table', append=True)
+
+In [374]: pd.read_hdf('store_tl.h5', 'table', where=['index>2'])
+Out[374]:
+ A B
+3 3 3
+4 4 4
+```
+HDFStore默认不会删除全是缺失值的行,但是通过设置`dropna=True`参数就能改变。
+
+```python
+In [375]: df_with_missing = pd.DataFrame({'col1': [0, np.nan, 2],
+ .....: 'col2': [1, np.nan, np.nan]})
+ .....:
+
+In [376]: df_with_missing
+Out[376]:
+ col1 col2
+0 0.0 1.0
+1 NaN NaN
+2 2.0 NaN
+
+In [377]: df_with_missing.to_hdf('file.h5', 'df_with_missing',
+ .....: format='table', mode='w')
+ .....:
+
+In [378]: pd.read_hdf('file.h5', 'df_with_missing')
+Out[378]:
+ col1 col2
+0 0.0 1.0
+1 NaN NaN
+2 2.0 NaN
+
+In [379]: df_with_missing.to_hdf('file.h5', 'df_with_missing',
+ .....: format='table', mode='w', dropna=True)
+ .....:
+
+In [380]: pd.read_hdf('file.h5', 'df_with_missing')
+Out[380]:
+ col1 col2
+0 0.0 1.0
+2 2.0 NaN
+```
+### 固定格式
+
+上面的例子表明了使用`put`进行存储的情况,该存储将`HDF5`以固定数组格式写入`PyTables`,这就是所谓的`fixed`格式。这些类型的存储一旦被写入后将**不能**再添加数据(虽然你能轻易地删除它们并再次写入),**也不能**查询;必须全部检索它们。它们也不支持没有唯一列名的数据表。`fixed`格式提供了非常快速的写入功能,并且比`table`存储在读取方面更快捷。默认的指定格式是使用`put` 或者`to_hdf` 亦或通过` format='fixed'`或` format='f'`格式。
+
+::: danger 警告
+
+如果你尝试使用`where`来检索,`fixed`格式将会报错` TypeError`:
+
+```python
+>>> pd.DataFrame(np.random.randn(10, 2)).to_hdf('test_fixed.h5', 'df')
+>>> pd.read_hdf('test_fixed.h5', 'df', where='index>5')
+TypeError: cannot pass a where specification when reading a fixed format.
+ this store must be selected in its entirety
+```
+:::
+
+### 表格格式
+
+`HDFStore `支持在磁盘上使用另一种`PyTables`格式,即`table`格式。从概念上来讲,`table`在外形上同具有行和列的DataFrame极度相似。`table`也能被添加到同样的或其他的会话中。此外,删除和查询操作也是支持的。通过指定格式为`format='table'`或`format='t'`到`append`方法或`put`或者`to_hdf`。
+
+`put/append/to_hdf`方法中使用的格式也可以设置为可选`pd.set_option('io.hdf.default_format','table')`,以默认的`table`格式存储。
+
+```python
+In [381]: store = pd.HDFStore('store.h5')
+
+In [382]: df1 = df[0:4]
+
+In [383]: df2 = df[4:]
+
+# append data (creates a table automatically)
+In [384]: store.append('df', df1)
+
+In [385]: store.append('df', df2)
+
+In [386]: store
+Out[386]:
+
+File path: store.h5
+
+# select the entire object
+In [387]: store.select('df')
+Out[387]:
+ A B C
+2000-01-01 -0.426936 -1.780784 0.322691
+2000-01-02 1.638174 -2.184251 0.049673
+2000-01-03 -1.022803 0.889445 2.827717
+2000-01-04 1.767446 -1.305266 -0.378355
+2000-01-05 0.486743 0.954551 0.859671
+2000-01-06 -1.170458 -1.211386 -0.852728
+2000-01-07 -0.450781 1.064650 1.014927
+2000-01-08 -0.810399 0.254343 -0.875526
+
+# the type of stored data
+In [388]: store.root.df._v_attrs.pandas_type
+Out[388]: 'frame_table'
+```
+::: tip 注意
+
+你也可以通过创建`table`来传递`format='table'`或者` format='t`到`put`操作。
+
+:::
+
+### 分层键
+
+存储的键能够指定为字符串,这些分层的路径名就像这样的格式(例如:`foo/bar/bah`)。它将生成子存储的层次结构(或者在PyTables中叫做`Groups` )。键可以不带前面的'/'指定而且**总是**单独的(例如:'foo' 指的就是'/foo')。删除操作能够删除所有子存储及**之后**的数据,所以要小心该操作。
+
+```python
+In [389]: store.put('foo/bar/bah', df)
+
+In [390]: store.append('food/orange', df)
+
+In [391]: store.append('food/apple', df)
+
+In [392]: store
+Out[392]:
+
+File path: store.h5
+
+# a list of keys are returned
+In [393]: store.keys()
+Out[393]: ['/df', '/food/apple', '/food/orange', '/foo/bar/bah']
+
+# remove all nodes under this level
+In [394]: store.remove('food')
+
+In [395]: store
+Out[395]:
+
+File path: store.h5
+```
+
+你能遍历组层次结构使用`walk`方法,该方法将为每个组键及其内容的相对键生成一个元组。
+
+*New in version 0.24.0.*
+
+```python
+In [396]: for (path, subgroups, subkeys) in store.walk():
+ .....: for subgroup in subgroups:
+ .....: print('GROUP: {}/{}'.format(path, subgroup))
+ .....: for subkey in subkeys:
+ .....: key = '/'.join([path, subkey])
+ .....: print('KEY: {}'.format(key))
+ .....: print(store.get(key))
+ .....:
+GROUP: /foo
+KEY: /df
+ A B C
+2000-01-01 -0.426936 -1.780784 0.322691
+2000-01-02 1.638174 -2.184251 0.049673
+2000-01-03 -1.022803 0.889445 2.827717
+2000-01-04 1.767446 -1.305266 -0.378355
+2000-01-05 0.486743 0.954551 0.859671
+2000-01-06 -1.170458 -1.211386 -0.852728
+2000-01-07 -0.450781 1.064650 1.014927
+2000-01-08 -0.810399 0.254343 -0.875526
+GROUP: /foo/bar
+KEY: /foo/bar/bah
+ A B C
+2000-01-01 -0.426936 -1.780784 0.322691
+2000-01-02 1.638174 -2.184251 0.049673
+2000-01-03 -1.022803 0.889445 2.827717
+2000-01-04 1.767446 -1.305266 -0.378355
+2000-01-05 0.486743 0.954551 0.859671
+2000-01-06 -1.170458 -1.211386 -0.852728
+2000-01-07 -0.450781 1.064650 1.014927
+2000-01-08 -0.810399 0.254343 -0.875526
+```
+::: danger 警告
+
+分层键对于存储在根节点下的项目,无法使用如上的方法将其作为点(属性)进行检索。
+
+```python
+In [8]: store.foo.bar.bah
+AttributeError: 'HDFStore' object has no attribute 'foo'
+
+# you can directly access the actual PyTables node but using the root node
+In [9]: store.root.foo.bar.bah
+Out[9]:
+/foo/bar/bah (Group) ''
+ children := ['block0_items' (Array), 'block0_values' (Array), 'axis0' (Array), 'axis1' (Array)]
+```
+
+相反,使用基于显式字符串的键:
+
+```python
+In [397]: store['foo/bar/bah']
+Out[397]:
+ A B C
+2000-01-01 -0.426936 -1.780784 0.322691
+2000-01-02 1.638174 -2.184251 0.049673
+2000-01-03 -1.022803 0.889445 2.827717
+2000-01-04 1.767446 -1.305266 -0.378355
+2000-01-05 0.486743 0.954551 0.859671
+2000-01-06 -1.170458 -1.211386 -0.852728
+2000-01-07 -0.450781 1.064650 1.014927
+2000-01-08 -0.810399 0.254343 -0.875526
+```
+:::
+
+### 存储类型
+
+#### 在表格中存储混合类型
+
+支持混合数据类型存储。字符串使用添加列的最大尺寸以固定宽度进行存储。后面尝试添加更长的字符串将会报错``ValueError``。
+
+添加参数``min_itemsize={`values`: size}``将给字符串列设置一个更大的最小值。目前支持的存储类型有 ``floats,strings, ints, bools, datetime64`` 。对于字符串列,添加参数 ``nan_rep = 'nan'``将改变磁盘上默认的nan值(转变为*np.nan*),原本默认是*nan*。
+
+``` python
+In [398]: df_mixed = pd.DataFrame({'A': np.random.randn(8),
+ .....: 'B': np.random.randn(8),
+ .....: 'C': np.array(np.random.randn(8), dtype='float32'),
+ .....: 'string': 'string',
+ .....: 'int': 1,
+ .....: 'bool': True,
+ .....: 'datetime64': pd.Timestamp('20010102')},
+ .....: index=list(range(8)))
+ .....:
+
+In [399]: df_mixed.loc[df_mixed.index[3:5],
+ .....: ['A', 'B', 'string', 'datetime64']] = np.nan
+ .....:
+
+In [400]: store.append('df_mixed', df_mixed, min_itemsize={'values': 50})
+
+In [401]: df_mixed1 = store.select('df_mixed')
+
+In [402]: df_mixed1
+Out[402]:
+ A B C string int bool datetime64
+0 -0.980856 0.298656 0.151508 string 1 True 2001-01-02
+1 -0.906920 -1.294022 0.587939 string 1 True 2001-01-02
+2 0.988185 -0.618845 0.043096 string 1 True 2001-01-02
+3 NaN NaN 0.362451 NaN 1 True NaT
+4 NaN NaN 1.356269 NaN 1 True NaT
+5 -0.772889 -0.340872 1.798994 string 1 True 2001-01-02
+6 -0.043509 -0.303900 0.567265 string 1 True 2001-01-02
+7 0.768606 -0.871948 -0.044348 string 1 True 2001-01-02
+
+In [403]: df_mixed1.dtypes.value_counts()
+Out[403]:
+float64 2
+float32 1
+datetime64[ns] 1
+int64 1
+bool 1
+object 1
+dtype: int64
+
+# we have provided a minimum string column size
+In [404]: store.root.df_mixed.table
+Out[404]:
+/df_mixed/table (Table(8,)) ''
+ description := {
+ "index": Int64Col(shape=(), dflt=0, pos=0),
+ "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
+ "values_block_1": Float32Col(shape=(1,), dflt=0.0, pos=2),
+ "values_block_2": Int64Col(shape=(1,), dflt=0, pos=3),
+ "values_block_3": Int64Col(shape=(1,), dflt=0, pos=4),
+ "values_block_4": BoolCol(shape=(1,), dflt=False, pos=5),
+ "values_block_5": StringCol(itemsize=50, shape=(1,), dflt=b'', pos=6)}
+ byteorder := 'little'
+ chunkshape := (689,)
+ autoindex := True
+ colindexes := {
+ "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
+
+```
+
+#### 存储多层索引数据表
+
+存储多层索引``DataFrames``为表格与从同类索引 ``DataFrames``中存储/选取是非常类似的。
+
+``` python
+In [405]: index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
+ .....: ['one', 'two', 'three']],
+ .....: codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
+ .....: [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
+ .....: names=['foo', 'bar'])
+ .....:
+
+In [406]: df_mi = pd.DataFrame(np.random.randn(10, 3), index=index,
+ .....: columns=['A', 'B', 'C'])
+ .....:
+
+In [407]: df_mi
+Out[407]:
+ A B C
+foo bar
+foo one 0.031885 0.641045 0.479460
+ two -0.630652 -0.182400 -0.789979
+ three -0.282700 -0.813404 1.252998
+bar one 0.758552 0.384775 -1.133177
+ two -1.002973 -1.644393 -0.311536
+baz two -0.615506 -0.084551 -1.318575
+ three 0.923929 -0.105981 0.429424
+qux one -1.034590 0.542245 -0.384429
+ two 0.170697 -0.200289 1.220322
+ three -1.001273 0.162172 0.376816
+
+In [408]: store.append('df_mi', df_mi)
+
+In [409]: store.select('df_mi')
+Out[409]:
+ A B C
+foo bar
+foo one 0.031885 0.641045 0.479460
+ two -0.630652 -0.182400 -0.789979
+ three -0.282700 -0.813404 1.252998
+bar one 0.758552 0.384775 -1.133177
+ two -1.002973 -1.644393 -0.311536
+baz two -0.615506 -0.084551 -1.318575
+ three 0.923929 -0.105981 0.429424
+qux one -1.034590 0.542245 -0.384429
+ two 0.170697 -0.200289 1.220322
+ three -1.001273 0.162172 0.376816
+
+# the levels are automatically included as data columns
+In [410]: store.select('df_mi', 'foo=bar')
+Out[410]:
+ A B C
+foo bar
+bar one 0.758552 0.384775 -1.133177
+ two -1.002973 -1.644393 -0.311536
+
+```
+
+### 查询
+
+#### 查询表格
+
+``select`` 和 ``delete`` 操作有一个可选项即能指定选择/删除仅有数据的子集。 这允许用户拥有一个很大的磁盘表并仅检索一部分数据。
+
+在底层里使用``Term`` 类指定查询为布尔表达式。
+
+- 支持的 ``DataFrames``索引器有 ``index`` 和 ``columns`` .
+- 如果指定为``data_columns``,这些将作为额外的索引器。
+
+有效的比较运算符有:
+
+``=, ==, !=, >, >=, <, <=``
+
+有效的布尔表达式包含如下几种:
+
+- ``|`` : 选择
+- ``&`` : 并列
+- ``(`` 和 ``)`` : 用来分组
+
+这些规则同在pandas的索引中使用布尔表达式是类似的。
+
+::: tip 注意
+
+- ``=`` 将自动扩展为比较运算符 ``==``
+- ``~`` 不是运算符,且只在有限的条件下使用
+- 如果传递的表达式时列表/元组,他们将通过 ``&``符号合并
+
+:::
+
+以下都是有效的表达式:
+- ``'index >= date'``
+- ``"columns = ['A', 'D']"``
+- ``"columns in ['A', 'D']"``
+- ``'columns = A'``
+- ``'columns == A'``
+- ``"~(columns = ['A', 'B'])"``
+- ``'index > df.index[3] & string = "bar"'``
+- ``'(index > df.index[3] & index <= df.index[6]) | string = "bar"'``
+- ``"ts >= Timestamp('2012-02-01')"``
+- ``"major_axis>=20130101"``
+
+`indexers`在子表达式的左边的有:
+`columns`, `major_axis`, `ts`
+
+(在比较运算符后面)子表达式可以是:
+- 能被求值的函数,比如:``Timestamp('2012-02-01')``
+- 字符串,比如: ``"bar"``
+- 类似日期,比如: ``20130101``或者 ``"20130101"``
+- 列表,比如: ``"['A', 'B']"``
+- 以本地命名空间定义的变量,比如:``date``
+
+::: tip 注意
+
+在查询表达式中插入字符串进行查询是不推荐的。如果将带有%的字符串分配给变量,然后在表达式中使用该变量。那么,这样做
+
+``` python
+string = "HolyMoly'"
+store.select('df', 'index == string')
+
+```
+来代替下面这样
+
+``` python
+string = "HolyMoly'"
+store.select('df', 'index == %s' % string)
+
+```
+
+因为后者将 **不会** 起作用并引起 ``SyntaxError``。注意 ``string``变量的双引号里面有一个单引号。
+
+
+如果你一定要插入,使用说明符格式 ``'%r'``
+
+``` python
+store.select('df', 'index == %r' % string)
+
+```
+
+它将会引用变量 ``string``.
+
+:::
+
+这儿有一些例子:
+
+``` python
+In [411]: dfq = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'),
+ .....: index=pd.date_range('20130101', periods=10))
+ .....:
+
+In [412]: store.append('dfq', dfq, format='table', data_columns=True)
+
+```
+使用布尔表达式同内联求值函数。
+
+``` python
+In [413]: store.select('dfq', "index>pd.Timestamp('20130104') & columns=['A', 'B']")
+Out[413]:
+ A B
+2013-01-05 0.450263 0.755221
+2013-01-06 0.019915 0.300003
+2013-01-07 1.878479 -0.026513
+2013-01-08 3.272320 0.077044
+2013-01-09 -0.398346 0.507286
+2013-01-10 0.516017 -0.501550
+
+```
+
+内联列引用
+
+``` python
+In [414]: store.select('dfq', where="A>0 or C>0")
+Out[414]:
+ A B C D
+2013-01-01 -0.161614 -1.636805 0.835417 0.864817
+2013-01-02 0.843452 -0.122918 -0.026122 -1.507533
+2013-01-03 0.335303 -1.340566 -1.024989 1.125351
+2013-01-05 0.450263 0.755221 -1.506656 0.808794
+2013-01-06 0.019915 0.300003 -0.727093 -1.119363
+2013-01-07 1.878479 -0.026513 0.573793 0.154237
+2013-01-08 3.272320 0.077044 0.397034 -0.613983
+2013-01-10 0.516017 -0.501550 0.138212 0.218366
+
+```
+关键词``columns`` 能用来筛选列字段并返回为列表,这等价于传递``'columns=list_of_columns_to_filter'``:
+
+``` python
+In [415]: store.select('df', "columns=['A', 'B']")
+Out[415]:
+ A B
+2000-01-01 -0.426936 -1.780784
+2000-01-02 1.638174 -2.184251
+2000-01-03 -1.022803 0.889445
+2000-01-04 1.767446 -1.305266
+2000-01-05 0.486743 0.954551
+2000-01-06 -1.170458 -1.211386
+2000-01-07 -0.450781 1.064650
+2000-01-08 -0.810399 0.254343
+
+```
+
+``start`` and ``stop`` 参数能指定总的搜索范围。这些是根据表中的总行数得出来的。
+
+::: tip 注意
+
+如果查询表达式有未知的引用变量,那么``select`` 将会报错 ``ValueError`` 。通常这就意味着你正在尝试选取的一列并**不在**当前数据列中。
+
+如果查询表达式无效,那么``select``将会报错``SyntaxError`` 。
+
+:::
+
+#### timedelta64[ns]的用法
+
+你能使用``timedelta64[ns]``进行存储和查询。使用``()``来指定查询的条目,浮点数可以带符号(和小数),timedelta的单位可以是``D,s,ms,us,ns``。看示例:
+
+```python
+In [416]: from datetime import timedelta
+
+In [417]: dftd = pd.DataFrame({'A': pd.Timestamp('20130101'),
+ .....: 'B': [pd.Timestamp('20130101') + timedelta(days=i,
+ .....: seconds=10)
+ .....: for i in range(10)]})
+ .....:
+
+In [418]: dftd['C'] = dftd['A'] - dftd['B']
+
+In [419]: dftd
+Out[419]:
+ A B C
+0 2013-01-01 2013-01-01 00:00:10 -1 days +23:59:50
+1 2013-01-01 2013-01-02 00:00:10 -2 days +23:59:50
+2 2013-01-01 2013-01-03 00:00:10 -3 days +23:59:50
+3 2013-01-01 2013-01-04 00:00:10 -4 days +23:59:50
+4 2013-01-01 2013-01-05 00:00:10 -5 days +23:59:50
+5 2013-01-01 2013-01-06 00:00:10 -6 days +23:59:50
+6 2013-01-01 2013-01-07 00:00:10 -7 days +23:59:50
+7 2013-01-01 2013-01-08 00:00:10 -8 days +23:59:50
+8 2013-01-01 2013-01-09 00:00:10 -9 days +23:59:50
+9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50
+
+In [420]: store.append('dftd', dftd, data_columns=True)
+
+In [421]: store.select('dftd', "C<'-3.5D'")
+Out[421]:
+ A B C
+4 2013-01-01 2013-01-05 00:00:10 -5 days +23:59:50
+5 2013-01-01 2013-01-06 00:00:10 -6 days +23:59:50
+6 2013-01-01 2013-01-07 00:00:10 -7 days +23:59:50
+7 2013-01-01 2013-01-08 00:00:10 -8 days +23:59:50
+8 2013-01-01 2013-01-09 00:00:10 -9 days +23:59:50
+9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50
+```
+
+#### 索引
+
+你能在表格中已经有数据的情况下(在``append/put``操作之后)使用``create_table_index``创建/修改表格的索引。给表格创建索引是**强**推荐的操作。当你使用带有索引的``select``当作``where``查询条件的时候,这将极大的加快你的查询速度。
+
+::: tip 注意
+
+索引会自动创建在可索引对象和任意你指定的数据列。你可以传递``index=False`` 到``append``来关闭这个操作。
+
+:::
+
+```python
+# we have automagically already created an index (in the first section)
+In [422]: i = store.root.df.table.cols.index.index
+
+In [423]: i.optlevel, i.kind
+Out[423]: (6, 'medium')
+
+# change an index by passing new parameters
+In [424]: store.create_table_index('df', optlevel=9, kind='full')
+
+In [425]: i = store.root.df.table.cols.index.index
+
+In [426]: i.optlevel, i.kind
+Out[426]: (9, 'full')
+```
+
+通常当有大量数据添加保存的时候,关闭添加列的索引创建,等结束后再创建是非常有效的。
+
+```python
+In [427]: df_1 = pd.DataFrame(np.random.randn(10, 2), columns=list('AB'))
+
+In [428]: df_2 = pd.DataFrame(np.random.randn(10, 2), columns=list('AB'))
+
+In [429]: st = pd.HDFStore('appends.h5', mode='w')
+
+In [430]: st.append('df', df_1, data_columns=['B'], index=False)
+
+In [431]: st.append('df', df_2, data_columns=['B'], index=False)
+
+In [432]: st.get_storer('df').table
+Out[432]:
+/df/table (Table(20,)) ''
+ description := {
+ "index": Int64Col(shape=(), dflt=0, pos=0),
+ "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
+ "B": Float64Col(shape=(), dflt=0.0, pos=2)}
+ byteorder := 'little'
+ chunkshape := (2730,)
+```
+
+当完成添加后再创建索引。
+
+```python
+In [433]: st.create_table_index('df', columns=['B'], optlevel=9, kind='full')
+
+In [434]: st.get_storer('df').table
+Out[434]:
+/df/table (Table(20,)) ''
+ description := {
+ "index": Int64Col(shape=(), dflt=0, pos=0),
+ "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
+ "B": Float64Col(shape=(), dflt=0.0, pos=2)}
+ byteorder := 'little'
+ chunkshape := (2730,)
+ autoindex := True
+ colindexes := {
+ "B": Index(9, full, shuffle, zlib(1)).is_csi=True}
+
+In [435]: st.close()
+```
+看[这里](https://stackoverflow.com/questions/17893370/ptrepack-sortby-needs-full-index "这里")关于如何在现存的表格中创建完全分类索引(CSI)。
+
+#### 通过数据列查询
+
+你可以指定(并建立索引)某些你希望能够执行查询的列(除了可索引的列,你始终可以查询这些列)。例如,假设你要在磁盘上执行此常规操作,仅返回与该查询匹配的帧。你可以指定``data_columns = True``来强制所有列为``data_columns``。
+
+```python
+In [436]: df_dc = df.copy()
+
+In [437]: df_dc['string'] = 'foo'
+
+In [438]: df_dc.loc[df_dc.index[4:6], 'string'] = np.nan
+
+In [439]: df_dc.loc[df_dc.index[7:9], 'string'] = 'bar'
+
+In [440]: df_dc['string2'] = 'cool'
+
+In [441]: df_dc.loc[df_dc.index[1:3], ['B', 'C']] = 1.0
+
+In [442]: df_dc
+Out[442]:
+ A B C string string2
+2000-01-01 -0.426936 -1.780784 0.322691 foo cool
+2000-01-02 1.638174 1.000000 1.000000 foo cool
+2000-01-03 -1.022803 1.000000 1.000000 foo cool
+2000-01-04 1.767446 -1.305266 -0.378355 foo cool
+2000-01-05 0.486743 0.954551 0.859671 NaN cool
+2000-01-06 -1.170458 -1.211386 -0.852728 NaN cool
+2000-01-07 -0.450781 1.064650 1.014927 foo cool
+2000-01-08 -0.810399 0.254343 -0.875526 bar cool
+
+# on-disk operations
+In [443]: store.append('df_dc', df_dc, data_columns=['B', 'C', 'string', 'string2'])
+
+In [444]: store.select('df_dc', where='B > 0')
+Out[444]:
+ A B C string string2
+2000-01-02 1.638174 1.000000 1.000000 foo cool
+2000-01-03 -1.022803 1.000000 1.000000 foo cool
+2000-01-05 0.486743 0.954551 0.859671 NaN cool
+2000-01-07 -0.450781 1.064650 1.014927 foo cool
+2000-01-08 -0.810399 0.254343 -0.875526 bar cool
+
+# getting creative
+In [445]: store.select('df_dc', 'B > 0 & C > 0 & string == foo')
+Out[445]:
+ A B C string string2
+2000-01-02 1.638174 1.00000 1.000000 foo cool
+2000-01-03 -1.022803 1.00000 1.000000 foo cool
+2000-01-07 -0.450781 1.06465 1.014927 foo cool
+
+# this is in-memory version of this type of selection
+In [446]: df_dc[(df_dc.B > 0) & (df_dc.C > 0) & (df_dc.string == 'foo')]
+Out[446]:
+ A B C string string2
+2000-01-02 1.638174 1.00000 1.000000 foo cool
+2000-01-03 -1.022803 1.00000 1.000000 foo cool
+2000-01-07 -0.450781 1.06465 1.014927 foo cool
+
+# we have automagically created this index and the B/C/string/string2
+# columns are stored separately as ``PyTables`` columns
+In [447]: store.root.df_dc.table
+Out[447]:
+/df_dc/table (Table(8,)) ''
+ description := {
+ "index": Int64Col(shape=(), dflt=0, pos=0),
+ "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
+ "B": Float64Col(shape=(), dflt=0.0, pos=2),
+ "C": Float64Col(shape=(), dflt=0.0, pos=3),
+ "string": StringCol(itemsize=3, shape=(), dflt=b'', pos=4),
+ "string2": StringCol(itemsize=4, shape=(), dflt=b'', pos=5)}
+ byteorder := 'little'
+ chunkshape := (1680,)
+ autoindex := True
+ colindexes := {
+ "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
+ "B": Index(6, medium, shuffle, zlib(1)).is_csi=False,
+ "C": Index(6, medium, shuffle, zlib(1)).is_csi=False,
+ "string": Index(6, medium, shuffle, zlib(1)).is_csi=False,
+ "string2": Index(6, medium, shuffle, zlib(1)).is_csi=False}
+```
+把很多列变成*数据列*会存在性能下降的情况,因此它取决于用户。此外,在第一次添加/插入操作后你不能改变数据列(也不能索引)(当然,你能读取数据和创建一个新表!)。
+
+#### 迭代器
+
+你能传递``iterator=True``或者``chunksize=number_in_a_chunk``给``select`` 和``select_as_multiple``,然后在结果中返回一个迭代器。默认一个块返回50,000行。
+
+```python
+In [448]: for df in store.select('df', chunksize=3):
+ .....: print(df)
+ .....:
+ A B C
+2000-01-01 -0.426936 -1.780784 0.322691
+2000-01-02 1.638174 -2.184251 0.049673
+2000-01-03 -1.022803 0.889445 2.827717
+ A B C
+2000-01-04 1.767446 -1.305266 -0.378355
+2000-01-05 0.486743 0.954551 0.859671
+2000-01-06 -1.170458 -1.211386 -0.852728
+ A B C
+2000-01-07 -0.450781 1.064650 1.014927
+2000-01-08 -0.810399 0.254343 -0.875526
+```
+
+::: tip 注意
+
+你也能使用``read_hdf`` 打开迭代器,然后迭代结束会自动关闭保存。
+
+```python
+for df in pd.read_hdf('store.h5', 'df', chunksize=3):
+ print(df)
+```
+
+:::
+
+注意,chunksize主要适用于**源**行。因此,如果你正在进行查询,chunksize将细分表中的总行和应用的查询,并在大小可能不相等的块上返回一个迭代器。
+
+这是生成查询并使用它创建大小相等的返回块的方法。
+
+```python
+In [449]: dfeq = pd.DataFrame({'number': np.arange(1, 11)})
+
+In [450]: dfeq
+Out[450]:
+ number
+0 1
+1 2
+2 3
+3 4
+4 5
+5 6
+6 7
+7 8
+8 9
+9 10
+
+In [451]: store.append('dfeq', dfeq, data_columns=['number'])
+
+In [452]: def chunks(l, n):
+ .....: return [l[i:i + n] for i in range(0, len(l), n)]
+ .....:
+
+In [453]: evens = [2, 4, 6, 8, 10]
+
+In [454]: coordinates = store.select_as_coordinates('dfeq', 'number=evens')
+
+In [455]: for c in chunks(coordinates, 2):
+ .....: print(store.select('dfeq', where=c))
+ .....:
+ number
+1 2
+3 4
+ number
+5 6
+7 8
+ number
+9 10
+```
+
+#### 高级查询
+
+##### 选取单列
+
+使用``select_column``方法可以找到单个可索引列或数据列。例如,这将让你非常快速地得到索引。它会返回一个由行号索引的``Series``结果。目前不接受``where``选择器。
+
+```python
+In [456]: store.select_column('df_dc', 'index')
+Out[456]:
+0 2000-01-01
+1 2000-01-02
+2 2000-01-03
+3 2000-01-04
+4 2000-01-05
+5 2000-01-06
+6 2000-01-07
+7 2000-01-08
+Name: index, dtype: datetime64[ns]
+
+In [457]: store.select_column('df_dc', 'string')
+Out[457]:
+0 foo
+1 foo
+2 foo
+3 foo
+4 NaN
+5 NaN
+6 foo
+7 bar
+Name: string, dtype: object
+```
+
+##### 选取坐标
+
+有时候你想要得到查询的坐标(又叫做 索引的定位),使用``Int64Index``将返回结果的定位。这些坐标也可以传递给之后的``where``操作。
+
+```python
+In [458]: df_coord = pd.DataFrame(np.random.randn(1000, 2),
+ .....: index=pd.date_range('20000101', periods=1000))
+ .....:
+
+In [459]: store.append('df_coord', df_coord)
+
+In [460]: c = store.select_as_coordinates('df_coord', 'index > 20020101')
+
+In [461]: c
+Out[461]:
+Int64Index([732, 733, 734, 735, 736, 737, 738, 739, 740, 741,
+ ...
+ 990, 991, 992, 993, 994, 995, 996, 997, 998, 999],
+ dtype='int64', length=268)
+
+In [462]: store.select('df_coord', where=c)
+Out[462]:
+ 0 1
+2002-01-02 0.440865 -0.151651
+2002-01-03 -1.195089 0.285093
+2002-01-04 -0.925046 0.386081
+2002-01-05 -1.942756 0.277699
+2002-01-06 0.811776 0.528965
+... ... ...
+2002-09-22 1.061729 0.618085
+2002-09-23 -0.209744 0.677197
+2002-09-24 -1.808184 0.185667
+2002-09-25 -0.208629 0.928603
+2002-09-26 1.579717 -1.259530
+
+[268 rows x 2 columns]
+```
+
+##### 使用位置遮罩选取
+
+有时你的查询可能涉及到创建要选择的行列表。通常这个``mask``将得到索引操作的``index``结果。下面这个例子显示了选取日期索引的月份等于5的操作。
+
+```python
+In [463]: df_mask = pd.DataFrame(np.random.randn(1000, 2),
+ .....: index=pd.date_range('20000101', periods=1000))
+ .....:
+
+In [464]: store.append('df_mask', df_mask)
+
+In [465]: c = store.select_column('df_mask', 'index')
+
+In [466]: where = c[pd.DatetimeIndex(c).month == 5].index
+
+In [467]: store.select('df_mask', where=where)
+Out[467]:
+ 0 1
+2000-05-01 -1.199892 1.073701
+2000-05-02 -1.058552 0.658487
+2000-05-03 -0.015418 0.452879
+2000-05-04 1.737818 0.426356
+2000-05-05 -0.711668 -0.021266
+... ... ...
+2002-05-27 0.656196 0.993383
+2002-05-28 -0.035399 -0.269286
+2002-05-29 0.704503 2.574402
+2002-05-30 -1.301443 2.770770
+2002-05-31 -0.807599 0.420431
+
+[93 rows x 2 columns]
+```
+
+##### 存储对象
+
+如果你想检查存储对象,可以通过``get_storer``找到。你能使用这种编程方法获得一个对象的行数。
+
+```python
+In [468]: store.get_storer('df_dc').nrows
+Out[468]: 8
+```
+
+#### 多表查询
+
+``append_to_multiple``和``select_as_multiple``方法能一次性执行多表的添加/选取操作。这个方法是让一个表(称为选择器表)索引大多数/所有列,并执行查询。其他表是带索引的数据表,它会匹配选择器表的索引。然后,你能在选择器表执行非常快速的查询并返回大量数据。这个方法类似于有个非常宽的表,但能高效的查询。
+
+``append_to_multiple``方法根据``d``把单个DataFrame划分为多个表,这里的d指的是字典,即将表名映射到该表中所需的“列”列表。如果使用*None*代替列表,则该表将具有给定DataFrame的其余未指定列。``selector``参数定义了哪张表是选择器表(即你可以从中执行查询的)。``dropna``参数将删除输入``DataFrame ``的行来确保表格是同步的。这意味着如果其中一张表写入的一行全是``np.NaN``,那么将从所有表中删除这行。
+
+如果``dropna``是False,则**用户负责同步表**。记住全是``np.Nan``的行是不会写入HDFStore,因此如果你选择``dropna=False``,一些表会比其他表具有更多行,而且``select_as_multiple ``将不会有作用或返回意外的结果。
+
+```python
+In [469]: df_mt = pd.DataFrame(np.random.randn(8, 6),
+ .....: index=pd.date_range('1/1/2000', periods=8),
+ .....: columns=['A', 'B', 'C', 'D', 'E', 'F'])
+ .....:
+
+In [470]: df_mt['foo'] = 'bar'
+
+In [471]: df_mt.loc[df_mt.index[1], ('A', 'B')] = np.nan
+
+# you can also create the tables individually
+In [472]: store.append_to_multiple({'df1_mt': ['A', 'B'], 'df2_mt': None},
+ .....: df_mt, selector='df1_mt')
+ .....:
+
+In [473]: store
+Out[473]:
+
+File path: store.h5
+
+# individual tables were created
+In [474]: store.select('df1_mt')
+Out[474]:
+ A B
+2000-01-01 0.475158 0.427905
+2000-01-02 NaN NaN
+2000-01-03 -0.201829 0.651656
+2000-01-04 -0.766427 -1.852010
+2000-01-05 1.642910 -0.055583
+2000-01-06 0.187880 1.536245
+2000-01-07 -1.801014 0.244721
+2000-01-08 3.055033 -0.683085
+
+In [475]: store.select('df2_mt')
+Out[475]:
+ C D E F foo
+2000-01-01 1.846285 -0.044826 0.074867 0.156213 bar
+2000-01-02 0.446978 -0.323516 0.311549 -0.661368 bar
+2000-01-03 -2.657254 0.649636 1.520717 1.604905 bar
+2000-01-04 -0.201100 -2.107934 -0.450691 -0.748581 bar
+2000-01-05 0.543779 0.111444 0.616259 -0.679614 bar
+2000-01-06 0.831475 -0.566063 1.130163 -1.004539 bar
+2000-01-07 0.745984 1.532560 0.229376 0.526671 bar
+2000-01-08 -0.922301 2.760888 0.515474 -0.129319 bar
+
+# as a multiple
+In [476]: store.select_as_multiple(['df1_mt', 'df2_mt'], where=['A>0', 'B>0'],
+ .....: selector='df1_mt')
+ .....:
+Out[476]:
+ A B C D E F foo
+2000-01-01 0.475158 0.427905 1.846285 -0.044826 0.074867 0.156213 bar
+2000-01-06 0.187880 1.536245 0.831475 -0.566063 1.130163 -1.004539 bar
+```
+
+### 从表中删除
+
+你能够通过``where``指定有选择地从表中删除数据。删除行,重点理解``PyTables``删除行是通过先抹去行,接着**删除**后面的数据。因此,根据数据的方向来删除会是非常耗时的操作。所以为了获得最佳性能,首先让要删除的数据维度可索引是很有必要的。
+
+数据根据(在磁盘上)可索引项来排序,这里有个简单的用例。你能存储面板数据(也叫时间序列-截面数据),在``major_axis``中存储日期,而``minor_axis``中存储ids。数据像下面这样交错:
+
+- date_1
+ - id_1
+ - id_2
+ - .
+ - id_n
+- date_2
+ - id_1
+ - .
+ - id_n
+
+应该清楚的是在 ``major_axis`` 上的删除操作将非常快,正如数据块被删除,接着后面的数据也会移动。另一方面,在``minor_axis`` 的操作将非常耗时。在这种情况下,几乎可以肯定使用``where``操作来选取所有除开缺失数据的列重写表格会更快。
+
+::: danger 警告
+
+请注意HDF5**不会自动回收空间**在h5文件中。于是重复地删除(或者移除节点)再添加操作**将会导致文件体积增大**。
+
+要重新打包和清理文件,请使用[ptrepack](https://www.pypandas.cn/docs/user_guide/io.html#io-hdf5-ptrepack "ptrepack")。
+
+:::
+
+### 注意事项
+
+#### 压缩
+
+``PyTables`` 允许存储地数据被压缩。这适用于所有类型地存储,不仅仅是表格。这两个参数
+ ``complevel``和 ``complib``可用来控制压缩。
+
+``complevel``指定数据会以何种方式压缩。
+
+``complib`` 指定要使用的压缩库。如果没有指定,那将使用默认的 ``zlib``库。压缩库通常会从压缩率或速度两方面来优化,而结果取决于数据类型。选择哪种压缩类型取决于你的具体需求和数据。下面是支持的压缩库列表:
+
+- [zlib](https://zlib.net/): 默认的压缩库。经典的压缩方式,能获得好的压缩率但是速度有点慢。
+- [lzo](https://www.oberhumer.com/opensource/lzo/): 快速地压缩和解压。
+- [bzip2](http://bzip.org/): 不错的压缩率。
+- [blosc](http://www.blosc.org/): 快速地压缩和解压。
+
+*New in version 0.20.2:* 支持另一种blosc压缩机:
+
+- [blosc:blosclz](http://www.blosc.org/) 这是默认地``blosc``压缩机
+- [blosc:lz4](https://fastcompression.blogspot.dk/p/lz4.html):
+一款紧凑、快速且流行的压缩机。
+- [blosc:lz4hc](https://fastcompression.blogspot.dk/p/lz4.html):
+调整后的LZ4版本可产生更好的压缩比,但会牺牲速度。
+- [blosc:snappy](https://google.github.io/snappy/):
+一款在很多地方使用的流行压缩机。
+- [blosc:zlib](https://zlib.net/): 经典款;虽然比前一款速度慢,但是可实现更好的压缩比。
+- [blosc:zstd](https://facebook.github.io/zstd/): 极其平衡的编解码器;它是以上所有压缩机中提供最佳压缩比的,且速度相当快。
+
+如果 ``complib``定义为其他的,不在上表中的库 ,那么就会出现 ``ValueError``。
+
+::: tip 注意
+
+如果你的平台上缺失指定的 ``complib`` 库,压缩机会使用默认 ``zlib``库。
+
+:::
+
+文件中所有的对象都可以启用压缩:
+
+``` python
+store_compressed = pd.HDFStore('store_compressed.h5', complevel=9,
+ complib='blosc:blosclz')
+
+```
+或者在未启用压缩的存储中进行即时压缩(这仅适用于表格):
+
+``` python
+store.append('df', df, complib='zlib', complevel=5)
+
+```
+
+#### ptrepack
+
+``PyTables``不是在一开始的时候开启压缩,而是在表被写入后再压缩,这提供了更好的写入性能。你能使用 ``PyTables`` 提供的实用程序``ptrepack``实现。此外,事实上在``ptrepack`` 之后会改变压缩等级。
+
+```
+ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc in.h5 out.h5
+
+```
+
+另外, ``ptrepack in.h5 out.h5`` 将重新打包文件让你可以重用之前删除的空间。或者,它能简单的删除文件并再次写入亦或使用 ``copy`` 方法。
+
+#### 注意事项
+
+::: danger 警告
+
+``HDFStore`` **不是一个安全的写入线程**. ``PyTables`` 的底层仅支持(通过线程或进程的)并发读取。如果你要同时读取和写入,那么你需要单个进程的单个线程里序列化这些操作,否则会破坏你的数据。更多信息参见([GH2397](https://github.com/pandas-dev/pandas/issues/2397))。
+
+:::
+
+- 如果你用锁来管理多个进程间的写入, 那么你可能要在释放写锁之前使用[``fsync()``](https://docs.python.org/3/library/os.html#os.fsync) 。方便起见,你能用 ``store.flush(fsync=True)`` 操作。
+- 一旦 ``table``创建的列(DataFrame)固定了; 那只有相同的列才可以添加数据。
+- 注意时区 (例如, ``pytz.timezone('US/Eastern')``)在不同的时区版本间不相等。
+因此,如果使用时区库的一个版本将数据本地化到HDFStore中的特定时区,并且使用另一个版本更新该数据,则由于这些时区不相等,因此数据将转换为UTC。使用相同版本的时区库或在更新的时区定义中使用 ``tz_convert``。
+
+::: danger 警告
+
+如果列名没能用作属性选择器,那么``PyTables`` 将显示``NaturalNameWarning`` 。
+自然标识符仅包括字母、数字和下划线,且不能以数字开头。其他标识符不能用``where`` 从句,这通常不是个好主意。
+
+:::
+
+### 数据类型
+
+``HDFStore`` 将对象数据类型映射到 ``PyTables`` 的底层数据类型。这意味着以下的已知类型都有效:
+
+Type | Represents missing values
+---|---
+floating : float64, float32, float16 | np.nan
+integer : int64, int32, int8, uint64,uint32, uint8 |
+boolean |
+datetime64[ns] | NaT
+timedelta64[ns] | NaT
+categorical : see the section below |
+object : strings | np.nan
+
+不支持``unicode`` 列,这会出现 **映射失败**.
+
+#### 数据类别
+
+你可以写入含``category`` 类型的数据到 ``HDFStore``。如果它是对象数组,那查询方式是一样的。然而, 含``category``的数据会以更高效的方式存储。
+
+``` python
+In [477]: dfcat = pd.DataFrame({'A': pd.Series(list('aabbcdba')).astype('category'),
+ .....: 'B': np.random.randn(8)})
+ .....:
+
+In [478]: dfcat
+Out[478]:
+ A B
+0 a 1.706605
+1 a 1.373485
+2 b -0.758424
+3 b -0.116984
+4 c -0.959461
+5 d -1.517439
+6 b -0.453150
+7 a -0.827739
+
+In [479]: dfcat.dtypes
+Out[479]:
+A category
+B float64
+dtype: object
+
+In [480]: cstore = pd.HDFStore('cats.h5', mode='w')
+
+In [481]: cstore.append('dfcat', dfcat, format='table', data_columns=['A'])
+
+In [482]: result = cstore.select('dfcat', where="A in ['b', 'c']")
+
+In [483]: result
+Out[483]:
+ A B
+2 b -0.758424
+3 b -0.116984
+4 c -0.959461
+6 b -0.453150
+
+In [484]: result.dtypes
+Out[484]:
+A category
+B float64
+dtype: object
+
+```
+
+#### 字符串列
+
+**min_itemsize**
+
+对于字符串列, ``HDFStore``的底层使用的固定列宽(列的大小)。字符串列大小的计算方式是: **在第一个添加的时候**,传递给 ``HDFStore``(该列)数据长度的最大值。 随后的添加可能会引入**更大**一列字符串,这超过了该列所能容纳的内容,这将引发异常(不然,你可以悄悄地截断这些列,让信息丢失)。之后将放松这一点,允许用户指定截断。
+
+在第一个表创建的时候,传递 ``min_itemsize`` 将优先指定特定字符串列的最小长度。``min_itemsize``可以是整数或将列名映射为整数的字典。 你可以将``values``作为键传递,以允许所有可索引对象或data_columns具有此min_itemsize。
+
+传递 ``min_itemsize``字典将导致所有可传递列自动创建 data_columns。
+
+::: tip 注意
+
+如果你没有传递任意 ``data_columns``,那么``min_itemsize``将会传递任意字符串的最大长度。
+
+:::
+
+``` python
+In [485]: dfs = pd.DataFrame({'A': 'foo', 'B': 'bar'}, index=list(range(5)))
+
+In [486]: dfs
+Out[486]:
+ A B
+0 foo bar
+1 foo bar
+2 foo bar
+3 foo bar
+4 foo bar
+
+# A and B have a size of 30
+In [487]: store.append('dfs', dfs, min_itemsize=30)
+
+In [488]: store.get_storer('dfs').table
+Out[488]:
+/dfs/table (Table(5,)) ''
+ description := {
+ "index": Int64Col(shape=(), dflt=0, pos=0),
+ "values_block_0": StringCol(itemsize=30, shape=(2,), dflt=b'', pos=1)}
+ byteorder := 'little'
+ chunkshape := (963,)
+ autoindex := True
+ colindexes := {
+ "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
+
+# A is created as a data_column with a size of 30
+# B is size is calculated
+In [489]: store.append('dfs2', dfs, min_itemsize={'A': 30})
+
+In [490]: store.get_storer('dfs2').table
+Out[490]:
+/dfs2/table (Table(5,)) ''
+ description := {
+ "index": Int64Col(shape=(), dflt=0, pos=0),
+ "values_block_0": StringCol(itemsize=3, shape=(1,), dflt=b'', pos=1),
+ "A": StringCol(itemsize=30, shape=(), dflt=b'', pos=2)}
+ byteorder := 'little'
+ chunkshape := (1598,)
+ autoindex := True
+ colindexes := {
+ "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
+ "A": Index(6, medium, shuffle, zlib(1)).is_csi=False}
+
+```
+
+**nan_rep**
+
+字符串列将序列化 ``np.nan`` (缺失值)以 ``nan_rep`` 的字符串形式。默认的字符串值为``nan``。你可能会无意中将实际的``nan``值转换为缺失值。
+
+``` python
+In [491]: dfss = pd.DataFrame({'A': ['foo', 'bar', 'nan']})
+
+In [492]: dfss
+Out[492]:
+ A
+0 foo
+1 bar
+2 nan
+
+In [493]: store.append('dfss', dfss)
+
+In [494]: store.select('dfss')
+Out[494]:
+ A
+0 foo
+1 bar
+2 NaN
+
+# here you need to specify a different nan rep
+In [495]: store.append('dfss2', dfss, nan_rep='_nan_')
+
+In [496]: store.select('dfss2')
+Out[496]:
+ A
+0 foo
+1 bar
+2 nan
+
+```
+
+### 外部兼容性
+
+``HDFStore``以特定格式写入``table``对象,这些格式适用于产生无损往返的pandas对象。 对于外部兼容性, ``HDFStore`` 能读取本地的 ``PyTables`` 格式表格。
+
+可以编写一个``HDFStore`` 对象,该对象可以使用``rhdf5`` 库 ([Package website](https://www.bioconductor.org/packages/release/bioc/html/rhdf5.html))轻松导入到``R`` 中。创建表格存储可以像这样:
+
+``` python
+In [497]: df_for_r = pd.DataFrame({"first": np.random.rand(100),
+ .....: "second": np.random.rand(100),
+ .....: "class": np.random.randint(0, 2, (100, ))},
+ .....: index=range(100))
+ .....:
+
+In [498]: df_for_r.head()
+Out[498]:
+ first second class
+0 0.366979 0.794525 0
+1 0.296639 0.635178 1
+2 0.395751 0.359693 0
+3 0.484648 0.970016 1
+4 0.810047 0.332303 0
+
+In [499]: store_export = pd.HDFStore('export.h5')
+
+In [500]: store_export.append('df_for_r', df_for_r, data_columns=df_dc.columns)
+
+In [501]: store_export
+Out[501]:
+
+File path: export.h5
+
+```
+在这个R文件中使用``rhdf5``库能读入数据到``data.frame``对象中。下面这个示例函数从值中读取相应的列名和数据值,再组合它们到``data.frame``中:
+
+``` R
+# Load values and column names for all datasets from corresponding nodes and
+# insert them into one data.frame object.
+
+library(rhdf5)
+
+loadhdf5data <- function(h5File) {
+
+listing <- h5ls(h5File)
+# Find all data nodes, values are stored in *_values and corresponding column
+# titles in *_items
+data_nodes <- grep("_values", listing$name)
+name_nodes <- grep("_items", listing$name)
+data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
+name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
+columns = list()
+for (idx in seq(data_paths)) {
+ # NOTE: matrices returned by h5read have to be transposed to obtain
+ # required Fortran order!
+ data <- data.frame(t(h5read(h5File, data_paths[idx])))
+ names <- t(h5read(h5File, name_paths[idx]))
+ entry <- data.frame(data)
+ colnames(entry) <- names
+ columns <- append(columns, entry)
+}
+
+data <- data.frame(columns)
+
+return(data)
+}
+
+```
+
+现在你能导入 ``DataFrame`` 到R中:
+
+``` R
+> data = loadhdf5data("transfer.hdf5")
+> head(data)
+ first second class
+1 0.4170220047 0.3266449 0
+2 0.7203244934 0.5270581 0
+3 0.0001143748 0.8859421 1
+4 0.3023325726 0.3572698 1
+5 0.1467558908 0.9085352 1
+6 0.0923385948 0.6233601 1
+
+```
+
+::: tip 注意
+
+R函数列出了整个HDF5文件的内容,并从所有匹配的节点组合了``data.frame`` 对象,因此,如果你已将多个``DataFrame``对象存储到单个HDF5文件中,那么只能用它作为起点。
+
+:::
+
+### 性能
+
+- 同``fixed``存储相比较,``tables`` 格式会有写入的性能损失。这样的好处就是便于(大量的数据)能添加/删除和查询。与常规存储相比,写入时间通常更长。但是查询的时间就相当快,特别是在有索引的轴上。
+- 你可以传递 ``chunksize=`` 给``append``, 指定写入块的大小(默认是50000)。这将极大地降低写入时地内存使用情况。
+- 你可以将`` expectedrows =``传递给第一个``append``来设置``PyTables``将预期的总行数。 这将优化读取/写入性能。
+- 重复的行可以写入表格,但是在选取的时候会进行筛选(会选最后一项;然后表在主要、次要对上是唯一的)。
+- 如果你企图存储已经序列化的PyTables类型数据(而不是存储为本地数据),那将引发A ``PerformanceWarning`` 。更多信息和解决办法参见[Here](https://stackoverflow.com/questions/14355151/how-to-make-pandas-hdfstore-put-operation-faster/14370190#14370190)。
+
+## Feather
+
+*New in version 0.20.0.*
+
+Feather provides binary columnar serialization for data frames. It is designed to make reading and writing data
+frames efficient, and to make sharing data across data analysis languages easy.
+
+Feather is designed to faithfully serialize and de-serialize DataFrames, supporting all of the pandas
+dtypes, including extension dtypes such as categorical and datetime with tz.
+
+Several caveats.
+
+- This is a newer library, and the format, though stable, is not guaranteed to be backward compatible
+to the earlier versions.
+- The format will NOT write an ``Index``, or ``MultiIndex`` for the
+``DataFrame`` and will raise an error if a non-default one is provided. You
+can ``.reset_index()`` to store the index or ``.reset_index(drop=True)`` to
+ignore it.
+- Duplicate column names and non-string columns names are not supported
+- Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message
+on an attempt at serialization.
+
+See the [Full Documentation](https://github.com/wesm/feather).
+
+``` python
+In [502]: df = pd.DataFrame({'a': list('abc'),
+ .....: 'b': list(range(1, 4)),
+ .....: 'c': np.arange(3, 6).astype('u1'),
+ .....: 'd': np.arange(4.0, 7.0, dtype='float64'),
+ .....: 'e': [True, False, True],
+ .....: 'f': pd.Categorical(list('abc')),
+ .....: 'g': pd.date_range('20130101', periods=3),
+ .....: 'h': pd.date_range('20130101', periods=3, tz='US/Eastern'),
+ .....: 'i': pd.date_range('20130101', periods=3, freq='ns')})
+ .....:
+
+In [503]: df
+Out[503]:
+ a b c d e f g h i
+0 a 1 3 4.0 True a 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00.000000000
+1 b 2 4 5.0 False b 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-01 00:00:00.000000001
+2 c 3 5 6.0 True c 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-01 00:00:00.000000002
+
+In [504]: df.dtypes
+Out[504]:
+a object
+b int64
+c uint8
+d float64
+e bool
+f category
+g datetime64[ns]
+h datetime64[ns, US/Eastern]
+i datetime64[ns]
+dtype: object
+
+```
+
+Write to a feather file.
+
+``` python
+In [505]: df.to_feather('example.feather')
+
+```
+
+Read from a feather file.
+
+``` python
+In [506]: result = pd.read_feather('example.feather')
+
+In [507]: result
+Out[507]:
+ a b c d e f g h i
+0 a 1 3 4.0 True a 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00.000000000
+1 b 2 4 5.0 False b 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-01 00:00:00.000000001
+2 c 3 5 6.0 True c 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-01 00:00:00.000000002
+
+# we preserve dtypes
+In [508]: result.dtypes
+Out[508]:
+a object
+b int64
+c uint8
+d float64
+e bool
+f category
+g datetime64[ns]
+h datetime64[ns, US/Eastern]
+i datetime64[ns]
+dtype: object
+
+```
+
+## Parquet
+
+*New in version 0.21.0.*
+
+[Apache Parquet](https://parquet.apache.org/) provides a partitioned binary columnar serialization for data frames. It is designed to
+make reading and writing data frames efficient, and to make sharing data across data analysis
+languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible
+while still maintaining good read performance.
+
+Parquet is designed to faithfully serialize and de-serialize ``DataFrame`` s, supporting all of the pandas
+dtypes, including extension dtypes such as datetime with tz.
+
+Several caveats.
+
+- Duplicate column names and non-string columns names are not supported.
+- The ``pyarrow`` engine always writes the index to the output, but ``fastparquet`` only writes non-default
+indexes. This extra column can cause problems for non-Pandas consumers that are not expecting it. You can
+force including or omitting indexes with the ``index`` argument, regardless of the underlying engine.
+- Index level names, if specified, must be strings.
+- Categorical dtypes can be serialized to parquet, but will de-serialize as ``object`` dtype.
+- Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message
+on an attempt at serialization.
+
+You can specify an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``.
+If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``,
+then ``pyarrow`` is tried, and falling back to ``fastparquet``.
+
+See the documentation for [pyarrow](https://arrow.apache.org/docs/python/) and [fastparquet](https://fastparquet.readthedocs.io/en/latest/).
+
+::: tip Note
+
+These engines are very similar and should read/write nearly identical parquet format files.
+Currently ``pyarrow`` does not support timedelta data, ``fastparquet>=0.1.4`` supports timezone aware datetimes.
+These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).
+
+:::
+
+``` python
+In [509]: df = pd.DataFrame({'a': list('abc'),
+ .....: 'b': list(range(1, 4)),
+ .....: 'c': np.arange(3, 6).astype('u1'),
+ .....: 'd': np.arange(4.0, 7.0, dtype='float64'),
+ .....: 'e': [True, False, True],
+ .....: 'f': pd.date_range('20130101', periods=3),
+ .....: 'g': pd.date_range('20130101', periods=3, tz='US/Eastern')})
+ .....:
+
+In [510]: df
+Out[510]:
+ a b c d e f g
+0 a 1 3 4.0 True 2013-01-01 2013-01-01 00:00:00-05:00
+1 b 2 4 5.0 False 2013-01-02 2013-01-02 00:00:00-05:00
+2 c 3 5 6.0 True 2013-01-03 2013-01-03 00:00:00-05:00
+
+In [511]: df.dtypes
+Out[511]:
+a object
+b int64
+c uint8
+d float64
+e bool
+f datetime64[ns]
+g datetime64[ns, US/Eastern]
+dtype: object
+
+```
+
+Write to a parquet file.
+
+``` python
+In [512]: df.to_parquet('example_pa.parquet', engine='pyarrow')
+
+In [513]: df.to_parquet('example_fp.parquet', engine='fastparquet')
+
+```
+
+Read from a parquet file.
+
+``` python
+In [514]: result = pd.read_parquet('example_fp.parquet', engine='fastparquet')
+
+In [515]: result = pd.read_parquet('example_pa.parquet', engine='pyarrow')
+
+In [516]: result.dtypes
+Out[516]:
+a object
+b int64
+c uint8
+d float64
+e bool
+f datetime64[ns]
+g datetime64[ns, US/Eastern]
+dtype: object
+
+```
+
+Read only certain columns of a parquet file.
+
+``` python
+In [517]: result = pd.read_parquet('example_fp.parquet',
+ .....: engine='fastparquet', columns=['a', 'b'])
+ .....:
+
+In [518]: result = pd.read_parquet('example_pa.parquet',
+ .....: engine='pyarrow', columns=['a', 'b'])
+ .....:
+
+In [519]: result.dtypes
+Out[519]:
+a object
+b int64
+dtype: object
+
+```
+
+### Handling indexes
+
+Serializing a ``DataFrame`` to parquet may include the implicit index as one or
+more columns in the output file. Thus, this code:
+
+``` python
+In [520]: df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
+
+In [521]: df.to_parquet('test.parquet', engine='pyarrow')
+
+```
+
+creates a parquet file with three columns if you use ``pyarrow`` for serialization:
+``a``, ``b``, and ``__index_level_0__``. If you’re using ``fastparquet``, the
+index [may or may not](https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write)
+be written to the file.
+
+This unexpected extra column causes some databases like Amazon Redshift to reject
+the file, because that column doesn’t exist in the target table.
+
+If you want to omit a dataframe’s indexes when writing, pass ``index=False`` to
+[``to_parquet()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html#pandas.DataFrame.to_parquet):
+
+``` python
+In [522]: df.to_parquet('test.parquet', index=False)
+
+```
+
+This creates a parquet file with just the two expected columns, ``a`` and ``b``.
+If your ``DataFrame`` has a custom index, you won’t get it back when you load
+this file into a ``DataFrame``.
+
+Passing ``index=True`` will always write the index, even if that’s not the
+underlying engine’s default behavior.
+
+### Partitioning Parquet files
+
+*New in version 0.24.0.*
+
+Parquet supports partitioning of data based on the values of one or more columns.
+
+``` python
+In [523]: df = pd.DataFrame({'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1]})
+
+In [524]: df.to_parquet(fname='test', engine='pyarrow',
+ .....: partition_cols=['a'], compression=None)
+ .....:
+
+```
+
+The *fname* specifies the parent directory to which data will be saved.
+The *partition_cols* are the column names by which the dataset will be partitioned.
+Columns are partitioned in the order they are given. The partition splits are
+determined by the unique values in the partition columns.
+The above example creates a partitioned dataset that may look like:
+
+```
+test
+├── a=0
+│ ├── 0bac803e32dc42ae83fddfd029cbdebc.parquet
+│ └── ...
+└── a=1
+ ├── e6ab24a4f45147b49b54a662f0c412a3.parquet
+ └── ...
+
+```
+
+## SQL queries
+
+The ``pandas.io.sql`` module provides a collection of query wrappers to both
+facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction
+is provided by SQLAlchemy if installed. In addition you will need a driver library for
+your database. Examples of such drivers are [psycopg2](http://initd.org/psycopg/)
+for PostgreSQL or [pymysql](https://github.com/PyMySQL/PyMySQL) for MySQL.
+For [SQLite](https://docs.python.org/3/library/sqlite3.html) this is
+included in Python’s standard library by default.
+You can find an overview of supported drivers for each SQL dialect in the
+[SQLAlchemy docs](https://docs.sqlalchemy.org/en/latest/dialects/index.html).
+
+If SQLAlchemy is not installed, a fallback is only provided for sqlite (and
+for mysql for backwards compatibility, but this is deprecated and will be
+removed in a future version).
+This mode requires a Python database adapter which respect the [Python
+DB-API](https://www.python.org/dev/peps/pep-0249/).
+
+See also some [cookbook examples](cookbook.html#cookbook-sql) for some advanced strategies.
+
+The key functions are:
+
+Method | Description
+---|---
+[read_sql_table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html#pandas.read_sql)(table_name, con[, schema, …]) | Read SQL database table into a DataFrame.
+[read_sql_query](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_query.html#pandas.read_sql_query)(sql, con[, index_col, …]) | Read SQL query into a DataFrame.
+[read_sql](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html#pandas.read_sql)(sql, con[, index_col, …]) | Read SQL query or database table into a DataFrame.
+[DataFrame.to_sql](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql)(self, name, con[, schema, …]) | Write records stored in a DataFrame to a SQL database.
+
+::: tip Note
+
+The function [``read_sql()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html#pandas.read_sql) is a convenience wrapper around
+[``read_sql_table()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_table.html#pandas.read_sql_table) and [``read_sql_query()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_query.html#pandas.read_sql_query) (and for
+backward compatibility) and will delegate to specific function depending on
+the provided input (database table name or sql query).
+Table names do not need to be quoted if they have special characters.
+
+:::
+
+In the following example, we use the [SQlite](https://www.sqlite.org/) SQL database
+engine. You can use a temporary SQLite database where data are stored in
+“memory”.
+
+To connect with SQLAlchemy you use the ``create_engine()`` function to create an engine
+object from database URI. You only need to create the engine once per database you are
+connecting to.
+For more information on ``create_engine()`` and the URI formatting, see the examples
+below and the SQLAlchemy [documentation](https://docs.sqlalchemy.org/en/latest/core/engines.html)
+
+``` python
+In [525]: from sqlalchemy import create_engine
+
+# Create your engine.
+In [526]: engine = create_engine('sqlite:///:memory:')
+
+```
+
+If you want to manage your own connections you can pass one of those instead:
+
+``` python
+with engine.connect() as conn, conn.begin():
+ data = pd.read_sql_table('data', conn)
+
+```
+
+### Writing DataFrames
+
+Assuming the following data is in a ``DataFrame`` ``data``, we can insert it into
+the database using [``to_sql()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql).
+
+id | Date | Col_1 | Col_2 | Col_3
+---|---|---|---|---
+26 | 2012-10-18 | X | 25.7 | True
+42 | 2012-10-19 | Y | -12.4 | False
+63 | 2012-10-20 | Z | 5.73 | True
+
+``` python
+In [527]: data
+Out[527]:
+ id Date Col_1 Col_2 Col_3
+0 26 2010-10-18 X 27.50 True
+1 42 2010-10-19 Y -12.50 False
+2 63 2010-10-20 Z 5.73 True
+
+In [528]: data.to_sql('data', engine)
+
+```
+
+With some databases, writing large DataFrames can result in errors due to
+packet size limitations being exceeded. This can be avoided by setting the
+``chunksize`` parameter when calling ``to_sql``. For example, the following
+writes ``data`` to the database in batches of 1000 rows at a time:
+
+``` python
+In [529]: data.to_sql('data_chunked', engine, chunksize=1000)
+
+```
+
+#### SQL data types
+
+[``to_sql()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql) will try to map your data to an appropriate
+SQL data type based on the dtype of the data. When you have columns of dtype
+``object``, pandas will try to infer the data type.
+
+You can always override the default type by specifying the desired SQL type of
+any of the columns by using the ``dtype`` argument. This argument needs a
+dictionary mapping column names to SQLAlchemy types (or strings for the sqlite3
+fallback mode).
+For example, specifying to use the sqlalchemy ``String`` type instead of the
+default ``Text`` type for string columns:
+
+``` python
+In [530]: from sqlalchemy.types import String
+
+In [531]: data.to_sql('data_dtype', engine, dtype={'Col_1': String})
+
+```
+
+::: tip Note
+
+Due to the limited support for timedelta’s in the different database
+flavors, columns with type ``timedelta64`` will be written as integer
+values as nanoseconds to the database and a warning will be raised.
+
+:::
+
+::: tip Note
+
+Columns of ``category`` dtype will be converted to the dense representation
+as you would get with ``np.asarray(categorical)`` (e.g. for string categories
+this gives an array of strings).
+Because of this, reading the database table back in does **not** generate
+a categorical.
+
+:::
+
+### Datetime data types
+
+Using SQLAlchemy, [``to_sql()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql) is capable of writing
+datetime data that is timezone naive or timezone aware. However, the resulting
+data stored in the database ultimately depends on the supported data type
+for datetime data of the database system being used.
+
+The following table lists supported data types for datetime data for some
+common databases. Other database dialects may have different data types for
+datetime data.
+
+Database | SQL Datetime Types | Timezone Support
+---|---|---
+SQLite | TEXT | No
+MySQL | TIMESTAMP or DATETIME | No
+PostgreSQL | TIMESTAMP or TIMESTAMP WITH TIME ZONE | Yes
+
+When writing timezone aware data to databases that do not support timezones,
+the data will be written as timezone naive timestamps that are in local time
+with respect to the timezone.
+
+[``read_sql_table()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_table.html#pandas.read_sql_table) is also capable of reading datetime data that is
+timezone aware or naive. When reading ``TIMESTAMP WITH TIME ZONE`` types, pandas
+will convert the data to UTC.
+
+#### Insertion method
+
+*New in version 0.24.0.*
+
+The parameter ``method`` controls the SQL insertion clause used.
+Possible values are:
+
+- ``None``: Uses standard SQL ``INSERT`` clause (one per row).
+- ``'multi'``: Pass multiple values in a single ``INSERT`` clause.
+It uses a special SQL syntax not supported by all backends.
+This usually provides better performance for analytic databases
+like Presto and Redshift, but has worse performance for
+traditional SQL backend if the table contains many columns.
+For more information check the SQLAlchemy [documention](http://docs.sqlalchemy.org/en/latest/core/dml.html#sqlalchemy.sql.expression.Insert.values.params.*args).
+- callable with signature ``(pd_table, conn, keys, data_iter)``:
+This can be used to implement a more performant insertion method based on
+specific backend dialect features.
+
+Example of a callable using PostgreSQL [COPY clause](https://www.postgresql.org/docs/current/static/sql-copy.html):
+
+``` python
+# Alternative to_sql() *method* for DBs that support COPY FROM
+import csv
+from io import StringIO
+
+def psql_insert_copy(table, conn, keys, data_iter):
+ # gets a DBAPI connection that can provide a cursor
+ dbapi_conn = conn.connection
+ with dbapi_conn.cursor() as cur:
+ s_buf = StringIO()
+ writer = csv.writer(s_buf)
+ writer.writerows(data_iter)
+ s_buf.seek(0)
+
+ columns = ', '.join('"{}"'.format(k) for k in keys)
+ if table.schema:
+ table_name = '{}.{}'.format(table.schema, table.name)
+ else:
+ table_name = table.name
+
+ sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
+ table_name, columns)
+ cur.copy_expert(sql=sql, file=s_buf)
+
+```
+
+### Reading tables
+
+[``read_sql_table()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_table.html#pandas.read_sql_table) will read a database table given the
+table name and optionally a subset of columns to read.
+
+::: tip Note
+
+In order to use [``read_sql_table()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_table.html#pandas.read_sql_table), you **must** have the
+SQLAlchemy optional dependency installed.
+
+:::
+
+``` python
+In [532]: pd.read_sql_table('data', engine)
+Out[532]:
+ index id Date Col_1 Col_2 Col_3
+0 0 26 2010-10-18 X 27.50 True
+1 1 42 2010-10-19 Y -12.50 False
+2 2 63 2010-10-20 Z 5.73 True
+
+```
+
+You can also specify the name of the column as the ``DataFrame`` index,
+and specify a subset of columns to be read.
+
+``` python
+In [533]: pd.read_sql_table('data', engine, index_col='id')
+Out[533]:
+ index Date Col_1 Col_2 Col_3
+id
+26 0 2010-10-18 X 27.50 True
+42 1 2010-10-19 Y -12.50 False
+63 2 2010-10-20 Z 5.73 True
+
+In [534]: pd.read_sql_table('data', engine, columns=['Col_1', 'Col_2'])
+Out[534]:
+ Col_1 Col_2
+0 X 27.50
+1 Y -12.50
+2 Z 5.73
+
+```
+
+And you can explicitly force columns to be parsed as dates:
+
+``` python
+In [535]: pd.read_sql_table('data', engine, parse_dates=['Date'])
+Out[535]:
+ index id Date Col_1 Col_2 Col_3
+0 0 26 2010-10-18 X 27.50 True
+1 1 42 2010-10-19 Y -12.50 False
+2 2 63 2010-10-20 Z 5.73 True
+
+```
+
+If needed you can explicitly specify a format string, or a dict of arguments
+to pass to [``pandas.to_datetime()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html#pandas.to_datetime):
+
+``` python
+pd.read_sql_table('data', engine, parse_dates={'Date': '%Y-%m-%d'})
+pd.read_sql_table('data', engine,
+ parse_dates={'Date': {'format': '%Y-%m-%d %H:%M:%S'}})
+
+```
+
+You can check if a table exists using ``has_table()``
+
+### Schema support
+
+Reading from and writing to different schema’s is supported through the ``schema``
+keyword in the [``read_sql_table()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_table.html#pandas.read_sql_table) and [``to_sql()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql)
+functions. Note however that this depends on the database flavor (sqlite does not
+have schema’s). For example:
+
+``` python
+df.to_sql('table', engine, schema='other_schema')
+pd.read_sql_table('table', engine, schema='other_schema')
+
+```
+
+### Querying
+
+You can query using raw SQL in the [``read_sql_query()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_query.html#pandas.read_sql_query) function.
+In this case you must use the SQL variant appropriate for your database.
+When using SQLAlchemy, you can also pass SQLAlchemy Expression language constructs,
+which are database-agnostic.
+
+``` python
+In [536]: pd.read_sql_query('SELECT * FROM data', engine)
+Out[536]:
+ index id Date Col_1 Col_2 Col_3
+0 0 26 2010-10-18 00:00:00.000000 X 27.50 1
+1 1 42 2010-10-19 00:00:00.000000 Y -12.50 0
+2 2 63 2010-10-20 00:00:00.000000 Z 5.73 1
+
+```
+
+Of course, you can specify a more “complex” query.
+
+``` python
+In [537]: pd.read_sql_query("SELECT id, Col_1, Col_2 FROM data WHERE id = 42;", engine)
+Out[537]:
+ id Col_1 Col_2
+0 42 Y -12.5
+
+```
+
+The [``read_sql_query()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_query.html#pandas.read_sql_query) function supports a ``chunksize`` argument.
+Specifying this will return an iterator through chunks of the query result:
+
+``` python
+In [538]: df = pd.DataFrame(np.random.randn(20, 3), columns=list('abc'))
+
+In [539]: df.to_sql('data_chunks', engine, index=False)
+
+```
+
+``` python
+In [540]: for chunk in pd.read_sql_query("SELECT * FROM data_chunks",
+ .....: engine, chunksize=5):
+ .....: print(chunk)
+ .....:
+ a b c
+0 -0.900850 -0.323746 0.037100
+1 0.057533 -0.032842 0.550902
+2 1.026623 1.035455 -0.965140
+3 -0.252405 -1.255987 0.639156
+4 1.076701 -0.309155 -0.800182
+ a b c
+0 -0.206623 0.496077 -0.219935
+1 0.631362 -1.166743 1.808368
+2 0.023531 0.987573 0.471400
+3 -0.982250 -0.192482 1.195452
+4 -1.758855 0.477551 1.412567
+ a b c
+0 -1.120570 1.232764 0.417814
+1 1.688089 -0.037645 -0.269582
+2 0.646823 -0.603366 1.592966
+3 0.724019 -0.515606 -0.180920
+4 0.038244 -2.292866 -0.114634
+ a b c
+0 -0.970230 -0.963257 -0.128304
+1 0.498621 -1.496506 0.701471
+2 -0.272608 -0.119424 -0.882023
+3 -0.253477 0.714395 0.664179
+4 0.897140 0.455791 1.549590
+
+```
+
+You can also run a plain query without creating a ``DataFrame`` with
+``execute()``. This is useful for queries that don’t return values,
+such as INSERT. This is functionally equivalent to calling ``execute`` on the
+SQLAlchemy engine or db connection object. Again, you must use the SQL syntax
+variant appropriate for your database.
+
+``` python
+from pandas.io import sql
+sql.execute('SELECT * FROM table_name', engine)
+sql.execute('INSERT INTO table_name VALUES(?, ?, ?)', engine,
+ params=[('id', 1, 12.2, True)])
+
+```
+
+### Engine connection examples
+
+To connect with SQLAlchemy you use the ``create_engine()`` function to create an engine
+object from database URI. You only need to create the engine once per database you are
+connecting to.
+
+``` python
+from sqlalchemy import create_engine
+
+engine = create_engine('postgresql://scott:tiger@localhost:5432/mydatabase')
+
+engine = create_engine('mysql+mysqldb://scott:tiger@localhost/foo')
+
+engine = create_engine('oracle://scott:tiger@127.0.0.1:1521/sidname')
+
+engine = create_engine('mssql+pyodbc://mydsn')
+
+# sqlite:///
+# where is relative:
+engine = create_engine('sqlite:///foo.db')
+
+# or absolute, starting with a slash:
+engine = create_engine('sqlite:////absolute/path/to/foo.db')
+
+```
+
+For more information see the examples the SQLAlchemy [documentation](https://docs.sqlalchemy.org/en/latest/core/engines.html)
+
+### Advanced SQLAlchemy queries
+
+You can use SQLAlchemy constructs to describe your query.
+
+Use ``sqlalchemy.text()`` to specify query parameters in a backend-neutral way
+
+``` python
+In [541]: import sqlalchemy as sa
+
+In [542]: pd.read_sql(sa.text('SELECT * FROM data where Col_1=:col1'),
+ .....: engine, params={'col1': 'X'})
+ .....:
+Out[542]:
+ index id Date Col_1 Col_2 Col_3
+0 0 26 2010-10-18 00:00:00.000000 X 27.5 1
+
+```
+
+If you have an SQLAlchemy description of your database you can express where conditions using SQLAlchemy expressions
+
+``` python
+In [543]: metadata = sa.MetaData()
+
+In [544]: data_table = sa.Table('data', metadata,
+ .....: sa.Column('index', sa.Integer),
+ .....: sa.Column('Date', sa.DateTime),
+ .....: sa.Column('Col_1', sa.String),
+ .....: sa.Column('Col_2', sa.Float),
+ .....: sa.Column('Col_3', sa.Boolean),
+ .....: )
+ .....:
+
+In [545]: pd.read_sql(sa.select([data_table]).where(data_table.c.Col_3 is True), engine)
+Out[545]:
+Empty DataFrame
+Columns: [index, Date, Col_1, Col_2, Col_3]
+Index: []
+
+```
+
+You can combine SQLAlchemy expressions with parameters passed to [``read_sql()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html#pandas.read_sql) using ``sqlalchemy.bindparam()``
+
+``` python
+In [546]: import datetime as dt
+
+In [547]: expr = sa.select([data_table]).where(data_table.c.Date > sa.bindparam('date'))
+
+In [548]: pd.read_sql(expr, engine, params={'date': dt.datetime(2010, 10, 18)})
+Out[548]:
+ index Date Col_1 Col_2 Col_3
+0 1 2010-10-19 Y -12.50 False
+1 2 2010-10-20 Z 5.73 True
+
+```
+
+### Sqlite fallback
+
+The use of sqlite is supported without using SQLAlchemy.
+This mode requires a Python database adapter which respect the [Python
+DB-API](https://www.python.org/dev/peps/pep-0249/).
+
+You can create connections like so:
+
+``` python
+import sqlite3
+con = sqlite3.connect(':memory:')
+
+```
+
+And then issue the following queries:
+
+``` python
+data.to_sql('data', con)
+pd.read_sql_query("SELECT * FROM data", con)
+
+```
+
+## Google BigQuery
+
+::: danger Warning
+
+Starting in 0.20.0, pandas has split off Google BigQuery support into the
+separate package ``pandas-gbq``. You can ``pip install pandas-gbq`` to get it.
+
+:::
+
+The ``pandas-gbq`` package provides functionality to read/write from Google BigQuery.
+
+pandas integrates with this external package. if ``pandas-gbq`` is installed, you can
+use the pandas methods ``pd.read_gbq`` and ``DataFrame.to_gbq``, which will call the
+respective functions from ``pandas-gbq``.
+
+Full documentation can be found [here](https://pandas-gbq.readthedocs.io/).
+
+## Stata format
+
+### Writing to stata format
+
+The method ``to_stata()`` will write a DataFrame
+into a .dta file. The format version of this file is always 115 (Stata 12).
+
+``` python
+In [549]: df = pd.DataFrame(np.random.randn(10, 2), columns=list('AB'))
+
+In [550]: df.to_stata('stata.dta')
+
+```
+
+Stata data files have limited data type support; only strings with
+244 or fewer characters, ``int8``, ``int16``, ``int32``, ``float32``
+and ``float64`` can be stored in ``.dta`` files. Additionally,
+Stata reserves certain values to represent missing data. Exporting a
+non-missing value that is outside of the permitted range in Stata for
+a particular data type will retype the variable to the next larger
+size. For example, ``int8`` values are restricted to lie between -127
+and 100 in Stata, and so variables with values above 100 will trigger
+a conversion to ``int16``. ``nan`` values in floating points data
+types are stored as the basic missing data type (``.`` in Stata).
+
+::: tip Note
+
+It is not possible to export missing data values for integer data types.
+
+:::
+
+The Stata writer gracefully handles other data types including ``int64``,
+``bool``, ``uint8``, ``uint16``, ``uint32`` by casting to
+the smallest supported type that can represent the data. For example, data
+with a type of ``uint8`` will be cast to ``int8`` if all values are less than
+100 (the upper bound for non-missing ``int8`` data in Stata), or, if values are
+outside of this range, the variable is cast to ``int16``.
+
+::: danger Warning
+
+Conversion from ``int64`` to ``float64`` may result in a loss of precision
+if ``int64`` values are larger than 2**53.
+
+:::
+
+::: danger Warning
+
+``StataWriter`` and
+``to_stata()`` only support fixed width
+strings containing up to 244 characters, a limitation imposed by the version
+115 dta file format. Attempting to write Stata dta files with strings
+longer than 244 characters raises a ``ValueError``.
+
+:::
+
+### Reading from Stata format
+
+The top-level function ``read_stata`` will read a dta file and return
+either a ``DataFrame`` or a ``StataReader`` that can
+be used to read the file incrementally.
+
+``` python
+In [551]: pd.read_stata('stata.dta')
+Out[551]:
+ index A B
+0 0 1.031231 0.196447
+1 1 0.190188 0.619078
+2 2 0.036658 -0.100501
+3 3 0.201772 1.763002
+4 4 0.454977 -1.958922
+5 5 -0.628529 0.133171
+6 6 -1.274374 2.518925
+7 7 -0.517547 -0.360773
+8 8 0.877961 -1.881598
+9 9 -0.699067 -1.566913
+
+```
+
+Specifying a ``chunksize`` yields a
+``StataReader`` instance that can be used to
+read ``chunksize`` lines from the file at a time. The ``StataReader``
+object can be used as an iterator.
+
+``` python
+In [552]: reader = pd.read_stata('stata.dta', chunksize=3)
+
+In [553]: for df in reader:
+ .....: print(df.shape)
+ .....:
+(3, 3)
+(3, 3)
+(3, 3)
+(1, 3)
+
+```
+
+For more fine-grained control, use ``iterator=True`` and specify
+``chunksize`` with each call to
+``read()``.
+
+``` python
+In [554]: reader = pd.read_stata('stata.dta', iterator=True)
+
+In [555]: chunk1 = reader.read(5)
+
+In [556]: chunk2 = reader.read(5)
+
+```
+
+Currently the ``index`` is retrieved as a column.
+
+The parameter ``convert_categoricals`` indicates whether value labels should be
+read and used to create a ``Categorical`` variable from them. Value labels can
+also be retrieved by the function ``value_labels``, which requires ``read()``
+to be called before use.
+
+The parameter ``convert_missing`` indicates whether missing value
+representations in Stata should be preserved. If ``False`` (the default),
+missing values are represented as ``np.nan``. If ``True``, missing values are
+represented using ``StataMissingValue`` objects, and columns containing missing
+values will have ``object`` data type.
+
+::: tip Note
+
+[``read_stata()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_stata.html#pandas.read_stata) and
+``StataReader`` support .dta formats 113-115
+(Stata 10-12), 117 (Stata 13), and 118 (Stata 14).
+
+:::
+
+::: tip Note
+
+Setting ``preserve_dtypes=False`` will upcast to the standard pandas data types:
+``int64`` for all integer types and ``float64`` for floating point data. By default,
+the Stata data types are preserved when importing.
+
+:::
+
+#### Categorical data
+
+``Categorical`` data can be exported to Stata data files as value labeled data.
+The exported data consists of the underlying category codes as integer data values
+and the categories as value labels. Stata does not have an explicit equivalent
+to a ``Categorical`` and information about whether the variable is ordered
+is lost when exporting.
+
+::: danger Warning
+
+Stata only supports string value labels, and so ``str`` is called on the
+categories when exporting data. Exporting ``Categorical`` variables with
+non-string categories produces a warning, and can result a loss of
+information if the ``str`` representations of the categories are not unique.
+
+:::
+
+Labeled data can similarly be imported from Stata data files as ``Categorical``
+variables using the keyword argument ``convert_categoricals`` (``True`` by default).
+The keyword argument ``order_categoricals`` (``True`` by default) determines
+whether imported ``Categorical`` variables are ordered.
+
+::: tip Note
+
+When importing categorical data, the values of the variables in the Stata
+data file are not preserved since ``Categorical`` variables always
+use integer data types between ``-1`` and ``n-1`` where ``n`` is the number
+of categories. If the original values in the Stata data file are required,
+these can be imported by setting ``convert_categoricals=False``, which will
+import original data (but not the variable labels). The original values can
+be matched to the imported categorical data since there is a simple mapping
+between the original Stata data values and the category codes of imported
+Categorical variables: missing values are assigned code ``-1``, and the
+smallest original value is assigned ``0``, the second smallest is assigned
+``1`` and so on until the largest original value is assigned the code ``n-1``.
+
+:::
+
+::: tip Note
+
+Stata supports partially labeled series. These series have value labels for
+some but not all data values. Importing a partially labeled series will produce
+a ``Categorical`` with string categories for the values that are labeled and
+numeric categories for values with no label.
+
+:::
+
+## SAS formats
+
+The top-level function [``read_sas()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sas.html#pandas.read_sas) can read (but not write) SAS
+*xport* (.XPT) and (since v0.18.0) *SAS7BDAT* (.sas7bdat) format files.
+
+SAS files only contain two value types: ASCII text and floating point
+values (usually 8 bytes but sometimes truncated). For xport files,
+there is no automatic type conversion to integers, dates, or
+categoricals. For SAS7BDAT files, the format codes may allow date
+variables to be automatically converted to dates. By default the
+whole file is read and returned as a ``DataFrame``.
+
+Specify a ``chunksize`` or use ``iterator=True`` to obtain reader
+objects (``XportReader`` or ``SAS7BDATReader``) for incrementally
+reading the file. The reader objects also have attributes that
+contain additional information about the file and its variables.
+
+Read a SAS7BDAT file:
+
+``` python
+df = pd.read_sas('sas_data.sas7bdat')
+
+```
+
+Obtain an iterator and read an XPORT file 100,000 lines at a time:
+
+``` python
+def do_something(chunk):
+ pass
+
+rdr = pd.read_sas('sas_xport.xpt', chunk=100000)
+for chunk in rdr:
+ do_something(chunk)
+
+```
+
+The [specification](https://support.sas.com/techsup/technote/ts140.pdf) for the xport file format is available from the SAS
+web site.
+
+No official documentation is available for the SAS7BDAT format.
+
+## Other file formats
+
+pandas itself only supports IO with a limited set of file formats that map
+cleanly to its tabular data model. For reading and writing other file formats
+into and from pandas, we recommend these packages from the broader community.
+
+### netCDF
+
+[xarray](https://xarray.pydata.org/) provides data structures inspired by the pandas ``DataFrame`` for working
+with multi-dimensional datasets, with a focus on the netCDF file format and
+easy conversion to and from pandas.
+
+## Performance considerations
+
+This is an informal comparison of various IO methods, using pandas
+0.20.3. Timings are machine dependent and small differences should be
+ignored.
+
+``` python
+In [1]: sz = 1000000
+In [2]: df = pd.DataFrame({'A': np.random.randn(sz), 'B': [1] * sz})
+
+In [3]: df.info()
+
+RangeIndex: 1000000 entries, 0 to 999999
+Data columns (total 2 columns):
+A 1000000 non-null float64
+B 1000000 non-null int64
+dtypes: float64(1), int64(1)
+memory usage: 15.3 MB
+
+```
+
+Given the next test set:
+
+``` python
+from numpy.random import randn
+
+sz = 1000000
+df = pd.DataFrame({'A': randn(sz), 'B': [1] * sz})
+
+
+def test_sql_write(df):
+ if os.path.exists('test.sql'):
+ os.remove('test.sql')
+ sql_db = sqlite3.connect('test.sql')
+ df.to_sql(name='test_table', con=sql_db)
+ sql_db.close()
+
+
+def test_sql_read():
+ sql_db = sqlite3.connect('test.sql')
+ pd.read_sql_query("select * from test_table", sql_db)
+ sql_db.close()
+
+
+def test_hdf_fixed_write(df):
+ df.to_hdf('test_fixed.hdf', 'test', mode='w')
+
+
+def test_hdf_fixed_read():
+ pd.read_hdf('test_fixed.hdf', 'test')
+
+
+def test_hdf_fixed_write_compress(df):
+ df.to_hdf('test_fixed_compress.hdf', 'test', mode='w', complib='blosc')
+
+
+def test_hdf_fixed_read_compress():
+ pd.read_hdf('test_fixed_compress.hdf', 'test')
+
+
+def test_hdf_table_write(df):
+ df.to_hdf('test_table.hdf', 'test', mode='w', format='table')
+
+
+def test_hdf_table_read():
+ pd.read_hdf('test_table.hdf', 'test')
+
+
+def test_hdf_table_write_compress(df):
+ df.to_hdf('test_table_compress.hdf', 'test', mode='w',
+ complib='blosc', format='table')
+
+
+def test_hdf_table_read_compress():
+ pd.read_hdf('test_table_compress.hdf', 'test')
+
+
+def test_csv_write(df):
+ df.to_csv('test.csv', mode='w')
+
+
+def test_csv_read():
+ pd.read_csv('test.csv', index_col=0)
+
+
+def test_feather_write(df):
+ df.to_feather('test.feather')
+
+
+def test_feather_read():
+ pd.read_feather('test.feather')
+
+
+def test_pickle_write(df):
+ df.to_pickle('test.pkl')
+
+
+def test_pickle_read():
+ pd.read_pickle('test.pkl')
+
+
+def test_pickle_write_compress(df):
+ df.to_pickle('test.pkl.compress', compression='xz')
+
+
+def test_pickle_read_compress():
+ pd.read_pickle('test.pkl.compress', compression='xz')
+
+```
+
+When writing, the top-three functions in terms of speed are are
+``test_pickle_write``, ``test_feather_write`` and ``test_hdf_fixed_write_compress``.
+
+``` python
+In [14]: %timeit test_sql_write(df)
+2.37 s ± 36.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+In [15]: %timeit test_hdf_fixed_write(df)
+194 ms ± 65.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
+
+In [26]: %timeit test_hdf_fixed_write_compress(df)
+119 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
+
+In [16]: %timeit test_hdf_table_write(df)
+623 ms ± 125 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+In [27]: %timeit test_hdf_table_write_compress(df)
+563 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+In [17]: %timeit test_csv_write(df)
+3.13 s ± 49.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+In [30]: %timeit test_feather_write(df)
+103 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
+
+In [31]: %timeit test_pickle_write(df)
+109 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
+
+In [32]: %timeit test_pickle_write_compress(df)
+3.33 s ± 55.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+```
+
+When reading, the top three are ``test_feather_read``, ``test_pickle_read`` and
+``test_hdf_fixed_read``.
+
+``` python
+In [18]: %timeit test_sql_read()
+1.35 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+In [19]: %timeit test_hdf_fixed_read()
+14.3 ms ± 438 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
+
+In [28]: %timeit test_hdf_fixed_read_compress()
+23.5 ms ± 672 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
+
+In [20]: %timeit test_hdf_table_read()
+35.4 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
+
+In [29]: %timeit test_hdf_table_read_compress()
+42.6 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
+
+In [22]: %timeit test_csv_read()
+516 ms ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+In [33]: %timeit test_feather_read()
+4.06 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
+
+In [34]: %timeit test_pickle_read()
+6.5 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
+
+In [35]: %timeit test_pickle_read_compress()
+588 ms ± 3.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+```
+
+Space on disk (in bytes)
+
+```
+34816000 Aug 21 18:00 test.sql
+24009240 Aug 21 18:00 test_fixed.hdf
+ 7919610 Aug 21 18:00 test_fixed_compress.hdf
+24458892 Aug 21 18:00 test_table.hdf
+ 8657116 Aug 21 18:00 test_table_compress.hdf
+28520770 Aug 21 18:00 test.csv
+16000248 Aug 21 18:00 test.feather
+16000848 Aug 21 18:00 test.pkl
+ 7554108 Aug 21 18:00 test.pkl.compress
+
+```
diff --git a/Python/pandas/user_guide/merging.md b/Python/pandas/user_guide/merging.md
new file mode 100644
index 00000000..4a87d98f
--- /dev/null
+++ b/Python/pandas/user_guide/merging.md
@@ -0,0 +1,1415 @@
+# Merge, join, and concatenate
+
+pandas provides various facilities for easily combining together Series or
+DataFrame with various kinds of set logic for the indexes
+and relational algebra functionality in the case of join / merge-type
+operations.
+
+## Concatenating objects
+
+The [``concat()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat) function (in the main pandas namespace) does all of
+the heavy lifting of performing concatenation operations along an axis while
+performing optional set logic (union or intersection) of the indexes (if any) on
+the other axes. Note that I say “if any” because there is only a single possible
+axis of concatenation for Series.
+
+Before diving into all of the details of ``concat`` and what it can do, here is
+a simple example:
+
+``` python
+In [1]: df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
+ ...: 'B': ['B0', 'B1', 'B2', 'B3'],
+ ...: 'C': ['C0', 'C1', 'C2', 'C3'],
+ ...: 'D': ['D0', 'D1', 'D2', 'D3']},
+ ...: index=[0, 1, 2, 3])
+ ...:
+
+In [2]: df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
+ ...: 'B': ['B4', 'B5', 'B6', 'B7'],
+ ...: 'C': ['C4', 'C5', 'C6', 'C7'],
+ ...: 'D': ['D4', 'D5', 'D6', 'D7']},
+ ...: index=[4, 5, 6, 7])
+ ...:
+
+In [3]: df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
+ ...: 'B': ['B8', 'B9', 'B10', 'B11'],
+ ...: 'C': ['C8', 'C9', 'C10', 'C11'],
+ ...: 'D': ['D8', 'D9', 'D10', 'D11']},
+ ...: index=[8, 9, 10, 11])
+ ...:
+
+In [4]: frames = [df1, df2, df3]
+
+In [5]: result = pd.concat(frames)
+```
+
+
+
+Like its sibling function on ndarrays, ``numpy.concatenate``, ``pandas.concat``
+takes a list or dict of homogeneously-typed objects and concatenates them with
+some configurable handling of “what to do with the other axes”:
+
+``` python
+pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None,
+ levels=None, names=None, verify_integrity=False, copy=True)
+```
+
+- ``objs`` : a sequence or mapping of Series or DataFrame objects. If a
+dict is passed, the sorted keys will be used as the *keys* argument, unless
+it is passed, in which case the values will be selected (see below). Any None
+objects will be dropped silently unless they are all None in which case a
+ValueError will be raised.
+- ``axis`` : {0, 1, …}, default 0. The axis to concatenate along.
+- ``join`` : {‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on
+other axis(es). Outer for union and inner for intersection.
+- ``ignore_index`` : boolean, default False. If True, do not use the index
+values on the concatenation axis. The resulting axis will be labeled 0, …,
+n - 1. This is useful if you are concatenating objects where the
+concatenation axis does not have meaningful indexing information. Note
+the index values on the other axes are still respected in the join.
+- ``keys`` : sequence, default None. Construct hierarchical index using the
+passed keys as the outermost level. If multiple levels passed, should
+contain tuples.
+- ``levels`` : list of sequences, default None. Specific levels (unique values)
+to use for constructing a MultiIndex. Otherwise they will be inferred from the
+keys.
+- ``names`` : list, default None. Names for the levels in the resulting
+hierarchical index.
+- ``verify_integrity`` : boolean, default False. Check whether the new
+concatenated axis contains duplicates. This can be very expensive relative
+to the actual data concatenation.
+- ``copy`` : boolean, default True. If False, do not copy data unnecessarily.
+
+Without a little bit of context many of these arguments don’t make much sense.
+Let’s revisit the above example. Suppose we wanted to associate specific keys
+with each of the pieces of the chopped up DataFrame. We can do this using the
+``keys`` argument:
+
+``` python
+In [6]: result = pd.concat(frames, keys=['x', 'y', 'z'])
+```
+
+
+
+As you can see (if you’ve read the rest of the documentation), the resulting
+object’s index has a [hierarchical index](advanced.html#advanced-hierarchical). This
+means that we can now select out each chunk by key:
+
+``` python
+In [7]: result.loc['y']
+Out[7]:
+ A B C D
+4 A4 B4 C4 D4
+5 A5 B5 C5 D5
+6 A6 B6 C6 D6
+7 A7 B7 C7 D7
+```
+
+It’s not a stretch to see how this can be very useful. More detail on this
+functionality below.
+
+::: tip Note
+
+It is worth noting that [``concat()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat) (and therefore
+``append()``) makes a full copy of the data, and that constantly
+reusing this function can create a significant performance hit. If you need
+to use the operation over several datasets, use a list comprehension.
+
+:::
+
+``` python
+frames = [ process_your_file(f) for f in files ]
+result = pd.concat(frames)
+```
+
+### Set logic on the other axes
+
+When gluing together multiple DataFrames, you have a choice of how to handle
+the other axes (other than the one being concatenated). This can be done in
+the following two ways:
+
+- Take the union of them all, ``join='outer'``. This is the default
+option as it results in zero information loss.
+- Take the intersection, ``join='inner'``.
+
+Here is an example of each of these methods. First, the default ``join='outer'``
+behavior:
+
+``` python
+In [8]: df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
+ ...: 'D': ['D2', 'D3', 'D6', 'D7'],
+ ...: 'F': ['F2', 'F3', 'F6', 'F7']},
+ ...: index=[2, 3, 6, 7])
+ ...:
+
+In [9]: result = pd.concat([df1, df4], axis=1, sort=False)
+```
+
+
+
+::: danger Warning
+
+*Changed in version 0.23.0.*
+
+The default behavior with ``join='outer'`` is to sort the other axis
+(columns in this case). In a future version of pandas, the default will
+be to not sort. We specified ``sort=False`` to opt in to the new
+behavior now.
+
+:::
+
+Here is the same thing with ``join='inner'``:
+
+``` python
+In [10]: result = pd.concat([df1, df4], axis=1, join='inner')
+```
+
+
+
+Lastly, suppose we just wanted to reuse the *exact index* from the original
+DataFrame:
+
+``` python
+In [11]: result = pd.concat([df1, df4], axis=1).reindex(df1.index)
+```
+
+Similarly, we could index before the concatenation:
+
+``` python
+In [12]: pd.concat([df1, df4.reindex(df1.index)], axis=1)
+Out[12]:
+ A B C D B D F
+0 A0 B0 C0 D0 NaN NaN NaN
+1 A1 B1 C1 D1 NaN NaN NaN
+2 A2 B2 C2 D2 B2 D2 F2
+3 A3 B3 C3 D3 B3 D3 F3
+```
+
+
+
+### Concatenating using ``append``
+
+A useful shortcut to [``concat()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat) are the [``append()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html#pandas.DataFrame.append)
+instance methods on ``Series`` and ``DataFrame``. These methods actually predated
+``concat``. They concatenate along ``axis=0``, namely the index:
+
+``` python
+In [13]: result = df1.append(df2)
+```
+
+
+
+In the case of ``DataFrame``, the indexes must be disjoint but the columns do not
+need to be:
+
+``` python
+In [14]: result = df1.append(df4, sort=False)
+```
+
+
+
+``append`` may take multiple objects to concatenate:
+
+``` python
+In [15]: result = df1.append([df2, df3])
+```
+
+
+
+::: tip Note
+
+Unlike the ``append()`` method, which appends to the original list
+and returns ``None``, [``append()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html#pandas.DataFrame.append) here **does not** modify
+``df1`` and returns its copy with ``df2`` appended.
+
+:::
+
+### Ignoring indexes on the concatenation axis
+
+For ``DataFrame`` objects which don’t have a meaningful index, you may wish
+to append them and ignore the fact that they may have overlapping indexes. To
+do this, use the ``ignore_index`` argument:
+
+``` python
+In [16]: result = pd.concat([df1, df4], ignore_index=True, sort=False)
+```
+
+
+
+This is also a valid argument to [``DataFrame.append()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html#pandas.DataFrame.append):
+
+``` python
+In [17]: result = df1.append(df4, ignore_index=True, sort=False)
+```
+
+
+
+### Concatenating with mixed ndims
+
+You can concatenate a mix of ``Series`` and ``DataFrame`` objects. The
+``Series`` will be transformed to ``DataFrame`` with the column name as
+the name of the ``Series``.
+
+``` python
+In [18]: s1 = pd.Series(['X0', 'X1', 'X2', 'X3'], name='X')
+
+In [19]: result = pd.concat([df1, s1], axis=1)
+```
+
+
+
+::: tip Note
+
+Since we’re concatenating a ``Series`` to a ``DataFrame``, we could have
+achieved the same result with [``DataFrame.assign()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html#pandas.DataFrame.assign). To concatenate an
+arbitrary number of pandas objects (``DataFrame`` or ``Series``), use
+``concat``.
+
+:::
+
+If unnamed ``Series`` are passed they will be numbered consecutively.
+
+``` python
+In [20]: s2 = pd.Series(['_0', '_1', '_2', '_3'])
+
+In [21]: result = pd.concat([df1, s2, s2, s2], axis=1)
+```
+
+
+
+Passing ``ignore_index=True`` will drop all name references.
+
+``` python
+In [22]: result = pd.concat([df1, s1], axis=1, ignore_index=True)
+```
+
+
+
+### More concatenating with group keys
+
+A fairly common use of the ``keys`` argument is to override the column names
+when creating a new ``DataFrame`` based on existing ``Series``.
+Notice how the default behaviour consists on letting the resulting ``DataFrame``
+inherit the parent ``Series``’ name, when these existed.
+
+``` python
+In [23]: s3 = pd.Series([0, 1, 2, 3], name='foo')
+
+In [24]: s4 = pd.Series([0, 1, 2, 3])
+
+In [25]: s5 = pd.Series([0, 1, 4, 5])
+
+In [26]: pd.concat([s3, s4, s5], axis=1)
+Out[26]:
+ foo 0 1
+0 0 0 0
+1 1 1 1
+2 2 2 4
+3 3 3 5
+```
+
+Through the ``keys`` argument we can override the existing column names.
+
+``` python
+In [27]: pd.concat([s3, s4, s5], axis=1, keys=['red', 'blue', 'yellow'])
+Out[27]:
+ red blue yellow
+0 0 0 0
+1 1 1 1
+2 2 2 4
+3 3 3 5
+```
+
+Let’s consider a variation of the very first example presented:
+
+``` python
+In [28]: result = pd.concat(frames, keys=['x', 'y', 'z'])
+```
+
+
+
+You can also pass a dict to ``concat`` in which case the dict keys will be used
+for the ``keys`` argument (unless other keys are specified):
+
+``` python
+In [29]: pieces = {'x': df1, 'y': df2, 'z': df3}
+
+In [30]: result = pd.concat(pieces)
+```
+
+
+
+``` python
+In [31]: result = pd.concat(pieces, keys=['z', 'y'])
+```
+
+
+
+The MultiIndex created has levels that are constructed from the passed keys and
+the index of the ``DataFrame`` pieces:
+
+``` python
+In [32]: result.index.levels
+Out[32]: FrozenList([['z', 'y'], [4, 5, 6, 7, 8, 9, 10, 11]])
+```
+
+If you wish to specify other levels (as will occasionally be the case), you can
+do so using the ``levels`` argument:
+
+``` python
+In [33]: result = pd.concat(pieces, keys=['x', 'y', 'z'],
+ ....: levels=[['z', 'y', 'x', 'w']],
+ ....: names=['group_key'])
+ ....:
+```
+
+
+
+``` python
+In [34]: result.index.levels
+Out[34]: FrozenList([['z', 'y', 'x', 'w'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])
+```
+
+This is fairly esoteric, but it is actually necessary for implementing things
+like GroupBy where the order of a categorical variable is meaningful.
+
+### Appending rows to a DataFrame
+
+While not especially efficient (since a new object must be created), you can
+append a single row to a ``DataFrame`` by passing a ``Series`` or dict to
+``append``, which returns a new ``DataFrame`` as above.
+
+``` python
+In [35]: s2 = pd.Series(['X0', 'X1', 'X2', 'X3'], index=['A', 'B', 'C', 'D'])
+
+In [36]: result = df1.append(s2, ignore_index=True)
+```
+
+
+
+You should use ``ignore_index`` with this method to instruct DataFrame to
+discard its index. If you wish to preserve the index, you should construct an
+appropriately-indexed DataFrame and append or concatenate those objects.
+
+You can also pass a list of dicts or Series:
+
+``` python
+In [37]: dicts = [{'A': 1, 'B': 2, 'C': 3, 'X': 4},
+ ....: {'A': 5, 'B': 6, 'C': 7, 'Y': 8}]
+ ....:
+
+In [38]: result = df1.append(dicts, ignore_index=True, sort=False)
+```
+
+
+
+## Database-style DataFrame or named Series joining/merging
+
+pandas has full-featured, **high performance** in-memory join operations
+idiomatically very similar to relational databases like SQL. These methods
+perform significantly better (in some cases well over an order of magnitude
+better) than other open source implementations (like ``base::merge.data.frame``
+in R). The reason for this is careful algorithmic design and the internal layout
+of the data in ``DataFrame``.
+
+See the [cookbook](cookbook.html#cookbook-merge) for some advanced strategies.
+
+Users who are familiar with SQL but new to pandas might be interested in a
+[comparison with SQL](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#compare-with-sql-join).
+
+pandas provides a single function, [``merge()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html#pandas.merge), as the entry point for
+all standard database join operations between ``DataFrame`` or named ``Series`` objects:
+
+``` python
+pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
+ left_index=False, right_index=False, sort=True,
+ suffixes=('_x', '_y'), copy=True, indicator=False,
+ validate=None)
+```
+
+- ``left``: A DataFrame or named Series object.
+- ``right``: Another DataFrame or named Series object.
+- ``on``: Column or index level names to join on. Must be found in both the left
+and right DataFrame and/or Series objects. If not passed and ``left_index`` and
+``right_index`` are ``False``, the intersection of the columns in the
+DataFrames and/or Series will be inferred to be the join keys.
+- ``left_on``: Columns or index levels from the left DataFrame or Series to use as
+keys. Can either be column names, index level names, or arrays with length
+equal to the length of the DataFrame or Series.
+- ``right_on``: Columns or index levels from the right DataFrame or Series to use as
+keys. Can either be column names, index level names, or arrays with length
+equal to the length of the DataFrame or Series.
+- ``left_index``: If ``True``, use the index (row labels) from the left
+DataFrame or Series as its join key(s). In the case of a DataFrame or Series with a MultiIndex
+(hierarchical), the number of levels must match the number of join keys
+from the right DataFrame or Series.
+- ``right_index``: Same usage as ``left_index`` for the right DataFrame or Series
+- ``how``: One of ``'left'``, ``'right'``, ``'outer'``, ``'inner'``. Defaults
+to ``inner``. See below for more detailed description of each method.
+- ``sort``: Sort the result DataFrame by the join keys in lexicographical
+order. Defaults to ``True``, setting to ``False`` will improve performance
+substantially in many cases.
+- ``suffixes``: A tuple of string suffixes to apply to overlapping
+columns. Defaults to ``('_x', '_y')``.
+- ``copy``: Always copy data (default ``True``) from the passed DataFrame or named Series
+objects, even when reindexing is not necessary. Cannot be avoided in many
+cases but may improve performance / memory usage. The cases where copying
+can be avoided are somewhat pathological but this option is provided
+nonetheless.
+- ``indicator``: Add a column to the output DataFrame called ``_merge``
+with information on the source of each row. ``_merge`` is Categorical-type
+and takes on a value of ``left_only`` for observations whose merge key
+only appears in ``'left'`` DataFrame or Series, ``right_only`` for observations whose
+merge key only appears in ``'right'`` DataFrame or Series, and ``both`` if the
+observation’s merge key is found in both.
+- ``validate`` : string, default None.
+If specified, checks if merge is of specified type.
+
+ - “one_to_one” or “1:1”: checks if merge keys are unique in both
+ left and right datasets.
+ - “one_to_many” or “1:m”: checks if merge keys are unique in left
+ dataset.
+ - “many_to_one” or “m:1”: checks if merge keys are unique in right
+ dataset.
+ - “many_to_many” or “m:m”: allowed, but does not result in checks.
+
+*New in version 0.21.0.*
+
+::: tip Note
+
+Support for specifying index levels as the ``on``, ``left_on``, and
+``right_on`` parameters was added in version 0.23.0.
+Support for merging named ``Series`` objects was added in version 0.24.0.
+
+:::
+
+The return type will be the same as ``left``. If ``left`` is a ``DataFrame`` or named ``Series``
+and ``right`` is a subclass of ``DataFrame``, the return type will still be ``DataFrame``.
+
+``merge`` is a function in the pandas namespace, and it is also available as a
+``DataFrame`` instance method [``merge()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html#pandas.DataFrame.merge), with the calling
+``DataFrame`` being implicitly considered the left object in the join.
+
+The related [``join()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html#pandas.DataFrame.join) method, uses ``merge`` internally for the
+index-on-index (by default) and column(s)-on-index join. If you are joining on
+index only, you may wish to use ``DataFrame.join`` to save yourself some typing.
+
+### Brief primer on merge methods (relational algebra)
+
+Experienced users of relational databases like SQL will be familiar with the
+terminology used to describe join operations between two SQL-table like
+structures (``DataFrame`` objects). There are several cases to consider which
+are very important to understand:
+
+- **one-to-one** joins: for example when joining two ``DataFrame`` objects on
+their indexes (which must contain unique values).
+- **many-to-one** joins: for example when joining an index (unique) to one or
+more columns in a different ``DataFrame``.
+- **many-to-many** joins: joining columns on columns.
+
+::: tip Note
+
+When joining columns on columns (potentially a many-to-many join), any
+indexes on the passed ``DataFrame`` objects **will be discarded**.
+
+:::
+
+It is worth spending some time understanding the result of the **many-to-many**
+join case. In SQL / standard relational algebra, if a key combination appears
+more than once in both tables, the resulting table will have the **Cartesian
+product** of the associated data. Here is a very basic example with one unique
+key combination:
+
+``` python
+In [39]: left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
+ ....: 'A': ['A0', 'A1', 'A2', 'A3'],
+ ....: 'B': ['B0', 'B1', 'B2', 'B3']})
+ ....:
+
+In [40]: right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
+ ....: 'C': ['C0', 'C1', 'C2', 'C3'],
+ ....: 'D': ['D0', 'D1', 'D2', 'D3']})
+ ....:
+
+In [41]: result = pd.merge(left, right, on='key')
+```
+
+
+
+Here is a more complicated example with multiple join keys. Only the keys
+appearing in ``left`` and ``right`` are present (the intersection), since
+``how='inner'`` by default.
+
+``` python
+In [42]: left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
+ ....: 'key2': ['K0', 'K1', 'K0', 'K1'],
+ ....: 'A': ['A0', 'A1', 'A2', 'A3'],
+ ....: 'B': ['B0', 'B1', 'B2', 'B3']})
+ ....:
+
+In [43]: right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
+ ....: 'key2': ['K0', 'K0', 'K0', 'K0'],
+ ....: 'C': ['C0', 'C1', 'C2', 'C3'],
+ ....: 'D': ['D0', 'D1', 'D2', 'D3']})
+ ....:
+
+In [44]: result = pd.merge(left, right, on=['key1', 'key2'])
+```
+
+
+
+The ``how`` argument to ``merge`` specifies how to determine which keys are to
+be included in the resulting table. If a key combination **does not appear** in
+either the left or right tables, the values in the joined table will be
+``NA``. Here is a summary of the ``how`` options and their SQL equivalent names:
+
+Merge method | SQL Join Name | Description
+---|---|---
+left | LEFT OUTER JOIN | Use keys from left frame only
+right | RIGHT OUTER JOIN | Use keys from right frame only
+outer | FULL OUTER JOIN | Use union of keys from both frames
+inner | INNER JOIN | Use intersection of keys from both frames
+
+``` python
+In [45]: result = pd.merge(left, right, how='left', on=['key1', 'key2'])
+```
+
+
+
+``` python
+In [46]: result = pd.merge(left, right, how='right', on=['key1', 'key2'])
+```
+
+
+
+``` python
+In [47]: result = pd.merge(left, right, how='outer', on=['key1', 'key2'])
+```
+
+
+
+``` python
+In [48]: result = pd.merge(left, right, how='inner', on=['key1', 'key2'])
+```
+
+
+
+Here is another example with duplicate join keys in DataFrames:
+
+``` python
+In [49]: left = pd.DataFrame({'A': [1, 2], 'B': [2, 2]})
+
+In [50]: right = pd.DataFrame({'A': [4, 5, 6], 'B': [2, 2, 2]})
+
+In [51]: result = pd.merge(left, right, on='B', how='outer')
+```
+
+
+
+::: danger Warning
+
+Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions, which may result in memory overflow. It is the user’ s responsibility to manage duplicate values in keys before joining large DataFrames.
+
+:::
+
+### Checking for duplicate keys
+
+*New in version 0.21.0.*
+
+Users can use the ``validate`` argument to automatically check whether there
+are unexpected duplicates in their merge keys. Key uniqueness is checked before
+merge operations and so should protect against memory overflows. Checking key
+uniqueness is also a good way to ensure user data structures are as expected.
+
+In the following example, there are duplicate values of ``B`` in the right
+``DataFrame``. As this is not a one-to-one merge – as specified in the
+``validate`` argument – an exception will be raised.
+
+``` python
+In [52]: left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]})
+
+In [53]: right = pd.DataFrame({'A' : [4,5,6], 'B': [2, 2, 2]})
+```
+
+``` python
+In [53]: result = pd.merge(left, right, on='B', how='outer', validate="one_to_one")
+...
+MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
+```
+
+If the user is aware of the duplicates in the right ``DataFrame`` but wants to
+ensure there are no duplicates in the left DataFrame, one can use the
+``validate='one_to_many'`` argument instead, which will not raise an exception.
+
+``` python
+In [54]: pd.merge(left, right, on='B', how='outer', validate="one_to_many")
+Out[54]:
+ A_x B A_y
+0 1 1 NaN
+1 2 2 4.0
+2 2 2 5.0
+3 2 2 6.0
+```
+
+### The merge indicator
+
+[``merge()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html#pandas.merge) accepts the argument ``indicator``. If ``True``, a
+Categorical-type column called ``_merge`` will be added to the output object
+that takes on values:
+
+Observation Origin | _merge value
+---|---
+Merge key only in 'left' frame | left_only
+Merge key only in 'right' frame | right_only
+Merge key in both frames | both
+
+``` python
+In [55]: df1 = pd.DataFrame({'col1': [0, 1], 'col_left': ['a', 'b']})
+
+In [56]: df2 = pd.DataFrame({'col1': [1, 2, 2], 'col_right': [2, 2, 2]})
+
+In [57]: pd.merge(df1, df2, on='col1', how='outer', indicator=True)
+Out[57]:
+ col1 col_left col_right _merge
+0 0 a NaN left_only
+1 1 b 2.0 both
+2 2 NaN 2.0 right_only
+3 2 NaN 2.0 right_only
+```
+
+The ``indicator`` argument will also accept string arguments, in which case the indicator function will use the value of the passed string as the name for the indicator column.
+
+``` python
+In [58]: pd.merge(df1, df2, on='col1', how='outer', indicator='indicator_column')
+Out[58]:
+ col1 col_left col_right indicator_column
+0 0 a NaN left_only
+1 1 b 2.0 both
+2 2 NaN 2.0 right_only
+3 2 NaN 2.0 right_only
+```
+
+### Merge dtypes
+
+*New in version 0.19.0.*
+
+Merging will preserve the dtype of the join keys.
+
+``` python
+In [59]: left = pd.DataFrame({'key': [1], 'v1': [10]})
+
+In [60]: left
+Out[60]:
+ key v1
+0 1 10
+
+In [61]: right = pd.DataFrame({'key': [1, 2], 'v1': [20, 30]})
+
+In [62]: right
+Out[62]:
+ key v1
+0 1 20
+1 2 30
+```
+
+We are able to preserve the join keys:
+
+``` python
+In [63]: pd.merge(left, right, how='outer')
+Out[63]:
+ key v1
+0 1 10
+1 1 20
+2 2 30
+
+In [64]: pd.merge(left, right, how='outer').dtypes
+Out[64]:
+key int64
+v1 int64
+dtype: object
+```
+
+Of course if you have missing values that are introduced, then the
+resulting dtype will be upcast.
+
+``` python
+In [65]: pd.merge(left, right, how='outer', on='key')
+Out[65]:
+ key v1_x v1_y
+0 1 10.0 20
+1 2 NaN 30
+
+In [66]: pd.merge(left, right, how='outer', on='key').dtypes
+Out[66]:
+key int64
+v1_x float64
+v1_y int64
+dtype: object
+```
+
+*New in version 0.20.0.*
+
+Merging will preserve ``category`` dtypes of the mergands. See also the section on [categoricals](categorical.html#categorical-merge).
+
+The left frame.
+
+``` python
+In [67]: from pandas.api.types import CategoricalDtype
+
+In [68]: X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
+
+In [69]: X = X.astype(CategoricalDtype(categories=['foo', 'bar']))
+
+In [70]: left = pd.DataFrame({'X': X,
+ ....: 'Y': np.random.choice(['one', 'two', 'three'],
+ ....: size=(10,))})
+ ....:
+
+In [71]: left
+Out[71]:
+ X Y
+0 bar one
+1 foo one
+2 foo three
+3 bar three
+4 foo one
+5 bar one
+6 bar three
+7 bar three
+8 bar three
+9 foo three
+
+In [72]: left.dtypes
+Out[72]:
+X category
+Y object
+dtype: object
+```
+
+The right frame.
+
+``` python
+In [73]: right = pd.DataFrame({'X': pd.Series(['foo', 'bar'],
+ ....: dtype=CategoricalDtype(['foo', 'bar'])),
+ ....: 'Z': [1, 2]})
+ ....:
+
+In [74]: right
+Out[74]:
+ X Z
+0 foo 1
+1 bar 2
+
+In [75]: right.dtypes
+Out[75]:
+X category
+Z int64
+dtype: object
+```
+
+The merged result:
+
+``` python
+In [76]: result = pd.merge(left, right, how='outer')
+
+In [77]: result
+Out[77]:
+ X Y Z
+0 bar one 2
+1 bar three 2
+2 bar one 2
+3 bar three 2
+4 bar three 2
+5 bar three 2
+6 foo one 1
+7 foo three 1
+8 foo one 1
+9 foo three 1
+
+In [78]: result.dtypes
+Out[78]:
+X category
+Y object
+Z int64
+dtype: object
+```
+
+::: tip Note
+
+The category dtypes must be *exactly* the same, meaning the same categories and the ordered attribute.
+Otherwise the result will coerce to ``object`` dtype.
+
+:::
+
+::: tip Note
+
+Merging on ``category`` dtypes that are the same can be quite performant compared to ``object`` dtype merging.
+
+:::
+
+### Joining on index
+
+[``DataFrame.join()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html#pandas.DataFrame.join) is a convenient method for combining the columns of two
+potentially differently-indexed ``DataFrames`` into a single result
+``DataFrame``. Here is a very basic example:
+
+``` python
+In [79]: left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
+ ....: 'B': ['B0', 'B1', 'B2']},
+ ....: index=['K0', 'K1', 'K2'])
+ ....:
+
+In [80]: right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
+ ....: 'D': ['D0', 'D2', 'D3']},
+ ....: index=['K0', 'K2', 'K3'])
+ ....:
+
+In [81]: result = left.join(right)
+```
+
+
+
+``` python
+In [82]: result = left.join(right, how='outer')
+```
+
+
+
+The same as above, but with ``how='inner'``.
+
+``` python
+In [83]: result = left.join(right, how='inner')
+```
+
+
+
+The data alignment here is on the indexes (row labels). This same behavior can
+be achieved using ``merge`` plus additional arguments instructing it to use the
+indexes:
+
+``` python
+In [84]: result = pd.merge(left, right, left_index=True, right_index=True, how='outer')
+```
+
+
+
+``` python
+In [85]: result = pd.merge(left, right, left_index=True, right_index=True, how='inner');
+```
+
+
+
+### Joining key columns on an index
+
+[``join()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html#pandas.DataFrame.join) takes an optional ``on`` argument which may be a column
+or multiple column names, which specifies that the passed ``DataFrame`` is to be
+aligned on that column in the ``DataFrame``. These two function calls are
+completely equivalent:
+
+``` python
+left.join(right, on=key_or_keys)
+pd.merge(left, right, left_on=key_or_keys, right_index=True,
+ how='left', sort=False)
+```
+
+Obviously you can choose whichever form you find more convenient. For
+many-to-one joins (where one of the ``DataFrame``’s is already indexed by the
+join key), using ``join`` may be more convenient. Here is a simple example:
+
+``` python
+In [86]: left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
+ ....: 'B': ['B0', 'B1', 'B2', 'B3'],
+ ....: 'key': ['K0', 'K1', 'K0', 'K1']})
+ ....:
+
+In [87]: right = pd.DataFrame({'C': ['C0', 'C1'],
+ ....: 'D': ['D0', 'D1']},
+ ....: index=['K0', 'K1'])
+ ....:
+
+In [88]: result = left.join(right, on='key')
+```
+
+
+
+``` python
+In [89]: result = pd.merge(left, right, left_on='key', right_index=True,
+ ....: how='left', sort=False);
+ ....:
+```
+
+
+
+To join on multiple keys, the passed DataFrame must have a ``MultiIndex``:
+
+``` python
+In [90]: left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
+ ....: 'B': ['B0', 'B1', 'B2', 'B3'],
+ ....: 'key1': ['K0', 'K0', 'K1', 'K2'],
+ ....: 'key2': ['K0', 'K1', 'K0', 'K1']})
+ ....:
+
+In [91]: index = pd.MultiIndex.from_tuples([('K0', 'K0'), ('K1', 'K0'),
+ ....: ('K2', 'K0'), ('K2', 'K1')])
+ ....:
+
+In [92]: right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
+ ....: 'D': ['D0', 'D1', 'D2', 'D3']},
+ ....: index=index)
+ ....:
+```
+
+Now this can be joined by passing the two key column names:
+
+``` python
+In [93]: result = left.join(right, on=['key1', 'key2'])
+```
+
+
+
+The default for ``DataFrame.join`` is to perform a left join (essentially a
+“VLOOKUP” operation, for Excel users), which uses only the keys found in the
+calling DataFrame. Other join types, for example inner join, can be just as
+easily performed:
+
+``` python
+In [94]: result = left.join(right, on=['key1', 'key2'], how='inner')
+```
+
+
+
+As you can see, this drops any rows where there was no match.
+
+### Joining a single Index to a MultiIndex
+
+You can join a singly-indexed ``DataFrame`` with a level of a MultiIndexed ``DataFrame``.
+The level will match on the name of the index of the singly-indexed frame against
+a level name of the MultiIndexed frame.
+
+``` python
+In [95]: left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
+ ....: 'B': ['B0', 'B1', 'B2']},
+ ....: index=pd.Index(['K0', 'K1', 'K2'], name='key'))
+ ....:
+
+In [96]: index = pd.MultiIndex.from_tuples([('K0', 'Y0'), ('K1', 'Y1'),
+ ....: ('K2', 'Y2'), ('K2', 'Y3')],
+ ....: names=['key', 'Y'])
+ ....:
+
+In [97]: right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
+ ....: 'D': ['D0', 'D1', 'D2', 'D3']},
+ ....: index=index)
+ ....:
+
+In [98]: result = left.join(right, how='inner')
+```
+
+
+
+This is equivalent but less verbose and more memory efficient / faster than this.
+
+``` python
+In [99]: result = pd.merge(left.reset_index(), right.reset_index(),
+ ....: on=['key'], how='inner').set_index(['key','Y'])
+ ....:
+```
+
+
+
+### Joining with two MultiIndexes
+
+This is supported in a limited way, provided that the index for the right
+argument is completely used in the join, and is a subset of the indices in
+the left argument, as in this example:
+
+``` python
+In [100]: leftindex = pd.MultiIndex.from_product([list('abc'), list('xy'), [1, 2]],
+ .....: names=['abc', 'xy', 'num'])
+ .....:
+
+In [101]: left = pd.DataFrame({'v1': range(12)}, index=leftindex)
+
+In [102]: left
+Out[102]:
+ v1
+abc xy num
+a x 1 0
+ 2 1
+ y 1 2
+ 2 3
+b x 1 4
+ 2 5
+ y 1 6
+ 2 7
+c x 1 8
+ 2 9
+ y 1 10
+ 2 11
+
+In [103]: rightindex = pd.MultiIndex.from_product([list('abc'), list('xy')],
+ .....: names=['abc', 'xy'])
+ .....:
+
+In [104]: right = pd.DataFrame({'v2': [100 * i for i in range(1, 7)]}, index=rightindex)
+
+In [105]: right
+Out[105]:
+ v2
+abc xy
+a x 100
+ y 200
+b x 300
+ y 400
+c x 500
+ y 600
+
+In [106]: left.join(right, on=['abc', 'xy'], how='inner')
+Out[106]:
+ v1 v2
+abc xy num
+a x 1 0 100
+ 2 1 100
+ y 1 2 200
+ 2 3 200
+b x 1 4 300
+ 2 5 300
+ y 1 6 400
+ 2 7 400
+c x 1 8 500
+ 2 9 500
+ y 1 10 600
+ 2 11 600
+```
+
+If that condition is not satisfied, a join with two multi-indexes can be
+done using the following code.
+
+``` python
+In [107]: leftindex = pd.MultiIndex.from_tuples([('K0', 'X0'), ('K0', 'X1'),
+ .....: ('K1', 'X2')],
+ .....: names=['key', 'X'])
+ .....:
+
+In [108]: left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
+ .....: 'B': ['B0', 'B1', 'B2']},
+ .....: index=leftindex)
+ .....:
+
+In [109]: rightindex = pd.MultiIndex.from_tuples([('K0', 'Y0'), ('K1', 'Y1'),
+ .....: ('K2', 'Y2'), ('K2', 'Y3')],
+ .....: names=['key', 'Y'])
+ .....:
+
+In [110]: right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
+ .....: 'D': ['D0', 'D1', 'D2', 'D3']},
+ .....: index=rightindex)
+ .....:
+
+In [111]: result = pd.merge(left.reset_index(), right.reset_index(),
+ .....: on=['key'], how='inner').set_index(['key', 'X', 'Y'])
+ .....:
+```
+
+
+
+### Merging on a combination of columns and index levels
+
+*New in version 0.23.*
+
+Strings passed as the ``on``, ``left_on``, and ``right_on`` parameters
+may refer to either column names or index level names. This enables merging
+``DataFrame`` instances on a combination of index levels and columns without
+resetting indexes.
+
+``` python
+In [112]: left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')
+
+In [113]: left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
+ .....: 'B': ['B0', 'B1', 'B2', 'B3'],
+ .....: 'key2': ['K0', 'K1', 'K0', 'K1']},
+ .....: index=left_index)
+ .....:
+
+In [114]: right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')
+
+In [115]: right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
+ .....: 'D': ['D0', 'D1', 'D2', 'D3'],
+ .....: 'key2': ['K0', 'K0', 'K0', 'K1']},
+ .....: index=right_index)
+ .....:
+
+In [116]: result = left.merge(right, on=['key1', 'key2'])
+```
+
+
+
+::: tip Note
+
+When DataFrames are merged on a string that matches an index level in both
+frames, the index level is preserved as an index level in the resulting
+DataFrame.
+
+:::
+
+::: tip Note
+
+When DataFrames are merged using only some of the levels of a *MultiIndex*,
+the extra levels will be dropped from the resulting merge. In order to
+preserve those levels, use ``reset_index`` on those level names to move
+those levels to columns prior to doing the merge.
+
+:::
+
+::: tip Note
+
+If a string matches both a column name and an index level name, then a
+warning is issued and the column takes precedence. This will result in an
+ambiguity error in a future version.
+
+:::
+
+### Overlapping value columns
+
+The merge ``suffixes`` argument takes a tuple of list of strings to append to
+overlapping column names in the input ``DataFrame``s to disambiguate the result
+columns:
+
+``` python
+In [117]: left = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'v': [1, 2, 3]})
+
+In [118]: right = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'v': [4, 5, 6]})
+
+In [119]: result = pd.merge(left, right, on='k')
+```
+
+
+
+``` python
+In [120]: result = pd.merge(left, right, on='k', suffixes=['_l', '_r'])
+```
+
+
+
+[``DataFrame.join()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html#pandas.DataFrame.join) has ``lsuffix`` and ``rsuffix`` arguments which behave
+similarly.
+
+``` python
+In [121]: left = left.set_index('k')
+
+In [122]: right = right.set_index('k')
+
+In [123]: result = left.join(right, lsuffix='_l', rsuffix='_r')
+```
+
+
+
+### Joining multiple DataFrames
+
+A list or tuple of ``DataFrames`` can also be passed to [``join()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html#pandas.DataFrame.join)
+to join them together on their indexes.
+
+``` python
+In [124]: right2 = pd.DataFrame({'v': [7, 8, 9]}, index=['K1', 'K1', 'K2'])
+
+In [125]: result = left.join([right, right2])
+```
+
+
+
+### Merging together values within Series or DataFrame columns
+
+Another fairly common situation is to have two like-indexed (or similarly
+indexed) ``Series`` or ``DataFrame`` objects and wanting to “patch” values in
+one object from values for matching indices in the other. Here is an example:
+
+``` python
+In [126]: df1 = pd.DataFrame([[np.nan, 3., 5.], [-4.6, np.nan, np.nan],
+ .....: [np.nan, 7., np.nan]])
+ .....:
+
+In [127]: df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5., 1.6, 4]],
+ .....: index=[1, 2])
+ .....:
+```
+
+For this, use the [``combine_first()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.combine_first.html#pandas.DataFrame.combine_first) method:
+
+``` python
+In [128]: result = df1.combine_first(df2)
+```
+
+
+
+Note that this method only takes values from the right ``DataFrame`` if they are
+missing in the left ``DataFrame``. A related method, [``update()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html#pandas.DataFrame.update),
+alters non-NA values in place:
+
+``` python
+In [129]: df1.update(df2)
+```
+
+
+
+## Timeseries friendly merging
+
+### Merging ordered data
+
+A [``merge_ordered()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_ordered.html#pandas.merge_ordered) function allows combining time series and other
+ordered data. In particular it has an optional ``fill_method`` keyword to
+fill/interpolate missing data:
+
+``` python
+In [130]: left = pd.DataFrame({'k': ['K0', 'K1', 'K1', 'K2'],
+ .....: 'lv': [1, 2, 3, 4],
+ .....: 's': ['a', 'b', 'c', 'd']})
+ .....:
+
+In [131]: right = pd.DataFrame({'k': ['K1', 'K2', 'K4'],
+ .....: 'rv': [1, 2, 3]})
+ .....:
+
+In [132]: pd.merge_ordered(left, right, fill_method='ffill', left_by='s')
+Out[132]:
+ k lv s rv
+0 K0 1.0 a NaN
+1 K1 1.0 a 1.0
+2 K2 1.0 a 2.0
+3 K4 1.0 a 3.0
+4 K1 2.0 b 1.0
+5 K2 2.0 b 2.0
+6 K4 2.0 b 3.0
+7 K1 3.0 c 1.0
+8 K2 3.0 c 2.0
+9 K4 3.0 c 3.0
+10 K1 NaN d 1.0
+11 K2 4.0 d 2.0
+12 K4 4.0 d 3.0
+```
+
+### Merging asof
+
+*New in version 0.19.0.*
+
+A [``merge_asof()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html#pandas.merge_asof) is similar to an ordered left-join except that we match on
+nearest key rather than equal keys. For each row in the ``left`` ``DataFrame``,
+we select the last row in the ``right`` ``DataFrame`` whose ``on`` key is less
+than the left’s key. Both DataFrames must be sorted by the key.
+
+Optionally an asof merge can perform a group-wise merge. This matches the
+``by`` key equally, in addition to the nearest match on the ``on`` key.
+
+For example; we might have ``trades`` and ``quotes`` and we want to ``asof``
+merge them.
+
+``` python
+In [133]: trades = pd.DataFrame({
+ .....: 'time': pd.to_datetime(['20160525 13:30:00.023',
+ .....: '20160525 13:30:00.038',
+ .....: '20160525 13:30:00.048',
+ .....: '20160525 13:30:00.048',
+ .....: '20160525 13:30:00.048']),
+ .....: 'ticker': ['MSFT', 'MSFT',
+ .....: 'GOOG', 'GOOG', 'AAPL'],
+ .....: 'price': [51.95, 51.95,
+ .....: 720.77, 720.92, 98.00],
+ .....: 'quantity': [75, 155,
+ .....: 100, 100, 100]},
+ .....: columns=['time', 'ticker', 'price', 'quantity'])
+ .....:
+
+In [134]: quotes = pd.DataFrame({
+ .....: 'time': pd.to_datetime(['20160525 13:30:00.023',
+ .....: '20160525 13:30:00.023',
+ .....: '20160525 13:30:00.030',
+ .....: '20160525 13:30:00.041',
+ .....: '20160525 13:30:00.048',
+ .....: '20160525 13:30:00.049',
+ .....: '20160525 13:30:00.072',
+ .....: '20160525 13:30:00.075']),
+ .....: 'ticker': ['GOOG', 'MSFT', 'MSFT',
+ .....: 'MSFT', 'GOOG', 'AAPL', 'GOOG',
+ .....: 'MSFT'],
+ .....: 'bid': [720.50, 51.95, 51.97, 51.99,
+ .....: 720.50, 97.99, 720.50, 52.01],
+ .....: 'ask': [720.93, 51.96, 51.98, 52.00,
+ .....: 720.93, 98.01, 720.88, 52.03]},
+ .....: columns=['time', 'ticker', 'bid', 'ask'])
+ .....:
+```
+
+``` python
+In [135]: trades
+Out[135]:
+ time ticker price quantity
+0 2016-05-25 13:30:00.023 MSFT 51.95 75
+1 2016-05-25 13:30:00.038 MSFT 51.95 155
+2 2016-05-25 13:30:00.048 GOOG 720.77 100
+3 2016-05-25 13:30:00.048 GOOG 720.92 100
+4 2016-05-25 13:30:00.048 AAPL 98.00 100
+
+In [136]: quotes
+Out[136]:
+ time ticker bid ask
+0 2016-05-25 13:30:00.023 GOOG 720.50 720.93
+1 2016-05-25 13:30:00.023 MSFT 51.95 51.96
+2 2016-05-25 13:30:00.030 MSFT 51.97 51.98
+3 2016-05-25 13:30:00.041 MSFT 51.99 52.00
+4 2016-05-25 13:30:00.048 GOOG 720.50 720.93
+5 2016-05-25 13:30:00.049 AAPL 97.99 98.01
+6 2016-05-25 13:30:00.072 GOOG 720.50 720.88
+7 2016-05-25 13:30:00.075 MSFT 52.01 52.03
+```
+
+By default we are taking the asof of the quotes.
+
+``` python
+In [137]: pd.merge_asof(trades, quotes,
+ .....: on='time',
+ .....: by='ticker')
+ .....:
+Out[137]:
+ time ticker price quantity bid ask
+0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
+1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
+2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
+3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
+4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
+```
+
+We only asof within ``2ms`` between the quote time and the trade time.
+
+``` python
+In [138]: pd.merge_asof(trades, quotes,
+ .....: on='time',
+ .....: by='ticker',
+ .....: tolerance=pd.Timedelta('2ms'))
+ .....:
+Out[138]:
+ time ticker price quantity bid ask
+0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
+1 2016-05-25 13:30:00.038 MSFT 51.95 155 NaN NaN
+2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
+3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
+4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
+```
+
+We only asof within ``10ms`` between the quote time and the trade time and we
+exclude exact matches on time. Note that though we exclude the exact matches
+(of the quotes), prior quotes **do** propagate to that point in time.
+
+``` python
+In [139]: pd.merge_asof(trades, quotes,
+ .....: on='time',
+ .....: by='ticker',
+ .....: tolerance=pd.Timedelta('10ms'),
+ .....: allow_exact_matches=False)
+ .....:
+Out[139]:
+ time ticker price quantity bid ask
+0 2016-05-25 13:30:00.023 MSFT 51.95 75 NaN NaN
+1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
+2 2016-05-25 13:30:00.048 GOOG 720.77 100 NaN NaN
+3 2016-05-25 13:30:00.048 GOOG 720.92 100 NaN NaN
+4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
+```
diff --git a/Python/pandas/user_guide/missing_data.md b/Python/pandas/user_guide/missing_data.md
new file mode 100644
index 00000000..e1db65e3
--- /dev/null
+++ b/Python/pandas/user_guide/missing_data.md
@@ -0,0 +1,1477 @@
+# Working with missing data
+
+In this section, we will discuss missing (also referred to as NA) values in
+pandas.
+
+::: tip Note
+
+The choice of using ``NaN`` internally to denote missing data was largely
+for simplicity and performance reasons. It differs from the MaskedArray
+approach of, for example, ``scikits.timeseries``. We are hopeful that
+NumPy will soon be able to provide a native NA type solution (similar to R)
+performant enough to be used in pandas.
+
+:::
+
+See the [cookbook](cookbook.html#cookbook-missing-data) for some advanced strategies.
+
+## Values considered “missing”
+
+As data comes in many shapes and forms, pandas aims to be flexible with regard
+to handling missing data. While ``NaN`` is the default missing value marker for
+reasons of computational speed and convenience, we need to be able to easily
+detect this value with data of different types: floating point, integer,
+boolean, and general object. In many cases, however, the Python ``None`` will
+arise and we wish to also consider that “missing” or “not available” or “NA”.
+
+::: tip Note
+
+If you want to consider ``inf`` and ``-inf`` to be “NA” in computations,
+you can set ``pandas.options.mode.use_inf_as_na = True``.
+
+:::
+
+``` python
+In [1]: df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
+ ...: columns=['one', 'two', 'three'])
+ ...:
+
+In [2]: df['four'] = 'bar'
+
+In [3]: df['five'] = df['one'] > 0
+
+In [4]: df
+Out[4]:
+ one two three four five
+a 0.469112 -0.282863 -1.509059 bar True
+c -1.135632 1.212112 -0.173215 bar False
+e 0.119209 -1.044236 -0.861849 bar True
+f -2.104569 -0.494929 1.071804 bar False
+h 0.721555 -0.706771 -1.039575 bar True
+
+In [5]: df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
+
+In [6]: df2
+Out[6]:
+ one two three four five
+a 0.469112 -0.282863 -1.509059 bar True
+b NaN NaN NaN NaN NaN
+c -1.135632 1.212112 -0.173215 bar False
+d NaN NaN NaN NaN NaN
+e 0.119209 -1.044236 -0.861849 bar True
+f -2.104569 -0.494929 1.071804 bar False
+g NaN NaN NaN NaN NaN
+h 0.721555 -0.706771 -1.039575 bar True
+```
+
+To make detecting missing values easier (and across different array dtypes),
+pandas provides the [``isna()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html#pandas.isna) and
+[``notna()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notna.html#pandas.notna) functions, which are also methods on
+Series and DataFrame objects:
+
+``` python
+In [7]: df2['one']
+Out[7]:
+a 0.469112
+b NaN
+c -1.135632
+d NaN
+e 0.119209
+f -2.104569
+g NaN
+h 0.721555
+Name: one, dtype: float64
+
+In [8]: pd.isna(df2['one'])
+Out[8]:
+a False
+b True
+c False
+d True
+e False
+f False
+g True
+h False
+Name: one, dtype: bool
+
+In [9]: df2['four'].notna()
+Out[9]:
+a True
+b False
+c True
+d False
+e True
+f True
+g False
+h True
+Name: four, dtype: bool
+
+In [10]: df2.isna()
+Out[10]:
+ one two three four five
+a False False False False False
+b True True True True True
+c False False False False False
+d True True True True True
+e False False False False False
+f False False False False False
+g True True True True True
+h False False False False False
+```
+
+::: danger Warning
+
+One has to be mindful that in Python (and NumPy), the ``nan's`` don’t compare equal, but ``None's`` **do**.
+Note that pandas/NumPy uses the fact that ``np.nan != np.nan``, and treats ``None`` like ``np.nan``.
+
+``` python
+In [11]: None == None # noqa: E711
+Out[11]: True
+
+In [12]: np.nan == np.nan
+Out[12]: False
+```
+
+So as compared to above, a scalar equality comparison versus a ``None/np.nan`` doesn’t provide useful information.
+
+``` python
+In [13]: df2['one'] == np.nan
+Out[13]:
+a False
+b False
+c False
+d False
+e False
+f False
+g False
+h False
+Name: one, dtype: bool
+```
+
+:::
+
+### Integer dtypes and missing data
+
+Because ``NaN`` is a float, a column of integers with even one missing values
+is cast to floating-point dtype (see [Support for integer NA](gotchas.html#gotchas-intna) for more). Pandas
+provides a nullable integer array, which can be used by explicitly requesting
+the dtype:
+
+``` python
+In [14]: pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())
+Out[14]:
+0 1
+1 2
+2 NaN
+3 4
+dtype: Int64
+```
+
+Alternatively, the string alias ``dtype='Int64'`` (note the capital ``"I"``) can be
+used.
+
+See [Nullable integer data type](integer_na.html#integer-na) for more.
+
+### Datetimes
+
+For datetime64[ns] types, ``NaT`` represents missing values. This is a pseudo-native
+sentinel value that can be represented by NumPy in a singular dtype (datetime64[ns]).
+pandas objects provide compatibility between ``NaT`` and ``NaN``.
+
+``` python
+In [15]: df2 = df.copy()
+
+In [16]: df2['timestamp'] = pd.Timestamp('20120101')
+
+In [17]: df2
+Out[17]:
+ one two three four five timestamp
+a 0.469112 -0.282863 -1.509059 bar True 2012-01-01
+c -1.135632 1.212112 -0.173215 bar False 2012-01-01
+e 0.119209 -1.044236 -0.861849 bar True 2012-01-01
+f -2.104569 -0.494929 1.071804 bar False 2012-01-01
+h 0.721555 -0.706771 -1.039575 bar True 2012-01-01
+
+In [18]: df2.loc[['a', 'c', 'h'], ['one', 'timestamp']] = np.nan
+
+In [19]: df2
+Out[19]:
+ one two three four five timestamp
+a NaN -0.282863 -1.509059 bar True NaT
+c NaN 1.212112 -0.173215 bar False NaT
+e 0.119209 -1.044236 -0.861849 bar True 2012-01-01
+f -2.104569 -0.494929 1.071804 bar False 2012-01-01
+h NaN -0.706771 -1.039575 bar True NaT
+
+In [20]: df2.dtypes.value_counts()
+Out[20]:
+float64 3
+bool 1
+datetime64[ns] 1
+object 1
+dtype: int64
+```
+
+### Inserting missing data
+
+You can insert missing values by simply assigning to containers. The
+actual missing value used will be chosen based on the dtype.
+
+For example, numeric containers will always use ``NaN`` regardless of
+the missing value type chosen:
+
+``` python
+In [21]: s = pd.Series([1, 2, 3])
+
+In [22]: s.loc[0] = None
+
+In [23]: s
+Out[23]:
+0 NaN
+1 2.0
+2 3.0
+dtype: float64
+```
+
+Likewise, datetime containers will always use ``NaT``.
+
+For object containers, pandas will use the value given:
+
+``` python
+In [24]: s = pd.Series(["a", "b", "c"])
+
+In [25]: s.loc[0] = None
+
+In [26]: s.loc[1] = np.nan
+
+In [27]: s
+Out[27]:
+0 None
+1 NaN
+2 c
+dtype: object
+```
+
+### Calculations with missing data
+
+Missing values propagate naturally through arithmetic operations between pandas
+objects.
+
+``` python
+In [28]: a
+Out[28]:
+ one two
+a NaN -0.282863
+c NaN 1.212112
+e 0.119209 -1.044236
+f -2.104569 -0.494929
+h -2.104569 -0.706771
+
+In [29]: b
+Out[29]:
+ one two three
+a NaN -0.282863 -1.509059
+c NaN 1.212112 -0.173215
+e 0.119209 -1.044236 -0.861849
+f -2.104569 -0.494929 1.071804
+h NaN -0.706771 -1.039575
+
+In [30]: a + b
+Out[30]:
+ one three two
+a NaN NaN -0.565727
+c NaN NaN 2.424224
+e 0.238417 NaN -2.088472
+f -4.209138 NaN -0.989859
+h NaN NaN -1.413542
+```
+
+The descriptive statistics and computational methods discussed in the
+[data structure overview](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-stats) (and listed [here](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-stats) and [here](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#api-dataframe-stats)) are all written to
+account for missing data. For example:
+
+- When summing data, NA (missing) values will be treated as zero.
+- If the data are all NA, the result will be 0.
+- Cumulative methods like [``cumsum()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cumsum.html#pandas.DataFrame.cumsum) and [``cumprod()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cumprod.html#pandas.DataFrame.cumprod) ignore NA values by default, but preserve them in the resulting arrays. To override this behaviour and include NA values, use ``skipna=False``.
+
+``` python
+In [31]: df
+Out[31]:
+ one two three
+a NaN -0.282863 -1.509059
+c NaN 1.212112 -0.173215
+e 0.119209 -1.044236 -0.861849
+f -2.104569 -0.494929 1.071804
+h NaN -0.706771 -1.039575
+
+In [32]: df['one'].sum()
+Out[32]: -1.9853605075978744
+
+In [33]: df.mean(1)
+Out[33]:
+a -0.895961
+c 0.519449
+e -0.595625
+f -0.509232
+h -0.873173
+dtype: float64
+
+In [34]: df.cumsum()
+Out[34]:
+ one two three
+a NaN -0.282863 -1.509059
+c NaN 0.929249 -1.682273
+e 0.119209 -0.114987 -2.544122
+f -1.985361 -0.609917 -1.472318
+h NaN -1.316688 -2.511893
+
+In [35]: df.cumsum(skipna=False)
+Out[35]:
+ one two three
+a NaN -0.282863 -1.509059
+c NaN 0.929249 -1.682273
+e NaN -0.114987 -2.544122
+f NaN -0.609917 -1.472318
+h NaN -1.316688 -2.511893
+```
+
+## Sum/prod of empties/nans
+
+::: danger Warning
+
+This behavior is now standard as of v0.22.0 and is consistent with the default in ``numpy``; previously sum/prod of all-NA or empty Series/DataFrames would return NaN.
+See [v0.22.0 whatsnew](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.22.0.html#whatsnew-0220) for more.
+
+:::
+
+The sum of an empty or all-NA Series or column of a DataFrame is 0.
+
+``` python
+In [36]: pd.Series([np.nan]).sum()
+Out[36]: 0.0
+
+In [37]: pd.Series([]).sum()
+Out[37]: 0.0
+```
+
+The product of an empty or all-NA Series or column of a DataFrame is 1.
+
+``` python
+In [38]: pd.Series([np.nan]).prod()
+Out[38]: 1.0
+
+In [39]: pd.Series([]).prod()
+Out[39]: 1.0
+```
+
+## NA values in GroupBy
+
+NA groups in GroupBy are automatically excluded. This behavior is consistent
+with R, for example:
+
+``` python
+In [40]: df
+Out[40]:
+ one two three
+a NaN -0.282863 -1.509059
+c NaN 1.212112 -0.173215
+e 0.119209 -1.044236 -0.861849
+f -2.104569 -0.494929 1.071804
+h NaN -0.706771 -1.039575
+
+In [41]: df.groupby('one').mean()
+Out[41]:
+ two three
+one
+-2.104569 -0.494929 1.071804
+ 0.119209 -1.044236 -0.861849
+```
+
+See the groupby section [here](groupby.html#groupby-missing) for more information.
+
+### Cleaning / filling missing data
+
+pandas objects are equipped with various data manipulation methods for dealing
+with missing data.
+
+## Filling missing values: fillna
+
+[``fillna()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna) can “fill in” NA values with non-NA data in a couple
+of ways, which we illustrate:
+
+**Replace NA with a scalar value**
+
+``` python
+In [42]: df2
+Out[42]:
+ one two three four five timestamp
+a NaN -0.282863 -1.509059 bar True NaT
+c NaN 1.212112 -0.173215 bar False NaT
+e 0.119209 -1.044236 -0.861849 bar True 2012-01-01
+f -2.104569 -0.494929 1.071804 bar False 2012-01-01
+h NaN -0.706771 -1.039575 bar True NaT
+
+In [43]: df2.fillna(0)
+Out[43]:
+ one two three four five timestamp
+a 0.000000 -0.282863 -1.509059 bar True 0
+c 0.000000 1.212112 -0.173215 bar False 0
+e 0.119209 -1.044236 -0.861849 bar True 2012-01-01 00:00:00
+f -2.104569 -0.494929 1.071804 bar False 2012-01-01 00:00:00
+h 0.000000 -0.706771 -1.039575 bar True 0
+
+In [44]: df2['one'].fillna('missing')
+Out[44]:
+a missing
+c missing
+e 0.119209
+f -2.10457
+h missing
+Name: one, dtype: object
+```
+
+**Fill gaps forward or backward**
+
+Using the same filling arguments as [reindexing](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-reindexing), we
+can propagate non-NA values forward or backward:
+
+``` python
+In [45]: df
+Out[45]:
+ one two three
+a NaN -0.282863 -1.509059
+c NaN 1.212112 -0.173215
+e 0.119209 -1.044236 -0.861849
+f -2.104569 -0.494929 1.071804
+h NaN -0.706771 -1.039575
+
+In [46]: df.fillna(method='pad')
+Out[46]:
+ one two three
+a NaN -0.282863 -1.509059
+c NaN 1.212112 -0.173215
+e 0.119209 -1.044236 -0.861849
+f -2.104569 -0.494929 1.071804
+h -2.104569 -0.706771 -1.039575
+```
+
+**Limit the amount of filling**
+
+If we only want consecutive gaps filled up to a certain number of data points,
+we can use the *limit* keyword:
+
+``` python
+In [47]: df
+Out[47]:
+ one two three
+a NaN -0.282863 -1.509059
+c NaN 1.212112 -0.173215
+e NaN NaN NaN
+f NaN NaN NaN
+h NaN -0.706771 -1.039575
+
+In [48]: df.fillna(method='pad', limit=1)
+Out[48]:
+ one two three
+a NaN -0.282863 -1.509059
+c NaN 1.212112 -0.173215
+e NaN 1.212112 -0.173215
+f NaN NaN NaN
+h NaN -0.706771 -1.039575
+```
+
+To remind you, these are the available filling methods:
+
+Method | Action
+---|---
+pad / ffill | Fill values forward
+bfill / backfill | Fill values backward
+
+With time series data, using pad/ffill is extremely common so that the “last
+known value” is available at every time point.
+
+[``ffill()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.ffill.html#pandas.DataFrame.ffill) is equivalent to ``fillna(method='ffill')``
+and [``bfill()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.bfill.html#pandas.DataFrame.bfill) is equivalent to ``fillna(method='bfill')``
+
+## Filling with a PandasObject
+
+You can also fillna using a dict or Series that is alignable. The labels of the dict or index of the Series
+must match the columns of the frame you wish to fill. The
+use case of this is to fill a DataFrame with the mean of that column.
+
+``` python
+In [49]: dff = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
+
+In [50]: dff.iloc[3:5, 0] = np.nan
+
+In [51]: dff.iloc[4:6, 1] = np.nan
+
+In [52]: dff.iloc[5:8, 2] = np.nan
+
+In [53]: dff
+Out[53]:
+ A B C
+0 0.271860 -0.424972 0.567020
+1 0.276232 -1.087401 -0.673690
+2 0.113648 -1.478427 0.524988
+3 NaN 0.577046 -1.715002
+4 NaN NaN -1.157892
+5 -1.344312 NaN NaN
+6 -0.109050 1.643563 NaN
+7 0.357021 -0.674600 NaN
+8 -0.968914 -1.294524 0.413738
+9 0.276662 -0.472035 -0.013960
+
+In [54]: dff.fillna(dff.mean())
+Out[54]:
+ A B C
+0 0.271860 -0.424972 0.567020
+1 0.276232 -1.087401 -0.673690
+2 0.113648 -1.478427 0.524988
+3 -0.140857 0.577046 -1.715002
+4 -0.140857 -0.401419 -1.157892
+5 -1.344312 -0.401419 -0.293543
+6 -0.109050 1.643563 -0.293543
+7 0.357021 -0.674600 -0.293543
+8 -0.968914 -1.294524 0.413738
+9 0.276662 -0.472035 -0.013960
+
+In [55]: dff.fillna(dff.mean()['B':'C'])
+Out[55]:
+ A B C
+0 0.271860 -0.424972 0.567020
+1 0.276232 -1.087401 -0.673690
+2 0.113648 -1.478427 0.524988
+3 NaN 0.577046 -1.715002
+4 NaN -0.401419 -1.157892
+5 -1.344312 -0.401419 -0.293543
+6 -0.109050 1.643563 -0.293543
+7 0.357021 -0.674600 -0.293543
+8 -0.968914 -1.294524 0.413738
+9 0.276662 -0.472035 -0.013960
+```
+
+Same result as above, but is aligning the ‘fill’ value which is
+a Series in this case.
+
+``` python
+In [56]: dff.where(pd.notna(dff), dff.mean(), axis='columns')
+Out[56]:
+ A B C
+0 0.271860 -0.424972 0.567020
+1 0.276232 -1.087401 -0.673690
+2 0.113648 -1.478427 0.524988
+3 -0.140857 0.577046 -1.715002
+4 -0.140857 -0.401419 -1.157892
+5 -1.344312 -0.401419 -0.293543
+6 -0.109050 1.643563 -0.293543
+7 0.357021 -0.674600 -0.293543
+8 -0.968914 -1.294524 0.413738
+9 0.276662 -0.472035 -0.013960
+```
+
+## Dropping axis labels with missing data: dropna
+
+You may wish to simply exclude labels from a data set which refer to missing
+data. To do this, use [``dropna()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html#pandas.DataFrame.dropna):
+
+``` python
+In [57]: df
+Out[57]:
+ one two three
+a NaN -0.282863 -1.509059
+c NaN 1.212112 -0.173215
+e NaN 0.000000 0.000000
+f NaN 0.000000 0.000000
+h NaN -0.706771 -1.039575
+
+In [58]: df.dropna(axis=0)
+Out[58]:
+Empty DataFrame
+Columns: [one, two, three]
+Index: []
+
+In [59]: df.dropna(axis=1)
+Out[59]:
+ two three
+a -0.282863 -1.509059
+c 1.212112 -0.173215
+e 0.000000 0.000000
+f 0.000000 0.000000
+h -0.706771 -1.039575
+
+In [60]: df['one'].dropna()
+Out[60]: Series([], Name: one, dtype: float64)
+```
+
+An equivalent [``dropna()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dropna.html#pandas.Series.dropna) is available for Series.
+DataFrame.dropna has considerably more options than Series.dropna, which can be
+examined [in the API](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#api-dataframe-missing).
+
+## Interpolation
+
+*New in version 0.23.0:* The ``limit_area`` keyword argument was added.
+
+Both Series and DataFrame objects have [``interpolate()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html#pandas.DataFrame.interpolate)
+that, by default, performs linear interpolation at missing data points.
+
+``` python
+In [61]: ts
+Out[61]:
+2000-01-31 0.469112
+2000-02-29 NaN
+2000-03-31 NaN
+2000-04-28 NaN
+2000-05-31 NaN
+ ...
+2007-12-31 -6.950267
+2008-01-31 -7.904475
+2008-02-29 -6.441779
+2008-03-31 -8.184940
+2008-04-30 -9.011531
+Freq: BM, Length: 100, dtype: float64
+
+In [62]: ts.count()
+Out[62]: 66
+
+In [63]: ts.plot()
+Out[63]:
+```
+
+
+
+``` python
+In [64]: ts.interpolate()
+Out[64]:
+2000-01-31 0.469112
+2000-02-29 0.434469
+2000-03-31 0.399826
+2000-04-28 0.365184
+2000-05-31 0.330541
+ ...
+2007-12-31 -6.950267
+2008-01-31 -7.904475
+2008-02-29 -6.441779
+2008-03-31 -8.184940
+2008-04-30 -9.011531
+Freq: BM, Length: 100, dtype: float64
+
+In [65]: ts.interpolate().count()
+Out[65]: 100
+
+In [66]: ts.interpolate().plot()
+Out[66]:
+```
+
+
+
+Index aware interpolation is available via the ``method`` keyword:
+
+``` python
+In [67]: ts2
+Out[67]:
+2000-01-31 0.469112
+2000-02-29 NaN
+2002-07-31 -5.785037
+2005-01-31 NaN
+2008-04-30 -9.011531
+dtype: float64
+
+In [68]: ts2.interpolate()
+Out[68]:
+2000-01-31 0.469112
+2000-02-29 -2.657962
+2002-07-31 -5.785037
+2005-01-31 -7.398284
+2008-04-30 -9.011531
+dtype: float64
+
+In [69]: ts2.interpolate(method='time')
+Out[69]:
+2000-01-31 0.469112
+2000-02-29 0.270241
+2002-07-31 -5.785037
+2005-01-31 -7.190866
+2008-04-30 -9.011531
+dtype: float64
+```
+
+For a floating-point index, use ``method='values'``:
+
+``` python
+In [70]: ser
+Out[70]:
+0.0 0.0
+1.0 NaN
+10.0 10.0
+dtype: float64
+
+In [71]: ser.interpolate()
+Out[71]:
+0.0 0.0
+1.0 5.0
+10.0 10.0
+dtype: float64
+
+In [72]: ser.interpolate(method='values')
+Out[72]:
+0.0 0.0
+1.0 1.0
+10.0 10.0
+dtype: float64
+```
+
+You can also interpolate with a DataFrame:
+
+``` python
+In [73]: df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
+ ....: 'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})
+ ....:
+
+In [74]: df
+Out[74]:
+ A B
+0 1.0 0.25
+1 2.1 NaN
+2 NaN NaN
+3 4.7 4.00
+4 5.6 12.20
+5 6.8 14.40
+
+In [75]: df.interpolate()
+Out[75]:
+ A B
+0 1.0 0.25
+1 2.1 1.50
+2 3.4 2.75
+3 4.7 4.00
+4 5.6 12.20
+5 6.8 14.40
+```
+
+The ``method`` argument gives access to fancier interpolation methods.
+If you have [scipy](http://www.scipy.org) installed, you can pass the name of a 1-d interpolation routine to ``method``.
+You’ll want to consult the full scipy interpolation [documentation](http://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation) and reference [guide](http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html) for details.
+The appropriate interpolation method will depend on the type of data you are working with.
+
+- If you are dealing with a time series that is growing at an increasing rate,
+``method='quadratic'`` may be appropriate.
+- If you have values approximating a cumulative distribution function,
+then ``method='pchip'`` should work well.
+- To fill missing values with goal of smooth plotting, consider ``method='akima'``.
+
+::: danger Warning
+
+These methods require ``scipy``.
+
+:::
+
+``` python
+In [76]: df.interpolate(method='barycentric')
+Out[76]:
+ A B
+0 1.00 0.250
+1 2.10 -7.660
+2 3.53 -4.515
+3 4.70 4.000
+4 5.60 12.200
+5 6.80 14.400
+
+In [77]: df.interpolate(method='pchip')
+Out[77]:
+ A B
+0 1.00000 0.250000
+1 2.10000 0.672808
+2 3.43454 1.928950
+3 4.70000 4.000000
+4 5.60000 12.200000
+5 6.80000 14.400000
+
+In [78]: df.interpolate(method='akima')
+Out[78]:
+ A B
+0 1.000000 0.250000
+1 2.100000 -0.873316
+2 3.406667 0.320034
+3 4.700000 4.000000
+4 5.600000 12.200000
+5 6.800000 14.400000
+```
+
+When interpolating via a polynomial or spline approximation, you must also specify
+the degree or order of the approximation:
+
+``` python
+In [79]: df.interpolate(method='spline', order=2)
+Out[79]:
+ A B
+0 1.000000 0.250000
+1 2.100000 -0.428598
+2 3.404545 1.206900
+3 4.700000 4.000000
+4 5.600000 12.200000
+5 6.800000 14.400000
+
+In [80]: df.interpolate(method='polynomial', order=2)
+Out[80]:
+ A B
+0 1.000000 0.250000
+1 2.100000 -2.703846
+2 3.451351 -1.453846
+3 4.700000 4.000000
+4 5.600000 12.200000
+5 6.800000 14.400000
+```
+
+Compare several methods:
+
+``` python
+In [81]: np.random.seed(2)
+
+In [82]: ser = pd.Series(np.arange(1, 10.1, .25) ** 2 + np.random.randn(37))
+
+In [83]: missing = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29])
+
+In [84]: ser[missing] = np.nan
+
+In [85]: methods = ['linear', 'quadratic', 'cubic']
+
+In [86]: df = pd.DataFrame({m: ser.interpolate(method=m) for m in methods})
+
+In [87]: df.plot()
+Out[87]:
+```
+
+
+
+Another use case is interpolation at *new* values.
+Suppose you have 100 observations from some distribution. And let’s suppose
+that you’re particularly interested in what’s happening around the middle.
+You can mix pandas’ ``reindex`` and ``interpolate`` methods to interpolate
+at the new values.
+
+``` python
+In [88]: ser = pd.Series(np.sort(np.random.uniform(size=100)))
+
+# interpolate at new_index
+In [89]: new_index = ser.index | pd.Index([49.25, 49.5, 49.75, 50.25, 50.5, 50.75])
+
+In [90]: interp_s = ser.reindex(new_index).interpolate(method='pchip')
+
+In [91]: interp_s[49:51]
+Out[91]:
+49.00 0.471410
+49.25 0.476841
+49.50 0.481780
+49.75 0.485998
+50.00 0.489266
+50.25 0.491814
+50.50 0.493995
+50.75 0.495763
+51.00 0.497074
+dtype: float64
+```
+
+### Interpolation limits
+
+Like other pandas fill methods, [``interpolate()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html#pandas.DataFrame.interpolate) accepts a ``limit`` keyword
+argument. Use this argument to limit the number of consecutive ``NaN`` values
+filled since the last valid observation:
+
+``` python
+In [92]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan,
+ ....: np.nan, 13, np.nan, np.nan])
+ ....:
+
+In [93]: ser
+Out[93]:
+0 NaN
+1 NaN
+2 5.0
+3 NaN
+4 NaN
+5 NaN
+6 13.0
+7 NaN
+8 NaN
+dtype: float64
+
+# fill all consecutive values in a forward direction
+In [94]: ser.interpolate()
+Out[94]:
+0 NaN
+1 NaN
+2 5.0
+3 7.0
+4 9.0
+5 11.0
+6 13.0
+7 13.0
+8 13.0
+dtype: float64
+
+# fill one consecutive value in a forward direction
+In [95]: ser.interpolate(limit=1)
+Out[95]:
+0 NaN
+1 NaN
+2 5.0
+3 7.0
+4 NaN
+5 NaN
+6 13.0
+7 13.0
+8 NaN
+dtype: float64
+```
+
+By default, ``NaN`` values are filled in a ``forward`` direction. Use
+``limit_direction`` parameter to fill ``backward`` or from ``both`` directions.
+
+``` python
+# fill one consecutive value backwards
+In [96]: ser.interpolate(limit=1, limit_direction='backward')
+Out[96]:
+0 NaN
+1 5.0
+2 5.0
+3 NaN
+4 NaN
+5 11.0
+6 13.0
+7 NaN
+8 NaN
+dtype: float64
+
+# fill one consecutive value in both directions
+In [97]: ser.interpolate(limit=1, limit_direction='both')
+Out[97]:
+0 NaN
+1 5.0
+2 5.0
+3 7.0
+4 NaN
+5 11.0
+6 13.0
+7 13.0
+8 NaN
+dtype: float64
+
+# fill all consecutive values in both directions
+In [98]: ser.interpolate(limit_direction='both')
+Out[98]:
+0 5.0
+1 5.0
+2 5.0
+3 7.0
+4 9.0
+5 11.0
+6 13.0
+7 13.0
+8 13.0
+dtype: float64
+```
+
+By default, ``NaN`` values are filled whether they are inside (surrounded by)
+existing valid values, or outside existing valid values. Introduced in v0.23
+the ``limit_area`` parameter restricts filling to either inside or outside values.
+
+``` python
+# fill one consecutive inside value in both directions
+In [99]: ser.interpolate(limit_direction='both', limit_area='inside', limit=1)
+Out[99]:
+0 NaN
+1 NaN
+2 5.0
+3 7.0
+4 NaN
+5 11.0
+6 13.0
+7 NaN
+8 NaN
+dtype: float64
+
+# fill all consecutive outside values backward
+In [100]: ser.interpolate(limit_direction='backward', limit_area='outside')
+Out[100]:
+0 5.0
+1 5.0
+2 5.0
+3 NaN
+4 NaN
+5 NaN
+6 13.0
+7 NaN
+8 NaN
+dtype: float64
+
+# fill all consecutive outside values in both directions
+In [101]: ser.interpolate(limit_direction='both', limit_area='outside')
+Out[101]:
+0 5.0
+1 5.0
+2 5.0
+3 NaN
+4 NaN
+5 NaN
+6 13.0
+7 13.0
+8 13.0
+dtype: float64
+```
+
+## Replacing generic values
+
+Often times we want to replace arbitrary values with other values.
+
+[``replace()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.replace.html#pandas.Series.replace) in Series and [``replace()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html#pandas.DataFrame.replace) in DataFrame provides an efficient yet
+flexible way to perform such replacements.
+
+For a Series, you can replace a single value or a list of values by another
+value:
+
+``` python
+In [102]: ser = pd.Series([0., 1., 2., 3., 4.])
+
+In [103]: ser.replace(0, 5)
+Out[103]:
+0 5.0
+1 1.0
+2 2.0
+3 3.0
+4 4.0
+dtype: float64
+```
+
+You can replace a list of values by a list of other values:
+
+``` python
+In [104]: ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])
+Out[104]:
+0 4.0
+1 3.0
+2 2.0
+3 1.0
+4 0.0
+dtype: float64
+```
+
+You can also specify a mapping dict:
+
+``` python
+In [105]: ser.replace({0: 10, 1: 100})
+Out[105]:
+0 10.0
+1 100.0
+2 2.0
+3 3.0
+4 4.0
+dtype: float64
+```
+
+For a DataFrame, you can specify individual values by column:
+
+``` python
+In [106]: df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})
+
+In [107]: df.replace({'a': 0, 'b': 5}, 100)
+Out[107]:
+ a b
+0 100 100
+1 1 6
+2 2 7
+3 3 8
+4 4 9
+```
+
+Instead of replacing with specified values, you can treat all given values as
+missing and interpolate over them:
+
+``` python
+In [108]: ser.replace([1, 2, 3], method='pad')
+Out[108]:
+0 0.0
+1 0.0
+2 0.0
+3 0.0
+4 4.0
+dtype: float64
+```
+
+## String/regular expression replacement
+
+::: tip Note
+
+Python strings prefixed with the ``r`` character such as ``r'hello world'``
+are so-called “raw” strings. They have different semantics regarding
+backslashes than strings without this prefix. Backslashes in raw strings
+will be interpreted as an escaped backslash, e.g., ``r'\' == '\\'``. You
+should [read about them](https://docs.python.org/3/reference/lexical_analysis.html#string-literals)
+if this is unclear.
+
+:::
+
+Replace the ‘.’ with ``NaN`` (str -> str):
+
+``` python
+In [109]: d = {'a': list(range(4)), 'b': list('ab..'), 'c': ['a', 'b', np.nan, 'd']}
+
+In [110]: df = pd.DataFrame(d)
+
+In [111]: df.replace('.', np.nan)
+Out[111]:
+ a b c
+0 0 a a
+1 1 b b
+2 2 NaN NaN
+3 3 NaN d
+```
+
+Now do it with a regular expression that removes surrounding whitespace
+(regex -> regex):
+
+``` python
+In [112]: df.replace(r'\s*\.\s*', np.nan, regex=True)
+Out[112]:
+ a b c
+0 0 a a
+1 1 b b
+2 2 NaN NaN
+3 3 NaN d
+```
+
+Replace a few different values (list -> list):
+
+``` python
+In [113]: df.replace(['a', '.'], ['b', np.nan])
+Out[113]:
+ a b c
+0 0 b b
+1 1 b b
+2 2 NaN NaN
+3 3 NaN d
+```
+
+list of regex -> list of regex:
+
+``` python
+In [114]: df.replace([r'\.', r'(a)'], ['dot', r'\1stuff'], regex=True)
+Out[114]:
+ a b c
+0 0 astuff astuff
+1 1 b b
+2 2 dot NaN
+3 3 dot d
+```
+
+Only search in column ``'b'`` (dict -> dict):
+
+``` python
+In [115]: df.replace({'b': '.'}, {'b': np.nan})
+Out[115]:
+ a b c
+0 0 a a
+1 1 b b
+2 2 NaN NaN
+3 3 NaN d
+```
+
+Same as the previous example, but use a regular expression for
+searching instead (dict of regex -> dict):
+
+``` python
+In [116]: df.replace({'b': r'\s*\.\s*'}, {'b': np.nan}, regex=True)
+Out[116]:
+ a b c
+0 0 a a
+1 1 b b
+2 2 NaN NaN
+3 3 NaN d
+```
+
+You can pass nested dictionaries of regular expressions that use ``regex=True``:
+
+``` python
+In [117]: df.replace({'b': {'b': r''}}, regex=True)
+Out[117]:
+ a b c
+0 0 a a
+1 1 b
+2 2 . NaN
+3 3 . d
+```
+
+Alternatively, you can pass the nested dictionary like so:
+
+``` python
+In [118]: df.replace(regex={'b': {r'\s*\.\s*': np.nan}})
+Out[118]:
+ a b c
+0 0 a a
+1 1 b b
+2 2 NaN NaN
+3 3 NaN d
+```
+
+You can also use the group of a regular expression match when replacing (dict
+of regex -> dict of regex), this works for lists as well.
+
+``` python
+In [119]: df.replace({'b': r'\s*(\.)\s*'}, {'b': r'\1ty'}, regex=True)
+Out[119]:
+ a b c
+0 0 a a
+1 1 b b
+2 2 .ty NaN
+3 3 .ty d
+```
+
+You can pass a list of regular expressions, of which those that match
+will be replaced with a scalar (list of regex -> regex).
+
+``` python
+In [120]: df.replace([r'\s*\.\s*', r'a|b'], np.nan, regex=True)
+Out[120]:
+ a b c
+0 0 NaN NaN
+1 1 NaN NaN
+2 2 NaN NaN
+3 3 NaN d
+```
+
+All of the regular expression examples can also be passed with the
+``to_replace`` argument as the ``regex`` argument. In this case the ``value``
+argument must be passed explicitly by name or ``regex`` must be a nested
+dictionary. The previous example, in this case, would then be:
+
+``` python
+In [121]: df.replace(regex=[r'\s*\.\s*', r'a|b'], value=np.nan)
+Out[121]:
+ a b c
+0 0 NaN NaN
+1 1 NaN NaN
+2 2 NaN NaN
+3 3 NaN d
+```
+
+This can be convenient if you do not want to pass ``regex=True`` every time you
+want to use a regular expression.
+
+::: tip Note
+
+Anywhere in the above ``replace`` examples that you see a regular expression
+a compiled regular expression is valid as well.
+
+:::
+
+## Numeric replacement
+
+[``replace()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html#pandas.DataFrame.replace) is similar to [``fillna()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna).
+
+``` python
+In [122]: df = pd.DataFrame(np.random.randn(10, 2))
+
+In [123]: df[np.random.rand(df.shape[0]) > 0.5] = 1.5
+
+In [124]: df.replace(1.5, np.nan)
+Out[124]:
+ 0 1
+0 -0.844214 -1.021415
+1 0.432396 -0.323580
+2 0.423825 0.799180
+3 1.262614 0.751965
+4 NaN NaN
+5 NaN NaN
+6 -0.498174 -1.060799
+7 0.591667 -0.183257
+8 1.019855 -1.482465
+9 NaN NaN
+```
+
+Replacing more than one value is possible by passing a list.
+
+``` python
+In [125]: df00 = df.iloc[0, 0]
+
+In [126]: df.replace([1.5, df00], [np.nan, 'a'])
+Out[126]:
+ 0 1
+0 a -1.02141
+1 0.432396 -0.32358
+2 0.423825 0.79918
+3 1.26261 0.751965
+4 NaN NaN
+5 NaN NaN
+6 -0.498174 -1.0608
+7 0.591667 -0.183257
+8 1.01985 -1.48247
+9 NaN NaN
+
+In [127]: df[1].dtype
+Out[127]: dtype('float64')
+```
+
+You can also operate on the DataFrame in place:
+
+``` python
+In [128]: df.replace(1.5, np.nan, inplace=True)
+```
+
+::: danger Warning
+
+When replacing multiple ``bool`` or ``datetime64`` objects, the first
+argument to ``replace`` (``to_replace``) must match the type of the value
+being replaced. For example,
+
+``` python
+>>> s = pd.Series([True, False, True])
+>>> s.replace({'a string': 'new value', True: False}) # raises
+TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'
+```
+
+will raise a ``TypeError`` because one of the ``dict`` keys is not of the
+correct type for replacement.
+
+However, when replacing a *single* object such as,
+
+``` python
+In [129]: s = pd.Series([True, False, True])
+
+In [130]: s.replace('a string', 'another string')
+Out[130]:
+0 True
+1 False
+2 True
+dtype: bool
+```
+
+the original ``NDFrame`` object will be returned untouched. We’re working on
+unifying this API, but for backwards compatibility reasons we cannot break
+the latter behavior. See [GH6354](https://github.com/pandas-dev/pandas/issues/6354) for more details.
+
+:::
+
+### Missing data casting rules and indexing
+
+While pandas supports storing arrays of integer and boolean type, these types
+are not capable of storing missing data. Until we can switch to using a native
+NA type in NumPy, we’ve established some “casting rules”. When a reindexing
+operation introduces missing data, the Series will be cast according to the
+rules introduced in the table below.
+
+data type | Cast to
+---|---
+integer | float
+boolean | object
+float | no cast
+object | no cast
+
+For example:
+
+``` python
+In [131]: s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7])
+
+In [132]: s > 0
+Out[132]:
+0 True
+2 True
+4 True
+6 True
+7 True
+dtype: bool
+
+In [133]: (s > 0).dtype
+Out[133]: dtype('bool')
+
+In [134]: crit = (s > 0).reindex(list(range(8)))
+
+In [135]: crit
+Out[135]:
+0 True
+1 NaN
+2 True
+3 NaN
+4 True
+5 NaN
+6 True
+7 True
+dtype: object
+
+In [136]: crit.dtype
+Out[136]: dtype('O')
+```
+
+Ordinarily NumPy will complain if you try to use an object array (even if it
+contains boolean values) instead of a boolean array to get or set values from
+an ndarray (e.g. selecting values based on some criteria). If a boolean vector
+contains NAs, an exception will be generated:
+
+``` python
+In [137]: reindexed = s.reindex(list(range(8))).fillna(0)
+
+In [138]: reindexed[crit]
+---------------------------------------------------------------------------
+ValueError Traceback (most recent call last)
+ in
+----> 1 reindexed[crit]
+
+/pandas/pandas/core/series.py in __getitem__(self, key)
+ 1101 key = list(key)
+ 1102
+-> 1103 if com.is_bool_indexer(key):
+ 1104 key = check_bool_indexer(self.index, key)
+ 1105
+
+/pandas/pandas/core/common.py in is_bool_indexer(key)
+ 128 if not lib.is_bool_array(key):
+ 129 if isna(key).any():
+--> 130 raise ValueError(na_msg)
+ 131 return False
+ 132 return True
+
+ValueError: cannot index with vector containing NA / NaN values
+```
+
+However, these can be filled in using [``fillna()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna) and it will work fine:
+
+``` python
+In [139]: reindexed[crit.fillna(False)]
+Out[139]:
+0 0.126504
+2 0.696198
+4 0.697416
+6 0.601516
+7 0.003659
+dtype: float64
+
+In [140]: reindexed[crit.fillna(True)]
+Out[140]:
+0 0.126504
+1 0.000000
+2 0.696198
+3 0.000000
+4 0.697416
+5 0.000000
+6 0.601516
+7 0.003659
+dtype: float64
+```
+
+Pandas provides a nullable integer dtype, but you must explicitly request it
+when creating the series or column. Notice that we use a capital “I” in
+the ``dtype="Int64"``.
+
+``` python
+In [141]: s = pd.Series([0, 1, np.nan, 3, 4], dtype="Int64")
+
+In [142]: s
+Out[142]:
+0 0
+1 1
+2 NaN
+3 3
+4 4
+dtype: Int64
+```
+
+See [Nullable integer data type](integer_na.html#integer-na) for more.
diff --git a/Python/pandas/user_guide/options.md b/Python/pandas/user_guide/options.md
new file mode 100644
index 00000000..533cf766
--- /dev/null
+++ b/Python/pandas/user_guide/options.md
@@ -0,0 +1,711 @@
+# Options and settings
+
+## Overview
+
+pandas has an options system that lets you customize some aspects of its behaviour,
+display-related options being those the user is most likely to adjust.
+
+Options have a full “dotted-style”, case-insensitive name (e.g. ``display.max_rows``).
+You can get/set options directly as attributes of the top-level ``options`` attribute:
+
+``` python
+In [1]: import pandas as pd
+
+In [2]: pd.options.display.max_rows
+Out[2]: 15
+
+In [3]: pd.options.display.max_rows = 999
+
+In [4]: pd.options.display.max_rows
+Out[4]: 999
+```
+
+The API is composed of 5 relevant functions, available directly from the ``pandas``
+namespace:
+
+- [``get_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_option.html#pandas.get_option) / [``set_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html#pandas.set_option) - get/set the value of a single option.
+- [``reset_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.reset_option.html#pandas.reset_option) - reset one or more options to their default value.
+- [``describe_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.describe_option.html#pandas.describe_option) - print the descriptions of one or more options.
+- [``option_context()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.option_context.html#pandas.option_context) - execute a codeblock with a set of options
+that revert to prior settings after execution.
+
+**Note:** Developers can check out [pandas/core/config.py](https://github.com/pandas-dev/pandas/blob/master/pandas/core/config.py) for more information.
+
+All of the functions above accept a regexp pattern (``re.search`` style) as an argument,
+and so passing in a substring will work - as long as it is unambiguous:
+
+``` python
+In [5]: pd.get_option("display.max_rows")
+Out[5]: 999
+
+In [6]: pd.set_option("display.max_rows", 101)
+
+In [7]: pd.get_option("display.max_rows")
+Out[7]: 101
+
+In [8]: pd.set_option("max_r", 102)
+
+In [9]: pd.get_option("display.max_rows")
+Out[9]: 102
+```
+
+The following will **not work** because it matches multiple option names, e.g.
+``display.max_colwidth``, ``display.max_rows``, ``display.max_columns``:
+
+``` python
+In [10]: try:
+ ....: pd.get_option("column")
+ ....: except KeyError as e:
+ ....: print(e)
+ ....:
+'Pattern matched multiple keys'
+```
+
+**Note:** Using this form of shorthand may cause your code to break if new options with similar names are added in future versions.
+
+You can get a list of available options and their descriptions with ``describe_option``. When called
+with no argument ``describe_option`` will print out the descriptions for all available options.
+
+## Getting and setting options
+
+As described above, [``get_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_option.html#pandas.get_option) and [``set_option()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html#pandas.set_option)
+are available from the pandas namespace. To change an option, call
+``set_option('option regex', new_value)``.
+
+``` python
+In [11]: pd.get_option('mode.sim_interactive')
+Out[11]: False
+
+In [12]: pd.set_option('mode.sim_interactive', True)
+
+In [13]: pd.get_option('mode.sim_interactive')
+Out[13]: True
+```
+
+**Note:** The option ‘mode.sim_interactive’ is mostly used for debugging purposes.
+
+All options also have a default value, and you can use ``reset_option`` to do just that:
+
+``` python
+In [14]: pd.get_option("display.max_rows")
+Out[14]: 60
+
+In [15]: pd.set_option("display.max_rows", 999)
+
+In [16]: pd.get_option("display.max_rows")
+Out[16]: 999
+
+In [17]: pd.reset_option("display.max_rows")
+
+In [18]: pd.get_option("display.max_rows")
+Out[18]: 60
+```
+
+It’s also possible to reset multiple options at once (using a regex):
+
+``` python
+In [19]: pd.reset_option("^display")
+```
+
+``option_context`` context manager has been exposed through
+the top-level API, allowing you to execute code with given option values. Option values
+are restored automatically when you exit the *with* block:
+
+``` python
+In [20]: with pd.option_context("display.max_rows", 10, "display.max_columns", 5):
+ ....: print(pd.get_option("display.max_rows"))
+ ....: print(pd.get_option("display.max_columns"))
+ ....:
+10
+5
+
+In [21]: print(pd.get_option("display.max_rows"))
+60
+
+In [22]: print(pd.get_option("display.max_columns"))
+0
+```
+
+## Setting startup options in Python/IPython environment
+
+Using startup scripts for the Python/IPython environment to import pandas and set options makes working with pandas more efficient. To do this, create a .py or .ipy script in the startup directory of the desired profile. An example where the startup folder is in a default ipython profile can be found at:
+
+```
+$IPYTHONDIR/profile_default/startup
+```
+
+More information can be found in the [ipython documentation](https://ipython.org/ipython-doc/stable/interactive/tutorial.html#startup-files). An example startup script for pandas is displayed below:
+
+``` python
+import pandas as pd
+pd.set_option('display.max_rows', 999)
+pd.set_option('precision', 5)
+```
+
+## Frequently Used Options
+
+The following is a walk-through of the more frequently used display options.
+
+``display.max_rows`` and ``display.max_columns`` sets the maximum number
+of rows and columns displayed when a frame is pretty-printed. Truncated
+lines are replaced by an ellipsis.
+
+``` python
+In [23]: df = pd.DataFrame(np.random.randn(7, 2))
+
+In [24]: pd.set_option('max_rows', 7)
+
+In [25]: df
+Out[25]:
+ 0 1
+0 0.469112 -0.282863
+1 -1.509059 -1.135632
+2 1.212112 -0.173215
+3 0.119209 -1.044236
+4 -0.861849 -2.104569
+5 -0.494929 1.071804
+6 0.721555 -0.706771
+
+In [26]: pd.set_option('max_rows', 5)
+
+In [27]: df
+Out[27]:
+ 0 1
+0 0.469112 -0.282863
+1 -1.509059 -1.135632
+.. ... ...
+5 -0.494929 1.071804
+6 0.721555 -0.706771
+
+[7 rows x 2 columns]
+
+In [28]: pd.reset_option('max_rows')
+```
+
+Once the ``display.max_rows`` is exceeded, the ``display.min_rows`` options
+determines how many rows are shown in the truncated repr.
+
+``` python
+In [29]: pd.set_option('max_rows', 8)
+
+In [30]: pd.set_option('max_rows', 4)
+
+# below max_rows -> all rows shown
+In [31]: df = pd.DataFrame(np.random.randn(7, 2))
+
+In [32]: df
+Out[32]:
+ 0 1
+0 -1.039575 0.271860
+1 -0.424972 0.567020
+.. ... ...
+5 0.404705 0.577046
+6 -1.715002 -1.039268
+
+[7 rows x 2 columns]
+
+# above max_rows -> only min_rows (4) rows shown
+In [33]: df = pd.DataFrame(np.random.randn(9, 2))
+
+In [34]: df
+Out[34]:
+ 0 1
+0 -0.370647 -1.157892
+1 -1.344312 0.844885
+.. ... ...
+7 0.276662 -0.472035
+8 -0.013960 -0.362543
+
+[9 rows x 2 columns]
+
+In [35]: pd.reset_option('max_rows')
+
+In [36]: pd.reset_option('min_rows')
+```
+
+``display.expand_frame_repr`` allows for the representation of
+dataframes to stretch across pages, wrapped over the full column vs row-wise.
+
+``` python
+In [37]: df = pd.DataFrame(np.random.randn(5, 10))
+
+In [38]: pd.set_option('expand_frame_repr', True)
+
+In [39]: df
+Out[39]:
+ 0 1 2 3 4 5 6 7 8 9
+0 -0.006154 -0.923061 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299 -0.226169
+1 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127 -1.436737 -1.413681 1.607920
+2 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747 -0.410001 -0.078638 0.545952 -1.219217
+3 -1.226825 0.769804 -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734 0.959726 -1.110336
+4 -0.619976 0.149748 -0.732339 0.687738 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849
+
+In [40]: pd.set_option('expand_frame_repr', False)
+
+In [41]: df
+Out[41]:
+ 0 1 2 3 4 5 6 7 8 9
+0 -0.006154 -0.923061 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299 -0.226169
+1 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127 -1.436737 -1.413681 1.607920
+2 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747 -0.410001 -0.078638 0.545952 -1.219217
+3 -1.226825 0.769804 -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734 0.959726 -1.110336
+4 -0.619976 0.149748 -0.732339 0.687738 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849
+
+In [42]: pd.reset_option('expand_frame_repr')
+```
+
+``display.large_repr`` lets you select whether to display dataframes that exceed
+``max_columns`` or ``max_rows`` as a truncated frame, or as a summary.
+
+``` python
+In [43]: df = pd.DataFrame(np.random.randn(10, 10))
+
+In [44]: pd.set_option('max_rows', 5)
+
+In [45]: pd.set_option('large_repr', 'truncate')
+
+In [46]: df
+Out[46]:
+ 0 1 2 3 4 5 6 7 8 9
+0 -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232 0.690579 0.995761 2.396780 0.014871
+1 3.357427 -0.317441 -1.236269 0.896171 -0.487602 -0.082240 -2.182937 0.380396 0.084844 0.432390
+.. ... ... ... ... ... ... ... ... ... ...
+8 -0.303421 -0.858447 0.306996 -0.028665 0.384316 1.574159 1.588931 0.476720 0.473424 -0.242861
+9 -0.014805 -0.284319 0.650776 -1.461665 -1.137707 -0.891060 -0.693921 1.613616 0.464000 0.227371
+
+[10 rows x 10 columns]
+
+In [47]: pd.set_option('large_repr', 'info')
+
+In [48]: df
+Out[48]:
+
+RangeIndex: 10 entries, 0 to 9
+Data columns (total 10 columns):
+0 10 non-null float64
+1 10 non-null float64
+2 10 non-null float64
+3 10 non-null float64
+4 10 non-null float64
+5 10 non-null float64
+6 10 non-null float64
+7 10 non-null float64
+8 10 non-null float64
+9 10 non-null float64
+dtypes: float64(10)
+memory usage: 928.0 bytes
+
+In [49]: pd.reset_option('large_repr')
+
+In [50]: pd.reset_option('max_rows')
+```
+
+``display.max_colwidth`` sets the maximum width of columns. Cells
+of this length or longer will be truncated with an ellipsis.
+
+``` python
+In [51]: df = pd.DataFrame(np.array([['foo', 'bar', 'bim', 'uncomfortably long string'],
+ ....: ['horse', 'cow', 'banana', 'apple']]))
+ ....:
+
+In [52]: pd.set_option('max_colwidth', 40)
+
+In [53]: df
+Out[53]:
+ 0 1 2 3
+0 foo bar bim uncomfortably long string
+1 horse cow banana apple
+
+In [54]: pd.set_option('max_colwidth', 6)
+
+In [55]: df
+Out[55]:
+ 0 1 2 3
+0 foo bar bim un...
+1 horse cow ba... apple
+
+In [56]: pd.reset_option('max_colwidth')
+```
+
+``display.max_info_columns`` sets a threshold for when by-column info
+will be given.
+
+``` python
+In [57]: df = pd.DataFrame(np.random.randn(10, 10))
+
+In [58]: pd.set_option('max_info_columns', 11)
+
+In [59]: df.info()
+
+RangeIndex: 10 entries, 0 to 9
+Data columns (total 10 columns):
+0 10 non-null float64
+1 10 non-null float64
+2 10 non-null float64
+3 10 non-null float64
+4 10 non-null float64
+5 10 non-null float64
+6 10 non-null float64
+7 10 non-null float64
+8 10 non-null float64
+9 10 non-null float64
+dtypes: float64(10)
+memory usage: 928.0 bytes
+
+In [60]: pd.set_option('max_info_columns', 5)
+
+In [61]: df.info()
+
+RangeIndex: 10 entries, 0 to 9
+Columns: 10 entries, 0 to 9
+dtypes: float64(10)
+memory usage: 928.0 bytes
+
+In [62]: pd.reset_option('max_info_columns')
+```
+
+``display.max_info_rows``: ``df.info()`` will usually show null-counts for each column.
+For large frames this can be quite slow. ``max_info_rows`` and ``max_info_cols``
+limit this null check only to frames with smaller dimensions then specified. Note that you
+can specify the option ``df.info(null_counts=True)`` to override on showing a particular frame.
+
+``` python
+In [63]: df = pd.DataFrame(np.random.choice([0, 1, np.nan], size=(10, 10)))
+
+In [64]: df
+Out[64]:
+ 0 1 2 3 4 5 6 7 8 9
+0 0.0 NaN 1.0 NaN NaN 0.0 NaN 0.0 NaN 1.0
+1 1.0 NaN 1.0 1.0 1.0 1.0 NaN 0.0 0.0 NaN
+2 0.0 NaN 1.0 0.0 0.0 NaN NaN NaN NaN 0.0
+3 NaN NaN NaN 0.0 1.0 1.0 NaN 1.0 NaN 1.0
+4 0.0 NaN NaN NaN 0.0 NaN NaN NaN 1.0 0.0
+5 0.0 1.0 1.0 1.0 1.0 0.0 NaN NaN 1.0 0.0
+6 1.0 1.0 1.0 NaN 1.0 NaN 1.0 0.0 NaN NaN
+7 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN
+8 NaN NaN NaN 0.0 NaN NaN NaN NaN 1.0 NaN
+9 0.0 NaN 0.0 NaN NaN 0.0 NaN 1.0 1.0 0.0
+
+In [65]: pd.set_option('max_info_rows', 11)
+
+In [66]: df.info()
+
+RangeIndex: 10 entries, 0 to 9
+Data columns (total 10 columns):
+0 8 non-null float64
+1 3 non-null float64
+2 7 non-null float64
+3 6 non-null float64
+4 7 non-null float64
+5 6 non-null float64
+6 2 non-null float64
+7 6 non-null float64
+8 6 non-null float64
+9 6 non-null float64
+dtypes: float64(10)
+memory usage: 928.0 bytes
+
+In [67]: pd.set_option('max_info_rows', 5)
+
+In [68]: df.info()
+
+RangeIndex: 10 entries, 0 to 9
+Data columns (total 10 columns):
+0 float64
+1 float64
+2 float64
+3 float64
+4 float64
+5 float64
+6 float64
+7 float64
+8 float64
+9 float64
+dtypes: float64(10)
+memory usage: 928.0 bytes
+
+In [69]: pd.reset_option('max_info_rows')
+```
+
+``display.precision`` sets the output display precision in terms of decimal places.
+This is only a suggestion.
+
+``` python
+In [70]: df = pd.DataFrame(np.random.randn(5, 5))
+
+In [71]: pd.set_option('precision', 7)
+
+In [72]: df
+Out[72]:
+ 0 1 2 3 4
+0 -1.1506406 -0.7983341 -0.5576966 0.3813531 1.3371217
+1 -1.5310949 1.3314582 -0.5713290 -0.0266708 -1.0856630
+2 -1.1147378 -0.0582158 -0.4867681 1.6851483 0.1125723
+3 -1.4953086 0.8984347 -0.1482168 -1.5960698 0.1596530
+4 0.2621358 0.0362196 0.1847350 -0.2550694 -0.2710197
+
+In [73]: pd.set_option('precision', 4)
+
+In [74]: df
+Out[74]:
+ 0 1 2 3 4
+0 -1.1506 -0.7983 -0.5577 0.3814 1.3371
+1 -1.5311 1.3315 -0.5713 -0.0267 -1.0857
+2 -1.1147 -0.0582 -0.4868 1.6851 0.1126
+3 -1.4953 0.8984 -0.1482 -1.5961 0.1597
+4 0.2621 0.0362 0.1847 -0.2551 -0.2710
+```
+
+``display.chop_threshold`` sets at what level pandas rounds to zero when
+it displays a Series of DataFrame. This setting does not change the
+precision at which the number is stored.
+
+``` python
+In [75]: df = pd.DataFrame(np.random.randn(6, 6))
+
+In [76]: pd.set_option('chop_threshold', 0)
+
+In [77]: df
+Out[77]:
+ 0 1 2 3 4 5
+0 1.2884 0.2946 -1.1658 0.8470 -0.6856 0.6091
+1 -0.3040 0.6256 -0.0593 0.2497 1.1039 -1.0875
+2 1.9980 -0.2445 0.1362 0.8863 -1.3507 -0.8863
+3 -1.0133 1.9209 -0.3882 -2.3144 0.6655 0.4026
+4 0.3996 -1.7660 0.8504 0.3881 0.9923 0.7441
+5 -0.7398 -1.0549 -0.1796 0.6396 1.5850 1.9067
+
+In [78]: pd.set_option('chop_threshold', .5)
+
+In [79]: df
+Out[79]:
+ 0 1 2 3 4 5
+0 1.2884 0.0000 -1.1658 0.8470 -0.6856 0.6091
+1 0.0000 0.6256 0.0000 0.0000 1.1039 -1.0875
+2 1.9980 0.0000 0.0000 0.8863 -1.3507 -0.8863
+3 -1.0133 1.9209 0.0000 -2.3144 0.6655 0.0000
+4 0.0000 -1.7660 0.8504 0.0000 0.9923 0.7441
+5 -0.7398 -1.0549 0.0000 0.6396 1.5850 1.9067
+
+In [80]: pd.reset_option('chop_threshold')
+```
+
+``display.colheader_justify`` controls the justification of the headers.
+The options are ‘right’, and ‘left’.
+
+``` python
+In [81]: df = pd.DataFrame(np.array([np.random.randn(6),
+ ....: np.random.randint(1, 9, 6) * .1,
+ ....: np.zeros(6)]).T,
+ ....: columns=['A', 'B', 'C'], dtype='float')
+ ....:
+
+In [82]: pd.set_option('colheader_justify', 'right')
+
+In [83]: df
+Out[83]:
+ A B C
+0 0.1040 0.1 0.0
+1 0.1741 0.5 0.0
+2 -0.4395 0.4 0.0
+3 -0.7413 0.8 0.0
+4 -0.0797 0.4 0.0
+5 -0.9229 0.3 0.0
+
+In [84]: pd.set_option('colheader_justify', 'left')
+
+In [85]: df
+Out[85]:
+ A B C
+0 0.1040 0.1 0.0
+1 0.1741 0.5 0.0
+2 -0.4395 0.4 0.0
+3 -0.7413 0.8 0.0
+4 -0.0797 0.4 0.0
+5 -0.9229 0.3 0.0
+
+In [86]: pd.reset_option('colheader_justify')
+```
+
+## Available options
+
+Option | Default | Function
+---|---|---
+display.chop_threshold | None | If set to a float value, all float values smaller then the given threshold will be displayed as exactly 0 by repr and friends.
+display.colheader_justify | right | Controls the justification of column headers. used by DataFrameFormatter.
+display.column_space | 12 | No description available.
+display.date_dayfirst | False | When True, prints and parses dates with the day first, eg 20/01/2005
+display.date_yearfirst | False | When True, prints and parses dates with the year first, eg 2005/01/20
+display.encoding | UTF-8 | Defaults to the detected encoding of the console. Specifies the encoding to be used for strings returned by to_string, these are generally strings meant to be displayed on the console.
+display.expand_frame_repr | True | Whether to print out the full DataFrame repr for wide DataFrames across multiple lines, max_columns is still respected, but the output will wrap-around across multiple “pages” if its width exceeds display.width.
+display.float_format | None | The callable should accept a floating point number and return a string with the desired format of the number. This is used in some places like SeriesFormatter. See core.format.EngFormatter for an example.
+display.large_repr | truncate | For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can show a truncated table (the default), or switch to the view from df.info() (the behaviour in earlier versions of pandas). allowable settings, [‘truncate’, ‘info’]
+display.latex.repr | False | Whether to produce a latex DataFrame representation for jupyter frontends that support it.
+display.latex.escape | True | Escapes special characters in DataFrames, when using the to_latex method.
+display.latex.longtable | False | Specifies if the to_latex method of a DataFrame uses the longtable format.
+display.latex.multicolumn | True | Combines columns when using a MultiIndex
+display.latex.multicolumn_format | ‘l’ | Alignment of multicolumn labels
+display.latex.multirow | False | Combines rows when using a MultiIndex. Centered instead of top-aligned, separated by clines.
+display.max_columns | 0 or 20 | max_rows and max_columns are used in __repr__() methods to decide if to_string() or info() is used to render an object to a string. In case Python/IPython is running in a terminal this is set to 0 by default and pandas will correctly auto-detect the width of the terminal and switch to a smaller format in case all columns would not fit vertically. The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible to do correct auto-detection, in which case the default is set to 20. ‘None’ value means unlimited.
+display.max_colwidth | 50 | The maximum width in characters of a column in the repr of a pandas data structure. When the column overflows, a “…” placeholder is embedded in the output.
+display.max_info_columns | 100 | max_info_columns is used in DataFrame.info method to decide if per column information will be printed.
+display.max_info_rows | 1690785 | df.info() will usually show null-counts for each column. For large frames this can be quite slow. max_info_rows and max_info_cols limit this null check only to frames with smaller dimensions then specified.
+display.max_rows | 60 | This sets the maximum number of rows pandas should output when printing out various output. For example, this value determines whether the repr() for a dataframe prints out fully or just a truncated or summary repr. ‘None’ value means unlimited.
+display.min_rows | 10 | The numbers of rows to show in a truncated repr (when max_rows is exceeded). Ignored when max_rows is set to None or 0. When set to None, follows the value of max_rows.
+display.max_seq_items | 100 | when pretty-printing a long sequence, no more then max_seq_items will be printed. If items are omitted, they will be denoted by the addition of “…” to the resulting string. If set to None, the number of items to be printed is unlimited.
+display.memory_usage | True | This specifies if the memory usage of a DataFrame should be displayed when the df.info() method is invoked.
+display.multi_sparse | True | “Sparsify” MultiIndex display (don’t display repeated elements in outer levels within groups)
+display.notebook_repr_html | True | When True, IPython notebook will use html representation for pandas objects (if it is available).
+display.pprint_nest_depth | 3 | Controls the number of nested levels to process when pretty-printing
+display.precision | 6 | Floating point output precision in terms of number of places after the decimal, for regular formatting as well as scientific notation. Similar to numpy’s precision print option
+display.show_dimensions | truncate | Whether to print out dimensions at the end of DataFrame repr. If ‘truncate’ is specified, only print out the dimensions if the frame is truncated (e.g. not display all rows and/or columns)
+display.width | 80 | Width of the display in characters. In case python/IPython is running in a terminal this can be set to None and pandas will correctly auto-detect the width. Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible to correctly detect the width.
+display.html.table_schema | False | Whether to publish a Table Schema representation for frontends that support it.
+display.html.border | 1 | A border=value attribute is inserted in the ``
`` tag for the DataFrame HTML repr.
+display.html.use_mathjax | True | When True, Jupyter notebook will process table contents using MathJax, rendering mathematical expressions enclosed by the dollar symbol.
+io.excel.xls.writer | xlwt | The default Excel writer engine for ‘xls’ files.
+io.excel.xlsm.writer | openpyxl | The default Excel writer engine for ‘xlsm’ files. Available options: ‘openpyxl’ (the default).
+io.excel.xlsx.writer | openpyxl | The default Excel writer engine for ‘xlsx’ files.
+io.hdf.default_format | None | default format writing format, if None, then put will default to ‘fixed’ and append will default to ‘table’
+io.hdf.dropna_table | True | drop ALL nan rows when appending to a table
+io.parquet.engine | None | The engine to use as a default for parquet reading and writing. If None then try ‘pyarrow’ and ‘fastparquet’
+mode.chained_assignment | warn | Controls SettingWithCopyWarning: ‘raise’, ‘warn’, or None. Raise an exception, warn, or no action if trying to use [chained assignment](indexing.html#indexing-evaluation-order).
+mode.sim_interactive | False | Whether to simulate interactive mode for purposes of testing.
+mode.use_inf_as_na | False | True means treat None, NaN, -INF, INF as NA (old way), False means None and NaN are null, but INF, -INF are not NA (new way).
+compute.use_bottleneck | True | Use the bottleneck library to accelerate computation if it is installed.
+compute.use_numexpr | True | Use the numexpr library to accelerate computation if it is installed.
+plotting.backend | matplotlib | Change the plotting backend to a different backend than the current matplotlib one. Backends can be implemented as third-party libraries implementing the pandas plotting API. They can use other plotting libraries like Bokeh, Altair, etc.
+plotting.matplotlib.register_converters | True | Register custom converters with matplotlib. Set to False to de-register.
+
+## Number formatting
+
+pandas also allows you to set how numbers are displayed in the console.
+This option is not set through the ``set_options`` API.
+
+Use the ``set_eng_float_format`` function
+to alter the floating-point formatting of pandas objects to produce a particular
+format.
+
+For instance:
+
+``` python
+In [87]: import numpy as np
+
+In [88]: pd.set_eng_float_format(accuracy=3, use_eng_prefix=True)
+
+In [89]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
+
+In [90]: s / 1.e3
+Out[90]:
+a 303.638u
+b -721.084u
+c -622.696u
+d 648.250u
+e -1.945m
+dtype: float64
+
+In [91]: s / 1.e6
+Out[91]:
+a 303.638n
+b -721.084n
+c -622.696n
+d 648.250n
+e -1.945u
+dtype: float64
+```
+
+To round floats on a case-by-case basis, you can also use [``round()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.round.html#pandas.Series.round) and [``round()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.round.html#pandas.DataFrame.round).
+
+## Unicode formatting
+
+::: danger Warning
+
+Enabling this option will affect the performance for printing of DataFrame and Series (about 2 times slower).
+Use only when it is actually required.
+
+:::
+
+Some East Asian countries use Unicode characters whose width corresponds to two Latin characters.
+If a DataFrame or Series contains these characters, the default output mode may not align them properly.
+
+::: tip Note
+
+Screen captures are attached for each output to show the actual results.
+
+:::
+
+``` python
+In [92]: df = pd.DataFrame({'国籍': ['UK', '日本'], '名前': ['Alice', 'しのぶ']})
+
+In [93]: df
+Out[93]:
+ 国籍 名前
+0 UK Alice
+1 日本 しのぶ
+```
+
+
+
+Enabling ``display.unicode.east_asian_width`` allows pandas to check each character’s “East Asian Width” property.
+These characters can be aligned properly by setting this option to ``True``. However, this will result in longer render
+times than the standard ``len`` function.
+
+``` python
+In [94]: pd.set_option('display.unicode.east_asian_width', True)
+
+In [95]: df
+Out[95]:
+ 国籍 名前
+0 UK Alice
+1 日本 しのぶ
+```
+
+
+
+In addition, Unicode characters whose width is “Ambiguous” can either be 1 or 2 characters wide depending on the
+terminal setting or encoding. The option ``display.unicode.ambiguous_as_wide`` can be used to handle the ambiguity.
+
+By default, an “Ambiguous” character’s width, such as “¡” (inverted exclamation) in the example below, is taken to be 1.
+
+``` python
+In [96]: df = pd.DataFrame({'a': ['xxx', '¡¡'], 'b': ['yyy', '¡¡']})
+
+In [97]: df
+Out[97]:
+ a b
+0 xxx yyy
+1 ¡¡ ¡¡
+```
+
+
+
+Enabling ``display.unicode.ambiguous_as_wide`` makes pandas interpret these characters’ widths to be 2.
+(Note that this option will only be effective when ``display.unicode.east_asian_width`` is enabled.)
+
+However, setting this option incorrectly for your terminal will cause these characters to be aligned incorrectly:
+
+``` python
+In [98]: pd.set_option('display.unicode.ambiguous_as_wide', True)
+
+In [99]: df
+Out[99]:
+ a b
+0 xxx yyy
+1 ¡¡ ¡¡
+```
+
+
+
+## Table schema display
+
+*New in version 0.20.0.*
+
+``DataFrame`` and ``Series`` will publish a Table Schema representation
+by default. False by default, this can be enabled globally with the
+``display.html.table_schema`` option:
+
+``` python
+In [100]: pd.set_option('display.html.table_schema', True)
+```
+
+Only ``'display.max_rows'`` are serialized and published.
diff --git a/Python/pandas/user_guide/reshaping.md b/Python/pandas/user_guide/reshaping.md
new file mode 100644
index 00000000..1448831e
--- /dev/null
+++ b/Python/pandas/user_guide/reshaping.md
@@ -0,0 +1,1520 @@
+# Reshaping and pivot tables
+
+## Reshaping by pivoting DataFrame objects
+
+
+
+Data is often stored in so-called “stacked” or “record” format:
+
+``` python
+In [1]: df
+Out[1]:
+ date variable value
+0 2000-01-03 A 0.469112
+1 2000-01-04 A -0.282863
+2 2000-01-05 A -1.509059
+3 2000-01-03 B -1.135632
+4 2000-01-04 B 1.212112
+5 2000-01-05 B -0.173215
+6 2000-01-03 C 0.119209
+7 2000-01-04 C -1.044236
+8 2000-01-05 C -0.861849
+9 2000-01-03 D -2.104569
+10 2000-01-04 D -0.494929
+11 2000-01-05 D 1.071804
+```
+
+For the curious here is how the above ``DataFrame`` was created:
+
+``` python
+import pandas.util.testing as tm
+
+tm.N = 3
+
+
+def unpivot(frame):
+ N, K = frame.shape
+ data = {'value': frame.to_numpy().ravel('F'),
+ 'variable': np.asarray(frame.columns).repeat(N),
+ 'date': np.tile(np.asarray(frame.index), K)}
+ return pd.DataFrame(data, columns=['date', 'variable', 'value'])
+
+
+df = unpivot(tm.makeTimeDataFrame())
+```
+
+To select out everything for variable ``A`` we could do:
+
+``` python
+In [2]: df[df['variable'] == 'A']
+Out[2]:
+ date variable value
+0 2000-01-03 A 0.469112
+1 2000-01-04 A -0.282863
+2 2000-01-05 A -1.509059
+```
+
+But suppose we wish to do time series operations with the variables. A better
+representation would be where the ``columns`` are the unique variables and an
+``index`` of dates identifies individual observations. To reshape the data into
+this form, we use the [``DataFrame.pivot()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot) method (also implemented as a
+top level function [``pivot()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot.html#pandas.pivot)):
+
+``` python
+In [3]: df.pivot(index='date', columns='variable', values='value')
+Out[3]:
+variable A B C D
+date
+2000-01-03 0.469112 -1.135632 0.119209 -2.104569
+2000-01-04 -0.282863 1.212112 -1.044236 -0.494929
+2000-01-05 -1.509059 -0.173215 -0.861849 1.071804
+```
+
+If the ``values`` argument is omitted, and the input ``DataFrame`` has more than
+one column of values which are not used as column or index inputs to ``pivot``,
+then the resulting “pivoted” ``DataFrame`` will have [hierarchical columns](advanced.html#advanced-hierarchical) whose topmost level indicates the respective value
+column:
+
+``` python
+In [4]: df['value2'] = df['value'] * 2
+
+In [5]: pivoted = df.pivot(index='date', columns='variable')
+
+In [6]: pivoted
+Out[6]:
+ value value2
+variable A B C D A B C D
+date
+2000-01-03 0.469112 -1.135632 0.119209 -2.104569 0.938225 -2.271265 0.238417 -4.209138
+2000-01-04 -0.282863 1.212112 -1.044236 -0.494929 -0.565727 2.424224 -2.088472 -0.989859
+2000-01-05 -1.509059 -0.173215 -0.861849 1.071804 -3.018117 -0.346429 -1.723698 2.143608
+```
+
+You can then select subsets from the pivoted ``DataFrame``:
+
+``` python
+In [7]: pivoted['value2']
+Out[7]:
+variable A B C D
+date
+2000-01-03 0.938225 -2.271265 0.238417 -4.209138
+2000-01-04 -0.565727 2.424224 -2.088472 -0.989859
+2000-01-05 -3.018117 -0.346429 -1.723698 2.143608
+```
+
+Note that this returns a view on the underlying data in the case where the data
+are homogeneously-typed.
+
+::: tip Note
+
+[``pivot()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot.html#pandas.pivot) will error with a ``ValueError: Index contains duplicate
+entries, cannot reshape`` if the index/column pair is not unique. In this
+case, consider using [``pivot_table()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html#pandas.pivot_table) which is a generalization
+of pivot that can handle duplicate values for one index/column pair.
+
+:::
+
+## Reshaping by stacking and unstacking
+
+
+
+Closely related to the [``pivot()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot) method are the related
+[``stack()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html#pandas.DataFrame.stack) and [``unstack()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.unstack.html#pandas.DataFrame.unstack) methods available on
+``Series`` and ``DataFrame``. These methods are designed to work together with
+``MultiIndex`` objects (see the section on [hierarchical indexing](advanced.html#advanced-hierarchical)). Here are essentially what these methods do:
+
+- ``stack``: “pivot” a level of the (possibly hierarchical) column labels,
+returning a ``DataFrame`` with an index with a new inner-most level of row
+labels.
+- ``unstack``: (inverse operation of ``stack``) “pivot” a level of the
+(possibly hierarchical) row index to the column axis, producing a reshaped
+``DataFrame`` with a new inner-most level of column labels.
+
+
+
+The clearest way to explain is by example. Let’s take a prior example data set
+from the hierarchical indexing section:
+
+``` python
+In [8]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
+ ...: 'foo', 'foo', 'qux', 'qux'],
+ ...: ['one', 'two', 'one', 'two',
+ ...: 'one', 'two', 'one', 'two']]))
+ ...:
+
+In [9]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
+
+In [10]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
+
+In [11]: df2 = df[:4]
+
+In [12]: df2
+Out[12]:
+ A B
+first second
+bar one 0.721555 -0.706771
+ two -1.039575 0.271860
+baz one -0.424972 0.567020
+ two 0.276232 -1.087401
+```
+
+The ``stack`` function “compresses” a level in the ``DataFrame``’s columns to
+produce either:
+
+- A ``Series``, in the case of a simple column Index.
+- A ``DataFrame``, in the case of a ``MultiIndex`` in the columns.
+
+If the columns have a ``MultiIndex``, you can choose which level to stack. The
+stacked level becomes the new lowest level in a ``MultiIndex`` on the columns:
+
+``` python
+In [13]: stacked = df2.stack()
+
+In [14]: stacked
+Out[14]:
+first second
+bar one A 0.721555
+ B -0.706771
+ two A -1.039575
+ B 0.271860
+baz one A -0.424972
+ B 0.567020
+ two A 0.276232
+ B -1.087401
+dtype: float64
+```
+
+With a “stacked” ``DataFrame`` or ``Series`` (having a ``MultiIndex`` as the
+``index``), the inverse operation of ``stack`` is ``unstack``, which by default
+unstacks the **last level**:
+
+``` python
+In [15]: stacked.unstack()
+Out[15]:
+ A B
+first second
+bar one 0.721555 -0.706771
+ two -1.039575 0.271860
+baz one -0.424972 0.567020
+ two 0.276232 -1.087401
+
+In [16]: stacked.unstack(1)
+Out[16]:
+second one two
+first
+bar A 0.721555 -1.039575
+ B -0.706771 0.271860
+baz A -0.424972 0.276232
+ B 0.567020 -1.087401
+
+In [17]: stacked.unstack(0)
+Out[17]:
+first bar baz
+second
+one A 0.721555 -0.424972
+ B -0.706771 0.567020
+two A -1.039575 0.276232
+ B 0.271860 -1.087401
+```
+
+
+
+If the indexes have names, you can use the level names instead of specifying
+the level numbers:
+
+``` python
+In [18]: stacked.unstack('second')
+Out[18]:
+second one two
+first
+bar A 0.721555 -1.039575
+ B -0.706771 0.271860
+baz A -0.424972 0.276232
+ B 0.567020 -1.087401
+```
+
+
+
+Notice that the ``stack`` and ``unstack`` methods implicitly sort the index
+levels involved. Hence a call to ``stack`` and then ``unstack``, or vice versa,
+will result in a **sorted** copy of the original ``DataFrame`` or ``Series``:
+
+``` python
+In [19]: index = pd.MultiIndex.from_product([[2, 1], ['a', 'b']])
+
+In [20]: df = pd.DataFrame(np.random.randn(4), index=index, columns=['A'])
+
+In [21]: df
+Out[21]:
+ A
+2 a -0.370647
+ b -1.157892
+1 a -1.344312
+ b 0.844885
+
+In [22]: all(df.unstack().stack() == df.sort_index())
+Out[22]: True
+```
+
+The above code will raise a ``TypeError`` if the call to ``sort_index`` is
+removed.
+
+### Multiple levels
+
+You may also stack or unstack more than one level at a time by passing a list
+of levels, in which case the end result is as if each level in the list were
+processed individually.
+
+``` python
+In [23]: columns = pd.MultiIndex.from_tuples([
+ ....: ('A', 'cat', 'long'), ('B', 'cat', 'long'),
+ ....: ('A', 'dog', 'short'), ('B', 'dog', 'short')],
+ ....: names=['exp', 'animal', 'hair_length']
+ ....: )
+ ....:
+
+In [24]: df = pd.DataFrame(np.random.randn(4, 4), columns=columns)
+
+In [25]: df
+Out[25]:
+exp A B A B
+animal cat cat dog dog
+hair_length long long short short
+0 1.075770 -0.109050 1.643563 -1.469388
+1 0.357021 -0.674600 -1.776904 -0.968914
+2 -1.294524 0.413738 0.276662 -0.472035
+3 -0.013960 -0.362543 -0.006154 -0.923061
+
+In [26]: df.stack(level=['animal', 'hair_length'])
+Out[26]:
+exp A B
+ animal hair_length
+0 cat long 1.075770 -0.109050
+ dog short 1.643563 -1.469388
+1 cat long 0.357021 -0.674600
+ dog short -1.776904 -0.968914
+2 cat long -1.294524 0.413738
+ dog short 0.276662 -0.472035
+3 cat long -0.013960 -0.362543
+ dog short -0.006154 -0.923061
+```
+
+The list of levels can contain either level names or level numbers (but
+not a mixture of the two).
+
+``` python
+# df.stack(level=['animal', 'hair_length'])
+# from above is equivalent to:
+In [27]: df.stack(level=[1, 2])
+Out[27]:
+exp A B
+ animal hair_length
+0 cat long 1.075770 -0.109050
+ dog short 1.643563 -1.469388
+1 cat long 0.357021 -0.674600
+ dog short -1.776904 -0.968914
+2 cat long -1.294524 0.413738
+ dog short 0.276662 -0.472035
+3 cat long -0.013960 -0.362543
+ dog short -0.006154 -0.923061
+```
+
+### Missing data
+
+These functions are intelligent about handling missing data and do not expect
+each subgroup within the hierarchical index to have the same set of labels.
+They also can handle the index being unsorted (but you can make it sorted by
+calling ``sort_index``, of course). Here is a more complex example:
+
+``` python
+In [28]: columns = pd.MultiIndex.from_tuples([('A', 'cat'), ('B', 'dog'),
+ ....: ('B', 'cat'), ('A', 'dog')],
+ ....: names=['exp', 'animal'])
+ ....:
+
+In [29]: index = pd.MultiIndex.from_product([('bar', 'baz', 'foo', 'qux'),
+ ....: ('one', 'two')],
+ ....: names=['first', 'second'])
+ ....:
+
+In [30]: df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)
+
+In [31]: df2 = df.iloc[[0, 1, 2, 4, 5, 7]]
+
+In [32]: df2
+Out[32]:
+exp A B A
+animal cat dog cat dog
+first second
+bar one 0.895717 0.805244 -1.206412 2.565646
+ two 1.431256 1.340309 -1.170299 -0.226169
+baz one 0.410835 0.813850 0.132003 -0.827317
+foo one -1.413681 1.607920 1.024180 0.569605
+ two 0.875906 -2.211372 0.974466 -2.006747
+qux two -1.226825 0.769804 -1.281247 -0.727707
+```
+
+As mentioned above, ``stack`` can be called with a ``level`` argument to select
+which level in the columns to stack:
+
+``` python
+In [33]: df2.stack('exp')
+Out[33]:
+animal cat dog
+first second exp
+bar one A 0.895717 2.565646
+ B -1.206412 0.805244
+ two A 1.431256 -0.226169
+ B -1.170299 1.340309
+baz one A 0.410835 -0.827317
+ B 0.132003 0.813850
+foo one A -1.413681 0.569605
+ B 1.024180 1.607920
+ two A 0.875906 -2.006747
+ B 0.974466 -2.211372
+qux two A -1.226825 -0.727707
+ B -1.281247 0.769804
+
+In [34]: df2.stack('animal')
+Out[34]:
+exp A B
+first second animal
+bar one cat 0.895717 -1.206412
+ dog 2.565646 0.805244
+ two cat 1.431256 -1.170299
+ dog -0.226169 1.340309
+baz one cat 0.410835 0.132003
+ dog -0.827317 0.813850
+foo one cat -1.413681 1.024180
+ dog 0.569605 1.607920
+ two cat 0.875906 0.974466
+ dog -2.006747 -2.211372
+qux two cat -1.226825 -1.281247
+ dog -0.727707 0.769804
+```
+
+Unstacking can result in missing values if subgroups do not have the same
+set of labels. By default, missing values will be replaced with the default
+fill value for that data type, ``NaN`` for float, ``NaT`` for datetimelike,
+etc. For integer types, by default data will converted to float and missing
+values will be set to ``NaN``.
+
+``` python
+In [35]: df3 = df.iloc[[0, 1, 4, 7], [1, 2]]
+
+In [36]: df3
+Out[36]:
+exp B
+animal dog cat
+first second
+bar one 0.805244 -1.206412
+ two 1.340309 -1.170299
+foo one 1.607920 1.024180
+qux two 0.769804 -1.281247
+
+In [37]: df3.unstack()
+Out[37]:
+exp B
+animal dog cat
+second one two one two
+first
+bar 0.805244 1.340309 -1.206412 -1.170299
+foo 1.607920 NaN 1.024180 NaN
+qux NaN 0.769804 NaN -1.281247
+```
+
+*New in version 0.18.0.*
+
+Alternatively, unstack takes an optional ``fill_value`` argument, for specifying
+the value of missing data.
+
+``` python
+In [38]: df3.unstack(fill_value=-1e9)
+Out[38]:
+exp B
+animal dog cat
+second one two one two
+first
+bar 8.052440e-01 1.340309e+00 -1.206412e+00 -1.170299e+00
+foo 1.607920e+00 -1.000000e+09 1.024180e+00 -1.000000e+09
+qux -1.000000e+09 7.698036e-01 -1.000000e+09 -1.281247e+00
+```
+
+### With a MultiIndex
+
+Unstacking when the columns are a ``MultiIndex`` is also careful about doing
+the right thing:
+
+``` python
+In [39]: df[:3].unstack(0)
+Out[39]:
+exp A B A
+animal cat dog cat dog
+first bar baz bar baz bar baz bar baz
+second
+one 0.895717 0.410835 0.805244 0.81385 -1.206412 0.132003 2.565646 -0.827317
+two 1.431256 NaN 1.340309 NaN -1.170299 NaN -0.226169 NaN
+
+In [40]: df2.unstack(1)
+Out[40]:
+exp A B A
+animal cat dog cat dog
+second one two one two one two one two
+first
+bar 0.895717 1.431256 0.805244 1.340309 -1.206412 -1.170299 2.565646 -0.226169
+baz 0.410835 NaN 0.813850 NaN 0.132003 NaN -0.827317 NaN
+foo -1.413681 0.875906 1.607920 -2.211372 1.024180 0.974466 0.569605 -2.006747
+qux NaN -1.226825 NaN 0.769804 NaN -1.281247 NaN -0.727707
+```
+
+## Reshaping by Melt
+
+
+
+The top-level [``melt()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html#pandas.melt) function and the corresponding [``DataFrame.melt()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html#pandas.DataFrame.melt)
+are useful to massage a ``DataFrame`` into a format where one or more columns
+are *identifier variables*, while all other columns, considered *measured
+variables*, are “unpivoted” to the row axis, leaving just two non-identifier
+columns, “variable” and “value”. The names of those columns can be customized
+by supplying the ``var_name`` and ``value_name`` parameters.
+
+For instance,
+
+``` python
+In [41]: cheese = pd.DataFrame({'first': ['John', 'Mary'],
+ ....: 'last': ['Doe', 'Bo'],
+ ....: 'height': [5.5, 6.0],
+ ....: 'weight': [130, 150]})
+ ....:
+
+In [42]: cheese
+Out[42]:
+ first last height weight
+0 John Doe 5.5 130
+1 Mary Bo 6.0 150
+
+In [43]: cheese.melt(id_vars=['first', 'last'])
+Out[43]:
+ first last variable value
+0 John Doe height 5.5
+1 Mary Bo height 6.0
+2 John Doe weight 130.0
+3 Mary Bo weight 150.0
+
+In [44]: cheese.melt(id_vars=['first', 'last'], var_name='quantity')
+Out[44]:
+ first last quantity value
+0 John Doe height 5.5
+1 Mary Bo height 6.0
+2 John Doe weight 130.0
+3 Mary Bo weight 150.0
+```
+
+Another way to transform is to use the [``wide_to_long()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html#pandas.wide_to_long) panel data
+convenience function. It is less flexible than [``melt()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html#pandas.melt), but more
+user-friendly.
+
+``` python
+In [45]: dft = pd.DataFrame({"A1970": {0: "a", 1: "b", 2: "c"},
+ ....: "A1980": {0: "d", 1: "e", 2: "f"},
+ ....: "B1970": {0: 2.5, 1: 1.2, 2: .7},
+ ....: "B1980": {0: 3.2, 1: 1.3, 2: .1},
+ ....: "X": dict(zip(range(3), np.random.randn(3)))
+ ....: })
+ ....:
+
+In [46]: dft["id"] = dft.index
+
+In [47]: dft
+Out[47]:
+ A1970 A1980 B1970 B1980 X id
+0 a d 2.5 3.2 -0.121306 0
+1 b e 1.2 1.3 -0.097883 1
+2 c f 0.7 0.1 0.695775 2
+
+In [48]: pd.wide_to_long(dft, ["A", "B"], i="id", j="year")
+Out[48]:
+ X A B
+id year
+0 1970 -0.121306 a 2.5
+1 1970 -0.097883 b 1.2
+2 1970 0.695775 c 0.7
+0 1980 -0.121306 d 3.2
+1 1980 -0.097883 e 1.3
+2 1980 0.695775 f 0.1
+```
+
+## Combining with stats and GroupBy
+
+It should be no shock that combining ``pivot`` / ``stack`` / ``unstack`` with
+GroupBy and the basic Series and DataFrame statistical functions can produce
+some very expressive and fast data manipulations.
+
+``` python
+In [49]: df
+Out[49]:
+exp A B A
+animal cat dog cat dog
+first second
+bar one 0.895717 0.805244 -1.206412 2.565646
+ two 1.431256 1.340309 -1.170299 -0.226169
+baz one 0.410835 0.813850 0.132003 -0.827317
+ two -0.076467 -1.187678 1.130127 -1.436737
+foo one -1.413681 1.607920 1.024180 0.569605
+ two 0.875906 -2.211372 0.974466 -2.006747
+qux one -0.410001 -0.078638 0.545952 -1.219217
+ two -1.226825 0.769804 -1.281247 -0.727707
+
+In [50]: df.stack().mean(1).unstack()
+Out[50]:
+animal cat dog
+first second
+bar one -0.155347 1.685445
+ two 0.130479 0.557070
+baz one 0.271419 -0.006733
+ two 0.526830 -1.312207
+foo one -0.194750 1.088763
+ two 0.925186 -2.109060
+qux one 0.067976 -0.648927
+ two -1.254036 0.021048
+
+# same result, another way
+In [51]: df.groupby(level=1, axis=1).mean()
+Out[51]:
+animal cat dog
+first second
+bar one -0.155347 1.685445
+ two 0.130479 0.557070
+baz one 0.271419 -0.006733
+ two 0.526830 -1.312207
+foo one -0.194750 1.088763
+ two 0.925186 -2.109060
+qux one 0.067976 -0.648927
+ two -1.254036 0.021048
+
+In [52]: df.stack().groupby(level=1).mean()
+Out[52]:
+exp A B
+second
+one 0.071448 0.455513
+two -0.424186 -0.204486
+
+In [53]: df.mean().unstack(0)
+Out[53]:
+exp A B
+animal
+cat 0.060843 0.018596
+dog -0.413580 0.232430
+```
+
+## Pivot tables
+
+While [``pivot()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot) provides general purpose pivoting with various
+data types (strings, numerics, etc.), pandas also provides [``pivot_table()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html#pandas.pivot_table)
+for pivoting with aggregation of numeric data.
+
+The function [``pivot_table()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html#pandas.pivot_table) can be used to create spreadsheet-style
+pivot tables. See the [cookbook](cookbook.html#cookbook-pivot) for some advanced
+strategies.
+
+It takes a number of arguments:
+
+- ``data``: a DataFrame object.
+- ``values``: a column or a list of columns to aggregate.
+- ``index``: a column, Grouper, array which has the same length as data, or list of them.
+Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
+- ``columns``: a column, Grouper, array which has the same length as data, or list of them.
+Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
+- ``aggfunc``: function to use for aggregation, defaulting to ``numpy.mean``.
+
+Consider a data set like this:
+
+``` python
+In [54]: import datetime
+
+In [55]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 6,
+ ....: 'B': ['A', 'B', 'C'] * 8,
+ ....: 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
+ ....: 'D': np.random.randn(24),
+ ....: 'E': np.random.randn(24),
+ ....: 'F': [datetime.datetime(2013, i, 1) for i in range(1, 13)]
+ ....: + [datetime.datetime(2013, i, 15) for i in range(1, 13)]})
+ ....:
+
+In [56]: df
+Out[56]:
+ A B C D E F
+0 one A foo 0.341734 -0.317441 2013-01-01
+1 one B foo 0.959726 -1.236269 2013-02-01
+2 two C foo -1.110336 0.896171 2013-03-01
+3 three A bar -0.619976 -0.487602 2013-04-01
+4 one B bar 0.149748 -0.082240 2013-05-01
+.. ... .. ... ... ... ...
+19 three B foo 0.690579 -2.213588 2013-08-15
+20 one C foo 0.995761 1.063327 2013-09-15
+21 one A bar 2.396780 1.266143 2013-10-15
+22 two B bar 0.014871 0.299368 2013-11-15
+23 three C bar 3.357427 -0.863838 2013-12-15
+
+[24 rows x 6 columns]
+```
+
+We can produce pivot tables from this data very easily:
+
+``` python
+In [57]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
+Out[57]:
+C bar foo
+A B
+one A 1.120915 -0.514058
+ B -0.338421 0.002759
+ C -0.538846 0.699535
+three A -1.181568 NaN
+ B NaN 0.433512
+ C 0.588783 NaN
+two A NaN 1.000985
+ B 0.158248 NaN
+ C NaN 0.176180
+
+In [58]: pd.pivot_table(df, values='D', index=['B'], columns=['A', 'C'], aggfunc=np.sum)
+Out[58]:
+A one three two
+C bar foo bar foo bar foo
+B
+A 2.241830 -1.028115 -2.363137 NaN NaN 2.001971
+B -0.676843 0.005518 NaN 0.867024 0.316495 NaN
+C -1.077692 1.399070 1.177566 NaN NaN 0.352360
+
+In [59]: pd.pivot_table(df, values=['D', 'E'], index=['B'], columns=['A', 'C'],
+ ....: aggfunc=np.sum)
+ ....:
+Out[59]:
+ D E
+A one three two one three two
+C bar foo bar foo bar foo bar foo bar foo bar foo
+B
+A 2.241830 -1.028115 -2.363137 NaN NaN 2.001971 2.786113 -0.043211 1.922577 NaN NaN 0.128491
+B -0.676843 0.005518 NaN 0.867024 0.316495 NaN 1.368280 -1.103384 NaN -2.128743 -0.194294 NaN
+C -1.077692 1.399070 1.177566 NaN NaN 0.352360 -1.976883 1.495717 -0.263660 NaN NaN 0.872482
+```
+
+The result object is a ``DataFrame`` having potentially hierarchical indexes on the
+rows and columns. If the ``values`` column name is not given, the pivot table
+will include all of the data that can be aggregated in an additional level of
+hierarchy in the columns:
+
+``` python
+In [60]: pd.pivot_table(df, index=['A', 'B'], columns=['C'])
+Out[60]:
+ D E
+C bar foo bar foo
+A B
+one A 1.120915 -0.514058 1.393057 -0.021605
+ B -0.338421 0.002759 0.684140 -0.551692
+ C -0.538846 0.699535 -0.988442 0.747859
+three A -1.181568 NaN 0.961289 NaN
+ B NaN 0.433512 NaN -1.064372
+ C 0.588783 NaN -0.131830 NaN
+two A NaN 1.000985 NaN 0.064245
+ B 0.158248 NaN -0.097147 NaN
+ C NaN 0.176180 NaN 0.436241
+```
+
+Also, you can use ``Grouper`` for ``index`` and ``columns`` keywords. For detail of ``Grouper``, see [Grouping with a Grouper specification](groupby.html#groupby-specify).
+
+``` python
+In [61]: pd.pivot_table(df, values='D', index=pd.Grouper(freq='M', key='F'),
+ ....: columns='C')
+ ....:
+Out[61]:
+C bar foo
+F
+2013-01-31 NaN -0.514058
+2013-02-28 NaN 0.002759
+2013-03-31 NaN 0.176180
+2013-04-30 -1.181568 NaN
+2013-05-31 -0.338421 NaN
+2013-06-30 -0.538846 NaN
+2013-07-31 NaN 1.000985
+2013-08-31 NaN 0.433512
+2013-09-30 NaN 0.699535
+2013-10-31 1.120915 NaN
+2013-11-30 0.158248 NaN
+2013-12-31 0.588783 NaN
+```
+
+You can render a nice output of the table omitting the missing values by
+calling ``to_string`` if you wish:
+
+``` python
+In [62]: table = pd.pivot_table(df, index=['A', 'B'], columns=['C'])
+
+In [63]: print(table.to_string(na_rep=''))
+ D E
+C bar foo bar foo
+A B
+one A 1.120915 -0.514058 1.393057 -0.021605
+ B -0.338421 0.002759 0.684140 -0.551692
+ C -0.538846 0.699535 -0.988442 0.747859
+three A -1.181568 0.961289
+ B 0.433512 -1.064372
+ C 0.588783 -0.131830
+two A 1.000985 0.064245
+ B 0.158248 -0.097147
+ C 0.176180 0.436241
+```
+
+Note that ``pivot_table`` is also available as an instance method on DataFrame,
+
+### Adding margins
+
+If you pass ``margins=True`` to ``pivot_table``, special ``All`` columns and
+rows will be added with partial group aggregates across the categories on the
+rows and columns:
+
+``` python
+In [64]: df.pivot_table(index=['A', 'B'], columns='C', margins=True, aggfunc=np.std)
+Out[64]:
+ D E
+C bar foo All bar foo All
+A B
+one A 1.804346 1.210272 1.569879 0.179483 0.418374 0.858005
+ B 0.690376 1.353355 0.898998 1.083825 0.968138 1.101401
+ C 0.273641 0.418926 0.771139 1.689271 0.446140 1.422136
+three A 0.794212 NaN 0.794212 2.049040 NaN 2.049040
+ B NaN 0.363548 0.363548 NaN 1.625237 1.625237
+ C 3.915454 NaN 3.915454 1.035215 NaN 1.035215
+two A NaN 0.442998 0.442998 NaN 0.447104 0.447104
+ B 0.202765 NaN 0.202765 0.560757 NaN 0.560757
+ C NaN 1.819408 1.819408 NaN 0.650439 0.650439
+All 1.556686 0.952552 1.246608 1.250924 0.899904 1.059389
+```
+
+## Cross tabulations
+
+Use [``crosstab()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html#pandas.crosstab) to compute a cross-tabulation of two (or more)
+factors. By default ``crosstab`` computes a frequency table of the factors
+unless an array of values and an aggregation function are passed.
+
+It takes a number of arguments
+
+- ``index``: array-like, values to group by in the rows.
+- ``columns``: array-like, values to group by in the columns.
+- ``values``: array-like, optional, array of values to aggregate according to
+the factors.
+- ``aggfunc``: function, optional, If no values array is passed, computes a
+frequency table.
+- ``rownames``: sequence, default ``None``, must match number of row arrays passed.
+- ``colnames``: sequence, default ``None``, if passed, must match number of column
+arrays passed.
+- ``margins``: boolean, default ``False``, Add row/column margins (subtotals)
+- ``normalize``: boolean, {‘all’, ‘index’, ‘columns’}, or {0,1}, default ``False``.
+Normalize by dividing all values by the sum of values.
+
+Any ``Series`` passed will have their name attributes used unless row or column
+names for the cross-tabulation are specified
+
+For example:
+
+``` python
+In [65]: foo, bar, dull, shiny, one, two = 'foo', 'bar', 'dull', 'shiny', 'one', 'two'
+
+In [66]: a = np.array([foo, foo, bar, bar, foo, foo], dtype=object)
+
+In [67]: b = np.array([one, one, two, one, two, one], dtype=object)
+
+In [68]: c = np.array([dull, dull, shiny, dull, dull, shiny], dtype=object)
+
+In [69]: pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
+Out[69]:
+b one two
+c dull shiny dull shiny
+a
+bar 1 0 0 1
+foo 2 1 1 0
+```
+
+If ``crosstab`` receives only two Series, it will provide a frequency table.
+
+``` python
+In [70]: df = pd.DataFrame({'A': [1, 2, 2, 2, 2], 'B': [3, 3, 4, 4, 4],
+ ....: 'C': [1, 1, np.nan, 1, 1]})
+ ....:
+
+In [71]: df
+Out[71]:
+ A B C
+0 1 3 1.0
+1 2 3 1.0
+2 2 4 NaN
+3 2 4 1.0
+4 2 4 1.0
+
+In [72]: pd.crosstab(df.A, df.B)
+Out[72]:
+B 3 4
+A
+1 1 0
+2 1 3
+```
+
+Any input passed containing ``Categorical`` data will have **all** of its
+categories included in the cross-tabulation, even if the actual data does
+not contain any instances of a particular category.
+
+``` python
+In [73]: foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
+
+In [74]: bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
+
+In [75]: pd.crosstab(foo, bar)
+Out[75]:
+col_0 d e
+row_0
+a 1 0
+b 0 1
+```
+
+### Normalization
+
+*New in version 0.18.1.*
+
+Frequency tables can also be normalized to show percentages rather than counts
+using the ``normalize`` argument:
+
+``` python
+In [76]: pd.crosstab(df.A, df.B, normalize=True)
+Out[76]:
+B 3 4
+A
+1 0.2 0.0
+2 0.2 0.6
+```
+
+``normalize`` can also normalize values within each row or within each column:
+
+``` python
+In [77]: pd.crosstab(df.A, df.B, normalize='columns')
+Out[77]:
+B 3 4
+A
+1 0.5 0.0
+2 0.5 1.0
+```
+
+``crosstab`` can also be passed a third ``Series`` and an aggregation function
+(``aggfunc``) that will be applied to the values of the third ``Series`` within
+each group defined by the first two ``Series``:
+
+``` python
+In [78]: pd.crosstab(df.A, df.B, values=df.C, aggfunc=np.sum)
+Out[78]:
+B 3 4
+A
+1 1.0 NaN
+2 1.0 2.0
+```
+
+### Adding margins
+
+Finally, one can also add margins or normalize this output.
+
+``` python
+In [79]: pd.crosstab(df.A, df.B, values=df.C, aggfunc=np.sum, normalize=True,
+ ....: margins=True)
+ ....:
+Out[79]:
+B 3 4 All
+A
+1 0.25 0.0 0.25
+2 0.25 0.5 0.75
+All 0.50 0.5 1.00
+```
+
+## Tiling
+
+The [``cut()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html#pandas.cut) function computes groupings for the values of the input
+array and is often used to transform continuous variables to discrete or
+categorical variables:
+
+``` python
+In [80]: ages = np.array([10, 15, 13, 12, 23, 25, 28, 59, 60])
+
+In [81]: pd.cut(ages, bins=3)
+Out[81]:
+[(9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (26.667, 43.333], (43.333, 60.0], (43.333, 60.0]]
+Categories (3, interval[float64]): [(9.95, 26.667] < (26.667, 43.333] < (43.333, 60.0]]
+```
+
+If the ``bins`` keyword is an integer, then equal-width bins are formed.
+Alternatively we can specify custom bin-edges:
+
+``` python
+In [82]: c = pd.cut(ages, bins=[0, 18, 35, 70])
+
+In [83]: c
+Out[83]:
+[(0, 18], (0, 18], (0, 18], (0, 18], (18, 35], (18, 35], (18, 35], (35, 70], (35, 70]]
+Categories (3, interval[int64]): [(0, 18] < (18, 35] < (35, 70]]
+```
+
+*New in version 0.20.0.*
+
+If the ``bins`` keyword is an ``IntervalIndex``, then these will be
+used to bin the passed data.:
+
+``` python
+pd.cut([25, 20, 50], bins=c.categories)
+```
+
+## Computing indicator / dummy variables
+
+To convert a categorical variable into a “dummy” or “indicator” ``DataFrame``,
+for example a column in a ``DataFrame`` (a ``Series``) which has ``k`` distinct
+values, can derive a ``DataFrame`` containing ``k`` columns of 1s and 0s using
+[``get_dummies()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html#pandas.get_dummies):
+
+``` python
+In [84]: df = pd.DataFrame({'key': list('bbacab'), 'data1': range(6)})
+
+In [85]: pd.get_dummies(df['key'])
+Out[85]:
+ a b c
+0 0 1 0
+1 0 1 0
+2 1 0 0
+3 0 0 1
+4 1 0 0
+5 0 1 0
+```
+
+Sometimes it’s useful to prefix the column names, for example when merging the result
+with the original ``DataFrame``:
+
+``` python
+In [86]: dummies = pd.get_dummies(df['key'], prefix='key')
+
+In [87]: dummies
+Out[87]:
+ key_a key_b key_c
+0 0 1 0
+1 0 1 0
+2 1 0 0
+3 0 0 1
+4 1 0 0
+5 0 1 0
+
+In [88]: df[['data1']].join(dummies)
+Out[88]:
+ data1 key_a key_b key_c
+0 0 0 1 0
+1 1 0 1 0
+2 2 1 0 0
+3 3 0 0 1
+4 4 1 0 0
+5 5 0 1 0
+```
+
+This function is often used along with discretization functions like ``cut``:
+
+``` python
+In [89]: values = np.random.randn(10)
+
+In [90]: values
+Out[90]:
+array([ 0.4082, -1.0481, -0.0257, -0.9884, 0.0941, 1.2627, 1.29 ,
+ 0.0824, -0.0558, 0.5366])
+
+In [91]: bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
+
+In [92]: pd.get_dummies(pd.cut(values, bins))
+Out[92]:
+ (0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
+0 0 0 1 0 0
+1 0 0 0 0 0
+2 0 0 0 0 0
+3 0 0 0 0 0
+4 1 0 0 0 0
+5 0 0 0 0 0
+6 0 0 0 0 0
+7 1 0 0 0 0
+8 0 0 0 0 0
+9 0 0 1 0 0
+```
+
+See also [``Series.str.get_dummies``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.get_dummies.html#pandas.Series.str.get_dummies).
+
+[``get_dummies()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html#pandas.get_dummies) also accepts a ``DataFrame``. By default all categorical
+variables (categorical in the statistical sense, those with *object* or
+*categorical* dtype) are encoded as dummy variables.
+
+``` python
+In [93]: df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
+ ....: 'C': [1, 2, 3]})
+ ....:
+
+In [94]: pd.get_dummies(df)
+Out[94]:
+ C A_a A_b B_b B_c
+0 1 1 0 0 1
+1 2 0 1 0 1
+2 3 1 0 1 0
+```
+
+All non-object columns are included untouched in the output. You can control
+the columns that are encoded with the ``columns`` keyword.
+
+``` python
+In [95]: pd.get_dummies(df, columns=['A'])
+Out[95]:
+ B C A_a A_b
+0 c 1 1 0
+1 c 2 0 1
+2 b 3 1 0
+```
+
+Notice that the ``B`` column is still included in the output, it just hasn’t
+been encoded. You can drop ``B`` before calling ``get_dummies`` if you don’t
+want to include it in the output.
+
+As with the ``Series`` version, you can pass values for the ``prefix`` and
+``prefix_sep``. By default the column name is used as the prefix, and ‘_’ as
+the prefix separator. You can specify ``prefix`` and ``prefix_sep`` in 3 ways:
+
+- string: Use the same value for ``prefix`` or ``prefix_sep`` for each column
+to be encoded.
+- list: Must be the same length as the number of columns being encoded.
+- dict: Mapping column name to prefix.
+
+``` python
+In [96]: simple = pd.get_dummies(df, prefix='new_prefix')
+
+In [97]: simple
+Out[97]:
+ C new_prefix_a new_prefix_b new_prefix_b new_prefix_c
+0 1 1 0 0 1
+1 2 0 1 0 1
+2 3 1 0 1 0
+
+In [98]: from_list = pd.get_dummies(df, prefix=['from_A', 'from_B'])
+
+In [99]: from_list
+Out[99]:
+ C from_A_a from_A_b from_B_b from_B_c
+0 1 1 0 0 1
+1 2 0 1 0 1
+2 3 1 0 1 0
+
+In [100]: from_dict = pd.get_dummies(df, prefix={'B': 'from_B', 'A': 'from_A'})
+
+In [101]: from_dict
+Out[101]:
+ C from_A_a from_A_b from_B_b from_B_c
+0 1 1 0 0 1
+1 2 0 1 0 1
+2 3 1 0 1 0
+```
+
+*New in version 0.18.0.*
+
+Sometimes it will be useful to only keep k-1 levels of a categorical
+variable to avoid collinearity when feeding the result to statistical models.
+You can switch to this mode by turn on ``drop_first``.
+
+``` python
+In [102]: s = pd.Series(list('abcaa'))
+
+In [103]: pd.get_dummies(s)
+Out[103]:
+ a b c
+0 1 0 0
+1 0 1 0
+2 0 0 1
+3 1 0 0
+4 1 0 0
+
+In [104]: pd.get_dummies(s, drop_first=True)
+Out[104]:
+ b c
+0 0 0
+1 1 0
+2 0 1
+3 0 0
+4 0 0
+```
+
+When a column contains only one level, it will be omitted in the result.
+
+``` python
+In [105]: df = pd.DataFrame({'A': list('aaaaa'), 'B': list('ababc')})
+
+In [106]: pd.get_dummies(df)
+Out[106]:
+ A_a B_a B_b B_c
+0 1 1 0 0
+1 1 0 1 0
+2 1 1 0 0
+3 1 0 1 0
+4 1 0 0 1
+
+In [107]: pd.get_dummies(df, drop_first=True)
+Out[107]:
+ B_b B_c
+0 0 0
+1 1 0
+2 0 0
+3 1 0
+4 0 1
+```
+
+By default new columns will have ``np.uint8`` dtype.
+To choose another dtype, use the ``dtype`` argument:
+
+``` python
+In [108]: df = pd.DataFrame({'A': list('abc'), 'B': [1.1, 2.2, 3.3]})
+
+In [109]: pd.get_dummies(df, dtype=bool).dtypes
+Out[109]:
+B float64
+A_a bool
+A_b bool
+A_c bool
+dtype: object
+```
+
+*New in version 0.23.0.*
+
+## Factorizing values
+
+To encode 1-d values as an enumerated type use [``factorize()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html#pandas.factorize):
+
+``` python
+In [110]: x = pd.Series(['A', 'A', np.nan, 'B', 3.14, np.inf])
+
+In [111]: x
+Out[111]:
+0 A
+1 A
+2 NaN
+3 B
+4 3.14
+5 inf
+dtype: object
+
+In [112]: labels, uniques = pd.factorize(x)
+
+In [113]: labels
+Out[113]: array([ 0, 0, -1, 1, 2, 3])
+
+In [114]: uniques
+Out[114]: Index(['A', 'B', 3.14, inf], dtype='object')
+```
+
+Note that ``factorize`` is similar to ``numpy.unique``, but differs in its
+handling of NaN:
+
+::: tip Note
+
+The following ``numpy.unique`` will fail under Python 3 with a ``TypeError``
+because of an ordering bug. See also
+[here](https://github.com/numpy/numpy/issues/641).
+
+:::
+
+``` python
+In [1]: x = pd.Series(['A', 'A', np.nan, 'B', 3.14, np.inf])
+In [2]: pd.factorize(x, sort=True)
+Out[2]:
+(array([ 2, 2, -1, 3, 0, 1]),
+ Index([3.14, inf, 'A', 'B'], dtype='object'))
+
+In [3]: np.unique(x, return_inverse=True)[::-1]
+Out[3]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))
+```
+
+::: tip Note
+
+If you just want to handle one column as a categorical variable (like R’s factor),
+you can use ``df["cat_col"] = pd.Categorical(df["col"])`` or
+``df["cat_col"] = df["col"].astype("category")``. For full docs on [``Categorical``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html#pandas.Categorical),
+see the [Categorical introduction](categorical.html#categorical) and the
+[API documentation](https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html#api-arrays-categorical).
+
+:::
+
+## Examples
+
+In this section, we will review frequently asked questions and examples. The
+column names and relevant column values are named to correspond with how this
+DataFrame will be pivoted in the answers below.
+
+``` python
+In [115]: np.random.seed([3, 1415])
+
+In [116]: n = 20
+
+In [117]: cols = np.array(['key', 'row', 'item', 'col'])
+
+In [118]: df = cols + pd.DataFrame((np.random.randint(5, size=(n, 4))
+ .....: // [2, 1, 2, 1]).astype(str))
+ .....:
+
+In [119]: df.columns = cols
+
+In [120]: df = df.join(pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val'))
+
+In [121]: df
+Out[121]:
+ key row item col val0 val1
+0 key0 row3 item1 col3 0.81 0.04
+1 key1 row2 item1 col2 0.44 0.07
+2 key1 row0 item1 col0 0.77 0.01
+3 key0 row4 item0 col2 0.15 0.59
+4 key1 row0 item2 col1 0.81 0.64
+.. ... ... ... ... ... ...
+15 key0 row3 item1 col1 0.31 0.23
+16 key0 row0 item2 col3 0.86 0.01
+17 key0 row4 item0 col3 0.64 0.21
+18 key2 row2 item2 col0 0.13 0.45
+19 key0 row2 item0 col4 0.37 0.70
+
+[20 rows x 6 columns]
+```
+
+### Pivoting with single aggregations
+
+Suppose we wanted to pivot ``df`` such that the ``col`` values are columns,
+``row`` values are the index, and the mean of ``val0`` are the values? In
+particular, the resulting DataFrame should look like:
+
+::: tip Note
+
+col col0 col1 col2 col3 col4
+row
+row0 0.77 0.605 NaN 0.860 0.65
+row2 0.13 NaN 0.395 0.500 0.25
+row3 NaN 0.310 NaN 0.545 NaN
+row4 NaN 0.100 0.395 0.760 0.24
+
+:::
+
+This solution uses [``pivot_table()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html#pandas.pivot_table). Also note that
+``aggfunc='mean'`` is the default. It is included here to be explicit.
+
+``` python
+In [122]: df.pivot_table(
+ .....: values='val0', index='row', columns='col', aggfunc='mean')
+ .....:
+Out[122]:
+col col0 col1 col2 col3 col4
+row
+row0 0.77 0.605 NaN 0.860 0.65
+row2 0.13 NaN 0.395 0.500 0.25
+row3 NaN 0.310 NaN 0.545 NaN
+row4 NaN 0.100 0.395 0.760 0.24
+```
+
+Note that we can also replace the missing values by using the ``fill_value``
+parameter.
+
+``` python
+In [123]: df.pivot_table(
+ .....: values='val0', index='row', columns='col', aggfunc='mean', fill_value=0)
+ .....:
+Out[123]:
+col col0 col1 col2 col3 col4
+row
+row0 0.77 0.605 0.000 0.860 0.65
+row2 0.13 0.000 0.395 0.500 0.25
+row3 0.00 0.310 0.000 0.545 0.00
+row4 0.00 0.100 0.395 0.760 0.24
+```
+
+Also note that we can pass in other aggregation functions as well. For example,
+we can also pass in ``sum``.
+
+``` python
+In [124]: df.pivot_table(
+ .....: values='val0', index='row', columns='col', aggfunc='sum', fill_value=0)
+ .....:
+Out[124]:
+col col0 col1 col2 col3 col4
+row
+row0 0.77 1.21 0.00 0.86 0.65
+row2 0.13 0.00 0.79 0.50 0.50
+row3 0.00 0.31 0.00 1.09 0.00
+row4 0.00 0.10 0.79 1.52 0.24
+```
+
+Another aggregation we can do is calculate the frequency in which the columns
+and rows occur together a.k.a. “cross tabulation”. To do this, we can pass
+``size`` to the ``aggfunc`` parameter.
+
+``` python
+In [125]: df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
+Out[125]:
+col col0 col1 col2 col3 col4
+row
+row0 1 2 0 1 1
+row2 1 0 2 1 2
+row3 0 1 0 2 0
+row4 0 1 2 2 1
+```
+
+### Pivoting with multiple aggregations
+
+We can also perform multiple aggregations. For example, to perform both a
+``sum`` and ``mean``, we can pass in a list to the ``aggfunc`` argument.
+
+``` python
+In [126]: df.pivot_table(
+ .....: values='val0', index='row', columns='col', aggfunc=['mean', 'sum'])
+ .....:
+Out[126]:
+ mean sum
+col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
+row
+row0 0.77 0.605 NaN 0.860 0.65 0.77 1.21 NaN 0.86 0.65
+row2 0.13 NaN 0.395 0.500 0.25 0.13 NaN 0.79 0.50 0.50
+row3 NaN 0.310 NaN 0.545 NaN NaN 0.31 NaN 1.09 NaN
+row4 NaN 0.100 0.395 0.760 0.24 NaN 0.10 0.79 1.52 0.24
+```
+
+Note to aggregate over multiple value columns, we can pass in a list to the
+``values`` parameter.
+
+``` python
+In [127]: df.pivot_table(
+ .....: values=['val0', 'val1'], index='row', columns='col', aggfunc=['mean'])
+ .....:
+Out[127]:
+ mean
+ val0 val1
+col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
+row
+row0 0.77 0.605 NaN 0.860 0.65 0.01 0.745 NaN 0.010 0.02
+row2 0.13 NaN 0.395 0.500 0.25 0.45 NaN 0.34 0.440 0.79
+row3 NaN 0.310 NaN 0.545 NaN NaN 0.230 NaN 0.075 NaN
+row4 NaN 0.100 0.395 0.760 0.24 NaN 0.070 0.42 0.300 0.46
+```
+
+Note to subdivide over multiple columns we can pass in a list to the
+``columns`` parameter.
+
+``` python
+In [128]: df.pivot_table(
+ .....: values=['val0'], index='row', columns=['item', 'col'], aggfunc=['mean'])
+ .....:
+Out[128]:
+ mean
+ val0
+item item0 item1 item2
+col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
+row
+row0 NaN NaN NaN 0.77 NaN NaN NaN NaN NaN 0.605 0.86 0.65
+row2 0.35 NaN 0.37 NaN NaN 0.44 NaN NaN 0.13 NaN 0.50 0.13
+row3 NaN NaN NaN NaN 0.31 NaN 0.81 NaN NaN NaN 0.28 NaN
+row4 0.15 0.64 NaN NaN 0.10 0.64 0.88 0.24 NaN NaN NaN NaN
+```
+
+## Exploding a list-like column
+
+*New in version 0.25.0.*
+
+Sometimes the values in a column are list-like.
+
+``` python
+In [129]: keys = ['panda1', 'panda2', 'panda3']
+
+In [130]: values = [['eats', 'shoots'], ['shoots', 'leaves'], ['eats', 'leaves']]
+
+In [131]: df = pd.DataFrame({'keys': keys, 'values': values})
+
+In [132]: df
+Out[132]:
+ keys values
+0 panda1 [eats, shoots]
+1 panda2 [shoots, leaves]
+2 panda3 [eats, leaves]
+```
+
+We can ‘explode’ the ``values`` column, transforming each list-like to a separate row, by using [``explode()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.explode.html#pandas.Series.explode). This will replicate the index values from the original row:
+
+``` python
+In [133]: df['values'].explode()
+Out[133]:
+0 eats
+0 shoots
+1 shoots
+1 leaves
+2 eats
+2 leaves
+Name: values, dtype: object
+```
+
+You can also explode the column in the ``DataFrame``.
+
+``` python
+In [134]: df.explode('values')
+Out[134]:
+ keys values
+0 panda1 eats
+0 panda1 shoots
+1 panda2 shoots
+1 panda2 leaves
+2 panda3 eats
+2 panda3 leaves
+```
+
+[``Series.explode()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.explode.html#pandas.Series.explode) will replace empty lists with ``np.nan`` and preserve scalar entries. The dtype of the resulting ``Series`` is always ``object``.
+
+``` python
+In [135]: s = pd.Series([[1, 2, 3], 'foo', [], ['a', 'b']])
+
+In [136]: s
+Out[136]:
+0 [1, 2, 3]
+1 foo
+2 []
+3 [a, b]
+dtype: object
+
+In [137]: s.explode()
+Out[137]:
+0 1
+0 2
+0 3
+1 foo
+2 NaN
+3 a
+3 b
+dtype: object
+```
+
+Here is a typical usecase. You have comma separated strings in a column and want to expand this.
+
+``` python
+In [138]: df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1},
+ .....: {'var1': 'd,e,f', 'var2': 2}])
+ .....:
+
+In [139]: df
+Out[139]:
+ var1 var2
+0 a,b,c 1
+1 d,e,f 2
+```
+
+Creating a long form DataFrame is now straightforward using explode and chained operations
+
+``` python
+In [140]: df.assign(var1=df.var1.str.split(',')).explode('var1')
+Out[140]:
+ var1 var2
+0 a 1
+0 b 1
+0 c 1
+1 d 2
+1 e 2
+1 f 2
+```
diff --git a/Python/pandas/user_guide/sparse.md b/Python/pandas/user_guide/sparse.md
new file mode 100644
index 00000000..86615a91
--- /dev/null
+++ b/Python/pandas/user_guide/sparse.md
@@ -0,0 +1,565 @@
+# Sparse data structures
+
+::: tip Note
+
+``SparseSeries`` and ``SparseDataFrame`` have been deprecated. Their purpose
+is served equally well by a [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) or [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) with
+sparse values. See [Migrating](#sparse-migration) for tips on migrating.
+
+:::
+
+Pandas provides data structures for efficiently storing sparse data.
+These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these
+objects as being “compressed” where any data matching a specific value (``NaN`` / missing value, though any value
+can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.
+
+``` python
+In [1]: arr = np.random.randn(10)
+
+In [2]: arr[2:-2] = np.nan
+
+In [3]: ts = pd.Series(pd.SparseArray(arr))
+
+In [4]: ts
+Out[4]:
+0 0.469112
+1 -0.282863
+2 NaN
+3 NaN
+4 NaN
+5 NaN
+6 NaN
+7 NaN
+8 -0.861849
+9 -2.104569
+dtype: Sparse[float64, nan]
+```
+
+Notice the dtype, ``Sparse[float64, nan]``. The ``nan`` means that elements in the
+array that are ``nan`` aren’t actually stored, only the non-``nan`` elements are.
+Those non-``nan`` elements have a ``float64`` dtype.
+
+The sparse objects exist for memory efficiency reasons. Suppose you had a
+large, mostly NA ``DataFrame``:
+
+``` python
+In [5]: df = pd.DataFrame(np.random.randn(10000, 4))
+
+In [6]: df.iloc[:9998] = np.nan
+
+In [7]: sdf = df.astype(pd.SparseDtype("float", np.nan))
+
+In [8]: sdf.head()
+Out[8]:
+ 0 1 2 3
+0 NaN NaN NaN NaN
+1 NaN NaN NaN NaN
+2 NaN NaN NaN NaN
+3 NaN NaN NaN NaN
+4 NaN NaN NaN NaN
+
+In [9]: sdf.dtypes
+Out[9]:
+0 Sparse[float64, nan]
+1 Sparse[float64, nan]
+2 Sparse[float64, nan]
+3 Sparse[float64, nan]
+dtype: object
+
+In [10]: sdf.sparse.density
+Out[10]: 0.0002
+```
+
+As you can see, the density (% of values that have not been “compressed”) is
+extremely low. This sparse object takes up much less memory on disk (pickled)
+and in the Python interpreter.
+
+``` python
+In [11]: 'dense : {:0.2f} bytes'.format(df.memory_usage().sum() / 1e3)
+Out[11]: 'dense : 320.13 bytes'
+
+In [12]: 'sparse: {:0.2f} bytes'.format(sdf.memory_usage().sum() / 1e3)
+Out[12]: 'sparse: 0.22 bytes'
+```
+
+Functionally, their behavior should be nearly
+identical to their dense counterparts.
+
+## SparseArray
+
+[``SparseArray``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.SparseArray.html#pandas.SparseArray) is a [``ExtensionArray``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.extensions.ExtensionArray.html#pandas.api.extensions.ExtensionArray)
+for storing an array of sparse values (see [dtypes](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes) for more
+on extension arrays). It is a 1-dimensional ndarray-like object storing
+only values distinct from the ``fill_value``:
+
+``` python
+In [13]: arr = np.random.randn(10)
+
+In [14]: arr[2:5] = np.nan
+
+In [15]: arr[7:8] = np.nan
+
+In [16]: sparr = pd.SparseArray(arr)
+
+In [17]: sparr
+Out[17]:
+[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]
+Fill: nan
+IntIndex
+Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)
+```
+
+A sparse array can be converted to a regular (dense) ndarray with ``numpy.asarray()``
+
+``` python
+In [18]: np.asarray(sparr)
+Out[18]:
+array([-1.9557, -1.6589, nan, nan, nan, 1.1589, 0.1453,
+ nan, 0.606 , 1.3342])
+```
+
+## SparseDtype
+
+The ``SparseArray.dtype`` property stores two pieces of information
+
+1. The dtype of the non-sparse values
+1. The scalar fill value
+
+``` python
+In [19]: sparr.dtype
+Out[19]: Sparse[float64, nan]
+```
+
+A [``SparseDtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.SparseDtype.html#pandas.SparseDtype) may be constructed by passing each of these
+
+``` python
+In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
+Out[20]: Sparse[datetime64[ns], NaT]
+```
+
+The default fill value for a given NumPy dtype is the “missing” value for that dtype,
+though it may be overridden.
+
+``` python
+In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'),
+ ....: fill_value=pd.Timestamp('2017-01-01'))
+ ....:
+Out[21]: Sparse[datetime64[ns], 2017-01-01 00:00:00]
+```
+
+Finally, the string alias ``'Sparse[dtype]'`` may be used to specify a sparse dtype
+in many places
+
+``` python
+In [22]: pd.array([1, 0, 0, 2], dtype='Sparse[int]')
+Out[22]:
+[1, 0, 0, 2]
+Fill: 0
+IntIndex
+Indices: array([0, 3], dtype=int32)
+```
+
+## Sparse accessor
+
+*New in version 0.24.0.*
+
+Pandas provides a ``.sparse`` accessor, similar to ``.str`` for string data, ``.cat``
+for categorical data, and ``.dt`` for datetime-like data. This namespace provides
+attributes and methods that are specific to sparse data.
+
+``` python
+In [23]: s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")
+
+In [24]: s.sparse.density
+Out[24]: 0.5
+
+In [25]: s.sparse.fill_value
+Out[25]: 0
+```
+
+This accessor is available only on data with ``SparseDtype``, and on the [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)
+class itself for creating a Series with sparse data from a scipy COO matrix with.
+
+*New in version 0.25.0.*
+
+A ``.sparse`` accessor has been added for [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) as well.
+See [Sparse accessor](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#api-frame-sparse) for more.
+
+## Sparse calculation
+
+You can apply NumPy [ufuncs](https://docs.scipy.org/doc/numpy/reference/ufuncs.html)
+to ``SparseArray`` and get a ``SparseArray`` as a result.
+
+``` python
+In [26]: arr = pd.SparseArray([1., np.nan, np.nan, -2., np.nan])
+
+In [27]: np.abs(arr)
+Out[27]:
+[1.0, nan, nan, 2.0, nan]
+Fill: nan
+IntIndex
+Indices: array([0, 3], dtype=int32)
+```
+
+The *ufunc* is also applied to ``fill_value``. This is needed to get
+the correct dense result.
+
+``` python
+In [28]: arr = pd.SparseArray([1., -1, -1, -2., -1], fill_value=-1)
+
+In [29]: np.abs(arr)
+Out[29]:
+[1.0, 1, 1, 2.0, 1]
+Fill: 1
+IntIndex
+Indices: array([0, 3], dtype=int32)
+
+In [30]: np.abs(arr).to_dense()
+Out[30]: array([1., 1., 1., 2., 1.])
+```
+
+## Migrating
+
+In older versions of pandas, the ``SparseSeries`` and ``SparseDataFrame`` classes (documented below)
+were the preferred way to work with sparse data. With the advent of extension arrays, these subclasses
+are no longer needed. Their purpose is better served by using a regular Series or DataFrame with
+sparse values instead.
+
+::: tip Note
+
+There’s no performance or memory penalty to using a Series or DataFrame with sparse values,
+rather than a SparseSeries or SparseDataFrame.
+
+:::
+
+This section provides some guidance on migrating your code to the new style. As a reminder,
+you can use the python warnings module to control warnings. But we recommend modifying
+your code, rather than ignoring the warning.
+
+**Construction**
+
+From an array-like, use the regular [``Series``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) or
+[``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) constructors with [``SparseArray``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.SparseArray.html#pandas.SparseArray) values.
+
+``` python
+# Previous way
+>>> pd.SparseDataFrame({"A": [0, 1]})
+```
+
+``` python
+# New way
+In [31]: pd.DataFrame({"A": pd.SparseArray([0, 1])})
+Out[31]:
+ A
+0 0
+1 1
+```
+
+From a SciPy sparse matrix, use [``DataFrame.sparse.from_spmatrix()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sparse.from_spmatrix.html#pandas.DataFrame.sparse.from_spmatrix),
+
+``` python
+# Previous way
+>>> from scipy import sparse
+>>> mat = sparse.eye(3)
+>>> df = pd.SparseDataFrame(mat, columns=['A', 'B', 'C'])
+```
+
+``` python
+# New way
+In [32]: from scipy import sparse
+
+In [33]: mat = sparse.eye(3)
+
+In [34]: df = pd.DataFrame.sparse.from_spmatrix(mat, columns=['A', 'B', 'C'])
+
+In [35]: df.dtypes
+Out[35]:
+A Sparse[float64, 0.0]
+B Sparse[float64, 0.0]
+C Sparse[float64, 0.0]
+dtype: object
+```
+
+**Conversion**
+
+From sparse to dense, use the ``.sparse`` accessors
+
+``` python
+In [36]: df.sparse.to_dense()
+Out[36]:
+ A B C
+0 1.0 0.0 0.0
+1 0.0 1.0 0.0
+2 0.0 0.0 1.0
+
+In [37]: df.sparse.to_coo()
+Out[37]:
+<3x3 sparse matrix of type ''
+ with 3 stored elements in COOrdinate format>
+```
+
+From dense to sparse, use [``DataFrame.astype()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html#pandas.DataFrame.astype) with a [``SparseDtype``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.SparseDtype.html#pandas.SparseDtype).
+
+``` python
+In [38]: dense = pd.DataFrame({"A": [1, 0, 0, 1]})
+
+In [39]: dtype = pd.SparseDtype(int, fill_value=0)
+
+In [40]: dense.astype(dtype)
+Out[40]:
+ A
+0 1
+1 0
+2 0
+3 1
+```
+
+**Sparse Properties**
+
+Sparse-specific properties, like ``density``, are available on the ``.sparse`` accessor.
+
+``` python
+In [41]: df.sparse.density
+Out[41]: 0.3333333333333333
+```
+
+**General differences**
+
+In a ``SparseDataFrame``, *all* columns were sparse. A [``DataFrame``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame) can have a mixture of
+sparse and dense columns. As a consequence, assigning new columns to a ``DataFrame`` with sparse
+values will not automatically convert the input to be sparse.
+
+``` python
+# Previous Way
+>>> df = pd.SparseDataFrame({"A": [0, 1]})
+>>> df['B'] = [0, 0] # implicitly becomes Sparse
+>>> df['B'].dtype
+Sparse[int64, nan]
+```
+
+Instead, you’ll need to ensure that the values being assigned are sparse
+
+``` python
+In [42]: df = pd.DataFrame({"A": pd.SparseArray([0, 1])})
+
+In [43]: df['B'] = [0, 0] # remains dense
+
+In [44]: df['B'].dtype
+Out[44]: dtype('int64')
+
+In [45]: df['B'] = pd.SparseArray([0, 0])
+
+In [46]: df['B'].dtype
+Out[46]: Sparse[int64, 0]
+```
+
+The ``SparseDataFrame.default_kind`` and ``SparseDataFrame.default_fill_value`` attributes
+have no replacement.
+
+## Interaction with scipy.sparse
+
+Use [``DataFrame.sparse.from_spmatrix()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sparse.from_spmatrix.html#pandas.DataFrame.sparse.from_spmatrix) to create a ``DataFrame`` with sparse values from a sparse matrix.
+
+*New in version 0.25.0.*
+
+``` python
+In [47]: from scipy.sparse import csr_matrix
+
+In [48]: arr = np.random.random(size=(1000, 5))
+
+In [49]: arr[arr < .9] = 0
+
+In [50]: sp_arr = csr_matrix(arr)
+
+In [51]: sp_arr
+Out[51]:
+<1000x5 sparse matrix of type ''
+ with 517 stored elements in Compressed Sparse Row format>
+
+In [52]: sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)
+
+In [53]: sdf.head()
+Out[53]:
+ 0 1 2 3 4
+0 0.956380 0.0 0.0 0.000000 0.0
+1 0.000000 0.0 0.0 0.000000 0.0
+2 0.000000 0.0 0.0 0.000000 0.0
+3 0.000000 0.0 0.0 0.000000 0.0
+4 0.999552 0.0 0.0 0.956153 0.0
+
+In [54]: sdf.dtypes
+Out[54]:
+0 Sparse[float64, 0.0]
+1 Sparse[float64, 0.0]
+2 Sparse[float64, 0.0]
+3 Sparse[float64, 0.0]
+4 Sparse[float64, 0.0]
+dtype: object
+```
+
+All sparse formats are supported, but matrices that are not in [``COOrdinate``](https://docs.scipy.org/doc/scipy/reference/sparse.html#module-scipy.sparse) format will be converted, copying data as needed.
+To convert back to sparse SciPy matrix in COO format, you can use the [``DataFrame.sparse.to_coo()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sparse.to_coo.html#pandas.DataFrame.sparse.to_coo) method:
+
+``` python
+In [55]: sdf.sparse.to_coo()
+Out[55]:
+<1000x5 sparse matrix of type ''
+ with 517 stored elements in COOrdinate format>
+```
+
+meth:*Series.sparse.to_coo* is implemented for transforming a ``Series`` with sparse values indexed by a [``MultiIndex``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html#pandas.MultiIndex) to a [``scipy.sparse.coo_matrix``](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html#scipy.sparse.coo_matrix).
+
+The method requires a ``MultiIndex`` with two or more levels.
+
+``` python
+In [56]: s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])
+
+In [57]: s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),
+ ....: (1, 2, 'a', 1),
+ ....: (1, 1, 'b', 0),
+ ....: (1, 1, 'b', 1),
+ ....: (2, 1, 'b', 0),
+ ....: (2, 1, 'b', 1)],
+ ....: names=['A', 'B', 'C', 'D'])
+ ....:
+
+In [58]: s
+Out[58]:
+A B C D
+1 2 a 0 3.0
+ 1 NaN
+ 1 b 0 1.0
+ 1 3.0
+2 1 b 0 NaN
+ 1 NaN
+dtype: float64
+
+In [59]: ss = s.astype('Sparse')
+
+In [60]: ss
+Out[60]:
+A B C D
+1 2 a 0 3.0
+ 1 NaN
+ 1 b 0 1.0
+ 1 3.0
+2 1 b 0 NaN
+ 1 NaN
+dtype: Sparse[float64, nan]
+```
+
+In the example below, we transform the ``Series`` to a sparse representation of a 2-d array by specifying that the first and second ``MultiIndex`` levels define labels for the rows and the third and fourth levels define labels for the columns. We also specify that the column and row labels should be sorted in the final sparse representation.
+
+``` python
+In [61]: A, rows, columns = ss.sparse.to_coo(row_levels=['A', 'B'],
+ ....: column_levels=['C', 'D'],
+ ....: sort_labels=True)
+ ....:
+
+In [62]: A
+Out[62]:
+<3x4 sparse matrix of type ''
+ with 3 stored elements in COOrdinate format>
+
+In [63]: A.todense()
+Out[63]:
+matrix([[0., 0., 1., 3.],
+ [3., 0., 0., 0.],
+ [0., 0., 0., 0.]])
+
+In [64]: rows
+Out[64]: [(1, 1), (1, 2), (2, 1)]
+
+In [65]: columns
+Out[65]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]
+```
+
+Specifying different row and column labels (and not sorting them) yields a different sparse matrix:
+
+``` python
+In [66]: A, rows, columns = ss.sparse.to_coo(row_levels=['A', 'B', 'C'],
+ ....: column_levels=['D'],
+ ....: sort_labels=False)
+ ....:
+
+In [67]: A
+Out[67]:
+<3x2 sparse matrix of type ''
+ with 3 stored elements in COOrdinate format>
+
+In [68]: A.todense()
+Out[68]:
+matrix([[3., 0.],
+ [1., 3.],
+ [0., 0.]])
+
+In [69]: rows
+Out[69]: [(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]
+
+In [70]: columns
+Out[70]: [0, 1]
+```
+
+A convenience method [``Series.sparse.from_coo()``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sparse.from_coo.html#pandas.Series.sparse.from_coo) is implemented for creating a ``Series`` with sparse values from a ``scipy.sparse.coo_matrix``.
+
+``` python
+In [71]: from scipy import sparse
+
+In [72]: A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])),
+ ....: shape=(3, 4))
+ ....:
+
+In [73]: A
+Out[73]:
+<3x4 sparse matrix of type ''
+ with 3 stored elements in COOrdinate format>
+
+In [74]: A.todense()
+Out[74]:
+matrix([[0., 0., 1., 2.],
+ [3., 0., 0., 0.],
+ [0., 0., 0., 0.]])
+```
+
+The default behaviour (with ``dense_index=False``) simply returns a ``Series`` containing
+only the non-null entries.
+
+``` python
+In [75]: ss = pd.Series.sparse.from_coo(A)
+
+In [76]: ss
+Out[76]:
+0 2 1.0
+ 3 2.0
+1 0 3.0
+dtype: Sparse[float64, nan]
+```
+
+Specifying ``dense_index=True`` will result in an index that is the Cartesian product of the
+row and columns coordinates of the matrix. Note that this will consume a significant amount of memory
+(relative to ``dense_index=False``) if the sparse matrix is large (and sparse) enough.
+
+``` python
+In [77]: ss_dense = pd.Series.sparse.from_coo(A, dense_index=True)
+
+In [78]: ss_dense
+Out[78]:
+0 0 NaN
+ 1 NaN
+ 2 1.0
+ 3 2.0
+1 0 3.0
+ 1 NaN
+ 2 NaN
+ 3 NaN
+2 0 NaN
+ 1 NaN
+ 2 NaN
+ 3 NaN
+dtype: Sparse[float64, nan]
+```
+
+## Sparse subclasses
+
+The ``SparseSeries`` and ``SparseDataFrame`` classes are deprecated. Visit their
+API pages for usage.
diff --git a/Python/pandas/user_guide/style.md b/Python/pandas/user_guide/style.md
new file mode 100644
index 00000000..f257d619
--- /dev/null
+++ b/Python/pandas/user_guide/style.md
@@ -0,0 +1,439 @@
+# Styling
+
+*New in version 0.17.1*
+
+Provisional: This is a new feature and still under development. We’ll be adding features and possibly making breaking changes in future releases. We’d love to hear your feedback.
+
+This document is written as a Jupyter Notebook, and can be viewed or downloaded [here](http://nbviewer.ipython.org/github/pandas-dev/pandas/blob/master/doc/source/style.ipynb).
+
+You can apply **conditional formatting**, the visual styling of a DataFrame depending on the data within, by using the ``DataFrame.style`` property. This is a property that returns a ``Styler`` object, which has useful methods for formatting and displaying DataFrames.
+
+The styling is accomplished using CSS. You write “style functions” that take scalars, ``DataFrame``s or ``Series``, and return *like-indexed* DataFrames or Series with CSS ``"attribute: value"`` pairs for the values. These functions can be incrementally passed to the ``Styler`` which collects the styles before rendering.
+
+## Building styles
+
+Pass your style functions into one of the following methods:
+
+- ``Styler.applymap``: elementwise
+- ``Styler.apply``: column-/row-/table-wise
+
+Both of those methods take a function (and some other keyword arguments) and applies your function to the DataFrame in a certain way. ``Styler.applymap`` works through the DataFrame elementwise. ``Styler.apply`` passes each column or row into your DataFrame one-at-a-time or the entire table at once, depending on the ``axis`` keyword argument. For columnwise use ``axis=0``, rowwise use ``axis=1``, and for the entire table at once use ``axis=None``.
+
+For ``Styler.applymap`` your function should take a scalar and return a single string with the CSS attribute-value pair.
+
+For ``Styler.apply`` your function should take a Series or DataFrame (depending on the axis parameter), and return a Series or DataFrame with an identical shape where each value is a string with a CSS attribute-value pair.
+
+Let’s see some examples.
+
+
+
+Here’s a boring example of rendering a DataFrame, without any (visible) styles:
+
+
+
+*Note*: The ``DataFrame.style`` attribute is a property that returns a ``Styler`` object. ``Styler`` has a ``_repr_html_`` method defined on it so they are rendered automatically. If you want the actual HTML back for further processing or for writing to file call the ``.render()`` method which returns a string.
+
+The above output looks very similar to the standard DataFrame HTML representation. But we’ve done some work behind the scenes to attach CSS classes to each cell. We can view these by calling the ``.render`` method.
+
+``` javascript
+df.style.highlight_null().render().split('\n')[:10]
+```
+
+``` javascript
+['
A
B
C
D
E
',
+ '
',
+ '
0
',
+ '
1
',
+ '
1.32921
',
+ '
nan
',
+ '
-0.31628
']
+```
+
+The ``row0_col2`` is the identifier for that particular cell. We’ve also prepended each row/column identifier with a UUID unique to each DataFrame so that the style from one doesn’t collide with the styling from another within the same notebook or page (you can set the ``uuid`` if you’d like to tie together the styling of two DataFrames).
+
+When writing style functions, you take care of producing the CSS attribute / value pairs you want. Pandas matches those up with the CSS classes that identify each cell.
+
+Let’s write a simple style function that will color negative numbers red and positive numbers black.
+
+
+
+In this case, the cell’s style depends only on it’s own value. That means we should use the ``Styler.applymap`` method which works elementwise.
+
+
+
+Notice the similarity with the standard ``df.applymap``, which operates on DataFrames elementwise. We want you to be able to reuse your existing knowledge of how to interact with DataFrames.
+
+Notice also that our function returned a string containing the CSS attribute and value, separated by a colon just like in a ```` tag. This will be a common theme.
+
+Finally, the input shapes matched. ``Styler.applymap`` calls the function on each scalar input, and the function returns a scalar output.
+
+Now suppose you wanted to highlight the maximum value in each column. We can’t use ``.applymap`` anymore since that operated elementwise. Instead, we’ll turn to ``.apply`` which operates columnwise (or rowwise using the ``axis`` keyword). Later on we’ll see that something like ``highlight_max`` is already defined on ``Styler`` so you wouldn’t need to write this yourself.
+
+
+
+In this case the input is a ``Series``, one column at a time. Notice that the output shape of ``highlight_max`` matches the input shape, an array with ``len(s)`` items.
+
+We encourage you to use method chains to build up a style piecewise, before finally rending at the end of the chain.
+
+
+
+Above we used ``Styler.apply`` to pass in each column one at a time.
+
+Debugging Tip: If you’re having trouble writing your style function, try just passing it into DataFrame.apply. Internally, Styler.apply uses DataFrame.apply so the result should be the same.
+
+What if you wanted to highlight just the maximum value in the entire table? Use ``.apply(function, axis=None)`` to indicate that your function wants the entire table, not one column or row at a time. Let’s try that next.
+
+We’ll rewrite our ``highlight-max`` to handle either Series (from ``.apply(axis=0 or 1)``) or DataFrames (from ``.apply(axis=None)``). We’ll also allow the color to be adjustable, to demonstrate that ``.apply``, and ``.applymap`` pass along keyword arguments.
+
+
+
+When using ``Styler.apply(func, axis=None)``, the function must return a DataFrame with the same index and column labels.
+
+
+
+### Building Styles Summary
+
+Style functions should return strings with one or more CSS ``attribute: value`` delimited by semicolons. Use
+
+- ``Styler.applymap(func)`` for elementwise styles
+- ``Styler.apply(func, axis=0)`` for columnwise styles
+- ``Styler.apply(func, axis=1)`` for rowwise styles
+- ``Styler.apply(func, axis=None)`` for tablewise styles
+
+And crucially the input and output shapes of ``func`` must match. If ``x`` is the input then ``func(x).shape == x.shape``.
+
+## Finer control: slicing
+
+Both ``Styler.apply``, and ``Styler.applymap`` accept a ``subset`` keyword. This allows you to apply styles to specific rows or columns, without having to code that logic into your ``style`` function.
+
+The value passed to ``subset`` behaves similar to slicing a DataFrame.
+
+- A scalar is treated as a column label
+- A list (or series or numpy array)
+- A tuple is treated as ``(row_indexer, column_indexer)``
+
+Consider using ``pd.IndexSlice`` to construct the tuple for the last one.
+
+
+
+For row and column slicing, any valid indexer to ``.loc`` will work.
+
+
+
+Only label-based slicing is supported right now, not positional.
+
+If your style function uses a ``subset`` or ``axis`` keyword argument, consider wrapping your function in a ``functools.partial``, partialing out that keyword.
+
+``` python
+my_func2 = functools.partial(my_func, subset=42)
+```
+
+## Finer Control: Display Values
+
+We distinguish the *display* value from the *actual* value in ``Styler``. To control the display value, the text is printed in each cell, use ``Styler.format``. Cells can be formatted according to a [format spec string](https://docs.python.org/3/library/string.html#format-specification-mini-language) or a callable that takes a single value and returns a string.
+
+
+
+Use a dictionary to format specific columns.
+
+
+
+Or pass in a callable (or dictionary of callables) for more flexible handling.
+
+
+
+## Builtin styles
+
+Finally, we expect certain styling functions to be common enough that we’ve included a few “built-in” to the ``Styler``, so you don’t have to write them yourself.
+
+
+
+You can create “heatmaps” with the ``background_gradient`` method. These require matplotlib, and we’ll use [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/) to get a nice colormap.
+
+``` python
+import seaborn as sns
+
+cm = sns.light_palette("green", as_cmap=True)
+
+s = df.style.background_gradient(cmap=cm)
+s
+
+/opt/conda/envs/pandas/lib/python3.7/site-packages/matplotlib/colors.py:479: RuntimeWarning: invalid value encountered in less
+ xa[xa < 0] = -1
+```
+
+
+
+``Styler.background_gradient`` takes the keyword arguments ``low`` and ``high``. Roughly speaking these extend the range of your data by ``low`` and ``high`` percent so that when we convert the colors, the colormap’s entire range isn’t used. This is useful so that you can actually read the text still.
+
+
+
+There’s also ``.highlight_min`` and ``.highlight_max``.
+
+
+
+Use ``Styler.set_properties`` when the style doesn’t actually depend on the values.
+
+
+
+### Bar charts
+
+You can include “bar charts” in your DataFrame.
+
+
+
+New in version 0.20.0 is the ability to customize further the bar chart: You can now have the ``df.style.bar`` be centered on zero or midpoint value (in addition to the already existing way of having the min value at the left side of the cell), and you can pass a list of ``[color_negative, color_positive]``.
+
+Here’s how you can change the above with the new ``align='mid'`` option:
+
+
+
+The following example aims to give a highlight of the behavior of the new align options:
+
+``` python
+import pandas as pd
+from IPython.display import HTML
+
+# Test series
+test1 = pd.Series([-100,-60,-30,-20], name='All Negative')
+test2 = pd.Series([10,20,50,100], name='All Positive')
+test3 = pd.Series([-10,-5,0,90], name='Both Pos and Neg')
+
+head = """
+