diff --git a/STYLE_GUIDE.md b/STYLE_GUIDE.md
index 7ca99ca..91c346a 100644
--- a/STYLE_GUIDE.md
+++ b/STYLE_GUIDE.md
@@ -74,7 +74,56 @@
         我们给这个图片打了标签img_workflow,此时对其进行引用如下
         机器学习系统工作流如 :numref:`img_workflow` 。必须注意的是在引用时冒号前要空有一个字符距离。
     ```
-   * 表格或者章节引用和图片引用类似，流程依旧是打上标签，然后用 :numref:‘引用的标签’
+   * 表格引用和图片引用类似，流程依旧是打上标签，然后用 :numref:‘引用的标签’
+   ```python
+    下面为表格引用方式：
+  
+    | Year | Number | Comment |
+    | --- | --- | --- |
+    | 2018 | 100 | Good year |
+    :label:`table`
+    表格引用使用 :numref:`table`
+   ```
+   * 公式引用依旧也是打上标签，然后使用 :eqref:`‘引用的标签’
+   ```python
+    下面为公式引用方式：
+  
+    $$\hat{\mathbf{y}}=\mathbf X \mathbf{w}+b$$
+    :eqlabel:`linear`
+    公式引用使用 :eqref:`linear`
+   ```
+  * 参考文献引用方式，目前mlsys.bib文件已经把书籍中参考文献全部导入，如需新增，只需在该文件中添加即可。参考文献使用 :cite:`文献`
+  ```python
+  下面参考文献的引用：
+  1. 单篇参考文献
+  这篇文章参考了论文 :cite:`cnn2015`
+  2. 多篇参考文献可以用逗号分开
+  这篇文章参考了论文 :cite:`cnn2015,rnn2015`
+  
+  此时在mlsys.bib中应该有如下参考文献
+  @inproceedings{cnn2015,
+	title = {CNN},
+	author = {xxx},
+	year = {2015},
+	keywords = {xxx}
+  }
+  @inproceedings{rnn2015,
+	title = {RNN},
+	author = {xxx},
+	year = {2015},
+	keywords = {xxx}
+  }
+  ```
+  * 章节引用方式我们可以在节标题后放置一个标签，以允许该节的标签引用它。标签格式是:label:`标签名`
+  ```python
+  ### Referencing Sections
+  :label:`my_sec3`
+  ```
+  然后，我们可以在代码块中通过:ref:`标签名`引用这一节
+  ```python
+  :ref:`my_sec3` 显示了章节引用.
+  ```
+  
 * 其他转换方式
     * 如果图中有很多公式，使用工具导入可能会有大量公式乱码，此时可以将图保存为.png格式。
     * 使用[在线图片去底工具](https://www.aigei.com/bgremover/) 将图片中的白底去除。
diff --git a/appendix_machine_learning_introduction/gradient_descent.md b/appendix_machine_learning_introduction/gradient_descent.md
index 64e27d4..e7db43c 100644
--- a/appendix_machine_learning_introduction/gradient_descent.md
+++ b/appendix_machine_learning_introduction/gradient_descent.md
@@ -23,7 +23,7 @@ Rate）。
 :width:`600px`
 :label:`gradient_descent2`
 
-那么接下来，在深度神经网络中如何实现梯度下降呢，这需要计算出网络中每层参数的偏导数$\frac{\partial \mathcal{L}}{\partial {w}}$，我们可以用**反向传播**（Back-Propagation）[@rumelhart1986learning; @lecun2015deep]来实现。
+那么接下来，在深度神经网络中如何实现梯度下降呢，这需要计算出网络中每层参数的偏导数$\frac{\partial \mathcal{L}}{\partial {w}}$，我们可以用**反向传播**（Back-Propagation） :cite:`rumelhart1986learning,lecun2015deep`来实现。
 接下来，
 我们引入一个中间量${\delta}=\frac{\partial \mathcal{L}}{\partial {z}}$来表示损失函数$\mathcal{L}$
 对于神经网络输出${z}$（未经过激活函数，不是$a$）的偏导数，
@@ -89,6 +89,5 @@ $\frac{\partial \mathcal{L}}{\partial {b}^l}$后，我们就可以用梯度下
 Descent，SGD）来计算损失值。具体来说，我们计算损失值不用全部训练数据，而是从训练集中随机选取一些数据样本来计算损失值，比如选取16、32、64或者128个数据样本，样本的数量被称为**批大小**（Batch
 Size）。
 此外，学习率的设定也非常重要。如果学习率太大，可能无法接近最小值的山谷，如果太小，训练又太慢。
-自适应学习率，例如Adam [@KingmaAdam2014]、RMSProp [@tieleman2012rmsprop]
-和
-Adagrad [@duchi2011adagrad] 等，在训练的过程中通过自动的方法来修改学习率，实现训练的快速收敛，到达最小值点。
+自适应学习率，例如Adam :cite:`KingmaAdam2014`、RMSProp :cite:`tieleman2012rmsprop`和
+Adagrad :cite:`duchi2011adagrad`等，在训练的过程中通过自动的方法来修改学习率，实现训练的快速收敛，到达最小值点。
diff --git a/appendix_machine_learning_introduction/neural_network.md b/appendix_machine_learning_introduction/neural_network.md
index cba1433..f9ef31c 100644
--- a/appendix_machine_learning_introduction/neural_network.md
+++ b/appendix_machine_learning_introduction/neural_network.md
@@ -116,7 +116,7 @@ $$f({z})_{i} = \frac{{\rm e}^{z_{i}}}{\sum_{k=1}^{K}{\rm e}^{z_{k}}}$$
 ![多层感知器例子。$a^l_i$表示神经元输出$z$经过激活函数后的值，其中$l$代表层的序号（$L$代表输出层），$i$代表输出的序号](../img/ch_basic/mlp2.png)
 
 **多层感知器**（Multi-Layer
-Perceptron，MLP）通过叠加多层全连接层来提升网络的表达能力。相比单层网络，多层感知器有很多中间层的输出并不暴露给最终输出，这些层被称为**隐含层**（Hidden
+Perceptron，MLP） :cite:`rosenblatt1958perceptron`通过叠加多层全连接层来提升网络的表达能力。相比单层网络，多层感知器有很多中间层的输出并不暴露给最终输出，这些层被称为**隐含层**（Hidden
 Layers）。这个例子中的网络可以通过下方的串联式矩阵运算实现，其中$W^l$和$b^l$代表不同层的权重矩阵和偏置，$l$代表层号，$L$代表输出层。
 
 $${z} = f({W^L}f({W^3}f({W^2}f({W^1}{x} + {b^1}) + {b^2}) + {b^3}) + {b^L})$$
@@ -136,8 +136,8 @@ Map），在这个例子中因为只有一个卷积核，所以输出的通道
 :label:`conv_computation_v4`
 
 **卷积神经网络** （Convolutional Neural
-Network，CNN）由多层**卷积层**（Convolutional
-Layer）组成，常用于计算机视觉任务。
+Network，CNN） :cite:`lecun1989backpropagation`由多层**卷积层**（Convolutional
+Layer）组成，常用于计算机视觉任务 :cite:`krizhevsky2012imagenet,he2016deep`。
  :numref:`conv_computation_v4`描述了一个卷积运算的例子。
 根据卷积的特点，我们可以知道两个事实：1）一个卷积核的通道数，等于输入的通道数；2）输出的通道数，等于卷积核的数量。
 
@@ -158,7 +158,7 @@ Pooling）。如 :numref:`pooling_v3`所示，假设池化的卷积核高宽为$
 ### 时序模型
 
 现实生活中除了图像还有大量时间序列数据，例如视频、股票价格等等。**循环神经网络**（Recurrent
-Neural Networks，RNN）是一种处理序列数据的深度学习模型结构。序列数据是一串连续的数据$\{x_1, x_2, \dots, x_n\}$，比如每个$x$代表一个句子中的单词。
+Neural Networks，RNN） :cite:`rumelhart1986learning`是一种处理序列数据的深度学习模型结构。序列数据是一串连续的数据$\{x_1, x_2, \dots, x_n\}$，比如每个$x$代表一个句子中的单词。
 
 为了可以接收一连串的输入序列，如 :numref:`rnn_simple_cell2`所示，朴素循环神经网络使用了循环单元（Cell）作为计算单元，用隐状态（Hidden
 State）来存储过去输入的信息。具体来说，对输入模型的每个数据$x$，根据公式 :eqref:`aligned`，循环单元会反复计算新的隐状态，用于记录当前和过去输入的信息。而新的隐状态会被用到下一单元的计算中。
@@ -172,5 +172,4 @@ $${h}_t = {W}[{x}_t; {h}_{t-1}] + {b}$$
 
 然而这种简单的朴素循环神经网络有严重的信息遗忘问题。比如说我们的输入是"我是中国人，我的母语是___"，隐状态记住了"中国人"的信息，使得网络最后可以预测出"中文"一词；但是如果句子很长的时候，隐状态可能记不住太久之前的信息了，比如说"我是中国人，我去英国读书，后来在法国工作，我的母语是___"，这时候在最后的隐状态中关于"中国人"的信息可能会被因为多次的更新而遗忘了。
 为了解决这个问题，后面有人提出了各种各样的改进方法，其中最有名的是长短期记忆（Long
-Short-Term
-Memory，LSTM）。关于时序的模型还有很多很多，比如近年来出现的Transformer等等。
+Short-Term Memory，LSTM） :cite:`Hochreiter1997lstm`。关于时序的模型还有很多很多，比如近年来出现的Transformer :cite:`vaswani2017attention`等等。
diff --git a/chapter_references/index.md b/chapter_references/index.md
new file mode 100644
index 0000000..07daede
--- /dev/null
+++ b/chapter_references/index.md
@@ -0,0 +1,10 @@
+```eval_rst
+
+.. only:: html
+
+   参考文献
+   ==========
+
+```
+
+:bibliography:`../mlsys.bib`
diff --git a/index.md b/index.md
index 64c89b1..f1e265d 100644
--- a/index.md
+++ b/index.md
@@ -31,4 +31,5 @@ chapter_online_machine_learning/index
 :maxdepth: 1
 
 appendix_machine_learning_introduction/index
+chapter_references/index
 ```
\ No newline at end of file
diff --git a/mlsys.bib b/mlsys.bib
new file mode 100644
index 0000000..fd33fc6
--- /dev/null
+++ b/mlsys.bib
@@ -0,0 +1,103 @@
+@article{rosenblatt1958perceptron,
+  title={The perceptron: a probabilistic model for information storage and organization in the brain.},
+  author={Rosenblatt, Frank},
+  journal={Psychological Review},
+  volume={65},
+  number={6},
+  pages={386},
+  year={1958},
+  publisher={American Psychological Association}
+}
+
+@article{lecun1989backpropagation,
+  title={Backpropagation applied to handwritten zip code recognition},
+  author={LeCun, Yann and Boser, Bernhard and Denker, John S and Henderson, Donnie and Howard, Richard E and Hubbard, Wayne and Jackel, Lawrence D},
+  journal={Neural computation},
+  volume={1},
+  number={4},
+  pages={541--551},
+  year={1989},
+  publisher={MIT Press}
+}
+
+@inproceedings{krizhevsky2012imagenet,
+  title={Imagenet classification with deep convolutional neural networks},
+  author={Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E},
+  booktitle={Advances in Neural Information Processing Systems},
+  pages={1097--1105},
+  year={2012}
+}
+
+@inproceedings{he2016deep,
+	title={{Deep Residual Learning for Image Recognition}},
+	author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
+	booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
+	year={2016}
+}
+
+@article{rumelhart1986learning,
+  title={Learning representations by back-propagating errors},
+  author={Rumelhart, David E and Hinton, Geoffrey E and Williams, Ronald J},
+  journal={Nature},
+  volume={323},
+  number={6088},
+  pages={533},
+  year={1986},
+  publisher={Nature Publishing Group}
+}
+
+@article{Hochreiter1997lstm,
+	author = {Hochreiter, Sepp and Hochreiter, S and Schmidhuber, J{\"{u}}rgen and Schmidhuber, J},
+	isbn = {08997667 (ISSN)},
+	issn = {0899-7667},
+	journal = {Neural Computation},
+	number = {8},
+	pages = {1735--80},
+	pmid = {9377276},
+	title = {{Long Short-Term Memory.}},
+	volume = {9},
+	year = {1997}
+}
+
+@inproceedings{vaswani2017attention,
+  title={Attention is all you need},
+  author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia},
+  booktitle={Advances in Neural Information Processing Systems},
+  pages={5998--6008},
+  year={2017}
+}
+
+@article{lecun2015deep,
+	title={Deep learning},
+	author={LeCun, Yann and Bengio, Yoshua and Hinton, Geoffrey},
+	journal={Nature},
+	volume={521},
+	number={7553},
+	pages={436},
+	year={2015},
+	publisher={Nature Publishing Group}
+}
+
+@inproceedings{KingmaAdam2014,
+	title = {{Adam}: A Method for Stochastic Optimization},
+	author = {Kingma, Diederik and Ba, Jimmy},
+	booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},
+	year = {2014}
+}
+
+@techreport{tieleman2012rmsprop,
+	title={Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning},
+	author={Tieleman, T and Hinton, G},
+	year={2017},
+	institution={Technical Report}
+}
+
+@article{duchi2011adagrad,
+	title={Adaptive subgradient methods for online learning and stochastic optimization},
+	author={Duchi, John and Hazan, Elad and Singer, Yoram},
+	journal={Journal of Machine Learning Research (JMLR)},
+	volume={12},
+	number={Jul},
+	pages={2121--2159},
+	year={2011}
+}