[内容补充与拓展]集合通信 (#334)

* add initial content on collective communication

* Update mlsys.bib

* update megatron-lm/dall-e citations

* [collective] basic definition

* Update collective.md

* [collective] Broadcast

* [collective] reduce

* [collective] Reduce, Allreduce, Gather, All Gather, Scatter, ReduceScatter

* [collective] reorganize op section

* Update collective.md

* [collective] format

* [collective] calculating bandwidth

* [collective] ZeRO

* [collective] ZeRO and DALL-E

* Update collective.md

* [collective] remove topology section

* [collective] ZeRO and DALL-E

* [collective] abstraction

* Update collective.md

* [collective] abstractions & allreduce to extension

* [collective] bandwidth calculation

* [collective] move comm interface to summary

* [collective] typo

* [collective] typo

* Update mlsys.bib

* Update references (#335)

* update ch03 (#338)

* update (#339)

Co-authored-by: Jiankai-Sun <jkaisun1@gmail.com>

* Fix ch10 figures (#341)

* fix #264

* Fix figures

* Add extended readings (fix #282)

* Remove extra spaces

* Fix typo

* fix #183

* update fonts in figures

* fix #184 #263

* fix #184 #263

* fix a bug

* fix a bug

* fix 183

* fix a bug

* fix a text

* Merge

* add overview figure fix #263

* fix #263

* fix the overview figure

Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com>

* Recsys fix (#340)

* fix text (#325)

* fix reference

* update images of explainable ai (#267) (#328)

* update explainable ai

* update explainable ai

* fix citation errors (#60)

* fix reference error

* update explainable ai

* update explainable ai

* fix citation errors (#60)

* fix reference error

* fetch upstream

* update explainable ai

* fix citation errors (#60)

* fix reference error

* update explainable ai

* remove redundant content

* update img of explainable AI(#267)

* fix bug in mlsys.bib

* fix bug2 in mlsys.bib

* rewrite mlsys.bib

Co-authored-by: lhy <hlicn@connect.ust.hk>
Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com>
Co-authored-by: Luo Mai <luo.mai.cs@gmail.com>

* 删除6.2.1小节标题中无效的图片路径 (#337)

6.2.1小节标题中的图片引用在下文出现了,删除该小节标题中无效的图片路径

Co-authored-by: Luo Mai <luo.mai.cs@gmail.com>
Co-authored-by: Cheng Lai <laicheng_VIP@163.com>

* add extension (#331)

Co-authored-by: Luo Mai <luo.mai.cs@gmail.com>

* add explainable extension (#343)

Co-authored-by: lixiaohui <lixiaohui33@huawei.com>
Co-authored-by: Luo Mai <luo.mai.cs@gmail.com>

* Update RL chapter (#349)

* fix chap12 render

* add distributed rl chapter

* fix bug

* fix issue #212

* fix typo

* update imgs

* fix chinese

* fix svg img

* update contents in rl chapter

* update marl sys

* fix a fig

* fix ref

* fix error

Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com>

* [collevtive] add references

* [collective] fix references & add equations

* [collective] fix reference and inline comments

* [collective] fix code

* Update collective.md

Co-authored-by: Cheng Lai <laicheng_VIP@163.com>
Co-authored-by: Jiarong Han <73918561+hanjr92@users.noreply.github.com>
Co-authored-by: Jack <sjkai1@126.com>
Co-authored-by: Jiankai-Sun <jkaisun1@gmail.com>
Co-authored-by: Yao Fu <fy38607203@163.com>
Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com>
Co-authored-by: HaoyangLI <417493727@qq.com>
Co-authored-by: lhy <hlicn@connect.ust.hk>
Co-authored-by: Luo Mai <luo.mai.cs@gmail.com>
Co-authored-by: theseed <feiyuxin1000@sina.com>
Co-authored-by: huygens12 <59854698+huygens12@users.noreply.github.com>
Co-authored-by: lixiaohui <lixiaohui33@huawei.com>
Co-authored-by: Zihan Ding <1402434478@qq.com>
This commit is contained in:
Peiyuan Liao
2022-05-23 13:34:50 -04:00
committed by GitHub
parent 719de7d582
commit 3788ff67ad
4 changed files with 213 additions and 24 deletions

View File

@@ -27,10 +27,26 @@
- 分布式机器学习系统:[综述](https://dl.acm.org/doi/abs/10.1145/3377454)
- 利用集合通支持并行训练的实践:[Horovod](https://arxiv.org/abs/1802.05799)
- 利用集合通支持并行训练的实践:[Horovod](https://arxiv.org/abs/1802.05799)
- AllReduce的工程实现细节[树形结构](https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/)[环形结构](https://github.com/baidu-research/baidu-allreduce)[二维环面结构](https://arxiv.org/abs/1811.05233),以及[CollNet算法](https://github.com/NVIDIA/nccl/issues/320)
- 流水线并行的实践:[gPipe](https://arxiv.org/abs/1811.06965)
- 在大规模数据并行下的实践:[Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677)
- 模型并行在超大模型上的实践:[ZeRO](https://arxiv.org/abs/1910.02054)
- 最后在讨论集合通信时经常可以看到一些关于底层通信接口的专业术语例如以太网Infiniband 等。这里给出一些常见术语的具体定义:
* [以太网Ethernet)](https://web.archive.org/web/20181222184046/http://www.mef.net/Assets/White_Papers/Metro-Ethernet-Services.pdf)
* [NVLink](https://devblogs.nvidia.com/parallelforall/how-nvlink-will-enable-faster-easier-multi-gpu-computing/)
* [AWS Elastic Fabric Adapter (EFA)](https://aws.amazon.com/cn/hpc/efa/)
* [Infiniband](https://www.infinibandta.org/about-infiniband/)
* [RDMA](http://reports.ias.ac.in/report/12829/understanding-the-concepts-and-mechanisms-of-rdma)
* [RoCE](https://www.roceinitiative.org/about-overview/)
* [IPoIB](https://www.ibm.com/docs/en/aix/7.2?topic=protocol-internet-over-infiniband-ipoib)
## 参考文献
:bibliography:`../references/distributed.bib`