mirror of
https://github.com/openmlsys/openmlsys-zh.git
synced 2026-05-12 11:06:53 +08:00
[内容补充与拓展]集合通信 (#334)
* add initial content on collective communication * Update mlsys.bib * update megatron-lm/dall-e citations * [collective] basic definition * Update collective.md * [collective] Broadcast * [collective] reduce * [collective] Reduce, Allreduce, Gather, All Gather, Scatter, ReduceScatter * [collective] reorganize op section * Update collective.md * [collective] format * [collective] calculating bandwidth * [collective] ZeRO * [collective] ZeRO and DALL-E * Update collective.md * [collective] remove topology section * [collective] ZeRO and DALL-E * [collective] abstraction * Update collective.md * [collective] abstractions & allreduce to extension * [collective] bandwidth calculation * [collective] move comm interface to summary * [collective] typo * [collective] typo * Update mlsys.bib * Update references (#335) * update ch03 (#338) * update (#339) Co-authored-by: Jiankai-Sun <jkaisun1@gmail.com> * Fix ch10 figures (#341) * fix #264 * Fix figures * Add extended readings (fix #282) * Remove extra spaces * Fix typo * fix #183 * update fonts in figures * fix #184 #263 * fix #184 #263 * fix a bug * fix a bug * fix 183 * fix a bug * fix a text * Merge * add overview figure fix #263 * fix #263 * fix the overview figure Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com> * Recsys fix (#340) * fix text (#325) * fix reference * update images of explainable ai (#267) (#328) * update explainable ai * update explainable ai * fix citation errors (#60) * fix reference error * update explainable ai * update explainable ai * fix citation errors (#60) * fix reference error * fetch upstream * update explainable ai * fix citation errors (#60) * fix reference error * update explainable ai * remove redundant content * update img of explainable AI(#267) * fix bug in mlsys.bib * fix bug2 in mlsys.bib * rewrite mlsys.bib Co-authored-by: lhy <hlicn@connect.ust.hk> Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com> Co-authored-by: Luo Mai <luo.mai.cs@gmail.com> * 删除6.2.1小节标题中无效的图片路径 (#337) 6.2.1小节标题中的图片引用在下文出现了,删除该小节标题中无效的图片路径 Co-authored-by: Luo Mai <luo.mai.cs@gmail.com> Co-authored-by: Cheng Lai <laicheng_VIP@163.com> * add extension (#331) Co-authored-by: Luo Mai <luo.mai.cs@gmail.com> * add explainable extension (#343) Co-authored-by: lixiaohui <lixiaohui33@huawei.com> Co-authored-by: Luo Mai <luo.mai.cs@gmail.com> * Update RL chapter (#349) * fix chap12 render * add distributed rl chapter * fix bug * fix issue #212 * fix typo * update imgs * fix chinese * fix svg img * update contents in rl chapter * update marl sys * fix a fig * fix ref * fix error Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com> * [collevtive] add references * [collective] fix references & add equations * [collective] fix reference and inline comments * [collective] fix code * Update collective.md Co-authored-by: Cheng Lai <laicheng_VIP@163.com> Co-authored-by: Jiarong Han <73918561+hanjr92@users.noreply.github.com> Co-authored-by: Jack <sjkai1@126.com> Co-authored-by: Jiankai-Sun <jkaisun1@gmail.com> Co-authored-by: Yao Fu <fy38607203@163.com> Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com> Co-authored-by: HaoyangLI <417493727@qq.com> Co-authored-by: lhy <hlicn@connect.ust.hk> Co-authored-by: Luo Mai <luo.mai.cs@gmail.com> Co-authored-by: theseed <feiyuxin1000@sina.com> Co-authored-by: huygens12 <59854698+huygens12@users.noreply.github.com> Co-authored-by: lixiaohui <lixiaohui33@huawei.com> Co-authored-by: Zihan Ding <1402434478@qq.com>
This commit is contained in:
@@ -27,10 +27,26 @@
|
||||
|
||||
- 分布式机器学习系统:[综述](https://dl.acm.org/doi/abs/10.1145/3377454)
|
||||
|
||||
- 利用集合通讯支持并行训练的实践:[Horovod](https://arxiv.org/abs/1802.05799)
|
||||
- 利用集合通信支持并行训练的实践:[Horovod](https://arxiv.org/abs/1802.05799)
|
||||
|
||||
- AllReduce的工程实现细节:[树形结构](https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/),[环形结构](https://github.com/baidu-research/baidu-allreduce),[二维环面结构](https://arxiv.org/abs/1811.05233),以及[CollNet算法](https://github.com/NVIDIA/nccl/issues/320)
|
||||
|
||||
- 流水线并行的实践:[gPipe](https://arxiv.org/abs/1811.06965)
|
||||
|
||||
- 在大规模数据并行下的实践:[Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677)
|
||||
|
||||
- 模型并行在超大模型上的实践:[ZeRO](https://arxiv.org/abs/1910.02054)
|
||||
|
||||
- 最后,在讨论集合通信时,经常可以看到一些关于底层通信接口的专业术语,例如以太网,Infiniband 等。这里给出一些常见术语的具体定义:
|
||||
|
||||
* [以太网(Ethernet)](https://web.archive.org/web/20181222184046/http://www.mef.net/Assets/White_Papers/Metro-Ethernet-Services.pdf)
|
||||
* [NVLink](https://devblogs.nvidia.com/parallelforall/how-nvlink-will-enable-faster-easier-multi-gpu-computing/)
|
||||
* [AWS Elastic Fabric Adapter (EFA)](https://aws.amazon.com/cn/hpc/efa/)
|
||||
* [Infiniband](https://www.infinibandta.org/about-infiniband/)
|
||||
* [RDMA](http://reports.ias.ac.in/report/12829/understanding-the-concepts-and-mechanisms-of-rdma)
|
||||
* [RoCE](https://www.roceinitiative.org/about-overview/)
|
||||
* [IPoIB](https://www.ibm.com/docs/en/aix/7.2?topic=protocol-internet-over-infiniband-ipoib)
|
||||
|
||||
## 参考文献
|
||||
|
||||
:bibliography:`../references/distributed.bib`
|
||||
Reference in New Issue
Block a user