openmlsys-zh/chapter_distributed_training/summary.md at robotics

mirror of https://github.com/openmlsys/openmlsys-zh.git synced 2026-02-07 12:23:55 +08:00

Files

Peiyuan Liao 3788ff67ad [内容补充与拓展]集合通信 (#334 )

* add initial content on collective communication

* Update mlsys.bib

* update megatron-lm/dall-e citations

* [collective] basic definition

* Update collective.md

* [collective] Broadcast

* [collective] reduce

* [collective] Reduce, Allreduce, Gather, All Gather, Scatter, ReduceScatter

* [collective] reorganize op section

* Update collective.md

* [collective] format

* [collective] calculating bandwidth

* [collective] ZeRO

* [collective] ZeRO and DALL-E

* Update collective.md

* [collective] remove topology section

* [collective] ZeRO and DALL-E

* [collective] abstraction

* Update collective.md

* [collective] abstractions & allreduce to extension

* [collective] bandwidth calculation

* [collective] move comm interface to summary

* [collective] typo

* [collective] typo

* Update mlsys.bib

* Update references (#335)

* update ch03 (#338)

* update (#339)

Co-authored-by: Jiankai-Sun <jkaisun1@gmail.com>

* Fix ch10 figures (#341)

* fix #264

* Fix figures

* Add extended readings (fix #282)

* Remove extra spaces

* Fix typo

* fix #183

* update fonts in figures

* fix #184 #263

* fix #184 #263

* fix a bug

* fix a bug

* fix 183

* fix a bug

* fix a text

* Merge

* add overview figure fix #263

* fix #263

* fix the overview figure

Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com>

* Recsys fix (#340)

* fix text (#325)

* fix reference

* update images of explainable ai (#267) (#328)

* update explainable ai

* update explainable ai

* fix citation errors (#60)

* fix reference error

* update explainable ai

* update explainable ai

* fix citation errors (#60)

* fix reference error

* fetch upstream

* update explainable ai

* fix citation errors (#60)

* fix reference error

* update explainable ai

* remove redundant content

* update img of explainable AI(#267)

* fix bug in mlsys.bib

* fix bug2 in mlsys.bib

* rewrite mlsys.bib

Co-authored-by: lhy <hlicn@connect.ust.hk>
Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com>
Co-authored-by: Luo Mai <luo.mai.cs@gmail.com>

* 删除6.2.1小节标题中无效的图片路径 (#337)

6.2.1小节标题中的图片引用在下文出现了，删除该小节标题中无效的图片路径

Co-authored-by: Luo Mai <luo.mai.cs@gmail.com>
Co-authored-by: Cheng Lai <laicheng_VIP@163.com>

* add extension (#331)

Co-authored-by: Luo Mai <luo.mai.cs@gmail.com>

* add explainable extension (#343)

Co-authored-by: lixiaohui <lixiaohui33@huawei.com>
Co-authored-by: Luo Mai <luo.mai.cs@gmail.com>

* Update RL chapter (#349)

* fix chap12 render

* add distributed rl chapter

* fix bug

* fix issue #212

* fix typo

* update imgs

* fix chinese

* fix svg img

* update contents in rl chapter

* update marl sys

* fix a fig

* fix ref

* fix error

Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com>

* [collevtive] add references

* [collective] fix references & add equations

* [collective] fix reference and inline comments

* [collective] fix code

* Update collective.md

Co-authored-by: Cheng Lai <laicheng_VIP@163.com>
Co-authored-by: Jiarong Han <73918561+hanjr92@users.noreply.github.com>
Co-authored-by: Jack <sjkai1@126.com>
Co-authored-by: Jiankai-Sun <jkaisun1@gmail.com>
Co-authored-by: Yao Fu <fy38607203@163.com>
Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com>
Co-authored-by: HaoyangLI <417493727@qq.com>
Co-authored-by: lhy <hlicn@connect.ust.hk>
Co-authored-by: Luo Mai <luo.mai.cs@gmail.com>
Co-authored-by: theseed <feiyuxin1000@sina.com>
Co-authored-by: huygens12 <59854698+huygens12@users.noreply.github.com>
Co-authored-by: lixiaohui <lixiaohui33@huawei.com>
Co-authored-by: Zihan Ding <1402434478@qq.com>

2022-05-23 13:34:50 -04:00

3.0 KiB

Raw Permalink Blame History

总结

大型机器学习模型的出现带来了对于算力和内存需求的快速增长，催生了分布式训练系统的出现。
分布式训练系统的设计往往遵循"分而治之"的设计思路。
利用分布式训练系统，人们可以显著提升性能，经济性，并且帮助抵御硬件故障。
分布式训练系统可以通过数据并行增加设备来提升算力。
当单节点内存不足时，我们可以通过模型并行来解决单设备内存不足。模型并行有两种实现方式：算子内并行和算子间并行。
大型模型并行系统容易出现设备使用空洞，而这种空洞可以通过流水线并行解决。
分布式训练系统往往运行在商用数据中心之中，数据中心网络无法提供充足的网络带宽来传输大量训练中生成的梯度。
为了提供海量的带宽，机器学习集群拥有异构的网络：以太网，机内网络（NVLink）和InfiniBand。
为了解决单节点瓶颈，我们可以使用Allreduce来分摊梯度聚合过程中的计算和通讯开销。
参数服务器可以帮助机器学习集群实现计算-存储的分离，从而更好的支持大型稀疏模型。
参数服务器常用数据副本技术解决数据热点问题，同时它们也可以被用来解决同步训练系统中常见的掉队者问题。

扩展阅读

分布式机器学习系统：综述
利用集合通信支持并行训练的实践：Horovod
AllReduce的工程实现细节：树形结构，环形结构，二维环面结构，以及CollNet算法
流水线并行的实践：gPipe
在大规模数据并行下的实践：Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
模型并行在超大模型上的实践：ZeRO
最后，在讨论集合通信时，经常可以看到一些关于底层通信接口的专业术语，例如以太网，Infiniband 等。这里给出一些常见术语的具体定义：

参考文献

:bibliography:../references/distributed.bib

3.0 KiB Raw Permalink Blame History Unescape Escape

总结

扩展阅读

参考文献

3.0 KiB

Raw Permalink Blame History