Files
Peiyuan Liao 3788ff67ad [内容补充与拓展]集合通信 (#334)
* add initial content on collective communication

* Update mlsys.bib

* update megatron-lm/dall-e citations

* [collective] basic definition

* Update collective.md

* [collective] Broadcast

* [collective] reduce

* [collective] Reduce, Allreduce, Gather, All Gather, Scatter, ReduceScatter

* [collective] reorganize op section

* Update collective.md

* [collective] format

* [collective] calculating bandwidth

* [collective] ZeRO

* [collective] ZeRO and DALL-E

* Update collective.md

* [collective] remove topology section

* [collective] ZeRO and DALL-E

* [collective] abstraction

* Update collective.md

* [collective] abstractions & allreduce to extension

* [collective] bandwidth calculation

* [collective] move comm interface to summary

* [collective] typo

* [collective] typo

* Update mlsys.bib

* Update references (#335)

* update ch03 (#338)

* update (#339)

Co-authored-by: Jiankai-Sun <jkaisun1@gmail.com>

* Fix ch10 figures (#341)

* fix #264

* Fix figures

* Add extended readings (fix #282)

* Remove extra spaces

* Fix typo

* fix #183

* update fonts in figures

* fix #184 #263

* fix #184 #263

* fix a bug

* fix a bug

* fix 183

* fix a bug

* fix a text

* Merge

* add overview figure fix #263

* fix #263

* fix the overview figure

Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com>

* Recsys fix (#340)

* fix text (#325)

* fix reference

* update images of explainable ai (#267) (#328)

* update explainable ai

* update explainable ai

* fix citation errors (#60)

* fix reference error

* update explainable ai

* update explainable ai

* fix citation errors (#60)

* fix reference error

* fetch upstream

* update explainable ai

* fix citation errors (#60)

* fix reference error

* update explainable ai

* remove redundant content

* update img of explainable AI(#267)

* fix bug in mlsys.bib

* fix bug2 in mlsys.bib

* rewrite mlsys.bib

Co-authored-by: lhy <hlicn@connect.ust.hk>
Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com>
Co-authored-by: Luo Mai <luo.mai.cs@gmail.com>

* 删除6.2.1小节标题中无效的图片路径 (#337)

6.2.1小节标题中的图片引用在下文出现了,删除该小节标题中无效的图片路径

Co-authored-by: Luo Mai <luo.mai.cs@gmail.com>
Co-authored-by: Cheng Lai <laicheng_VIP@163.com>

* add extension (#331)

Co-authored-by: Luo Mai <luo.mai.cs@gmail.com>

* add explainable extension (#343)

Co-authored-by: lixiaohui <lixiaohui33@huawei.com>
Co-authored-by: Luo Mai <luo.mai.cs@gmail.com>

* Update RL chapter (#349)

* fix chap12 render

* add distributed rl chapter

* fix bug

* fix issue #212

* fix typo

* update imgs

* fix chinese

* fix svg img

* update contents in rl chapter

* update marl sys

* fix a fig

* fix ref

* fix error

Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com>

* [collevtive] add references

* [collective] fix references & add equations

* [collective] fix reference and inline comments

* [collective] fix code

* Update collective.md

Co-authored-by: Cheng Lai <laicheng_VIP@163.com>
Co-authored-by: Jiarong Han <73918561+hanjr92@users.noreply.github.com>
Co-authored-by: Jack <sjkai1@126.com>
Co-authored-by: Jiankai-Sun <jkaisun1@gmail.com>
Co-authored-by: Yao Fu <fy38607203@163.com>
Co-authored-by: Dalong <39682259+eedalong@users.noreply.github.com>
Co-authored-by: HaoyangLI <417493727@qq.com>
Co-authored-by: lhy <hlicn@connect.ust.hk>
Co-authored-by: Luo Mai <luo.mai.cs@gmail.com>
Co-authored-by: theseed <feiyuxin1000@sina.com>
Co-authored-by: huygens12 <59854698+huygens12@users.noreply.github.com>
Co-authored-by: lixiaohui <lixiaohui33@huawei.com>
Co-authored-by: Zihan Ding <1402434478@qq.com>
2022-05-23 13:34:50 -04:00

1.6 KiB
Raw Permalink Blame History

分布式训练

随着机器学习的进一步发展科学家们设计出更大型更多功能的机器学习模型例如说GPT-3。这种模型含有大量参数需要复杂的计算以及处理海量的数据。单个机器上有限的资源无法满足训练大型机器学习模型的需求。因此我们需要设计分布式训练系统从而将一个机器学习模型任务拆分成多个子任务并将子任务分发给多个计算节点解决资源瓶颈。

在本章节中我们会引入分布式机器学习系统的相关概念设计挑战系统实现和实例研究。我们会首先讨论分布式训练系统的定义设计动机和好处。进一步我们会讨论常见的分布式训练方法数据并行模型并行和流水线并行。在实际中这些分布式训练方法会被参数服务器Parameter Servers或者是集合通信库Collective Communication Libraries实现。不同的系统实现具有各自的优势和劣势。我们会用大型预训练模型和大型深度学习推荐系统作为实例来探讨不同系统实现的利与弊。

本章的学习目标包括:

  • 掌握分布式训练相关系统组件的定义,设计动机和好处

  • 掌握常见的分布式训练方法:数据并行,模型并行和流水线并行

  • 掌握常见的分布式训练框架实现:参数服务器和集合通信

  • 理解常见分布式训练的实例,和采用不同实现方法的利弊。

:maxdepth: 2

overview
methods
pipeline
collective
parameter_servers
summary