openmlsys-zh/v1/zh_chapters/chapter_distributed_training/summary.md at main

mirror of https://github.com/openmlsys/openmlsys-zh.git synced 2026-03-24 14:00:43 +08:00

Files

Yeqi Huang d953030747 feat: add v1/v2 versioning with language selector (#494 )

* feat: add v1/v2 versioning and language selector for mdbook

- Copy current content to v1/ directory (1st Edition)
- Create v2/ directory with new TOC structure (2nd Edition) and placeholder chapters
- Add version selector (V1/V2) and language toggle (EN/ZH) in top-right nav bar
- Add build scripts: build_mdbook_v1.sh, build_mdbook_v2.sh
- Update assemble_docs_publish_tree.py to support v1/v2 deployment layout
- Fix mdbook preprocessor to use 'sections' key (v0.4.43 compatibility)
- Update .gitignore for new build artifact directories
- Deployment layout: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* build: update CI to build and verify all four books (v1/v2 x EN/ZH)

- Clarify step names: "Build v2 (EN + ZH)" and "Build v1 (EN + ZH)"
- Add verification step to check all four index.html outputs exist
- Deploy workflow assembles: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: gracefully skip missing TOC entries instead of crashing

resolve_toc_target() now returns None for missing files instead of
raising FileNotFoundError. This fixes v1 EN build where chapter index
files reference TOC entry names that don't match actual filenames.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-12 13:37:42 +00:00

1.8 KiB

Raw Permalink Blame History

总结

大型机器学习模型的出现带来了对于算力和内存需求的快速增长，催生了分布式训练系统的出现。
分布式训练系统的设计往往遵循“分而治之”的设计思路。
利用分布式训练系统，人们可以显著提升训练性能，体现经济性，并且帮助防范硬件故障。
分布式训练系统可以通过数据并行增加设备来提升算力。
当单节点内存不足时，可以通过模型并行解决单设备内存不足。模型并行有两种实现方式：算子内并行和算子间并行。
大型模型并行系统容易出现设备使用气泡，而这种气泡可以通过流水线并行解决。
分布式训练系统往往运行在计算集群之中，集群网络无法提供充足的网络带宽来传输大量训练中生成的梯度。
为了提供海量的通信带宽，机器学习集群拥有异构的高性能网络，包括以太网、加速器高速互连技术NVLink和高带宽网络InfiniBand。
为了解决单节点瓶颈，可以使用AllReduce算法来分摊梯度聚合过程中产生的计算和通信操作，同时实现负载均衡。
参数服务器可以帮助实现灵活的梯度同步和异步训练，从而防范集群中可能出现的落后者服务器。
参数服务器常用数据副本技术解决数据热点问题和防范硬件故障。

1.8 KiB Raw Permalink Blame History Unescape Escape

总结

拓展阅读

1.8 KiB

Raw Permalink Blame History