* feat: add v1/v2 versioning and language selector for mdbook - Copy current content to v1/ directory (1st Edition) - Create v2/ directory with new TOC structure (2nd Edition) and placeholder chapters - Add version selector (V1/V2) and language toggle (EN/ZH) in top-right nav bar - Add build scripts: build_mdbook_v1.sh, build_mdbook_v2.sh - Update assemble_docs_publish_tree.py to support v1/v2 deployment layout - Fix mdbook preprocessor to use 'sections' key (v0.4.43 compatibility) - Update .gitignore for new build artifact directories - Deployment layout: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * build: update CI to build and verify all four books (v1/v2 x EN/ZH) - Clarify step names: "Build v2 (EN + ZH)" and "Build v1 (EN + ZH)" - Add verification step to check all four index.html outputs exist - Deploy workflow assembles: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: gracefully skip missing TOC entries instead of crashing resolve_toc_target() now returns None for missing files instead of raising FileNotFoundError. This fixes v1 EN build where chapter index files reference TOC entry names that don't match actual filenames. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1.5 KiB
Chapter Summary
-
The advent of large-scale machine learning models has sparked an exponential increase in the need for computational power and memory, leading to the emergence of distributed training systems.
-
Distributed training systems often utilize data parallelism, model parallelism, or a combination of both, based on memory limitations and computational constraints.
-
Pipeline parallelism is another technique adopted by distributed training systems, which involves partitioning a mini-batch into micro-batches and overlapping the forward and backward propagation of different micro-batches.
-
Although distributed training systems usually function in compute clusters, these networks sometimes lack the sufficient bandwidth for the transmission of substantial gradients produced during training.
-
To meet the demand for comprehensive communication bandwidth, machine learning clusters integrate heterogeneous high-performance networks, such as NVLink, NVSwitch, and InfiniBand.
-
To accomplish synchronous training of a machine learning model, distributed training systems frequently employ a range of collective communication operators, among which the AllReduce operator is popularly used for aggregating the gradients computed by distributed nodes.
-
Parameter servers play a crucial role in facilitating asynchronous training and sparse model training. Moreover, they leverage model replication to address issues related to data hotspots and server failures.