Split each bilingual CONTRIBUTING file into _en.md and _zh.md variants: - CONTRIBUTING.md → CONTRIBUTING_en.md + CONTRIBUTING_zh.md - style.md → style_en.md + style_zh.md - info.md → info_en.md + info_zh.md - issue.md → issue_en.md + issue_zh.md - reference_guide.md → reference_guide_en.md + reference_guide_zh.md terminology.md kept as-is (bilingual lookup table by nature). Update README.md to link to _zh.md files, README_EN.md to _en.md files. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
中文 | English
Machine Learning Systems: Design and Implementation
An open-source book explaining the design principles and implementation experience of modern machine learning systems, covering the complete technology stack from programming interfaces and computational graphs to compilers and distributed training.
English version 1 (stable): openmlsys.github.io/html-en/
English version 2: Under reconstruction.
Table of Contents
Target Audience
- Students: Those who have mastered machine learning fundamentals and want to deeply understand the design and implementation of modern ML systems.
- Researchers: Those who need to develop custom operators or leverage distributed execution for large model development.
- Engineers: Those responsible for building ML infrastructure and need to tune system performance or customize ML systems for business needs.
Content Overview
The book (2nd edition) consists of 9 chapters:
| Chapter | Content |
|---|---|
| Chapter 1: Introduction | Overview of ML system architecture and technology stack |
| Chapter 2: Programming Interfaces and Computational Graphs | Tensor abstraction, automatic differentiation, graph representation and execution |
| Chapter 3: AI Accelerators and Programming | GPU architecture and CUDA/Triton/CUTLASS programming models |
| Chapter 4: AI Compilers and Runtime Systems | IR design, graph optimization, kernel generation, and runtime execution |
| Chapter 5: Data Processing Systems | Data loading, data pipelines, and distributed data processing |
| Chapter 6: Training Systems | Single-node and distributed training, parallelism strategies, and training optimization |
| Chapter 7: Model Serving | Inference optimization, online serving, and model management |
| Chapter 8: RL Systems | Reinforcement learning pipelines, environment interaction, and RL system design |
| Chapter 9: Large-scale GPU Cluster Management | GPU scheduling, resource management, and large-scale training infrastructure |
Changelog
| Date | Event |
|---|---|
| 2022-01 | Project initialized; Chinese content writing begins |
| 2022-05 | Extension chapters released (Federated Learning, RL Systems, Explainable AI) |
| 2023-05 | Codebase adapted to MindSpore 2.0 |
| 2026-03 | Bilingual (CN/EN) build architecture refactored; English version launched |
Build Guide
Prerequisites
- curl
- git
- Python 3
Installation
# Clone the repository
git clone https://github.com/openmlsys/openmlsys-zh.git
cd openmlsys-zh
# Install Rust toolchain (Linux/macOS)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install mdbook
cargo install mdbook
Build HTML
sh build_mdbook_v2.sh
# English output: .mdbook-v2/book
# Chinese output: .mdbook-v2-zh/book
For more details, see the Build Guide.
Contributing
We welcome all forms of contributions. For the full workflow, see the Contributing Guide.
Before contributing, please read:
Community
Join our WeChat group by scanning the QR code
Citation
If this book has been helpful to your research or work, please cite it as:
Plain text:
OpenMLSys Team. Machine Learning Systems: Design and Implementation. 2022. https://openmlsys.github.io/
BibTeX:
@book{openmlsys2022,
title = {Machine Learning Systems: Design and Implementation},
author = {OpenMLSys Team},
year = {2022},
url = {https://openmlsys.github.io/},
note = {Open-source textbook, \url{https://github.com/openmlsys/openmlsys-zh}}
}
License
This project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
