Files
Yeqi Huang d953030747 feat: add v1/v2 versioning with language selector (#494)
* feat: add v1/v2 versioning and language selector for mdbook

- Copy current content to v1/ directory (1st Edition)
- Create v2/ directory with new TOC structure (2nd Edition) and placeholder chapters
- Add version selector (V1/V2) and language toggle (EN/ZH) in top-right nav bar
- Add build scripts: build_mdbook_v1.sh, build_mdbook_v2.sh
- Update assemble_docs_publish_tree.py to support v1/v2 deployment layout
- Fix mdbook preprocessor to use 'sections' key (v0.4.43 compatibility)
- Update .gitignore for new build artifact directories
- Deployment layout: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* build: update CI to build and verify all four books (v1/v2 x EN/ZH)

- Clarify step names: "Build v2 (EN + ZH)" and "Build v1 (EN + ZH)"
- Add verification step to check all four index.html outputs exist
- Deploy workflow assembles: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: gracefully skip missing TOC entries instead of crashing

resolve_toc_target() now returns None for missing files instead of
raising FileNotFoundError. This fixes v1 EN build where chapter index
files reference TOC entry names that don't match actual filenames.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 13:37:42 +00:00

2.3 KiB

Summary

In this chapter, we explored how to design and implement the data preprocessing module in machine learning systems from three dimensions: usability, efficiency, and order preservation. On the usability dimension, we focused on the programming model of the data module. By drawing on the design experience of historically excellent parallel data processing systems, we concluded that a programming abstraction based on describing dataset transformations is well-suited as the programming model for data modules. In concrete system implementations, we need not only to provide a sufficient number of built-in operators on top of this programming model to facilitate users' data preprocessing programming, but also to consider how to support users in conveniently using custom operators. On the efficiency dimension, we introduced specialized file format design and parallel computation architecture design from the perspectives of data loading and computation, respectively. We also applied the model computation graph compilation optimization techniques learned in previous chapters to optimize users' data preprocessing computation graphs, further achieving higher data processing throughput. In machine learning scenarios, models are sensitive to data input order, which gives rise to the special property of order preservation. We analyzed this property in this chapter and demonstrated how real systems ensure order preservation through the special constraint implementation of MindSpore's Connector. Finally, we also addressed situations where single-machine CPU data preprocessing performance is insufficient, introducing the current vertical scaling approach based on heterogeneous processing acceleration and the horizontal scaling approach based on distributed data preprocessing. We believe that after studying this chapter, readers will have a deep understanding of data modules in machine learning systems and an awareness of the challenges that data modules will face in the future.

Further Reading

  • For an example of pipeline-level parallelism implementation, we recommend reading PyTorch DataLoader.
  • For an example of operator-level parallelism implementation, we recommend reading MindData.