openmlsys-zh/v1/zh_chapters/chapter_data_processing/summary.md at main

mirror of https://github.com/openmlsys/openmlsys-zh.git synced 2026-03-21 12:32:04 +08:00

Files

Yeqi Huang d953030747 feat: add v1/v2 versioning with language selector (#494 )

* feat: add v1/v2 versioning and language selector for mdbook

- Copy current content to v1/ directory (1st Edition)
- Create v2/ directory with new TOC structure (2nd Edition) and placeholder chapters
- Add version selector (V1/V2) and language toggle (EN/ZH) in top-right nav bar
- Add build scripts: build_mdbook_v1.sh, build_mdbook_v2.sh
- Update assemble_docs_publish_tree.py to support v1/v2 deployment layout
- Fix mdbook preprocessor to use 'sections' key (v0.4.43 compatibility)
- Update .gitignore for new build artifact directories
- Deployment layout: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* build: update CI to build and verify all four books (v1/v2 x EN/ZH)

- Clarify step names: "Build v2 (EN + ZH)" and "Build v1 (EN + ZH)"
- Add verification step to check all four index.html outputs exist
- Deploy workflow assembles: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: gracefully skip missing TOC entries instead of crashing

resolve_toc_target() now returns None for missing files instead of
raising FileNotFoundError. This fixes v1 EN build where chapter index
files reference TOC entry names that don't match actual filenames.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-12 13:37:42 +00:00

1.8 KiB

Raw Permalink Blame History

总结

本章我们围绕着易用性、高效性和保序性三个维度展开研究如何设计实现机器学习系统中的数据预处理模块。在易用性维度我们重点探讨了数据模块的编程模型，通过借鉴历史上优秀的并行数据处理系统的设计经验，我们认为基于描述数据集变换的编程抽象较为适合作为数据模块的编程模型，在具体的系统实现中，我们不仅要在上述的编程模型的基础上提供足够多内置算子方便用户的数据预处理编程，同时还要考虑如何支持用户方便的使用自定义算子。在高效性方面，我们从数据读取和计算两个方面分别介绍了特殊文件格式设计和计算并行架构设计。我们也使用我们在前几章中学习到的模型计算图编译优化技术来优化用户的数据预处理计算图，以进一步的达到更高的数据处理吞吐率。机器学习场景中模型对数据输入顺序敏感，于是衍生出来保序性这一特殊性质，我们在本章中对此进行了分析并通过MindSpore中的Connector的特殊约束实现来展示真实系统实现中如何确保保序性。最后，我们也针对部分情况下单机CPU数据预处理性能的问题，介绍了当前基于异构处理加速的纵向扩展方案，和基于分布式数据预处理的横向扩展方案，我们相信读者学习了本章后能够对机器学习系统中的数据模块有深刻的认知，也对数据模块未来面临的挑战有所了解。

扩展阅读

流水线粒度并行实现示例建议阅读 Pytorch DataLoader。
算子粒度并行实现示例建议阅读 MindData。

1.8 KiB Raw Permalink Blame History Unescape Escape

总结

扩展阅读

1.8 KiB

Raw Permalink Blame History