Files
Yeqi Huang d953030747 feat: add v1/v2 versioning with language selector (#494)
* feat: add v1/v2 versioning and language selector for mdbook

- Copy current content to v1/ directory (1st Edition)
- Create v2/ directory with new TOC structure (2nd Edition) and placeholder chapters
- Add version selector (V1/V2) and language toggle (EN/ZH) in top-right nav bar
- Add build scripts: build_mdbook_v1.sh, build_mdbook_v2.sh
- Update assemble_docs_publish_tree.py to support v1/v2 deployment layout
- Fix mdbook preprocessor to use 'sections' key (v0.4.43 compatibility)
- Update .gitignore for new build artifact directories
- Deployment layout: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* build: update CI to build and verify all four books (v1/v2 x EN/ZH)

- Clarify step names: "Build v2 (EN + ZH)" and "Build v1 (EN + ZH)"
- Add verification step to check all four index.html outputs exist
- Deploy workflow assembles: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: gracefully skip missing TOC entries instead of crashing

resolve_toc_target() now returns None for missing files instead of
raising FileNotFoundError. This fixes v1 EN build where chapter index
files reference TOC entry names that don't match actual filenames.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 13:37:42 +00:00

2.0 KiB

Chapter Summary

  1. Model deployment is restricted by factors including the model size, runtime memory usage, inference latency, and inference power consumption.

  2. Models can be compressed using techniques such as quantization, pruning, and knowledge distillation in the offline phase. In addition, some model optimization techniques, such as operator fusion, can also reduce the model size, albeit to a lesser degree.

  3. Runtime memory usage can be improved by optimizing the model size, deployment framework size, and runtime temporary memory usage. Methods for optimizing the model size have been summarized earlier. Making the framework code simpler and more modular helps optimize the deployment framework. Memory pooling can help implement memory overcommitment to optimize the runtime temporary memory usage.

  4. Model inference latency can be optimized from two aspects. In the offline phase, the model computation workload can be reduced using model optimization and compression methods. Furthermore, improving the inference parallelism and optimizing operator implementation can help maximize the utilization of the computing power. In addition to the computation workload and computing power, consideration should be given to the load/store overhead during inference.

  5. Power consumption during inference can be reduced through offline model optimization and compression technologies. By reducing the computational workload, these technologies also facilitate power consumption reduction, which coincides with the optimization method for model inference latency.

  6. In addition to the optimization of factors related to model deployment, this chapter also discussed technologies regarding deployment security, such as model obfuscation and model encryption. Secure deployment protects the model assets of enterprises and prevents hackers from attacking the deployment environment by tampering with models.