## Summary

In this chapter, we explored how to design and implement the data preprocessing module in machine learning systems from three dimensions: usability, efficiency, and order preservation. On the usability dimension, we focused on the programming model of the data module. By drawing on the design experience of historically excellent parallel data processing systems, we concluded that a programming abstraction based on describing dataset transformations is well-suited as the programming model for data modules. In concrete system implementations, we need not only to provide a sufficient number of built-in operators on top of this programming model to facilitate users' data preprocessing programming, but also to consider how to support users in conveniently using custom operators. On the efficiency dimension, we introduced specialized file format design and parallel computation architecture design from the perspectives of data loading and computation, respectively. We also applied the model computation graph compilation optimization techniques learned in previous chapters to optimize users' data preprocessing computation graphs, further achieving higher data processing throughput. In machine learning scenarios, models are sensitive to data input order, which gives rise to the special property of order preservation. We analyzed this property in this chapter and demonstrated how real systems ensure order preservation through the special constraint implementation of MindSpore's Connector. Finally, we also addressed situations where single-machine CPU data preprocessing performance is insufficient, introducing the current vertical scaling approach based on heterogeneous processing acceleration and the horizontal scaling approach based on distributed data preprocessing. We believe that after studying this chapter, readers will have a deep understanding of data modules in machine learning systems and an awareness of the challenges that data modules will face in the future.

## Further Reading

-   For an example of pipeline-level parallelism implementation, we recommend reading [PyTorch DataLoader](https://github.com/pytorch/pytorch/tree/master/torch/utils/data).
-   For an example of operator-level parallelism implementation, we recommend reading [MindData](https://gitee.com/mindspore/mindspore/tree/master/mindspore/ccsrc/minddata).