openmlsys-zh/v1/en_chapters/chapter_data_processing/program_model.md at main

mirror of https://github.com/openmlsys/openmlsys-zh.git synced 2026-03-21 20:41:41 +08:00

Files

Yeqi Huang d953030747 feat: add v1/v2 versioning with language selector (#494 )

* feat: add v1/v2 versioning and language selector for mdbook

- Copy current content to v1/ directory (1st Edition)
- Create v2/ directory with new TOC structure (2nd Edition) and placeholder chapters
- Add version selector (V1/V2) and language toggle (EN/ZH) in top-right nav bar
- Add build scripts: build_mdbook_v1.sh, build_mdbook_v2.sh
- Update assemble_docs_publish_tree.py to support v1/v2 deployment layout
- Fix mdbook preprocessor to use 'sections' key (v0.4.43 compatibility)
- Update .gitignore for new build artifact directories
- Deployment layout: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* build: update CI to build and verify all four books (v1/v2 x EN/ZH)

- Clarify step names: "Build v2 (EN + ZH)" and "Build v1 (EN + ZH)"
- Add verification step to check all four index.html outputs exist
- Deploy workflow assembles: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: gracefully skip missing TOC entries instead of crashing

resolve_toc_target() now returns None for missing files instead of
raising FileNotFoundError. This fixes v1 EN build where chapter index
files reference TOC entry names that don't match actual filenames.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-12 13:37:42 +00:00

14 KiB

Raw Permalink Blame History

Usability Design

In this section, we focus on how to design a user-friendly data module for machine learning systems. As mentioned earlier, usability requires the data module to provide good programming abstractions and interfaces so that users can conveniently construct data processing pipelines, while also supporting users in flexibly registering and using custom operators within the data pipeline to meet diverse and specialized requirements. We will explore this topic from two aspects: programming interface abstraction and custom operator registration mechanisms.

Programming Abstraction and Interfaces

In :numref:image_process_pipeline, we present a classic data preprocessing pipeline for training an image classification model. After loading the dataset from storage devices, we perform a series of operations on the image data, including decoding, resizing, rotation, normalization, and channel transposition. We also apply specific preprocessing operations to the dataset labels, and finally send the processed data to the accelerator chip for model computation. We hope that the programming abstractions provided by the data module are sufficiently high-level so that users can describe the data processing logic in just a few lines of code without getting bogged down in excessive, repetitive implementation details. At the same time, we need to ensure that this set of high-level abstractions is sufficiently general to meet diverse data preprocessing requirements. Once we have a good programming abstraction, we will use a code snippet that implements the data preprocessing pipeline described in the figure below using MindSpore's data module programming interfaces as an example to demonstrate how significantly a well-designed programming abstraction can reduce the user's programming burden.

:width:800px 🏷️image_process_pipeline

In fact, programming abstractions for data computation have long been extensively studied in the field of general-purpose data-parallel computing systems, and a relatively unified consensus has been reached --- that is, to provide LINQ-style :cite:meijer2006linq programming abstractions. The key characteristic is to let users focus on describing dataset creation and transformations, while delegating the efficient implementation and scheduling of these operations to the data system's runtime. Some excellent systems such as Naiad :cite:murray2013naiad, Spark :cite:zaharia2010spark, and DryadLINQ :cite:fetterly2009dryadlinq have all adopted this programming model. We will use Spark as an example for a brief introduction.

Spark provides users with a programming model based on the concept of Resilient Distributed Datasets (RDD). An RDD is a read-only distributed data collection. Users primarily describe the creation and transformation of RDDs through Spark's programming interfaces. Let us elaborate with a Spark example. The following code demonstrates counting the number of lines containing the "ERROR" field in a log file. We first create a distributed dataset file by reading from a file (as mentioned earlier, an RDD represents a collection of data; here file is actually a collection of log lines). We apply a filter operation to this file dataset to obtain a new dataset errs that retains only log lines containing the "ERROR" field. Then we apply a map operation to each element in errs to obtain the dataset ones. Finally, we perform a reduce operation on the ones dataset to get our desired result --- the number of log lines containing the "ERROR" field in the file dataset.

val file = spark.textFile("hdfs://...")
val errs = file.filter(_.contains("ERROR"))
val ones = errs.map(_ => 1)
val count = ones.reduce(_+_)

We can see that users need only four lines of code to accomplish the complex task of counting specific field occurrences in a distributed dataset. This is made possible by Spark's core RDD programming abstraction. From the computation flow visualization in :numref:rdd_transformation_example, we can also clearly see that after creating the dataset, users only need to describe the operators applied to the dataset, while the execution and implementation of the operators are handled by the system's runtime.

:width:800px 🏷️rdd_transformation_example The data modules in mainstream machine learning systems have also adopted similar programming abstractions, such as TensorFlow's data module tf.data :cite:murray2021tf and MindSpore's data module MindData. Next, we will use MindData's interface design as an example to introduce how to design good programming abstractions for the machine learning scenario to help users conveniently construct the diverse data processing pipelines needed in model training.

MindData is the data module of the machine learning system MindSpore, primarily responsible for completing data preprocessing tasks in machine learning model training. The core programming abstraction that MindData provides to users is based on Dataset transformations. Here, Dataset is a data frame concept (Data Frame), meaning a Dataset is a multi-row, multi-column relational data table where each column has a column name.

:width:800px 🏷️mindspore dataset example

Based on this programming model, combined with the key processing steps in the machine learning data workflow introduced in the first section, MindData provides users with dataset operation operators for performing shuffle, map, batch, and other transformation operations on datasets. These operators take a Dataset as input and produce a newly processed Dataset as output. We list the typical dataset transformation interfaces as follows:

:Dataset operation interfaces supported by MindSpore

Dataset Operation	Description
batch	Groups multiple data rows in the dataset into a mini-batch
map	Applies transformation operations to each data row in the dataset
shuffle	Randomly shuffles the order of data rows in the dataset
filter	Filters data rows in the dataset, retaining only rows that pass the filter condition
prefetch	Prefetches data from the storage medium
project	Selects certain columns from the Dataset table for subsequent processing
zip	Merges multiple datasets into one dataset
repeat	In multi-epoch training, repeats the entire data pipeline multiple times
create_dict_iterator	Creates an iterator that returns dictionary-type data for the dataset
...	...

The above describes the dataset interface abstractions, while the specific operations on datasets are actually defined by concrete data operator functions. For user convenience, MindData has built-in implementations of rich data operator libraries for common data types and their common processing needs in the machine learning domain. For the vision domain, MindData provides common operators such as Decode, Resize, RandomRotation, Normalize, and HWC2CHW (channel transposition); for the text domain, MindData provides operators such as Ngram, NormalizeUTF8, and BertTokenizer; for the audio domain, MindData provides operators such as TimeMasking, LowpassBiquad, and ComplexNorm. These commonly used operators can cover the vast majority of user requirements.

In addition to supporting flexible Dataset transformations, MindData also provides flexible Dataset creation to address the challenge of numerous dataset types with varying formats and organizations. There are mainly three categories:

Creating from built-in datasets: MindData has a rich set of built-in classic datasets, such as CelebADataset, Cifar10Dataset, CocoDataset, ImageFolderDataset, MnistDataset, VOCDataset, etc. If users need to use these common datasets, they can achieve out-of-the-box usage with a single line of code. MindData also provides efficient implementations for loading these datasets to ensure users enjoy the best read performance.
Loading from MindRecord: MindRecord is a high-performance, general-purpose data storage file format designed for MindData. Users can convert their datasets to MindRecord and then leverage MindSpore's relevant APIs for efficient reading.
Creating from a Python class: If users already have a Python class for reading their dataset, they can use MindData's GeneratorDataset interface to call that Python class to create a Dataset, providing users with great flexibility.

Finally, we use an example of implementing the data processing pipeline described at the beginning of this section using MindData to demonstrate how user-friendly the Dataset-centric data programming abstraction is. We need only about 10 lines of code to accomplish our desired complex data processing. Throughout the entire process, we focus solely on describing the logic, while delegating operator implementation and execution scheduling to the data module, which greatly reduces the user's programming burden.

import mindspore.dataset as ds
import mindspore.dataset.transforms.c_transforms as c_transforms
import mindspore.dataset.transforms.vision.c_transforms as vision
dataset_dir = "path/to/imagefolder_directory"

# create a dataset that reads all files in dataset_dir with 8 threads
dataset = ds.ImageFolderDatasetV2(dataset_dir, num_parallel_workers=8)

#create a list of transformations to be applied to the image data
transforms_list = [vision.Decode(),
                    vision.Resize((256, 256)),
                    vision.RandomRotation((0, 15)),
                    vision.Normalize((100,  115.0, 121.0), (71.0, 68.0, 70.0)),
                    vision.HWC2CHW()]
onehot_op = c_transforms.OneHot(num_classes)

# apply the transform to the dataset through dataset.map()
dataset = dataset.map(input_columns="image", operations=transforms_list)
dataset = dataset.map(input_columns="label", operations=onehot_op)

Custom Operator Support

With the dataset transformation-based programming abstraction and the rich transformation operator support for various data types in machine learning, we can cover the vast majority of user data processing needs. However, since the machine learning field itself is rapidly evolving with new data processing requirements constantly emerging, there may be situations where a data transformation operator that users want to use is not covered by the data module. Therefore, we need to design a well-crafted user-defined operator registration mechanism so that users can conveniently use custom operators when constructing data processing pipelines.

In machine learning scenarios, Python is the primary development programming language for users, so we can assume that user-defined operators are more often Python functions or Python classes. The difficulty of supporting custom operators in the data module is mainly related to how the data module schedules computation. For example, PyTorch's dataloader primarily implements computation scheduling at the Python level, and thanks to Python's flexibility, inserting custom operators into the dataloader's data pipeline is relatively straightforward. In contrast, systems like TensorFlow's tf.data and MindSpore's MindData primarily implement computation scheduling at the C++ level, making it more challenging for the data module to flexibly insert user-defined Python operators into the data flow. Next, we will use MindData's custom operator registration and usage implementation as an example to discuss this topic in detail.

:width:800px 🏷️mindspore operator example

Data preprocessing operators in MindData can be divided into C-level operators and Python-level operators. C-level operators provide higher execution performance, while Python-level operators can conveniently leverage rich third-party Python packages for development. To flexibly cover more scenarios, MindData supports users in developing custom operators using Python. If users pursue higher performance, MindData also supports users in compiling their C-level operators and registering them as plugins in MindSpore's data processing pipeline.

For custom data processing operators passed into dataset transformation operators such as map and filter, MindData's Pipeline executes them through the created Python runtime after startup. It should be noted that custom Python operators must ensure that both input and output are of the numpy.ndarray type. During execution, when MindData's Pipeline encounters a user-defined PyFunc operator in a dataset transformation, it passes the input data to the user's PyFunc as numpy.ndarray type. After the custom operator finishes execution, the result is returned to MindData as numpy.ndarray. During this process, the executing dataset transformation operator (such as map, filter, etc.) is responsible for the PyFunc's runtime lifecycle and exception handling. If users pursue higher performance, MindData also supports user-defined C operators. The dataset-plugin repository :cite:minddata serves as MindData's operator plugin repository, encompassing operators tailored for specific domains (remote sensing, medical imaging, meteorology, etc.). This repository carries MindData's plugin capability extensions and provides a convenient entry point for users to write new MindData operators. Users can write operators, compile, and install the plugin, and then use the newly developed operators in the map operations of the MindData Pipeline.

:width:800px 🏷️mindspore_user_defined_operator

14 KiB Raw Permalink Blame History

Usability Design

Programming Abstraction and Interfaces

Custom Operator Support

14 KiB

Raw Permalink Blame History