feat: add v1/v2 versioning with language selector (#494)

* feat: add v1/v2 versioning and language selector for mdbook

- Copy current content to v1/ directory (1st Edition)
- Create v2/ directory with new TOC structure (2nd Edition) and placeholder chapters
- Add version selector (V1/V2) and language toggle (EN/ZH) in top-right nav bar
- Add build scripts: build_mdbook_v1.sh, build_mdbook_v2.sh
- Update assemble_docs_publish_tree.py to support v1/v2 deployment layout
- Fix mdbook preprocessor to use 'sections' key (v0.4.43 compatibility)
- Update .gitignore for new build artifact directories
- Deployment layout: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* build: update CI to build and verify all four books (v1/v2 x EN/ZH)

- Clarify step names: "Build v2 (EN + ZH)" and "Build v1 (EN + ZH)"
- Add verification step to check all four index.html outputs exist
- Deploy workflow assembles: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: gracefully skip missing TOC entries instead of crashing

resolve_toc_target() now returns None for missing files instead of
raising FileNotFoundError. This fixes v1 EN build where chapter index
files reference TOC entry names that don't match actual filenames.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Yeqi Huang
2026-03-12 13:37:42 +00:00
committed by GitHub
parent 00db02dbfd
commit d953030747
274 changed files with 24198 additions and 29 deletions

View File

@@ -31,8 +31,18 @@ jobs:
python3 -m unittest discover -s tests -p 'test_ensure_book_resources.py'
python3 -m unittest discover -s tests -p 'test_update_docs_workflow.py'
- name: Build English HTML with mdBook
run: bash build_mdbook.sh
- name: Build v2 (EN + ZH) with mdBook
run: bash build_mdbook_v2.sh
- name: Build Chinese HTML with mdBook
run: bash build_mdbook_zh.sh
- name: Build v1 (EN + ZH) with mdBook
run: bash build_mdbook_v1.sh
- name: Verify build outputs
run: |
for d in .mdbook-v2/book .mdbook-v2-zh/book .mdbook-v1/book .mdbook-v1-zh/book; do
if [ ! -f "$d/index.html" ]; then
echo "ERROR: $d/index.html not found"
exit 1
fi
echo "OK: $d/index.html exists"
done

View File

@@ -29,13 +29,23 @@ jobs:
python3 -m unittest discover -s tests -p 'test_assemble_docs_publish_tree.py'
python3 -m unittest discover -s tests -p 'test_ensure_book_resources.py'
- name: Build English HTML with mdBook
run: bash build_mdbook.sh
- name: Build v2 (EN + ZH) with mdBook
run: bash build_mdbook_v2.sh
- name: Build Chinese HTML with mdBook
run: bash build_mdbook_zh.sh
- name: Build v1 (EN + ZH) with mdBook
run: bash build_mdbook_v1.sh
- name: Deploy to openmlsys.github.io
- name: Verify build outputs
run: |
for d in .mdbook-v2/book .mdbook-v2-zh/book .mdbook-v1/book .mdbook-v1-zh/book; do
if [ ! -f "$d/index.html" ]; then
echo "ERROR: $d/index.html not found"
exit 1
fi
echo "OK: $d/index.html exists"
done
- name: Assemble and deploy to openmlsys.github.io
env:
DEPLOY_TOKEN: ${{ secrets.DEPLOY_TOKEN }}
run: |
@@ -44,8 +54,10 @@ jobs:
python3 tools/assemble_docs_publish_tree.py \
--destination-root openmlsys.github.io \
--docs-subdir docs \
--en-source .mdbook/book \
--zh-source .mdbook-zh/book
--v2-en-source .mdbook-v2/book \
--v2-zh-source .mdbook-v2-zh/book \
--v1-en-source .mdbook-v1/book \
--v1-zh-source .mdbook-v1-zh/book
cd openmlsys.github.io
git config user.name "github-actions[bot]"

20
.gitignore vendored
View File

@@ -16,6 +16,10 @@ env
.mdbook-zh/
.mdbook-zh-test/
.mdbook-bin/
.mdbook-v1/
.mdbook-v1-zh/
.mdbook-v2/
.mdbook-v2-zh/
task_plan.md
findings.md
progress.md
@@ -29,3 +33,19 @@ zh_chapters/img
zh_chapters/references
zh_chapters/static
zh_chapters/mlsys.bib
v1/en_chapters/img
v1/en_chapters/references
v1/en_chapters/static
v1/en_chapters/mlsys.bib
v1/zh_chapters/img
v1/zh_chapters/references
v1/zh_chapters/static
v1/zh_chapters/mlsys.bib
v2/en_chapters/img
v2/en_chapters/references
v2/en_chapters/static
v2/en_chapters/mlsys.bib
v2/zh_chapters/img
v2/zh_chapters/references
v2/zh_chapters/static
v2/zh_chapters/mlsys.bib

View File

@@ -15,4 +15,5 @@ command = "python3 tools/mdbook_preprocessor.py"
mathjax-support = true
git-repository-url = "https://github.com/openmlsys/openmlsys-zh"
preferred-dark-theme = "navy"
additional-css = ["theme/dark-mode-images.css"]
additional-css = ["theme/dark-mode-images.css", "theme/version-selector.css"]
additional-js = ["theme/version-selector.js"]

View File

@@ -15,4 +15,5 @@ command = "python3 ../../tools/mdbook_zh_preprocessor.py"
mathjax-support = true
git-repository-url = "https://github.com/openmlsys/openmlsys-zh"
preferred-dark-theme = "navy"
additional-css = ["theme/dark-mode-images.css"]
additional-css = ["theme/dark-mode-images.css", "theme/version-selector.css"]
additional-js = ["theme/version-selector.js"]

View File

@@ -0,0 +1,48 @@
/* Version and Language selectors — inline in .right-buttons */
.openmlsys-nav-selectors {
display: inline-flex;
align-items: center;
gap: 4px;
margin-right: 4px;
vertical-align: middle;
}
/* Shared style for all selector links/buttons */
.openmlsys-selector-link {
display: inline-flex;
align-items: center;
justify-content: center;
min-width: 32px;
height: 28px;
padding: 0 8px;
border-radius: 4px;
border: 1px solid transparent;
color: var(--icons, #747474);
font-size: 12px;
font-weight: 600;
text-decoration: none;
cursor: pointer;
line-height: 1;
transition: color 0.1s, background 0.1s;
}
.openmlsys-selector-link:hover {
color: var(--icons-hover, #333);
background: var(--theme-hover, rgba(0, 0, 0, 0.05));
}
/* Active/current indicator */
.openmlsys-selector-link.active {
color: var(--links, #4183c4);
border-color: var(--links, #4183c4);
font-weight: 700;
}
/* Separator between version and language groups */
.openmlsys-selector-sep {
width: 1px;
height: 18px;
background: var(--icons, #747474);
opacity: 0.3;
margin: 0 2px;
}

View File

@@ -0,0 +1,74 @@
// Version and Language selector for OpenMLSys mdbook
(function () {
"use strict";
var path = window.location.pathname;
// Detect current version and language from URL
var currentVersion = "v2";
var currentLang = "en";
if (path.match(/\/v1(\/|$)/)) {
currentVersion = "v1";
}
if (path.match(/\/cn(\/|$)/)) {
currentLang = "zh";
}
// Build base paths
function basePath(version, lang) {
var docsRoot = path.replace(/\/v1\/.*/, "/").replace(/\/cn\/.*/, "/");
docsRoot = docsRoot.replace(/(\/docs\/?).*/, "/docs/");
if (!docsRoot.endsWith("/")) docsRoot += "/";
var p = docsRoot;
if (version === "v1") p += "v1/";
if (lang === "zh") p += "cn/";
return p;
}
var container = document.createElement("span");
container.className = "openmlsys-nav-selectors";
// --- Version links: V1 | V2 ---
var versions = [
{ label: "V1", value: "v1" },
{ label: "V2", value: "v2" },
];
versions.forEach(function (v) {
var a = document.createElement("a");
a.className = "openmlsys-selector-link";
a.textContent = v.label;
a.href = basePath(v.value, currentLang);
if (v.value === currentVersion) a.classList.add("active");
container.appendChild(a);
});
// Separator
var sep = document.createElement("span");
sep.className = "openmlsys-selector-sep";
container.appendChild(sep);
// --- Language toggle: single button that switches to the other language ---
var otherLang = currentLang === "zh" ? "en" : "zh";
var langLink = document.createElement("a");
langLink.className = "openmlsys-selector-link";
langLink.textContent = currentLang === "zh" ? "EN" : "ZH";
langLink.href = basePath(currentVersion, otherLang);
container.appendChild(langLink);
// Insert into .right-buttons, before existing icons
function insertSelector() {
var rightButtons = document.querySelector(".right-buttons");
if (rightButtons) {
rightButtons.insertBefore(container, rightButtons.firstChild);
}
}
if (document.readyState === "loading") {
document.addEventListener("DOMContentLoaded", insertSelector);
} else {
insertSelector();
}
})();

32
build_mdbook_v1.sh Executable file
View File

@@ -0,0 +1,32 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PYTHON_BIN="$(command -v python3 || command -v python || true)"
if [[ -z "${PYTHON_BIN}" ]]; then
echo "Python is required to prepare the mdBook staging tree." >&2
exit 1
fi
if ! command -v mdbook >/dev/null 2>&1; then
echo "mdbook is not installed. Install it first, for example with: cargo install mdbook" >&2
exit 1
fi
# ── English v1 ────────────────────────────────────────────────────────────────
"${PYTHON_BIN}" "${ROOT}/tools/ensure_book_resources.py" --chapter-dir "${ROOT}/v1/en_chapters"
"${PYTHON_BIN}" "${ROOT}/tools/prepare_mdbook.py" \
--source "${ROOT}/v1/en_chapters" \
--summary-output "${ROOT}/v1/en_chapters/SUMMARY.md" \
--placeholder-prefix "[TODO: src = zh_chapters/"
mdbook build "${ROOT}/v1"
# ── Chinese v1 ────────────────────────────────────────────────────────────────
"${PYTHON_BIN}" "${ROOT}/tools/ensure_book_resources.py" --chapter-dir "${ROOT}/v1/zh_chapters"
"${PYTHON_BIN}" "${ROOT}/tools/prepare_mdbook_zh.py" \
--source "${ROOT}/v1/zh_chapters" \
--summary-output "${ROOT}/v1/zh_chapters/SUMMARY.md"
mdbook build "${ROOT}/v1/books/zh"

32
build_mdbook_v2.sh Executable file
View File

@@ -0,0 +1,32 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PYTHON_BIN="$(command -v python3 || command -v python || true)"
if [[ -z "${PYTHON_BIN}" ]]; then
echo "Python is required to prepare the mdBook staging tree." >&2
exit 1
fi
if ! command -v mdbook >/dev/null 2>&1; then
echo "mdbook is not installed. Install it first, for example with: cargo install mdbook" >&2
exit 1
fi
# ── English v2 ────────────────────────────────────────────────────────────────
"${PYTHON_BIN}" "${ROOT}/tools/ensure_book_resources.py" --chapter-dir "${ROOT}/v2/en_chapters"
"${PYTHON_BIN}" "${ROOT}/tools/prepare_mdbook.py" \
--source "${ROOT}/v2/en_chapters" \
--summary-output "${ROOT}/v2/en_chapters/SUMMARY.md" \
--placeholder-prefix "[TODO: src = zh_chapters/"
mdbook build "${ROOT}/v2"
# ── Chinese v2 ────────────────────────────────────────────────────────────────
"${PYTHON_BIN}" "${ROOT}/tools/ensure_book_resources.py" --chapter-dir "${ROOT}/v2/zh_chapters"
"${PYTHON_BIN}" "${ROOT}/tools/prepare_mdbook_zh.py" \
--source "${ROOT}/v2/zh_chapters" \
--summary-output "${ROOT}/v2/zh_chapters/SUMMARY.md"
mdbook build "${ROOT}/v2/books/zh"

View File

@@ -171,7 +171,7 @@ a {
}
.cover h2 {
font-size: 34px;
font-size: 24px;
margin-bottom: 0px;
padding-bottom: 20px;
}

View File

@@ -144,8 +144,10 @@ missing
)
(source / "existing.md").write_text("# 现有章节\n", encoding="utf-8")
with self.assertRaises(FileNotFoundError):
write_summary(source)
summary_path = write_summary(source)
summary = summary_path.read_text(encoding="utf-8")
self.assertIn("existing", summary)
self.assertNotIn("missing", summary)
def test_rewrite_markdown_normalizes_common_d2l_directives(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:

View File

@@ -0,0 +1,48 @@
/* Version and Language selectors — inline in .right-buttons */
.openmlsys-nav-selectors {
display: inline-flex;
align-items: center;
gap: 4px;
margin-right: 4px;
vertical-align: middle;
}
/* Shared style for all selector links/buttons */
.openmlsys-selector-link {
display: inline-flex;
align-items: center;
justify-content: center;
min-width: 32px;
height: 28px;
padding: 0 8px;
border-radius: 4px;
border: 1px solid transparent;
color: var(--icons, #747474);
font-size: 12px;
font-weight: 600;
text-decoration: none;
cursor: pointer;
line-height: 1;
transition: color 0.1s, background 0.1s;
}
.openmlsys-selector-link:hover {
color: var(--icons-hover, #333);
background: var(--theme-hover, rgba(0, 0, 0, 0.05));
}
/* Active/current indicator */
.openmlsys-selector-link.active {
color: var(--links, #4183c4);
border-color: var(--links, #4183c4);
font-weight: 700;
}
/* Separator between version and language groups */
.openmlsys-selector-sep {
width: 1px;
height: 18px;
background: var(--icons, #747474);
opacity: 0.3;
margin: 0 2px;
}

74
theme/version-selector.js Normal file
View File

@@ -0,0 +1,74 @@
// Version and Language selector for OpenMLSys mdbook
(function () {
"use strict";
var path = window.location.pathname;
// Detect current version and language from URL
var currentVersion = "v2";
var currentLang = "en";
if (path.match(/\/v1(\/|$)/)) {
currentVersion = "v1";
}
if (path.match(/\/cn(\/|$)/)) {
currentLang = "zh";
}
// Build base paths
function basePath(version, lang) {
var docsRoot = path.replace(/\/v1\/.*/, "/").replace(/\/cn\/.*/, "/");
docsRoot = docsRoot.replace(/(\/docs\/?).*/, "/docs/");
if (!docsRoot.endsWith("/")) docsRoot += "/";
var p = docsRoot;
if (version === "v1") p += "v1/";
if (lang === "zh") p += "cn/";
return p;
}
var container = document.createElement("span");
container.className = "openmlsys-nav-selectors";
// --- Version links: V1 | V2 ---
var versions = [
{ label: "V1", value: "v1" },
{ label: "V2", value: "v2" },
];
versions.forEach(function (v) {
var a = document.createElement("a");
a.className = "openmlsys-selector-link";
a.textContent = v.label;
a.href = basePath(v.value, currentLang);
if (v.value === currentVersion) a.classList.add("active");
container.appendChild(a);
});
// Separator
var sep = document.createElement("span");
sep.className = "openmlsys-selector-sep";
container.appendChild(sep);
// --- Language toggle: single button that switches to the other language ---
var otherLang = currentLang === "zh" ? "en" : "zh";
var langLink = document.createElement("a");
langLink.className = "openmlsys-selector-link";
langLink.textContent = currentLang === "zh" ? "EN" : "ZH";
langLink.href = basePath(currentVersion, otherLang);
container.appendChild(langLink);
// Insert into .right-buttons, before existing icons
function insertSelector() {
var rightButtons = document.querySelector(".right-buttons");
if (rightButtons) {
rightButtons.insertBefore(container, rightButtons.firstChild);
}
}
if (document.readyState === "loading") {
document.addEventListener("DOMContentLoaded", insertSelector);
} else {
insertSelector();
}
})();

View File

@@ -28,8 +28,12 @@ def assemble_publish_tree(
docs_subdir: str = "docs",
en_source: Path | None = None,
zh_source: Path | None = None,
v1_en_source: Path | None = None,
v1_zh_source: Path | None = None,
v2_en_source: Path | None = None,
v2_zh_source: Path | None = None,
) -> tuple[Path, Path | None]:
if en_source is None and zh_source is None:
if en_source is None and zh_source is None and v2_en_source is None:
raise ValueError("At least one site source must be provided.")
destination_root = destination_root.resolve()
@@ -38,15 +42,25 @@ def assemble_publish_tree(
remove_path(docs_root)
docs_root.parent.mkdir(parents=True, exist_ok=True)
if en_source is not None:
copy_site(en_source, docs_root)
# v2 (latest) is deployed at the root — /docs/
effective_en = v2_en_source or en_source
if effective_en is not None:
copy_site(effective_en, docs_root)
else:
docs_root.mkdir(parents=True, exist_ok=True)
zh_destination: Path | None = None
if zh_source is not None:
effective_zh = v2_zh_source or zh_source
if effective_zh is not None:
zh_destination = docs_root / "cn"
copy_site(zh_source, zh_destination)
copy_site(effective_zh, zh_destination)
# v1 is deployed under /docs/v1/
if v1_en_source is not None:
v1_root = docs_root / "v1"
copy_site(v1_en_source, v1_root)
if v1_zh_source is not None:
copy_site(v1_zh_source, v1_root / "cn")
return docs_root, zh_destination
@@ -69,12 +83,32 @@ def parse_args() -> argparse.Namespace:
parser.add_argument(
"--en-source",
type=Path,
help="Built site to publish at docs/.",
help="Built site to publish at docs/ (legacy, use --v2-en-source instead).",
)
parser.add_argument(
"--zh-source",
type=Path,
help="Built site to publish at docs/cn/.",
help="Built site to publish at docs/cn/ (legacy, use --v2-zh-source instead).",
)
parser.add_argument(
"--v1-en-source",
type=Path,
help="Built v1 English site to publish at docs/v1/.",
)
parser.add_argument(
"--v1-zh-source",
type=Path,
help="Built v1 Chinese site to publish at docs/v1/cn/.",
)
parser.add_argument(
"--v2-en-source",
type=Path,
help="Built v2 English site to publish at docs/.",
)
parser.add_argument(
"--v2-zh-source",
type=Path,
help="Built v2 Chinese site to publish at docs/cn/.",
)
return parser.parse_args()
@@ -86,6 +120,10 @@ def main() -> int:
docs_subdir=args.docs_subdir,
en_source=args.en_source,
zh_source=args.zh_source,
v1_en_source=args.v1_en_source,
v1_zh_source=args.v1_zh_source,
v2_en_source=args.v2_en_source,
v2_zh_source=args.v2_zh_source,
)
print(f"Assembled root site at {docs_root}")
if zh_root is not None:

View File

@@ -43,7 +43,7 @@ def main() -> int:
for key, fields in parse_bib(extra_bib).items():
bib_db.setdefault(key, fields)
chapters = iter_chapters(book.get("items", []))
chapters = iter_chapters(book.get("sections") or book.get("items") or [])
# Pass 1: collect all :label: directives and figure labels
ref_label_map: dict[str, str] = {}

View File

@@ -42,7 +42,7 @@ def main() -> int:
for key, fields in parse_bib(extra_bib).items():
bib_db.setdefault(key, fields)
chapters = iter_chapters(book.get("items", []))
chapters = iter_chapters(book.get("sections") or book.get("items") or [])
# Pass 1: collect all :label: directives and figure labels
ref_label_map: dict[str, str] = {}

View File

@@ -200,11 +200,11 @@ def parse_toc_blocks(markdown: str) -> list[list[TocItem]]:
return blocks
def resolve_toc_target(current_file: Path, entry: str) -> Path:
def resolve_toc_target(current_file: Path, entry: str) -> Path | None:
target_name = entry if entry.endswith(".md") else f"{entry}.md"
target = (current_file.parent / target_name).resolve()
if not target.exists():
raise FileNotFoundError(f"TOC entry '{entry}' from '{current_file}' does not exist")
return None
return target
@@ -828,7 +828,7 @@ def render_toc_list(entries: list[TocItem], current_file: Path, title_cache: dic
continue
target = resolve_toc_target(current_file, entry.target)
if target not in title_cache:
if target is None or target not in title_cache:
continue
label = chapter_label(entry, target, title_cache)
@@ -943,7 +943,9 @@ def build_summary(source_dir: Path, title_cache: dict[Path, str]) -> str:
for entry in block:
if entry.kind != "chapter" or entry.target is None:
continue
append_entry(resolve_toc_target(target, entry.target), indent + 1, entry.label or None)
child_target = resolve_toc_target(target, entry.target)
if child_target is not None:
append_entry(child_target, indent + 1, entry.label or None)
def append_prefix_chapter(target: Path, label: str | None = None) -> None:
target = target.resolve()
@@ -969,6 +971,8 @@ def build_summary(source_dir: Path, title_cache: dict[Path, str]) -> str:
continue
target = resolve_toc_target(root_index, entry.target)
if target is None:
continue
if numbered_started:
append_entry(target, 0, entry.label or None)
else:

19
v1/book.toml Normal file
View File

@@ -0,0 +1,19 @@
[book]
authors = ["OpenMLSys Contributors"]
language = "en"
src = "en_chapters"
title = "Machine Learning Systems: Design and Implementation (1st Edition)"
[build]
build-dir = "../.mdbook-v1/book"
create-missing = false
[preprocessor.openmlsys]
command = "python3 tools/mdbook_preprocessor.py"
[output.html]
mathjax-support = true
git-repository-url = "https://github.com/openmlsys/openmlsys-zh"
preferred-dark-theme = "navy"
additional-css = ["theme/dark-mode-images.css", "theme/version-selector.css"]
additional-js = ["theme/version-selector.js"]

19
v1/books/zh/book.toml Normal file
View File

@@ -0,0 +1,19 @@
[book]
authors = ["OpenMLSys Contributors"]
language = "zh-CN"
src = "../../zh_chapters"
title = "机器学习系统:设计和实现(第一版)"
[build]
build-dir = "../../../.mdbook-v1-zh/book"
create-missing = false
[preprocessor.openmlsys-zh]
command = "python3 tools/mdbook_zh_preprocessor.py"
[output.html]
mathjax-support = true
git-repository-url = "https://github.com/openmlsys/openmlsys-zh"
preferred-dark-theme = "navy"
additional-css = ["theme/dark-mode-images.css", "theme/version-selector.css"]
additional-js = ["theme/version-selector.js"]

View File

@@ -0,0 +1,16 @@
/* 暗色模式下仅为正文图片添加浅灰色背景,提高透明背景图片的可读性 */
.navy .content main img,
.coal .content main img,
.ayu .content main img {
background-color: #e8e8e8;
border-radius: 4px;
padding: 8px;
}
/* 首页 frontpage 图片保持透明,不添加正文图像底色。 */
.navy .openmlsys-frontpage img,
.coal .openmlsys-frontpage img,
.ayu .openmlsys-frontpage img {
background-color: transparent !important;
padding: 0 !important;
}

View File

@@ -0,0 +1,12 @@
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
"HTML-CSS": {
availableFonts: ["TeX"],
preferredFont: "TeX",
webFont: "TeX"
},
SVG: {
font: "TeX"
}
});
</script>

View File

@@ -0,0 +1,48 @@
/* Version and Language selectors — inline in .right-buttons */
.openmlsys-nav-selectors {
display: inline-flex;
align-items: center;
gap: 4px;
margin-right: 4px;
vertical-align: middle;
}
/* Shared style for all selector links/buttons */
.openmlsys-selector-link {
display: inline-flex;
align-items: center;
justify-content: center;
min-width: 32px;
height: 28px;
padding: 0 8px;
border-radius: 4px;
border: 1px solid transparent;
color: var(--icons, #747474);
font-size: 12px;
font-weight: 600;
text-decoration: none;
cursor: pointer;
line-height: 1;
transition: color 0.1s, background 0.1s;
}
.openmlsys-selector-link:hover {
color: var(--icons-hover, #333);
background: var(--theme-hover, rgba(0, 0, 0, 0.05));
}
/* Active/current indicator */
.openmlsys-selector-link.active {
color: var(--links, #4183c4);
border-color: var(--links, #4183c4);
font-weight: 700;
}
/* Separator between version and language groups */
.openmlsys-selector-sep {
width: 1px;
height: 18px;
background: var(--icons, #747474);
opacity: 0.3;
margin: 0 2px;
}

View File

@@ -0,0 +1,74 @@
// Version and Language selector for OpenMLSys mdbook
(function () {
"use strict";
var path = window.location.pathname;
// Detect current version and language from URL
var currentVersion = "v2";
var currentLang = "en";
if (path.match(/\/v1(\/|$)/)) {
currentVersion = "v1";
}
if (path.match(/\/cn(\/|$)/)) {
currentLang = "zh";
}
// Build base paths
function basePath(version, lang) {
var docsRoot = path.replace(/\/v1\/.*/, "/").replace(/\/cn\/.*/, "/");
docsRoot = docsRoot.replace(/(\/docs\/?).*/, "/docs/");
if (!docsRoot.endsWith("/")) docsRoot += "/";
var p = docsRoot;
if (version === "v1") p += "v1/";
if (lang === "zh") p += "cn/";
return p;
}
var container = document.createElement("span");
container.className = "openmlsys-nav-selectors";
// --- Version links: V1 | V2 ---
var versions = [
{ label: "V1", value: "v1" },
{ label: "V2", value: "v2" },
];
versions.forEach(function (v) {
var a = document.createElement("a");
a.className = "openmlsys-selector-link";
a.textContent = v.label;
a.href = basePath(v.value, currentLang);
if (v.value === currentVersion) a.classList.add("active");
container.appendChild(a);
});
// Separator
var sep = document.createElement("span");
sep.className = "openmlsys-selector-sep";
container.appendChild(sep);
// --- Language toggle: single button that switches to the other language ---
var otherLang = currentLang === "zh" ? "en" : "zh";
var langLink = document.createElement("a");
langLink.className = "openmlsys-selector-link";
langLink.textContent = currentLang === "zh" ? "EN" : "ZH";
langLink.href = basePath(currentVersion, otherLang);
container.appendChild(langLink);
// Insert into .right-buttons, before existing icons
function insertSelector() {
var rightButtons = document.querySelector(".right-buttons");
if (rightButtons) {
rightButtons.insertBefore(container, rightButtons.firstChild);
}
}
if (document.readyState === "loading") {
document.addEventListener("DOMContentLoaded", insertSelector);
} else {
insertSelector();
}
})();

21
v1/en_chapters/SUMMARY.md Normal file
View File

@@ -0,0 +1,21 @@
# Summary
[Machine Learning Systems: Design and Implementation (1st Edition)](index.md)
[Preface](chapter_preface/index.md)
[Introduction](chapter_introduction/index.md)
[Programming Model](chapter_programming_interface/index.md)
[Computational Graph](chapter_computational_graph/index.md)
[Part I Framework Design](chapter_preface_advanced/index.md)
[AI Compiler Frontend](chapter_frontend_and_ir/index.md)
[AI Compiler Backend](chapter_backend_and_runtime/index.md)
[Hardware Accelerator](chapter_accelerator/index.md)
[Data Processing Framework](chapter_data_processing/index.md)
[Model Deployment {#ch:deploy}](chapter_model_deployment/index.md)
[Distributed Training](chapter_distributed_training/index.md)
[Part II Application Scenarios](chapter_preface_extension/index.md)
[Recommender System](chapter_recommender_system/index.md)
[Federated Learning Systems](chapter_federated_learning/index.md)
[Reinforcement Learning System](chapter_reinforcement_learning/index.md)
[Explainable AI Systems](chapter_explainable_AI/index.md)
[Robotic System](chapter_rl_sys/index.md)
[Appendix: Introduction to Machine Learning](appendix_machine_learning_introduction/index.md)

View File

@@ -0,0 +1,62 @@
## Classic Machine Learning Methods
Many classic machine learning algorithms, such as Support Vector Machine (SVM), K-Nearest Neighbor (KNN) classification algorithm, and K-Means Clustering Algorithm, differ in various ways---some have network parameters while others do not, some are supervised learning algorithms while others are unsupervised, and their training processes also differ. However, from a systems perspective, they are all based on matrix operations. Below, we briefly introduce these algorithms.
### Support Vector Machine
**Support Vector Machine** (SVM) is a classic machine learning classification algorithm whose core idea is to maximize the distance from the decision boundary to the data points. Here, we use linearly separable data as an example; for non-linearly separable data, the **Kernel Method** can be applied in a similar manner.
If the training data is linearly separable, the objective of SVM is to maximize the **margin**. First, let us define the maximum margin classifier as follows:
$$\min_{{w},b} ~~~\frac{1}{2} ||{w}||^2$$
$$s.t. ~~~y_i ({w}^T {x_i} + b) \geq 1, ~~~\forall 1 \leq i \leq n$$
Its Lagrange multiplier formulation is
$$L({w},b,{\lambda}) = \frac{1}{2} ||{w}||^2 + \sum_{i=1}^n \lambda_i (1-y_i({w}^T {x_i} + b))$$
Since $\frac{1}{2} ||{w}||^2$ is convex, and $\lambda_i (1-y_i({w}^T {x_i} + b))$ is linear (and therefore also convex), the solution to the optimization problem is
$$\max_{\lambda>0} \min_{{w},b} L({w},b, {\lambda})$$
Taking the derivatives of $L$ with respect to ${w},b$, we have
$$\nabla_{{w}} L= {w} - \sum_{i=1}^n \lambda_i y_i {x_i}$$
$$\nabla_b L = - \sum_{i=1}^n \lambda_i y_i$$
Setting the derivatives of $L$ with respect to ${w},b$ to zero, we obtain ${w}^* = \sum_{i=1}^n \lambda_i y_i {x_i}$ and $\sum_{i=1}^n \lambda_i y_i = 0$.
Since when $\lambda$ is fixed, the value of $b$ does not contribute to the objective function, we can set $b^* = 0$.
At this point, by duality theory and the KKT conditions, we obtain:
$$y_i ({w}^{*T} {x_i} + b^*) > 1 \Rightarrow \lambda_i^* = 0$$
$$\lambda_i^* > 0 \Rightarrow y_i ({w}^{*T} {x_i} + b^*) = 1$$
$${w}^* = \sum_{i=1}^n \lambda_i^* y_i {x_i}$$
If $y_i ({w}^{*T} {x_i} + b^*) = 1$, then ${x_i}$ is one of the points closest to the hyperplane $({w}^*,b^*)$; otherwise, it is not. Therefore, ${w}^*$ is a linear combination of the points ${x_i}$ that are closest to the hyperplane $({w}^*,b^*)$.
In this way, through the SVM algorithm, we achieve data classification while maximizing the distance from the decision boundary to the nearest points.
We define the ${x_i}$ satisfying $y_i ({w}^{*T} {x_i} + b^*) = 1$ as **support vectors**, and call the classifier $\hat{y}=sgn({w}^{*T} {x_i} + b^*)$ the support vector machine.
### K-Nearest Neighbor Algorithm
**K-Nearest Neighbor** (KNN) is also a traditional machine learning algorithm that can be used for basic machine learning tasks such as classification and regression. Unlike the SVM algorithm introduced above, the core idea of the K-Nearest Neighbor algorithm is not to separate data of different classes using a decision boundary, but rather to predict the properties of a data point based on the properties of its K nearest neighbors.
When KNN is used for classification, a vote is conducted to predict the class of a sample point. The voters are the K sample points closest to the observation point, where each voting sample point may be assigned different weights, and the "content" of the vote is the class label of the sample point. When processing the voting results, a majority vote decision method is used. That is, if most of the K nearest sample points belong to a certain class, then the sample point is also assigned to that class.
The KNN algorithm can be described as follows: (1) compute the distance from the point to be classified to each known-class point; (2) sort these points by distance and select the K nearest points; (3) tally the votes according to each point's weight, where the vote content is the point's class label; (4) return the class with the highest vote count as the predicted class for the point to be classified.
The KNN algorithm has several key issues that require attention, including the choice of the hyperparameter K, the distance metric, and the classification decision rule. For the hyperparameter K, it should not be too large, as this would lead to significant approximation error, nor too small, as this would lead to significant estimation error. For the distance metric, one can choose Manhattan distance, Euclidean distance, Minkowski distance, and so on. To reduce the error and impact of the K value on prediction results, we can typically impose certain rules on the classification decision, such as giving closer points larger weights and more distant points smaller weights during voting. When implementing the KNN algorithm programmatically, parameters such as weights are computed in matrix form to improve computational efficiency.
### K-Means Clustering Algorithm
**K-Means Clustering Algorithm** is a common unsupervised clustering algorithm in machine learning. Here, we first define the clustering problem: given data points ${x_1},\cdots, {x_n} \in \mathbb{R}^d$ and $K\in \mathbb{N}$, we need to partition them into $K$ clusters ${C_1}, \cdots, {C_K} \in \mathbb{R}^d$ along with the corresponding cluster center ${ C_{(1)}}, \cdots, {C_{(n)}}$ for each data point, so as to minimize the sum of distances $\sum_i ||{x_i} - {C_{(i)}}||^2$.
The K-Means clustering algorithm solves the clustering problem as follows:
- Randomly initialize ${C_1}, \cdots, {C_K}$
- Assign each ${x_i}$ to the cluster whose center is nearest
- Compute and update ${C_K} = \frac{\sum_{{C_{(i)}}={C_K}} {x_i}}{\sum_{{C_{(i)}}={C_K}} 1}$
- Repeat the above steps until the algorithm converges
It can be proven that the K-Means clustering algorithm monotonically decreases the sum of distances $\sum_i ||{x_i} - {C_{(i)}}||^2$ and eventually converges. However, the algorithm may converge to a local minimum.
Chapter conclusion:
From a systems perspective, regardless of the specific algorithm, machine learning algorithms involving high-dimensional data tasks are all implemented through matrix operations.
## References
:bibliography:`../references/appendix.bib`

View File

@@ -0,0 +1,85 @@
## Gradient Descent and Backpropagation
The previous section provided a general introduction to classic neural networks. Now an important question arises: how are the parameters in these networks determined? If the problem can be solved by a simple perceptron, the parameters can be manually determined. However, for deep networks, parameter determination needs to be automated---this is the so-called network training, and this process requires us to define a **loss function** to guide the direction of training optimization.
Common loss functions include: 1) Mean Squared Error (MSE), which measures the distance between vectors,
$\mathcal{L} = \frac{1}{N}\|{y}-\hat{{y}}\|^{2}_{2} = \frac{1}{N}\sum_{i=1}^N(y_{i}-\hat{y}_{i})^{2}$
and Mean Absolute Error (MAE),
$\mathcal{L} = \frac{1}{N}\sum_{i=1}^{N}|y_{i}-\hat{y}_{i}|$
, where $N$ represents the number of data samples used for averaging, $y$ represents the ground truth labels, and $\hat{y}$ represents the predicted labels output by the network.
2) Cross Entropy, which can be used for classification tasks,
$\mathcal{L} = - \frac{1}{N} \sum_{i=1}^N \bigg(y_{i}\log\hat{y}_{i} + (1 - y_{i})\log(1 - \hat{y}_{i})\bigg)$, where the loss value is zero if and only if the output labels match the predicted labels.
With the loss value computed, we can use large amounts of labeled data and optimization methods to update the model parameters. The most commonly used method is **gradient descent**. As shown in :numref:`gradient_descent2`,
initially, the model parameters ${w}$ are randomly selected. Then the partial derivative of the loss with respect to the parameters $\frac{\partial \mathcal{L}}{\partial {w}}$ is computed, and optimization is performed through repeated iterations of
${w}:={w}-\alpha\frac{\partial \mathcal{L}}{\partial {w}}$. This optimization process effectively reduces the loss value to achieve the task objective, where $\alpha$ is the **learning rate** that controls the optimization step size.
In practice, the minimum value obtained by gradient descent is very likely a local minimum rather than the global minimum. However, since deep neural networks provide strong data representation capability, the local minimum can be very close to the global minimum, and the loss value can be sufficiently small.
![Introduction to gradient descent. (Left) Only one trainable parameter $w$; (Right) Two trainable parameters ${w}=[w_1,w_2]$. After continuously updating and iterating the parameters, the loss value $\mathcal{L}$ gradually decreases. However, due to the existence of many local optima, we often cannot reach the global optimum.](../img/ch_basic/gradient_descent2.png)
:width:`600px`
:label:`gradient_descent2`
The next question is: how do we implement gradient descent in deep neural networks? This requires computing the partial derivatives $\frac{\partial \mathcal{L}}{\partial {w}}$ of the parameters at each layer, which can be achieved using **backpropagation** :cite:`rumelhart1986learning,lecun2015deep`.
Next,
we introduce an intermediate quantity ${\delta}=\frac{\partial \mathcal{L}}{\partial {z}}$ to represent the partial derivative of the loss function $\mathcal{L}$
with respect to the neural network output ${z}$ (before the activation function, not $a$),
and ultimately obtain $\frac{\partial \mathcal{L}}{\partial {w}}$.
We illustrate the backpropagation algorithm with an example below.
Let the layer index be $l=1, 2, \ldots L$ (the output layer, i.e., the last layer, has index $L$).
For each network layer, we have the output ${z}^l$, the intermediate value ${\delta}^l=\frac{\partial \mathcal{L}}{\partial {z}^l}$, and an activation output ${a}^l=f({z}^l)$
(where $f$ is the activation function).
We assume the model is a multi-layer perceptron using the Sigmoid activation function, with Mean Squared Error (MSE) as the loss function. That is, we define:
- Network structure ${z}^{l}={W}^{l}{a}^{l-1}+{b}^{l}$
- Activation function ${a}^l=f({z}^l)=\frac{1}{1+{\rm e}^{-{z}^l}}$
- Loss function $\mathcal{L}=\frac{1}{2}\|{y}-{a}^{L}\|^2_2$
We can directly compute the partial derivative of the activation output with respect to the pre-activation output:
- $\frac{\partial {a}^l}{\partial {z}^l}=f'({z}^l)=f({z}^l)(1-f({z}^l))={a}^l(1-{a}^l)$
and the partial derivative of the loss function with respect to the activation output:
- $\frac{\partial \mathcal{L}}{\partial {a}^{L}}=({a}^{L}-{y})$
With these results, to further obtain the partial derivatives of the loss function with respect to each parameter, we can use the **chain rule**, detailed as follows:
First, starting from the output layer ($l=L$, the last layer), we propagate the error backward. By the chain rule, we first compute the intermediate quantity of the output layer:
- ${\delta}^{L}
=\frac{\partial \mathcal{L}}{\partial {z}^{L}}
=\frac{\partial \mathcal{L}}{\partial {a}^{L}}\frac{\partial {a}^L}{\partial {z}^{L}}=({a}^L-{y})\odot({a}^L(1-{a}^L))$
Besides the intermediate value ${\delta}^{L}$ of the output layer ($l=L$), how do we compute the intermediate values ${\delta}^{l}$ for the other layers ($l=1, 2, \ldots , L-1$)?
- Given the model structure ${z}^{l+1}={W}^{l+1}{a}^{l}+{b}^{l+1}$, we can directly obtain $\frac{\partial {z}^{l+1}}{\partial {a}^{l}}={W}^{l+1}$; moreover, we already know that $\frac{\partial {a}^l}{\partial {z}^l}={a}^l(1-{a}^l)$
- Then by the chain rule, we can obtain ${\delta}^{l}
=\frac{\partial \mathcal{L}}{\partial {z}^{l}}
=\frac{\partial \mathcal{L}}{\partial {z}^{l+1}}\frac{\partial {z}^{l+1}}{\partial {a}^{l}}\frac{\partial {a}^{l}}{\partial {z}^{l}}
=({W}^{l+1})^\top{\delta}^{l+1}\odot({a}^l(1-{a}^l))$
Having computed the intermediate values ${\delta}^l, l=1, 2, \ldots , L$ for all layers using the above derivation, we can then compute the partial derivatives of the loss function with respect to the parameters of each layer: $\frac{\partial \mathcal{L}}{\partial {W}^l}$ and $\frac{\partial \mathcal{L}}{\partial {b}^l}$, and use gradient descent to update the parameters at each layer.
- Given the model structure ${z}^l={W}^l{a}^{l-1}+{b}^l$, we can compute
$\frac{\partial {z}^{l}}{\partial {W}^l}={a}^{l-1}$ and
$\frac{\partial {z}^{l}}{\partial {b}^l}=1$
- Then by the chain rule, we can obtain $\frac{\partial \mathcal{L}}{\partial {W}^l}=\frac{\partial \mathcal{L}}{\partial {z}^l}\frac{\partial {z}^l}{\partial {W}^l}={\delta}^l({a}^{l-1})^\top$
,
$\frac{\partial \mathcal{L}}{\partial {b}^l}=\frac{\partial \mathcal{L}}{\partial {z}^l}\frac{\partial {z}^l}{\partial {b}^l}={\delta}^l$
After obtaining all partial derivatives $\frac{\partial \mathcal{L}}{\partial {W}^l}$ and
$\frac{\partial \mathcal{L}}{\partial {b}^l}$, we can update all parameters ${W}^l$
and ${b}^l$ using gradient descent:
- ${W}^l:={W}^l-\alpha\frac{\partial \mathcal{L}}{\partial {W}^l}$,
${b}^l:={b}^l-\alpha\frac{\partial \mathcal{L}}{\partial {b}^l}$
However, there is still one issue to address: each time gradient descent updates the parameters, it needs to compute the loss value under the current parameters. When the training dataset is large ($N$ is large), computing the loss value using the entire training set for each update would be computationally prohibitive.
To reduce the computational cost, we use **Stochastic Gradient Descent** (SGD) to compute the loss value. Specifically, instead of using all training data, we randomly select a subset of data samples from the training set to compute the loss value, such as 16, 32, 64, or 128 data samples. The number of samples is called the **batch size**.
Furthermore, setting the learning rate is also very important. If the learning rate is too large, we may not be able to approach the valley of the minimum; if it is too small, training proceeds too slowly.
Adaptive learning rates, such as Adam :cite:`KingmaAdam2014`, RMSProp :cite:`tieleman2012rmsprop`, and
Adagrad :cite:`duchi2011adagrad`, automatically adjust the learning rate during training to achieve fast convergence and reach the minimum.

View File

@@ -0,0 +1,12 @@
# Appendix: Introduction to Machine Learning
This book assumes that readers have a basic foundation in machine learning algorithms. Therefore, this chapter only provides a brief introduction to machine learning. Among the topics covered, the gradient descent method is particularly important for understanding machine learning systems and is essential knowledge.
```toc
:maxdepth: 2
:numbered:
neural_network
gradient_descent
classic_machine_learning
```

View File

@@ -0,0 +1,156 @@
## Neural Networks
### Perceptron
![A neuron with three inputs and a single output](../img/ch_basic/single_neuron2.png)
:width:`600px`
:label:`single_neuron`
:numref:`single_neuron` shows an example of a neuron, where the input data $x$ is weighted and summed according to the weights $w$ on the connections to produce the output $z$. We call such a model a **perceptron**.
Since there is only one layer of neural connections between input and output, this model is also called a single-layer perceptron. The computation of the model in :numref:`single_neuron` can be written as: $z = w_{1}x_{1}+ w_{2}x_{2} + w_{3}x_{3}$.
When the input data is represented as a column vector ${x}=[x_1,x_2,x_3]^T$ and the model weights are represented as a row vector ${w}=[w_1,w_2,w_3]$, the output scalar $z$ can be written as:
$$z =
\begin{bmatrix}
w_1,w_2,w_3\\
\end{bmatrix}
\begin{bmatrix}
x_1\\
x_2\\
x_3
\end{bmatrix}
={w}{x}$$
We can use the output scalar $z$ as a weighted combination of the inputs to accomplish specific tasks.
For example, we can classify "good apples" and "bad apples," where $x_1,x_2,x_3$ represent three different features: 1) degree of redness, 2) presence of holes, and 3) size. If the size of the apple has no effect on this judgment, the corresponding weight would be zero.
Training this neural network essentially means selecting appropriate weights to accomplish our task. For instance, we can choose appropriate weights such that when $z$ is less than or equal to $0$, it represents a "bad apple," and when $z$ is greater than $0$, it represents a "good apple."
The final classification output label $y$ is as follows, where $1$ represents good and $0$ represents bad. Since there is only one layer between the input and output of this neuron, it can be called a single-layer neural network.
$$
y =
\begin{cases}
1 & z>0 \\
0 & z \leq 0 \\
\end{cases}$$
### Decision Boundary vs. Bias
By selecting appropriate weights and classifying input data based on whether $z$ is greater or less than $0$, we can obtain a **decision boundary** in the data space. As shown in :numref:`single_neuron_decision_boundary2`, using the neuron output $z=0$ as the decision boundary for the output label $y$,
without bias the decision boundary must pass through the origin. If the data sample points are not separated by the origin, classification errors will occur.
To solve this problem, a **bias** can be added to the neuron. :numref:`single_neuron_bias2`
shows a neuron model with bias $b$, which can be expressed by :eqref:`singleneuron_bias`:
$$z = w_{1}x_{1}+ w_{2}x_{2}+ w_{3}x_{3} + b$$
:eqlabel:`singleneuron_bias`
![Decision boundaries with two inputs (left) and three inputs (right). Different shaped points represent different classes of data, and we need to find $z=0$ as the decision boundary to separate the different data points. With two inputs, the decision boundary is a line; with three inputs, the decision boundary is a plane; with higher-dimensional inputs, the decision boundary is called a **hyperplane**.
Left: $z=w_{1}x_{1}+w_{2}x_{2}+b$. Right: $z=w_{1}x_{1}+w_{2}x_{2}+w_{3}x_{3}+b$. Without bias, the decision boundary must pass through the origin, so it cannot separate the data samples of different classes.](../img/ch_basic/single_neuron_decision_boundary2.png)
:width:`600px`
:label:`single_neuron_decision_boundary2`
![A single-layer neural network with bias](../img/ch_basic/single_neuron_bias2.png)
:width:`600px`
:label:`single_neuron_bias2`
With bias, the decision boundary (line, plane, or hyperplane) does not have to pass through the origin, thus enabling better classification of samples.
More precisely, the decision boundary separates the sample data into two different classes, and this boundary is
$\{x_1, x_2, x_3 | w_{1}x_{1}+ w_{2}x_{2}+ w_{3}x_{3} + b = 0\}$.
### Logistic Regression
The input-output relationship of the above neuron is linear. To provide nonlinear data representation capability, an **activation function** can be applied to the neuron output. The most common activation functions include Sigmoid, Tanh, ReLU, and Softmax.
For example, the above neuron uses $z=0$ as the boundary for classification tasks. Can we instead have the neuron output a probability? For instance, outputting values between $0$ and $1$, where $1$ means the input data belongs to a certain class with $100\%$ probability.
To make the neuron output values between $0$ and $1$, we can apply the logistic function **Sigmoid** to $z$,
as shown in :eqref:`sigmoid`. Sigmoid constrains values between 0 and 1, and a simple threshold (e.g., 0.5) can be used to determine whether the final output label belongs to a certain class. This method is called **logistic regression**.
$$a = f({z}) = \frac{1}{1+{\rm e}^{-{z}}}$$
:eqlabel:`sigmoid`
### Multiple Neurons
![Multiple neurons](../img/ch_basic/two_neurons2.png)
:width:`600px`
:label:`two_neurons2`
The above network has only one output. With multiple neurons together, we can have multiple outputs. :numref:`two_neurons2` shows a network with two outputs, where each output is connected to all inputs. This is also called a **fully-connected (FC) layer**,
which can be expressed by the following equation :eqref:`fc_cal`.
$$z_{1} &= w_{11}x_{1} + w_{12}x_{2} + w_{13}x_{3} + b_1 \notag \\ z_{2} &= w_{21}x_{1} + w_{22}x_{2} + w_{23}x_{3} + b_2$$
:eqlabel:`fc_cal`
The following expression shows the matrix form of the computation:
$$
{z} =
\begin{bmatrix}
z_1 \\
z_2
\end{bmatrix}
=
\begin{bmatrix}
w_{11} & w_{12} & w_{13}\\
w_{21} & w_{22} & w_{23}\\
\end{bmatrix}
\begin{bmatrix}
x_1\\
x_2\\
x_3
\end{bmatrix}
+
\begin{bmatrix}
b_1 \\ b_2
\end{bmatrix}
= {W}{x} + {b}$$
A network with multiple outputs can solve multi-class classification problems. For example, with 10 numerical outputs, each value represents the probability of a particular class, with each output between $0$ and $1$, and the sum of all 10 outputs equal to $1$.
This can be achieved using the **Softmax** function shown in :eqref:`e_softmax`, where $K$ is the number of outputs:
$$f({z})_{i} = \frac{{\rm e}^{z_{i}}}{\sum_{k=1}^{K}{\rm e}^{z_{k}}}$$
:eqlabel:`e_softmax`
### Multi-Layer Perceptron
![Multi-layer perceptron example. $a^l_i$ represents the value after the neuron output $z$ passes through the activation function, where $l$ denotes the layer index ($L$ denotes the output layer), and $i$ denotes the output index](../img/ch_basic/mlp2.png)
**Multi-Layer Perceptron** (MLP) :cite:`rosenblatt1958perceptron` enhances the network's representation capability by stacking multiple fully-connected layers. Compared to single-layer networks, the multi-layer perceptron has many intermediate layer outputs that are not exposed to the final output; these layers are called **hidden layers**. The network in this example can be implemented through the following cascaded matrix operations, where $W^l$ and $b^l$ represent the weight matrices and biases of different layers, $l$ denotes the layer index, and $L$ denotes the output layer.
$${z} = f({W^L}f({W^3}f({W^2}f({W^1}{x} + {b^1}) + {b^2}) + {b^3}) + {b^L})$$
In the deep learning era, network models are essentially composed of multiple layers of neural network layers connected together. Input data passes through multiple layers of feature extraction, learning **feature vectors** at different levels of abstraction. Below we introduce some other commonly used neural network layers.
### Convolutional Networks
![Convolution operation example. The input is a three-channel data of size $4 \times 4 \times 3$ (height $\times$ width $\times$ channels). To perform convolution on each channel, the convolution kernel must also have three channels. A single convolution kernel has size $3 \times 3 \times 3 \times 1$ (height $\times$ width $\times$ input channels $\times$ output channels (number of kernels)). The number of convolution kernels determines the number of output **feature maps**. In this example, since there is only one convolution kernel, the output has 1 channel with height and width of 2. We call such high-dimensional input data **tensors**, such as RGB images, videos, outputs from previous convolutional layers, etc.](../img/ch_basic/conv_computation_v4.png)
:width:`600px`
:label:`conv_computation_v4`
**Convolutional Neural Network** (CNN) :cite:`lecun1989backpropagation` consists of multiple **convolutional layers** and is commonly used in computer vision tasks :cite:`krizhevsky2012imagenet,he2016deep`.
:numref:`conv_computation_v4` describes an example of a convolution operation.
Based on the properties of convolution, we can observe two facts: 1) the number of channels in a convolution kernel equals the number of input channels; 2) the number of output channels equals the number of convolution kernels.
In the example of :numref:`conv_computation_v4`, the convolution kernel slides by one unit at a time to perform the convolution operation; we say its **stride** is 1. Additionally, if we want the edge values of the input to also be taken into account, we need to perform **zero padding** on the edges. In the example of :numref:`conv_computation_v4`, if each channel of the input is padded with a ring of zeros on all four sides, the output size would be $4\times 4\times 1$. The number of padding rings depends on the kernel size---larger kernels require more padding.
To perform feature extraction on input image data, the number of convolution kernels is typically greater than the number of input channels, which means the output data contains many more values and the computation increases. However, features of adjacent pixels in image data are often similar, so we can perform aggregation operations on adjacent output features. **Pooling layers** serve this purpose, and we typically use two pooling methods: Max Pooling and Mean Pooling. As shown in :numref:`pooling_v3`, assuming a pooling kernel of size $2\times2$, an input of $4\times4$, and a stride of 2 (with stride 1, the output equals the input), the output is $2\times2$.
![$2 \times 2$
max pooling and mean pooling examples, with stride 2 and input size $4 \times 4$](../img/ch_basic/pooling_v3.png)
:width:`600px`
:label:`pooling_v3`
Both convolutional layers and fully-connected layers are commonly used. However, when the input is high-dimensional image data, convolutional layers require far fewer parameters than fully-connected layers. The operations in convolutional layers are similar to those in fully-connected layers---the former is based on high-dimensional tensor operations, while the latter is based on two-dimensional matrix operations.
### Sequential Models
In real life, besides images, there is a large amount of time series data, such as videos, stock prices, and so on. **Recurrent Neural Networks** (RNN) :cite:`rumelhart1986learning` are a type of deep learning model architecture designed for processing sequential data. Sequential data is a series of continuous data $\{x_1, x_2, \dots, x_n\}$, where each $x$ might represent a word in a sentence, for example.
To receive a continuous sequence of inputs, as shown in :numref:`rnn_simple_cell2`, the vanilla recurrent neural network uses a recurrent cell as the computation unit, with a hidden state to store information from past inputs. Specifically, for each input data $x$ to the model, according to equation :eqref:`aligned`, the recurrent cell repeatedly computes new hidden states to record information from current and past inputs. The new hidden state is then used in the computation of the next cell.
$${h}_t = {W}[{x}_t; {h}_{t-1}] + {b}$$
:eqlabel:`aligned`
![Vanilla recurrent neural network. At each computation step, the recurrent cell computes the current hidden state ${h}_t$ from the previous hidden state ${h}_{t-1}$ and the current input ${x}_t$.](../img/ch_basic/rnn_simple_cell2.png)
:width:`600px`
:label:`rnn_simple_cell2`
However, this simple vanilla recurrent neural network suffers from a severe information forgetting problem. For example, if the input is "I am Chinese, my native language is ___," the hidden state remembers the information about "Chinese," enabling the network to predict the word "Chinese (language)" at the end. But when the sentence is very long, the hidden state may not remember information from too long ago. For instance, "I am Chinese, I went to study in the UK, then worked in France, my native language is ___"---at this point, the information about "Chinese" in the final hidden state may have been forgotten due to multiple updates.
To address this problem, various improved methods have been proposed, the most famous being Long Short-Term Memory (LSTM) :cite:`Hochreiter1997lstm`. There are many more sequential models, such as the Transformer :cite:`vaswani2017attention` that emerged in recent years.

View File

@@ -0,0 +1,204 @@
# Components of Hardware Accelerators
A hardware accelerator typically comprises multiple on-chip caches and
various types of arithmetic units. In this section, we'll examine the
fundamental components of hardware accelerators, using the Nvidia Volta
GPU architecture as a representative example.
## Architecture of Accelerators
Contemporary graphics processing units (GPUs) offer remarkable computing
speed, ample memory storage, and impressive I/O bandwidth. A top-tier
GPU frequently surpasses a conventional CPU by housing double the number
of transistors, boasting a memory capacity of 16 GB or greater, and
operating at frequencies reaching up to 1 GHz. The architecture of a GPU
comprises streaming processors and a memory system, interconnected
through an on-chip network. These components can be expanded
independently, allowing for customized configurations tailored to the
target market of the GPU.
Figure :numref:`ch06/ch06-gv100` illustrates the architecture of the
Volta GV100 . This architecture has:
![Volta GV100](../img/ch06/V100.png)
:label:`ch06/ch06-gv100`
1. 6 GPU processing clusters (GPCs), each containing:
1. 7 texture processing clusters (TPCs), each containing two
streaming multiprocessors (SMs).
2. 14 SMs.
2. 84 SMs, each containing:
1. 64 32-bit floating-point arithmetic units
2. 64 32-bit integer arithmetic units
3. 32 64-bit floating-point arithmetic units
4. 8 Tensor Cores
5. 4 texture units
3. 8 512-bit memory controllers.
As shown in Figure :numref:`ch06/ch06-gv100`, a GV100 GPU contains 84 SMs (Streaming
Multiprocessors), 5376 32-bit floating-point arithmetic units, 5376
32-bit integer arithmetic units, 2688 64-bit floating-point arithmetic
units, 672 Tensor Cores, and 336 texture units. A pair of memory
controllers controls an HBM2 DRAM stack. Different vendors may use
different configurations (e.g., Tesla V100 has 80 SMs).
## Memory Units
The memory units of a hardware accelerator resemble a CPU's memory
controller. However, they encounter a bottleneck when retrieving data
from the computer system's DRAM, as it is slower compared to the
processor's computational speed. Without a cache for quick access, the
DRAM bandwidth becomes inadequate to handle all transactions of the
accelerator. Consequently, if program instructions or data cannot be
swiftly retrieved from the DRAM, the accelerator's efficiency diminishes
due to prolonged idle time. To tackle this DRAM bandwidth issue, GPUs
employ a hierarchical design of memory units. Each type of memory unit
offers its own maximum bandwidth and latency. To fully exploit the
computing power and enhance processing speed, programmers must select
from the available memory units and optimize memory utilization based on
varying access speeds.
1. **Register file**: Registers serve as the swiftest on-chip memories.
In contrast to CPUs, each SM in a GPU possesses tens of thousands of
registers. Nevertheless, excessively utilizing registers for every
thread can result in a reduced number of thread blocks that can be
scheduled within the SM, leading to fewer executable threads. This
underutilization of hardware capabilities hampers performance
considerably. Consequently, programmers must judiciously determine
the appropriate number of registers to employ, taking into account
the algorithm's demands.
2. **Shared memory**: The shared memory is a level-1 cache that is
user-controllable. Each SM features a 128 KB level-1 cache, with the
ability for programmers to manage up to 96 KB as shared memory. The
shared memory offers a low access latency, requiring only a few
dozen clock cycles, and boasts an impressive bandwidth of up to 1.5
TB/s. This bandwidth is significantly higher than the peak bandwidth
of the global memory, which stands at 900 GB/s. In high-performance
computing (HPC) scenarios, engineers must possess a thorough
understanding of how to leverage shared memory effectively.
3. **Global memory**: Both GPUs and CPUs are capable of reading from
and writing to global memory. Global memory is visible and
accessible by all threads on a GPU, whereas other devices like CPUs
need to traverse buses like PCIe and NV-Link to access the global
memory. The global memory represents the largest memory space
available in a GPU, with capacities reaching over 80 GB. However, it
also exhibits the longest memory latency, with a load/store latency
that can extend to hundreds of clock cycles.
4. **Constant memory**: The constant memory is a virtual address space
in the global memory and does not occupy a physical memory block. It
serves as a high-speed memory, specifically designed for rapid
caching and efficient broadcasting of a single value to all threads
within a warp.
5. **Texture memory**: Texture memory is a specialized form of global
memory that is accessed through a dedicated texture cache to enhance
performance. In earlier GPUs without caches, the texture memory on
each SM served as the sole cache for data. However, the introduction
of level-1 and level-2 caches in modern GPUs has rendered the
texture memory's role as a cache obsolete. The texture memory proves
most beneficial in enabling GPUs to execute hardware-accelerated
operations while accessing memory units. For instance, it allows
arrays to be accessed using normalized addresses, and the retrieved
data can be automatically interpolated by the hardware.
Additionally, the texture memory supports both hardware-accelerated
bilinear and trilinear interpolation for 2D and 3D arrays,
respectively. Moreover, the texture memory facilitates automatic
handling of boundary conditions based on array indices. This means
that operations on array elements can be carried out without
explicit consideration of boundary situations, thus avoiding the
need for extra conditional branches in a thread.
## Compute Units
Hardware accelerators offer a variety of compute units to efficiently
handle various neural networks.
Figure :numref:`ch06/ch06-compute-unit` demonstrates how different
layers of neural networks select appropriate compute units.
![Computeunits](../img/ch06/compute_unit.png)
:label:`ch06/ch06-compute-unit`
1. **Scalar Unit**: calculates one scalar element at a time, similar to
the standard reduced instruction set computer (RISC).
2. **1D Vector Unit**: computes multiple elements at a time, similar to
the SIMD used in traditional CPU and GPU architectures. It has been
widely used in HPC and signal processing.
3. **2D Matrix Unit**: computes the inner product of a matrix and a
vector or the outer product of a vector within one operation. It
reuses data to reduce communication costs and memory footprint,
which achieves the performance of matrix multiplication.
4. **3D Cube Unit**: completes matrix multiplication within one
operation. Specially designed for neural network applications, it
can reuse data to compensate for the gap between the data
communication bandwidth and computing.
The compute units on a GPU mostly include Scalar Units and 3D Cube
Units. As shown in Figure :numref:`ch06/ch06-SM`, each SM has 64 32-bit floating-point
arithmetic units, 64 32-bit integer arithmetic units, 32 64-bit
floating-point arithmetic units, which are Scalar Units, and 8 Tensor
Cores, which are 3D Cube Units specially designed for neural network
applications.
![Volta GV100 SM](../img/ch06/SM.png)
:label:`ch06/ch06-SM`
A Tensor Core is capable of performing one $4\times4$ matrix
multiply-accumulate operation per clock cycle, as shown in
Figure :numref:`ch06/ch06-tensorcore`.
```
D = A * B + C
```
![Tensor Core's $4\times4$ matrix multiply-accumulateoperation](../img/ch06/tensor_core.png)
:label:`ch06/ch06-tensorcore`
$\bf{A}$, $\bf{B}$, $\bf{C}$, and $\bf{D}$ are $4\times4$ matrices.
Input matrices $\bf{A}$ and $\bf{B}$ are FP16 matrices, and accumulation
matrices $\bf{C}$ and $\bf{D}$ can be either FP16 or FP32 matrices.
Tesla V100's Tensor Cores are programmable matrix multiply-accumulate
units that can deliver up to 125 Tensor Tera Floating-point Operations
Per Second (TFLOPS) for training and inference applications, resulting
in a ten-fold increase in computing speed when compared with common FP32
compute units.
## Domain Specific Architecture
![Da Vinciarchitecture](../img/ch06/davinci_architecture.png)
:label:`ch06/ch06-davinci_architecture`
Domain Specific Architecture (DSA) has been an area of interest in
meeting the fast-growing demand for computing power by deep neural
networks. As a typical DSA design targeting image, video, voice, and
text processing, neural network processing units (or namely deep
learning hardware accelerators) are system-on-chips (SoCs) containing
special compute units, large memory units, and the corresponding control
units. A neural processing unit, for example, Ascend chip, typically
consists of a control CPU, a number of AI computing engines, multi-level
on-chip caches or buffers, and the digital vision pre-processing (DVPP)
module.
The computing core of AI chips is composed of AI Core, which is
responsible for executing scalar- and tensor-based arithmetic-intensive
computing. Consider the Ascend chip as an example. Its AI Core adopts
the Da Vinci   architecture.
Figure :numref:`ch06/ch06-davinci_architecture` shows the architecture
of an AI Core, which can be regarded as a simplified version of modern
microprocessor architecture from the control perspective. It includes
three types of basic computing units: Cube Unit, Vector Unit, and Scalar
Unit. These units are used to compute on tensors, vectors, and scalars,
respectively, in three independent pipelines centrally scheduled through
the system software to coordinate with each other for higher efficiency.
Similar to GPU designs, the Cube Unit functions as the computational
core of the AI Core and delivers parallel acceleration for matrix
multiply-accumulate operations. Specifically, it can multiply two
$16\times16$ matrices in a single instruction --- equivalent to
completing 4096 (=$16\times16\times16$) multiply-accumulate operations
within an extremely short time --- with precision comparable to FP16
operations.

View File

@@ -0,0 +1,33 @@
# Overview
An effective computer architecture is expected to be both
energy-efficient---quantified by the number of basic operations executed
per unit of energy---and versatile---defined by the range of tasks a
chip can undertake. We can evaluate these aspects by considering two
primary chip categories. The first includes general-purpose processors
like CPUs, capable of managing a diverse array of computing tasks,
though at the cost of lower energy efficiency, averaging around 0.1
TOPS/W. Conversely, application-specific integrated circuits (ASICs)
offer enhanced energy efficiency but have more restricted task
capabilities. With respect to chip design, general-purpose processors
have integrated various acceleration technologies such as superscalar,
single-instruction multi-data (SIMD), and single-instruction
multi-thread (SIMT) to boost their energy efficiency.
General-Purpose Graphics Processing Units (GPUs) achieve a respectable
equilibrium between energy efficiency and versatility. Modern GPUs
incorporate numerous optimization designs for vector, matrix, and tensor
computing. For instance, NVIDIA GPUs are equipped with Tensor Cores,
Transformer Cores, and Structure Sparsity Cores, which are specifically
designed to expedite the distinctive types of computation prevalent in
neural networks. Despite these enhancements, GPUs' requirement to
support a wide range of computing tasks results in larger footprints and
increased power consumption.
A promising solution to this challenge is deep learning hardware
accelerators. Notable examples include Google's Tensor Processing Units
(TPUs), Apple's Neural Processing Units (NPUs), and Huawei's Ascend
Chips. For instance, Google's TPU, a chip designed to expedite deep
learning computations, uses a systolic array to optimize matrix
multiplication and convolution operations, fully utilizing local data
with minimal memory access.

View File

@@ -0,0 +1,449 @@
# Performance Optimization Methods
Hardware accelerators boast intricate computational and memory
architectures. To maximize their performance, developers frequently need
to grasp a variety of performance optimization methods. Common methods
encompass enhancing arithmetic intensity, capitalizing effectively on
shared memory, optimizing the memory load/store pipeline, among others.
The subsequent sections will elucidate these methods through practical
programming examples, all aimed towards a singular objective:
accelerating an FP32 GEMM program.
## Implementing General Matrix Multiplication
Code `lst:cpu` shows a reference implementation of GEMM in C++.
**lst:cpu**
```cpp
float A[M][K];
float B[K][N];
float C[M][N];
float alpha, beta;
for (unsigned m = 0; m < M; ++m) {
for (unsigned n = 0; n < N; ++n) {
float c = 0;
for (unsigned k = 0; k < K; ++k) {
c += A[m][k] * B[k][n];
}
C[m][n] = alpha * c + beta * C[m][n];
}
}
```
ach element in matrix $C$ is independently computed, and numerous GPU
threads can be launched to compute the corresponding elements in matrix
$C$ in parallel. The GPU kernel function is shown in
Code `lst:gpu`.
**lst:gpu**
```cpp
__global__ void gemmKernel(const float * A,
const float * B, float * C,
float alpha, float beta, unsigned M, unsigned N,
unsigned K) {
unsigned int m = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int n = threadIdx.y + blockDim.y * blockIdx.y;
if (m >= M || n >= N)
return;
float c = 0;
for (unsigned k = 0; k < K; ++k) {
c += A[m * K + k] * B[k * N + n];
}
c = c * alpha;
float result = c;
if (beta != 0) {
result = result + C[m * N + n] * beta;
}
C[m * N + n] = result;
}
```
Figure :numref:`cuda_naive_gemm` shows the layout of the implementation.
Each element in matrix $C$ is computed by one thread. The row index $m$
and column index $n$ of the element in matrix $C$ corresponding to the
thread are computed in lines 5 and 6 of the GPU kernel. Then, in lines 9
to 11, the thread loads the row vector in matrix $A$ according to the
row index and the column vector in matrix $B$ according to the column
index, computes the vector inner product. The thread also stores the
result back to $C$ matrix in line 17.
![Simple implementation ofGEMM](../img/ch06/practise/naive.png)
:label:`cuda_naive_gemm`
The method of launching the kernel function is shown in
Code `lst:launch`.
**lst:launch**
```cpp
void gemmNaive(const float *A, const float *B, float *C,
float alpha, float beta, unsigned M,
unsigned N, unsigned K) {
dim3 block(16, 16);
dim3 grid((M - 1) / block.x + 1, (N - 1) / block.y + 1);
gemmKernel<<<grid, block>>>(A, B, C, alpha, beta, M, N, K);
}
```
Each thread block processes $16\times16$ elements in matrix $C$.
Therefore, $(M - 1) / 16 + 1 \times (N - 1) / 16 + 1$ thread blocks are
used to compute the entire matrix $C$.
Eigen is used to generate data and compute the GEMM result on the CPU.
In addition, error computing and time profiling code are implemented for
the GPU computing result. For details, see
[first_attempt.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/first_attempt.cu).
After the program is compiled and executed, output results are as
follows:
```
Average time: 48.961 ms
Max error: 0.000092
```
The peak GPU throughput can be approximated by using the following
formula: 2 $\times$ Frequency $\times$ Number of single-precision
compute units. The number of single-precision compute units equals the
number of SMs in the GPU multiplied by the number of single-precision
compute units in each SM. The results are as follows:
```
FP32 peak throughput 29767.680 GFLOPS
Average Throughput: 185.313 GFLOPS
```
A significant gap exists between the performance that can be achieved by
the current code and the peak device performance. In an entire computing
process, the process with the highest computing density is matrix
multiplication $A\times B$. Its time complexity is $O(M*N*K)$, whereas
that time complexity of the entire computing process is
$O(M*N*K+2*M*N)$. Therefore, optimizing matrix multiplication is key to
improving performance.
## Enhancing Arithmetic Intensity
Arithmetic intensity is the ratio of computational instructions to
load/store instructions. Modern GPUs typically have numerous compute
units, constrained only by a limited load/store bandwidth. This
limitation often leaves these units waiting for data loading in a
program. Thus, boosting arithmetic intensity is a crucial step to
improve program performance.
In the GPU kernel function discussed previously, we can approximate its
arithmetic intensity by dividing the total number of floating-point
operations by the number of data reads. When calculating the inner
product within $K$ loops, floating-point multiplication and addition
operations occur each time elements from matrix $A$ and $B$ are loaded.
Consequently, the arithmetic intensity is 1, derived from two 32-bit
floating-point operations divided by two 32-bit data load/store
instructions.
In the original code, each thread handles one element in matrix $C$,
computing the inner product of a row in matrix $A$ and a column in
matrix $B$. In essence, we can elevate the arithmetic intensity by
amplifying the elements in matrix $C$ that each thread can process,
computing the inner product of multiple rows in matrix $A$ and multiple
columns in matrix $B$. More specifically, if $m$ elements in matrix $A$
and $n$ elements in matrix $B$ are loaded concurrently while calculating
the inner product in $K$ loops, there are $m+n$ 32-bit load/store
instructions and $2mn$ 32-bit computational instructions. Hence, the
arithmetic intensity becomes $\frac{2mn}{m+n}$. Therefore, by increasing
$m$ and $n$, we can optimize the arithmetic intensity.
In the preceding section, a `float` pointer was employed to access
global memory and store data in it, utilizing the hardware instructions
`LDG.E` and `STG.E`. Multiple `float` elements can be loaded
concurrently using the 128-bit wide instructions `LDG.E.128` and
`STG.E.128`. These wide instructions can streamline the instruction
sequence, potentially saving dozens of instruction issue cycles compared
to four standard instructions, thereby enabling the issue of more
computational instructions within the saved time. Wide instructions can
also enhance the cache line hit rate. Despite these benefits, we advise
against the blanket use of wide instructions in all code. Instead,
programmers should prioritize direct optimization methods, such as
parallel design and local data reuse.
A specific implementation is stacking four `float` numbers to form a
128-bit `float4` class. The load/store operations will be completed
using a wide instruction for the `float4` class. For details about the
code implementation, see
[util.cuh](https://github.com/openmlsys/openmlsys-cuda/blob/main/util.cuh).
Note that each thread needs to load four `float` numbers (instead of
one) from matrix $A$ and matrix $B$, requiring each thread to process
$4\times 4$ blocks (`thread tile`) in matrix $C$. Each thread loads data
from matrix $A$ and matrix $B$ from left to right and from top to
bottom, computes the data, and stores the data to matrix $C$, as shown
in Figure :numref:`use_float4`.
![Enhancing arithmeticintensity](../img/ch06/practise/use_float4.png)
:label:`use_float4`
For details about the complete code, see
[gemm_use_128.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_use_128.cu).
We can further increase the amount of data processed by each thread in
order to improve the arithmetic intensity more, as shown in
Figure :numref:`use_tile`. For
details about the code used to achieve this, see
[gemm_use_tile.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_use_tile.cu).
![Further enhancement of the arithmetic intensity by adding matrixblocks processed by eachthread](../img/ch06/practise/use_tile.png)
:label:`use_tile`
The test results are as follows:
```
Max Error: 0.000092
Average Time: 6.232 ms, Average Throughput: 1378.317 GFLOPS
```
To sample and analyze performance indicators, we will use the analysis
tool Nsight Compute released by NVIDIA. This tool, designed for GPU
kernel functions, samples and collects GPU activity data by hooking
drivers. The following commands can be used to analyze the performance:
```
bash
ncu --set full -o <profile_output_file> <profile_process>
```
`set full` indicates that all data is sampled. `-o` indicates that the
result is output as a file. `<profile_output_file>` indicates the output
file name without the file name extension. `<profile_process>` indicates
the executable file to be analyzed and its arguments. For example, to
analyze `first_attempt` and name the output result
`first_attepmt_prof_result`, run the following instructions:
```
ncu --set full -o first_attepmt_prof_result ./first_attempt
```
If the system displays a message indicating that you do not have
permission to run this command, prefix it with `sudo` and run it again.
After obtaining the output file, the program `nv-nsight-cu` can be used
to view the file. We compared the profiling results of the new GPU
kernel function and the previous one.
The result shows that the number of `LDG` instructions decreases by 84%,
and the value of `Stall LG Throttle` decreases by 33%. By using wide
instructions to increase the compute density, we are able to reduce the
number of global load/store instructions, thereby cutting the amount of
time needed to wait before issuing instructions. The improvement on
`Arithmetic Intensity` proves that our analysis of the arithmetic
intensity is correct. The gemm_use_tile.cu test results are as follows:
```
Max Error: 0.000092
Average Time: 3.188 ms, Average Throughput: 2694.440 GFLOPS
```
The analysis using Nsight Compute shows that the code can also improve
other indicators, such as `Stall LG Throttle`.
## Caching Data in Shared Memory
By increasing the amount of data that a thread can load in one go, we
can improve the arithmetic intensity and performance. However, this
method decreases the degree of parallelism because it reduces the total
number of enabled threads. Other hardware features need to be exploited
in order to improve performance without compromising the degree of
parallelism. In earlier code, several thread blocks are enabled, each of
which processes one or more matrix blocks in matrix $C$. As shown in
Figure :numref:`duplicated_data`, thread $x$ and thread $y$ process the same
row in matrix $C$, so they load the same data from matrix $A$. The
shared memory can be used to improve the program throughput by enabling
different threads in the same thread block to load unique data and reuse
shared data.
![Threads loading redundantdata](../img/ch06/practise/duplicated_data.png)
:label:`duplicated_data`
We have previously mentioned that the inner product can be computed by
loading and accumulating data in $K$ loops. Specifically, in each loop,
threads that process the same row in matrix $C$ load the same data from
matrix $A$, and threads that process the same column in matrix $C$ load
the same data from matrix $B$. However, the code needs to be optimized
by dividing $K$ loops into $\frac{K}{tileK}$ outer loops and $tileK$
inner loops. In this way, an entire block of data is loaded in each
outer loop and accumulated in each inner loop.
Figure :numref:`use_smem_store` shows the process of moving data from the
global memory to the shared memory. Before each inner loop starts, the
entire `tiles` in matrix $A$ and matrix $B$ is stored in the shared
memory.
Figure :numref:`use_smem_load` shows the process of moving data from the
shared memory to the register. In each inner loop, data is loaded from
the shared memory and computed. An advantage of this design is that each
thread does not need to load all the data it requires from the global
memory. Instead, the entire thread block loads the data required for all
threads from the global memory and stores the data in the shared memory.
During computational processes, each thread only needs to load the data
it requires from the shared memory.
![Writing data to the sharedmemory](../img/ch06/practise/use_smem_store.png)
:label:`use_smem_store`
![Loading data from the sharedmemory](../img/ch06/practise/use_smem_load.png)
:label:`use_smem_load`
For details about the complete code, see
[gemm_use_smem.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_use_smem.cu).
The test results are as follows:
```
Max Error: 0.000092
Average Time: 0.617 ms, Average Throughput: 13925.168 GFLOPS
```
Again, we use Nsight Compute to profile the kernel function and compare
the results with the previous ones. The analysis shows some major
improvements. Specifically, the number of `LDG` instructions decreases
by 97%, which is consistent with this design. And the value of
`SM Utilization` increases by 218%, which proves that using the shared
memory can reduce the memory access latency and improve the memory
utilization. Furthermore, the performance of other indicators such as
`Pipe Fma Cycles Active` also improves significantly, demonstrating the
benefits of the shared memory.
## Reducing Register Usage
In previous sections, the data blocks that store matrix $A$ in the
shared memory are arranged in a row-first manner, and the shared memory
is loaded by row. We can instead adopt a column-first manner in order to
reduce loops and loop variables, thereby reducing the number of
registers and improving performance.
For details about the complete code, see
[gemm_transpose_smem.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_transpose_smem.cu).
The test results are as follows:
```
Max Error: 0.000092
Average Time: 0.610 ms, Average Throughput: 14083.116 GFLOPS
```
Analysis by Nsight Compute shows that `Occupancy` increases by 1.3%.
This is because only 111 registers are used (17 fewer than used by the
previous GPU kernel function). The benefit of reducing the number of
registers varies depending on the GPU architecture. Observations have
shown that the number of `STS` instructions increases and bank conflicts
occur, meaning that using fewer registers may not have a positive impact
on other GPU architectures.
## Hiding Shared Memory Loading Latency
To load data from the shared memory, a GPU uses the `LDS` instruction.
After issuing this instruction, the GPU will execute the following
instructions without waiting for the data to be loaded to the register
unless the instructions require such data. In the previous section, each
time this instruction is issued during $tileK$ inner loops, the
mathematical operation that requires the loaded data is performed
immediately. However, the compute unit has to wait for the data to be
loaded from the shared memory, as shown in
Figure :numref:`use_smem_pipeline`. Accessing the shared memory may take
dozens of clock cycles, but computation instructions can often be
completed within only a few clock cycles. In order to significantly
accelerate memory access, we can hide the shared memory loading latency
by optimizing the pipeline. Specifically, during $tileK$ inner loops,
loading instructions that prepare data in the next loop can be loaded at
the beginning of each loop, as shown in
Figure :numref:`hide_smem_latency`. In this way, computation instructions in
the current operation do not require the data in the next loop. As such,
the execution of these computation instructions will not be blocked by
the instructions that load the data for the next loop.
![Pipeline of the previous GPU kernelfunction](../img/ch06/practise/use_smem_pipeline.png)
:label:`use_smem_pipeline`
![Pipeline that hides the shared memory loadinglatency](../img/ch06/practise/hide_smem_latency.png)
:label:`hide_smem_latency`
For details about the complete code, see
[gemm_hide_smem_latency.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_hide_smem_latency.cu).
The test results are as follows:
```
Max Error: 0.000092
Average Time: 0.585 ms, Average Throughput: 14686.179 GFLOPS
```
Analysis by Nsight Compute shows that the value of
`Stall Short Scoreboard` decreases by 67% when compared with that of the
previous GPU kernel function. As mentioned before, after GPU memory
load/store instructions are issued, the GPU executes the next
instruction without waiting for the data to be landed in the register.
However, it will set a flag on the Scoreboard and reset the flag after
the data is landed. If instructions that require such data need to be
executed, the GPU will execute them only after the data is landed. The
decrease of `Stall Short Scoreboard` demonstrates that hiding the access
latency of the shared memory is an effective method to better utilize
the GPU.
## Hiding Global Memory Loading Latency
To load data from the global memory, a GPU uses the textttLDG
instruction, the behavior of which is similar to the `LDS` instruction
used to load data from the shared memory as discussed in the previous
section. At the beginning of each of the $\frac{K}{tileK}$ outer loops,
instructions that load the data tiles in matrix $A$ for the next loop
are issued. Because this data is not required by any inner loop in a
given outer loop, the computational processes in the inner loop will not
wait for the read instruction to be completed, thereby hiding the global
memory loading latency. We can also enable data in `buffer` to be
written to `tile` in the last loop in the inner loop after $tileK - 1$
loops are executed, further reducing the latency of writing data to
`tile`. Figure :numref:`hide_global_latency` shows the optimized pipeline.
![Pipeline that hides the global memory loadinglatency](../img/ch06/practise/hide_global_latency.png)
:label:`hide_global_latency`
For details about the complete code, see
[gemm_final.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_final.cu).
The test results are as follows:
```
Max Error: 0.000092
Average Time: 0.542 ms, Average Throughput: 15838.302 GFLOPS
```
Similar to the `Stall Short Scoreboard` results obtained in the previous
section, analysis by Nsight Compute shows that the value of
`Stall Long Scoreboard` (a global memory indicator) decreases by 67%.
Such a significant decrease demonstrates that prefetching data can hide
the global memory to reduce the loading latency.
## Performance Optimization Principles
So far, we have discussed various methods to enhance the performance of
an accelerator. Even though other methods exist, the principles of
performance optimization generally adhere to the following:
- Increasing parallelism through resource mapping: Multi-level
parallel resources (`blocks`, `warps`, and `threads`) are mapped to
the data needing computation and transfer to enhance program
parallelism.
- Reducing memory access latency through memory structure
optimization: Based on the recognition of data reuse within the same
`block` during computation, the reused data is stored in local
memory (like shared memory and registers) to increase locality.
- Reducing the instruction issue overhead through optimizing
instruction execution: The `#pragma unroll` function is used to
unroll loops in order to improve the degree of parallelism at the
instruction level and reduce logic judgment. The vectorized load
instruction is used to increase bandwidth. For the Ampere
architecture, the maximum vectorized load instruction is
`LDG.E.128`, and the data type for data loading is `float4`.
- Hiding load/store latency by optimizing the memory access pipeline:
In instances where the in-memory data undergoes modifications (such
as the movement of matrix data), we can optimize the memory access
pipeline. This way, the accelerator performs computations during the
intervals between data movement, thereby concealing the latency
associated with data movement.

View File

@@ -0,0 +1,181 @@
# Programming Methods
:label:`Programming Principles for Hardware Accelerators`
The first two sections of this chapter primarily discuss the
significance, ideas, and basic principles behind the design of hardware
accelerators. Co-optimization of software and hardware, as an important
guiding principle for building efficient AI systems, requires mutual
influence and close coupling between software algorithms/stacks and
hardware architectures in neural network applications. In order to fully
leverage the advantages of accelerators, it is necessary to design a set
of programming methods based on the hardware system architecture.
## Method Classification
Programming methods for hardware accelerators are categorized into three
approaches: using high-level computation operators, harnessing
primitives for specialized hardware units, and employing low-level
assembly languages:
1. **High-level computation operators**: Hardware accelerators often
come equipped with high-level, hardware-accelerated implementations
of operators extensively used in numerical computing and deep
learning. For instance, NVIDIA provides cuBLAS (CUDA Basic Linear
Algebra Subprograms) and cuDNN (CUDA Deep Neural Network library).
These libraries offer developers an accessible way to harness the
power of NVIDIA GPUs without delving into low-level code. These
operators are optimized for efficiency and automatically exploit
specific GPU features, such as Tensor Cores.
2. **Primitives for task-specific hardware units:**: Hardware
accelerators typically feature task-specific hardware units (like
the Tensor Cores in NVIDIA GPUs) engineered to execute
mixed-precision matrix multiplication operations at high speed.
These units have associated programming primitives, such as CUDA's
Warp Matrix Multiply Accumulate (WMMA) and primitives for
loading/unloading tensors on the units.
3. **Low-level assembly languages**: Hardware accelerators also have
low-level assembly language interfaces. For instance, NVIDIA GPUs
offer the PTX ISA (Parallel Thread Execution Instruction Set
Architecture). It provides explicit control over all aspects of GPU
behavior, but it requires a deep understanding of the GPU
architecture and is more challenging to use correctly and
effectively than the high-level interfaces provided by cuBLAS and
cuDNN. PTX code is typically generated by a compiler from a
high-level language like CUDA C++.
In essence, the above three methods operate at different levels of
abstraction. High-level operators like cuBLAS and cuDNN provide
easy-to-use interfaces to powerful hardware-accelerated operations,
while the primitives provided by task-specific hardware units provide a
more detailed interface to hardware operations, and low-level assembly
languages like PTX ISA provide the most detailed, low-level control over
accelerator behavior.
## Programming Examples
We exemplify different programming methods by implementing the General
Matrix Multiplication (GEMM) with each approach. The implementation
targets an NVIDIA Volta GPU. GEMM follows the equation
$\bf{C} = \alpha \bf{A}\times \bf{B} + \beta \bf{C}$, where
$\bf{A}\in\mathbb{R}^{M\times K}, \bf{B}\in\mathbb{R}^{K\times N}, \bf{C}\in\mathbb{R}^{M\times N}$,
and $\alpha$ and $\beta$ are parameters provided by users.
### High-level Computation Operators
:label:`sec-accelerator-use-cublas`
Using an operator acceleration library directly is the most
straightforward method. NVIDIA offers two types of operator libraries:
cuBLAS and cuDNN. cuBLAS provides an interface for leveraging Tensor
Cores to accelerate GEMM operations, while cuDNN offers an interface to
hasten neural network operations. To utilize Tensor Cores via cuBLAS
doing GEMM, we can use function `cublasGemmEx`, its signature is shown
in Code `lst:cublasGemmEx`.
**lst:cublasGemmEx**
```cpp
cublasStatus_t cublasGemmEx(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const void *alpha, const void *A, cudaDataType_t Atype, int lda, const void *B, cudaDataType_t Btype, int ldb, const void *beta, void *C, cudaDataType_t Ctype, int ldc, cublasComputeType_t computeType, cublasGemmAlgo_t algo)
```
`handle` is the cuBLAS handle, which is created using the `cublasCreate`
function. `transa` denotes whether the matrices $\bf{A}$ and $\bf{C}$
are transposed, while `transb` denotes whether the matrix $\bf{B}$ is
transposed. `m`, `n`, and `k` are used to describe the shape of the
matrices. `alpha` and `beta` are used to scale the matrix multiplication
results. `A`, `B`, and `C` are pointers to the starting addresses of the
matrices. `Atype`, `Btype`, and `Ctype` describe the data type of the
matrices. For example, `CUDA_R_16F` indicates that the data is stored in
real 16-bit floating point type. `lda`, `ldb`, and `ldc` represent the
leading dimensions of the matrices. `computeType` is the data type used
in computation. For instance, `CUBLAS_COMPUTE_16F` implies the use of
Tensor Cores for computation in 16-bit floating point. Notably, if the
input data type is 32-bit float, we can use
`CUBLAS_COMPUTE_32F_FAST_16F` to perform the computation in 16-bit
floating point and achieve acceleration using Tensor Cores. `algo` is
the algorithm used in computation, and `CUBLAS_GEMM_DEFAULT` is commonly
used to select the default algorithm.
### Primitives for Hardware Units
The second approach to accelerator programming involves the use of
programming primitives, such as invoking the CUDA Warp Matrix Multiply
Accumulate (WMMA) API on a device. This approach hinges on the
collaborative design of software and hardware, meaning that the design
of programming APIs at this level is architecture-dependent. For
instance, in the Volta architecture, the control object of WMMA is a
$16\times16$ matrix block, processed by two Tensor Cores at a time. This
notion is tightly linked to the integration of Tensor Cores into a SM.
In the Volta architecture, NVIDIA offers three distinct sizes of WMMA
multiply-accumulate computing interfaces for FP16 input data:
$16\times16\times16$, $32\times8\times16$, and $8\times32\times16$.
The basic control unit of the WMMA API is a fragment, which refers to a
template class that specifies information such as the meaning of
matrices (multiplier or accumulator), matrix shape
(`WMMA_M, WMMA_N, or WMMA_K`), data type (FP16, FP32, etc.), and layout
(`row_major or col_major`).
Code `lst:frament` shows the fragment types.
**lst:frament**
```
wmma::fragment<wmma::matrix_a, WMMA_M, WMMA_N, WMMA_K, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, WMMA_M, WMMA_N, WMMA_K, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, float> acc_frag;
wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, float> c_frag;
```
The data of the matrix block required by multiplication operations needs
to be loaded to the register as a fragment. Fragments are initialized or
cleared after multiply-accumulate operations performed by Tensor Cores,
the fragments are stored back in global memory. NVIDIA provides the
`wmma.load_matrix_sync() and wmma.store_matrix_sync()` interfaces to
load or write the submatrix blocks. The `wmma.fill_fragment()` interface
is used to initialize the data of the corresponding fragments, and the
`wmma.mma_sync()` interface is used to perform multiply-accumulate
operations on fragments.
### Low-level Assembly Language Interface
The PTX ISA offers another programming interface, for example, the
`mma.sync.aligned.m8n8k4` instruction in the Volta architecture. This
instruction uses the shape configuration of $M=8, N=8, K=4$ to perform
multiply-add operations. The basic control unit of the API is the data
element. The matrix size (modifier `.m8n8k4`), data format (modifier
`.row` or `.col`) and data formats of input accumulator D, matrix A,
matrix B, and output accumulator C (modifier `.f32` or `.f16`) need to
be specified. NVIDIA's documentation provides information about
using the PTX instruction set, helping programmers compile code based on
the corresponding syntax rules, as shown in
Code `lst:ptx`.
**lst:ptx**
```cpp
half_t *a, *b;
float *C, *D;
unsigned const* A = reinterpret_cast<unsigned const*>(a);
unsigned const* B = reinterpret_cast<unsigned const*>(b);
asm volatile(
"mma.sync.aligned.m8n8k4.row.row.f32.f16.f16.f32 "
"{%0,%1,%2,%3,%4,%5,%6,%7}, {%8,%9}, {%10,%11}, "
"{%12,%13,%14,%15,%16,%17,%18,%19};\n"
: "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3]), "=f"(D[4]),
"=f"(D[5]), "=f"(D[6]), "=f"(D[7])
: "r"(A[0]), "r"(A[1]), "r"(B[0]), "r"(B[1]), "f"(C[0]),
"f"(C[1]), "f"(C[2]), "f"(C[3]), "f"(C[4]), "f"(C[5]),
"f"(C[6]), "f"(C[7]));
```
Data elements are directly used as the input (`unsigned` type is used
for containing FP16 data elements). Moreover, NVIDIA provides the
`ldmatrix` instruction to load data from the shared memory to fragments.
A finer-grained instruction, `mma`, can form a warp-level WMMA API of
more diversified shapes to control the mapping between threads and data
in the warp. The PTX instructions offer greater flexibility than
directly using CUDA C++ codes.
[^1]: available at
<https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html>

View File

@@ -0,0 +1,26 @@
# Hardware Accelerator
In the field of AI frameworks, hardware accelerators play a vital role
in enabling efficient neural network computations. This chapter delves
into the design of modern hardware accelerators, their programming
techniques, and the typical approaches to optimize accelerator
performance.
This chapter has the following learning objectives:
1. Understand the architecture of a modern hardware accelerator.
2. Understand the methods of programming hardware accelerators.
3. Understand the typical techniques used to optimize the performance
of accelerators.
```toc
:maxdepth: 2
Overview
Components_of_Hardware_Accelerators
Programming_Methods
Performance_Optimization_Methods
Chapter_Summary
```

View File

@@ -0,0 +1,18 @@
# Chapter Summary
1. Hardware accelerators offer various types of on-chip caches and
computational units, enhancing the performance of deep learning
computational tasks.
2. To fully exploit the performance potential of hardware accelerators,
it's necessary to implement programmable hardware accelerators,
bringing architectural innovation.
3. To balance computational efficiency and usability, the programming
methods for hardware accelerators range from high-level computation
operators to harnessing the primitives associated with hardware
units, and to using low-level assembly languages.
4. A variety of methods are crucial to optimize accelerator
performance, which include enhancing arithmetic intensity, caching
data in shared memory, and concealing data store/load latency.

View File

@@ -0,0 +1,227 @@
## Computation Scheduling and Execution
After operator selection and memory allocation, computation tasks can be scheduled and executed on hardware through the runtime. Depending on whether operators are compiled into a computational graph, computation scheduling can be divided into two approaches: single-operator scheduling and graph scheduling. For example, MindSpore provides the PyNative mode and Graph mode respectively. Furthermore, depending on the hardware capabilities, the execution of computational graphs can be divided into two modes: interactive execution, where operators are dispatched and executed one by one, and sink execution, where the entire computational graph or partial subgraphs are dispatched to the hardware at once.
### Single-Operator Scheduling
Single-operator scheduling, as opposed to graph-based scheduling, means that operators contained in algorithms or models are scheduled and executed one by one through the Python runtime. Examples include PyTorch's default execution mode, TensorFlow's eager mode, and MindSpore's PyNative mode. Taking MindSpore as an example, the code is shown below.
```python
import mindspore.nn as nn
from mindspore import context
class Computation(nn.Cell):
def construct(self, x, y):
m = x * y
n = x - y
print(m)
z = m + n
return z
compute = Computation()
c = compute(1, 2)
print(c)
```
The above script defines all computation logic in the `construct` method of the `Computation` class. Since single-operator execution mode is preset in the context at the beginning of the script, the computations in `construct` will be called and executed line by line through the Python runtime, and `print` commands can be inserted at any position in the code to print intermediate computation results.
The call chain for single-operator execution is shown in :numref:`single_op_exec`. After an operator is triggered for execution on the Python side, it goes through the machine learning framework initialization, which determines information including the operator's precision, input and output types and sizes, and the corresponding hardware device. Then the framework allocates the memory required for computation, and finally hands it over to the specific hardware computing device to complete the execution.
![Single-Operator Execution](../img/ch05/single_op_exec.PNG)
:width:`800px`
:label:`single_op_exec`
The advantage of single-operator scheduling lies in its flexibility. Since operators are directly scheduled through the Python runtime, it can express arbitrarily complex computation logic, especially in scenarios requiring complex control flow and Python native data structures to implement complex algorithms. Additionally, single-operator scheduling is very convenient for debugging program correctness, as developers can print any variable that needs to be debugged during code execution. Finally, by driving operators through the Python runtime, computation tasks can be completed in coordination with Python's vast and rich ecosystem of libraries.
### Graph Scheduling
Although single-operator scheduling has the advantages described above, its disadvantages are also obvious. On one hand, it is difficult to optimize computation performance, because without global information from the computational graph, single-operator execution cannot perform optimizations such as operator fusion and algebraic simplification based on context. On the other hand, due to the lack of topological relationships in the computation, the entire computation can only be scheduled and executed serially, meaning that parallel computation cannot be achieved through the runtime. For example, the computation logic of the above sample code can be expressed as shown in :numref:`graph_exec`. From this computational graph, we can see that there is no dependency between the multiplication and subtraction operations, so these two computations can be executed in parallel. Such parallel execution information can only be analyzed after the computation is expressed as a computational graph, which is one of the advantages of graph scheduling over single-operator scheduling.
![Computational Graph](../img/ch05/graph_exec.png)
:width:`800px`
:label:`graph_exec`
Now let us introduce the scheduling methods for computational graphs. In a typical heterogeneous computing environment, there are multiple types of computing devices such as CPUs, GPUs, and NPUs. Therefore, a computational graph can be composed of operators running on different devices, forming a heterogeneous computational graph. :numref:`computation_graph` shows a typical computational graph involving heterogeneous hardware.
![Heterogeneous Hardware Computational Graph](../img/ch05/computation_graph.png)
:width:`800px`
:label:`computation_graph`
The computational graph described above consists of operators corresponding to the following types of heterogeneous hardware:
- **CPU Operators**: Operators written in C++ and executed on the host via the CPU. The performance of CPU computation depends on whether the multi-core computing capability of the CPU can be fully utilized.
- **GPU Operators**: Taking NVIDIA GPU chips as an example, GPU Kernels are dispatched one by one from the host side to the GPU device, where the GPU chip executes the operator's computation logic. Due to the large number of parallel execution units on the chip, it can provide powerful acceleration capabilities for highly parallel algorithms.
- **NPU Operators**: Taking Huawei Ascend chips as an example, Ascend is a highly integrated SoC chip. The advantage of NPUs is their support for sinking part of or the entire computational graph into the chip to complete computation. During computation, there is no interaction with the host, resulting in higher computational performance.
- **Python Operators**: Similar to CPU operators in execution mode, both are executed by the host's CPU. The difference is that the computation logic is interpreted and executed by the Python runtime through the Python interpreter.
The prerequisite for correctly expressing a heterogeneous computational graph is to accurately identify the device on which each operator executes. For example, the CPU, GPU, and Ascend Kernels identified in the heterogeneous computational graph :numref:`computation_graph`, as well as the Python Kernels marked to be executed by the Python runtime. Mainstream frameworks all provide the capability to specify the device on which an operator runs. Taking MindSpore as an example, a simple heterogeneous computation code is shown below.
```python
import numpy as np
from mindspore import Tensor
import mindspore.ops.operations as ops
from mindspore.common.api import jit
# Create operators and specify the hardware device for execution
add = ops.Add().add_prim_attr('primitive_target', 'CPU')
sub = ops.Sub().add_prim_attr('primitive_target', 'GPU')
# Specify execution in static computational graph mode
@jit
def compute(x, y, z):
r = add(x, y)
return sub(r, z)
# Create arguments
x = Tensor(np.ones([2, 2]).astype(np.float32))
y = Tensor(np.ones([2, 2]).astype(np.float32))
z = Tensor(np.ones([2, 2]).astype(np.float32))
# Execute computation
output = compute(x, y, z)
```
The above code snippet completes the computation logic of x + y - z, where the Add operator is set to execute on the CPU and the Sub operator is set to execute on the GPU, forming CPU-GPU collaborative heterogeneous computation. Through a similar tagging mechanism, arbitrarily complex multi-hardware collaborative heterogeneous computation can be expressed.
Another relatively special type of heterogeneity involves Python operators. The advantages of Python lie in its flexibility of expression, development efficiency, and rich surrounding ecosystem. Therefore, introducing Python operators into the computational graph to collaborate with operators on other heterogeneous hardware greatly enhances computation flexibility. Unlike the heterogeneity where CPU and GPU execute on different devices, Python operators and CPU operators implemented in C++ are both executed by the host-side CPU cores. The difference is that Python operators are described through a unified computational graph and therefore also need to be triggered for execution in the backend runtime. To express Python operators in the computational graph, the framework needs to provide corresponding support.
After marking the devices corresponding to operators in the computational graph, the graph is ready to be scheduled and executed. Depending on hardware capabilities, the execution of heterogeneous computational graphs can be divided into three modes: operator-by-operator interactive execution, whole-graph sink execution, and subgraph sink execution. Interactive execution is mainly for CPU and GPU scenarios, where operators in the computational graph are scheduled and executed one by one according to the dependency relationships of inputs and outputs. Whole-graph sink execution is mainly for NPU chips, whose main advantage is the ability to dispatch the entire neural network's computational graph to the device at once, independently completing the scheduling and execution of all operators in the graph without relying on the host's CPU capability, reducing the number of interactions between host and chip, and improving computational efficiency and performance through the NPU's tensor acceleration capability. Subgraph sink execution combines the previous two execution modes. Due to the flexibility of computational graph expression itself, whole-graph sink execution on NPU chips may not achieve optimal efficiency for complex scenarios. Therefore, parts with low execution efficiency on NPU chips can be separated and handed over to devices with higher execution efficiency such as CPUs or GPUs, while subgraphs more suitable for NPU computation are sunk to the NPU for computation, thus balancing both performance and flexibility.
The above heterogeneous computational graph can serve two purposes. The first is heterogeneous hardware acceleration, placing specific computations on suitable hardware for execution. The second is achieving concurrent execution between operators. From the computational graph, we can see that there is no dependency between kernel_1 and kernel_2, nor between kernel_3 and kernel_4. Therefore, these two pairs of CPU and GPU operators can logically be invoked concurrently by the framework. However, kernel_5 depends on the outputs of kernel_3 and kernel_4 as its inputs, so kernel_5 needs to wait for kernel_3 and kernel_4 to complete before being triggered for execution.
Although concurrency relationships between operators can be fully expressed on the computational graph, in practice, some unexpected side effects may arise due to concurrency, as shown in the following code:
```python
import mindspore as ms
from mindspore import Parameter, Tensor
import mindspore.ops.operations as ops
from mindspore.common.api import jit
# Define global variables
x = Parameter(Tensor([1.0], ms.float32), name="x")
y = Tensor([0.2], ms.float32)
z = Tensor([0.3], ms.float32)
# Specify execution in static computational graph mode
@jit
def compute(y, z):
ops.Assign()(x, y)
ops.Assign()(x, z)
r = ops.Sub()(x, y)
return r
compute(y, z)
```
The above code expresses the following computation logic:
```text
x = y
x = z
x = x - y
```
This simple computation logic, when translated to the computational graph, can be represented as shown in :numref:`side_effect_1`.
![Concurrent Operator Execution](../img/ch05/side_effect_1.png)
:width:`800px`
:label:`side_effect_1`
There are no dependencies among the three computations shown in the code, so these three operators can logically be executed concurrently on the computational graph. However, based on the code semantics, it is obvious that the program needs to be executed sequentially. The issue introduced here is called a side effect, which refers to the behavior of modifying state variables defined outside the function. Due to the introduction of side effects, incorrect concurrency relationships occur. One solution is to add dependencies between operators during the computational graph compilation phase to convert concurrent execution logic into sequential execution logic. The transformed computational graph is shown in :numref:`side_effect_2`.
![Eliminating Side Effects](../img/ch05/side_effect_2.png)
:width:`800px`
:label:`side_effect_2`
The dashed arrows in the figure represent the dependency relationships between operators. After adding dependency relationships, the operators will execute serially in the order of Assign_1, Assign_2, Sub_1, which is consistent with the original code semantics.
### Interactive Execution
As described above, in interactive execution mode, the framework's runtime dispatches operators to the hardware for execution one by one according to the dependency relationships of operators in the computational graph, following a certain execution order (e.g., breadth-first order). To aid understanding and comparison, we first introduce the execution method for non-heterogeneous computational graphs (where all operators in the graph run on the same type of device), as heterogeneous computational graph execution is built upon non-heterogeneous graphs.
1. Execution of Non-Heterogeneous Computational Graphs
![Non-Heterogeneous Computational Graph](../img/ch05/graph_exec_1.png)
:width:`800px`
:label:`graph_exec_1`
As shown in :numref:`graph_exec_1`, this is a non-heterogeneous computational graph where all Kernels are GPU operators. The execution methods are generally divided into serial execution and parallel execution:
![Serial Execution](../img/ch05/graph_exec_2.png)
:width:`800px`
:label:`graph_exec_2`
![Parallel Execution](../img/ch05/graph_exec_3.png)
:width:`800px`
:label:`graph_exec_3`
- **Serial Execution**: The computational graph is unfolded into an execution sequence, and operators are executed serially one by one according to the execution order, as shown in :numref:`graph_exec_2`. Its characteristics include a fixed execution order, single-threaded execution, and relatively low system resource requirements.
- **Parallel Execution**: The computational graph is unfolded according to the dependency relationships between operators. Operators with dependencies maintain their execution order through input dependencies, while operators without dependencies can be executed in parallel, as shown in :numref:`graph_exec_3`. Kernel_1 and Kernel_2 have no dependencies and can execute in parallel, and Kernel_3 and Kernel_4 have no dependencies and can execute in parallel. Its characteristics include a non-fixed execution order (the order of operators executed in each round is likely to differ), multi-threaded execution, and relatively high system resource requirements.
Serial execution and parallel execution each have their advantages and disadvantages, summarized in :numref:`serial_vs_parallel`.
:Comparison of Serial Execution and Parallel Execution
| Execution Method | Serial Execution | Parallel Execution |
|--------------|----------|------|
|Operator Execution Order | Fixed | Non-fixed |
|Operator Execution Thread |Single-threaded | Multi-threaded |
|Required Execution Resources | Lower | Higher |
:label:`serial_vs_parallel`
2. Execution of Heterogeneous Computational Graphs
![Heterogeneous Computational Graph](../img/ch05/graph_exec_4.png)
:width:`800px`
:label:`graph_exec_4`
As shown in :numref:`graph_exec_4`, this is a heterogeneous computational graph, where Kernel_1, Kernel_2, Kernel_5, and Kernel_9 are CPU operators, Kernel_6 is a Python operator (also executed on the CPU), Kernel_3 and Kernel_4 are GPU operators, and Kernel_7 and Kernel_8 are GPU operators.
Generally, computational graph optimizations are implemented based on non-heterogeneous computational graphs, requiring all operators in the graph to be on the same device to facilitate optimizations such as operator fusion and replacement. Therefore, a heterogeneous computational graph needs to be partitioned into multiple non-heterogeneous computational graphs. The partitioning can be quite flexible, with various partitioning rules defined. Generally, partitioning rules that produce as few subgraphs as possible are used, placing as many operators on the same device into one subgraph as possible. As shown in :numref:`graph_exec_5`, five subgraphs are produced: Graph_1\_CPU, Graph_2\_GPU, Graph_3\_CPU, Graph_4\_Ascend, and Graph_5\_CPU.
![Heterogeneous Computational Graph Partitioning](../img/ch05/graph_exec_5.png)
:width:`800px`
:label:`graph_exec_5`
After partitioning a heterogeneous computational graph into multiple subgraphs, the execution methods are generally divided into subgraph partitioned execution and subgraph merged execution:
- **Subgraph Partitioned Execution**: The partitioned subgraphs are executed separately, i.e., one subgraph finishes execution before the next one starts, as shown in :numref:`graph_exec_6`. The output data of the previous subgraph is transferred to the input of the next subgraph, and the next subgraph needs to copy the input data to its own device memory. For example, Graph_2\_GPU needs to copy the output data of Graph_1\_CPU from CPU to GPU, and conversely, Graph_3\_CPU needs to copy the output data of Graph_2\_GPU from GPU to CPU. There is a certain overhead in switching execution between subgraphs.
- **Subgraph Merged Execution**: The partitioned subgraphs are merged into a single overall DAG for execution, as shown in :numref:`graph_exec_7`. Copy operators are inserted based on operator device attributes to enable data transfer between operators on different devices, and the copy operators are also incorporated into the whole graph, forming a large unified graph for execution, reducing the overhead of switching between subgraphs.
![Subgraph Partitioning](../img/ch05/graph_exec_6.png)
:width:`800px`
:label:`graph_exec_6`
![Subgraph Merging](../img/ch05/graph_exec_7.png)
:width:`800px`
:label:`graph_exec_7`
Since subgraph merged execution can reduce the overhead of switching between subgraphs, it generally achieves higher performance. A summary comparison is shown in :numref:`partitioning_vs_merging`.
:Comparison of Subgraph Partitioning and Subgraph Merging
| Execution Method | Subgraph Partitioning | Subgraph Merging|
| --------------|------------------|--------------|
| Heterogeneous Data Transfer | Copy between subgraphs | Copy between operators|
| Additional Execution Overhead | Subgraph switching overhead | None|
| Execution Concurrency Granularity | Subgraph-level concurrency | Native operator-level concurrency|
:label:`partitioning_vs_merging`
3. Execution Acceleration of Heterogeneous Computational Graphs
The previous sections described two execution methods for non-heterogeneous computational graphs and two execution methods for heterogeneous computational graphs, where heterogeneous computational graphs are built upon non-heterogeneous ones. Therefore, heterogeneous computational graphs have four possible execution methods through pairwise combination. Taking MindSpore as an example, it adopts subgraph merged parallel execution, as illustrated in :numref:`graph_exec_5`. First, executing as a single whole graph avoids the overhead of subgraph switching, and then parallel execution within the whole graph maximizes the advantage of concurrent execution, achieving optimal execution performance.
![Heterogeneous Hardware Acceleration](../img/ch05/graph_exec_8.png)
:width:`800px`
:label:`graph_exec_8`
### Sink Execution
Sink execution leverages the SoC architecture of specialized chips to schedule the entire or partial computational graph onto the chip at once to complete the computation of the full data volume. For example, with Ascend chips, a computational graph composed of multiple Ascend operators can be compiled into a Task before execution. Through the interface provided by the Ascend driver, the Task containing multiple operators is dispatched to the hardware at once for scheduling and execution. Therefore, in the above example, the Ascend operators Kernel_7 and Kernel_8 can be optimized into a subgraph Graph_4\_Ascend, which is then compiled into a Task and sunk to the Ascend for execution, as shown in :numref:`graph_exec_8`.
Sink execution achieves better overall computational performance by avoiding interactions between the host side and the device side during computation. However, sink execution also has some limitations. For example, it faces significant technical challenges in scenarios involving dynamic shape operators and complex control flow.

View File

@@ -0,0 +1,115 @@
# Graph Optimization
Graph optimization techniques at the backend primarily focus on
hardware-oriented approaches. These techniques can be categorized as
hardware-agnostic, such as memory I/O optimization, or specific to
particular hardware, such as subgraph transformation to accommodate
hardware instruction restrictions.
## Hardware-Agnostic Optimizations
Hardware-agnostic optimizations involve subgraph transformation, which
replaces a subgraph in a computational graph with a hardware-friendly
equivalent.
One example of such optimization is memory I/O optimization. In deep
learning models, operators can be categorized as either
compute-intensive (e.g., Conv and FC) or memory-intensive (e.g., ReLU
and element-wise Sum). Memory-intensive operators are mainly used for
element-wise operations. Often, both types of operators are used
together in a typical deep learning model, such as the combination of
\"Conv + ReLU\". By fusing ReLU and Conv into a composite operator, we
can reduce memory access latency, bandwidth pressure, and improve
execution efficiency.
Figure :numref:`ch07/ch07-compiler-backend-03` illustrates an example of
fusing \"Conv + Conv + Sum + ReLU\". This fusion optimization eliminates
two read operations and two write operations, optimizing the read and
write of the outputs generated by Conv and Sum.
![Element-wise operatorfusion](../img/ch07/conv_sum_relu.png)
:label:`ch07/ch07-compiler-backend-03`
Furthermore, automatic operator generation technology enables more
flexible general optimizations in addition to fusion-based optimizations
for specific operator types. An example of this technology is graph
kernel fusion (available on AI frameworks such as TensorFlow and
MindSpore). It aims to reduce inefficient memory movements and enable
intensive computing through three steps: operator expansion,
aggregation, and reconstruction.
Figure
:numref:`ch07/ch07-compiler-backend-graph-kernel` provides an
overview of graph kernel fusion, which involves the following steps:
1. Expander: Composite operators (Op1, Op3, and Op4) in the
computational graph are expanded into combinations of basic
operators, as represented by the graph nodes with dash lines.
2. Aggregation: The basic operator (Op2) and expanded operators are
aggregated into larger operator combinations.
3. Reconstruction: The basic operators are classified based on the
input-to-output affinity, such as elemwise, broadcast, reduce, and
transform. This classification allows the derivation of general
compute rules (e.g., elemwise + reduce) to facilitate efficient
execution. The operator combination is then analyzed and filtered
iteratively, leading to the creation of new operators (New Op1 and
New Op2) through reconstruction. These new operators are designed to
be hardware-friendly.
Graph kernel fusion enables joint optimization beyond operator
boundaries by expanding and aggregating the computational graph. It
generates new hardware-friendly operators through reconstruction based
on general compute rules, thereby facilitating efficient execution.
However, it should be noted that this approach involves additional
memory movements.
![Graph kernelfusion](../img/ch07/graph_kernel.png)
:label:`ch07/ch07-compiler-backend-graph-kernel`
## Hardware-Specific Optimizations
Hardware-specific optimizations are tailored to address the restrictions
imposed by specific hardware instructions and memory formats associated
with particular hardware devices.
### Hardware Instruction Restrictions
Hardware instruction restrictions arise when certain IR nodes lack
direct operator counterparts on a specific hardware device. In such
cases, subgraph transformation can be employed to overcome these
restrictions. Let's consider an example. The Concat operator on the
accelerator supports a maximum of 63 inputs. If the Concat node in the
frontend IR exceeds this limit, we can partition the node into multiple
smaller Concat nodes. Figure
:numref:`ch07/ch07-compiler-backend-04` illustrates how we can
split a 100-input Concat node into two smaller nodes, one with 63 inputs
and the other with 37 inputs, to meet the 63-input requirement of the
accelerator.
![Partitioning of the Concatoperator](../img/ch07/concat.png)
:label:`ch07/ch07-compiler-backend-04`
### Memory Format Restrictions
Different platforms define varying formats for different operators to
achieve optimal performance. When the formats are inconsistent with a
particular framework, a common approach is to insert format
transformation operations to reformat the operator output. However, this
introduces additional memory movements.
Figure :numref:`ch07/ch07-compiler-backend-05` provides an example to
illustrate this scenario. Consider that the default format in an AI
framework is NCHW, but the hardware accelerator is optimized for
performing convolution with inputs and outputs in NC1HWC0 format. To
bridge this gap, the output of the first Conv operator is formatted to
NCHW using a TransData operator. It is then reformatted to NC1HWC0 using
another TransData operator before being passed to the next Conv
operator. The two TransData operations (depicted as dashed lines in the
figure) are inverse operations of each other. By employing pattern
matching on the computational graph, such operations can be easily
eliminated.
![Elimination of format transformationoperations](../img/ch07/transdata.png)
:label:`ch07/ch07-compiler-backend-05`

View File

@@ -0,0 +1,38 @@
# AI Compiler Backend
In this chapter, we will explore the design of the AI compiler backend.
The objective of an AI compiler backend is to enhance the efficiency of
AI program execution by optimizing the Intermediate Representation (IR)
generated by the compiler frontend. This optimization enables the full
utilization of hardware capabilities. The backend achieves this goal by
applying optimizations to IR code based on hardware capabilities.
Furthermore, it selects suitable operators based on the capabilities of
target hardware to execute computations efficiently, while also
allocating memory to optimize data reuse and locality. Additionally, the
backend often incorporates an operator compiler, which optimizes the
execution strategy for code statements associated with operators.
This chapter aims to achieve the following learning objectives:
- Understand the role and architecture of an AI compiler backend.
- Understand typical methods for optimizing computational graphs.
- Understand typical methods for selecting operators.
- Understand typical methods for memory allocation.
- Understand the architecture and functionalities of operator
compilers.
```toc
:maxdepth: 2
Overview
Graph_Optimization
Operator_Selection
Memory_Allocation
Operator_Compiler
Chapter_Summary
Further_Reading
```

View File

@@ -0,0 +1,214 @@
# Operator Selection
Following graph optimization, the compiler backend generates a sequence
of operators that can be executed on hardware. This is achieved by
selecting the most suitable operators from a set of candidate operators
for each node in the IR. Since these candidate operators have diverse
specifications, their execution efficiency varies depending on the
scenario. Therefore, the primary objective of operator selection is to
choose the operators that are most appropriate for the target device
based on the information provided by the IR.
## Basic Concepts of Operator Selection
We can think of the nodes in a backend-optimized IR as being units of
execution that are visible to the user, and each unit represents a
hardware-agnostic operation in the user code. In essence, operator
selection involves selecting appropriate hardware information, which is
referred to as operator information. Such information defines the
following:
1. The format of an operator, which is a determinant of the operator's
performance on the target platform. Machine learning systems
commonly use NCHW and NHWC formats.
2. The data type (such as float32, float16, or int32) of an operator on
the target platform. The operators selected are those with data
types close to (or the same as) user definitions.
### Data Formats
In machine learning systems, many operations are converted into matrix
multiplication (e.g., convolution) for faster computation. Matrix
multiplication in the form of
$\textit{\textit{A}}\times \textit{\textit{B}} = \textit{\textit{C}}$ is
essentially a row-by-column multiplication. Specifically, the entry *ij*
of **C** is obtained by multiplying the entries in the *i*th row of
**A** and the corresponding entries in the *j*th column of **B** and
then adding the results together. Consider the example shown in Figure
:numref:`ch07/ch07-compiler-backend-06`. Matrix data is stored in
row-major order by default, as shown at the top of the figure. However,
matrix **B** is read in column-major order in the matrix multiplication
process, as shown at the bottom.
![Matrix data layouts in matrixmultiplication](../img/ch07/matmuldatalayout.png)
:label:`ch07/ch07-compiler-backend-06`
Storing matrix **B** in the reading order increases the computation
efficiency because access to contiguous blocks of memory is faster. We
can therefore see that data formats play an important role in
performance improvement.
There are two major formats in machine learning systems: NCHW and NHWC.
For an image input, N denotes the batch size, C denotes the number of
channels, and H and W denote the height and width respectively. Figure
:numref:`ch07/ch07-compiler-backend-07` depicts the logical
diagram of an input with batch size 2, channels 16, height 5, and width
4.
![Formatdiagram](../img/ch07/data_format.png)
:label:`ch07/ch07-compiler-backend-07`
A multidimensional matrix is flattened into 1D format before it is
written to memory. This involves indexing, which maps logical data to
physical memory.
Access to machine learning data is performed in an axis-wise order from
the last axis forward. For instance, data in NCHW format is read in the
axis order of W, H, C, and N. Equation
:eqref:`ch05/equation-01` denotes the mapping between
logical memory and physical memory for this format of data.
$$
\text{offsetnchw}(n,c,h,w) = n \times \textit{C} \times \textit{H} \times \textit{W} + c \times \textit{H} \times \textit{W} + h \times \textit{W} + w
$$
:eqlabel:`equation:ch05/equation-01`
As shown in Figure
:numref:`ch07/ch07-compiler-backend-08`, matrix elements are
flattened from the lowest dimension (i.e., W axis) forward, and
neighboring elements of an axis reside next to each other in memory. To
take the same element on the next image in the same location, the whole
image size ($C*H*W$) has to be jumped. Assume we have a batch of eight
RGB images of size 32$\times$`<!-- -->`{=html}32, or a matrix with
$N=8,C=3,H=32,W=32$. Memory storage of these images begins from the
first channel of the first image by flattening the matrix along axis W
and then arranging matrix elements along axis H. This is performed
before the next channel is processed. The same procedure is repeated
until the last channel of the last image is processed. NCHW is the
default format on PyTorch and MindSpore.
![RGB image data in NHWCformat](../img/ch07/nchw.png)
:label:`ch07/ch07-compiler-backend-08`
Access to data in NHWC format also begins at the lowest dimension (i.e.,
C axis) forward. NHWC is the default format on TensorFlow (PyTorch
refers to it as the channel-last format). Equation
:eqref:`ch05/equation-02` denotes the mapping from logical
memory to physical memory for this format of data.
$$
\text{offsetnchw}(n,h,w,c) = n \times \textit{H} \times \textit{W} \times \textit{C} + h \times \textit{W} \times \textit{C} + w \times \textit{C} + c
$$
:eqlabel:`equation:ch05/equation-02`
Figure
:numref:`ch07/ch07-compiler-backend-nchwandnhwc` compares the
logical indexing of the NCHW and NHWC formats. The \[x:1\] marks refer
to the jumps from the innermost axis to the next. For example, \[a:1\]
indicates the jump from axis W to axis H, and \[b:1\] indicates the jump
from axis C (the innermost) to axis W.
![NCHW and NHWCformats](../img/ch07/nchwandnhwc.png)
:label:`ch07/ch07-compiler-backend-nchwandnhwc`
These two formats offer a high degree of flexibility and are therefore
used on many frameworks. However, to accelerate computing on hardware,
further optimization is needed. In a machine learning system, if the
size of the user input exceeds what the compute component can pass
through the network at a time (which is often the case), the input will
be batched before computation. For further optimization, many frameworks
introduce blocked formats (which are more hardware-friendly), such as
the nChw16c and nChw8c formats of the oneAPI Deep Neural Network Library
(oneDNN) and the NC1HWC0 format on the Ascend platform. By leveraging
hardware acceleration instructions to move and compute data, matrices
can be quickly transformed into vectors, increasing the utilization of
the on-chip cache.
### Data Types
Single-precision (float32), occupying 32 bits in memory, is the most
commonly used data type in machine learning systems. In applications
where higher precision is not essential, the half-precision (float16)
data type may be used, occupying 16 bits in memory. When used on
hardware, float16 offers up to 7 times more arithmetic throughput with
less memory footprint compared with the single-precision data type ---
this allows for larger batch sizes and consequently reduced training
time. Next, we will look at the differences between half-precision
floating-point numbers and single-precision floating-point numbers.
In Figure :numref:`ch07/ch07-float32andfloat16`, *Sig* refers to the sign
bit that indicates the sign of a number, *Exponent* refers to the
exponent bits, and *Mantissa* refers to the mantissa bits.
![Binary representation of floating-pointnumbers](../img/ch07/floatdtype.png)
:label:`ch07/ch07-float32andfloat16`
Applying Equation
:eqref:`ch05/equation-03` will convert a float16 number in
binary scientific notation to decimal format.
$$
(-1)^{\text{Sig}}\times 2^{\text{Exponent}-15}\times (\frac{\text{Mantissa}}{1024}+1)
$$
:eqlabel:`equation:ch05/equation-03`
If the exponent bits and mantissa bits are all 0s, the number is 0. If
the exponent bits are all 0s but the mantissa bits are not, the number
is very small. If the exponent bits are all 1s and the mantissa bits are
all 0s, the number is an infinity, either positive or negative depending
on the sign bit. Not a Number (NaN) is denoted by the exponent bits
being all 1s while the mantissa bits are not all 0s. bfloat16 is a
special data type developed by Google for machine learning on its tensor
processing units (TPUs). Although bfloat16 is not an industry-standard
IEEE 16-bit floating-point data type, it has the same exponent size as
float32, meaning that it can be easily converted to and from float32.
### Operator Information Library
Hardware devices support different operators based on their data format
and data type requirements. Each device maintains an operator
information library that contains a comprehensive list of operators
supported by that device. During the operator selection process, the
most suitable operators are chosen from this library. The library serves
as a reference for determining which operators are compatible and can be
efficiently executed on a particular hardware device.
## Process of Operator Selection
Operator selection involves selecting the most appropriate operator for
each operation node in an IR. Operator information contains the
supported device type, data type, and data format. After the compiler
frontend completes type inference and static analysis, the data type of
user code is derived from the IR.
Figure :numref:`ch07/ch07-compiler-backend-select` shows the operator
selection process. First, the target hardware needs to be selected (or
this step can be skipped in order to keep the default hardware selection
defined in the compiler backend). The implementation, supported data
types, and execution efficiency of a given operator vary depending on
the target hardware. Then, the compiler backend selects an operator
based on the data type and data format derived from the IR.
![Operator selection process (using GPU as anexample)](../img/ch07/select_kernel.png)
:label:`ch07/ch07-compiler-backend-select`
The result of the operator selection process might not be as expected
due to software or hardware specifications. Sometimes, we might need to
adjust the precision of a particular node to find an operator with the
right data type. For example, the Conv2D operator supported by Ascend
(i.e., the backend of MindSpore) allows only the float16 data type. When
used on a float32 network on Ascend, the Conv2D operator is executable
only when its input precision is reduced from float32 to float16.
Converting operators from one format to another can be time-consuming
and incur memory movement overheads. To avoid this, data should be
transferred between operators of the same format whenever possible. In
addition, data type inconsistency may lead to reduced precision,
potentially slowing down or even preventing network convergence. As
such, thorough operator analysis is needed to ensure that the right data
type is selected.
Simply put, an operator selection algorithm is considered optimal if it
keeps the data type as consistent as possible with user settings while
also minimizing data format conversion.

View File

@@ -0,0 +1,261 @@
# Memory Allocation
Memory allocation is a crucial aspect of conventional computer memory
hierarchy, acting as a link between cache and disk storage. It provides
more storage capacity than the cache and enables faster access compared
to disk storage. With the progress of deep learning, accommodating large
deep neural networks within the memory of hardware accelerators or AI
processors has become increasingly challenging. To overcome this
obstacle, various solutions have been developed, including memory reuse,
contiguous memory allocation, and in-place memory allocation. Proper
implementation of contiguous memory allocation and in-place memory
allocation can enhance the execution efficiency of operators and further
optimize performance.
## Device Memory
In a deep learning architecture, the memory closest to the hardware
accelerator (such as the GPU or AI processor) is usually referred to as
the device memory, and that closest to the CPU is referred to as the
host memory. As shown in Figure
:numref:`ch07/ch07-compiler-backend-memory-01`, the CPU can
directly access the host memory but not the device memory. Similarly,
the AI processor can directly access the device memory but not the host
memory. In a typical network training process, data needs to be loaded
from disk storage to the host memory, where it is then processed. After
that, the data is copied from the host memory to the device memory, so
that the device can directly access the data. When the computation is
finished, the user can obtain the training result once the result data
is copied from the device memory back to the host memory.
![Host memory and devicememory](../img/ch07/host-device-memory.png)
:label:`ch07/ch07-compiler-backend-memory-01`
## Process of Memory Allocation
The memory allocation module allocates device memory to the input and
output of each operator in a graph. The compiler frontend interprets the
user script into an IR, based on which the compiler backend performs
operator selection and optimization to determine information such as the
shape, data type, and format of each input/output tensor of each
operator. With this information, the size of each input/output tensor of
each operator can be calculated using Equation
:eqref:`ch05/equation-04`:
$$
\text{size}=\prod_{i=0}^{\text{dimension }}\text{shape}_i \times \text{sizeof}\left ( \text{datatype} \right )
$$
:eqlabel:`equation:ch05/equation-04`
Unaligned memory access can be time-consuming, because the transfer of
data to and from memory is most efficient in chunks of 4, 8, or 16
bytes. When the size of the data to be transferred is not a multiple of
any of these sizes, one or more empty bytes are padded to align the data
in memory.
Figure
:numref:`ch07/ch07-compiler-backend-memory-02` illustrates an
example of memory allocation.
![Memory allocationexample](../img/ch07/memory_allocate.png)
:label:`ch07/ch07-compiler-backend-memory-02`
In this example, memory addresses are assigned to the input tensor,
Conv2D's weight, and Conv2D's output. Subsequently, a memory address is
allocated to the input of BatchNorm. Since the input of BatchNorm is the
same as the output of Conv2D, which already has a allocated memory
address, the output address of Conv2D can be shared with the input of
BatchNorm. This approach avoids redundant memory allocation and
unnecessary memory copies. The entire training process in this example
involves allocating memory for three types based on their data lifetime:
the initial input of the graph, the weights or attributes of operators,
and the output tensor of the final operator.
Frequent allocations and deallocations of memory blocks of various sizes
using functions like `malloc` can significantly degrade performance. To
mitigate this issue, memory pools can be employed. Memory pools involve
pre-allocating a specific amount of memory, allowing memory blocks to be
dynamically allocated from the pool as needed and returned for reuse.
Memory pools are widely utilized in AI frameworks to manage frequent
allocations of device memory and ensure consistent memory lifetime for
tensors. Different AI frameworks adopt similar memory pool designs.
Figure
:numref:`ch07/ch07-compiler-backend-memory-03` presents an
example of memory allocation in an AI framework. In this case, each
tensor's memory is allocated from a pre-allocated device memory space
using double pointers to offset the start and end addresses. Weight
tensors of operators are allocated memory by offsetting from the start
address (with a lifetime lasting throughout the training process). The
output tensor of each operator is allocated memory by offsetting from
the end address (with a shorter lifetime that terminates when the tensor
is no longer needed in the computation process). This approach allows
operator memory to be allocated using offset pointers from pre-allocated
device memory, significantly reducing the time required compared to
direct memory allocations from the device.
![Memory allocation using double offsetpointers](../img/ch07/device_malloc.png)
:label:`ch07/ch07-compiler-backend-memory-03`
## Memory Reuse
In a machine learning system, memory reuse is achieved by analyzing the
lifespan of a tensor and, once it reaches the end of its lifespan,
releasing its device memory back to the memory pool for future reuse by
other tensors. The objective of memory reuse is to enhance memory
utilization and enable the accommodation of larger models within the
constraints of limited device memory. By reusing memory instead of
continuously allocating new memory for tensors, the system can optimize
memory utilization and mitigate the memory limitations inherent in deep
learning computations.
Figure
:numref:`ch07/ch07-compiler-backend-memory-02` provides an
example, where output 1 becomes unused once the computation of the
BatchNorm operator is complete. In this case, the device memory of
output 1 can be reclaimed and reused for output 3 (if output 3 does not
require a larger memory size than output 1).
Figure
:numref:`ch07/ch07-compiler-backend-memory-04` depicts memory
lifetime using coordinate charts. The horizontal axes represent the
tensor lifetime, and the vertical axes represent the memory sizes.
During its lifetime, a tensor occupies a specific amount of device
memory. The objective of memory allocation is to find an optimal
solution that accommodates the maximum number of non-conflicting
rectangular blocks (each denoting a tensor's lifetime and memory size)
in the same memory. In Figure
:numref:`ch07/ch07-compiler-backend-memory-04`, the memory can
accommodate only four rectangular blocks (i.e., tensors T0, T1, T2, and
T3) when no memory reuse policy is applied, as shown in the left chart.
![Memory lifetimecharts](../img/ch07/combine_memory_resue_and_no_reuse_cn.png)
:label:`ch07/ch07-compiler-backend-memory-04`
To determine an appropriate memory reuse policy, we face an NP-complete
problem. AI frameworks often employ greedy algorithms, such as best-fit,
which allocate memory by searching for the smallest available block in
the memory pool one at a time. However, this approach only yields a
locally optimal solution rather than a globally optimal one. To
approximate a globally optimal solution, a method called Safe Optimized
Memory Allocation Solver (SOMAS) can be considered.
SOMAS addresses the computational graph by conducting aggregative
analysis on parallel streams and data dependencies. This analysis
reveals the ancestor-descendant relationships between operators. By
generating a global set of mutually exclusive constraints concerning the
lifetime of each tensor, SOMAS combines multiple heuristic algorithms to
achieve an optimal solution for static memory planning. Through SOMAS,
an optimized memory reuse outcome is obtained, resulting in increased
reusable memory.
As shown in the right chart of Figure
:numref:`ch07/ch07-compiler-backend-memory-04`, with the SOMAS
algorithm, the number of tensors allowed in the same memory is increased
to seven.
## Optimization Techniques for Memory Allocation
In the following, we describe the typical optimization techniques for
memory allocation.
### Memory Fusion
Commonly used memory allocation methods operate at the tensor level,
often resulting in discontinuous device addresses across tensors.
However, certain specialized operators, like AllReduce for
communication, require contiguous memory allocation. Executing a
communication operator involves waiting for communication, which is a
significant performance bottleneck in large-scale distributed systems.
It includes data transfer and computation. To minimize communication
time, we can fuse multiple communication operators into a composite
operator. This allows for contiguous memory allocation of the operator
input, as depicted in Figure
:numref:`ch07/ch07-compiler-backend-memory-06`.
Additionally, the time spent in communication can be reduced during the
weight initialization task in distributed neural network training. This
task involves broadcasting the initialized weight from one process to
all processes. If a network contains multiple weights (which is often
the case), these broadcasts are repeated. To minimize communication time
in this scenario, a typical approach is to allocate contiguous memory
addresses to all weights on the network and then perform a single
broadcast operation.
![Memory fusion of communicationoperators](../img/ch07/memory_fusion.png)
:label:`ch07/ch07-compiler-backend-memory-06`
### In-place Operators
In the memory allocation process depicted in
Figure :numref:`ch07/ch07-compiler-backend-memory-02`, the input and
output of each operator are assigned different memory addresses.
However, this approach can lead to memory waste and performance
degradation for several other operators. Examples include optimizer
operators used to update neural network weights, Python's `+=` or `*=`
operators that modify variable values, and the `a[0]=b` operator that
updates the value of `a[0]` with `b`. These operators share a common
purpose: updating the input value. The concept of in-place can be
illustrated using the `a[0]=b` operator.
In the original implementation shown on the left of Figure
:numref:`ch07/ch07-compiler-backend-memory-08`, the operator
involves three steps: copying tensor `a` to tensor `a`, assigning
tensor `b` to tensor `a`, and then copying tensor `a` back to tensor
`a`. However, by performing the operation in-place, as depicted on the
right of Figure
:numref:`ch07/ch07-compiler-backend-memory-08`, this process is
simplified to a single step: copying tensor `b` to the position
corresponding to tensor `a`. This reduces data copy time by eliminating
two copies and eliminates the need to allocate memory for tensor `a`.
![Memory allocation of an in-placeoperator](../img/ch07/inplace-op.png)
:label:`ch07/ch07-compiler-backend-memory-08`
## Data Compression
Deep neural networks (DNNs) in modern training heavily rely on GPUs to
effectively train intricate networks with hundreds of layers. A
prominent challenge faced by both researchers and industry professionals
is the constraint imposed by the available GPU main memory as networks
become deeper. This limitation restricts the size of networks that can
be trained. To address this issue, researchers have recognized the value
of employing DNN-layer-specific encoding schemes. Consequently, they
have directed their attention towards storing encoded representations of
the intermediate layer outputs (feature maps) that are required for the
backward pass. These encoded representations are stored during the
temporal gap between their uses and are decoded only when needed for the
backward pass. The full-fidelity feature maps are promptly discarded
after use, resulting in a noteworthy reduction in memory consumption.
## Memory Swap
Machine learning frameworks frequently necessitate users to optimize
their memory utilization to guarantee that the DNN can be accommodated
within the memory capacity of the GPU. This constraint restricts
researchers from thoroughly investigating diverse machine learning
algorithms, compelling them to make concessions either in terms of
network architecture or by distributing the computational load across
multiple GPUs. One feasible approach is to incorporate DRAM to
facilitate memory swapping. By transferring temporarily inactive data to
DRAM, we can optimize GPU utilization. In recent studies, researchers
have implemented a cautious approach to allocating GPU memory for the
immediate computational needs of a specific layer. This strategy
effectively reduces both the maximum and average memory usage, enabling
researchers to train more extensive networks. To elaborate further, the
researchers promptly release feature maps from GPU memory in the absence
of any potential reuse. Alternatively, if there is a possibility of
future reuse but no immediate requirement, the feature maps are
offloaded to CPU memory and subsequently prefetched back to GPU memory.
The fundamental concept behind memory swapping is straightforward and
inherent. However, its implementation remains challenging and
necessitates prior expertise in our compiler frontend. One such
expertise involves maximizing the overlap between computation and data
swapping time. A precise cost model is essential for evaluating the
estimated time required for data movement and the time cost associated
with each layer in DNN (Deep Neural Network). Additionally, there are
numerous strategies to explore in auto scheduling and auto tuning.
Fortunately, there is an abundance of literature available that
addresses these issues. For additional information, please refer to the
Further Readings section.

View File

@@ -0,0 +1,295 @@
# Operator Compiler {#sec:operator-compiler}
Operator compilers are used for compiling and optimizing operators,
which may be part of a neural network or come from the code implemented
in a domain-specific language (DSL). The compilation is the process of
*transforming* the source code from one *representation* into another.
The objective of an operator compiler is to improve the *execution
performance* of operators. An operator compiler accepts tensor
computation logic described in *dynamic languages* (e.g., Python) as the
input and outputs executable files on *specific AI processors*.
## Scheduling Strategy
An operator compiler abstracts the execution of statements in an
operator implementation into \"scheduling strategies\". Since an
operator typically consists of multiple statements, the focus lies in
determining the scheduling strategy for the statements within the
operator. This strategy encompasses considerations such as the
calculation order, data block movement, and other relevant factors.
If ignoring the specific processor architecture, for the best
performance, we only need to load all input tensors to the computation
core based on the *computational logic* of the operator and access the
result from the core for storage. *Computational logic* refers to basic
arithmetic operations (e.g., addition, subtraction, multiplication, and
division) and other function expressions (e.g., convolution,
transposition, and loss functions).
Modern computer memory hierarchy looks like a pyramid structure, as
shown in Figure
:numref:`ch05/ch05-memory_architecture`. As we move up the
pyramid, the storage elements have a higher cost but a faster access
time.
![Modern computer memoryhierarchy](../img/ch05/memory_architecture.png)
:label:`ch05/ch05-memory_architecture`
Such hardware design leads to two basic types of locality:
\(1\) Temporal locality: the tendency to access the same memory location
several times in quick succession. As such, accessing the same location
in the L1 cache several times is more efficient than accessing different
locations in the L1 cache several times.
\(2\) Spatial locality: the tendency to access nearby memory locations
in quick succession. As such, accessing nearby locations in the L1 cache
several times is more efficient than moving back and forth between the
L1 cache and the main memory.
Both types of locality help improve system performance. Specifically, in
order to improve the data access speed, data to be repeatedly processed
can be placed in fixed nearby memory locations when possible.
For a serial computational task, it is also possible to decouple the
data part from the logic part and generate a range of independent groups
of data that can be executed in parallel, as shown in Figure
:numref:`ch05/ch05-parallel_computing`.
![Serial computing and parallelcomputing](../img/ch05/parallel_computing.png)
:label:`ch05/ch05-parallel_computing`
These specific data-oriented operations performed at program runtime are
referred to as *schedules*. A schedule defines the following aspects:
\(1\) When and where should each value in a function be calculated?
\(2\) Where is data stored?
\(3\) How long does it take to access each value between those
calculated using preorder structure consumers? And when is independent
recomputation performed by each such value?
Simply put, a scheduling strategy is defined by a set of algorithms
designed during compilation based on the characteristics of target
hardware architecture to improve locality and parallelism. The purpose
of this is to ensure that the resulting executable file delivers optimal
performance at runtime. These algorithms have no effect on the
computation result; instead, they only adjust the computation process in
order to shorten the computation time.
## Combining Scheduling Strategies
In the realm of operator compilers, a common optimization technique
involves combining multiple abstracted scheduling strategies into a
comprehensive and efficient scheduling set through manual template
matching. However, this approach may not be fine-tuned and can be
labor-intensive when applied to achieve refined optimization across
different operators. To illustrate this, let's consider an optimization
algorithm implemented in the Tensor Virtual Machine (TVM). It
accelerates and optimizes a multiply-accumulate code segment on the CPU
by combining several fundamental scheduling strategies.
In Code `lst:before_tvm`, the basic computational logic is as
follows: Initialize tensor C, multiply tensor A by tensor B, and
accumulate the results to tensor C.
**lst:before_tvm**
```
for (m: int32, 0, 1024) {
for (n: int32, 0, 1024) {
C[((m*1024) + n)] = 0f32
for (k: int32, 0, 1024) {
let cse_var_2: int32 = (m*1024)
let cse_var_1: int32 = (cse_var_2 + n)
C[cse_var_1] = (C[cse_var_1] + (A[(cse_var_2 + k)]*B[((k*1024) + n)]))
}
}
}
```
Assuming that the data type is float and that tensors A, B, and C are of
size 1024 $\times$ 1024, then the total memory required by the tensors
is 1024 $\times$ 1024 $\times$ 3 $\times$ sizeof(float) = 12 MB. This
far exceeds the capacity of common caches (e.g., the L1 cache is 32 KB).
Therefore, if we want to compute on Tensor A, B, and C in a single
operation, we must store them in the main memory. However, the main
memory is distant from the compute core, resulting in significantly
lower access efficiency compared to using the cache for storage.
There are several scheduling strategies that can help improve
performance: tile, reorder, and split. The size of the L1 cache is 32
KB. To ensure that data used in every computation step is stored in the
cache, tiling based on the factors of 32 is performed. In this way, only
the tiny block formed by `m.inner `$\times$` n.inner` needs to be taken
into account, and memory access of the innermost tiny block is
independent of the outer loops. A tiny block will occupy only 32
$\times$ 32 $\times$ 3 $\times$ sizeof(float), which is 12 KB in the
cache. The optimized code is shown in Code
`lst:after_tvm`. We perform tiling on loops m and n based on
factor 32 as the previous analysis. Similarly, we tile the loop k based
on factor 4, then reorder the k.outer and k.inner axis as the outermost
axis.
**lst:after_tvm**
```
// Obtain an outer loop by tiling for (m: int32, 0, 1024) based on factor 32.
for (m.outer: int32, 0, 32) {
// Obtain an outer loop by tiling for (n: int32, 0, 1024) based on factor 32.
for (n.outer:
// Obtain an inner loop by tiling for (m: int32, 0, 1024) based on factor 32.
for (m.inner.init: int32, 0, 32) {
// Obtain an inner loop by tiling for (n: int32, 0, 1024) based on factor 32.
for (n.inner.init: int32, 0, 32) {
// Obtain the corresponding factors.
C[((((m.outer*32768) + (m.inner.init*1024)) + (n.outer*32)) + n.inner.init)] = 0f32
}
}
// Obtain an outer loop by splitting for (k: int32, 0, 1024) based on factor 4, with reorder.
for (k.outer: int32, 0, 256) {
// Obtain an outer loop by splitting for (k: int32, 0, 1024) based on factor 4, with reorder.
for (k.inner: int32, 0, 4) {
// Obtain an inner loop by tiling for (m: int32, 0, 1024) based on factor 32.
for (m.inner: int32, 0, 32) {
// Obtain an inner loop by tiling for (n: int32, 0, 1024) based on factor 32.
for (n.inner: int32, 0, 32) {
// Outer axis factor obtained by tiling along axis n
let cse_var_3: int32 = (n.outer*32)
// Outer axis & inner axis factors obtained by tiling along axis m
let cse_var_2: int32 = ((m.outer*32768) + (m.inner*1024))
// Outer axis & inner axis factors obtained by tiling along axes m & n
let cse_var_1: int32 = ((cse_var_2 + cse_var_3) + n.inner)
// Split the computational logic into different layers so that data involved every loop can be stored in the cache.
C[cse_var_1] = (C[cse_var_1] + (A[((cse_var_2 + (k.outer*4)) + n.inner)] * B[((((k.outer*4096) + (k.inner*1024)) + cse_var_3) + n.inner)]))
}
}
}
}
}
}
```
## Finding Optimized Strategies with Polyhedral Models
Another optimization approach is to automatically select an operator
schedule from a schedule search space. A good example of this idea is
the polyhedral compilation. They improve the generalization of operator
compilation at the expense of prolonged compile time.
Polyhedral compilation mainly optimizes the loops in user code by
abstracting each loop into a multidimensional space, computing instances
into points in the space, and dependencies between the instances into
lines in the space. The main idea of this algorithm is to model the
memory access characteristics in code and adjust the execution order of
each instance within each loop. In this way, it aims to enable better
locality and parallelism of the loop code under the new schedule.
Code `lst:before_poly` is used as an example to describe the
algorithm.
**lst:before_poly**
```
for (int i = 0; i < N; i++)
for (int j = 1; j < N; j++)
a[i+1][j] = a[i][j+1] - a[i][j] + a[i][j-1];
```
As shown in Figure :numref:`ch05/ch05-poly_test`, a memory access structure is first
modeled by using the polyhedral model algorithm, and then dependencies
(denoted by arrows) between instances (denoted by nodes) are analyzed.
![Polyhedral model of the samplecode](../img/ch05/poly_test.png)
:label:`ch05/ch05-poly_test`
Complex dependency analysis and schedule transformation are then
performed to obtain an optimal solution that fits the memory model.
Using the polyhedral model algorithm, the code is optimized to that
shown in Code `lst:after_poly`.
**lst:after_poly**
```
for (int i_new = 0; i_new < N; i_new++)
for (int j_new = i+1; j_new < i+N; j_new++)
a[i_new+1][j_new-i_new] = a[i_new][j_new-i_new+1] - a[i_new][j_new-i_new] + a[i_new][j_new-i_new-1];
```
The resulting code looks relatively complex. We can model the code (as
shown in Figure :numref:`ch05/ch05-poly`) to determine its performance
improvements. Through dependency analysis, we find that the loop
dependencies present in the source code are removed in the optimized
code, thereby increasing the opportunities for parallel computing.
Specifically, parallel computing is possible when the loop dependencies
are partitioned along the dashed lines based on the green blocks, as
shown in Figure :numref:`ch05/ch05-poly`.
![Optimization result with the polyhedralmodel](../img/ch05/poly.png)
:label:`ch05/ch05-poly`
We have only introduced the Polyhedral Compilation technique in this
section. However, there are other optimization techniques available,
such as Ansor, which is a heuristic searching method with pruning.
## Adaptation to Instruction Sets
We have previously explored the optimization techniques of operator
compilers. In this section, we build on this foundation to examine how
operator compilers adapt to instruction sets on different chips.
Typically, a general-purpose compiler is designed to be compatible with
as many backend architectures and instruction sets as possible. However,
this can present challenges when the compiler must handle backends with
different architectures and instruction sets.
Two common programming models adopted by AI processors are single
instruction, multiple data (SIMD) and single instruction, multiple
threads (SIMT). As shown in Figures
:numref:`ch05/ch05-SIMD` and
:numref:`ch05/ch05-SIMT`, respectively, SIMD corresponds to chips
with vector instructions, while SIMT corresponds to chips that support
multiple threads. Recently, some chips have begun to combine both
programming models in order to support both multithreaded parallel
computing and vector instructions. When handling different programming
models, an operator compiler adopts different optimization strategies,
such as vectorization.
![SIMD diagram](../img/ch05/SIMD.png)
:label:`ch05/ch05-SIMD`
![SIMT diagram](../img/ch05/SIMT.png)
:label:`ch05/ch05-SIMT`
Operator compilers place a strong emphasis on differentiated support in
the frontend, midend, and backend. In the frontend, support for multiple
backend instruction sets is added, allowing AI programmers to focus on
algorithm logic without having to worry about chip differences. In the
midend, the architectures of different chips are identified, which
allows for specific optimization methods to be implemented for each
chip. When generating backend code, the instruction sets of different
chips are further identified to ensure efficient execution on target
chips.
## Expression Ability
The representation capability of an operator compiler is important
because it determines how well the frontend can express the input code
in an IR without loss of syntax information. The frontend of an operator
compiler is often fed with code programmed in flexible languages (e.g.,
PyTorch code written in Python). However, flexible expressions (e.g.,
indexing and view syntax in Python) pose high requirements on the
frontend expression ability of operator compilers. From the model
perspective, managing the inputs of an operatorn often contain many
control flow statements. Also, some models allow for dynamic-shape
operators whose shapes vary with control flow decisions across
iterations.
Additionally, there are a large number of operators that may not have
optimized implementation provided by the accelerator libraries (e.g.,
cuDNN) directly. This phenomenon is referred to as long tail operators.
However, the long tail operators can have highly flexible syntax or
abundant control flow statements and sometimes support dynamic shapes,
making it extremely difficult for the frontend of existing operator
compilers to express, optimize, or accelerate them. Consequently, such
operators have to be executed by the Python interpreter or slow virtual
machines, leading to a performance bottleneck in network execution. This
is why it is imperative to improve the expression ability of the
operator compiler frontend.

View File

@@ -0,0 +1,56 @@
# Overview
Figure :numref:`ch07/ch07-compiler-backend-01` illustrates the
architecture of the AI compiler backend, situated between the frontend
and the hardware driver layer.
![Architecture of AI compilerbackend](../img/ch07/compiler-backend-architecture.pdf)
:label:`ch07/ch07-compiler-backend-01`
Graph optimization is a crucial step that involves transforming the
Intermediate Representation (IR) into a format that aligns with the
hardware features, facilitating operator selection. Since the frontend's
IR is abstracted from low-level runtime details, additional effort is
required to map the IR to a set of operators, such as MatMul,
Convolution, and ReLU. Sometimes, a single operator is sufficient to
handle a subset of the IR's functions. In such cases, the operator
fusion technique can be employed to fuse a group of IR nodes together.
Similarly, if a direct backend counterpart for a complex IR node is
unavailable, it can be partitioned into smaller operators.
Once the graph optimization is complete, the compiler backend proceeds
with operator selection, which involves matching the optimized IR with
appropriate operators that can be executed on the target device with
optimal efficiency. This process is similar to pattern matching. While
the easiest approach would be to map each IR node to a separate hardware
operator, such an approach may not be hardware-friendly. Instead,
existing compilers generally provide multiple candidate operators for
each IR node. The following steps are typically involved in the operator
selection process:
1. The IR nodes received from the frontend are partitioned or fused to
generate a low-level IR that is meaningful to the hardware.
2. The compiler backend carefully selects operator mappings for the IR
nodes, aiming to create a complete sequence of operators.
3. The backend determines the format and data type of each input and
output, ensuring fine-grained optimization on the IR.
4. Finally, the compiler backend traverses the resulting sequence of
operators, allocates input and output memory for each operator, and
loads the operators onto the target device for computation.
By following this process, the compiler backend optimizes the IR by
selecting suitable operators, determining their input and output
requirements, and allocating memory accordingly. This enables efficient
execution of the AI program on the target device.
To further enhance the performance of a single operator, the compiler
backend often utilizes an operator compiler like TVM (Tensor Virtual
Machine) or XLA (Accelerated Linear Algebra). An operator compiler
analyzes the statements in an operator implementation, and it offers
various levels of optimization, including operator-level optimizations,
code generation, and runtime support. This stack is designed to enable
efficient execution of an operator on a wide range of hardware
platforms.

View File

@@ -0,0 +1,34 @@
# Chapter Summary
1. The compiler backend performs three primary tasks: graph
optimization, operator selection, and memory allocation.
2. Graph optimization reduces resource overhead, adapts the graph to
hardware capabilities, and enhances execution performance while
maintaining the model's numerical properties.
3. Graph optimization techniques can be hardware-agnostic (e.g., memory
I/O optimization) or hardware-specific (e.g., subgraph
transformation to adapt to hardware instruction restrictions).
4. Operator selection involves mapping the compute nodes in an IR to
suitable operators for hardware execution.
5. When selecting an optimized operator, factors such as data format
and type must be considered, as they impact operator performance on
the target hardware.
6. An IR is generated after graph optimization and operator selection.
Based on the IR, memory is allocated for input and output tensors of
each operator before launching them to hardware for execution.
7. Memory reuse is designed to improve memory utilization and
accommodate larger models within limited device memory.
8. Fusion of communication operators enhances communication efficiency.
Properly allocating memory for in-place operators reduces memory
footprint and improves computing efficiency.
9. Operator compilers play a vital role in optimizing hardware
performance. Critical optimization techniques include scheduling
strategies and the polyhedral model algorithm.

View File

@@ -0,0 +1,469 @@
# Computational Graph Basics
A computational graph contains operators (as units of operations) and
tensors (as units of data). The operator nodes in a graph are connected
with directed edges, which indicate the state of each tensor and
dependencies between operators.
Figure :numref:`ch04/ch04-simpleDAG` shows a computational graph example
of $\bf{Z}$=ReLU$(\bf{X}\times\bf{Y})$.
![Simple computationalgraph](../img/ch04/simple-graph.png)
:label:`ch04/ch04-simpleDAG`
## Tensors and Operators
In mathematics, tensors are a generalization of scalars and vectors.
Machine learning defines multidimensional data as tensors. The rank of a
tensor refers to the number of axes (or dimensions) the tensor has. A
scalar is a rank-0 tensor containing a single value, without axes; a
vector is a rank-1 tensor with one axis; and a three-channel RGB color
image is a rank-3 tensor with three axes. See Figure
:numref:`ch04/ch04-tensor`.
![Tensors](../img/ch04/tensor.png)
:label:`ch04/ch04-tensor`
In a machine learning framework, a tensor stores not only data itself
but also attributes such as the data type, data shape, rank, and
gradient transfer status. Table
:numref:`ch04/ch4-tensor` describes the main attributes of a
tensor.
:Tensor attributes
| Tensor Attribute | Description |
|------------------|--------------------------------------------------------------------------------- |
| shape | Length of each dimension, for example, \[3,3,3\]. |
| dim | Number of axes (or dimensions). The value is 0 for a scalar and 1 for a vector. |
| dtype | Data type, such as bool, uint8, int16, float32, and float64. |
| device | Target device, such as a CPU or GPU. |
| name | Tensor name. |
:label:`ch04/ch4-tensor`
In the following, we explore each tensor attribute with image data as an
example. Assume that our machine learning framework loads a 96-pixel by
96-pixel RGB (3-channel) image and converts the image data into a tensor
for storage. A *rank*-3 tensor of *shape* \[96,96,3\] is generated, with
the three dimensions representing the image height, image width, and
number of channels, respectively. The pixels in the RGB image are
represented by unsigned integers ranging from 0 to 255. Therefore, the
*dtype* of the resulting tensor is uint8. The image data is normalized
before it is fed into a CNN for training. Specifically, its data type is
reformatted to float32 so that it is compatible with the default data
type of common machine learning frameworks.
Before training, the machine learning framework determines the compute
device (i.e., CPU, GPU, or other hardware) and stores the data and
weight parameters necessary for training in the memory of the
corresponding hardware --- as specified by the *device* attribute.
Typically, the device attribute of a tensor is automatically assigned by
the machine learning framework based on the hardware environment.
Tensors are either mutable or immutable. Mutable tensors store weight
parameters and are updated based on gradient information, for example,
convolution kernel tensors that participate in convolution operations.
Immutable tensors store initial user data or data input to models, for
example, the image data tensor mentioned above.
What does a tensor look like in machine learning settings? Most tensors,
like image data and convolution kernel tensors, are \"rectangular\" or
\"cubic\" in shape. That is, such a tensor has the same number of
elements along each of its axes. However, there are specialized tensors
that have different shapes: ragged and sparse tensors. As shown in
Figure :numref:`ch04/ch04-tensor1`, a tensor is ragged if it has
variable numbers of elements along some axes. Ragged tensors enable
efficient storage and processing of irregularly shaped data, such as
variable-length texts in natural language processing (NLP) applications.
Sparse tensors often handle graph data of graph neural networks (GNNs)
and are encoded using special formats such as the coordinate list (COO)
to improve storage efficiency.
![Types of tensors](../img/ch04/tensor-class.png)
:label:`ch04/ch04-tensor1`
Operators are the basic compute units of neural networks. They process
tensor data and implement common computational logic in machine
learning, including data transformation, conditional control,
mathematical calculation, etc. Based on their functionalities, operators
are classified into tensor operators, neural network operators, data
flow operators, and control flow operators.
1. **Tensor operators** involve tensor structure and mathematical
operations. Typical tensor structure operations include reshaping
tensors, permuting tensor dimensions, concatenating tensors, etc.
For example, we may need to change the dimension order (between
\"channels first\" and \"channels last\") of image data tensors in
CNN applications. Mathematical operations are tensor-based and
include matrix multiplication, norm calculation, determinant
calculation, eigenvalue calculation, etc. They are often seen in the
gradient computation of machine learning models.
2. **Neural network operators**, the foundation of neural network
models, are the most common operators, including feature extraction,
activation functions, loss functions, optimization algorithms, etc.
Feature extraction refers to extracting feature tensors from input
data in CNN tasks. With the nonlinear ability introduced by
activation functions, neural networks can model highly complex
relationships and patterns in data. Optimization algorithms are used
to update model parameters so that the loss function is minimized.
3. **Data flow operators** cover data preprocessing and loading. Data
preprocessing mainly refers to data resizing, padding,
normalization, and argumentation of mostly visual and textual data,
whereas data loading involves operations such as shuffling,
batching, and pre-fetching of the dataset. Data flow operators
transform raw input data into a format meaningful to the machine
learning framework and efficiently load the data to the network for
training or inference according to the defined number of iterations,
reducing memory usage and wait time.
4. **Control flow operators**, usually found in flexible and complex
models, are used to control data flows in computational graphs.
Typical control flow operators are conditional operators and loop
operators. They are provided by either the machine learning
framework or the frontend language. Control flow operations affect
data flows in both forward and backward computation of neural
networks.
## Computational Dependencies
In a computational graph, the dependencies between operators influence
the execution sequence and parallelism of operators. The computational
graphs involved in machine learning algorithms are directed acyclic
graphs, where data flows must not lead to circular dependencies. With a
circular dependency, the training program will run into an infinite loop
and never terminate by itself. Data stuck in an infinite loop tends to
either infinity or 0, yielding invalid results. To analyze the execution
sequence and facilitate model topology design, the following describes
the dependencies between the compute nodes in a computational graph.
As shown in Figure :numref:`ch04/ch04-dependence`, if the Matmul1 operator is
removed from the graph, there will be no input to the downstream
activation function, and the data flow will be interrupted. We can
therefore conclude that the operators in this computational graph depend
on each other with transitive relations.
![Computationaldependencies](../img/ch04/dependence.png)
:label:`ch04/ch04-dependence`
Three types of dependencies are available.
1. **Direct dependency**: For example, the ReLU1 node is directly
dependent on the Matmul1 node. That is, ReLU1 can run properly only
when it receives a direct output from Matmul1.
2. **Indirect dependency**: For example, the Add node indirectly
depends on the Matmul1 node. Specifically, Matmul1's output is
processed by one or more intermediate nodes and then transmitted to
the Add node. The Add node directly or indirectly depends on the
intermediate nodes.
3. **Mutual independence**: For example, the graph shows no
input/output dependency between Matmul1 and Matmul2, meaning that
the two nodes are independent of each other.
In the computational graph shown in Figure
:numref:`ch04/ch04-recurrent`, the Add node indirectly depends on
the Matmul node; conversely, the Matmul node directly depends on the Add
node. The two nodes are stuck waiting for each other's output to start
their computation. When input data is manually assigned to the two nodes
at the same time, they will compute endlessly, and the training process
can never terminate by itself. A circular dependency produces a positive
feedback data flow, where data values overflow to positive infinity,
underflow to negative infinity, or tend to 0. These all lead to
unexpected training results. As such, we should avoid circular
dependencies between operators when designing deep learning models.
![Circulardependency](../img/ch04/recurrent.png)
:label:`ch04/ch04-recurrent`
In machine learning frameworks, the *unrolling* method is used to
represent loop iterations. Figure
:numref:`ch04/ch04-recurrent-1` shows a computational graph
involving three loop iterations. The subgraph of the loop body is
replicated to three (according to the number of iterations) to produce
an unrolled loop, where the resulting subgraphs are concatenated in the
iteration sequence. The subgraph of one iteration has a direct
dependency on that of the previous iteration. In one computational
graph, tensors and operators are uniquely identified across the loop
iterations, even for the same operation. Unlike circular dependencies,
loop iterations do not involve mutual dependencies between operators
with unique identifiers. When a subgraph is replicated to produce an
unrolled loop, the replicated tensors and operators are assigned new
identifiers to avoid circular dependencies.
![Unrolled loop](../img/ch04/unroll.png)
:label:`ch04/ch04-recurrent-1`
## Control Flows
A control flow maintains the sequence of computation tasks, thereby
facilitating the design of flexible and complex models. By introducing a
control flow to a model, we can execute a node iteratively any number of
times or skip a node based on specific conditions. Many deep learning
models rely on control flows for training and inference. For example,
models built on recurrent neural networks (RNNs) and reinforcement
learning rely on recurrence relations and input status conditions to
complete the computation.
Popular machine learning frameworks provide two major types of control
flows:
1. **Frontend control flows**: Python control flow statements are used
to implement control decision-making in a computational graph.
Frontend control flows are easy to use in model building. However,
because the computation process of the machine learning framework
runs on the backend hardware and the control flow is decoupled from
the data flow, the computational graph cannot run entirely on the
backend hardware. As such, control flow implementations using the
frontend language are referred to as the *out-of-graph approach*.
2. **Framework control primitives**: Machine learning frameworks come
with built-in low-level fine-grained control primitive operators.
Such operators are executable on compute hardware. When they are
introduced to a model, the computational graph can run entirely on
the backend hardware. This type of control flow implementations are
referred to as the *in-graph approach*.
To explain why we need these different approaches to implement control
flows, let's look at the differences between the two approaches.
The out-of-graph approach is familiar to Python programmers. This
flexible, intuitive approach allows direct use of Python commands such
as `if-else`, `while`, and `for` in building control flows.
The in-graph approach, by contrast, is more complicated. TensorFlow
provides a range of in-graph control flow operators (such as `tf.cond`
for conditional control, `tf.while_loop` for loop control, and `tf.case`
for branch control). These operators are composites of lower-level
primitive operators. The control flow representations adopted by the
in-graph approach are in a different style from common programming ---
this improves computing performance but comes at the expense of
usability.
The out-of-graph approach is easier to use. However, not all backend
compute hardware is compatible with the frontend runtime environment,
and extra efforts may be needed to execute the frontend control flows.
Nevertheless, control flows implemented using the in-graph approach are
directly executable on hardware independent of the frontend environment,
improving efficiency throughout the model building, optimization, and
execution process.
The two approaches serve different application scenarios. To run tasks
such as model training, inference, and deployment on compute hardware
independent of the frontend environment, the in-graph approach is
recommended for building control flows. For model validation purposes,
the out-of-graph approach allows for higher efficiency in generating
model code from the model algorithm.
Major machine learning frameworks support both the out-of-graph and
in-graph approaches. In the following illustrations about the impact of
control flows on forward and backward computation, we adopt the
out-of-graph approach for control flow implementations, given that
frontend control flows are more popular in practice. The most common
control flows include conditional branches and loops. For a model
containing control flow operations, the control flow is replicated to
the gradient computational graph during backpropagation, so that the
required tensor gradients can be accurately calculated.
Code `ch04/code1` shows an example of simple conditional control,
where `matmul` indicates the matrix multiplication operator.
**ch04/code1**
```python
def control(A, B, C, conditional = True):
if conditional:
y = matmul(A, B)
else:
y = matmul(A, C)
return y
```
Figure :numref:`ch04/ch04-if` depicts the forward and backward
computational graphs of Code
`ch04/code1`. When running a model containing `if`
conditions, the program needs to know which branch of each condition is
taken so that it can apply the gradient computation logic to the right
branch. In the forward computational graph, tensor $\bf{C}$ does not
participate in computation due to conditional control. Similarly, in the
backward computational graph, tensor $\bf{C}$ is skipped in gradient
computation.
![Computational graphs of conditionalcontrol](../img/ch04/if.png)
:label:`ch04/ch04-if`
A control loop allows us to execute an operation in a loop zero or
multiple times. When the loop is unrolled, each operation is assigned a
unique identifier to identify different calls to the same operation.
Each iteration directly depends on the result of the previous one.
Therefore, one or more lists of tensors need to be maintained in the
control loop for storing per-iteration intermediate results used in the
forward pass and gradient computation. Code
`ch04/code2` shows a control loop example. In its unrolled
loop, $\bf{X_i}$ and $\bf{W_i}$ are the lists of intermediate result
tensors to be maintained.
**ch04/code2**
```python
def recurrent_control(X : Tensor, W : Sequence[Tensor], cur_num = 3):
for i in range(cur_num):
X = matmul(X, W[i])
return X
# Unroll the loop to obtain an equivalent representation.
def recurrent_control(X : Tensor, W : Sequence[Tensor]):
X1 = matmul(X, W) # Let W = W[0], W1 = W[1], and W2 = W[2].
X2 = matmul(X1, W1)
Y = matmul(X2, W2)
return Y
```
The forward and backward computational graphs of Code
`ch04/code2` are shown in Figure
:numref:`ch04/ch04-while`. The gradient of the control loop is
also a loop, with the same number of iterations as the forward loop. The
gradient value output by one iteration serves as the input value for
calculating the gradient of the next iteration until the loop ends.
![Computational graphs of loopcontrol](../img/ch04/while.png)
:label:`ch04/ch04-while`
## Gradient Computation Using the Chain Rule
In the loop unrolling example in Section 3.2.3, when input tensor
$\bf{X}$ is fed into the neural network, the data is propagated forward
one layer at a time in the computational graph, and the intermediate
variables are calculated and stored until $\bf{Y}$ is output after
multilayer computation. In DNN training, the loss function result is
calculated based on the output result of forward propagation and the
label value. The model backpropagates the loss function information
through the computational graph and updates the training parameters
based on computed gradients. Typically, backpropagation works by
computing the gradients of the loss function with respect to each
parameter. Backpropagation based on other information can also work but
is not discussed here.
The chain rule method is used to calculate the gradients with respect to
each parameter during backpropagation. In calculus, the chain rule
provides a technique for finding the derivatives of composite functions.
The derivative of a composite function at a given point is the product
of the derivatives of each individual function at the corresponding
point. Assume that *f* and *g* are functions mapped from the real number
*x*. If $y=g(x)$ and $z=f(y)=f(g(x))$, the derivative of *z* with
respect to *x* is
$$\frac{\partial z}{\partial x}=\frac{\partial z}{\partial y}\frac{\partial y}{\partial x}.$$
:eqlabel:`eq:ch04/chainrule`
The backpropagation algorithm of neural networks executes the chain rule
in the sequence defined by the backward computational graph. Generally,
neural networks accept 3D tensor inputs and output 1D vectors.
Therefore, we can generalize the gradient computation Equations
:eqref:`ch04/chainrule` of composite functions with respect to
scalars as follows: Assuming that $\bf{X}$ is an *m*-dimensional tensor,
$\bf{Y}$ is an *n*-dimensional tensor, $\bf{z}$ is a 1D vector,
$\bf{Y}=g(\bf{X})$, and $\bf{z}=f(\bf{Y})$, the partial derivative of
$\bf{z}$ with respect to each element of $\bf{X}$ is
$$\frac{\partial z}{\partial x_i}=\sum_j\frac{\partial z}{\partial y_j}\frac{ \partial y_j}{ \partial x_i}.$$
:eqlabel:`eq:ch04/chainrule-1`
The equivalent form of Equation
:eqref:`ch04/chainrule-1` is
$$\nabla_{\bf{X}}\bf{z} = (\frac{\partial \bf{Y}}{\partial\bf{X}})^{\top}\nabla_{\bf{Y}}\bf{z},$$
:eqlabel:`eq:ch04/chainrule-2`
where, $\nabla_{\bf{X}}z$ represents the gradient matrix of $z$ with
respect to $\bf{X}$.
Figure :numref:`ch04/ch04-chain` shows the application of the chain rule
in neural networks, illustrating both forward and backward passes in a
single graph. The neural network performs matrix multiplication twice to
obtain the predicted value $\bf{Y}$, and then performs gradient
backpropagation based on the error between the output value and label
value to update the weight parameters to minimize the error. The weight
parameters to be updated include $\bf{W}$ and $\bf{W_1}$.
![Backpropagation computationalgraph](../img/ch04/chain.png)
:label:`ch04/ch04-chain`
The mean square error (MSE) is selected as the loss function in this
example. Two important questions arise here: How does the loss function
transfer the gradient information to $\bf{W}$ and $\bf{W_1}$ using the
chain rule method? And why do we need to calculate the gradients of
non-parameter data $\bf{X}$ and $\bf{X_1}$? To answer these questions,
let's analyze the computation details of forward and backward
propagation. First, the loss value is calculated through forward
propagation in three steps: (1) $\bf{X_1}=\bf{XW}$; (2)
$\bf{Y}=\bf{X_1W_1}$; and (3) $Loss=\frac{1}{2} (\bf{Y}-Label)^2$.
The loss function is calculated to minimize the distance between the
prediction value and the label value. According to the chain rule,
backpropagation is performed through Equations
:eqref:`ch04/chainrule-3` and
:eqref:`ch04/chainrule-4` to calculate the gradients of the loss
function with respect to parameters $\bf{W}$ and $\bf{W_1}$:
$$\frac{\partial {\rm Loss}}{\partial \bf{W_1}}=\frac{\partial \bf{Y}}{\partial \bf{W_1}}\frac{\partial {\rm Loss}}{\partial \bf{Y}}$$
:eqlabel:`eq:ch04/chainrule-3`
$$\frac{\partial {\rm Loss}}{\partial \bf{W}}=\frac{\partial \bf{X_1}}{\partial \bf{W}}\frac{\partial {\rm Loss}}{\partial \bf{Y}}\frac{\partial \bf{Y}}{\partial \bf{X_1}}$$
:eqlabel:`eq:ch04/chainrule-4`
Both Equations
:eqref:`ch04/chainrule-3` and
:eqref:`ch04/chainrule-4` solve
$\frac{\partial {\rm Loss}}{\partial \bf{Y}}$, which corresponds to grad
$\bf{Y}$ in Figure :numref:`ch04/ch04-chain`.
$\frac{\partial {\rm Loss}}{\partial \bf{Y}}\frac{\partial \bf{Y}}{\partial \bf{X_1}}$
in Equation
:eqref:`ch04/chainrule-4` corresponds to grad $\bf{X_1}$ in
Figure :numref:`ch04/ch04-chain`. To calculate the gradient of model
parameter $\bf{W}$, the gradient of intermediate result $\bf{X_1}$ is
calculated. This also answers the second question raised above. The
gradients of non-parameter intermediate results are calculated to
facilitate gradient computation with regard to each parameter.
Because $\bf{X_1}=\bf{XW}$, $\bf{Y}=\bf{X_1W_1}$, and
Loss=$\frac{1}{2}$($\bf{Y}$-Label)$^2$, Equations
:eqref:`ch04/chainrule-3` and
:eqref:`ch04/chainrule-4` are expanded to
:eqref:`ch04/chainrule-5` and
:eqref:`ch04/chainrule-6` according to Equations
:eqref:`ch04/chainrule-2`, respectively. Then, we can analyze how
variables participate in gradient computation when the machine learning
framework uses the chain rule to build a backward computational graph.
$$\frac{\partial {\rm Loss}}{\partial \bf{W_1}}=\frac{\partial \bf{Y}}{\partial \bf{W_1}}\frac{\partial {\rm Loss}}{\partial \bf{Y}}=\bf{X_1}^\top(\bf{Y}-{\rm Label})$$
:eqlabel:`eq:ch04/chainrule-5`
$$\frac{\partial {\rm Loss}}{\partial \bf{W}}=\frac{\partial \bf{X_1}}{\partial \bf{W}}\frac{\partial {\rm Loss}}{\partial \bf{Y}}\frac{\partial \bf{Y}}{\partial \bf{X_1}}=\bf{X}^\top(\bf{Y}-{\rm Label})\bf{W_1}^\top$$
:eqlabel:`eq:ch04/chainrule-6`
Equation
:eqref:`ch04/chainrule-5` uses intermediate result $\bf{X_1}$ in
the forward computational graph when calculating the gradient of
$\bf{W_1}$. In equation
:eqref:`ch04/chainrule-6`, both input $\bf{X}$ and parameter
$\bf{W_1}$ are used for calculating the gradient of parameter $\bf{W}$.
This answers the first question. The gradient information transferred
backward from downstream network layers, and the intermediate results
and parameter values in forward computation, all have roles to play in
calculating the gradient of each parameter in the graph.
Based on Figure :numref:`ch04/ch04-chain` and Equations
:eqref:`ch04/chainrule-3`,
:eqref:`ch04/chainrule-4`,
:eqref:`ch04/chainrule-5` and
:eqref:`ch04/chainrule-6`, when the chain rule is used to
construct a backward computational graph, the computation process is
analyzed and the intermediate results and gradient transfer status in
the model are stored. The machine learning framework improves the
backpropagation efficiency by reusing buffered computation results.
We can generalize the chain rule to wider applications. With flexible
control flows, the machine learning framework can quickly analyze the
computation processes of the forward data flow and backward gradient
flow by using computational graph technology, effectively manage the
lifetime of each intermediate result in memory, and improve the overall
computation efficiency.

View File

@@ -0,0 +1,64 @@
# Computational Graph Functions
Early machine learning frameworks are mainly designed for fully
connected networks and convolutional neural networks (CNNs). Such neural
networks have serial layers, whose topology structures can be
represented in simple configuration files (e.g., Caffe model definition
in Protocol Buffers format).
Conversely, modern machine learning models have ever more complex
structures. Prominent examples include mixture-of-experts (MoE),
generative adversarial network (GAN), and attention models. To improve
the training efficiency with complex model structures (e.g., loops with
branching), machine learning frameworks are expected to quickly analyze
operator dependencies, gradient computation, and training parameters, to
facilitate model optimization, formulate scheduling strategies, and
automate gradient computation. As such, machine learning system
designers call for a common data structure to understand, represent, and
execute machine learning models. To this end, machine learning
frameworks introduce the computational graph technology while still
decoupling the frontend and backend languages in design, as shown in
Figure :numref:`ch04/ch04-DAG`. From a top-level view, computational
graph technology provides the following key functions:
![Computational graph--basedarchitecture](../img/ch04/graph.png)
:label:`ch04/ch04-DAG`
1. **Unified representation of the computation process.** Developers
tend to write machine learning programs in high-level programming
languages (e.g., Python, Julia, and C++). However, because most
devices such as hardware accelerators provide only C/C++ APIs,
implementations of machine learning systems are largely restricted
to C/C++. Computational graph technology makes it possible to run
programs written in different high-level languages on common
low-level C/C++ system modules. As a unified representation, a
computational graph describes a model's input data, computational
logic (usually referred to as operators), and execution sequence of
operators.
2. **Automatic gradient computation.** The training program receives
data samples (or the training dataset), performs forward computation
through the network, and then calculates the loss value. Based on
the loss value, the machine learning system computes the gradient
for each model parameter and then updates the model parameters. The
gradient computation method should apply universally and run
automatically, regardless of the model topology and loss computation
method. Based on the computational graph, the machine learning
system can quickly analyze the gradient transfer relations between
parameters, thereby achieving automatic gradient computation.
3. **Lifetime analysis of model variables.** During model training,
many intermediate variables are generated, for example, the
activation values in the forward pass and the gradients in the
backward pass. Some of the intermediate variables generated in the
forward pass are used in conjunction with the gradients for updating
model parameters. With a computational graph, the machine learning
system can accurately analyze the lifetime of each intermediate
variable (i.e., from the time the variable is generated to the time
it is destroyed), helping the framework optimize memory management.
4. **Execution optimization.** User programs can have different network
structures. With computational graph technology, the machine
learning framework can analyze the model topology and operator
dependencies, and it automatically searches for operator parallel
computing strategies to improve the model execution efficiency.

View File

@@ -0,0 +1,432 @@
# Generating a Computational Graph
In the previous section, we explored the ingredients of a computational
graph. Now let's proceed to the next question --- how is a computational
graph automatically generated? Machine learning frameworks support two
approaches to implementing computational graphs: static and dynamic. The
static approach builds a static (unchanging) graph based on information
such as the network topology and parameter variables described by the
frontend language. Because frontend languages are independent, static
graphs are especially suitable for model deployment (e.g., deploying a
facial recognition application on mobile devices).
Unlike the static approach, the dynamic approach dynamically generates a
temporary graph based on the frontend description each time the model is
executed. Dynamic graphs are easy to debug, making it possible to
fine-tune models efficiently on the fly. Major machine learning
frameworks such as TensorFlow and MindSpore are compatible with both
approaches. And although PyTorch uses dynamic graphs, it also offers
dynamic-to-static conversion support for efficient model execution. To
choose the right approach for a specific task, we need to consider the
task requirements as well as the pros and cons of each approach.
## Static Graph
The static graph approach decouples the definition and execution
processes. That is, a static graph is compiled before it is executed, as
shown in Figure :numref:`ch04/ch04-static`.
![Generating and executing a staticgraph](../img/ch04/static.png)
:label:`ch04/ch04-static`
When a model program is generated using the frontend language, the
machine learning framework first analyzes the model topology for
information such as the connections between network layers, parameter
variable settings, and loss functions. The framework then compiles the
model description into fixed code (i.e., a static computational graph)
that can be invoked and executed by the computing backend. In this case,
subsequent training or inference on this model is no longer
frontend-dependent. Specifically, when input data is fed into the static
graph, the operators in the graph are directly scheduled to hardware for
execution. And to improve hardware computational efficiency, we can also
convert a static graph into other equivalent structures through various
optimization strategies.
Code `ch04/code4` shows an example of generating and executing a
simple static graph. In the frontend definition phase, some machine
learning frameworks require developers to declare predefined
configuration items including tensor placeholders, loss functions,
optimization functions, network building and runtime environments, and
network executors, as well as in-graph control statements using control
flow operators. The design of machine learning frameworks has recently
been improved to provide easy-to-use APIs and a unified model building
paradigm. For example, MindSpore enables unified frontend programming
representations featuring dynamic and static integration. To illustrate,
let's consider the following simple model.
**ch04/code4**
```python
def model(X, flag):
if flag > 0:
Y = matmul(W1, X)
else:
Y = matmul(W2, X)
Y = Y + b
Y = relu(Y)
return Y
```
The machine learning framework does not load input data when generating
a static graph. Instead, *placeholder* tensors are used to hold places
of input data. In the static graph defined in Code
`ch04/code4`, we need to create a placeholder for input
$\bf{X}$ in line 1. Because no actual input is fed into the model during
static graph generation, the control flow defined in line 2 cannot make
control decisions at build time. As such, we need to add the control
flow operator and the computational subgraph of each branch to the
static graph. When the model receives actual inputs during runtime,
different branches are taken (by running the corresponding computational
subgraphs) depending on different inputs. However, not all machine
learning frameworks are able to compile Python control flows as their
static graph equivalents. In order to implement control flows in this
case, we can use the control primitives provided by the framework.
![Generating a staticgraph](../img/ch04/static_gen.png)
:label:`ch04/ch04-static-gen`
Static computational graphs offer two distinct advantages. First, they
yield better performance with less memory. When building a static graph,
the machine learning framework acquires the complete model topology
containing global information of the model, which facilitates the
formulation of graph optimization strategies (e.g., the operator fusion
strategy that fuses two or more operators into a larger one). As shown
in Figure :numref:`ch04/ch04-static-gen`, the Add and ReLU operators are
fused into one operator to reduce the loads/stores of intermediate
results and low-level scheduling overhead, thereby improving the
execution performance and efficiency with a lower memory footprint.
Static graphs allow for many optimization strategies at build time,
which we will discuss in later sections.
Second, by converting static graphs into executable code within the
machine learning framework, we can directly deploy our models on various
hardware platforms to provide efficient inference services. Also, we can
store static graphs using serialization techniques for future execution
(either model training or inference), eliminating the need to rebuild
the frontend source code from scratch every time before execution.
Once the frontend code of the model is compiled into a static graph, the
graph structure is fixed. If we introduce any optimizations to the
graph, the optimized code can differ significantly from the original.
However, the optimized code is not intuitively visible, meaning that it
is sometimes impossible to locate a runtime error based on the returned
code line number in the optimized code. Consider a simple case. Assuming
that the Add and ReLU operators in Code
`ch04/code4` have been fused for optimization, if a runtime
error related to the fused operator is reported, it would be hard for us
to determine the exact error location (Add or ReLU).
In addition, in the daunting process of model debugging and testing,
intermediate results cannot be printed in real time. To make this
happen, we need to insert additional code to the source code and then
recompile the source code for execution, making debugging less
efficient. By contrast, the dynamic graph approach offers more
flexibility.
## Dynamic Graph
Figure :numref:`ch04/ch04-eager1` shows the principle of the dynamic
graph approach. A dynamic graph is defined as it runs. The frontend
interpreter parses the graph code and the machine learning framework
distributes the operators in the graph to the backend for just-in-time
(JIT) execution. Adopting the user-friendly imperative programming
paradigm, the dynamic graph approach allows developers to create neural
network models at the frontend and is therefore favored by a vast number
of deep learning researchers.
![Dynamic graph principle](../img/ch04/eager.png)
:label:`ch04/ch04-eager1`
Next, we reuse the pseudocode in the previous section to compare the
dynamic and static graph approaches.
While these two approaches differ only slightly in their frontend
representations, they differ dramatically in terms of their compilation
and execution mechanisms. Unlike the static graph approach, the dynamic
graph approach calls the built-in operator distribution function of the
machine learning framework through the Python API to distribute Python
operators to the hardware backend (e.g., CPU, GPU, or NPU) for
accelerated computing, which then returns the computational result to
the frontend. This process does not generate a static computational
graph. Instead, the framework describes the model topology using the
frontend language, schedules and executes the model based on
computational dependencies, and dynamically generates a temporary graph.
Figure :numref:`ch04/ch04-dynamic-gen` shows the process of generating a
dynamic graph.
![Generating a dynamicgraph](../img/ch04/eager-gen.png)
:label:`ch04/ch04-dynamic-gen`
Forward computation is run through the neural network in the sequence
defined by the model declaration. Once the model receives input
$\bf{X}$, the machine learning framework starts to generate a dynamic
graph by adding the input node to the graph and sending the data to the
downstream node. The control flow (if available) makes a data flow
decision immediately. For example, in Figure
:numref:`ch04/ch04-dynamic-gen`, if the conditional returns true,
only the Matmul operator node with respect to tensor $\bf{W1}$ is added
to the graph. Then, the machine learning framework inserts the Add and
ReLU operator nodes based on the operator sequence and computational
dependencies defined in the code. For each newly added operator node,
the machine learning framework distributes and executes the operator,
returns the computational result, and prepares to pass the result to the
next node. When forward computation resumes, the last dynamic graph
becomes invalid and a new dynamic graph is created according to current
input and control decision. In contrast with a static graph that
represents the entire model described in the frontend language, a
dynamic graph is generated on the fly as the control flow and data flow
evolve over time. For this reason, the machine learning framework has
few opportunities to optimize the model in the dynamic graph setting.
In the static graph setting, as the model definition is entirely
available, a complete forward computational graph and a complete
backward computational graph can be constructed simultaneously. However,
in the dynamic graph setting, gradients are calculated for
backpropagation as the forward pass proceeds. Specifically, the machine
learning framework collects information of each backward operator and
tensor participating in gradient computation based on the information of
each operator called in the forward pass. Once the forward pass ends,
the operator and tensor information for backpropagation becomes
available. With this information, the machine learning framework creates
a backward computational graph and runs it on hardware to complete
gradient computation and parameter update.
As shown in Figure :numref:`ch04/ch04-dynamic-gen`, when the Matmul operator with
respect to tensor $\bf{W1}$ is called, the framework runs the Matmul
operator to calculate the product of inputs $\bf{X}$ and $\bf{W1}$, and
then records the operator and tensor $\bf{X}$ that will participate in
backpropagation based on the backward computation process
Grad\_$\bf{W1}$=Grad\_$\bf{Y}*\bf{X}$, thereby completing the forward
pass and producing a backward computational graph.
Although the optimization techniques useful in the static graph setting
do not work for dynamic graphs (because the complete network structure
is unknown until the dynamic graph runs), researchers and developers can
easily analyze errors and debug results during model testing and
optimization. This is made possible by dynamic graphs supporting JIT
computing and returning computational results immediately with the
execution of each statement.
Also, the dynamic graph approach enables flexible execution using native
control flows provided by the frontend --- unlike static graphs, which
involve complex control flows along with programming and debugging
difficulties. Consequently, the dynamic graph approach lowers the
barriers to programming for beginners while also improving the iteration
efficiency of algorithm development and model optimization.
## Dynamic Graph vs. Static Graph
The two approaches for implementing computational graphs have their pros
and cons, as described in
Table :numref:`ch04/ch4-graph`.
:Static graph vs. dynamic graph
|Feature |Static Graph |Dynamic Graph |
|---------------------------------|-------------------------------------------------|---------------------------------------------- |
|On-the-fly intermediate results |No |Yes |
|Code debugging |Difficult |Easy |
|Control flow implementation |Specialized syntax |Frontend syntax |
|Performance |Better, supporting wide optimization strategies |Poor, supporting limited graph optimizations |
|Memory footprint |Low |High |
|Direct deployment |Yes |No |
:label:`ch04/ch4-graph`
Compared with the dynamic graph approach, the static graph approach
seems to be less user-friendly to developers because intermediate
results are not available on the fly, code debugging is difficult, and
implementing control flows is complex. However, static graphs ensure
higher execution performance than dynamic graphs. See the example in
Code `ch04/code5`.
**ch04/code5**
```python
def model(X1, X2):
Y1 = matmul(X1, W1)
Y2 = matmul(X2, W2)
Y = Y1 + Y2
output = relu(Y)
return output
```
If the static approach is used to implement Code
`ch04/code5`, the machine learning framework creates a
complete computational graph. Because tensors $\bf{Y_1}$ and $\bf{Y_2}$
are computed independently from each other, we can implement automatic
parallelism on them in order to improve the computational efficiency.
Furthermore, the static approach allows many more optimization
strategies to improve efficiency while also lowering memory footprint,
for example, fusing operators Add and ReLU to reduce the loads and
stores of the intermediate variable $\bf{Y}$. Conversely, if the dynamic
approach is used without a manually configured parallelism strategy, the
machine learning framework is unaware of the independence between
operators due to the lack of a complete computational graph.
Consequently, the framework has to execute the operators, including Add
and ReLU, in a defined order and store the intermediate variable
$\bf{Y}$. To further reduce memory footprint, the static approach
narrows down the intermediate variables to be stored for backpropagation
beforehand in the forward pass, based on the forward and backward
computational graphs defined prior to execution. This is not feasible in
the dynamic approach, where the backward computational graph is defined
only after the forward pass is complete. As such, more intermediate
variables have to be stored in the forward pass to ensure the
backpropagation efficiency, resulting in higher memory footprint.
To choose one approach over the other, we should consider their pros and
cons in addition to analyzing specific task requirements. For academic
research purposes or in the model design and debugging phases, the
dynamic graph approach is suggested because it allows for quick testing
of experimental ideas and iterative update of the model structure. In
other cases where the model structure is determinant, to accelerate the
training process or deploy a model on specific hardware, using the
static graph approach offers higher efficiency.
## Conversion Between and Combination of Dynamic and Static Graphs
:label:`conversion_between_and_combination_of_dynamic_and_static_graphs`
Dynamic graphs are easy to debug and suitable for model design and
testing, whereas static graphs improve execution efficiency and shorten
model training time. Is there a way for the machine learning framework
to combine the merits of both approaches? Major machine learning
frameworks, such as TensorFlow, MindSpore, PyTorch, and PaddlePaddle,
have added support to convert between dynamic and static graphs,
allowing developers to program using the dynamic graph approach and
letting the framework automatically convert the code to a static
equivalent for execution.
Table :numref:`ch04/ch4-eagertoscript` lists the APIs for dynamic graph
to static graph conversion provided by major frameworks.
:Dynamic graph to static graph conversion support of major frameworks
| Framework | Dynamic Graph to Static Graph Conversion |
|------------------------------------------------------------------------------|------------------------------------------ |
| TensorFlow | |
| where AutoGraph automatically transforms a control flow to | |
| the equivalent static statement. | |
| MindSpore | |
| `context.set_context(mode=context.GRAPH_MODE)`: static graph mode, | |
| `@ms_function`: builds a static graph from source code. | |
| PyTorch | |
| `torch.jit.trace()`: builds a static graph by tracing operators. | |
| PaddlePaddle | |
|`paddle.jit.TracedLayer.trace()`: builds a static graph by tracing operators. | |
:label:`ch04/ch4-eagertoscript`
These dynamic-to-static conversion methods fall into the following two
categories:
1. **Tracing**: A static graph is built by tracing operator scheduling
in a dynamic graph.
2. **Source code transformation**: The frontend code is inspected and
built as static graph code. And the static graph executor is
automatically called to run the static graph.
The *tracing* method goes through two simple phases. The first is to
generate a dynamic graph, following a workflow similar to that shown in
Figure :numref:`ch04/ch04-dynamic-gen`. The machine learning framework
runs the created dynamic graph and traces the data flow and operator
scheduling in the dynamic graph to produce a static graph. Note that the
dynamic graph is not destroyed; instead, it is preserved as a static
graph for subsequent execution. As the machine learning framework
finishes executing the dynamic graph, a static graph is produced. In the
second phase when the model is called again, the machine learning
framework runs the static graph for computation. The tracing technique
only traces the operators scheduled when the dynamic graph is run for
the first time. However, if the model has a data-dependent conditional,
only one branch of the conditional can be traced --- the traced graph
would be unable to take alternate branches. Similarly, the traced graph
cannot include every iteration if there is a data-dependent loop.
Unlike dynamic graph code which is parsed and executed by the frontend
interpreter, a static graph must be first created by the graph compiler
of the machine learning framework before execution. Because the graph
compiler cannot directly deal with dynamic graph code, the source code
transformation--based method is introduced to convert the dynamic graph
code into static code description.
The *source code transformation*--based method can overcome the
drawbacks involved in the tracing method and also consists of two
phases, as shown in Figure :numref:`ch04/ch04-ast`. The first involves lexical and syntax
analysis. Specifically, the lexical analyzer scans and analyzes every
character in the dynamic graph code, splits the source text by removing
any white spaces or comments, and returns a stream of tokens. Then, the
syntax analyzer or parser analyzes the token stream, eliminates any
errors, and generates a parse tree as the output of the phase. In the
second phase, the built-in translators of the machine learning framework
scan and translate each part of the abstract syntax tree to map the
grammatical structures from dynamic graph format into static graph
format. Any control flow written in the frontend language is transformed
into the corresponding static graph API in this phase, so as to include
every branch of the control flow in the resulting graph. Next, we can
easily generate static graph code from the translated syntax tree.
![Source code transformation](../img/ch04/ast.png)
:label:`ch04/ch04-ast`
In numerous instances, the utilization of either tracing or source code
transformation proves to be a more convenient approach in the conversion
of a model to a static graph. Both tracing and source code
transformation can be combined to cater to the specific requirements of
a model segment. For instance, PyTorch offers both methods for
transforming dynamic graphs into static graphs, and frequently, a hybrid
approach is employed. Scripted functions can invoke traced functions,
which is advantageous when implementing control-flow mechanisms within a
straightforward model, such as utilizing beam search in a
sequence-to-sequence model with an encoder module produced through
tracing. Traced functions, on the other hand, can call script functions,
which is beneficial when control-flow is needed in a limited section of
a model, typically a feed-forward network.
To improve the computational efficiency, we can transform the entire
model graph for fast deployment on hardware. Alternatively, we can
consider transforming some of the model functions into static subgraphs
and embedding them into the global dynamic graph as individual
operators, so that these exact functions would run in the form of static
graphs at execution time. This not only improves computational
efficiency but also retains flexibility for code debugging.
Code `ch04/code6` shows a simple model, which can be built into a
dynamic graph as a whole. In this example, we transform the
`add_and_relu` module into a static subgraph. The model runs on the
input data in a predefined sequence, resulting in a temporary dynamic
graph. When the `Y=add_and_relu(Y,b)` statement is executed, the machine
learning framework automatically runs the static subgraph transformed
from the module, achieving a performance gain by combining the
advantages of dynamic and static graphs.
**ch04/code6**
```python
def add_and_relu(Y, b):
Y = Y + b
Y = relu(Y)
return Y
def model(X, flag):
if flag > 0:
Y = matmul(W1, X)
else:
Y = matmul(W2, X)
Y = add_and_relu(Y, b)
return Y
```
Dynamic-to-static conversion is mostly found in the model deployment
stage, as a workaround to the hardware constraints on dynamic graph
deployment, which requires the frontend model definition code for
topology discovery in addition to the file of already-trained
parameters. To remove the frontend dependency, once model training in
dynamic graph mode is complete, we may convert the model into static
graph format and serialize the model and parameter files, thereby
expanding the list of supported hardware.
However, the process of translating a dynamic graph into a static graph
can become more intricate when dealing with reverse graph dependencies
and dynamic shapes. Additionally, the performance of the executing
engine may be compromised during complex graph transformations. To
address this, frameworks like PyTorch have introduced more aggressive
dynamic transformation methods. PyTorch's dynamo module not only
implements source code transformation, but also replaces the Python
execution engine with lower-level APIs. This approach resembles the
combination of a compiler and interpreter found in modern Python code
execution engines like CPython, resulting in optimal performance.

View File

@@ -0,0 +1,27 @@
# Computational Graph
In this chapter, we look at the following question: How does a machine
learning system efficiently execute such a program on hardware? We can
break this down into three sub-questions: How do we schedule and execute
the model described by a machine learning program? How do we improve the
model scheduling and execution efficiency? And can we implement
automatic gradient computation for updating the model? The key to
answering these questions is computational graph technology. To explain
this technology, this chapter explains the following key aspects:
1. Computational graph basics
2. Generation of static and dynamic computational graphs
3. Common execution methods of computational graphs
```toc
:maxdepth: 2
Computational_Graph_Functions
Computational_Graph_Basics
Generating_a_Computational_Graph
Scheduling_and_Executing_Computational_Tasks
Chapter_Summary
Further_Reading
```

View File

@@ -0,0 +1,245 @@
# Scheduling and Executing Computational Tasks
Training a model is conducted by scheduling the execution of the
operators in a computational graph. From a broad perspective, a training
job runs a computational graph for a defined number of iterations,
relying on optimal scheduling of tasks such as data loading and training
(inference) execution. Within each iteration, we need to analyze
operator-level scheduling based on the graph topology, computational
dependencies, and control flows. We optimize the scheduling and
execution of computational graphs to make full use of computing
resources, improve computational efficiency, and shorten the model
training and inference time. The following introduces the typical
techniques of computational graph scheduling.
The scheduling execution of the computation graph can be divided into
three modes according to the graph generation method, which are operator
scheduling, whole graph scheduling, and operator and subgraph combined
scheduling. These three modes also correspond to the three modes of
dynamic graph, static graph, and combination of dynamic and static in
the calculation graph generation mechanism.
Next, we will introduce the scheduling and execution of the calculation
graph in detail.
## Operator Scheduling
Operator scheduling means that the operators contained in the algorithm
or model are scheduled and executed one by one through the runtime of
the Python language. This scheduling mechanism is used when the
calculation graph is executed in dynamic graph mode, such as PyTorch's
default execution mode and TensorFlow's eager mode.
Operator scheduling includes two steps. In the first step, according to
the call sequence of the model operator declaration, the dynamic
calculation graph obtains a linear operator scheduling sequence. And the
second is distributing the ordering of operators to instruction streams.
In Figure :numref:`ch04/ch04-diaoduzhixing`, the directed acyclic graph on
the left contains five nodes a, b, c, d, and e and four dependency edges
a-\>d, b-\>c, c-\>d, and d-\>e (e.g., a-\>d indicates that d depends on
a). According to the operator call sequence of the model code, such as
a-\>b-\>c-\>d-\>e, all operator nodes are put into the queue in turn,
and the scheduling ends.
![Operator scheduling andexecution](../img/ch04/schedule.png)
:label:`ch04/ch04-diaoduzhixing`
With the ordering, we then prepare to distribute the operators in the
ordering and related data to the GPU hardware for execution. Figure
:numref:`ch04/ch04-single-op-exec` shows the trace of operator
scheduling. Once the Python runtime calls an operator, the machine
learning framework initializes the operator by determining information
such as the operator precision, type and size of each input/output, and
target device. It then allocates memory for the operator before copying
the memory to the specific device for execution.
![Operator schedulingtrace](../img/ch05/single_op_exec.PNG)
:label:`ch04/ch04-single-op-exec`
The operator scheduling method offers high flexibility because operators
are directly scheduled by the Python runtime. It facilitates the
representation of complex computational logic (such as control flows)
and use of Python-native data structures for implementing complex
algorithms. Operators are driven by the Python runtime to finish
computational tasks, facilitating easy collaboration with Python's
large, rich ecosystem.
Despite its advantages, operator scheduling also has some disadvantages.
One is that context-based runtime optimizations such as operator fusion
and algebraic simplification become difficult. This is because global
information about the computational graph is unavailable. Another
disadvantage is that computational tasks have to run in serial mode,
rather than in parallel, due to the lack of computational topology.
## Graph Scheduling
When the calculation graph uses the static graph mechanism for
whole-graph scheduling execution, operators will be sent to the hardware
for execution one by one according to a certain execution sequence.
However, global information about the computational graph is available.
it can analyze operator dependencies and the number of computing
devices, and complete the scheduling and execution of the entire graph
in the following two ways:
1. **Serial**: executes its tasks one at a time, in the order that they
are added to the queue.This method expands a computational graph
into a sequence of operators, which are then run separately.
Operators are executed in a static order using a single thread,
thereby requiring fewer resources.
2. **Parallel**: executes its tasks concurrently for higher
efficiency.This method expands a computational graph based on
operator dependencies. Operators are executed in the order defined
by their input dependencies, and those without input dependencies
are executed concurrently. This method executes operators in a
dynamic order (which may vary in each iteration) using multiple
threads, thereby consuming more system resources.
Within a computational graph, most operators are dependent on each other
directly or indirectly. When scheduling such operators, their sequence
must be guaranteed. Figure
:numref:`ch04/ch04-diaodu` shows a computational graph, where a
forward pass is run on the input data to produce a predicted value and
then the gradient of the loss function is computed for backpropagation.
In general, downstream operators run dependently on the output from the
upstream. As such, we have to schedule the operators in this
computational graph to a serial queue in order to ensure that each
operator receives the necessary input.
![Serial operatorscheduling](../img/ch04/order.png)
:label:`ch04/ch04-diaodu`
A computational graph may also contain operators independent of each
other, for example, op1 and op2 shown in Figure
:numref:`ch04/ch04-para`. We can have each operator run on
different hardware devices to implement parallel computing. Compared
with the serial mode, parallel computing decreases execution time by
leveraging more computing resources at the same time.
![Parallel operator scheduling](../img/ch04/para.png)
:label:`ch04/ch04-para`
Serial execution and parallel execution have their own advantages and
disadvantages, as summarized in Table
:numref:`ch04/ch4-graph`.
:Comparison between serial execution and parallel execution
| Execution Method | Serial execution | Parallel execution |
|----------------------|------------------|-------------------- |
| Execution Order | Static | Dynamic |
| Execution Threads | Single thread | Multiple threads |
| Resource Consumption | Low | High |
:label:`ch04/ch4-graph`
A computing environment contains more than one type of computing device,
such as a CPU, GPU, or other. As such, a computational graph consisting
of operators that run on more than one type of computing device is
referred to as a heterogeneous computational graph.
The graph contains the following types of operators based on the
computing hardware.
- **CPU operators**: They are C++ operators that run on the host CPU.
The computing performance of the CPU depends on the extent to which
the multi-core capability of the CPU is utilized.
- **GPU operators**: They run on the GPU (e.g., NVIDIA GPU). GPU
kernels are delivered to the host GPU one by one for execution. The
GPU features ample parallel computing units that offer significant
speedup to parallel algorithms.
- **Python operators**: They run on the host CPU. Unlike CPU
operators, Python operators are interpreted and executed by the
Python runtime interpreter.
We mentioned earlier that the dynamic graph mechanism relies on the
Python interpreter to distribute operators and execute them serially
according to the order of operators defined by the model code. This mode
usually allows data to be transmitted on different computing devices.
Communication bottlenecks may increase the time spent waiting for
operators to execute data, reducing the overall execution efficiency of
the calculation graph. Therefore, the first condition for the efficient
execution of the calculation graph is to accurately identify the device
where the operator is executed, try to avoid the transmission of data
between different devices. Independent operators are scheduled on
different devices in parallel. The static graph mechanism can get rid of
the constraints of the Python interpreter. The calculation graph is sent
to the device at one time, which reduces the number of interactions
between the host and the computing chip, and improves computing
efficiency and performance.
The combination of operators and subgraphs for scheduling execution mode
is a combination of the previous two execution modes. Due to the
flexibility of the computing graph structure, the efficiency of
computing graphs in complex scenarios may not be optimal when executed
on the entire computing chip. For example, computing chips can
accelerate floating-point operations, while CPUs are good at processing
logical judgments. Therefore, the parts with low execution efficiency
for computing chips can be separated and handed over to devices with
higher execution efficiency such as CPU for processing, which can take
into account both performance and flexibility.
There are different levels of parallelism: operator parallelism, model
parallelism, and data parallelism. Operator parallelism is not just
about executing independent operators in parallel. Where applicable, we
can further partition an operator into multiple parallel child
operations. Model parallelism refers to partitioning a computational
graph among several devices in order to shorten the time taken by each
training iteration. And data parallelism involves training the same
computational graph on different data, reducing the total number of
iterations and improving training efficiency. We will discuss these
three parallelism methods in Chapter Distributed Training.
## Synchronous and Asynchronous Data Loading
As previously mentioned, a single training iteration of a computational
graph goes through three serial tasks: data loading, data preprocessing,
and model training. Each task is dependent on the output of the previous
one. To schedule the three types of tasks in iterative graph training,
we can use the synchronous and asynchronous mechanisms at the iteration
level.
1. **Synchronous**: Tasks are executed in order, one after the other.
Tasks have to wait for and coordinate between each other.
2. **Asynchronous**: When a task is complete, the same task in the next
iteration can be executed immediately.
If the synchronous mechanism is adopted to train the computational graph
shown in Figure :numref:`ch04/ch04-tongbu`, in each iteration, a batch of input
data is loaded, preprocessed, and then passed to the computational graph
for model training and parameter update. Tasks in the next iteration
wait until the current iteration is complete. The synchronous mechanism
wastes computation and communication resources because the data
preprocessing and model training tasks must wait until a batch of data
is completely loaded, and because the I/O channel for data loading is
idle at model training time.
![Synchronous mechanism](../img/ch04/sync.png)
:label:`ch04/ch04-tongbu`
In the asynchronous setting shown in Figure
:numref:`ch04/ch04-yibu`, after loading and passing a batch of
input data to the subsequent data preprocessing task, the I/O channel
immediately moves on to the next batch without waiting for the current
iteration to complete. In contrast with the synchronous mechanism, the
idle time between data loading, data preprocessing, and model training
in the asynchronous mechanism is notably reduced, thereby shortening the
overall training time with improved execution efficiency.
![Asynchronous mechanism](../img/ch04/async.png)
:label:`ch04/ch04-yibu`
To further shorten the training time and improve the execution
efficiency, we can combine the asynchronous mechanism with parallel
computing, as shown in Figure
:numref:`ch04/ch04-yibubingxing`. On the one hand, the
asynchronous mechanism reduces the model's wait time for data loading
and preprocessing, allowing the model to quickly traverse the entire
dataset. On the other hand, parallel computing increases the batch size
in iterative training, increasing the efficiency of computing resources.
![Asynchronous mechanism combined with parallelcomputing](../img/ch04/para-async.png)
:label:`ch04/ch04-yibubingxing`

View File

@@ -0,0 +1,41 @@
# Chapter Summary
1. The computational graph technology is introduced to machine learning
frameworks in order to achieve a trade-off between programming
flexibility and computational efficiency.
2. A computational graph contains tensors (as units of data) and
operators (as units of operations).
3. A computational graph represents the computational logic and status
of a machine learning model and offers opportunities for
optimizations.
4. A computational graph is a directed acyclic graph. Operators in the
graph are directly or indirectly dependent on or independent of each
other, without circular dependencies.
5. Control flows, represented by conditional control and loop control,
determines how data flows in a computational graph.
6. Computational graphs come in two types: static and dynamic.
7. Static graphs support easy model deployment, offering high
computational efficiency and low memory footprint at the expense of
debugging performance.
8. Dynamic graphs provide computational results on the fly, which
increases programming flexibility and makes debugging easy for model
optimization and iterative algorithm improvement.
9. We can appropriately schedule the execution of operators based on
their dependencies reflected in computational graphs.
10. For operators that run independently, we can consider concurrent
scheduling to achieve parallel computing. For operators with
computational dependencies, schedule them to run in serial.
11. Specific training tasks of a computational graph can run
synchronously or asynchronously. The asynchronous mechanism
effectively improves the hardware efficiency and shortens the
training time.

View File

@@ -0,0 +1,21 @@
## Order Preservation Design
Unlike conventional data-parallel computing tasks, parallel data processing in machine learning scenarios needs to maintain order preservation to ensure experimental reproducibility. In concrete implementations, we need to guarantee that the output order of data after parallel preprocessing remains the same as the input order (i.e., SeqB and SeqA in the figure below are identical). This ensures that the output order of the data module is uniquely determined by the output order of the data shuffling component, helping users compare and debug across different experiments. Different machine learning systems adopt different approaches to ensure order preservation. We use MindSpore's implementation as an example to deepen readers' understanding of this topic.
![Data order preservation --- ensuring SeqB is identical to SeqA](../img/ch07/7.4/data_ordering.png)
:width:`800px`
:label:`data_order_definition`
MindSpore ensures order preservation by constraining the communication behavior between operator thread groups so that the input order to the current operator's downstream operator remains the same as its own input order. Based on this recursive constraint, the output order of the last operator in the entire parallel data processing pipeline is guaranteed to be the same as the input order of the first operator. In the specific implementation, MindSpore uses a Connector as the communication component between operator thread groups. The core operations on the Connector are the Push operation by the upstream operator and the Pop operation by the downstream operator. We focus on MindSpore's constraints on these two behaviors.
The usage of Connector has the following two requirements:
- The threads in both the data producer thread group and the data consumer thread group on either side of the Connector are numbered starting from 0.
- The input data order of the data producers must follow a round-robin distribution across producer threads. That is, when the producer thread group size is M, producer thread 0 holds the (0 + M \* k)-th data sample, producer thread 1 holds the (1 + M \* k)-th sample, producer thread 2 holds the (2 + M \* k)-th sample, and so on (where k=0, 1, 2, 3...).
The Connector maintains the same number of queues as the number of producer threads and ensures that when data is placed into the Connector, each producer thread's data goes only into the correspondingly numbered queue. This guarantees that the distribution of data across different queues in the Connector is the same as the distribution across different producer threads (the Push function in the code snippet). Then, when the Connector's consumer thread group retrieves data from the Connector, we need to ensure that the final data distribution across different consumer threads still follows a round-robin pattern. That is, when the consumer thread group size is N, consumer thread 0 holds the (0 + N \* k)-th data sample, consumer thread 1 holds the (1 + N \* k)-th sample, consumer thread 2 holds the (2 + N \* k)-th sample, and so on (where k=0, 1, 2, 3...). To achieve this, when a consumer thread requests data from the Connector, the Connector retrieves data from the queues in a round-robin manner, subject to the constraint that the requesting consumer thread number i and the pending data index j satisfy the relationship $i=j\%N$ (where N is the number of consumer threads). If the indices do not satisfy this relationship, the request blocks and waits. Through this communication constraint mechanism, MindSpore achieves order preservation.
![MindSpore order preservation implementation](../img/ch07/7.4/mindspore_data_order.jpeg)
:width:`800px`
:label:`mindspore_data_order_implementation`

View File

@@ -0,0 +1,97 @@
## Scaling Single-Machine Data Processing Performance
In the previous sections, we introduced how to accelerate data preprocessing through parallel architectures that leverage multi-core CPU computing power to meet the throughput requirements of model computation on accelerator chips for data consumption. This approach can resolve user issues in most cases. However, data consumption performance is growing rapidly year over year with the development of AI chips (i.e., model computation speed is increasing), while the data module, which primarily relies on CPU computing power, cannot benefit from hardware performance improvements due to the gradual end of Moore's Law. This makes it difficult for data production performance to achieve year-over-year breakthroughs comparable to model computation performance. Moreover, in recent years the growth rate of AI chips in AI servers has far exceeded the growth rate of CPUs, further exacerbating the contradiction between chips' data consumption demands and the data module's data production performance. Taking NVIDIA's DGX series servers as an example, the DGX-1 server is configured with 40 CPU cores and 8 GPU chips. By the next generation NVIDIA DGX-2, the number of GPU chips grew to 16, while the number of CPU cores only increased from 40 to 48. Since all GPU chips share CPU computing power during training, on average, the computing power available to each GPU chip (data consumer) decreased from 5 CPU cores/GPU with NVIDIA DGX-1 to 3 CPU cores/GPU with NVIDIA DGX-2. The CPU computing power bottleneck prevents users from achieving expected scaling performance when training with multiple cards. To address the problem of insufficient CPU computing power on a single machine, we present two currently common solutions: heterogeneous data processing acceleration based on CPU+AI chips and distributed data preprocessing scaling.
### Heterogeneous Computing-Based Data Preprocessing
Since AI chips have richer computing resources compared to CPUs, leveraging AI accelerator chips for data preprocessing when CPU computing power becomes the bottleneck is an effective approach. Although AI chips do not possess general-purpose data preprocessing capabilities, most time-consuming data preprocessing operations are Tensor-related computations, such as Fast Fourier Transform (FFT) in speech processing and denoising in image processing, enabling some operations to be offloaded to AI chips for acceleration. For example, the Dvpp module on Huawei's Ascend 310 chip is a built-in hardware decoder on the chip that offers stronger image processing performance compared to CPUs. Dvpp supports basic image processing operations such as JPEG image decoding and resizing. In actual data preprocessing, users can designate certain image processing operations to be completed on the Ascend 310 chip to improve data module performance.
```python
namespace ms = mindspore;
namespace ds = mindspore::dataset;
// Initialization operations
//...
// Build data processing operators
// 1. Decode
std::shared_ptr<ds::TensorTransform> decode(new ds::vision::Decode());
// 2. Resize
std::shared_ptr<ds::TensorTransform> resize(new ds::vision::Resize({256}));
// 3. Normalize
std::shared_ptr<ds::TensorTransform> normalize(new ds::vision::Normalize(
{0.485 * 255, 0.456 * 255, 0.406 * 255}, {0.229 * 255, 0.224 * 255, 0.225 * 255}));
// 4. Center crop
std::shared_ptr<ds::TensorTransform> center_crop(new ds::vision::CenterCrop({224, 224}));
// Build the pipeline and specify using Ascend for computation
ds::Execute preprocessor({decode, resize, center_crop, normalize}, MapTargetDevice::kAscend310, 0);
// Execute the data processing pipeline
ret = preprocessor(image, &image);
```
Compared to Dvpp, which only supports a subset of image preprocessing operations, NVIDIA's DALI :cite:`nvidia_dali` is a more general GPU-based data preprocessing acceleration framework. DALI contains the following three core concepts:
- DataNode: Represents a collection of Tensors
- Operator: An operator that transforms DataNodes. Both the input and output of an Operator are DataNodes. Notably, operators in DALI can be configured to one of three different execution modes: cpu, gpu, and mixed. In cpu mode, both the operator's input and output are DataNodes on the CPU. In gpu mode, both the input and output are DataNodes on the GPU. In mixed mode, the operator's input is a CPU DataNode while the output is a GPU DataNode.
- Pipeline: A data processing pipeline constructed by users through describing the transformation process of DataNodes using Operators
In practice, users configure whether an operator's computation is performed by the CPU or GPU by setting the operator's execution mode. DALI also has the following constraint: when an operator is in mixed or gpu mode, all of its downstream operators are mandatorily required to execute in gpu mode.
![NVIDIA DALI overview](../img/ch07/7.5/dali_overview.png)
:width:`800px`
:label:`dali_overview`
Below is an example code snippet demonstrating the construction of a data processing pipeline using DALI. We read image data from files, apply mixed-mode decoding, and then process the images through rotation and resizing operators running on the GPU before returning the results to users.
Due to its demonstrated excellent performance,
DALI is widely used in high-performance inference services and multi-card training performance optimization.
```python
import nvidia.dali as dali
pipe = dali.pipeline.Pipeline(batch_size = 3, num_threads = 2, device_id = 0)
with pipe:
files, labels = dali.fn.readers.file(file_root = "./my_file_root")
images = dali.fn.decoders.image(files, device = "mixed")
images = dali.fn.rotate(images, angle = dali.fn.random.uniform(range=(-45,45)))
images = dali.fn.resize(images, resize_x = 300, resize_y = 300)
pipe.set_outputs(images, labels)
pipe.build()
outputs = pipe.run()
```
### Distributed Data Preprocessing
Distributed data preprocessing is another viable solution to address insufficient CPU computing power. A common approach is to leverage existing big data computing frameworks such as Spark or Dask for data preprocessing and write the results to a distributed file system. The training machines then only need to read the preprocessed result data and proceed with training.
![Distributed data preprocessing based on third-party distributed computing frameworks](../img/ch07/7.5/distribute.png)
:width:`800px`
:label:`distributed_data_preprocess_based_on_3rd_party_software`
Although this approach is widely used in the industry, it faces three problems:
- Since data processing and model training use different frameworks, users often need to write programs in different languages across two different frameworks, increasing the user's burden.
- Since the data processing system and the machine learning system cannot achieve zero-copy data sharing, data serialization and deserialization often become non-negligible additional overhead.
- Since big data computing frameworks are not entirely tailored for machine learning scenarios, certain distributed preprocessing operations such as global data shuffling cannot be efficiently implemented.
To better adapt to data preprocessing in machine learning scenarios, the distributed machine learning framework Ray leverages its own task scheduling capabilities to implement simple distributed data preprocessing ---
Ray Dataset :cite:`moritz2018ray`. Since data preprocessing and training reside within the same framework, this reduces the user's programming burden while also eliminating the additional overhead of serialization/deserialization through zero-copy data sharing. Ray Dataset supports simple parallel dataset transformation operators such as map, batch, filter, as well as some basic aggregation operators like mean. Ray
Dataset also supports sorting, random shuffling, GroupBy, and other global shuffle operations. This approach is currently under research and development and has not yet been widely adopted. Interested readers can consult relevant materials for further understanding.
```python
ray.data.read_parquet("foo.parquet") \
.filter(lambda x: x < 0) \
.map(lambda x: x**2) \
.random_shuffle() \
.write_parquet("bar.parquet")
```

View File

@@ -0,0 +1,29 @@
# Data Processing Framework
In the previous two chapters, we introduced the frontend and backend of compilers, elaborating on the optimization process of transforming source programs into target programs. Beyond enabling high-performance execution on accelerator chips during training and inference, we also need to efficiently deliver data to these chips to achieve optimal end-to-end performance. Machine learning model training and inference require loading datasets from storage devices (such as local disks, memory, and remote storage systems), performing a series of processing transformations on the datasets, and sending the processed results to GPUs, Huawei Ascend, or other accelerators for model computation. Performance issues at any step in this pipeline can negatively impact training and inference throughput. In this chapter, we will focus on how to design and implement a data system tailored for machine learning scenarios, helping users easily construct various complex data pipelines while ensuring sufficiently high execution performance so that data preprocessing does not become a performance bottleneck for model training and inference.
This chapter introduces the data module in machine learning systems from three dimensions: usability, efficiency, and order preservation. In the first two sections, we discuss how to build a user-friendly data module, including how to design programming abstractions that allow users to describe complex preprocessing workflows in just a few lines of code, and how to provide rich built-in operators for usability while flexibly supporting user-defined operators to cover long-tail requirements. After users construct their data processing workflows, the data module is responsible for efficiently scheduling and executing the data pipeline to achieve optimal data processing throughput. Efficiently executing the data pipeline is a challenging task, as we face both I/O performance issues in data loading and computational performance issues in data processing. To address these challenges, we will introduce file format designs for high-throughput data loading, as well as parallel architecture designs that fully leverage multi-core CPU computing power. Moreover, unlike conventional data-parallel computing tasks, most machine learning scenarios have special `order preservation` requirements for data input and output sequences. We will dedicate a section to introducing what order preservation is and how to design corresponding components within the data module's parallel architecture to meet this requirement. After studying the above content, readers will gain a deep understanding of how to build an efficient and user-friendly data module for machine learning scenarios. Finally, as extended content, we will draw on practical experience from both academia and industry to introduce how to scale our data processing module to meet training performance requirements when single-machine processing performance is insufficient. The learning objectives of this chapter include:
- Understand the key components and their functions in the machine learning data module architecture
- Understand the design of different data module user programming interfaces
- Master file format design for high-performance data loading
- Master the parallel architecture of the data module in machine learning systems
- Master the concept and solutions for data order preservation in machine learning system data modules
- Understand two approaches for scaling single-machine data processing performance
```toc
:maxdepth: 2
requirements
program_model
performance
data_order
extension
summary
```

View File

@@ -0,0 +1,182 @@
## Efficiency Design
In the previous section, we focused on the programming abstractions and interface design of the data module, ensuring that users can conveniently describe data processing workflows based on the APIs we provide without needing to worry too much about implementation and execution details. In this section, we will further explore the design details of key data module components such as data loading and pipeline scheduling to ensure that users can achieve optimal data processing performance. Throughout this section, we will also draw on practical experience from major existing machine learning systems to help readers deepen their understanding of these critical design approaches.
As shown in :numref:`async_data_process`, deep learning model training requires the data module to first load datasets from storage devices, perform a series of preprocessing transformations in memory, and finally send the processed data to accelerator chips for model computation. Currently, a large body of work focuses on accelerating model computation on chips through new hardware designs or operator compilation techniques, with relatively little attention paid to data processing pipeline performance issues. However, in many cases, the execution time of data preprocessing occupies a substantial proportion of the entire training task, preventing GPUs, Huawei Ascend, and other accelerators from being fully utilized. Research has shown that approximately 30% of computation time in enterprise data center workloads is spent on data preprocessing steps :cite:`murray2021tf`, and other studies have found that model training tasks on some public datasets spend 65% of their time on data preprocessing :cite:`mohan2020analyzing`. This clearly demonstrates that data module performance has a decisive impact on overall training throughput.
![Asynchronous parallel execution of data loading, preprocessing, and model computation](../img/ch07/7.3/async_data_process.png)
:width:`800px`
:label:`async_data_process`
To pursue maximum training throughput, existing systems generally choose to execute data loading, data preprocessing computation, and on-chip model computation asynchronously in parallel. These three steps form a typical producer-consumer upstream-downstream relationship. We denote the data loading rate from storage devices as F, the data preprocessing rate as P, and the on-chip data consumption rate as G. Ideally, we want G < min(F, P), so that the accelerator chip is never blocked waiting for data. However, in practice, we often encounter situations where either the data loading rate F is too low (known as I/O Bound) or the data preprocessing rate P is too low (known as CPU Bound), causing G > min(F, P) and leaving the chip underutilized. To address these critical performance issues, this section will focus on two topics:
- How to design appropriate file formats and loading methods for the specific I/O requirements of machine learning scenarios to optimize the data loading rate F.
- How to design parallel architectures that fully leverage the computing power of modern multi-core CPUs to improve the data processing rate P.
At the end of this section, we will also examine a challenging problem: how to leverage the computational graph compilation techniques learned in previous chapters to optimize the user's data processing computation graph, further achieving optimal data processing throughput performance. Now, let us embark on this section's brainstorming journey together.
### Efficiency of Data Loading
First, let us examine how to address the performance challenges of data loading. The first problem we face is the I/O differences caused by diverse data types and non-uniform storage formats. For example, text data may be stored in txt format, and image data may be stored in raw format or compressed formats such as JPEG. We obviously cannot design an optimal data loading scheme for every possible storage scenario. However, we can propose a unified storage format (which we call the Unirecord format) to shield against I/O differences across different data types, and then design and optimize data loading schemes based on this format. In practice, users simply need to convert their original datasets to our unified data format to benefit from efficient read performance.
![Unified data format](../img/ch07/7.3/uni_record.png)
:width:`800px`
:label:`unified_record_format`
So what other characteristics should our Unirecord have beyond unifying user storage formats? Data access in machine learning model training has the following characteristics:
- Within each epoch, all data is traversed in a random order, with each data sample visited exactly once
- Across all epochs, the data must be traversed in different random orders
The above access patterns require that our Unirecord storage format supports efficient random access. When our dataset can fit entirely in RAM, random access to Unirecord is not a major issue. However, when the dataset is too large and must be stored on local disks or distributed file systems, we need to design specific solutions. An intuitive approach is to divide a Unirecord file into an index block and a data block. The index block records metadata for each data sample, such as its size, offset within the file, and checksum values. The data block stores the actual data for each sample. When we need to perform random access on a Unirecord-format file, we first load the file's index block into memory (which is typically much smaller than the entire file) and build an in-memory index table for the data in the file. Then, when we need to randomly access a data sample, we first look up the sample's offset, size, and other information in the index table and read the data from disk based on this information. This loading approach satisfies our random access requirements on disk. Next, we will use the practical experience of MindRecord proposed by MindSpore as an example to introduce the design of a unified file format and help deepen understanding of this topic.
![File format design supporting random access](../img/ch07/7.3/file_indexing.png)
:width:`800px`
:label:`file_random_access`
#### Introduction to MindRecord
MindRecord is the unified data format introduced by MindSpore, with the goal of normalizing user datasets and optimizing the training data loading process. This file format has the following characteristics:
- Enables unified storage and access of diverse user data, making training data loading more convenient.
- Aggregated data storage for efficient reading, while being easy to manage and transfer.
- Efficient data encoding and decoding operations, transparent and imperceptible to users.
- Flexible control over partition sizes, facilitating distributed training.
Similar to the Unirecord design described earlier, a MindRecord file also consists of data files and index files. The data file contains a file header, scalar data pages, and block data pages for storing users' normalized training data. The index file contains index information generated based on scalar data (such as image labels, image filenames, etc.) for convenient retrieval and statistical analysis of dataset information. To ensure random access performance for a single MindRecord file, MindSpore recommends that each MindRecord file be smaller than 20 GB. If a dataset exceeds 20 GB, users can specify the corresponding parameters during MindRecord dataset generation to shard the original dataset into multiple MindRecord files.
![MindRecord file format composition](../img/ch07/7.3/MindRecord_format.png)
:width:`800px`
:label:`mindrecord_format`
The detailed information about the key components of the data file portion in a MindRecord file is as follows:
- **File Header**
The file header is primarily used to store the file header size, scalar data page size, block data page size, Schema information, index fields, statistical information, file partition information, and the correspondence between scalar data and block data. It serves as the metadata of the MindRecord file.
- **Scalar Data Pages**
Scalar data pages are primarily used to store integer, string, and floating-point data, such as image labels, image filenames, image dimensions, and other information that is suitable for scalar storage.
- **Block Data Pages**
Block data pages are primarily used to store binary strings, NumPy arrays, and similar data, such as binary image files themselves and dictionaries converted from text.
During training, MindRecord's reader can quickly locate and find the position of data based on index files, and read and decode the data. Additionally, MindRecord possesses certain retrieval capabilities, allowing users to filter and obtain data samples that meet their expectations by specifying query conditions.
For distributed training scenarios, MindRecord loads metadata based on the Header in data files and index files to obtain the IDs of all samples and their offset information within data files. It then performs data partitioning based on user-input num_shards (number of training nodes) and shard_id (current node ID), obtaining 1/num_shards of the data for the current node. In other words, during distributed training, multiple nodes each read only 1/num_shards of the dataset, and the effect of training on the entire dataset is achieved through AllReduce on the computation side. Furthermore, if users enable the shuffle operation, the shuffle seed is kept consistent across all nodes within each epoch, ensuring that the ID shuffle results for all samples are consistent, which in turn ensures correct data partitioning.
![MindRecord Partition strategy](../img/ch07/7.3/partition.png)
:width:`800px`
:label:`mindrecord_partition`
### Efficiency of Data Computation
After addressing the data loading performance issue, let us continue to study how to improve data computation performance (i.e., maximizing the data processing rate P mentioned earlier). We will use the data preprocessing pipeline mentioned above as an example to study how to design the data module's scheduling and execution of user computation graphs to achieve optimal performance.
![Diagram of serialized sequential execution of data preprocessing](../img/ch07/7.3/single_pipeline.png)
:width:`800px`
:label:`serialized_data_process`
Since deep learning chips such as GPUs and Huawei Ascend do not possess general-purpose data processing capabilities,
we currently still rely primarily on CPUs to complete preprocessing computation. Mainstream AI servers are equipped with multiple multi-core CPUs, and the data module needs to design reasonable parallel architectures to fully leverage multi-core computing power, thereby improving data preprocessing performance and minimizing accelerator stalls caused by waiting for data. In this section, we will introduce two common parallel architectures: pipeline-level parallelism and operator-level parallelism. Pipeline parallelism has a clear structure, is easy to understand and implement, and is primarily adopted by machine learning systems like PyTorch that implement data modules in Python. Influenced by the scheduling and execution architecture designs of classic data-parallel systems, other systems such as Google's TensorFlow and Huawei's MindSpore primarily adopt operator-level parallelism for fine-grained CPU resource allocation to fully utilize multi-core computing power. However, fine-grained allocation means we need to set reasonable parallelism parameters for all operators involved in the data processing pipeline, which poses a significant challenge for users. Consequently, frameworks like MindSpore also provide automatic tuning of key parameters in the data flow graph. Through dynamic analysis at runtime, the system automatically searches for optimal operator parallelism parameters, greatly reducing the user's programming burden. Let us now discuss each approach in detail.
#### Pipeline Parallelism
The first common parallelism approach is pipeline-level parallelism, where the user's constructed computation pipeline is executed sequentially within a single thread/process, while multiple threads/processes are launched to execute multiple pipelines in parallel. If users need to process a total of N data samples, then with pipeline parallelism degree M, each process/thread only needs to process (N/M) samples. Pipeline parallelism has a simple architecture and is easy to implement. Within the entire parallel architecture, each executing process/thread only needs to communicate across processes/threads at the beginning and end of data execution. The data module distributes pending data tasks to each pipeline process/thread and finally aggregates the results to send to the chip for model computation. From the user's perspective, usage is also relatively convenient, requiring only the specification of the key parallelism degree parameter. Let us use PyTorch as an example for detailed elaboration.
![Diagram of pipeline-level parallel execution](../img/ch07/7.3/pipeline_parallisim.png)
:width:`800px`
:label:`pipeline_parallisim`
In PyTorch, users only need to implement a Dataset Python class to write the data processing logic. The Dataloader launches the corresponding number of Python processes based on the user-specified parallelism parameter num_workers to invoke the user-defined Dataset class for data preprocessing. The Dataloader has two types of process roles: worker processes and the main process, along with two types of inter-process communication queues: index_queue and worker_result_queue. During training, the main process sends the list of pending data tasks to each worker process through index_queue. Each worker process executes the data preprocessing logic of the user-written Dataset class and returns the processed results to the main process through worker_result_queue.
![PyTorch Dataloader parallel execution architecture](../img/ch07/7.3/pytorch_dataloader.png)
:width:`800px`
:label:`pytorch_dataloader`
Next, we present a code snippet of using PyTorch's Dataloader for parallel data preprocessing. We can see that we only need to implement the Dataset class to describe the data preprocessing logic and specify num_workers to achieve pipeline-level parallel data preprocessing.
```python
# Describe the data preprocessing workflow
class TensorDataset:
def __init__(self, inps):
sef.inps = inps
def __getitem__(self, idx):
data = self.inps[idx]
data = data + 1
return data
def __len__(self):
return self.inps.shape[0]
inps = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)
dataset = TensorDataset(inps)
# Set parallelism degree to 3
loader = DataLoader(dataset, batch_size=2, num_workers=3)
for batch_idx, sample in enumerate(loader):
print(sample)
```
Finally, it should be noted that PyTorch Dataloader's execution involves extensive inter-process communication. Although PyTorch has implemented shared memory-based inter-process communication for Tensor-type data to accelerate this step, when the communication data volume is large, cross-process communication can still significantly impact end-to-end data preprocessing throughput performance. Of course, this is not an architectural issue with pipeline parallelism itself, but rather a consequence of CPython's Global Interpreter Lock (GIL), which forces pipeline parallelism at the Python level to use process parallelism rather than thread parallelism. To address this issue, the PyTorch team is currently attempting to remove the GIL from CPython to achieve thread-based pipeline parallelism for improved communication efficiency :cite:`rmpygil`. Interested readers can explore this topic further.
#### Operator Parallelism
In pipeline parallelism, computing resources (CPU cores) are allocated at the pipeline granularity. In contrast, operator parallelism allocates resources at the operator granularity, pursuing a more fine-grained resource allocation approach. We aim to assign higher parallelism to operators with greater computation costs and lower parallelism to operators with lesser computation costs, achieving more efficient and reasonable CPU resource utilization. The idea of operator parallelism is in the same spirit as classic data-parallel computing system parallelism. Taking classic MapReduce execution as an example, we can see that this can also be considered a form of operator parallelism (map operators and reduce operators), where the parallelism degree of map operators and reduce operators is determined by the computation cost of each operator phase.
![Classic MapReduce parallel execution architecture](../img/ch07/7.3/map_reduce.png)
:width:`800px`
:label:`mapreduce`
In the figure below, we present the operator parallelism architecture diagram for the data preprocessing pipeline introduced at the beginning of this section. Based on the computation cost of each operator, we set the image decoding operator parallelism to 3, image resizing parallelism to 2, image random rotation operator parallelism to 4, image normalization operator parallelism to 3, and image channel transposition operator parallelism to 1. We aim to achieve efficient and full utilization of computing resources by precisely allocating resources to operators with different computation costs. In specific implementations, operator parallelism generally uses thread-level parallelism, with all operators communicating through shared memory using inter-thread queues and similar methods.
![Operator parallel execution architecture](../img/ch07/7.3/operator_parallisim.png)
:width:`800px`
:label:`operator_parallisim`
Among existing machine learning system data modules, tf.data and MindData both adopt the operator parallelism approach. Due to more efficient resource utilization and high-performance data flow scheduling implemented in C++, operator parallelism approaches often demonstrate better performance. Performance evaluations of tf.data show that it has nearly twice the performance advantage compared to PyTorch's Dataloader :cite:`murray2021tf`.
Next, we use a MindSpore-based implementation of the data preprocessing pipeline described at the beginning of this section to demonstrate how to set the parallelism degree for each operator in an operator-parallel data pipeline.
```python
import mindspore.dataset as ds
import mindspore.dataset.transforms.c_transforms as c_transforms
import mindspore.dataset.transforms.vision.c_transforms as vision
# Load data
dataset_dir = "path/to/imagefolder_directory"
dataset = ds.ImageFolderDatasetV2(dataset_dir, num_parallel_workers=8)
transforms_list = [vision.Decode(),
vision.Resize((256, 256)),
vision.RandomRotation((0, 15)),
vision.Normalize((100, 115.0, 121.0), (71.0, 68.0, 70.0)),
vision.HWC2CHW()]
onehot_op = c_transforms.OneHot(num_classes)
# Decoding operator parallelism degree: 3
dataset = dataset.map(input_columns="image", operations=vision.Decode(), num_parallel_workers=3)
# Resizing operator parallelism degree: 2
dataset = dataset.map(input_columns="image", operations=vision.Resize((256, 256)), num_parallel_workers=2)
# Random rotation operator parallelism degree: 4
dataset = dataset.map(input_columns="image", operations=vision.RandomRotation((0, 15)), num_parallel_workers=4)
# Normalization operator parallelism degree: 3
dataset = dataset.map(input_columns="image", operations=vision.Normalize((100, 115.0, 121.0), (71.0, 68.0, 70.0)), num_parallel_workers=3)
# Channel transposition operator parallelism degree: 1
dataset = dataset.map(input_columns="image", operations=vision.HWC2CHW(), num_parallel_workers=1)
dataset = dataset.map(input_columns="label", operations=onehot_op)
```
We observe that while operator parallelism has higher performance potential, it requires us to set reasonable parallelism parameters for each operator. This not only places high demands on users but also increases the risk of performance degradation due to unreasonable parameter settings. To make operator parallelism easier for users, both tf.data and MindData have added dynamic tuning of key pipeline parameters, computing reasonable parameters based on runtime performance monitoring of the pipeline execution to achieve optimal data preprocessing throughput as much as possible :cite:`murray2021tf`.
#### Data Processing Computation Graph Optimization
In the preceding text, we focused on efficiently executing the user's constructed data preprocessing computation graph through parallel architectures. However, we can consider the following question: Is the computation graph given by the user an efficient one?
If not, can we optimize and rewrite the user's data computation graph under the premise of equivalent transformation to obtain a computation graph with expected better execution performance? Indeed, this shares the same philosophy as the model computation graph compilation optimization we studied in previous chapters --- that is, achieving better execution performance by analyzing and transforming the computation graph IR to obtain a more optimal IR representation. Common data graph optimization strategies include operator fusion and map operation vectorization. Operator fusion merges operator combinations such as map+map, map+batch, map+filter, and filter+filter into equivalent composite operators, combining computations that originally required execution in two thread groups into composite computations executed in a single thread group. This reduces inter-thread synchronization and communication overhead, achieving better performance. Map operation vectorization transforms the common dataset.map(f).batch(b) operation combination into dataset.batch(b).map(parallel_for(f)), leveraging modern CPUs' parallelism-friendly SIMD instruction sets to accelerate data preprocessing.

View File

@@ -0,0 +1,121 @@
## Usability Design
In this section, we focus on how to design a user-friendly data module for machine learning systems. As mentioned earlier, usability requires the data module to provide good programming abstractions and interfaces so that users can conveniently construct data processing pipelines, while also supporting users in flexibly registering and using custom operators within the data pipeline to meet diverse and specialized requirements. We will explore this topic from two aspects: programming interface abstraction and custom operator registration mechanisms.
### Programming Abstraction and Interfaces
In :numref:`image_process_pipeline`, we present a classic data preprocessing pipeline for training an image classification model. After loading the dataset from storage devices, we perform a series of operations on the image data, including decoding, resizing, rotation, normalization, and channel transposition. We also apply specific preprocessing operations to the dataset labels, and finally send the processed data to the accelerator chip for model computation. We hope that the programming abstractions provided by the data module are sufficiently high-level so that users can describe the data processing logic in just a few lines of code without getting bogged down in excessive, repetitive implementation details. At the same time, we need to ensure that this set of high-level abstractions is sufficiently general to meet diverse data preprocessing requirements. Once we have a good programming abstraction, we will use a code snippet that implements the data preprocessing pipeline described in the figure below using MindSpore's data module programming interfaces as an example to demonstrate how significantly a well-designed programming abstraction can reduce the user's programming burden.
![Data preprocessing example](../img/ch07/7.2/image_process_pipeline.png)
:width:`800px`
:label:`image_process_pipeline`
In fact, programming abstractions for data computation have long been extensively studied in the field of general-purpose data-parallel computing systems, and a relatively unified consensus has been reached --- that is, to provide LINQ-style :cite:`meijer2006linq` programming abstractions. The key characteristic is to let users focus on describing dataset creation and transformations, while delegating the efficient implementation and scheduling of these operations to the data system's runtime. Some excellent systems such as Naiad :cite:`murray2013naiad`,
Spark :cite:`zaharia2010spark`, and DryadLINQ :cite:`fetterly2009dryadlinq` have all adopted this programming model. We will use Spark as an example for a brief introduction.
Spark provides users with a programming model based on the concept of Resilient Distributed Datasets (RDD). An RDD is a read-only distributed data collection. Users primarily describe the creation and transformation of RDDs through Spark's programming interfaces. Let us elaborate with a Spark example. The following code demonstrates counting the number of lines containing the "ERROR" field in a log file. We first create a distributed dataset `file` by reading from a file (as mentioned earlier, an RDD represents a collection of data; here `file` is actually a collection of log lines).
We apply a filter operation to this `file` dataset to obtain a new dataset `errs` that retains only log lines containing the "ERROR" field. Then we apply a map operation to each element in `errs` to obtain the dataset `ones`. Finally, we perform a reduce operation on the `ones` dataset to get our desired result --- the number of log lines containing the "ERROR" field in the `file` dataset.
```java
val file = spark.textFile("hdfs://...")
val errs = file.filter(_.contains("ERROR"))
val ones = errs.map(_ => 1)
val count = ones.reduce(_+_)
```
We can see that users need only four lines of code to accomplish the complex task of counting specific field occurrences in a distributed dataset. This is made possible by Spark's core RDD programming abstraction. From the computation flow visualization in :numref:`rdd_transformation_example`, we can also clearly see that after creating the dataset, users only need to describe the operators applied to the dataset, while the execution and implementation of the operators are handled by the system's runtime.
![The core of Spark programming --- RDD transformations](../img//ch07/7.2/RDD.png)
:width:`800px`
:label:`rdd_transformation_example`
The data modules in mainstream machine learning systems have also adopted similar programming abstractions, such as TensorFlow's data module tf.data :cite:`murray2021tf`
and MindSpore's data module MindData. Next, we will use MindData's interface design as an example to introduce how to design good programming abstractions for the machine learning scenario to help users conveniently construct the diverse data processing pipelines needed in model training.
MindData is the data module of the machine learning system MindSpore, primarily responsible for completing data preprocessing tasks in machine learning model training. The core programming abstraction that MindData provides to users is based on Dataset transformations. Here, Dataset is a data frame concept (Data
Frame), meaning a Dataset is a multi-row, multi-column relational data table where each column has a column name.
![MindSpore
Dataset example](../img/ch07/7.2/dataset_table.png)
:width:`800px`
:label:`mindspore dataset example`
Based on this programming model, combined with the key processing steps in the machine learning data workflow introduced in the first section, MindData provides users with dataset operation operators for performing shuffle, map, batch, and other transformation operations on datasets. These operators take a Dataset as input and produce a newly processed Dataset as output. We list the typical dataset transformation interfaces as follows:
:Dataset operation interfaces supported by MindSpore
| Dataset Operation | Description |
| -------------------- | ------------------------------------------------------------------ |
| batch | Groups multiple data rows in the dataset into a mini-batch |
| map | Applies transformation operations to each data row in the dataset |
| shuffle | Randomly shuffles the order of data rows in the dataset |
| filter | Filters data rows in the dataset, retaining only rows that pass the filter condition |
| prefetch | Prefetches data from the storage medium |
| project | Selects certain columns from the Dataset table for subsequent processing |
| zip | Merges multiple datasets into one dataset |
| repeat | In multi-epoch training, repeats the entire data pipeline multiple times |
| create_dict_iterator | Creates an iterator that returns dictionary-type data for the dataset |
| ... | ... |
The above describes the dataset interface abstractions, while the specific operations on datasets are actually defined by concrete data operator functions. For user convenience, MindData has built-in implementations of rich data operator libraries for common data types and their common processing needs in the machine learning domain. For the vision domain, MindData provides common operators such as Decode, Resize, RandomRotation, Normalize, and HWC2CHW (channel transposition); for the text domain, MindData provides operators such as Ngram, NormalizeUTF8, and BertTokenizer; for the audio domain, MindData provides operators such as TimeMasking, LowpassBiquad, and ComplexNorm. These commonly used operators can cover the vast majority of user requirements.
In addition to supporting flexible Dataset transformations, MindData also provides flexible Dataset creation to address the challenge of numerous dataset types with varying formats and organizations. There are mainly three categories:
- Creating from built-in datasets: MindData has a rich set of built-in classic datasets, such as CelebADataset, Cifar10Dataset, CocoDataset, ImageFolderDataset, MnistDataset, VOCDataset, etc. If users need to use these common datasets, they can achieve out-of-the-box usage with a single line of code. MindData also provides efficient implementations for loading these datasets to ensure users enjoy the best read performance.
- Loading from MindRecord: MindRecord is a high-performance, general-purpose data storage file format designed for MindData. Users can convert their datasets to MindRecord and then leverage MindSpore's relevant APIs for efficient reading.
- Creating from a Python class: If users already have a Python class for reading their dataset, they can use MindData's GeneratorDataset interface to call that Python class to create a Dataset, providing users with great flexibility.
![MindSpore
Dataset multiple creation methods](../img/ch07/7.2/dataset.png)
Finally, we use an example of implementing the data processing pipeline described at the beginning of this section using MindData to demonstrate how user-friendly the Dataset-centric data programming abstraction is. We need only about 10 lines of code to accomplish our desired complex data processing. Throughout the entire process, we focus solely on describing the logic, while delegating operator implementation and execution scheduling to the data module, which greatly reduces the user's programming burden.
```python
import mindspore.dataset as ds
import mindspore.dataset.transforms.c_transforms as c_transforms
import mindspore.dataset.transforms.vision.c_transforms as vision
dataset_dir = "path/to/imagefolder_directory"
# create a dataset that reads all files in dataset_dir with 8 threads
dataset = ds.ImageFolderDatasetV2(dataset_dir, num_parallel_workers=8)
#create a list of transformations to be applied to the image data
transforms_list = [vision.Decode(),
vision.Resize((256, 256)),
vision.RandomRotation((0, 15)),
vision.Normalize((100, 115.0, 121.0), (71.0, 68.0, 70.0)),
vision.HWC2CHW()]
onehot_op = c_transforms.OneHot(num_classes)
# apply the transform to the dataset through dataset.map()
dataset = dataset.map(input_columns="image", operations=transforms_list)
dataset = dataset.map(input_columns="label", operations=onehot_op)
```
### Custom Operator Support
With the dataset transformation-based programming abstraction and the rich transformation operator support for various data types in machine learning, we can cover the vast majority of user data processing needs. However, since the machine learning field itself is rapidly evolving with new data processing requirements constantly emerging, there may be situations where a data transformation operator that users want to use is not covered by the data module. Therefore, we need to design a well-crafted user-defined operator registration mechanism so that users can conveniently use custom operators when constructing data processing pipelines.
In machine learning scenarios, Python is the primary development programming language for users, so we can assume that user-defined operators are more often Python functions or Python classes. The difficulty of supporting custom operators in the data module is mainly related to how the data module schedules computation. For example, PyTorch's dataloader primarily implements computation scheduling at the Python level, and thanks to Python's flexibility, inserting custom operators into the dataloader's data pipeline is relatively straightforward. In contrast, systems like TensorFlow's tf.data and MindSpore's MindData primarily implement computation scheduling at the C++ level, making it more challenging for the data module to flexibly insert user-defined Python operators into the data flow. Next, we will use MindData's custom operator registration and usage implementation as an example to discuss this topic in detail.
![C-level operators and Python-level operators in MindData](../img/ch07/7.2/operation.png)
:width:`800px`
:label:`mindspore operator example`
Data preprocessing operators in MindData can be divided into C-level operators and Python-level operators. C-level operators provide higher execution performance, while Python-level operators can conveniently leverage rich third-party Python packages for development. To flexibly cover more scenarios, MindData supports users in developing custom operators using Python. If users pursue higher performance, MindData also supports users in compiling their C-level operators and registering them as plugins in MindSpore's data processing pipeline.
For custom data processing operators passed into dataset transformation operators such as map and filter, MindData's Pipeline executes them through the created Python runtime after startup. It should be noted that custom Python operators must ensure that both input and output are of the numpy.ndarray type. During execution, when MindData's Pipeline encounters a user-defined PyFunc operator in a dataset transformation, it passes the input data to the user's PyFunc as numpy.ndarray type. After the custom operator finishes execution, the result is returned to MindData as numpy.ndarray. During this process, the executing dataset transformation operator (such as map, filter, etc.) is responsible for the PyFunc's runtime lifecycle and exception handling. If users pursue higher performance, MindData also supports user-defined C operators. The dataset-plugin repository :cite:`minddata` serves as MindData's operator plugin repository, encompassing operators tailored for specific domains (remote sensing, medical imaging, meteorology, etc.). This repository carries MindData's plugin capability extensions and provides a convenient entry point for users to write new MindData operators. Users can write operators, compile, and install the plugin, and then use the newly developed operators in the map operations of the MindData Pipeline.
![MindSpore custom operator registration](../img/ch07/7.2/dataset-plugin.png)
:width:`800px`
:label:`mindspore_user_defined_operator`

View File

@@ -0,0 +1,31 @@
## Overview
Data processing in machine learning scenarios is a typical ETL (Extract, Transform, Load) process. The first stage (Extract) loads datasets from storage devices, and the second stage (Transform) performs transformations on the datasets. Although different machine learning systems adopt different technical approaches when building their data modules, the core components generally include data loading, data shuffling, data transformation, data mini-batch assembly, and data sending. The functionality of each component is described as follows:
- **Data Loading Component (Load)**: Responsible for loading and reading datasets from storage devices. It must consider both the diversity of storage devices (e.g., local disk/memory, remote disk and memory, etc.) and the diversity of dataset formats (e.g., csv format, txt format, etc.). Based on the characteristics of machine learning tasks, AI frameworks have also proposed unified data storage formats (e.g., Google's TFRecord, Huawei's MindRecord, etc.) to provide higher-performance data loading.
- **Data Shuffling Component (Shuffle)**: Responsible for randomly shuffling the order of input data according to user-specified methods to improve model robustness.
- **Data Transformation Component (Map)**: Responsible for performing data transformations, with built-in preprocessing operators for various data types, such as resizing and flipping for images, random noise addition and pitch shifting for audio, and stopword removal and random masking for text processing.
- **Data Batching Component (Batch)**: Responsible for assembling and constructing a mini-batch of data to send to training/inference.
- **Data Sending Component (Send)**: Responsible for sending processed data to accelerators such as GPUs or Huawei Ascend for subsequent model computation and updates. High-performance data modules often choose to execute data transfer to devices asynchronously with computation on accelerators to improve overall training throughput.
![Core components of the data module](../img/ch07/7.1/pipeline.png)
:width:`800px`
:label:`pipeline`
Implementing the above components is just the foundation of a data module. We also need to focus on the following aspects:
#### Usability
Data processing involved in AI model training/inference is highly flexible. On one hand, datasets in different application scenarios vary significantly in type and characteristics. When loading datasets, the data module must support specific storage formats for multiple types such as images, text, audio, and video, as well as multiple storage device types including memory, local disks, distributed file systems, and object storage systems. The module needs to abstract and unify the I/O differences in data loading under these complex situations to reduce users' learning costs. On the other hand, different data types often have different processing requirements. In common machine learning tasks, image tasks frequently involve resizing, flipping, and blurring; text tasks require tokenization and vectorization; and speech tasks need Fast Fourier Transform, reverb enhancement, and frequency shifting. To help users address data processing needs in the vast majority of scenarios, the data module needs to support a sufficiently rich set of data preprocessing operators for various types. However, new algorithms and data processing requirements are constantly and rapidly emerging, so we need to support users in conveniently using custom processing operators within the data module to handle scenarios not covered by built-in operators, achieving the best balance between flexibility and efficiency.
#### Efficiency
Since common AI accelerators such as GPUs and Huawei Ascend are primarily designed for Tensor data type computation and do not possess general-purpose data processing capabilities, current mainstream machine learning system data modules typically use CPUs to execute data pipelines. Ideally, before each training iteration begins, the data module should have data ready to minimize the time accelerators spend waiting for data. However, data loading and preprocessing in the data pipeline often face challenging I/O performance and CPU computation performance issues. The data module needs to design file formats that support random access with high read throughput to resolve data loading bottlenecks, and also needs to design reasonable parallel architectures to efficiently execute data pipelines to address computation performance issues. To achieve high-performance training throughput, mainstream machine learning systems all adopt asynchronous execution of data processing and model computation to hide data preprocessing latency.
#### Order Preservation
Unlike conventional data-parallel computing tasks, machine learning model training is sensitive to data input order. When training models using stochastic gradient descent, data is typically fed to the model in a pseudo-random order in each epoch, with a different random order in each training epoch. Since the model's final parameters are sensitive to the order of input data, to help users better debug and ensure reproducibility across different experiments, we need to design mechanisms in the system so that the final order in which data is fed to the model is uniquely determined by the output order of the data shuffling component, rather than being made non-deterministic by parallel data transformations. We will discuss the requirements and specific implementation details of order preservation in later sections.

View File

@@ -0,0 +1,8 @@
## Summary
In this chapter, we explored how to design and implement the data preprocessing module in machine learning systems from three dimensions: usability, efficiency, and order preservation. On the usability dimension, we focused on the programming model of the data module. By drawing on the design experience of historically excellent parallel data processing systems, we concluded that a programming abstraction based on describing dataset transformations is well-suited as the programming model for data modules. In concrete system implementations, we need not only to provide a sufficient number of built-in operators on top of this programming model to facilitate users' data preprocessing programming, but also to consider how to support users in conveniently using custom operators. On the efficiency dimension, we introduced specialized file format design and parallel computation architecture design from the perspectives of data loading and computation, respectively. We also applied the model computation graph compilation optimization techniques learned in previous chapters to optimize users' data preprocessing computation graphs, further achieving higher data processing throughput. In machine learning scenarios, models are sensitive to data input order, which gives rise to the special property of order preservation. We analyzed this property in this chapter and demonstrated how real systems ensure order preservation through the special constraint implementation of MindSpore's Connector. Finally, we also addressed situations where single-machine CPU data preprocessing performance is insufficient, introducing the current vertical scaling approach based on heterogeneous processing acceleration and the horizontal scaling approach based on distributed data preprocessing. We believe that after studying this chapter, readers will have a deep understanding of data modules in machine learning systems and an awareness of the challenges that data modules will face in the future.
## Further Reading
- For an example of pipeline-level parallelism implementation, we recommend reading [PyTorch DataLoader](https://github.com/pytorch/pytorch/tree/master/torch/utils/data).
- For an example of operator-level parallelism implementation, we recommend reading [MindData](https://gitee.com/mindspore/mindspore/tree/master/mindspore/ccsrc/minddata).

View File

@@ -0,0 +1,100 @@
# Architecture of Machine Learning Clusters
Distributed model training is usually implemented in a compute cluster.
Next, we will introduce the composition of a compute cluster and explore
the design of a cluster network.
Figure :numref:`ch010/ch10-datacentre` shows the typical architecture of
a machine learning cluster. There are many servers deployed in such a
cluster, and each server has several hardware accelerators. To
facilitate server management, multiple servers are placed into one
*rack*, which is connected to a *top of rack (ToR) switch*. If ToR
switches are fully loaded but more new racks need to be connected, we
can add a *spine switch* between ToR switches. Such a structure forms a
multi-level tree. It is worth noting that cross-rack communication
within a cluster may encounter network bottlenecks. This is because the
network links used to construct the cluster network have the same
specifications (necessary to facilitate hardware procurement and device
management), increasing the probability of *network bandwidth
oversubscription* on the network links from the ToR switches to the
spine switch.
Network bandwidth oversubscription can be defined as a situation wherein
the peak bandwidth required exceeds the actual bandwidth available on
the network. In the cluster shown in Figure
:numref:`ch010/ch10-datacentre`, when server 1 and server 2 send
data to server 3 through their respective network links (say 10 Gb/s of
data), ToR switch 1 aggregates the data (that is, 20 Gb/s) and sends it
to spine switch 1. However, because there is only one network link (10
Gb/s) between spine switch 1 and ToR switch 1, the peak bandwidth
required is twice the actual bandwidth available, hence network
bandwidth oversubscription. In real-world machine learning clusters, the
ratio between peak bandwidth and actual bandwidth is generally between
1:4 and 1:16. One approach for avoiding network bottlenecks is to
restrict network communication within individual racks. This approach
has become a core design requirement for distributed machine learning
systems.
![Architecture of a machine learningcluster](../img/ch10/ch10-datacentre.png)
:label:`ch010/ch10-datacentre`
So, how much network bandwidth is required for training a large-scale
neural network in a compute cluster? Assume a neural network has
hundreds of billions of parameters (e.g., GPT-3 --- a huge language
model released by OpenAI --- has nearly 175 billion parameters). If each
parameter is expressed with a 32-bit floating-point number, a single
model replica in data parallelism mode will generate 700 GB (175 billion
$*$ 4 bytes) of local gradient data in each round of training iteration.
If there are three model replicas, at least 1.4 TB \[700 GB $*$
$(3-1)$\] of gradient data needs to be transmitted. This is because for
$N$ replicas, only $N-1$ of them need to be transmitted for computation.
To ensure that the model replicas will not diverge from the parameters
in the main model, the average gradient --- once computed --- is
broadcast to all model replicas (1.4 TB of data) for updating local
parameters in these model replicas.
Currently, machine learning clusters generally use Ethernet to construct
networks between different racks. The bandwidth of mainstream commercial
Ethernet links ranges from 10 Gb/s to 25 Gb/s. [^1] Using Ethernet to
transmit massive gradients will encounter severe transmission latency.
Because of this, new machine learning clusters (such as NVIDIA DGX) are
often configured with the faster InfiniBand. A single InfiniBand link
can provide 100 Gb/s or 200 Gb/s bandwidth. Even this high-speed network
still faces high latency when transmitting TB-level local gradients.
Even if network latency is ignored, it takes at least 40 seconds to
transmit 1 TB of data on a 200 Gb/s link.
To address this issue, InfiniBand uses remote direct memory access
(RDMA) as the core of its programming interfaces. RDMA enables
InfiniBand to provide high-bandwidth, low-latency data read and write
functions. As such, its programming interfaces are vastly different from
the TCP/IP socket interfaces used by conventional Ethernet. For
compatibility purposes, people use the IP-over-InfiniBand (IPoIB)
technology, which ensures that legacy applications can invoke socket
interfaces whereas the underlying layer invokes the RDMA interfaces of
InfiniBand through IPoIB.
To support multiple accelerators (typically 2--16) within a server, a
common practice is to build a heterogeneous network on the server. Take
server 1 in Figure :numref:`ch010/ch10-datacentre` as an example. This server is
equipped with two CPUs, which communicate with each other through
QuickPath Interconnect (QPI). Within a CPU interface (socket), the
accelerator and CPU are connected by a PCIe bus. Accelerators use
high-bandwidth memory (HBM), which offers much more bandwidth than PCIe
does. A prominent example is the NVIDIA A100 server: In this server, HBM
offers 1935 GB/s bandwidth, whereas PCIe 4.0 offers only 64 GB/s
bandwidth. PCIe needs to be shared by all accelerators within the
server, meaning that it becomes a significant communication bottleneck
when multiple accelerators simultaneously transmit data through PCIe. To
solve this problem, machine learning servers tend to use accelerator
high-speed interconnect technologies (e.g., NVIDIA GPU NVLink). Such
technology bypasses PCIe to achieve high-speed communication. A
prominent example is NVIDIA A100 GPU --- its NVLink provides 600 GB/s
bandwidth, enabling accelerators to transmit large amounts of data to
each other.
## AI Cluster Network Topology
[^1]: Network bandwidth is typically measured in Gb/s, whereas memory
bandwidth is in GB/s --- *b* stands for bit, and *B* stands for
byte.

View File

@@ -0,0 +1,274 @@
# Collective Communication
This section delves into the application of collective communication in
the creation of distributed training systems within machine learning
clusters. Collective communication, a fundamental aspect of parallel
computing, is instrumental in developing high-performance Single Program
Multiple Data (SPMD) programs. We will begin by discussing common
operators within collective communication. Following this, we explore
the use of the AllReduce algorithm to alleviate network bottlenecks in
distributed training systems. Lastly, we will address the support
available for different collective communication algorithms within
existing machine learning systems.
## Collective Communication Operators
In this subsection, we will establish a simplified model of collective
communication before introducing commonly used collective communication
operators. These include Broadcast, Reduce, AllGather, Scatter, and
AllReduce:
![Examples of collective communicationoperators](../img/ch10/ch10-collective-operators.png)
:label:`ch010/ch10-collective-operators`
- **Broadcast**: The Broadcast operator is often employed in a
distributed machine learning system to transmit model parameters or
configuration files from device $i$ to all other devices. The
starting and final states of this operation, initiated by device 1
in a three-device cluster, are depicted in Figure
:numref:`ch010/ch10-collective-operators`.
- **Reduce**: In a distributed machine learning system, the Reduce
operator plays a pivotal role by consolidating computation results
from different devices. It is commonly used to aggregate local
gradients from each device to compute the gradient summation. This
operator employs functions, represented as $f$, which often obey the
associative and commutative laws. Such functions, including sum,
prod, max, and min, are initiated by all devices, with the final
aggregate result stored in device $i$. The initial and final states
when device 1 executes the Reduce operator for summation are
depicted in Figure
:numref:`ch010/ch10-collective-operators`.
- **AllReduce**: The AllReduce operator, a part of collective
communication, stores the result of the Reduce function $f$ in all
devices. Figure
:numref:`ch010/ch10-collective-operators` shows the starting
and ending states when devices 1, 2, and 3 jointly execute AllReduce
to perform a summation.
- **Gather**: The Gather operator can gather data from all devices and
store it in device $i$. Figure
:numref:`ch010/ch10-collective-operators` shows the initial
and end states when device 1 invokes the Gather operator to gather
data from all devices.
- **AllGather**: The AllGather operator sends the gather result to all
devices. Figure
:numref:`ch010/ch10-collective-operators` shows the initial
and end states when devices 1, 2, and 3 invoke the AllGather
operator.
- **Scatter**: The Scatter operator is the inverse of the Gather
operator. Figure
:numref:`ch010/ch10-collective-operators` shows the initial
and end states when device 1 invokes the Scatter operator.
It's important to note that other collective communication operators may
also be deployed in distributed machine learning applications. Examples
of these are ReduceScatter, Prefix Sum, Barrier, and All-to-All.
However, this section will not delve into the specifics of these
operators.
## Gradient Averaging with AllReduce
The following discusses how to utilize AllReduce operators to implement
efficient gradient averaging in large clusters. We can implement a
simple method for computing the average gradient, whereby a device in
the cluster gathers local gradients from each device and then broadcasts
the computed average gradient to all devices. Although this approach is
easy to implement, it leads to two problems. 1) Network congestion may
occur if multiple devices send data to the gather device simultaneously.
2) It is not feasible to fit gradient averaging computation on a single
device due to the computing power constraint.
To solve the preceding problems, the Reduce-Broadcast implementation of
the AllReduce operator can be used to optimize the algorithm. In this
implementation, all nodes participate in network communication and
averaging computation of gradients so that the huge amount of network
and computing overheads is evenly shared across all nodes. This
implementation can solve the two problems of a single gradient gather
node. Assume that there are $M$ devices, and that each device stores a
model replica consisting of $N$ parameters/gradients. According to the
requirements of AllReduce, all parameters need to be partitioned into
$M$ partitions based on the number of devices, with each partition
containing $N/M$ parameters. The initial and end states of the algorithm
are provided.
In the AllReduce example shown in Figure
:numref:`ch010/ch10-collective-operators`, there are three
devices. Each device has a model replica, and each replica has 3
parameters. According to the partitioning method of AllReduce,
parameters are partitioned into three partitions (because there are 3
devices), and each partition has 1 ($N/M$ = 3/3) parameter. In this
example, assume that device 1 has parameters 2, 4, and 6; device 2 has
parameters 1, 2, and 3; and device 3 has parameters 4, 8, and 12. After
an AllReduce operator is used for computation, the gradient summation
results 7, 14, and 21 are sent to all devices. The result 7 of partition
1 is the sum of the initial results of partition 1 in the three devices
(7 = 1 + 2 + 4). To compute the average gradient, the sum of gradients
needs to be divided by the number of devices (e.g., to obtain the final
result of partition 1, divide 7 by 3).
The AllReduce operator splits the gradient computation into $M-1$ Reduce
operators and $M-1$ Broadcast operators (where $M$ indicates the number
of nodes). Reduce operators are used to compute the summation of
gradients, and Broadcast operators are used to broadcast the summation
of gradients to all nodes.
Figure :numref:`ch010/ch10-allreduce-process` shows the execution
process of an AllReduce operator. The AllReduce operator starts with a
Reduce operator. In the first Reduce operator, the AllReduce operator
performs pairing on all nodes and enables them to jointly complete
gradient summation. In the first Reduce operator shown in Figure
:numref:`ch010/ch10-allreduce-process`, devices 1 and 2 are
paired to jointly complete the summation of data in partition 1. Device
2 sends local gradient data 1 to device 1, which adds up the received
gradient data 1 and gradient data 2 stored in local partition 1 to
obtain the intermediate gradient summation result 3. At the same time,
devices 1 and 3 are paired to jointly complete the summation of data in
partition 3, and devices 3 and 2 are paired to jointly complete the
summation of data in partition 2.
![Process of the AllReducealgorithm](../img/ch10/ch10-allreduce-process.png)
:label:`ch010/ch10-allreduce-process`
Such distributed computing of gradients performed by Reduce operators
realizes the following performance optimizations:
1. **Network optimization:** All devices receive and send data
simultaneously by utilizing their ingress and egress bandwidths.
Therefore, in the execution process of the AllReduce algorithm, the
available bandwidth is $M * B$, where $M$ indicates the number of
nodes and $B$ indicates the node bandwidth. This enables the system
to implement network bandwidth scalability.
2. **Computing power optimization:** Processors of all devices
participate in the gradient summation. Therefore, in the execution
process of the AllReduce algorithm, the total number of available
processors is $M * P$, where $M$ indicates the number of nodes and
$P$ indicates the number of processors for a single device. This
enables the system to implement computing scalability.
3. **Load balancing:** Data partitions are evenly partitioned.
Therefore, the communication and computing overheads allocated to
each device are the same.
In the Reduce operators other than the first one, the AllReduce
algorithm selects other pairing methods for different data partitions.
For example, in the second Reduce operator shown in Figure
:numref:`ch010/ch10-allreduce-process`, the AllReduce algorithm
pairs devices 1 and 3 for data summation in partition 1. Devices 1 and 2
are paired for data summation in partition 2, and devices 2 and 3 are
paired for data summation in partition 3. In a three-node AllReduce
cluster, after two Reduce operators complete execution, the data
summation result of each partition is obtained. The data summation
result (7) of partition 1 is stored on device 3, the data summation
result (14) of partition 2 is stored on device 1, and the data summation
result (21) of partition 3 is stored on device 2.
The AllReduce algorithm then enters the broadcast phase. The process in
this phase is similar to the execution process of Reduce operators. The
core difference is that, after nodes are paired, they do not add up data
--- instead, they broadcast the computation results of Reduce operators.
In the first Broadcast operator shown in Figure
:numref:`ch010/ch10-allreduce-process`, device 1 directly writes
the result (14) of partition 2 to partition 2 of device 3. Device 2
directly writes the result (21) of partition 3 to device 1, and device 3
directly writes the result of partition 1 to device 2. In a three-node
AllReduce cluster, the Broadcast operator is repeated twice in order to
notify all nodes of the Reduce computation result of each partition.
## Model Training with Collective Communication
Typically, a machine learning system flexibly combines different
collective communication operators for different clusters to maximize
communication efficiency. The following describes two cases: ZeRO and
DALL-E.
ZeRO is a neural network optimizer proposed by Microsoft. In practice,
ZeRO successfully trained the world's largest language model in 2020
(with up to 17 billion parameters). In the training process of a neural
network like this, parameters of the optimizer, gradients obtained
during backward computation, and model parameters all impose significant
pressure on the memory space of accelerators. If parameters are
represented by 32-bit floating-point numbers, a model with 17 billion
parameters requires at least 680 GB of memory, far exceeding the maximum
memory capacity (80 GB) of NVIDIA A100 (an accelerator with the largest
memory available today). Therefore, we need to explore how to
efficiently split a model across different accelerators, and how to
efficiently utilize collective communication operators for model
training and inference. The following describes three optimization
technologies regarding collective communication:
1. **Parameter storage on a single node:** The bandwidth of the
accelerators inside a node in a modern cluster is much greater than
the inter-node bandwidth. Therefore, we need to minimize inter-node
communication and ensure that communication mostly happens between
accelerators inside nodes. The model slicing process shows that the
amount of communication between different slices during the forward
and backward computation of the model is far less than the average
amount of communication required for gradient averaging of model
replicas. As such, ZeRO stores all slices of a single model in the
same node, greatly improving the training efficiency.
2. **Forward computation based on the AllGather operator:** Assuming
that the parameters in a model are linear by layer, we can assign
the parameters to different accelerators from front to back based on
the sequence of these parameters on the network. In forward
computation, the computation of a layer depends only on the
parameters of its adjacent layers. Given this, we can apply
AllGather computation once on all accelerators that contain model
parameters in order to extract the parameters of the next layer for
the current layer and to compute the activation value of the current
layer. To conserve memory resources, the parameters of layers other
than the current one need to be discarded immediately after the
AllGather operation is complete.
3. **Gradient averaging based on the ReduceScatter operator:**
Similarly, during backward computation, only the parameters of the
previous layer are needed to compute the activation value and
gradient of the current layer. Therefore, AllGather can be used
again to complete the gradient computation on each accelerator. At
the same time, after gradients are gathered, each accelerator needs
only the gradient corresponding to the layer with the same index as
the accelerator. In this case, the ReduceScatter operator, instead
of AllReduce, can be used to directly store the corresponding
gradient to accelerator $i$.
DALL-E is a text-based image generation model proposed by OpenAI. This
model has up to 12 billion parameters. In addition to the AllGather +
ReduceScatter technique used by ZeRO during training, the OpenAI team
made further optimizations. The following describes two optimization
technologies regarding collective communication:
1. **Matrix factorization:** The operational speeds of collective
communication operators are positively correlated with the message
length. In model training, the message length indicates the number
of model parameters. DALL-E uses matrix factorization to convert a
high-dimensional tensor into a two-dimensional matrix, and then uses
collective communication operators for transmission after
factorization. In this way, DALL-E significantly reduces the amount
of communication.
2. **Custom data types:** Another way to reduce the amount of
communication is to modify data types. As expected, the 16-bit
half-precision floating-point number representation can reduce the
amount of communication by nearly half compared with the 32-bit
floating-point number representation. However, in practice,
low-precision data types cause unstable model convergence and
compromise the final training result. OpenAI analyzes the structure
of the DALL-E model and classifies the model parameters into three
categories based on their sensitivity to the precision of data
types. The most precision-sensitive parameters are represented by
32-bit floating-point numbers and synchronized only by the AllReduce
operator, whereas the most precision-insensitive parameters are
compressed and transmitted using matrix factorization. For the
remaining parameters, such as the moments and variance parameters
involved in Adam optimization, OpenAI implements two new data types
based on the IEEE 754 standard: 1-6-9 and 0-6-10. (The first digit
indicates the number of bits required for expressing positive and
negative, the second digit indicates the number of bits required for
expressing the exponent, and the third digit indicates the number of
bits required for expressing a valid number.) In addition to
conserving space, this also ensures training convergence.

View File

@@ -0,0 +1,54 @@
# Distributed Training
As the field of machine learning continues to accelerate at a rapid
pace, it has given rise to increasingly sophisticated models. These
models are characterized by a staggering quantity of parameters, the
gigantic size of a training dataset, and highly sophisticated
structures, which in turn place significant demands on both computing
and memory resources. Consequently, the limitations of single-machine
systems have become increasingly apparent, and they no longer suffice
for training these large machine learning models. This necessitates the
advent of distributed training systems, designed to alleviate the strain
on resources.
In this chapter, we dive deep into the fundamentals, design aspects, and
practical implementations of distributed machine learning systems. We
commence our discussion by elucidating what distributed training systems
entail, followed by an exploration of the rationale behind their design
and the potential benefits they offer. Subsequently, we scrutinize the
most commonly adopted methods of distributed training, encompassing data
parallelism, model parallelism, and pipeline parallelism. Each of these
methods can typically be implemented via one of two techniques:
collective communication or parameter servers, both of which come with
their unique sets of merits and drawbacks.
The key learning objectives of this chapter are as follows:
1. Grasping the advantages offered by distributed training systems.
2. Understanding widely-used parallelism methods, namely, data
parallelism, model parallelism, hybrid parallelism, and pipeline
parallelism.
3. Comprehending the architecture of a machine learning cluster.
4. Understanding collective communication operators and their
applications in distributed training systems.
5. Developing an understanding of parameter server architectures.
```toc
:maxdepth: 2
Overview
Parallelism_Methods
Pipeline_Parallelism_with_Micro-Batching
Architecture_of_Machine_Learning_Clusters
Collective_Communication
Parameter_Server
Federated_Learning
Training_Large_Language_Models
Chapter_Summary
Further_Reading
```

View File

@@ -0,0 +1,174 @@
# Parallelism Methods
This section explores the prevalent methods for implementing distributed
training systems, discussing the design goals and a detailed examination
of each parallelism approach.
## Classification of Methods
Distributed training amalgamates multiple single-node training systems
into a parallel structure to expedite the training process without
sacrificing model accuracy. A single-node training system, depicted in
Figure :numref:`ch010/ch10-single-node`, processes training datasets
split into small batches, termed as mini-batches. Here, a mini-batch of
*data* is input into the model, guided by a training *program*, which
generates gradients to enhance model accuracy. Typically, this program
executes a deep neural network. To illustrate the execution of a neural
network, we employ a computational graph, comprising connected
operators. Each operator executes a layer of the neural network, storing
parameters to be updated during the training.
![Single-node trainingsystem](../img/ch10/ch10-single-node.png)
:label:`ch010/ch10-single-node`
The execution of a computational graph involves two phases: *forward*
and *backward* computation. In the forward phase, data is fed into the
initial operator, which calculates and generates the data required by
the downstream operator. This process is continued sequentially through
all operators until the last one concludes its computation. The backward
phase initiates from the last operator, computing gradients and updating
local parameters accordingly. The process culminates at the first
operator. Upon completion of these two phases for a given mini-batch,
the system loads the next mini-batch to update the model.
Considering a model training job, partitioning the *data* and *program*
can facilitate parallel acceleration. Table
`ch010/ch10-parallel-methods` compiles various partition
methods. Single-node training systems enable a \"single program, single
data\" paradigm. For parallel computing across multiple devices, data is
partitioned and the program is replicated for simultaneous execution,
creating a \"single program, multiple data\" or *data parallelism*
paradigm. Another approach involves partitioning the program,
distributing its operators across devices---termed as \"multiple
programs, single data\" or *model parallelism*. When training
exceptionally large AI models, both data and program are partitioned to
optimize the degree of parallelism (DOP), yielding a \"multiple program,
multiple data\" or *hybrid parallelism* paradigm.
:Parallelism methods
| Classification | Single Data | Multiple Data |
|------------------|-----------------------|-------------------- |
| Single program | single-node execution | data parallelism |
| Multiple program | model parallelism | hybrid parallelism |
:label:`ch010/ch10-parallel-methods`
## Data Parallelism
Data parallelism is used when a single node cannot provide sufficient
computing power. This is the most common parallelism approach adopted by
AI frameworks. Specific implementations include TensorFlow
DistributedStrategy, PyTorch Distributed, and Horovod
DistributedOptimizer. Given a data-parallel system, assume that the
training batch size is $N$, and that there are $M$ devices available for
parallel acceleration. To achieve data parallelism, the batch size is
partitioned into $M$ partitions, with each device getting $N/M$ training
samples. Sharing a replica of the training program, each device executes
and calculates a gradient separately over its own data partition. Each
device (indexed $i$) calculates a gradient $G_i$ based on local training
samples. To ensure that training program parameters are coherent, local
gradients $G_i$ on different devices are aggregated to calculate an
average gradient $(\sum_{i=1}^{N} G_i) / N$. To complete the training on
this mini-batch, the training program updates model parameters based on
the average gradient.
Figure :numref:`ch010/ch10-data-parallel` shows a data-parallel training
system composed of two devices. For a batch size of 64, each device is
assigned 32 training samples and shares the same neural network
parameters (or program replicas). The local training samples are passed
through the operators in the program replica in sequence for forward and
backward computation. During backward computation, the program replicas
generate local gradients. Corresponding local gradients on different
devices (e.g., gradient 1 on device 1 and gradient 1 on device 2) are
aggregated (typically by AllReduce, a collective communication
operation) to calculate an average gradient.
![Data-parallelsystem](../img/ch10/ch10-data-parallel.png)
:label:`ch010/ch10-data-parallel`
## Model Parallelism
Model parallelism is useful when memory constraints make it impossible
to train a model on a single device. For example, the memory on a single
device will be insufficient for a model that contains a large operator
(such as the compute-intensive fully connected layer for classification
purpose). In such cases, we can partition this large operator for
parallel execution. Assume that the operator has $P$ parameters and the
system consists of $N$ devices. To minimize the workload on each device
given the limited memory capacity, we can evenly assign the parameters
across the devices ($P/N$ = number of parameters per device). This
partitioning method is called **intra-operator parallelism**, which is a
typical application of model parallelism.
Figure :numref:`ch010/ch10-model-parallel-intra-op` shows an example of
intra-operator parallelism implemented by two devices. The neural
network in this example consists of two operators. To complete forward
and backward computation, operator 1 and operator 2 require 16 GB and 1
GB of memory, respectively. However, in this example, the maximum amount
of memory a single device can provide is only 10 GB. To train this
network, parallelism is implemented on operator 1. Specifically, the
parameters of operator 1 are evenly partitioned into two partitions
between device 1 and device 2, meaning that device 1 runs program
partition 1 while device 2 runs program partition 2. The network
training process starts with feeding a mini-batch of training data to
operator 1. Because the parameters of operator 1 are shared between two
devices, the data is broadcast to the two devices. Each device completes
forward computation based on the local partition of parameters. The
local computation results on the devices are aggregated before being
sent to downstream operator 2. In backward computation, the data of
operator 2 is broadcast to device 1 and device 2, so that each device
completes backward computation based on the local partition of
operator 1. The local computation results on the devices are aggregated
and returned to complete the backward computation process.
![Model-parallel system: intra-operatorparallelism](../img/ch10/ch10-model-parallel-intra-op.png)
:label:`ch010/ch10-model-parallel-intra-op`
In some cases, the overall model --- rather than specific operators ---
requires more memory than a single device can provide. Given $N$
operators and $M$ devices, we can evenly assign the operators across $M$
devices. As such, each device needs to run forward and backward
computation of only $N/M$ operators, thereby reducing the memory
overhead of each device. This application of model parallelism is called
*inter-operator parallelism*.
Figure :numref:`ch010/ch10-model-parallel-inter-op` shows an example of
inter-operator parallelism implemented by two devices. The neural
network in this example has two operators, each requiring 10 GB of
memory for computation (20 GB in total). Because the maximum memory a
single device can provide in this example is 10 GB, we can place
operator 1 on device 1 and operator 2 on device 2. In forward
computation, the output of operator 1 is sent to device 2, which uses
this output as input to complete forward computation of operator 2. In
backward computation, device 2 sends the backward computation result of
operator 2 to device 1 for backward computation of operator 1,
completing the training on a mini-batch.
![Model-parallel system: inter-operatorparallelism](../img/ch10/ch10-model-parallel-inter-op.png)
:label:`ch010/ch10-model-parallel-inter-op`
## Hybrid Parallelism
In training large AI models, the computing power and memory constraints
often go hand in hand. The solution to overcoming these constraints is
to adopt a hybrid of data parallelism and model parallelism, that is,
hybrid parallelism. Figure
:numref:`ch010/ch10-hybrid-parallel` shows an example of hybrid
parallelism implemented by four devices. In this example, inter-operator
parallelism is adopted to reduce memory overheads by allocating operator
1 to device 1 and operator 2 to device 2. Device 3 and device 4 are
added to the system to achieve data parallelism, thereby improving the
computing power of the system. Specifically, the training data is
partitioned to data partitions 1 and 2, and the model (consisting of
operators 1 and 2) is replicated on devices 3 and 4 respectively. This
makes it possible for the program replicas to run in parallel. During
forward computation, devices 1 and 3 run the replicas of operator 1
simultaneously and send their respective computation results to devices
2 and 4 to compute the replicas of operator 2. During backward
computation, devices 2 and 4 compute gradients simultaneously, and the
local gradients are averaged by using the AllReduce operation. The
averaged gradient is back-propagated to the replicas of operator 1 on
devices 1 and 3, and the backward computation process ends.
![Hybrid-parallelsystem](../img/ch10/ch10-hybrid-parallel.png)
:label:`ch010/ch10-hybrid-parallel`

View File

@@ -0,0 +1,137 @@
# Overview
This section provides an overview of the need for distributed training
systems.
## Motivation
The principal objective of implementing distributed training systems is
to circumvent the restrictions imposed by single-node training systems,
primarily characterized by their computational and memory constraints.
### Computational Constraints
A single processor, confined by its inherent limitations, can only yield
a certain extent of computational power, quantified in terms of
*floating-point operations per second (FLOPS)*. The advent of
distributed training systems emerged as an innovative resolution to
overcome these constraints associated with a single processor's
computational prowess.
Figure :numref:`ch010/ch10-computation-increase` illustrates the
escalating demands for computational power required by machine learning
models compared to the growth rate of a processor's computational
capabilities over the past few years. In this context, computational
power is measured in petaFLOP/s-day, a unit implying the execution of
$10^{15}$ neural network operations every second for an entire day,
summing up to approximately $10^{20}$ operations in total. According to
Moore's Law, the computational power of CPUs approximately doubles every
18 months. This exponential growth principle also extends to
accelerators, such as GPUs and TPUs, which are leveraged to support
machine learning computations with their immense computational
abilities.
However, the evolution of machine learning models is outpacing this
growth rate. A few years back, machine learning models, like AlexNet,
could only recognize a limited array of objects. Now, with models like
AlphaStar, we have reached a point where machines can outperform humans
in executing certain intricate tasks. In this short timeframe, the
computational demands of machine learning models have escalated 56-fold
every 18 months.
Distributed computing is designed to reconcile this divergence between
the performance of processors and the rising demand for computational
power. By capitalizing on the myriad of processors available in
expansive data centers and cloud computing facilities and managing them
effectively through distributed training systems, we can cater to the
surging computational requirements of evolving models.
![Machine Learning Model Size vs. Hardware ComputationalCapability](../img/ch10/ch10-computation-increase.png)
:label:`ch010/ch10-computation-increase`
### Memory Constraints
The process of training machine learning models often necessitates
substantial memory. Take, for instance, a neural network model boasting
100 billion parameters in a 32-bit floating-point format (4 bytes); it
would demand 400 GB of memory to store all parameters. In practice,
additional memory is needed to store activation values and gradients.
Assuming these are also stored in a 32-bit floating-point format, an
extra 800 GB of memory would be required. This would result in an
overall memory requirement exceeding 1200 GB (or 1.2 TB). Nevertheless,
current accelerators, such as the NVIDIA A100, can only provide a
maximum memory of 80 GB.
However, unlike individual accelerators, whose memory growth is largely
hindered by factors such as hardware specifications, heat dissipation,
and costs, distributed training systems have the potential to train
models with hundreds of billions of parameters across hundreds of
accelerators simultaneously. This approach can fulfill the model's
memory requirements in the terabyte range.
## System Architecture
Data centers, housing hundreds of clusters with each cluster operating
hundreds to thousands of servers, provide an ideal environment for
distributed training. We can harness the power of numerous servers in a
distributed training system to parallelly train a machine learning
model.
![Comparison between single-node computing and multi-Node distributedcomputing](../img/ch10/ch10-single-vs-multi.png)
:label:`ch010/ch10-single-vs-multi`
For enhancing the efficiency of the distributed training system, it is
crucial to assess the computational power and memory usage of computing
tasks, ensuring no single task turns into a bottleneck. As depicted in
Figure :numref:`ch010/ch10-single-vs-multi`, the system evenly
distributes a task across all computing nodes by partitioning the input
data into segments. Each model training job, which takes a dataset
(e.g., training samples) or a group of tasks (e.g., operators) as input,
is run on a computing node (e.g., a GPU) to generate outputs (e.g.,
gradients).
Distributed execution generally comprises three steps:
1. *Partitioning* the input into smaller segments.
2. *Distributing* these partitions across multiple compute nodes for
parallel computing.
3. *Merging* the outputs from all compute nodes to generate a result
akin to that of single-node computing.
This process fundamentally adheres to the divide-and-conquer approach,
where each compute node runs a small portion of the workload in parallel
with others, thus expediting the overall computing process.
## Benefits
Distributed training systems bring the following benefits:
1. **Improved system performance:** Distributed training significantly
improves training performance. Generally, we use the
time-to-accuracy metric to measure the performance of a distributed
training system. This metric is determined by two parameters: time
taken to process all training samples one time and the accuracy
improved within the time. By adding parallel compute nodes, we can
shorten the time taken to process all training samples one time and
therefore achieve smaller time-to-accuracy values.
2. **Reduced costs:** Distributed training reduces the cost of training
machine learning models. Due to the limited heat dissipation
capacity of a single node, nodes with higher computing power will
incur higher costs in terms of dissipating heat. By combining
multiple compute nodes, we can obtain the same computing power in a
more cost-effective way. This drives cloud service providers (such
as Amazon and Microsoft) to focus more on providing distributed
machine learning systems.
3. **Hardware fault protection:** Machine learning training clusters
typically run commodity hardware (such as disks and NICs). As such,
hardware faults are inevitable over long-term operations. In
single-node training, the failure of one hardware device will cause
the entire training job to fail. In distributed training, a training
job is jointly completed by multiple hardware devices. This means
that the system can transfer the workload on the faulty device to a
good one, eliminating concerns that hardware faults will interrupt
training.

View File

@@ -0,0 +1,170 @@
# Parameter Server
:label:`parameter server`
The following describes another common distributed training system:
parameter server. In different machine learning frameworks, the
parameter server may be implemented in different ways. For example,
while TensorFlow and MindSpore come with built-in parameter server
implementations, PyTorch requires users to implement the parameter
servers themselves by using RPC interfaces.
![Architecture of a parameter serversystem](../img/ch10/ch10-parameter-servers.png)
:label:`ch010/ch10-parameter-servers`
## System Architecture
Different from the machine learning systems implemented based on
collective communication, the parameter server system assigns two roles
to servers: training server or parameter server. The parameter server
needs to provide sufficient memory and communication resources, whereas
the training server needs to provide a large quantity of computing
resources (e.g., hardware accelerators).
Figure :numref:`ch010/ch10-parameter-servers` depicts a machine learning
cluster with two training servers and two parameter servers. Assume we
have a model that can be divided into two parameter partitions. Each
partition is assigned to a parameter server for synchronizing
parameters. In the training process, each training server has a complete
model to train a gradient based on the local training dataset shard. The
gradient is then pushed to the corresponding parameter server. After the
two training servers push their gradients, the parameter servers start
to compute the average gradient and update parameters accordingly. The
parameter servers then request the training servers to pull the latest
parameters and start the next round of training iteration.
## Asynchronous Distributed Training
As discussed earlier, after each round of training, training servers
need to compute an average gradient to update each model replica. This
is necessary to ensure that the parameters of all model replicas are
consistent before the next round of training begins. Such implementation
is generally referred to as *synchronous training*.
Although synchronous training helps the training system achieve higher
model accuracy, in a large system, stragglers often appear due to
various causes. Common causes include: 1) The stragglers may not be in
the same rack as other devices. Therefore, the communication bandwidth
of the stragglers is significantly lower than that of the other devices.
2) The stragglers may share local computing and communication resources
with other processes, resulting in resource contention and performance
degradation.
Stragglers will significantly impact the performance of AllReduce-based
synchronous training systems. This is because, in such systems, all
nodes participate in average-gradient computation and communication.
Therefore, the emergence of any straggler will delay the entire
AllReduce operation. To solve this problem, we could use a parameter
server that realizes *asynchronous training* of models.
In an asynchronous training system, all training servers have the same
model parameter replica at the outset of training. During training, once
they finish computing gradients, the training servers immediately push
the results to the parameter server. Based on the received gradients,
the parameter server immediately updates model parameters and requests
training servers to pull the latest parameters. In this process,
different training servers are likely to use model parameters of
different versions for gradient computation. While this method may
negatively affect model accuracy, it enables different training servers
to push and pull parameters based on their operation speeds rather than
waiting for their peers. In this sense, stragglers will not affect the
performance of the entire cluster.
### Training Sparse Models
A substantial number of large-scale machine learning models exhibit
*sparsity*, which signifies that only a subset of their parameters
become activated when a model training or inference request is
processed. An illustrative example of this can be found in recommender
systems, where a sizable embedding table is stored on parameter servers.
In response to an inference request for a specific user, the parameter
server retrieves only the embedding pertinent to that user. A similar
scenario can be observed in mixture-of-expert models, in which a limited
number of experts are activated to process input data, contingent on the
data's characteristics.
Parameter servers can be especially beneficial in streamlining the
training of sparse machine learning models. This advantage stems from
the ability to store the sparse models on the parameter servers, leaving
the dense models---often neural networks---on the training servers where
sophisticated hardware accelerators are deployed. Operating with a lower
resource footprint, parameter servers mainly necessitate an adequate
supply of memory and network resources, rather than the more expensive
parallel cores utilized by CPUs and GPUs. As a result, this approach
significantly cuts costs when accommodating large sparse models. This is
in contrast to the more expensive strategy which relies solely on GPU
servers---coordinated through collective communication---to host both
sparse and dense models. This practice incurs significantly higher
costs.
## Model Replication
In this section, we will discuss the ways parameter servers utilize
model replication to address issues related to data hotspots and server
failures.
### Addressing Data Hotspots
Data on the internet typically follows a power-law distribution, which
means that certain parameters are accessed more often than others during
training. For instance, the embedding item of a widely popular commodity
may be pulled by training servers much more frequently than one from a
less popular commodity. This disparity can result in a parameter server,
storing such popular data, being burdened with a disproportionately high
volume of data pull and push requests, leading to data hotspots that can
undermine system scalability.
To mitigate data hotspots, a machine learning cluster can monitor the
access frequency of each model parameter. It can then create multiple
replicas of frequently accessed parameters, distributing them across
different parameter servers. To facilitate this, a router is created
which directs a parameter query to an appropriate parameter replica.
Within this router, strategies such as random routing or round-robin
routing can be implemented to ensure a balanced access workload across
all replicas.
### Managing Server Failures
Parameter servers are typically deployed for extended periods, enabling
training servers or inference servers to continually query and update
parameters. During this time, some parameter servers may experience
failures due to hardware issues (such as disk, memory, and processors)
or network partitions caused by network switch failures or network
misconfigurations.
To combat server failures, parameter servers can create replicas of all
parameters and distribute these replicas across different servers. This
distribution decreases the chance that these servers will fail
simultaneously. Generally, these replicas are located on servers placed
in separate racks, clusters, and data centers to further minimize risk.
### Maintaining Replica Consistency
Both training and inference servers can update a parameter replicated on
different servers. To ensure consistency amongst these replicas,
parameter servers must employ a replication protocol to coordinate
simultaneous updates on parameter replicas. A commonly utilized protocol
is the Leader-Follower replication. This protocol designates one of the
replicas as a leader and synchronizes all update operations on training
servers to this leader replica before propagating the updates to the
follower replicas.
Deciding on the leader replica and synchronizing updates between the
leader and follower replicas are enduring challenges in the field of
distributed systems. To address these challenges, industry professionals
have developed numerous robust algorithms, such as Paxos and Raft.
Moreover, striking a balance between availability and consistency when
replicating updates is another key concern. A strong-consistency
replication protocol, like chain replication, may lead to failure of the
training servers' push requests, making the parameter servers
unavailable. On the other hand, adopting a weak-consistency replication
protocol might result in replicas storing inconsistent parameters. To
counter this, recent developments have introduced weak-consistency
replication protocols like Adam and Ekko that leverage machine learning
workload characteristics to reduce the communication cost of
synchronizing replicas. For example, Microsoft's Adam protocol
introduces a two-phase commit protocol for accelerating parameter
synchronization while Ekko features a decentralized algorithm where
parameter servers can analyze the model updates based on the gradient
magnitude. Ekko further prioritizes the synchronization requests that
are more likely to affect the quality of model inference.

View File

@@ -0,0 +1,33 @@
# Chapter Summary
1. The advent of large-scale machine learning models has sparked an
exponential increase in the need for computational power and memory,
leading to the emergence of distributed training systems.
2. Distributed training systems often utilize data parallelism, model
parallelism, or a combination of both, based on memory limitations
and computational constraints.
3. Pipeline parallelism is another technique adopted by distributed
training systems, which involves partitioning a mini-batch into
micro-batches and overlapping the forward and backward propagation
of different micro-batches.
4. Although distributed training systems usually function in compute
clusters, these networks sometimes lack the sufficient bandwidth for
the transmission of substantial gradients produced during training.
5. To meet the demand for comprehensive communication bandwidth,
machine learning clusters integrate heterogeneous high-performance
networks, such as NVLink, NVSwitch, and InfiniBand.
6. To accomplish synchronous training of a machine learning model,
distributed training systems frequently employ a range of collective
communication operators, among which the AllReduce operator is
popularly used for aggregating the gradients computed by distributed
nodes.
7. Parameter servers play a crucial role in facilitating asynchronous
training and sparse model training. Moreover, they leverage model
replication to address issues related to data hotspots and server
failures.

View File

@@ -0,0 +1,245 @@
## Background
Throughout human history, technological progress, production relations, and the development of ethical regulations have evolved dynamically. When a new technology achieves a breakthrough in the laboratory, the resulting changes in value creation sequentially impact commodity forms, production relations, and other aspects. At the same time, once the value gains brought by new technology are recognized, the organizational forms of business logic, in their spontaneous adjustment process, also place demands on the path, content, and even pace of technological development, and adapt new ethical regulations when these demands are met. Through such interactions, technological systems and social systems resonate and co-evolve---this is what constitutes a technological revolution.
Over the past decade, driven by the cost-performance ratio of computational power and data scale surpassing critical thresholds, connectionist model architectures represented by deep neural networks and statistical learning paradigms (hereinafter referred to as deep learning) have achieved breakthrough advances in feature representation capabilities, greatly advancing the development of artificial intelligence and achieving remarkable results in many scenarios. For example, face recognition accuracy has reached over 97%, and Google's intelligent voice assistant achieved a 92.9% correct response rate in 2019 tests. In these typical scenarios, deep learning's intelligent performance has surpassed that of ordinary humans (and even experts), reaching a tipping point for technology replacement. In recent years, in domains where business logic is technology-friendly or where ethical regulations are temporarily sparse---such as security, real-time scheduling, process optimization, competitive gaming, and information feed distribution---artificial intelligence and deep learning have achieved rapid technical and commercial breakthroughs.
Having tasted success, no domain wants to miss out on the benefits of technological progress. However, when the commercial application of deep learning enters domains that are technology-sensitive and closely related to human survival or safety---such as autonomous driving, finance, healthcare, and judicial high-risk application scenarios---the existing business logic encounters resistance during technology replacement, leading to slowdowns or even failures in commercialization. The root cause is that the business logic and underlying ethical regulations of these scenarios center on stable, traceable accountability and responsibility distribution; yet the models produced by deep learning are black boxes from which we cannot extract any information about model behavior from the model's structure or weights, rendering the accountability and distribution mechanisms in these scenarios inoperative and causing technical and structural difficulties for AI in business applications.
Here are two specific examples: Example 1, in the financial risk control scenario, a deep learning model identifies a small subset of users with suspected fraud, but the business department does not dare to directly act on these results. Because people cannot understand how the results were obtained, they cannot determine whether the results are accurate. Moreover, the results lack clear evidence, and if acted upon, cannot be justified to regulatory agencies.
Example 2, in the medical field, a deep learning model determines that a patient has tuberculosis based on the patient's test data, but the doctor does not know how the diagnosis was reached and does not dare to directly adopt it, instead relying on their own experience, carefully reviewing the relevant test data, and then making their own judgment. These two examples demonstrate that black-box models seriously hinder the application and promotion of models in real-world scenarios.
Moreover, model interpretability has attracted national-level attention, with relevant institutions issuing related policies and regulations.
- In July 2017, the State Council issued the "New Generation Artificial Intelligence Development Plan," which for the first time encompassed explainable AI.
- In March 2021, the People's Bank of China released the financial industry standard "Evaluation Specification for Financial Applications of Artificial Intelligence Algorithms," which set explicit requirements for the interpretability of AI models in the financial industry.
- In August 2021, the Cyberspace Administration of China issued the "Provisions on the Management of Algorithmic Recommendations for Internet Information Services," proposing requirements for the interpretability of algorithmic recommendations in the internet industry.
- In September 2021, the Ministry of Science and Technology released the "Ethical Norms for New Generation Artificial Intelligence."
Therefore, from both the commercial promotion and regulatory perspectives, we need to open up the black box model and provide explanations for models. Explainable AI is precisely the technology that addresses this class of problems.
## Definition of Explainable AI
According to DARPA (Defense Advanced Research Projects Agency), as shown in :numref:`xai_concept`,
the concept of explainable AI is: unlike existing AI systems, explainable AI systems can address the problems users face with black-box models, enabling users to know not only what, but also why.
![Concept of Explainable AI (Image source: Broad Agency Announcement Explainable Artificial Intelligence (XAI) DARPA-BAA-16-53)](../img/ch11/xai_concept.png)
:width:`800px`
:label:`xai_concept`
However, neither academia nor industry has a unified definition of explainable AI (eXplainable AI, XAI). Here we list three typical definitions for discussion:
- Interpretability is the desire to directly understand the working mechanism of a model, breaking open the black box of artificial intelligence.
- Explainable AI provides human-readable and understandable explanations for decisions made by AI algorithms.
- Explainable AI is a set of methods that ensures humans can easily understand and trust the decisions made by AI agents.
Based on our practical experience and understanding, we define explainable AI as: a collection of techniques oriented toward machine learning (primarily deep neural networks), including visualization, data mining, logical reasoning, knowledge graphs, etc. The purpose is to use this collection of techniques to make deep neural networks exhibit a certain degree of understandability, so as to satisfy the information needs (such as causal or background information) of relevant users regarding models and application services, thereby establishing cognitive-level trust in AI services among users.
## Overview of Explainable AI Algorithms
With the emergence of the concept of explainable AI, XAI has received increasing attention from both academia and industry. The figure below shows the trend of explainable AI keywords in top academic conferences in the field of artificial intelligence. To provide readers with a holistic understanding of existing explainable AI algorithms, we summarize and categorize the types of XAI algorithms with reference to :cite:`2020tkde_li`, as shown in :numref:`XAI_methods`.
![Explainable AI (XAI) algorithm branches](../img/ch11/XAI_methods.PNG)
:width:`800px`
:label:`XAI_methods`
There are diverse methods for explaining models. Here, based on whether the explanation process introduces external knowledge beyond the dataset, we divide them into data-driven explanation methods and knowledge-aware explanation methods.
**Data-Driven Explanations**
Data-driven explanations refer to methods that generate explanations purely from the data itself, without requiring external information such as prior knowledge. To provide explanations, data-driven methods typically start by selecting a dataset (with global or local distribution). Then, the selected dataset or its variants are fed into the black-box model (in some cases, selecting a dataset is not necessary; for example, the maximum activation method proposed by :cite:`erhan2009visualizing`), and explanations are generated through certain analysis of the corresponding predictions from the black-box model (e.g., computing derivatives of predictions w.r.t. input features). Based on the scope of interpretability, these methods can be further divided into global methods or local methods---that is, whether they explain the global model behavior across all data points or the behavior of a subset of predictions. In particular, instance-based methods provide a special type of explanation---they directly return data instances as explanations. Although from the perspective of explanation scope, instance-based methods can also fit into global methods (representative samples) or local methods (counterfactuals), we list them separately to emphasize their distinctive way of providing explanations.
Global methods aim to provide an understanding of the model logic and complete reasoning for all predictions, based on a holistic view of its features, learned components, and structure. Several directions can be explored for global interpretability. For ease of understanding, we divide them into the following three subcategories:
(i)
Model extraction---extracting an interpretable model from the original black-box model, for example, distilling the original black-box model into an interpretable decision tree through model distillation :cite:`frosst2017distilling` :cite:`zhang2019interpreting`, thereby using the rules in the decision tree to explain the original model;
(ii)
Feature-based methods---estimating feature importance or relevance, as shown in :numref:`xai_global_feature_importance`.
This type of explanation can provide explanations such as "credit overdue records are the most important feature relied upon by the model," thereby helping to determine whether the model has bias. A typical global feature explanation method is SHAP (which can only output global explanations for tree models) :cite:`lundberg2017unified`.
(iii) Transparent model design---modifying or redesigning black-box models to improve their interpretability. This class of methods has also gradually become a research hotspot, with recent related work including ProtoPNet :cite:`chen2019looks`, Interpretable CNN :cite:`zhang2018interpretable`, ProtoTree :cite:`nauta2021neural`, etc.
![Global feature importance explanation](../img/ch11/xai_global_feature_importance.png)
:width:`800px`
:label:`xai_global_feature_importance`
Global explanations can provide an overall understanding of the black-box model. However, due to the high complexity of black-box models, in practice it is often difficult to obtain simple transparent models with behavior similar to the original model through model extraction/design, and it is often difficult to abstract unified feature importance across the entire dataset. Furthermore, global explanations also lack local fidelity when generating explanations for individual observations, as globally important features may not accurately explain decisions for individual samples. Therefore, local methods have become an important research direction in recent years. Local methods attempt to verify the reasonableness of model behavior for individual instances or a set of instances. When focusing only on local behavior, complex models can become simple, so even simple functions can provide highly credible explanations for local regions. Based on the process of obtaining explanations, local methods can be divided into two categories: local approximation and propagation-based methods.
Local approximation generates understandable sub-models by simulating the behavior of the black-box model in the neighborhood of a sample. Compared to model extraction in global methods, local approximation only needs to focus on the neighborhood of the sample, making it easier to obtain sub-models that accurately describe local behavior. As shown in :numref:`xai_lime`, by generating $m$ data points $(x_i^\prime, f(x_i^\prime)), for\ i=1,2, ...m$ (where $f$ is the black-box model decision function) near the data point of interest $x$, and linearly fitting these data points, we can obtain a linear model $g=\sum_i^k w_ix^i$, where $k$ represents the feature dimensionality of the data. The weights $w_i$ in the linear model can then be used to represent the importance of the $i$-th feature of data $x$ for model $f$.
![Example of local approximation method](../img/ch11/xai_lime.png)
:width:`800px`
:label:`xai_lime`
Propagation-based methods typically propagate certain information to directly locate relevant features. These methods include backpropagation-based methods and forward propagation-based methods. Backpropagation-based methods attribute the output contributions to input features through gradient backpropagation. As shown in :numref:`xai_gradient_based`, through gradient backpropagation, the gradient of the model output with respect to the input $\frac{d(f(x))}{dx}$ is computed as the model explanation. Common gradient propagation-based methods include the basic Gradient method, GuidedBackprop :cite:`zeiler2014visualizing`, GradCAM :cite:`selvaraju2017grad`, etc.
Forward propagation-based methods quantify the correlation between outputs and features by perturbing features and observing the differences in forward inference outputs. Common methods in this category include RISE :cite:`petsiuk2018rise`, ScoreCAM :cite:`wang2020score`, etc.
![Example of gradient-based method](../img/ch11/xai_gradient_based.PNG)
:width:`800px`
:label:`xai_gradient_based`
**Knowledge-Aware Explanations**
Data-driven explanation methods can provide comprehensive explanations from datasets or the relationships between inputs and outputs. Building on this, external knowledge can also be leveraged to enrich explanations and make them more human-friendly. Laypersons without machine learning background knowledge may find it difficult to directly understand feature importance and the connections between features and targets. With external domain knowledge, we can not only generate explanations indicating feature importance, but also describe why certain features are more important than others. Therefore, knowledge-aware explainable AI methods have attracted increasing attention in recent years. Compared to raw datasets collected from multiple scenarios, knowledge is typically regarded as entities or relationships derived from human life experience or rigorous theoretical reasoning. Generally, knowledge can take many forms. It can reside in people's minds, or be recorded in natural language, audio, or rules with strict logic. To systematically review these methods, we categorize them based on knowledge sources into two types: general knowledge methods and knowledge base (KB) methods. The former uses unstructured data as a knowledge source to construct explanations, while the latter uses structured knowledge bases as the foundation for building explanations.
A relatively straightforward approach to providing knowledge is through human involvement. In fact, with the explosive growth of AI research and applications, the critical role of humans in AI systems has gradually become apparent. Such systems are called human-centered AI systems. :cite:`riedl2019human` argue that human-centered AI can not only enable AI systems to better understand humans from a sociocultural perspective, but also enable AI systems to help humans understand themselves. To achieve these goals, AI needs to satisfy several properties including interpretability and transparency.
Specifically, humans can play a role in AI systems by providing a considerable number of human-defined concepts. :cite:`kim2018interpretability` uses Concept Activation Vectors (CAV) to test the importance of concepts in classification tasks (TCAV). A CAV is a vector perpendicular to the decision boundary between the activation and non-activation of a target concept of interest. This vector can be obtained as follows: input positive and negative samples of the target concept, perform linear regression to get the decision boundary, and thereby obtain the CAV. Taking the "stripes" concept for "zebra" as an example, the user first collects data samples containing "stripes" and data samples not containing "stripes," feeds them into the network, obtains the activation values of intermediate layers, fits these based on positive and negative sample labels ($1$ for containing the concept, $0$ for not containing the concept) to the intermediate layer activation values, obtains the decision boundary, and the CAV is the perpendicular vector to this decision boundary.
As shown in :numref:`xai_tcav`, to compute the TCAV score, the "concept sensitivity" representing the importance of a concept at layer $l$ for class $k$ prediction can first be computed as the directional derivative $S_{C,k,l}(\mathbf{x})$:
$$\begin{split}
S_{C,k,l}(\mathbf{x}) = &\lim_{\epsilon\rightarrow 0}\frac{h_{l,k}(f_{l}(\mathbf{x})+\epsilon \mathbf{v}^{l}_{C})-h_{l,k}(f_{l}(\mathbf{x}))}{\epsilon} \\ = &\nabla h_{l,k}(f_{l}(\mathbf{x})) \cdot \mathbf{v}^{l}_{C}
\end{split}
\label{eq:TCAV_score}$$
where $f_{l}(\mathbf{x})$ is the activation at layer $l$, $h_{l,k}(\cdot)$ is the logit for class $k$, $\nabla h_{l,k}(\cdot)$ is the gradient of $h_{l,k}$
w.r.t. the activations at layer $l$. $\mathbf{v}^{l}_{C}$ is the CAV for concept $C$ that the user aims to explore. Positive (or negative) sensitivity indicates that concept $C$ has a positive (or negative) influence on the activation of the input.
Based on $S_{C,k,l}$,
TCAV can then be obtained by computing the ratio of samples of class $k$ with positive $S_{C,k,l}$'s:
$$\textbf{TCAV}_{Q_{C,k,l}}=\frac{\vert \{\mathbf{x}\in X_{k}:S_{C,k,l}(\mathbf{x})>0\}\vert}{\vert X_{k}\vert}
\label{eq:TCAV}$$
Combined with the $t$-distribution hypothesis method, if $\textbf{TCAV}_{Q_{C,k,l}}$ is greater than 0.5, it indicates that concept $C$ has a significant influence on class $k$.
![TCAV pipeline (Image source: :cite:`2020tkde_li`)](../img/ch11/xai_tcav.png)
:width:`800px`
:label:`xai_tcav`
Human knowledge can be subjective, while KB can be objective. In current research, KB is usually modeled as a Knowledge Graph (KG). The following uses the explainable recommendation model TB-Net, supported by MindSpore, as an example to explain how to build an explainable model using knowledge graphs. Knowledge graphs can capture rich semantic relationships between entities. One of TB-Net's objectives is to identify which pair of entities (i.e., item-item) has the most significant influence on the user, and through which relationships and key nodes they are connected. Unlike existing KG embedding-based methods (RippleNet uses KG completion methods to predict paths between users and items), TB-Net extracts real paths to achieve high accuracy and superior interpretability of recommendation results.
![TB-Net network training framework](../img/ch11/tb_net.png)
:width:`800px`
:label:`tb_net`
The framework of TB-Net is shown in :numref:`tb_net`: where $i_c$ represents the candidate item to be recommended, $h_n$ represents items that the user has interacted with in their history, $r$ and $e$ represent relations and entities in the knowledge graph, and their vectorized representations are concatenated to form relation matrices and entity matrices. First, TB-Net constructs a subgraph for user $u$ by connecting $i_c$ and $h_n$ through shared attribute values. Each pair of $i_c$ and $h_n$ is connected by a path composed of relations and entities. Then, TB-Net's bidirectional path propagation method propagates the computation of item, entity, and relation vectors from the left and right sides of the path to the middle node, computing the probability that the two directional flows converge at the same intermediate entity. This probability is used to represent the user's preference for the intermediate entity and serves as the basis for explanations. Finally, TB-Net identifies key paths (i.e., key entities and relations) in the subgraph, outputting recommendation results and explanations with semantic-level detail.
Taking game recommendation as a scenario, randomly recommending a new game to a user, as shown in :numref:`xai_kg_recommendation`, where Half-Life, DOTA 2, Team Fortress 2, etc. are game titles. In the relation attributes, game.year represents the game release year, game.genres represents game genre, game.developer represents the game developer, and game.categories represents game categories. In the attribute nodes, MOBA stands for Multiplayer Online Battle Arena, Valve is the Valve Corporation, Action stands for action genre, Multi-player stands for multiplayer mode, Valve Anti-Cheat enabled represents the Valve Anti-Cheat system, Free means free-to-play, and Cross-Platform means cross-platform support. The games on the right are games the user has played according to their history. The correctly recommended game in the test data is "Team Fortress 2."
![Steam game recommendation explainability example (Games played by user: Half-Life, DOTA 2. Correctly recommended game: "Team Fortress 2." Nodes with attribute information such as game.genres: Action, free-to-play; game.developer: Valve; game.categories:
Multiplayer, MOBA.)](../img/ch11/xai_kg_recommendation.png)
:width:`800px`
:label:`xai_kg_recommendation`
In :numref:`xai_kg_recommendation`, there are two highlighted relevance probabilities (38.6%, 21.1%), which are the probabilities of key paths being activated during the recommendation process as computed by the model. The red arrows highlight the key path from "Team Fortress 2" to the historical item "Half-Life." This shows that TB-Net can recommend items to users through various relational connections and identify key paths as explanations. Therefore, the explanation for recommending "Team Fortress 2" to the user can be translated into a fixed narrative: "Team Fortress 2 is an action, multiplayer online, shooting video game developed by game company Valve. It is highly correlated with the game Half-Life that the user has played before."
## Explainable AI Systems and Practice
As the demand for explainability grows rapidly across various domains, an increasing number of enterprises are integrating explainable AI toolkits to provide users with fast and convenient explainability solutions. The mainstream toolkits currently available in the industry include:
- TensorFlow team's What-if Tool, which allows users to explore learning models without writing any code, enabling non-developers to participate in model tuning.
- IBM's AIX360, which provides multiple explanation and measurement methods to evaluate model interpretability and trustworthiness across different dimensions.
- Facebook's Torch team's Captum, which offers multiple mainstream explanation methods for image and text scenarios.
- Microsoft's InterpretML, which allows users to train different white-box models and explain black-box models.
- SeldonIO's Alibi, which focuses on inspecting model internals and decision explanations, providing implementations of various white-box, black-box, single-sample, and global explanation methods.
- Huawei MindSpore's XAI tool, which provides data tools, explanation methods, white-box models, and measurement methods, offering users explanations at different levels (local, global, semantic-level, etc.).
This section uses the MindSpore XAI tool as an example to explain how to use explainable AI tools in practice to provide explanations for image classification models and tabular data classification models, thereby helping users understand models for further debugging and optimization.
The architecture of the MindSpore XAI tool is shown below. It is an explainability tool built on the MindSpore deep learning framework and can be deployed on Ascend and GPU devices.
![MindSpore XAI architecture diagram](../img/ch11/mindspore_xai.png)
:width:`800px`
:label:`mindspore_xai`
To use MindSpore Explainable AI, readers first need to install the MindSpore XAI package via pip (supporting MindSpore 1.7 or above, GPU and Ascend processors, recommended to use with JupyterLab):
```bash
pip install mindspore-xai
```
In the MindSpore XAI [official tutorial](https://www.mindspore.cn/xai/docs/zh-CN/r1.8/index.html), detailed instructions on how to install and use the provided explanation methods are available for readers to consult.
### MindSpore XAI Tool for Image Classification Explanation
Below is a code demonstration example combining the saliency map visualization method GradCAM, which is supported in MindSpore XAI version 1.8. Readers can refer to the [official tutorial](https://www.mindspore.cn/xai/docs/zh-CN/1.8/using_cv_explainers.html) to obtain the demo dataset, model, and complete script code.
```python
from mindspore_xai.explainer import GradCAM
# Typically specify the last convolutional layer
grad_cam = GradCAM(net, layer="layer4")
# 3 is the ID for the 'boat' class
saliency = grad_cam(boat_image, targets=3)
```
If the input is an image tensor of dimension $1*3*224*224$, the returned saliency is a saliency map tensor of dimension $1*1*224*224$. Below we present several examples demonstrating how to use explainable AI capabilities to better understand the prediction results of image classification models, identify the key feature regions used as the basis for classification predictions, and thereby judge the reasonableness and correctness of the classification results to accelerate model optimization.
![Example where the prediction result is correct and the key features relied upon are reasonable](../img/ch11/correct_correct.png)
:width:`400px`
:label:`correct_correct`
In the figure above, the predicted label is "bicycle," and the explanation result shows that the key features relied upon are on the wheels, indicating that this classification judgment basis is reasonable and the model can be preliminarily deemed trustworthy.
![Example where the prediction result is correct but the key features relied upon are unreasonable](../img/ch11/correct_wrong.png)
:width:`400px`
:label:`correct_wrong`
In the figure above, one of the predicted labels is "person," which is correct. However, in the explanation, the highlighted region is on the horse's head, so the key feature basis is likely incorrect, and the reliability of this model needs further verification.
![Example where the prediction result is incorrect and the key features relied upon are unreasonable](../img/ch11/wrong_wrong.png)
:width:`400px`
:label:`wrong_wrong`
In the figure above, the predicted label is "boat," but there is no boat in the original image. Through the explanation result on the right side of the figure, we can see that the model used the water surface as the key basis for classification to arrive at the prediction "boat"---this basis is incorrect. By analyzing the subset of the training dataset labeled "boat," it was found that the vast majority of images labeled "boat" contain water surfaces, which likely caused the model to mistakenly learn water surfaces as a key feature for the "boat" class during training. Based on this finding, proportionally supplementing images with boats but without water surfaces can significantly reduce the probability of the model misjudging key features during learning.
### MindSpore XAI Tool for Tabular Classification Explanation
MindSpore XAI version 1.8 supports three commonly used tabular data model explanation methods in the industry: LIMETabular, SHAPKernel, and SHAPGradient.
Using LIMETabular as an example, it provides a locally interpretable model to explain individual samples for a complex, hard-to-explain model:
```python
from mindspore_xai.explainer import LIMETabular
# Convert features to feature statistics
feature_stats = LIMETabular.to_feat_stats(data, feature_names=feature_names)
# Initialize the explainer
lime = LIMETabular(net, feature_stats, feature_names=feature_names, class_names=class_names)
# Explain
lime_outputs = lime(inputs, targets, show=True)
```
The explainer displays the decision boundary for classifying the sample as setosa. The returned lime_outputs is a structured data representing the decision boundary.
Visualizing the explanation yields
![LIME explanation result](../img/ch11/tabular.png)
:width:`400px`
:label:`tabular_lime`
The above explanation shows that for the setosa classification decision, the most important feature is petal length.
### MindSpore XAI Tool: White-Box Models
In addition to post-hoc explanation methods for black-box models, the XAI tool also provides industry-leading white-box models, enabling users to train on these white-box models so that during inference the model can simultaneously output both inference results and explanations. Taking TB-Net as an example (refer to :numref:`tb_net` and its [official tutorial](https://e.gitee.com/mind_spore/repos/mindspore/xai/tree/master/models/whitebox/tbnet) for usage), this method has been deployed commercially, providing millions of customers with semantic-level explainable financial product recommendation services. TB-Net leverages knowledge graphs to model the attributes of financial products and customers' historical data. In the graph, financial products with common attribute values are connected. The candidate product and the customer's historically purchased or browsed products are connected through common attribute values into paths, forming the customer's subgraph. Then, TB-Net performs bidirectional propagation computation on the paths in the graph to identify key products and key paths as the basis for recommendations and explanations.
An example of explainable recommendation is as follows: in the historical data, the customer has recently purchased or browsed financial products A, B, N, etc. Through TB-Net's bidirectional path propagation computation, it is found that the path (Product P, moderate-to-high annualized return, Product A) and the path (Product P, moderate risk level, Product N) have high weights, making them key paths. At this point, TB-Net outputs the following explanation: "Financial product P is recommended to this customer because its moderate-to-high annualized return and moderate risk level are consistent with financial products A and B that the customer has recently purchased or browsed."
![TB-Net application in financial wealth management scenario](../img/ch11/tbnet_finance.png)
:width:`800px`
:label:`tbnet_finance`
In addition to the explanation methods introduced above, MindSpore XAI also provides a series of measurement methods for evaluating the quality of different explanation methods, and will continue to add white-box models with built-in explanations. Users can directly adopt mature model architectures to quickly build their own explainable AI systems.
## Future of Explainable AI
To further advance research in explainable AI, we summarize several noteworthy research directions here.
First, knowledge-aware XAI still has significant room for research expansion. However, there are still many open questions regarding how to effectively leverage external knowledge. One issue is how to acquire or retrieve useful knowledge from such a vast knowledge space. For example, Wikipedia contains knowledge related to various fields, but if the goal is to solve a medical image classification problem, most Wikipedia entries are irrelevant or noisy, making it difficult to accurately find appropriate knowledge to incorporate into the XAI system.
Furthermore, the deployment of XAI systems also urgently needs a more standardized and unified evaluation framework. To build such a standardized and unified evaluation framework, we may need to simultaneously leverage different metrics that complement each other. Different metrics may be applicable to different tasks and users. A unified evaluation framework should have corresponding flexibility.
Finally, we believe that interdisciplinary collaboration will be beneficial. The development of XAI requires not only computer scientists to develop advanced algorithms, but also physicists, biologists, and cognitive scientists to unravel the mysteries of human cognition, as well as domain experts to contribute their domain knowledge.
## References
:bibliography:`../references/explainable.bib`

View File

@@ -0,0 +1,21 @@
# Explainable AI Systems
Over the past decade, driven by the cost-performance ratio of computational power and data scale surpassing critical thresholds, connectionist model architectures represented by deep neural networks and statistical learning paradigms (hereinafter referred to as deep learning) have achieved breakthrough advances in feature representation capabilities, greatly advancing the development of artificial intelligence and achieving remarkable results in many scenarios. For example, face recognition accuracy has reached over 97%, and Google's intelligent voice assistant achieved a 92.9% correct response rate in 2019 tests. In these typical scenarios, deep learning's intelligent performance has surpassed that of ordinary humans (and even experts), reaching a tipping point for technology replacement. In recent years, in domains where business logic is technology-friendly or where ethical regulations are temporarily sparse---such as security, real-time scheduling, process optimization, competitive gaming, and information feed distribution---artificial intelligence and deep learning have achieved rapid technical and commercial breakthroughs.
Having tasted success, no domain wants to miss out on the benefits of technological progress. However, when the commercial application of deep learning enters domains that are technology-sensitive and closely related to human survival or safety---such as autonomous driving, finance, healthcare, and judicial high-risk application scenarios---the existing business logic encounters resistance during technology replacement, leading to slowdowns or even failures in commercialization. The root cause is that the business logic and underlying ethical regulations of these scenarios center on stable, traceable accountability and responsibility distribution; yet the models produced by deep learning are black boxes from which we cannot extract any information about model behavior from the model's structure or weights, rendering the accountability and distribution mechanisms in these scenarios inoperative and causing technical and structural difficulties for AI in business applications. Moreover, model interpretability has attracted national-level attention, with relevant institutions issuing related policies and regulations.
Therefore, from both the commercial promotion and regulatory perspectives, we need to open up the black box model and provide explanations for models. Explainable AI is precisely the technology that addresses this class of problems.
The learning objectives of this chapter include:
- Understand the goals and application scenarios of explainable AI
- Master the common types of explainable AI methods and their representative techniques
- Reflect on the future development of explainable AI methods
```toc
:maxdepth: 2
explainable_ai
```

View File

@@ -0,0 +1,57 @@
## Horizontal Federated Learning
### Horizontal Federation in Cloud-Cloud Scenarios
In a horizontal federated learning system, multiple participants with the same data structure collaboratively build a machine learning model through a cloud server. A typical assumption is that the participants are honest while the server is honest but curious; therefore, no participant is allowed to leak raw gradient information to the server. The training process of such a system typically consists of the following four steps:
Step 1: Participants compute training gradients locally, mask selected gradients using encryption, differential privacy, or secret sharing techniques, and send the masked results to the server.
Step 2: The server performs secure aggregation without learning any participant's gradient information.
Step 3: The server sends the aggregated results back to the participants.
Step 4: Participants update their respective models using the decrypted gradients.
Compared to traditional distributed learning, federated learning faces the challenges of unstable training nodes and high communication costs. These challenges prevent federated learning from synchronizing weights across different training nodes after every single training step, as traditional distributed learning does. To improve the computation-to-communication ratio and reduce the high energy consumption caused by frequent communication, Google proposed the Federated Averaging algorithm (FedAvg) in 2017 :cite:`fedavg`. :numfef:`ch10-federated-learning-fedavg` illustrates the overall process of FedAvg. In each training round, clients perform multiple local training steps. Then the server aggregates the weights from multiple clients and computes a weighted average.
![Federated Averaging Algorithm](../img/ch10/ch10-federated-learning-fedavg.png)
:width:`800px`
:label:`ch10-federated-learning-fedavg`
### Horizontal Federation in Device-Cloud Scenarios
The overall process of device-cloud federated learning is the same as cloud-cloud federated learning, but device-cloud federated learning faces additional challenges in the following three aspects:
1. High communication costs. Unlike cloud-cloud federated learning, the communication overhead in device-cloud federated learning primarily lies in the volume of data per communication round, whereas the overhead in cloud-cloud federated learning mainly lies in the frequency of communication. In device-cloud federated learning scenarios, the typical communication network may be WLAN or mobile data, where network communication speeds can be orders of magnitude slower than local computation, making high communication costs a critical bottleneck for federated learning.
2. System heterogeneity. Due to variations in client device hardware (CPU, memory), network connections (3G, 4G, 5G, WiFi), and power supply (battery level), each device in the federated learning network may have different storage, computation, and communication capabilities. Limitations of the network and the devices themselves may result in only a subset of devices being active at any given time. Furthermore, devices may encounter unexpected situations such as battery depletion or network disconnection, leading to temporary unavailability. This heterogeneous system architecture affects the formulation of the overall federated learning strategy.
3. Privacy concerns. Since clients in device-cloud federated learning cannot participate in every iteration round, the difficulty of data privacy protection is higher than in other distributed learning methods. Moreover, during the federated learning process, transmitting model update information between devices and the cloud still poses the risk of exposing sensitive information to third parties or the central server. Privacy protection becomes a critical issue that device-cloud federated learning must address.
To address the challenges posed by device-cloud federated learning, MindSpore Federated designed a distributed FL-Server architecture. The system consists of three components: the scheduler module, the server module, and the client module. The system architecture is shown in :numref:`ch10-federated-learning-architecture`. The functionalities of each module are described below:
- Federated Learning Scheduler:
The Federated Learning Scheduler (FL-Scheduler) assists in cluster networking and is responsible for issuing management tasks.
- Federated Learning Server:
The Federated Learning Server (FL-Server) provides client selection, time-limited communication, and distributed federated aggregation capabilities. The FL-Server must be capable of supporting tens of millions of device-cloud devices and supporting the access and secure processing logic of edge servers.
- Federated Learning Client:
The Federated Learning Client (FL-Client) is responsible for local data training and securely encrypts the uploaded weights when communicating with the FL-Server.
![Federated Learning System Architecture](../img/ch10/ch10-federated-learning-architecture.svg)
:label:`ch10-federated-learning-architecture`
In addition, MindSpore Federated has designed four key features for device-cloud federated learning:
1. Time-limited communication: After the FL-Server and FL-Client establish a connection, a global timer and counter are initiated. When the FL-Server receives model parameters from FL-Clients that meet a certain proportion of all initially connected FL-Clients within a preset time window, aggregation can proceed. If the proportion threshold is not reached within the time window, the system proceeds to the next iteration. This ensures that even with a massive number of FL-Clients, the entire federated learning process will not stall due to excessively long training times or disconnections of individual FL-Clients.
2. Loosely-coupled networking: An FL-Server cluster is used. Each FL-Server receives and distributes weights to a subset of FL-Clients, reducing the bandwidth pressure on any single FL-Server. Additionally, FL-Clients are supported to connect in a loosely-coupled manner. The mid-session withdrawal of any FL-Client will not affect the global task, and any FL-Client can obtain the complete data needed for training from any FL-Server at any time.
3. Encryption module: To prevent model gradient leakage, MindSpore Federated deploys multiple encryption algorithms: Local Differential Privacy (LDP), secure aggregation algorithms based on Multi-Party Computation (MPC), and Huawei's proprietary Sign-based Dimension Selection differential privacy algorithm (SignDS).
4. Communication compression module: MindSpore Federated uses quantization and sparsification techniques to compress and encode weights into smaller data formats when the FL-Server distributes model parameters and when FL-Clients upload model parameters, and decodes the compressed data back to the original format at the receiving end.

View File

@@ -0,0 +1,20 @@
# Federated Learning Systems
In this chapter, we introduce an important branch of deep learning --- federated learning and its related system knowledge. The learning objectives of this chapter include:
- Master the basic definitions of federated learning and become familiar with existing mainstream open-source federated learning frameworks.
- Understand horizontal federated learning algorithms.
- Understand vertical federated learning algorithms.
- Understand federated learning encryption algorithms.
- Understand cutting-edge federated learning algorithms and future research directions.
```toc
:maxdepth: 2
overview
horizontal_fl
vertical_fl
privacy_encryption_algorithm
outlook
summary
```

View File

@@ -0,0 +1,35 @@
## Outlook
To achieve large-scale commercial deployment of federated learning, substantial research work is still needed. For instance, since we cannot inspect the distributed data in federated learning, it is very difficult to select model hyperparameters and configure optimizers, and we can only resort to simulation-based approaches for model tuning and testing. For deployment on mobile devices, individual users have very little labeled data, and sometimes data labels cannot even be obtained, raising the question of how federated learning can be applied to unsupervised learning. Furthermore, due to inconsistent data distributions across participants, training a single global model makes it difficult to evaluate the model's quality for each participant. Additionally, data has always been a core asset for companies, and different companies have been dedicated to collecting data and creating data silos, so how to effectively incentivize companies or institutions to participate in federated learning systems remains an open question. Below we introduce some efforts undertaken by MindSpore Federated and related work in the field.
**Federated Learning in Heterogeneous Scenarios**
The horizontal and vertical federated learning approaches discussed earlier all involve different participants collaboratively building a shared machine learning model. However, enterprise-level federated learning frameworks often need to adapt to various heterogeneous scenarios, such as data heterogeneity (inconsistent data scales and distributions across different clients), device heterogeneity (inconsistent computing capabilities and communication efficiency across different client devices), and model heterogeneity (inconsistent features learned by different local client models).
Two relatively mainstream directions of work in heterogeneous federated learning scenarios are:
1) Personalized federated learning strategies with local models that are highly robust to heterogeneous data:
Federated learning trains a global model to obtain a globally optimal solution based on all data. However, the data volume and distribution of different participants are different, and in many scenarios, the global model cannot capture the overall picture while also accommodating such differences. When one party's data deviates significantly from the overall distribution, the performance of federated learning may indeed be inferior to that of local training. How to maximize the overall benefit of all participants while also maximizing individual benefits is the goal of personalized federated learning.
Personalized federated learning does not require that all participants ultimately use the same model. For example, it allows each participant to fine-tune the model based on their own data after participating in federated learning, thereby generating a unique personalized model. After personalized fine-tuning, the model typically performs better on the local test set. Under this approach, different participants' models share the same structure but may have different parameters. Some other approaches allow all participants to share the same feature extraction layers but have different task classification layers. Another line of thinking introduces knowledge distillation into federated learning, using the global model from federated learning as the teacher model and the personalized model as the student model, which can alleviate the overfitting problem during personalization.
2) Research on model aggregation strategies for heterogeneous models:
Generally, under the FedAvg federated aggregation paradigm, fewer local iteration training steps and more frequent aggregation lead to better model convergence accuracy, especially when the data across different participating clients is non-IID. However, aggregation incurs communication costs, and there is a trade-off between communication cost and model accuracy in federated learning. Therefore, many researchers focus on designing adaptive aggregation schemes that find the optimal balance between local updates and global communication under a given training time budget to minimize the generalization error of the global model.
**Communication Efficiency Improvement**
In the federated learning process, during each global training round, every participant needs to send the complete parameters to the server, and the server then distributes the aggregated parameters. Modern deep learning networks easily have millions or even more parameters, and transmitting such a large number of parameters incurs enormous communication overhead. To reduce communication overhead, MindSpore Federated has adopted several methods to improve communication efficiency:
1) Intelligent frequency adjustment strategy: Improve federated learning efficiency by changing the number of global model aggregation rounds, reducing the communication overhead required for training tasks to converge. One intuition is that in the early stages of the federated learning process, parameter changes across different participants are relatively consistent, so setting a lower aggregation frequency can reduce communication costs; in the later stages of federated learning, parameter changes across different participants become more inconsistent, so setting a higher aggregation frequency can enable the model to converge quickly.
2) Communication compression scheme: Quantize and sparsify weight differences, i.e., only upload a small portion of quantized weight differences in each communication round. The reason for choosing weight differences for quantization and sparsification is that their distribution is easier to fit than weight values, and they have higher sparsity. Quantization maps float32 data types to int8 or even lower-bit representations, reducing storage and communication overhead on one hand, and enabling better use of compression encoding methods for transmission on the other (such as Huffman coding, finite state entropy coding, etc.). A commonly used sparsification method is Top-K sparsification, which sorts gradients by absolute value in ascending order and only uploads the top k parameters per round. Communication compression schemes generally incur some accuracy loss, and selecting an appropriate k is a challenging problem.
**Federated Ecosystem**
In the preceding chapters, we introduced some technologies and practices in the field of privacy-preserving federated learning. However, as exploration deepens, the field of federated learning has become increasingly inclusive, encompassing machine learning, model compression and deployment, information security, encryption algorithms, game theory, and more. As more and more companies, universities, and institutions become involved, federated learning today is no longer merely a technical solution but a privacy-preserving ecosystem. For example, different participants wish to join the federated learning process in a sustainable manner, and there are questions about how to design incentive mechanisms to ensure that profits can be shared relatively fairly among participants while effectively deterring participants who engage in malicious attacks or destructive behavior.
Furthermore, as more laws and regulations on user data privacy protection and proper use are being introduced, establishing technical standards for federated learning has become increasingly important. Such standards can build a bridge between legal regulators and technical developers, letting enterprises know which technologies to adopt in order to better share information while complying with regulations.
At the end of 2020, the international standard for federated learning (IEEE P3652.1), approved by the IEEE Standards Committee, was officially published and implemented. This standard aims to provide guidelines for building federated learning architectures and applications, with main content including: descriptions and definitions of federated learning, scenario requirement classification and security evaluation, quantification of personalized metrics for federated learning evaluation, and requirements for joint governance. This is also the first international standard established for AI collaborative technology frameworks, marking the beginning of a new chapter for large-scale industrial application of federated learning.

View File

@@ -0,0 +1,47 @@
## Overview
With the rapid development of artificial intelligence, large-scale and high-quality data has become increasingly important for model performance and user experience. At the same time, data utilization has become a bottleneck constraining the further development of AI. Issues related to privacy, regulation, and engineering have prevented data sharing between devices, leading to the emergence of data silos. To address this challenge, Federated Learning (FL) was proposed. The concept of federated learning was first introduced in 2016. Under the requirements of user privacy protection, data security, and government regulations, federated learning enables effective machine learning modeling using data from multiple parties.
### Definition
The core principle of federated learning is that data stays in place while the model moves. Clearly, centralizing data from all parties would fail to protect user privacy and would violate relevant laws and regulations. Federated learning allows the model to "move" across data holders, thereby enabling modeling without data leaving the local device. In federated learning, each party's data remains local, and a machine learning model is built by exchanging encrypted parameters or other information (on a central server).
### Application Scenarios
In practical application scenarios, federated learning can be categorized based on the overlap of samples and features into horizontal federated learning (different samples, overlapping features), vertical federated learning (different features, overlapping samples), and federated transfer learning (neither samples nor features overlap).
**Horizontal federated learning** is suitable for scenarios where different participants possess the same features but different individuals. For example, in an advertising recommendation scenario, algorithm developers use data with the same features (click counts, dwell time, usage frequency, etc.) from different mobile phone users to build models. Since these feature data cannot leave the device, horizontal federated learning is used to jointly build models from multiple users' feature data.
**Vertical federated learning** is suitable for scenarios with substantial sample overlap but little feature overlap. For example, consider two different institutions: an insurance company and a hospital. Their user bases are likely to include most residents of the area, so the intersection of their users may be large. However, since the insurance company records users' financial behavior and credit ratings while the hospital holds users' disease and medication records, their feature intersection is small. Vertical federated learning aggregates these different features in an encrypted state to enhance model capability.
**Federated transfer learning** focuses on finding similarities between the source domain and the target domain. For example, consider two different institutions: a bank located in China and an e-commerce company located in the United States. Due to geographical limitations, the user base intersection of these two institutions is very small. Meanwhile, due to the different types of institutions, their data features also have only a small overlap. In this case, to conduct effective federated learning, transfer learning must be introduced. Federated transfer learning can address problems of small data scale on a single side and insufficient labeled samples, thereby improving model performance.
### Deployment Scenarios
Federated learning is architecturally very similar to the parameter server approach (data center distributed learning), both employing a centralized server and decentralized clients to collaboratively build a machine learning model. Furthermore, depending on the deployment scenario, federated learning can be further divided into cross-silo and cross-device federated learning. Generally, cross-silo federated learning involves users at the enterprise or organizational level, while cross-device federated learning targets portable electronic devices and mobile devices. :numref:`ch10-federated-learning-different-connection` illustrates the differences and connections among the three approaches:
![Differences and connections among data center distributed training, cross-silo federated learning, and cross-device federated learning](../img/ch10/ch10-federated-learning-different-connection.png)
:width:`800px`
:label:`ch10-federated-learning-different-connection`
### Common Frameworks
As the demand for federated learning technology from users and developers continues to grow, the number of federated learning tools and frameworks has also been increasing. Below we introduce some mainstream federated learning frameworks.
[TFF](https://www.tensorflow.org/federated) (TensorFlow Federated) is an open-source federated learning framework led by Google for machine learning and other computations on decentralized data. TFF was developed to facilitate open research and experimentation in federated learning. It trains shared global models among many participating clients who keep their training data locally. For example, federated learning has been used to train prediction models for mobile keyboards without uploading sensitive typing data to a server.
[PaddleFL](https://paddlefl.readthedocs.io/en/latest/index.html) is an open-source federated learning framework based on PaddlePaddle, proposed by Baidu. Researchers can easily replicate and compare different federated learning algorithms using PaddleFL, and developers can readily deploy PaddleFL federated learning systems in large-scale distributed clusters. PaddleFL provides various federated learning strategies (horizontal federated learning, vertical federated learning) and their applications in computer vision, natural language processing, recommendation algorithms, and other domains. Additionally, PaddleFL offers applications for traditional machine learning training strategies, such as multi-task learning and transfer learning in federated learning environments. Leveraging PaddlePaddle's large-scale distributed training capabilities and Kubernetes' elastic scheduling of training tasks, PaddleFL can be easily deployed based on a full-stack open-source software.
[FATE](https://fate.fedai.org) (Federated AI Technology Enabler), proposed by WeBank, is the world's first industrial-grade open-source federated learning framework that enables enterprises and institutions to collaborate on data while ensuring data security and privacy. The FATE project uses Secure Multi-Party Computation (MPC) and Homomorphic Encryption (HE) technologies to build underlying secure computation protocols, supporting secure computation for various types of machine learning, including logistic regression, tree-based algorithms, deep learning, and transfer learning. FATE was first open-sourced in February 2019, and the FATE community was established. Community members include major domestic cloud computing and financial services companies.
[FedML](https://FedML.ai) is an open-source federated learning research and benchmark library led by the University of Southern California (USC), which facilitates the development of new federated learning algorithms and fair performance comparison. FedML supports three computing paradigms (distributed training, on-device training, and standalone simulation) for users to experiment in different system environments. FedML also facilitates diverse algorithmic research through flexible and general API design and reference baseline implementations. To enable fair comparison of various federated learning algorithms, FedML has set up comprehensive benchmark datasets, including non-Independent and Identically Distributed (IID) datasets.
[PySyft](https://openmined.github.io/PySyft/index.html) is a secure and private deep learning Python library released by University College London (UCL), DeepMind, and OpenMined, encompassing federated learning, differential privacy, and multi-party learning. PySyft uses differential privacy and encrypted computation (MPC and HE) to decouple private data from model training.
[Fedlearner](https://github.com/bytedance/fedlearner) is a vertical federated learning framework proposed by ByteDance that allows joint modeling on data distributed across institutions. Fedlearner comes with surrounding infrastructure for cluster management, job management, job monitoring, and network proxying. Fedlearner adopts a cloud-native deployment approach and stores data in HDFS. Fedlearner manages and launches tasks through Kubernetes. Both participating parties of each Fedlearner task need to simultaneously launch training tasks through Kubernetes, with a Master node uniformly managing multiple training tasks and Workers handling communication.
[OpenFL](https://openfl.readthedocs.io/en/latest/index.html) is a Python framework for federated learning proposed by Intel. OpenFL aims to be a flexible, extensible, and easy-to-learn tool for data scientists.
[Flower](https://flower.dev) is an open-source federated learning system released by the University of Cambridge, primarily optimized for deploying federated learning algorithms on large-scale, heterogeneous devices.
[MindSpore Fedrated](https://www.mindspore.cn/en) is an open-source federated learning framework proposed by Huawei, supporting commercial deployment on tens of millions of stateless terminal devices, enabling full-scenario intelligent applications while keeping user data local. MindSpore Federated focuses on application scenarios of horizontal federated learning with large-scale participants, enabling users participating in federated learning to collaboratively build AI models without sharing local data. MindSpore Federated primarily addresses challenges in deploying federated learning in industrial scenarios, including privacy security, large-scale federated aggregation, semi-supervised federated learning, communication compression, and cross-platform deployment.

View File

@@ -0,0 +1,139 @@
## Privacy Encryption Algorithms
During the federated learning process, user data is only used for training on local devices and does not need to be uploaded to the central FL-Server. This can prevent the direct leakage of users' personal data. However, in the federated learning framework, uploading model weights to the cloud in plaintext still poses the risk of indirectly leaking user privacy. After obtaining the plaintext weights uploaded by users, adversaries can recover users' personal training data through reconstruction, model inversion, and other attacks, leading to user privacy leakage.
The MindSpore Federated framework provides secure aggregation algorithms based on Local Differential Privacy (LDP), Multi-Party Computation (MPC), and Huawei's proprietary Sign-based Dimension Selection differential privacy algorithm (SignDS), which add noise or perturbation to the local model weights before uploading them to the cloud. These algorithms address the privacy leakage problem in federated learning while ensuring model usability.
### LDP-Based Secure Aggregation
Differential privacy is a mechanism for protecting user data privacy. Differential privacy is defined as:
$$
Pr[\mathcal{K}(D)\in S] \le e^{\epsilon} Pr[\mathcal{K}(D') \in S]+\delta
$$
For two datasets $D$ and $D'$ that differ by only one record, the probability that the output of a randomized algorithm $\mathcal{K}$ falls within a subset of set $S$ satisfies the above formula. $\epsilon$ is the differential privacy budget, $\delta$ is the perturbation parameter, and smaller values of $\epsilon$ and $\delta$ indicate that the output distributions of $\mathcal{K}$ on $D$ and $D'$ are closer.
In federated learning, suppose the model weight matrix after local training on an FL-Client is $W$. Since the model "memorizes" the characteristics of the training set during training, an adversary can use $W$ to reconstruct the user's training dataset.
MindSpore Federated provides an LDP-based secure aggregation algorithm to prevent privacy data leakage when local model weights are uploaded to the cloud.
The FL-Client generates a differential noise matrix $G$ with the same dimensions as the local model weight matrix $W$, and then adds the two together to obtain a weight matrix $W_p$ that satisfies the differential privacy definition:
$$
W_p=W+G
$$
The FL-Client uploads the noisy model weight matrix $W_p$ to the cloud-side FL-Server for federated aggregation. The noise matrix $G$ essentially adds a layer of masking to the original model, reducing the risk of the model leaking sensitive data while also affecting the convergence of model training. How to achieve a better balance between model privacy and usability remains an open research question. Experiments show that when the number of participants $n$ is sufficiently large (generally above 1000), most noise can cancel each other out, and the local differential privacy mechanism has no significant impact on the accuracy and convergence of the aggregated model.
### MPC-Based Secure Aggregation
Although differential privacy technology can adequately protect user data privacy, when the number of participating FL-Clients is small or the Gaussian noise amplitude is large, model accuracy can be significantly affected. To simultaneously satisfy both model protection and model convergence requirements, MindSpore Federated provides an MPC-based secure aggregation scheme.
Although differential privacy technology can adequately protect user data privacy, when the number of participating FL-Clients is small or the Gaussian noise amplitude is large, model accuracy can be significantly affected. To simultaneously satisfy both model protection and model convergence requirements, MindSpore Federated provides an MPC-based secure aggregation scheme.
In this training mode, suppose the set of participating FL-Clients is $U$. For any FL-Client $u$ and $v$, they negotiate a pair of random perturbations $p_{uv}$ and $p_{vu}$ that satisfy
$$
\label{puv}
p_{uv}=
\begin{cases}
-p_{vu}, &u{\neq}v\\
0, &u=v
\end{cases}
$$
Thus, each FL-Client $u$ adds the perturbations negotiated with other users to its original model weights $x_u$ before uploading them to the FL-Server:
$$
x_{encrypt}=x_u+\sum\limits_{v{\in}U}p_{uv}
$$
Consequently, the FL-Server aggregation result $\overline{x}$ is:
$$
\label{eq:juhejieguo}
\overline{x}=\sum\limits_{u{\in}U}(x_{u}+\sum\limits_{v{\in}U}p_{uv})=\sum\limits_{u{\in}U}x_{u}+\sum\limits_{u{\in}U}\sum\limits_{v{\in}U}p_{uv}=\sum\limits_{u{\in}U}x_{u}
$$
The above process only introduces the main idea of the aggregation algorithm. The MPC-based aggregation scheme is lossless in terms of accuracy, at the cost of additional communication rounds.
### LDP-SignDS Algorithm-Based Secure Aggregation
For the previous dimension-wise noise-adding LDP algorithm, the noise scale added to each dimension is essentially proportional to the number of model parameters. Therefore, for high-dimensional models, a very large number of participants may be needed to mitigate the impact of noise on model convergence. To address this "dimension dependence" issue, MindSpore Federated further provides the **Sign-based Dimension Selection (SignDS)** :cite:`jiang2022signds` algorithm based on dimension selection.
The main idea of the SignDS algorithm is as follows: for each true local update $\Delta\in\mathbb{R}^{d}$, the FL-Client first selects a small subset of the most significantly updated dimensions to construct a Top-K set $S_k$, and then selects a dimension set $J$ based on this to return to the FL-Server. The FL-Server constructs a corresponding sparse update $\Delta^\prime$ based on the dimension set $J$ and aggregates all sparse updates to update the global model. Since local model updates are correlated with local data information, directly selecting the true largest update dimensions may lead to privacy leakage. To address this, the SignDS algorithm provides privacy guarantees in two aspects. On one hand, the algorithm uses an Exponential Mechanism (EM :cite:`mcsherry2007mechanism`)-based dimension selection algorithm **EM-MDS**, ensuring that the selected dimension set satisfies strict $\epsilon$-LDP guarantees; on the other hand, when constructing sparse updates, a constant value is assigned to the selected dimensions instead of directly using the actual update values, ensuring that the sparse updates are no longer directly correlated with local data. Since the dimension selection satisfies $\epsilon$-LDP and the update values assigned to the selected dimensions are independent of local data, by the post-processing property of differential privacy :cite:`dwork2014algorithmic`, the constructed sparse updates also satisfy $\epsilon$-LDP guarantees. **Compared to the previous dimension-wise noise-adding LDP algorithm, the SignDS algorithm can significantly improve training accuracy for high-dimensional models. Moreover, since FL-Clients only need to upload a small subset of dimension values rather than all model weights, the uplink communication volume of federated learning is also greatly reduced.**
Below, we provide detailed introductions to the construction of the Top-K set $S_k$ and the EM-MDS dimension selection algorithm.
First, since actual update values can be positive or negative, directly assigning the same constant value to all selected dimensions may significantly change the model update direction and affect model convergence. To solve this problem, SignDS proposes a sign-based Top-K set construction strategy. Specifically, the algorithm introduces an additional sign variable $s\in\\{-1,1\\}$. This variable is randomly sampled with equal probability by the FL-Client and is used to determine the Top-K set $S_k$ of the local update $\Delta$. If $s=1$, we sort $\Delta$ by **actual update values** and record the $k$ dimensions with the **largest** updates as $S_k$. We further randomly select a subset of dimensions from $S_k$ and use $s=1$ as the update value for these dimensions to construct the sparse update. Intuitively, the update values of dimensions in $S_k$ are likely to be greater than zero. Therefore, assigning $s=1$ to the selected dimensions will not cause a large deviation in the model update direction, thereby mitigating the impact on model accuracy. Similarly, when $s=-1$, we select the $k$ dimensions with the **smallest** updates as $S_k$ and assign $s=-1$ to the selected dimensions.
Next, we further introduce the EM-MDS algorithm for dimension selection. In brief, the purpose of the EM-MDS algorithm is to randomly select a dimension set $J\in\mathcal{J}$ from the output dimension domain $\mathcal{J}$ with a certain probability $\mathcal{P}$, where different dimension sets correspond to different probabilities. We assume that $J$ contains a total of $h$ dimensions, of which $\nu$ dimensions belong to the Top-K set (i.e., $|S_k \cap J|=\nu$, where $\nu\in[0,h]$), and the other $h-\nu$ dimensions belong to the non-Top-K set. Intuitively, the larger $\nu$ is, the more Top-K dimensions $J$ contains, and the better the model convergence. Therefore, we want to assign higher probabilities to dimension sets with larger $\nu$. Based on this idea, we define the score function as:
$$
u(S_{k}, J) = 𝟙(|S_k\cap J| \geq \nu_{th}) = 𝟙(\nu \geq \nu_{th})
$$
:eqlabel:`score_function`
$u(S_{k}, J)$ measures whether the number of Top-K dimensions contained in the output dimension set $J$ exceeds a certain threshold $\nu_{th}$ ($\nu_{th}\in[1,h]$): it equals 1 if exceeded, and 0 otherwise. Furthermore, the sensitivity of $u(S_{k}, J)$ can be computed as:
$$
\phi = \max_{J\in\mathcal{J}} ||u(S_{k}, J) - u(S^\prime_{k}, J)||= 1 - 0 = 1
$$
:eqlabel:`sensitivity`
Note that :eqref:`sensitivity` holds for any pair of different Top-K sets $S_k$ and $S_k^\prime$.
Based on the above definitions, the EM-MDS algorithm is described as follows:
*Given the Top-K set $S_k$ of the true local update $\Delta\in\mathbb{R}^{d}$ and the privacy budget $\epsilon$, the sampling probability of the output dimension set $J\in\mathcal{J}$ is:*
$$
\mathcal{P}=\frac{\mathrm{exp}(\frac{\epsilon}{\phi}\cdot u(S_{k}, J))}{\sum_{J^\prime\in\mathcal{J}}\mathrm{exp}(\frac{\epsilon}{\phi}\cdot u(S_{k}, J^\prime))}
=
\frac{\mathrm{exp}(\epsilon\cdot 𝟙(\nu \geq \nu_{th}))}{\sum_{\tau=0}^{\tau=h}\omega_{\tau}\cdot \mathrm{exp}(\epsilon\cdot 𝟙(\tau\geq\nu_{th}))}
=
\frac{\mathrm{exp}(\epsilon\cdot 𝟙(\nu \geq \nu_{th}))}{\sum_{\tau=0}^{\tau=\nu_{th}-1}\omega_{\tau} + \sum_{\tau=\nu_{th}}^{\tau=h}\omega_{\tau}\cdot \mathrm{exp}(\epsilon)}
$$
:eqlabel:`emmds`
*where $\nu$ is the number of Top-K dimensions contained in $J$, $\nu_{th}$ is the score function threshold, $J^\prime$ is any output dimension set, and $\omega_{\tau}=\binom{k}{\tau}\binom{d-k}{h-\tau}$ is the number of all sets containing $\tau$ Top-K dimensions.*
We further provide the privacy proof of the EM-MDS algorithm:
For each FL-Client, given a randomly sampled sign value $x$, let the Top-K sets of any two local updates $\Delta$ and $\Delta^\prime$ be denoted as $S_k$ and $S_k^\prime$. For any output dimension set $J\in\mathcal{J}$, let $\nu=|S_k \cap J|$ and $\nu^\prime=|S_k^\prime \cap J|$ be the intersection sizes of $J$ with the two Top-K dimension sets. According to :eqref:`emmds`, the following inequality holds:
$$
\frac{\mathrm{Pr}\[J|\Delta\]}{\mathrm{Pr}\[J|\Delta^\prime\]} = \frac{\mathrm{Pr}\[J|S_{k}\]}{\mathrm{Pr}\[J|S^\prime_{k}\]} = \frac{\frac{\mathrm{exp}(\frac{\epsilon}{\phi}\cdot u(S_{k}, J))}{\sum_{J^\prime\in\mathcal{J}}\mathrm{exp}(\frac{\epsilon}{\phi}\cdot u(S_{k}, J^\prime))}}{\frac{\mathrm{exp}(\frac{\epsilon}{\phi}\cdot u(S^\prime_{k}, J))}{\sum_{J^\prime\in\mathcal{J}}\mathrm{exp}(\frac{\epsilon}{\phi}\cdot u(S^\prime_{k}, J^\prime))}}
= \frac{\frac{\mathrm{exp}(\epsilon\cdot 𝟙(\nu \geq \nu_{th}))}{\sum_{\tau=0}^{\tau=h}\omega_{\tau}\cdot \mathrm{exp}(\epsilon\cdot 𝟙(\tau\geq\nu_{th}))}}{\frac{
\mathrm{exp}(\epsilon\cdot 𝟙(\nu^\prime \geq \nu_{th}))}{\sum_{\tau=0}^{\tau=h}\omega_{\tau}\cdot \mathrm{exp}(\epsilon\cdot 𝟙(\tau\geq\nu_{th}))}} \\
= \frac{\mathrm{exp}(\epsilon\cdot 𝟙(\nu \geq \nu_{th}))}{
\mathrm{exp}(\epsilon\cdot 𝟙(\nu^\prime \geq \nu_{th}))}
\leq \frac{\mathrm{exp}(\epsilon\cdot 1)}{\mathrm{exp}(\epsilon\cdot 0)} = \mathrm{exp}(\epsilon)
$$
*This proves that the EM-MDS algorithm satisfies the $\epsilon$-LDP guarantee.*
It is worth noting that computing :eqref:`emmds` requires first determining the Top-K dimension count threshold $\nu_{th}$. To this end, we first derive the probability distribution and expectation of the number of Top-K dimensions contained in any output dimension set $J$ given the threshold $\nu_{th}$:
$$
\mathrm{Pr}(\nu=\tau|\nu_{th})=
\begin{cases}
\omega_{\tau} / \Omega \quad \quad \quad \quad \quad \mathrm{ } &if \quad \tau\in\[0,\nu_{th}\) \\
\omega_{\tau}\cdot\mathrm{exp}(\epsilon) / \Omega \quad \quad &if \quad \tau\in\[\nu_{th},h\]
\end{cases}
$$
:eqlabel:`discrete-prob`
$$
\mathbb{E}\[\nu|\nu_{th}\] = \sum_{\tau=0}^{\tau=h}\tau\cdot \mathrm{Pr}(\nu=\tau|\nu_{th})
$$
:eqlabel:`expectation`
Here, $\Omega$ is the denominator part of $\mathcal{P}$ in :eqref:`emmds`. Intuitively, the higher $\mathbb{E}\[\nu|\nu_{th}\]$ is, the greater the probability that the randomly sampled set $J$ contains Top-K dimensions, and thus the better the model utility. Therefore, we determine the threshold that maximizes $\mathbb{E}\[\nu|\nu_{th}\]$ as the target threshold $\nu_{th}^{\*}$, i.e.:
$$
\nu_{th}^{\*} = \underset{\nu_{th}\in\[1, h\]}{\operatorname{argmax}} \mathbb{E}\[\nu|\nu_{th}\]
$$
:eqlabel:`threshold`
Finally, we describe the detailed workflow of the SignDS algorithm in :numref:`signds_workflow`. Given a local model update $\Delta$, we first randomly sample a sign value $s$ and construct the Top-K set $S_k$. Next, we determine the threshold $\nu_{th}^{\*}$ according to :eqref:`threshold` and select the output set $J$ following the probability defined in :eqref:`emmds`. Considering that the output domain $\mathcal{J}$ contains $\binom{d}{k}$ possible dimension sets, directly sampling a combination from $\mathcal{J}$ with a certain probability would require very high computational and space costs. Therefore, we adopt an inverse sampling algorithm to improve computational efficiency. Specifically, we first sample a random value $\beta\sim U(0,1)$ from the standard uniform distribution, and determine the number of Top-K dimensions $\nu$ in the output dimension set based on the cumulative distribution function $CDF_{\tau}$ of $p(\nu=\tau|\nu_{th})$ in :eqref:`discrete-prob`. Finally, we randomly select $\nu$ dimensions from the Top-K set $S_k$ and randomly sample $h-\nu$ dimensions from the non-Top-K set to construct the final output dimension set $J$.
![SignDS Workflow](../img/ch10/ch10-federated-learning-signds.PNG)
:width:`800px`
:label:`signds_workflow`

View File

@@ -0,0 +1,3 @@
## Summary
In this chapter, we briefly introduced the background, system architecture, federated averaging algorithm, privacy encryption algorithms, and practical deployment challenges of federated learning. Federated learning is an emerging artificial intelligence paradigm that can build effective machine learning models under the two major constraints of "data protection" and "data silos." Furthermore, due to the unique characteristics of federated learning scenarios (local data not being uploaded, high security and privacy requirements, and non-IID data distributions), the development of systems and algorithms becomes more challenging: how to balance computation and communication overhead, how to ensure the model does not leak privacy, and how algorithms can converge under non-IID scenarios. These challenges require developers to have a deeper understanding of practical federated learning scenarios.

View File

@@ -0,0 +1,61 @@
## Vertical Federated Learning
Now we introduce another type of federated learning algorithm: Vertical Federated Learning. In vertical federated learning, the participating parties possess data with the same sample space but different feature spaces. They perform secure joint modeling using shared sample data, which has broad applications in fields such as finance and advertising. Compared to horizontal federated learning, vertical federated learning requires participants to collaboratively complete data intersection, joint model training, and joint model inference. Moreover, the more participants involved, the higher the complexity of the vertical federated learning system.
Below, we use a two-party example with Enterprise A and Enterprise B to introduce the basic architecture and workflow of vertical federated learning. Suppose Enterprise A has both feature data and label data and can build models independently; Enterprise B has feature data but lacks label data and thus cannot build models independently. Due to privacy regulations and industry standards, data between the two enterprises cannot be directly shared. Enterprise A and Enterprise B can adopt a vertical federated learning solution to collaborate: data stays local, and both parties use their shared sample data for joint modeling and training. Ultimately, both parties obtain a more powerful model.
### Vertical Federation Architecture
![Vertical Federated Two-Party Architecture](../img/ch10/ch10-federated-learning-vfl-arch.svg)
:width:`800px`
:label:`federated-learning-vfl-arch`
Model training in a vertical federated learning system generally consists of the following phases:
- Sample alignment: First, align the sample data with the same ID (Identification) across Enterprise A and Enterprise B. During the data alignment phase, the system employs encryption algorithms to protect the data, ensuring that neither party's user data is exposed.
- Joint training: After determining the shared user data between Enterprise A and Enterprise B, this shared data can be used to collaboratively train a business model. During the model training process, model parameter information is transmitted in an encrypted manner. The trained federated learning model can be deployed across all participating parties in the federated learning system.
### Sample Alignment
Private Set Intersection (PSI) technology is a commonly used solution for data sample alignment in vertical federated learning. There are multiple PSI implementation approaches in the industry: circuit-based, public-key encryption-based, oblivious transfer protocol-based, and fully homomorphic encryption-based. Different PSI approaches have their own advantages and disadvantages. For example, public-key encryption-based approaches do not require an auxiliary server to run but incur high computational overhead for public-key encryption; while oblivious transfer-based approaches have high computational performance but incur large communication overhead. Therefore, in specific applications, the best balance among functionality, performance, and security should be chosen based on the actual scenario.
RSA blind signature is a classic PSI method based on public-key encryption and is one of the widely adopted technologies in current vertical federated learning systems. Below, we describe the basic workflow of the RSA blind signature algorithm using Enterprise A and Enterprise B as an example.
![Vertical Federated Sample Alignment](../img/ch10/ch10-federated-learning-vfl-data.png)
:width:`600px`
:label:`federated-learning-vfl-data`
Enterprise A acts as the server and possesses a set containing label data and sample IDs. Enterprise B acts as the client and possesses a set of sample IDs. First, Enterprise A uses the RSA algorithm to generate a private key and a public key. The private key is retained on the server side, and the public key is sent to Enterprise B.
The server uses the RSA algorithm to compute the signatures of the IDs participating in sample alignment:
$$t_j=H^{'}(K_{a:j})$$
where $K_{a:j}=(H(a_j))^d \ mod \ n$ is the RSA encryption result of $H(a_j)$ encrypted with the private key $d$. $H()$ and $H^{'}()$ are hash functions.
_Similarly, on the client side, the sample IDs are encrypted with the public key and multiplied by a random number $R_{b,i}$ for blinding perturbation:
$$y_i=H(b_i)\cdot(R_{b,i})^e \ mod \ n$$
The client transmits the computed values $\{y_1,...,y_v\}$ to the server side. After receiving the $y_i$ values, the server signs them using the private key $d$ and computes:
$$y_i^{'}=y_i^d \ mod \ n$$
Then the server sends the computed $\{y_1^{'},...,y_v^{'}\}$ and $\{t_1,...,t_w\}$ to the client side.
Upon receiving $y_i^{'}$ and $t_j$, the client first performs the unblinding operation:
$$K_{b:i}={y_i}^{'}/R_{b,i}$$
and aligns its own ID signatures with the server's ID signatures to obtain the ID intersection $I$ in an encrypted and hashed state:
$${t_i}^{'}=H^{'}(K_{b:i}) \\I=\{t_1,...,t_w\}\cap \{{t_1}^{'},...,{t_v}^{'}\}$$
Finally, the aligned sample ID intersection $I$ is sent to the server, and the server uses its own mapping table to independently derive the plaintext results. In this way, Enterprise A and Enterprise B complete the intersection computation of user sets in an encrypted state, and throughout the entire process, non-overlapping sample IDs of both parties are never exposed.
### Joint Training
After sample ID alignment, developers can use the shared data to build machine learning models.
Currently, models such as linear regression, decision trees, and neural networks have been widely applied in vertical federated learning systems. During the model training process in vertical federated learning, a third-party collaborator C is generally introduced to implement the central server functionality, and it is assumed that this third-party collaborator C is trustworthy and will not collude with other participants. The central server acts as a neutral party during training, generating and distributing keys, and decrypting and computing encrypted data. However, the central server role is not mandatory; for example, in a two-party federated learning scenario, a third-party collaborator C is not needed to coordinate the training tasks of both parties, and Enterprise A, which holds the label data, can assume the role of the central server. Without loss of generality, we continue to describe the vertical federated learning joint training process using a scheme that includes the third-party collaborator C.
![Vertical Federated Joint Modeling](../img/ch10/ch10-federated-learning-vfl-train.svg)
:width:`800px`
:label:`federated-learning-vfl-train`
- Step 1: The third-party collaborator C creates a key pair and sends the public key to Enterprise A and B.
- Step 2: Enterprise A and B separately compute the intermediate results needed for gradient and loss computation, and encrypt and exchange them.
- Step 3: Enterprise A and B separately compute the encrypted gradients and add masks. Meanwhile, Enterprise A also computes the encrypted loss value. After computation, Enterprise A and B send the encrypted values to the third-party collaborator C.
- Step 4: The third-party collaborator C decrypts the gradients and loss values, and sends the results back to Enterprise A and B.
- Step 5: Enterprise A and B first remove the masks from the received gradient values, and then update their local model parameters.
Throughout the entire training process, any sensitive data between Enterprise A and B is encrypted using encryption algorithms before leaving their respective trust domains. Homomorphic Encryption (HE) is one of the commonly used algorithms in federated learning frameworks. Homomorphic encryption means that performing certain operations on two pieces of encrypted data and then directly decrypting the result yields the same outcome as performing the same operations on the original data. When this operation is addition, it is called additive homomorphic encryption. We denote the encryption function as $[[\cdot]]$.

View File

@@ -0,0 +1,462 @@
# Automatic Differentiation
In the following, we describe the key methodologies applied in automatic
differentiation.
## Types of Differentiation Methods
Differentiation constitutes a collection of methodologies enabling the
efficient and precise evaluation of derivatives within computer
programs. Since the 1960s and 1970s, it has been extensively utilized
across multiple sectors including fluid mechanics, astronomy, and
mathematical finance . Its theories and implementation have been
rigorously studied over time.
With the advancement of deep learning, which has shown remarkable
progress across an expanding range of machine learning tasks in recent
years, automatic differentiation has found wide-spread application in
the field of machine learning. Given that many optimization algorithms
employed in machine learning models necessitate derivatives of the
models, automatic differentiation has emerged as an integral component
within mainstream machine learning frameworks such as TensorFlow and
PyTorch.
There are four primary methods to evaluate derivatives in computer
programs, each of which is explained in the following sections.
### Manual Differentiation
Manual differentiation involves the direct computation of the derivative
expression of a function, a task which hinges upon the input values
specified within a program. Although this method could seem appealing
due to its simplicity and directness, it is worth noting that it comes
with its share of limitations.
A primary drawback of manual differentiation is the need to re-derive
and re-implement the derivative every time a function changes, which can
be laborious and time-consuming. This is especially true for complex
functions or when working on large-scale projects where the function
might undergo frequent updates.
Moreover, manual differentiation can be prone to human errors. The
process of deriving complex functions often involves intricate chains of
mathematical reasoning. A slight oversight or error in any of these
steps can lead to an incorrect derivative, which, in turn, can greatly
affect the outcome of the computation. This susceptibility to mistakes
can add a layer of uncertainty to the reliability of this method.
Furthermore, in cases where high-order derivatives or partial
derivatives with respect to many variables are needed, manual
differentiation quickly becomes unfeasible due to the increase in
complexity. The difficulty of computing these derivatives correctly
grows exponentially with the number of variables and the order of the
derivative.
### Numerical Differentiation
Numerical differentiation is an approach that logically stems from the
fundamental definition of a derivative and employs the method of
difference approximation. The basic formula for numerical
differentiation can be described as follows:
$$f^{'}(x)=\lim_{h \to 0}\frac{f(x+h)-f(x)}{h}$$
In this equation, for a sufficiently small value of the step size $h$,
the difference quotient $\frac{f(x+h)-f(x)}{h}$ is used as an
approximation of the derivative. The inherent error in this
approximation is referred to as the truncation error, which
theoretically diminishes as the value of $h$ approaches zero. This
suggests that a smaller step size would yield a more accurate
approximation.
However, the scenario in practice is not always so straightforward due
to the phenomenon of round-off error. This error arises from the finite
precision of floating-point arithmetic operations in digital computer
systems. As the value of $h$ decreases, the round-off error conversely
increases, adding a degree of uncertainty to the computation.
This creates a complex interplay between truncation error and round-off
error. When the value of $h$ is large, the truncation error dominates,
whereas when $h$ is small, the round-off error is more significant.
Consequently, the total error of numerical differentiation achieves a
minimum at an optimal $h$ value that balances these two types of errors.
In a nutshell, while numerical differentiation offers the advantage of
relative simplicity in implementation, it suffers from certain
limitations with regard to accuracy. The complexities arising from the
interplay between truncation and round-off errors make it less reliable
for certain tasks, particularly when high precision is required.
Therefore, for many practical applications, more sophisticated
techniques of automatic differentiation are preferred.
### Symbolic Differentiation
Symbolic differentiation involves the use of computer programs to
automatically calculate derivatives. This is accomplished by recursively
transforming function expressions in accordance with specific
differentiation rules. These rules can be summarized as follows:
$$\frac{\partial}{\partial x}(f(x)+g(x))\rightsquigarrow\frac{\partial}{\partial x}f(x)+\frac{\partial }{\partial x}g(x)$$
$$\frac{\partial}{\partial x}(f(x)g(x))\rightsquigarrow(\frac{\partial}{\partial x}f(x))g(x)+f(x)(\frac{\partial}{\partial x}g(x))$$
Symbolic differentiation has been integrated into many modern algebraic
systems such as Mathematica, as well as machine learning frameworks like
Theano. It successfully addresses the issues related to hard-coding
derivatives inherent in manual differentiation, thus automating the
differentiation process and minimizing human error.
Despite these advantages, symbolic differentiation has its own set of
challenges. One of its primary limitations is its strict adherence to
transforming and expanding expressions recursively, without the ability
to reuse previous results of transformations. This can lead to a
phenomenon known as expression swell , which results in highly complex
and expanded expressions that can significantly slow down computation
and increase memory usage.
In addition, symbolic differentiation requires that the expressions to
be differentiated are defined in closed form. This constraint largely
restricts the use of control flow statements such as loops and
conditional branches, which are common in programming. This lack of
flexibility can significantly limit the design and expressivity of
neural networks within machine learning frameworks, as these often
require intricate control flow structures for more advanced operations.
### Automatic Differentiation
Automatic differentiation cleverly amalgamates the strategies of
numerical differentiation and symbolic differentiation to offer an
efficient and precise mechanism for derivative evaluation. It breaks
down the arithmetic operations in a program into a finite set of
elementary operations, for each of which the rules of derivative
evaluation are already known. Upon determining the derivative of each
elementary operation, the chain rule is applied to synthesize these
individual results, ultimately yielding the derivative of the entire
program.
The fundamental strength of automatic differentiation lies in its
ability to sidestep the primary drawbacks of both numerical and symbolic
differentiation. Unlike numerical differentiation, which suffers from
precision issues due to truncation and round-off errors, automatic
differentiation facilitates accurate derivative evaluations.
Furthermore, it mitigates the issue of expression swell, a significant
concern in symbolic differentiation, by decomposing the program into a
series of elementary expressions. Symbolic differentiation rules are
only applied to these simplified expressions, and the derivative results
are reused to enhance efficiency.
Automatic differentiation also surpasses symbolic differentiation in its
capability to handle control flow statements. It has the ability to
process branching, looping, and recursion, enhancing its flexibility and
applicability to complex computational scenarios.
In contemporary applications, automatic differentiation has found
widespread use in deep learning frameworks for the evaluation of
derivatives, given its blend of accuracy and efficiency. The subsequent
sections delve into the mechanics and implementation aspects of
automatic differentiation, elucidating its role as a crucial tool in
computational mathematics and machine learning.
## Forward Mode and Reverse Mode
Automatic differentiation can be categorized into two modes, forward and
reverse, based on the sequence in which the chain rule is applied.
Consider a composite function $y=a(b(c(x)))$. The formula to calculate
its gradient, $\frac{\partial y}{\partial x}$, is given as:
$$\frac{\partial y}{\partial x}=\frac{\partial y}{\partial a}\frac{\partial a}{\partial b}\frac{\partial b}{\partial c}\frac{\partial c}{\partial x}$$
In the forward mode of automatic differentiation, the computation of the
gradient originates from the inputs, as shown in the following
formulation:
$$\frac{\partial y}{\partial x}=(\frac{\partial y}{\partial a}(\frac{\partial a}{\partial b}(\frac{\partial b}{\partial c}\frac{\partial c}{\partial x})))$$
Conversely, in the reverse mode, the computation of the gradient begins
from the outputs, represented by the equation:
$$\frac{\partial y}{\partial x}=(((\frac{\partial y}{\partial a}\frac{\partial a}{\partial b})\frac{\partial b}{\partial c})\frac{\partial c}{\partial x})$$
To illustrate the computation methods of the two modes, let us consider
the following function and aim to evaluate its derivative,
$\frac{\partial y}{\partial x_1}$ at the point $(x_1, x_2)=(2,5)$:
$$y=f(x_1,x_2)=ln(x_1)+{x_1}{x_2}-sin(x_2)$$
Figure :numref:`ch04/ch04-calculation_graph` represents the
computational graph of this function, providing a visual demonstration
of how automatic differentiation processes the function in both forward
and reverse modes. This distinction between forward and reverse modes is
particularly important when dealing with functions of multiple
variables, with each mode having specific use cases and efficiency
implications.
![Computational graph of the examplefunction](../img/ch04/AD-example_graph.png)
:label:`ch04/ch04-calculation_graph`
### Forward Mode
![Illustration of forward-mode automaticdifferentiation](../img/ch04/AD-forward_example.png)
:label:`ch04/ch04-forward-mode-compute-function`
Figure :numref:`ch04/ch04-forward-mode-compute-function` elucidates thecomputation process within the forward mode. The sequence of elementaryoperations, derived from the source program, is displayed on the left.Following the chain rule and using established derivative evaluationrules, we sequentially compute each intermediate variable ${\dot{v}_i}=\frac{\partial v_i}{\partial x_1}$ from top to bottom, as depicted on the right.
Consequently, this leads to the computation ofthe final variable ${\dot{v}_5}=\frac{\partial y}{\partial x_1}$. In the process of derivative evaluation of a function, we obtain a setof partial derivatives of any output with respect to any input of thisfunction.
For a function $f:{\mathbf{R}^n}\to \mathbf{R}^m$, where $n$ is the number of independent input variables $x_i$, and $m$ is thenumber of independent output variables $y_i$, the derivative resultscorrespond to the following Jacobian matrix:
$$
\mathbf{J}_{f}= \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix}
$$
Each forward pass of function $f$ results in partial derivatives of alloutputs with respect to a single input, represented by the vectorsbelow. This corresponds to one column of the Jacobian matrix. Therefore,executing $n$ forward passes gives us the full Jacobian matrix.
$$
\begin{bmatrix} \frac{\partial y_1}{\partial x_i} \\
\vdots \\
\frac{\partial y_m}{\partial x_i} \end{bmatrix}
$$
The forward mode allows us to compute Jacobian-vector products byinitializing $\dot{\mathbf{x}}=\mathbf{r}$ to generate the results for asingle column. As the derivative evaluation rules for elementaryoperations are pre-determined, we know the Jacobian matrix for all theelementary operations. Consequently, by leveraging the chain rule toevaluate the derivatives of $f$ propagated from inputs to outputs, wesecure one column in the Jacobian matrix of the entire network.
$$
\mathbf{J}_{f}\mathbf{r}= \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix} \begin{bmatrix} r_1 \\
\vdots \\
r_n \end{bmatrix}
$$
### Reverse Mode
Figure :numref:`ch04/ch04-backward-mode-compute` illustrates theautomatic differentiation process in the reverse mode. The sequence ofelementary operations, derived from the source program, is displayed onthe left. Beginning from $\bar{v}_5=\bar{y}=\frac{\partial y}{\partial y}=1$, we sequentiallycompute each intermediate variable ${\bar{v}_i}=\frac{\partial y_j}{\partial v_i}$ from bottom to top,
leveraging the chain rule and established derivative evaluation rules
(as depicted on the right). Thus, we can compute the final variables
${\bar{x}_1}=\frac{\partial y}{\partial x_1}$ and
${\bar{x}_2}=\frac{\partial y}{\partial x_2}$.
![Illustration of reverse-mode automaticdifferentiation](../img/ch04/AD-backward_example.png)
:label:`ch04/ch04-backward-mode-compute`
Every reverse pass of function $f$ produces partial derivatives of asingle output with respect to all inputs, represented by the followingvectors. This corresponds to a single row of the Jacobian matrix.Consequently, executing $m$ reverse passes gives us the full Jacobianmatrix.
$$
\begin{bmatrix} \frac{\partial y_j}{\partial x_1} & \cdots & \frac{\partial y_j}{\partial x_n} \end{bmatrix}$$Similarly, we can compute vector-Jacobian products to obtain the resultsfor a single row.$$\mathbf{r}^{T}\mathbf{J}_{f}= \begin{bmatrix} r_1 & \cdots & r_m \end{bmatrix} \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix}
$$
The quantity of columns and rows in a Jacobian matrix directly
influences the number of forward and reverse passes needed to solve it
for a given function $f$. This characteristic is particularly
significant when determining the most efficient method of automatic
differentiation.
When the function has significantly fewer inputs than outputs
$(f:{\mathbf{R}^n}\to \mathbf{R}^m, n << m)$, the forward mode proves to
be more efficient. Conversely, when the function has considerably more
inputs than outputs $(f:{\mathbf{R}^n}\to \mathbf{R}^m, n >> m)$, the
reverse mode becomes advantageous.
For an extreme case where the function maps from $n$ inputs to a single
output $f:{\mathbf{R}^n}\to \mathbf{R}$, we can evaluate all the
derivatives of the output with respect to the inputs
$(\frac{\partial y}{\partial x_1},\cdots,\frac{\partial y}{\partial n})$
using a single reverse pass or $n$ forward passes. This is a situation
akin to derivative evaluation for a multi-input, single-output network,
a structure frequently encountered in machine learning.
Due to this feature, reverse-mode automatic differentiation forms the
basis for the backpropagation algorithm, a key technique for training
neural networks. By enabling efficient computation of gradients,
especially in scenarios with high-dimensional input data and scalar
output (common in many machine learning applications), reverse-mode
automatic differentiation has become indispensable in the field.
However, the reverse mode does come with certain limitations. For
instance, once a source program is decomposed into a sequence of
elementary operations in the forward mode, inputs can be obtained
synchronously during the execution of these operations. This is possible
because the sequence of derivative evaluations aligns with the sequence
of operation execution. In contrast, in the reverse mode, the sequence
for derivative evaluation is the inverse of the execution sequence of
the source program, leading to a two-phased computation process. The
initial phase entails executing the source program and storing the
intermediate results in memory, while the subsequent phase involves
retrieving these intermediate results to evaluate the derivatives. Due
to the additional steps involved, the reverse mode requires more memory.
## Implementing Automatic Differentiation
This section explores typical design patterns for implementing automatic
differentiation in machine learning frameworks. These design patterns
can be broadly classified into three categories: elemental libraries,
operator overloading, and source transformation.
### Elemental Libraries
Elemental libraries encapsulate elementary expressions and their
differential expressions as library functions. When coding, users must
manually decompose a program into a set of elementary expressions and
replace them with corresponding library functions. Take the program
$a=(x+y)/z$ as an example; it needs to be manually decomposed as
follows:
```
t = x + y
a = t / z
```
Subsequently, users replace the decomposed elementary expressions with
appropriate library functions:
```
// The parameters include variables x, y, and t and their derivative variables dx, dy, and dt.
call ADAdd(x, dx, y, dy, t, dt)
// The parameters include variables t, z, and a and their derivative variables dt, dz, and da.
call ADDiv(t, dt, z, dz, a, da)
```
The library functions, ADAdd and ADDiv, use the chain rule to define the
Add and Div differential expressions, respectively. This is illustrated
in Code `lst:diff`.
**lst:diff**
```
def ADAdd(x, dx, y, dy, z, dz):
z = x + y
dz = dy + dx
def ADDiv(x, dx, y, dy, z, dz):
z = x / y
dz = dx / y + (x / (y * y)) * dy
```
Elemental libraries constitute a simple and straightforward way of
implementing automatic differentiation for programming languages.
However, this approach requires users to manually decompose a program
into elementary expressions before calling library functions for
programming. Furthermore, it is not possible to use the native
expressions found in programming languages.
### Operator Overloading
Leveraging the polymorphism characteristic inherent in modern
programming languages, the Operator Overloading design pattern redefines
the semantics of elementary operations and successfully encapsulates
their differentiation rules. During the execution phase, it methodically
documents the type, inputs, and outputs of every elementary operation
within a data structure known as a 'tape'. These tapes have the ability
to generate a trace, serving as a pathway for applying the chain rule.
This makes it possible to aggregate elementary operations either in a
forward or backward direction to facilitate differentiation. As depicted
in Code `lst:OO`,
we utilize the AutoDiff code from automatic differentiation libraries as
a case in point to overload the basic arithmetic operators in
programming languages.
**lst:OO**
```cpp
namespace AutoDiff
{
public abstract class Term
{
// To overload and call operators (`+`, `*`, and `/`),
// TermBuilder records the types, inputs, and outputs of operations in tapes.
public static Term operator+(Term left, Term right)
{
return TermBuilder.Sum(left, right);
}
public static Term operator*(Term left, Term right)
{
return TermBuilder.Product(left, right);
}
public static Term operator/(Term numerator, Term denominator)
{
return TermBuilder.Product(numerator, TermBuilder.Power(denominator, -1));
}
}
// Tape data structures include the following basic elements:
// 1) Arithmetic results of operations
// 2) Derivative evaluation results corresponding to arithmetic results of operations
// 3) Inputs of operations
// In addition, functions Eval and Diff are used to define the computation and differentiation rules of the arithmetic operations.
internal abstract class TapeElement
{
public double Value;
public double Adjoint;
public InputEdges Inputs;
public abstract void Eval();
public abstract void Diff();
}
}
```
Operator overloading carries the advantage of tracing the program
through function calls and control flows, resulting in an implementation
process that is both simple and straightforward. However, the
requirement to trace the program during runtime introduces certain
challenges. Specifically, operator overloading is necessitated to
execute reverse-mode differentiation along the trace, which can
potentially cause a drop in performance, particularly for elementary
operations that are executed swiftly. Furthermore, due to the
constraints of runtime, operator overloading is unable to conduct
compile-time graph optimization prior to execution, and the unfolding of
control flows must be based on the information available at runtime.
Despite these challenges, operator overloading is extensively employed
in the PyTorch framework for automatic differentiation due to its
inherent simplicity and adaptability.
### Source Transformation
Source transformation is a design pattern that enriches programming
languages and scrutinizes a program's source code or its Abstract Syntax
Tree (AST) to automatically deconstruct the program into a set of
differentiable elementary operations, each with predefined
differentiation rules. The chain rule is then employed to amalgamate the
differential expressions of the elementary operations, resulting in a
novel program expression that conducts the differentiation. Source
Transformation is integral to machine learning frameworks such as
TensorFlow and MindSpore.
Unlike operator overloading, which functions within programming
languages, source transformation necessitates parsers and tools that
manipulate IRs. It also requires transformation rules for function calls
and control flow statements, such as loops and conditions. The principal
advantage of source transformation is that the automatic differentiation
transformation occurs only once per program, thus eliminating runtime
overhead. Additionally, the complete differentiation program is
available during compilation, enabling ahead-of-time optimization using
compilers.
However, source transformation presents a higher implementation
complexity compared to the other approaches. It must support a wider
array of data types and operations, and it necessitates preprocessors,
compilers, or interpreters of extended languages, along with a more
robust type-checking system. Even though source transformation does not
manage automatic differentiation transformation at runtime, it still
must ensure that certain intermediate variables from the forward pass
are accessible by the adjoint in reverse mode. Two modes are available
to facilitate this:
- **Tape-based mode**: This mode requires a global tape that ensures
the accessibility of intermediate variables. In this method, the
primitive function is augmented so that intermediate variables are
written to functions in the tape during the forward pass, and the
adjoint program reads these intermediate variables from the tape
during the backward pass. The tape used in source transformation
primarily stores the intermediate variables, while the tape used in
operator overloading additionally stores the executed operation
types. Given that the tape is a data structure constructed at
runtime, custom compiler optimizations are required. Moreover, tape
read and write operations must be differentiable to support
higher-order differentiation, which involves multiple applications
of reverse mode. As most tape-based tools do not differentiate tape
read and write operations, such tools do not support
reverse-over-reverse automatic differentiation.
- **Closure-based mode**: This mode was proposed to mitigate some of
the limitations observed in the tape-based mode. Within functional
programming, closures can capture the execution environment of a
statement and identify the non-local use of intermediate variables.

View File

@@ -0,0 +1,86 @@
# Overview of AI Compilers
Like classical compilers, AI compilers also convert user-written code
into efficient machine-executable code. In the following, we delve into
the intricacies of AI compilers, discussing various concepts inherent to
general-purpose compilers such as ahead of time (AOT), just in time
(JIT), intermediate representations (IRs), pass-based optimization,
abstract syntax tree, side effects, and closures. Our focus will be
primarily on the distinctive design and functionality of AI compilers as
compared to classical compilers, rather than offering definitions of
these concepts, as these can be found in numerous other compiler-related
textbooks.
The design of AI compilers is significantly influenced by classical
compilers like the Low Level Virtual Machine (LLVM). Thus, gaining an
understanding of the basic architecture of the LLVM compiler, depicted
in Figure :numref:`ch04/llvm-basic`, will be beneficial.
![Basic architecture of the LLVMcompiler](../img/ch04/LLVM_basic_architecture.png)
:label:`ch04/llvm-basicwidth="\\linewidth"`
The LLVM compiler consists of three components: the frontend,
intermediate representations, and the backend. The frontend converts
high-level languages into IRs. The backend then transforms these IRs
into machine instructions executable on the target hardware. As their
name implies, IRs serve as a transition phase from the frontend to the
backend, where necessary optimizations can take place. The architecture
of the LLVM compiler ensures that IRs are reusable and compatible with
any newly introduced frontend or hardware. While IRs can exist on one or
more levels, LLVM typically uses a one-level structure, meaning the
frontend and backend optimizations share the same set of IRs.
AI compilers, on the other hand, commonly employ a multi-level IR
structure. An example is the multi-level IR (MLIR) design adopted by
TensorFlow, as depicted in Figure
:numref:`ch04/TF-IR`.
TensorFlow's MLIR comprises three levels of IRs: the TensorFlow graph
IR, the XLA HLO IR, and hardware-specific LLVM IR or TPU IR. The
subsequent sections briefly outline these levels and their corresponding
compilation optimization processes.
![TensorFlow's multi-level IRdesign](../img/ch04/TensorFlow-IR.png)
:label:`ch04/TF-IRwidth="\\linewidth"`
The process of optimization in computational graphs is known as graph
compilation optimization. The first level of IR, the graph IR, carries
out optimization and operations (e.g., graph optimization and graph
segmentation) for an entire graph. While this complete-graph IR is
suitable for static graph execution, it proves challenging for
hardware-specific optimization due to the absence of hardware
information. To address this, hardware-specific generic compilation
optimization is applied at the mid-level of IRs. Platforms like XLA,
Tensor RT, and MindSpore's graph kernel fusion enhance the execution
performance of various neural networks on specific hardware by executing
operator fusion and other optimizations for different hardware types.
The final level of IR deals exclusively with a certain type of hardware
accelerator and often comes bundled with a hardware vendor's compiler.
For instance, the TBE compiler, paired with the Ascend hardware, is
based on HalideIR as its efficient execution operators are generated
based on TVM's HalideIR.
The multi-level IR design grants IRs enhanced flexibility and
facilitates more efficient pass-based optimization for each specific IR
level. However, this design has limitations. First, achieving fully
compatible IR transformation across different levels is challenging due
to the substantial engineering effort required and potential information
loss during the transformation. Optimization carried out at one IR level
might eliminate some information, and the implications of this removal
must be evaluated at the next level. As a result, IR transformation
imposes stricter constraints on the sequence in which optimization
occurs. Second, the decision of at which of two adjacent levels to
perform certain IR optimizations presents a dilemma for framework
developers. Lastly, because different IR levels can define different
operator granularities, some accuracy might be compromised.
To mitigate these drawbacks, the AI compiler in the MindSpore machine
learning framework uses a unified IR design known as MindIR. Figure
:numref:`ch04/msflow`
illustrates the internal execution process of MindSpore's AI compiler.
In this process, the compiler frontend handles graph compilation and
hardware-agnostic optimization, while the compiler backend conducts
tasks like hardware-specific optimization and operator selection.
![Working process of MindSpore's AIcompiler](../img/ch04/compiler_process.png)
:label:`ch04/msflowwidth="\\linewidth"`

View File

@@ -0,0 +1,95 @@
# Frontend Compilation Optimization
Much like classical compilers, AI compilers implement compilation
optimization to enhance the effectiveness of the IRs generated during
the compilation process. This strategy reduces not only the length of
the code and the time required for its compilation and execution but
also diminishes the energy usage of processors during execution.
Compilation optimization techniques can be divided into two categories:
hardware-agnostic optimization and hardware-specific optimization.
However, all optimization techniques applied at the frontend are
inherently hardware-agnostic, as the frontend remains oblivious to the
backend hardware specifics.
## Process of Compilation Optimization
Typically, compilation optimizers execute a sequence of optimization
passes. In each pass, an IR is used as input, which then produces a
revised IR as output. A single pass might incorporate several sub-passes
and can be conducted once or multiple times.
The overall success of compilation optimization significantly depends on
the selection and ordering of optimization operations. Not only does the
compiler execute various compilation optimization operations as needed,
but it can also adjust the number of optimization passes along with the
types and sequence of optimization operations. These adjustments are
contingent upon the set level of compilation optimization, as
illustrated in Figure :numref:`ch06/ch06-opt-pass`.
![Structural layout of an optimization pass in compilationoptimization](../img/ch04/optimization_pass.png)
:label:`ch06/ch06-opt-pass`
## Prevalent Optimization Methods
Today, a wide array of frontend compilation optimization methods exist.
Analogously, machine learning frameworks also employ various
optimization methods, although these diverge from those found in
classical compilers. This section will detail three frequently employed
and versatile frontend compilation optimization methods.
### Elimination of Dead Code and Unreachable Code
Dead code refers to segments of code that yield outputs not utilized by
any other code, while unreachable code refers to segments of code that
are not included in any valid control flow path. Figure
:numref:`ch06/ch06-opt-pass-useless-code0-elimination`
demonstrates these two types of code. The removal of dead or unreachable
code can decrease the size of IRs and expedite both the compilation and
execution of a program. These types of code can result from human errors
or may manifest during other compilation optimizations.
![Elimination of deadcode](../img/ch04/dead_code_elimination.png)
:label:`ch06/ch06-opt-pass-useless-code0-elimination`
In Chapter
[\[subsec:conversion_between_and_combination_of_dynamic_and_static_graphs\]](#subsec:conversion_between_and_combination_of_dynamic_and_static_graphs){reference-type="ref"
reference="subsec:conversion_between_and_combination_of_dynamic_and_static_graphs"},
it was previously mentioned that the tracing method can be employed
during the process of converting dynamic graphs to static graphs. The
tracing method is considered highly effective in identifying dead code
and unreachable code. Consequently, this step is often incorporated into
the graph conversion procedure.
### Constant Propagation and Constant Folding
Constant propagation is a process that replaces specific constants with
their known values during compilation. On the other hand, constant
folding is a process that substitutes variables with constants when the
results of multiple operations can be computed directly during
compilation.
Figure :numref:`ch06/ch06-opt-pass-constant-broadcast` depicts these two
methods.
![Constant propagation and constant foldingtechniques](../img/ch04/constant_propagation_and_constant_folding.png)
:label:`ch06/ch06-opt-pass-constant-broadcast`
### Common Subexpression Elimination
In order to understand what common subexpression elimination entails,
let's consider the following: If an expression E has been computed and
the values of all its variables remain unchanged from the prior
computation, E is identified as a common subexpression. This concept is
visualized in
Figure :numref:`ch06/ch06-opt-pass-CSE`. As such, E doesn't need to be
computed again; it can be directly replaced with the expression result
obtained from the preceding computation.
![Common subexpression eliminationprocess](../img/ch04/common_subexpression_elimination.png)
:label:`ch06/ch06-opt-pass-CSE`
Common subexpression elimination, like the elimination of dead code and
unreachable code, is typically carried out during the graph conversion
process. In PyTorch, the torch script module provides a dedicated API
for common subexpression elimination. This approach is inherent as it
simplifies the identification of common subexpressions within
torchscript.

View File

@@ -0,0 +1,38 @@
# AI Compiler Frontend
Tailored for machine learning frameworks, an AI compiler is designed to
convert Python-based machine learning programs into their optimized
forms, enabling efficient native execution on heterogeneous processors.
This chapter first outlines the typical architecture of an AI compiler
before delving into the design of the compiler's frontend. The compiler
frontend incorporates various techniques, including intermediate
representations (IRs), automatic differentiation, type systems, static
analysis, and compilation optimization.
The learning objectives of this chapter include:
- Understanding the typical architecture of an AI compiler.
- Understanding the types and implementation of IRs in machine
learning frameworks.
- Understanding the methods of automatic differentiation implemented
in AI compilers.
- Understanding type systems and static analysis in AI compilers.
- Understanding common frontend compilation optimization methods used
by AI compilers.
```toc
:maxdepth: 2
Overview_of_AI_Compilers
Overview_of_AI_Compiler_Frontends
Intermediate_Representation
Automatic_Differentiation
Type_Systems_and_Static_Analysis
Frontend_Compilation_Optimization
Chapter_Summary
Further_Reading
```

View File

@@ -0,0 +1,505 @@
# Intermediate Representation
In this section, we begin by introducing basic IR concepts and the types
of IR employed in classical compilers. Next, we address the new
requirements and challenges that arise in the IR design for machine
learning frameworks. To conclude this section, we examine the types of
IRs utilized by well-known machine learning frameworks and delve into
their implementation.
## Definition of Intermediate Representations
An IR is a data structure or a form of code that a compiler utilizes to
represent source code. Almost all compilers need IRs to model the
program code that requires analysis, transformation, and optimization.
The representational capability of an IR is crucial during the
compilation process. It must accurately depict source code without
information loss, ensure the completeness of the source-to-target code
compilation, and guarantee the effectiveness and performance of code
optimization.
As illustrated in Figure :numref:`ch04/ch04-IR`, IRs facilitate the representation of
multiple source program languages from the frontend and enable the
backend to connect to various target machines. Located between the
frontend and backend is an optimizer, which allows for the addition of
new optimization processes directly into the frontend and backend. These
processes use existing IRs as input and generate new IRs as output. By
analyzing and optimizing IRs, the optimizer enhances the extensibility
of the compilation process and minimizes the impact that might be
introduced during an optimization process on the frontend and backend.
![Compiler's optimizationprocess](../img/ch04/IR-IR_structure.png)
:label:`ch04/ch04-IR`
With the ongoing evolution of compiler techniques, the development of
IRs has progressed through three stages. In the initial stage, IRs were
confined within a compiler and exclusively used by compiler developers.
During the middle stage, when specific compilers became open source, IRs
started being made publicly available, primarily for use by the users of
compilers and related compilation tools. In the current stage, IRs are
advancing toward facilitating an ecosystem of ecosystems (through a
unified IR approach), encouraging increasing stakeholders (for example,
hardware accelerator designers, machine learning framework users, and
more) to participate in advertising AI computing.
## Types of Intermediate Representations
We will discuss various types of IR structures used by classical
compilers. Understanding these IR structures is essential for analyzing
source programs and generating optimized compiled code. Table
:numref:`ch06/ch06-categorize` offers an overview of the
different IR types. It is important to design IR structures carefully,
considering the specific requirements of the compiler's design.
:Types of IRs
| IR Structure | Characteristics | Examples |
| --------------| --------------------------------------| ----------------------------------------------
| Linear IR | Based on linear code | Stack machine code, three-address code |
| Graphical IR | Based on graphs | Abstract syntax tree, directed acyclic graph |
| Hybrid IR | Based on both graphs and linear code |LLVM IR |
:label:`ch06/ch06-categorize`
### Linear Intermediate Representation
Linear IRs are widely used in compiler design, resembling assembly code
for abstract machines. They represent the code to be compiled as a
sequentially ordered series of operations. This ordering is important in
practical terms. Linear IRs are popular because most processors utilize
linear assembly languages.
Two common types of linear IRs are stack machine code and three-address
code . Stack machine code, a form of single-address code, offers a
straightforward and compact representation. Instructions in stack
machine code typically consist solely of an opcode that specifies an
operation, with operands stored on a stack. Most instructions retrieve
operands from the stack and push the results of their operations back
onto it. On the other hand, three-address code (3AC) emulates the
instruction format used in modern RISC machines. It employs a set of
quadruples, each containing an operator and three addresses (two
operands and one target). Figure
:numref:`ch04/ch04-linearIR` illustrates the stack machine code
and three-address code representations for the expression $a-b*5$.
![Stack machine code and three-addresscode](../img/ch04/IR-linear_IR.png)
:label:`ch04/ch04-linearIR`
### Graphical Intermediate Representation
Graphical IRs store information about the compilation process in the
form of graphs. These graphs utilize nodes, edges, lists, trees, and
other elements to collectively represent an algorithm. Although all
graphical IRs consist of nodes and edges, they differ in terms of
abstraction levels and graph structures. Common examples of graphical
IRs include abstract syntax trees (ASTs), directed acyclic graphs
(DAGs), and control-flow graphs (CFGs).
An AST is a tree-structured IR that closely resembles the structure of
the source code. Figure :numref:`ch04/ch04-AST_DAG` depicts the AST for the expression
$a5+a5b$. It is worth noting that the AST contains two identical copies
of $a5$, which introduces redundancy. To address this redundancy, the
DAG offers a simplified representation where identical subtrees can be
shared by multiple parent nodes. By reusing subtrees, the DAG reduces
the cost of the evaluation process, especially when the compiler can
verify that the value of $a$ remains constant.
![AST and DAG](../img/ch04/IR-ASTDAG.png)
:label:`ch04/ch04-AST_DAG`
### Hybrid Intermediate Representation
Hybrid IRs combine both linear IR and graphical IR elements. An example
of a hybrid IR is LLVM IR , which is illustrated in Figure
:numref:`ch04/ch04-LLVM_IR`. LLVM is an open-source compiler
framework with the goal of providing unified IRs for different frontends
and backends.
In LLVM IR, linear IRs are used to construct basic blocks, while
graphical IRs represent the control flow between these blocks. Each
instruction within a basic block is presented as a static single
assignment (SSA) . SSA requires each variable to be defined before use,
with values assigned to them only once. Multiple SSA instructions form a
linear list within a basic block.
In the control flow graph (CFG), each node represents a basic block, and
control transfer between these blocks is implemented through edges. This
combination of linear IR for basic blocks and graphical IR for control
flow allows for a flexible and efficient representation in LLVM IR.
![LLVM IR](../img/ch04/IR-LLVMIR.png)
:label:`ch04/ch04-LLVM_IR`
## Intermediate Representation in Machine Learning Frameworks
Classical IRs (such as LLVM IR) primarily target programming languages
for general-purpose computation tasks, which falls short of satisfying
the unique requirements of machine-learning-related computation. When
designing IRs tailored for machine learning frameworks, certain vital
factors warrant attention:
- **Tensor Representation**. Given the predominance of tensor data in
machine learning frameworks, it's imperative that the IRs can
effectively handle tensor representation.
- **Automatic Differentiation**. A core aspect of machine learning
involves evaluating derivatives of neural networks and optimizers
through automatic differentiation. Accordingly, IRs must prioritize
simplicity, performance, and scalability of higher-order
differentials for automatic differentiation.
- **Computational Graph Mode**. Machine learning frameworks like
TensorFlow, PyTorch, and MindSpore operate on two computational
graph modes: static and dynamic. The static mode, with pre-defined
computational graphs, enhances optimization but compromises on
flexibility. Conversely, the dynamic mode trades running speed for
flexibility and easier debugging by executing operators immediately
in the computational graph. IRs should therefore support both modes,
enabling users to choose the one best suited for their tasks while
building algorithm models.
- **Support for Higher-order Functions and Closures**. Essential in
functional programming, higher-order functions take or return
functions, while closures bundle code blocks with references to the
surrounding environment, facilitating access to an outer function's
scope from an inner function. Such support reduces redundant code,
improves abstraction, and enhances the flexibility and simplicity of
framework representations.
- **Compilation Optimization**. Machine learning frameworks lean on
compilation optimizations, including hardware-agnostic,
hardware-specific, and deployment- or inference-related
optimizations. These rely significantly on IRs implementations.
- **Just-in-Time (JIT) Compilation**. For expedited compilation and
execution in machine learning frameworks, JIT compilation is
frequently utilized. Optimization of JIT compilation, including loop
unrolling, fusion, and inlining, plays a crucial role in optimizing
parts of data flow graphs in IRs. A flawed IR design could
potentially hamper JIT compilation performance in machine learning
frameworks, thereby impacting the program's running capabilities.
Considering these factors, developers persistently refine classical IRs
and introduce new IRs specifically tailored for machine learning
frameworks. In the following section, we will delve into the IRs
employed by various machine learning frameworks.
### Intermediate Representation in PyTorch
PyTorch is a dynamic, Python-oriented machine learning framework.
Renowned for its usability and flexibility, PyTorch simplifies the
process of writing and debugging machine learning programs. It
introduces TorchScript, a method used for constructing serializable and
optimizable models during the saving and loading of neural networks.
Particularly, TorchScript IR employs JIT compilation to convert Python
code into target model files. All TorchScript programs can be saved
within the Python process and later loaded into processes devoid of
Python dependencies.
Aligning with the imperative programming paradigm, PyTorch incorporates
the TorchScript IR, composed primarily of Single Static Assignment
(SSA)-based linear IRs, to represent Python code. This representation
can be achieved through either the Tracing or Scripting method of JIT
compilation. TorchScript IR not only amplifies model deployment
capabilities but also bolsters compilation performance. Additionally,
TorchScript IR greatly improves the model visualization within the
PyTorch framework.
Code `lst:torchscript` illustrates the use of the Scripting method
to print a TorchScript IR graph.
**lst:torchscript**
```python
import torch
@torch.jit.script
def test_func(input):
rv = 10.0
for i in range(5):
rv = rv + input
rv = rv/2
return rv
print(test_func.graph)
```
Code `lst:torchscriptir` shows the structure of this IR graph.
**lst:torchscriptir**
```
graph(%input.1 : Tensor):
%9 : int = prim::Constant[value=1]()
%5 : bool = prim::Constant[value=1]() # test.py:6:1
%rv.1 : float = prim::Constant[value=10.]() # test.py:5:6
%2 : int = prim::Constant[value=5]() # test.py:6:16
%14 : int = prim::Constant[value=2]() # test.py:8:10
%rv : float = prim::Loop(%2, %5, %rv.1) # test.py:6:1
block0(%i : int, %rv.9 : float):
%rv.3 : Tensor = aten::add(%input.1, %rv.9, %9) # <string>:5:9
%12 : float = aten::FloatImplicit(%rv.3) # test.py:7:2
%rv.6 : float = aten::div(%12, %14) # test.py:8:7
-> (%5, %rv.6)
return (%rv)
```
### Intermediate Representation in JAX
The JAX framework facilitates both static and dynamic computational
graphs and employs the Jax Program Representation (Jaxpr) IR. This IR
ensures that the output, not reliant on global variables, depends solely
on the input, with both input and output encapsulating typed
information. Functionality-wise, Jaxpr IR supports an array of features
such as loops, branching, recursion, closure function differentiation,
third-order differentiation, as well as backpropagation and forward
propagation in automatic differentiation.
Jaxpr IR utilizes the A-normal Form (ANF), a form of functional
expression, demonstrated in
Code `lst:ANF`
via the ANF grammar.
**lst:ANF**
```
<aexp> ::= NUMBER | STRING | VAR | BOOLEAN | PRIMOP
| (lambda (VAR ...) <exp>)
<cexp> ::= (<aexp> <aexp> ...)
| (if <aexp> <exp> <exp>)
<exp> ::= (let ([VAR <cexp>]) <exp>) | <cexp> | <aexp>
```
The ANF segregates expressions into atomic expressions (aexp) and
compound expressions (cexp). Atomic expressions represent constants,
variables, primitives, and anonymous functions, while compound
expressions, comprising several atomic expressions, can be viewed as
invocations of anonymous or primitive functions. The first input in a
cexp represents the invoked function, and all subsequent inputs
symbolize the invoked parameters.
Code `lst:JaxCode` displays the Jaxpr corresponding to a function.
**lst:JaxCode**
```python
from jax import make_jaxpr
import jax.numpy as jnp
def test_func(x, y):
ret = x + jnp.sin(y) * 3
return jnp.sum(ret)
print(make_jaxpr(test_func)(jnp.zeros(8), jnp.ones(8)))
```
The structure of this Jaxpr is shown in
Code `lst:JaxPr`.
**lst:JaxPr**
```
{ lambda ; a:f32[8] b:f32[8]. let
c:f32[8] = sin b
d:f32[8] = mul c 3.0
e:f32[8] = add a d
f:f32[] = reduce_sum[axes=(0,)] e
in (f,) }
```
### Intermediate Representation in TensorFlow
TensorFlow utilizes dataflow programming to execute numerical
computations through dataflow graphs. TensorFlow's static graph
mechanism progresses through a series of abstractions and analyses when
running a program, transforming it from higher-level to lower-level IRs,
a process referred to as \"lowering\".
To cater to diverse hardware platforms, TensorFlow employs a range of IR
designs. As illustrated in
Figure :numref:`ch04/ch04-tensorflow_ecosystem`, the blue boxes denote
graph-based IRs while the green ones indicate SSA-based IRs. During the
IR transformation, each level optimizes the IR independently, precluding
communication with other levels. This absence of awareness about
optimizations performed at other levels necessitates optimal
implementation at each level, often leading to repetitive tasks and
sub-optimal efficiency. Notably, transitioning from graph-based IRs to
SSA-based IRs involves a qualitative transformation that incurs
significant costs. The inability to reuse the same optimization code
across levels also hampers development efficiency.
Multi-level IRs present a mixed bag of advantages and disadvantages. On
the plus side, they offer flexible representations, pass-based
optimization at varying levels, and efficient optimization algorithms.
On the downside, they pose challenges due to their inherent
characteristics: The transformation between different IRs often
complicates full compatibility implementation, thereby increasing
engineering workload and potentially leading to information loss. This
might make lower-level optimization challenging if information at a
higher level has been optimized. To mitigate such information loss, we
can impose stricter constraints on the optimization sequence.
Additionally, choosing the level for implementing certain optimizations
that can be performed at two adjacent levels can be a conundrum for
framework developers. Finally, defining distinct operator granularities
at different levels might impact accuracy to a certain degree.
![TensorFlow's IRdesign](../img/ch04/IR-MLIR.png)
:label:`ch04/ch04-tensorflow_ecosystem`
### Multi-Level Intermediate Representation
Multi-Level Intermediate Representation (MLIR) serves as a unified
platform for IRs rather than being a specific type of IR. Leveraging the
infrastructure provided by MLIR, developers can define IRs to suit their
needs. Thus, MLIR can be interpreted as a \"compiler of compilers\". It
expands beyond the TensorFlow framework and can be used to construct IRs
linking other languages to backend platforms (such as LLVM).
Despite the design of MLIR being heavily influenced by LLVM, MLIR
fosters a more open ecosystem. Given that MLIR does not confine
developers to a set group of operation or abstraction types, it offers
more latitude to define IRs and solve specific problems. To facilitate
this extensibility, MLIR introduces the concept of \"dialects\". These
provide a grouping mechanism for abstraction under a unique namespace.
Each dialect lays out a production and associates an operation to an IR,
thus producing an MLIR-typed IR. Within MLIR, the \"operation\" is the
fundamental unit of abstraction and computation. Operations can carry
application-specific semantics and encapsulate all the core IR
structures in LLVM, including instructions, functions, modules, etc.
The MLIR assembly for an operation is illustrated as follows:
```
%tensor = "toy.transpose"(%tensor) {inplace = true} : (tensor<2x3xf64>) -> tensor<3x2xf64> loc("example/file/path":12:1)
```
This MLIR operation can be dissected as follows:
- %tensor: The identifier for the result defined by this operation
(prefixed with a $\%$ to prevent naming conflicts). An operation may
define no results or multiple results, represented as SSA values.
- \"toy.transpose\": The operation name. It is usually a unique
string, with the dialect's namespace prefixing the ".". This refers
to the transpose operation within the toy dialect.
- (%tensor): A list that can contain zero or more input operands (or
arguments), which are SSA values defined by other operations or that
refer to block arguments.
- inplace = true: A dictionary that may contain zero or more
attributes. These are constant special operands. Here, a boolean
attribute named `inplace` with a constant value of `true` is
defined.
- (tensor\<2x3xf64\>)-\>tensor\<3x2xf64\>: This represents the
operation type in a functional form, specifying the input before the
arrow and output after. The data types and shapes of the input and
output are contained within the parentheses. For instance,
$<2x3xf64>$ represents a tensor with a shape of `(2, 3)` and data
type `float64`.
- loc(\"example/file/path\":12:1): This refers to the source code
location from where this operation originated.
As each level's IR design adheres to this assembly, it simplifies
transformation across levels, boosting the efficiency of IR
transformation. Moreover, different levels can interact to optimize the
IRs, enabling optimization to be performed at the most suitable level,
thereby negating the need for optimal performance at each level. By
transforming them into the IR at the most appropriate level, other IRs
can be optimized, enhancing both optimization and development
efficiency. TensorFlow can also employ MLIR to perform multi-layer
transformation from graph-based IRs to
### Intermediate Representation in MindSpore
MindSpore adopts graph-based functional IRs, known as MindSpore IR
(abbreviated to MindIR). MindIR employs a unified IR approach instead of
a multi-level IR structure, outlining the network's logical structure
and operator attributes. This approach obliterates model disparities
across different backends, facilitating connections to various target
machines.
MindIR primarily caters to the automatic differential transformation. It
implements a transformation method grounded in functional programming
frameworks, thereby making it similar to ANF (A-Normal Form) functional
semantics. Its defining characteristics include:
1. **Graph-based Representation**. MindSpore represents programs as
graphs which are conducive to optimization. MindSpore treats
functions as essential elements of a machine learning program,
allowing for recursive invocation, parameter passing, or returning
from other functions. This ability paves the way for representing a
range of control flow structures.
2. **Purely Functional**. In a purely functional context, the function
outcomes depend solely on parameters. Side effects are potential
issues when a function relies on or affects external states, such as
global variables. These can lead to incorrect results if code
execution sequence isn't strictly maintained. These side effects can
also impact automatic differentiation, necessitating the requirement
for pure functions. MindIR has the capability to transform
representations with side effects into purely functional
representations, ensuring correct code execution sequence while
upholding ANF functional semantics and enabling a higher degree of
automatic differentiation freedom.
3. **Closure Representation**. Reverse mode automatic differentiation
requires the storage of basic operation intermediate results in
closures for a combined connection. Closures, the combination of a
code block bundled with references to its surrounding environment,
become particularly crucial. In MindIR, the code block takes the
shape of a function diagram, with the surrounding environment
interpreted as the function invocation context.
4. **Strongly Typed**. Each node requires a specific type for achieving
optimal performance. This is particularly crucial in machine
learning frameworks where operator execution can be time-consuming.
Detecting errors at the earliest can help save valuable time.
MindIR's type and shape inference capabilities thus center on the
support for function invocation and higher-order functions.
Figure :numref:`ch04/ch04-MindIR` outlines the MindIR grammar based on
MindSpore framework's characteristics. ANode corresponds to an atomic
expression in ANF, ValueNode represents the constant value,
ParameterNode signifies the function's formal parameter, and CNode
(corresponding to a compound expression in ANF) indicates function
invocation.
![MindIR grammar](../img/ch04/IR-MindIR.png)
:label:`ch04/ch04-MindIR`
The example provided below in Code 1 offers a deeper analysis of MindIR.
**lst:MindSporeCode**
```
def func(x, y):
return x / y
@ms_function
def test_f(x, y):
a = x - 1
b = a + y
c = b * func(a, b)
return c
```
The ANF expression corresponding to this function is demonstrated in
Code `lst:MindIR`.
**lst:MindIR**
```
lambda (x, y)
let a = x - 1 in
let b = a + y in
let func = lambda (x, y)
let ret = x / y in
ret end in
let %1 = func(a, b) in
let c = b * %1 in
c end
```
In ANF, each expression is encapsulated as a variable utilizing the
`let` expression, with dependencies on the expression's output
represented via variable references. In contrast, MindIR packages each
expression as a node, portraying dependencies through directed edges
connecting the nodes.

View File

@@ -0,0 +1,51 @@
# Overview of AI Compiler Frontends
Figure :numref:`ch04/compiler_frontend_structure` depicts the typical
structure of the AI compiler frontend within a machine learning
framework. As AI compilers parse source programs similarly to classical
compilers, we will not detail the parsing process here. Instead, we will
explore a feature unique to the compiler frontend in a machine learning
framework - its automatic differentiation functionality. To enact
automatic differentiation, the machine learning framework requires a new
IR structure built upon classical IRs. Consequently, this section
concentrates on IRs and automatic differentiation, and later provides a
succinct introduction to basic compiler concepts, including type
systems, static analysis, and frontend optimization.
![Typical structure of an AI compilerfrontend](../img/ch04/compiler_frontend_structure.png)
:label:`ch04/compiler_frontend_structure`
An **Intermediate Representation** is a data structure, or a form of
code, employed by a compiler to represent source code. Essentially, an
IR serves as a bridge between a source language and a target language
during the compilation process. In classical compilers, IRs are divided
into linear IR, graphical IR, and hybrid IR. However, as these classical
IRs do not provide the comprehensive range of functionalities required
by machine learning frameworks, developers have extended classical IRs
and proposed numerous new IRs specifically for machine learning
frameworks.
**Automatic Differentiation** is a method used to compute derivatives
and efficiently resolve symbols for computational graphs. Combining the
benefits of both symbolic and numerical differentiation while mitigating
their shortcomings, automatic differentiation proves particularly
valuable in calculating the gradient of a function. Modern AI
algorithms, such as deep learning algorithms, use vast amounts of data
to learn models with various parameters, and typically employ a gradient
descent approach to update these parameters. Therefore, automatic
differentiation is crucial to deep learning and becomes an integral
component of training algorithms. Automatic differentiation generally
resolves IR symbols during the frontend optimization process to generate
new IRs with gradient functions.
**Type Systems and Static Analysis** are incorporated into the compiler
frontend to help reduce potential runtime errors. A type system can
avert type errors during program execution, while static analysis offers
insights and other information for compilation optimization, effectively
reducing issues like structural errors and security vulnerabilities in
program code.
**Frontend Compilation Optimization** aims to tackle code efficiency
issues. It is a significant aspect in both classical compilers and
machine learning frameworks and is independent of specific hardware
types.

View File

@@ -0,0 +1,52 @@
# Chapter Summary
- Intermediate Representation (IR) serves as one of the fundamental
data structures of a compiler. It represents the transition from the
source language to the target language during the process of program
compilation.
- Classical compilers categorize IRs into three types based on their
structure: linear IR, graphical IR, and hybrid IR.
- The demands imposed by machine learning frameworks necessitate new
forms of IRs, as classical IRs fail to fully satisfy these
requirements. Therefore, innovative IRs that are more compatible
with these frameworks must be developed based on classical IRs.
- The central principle in automatic differentiation is the
decomposition of a program's arithmetic operations into a finite set
of basic operations. Knowing the derivative evaluation rules for all
these operations allows for the calculation of the derivative for
each basic operation. Subsequently, these results are aggregated
using the chain rule to obtain the derivative result for the entire
program.
- Automatic differentiation operates in two modes---forward-mode and
reverse-mode---based on the sequence adopted by the chain rule for
combining derivatives.
- Forward-mode automatic differentiation is applied when evaluating
the derivative of a network where the input dimension is smaller
than the output dimension. In contrast, reverse-mode automatic
differentiation is employed when the output dimension of a network
is smaller than the input dimension.
- Implementation methods for automatic differentiation encompass
elemental libraries, operator overloading, and source
transformation.
- Type systems, which are utilized to define various types, detail the
operations of each type and outline the interactions among types.
Comprising a set of types and the type-based rules that delineate
program behavior, type systems are extensively used in compilers,
interpreters, and static checking tools.
- Static analysis involves the inspection and verification of code
through lexical analysis, syntactic analysis, control flow analysis,
and data flow analysis, all of which are conducted without executing
the programs.
- The objective of compilation optimization is to boost the efficiency
of the IRs generated during the compilation process. Notably,
compilation optimization conducted at the frontend is
hardware-agnostic.

View File

@@ -0,0 +1,128 @@
# Type Systems and Static Analysis
In the realm of compiler frontends, type systems and static analysis
play instrumental roles in bolstering the compiler's abstraction
prowess, while simultaneously mitigating potential errors that may arise
during program runtime. This section delves into the basic principles,
functionalities, and quintessential examples related to type systems and
static analysis.
## Type Systems
In the context of programming languages, 'types' represent certain
attributes, which could be numerical values, expressions, or functions.
Type systems, which define these varied types, also determine the
operations applicable to each type and orchestrate the interactions
among these types. Essentially, a type system comprises a set of types
and type-oriented rules that dictate the behavior of a program. They
find extensive applications in compilers, interpreters, and static
checking tools, offering the following capabilities:
1. **Precision**: Type systems in compilers deploy type checking to
detect potential runtime errors, thus enhancing runtime safety.
Leveraging type inference and type checking, the compiler can
identify the majority of type-associated exceptions and errors,
thereby averting runtime errors such as those triggered by program
exceptions. This also ensures memory safety and thwarts invalid
computations and semantic logic errors between types.
2. **Optimization**: The information obtained from static type checking
enables the compiler to execute more efficient instructions, thereby
reducing the runtime duration.
3. **Abstraction**: A type system, when employed with adept
abstraction, can significantly boost system performance, given the
system remains secure. Such streamlined abstraction allows
developers to concentrate their efforts on high-level design.
4. **Readability**: The use of explicit type declarations amplifies
code readability, enabling readers to grasp the program code more
effectively.
Machine learning frameworks frequently use Python, a both dynamically
and strongly typed language, as the frontend language for describing
neural network model structures. Python's simplicity and ease of
development have earned its popularity, despite its slower execution due
to its interpretative execution mode.
While Python offers users dynamic and flexible semantics at the
frontend, the backend framework demands static and strongly typed IRs
that are optimization-friendly, to generate efficient backend code. To
transform Python frontend representations into their equivalent static
and strongly typed IRs, we require an effective and trustworthy static
analysis method to enhance both development and execution efficiency.
A notable example is the Hindley--Milner (HM) type system---a type
system that caters to the simply typed lambda calculus with parametric
polymorphism. Initially proposed by J. Roger Hindley , the HM type
system was subsequently expanded and validated by Robin Milner . Later,
Luis Damas conducted a comprehensive formal analysis and proof of this
system , further extending it to support polymorphic references. The HM
type system is designed to infer the type of any expression
automatically, without requiring any given type annotations. It employs
a versatile algorithm to represent expressions using simple symbols and
infer clear and intuitive definitions. This type system is widely used
for type inference and type checking in the design of programming
languages such as Haskell and OCaml.
## Static Analysis
Once a type system has been established, we must then construct a static
analysis system. This will allow the compiler to perform static checking
and analysis of IRs. Initially, the syntax parser deciphers the program
code and forms an abstract syntax tree based on the resultant data,
which subsequently generates the corresponding IR. As this IR lacks the
abstract information stipulated in the type system, a static analysis
module is needed to process and scrutinize the IR. This paves the way
for a statically and strongly typed IR, which is indispensable for
subsequent steps such as compilation optimization, automatic
parallelization, and automatic differentiation. During the process of
compiling program code, the frontend compiler might execute static
analysis several times. In certain frameworks, the decision to terminate
compilation optimization could be based on the outcome of static
analysis.
The static analysis module is responsible for executing operations like
type inference and generic specialization on IRs, utilizing abstract
interpretations. Alongside these processes, the following operations are
also undertaken:
1. **Abstract Interpretation**: This involves an abstract interpreter
creating a generalized abstraction of a language's semantics,
garnering only the attributes needed for subsequent optimization,
and carrying out interpretive execution on ambiguous aspects.
Abstract values typically include aspects like the types and
dimensions of variables.
2. **Type Inference**: Based on abstract interpretation, the compiler
can infer the abstract types of variables or expressions within the
program code. This process is integral to facilitating subsequent
compilation optimization that hinges on type information.
3. **Generic Specialization**: During the compilation phase, the
compiler carries out type inference, a necessary precursor for
generic specialization. This helps determine the type of function to
be invoked. Subsequently, the compiler conducts type replacement
(provided it can supply the context of types), generating a distinct
function method for each type through generic specialization.
To illustrate the implementation of the static analysis module, we can
consider the example of the MindSpore framework. MindSpore employs
abstract interpretation to perform interpretive execution on uncertain
abstract semantics, thereby acquiring abstract values. These abstract
values for each node in a function graph represent the anticipated
static program information. Within an abstract interpretation method,
interpretive execution commences from the entry point of a top-level
function graph in MindIR. This is followed by topological sorting of all
nodes in the function graph, and the recursive inference of the abstract
value for each node, based on node semantics. If there are any function
subgraphs involved, interpretive execution is carried out within each
subgraph recursively. The outcome of this process is the abstract value
of the top-level function's output node. The static analysis module in
MindSpore consists of several components, such as the abstract domain
module, cache module, semantics inference module, and control flow
processing module, as illustrated in
Figure :numref:`ch04/ch04-compiler-frontend`.
![Static analysismodule](../img/ch04/static_analysis_module.png)
:label:`ch04/ch04-compiler-frontend`

View File

@@ -0,0 +1,53 @@
# Machine Learning Applications
In general terms, machine learning is a technology that learns useful
knowledge from data. There are a variety of machine learning methods,
including supervised learning, unsupervised learning, and reinforcement
learning.
1. In supervised learning, the mapping relationships between inputs and
outputs are known to machines. For example, a discrete label can be
assigned to an input image.
2. In unsupervised learning, input data is provided to machines without
any labels assigned. For example, to distinguish cats and dogs among
a group of images, a machine needs to learn by itself the
characteristics of cats and dogs in order to classify them. This
unsupervised classification is also called clustering.
3. In reinforcement learning, an algorithm that runs on the machine
automatically improves itself to achieve the task objective in a
given learning environment. A well-known example of this is AlphaGo,
in which the rules of Go serve as the learning environment and the
victory score is set as the task objective.
Machine learning is applied in a variety of fields --- computer vision,
natural language processing (NLP), and intelligent decision-making, to
name just a few. Computer vision, in a narrow sense, includes all
image-based applications, such as facial recognition, object
recognition, target tracking, human pose estimation, and image
understanding. It is widely used in autonomous driving, smart city,
smart security, and other scenarios.
NLP involves both text- and speech-related applications, including
language translation, text-to-speech and speech-to-text conversion, text
understanding, and image style transfer. NLP and computer vision overlap
in many aspects. For instance, in order to generate text description for
images, or to generate or process images based on texts, machines need
to handle both language and image data.
Intelligent decision-making is usually achieved through technical means
such as computer vision, NLP, reinforcement learning, and cybernetics.
It is widely used in many scenarios, such as robotics, autonomous
driving, games, recommender systems, smart factories, and smart grids.
These machine learning applications use different underlying algorithms
--- such as support vector machine (SVM), logistic regression, and naive
Bayes --- based on the needs and characteristics of the applications. In
recent years, deep learning has progressed significantly thanks to the
availability of massive data, development of neural network algorithms,
and maturity of hardware accelerators. But despite a wide variety of
machine learning algorithms, the vast majority of computation work still
relies on vector and matrix operations, regardless of whether classical
or deep learning algorithms are employed. In this book, we therefore
discuss machine learning systems that employ neural networks.

View File

@@ -0,0 +1,72 @@
# Machine Learning Framework Architecture
Figure :numref:`intro/framework-architecture` shows the basic
architecture of a typical, complete machine learning framework.
![Architecture of a machine learningframework](../img/intro/framework-architecture.png)
:label:`intro/framework-architecture`
1. **Programming interfaces:** A machine learning framework needs to
provide programming interfaces, usually those of high-level
programming languages (like Python), to cater for the diversified
backgrounds of machine learning developers. At the same time, the
framework also needs to support a system implementation that is
mainly based on low-level programming languages (e.g., C and C++) so
that operating system features (e.g., thread management and network
communication) and various hardware accelerators can be utilized
efficiently for optimized performance.
2. **Computational graph:** Machine learning applications, though
implemented through different programming interfaces, need to share
the same backend when the applications run. The computational graph
technology is key to realizing this backend. A computational graph,
which defines a user's machine learning application, includes many
graph nodes that represent computational operations. These nodes are
connected by edges, which represent computational dependencies.
3. **Compiler frontend:** Once a computational graph is built, the
machine learning framework analyzes and optimizes it (or the
corresponding application) through the compiler frontend. The
compiler frontend provides key functions such as intermediate
representation, automatic differentiation, type derivation, and
static analysis.
4. **Compiler backend and runtime:** After analyzing and optimizing the
computational graph, the machine learning framework uses the
compiler backend and runtime to optimize different types of
underlying hardware. In addition to optimizing the selection or
scheduling sequence of operators, common optimization technologies
usually analyze the L2/L3 cache size and the instruction pipeline
length to match hardware specifications.
5. **Heterogeneous processors:** A machine learning application is
co-executed by central processing units (CPUs) and hardware
accelerators (such as NVIDIA GPUs, Huawei Ascend processors, and
Google TPUs). During the execution, non-matrix operations (e.g.,
complex data preprocessing and computational graph scheduling) are
handled by CPUs, whereas matrix operations and certain frequently
used machine learning operators (e.g., Transformer operators and
convolution operators) are performed by hardware accelerators.
6. **Data processing:** A machine learning application needs to perform
complex preprocessing on raw data and manage a large number of
training, validation, and test datasets. The data processing module
(e.g., the tf.data module of TensorFlow, or the DataLoader module of
PyTorch) is responsible for such data-centered operations.
7. **Model deployment:** In addition to model training, model
deployment is another key function needed in a machine learning
framework. Model compression technologies --- such as model
conversion, quantization, and distillation --- enable us to run
models on hardware with limited memory. It is also necessary to
optimize model operators for specific hardware inference platforms
(e.g., NVIDIA Orin). Furthermore, in order to ensure the security of
a model (e.g., to deny unauthorized user reads), model obfuscation
must be considered in the framework's design.
8. **Distributed training:** A machine learning model is usually
trained in parallel on distributed compute nodes. Common parallel
training methods include data parallelism, model parallelism, hybrid
parallelism, and pipeline parallelism, all of which are usually
implemented through the remote procedure call (RPC), collective
communication, or parameter server.

View File

@@ -0,0 +1,86 @@
# Design Objectives of Machine Learning Frameworks
*Machine learning frameworks* (e.g., TensorFlow, PyTorch, and MindSpore)
were designed and implemented so that machine learning algorithms could
be developed efficiently for different applications. In a broad sense,
these frameworks achieved the following common design objectives.
1. **Neural network programming:** The huge success of deep learning
has solidified neural networks as the core of many machine learning
applications. People need to customize neural networks to meet their
specific application requirements --- such customization typically
results in the creation of convolutional neural networks (CNNs) and
self-attention neural networks. In order to develop, train, and
deploy these networks, we need a generic system software.
2. **Automatic differentiation:** The training of neural networks
involves continuously computing gradients through the combined use
of training data, data annotation, and a loss function to
iteratively improve model parameters. Computing gradients manually
is a complex and time-consuming task. Consequently, a machine
learning framework is expected to compute gradients automatically
based on a neural network application provided by developers. This
computation process is called automatic differentiation.
3. **Data management and processing:** Data is the key to machine
learning. There are several types of data, including training,
validation, and test datasets, as well as model parameters. A
machine learning system should be able to read, store, and
preprocess (data augmentation and cleansing are examples of
preprocessing) these types of data by itself.
4. **Model training and deployment:** A machine learning model is
expected to provide optimal performance. In order to achieve this,
we need to use an optimization method --- for example, mini-batch
stochastic gradient descent (SGD) --- to repeatedly compute
gradients through multi-step iteration. This process is called
training. Once the training is complete, we can then deploy the
trained model to the inference device.
5. **Hardware accelerators:** Many core operations in machine learning
can be deemed as matrix computation. To accelerate such computation,
machine learning developers leverage many specially designed
hardware components referred to as hardware accelerators or AI
chips.
6. **Distributed training:** As the volume of training data and the
number of neural network parameters increase, the amount of memory
used by a machine learning system far exceeds what a single machine
can provide. Therefore, a machine learning framework should be able
to train models on distributed machines.
Early attempts by developers to design such a framework employed
traditional methods such as *neural network libraries* (e.g., Theano and
Caffe) and *data processing frameworks* (e.g., Apache Spark and Google's
Pregel), but the results were disappointing. At that time, neural
network libraries lacked the ability to manage and process large
datasets, deploy models, or perform distributed model execution, meaning
they were not qualified enough for developing today's product-level
machine learning applications even though they supported neural network
development, automatic differentiation, and hardware accelerators.
Furthermore, data-parallel computing frameworks were not suitable for
developing neural network--centered machine learning applications
because they lacked support for neural networks, automatic
differentiation, and accelerators, although such frameworks were already
mature in supporting distributed running and data management.
These drawbacks led many enterprise developers and university
researchers to design and implement their own software frameworks for
machine learning from scratch. In only a few short years, numerous
machine learning frameworks emerged --- well-known examples of these
include TensorFlow, PyTorch, MindSpore, MXNet, PaddlePaddle, OneFlow,
and CNTK. These frameworks boosted the development of AI significantly
in both upstream and downstream industries. Table
:numref:`intro-comparison` lists the differences between machine
learning frameworks and other related systems.
:Differences between machine learning frameworks and related systems
|Design Method | Neural Network | Automatic Differentiation | Data Management | Training and Deployment | Accelerator | Distributed Training |
|----------------------------|----------------|----------------------------|-------------------|---------------------------|---------------|----------------------|
|Neural network libraries | Yes | Yes | No | No | Yes | No |
|Data processing frameworks | No | No | Yes | No | No | Yes |
|Machine learning frameworks | Yes | Yes | Yes | Yes | Yes | Yes |
:label:intro-comparison

View File

@@ -0,0 +1,78 @@
# Application Scenarios of Machine Learning Systems
A machine learning framework is commonly utilized in diverse scenarios,
giving rise to a range of *machine learning systems*. In a broader
context, a machine learning system refers to a collective term
encompassing a variety of software and hardware systems that facilitate
and execute machine learning applications. Figure
:numref:`intro/system-ecosystem` provides an overview of the
various application scenarios for machine learning systems.
![Application scenarios of machine learningsystems](../img/intro/system-ecosystem.png)
:label:`intro/system-ecosystem`
1. **Federated learning:** Laws and regulations on user privacy
protection and data protection prevent many machine learning
applications from accessing user data directly for model training
purposes. This is where federated learning --- based on a machine
learning framework --- benefits such applications.
2. **Recommender system:** Incorporating machine learning (especially
deep learning) into recommender systems have achieved major success
over the past few years. Compared with traditional rule-based
recommender systems, those based on deep learning can analyze
massive feature data of users more effectively, thereby bringing
huge improvements to the accuracy and timeliness of recommendations.
3. **Reinforcement learning:** Because reinforcement learning is
special in terms of the way it collects data and trains models, it
is therefore necessary to develop dedicated reinforcement learning
systems based on a machine learning framework.
4. **Explainable AI:** As machine learning becomes more and more
popular in many key areas, including finance, healthcare, and
governmental affairs, developing explainable AI systems based on a
machine learning framework is gaining wider attention.
5. **Robotics:** Robotics is another area where the use of machine
learning frameworks is gaining popularity. Compared with traditional
robot vision methods, machine learning methods have achieved
enormous success in several robot tasks, such as automatic feature
extraction, target recognition, and path planning.
6. **Graph learning:** Graphs are the most widely used data structure
and are used to express large volumes of Internet data, for
instance, social network graphs and product relationship graphs.
Machine learning algorithms have been proven effective for analyzing
large-scale graph data. A machine learning system designed to
process graph data is referred to as a graph learning system.
7. **Scientific computing:** Scientific computing covers a wide range
of traditional fields (such as electromagnetic simulation, graphics,
and weather forecast), in which many large-scale problems can be
effectively solved by machine learning methods. Therefore,
developing special machine learning systems for scientific computing
is becoming an increasingly common practice.
8. **Scheduling of a machine learning cluster:** A machine learning
cluster consists of heterogeneous processors, heterogeneous
networks, and even heterogeneous storage devices. But in a machine
learning cluster, computing tasks often have common characteristics
during their execution (e.g., iterative execution based on the
collective communication operator AllReduce). In order to take
account of the cluster's heterogeneity of devices and the common
characteristics in task execution, a machine learning cluster is
often designed to use a special scheduling method.
9. **Quantum computing:** Quantum computers are generally realized
through a hybrid architecture, in which quantum computing is
performed by quantum computers and the simulation of quantum
computers is performed by classical computers. Many simulation
systems (such as TensorFlow Quantum and MindQuantum) are realized on
the basis of a machine learning framework because the simulation
often requires massive matrix computations and gradient computation.
There are too many machine learning systems for this book to cover them
all in depth. Instead, we aim to provide a system designer's perspective
on several core systems used in federated learning, recommenders,
reinforcement learning, explainable AI, and robotics.

View File

@@ -0,0 +1,18 @@
# Introduction
This chapter aims to provide readers with a comprehensive understanding
of machine learning systems by describing the applications of machine
learning and summarizing the design objectives and basic composition
principles of such systems.
```toc
:maxdepth: 2
Machine_Learning_Applications
Design_Objectives_of_Machine_Learning_Frameworks
Machine_Learning_Framework_Architecture
Application_Scenarios_of_Machine_Learning_Systems
Book_Organization_and_Intended_Audience
```

View File

@@ -0,0 +1,33 @@
# Book Organization and Intended Audience
This book adopts a level-by-level approach to discuss design principles
and implementation practices of machine learning systems. The
**Framework Design** part starts with introducing key concepts that
framework users need to understand, including programming interface
design and computational graph. This part then describes the frontend
and backend techniques used in AI compilers as well as key techniques
for processing data, deploying models, and distributing training to
multiple machines. The **Application Scenarios** part elaborates on
several important types of machine learning systems, such as federated
learning and recommender systems, in an attempt to provide readers with
useful knowledge for both deploying and operating machine learning
frameworks in different application scenarios.
This book is intended for the following readers:
1. **Students:** This book provides a wealth of design principles and
hands-on experience of machine learning systems. Such knowledge will
help students better understand the theoretical pros and cons and
practical challenges of machine learning algorithms.
2. **Researchers:** This book aims to help researchers tackle various
challenges in machine learning implementation and guide them through
the design of next-generation machine learning algorithms meant to
solve large-scale practical problems.
3. **Developers:** We also hope this book will allow developers to gain
a profound understanding on the internal architecture of a machine
learning system. Such knowledge will move them a step further in
developing new functions for their applications, debugging system
performance issues, and even customizing a machine learning system
based on their business needs.

View File

@@ -0,0 +1,36 @@
# Model Deployment {#ch:deploy}
In earlier chapters, we discussed the basic components of the machine
learning model training system. In this chapter, we look at the basics
of model deployment, a process whereby a trained model is deployed in a
runtime environment for inference. We explore the conversion from a
training model into an inference model, model compression methods that
adapt to hardware restrictions, model inference and performance
optimization, and model security protection.
The key aspects this chapter explores are as follows:
1. Conversion and optimization from a training model to an inference
model.
2. Common methods for model compression: quantization, sparsification,
and knowledge distillation.
3. Model inference process and common methods for performance
optimization.
4. Common methods for model security protection.
```toc
:maxdepth: 2
Overview
Conversion_to_Inference_Model_and_Model_Optimization
Model_Compression
Advanced_Efficient_Techniques
Model_Inference
Security_Protection_of_Models
Chapter_Summary
Further_Reading
```

View File

@@ -0,0 +1,369 @@
# Model Compression
:label:`ch-deploy/model-compression`
The previous section briefly described the purpose of model conversion
and focused on some common model optimization methods for model
deployment. Hardware restrictions differ depending on where models are
deployed. For instance, smartphones are more sensitive to the model
size, usually supporting models only at the MB level. Larger models need
to be compressed using compression techniques before they can be
deployed on different computing hardware.
## Quantization
Model quantization is a technique that approximates floating-point
weights of contiguous values (usually float32 or many possibly discrete
values) at the cost of slightly reducing accuracy to a limited number of
discrete values (usually int8). As shown in Figure
:numref:`ch-deploy/quant-minmax`, $T$ represents the data range
before quantization. In order to reduce the model size, model
quantization represents floating-point data with fewer bits. As such,
the memory usage during inference can be reduced, and the inference on
processors that are good at processing low-precision operations can be
accelerated.
![Principles ofquantization](../img/ch08/ch09-quant-minmax.png)
:label:`ch-deploy/quant-minmax`
The number of bits and the range of data represented by different data
types in a computer are different. Based on service requirements, a
model may be quantized into models with different number of bits based
on service requirements. Generally, single-precision floating-point
numbers are used to represent a deep neural network. If signed integers
can be used to approximate parameters, the size of the quantized weight
parameters may be reduced to a quarter of the original size. Using fewer
bits to quantize a model results in a higher compression rate --- 8-bit
quantization is the mostly used in the industry. The lower limit is
1-bit quantization, which can compress a model to 1/32 of its original
size. During inference, efficient XNOR and BitCount bit-wise operations
can be used to accelerate the inference.
According to the uniformity of the original ranges represented by the
quantized data, quantization can be further divided into linear
quantization and non-linear quantization. Because the weights and
activations of a deep neural network are usually not uniform in
practice, non-linear quantization can theoretically achieve a smaller
loss of accuracy. In real-world inference, however, non-linear
quantization typically involves higher computation complexity, meaning
that linear quantization is more commonly used. The following therefore
focuses on the principles of linear quantization.
In Equation
:eqref:`ch-deploy/quantization-q`, assume that $r$ represents
the floating-point number before quantization. We are then able to
obtain the integer $q$ after quantization.
$$q=clip(round(\frac{r}{s}+z),q_{min},q_{max})$$
:eqlabel:`ch-deploy/quantization-q`
$clip(\cdot)$ and $round(\cdot)$ indicate the truncation and rounding
operations, and $q_{min}$ and $q_{max}$ indicate the minimum and maximum
values after quantization, respectively. $s$ is the quantization
interval, and $z$ is the bias representing the data offset. The
quantization is symmetric if the bias ($z$) used in the quantization is
0, or asymmetric in other cases. Symmetric quantization reduces the
computation complexity during inference because it avoids computation
related to $z$. In contrast, asymmetric quantization determines the
minimum and maximum values based on the actual data distribution, and
the information about the quantized data is more effectively used. As
such, asymmetric quantization reduces the loss of accuracy caused by
quantization.
According to the shared range of quantization parameters $s$ and $z$,
quantization methods may be classified into layer-wise quantization and
channel-wise quantization. In the former, separate quantization
parameters are defined for each layer. Whereas the latter involves
defining separate quantization parameters for each channel.
Finer-grained channel-wise quantization yields higher quantization
precision, but increases the computation complexity.
Model quantization can also be classified into quantization aware
training (QAT) and post-training quantization (PTQ) based on whether
training is involved. In QAT, fake-quantization operators are added, and
statistics on the input and output ranges before and after quantization
are collected during training to improve the accuracy of the quantized
model. This method is therefore suitable for scenarios that place strict
requirements on accuracy. In PTQ, models are directly quantized after
training, requiring only a small amount of calibration data. This method
is therefore suitable for scenarios that place strict requirements on
usability and have limited training resources.
**1. Quantization aware training**
QAT simulates quantization during training by including the accuracy
loss introduced by fake-quantization operators. In this way, the
optimizer can minimize the quantization error during training, leading
to higher model accuracy. QAT involves the following steps:
1. Initialization: Set initial values for the $q_{min}$/$q_{max}$
ranges of weights and activations.
2. Building a network for simulated quantization: Insert
fake-quantization operators after weights and activations that
require quantization.
3. Running QAT: Compute the range (i.e., $q_{min}$ and $q_{max}$) for
each weight and activation of the quantized network layer. Then,
perform forward computation with the quantization loss considered,
so that the loss can be involved in subsequent backpropagation and
network parameter update.
4. Exporting the quantized network: Obtain $q_{min}$ and $q_{max}$, and
compute the quantization parameters $s$ and $z$. Substitute the
quantization parameters into the quantized formula to transform the
network weights into quantized integer values. Then, delete the
fake-quantization operators, and add quantization and dequantization
operators before and after the quantization network layer,
respectively.
**2. Post-training quantization**
PTQ can be divided into two types: weight quantization and full
quantization. Weight quantization quantizes only the weights of a model
to compress its size, and then the weights are dequantized to the
original float32 format during inference. The subsequent inference
process is the same as that of a common float32 model. The advantage of
weight quantization is that calibration dataset and quantized operators
are not required, and that the accuracy loss is small. However, it does
not improve the inference performance, because the operators used during
inference are still float32. Full quantization quantizes both the
weights and activations of a model, and the quantized operators are
executed to accelerate model inference. The quantization of activations
requires a small number of calibration datasets (training dataset or
inputs of real scenarios) to collect the distribution of the activations
at each layer and calibrate the quantized operators. Calibration
datasets are used as the input during the quantization of activations.
After the inference, the distribution of activations at each layer is
collected to obtain quantization parameters. The process is summarized
as follows:
1. Use a histogram to represent the distribution $P_f$ of the original
float32 data.
2. Select several $q_{min}$ and $q_{max}$ values from a given search
space, quantize the activations, and obtain the quantized data
$Q_q$.
3. Use a histogram to represent the distribution of $Q_q$.
4. Compute the distribution difference between $Q_q$ and $P_f$, and
find the $q_{min}$ and $q_{max}$ values corresponding to the
smallest difference between $Q_q$ and $P_f$ in order to compute the
quantization parameters. Common indicators used to measure the
distribution differences include symmetric Kullback-Leibler
divergence and Jenson-Shannon divergence.
In addition, the inherent error of quantization requires calibration
during quantization. Take the matrix multiplication
$a=\sum_{i=1}^Nw_ix_i+b$ as an example. $w$ denotes the weight, $x$ the
activation, and $b$ the bias. To overcome the quantization error, we
first calibrate the quantized mean value, and then obtain the mean value
of each channel output by the float32 operator and the quantized
operator. Assume that the mean value output by the float32 operator of
channel $i$ is $a_i$, and that output by the quantized operator after
dequantization is $a_{qi}$. From this, we can obtain the final mean
value by adding the mean value difference $a_i-a_q$ of the two channels
to the corresponding channel. In this manner, the final mean value is
consistent with that output by the float32 operator. We also need to
ensure that the distribution after quantization is the same as that
before quantization. Assume that the mean value and variance of the
weight of a channel are $E(w_c)$ and $||w_c-E(w_c)||$, and the mean
value and variance after quantization are $E(\hat{w_c})$ and
$||\hat{w_c}-E(\hat{w_c})||$, respectively. Equation
:eqref:`ch-deploy/post-quantization` is the calibration of the
weight:
$$
\begin{aligned}
\hat{w_c}\leftarrow\zeta_c(\hat{w_c}+u_c) \\
u_c=E(w_c)-E(\hat{w_c}) \\
\zeta_c=\frac{||w_c-E(w_c)||}{||\hat{w_c}-E(\hat{w_c})||}
\end{aligned}
$$
:eqlabel:`ch-deploy/post-quantization`
As a general model compression method, quantization can significantly
improve the memory and compression efficiency of neural networks, and
has been widely used.
## Model Sparsification
Model sparsification reduces the memory and computation overheads by
removing some components (such as weights, feature maps, and convolution
kernels) from a neural network. It is a type of strong inductive bias
introduced to reduce the computation complexity of the model, just like
weight quantization, weight sharing, and pooling.
**1. Motivation of model sparsification**
Convolution on a convolutional neural network can be considered as a
weighted linear combination of the input and the weights of the
convolution kernel. In this sense, tiny weights have a relatively small
impact on the output. Model sparsification can be justified based on two
assumptions:
1. Most neural network models have over-parameterized weights. The
number of weight parameters can reach tens or even hundreds of
millions.
2. For most computer vision tasks such as detection, classification,
and segmentation, useful information accounts for only a small
proportion in an activation feature map generated during inference.
As such, model sparsification can be classified into two types according
to the source of sparsity: weight sparsification and activation
sparsification. Both types reduce the computation workload and model
storage requirements by reducing redundant components in a model. In
model sparsification, some weak connections are pruned based on the
absolute value of weights or activations (i.e. the weight or activation
of such connections is set to 0), with the goal of improving the model
performance. The sparsity of a model is measured by the proportion of
zero-value weights or activation tensors. Because the accuracy of a
model typically decreases as its sparsity increases, we hope to minimize
such loss when increasing the sparsity.
Neurobiology was the inspiration for inventing neural networks --- it
has also inspired the sparsification of neural network models.
Neurobiologists found that most mammalian brains, including humans, have
a process called synapse pruning, which occurs between infancy and
adulthood. During synapse pruning, neuron axons and dendrites decay and
die off, and the neuron connections are continuously simplified and
reconstructed. This process allows brains to work more efficiently and
consume less energy.
**2. Structured and unstructured sparsification**
Let's first look at weight sparsification. It can be classified into
structured and unstructured sparsification. Structured sparsification
involves pruning channels or convolution kernels in order to generate
regular and smaller weight matrices that are more likely to obtain
speedup on CPUs and GPUs. However, this mode is coarse-grained, meaning
that it severely reduces the model accuracy.
In contrast, unstructured sparsification allows a weight at any location
to be pruned, meaning it is a fine-grained mode that causes less loss to
the model accuracy. However, the unstructured mode limits the speedup of
sparse models on hardware for a number of reasons:
1. The irregular layout of weights requires many control flow
instructions. For instance, the presence of zero values introduces
many `if-else` instructions for decision-making, which inevitably
reduces instruction-level parallelism.
2. The computation of convolution kernels is typically multi-threaded.
However, the irregular layout of weight matrices on memory causes
thread divergence and load imbalance, which therefore affects
thread-level parallelism.
3. The irregular layout of weight matrices on the memory hinders data
locality and reduces the cache hit rate. Consequently, the
load/store efficiency is reduced.
In an attempt to solve these problems, recent work combines structured
sparsification with unstructured sparsification. This approach
incorporates the advantages of both modes, and overcomes their
disadvantages to an extent.
**3. Sparsification strategies**
Given a neural network model, after deciding to sparsify the weights or
activations, we need to determine when and how to perform the
sparsification. The most common sparsification process is currently
pre-training, pruning, and fine-tuning. With this process, we need to
sparsify and fine-tune a converged dense model obtained through
training. Given the fact that a pre-trained model contains knowledge it
has learned, sparsification on such models will achieve a better effect
than directly on the initial model. In addition to pruning the
pre-trained model, we usually interleave pruning with network training.
Compared with one-shot pruning, iterative pruning is integrated more
closely with training, so that redundant convolution kernels can be
identified more efficiently. As such, iterative pruning is widely used.
To illustrate how to prune a network, we will use Deep
Compression [@han2015deep] as an example. Removing most weights leads to
a loss of accuracy of the neural network, as shown in Figure
:numref:`ch-deploy/deepcomp`. Fine-tuning a pruned sparse neural
network can help improve model accuracy, and the pruned network may be
quantized to represent weights using fewer bits. In addition, using
Huffman coding can further reduce the memory cost of the deep neural
network.
![Deep Compressionalgorithm](../img/ch08/ch09-deepcomp.png)
:label:`ch-deploy/deepcomp`
In addition to removing redundant neurons, a dictionary learning-based
method can be used to remove unnecessary weights on a deep convolutional
neural network. By learning the bases of convolution kernels, the
original convolutional kernels can be transformed into the coefficient
domain for sparsification. An example of this approach is the work by
Bagherinezhad et al. [@bagherinezhad2017lcnn], in which they proposed
that the original convolution kernel can be decomposed into a weighted
linear combination of the base of the convolution kernel and sparse
coefficient.
## Knowledge Distillation
Knowledge distillation (KD), also known as the teacher-student learning
algorithm, has gained much attention in the industry. Large deep
networks tend to deliver good performance in practice, because
over-parameterization increases the generalization capability when it
comes to new data. In KD, a large pre-trained network serves as the
teacher, a deep and thin brand-new neural network serves as the student,
supervised by the teacher network. The key to this learning algorithm is
how to transfer knowledge converted by the teacher to the student.
Hinton et al. [@Distill] first proposed a teacher-student learning
framework. It is used for the learning of deep and thin neural networks
by minimizing the differences between the teacher and student neural
networks. The teacher network is denoted as $\mathcal{N}_{T}$ with
parameters $\theta_T$, and the student network is denoted as
$\mathcal{N}_{S}$ with parameters $\theta_S$. In general, the student
network has fewer parameters than the teacher network.
[@Distill] proposed KD, which makes the classification result of the
student network more closely resembles the ground truth as well as the
classification result of the teacher network, that is, Equation :eqref:`c2Fcn:distill`.
$$\mathcal{L}_{KD}(\theta_S) = \mathcal{H}(o_S,\mathbf{y}) +\lambda\mathcal{H}(\tau(o_S),\tau(o_T)),$$
:eqlabel:`c2Fcn:distill`
where $\mathcal{H}(\cdot,\cdot)$ is the cross-entropy function, $o_S$
and $o_T$ are outputs of the student network and the teacher network,
respectively, and $\mathbf{y}$ is the label. The first item in
Equation :eqref:`c2Fcn:distill` makes the classification result of the
student network resemble the expected ground truth, and the second item
aims to extract useful information from the teacher network and transfer
the information to the student network, $\lambda$ is a weight parameter
used to balance two objective functions, and $\tau(\cdot)$ is a soften
function that smooths the network output.
Equation :eqref:`c2Fcn:distill` only extracts useful information from the
output of the teacher network classifier --- it does not mine
information from other intermediate layers of the teacher network.
Romero et al. [@FitNet] proposed an algorithm for transferring useful
information from any layer of a teacher network to a small student
network. Note that not all inputs are useful for convolutional neural
network computing and subsequent task execution. For example, in an
image containing an animal, it is important to classify and identify the
region where the animal is rather than the background information.
Therefore, it is an efficient way to select useful information from the
teacher network. Zagoruyko and Komodakis [@attentionTS] proposed a
learning method based on an attention loss function to improve the
performance of the student network. This method introduces an attention
module. The attention module generates an attention map, which
identifies the importance of different areas of an input image to the
classification result. The attention map is then transferred from the
teacher network to the student network, as depicted in Figure
 :numref:`ch-deploy/attentionTS`.
KD is an effective method to optimize small networks. It can be combined
with other compression methods such as pruning and quantization to train
efficient models with higher accuracy and less computation workload.
<figure id="fig:ch-deploy/attentionTS">
<div class="center">
<img src="../img/ch08/distillation.png" style="width:80.0%" />
</div>
<figcaption>Teacher-student neural network learning
algorithm</figcaption>
</figure>

View File

@@ -0,0 +1,227 @@
# Conversion to Inference Model and Model Optimization
:label:`ch-deploy/model-optimization`
## Model Conversion
As mentioned earlier, TensorFlow, PyTorch, MindSpore, MXNet, and CNTK
define their own model data structures. This means that the inference
system needs to convert these structures to a unified one. Open Neural
Network Exchange (ONNX) is designed to implement such conversion. It
supports an extensive range of machine learning operators and converts
models from various frameworks (e.g., TensorFlow and PyTorch) into ONNX
models. Because models are structured data, the conversion process
involves converting the data structure. It starts by analyzing the
similarities and differences between two data structures. If they are
the same, data is transferred; if the structures are similar but with
slight differences, data is mapped; if the structures differ
significantly, extra semantics conversion might be required; and if they
are totally incompatible, the conversion will fail. ONNX features strong
expressive power, meaning that it can convert models from most
frameworks in the industry to compatible ONNX models. If a model is
abstracted as a graph, its data structure can be defined as follows:
1. **Topological expression of model:** The topological connections of
a model are represented as edges in a graph. From the perspective of
a model, these edges define the data flows and control flows in the
model. Based on such definitions, we can extend to the expressions
of the subgraphs, model inputs and outputs, and control flow
structures. For example, the control flow on TensorFlow 1.x is
expressed as a cyclic graph. To prevent the formation of cycles,
TensorFlow 1.x uses operators such as Enter, Exit, Switch, LoopCond,
and NextIteration, whereas ONNX uses operators such as Loop and If.
As such, when converting a TensorFlow1.x control flow model into an
ONNX model, the control flow graph structure in the TensorFlow model
must be merged into a While or If operator on ONNX.
2. **Operator prototype definition:** Operators can be regarded as data
processing or control flow nodes in a model or as vertices in a
graph. An operator prototype defines the type, inputs, outputs, and
attributes of an operator. For instance, Slice has different
semantics on Caffe and ONNX. To convert a Caffe model into an ONNX
model, we need to map Slice on Caffe to Split on ONNX.
FusedBatchnorm on TensorFlow does not have a mapping operator on
Caffe. Rather, Batchnorm and Scale on Caffe need to be combined to
express the same semantics of FusedBatchnorm on TensorFlow.
Generally, the model conversion process involves converting the
topological relationships and mapping the operator prototypes
between models.
Following model conversion, some input-agnostic operations are conducted
for optimization purposes prior to model deployment, including constant
folding, operator fusion, operator replacement, and operator reordering
--- optimization methods discussed earlier in this book. For instance,
constant folding is usually performed during the compilation executed on
the compiler frontend, whereas, operator fusion and partition are often
performed (depending on the backend hardware support) once the
compilation is complete. However, some optimization operations can only
be performed in their entirety during the deployment phase.
![Layered computer storagearchitecture](../img/ch08/ch09-storage.png)
:label:`ch-deploy/fusion-storage`
## Operator Fusion
:label:`ch-deploy/kernel-fusion`
Operator fusion involves combining multiple operators in a deep neural
network (DNN) model into a new operator based on certain rules, reducing
the inference latency and power consumption by lowering the computation
workload and load/store overhead during online inference.
The two main performance benefits brought by operator fusion are as
follows: First, it maximizes the utilization of registers and caches.
And second, because it combines operators, the load/store time between
the CPU and memory is reduced. Figure
:numref:`ch-deploy/fusion-storage` shows the architecture of a
computer's storage system. While the storage capacity increases from the
level-1 cache (L1) to hard disk, so too does the time for reading data.
After operator fusion is performed, the previous computation result can
be temporarily stored in the CPU's register or cache where the next
computation can directly read the result, reducing the number of I/O
operations on the memory. Furthermore, operator fusion allows some
computation to be completed in advance, eliminating redundant or even
cyclic redundant computing during forward computation.
![Convolution + Batchnorm operatorfusion](../img/ch08/ch09-conv-bn-fusion.png)
:label:`ch-deploy/conv-bn-fusion`
To describe the principle of operator fusion, we will use two operators,
Convolution and Batchnorm, as shown in Figure
:numref:`ch-deploy/conv-bn-fusion`. In the figure, the
solid-colored boxes indicate operators, the resulting operators after
fusion is performed are represented by hatched boxes, and the weights or
constant tensors of operators are outlined in white. The fusion can be
understood as the simplification of an equation. The computation of
Convolution is expressed as Equation
:eqref:`ch-deploy/conv-equation`.
$$\bf{Y_{\rm conv}}=\bf{W_{\rm conv}}\cdot\bf{X_{\rm conv}}+\bf{B_{\rm conv}}$$
:eqlabel:`equ:ch-deploy/conv-equation`
Here, we do not need to understand what each variable means. Instead, we
only need to keep in mind that Equation
:eqref:`ch-deploy/conv-equation` is an equation for
$\bf{Y_{\rm conv}}$ with respect to $\bf{X_{\rm conv}}$, and other
symbols are constants.
Equation
:eqref:`ch-deploy/bn-equation` is about the computation of
Batchnorm:
$$\bf{Y_{\rm bn}}=\gamma\frac{\bf{X_{\rm bn}}-\mu_{\mathcal{B}}}{\sqrt{{\sigma_{\mathcal{B}}}^{2}+\epsilon}}+\beta$$
:eqlabel:`equ:ch-deploy/bn-equation`
Similarly, it is an equation for $\bf{Y_{\rm bn}}$ with respect to
$\bf{X_{\rm bn}}$. Other symbols in the equation represent constants.
As shown in Figure
:numref:`ch-deploy/conv-bn-fusion`, when the output of
Convolution is used as the input of Batchnorm, the formula of Batchnorm
is a function for $\bf{Y_{\rm bn}}$ with respect to $\bf{X_{\rm conv}}$.
After substituting $\bf{Y_{\rm conv}}$ into $\bf{X_{\rm bn}}$ and
uniting and extracting the constants, we obtain Equation
:eqref:`ch-deploy/conv-bn-equation-3`.
$$\bf{Y_{\rm bn}}=\bf{A}\cdot\bf{X_{\rm conv}}+\bf{B}$$
:eqlabel:`equ:ch-deploy/conv-bn-equation-3`
Here, $\bf{A}$ and $\bf{B}$ are two matrices. It can be noticed that
Equation
:eqref:`ch-deploy/conv-bn-equation-3` is a formula for computing
Convolution. The preceding example shows that the computation of
Convolution and Batchnorm can be fused into an equivalent Convolution
operator. Such fusion is referred to as formula fusion.
The fusion of Convolution and Batchnorm eliminates a Batchnorm
operation, thereby reducing the quantity of parameters and computation
workload are reduced, and thereby the load/store operations are also
reduced. In general, this fusion not only optimizes the power
consumption and performance during model deployment, but also brings
certain benefits in compressing the model size.
Symbols that are considered as constants in the Convolution and
Batchnorm formulas during fusion are considered as parameters during
training. Performing fusion during the training process will result in
missing model parameters. Because the fusion eliminates a Batchnorm
operator and corresponding parameters from the network, the algorithm of
the DNN is changed, degrading the accuracy to unacceptable levels.
Therefore, the fusion of Convolution and Batchnorm is an optimization
method typically used during deployment. To evaluate the optimization
effect, we constructed a sample network with Convolution and Batchnorm
using MindSpore Lite. We ran the sample network and mobilenet-v2 network
for inference in dual threads on a Huawei Mate 30 smartphone to compare
the time of running 3,000 inference epochs before and after the fusion.
As shown in Table
`ch09-conv-bn-fusion`, the inference performance of
the sample network and mobilenet-v2 network is improved considerably
after the fusion --- by 8.5% and 11.7% respectively. Such improvements
are achieved without bringing side effects and without requiring
additional hardware or operator libraries.
:Convolution + Batchnorm inference performance before and after fusion (unit: ms)
|Fusion | Sample | Mobilenet-v2 |
|---------------| --------| -------------- |
|Before fusion | 0.035 | 15.415 |
|After fusion | 0.031 | 13.606 |
:label:`ch09/ch09-conv-bn-fusion`
## Operator Replacement
The principle of operator replacement is to simplify an operator formula
by uniting like terms, extracting common factors, and employing other
mathematical methods, and then map the simplified formula to a certain
type of operators that have the same computational logic but are more
suitable for online deployment. In this way, we can reduce the
computation workload and compress the model.
![Replacement ofBatchnorm](../img/ch08/ch09-bn-replace.png)
:label:`ch-deploy/bn-replace`
Figure :numref:`ch-deploy/bn-replace` depicts the replacement of
Batchnorm with Scale, which is used as an example to describe the
principle of operator replacement. After decomposing Equation
:eqref:`ch-deploy/bn-equation` (the Batchnorm formula) and
folding the constants, Batchnorm is defined as Equation
:eqref:`ch-deploy/replace-scale`
$$\bf{Y_{bn}}=scale\cdot\bf{X_{bn}}+offset$$
:eqlabel:`equ:ch-deploy/replace-scale`
where **scale** and **offsets** are scalars. This simplified formula can
be mapped to a Scale operator.
Compared with the original Batchnorm formula, the simplified formula has
fewer parameters and involves less computation workload. This indicates
that operator replacement is an effective approach to optimizing the
power consumption and performance of a model during deployment. Symbols
that are considered as constants in Batchnorm during deployment are not
considered as constants during training, meaning that the replacement
can be performed only during deployment. Operator replacement reduces
the quantity of parameters and changes the structure of the model,
weakening the expressive power and reducing the accuracy of the model
during convergence.
## Operator Reordering
Another way of reducing the computation workload of an inference model
is to adjust the topological order of its operators according to certain
rules, on the condition that the inference accuracy is not degraded.
Common methods of operator reordering include moving cropping operators
(e.g., Slice, StrideSlice, and Crop) forward, and reordering Reshape,
Transpose, and BinaryOp.
![Reordering ofCrop](../img/ch08/ch09-crop-reorder.png)
:label:`ch-deploy/crop-reorder`
Crop is used to cut a part out of the input feature map as the output.
After Crop is executed, the size of the feature map is reduced. As shown
in Figure :numref:`ch-deploy/crop-reorder`, moving Crop forward to cut the
feature map before other operators reduces the computation workload of
subsequent operators, thereby improving the inference performance in the
deployment phase. Such improvement is related to the operator
parameters. Note, however, that Crop can be moved forward only along
element-wise operators.
The experiment result above proves that optimizing models before
inference makes it possible to significantly reduce the latency, power
consumption, and memory usage.

View File

@@ -0,0 +1,71 @@
# Overview
After training a model, we need to save it and its parameters to files
to make them persistent. However, because different training frameworks
adopt different data structures for such files, the inference system
must support models trained using different training frameworks and
convert the data in the files into a unified data structure. During the
conversion from a training model to an inference model, optimization
operations such as operator fusion and constant folding on the model can
be performed to improve the inference performance.
The hardware restrictions of different production environments must be
considered when we deploy an inference model. For instance, a
large-scale model needs to be deployed on a server in a computing or
data center with strong computing power, whereas a mid-scale model
should be deployed on an edge server, PC, or smartphone --- such devices
often have limited computing resources and memory. For simple,
small-scale models, ultra-low power microcontrollers can be used. In
addition, different hardware supports different data types (such as
float32, float16, bfloat16, and int8). To adapt to the hardware
restrictions, a trained model may sometimes need to be compressed in
order to reduce model complexity or data precision and reduce model
parameters.
Before a model can be used for inference, it needs to be deployed in the
runtime environment. To optimize model inference, which may be affected
by latency, memory usage, and power consumption, we can design chips
dedicated for machine learning --- such dedicated chips usually
outperform general-purpose ones in terms of energy efficiency. Another
approach is to fully leverage hardware capabilities through
software-hardware collaboration. Take a CPU as an example. When
designing and optimizing models for a specific CPU architecture, we can
suitably divide data blocks to meet the cache size, rearrange data to
facilitate contiguous data access during computing, reduce data
dependency to improve the parallelism of hardware pipelines, and use
extended instruction sets to improve the computing performance.
Because models are an important enterprise asset, it is important to
ensure their security after they are deployed in the runtime
environment. This chapter will discuss some of the common protection
measures and use model obfuscation as an example.
Some of the common methods used in the industry to address the preceding
challenges are as follows:
1. **Model compression:** Technologies that reduce the model size and
computation complexity by means of quantization and pruning. Such
technologies can be categorized according to whether retraining is
required.
2. **Operator fusion:** Technologies that combine multiple operators
into one by simplifying expressions and fusing attributes, aiming to
reduce the computation complexity and size of the model.
3. **Constant folding:** Forward computation of operators that meet
certain conditions is completed in the offline phase, reducing the
computation complexity and size of a model. This requires that the
inputs of operators be constants in the offline phase.
4. **Data format:** According to the operator library and hardware
restrictions and exploration of the optimal data format of each
layer on the network, data is rearranged or data rearrangement
operators are inserted, in order to reduce the inference latency
during model deployment.
5. **Model obfuscation:** Network nodes or branches are added and
operator names are changed for a trained model, so that it is
difficult for attackers to understand the original model structure
even if they steal the model. An obfuscated model may be directly
executed in the deployment environment, thereby ensuring the
security of the model during execution.

View File

@@ -0,0 +1,372 @@
# Model Inference
After conversion and compression, a trained model needs to be deployed
on the computation hardware in order to execute inference. Such
execution involves the following steps:
1. Preprocessing: Process raw data to suit the network input.
2. Inference execution: Deploy the model resulting from offline
conversion on the device to execute inference and compute the output
based on the input.
3. Postprocessing: Further process the output of the model, for
example, by threshold filtering.
## Preprocessing and Postprocessing
**1. Preprocessing**
Raw data, such as images, voices, and texts, is so disordered that
machine learning models cannot identify or extract useful information
from it. Preprocessing is intended to convert such into tensors that
work for machine learning networks, eliminate irrelevant information,
restore useful true information, enhance the detectability of relevant
information, and simplify the data as much as possible. In this way,
reliability indicators related to feature extraction, image
segmentation, matching, and recognition of the models can be improved.
The following techniques are often used in data preprocessing:
1. Feature encoding: Encode the raw data that describes features into
numbers and input them to machine learning models which can process
only numerical values. Common encoding approaches include
discretization, ordinal encoding, one-hot encoding, and binary
encoding.
2. Normalization: Modify features to be on the same scale without
changing the correlation between them, eliminating the impact of
dimensions between data indicators. Common approaches include
Min-Max normalization that normalizes the data range, and Z-score
normalization that normalizes data distribution.
3. Outliner processing: An outlier is a data point that is distant from
all others in distribution. Elimination of outliers can improve the
accuracy of a model.
**2. Postprocessing**
After model inference, the output data is transferred to users for
postprocessing. Common postprocessing techniques include:
1. Discretization of contiguous data: Assume we expect to predict
discrete data, such as the quantity of a good, using a model, but a
regression model only provides contiguous prediction values, which
have to be rounded or bounded.
2. Data visualization: This technique uses graphics and tables to
represent data so that we can find relationships in the data in
order to support analysis strategy selection.
3. Prediction range widening: Most values predicted by a regression
model are concentrated in the center, and few are in the tails. For
example, abnormal values of hospital laboratory data are used to
diagnose diseases. To increase the accuracy of prediction, we can
enlarge the values in both tails by widening the prediction range
and multiplying the values that deviate from the normal range by a
coefficient to
## Parallel Computing
:label:`ch-deploy/parallel-inference`
Most inference models have a multi-thread mechanism that leverages the
capabilities of multiple cores in order to achieve performance
improvements. In this mechanism, the input data of operators is
partitioned, and multiple threads are used to process different data
partitions. This allows operators to be computed in parallel, thereby
multiplying the operator performance.
![Data partitioning for matrixmultiplication](../img/ch08/ch09-parallel.png)
:label:`ch09_parallel`
In Figure :numref:`ch09_parallel`, the matrix in the multiplication can be
partitioned according to the rows of matrix A. Three threads can then be
used to compute A1 \* B, A2 \* B, and A3 \* B (one thread per
computation), implementing multi-thread parallel execution of the matrix
multiplication.
To facilitate parallel computing of operators and avoid the overhead of
frequent thread creation and destruction, inference frameworks usually
have a thread pooling mechanism. There are two common practices:
1. Open Multi-Processing (OpenMP) API: OpenMP is an API that supports
concurrency through memory sharing across multiple platforms. It
provides interfaces that are commonly used to implement operator
parallelism. An example of such an interface is `parallel for`,
which allows `for` loops to be concurrently executed by multiple
threads.
2. Framework-provided thread pools: Such pools are more lightweight and
targeted at the AI domain compared with OpenMP interfaces, and can
deliver better performance.
## Operator Optimization
:label:`ch-deploy/kernel-optimization`
When deploying an AI model, we want model training and inference to be
performed as fast as possible in order to obtain better performance. For
a deep learning network, the scheduling of the framework takes a short
period of time, whereas operator execution is often a bottleneck for
performance. This section introduces how to optimize operators from the
perspectives of hardware instructions and algorithms.
**1. Hardware instruction optimization**
Given that most devices have CPUs, the time that CPUs spend processing
operators has a direct impact on the performance. Here we look at the
methods for optimizing hardware instructions on ARM CPUs.
**1) Assembly language**
High-level programming languages such as C++ and Java are compiled as
machine instruction code sequences by compilers, which often have a
direct influence on which capabilities these languages offer. Assembly
languages are close to machine code and can implement any instruction
code sequence in one-to-one mode. Programs written in assembly languages
occupy less memory, and are faster and more efficient than those written
in high-level languages.
In order to exploit the advantages of both types of languages, we can
write the parts of a program that require better performance in assembly
languages and the other parts in high-level languages. Because
convolution and matrix multiplication operators in deep learning involve
a large amount of computation, using assembly languages for code
necessary to perform such computation can improve model training and
inference performance by dozens or even hundreds of times.
Next, we use ARMv8 CPUs to illustrate the optimization related to
hardware instructions.
**2) Registers and NEON instructions**
Each ARMv8 CPU has 32 NEON registers, that is, v0 to v31. As shown in
Figure :numref:`ch-deploy/register`, NEON register v0 can store 128
bits, which is equal to the capacity of 4 float32, 8 float16, or 16
int8.
![Structure of the NEON register v0 of an ARMv8CPU](../img/ch08/ch09-register.png)
:label:`ch-deploy/register`
The single instruction multiple data (SIMD) method can be used to
improve the data access and computing speed on this CPU. Compared with
single data single instruction (SISD), the NEON instruction can process
multiple data values in the NEON register at a time. For example, the
`fmla` instruction for floating-point data is used as
`fmla v0.4s, v1.4s, v2.4s`. As depicted in Figure
:numref:`ch-deploy/fmla`, the products of the corresponding
floating-point values in registers v1 and v2 are added to the value in
v0.
![fmla instructioncomputing](../img/ch08/ch09-fmla.png)
:label:`ch-deploy/fmla`
**3) Assembly language optimization**
For assembly language programs with known functions, computational
instructions are usually fixed. In this case, non-computational
instructions are the source the performance bottleneck. The structure of
computer storage devices resembles a pyramid, as shown in Figure
:numref:`ch-deploy/fusion-storage`. The top layer has the fastest
speed but the smallest space; conversely, the bottom layer has the
largest space but the slowest speed. L1 to L3 are referred to as caches.
When accessing data, the CPU first attempts to access the data from one
of its caches. If the data is not found, the CPU then accesses an
external main memory. Cache hit rate is introduced to measure the
proportion of data that is accessed from the cache. In this sense, the
cache hit rate must be maximized to improve the program performance.
There are some techniques to improve the cache hit rate and optimize the
assembly performance:
1. Loop unrolling: Use as many registers as possible to achieve better
performance at the cost of increasing the code size.
2. Instruction reordering: Reorder the instructions of different
execution units to improve the pipeline utilization, thereby
allowing instructions that incur latency to be executed first. In
addition to reducing the latency, this method also reduces data
dependency before and after the instruction.
3. Register blocking: Block NEON registers appropriately to reduce the
number of idle registers and reuse more registers.
4. Data rearrangement: Rearrange the computational data to ensure
contiguous memory reads and writes and improve the cache hit rate.
5. Instruction prefetching: Load the required data from the main memory
to the cache in advance to reduce the access latency.
**2. Algorithm optimization**
For most AI models, 90% or more of the inference time of the entire
network is spent on computing convolution and matrix multiplication
operators. This section focuses on the optimization of convolution
operator algorithms, which can be applied to various hardware devices.
The computation of convolution can be converted into the multiplication
of two matrices, and we have elaborated on the optimization of the GEMM
algorithm in Section :ref:`ch-deploy/parallel-inference`. For different hardware,
appropriate matrix blocking can optimize data load/store efficiency and
instruction parallelism. This helps to maximize the utilization of the
hardware's computing power, thereby improving the inference performance.
**(1) Img2col**
Img2col is often used to convert convolution into matrix multiplication.
Convolutional layers typically operate on 4D inputs in NHWC format.
Figure :numref:`ch-deploy/conv_nhwc` is a diagram of convolution. The
input shape is (1, IH, IW, IC), the convolution kernel shape is (OC, KH,
KW, IC), and the output shape is (1, OH, OW, OC).
![Generalconvolution](../img/ch08/ch09-conv_nhwc.png)
:label:`ch-deploy/conv_nhwc`
As shown in Figure
:numref:`ch-deploy/img2col_input`, the Img2col rules for
convolution are as follows: The input is reordered to obtain the matrix
on the right. The number of rows corresponds to the number of OH \* OW
outputs. For a row vector, Img2col processes KH \* KW data points of
each input channel in sequence, from the first channel to channel IC.
![Img2col on the convolutioninput](../img/ch08/ch09-img2col_input.png)
:label:`ch-deploy/img2col_input`
As shown in Figure
:numref:`ch-deploy/img2col_weight`, the weights are rearranged.
One convolution kernel is expanded into one column of the weight matrix.
This means that there are OC columns in total. On each column vector, KH
\* KW data values on the first input channel are arranged first, and
then on subsequent channels until the channel IC. In this manner, the
convolution operation is converted into the multiplication of two
matrices. In practice, the data rearrangement of Img2col and GEMM is
performed simultaneously to save time.
![Img2col on the convolutionkernel](../img/ch08/ch09-img2col_weight.png)
:label:`ch-deploy/img2col_weight`
**(2) Winograd**
Convolution is essentially considered as matrix multiplication. The time
complexity of multiplying two 2D matrices is $O(n^3)$. The Winograd
algorithm can reduce the complexity of matrix multiplication.
Assume that a 1D convolution operation is denoted as ***F***($m$, $r$),
where $m$ indicates the number of outputs, and $r$ indicates the number
of convolution kernels. The input is
$\textit{\textbf{d}}=[d_0 \ d_1 \ d_2 \ d_3]$, and the convolution
kernel is $g=[g_0 \ g_1 \ g_2]^{\rm T}$. The convolution operation may
be written using matrices as Equation
:eqref:`ch-deploy/conv-matmul-one-dimension`, which contains six
multiplications and four additions.
$$
\textit{\textbf{F}}(2, 3)=
\left[ \begin{matrix} d_0 & d_1 & d_2 \\ d_1 & d_2 & d_3 \end{matrix} \right] \times \left[ \begin{matrix} g_0 \\ g_1 \\ g_2 \end{matrix} \right]=
\left[ \begin{matrix} y_0 \\ y_1 \end{matrix} \right]
$$
:eqlabel:`equ:ch-deploy/conv-matmul-one-dimension`
In the preceding equation, there are repeated elements $d_1$ and $d_2$
in the input matrix. As such, there is space for optimization for matrix
multiplication converted from convolution compared with general matrix
multiplication. The matrix multiplication result may be obtained by
computing an intermediate variable $m_0-m_3$, as shown in Equation
:eqref:`ch-deploy/conv-2-winograd`:
$$
\textit{\textbf{F}}(2, 3)=
\left[ \begin{matrix} d_0 & d_1 & d_2 \\ d_1 & d_2 & d_3 \end{matrix} \right] \times \left[ \begin{matrix} g_0 \\ g_1 \\ g_2 \end{matrix} \right]=
\left[ \begin{matrix} m_0+m_1+m_2 \\ m_1-m_2+m_3 \end{matrix} \right]
$$
:eqlabel:`equ:ch-deploy/conv-2-winograd`
where $m_0-m_3$ are computed as Equation
:eqref:`ch-deploy/winograd-param`:
$$
\begin{aligned}
m_0=(d_0-d_2) \times g_0 \\
m_1=(d_1+d_2) \times (\frac{g_0+g_1+g_2}{2}) \\
m_2=(d_0-d_2) \times (\frac{g_0-g_1+g_2}{2}) \\
m_3=(d_1-d_3) \times g_2
\end{aligned}
$$
:eqlabel:`equ:ch-deploy/winograd-param`
The indirect computation of r1 and r2 by computing $m_0-m_3$ involves
four additions of the input $d$ and four multiplications and four
additions of the output $m$. Because the weights are constant during
inference, the operations on the convolution kernel can be performed
during graph compilation, which is excluded from the online runtime. In
total, there are four multiplications and eight additions --- fewer
multiplications and more additions compared with direct computation
(which has six multiplications and four additions). In computer systems,
multiplications are generally more time-consuming than additions.
Decreasing the number of multiplications while adding a small number of
additions can accelerate computation.
In a matrix form, the computation can be written as Equation
:eqref:`ch-deploy/winograd-matrix`, where $\odot$ indicates the
multiplication of corresponding locations, and ***A***, ***B***, and
***G*** are all constant matrices. The matrix here is used to facilitate
clarity --- in real-world use, faster computation can be achieved if the
matrix computation is performed based on the handwritten form, as
provided in Equation
:eqref:`ch-deploy/winograd-param`.
$$\textit{\textbf{Y}}=\textit{\textbf{A}}^{\rm T}(\textit{\textbf{G}}g) \odot (\textit{\textbf{B}}^{\rm T}d)$$
:eqlabel:`equ:ch-deploy/winograd-matrix`
$$\textit{\textbf{B}}^{\rm T}=
\left[ \begin{matrix} 1 & 0 & -1 & 0 \\ 0 & 1 & 1 & 0 \\ 0 & -1 & 1 & 0 \\ 0 & 1 & 0 & -1 \end{matrix} \right]$$
:eqlabel:`equ:ch-deploy/winograd-matrix-bt`
$$\textit{\textbf{G}}=
\left[ \begin{matrix} 1 & 0 & 0 \\ 0.5 & 0.5 & 0.5 \\ 0.5 & -0.5 & 0.5 \\ 0 & 0 & 1 \end{matrix} \right]$$
:eqlabel:`equ:ch-deploy/winograd-matrix-g`
$$\textit{\textbf{A}}^{\rm T}=
\left[ \begin{matrix} 1 & 1 & -1 & 0 \\ 0 & 1 & -1 & -1 \end{matrix} \right] \\$$
:eqlabel:`equ:ch-deploy/winograd-matrix-at`
In deep learning, 2D convolution is typically used. When ***F***(2, 3)
is extended to ***F***(2$\times$`<!-- -->`{=html}2,
3$\times$`<!-- -->`{=html}3), it can be written in a matrix form, as
shown in Equation
:eqref:`ch-deploy/winograd-two-dimension-matrix`. In this case,
Winograd has 16 multiplications, reducing the computation complexity by
2.25 times compared with 36 multiplications of the original convolution.
$$\textit{\textbf{Y}}=\textit{\textbf{A}}^{\rm T}(\textit{\textbf{G}}g\textit{\textbf{G}}^{\rm T}) \odot (\textit{\textbf{B}}^{\rm T}d\textit{\textbf{B}})\textit{\textbf{A}}$$
:eqlabel:`equ:ch-deploy/winograd-two-dimension-matrix`
The logical process of Winograd can be divided into four steps, as shown
in Figure :numref:`ch-deploy/winograd`.
![Winogradsteps](../img/ch08/ch09-winograd.png)
:label:`ch-deploy/winograd`
To use Winograd of ***F***(2$\times$`<!-- -->`{=html}2,
3$\times$`<!-- -->`{=html}3) for any output size, we need to divide the
output into 2$\times$`<!-- -->`{=html}2 blocks. We can then perform the
preceding four steps using the corresponding input to obtain the
corresponding output. Winograd is not limited to solving
***F***(2$\times$`<!-- -->`{=html}2, 3$\times$`<!-- -->`{=html}3). For
any ***F***($m \times m$, $r \times r$), appropriate constant matrices
***A***, ***B***, and ***G*** can be found to reduce the number of
multiplications through indirect computation. However, as $m$ and $r$
increase, the number of additions involved in input and output and the
number of multiplications of constant weights increase. In this case,
the decrease in the computation workload brought by fewer
multiplications is offset by additions and constant multiplications.
Therefore, we need to evaluate the benefits of Winograd before using it.
This section describes methods for processing data and optimizing
performance during model inference. An appropriate data processing
method can facilitate the input feature extraction and output
processing. And to fully leverage the computing power of hardware, we
can use parallel computing and operator-level hardware instruction and
algorithm optimization. In addition, the memory usage and load/store
rate are also important for the inference performance. Therefore, it is
essential to design an appropriate memory overcommitment strategy for
inference. Related methods have been discussed in the section about the
compiler backend.

View File

@@ -0,0 +1,165 @@
# Security Protection of Models
After training and optimizing models locally, AI service providers
deploy the models on third-party platforms (such as mobile devices, edge
devices, and cloud servers) to provide inference services. The design
and training of the AI models require a large amount of time, data, and
computing power. This is why model and service providers protect the
intellectual property rights of the models (including model structures
and parameters) from being stolen during transfer, storage, and running
in the deployment phase.
## Overview
The security protection of models can be divided into static protection
and dynamic protection. Static protection refers to protecting models
during transfer and storage. At present, it is widely implemented based
on file encryption, in which AI model files are transferred and stored
in ciphertext and are decrypted in the memory before being used for
inference. However, throughout the inference process, models remain in
plaintext in the memory, making it possible for theft. Dynamic
protection refers to protecting models during runtime. Dynamic
protection methods currently available can be classified into three
categories. The first is trusted execution environment-based (TEE-based)
protection. TEEs are usually secure zones isolated on trusted hardware,
and AI model files are stored and transferred in non-secure zones and
running after decryption in the secure zones. Although this method
involves only a short inference latency on the CPU, it requires specific
trusted hardware, making it difficult to implement. In addition, due to
constraints on hardware resources, protecting large-scale deep models is
difficult and heterogeneous hardware acceleration is still challenging.
The second is a cryptographic computing-based protection, which ensures
that models remain in ciphertext during transfer, storage, and running
using cryptographic techniques (such as homomorphic encryption and
secure multi-party computation). Although this method is free from
hardware constraints, it has large computation or communications
overheads and cannot protect model structure information. The third is
obfuscation-based protection. This method scrambles the computational
logic of models with fake nodes, so that attackers cannot understand the
models even if they obtain them. Compared with the former two methods,
obfuscation-based protection brings a smaller overhead to the
performance and neglectable loss of accuracy. Furthermore, it is
hardware-agnostic, and can support protection of very large models. We
will focus on protection using the obfuscation-based method.
## Model Obfuscation
Model obfuscation can automatically obfuscate the computational logic of
plaintext AI models, preventing attackers from understanding the models
even if they obtain them during transfer and storage. In addition,
models can run while still being obfuscated, thereby ensuring the
confidentiality while they are running. Obfuscation does not affect the
inference results and brings only a low performance overhead.
![Procedure of modelobfuscation](../img/ch08/model_obfuscate.png)
:label:`ch-deploy/model_obfuscate`
Figure :numref:`ch-deploy/model_obfuscate` depicts the model obfuscation
procedure, which is described as follows.
1. **Interpret the given model into a computational graph:** Based on
the structure of a trained model, interpret the model file into the
graph expression (computational graph) of the model computational
logic for subsequent operations. The resulting computational graph
contains information such as node identifiers, node operator types,
node parameters, and network structures.
2. **Scramble the network structure of the computational graph[^1]:**
Scramble the relationship between nodes in the computational graph
using graph compression, augmentation, and other techniques in order
to conceal the true computational logic. In graph compression, the
key subgraph structure is matched by checking the entire graph.
These subgraphs are compressed and replaced with a single new
computing node. Graph augmentation adds new input/output edges to
the compressed graph in order to further conceal the dependencies
between nodes. An input/output edge comes from or points to an
existing node in the graph, or comes from or points to the new
obfuscation node in this step.
3. **Anonymize nodes in the computational graph:** Traverse the
computational graph processed in Step (2) and select the nodes to be
protected. For a node to be protected, we can replace the node
identifier, operator type, and other attributes that can describe
the computational logic of the model with non-semantic symbols. For
node identifier anonymization, the anonymized node identifier must
be unique in order to distinguish different nodes. For operator type
anonymization, to avoid operator type explosion caused by
large-scale computational graph anonymization, we can divide nodes
with the same operator type into several disjoint sets, and replace
the operator type of nodes in the same set with the same symbol.
Step (5) ensures that the model can be identified and executed after
node anonymization.
4. **Scramble weights of the computational graph:** Add random noise
and mapping functions to the weights to be protected. The random
noise and mapping functions can vary with weights. Step (6) ensures
that the noise of weights does not change the model execution
result. The computational graph processed after Steps (2), (3),
and (4) are then saved as a model file for subsequent operations.
5. **Transform operator interfaces:** Steps (5) and (6) transform
operators to be protected in order to generate candidate obfuscated
operators. An original operator may correspond to multiple
obfuscated operators. The quantity of candidate obfuscated operators
depends on how many sets the nodes are grouped into in Step (3). In
this step, the operator interfaces are transformed based on the
anonymized operator types and operator input/output relationship
obtained after Steps (2), (3), and (4). Such transformation can be
implemented by changing the input, output, or interface name.
Changing the input and output involves modification on the input and
output data, making the form of the obfuscated operator different
from that of the original operator. The added data includes the data
dependency introduced by graph augmentation in Step (2) and the
random noise introduced by weight obfuscation in Step (4). The
operator name is changed to the name of the anonymized operator
obtained in Step (3) to ensure that the model can still be
identified and executed after the nodes are anonymized and that the
operator name does not reveal the computational logic.
6. **Transform the operator implementation:** Transform the operator
code implementation by encrypting strings, adding redundant code,
and employing other code obfuscation techniques in order to keep the
computational logic consistent between the original operator and
obfuscated operator while also making the logic more difficult to
understand. A combination of different code obfuscation techniques
may be applied to different operators in order to realize the code
implementation transformation. In addition to equivalent code
transformation, the obfuscated operators further implement some
additional computational logic. For example, in Step (4), noise has
been added to the weights of an operator. The obfuscated operator
also implements an inverse mapping function of the weight noise,
dynamically eliminating noise in the operator execution process and
ensuring that the computation result is the same as the original
model. The generated obfuscated operators can then be saved as a
library file for subsequent operations.
7. **Deploy the model and operator library:** Deploy the obfuscated
model and corresponding operator library file on the desired device.
8. **Load the obfuscated model:** Parse the obfuscated model file and
obtain the graph expression of the model computational logic, that
is, the obfuscated computational graph obtained after Step (2), (3),
and (4).
9. **Initialize the computational graph:** Initialize the computational
graph to generate an execution task sequence. According to security
configuration options, if runtime model security needs to be
protected, the obfuscated graph should be directly initialized to
generate an execution task sequence. Each compute unit in the
sequence corresponds to execution of one obfuscated operator or
original operator. If security protection is required during only
model transfer and storage, restore the obfuscated graph in the
memory to the source graph, and then initialize the source graph to
generate an execution task sequence. Each unit in the sequence
corresponds to the execution of an original operator. In this way,
performance overheads during inference can be further reduced.
10. **Execute inference tasks:** The model executes the compute units
sequentially on the input of the AI application in order to obtain
an inference result. If a compute unit corresponds to an obfuscated
operator, the obfuscated operator library is invoked. Otherwise, the
original operator library is invoked.
[^1]: Scrambling refers to adding noise to the computational graph.
Common methods include adding redundant nodes and edges and merging
some subgraphs.

View File

@@ -0,0 +1,38 @@
# Chapter Summary
1. Model deployment is restricted by factors including the model size,
runtime memory usage, inference latency, and inference power
consumption.
2. Models can be compressed using techniques such as quantization,
pruning, and knowledge distillation in the offline phase. In
addition, some model optimization techniques, such as operator
fusion, can also reduce the model size, albeit to a lesser degree.
3. Runtime memory usage can be improved by optimizing the model size,
deployment framework size, and runtime temporary memory usage.
Methods for optimizing the model size have been summarized earlier.
Making the framework code simpler and more modular helps optimize
the deployment framework. Memory pooling can help implement memory
overcommitment to optimize the runtime temporary memory usage.
4. Model inference latency can be optimized from two aspects. In the
offline phase, the model computation workload can be reduced using
model optimization and compression methods. Furthermore, improving
the inference parallelism and optimizing operator implementation can
help maximize the utilization of the computing power. In addition to
the computation workload and computing power, consideration should
be given to the load/store overhead during inference.
5. Power consumption during inference can be reduced through offline
model optimization and compression technologies. By reducing the
computational workload, these technologies also facilitate power
consumption reduction, which coincides with the optimization method
for model inference latency.
6. In addition to the optimization of factors related to model
deployment, this chapter also discussed technologies regarding
deployment security, such as model obfuscation and model encryption.
Secure deployment protects the model assets of enterprises and
prevents hackers from attacking the deployment environment by
tampering with models.

View File

@@ -0,0 +1,54 @@
# Preface
## Background
In 2020, I joined the School of Informatics at the University of Edinburgh, which is considered one of the birthplaces of Artificial Intelligence (AI) research. The university offers machine learning courses that cover a wide range of topics, including natural language processing, computer vision, and computational neuroscience. Additionally, the university is well-known for providing a complete series of fundamental courses on computer systems, such as operating systems, programming languages, compilers, and computer architecture. However, when I asked my students about how computer systems are utilized to deploy and accelerate computation in machine learning, many of them appeared puzzled. This led me to contemplate whether the University of Edinburgh, along with other universities worldwide, should expand their curricula by adding a course that bridges the gap between machine learning and computer systems.
Initially, my idea was to expand an existing course. At the time, the "AI Systems" course at the University of California, Berkeley was particularly popular. It explored various research directions in machine learning systems, with an emphasis on studying research papers. Unfortunately, many of these papers did not stand the test of time, and the course did not provide a comprehensive architectural overview of the knowledge. Consequently, students were unable to gain a complete understanding of the subject or learn how to construct a machine learning system from scratch. I then looked to other universities, where I discovered that the University of Washington offered a brief course called "Deep Learning Systems," which focused on the compilation process of machine learning programs. However, the course primarily centered around Apache TVM, a compiler stack for deep learning systems, and lacked a systematic introduction to machine learning systems. Stanford University also had a course in this area, "Machine Learning Systems Design," but it focused on topics such as data cleansing, management, and annotation, as databases were the course designer's primary expertise.
In my search for a suitable course, I expanded my scope to Microsoft Research Asia. Their "AI Systems" course seemed like the closest match to my expectations at the time, as it elaborated on the design concepts of machine learning systems. However, as I prepared to teach it to undergraduates, I realized that it provided only a general introduction to the core design concepts of machine learning systems and assumed students had a solid foundational knowledge of computer systems. It was better suited for doctoral students than undergraduates. In fact, all the courses I previously mentioned focused on studying research papers rather than on easily comprehensible textbooks that provide a clear knowledge map. Consequently, the materials involved in these courses were filled with scattered ideas, creating significant obstacles for students attempting to learn about machine learning systems.
On the flip side, 2020 was a year in which we saw the emergence of excellent course materials, providing fundamental knowledge about operating systems, databases, distributed systems, and even machine learning algorithms. However, it remained difficult to find a textbook that systematically introduces machine learning systems. Many enterprise and university labs needed to expend significant resources in order to train students and engineers from scratch and enhance their understanding of the fundamental architecture of machine learning systems. The absence of such textbooks presented a huge challenge in developing academic and industry talent. Against this backdrop, the idea of writing a textbook on machine learning systems began to take shape in my mind.
## Beginning
When I shared this idea with my friends, they recognized the immense value of writing such a textbook. However, the preparation and writing process involved could be a daunting uphill battle. My postdoctoral mentor advised me to focus on publishing high-impact papers at the beginning of my faculty career instead of spending significant amounts of time and energy on a book that may not even be published. Other professors preferred to revise existing textbooks rather than write new ones, particularly in the field of machine learning systems, which evolve rapidly through a process of trial and error. Even if a new book were published, it may become obsolete quickly due to technological advancements over time.
Despite encountering several obstacles, the idea of writing a textbook on machine learning systems did not fade away until I went to China for a holiday and spoke with Xuefeng Jin, the architect of MindSpore. We first met in London around Christmas time in 2019 when he was leading the development of MindSpore 1.0, which had yet to be launched. We became acquainted through our mutual interest in the development of machine learning systems. In 2018, I co-built a new machine learning framework from scratch, similar to PyTorch, with my colleagues. Although the project ended due to insufficient resources, the experience motivated me to publish several papers on machine learning systems. Xuefeng and I both recognized how challenging it was to develop AI systems and to find experts in machine learning system development. Students often focused more on machine learning algorithms and had only a superficial understanding of key system design principles. They did not realize the significance of these principles until they applied machine learning technologies in practice, but by that point, it was too late to learn them. I shared my idea with Xuefeng about writing a textbook on machine learning systems and anticipated that it might take three to four years to complete. Xuefeng had a similar idea and asked whether he could assist in any way.
Xuefeng's offer was enlightening. I started asking myself: why not break the conventional pattern of book writing, which follows the chronicle of discipline development over years by one or two professors. This pattern is similar to the waterfall model in traditional software development, but with technological advancements, software development has evolved to open-source agile development. Therefore, why should book writing follow the outdated approach? A good example of this is the \emph{Deep Dive into Deep Learning} book, compiled by the MXNet open-source community. I immediately invited Hao Dong, an assistant professor at Peking University and co-founder of the TensorLayer open-source community, to collaborate with us. Excited about this prospect, Xuefeng invited his colleague, Zhiliang Gan, to join us. We were committed to creating a new textbook and finally settled down to writing.
After several rounds of discussion, we named the book **Machine Learning Systems: Design and Implementation**. Our intention was to introduce the time-tested design principles of machine learning systems and share a wealth of system implementation experience, so that students could learn how to analyze and solve problems in future work and scientific research.
## Community Building
Since the field of machine learning systems is an evolving discipline that continually nurtures a variety of research subjects, I pondered how to create an author community to ensure the book's sustainability. As my research expertise focuses on large-scale software systems, I chose to build a community by referencing several key design points of distributed systems, as follows:
- **Prevention of single-point failure or bottleneck:**
Modern distributed systems are typically designed to separate the control plane from the data plane to avoid single-point failure or bottleneck. To ensure the sustainability of the book, we decided to follow this approach and design a highly scalable writing community using a distributed mechanism. The editor spent most of their time searching for excellent, proactive, and responsible chapter owners. Chapter owners then collaborated with other authors to facilitate writing progress on a per-chapter basis, communicating with chapter authors about writing details and adhering to given deadlines. The editor and chapter owners had weekly meetings to synchronize writing progress and ensure that chapter content met the overall expectations of the editor and the community in terms of quality.
- **Iterative improvement:**
The stochastic gradient descent (SGD) optimization algorithm in deep learning uses local gradients to perform numerous iterations in complex problems and find local optimal solutions. I applied the same principles when designing the iterative improvement process for the book's quality. Similar to determining initial parameters, we drafted the first edition of the book on Overleaf. Then, we organized the content into a standard Git code repository and established a mechanism to encourage readers and community members to access issues and pull requests (PRs) on GitHub. We also set up comprehensive book building tools, continuous integration tools, and contributor seminars. This enabled us to continually improve the book's quality, aiming to achieve optimal quality. It was akin to the outcome we achieve in machine learning by following the SGD method.
- **High availability:** We established a 24/7 online writing platform for participants to develop the book and receive feedback from the community in any time zone and language around the world. The Git repository was hosted on GitHub and mirrored on Gitee to ensure high availability of the writing platform.
- **Content neutralization:** In a distributed system, the equal treatment of each node is crucial for long-term operation, as it allows for a unified approach to rectifying issues. Similarly, in writing a book, we must anticipate potential challenges such as outdated designs or the departure of writers, and mitigate them through collaboration among participants from diverse backgrounds. We emphasize the importance of creating neutral, objective, and inclusive content and ensuring that any issues that arise do not impede progress.
## Current Situation and Future Outlook
With the established mechanism, writing progressed smoothly and more participants joined the project. My former students Xiulong Yuan, Zihan Ding, Yao Fu, Jie Ren, and Wenteng Liang were also dedicated to writing and editing this book. Jiarong Han and Cheng Lai from Peng Cheng Laboratory, along with numerous MindSpore developers all made significant contributions to the book. Many senior designers of machine learning systems also held discussions with us through various channels and provided valuable feedback for the book. In addition, many academic and industry top minds shared their thoughts with us. And worldwide, talented students participated in writing. They included Jiankai Sun from Stanford University, Peiyuan Liao from Carnegie Mellon University, Hanchen Wang from Cambridge University, and Pei Mu from the University of Edinburgh. Kaiyan Xiao, a machine learning expert from GlaxoSmithKline PLC, also became one of the authors. Furthermore, professors Peter Pietzuch from Imperial College London and Lei Chen from Hong Kong University of Science and Technology, among others, provided continuous writing advice to enhance the book's quality.
After we implemented the "distributed system" for book writing, the book's quality has continually improved. When we released the book as an open-source project, the number of participants rapidly increased, coming as a major surprise to us. Driven by the open-source community, the English and Chinese versions of the book have been advanced. This was the first time that I realized the huge benefit of using the idea of distributed systems and the knowledge of machine learning in solving complex problems in real life.
A single tree is too weak to withstand a sandstorm. Similarly, it was the forest of friends and the power of the community that gave us the courage to take the very first and crucial step in writing this book. I hope that this way of thinking can inspire and help in finding solutions to other complex problems.
By May 2022, the core authors and editors (Luo Mai, Hao Dong, Xuefeng Jin, and Zhiliang Gan), the book coordinator (Zhipeng Tan), and the following contributors have endeavored to create this book: **Introduction** (Luo Mai, Hao Dong, and Zhiliang Gan), **Programming Model** (Cheng Lai, Luo Mai, and Hao Dong), **Computational Graph** (Jiarong Han, Luo Mai, and Hao Dong), **AI Compiler and Frontend Technology** (Zhibo Liang, Qinghua Zhang, Bingjian Huang, Jianfeng Yu, and Zhiliang Gan), **AI Compiler Backend and Runtime** (Jinjin Chu, Pei Mu, and Fubi Cai), **Hardware Accelerator** (Renwei Zhang, Jie Ren, Wenteng Liang, Chao Liu, Gang Chen, and Mingqi Li), **Data Processing** (Xiulong Yuan), **Model Deployment** (Gangqiang Han, Yehui Tang, Zhiqiang Zhai, and Shanni Li), **Distributed Training** (Luo Mai and Peiyuan Liao) **Federated Learning System** (Tiancheng Wu and Hanchen Wang), **Recommender System** (Yao Fu, Bei Pei, and Luo Mai), **Reinforcement Learning System** (Zihan Ding), **Explainable AI System** (Haoyang Li and Xiaohui Li), and **Robotic System** (Jiankai Sun and Kaiyan Xiao).
We welcome new contributors to help improve and expand the book's content. If you're interested, please contact us through our book's [OpenMLSys Community](https://openmlsys.github.io/html-en/). Let's work together to create a machine learning systems book that advances the world.
Luo Mai
Edinburgh, United Kingdom
4th May 2022

View File

@@ -0,0 +1,13 @@
# Part I Framework Design
:label:`part-i-framework-design`
In Part 1, we present a top-down approach to designing a machine
learning framework. We begin by introducing the design of programming
models for machine learning frameworks, followed by a discussion on
representing a machine learning program as a computational graph. The
machine learning program undergoes compilation by an AI compiler, which
employs a range of frontend and backend techniques. Additionally, we
will delve into the system components within a machine learning
framework that facilitate data processing, model deployment, and
distributed training.

View File

@@ -0,0 +1,7 @@
# Part II Application Scenarios
:label:`part-ii-application-scenarios`
In Part II, we will introduce various scenarios of applying machine
learning frameworks. These scenarios include federated learning systems,
recommender systems, reinforcement learning systems, and robotic
systems.

View File

@@ -0,0 +1,98 @@
# Bridging Python and C/C++ Functions
Developers frequently encounter the need to incorporate custom operators
into a machine learning framework. These operators implement new models,
optimizers, data processing functions, and more. Custom operators, in
particular, often require implementation in C/C++ to achieve optimized
performance. They also have Python interfaces, facilitating developers
to integrate custom operators with existing machine learning workflows
written in Python. This section will delve into the implementation
details of this process.
The Python interpreter, being implemented in C, enables the invocation
of C and C++ functions within Python. Contemporary machine learning
frameworks such as TensorFlow, PyTorch, and MindSpore rely on pybind11
to automatically generate Python functions from underlying C and C++
functions. This mechanism is known as *Python binding*. Prior to the
advent of pybind11, Python binding was accomplished using one of the
following approaches:
1. **C-APIs in Python**: This approach necessitates the inclusion of
`Python.h` in C++ programs and the utilization of Python's C-APIs to
execute Python operations. To effectively work with C-APIs,
developers must possess a comprehensive understanding of Python's
internal implementation, such as managing reference counting.
2. **Simplified Wrapper and Interface Generator (SWIG)**: SWIG serves
as a bridge between C/C++ code and Python, and it played a
significant role in the initial development of TensorFlow. Utilizing
SWIG involves crafting intricate interface statements and relying on
SWIG to automatically generate C code that interfaces with Python's
C-APIs. However, due to the lack of readability in the generated
code, the maintenance costs associated with it tend to be high.
3. **Python `ctypes` module**: This module encompasses a comprehensive
range of types found in the C language and allows direct invocation
of dynamic link libraries (DLLs). However, a limitation of this
module is its heavy reliance on native C types, which results in
insufficient support for customized types.
4. **CPython**: In basic terms, CPython can be described as the fusion
of Python syntax with static types from the C language. It
facilitates the retention of Python's syntax while automatically
translating CPython functions into C/C++ code. This functionality
empowers developers to seamlessly incorporate invocations of C/C++
functions within the CPython environment.
5. **Boost::Python (a C++ library)**: Boost::Python allows for the
exposure of C++ functions as Python functions. It operates on
similar principles to Python's C-APIs but provides a more
user-friendly interface. However, the reliance on the Boost library
introduces a significant dependency on third-party components, which
can be a potential drawback for Boost::Python.
In comparison to the above Python binding approaches, pybind11 shares
similarities with Boost::Python in terms of simplicity and usability.
However, pybind11 stands out due to its focus on supporting C++ 11 and
eliminating dependencies on Boost. As a lightweight Python library,
pybind11 is particularly suitable for exposing numerous Python functions
in complex C++ projects such as the machine learning system discussed in
this book. The combination of Code
`ch02/code2.5.1` and Code
`ch02/code2.5.2` is an example of adding a custom operator to
Pytorch with the integration of C++ and Python:\
In C++:
**ch02/code2.5.1**
```cpp
//custom_add.cpp
#include <torch/extension.h>
#include <pybind11/pybind11.h>
torch::Tensor custom_add(torch::Tensor a, torch::Tensor b) {
return a + b;
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("custom_add", &custom_add, "A custom add function");
}
```
In Python:
**ch02/code2.5.2**
```python
import torch
from torch.utils.cpp_extension import load
# Load the C++ extension
custom_extension = load(
name='custom_extension',
sources=['custom_add.cpp'],
verbose=True
)
# Use your custom add function
a = torch.randn(10)
b = torch.randn(10)
c = custom_extension.custom_add(a, b)
```

View File

@@ -0,0 +1,87 @@
# Overview
With the advent of machine learning systems, the design of user-friendly
and high-performance APIs has become a paramount concern for system
designers. In the early stages of machine learning frameworks (as
depicted in Figure :numref:`ch03/framework_development_history`), developers often
opted for high-level programming languages like Lua (Torch) and Python
(Theano) to write machine learning programs. These frameworks offered
essential functions, including model definition and automatic
differentiation, which are integral to machine learning. They were
particularly well-suited for creating small-scale machine learning
applications targeted toward scientific research purposes.
<figure id="fig:ch03/framework_development_history">
<embed src="../img/ch03/framework_development_history.pdf" />
<figcaption> Evolution of Machine Learning Programming Frameworks: A
Historical Perspective</figcaption>
</figure>
The rapid advancement of deep neural networks (DNNs) since 2011 has
sparked groundbreaking achievements in various AI application domains,
such as computer vision, speech recognition, and natural language
processing. However, training DNNs requires substantial computational
power. Unfortunately, earlier frameworks like Torch (primarily using
Lua) and Theano (mainly using Python) were unable to fully harness this
computing power. On the other hand, general-purpose APIs like CUDA C for
computational accelerators such as NVIDIA GPUs have become increasingly
mature, and multi-thread libraries like POSIX Threads built on CPU
multi-core technology have gained popularity among developers.
Consequently, many machine learning users sought to develop
high-performance deep learning applications utilizing C/C++. These
requirements led to the emergence of frameworks like Caffe, which
employed C/C++ as their core APIs.
However, customization of machine learning models is often necessary to
suit specific deployment scenarios, data types, identification tasks,
and so on. This customization typically falls on the shoulders of AI
application developers, who may come from diverse backgrounds and may
not fully leverage the capabilities of C/C++. This became a significant
bottleneck that hindered the widespread adoption of programming
frameworks like Caffe, which heavily relied on C/C++.
In late 2015, Google introduced TensorFlow, which revolutionized the
landscape. In contrast to Torch, TensorFlow adopted a design where the
frontend and backend were relatively independent. The frontend,
presented to users, utilized the high-level programming language Python,
while the high-performance backend was implemented in C/C++. TensorFlow
provided numerous Python-based frontend APIs, gaining wide acceptance
among data scientists and machine learning researchers. It seamlessly
integrated into Python-dominated big data ecosystems, benefiting from
various big data development libraries such as NumPy, Pandas, SciPy,
Matplotlib, and PySpark. Python's exceptional interoperability with
C/C++, as demonstrated in multiple Python libraries, further enhanced
TensorFlow's appeal. Consequently, TensorFlow combined the flexibility
and ecosystem of Python with high-performance capabilities offered by
its C/C++ backend. This design philosophy was inherited by subsequent
frameworks like PyTorch, MindSpore, and PaddlePaddle.
Subsequently, as observed globally, prominent enterprises started
favoring open-source machine learning frameworks, leading to the
emergence of Keras and TensorLayerX. These high-level libraries
significantly expedited the development of machine learning
applications. They provided Python APIs that allowed quick importing of
existing models, and these high-level APIs were decoupled from the
intricate implementation details of specific machine learning
frameworks. As a result, Keras and TensorLayerX could be utilized across
different machine learning frameworks.
While deep neural networks continued to evolve, new challenges surfaced
regarding the APIs of machine learning frameworks. Around 2020, novel
frameworks like MindSpore and JAX emerged to tackle these challenges.
MindSpore, in addition to inheriting the hybrid interfaces (Python and
C/C++) from TensorFlow and PyTorch, expanded the scope of machine
learning programming models. This expansion facilitated efficient
support for a diverse range of AI backend chips, including NVIDIA GPU,
Huawei Ascend , and ARM. Consequently, machine learning applications can
be swiftly deployed across a wide array of heterogeneous devices.
Simultaneously, the proliferation of ultra-large datasets and
ultra-large DNNs necessitated distributed execution as a fundamental
design requirement for machine learning programming frameworks. However,
implementing distributed execution in TensorFlow and PyTorch required
developers to write substantial amounts of code for allocating datasets
and DNNs across distributed nodes. Yet, many AI developers are not
well-versed in distributed programming. In this regard, JAX and
MindSpore significantly improves the situation by enabling the seamless
execution of programs on a single node across various other nodes.

View File

@@ -0,0 +1,49 @@
# Programming Model
Machine learning frameworks comprise various components that facilitate
the efficient development of algorithms, data processing, model
deployment, performance optimization, and hardware acceleration. When
designing the application programming interfaces (APIs) for these
components, a key consideration is striking the right balance between
framework performance and usability. To achieve optimal performance,
developers utilize C or C++, as these programming languages enable
efficient invocation of the APIs provided by the operating system and
hardware accelerators.
Regarding usability, machine learning framework users, including data
scientists, biologists, chemists, and physicists, often possess strong
industrial backgrounds and are skilled in using high-level scripting
languages like Python, Matlab, R, and Julia. While these languages offer
remarkable programming usability, they lack deep optimization
capabilities for underlying hardware or operating systems compared to C
and C++. Therefore, the core design objective of machine learning
frameworks encompasses two aspects: providing easy-to-use APIs for
implementing algorithms using high-level languages like Python, and
providing low-level APIs centered around C and C++ to assist framework
developers in implementing numerous high-performance components and
efficiently executing them on hardware. This chapter describes
strategies for achieving this design objective.
The chapter aims to achieve the following learning objectives:
1. Understanding the workflows and programming principles of machine
learning frameworks.
2. Understanding the design of neural network models and layers.
3. Understanding how machine learning frameworks bridge Python and
C/C++ functions.
4. Understanding the support for functional programming in machine
learning frameworks.
```toc
:maxdepth: 2
Overview
Machine_Learning_Workflow
Neural_Network_Programming
Functional_Programming
Bridging_Python_and_C_C++_Functions
Chapter_Summary
```

View File

@@ -0,0 +1,115 @@
# Functional Programming
In the following, we will discuss the reasons behind the growing trend
of incorporating functional programming into the design of machine
learning frameworks.
## Benefits of Functional Programming
Training constitutes the most critical phase in machine learning, and
the manner in which training is depicted hinges significantly on
optimizer algorithms. Predominantly, contemporary machine learning tasks
utilize first-order optimizers, favored for their ease of use. With
machine learning advancing at a rapid pace, both software and hardware
are incessantly updated to stay abreast. Consequently, an increasing
number of researchers are beginning to investigate higher-order
optimizers, noted for their superior convergence performance. Frequently
utilized second-order optimizers, such as the Newton method,
quasi-Newton method, and AdaHessians, necessitate the computation of a
Hessian matrix incorporating second-order derivative information. Two
considerable challenges arise from this computation: 1) how to manage
such a hefty computational load efficiently; 2) how to express
higher-order derivatives in programmatic language.
In recent times, numerous large AI models have been introduced, which
include (with the number of parameters noted in parentheses) OpenAI
GPT-3 (175B) in 2020; PanGu (100B), PanGu-$\alpha$ (200B), Google's
Switch Transformer (1.6T), and WuDao (1.75T) in 2021; along with
Facebook's NLLB-200 (54B) in 2022. The demand for ultra-large model
training is escalating, and data parallelism alone cannot meet this
growing requirement. Conversely, model parallelism demands manual model
segmentation, a process that is time-intensive and laborious.
Consequently, the main challenge future machine learning frameworks must
overcome is how to actualize automatic parallelism. At its core, a
machine learning model is a representation of a mathematical model.
Hence, the ability to succinctly represent machine learning models has
risen to a key concern in the design of programming paradigms for
machine learning frameworks.
Recognizing the challenges presented by the practical implementation of
machine learning frameworks, researchers have identified that functional
programming could offer beneficial solutions. Functional programming, in
computer science, is a programming paradigm that envisions computation
as the evaluation of mathematical functions, actively avoiding state
changes and data mutations. This paradigm harmonizes well with
mathematical reasoning. Neural networks are composed of interconnected
nodes, with each node performing basic mathematical operations.
Functional programming languages allow developers to portray these
mathematical operations in a language that closely mirrors the
operations, enhancing the readability and maintainability of programs.
Concurrently, in functional languages, functions are kept separate,
simplifying the management of concurrency and parallelism.
In summary, functional programming is anticipated to confer the
following benefits to machine learning frameworks:
1. It is suited for machine learning scenarios where higher-order
derivatives are needed.
2. It simplifies the development of parallel programming interfaces.
3. It results in a more concise code representation.
## Framework Support for Functional Programming
Machine learning frameworks have increasing support for functional
programming. In 2018, Google rolled out JAX. Contrary to traditional
machine learning frameworks, JAX amalgamates neural network computation
and numerical computation. Its interfaces are compatible with native
data science interfaces in Python, such as NumPy and SciPy. Moreover,
JAX extends distribution, vectorization, high-order derivation, and
hardware acceleration in a functional programming style, characterized
by Lambda closure and no side effects.
In 2020, Huawei introduced MindSpore, the functional differential
programming architecture of which allows users to concentrate on the
native mathematical expressions of machine learning models. In 2022,
taking inspiration from Google's JAX, PyTorch launched functorch.
Functorch is essentially a library aimed at providing composable vmap
(vectorization) and autodiff transforms compatible with PyTorch modules
and PyTorch autograd, thereby achieving excellent eager-mode
performance. It can be inferred that functorch meets the requirements
for distributed parallelism in PyTorch static graphs. Code
`ch02/code2.4` gives an example of functorch.
**ch02/code2.4**
```
from functorch import combine_state_for_ensemble, vmap
minibatches = data[:num_models]
models = [MLP().to(device) for _ in range(num_models)]
fmodel, params, buffers = combine_state_for_ensemble(models)
predictions1_vmap = vmap(fmodel, out_dims=1)(params, buffers, minibatches)
```
Functorch introduces *vmap*, standing for \"vectorized map\". Its role
is to adapt functions designed for individual inputs so that they can
handle batches of inputs, therefore facilitating efficient vectorized
calculations. Unlike the batch processing capabilities of standard
PyTorch modules, vmap can convert any operation to be batch-aware
without the need to alter the operation's original structure. Moreover,
vmap offers greater flexibility to batch dimensions, allowing users to
specify which dimension should be treated as the batch dimension
(specifying the $out\_dim$ argument), a contrast to the default
behaviour of the standard PyTorch where the first dimension is usually
chosen as the batch dimension.
By tracing the development of machine learning frameworks, it becomes
evident that the functional programming paradigm become increasingly
popular. This can be attributed to functional programming's ability to
express machine learning models intuitively and its convenience for
implementing automatic differentiation, high-order derivation, and
parallel execution. Consequently, future machine learning frameworks are
likely to adopt layered frontend interfaces that are not exclusively
designed for machine learning scenarios. Instead, they will primarily
offer differential programming in their abstraction designs, making
gradient-based software easy to be developed for various applications.

View File

@@ -0,0 +1,129 @@
# Machine Learning Workflow
In machine learning systems, the fundamental design objective of
programming models is to offer comprehensive workflow programming
support for developers. A typical machine learning task adheres to the
workflow depicted in Figure :numref:`ch03/workflow`. This workflow involves loading the
training dataset, training, testing, and debugging models. The following
APIs are defined to facilitate customization within the workflow
(assuming that high-level APIs are provided as Python functions):
1. **Data Processing API:** Users first require a data processing API
to read datasets from a disk. Subsequently, they need to preprocess
the data to make it suitable for input into machine learning models.
Code `ch02/code2.2.1` is an example of how PyTorch can be used
to load data and create data loaders for both training and testing
purposes.
**ch02/code2.2.1**
```python
import pickle
from torch.utils.data import Dataset, DataLoader
data_path = '/path/to/data'
dataset = pickle.load(open(data_path, 'rb')) # Example for a pkl file
batch_size = ... # You can make it an argument of the script
class CustomDataset(Dataset):
def __init__(self, data, labels):
self.data = data
self.labels = labels
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
sample = self.data[idx]
label = self.labels[idx]
return sample, label
training_dataset = CustomDataset(dataset['training_data'], dataset['training_labels'])
testing_dataset = CustomDataset(dataset['testing_data'], dataset['testing_labels'])
training_dataloader = DataLoader(training_dataset, batch_size=batch_size, shuffle=True) # Create a training dataloader
testing_dataloader = DataLoader(testing_dataset, batch_size=batch_size, shuffle=False) # Create a testing dataloader
```
2. **Model Definition API:** Once the data is preprocessed, users need
a model definition API to define machine learning models. These
models include model parameters and can perform inference based on
given data. Code
`ch02/code2.2.2` is an example of how to create a custom
model in Pytorch:
**ch02/code2.2.2**
```python
import torch.nn as nn
class CustomModel(nn.Module):
def __init__(self, input_size, output_size):
super(CustomModel, self).__init__()
self.linear = nn.Linear(input_size, output_size) # A single linear layer
def forward(self, x):
return self.linear(x)
```
3. **Optimizer Definition API:** The outputs of models need to be
compared with user labels, and their difference is evaluated using a
loss function. The optimizer definition API enables users to define
their own loss functions and import or define optimization
algorithms based on the actual loss. These algorithms calculate
gradients and update model parameters. Code
`ch02/code2.2.3` is an example of an optimizer definition
in Pytorch:
**ch02/code2.2.3**
```python
import torch.optim as optim
import torch.nn
model = CustomModel(...)
# Optimizer definition (Adam, SGD, etc.)
optimizer = optim.Adam(model.parameters(), lr=1e-4, momentum=0.9)
loss = nn.CrossEntropyLoss() # Loss function definition
```
4. **Training API:** Given a dataset, model, loss function, and
optimizer, users require a training API to define a loop that reads
data from datasets in a mini-batch mode. In this process, gradients
are computed repeatedly, and model parameters are updated
accordingly. This iterative update process is known as *training*.
Code `ch02/code2.2.4` is an example of how to train a model in
Pytorch:
**ch02/code2.2.4**
```python
device = "cuda:0" if torch.cuda.is_available() else "cpu" # Select your training device
model.to(device) # Move the model to the training device
model.train() # Set the model to train mode
epochs = ... # You can make it an argument of the script
for epoch in range(epochs):
for batch_idx, (data, target) in enumerate(training_dataloader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad() # zero the parameter gradients
output = model(data) # Forward pass
loss_value = loss(output, target) # Compute the loss
loss_value.backward() # Backpropagation
optimizer.step()
```
5. **Testing and Debugging APIs:** Throughout the training process,
users need a testing API to evaluate the accuracy of the model
(training concludes when the accuracy exceeds the set goal).
Additionally, a debugging API is necessary to verify the performance
and correctness of the model. Code
`ch02/code2.2.5` is an example of model evaluation in
Pytorch:
**ch02/code2.2.5**
```python
model.eval() # Set the model to evaluation mode
overall_accuracy = []
for batch_idx, (data, target) in enumerate(testing_dataloader):
data, target = data.to(device), target.to(device)
output = model(data) # Forward pass
accuracy = your_metrics(output, target) # Compute the accuracy
overall_accuracy.append(accuracy) # Print the accuracy
# For debugging, you can print logs inside the training or evaluation loop, or use python debugger.
```
![Workflow within a machine learningsystem](../img/ch03/workflow.pdf)
:label:`ch03/workflow`

Some files were not shown because too many files have changed in this diff Show More