mirror of
https://github.com/openmlsys/openmlsys-zh.git
synced 2026-03-24 14:00:43 +08:00
feat: add v1/v2 versioning with language selector (#494)
* feat: add v1/v2 versioning and language selector for mdbook - Copy current content to v1/ directory (1st Edition) - Create v2/ directory with new TOC structure (2nd Edition) and placeholder chapters - Add version selector (V1/V2) and language toggle (EN/ZH) in top-right nav bar - Add build scripts: build_mdbook_v1.sh, build_mdbook_v2.sh - Update assemble_docs_publish_tree.py to support v1/v2 deployment layout - Fix mdbook preprocessor to use 'sections' key (v0.4.43 compatibility) - Update .gitignore for new build artifact directories - Deployment layout: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * build: update CI to build and verify all four books (v1/v2 x EN/ZH) - Clarify step names: "Build v2 (EN + ZH)" and "Build v1 (EN + ZH)" - Add verification step to check all four index.html outputs exist - Deploy workflow assembles: / = v2 EN, /cn/ = v2 ZH, /v1/ = v1 EN, /v1/cn/ = v1 ZH Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: gracefully skip missing TOC entries instead of crashing resolve_toc_target() now returns None for missing files instead of raising FileNotFoundError. This fixes v1 EN build where chapter index files reference TOC entry names that don't match actual filenames. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
18
.github/workflows/main.yml
vendored
18
.github/workflows/main.yml
vendored
@@ -31,8 +31,18 @@ jobs:
|
||||
python3 -m unittest discover -s tests -p 'test_ensure_book_resources.py'
|
||||
python3 -m unittest discover -s tests -p 'test_update_docs_workflow.py'
|
||||
|
||||
- name: Build English HTML with mdBook
|
||||
run: bash build_mdbook.sh
|
||||
- name: Build v2 (EN + ZH) with mdBook
|
||||
run: bash build_mdbook_v2.sh
|
||||
|
||||
- name: Build Chinese HTML with mdBook
|
||||
run: bash build_mdbook_zh.sh
|
||||
- name: Build v1 (EN + ZH) with mdBook
|
||||
run: bash build_mdbook_v1.sh
|
||||
|
||||
- name: Verify build outputs
|
||||
run: |
|
||||
for d in .mdbook-v2/book .mdbook-v2-zh/book .mdbook-v1/book .mdbook-v1-zh/book; do
|
||||
if [ ! -f "$d/index.html" ]; then
|
||||
echo "ERROR: $d/index.html not found"
|
||||
exit 1
|
||||
fi
|
||||
echo "OK: $d/index.html exists"
|
||||
done
|
||||
|
||||
26
.github/workflows/update_docs.yml
vendored
26
.github/workflows/update_docs.yml
vendored
@@ -29,13 +29,23 @@ jobs:
|
||||
python3 -m unittest discover -s tests -p 'test_assemble_docs_publish_tree.py'
|
||||
python3 -m unittest discover -s tests -p 'test_ensure_book_resources.py'
|
||||
|
||||
- name: Build English HTML with mdBook
|
||||
run: bash build_mdbook.sh
|
||||
- name: Build v2 (EN + ZH) with mdBook
|
||||
run: bash build_mdbook_v2.sh
|
||||
|
||||
- name: Build Chinese HTML with mdBook
|
||||
run: bash build_mdbook_zh.sh
|
||||
- name: Build v1 (EN + ZH) with mdBook
|
||||
run: bash build_mdbook_v1.sh
|
||||
|
||||
- name: Deploy to openmlsys.github.io
|
||||
- name: Verify build outputs
|
||||
run: |
|
||||
for d in .mdbook-v2/book .mdbook-v2-zh/book .mdbook-v1/book .mdbook-v1-zh/book; do
|
||||
if [ ! -f "$d/index.html" ]; then
|
||||
echo "ERROR: $d/index.html not found"
|
||||
exit 1
|
||||
fi
|
||||
echo "OK: $d/index.html exists"
|
||||
done
|
||||
|
||||
- name: Assemble and deploy to openmlsys.github.io
|
||||
env:
|
||||
DEPLOY_TOKEN: ${{ secrets.DEPLOY_TOKEN }}
|
||||
run: |
|
||||
@@ -44,8 +54,10 @@ jobs:
|
||||
python3 tools/assemble_docs_publish_tree.py \
|
||||
--destination-root openmlsys.github.io \
|
||||
--docs-subdir docs \
|
||||
--en-source .mdbook/book \
|
||||
--zh-source .mdbook-zh/book
|
||||
--v2-en-source .mdbook-v2/book \
|
||||
--v2-zh-source .mdbook-v2-zh/book \
|
||||
--v1-en-source .mdbook-v1/book \
|
||||
--v1-zh-source .mdbook-v1-zh/book
|
||||
|
||||
cd openmlsys.github.io
|
||||
git config user.name "github-actions[bot]"
|
||||
|
||||
20
.gitignore
vendored
20
.gitignore
vendored
@@ -16,6 +16,10 @@ env
|
||||
.mdbook-zh/
|
||||
.mdbook-zh-test/
|
||||
.mdbook-bin/
|
||||
.mdbook-v1/
|
||||
.mdbook-v1-zh/
|
||||
.mdbook-v2/
|
||||
.mdbook-v2-zh/
|
||||
task_plan.md
|
||||
findings.md
|
||||
progress.md
|
||||
@@ -29,3 +33,19 @@ zh_chapters/img
|
||||
zh_chapters/references
|
||||
zh_chapters/static
|
||||
zh_chapters/mlsys.bib
|
||||
v1/en_chapters/img
|
||||
v1/en_chapters/references
|
||||
v1/en_chapters/static
|
||||
v1/en_chapters/mlsys.bib
|
||||
v1/zh_chapters/img
|
||||
v1/zh_chapters/references
|
||||
v1/zh_chapters/static
|
||||
v1/zh_chapters/mlsys.bib
|
||||
v2/en_chapters/img
|
||||
v2/en_chapters/references
|
||||
v2/en_chapters/static
|
||||
v2/en_chapters/mlsys.bib
|
||||
v2/zh_chapters/img
|
||||
v2/zh_chapters/references
|
||||
v2/zh_chapters/static
|
||||
v2/zh_chapters/mlsys.bib
|
||||
|
||||
@@ -15,4 +15,5 @@ command = "python3 tools/mdbook_preprocessor.py"
|
||||
mathjax-support = true
|
||||
git-repository-url = "https://github.com/openmlsys/openmlsys-zh"
|
||||
preferred-dark-theme = "navy"
|
||||
additional-css = ["theme/dark-mode-images.css"]
|
||||
additional-css = ["theme/dark-mode-images.css", "theme/version-selector.css"]
|
||||
additional-js = ["theme/version-selector.js"]
|
||||
|
||||
@@ -15,4 +15,5 @@ command = "python3 ../../tools/mdbook_zh_preprocessor.py"
|
||||
mathjax-support = true
|
||||
git-repository-url = "https://github.com/openmlsys/openmlsys-zh"
|
||||
preferred-dark-theme = "navy"
|
||||
additional-css = ["theme/dark-mode-images.css"]
|
||||
additional-css = ["theme/dark-mode-images.css", "theme/version-selector.css"]
|
||||
additional-js = ["theme/version-selector.js"]
|
||||
|
||||
48
books/zh/theme/version-selector.css
Normal file
48
books/zh/theme/version-selector.css
Normal file
@@ -0,0 +1,48 @@
|
||||
/* Version and Language selectors — inline in .right-buttons */
|
||||
.openmlsys-nav-selectors {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
gap: 4px;
|
||||
margin-right: 4px;
|
||||
vertical-align: middle;
|
||||
}
|
||||
|
||||
/* Shared style for all selector links/buttons */
|
||||
.openmlsys-selector-link {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
min-width: 32px;
|
||||
height: 28px;
|
||||
padding: 0 8px;
|
||||
border-radius: 4px;
|
||||
border: 1px solid transparent;
|
||||
color: var(--icons, #747474);
|
||||
font-size: 12px;
|
||||
font-weight: 600;
|
||||
text-decoration: none;
|
||||
cursor: pointer;
|
||||
line-height: 1;
|
||||
transition: color 0.1s, background 0.1s;
|
||||
}
|
||||
|
||||
.openmlsys-selector-link:hover {
|
||||
color: var(--icons-hover, #333);
|
||||
background: var(--theme-hover, rgba(0, 0, 0, 0.05));
|
||||
}
|
||||
|
||||
/* Active/current indicator */
|
||||
.openmlsys-selector-link.active {
|
||||
color: var(--links, #4183c4);
|
||||
border-color: var(--links, #4183c4);
|
||||
font-weight: 700;
|
||||
}
|
||||
|
||||
/* Separator between version and language groups */
|
||||
.openmlsys-selector-sep {
|
||||
width: 1px;
|
||||
height: 18px;
|
||||
background: var(--icons, #747474);
|
||||
opacity: 0.3;
|
||||
margin: 0 2px;
|
||||
}
|
||||
74
books/zh/theme/version-selector.js
Normal file
74
books/zh/theme/version-selector.js
Normal file
@@ -0,0 +1,74 @@
|
||||
// Version and Language selector for OpenMLSys mdbook
|
||||
(function () {
|
||||
"use strict";
|
||||
|
||||
var path = window.location.pathname;
|
||||
|
||||
// Detect current version and language from URL
|
||||
var currentVersion = "v2";
|
||||
var currentLang = "en";
|
||||
|
||||
if (path.match(/\/v1(\/|$)/)) {
|
||||
currentVersion = "v1";
|
||||
}
|
||||
if (path.match(/\/cn(\/|$)/)) {
|
||||
currentLang = "zh";
|
||||
}
|
||||
|
||||
// Build base paths
|
||||
function basePath(version, lang) {
|
||||
var docsRoot = path.replace(/\/v1\/.*/, "/").replace(/\/cn\/.*/, "/");
|
||||
docsRoot = docsRoot.replace(/(\/docs\/?).*/, "/docs/");
|
||||
if (!docsRoot.endsWith("/")) docsRoot += "/";
|
||||
|
||||
var p = docsRoot;
|
||||
if (version === "v1") p += "v1/";
|
||||
if (lang === "zh") p += "cn/";
|
||||
return p;
|
||||
}
|
||||
|
||||
var container = document.createElement("span");
|
||||
container.className = "openmlsys-nav-selectors";
|
||||
|
||||
// --- Version links: V1 | V2 ---
|
||||
var versions = [
|
||||
{ label: "V1", value: "v1" },
|
||||
{ label: "V2", value: "v2" },
|
||||
];
|
||||
|
||||
versions.forEach(function (v) {
|
||||
var a = document.createElement("a");
|
||||
a.className = "openmlsys-selector-link";
|
||||
a.textContent = v.label;
|
||||
a.href = basePath(v.value, currentLang);
|
||||
if (v.value === currentVersion) a.classList.add("active");
|
||||
container.appendChild(a);
|
||||
});
|
||||
|
||||
// Separator
|
||||
var sep = document.createElement("span");
|
||||
sep.className = "openmlsys-selector-sep";
|
||||
container.appendChild(sep);
|
||||
|
||||
// --- Language toggle: single button that switches to the other language ---
|
||||
var otherLang = currentLang === "zh" ? "en" : "zh";
|
||||
var langLink = document.createElement("a");
|
||||
langLink.className = "openmlsys-selector-link";
|
||||
langLink.textContent = currentLang === "zh" ? "EN" : "ZH";
|
||||
langLink.href = basePath(currentVersion, otherLang);
|
||||
container.appendChild(langLink);
|
||||
|
||||
// Insert into .right-buttons, before existing icons
|
||||
function insertSelector() {
|
||||
var rightButtons = document.querySelector(".right-buttons");
|
||||
if (rightButtons) {
|
||||
rightButtons.insertBefore(container, rightButtons.firstChild);
|
||||
}
|
||||
}
|
||||
|
||||
if (document.readyState === "loading") {
|
||||
document.addEventListener("DOMContentLoaded", insertSelector);
|
||||
} else {
|
||||
insertSelector();
|
||||
}
|
||||
})();
|
||||
32
build_mdbook_v1.sh
Executable file
32
build_mdbook_v1.sh
Executable file
@@ -0,0 +1,32 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
PYTHON_BIN="$(command -v python3 || command -v python || true)"
|
||||
|
||||
if [[ -z "${PYTHON_BIN}" ]]; then
|
||||
echo "Python is required to prepare the mdBook staging tree." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! command -v mdbook >/dev/null 2>&1; then
|
||||
echo "mdbook is not installed. Install it first, for example with: cargo install mdbook" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# ── English v1 ────────────────────────────────────────────────────────────────
|
||||
"${PYTHON_BIN}" "${ROOT}/tools/ensure_book_resources.py" --chapter-dir "${ROOT}/v1/en_chapters"
|
||||
"${PYTHON_BIN}" "${ROOT}/tools/prepare_mdbook.py" \
|
||||
--source "${ROOT}/v1/en_chapters" \
|
||||
--summary-output "${ROOT}/v1/en_chapters/SUMMARY.md" \
|
||||
--placeholder-prefix "[TODO: src = zh_chapters/"
|
||||
|
||||
mdbook build "${ROOT}/v1"
|
||||
|
||||
# ── Chinese v1 ────────────────────────────────────────────────────────────────
|
||||
"${PYTHON_BIN}" "${ROOT}/tools/ensure_book_resources.py" --chapter-dir "${ROOT}/v1/zh_chapters"
|
||||
"${PYTHON_BIN}" "${ROOT}/tools/prepare_mdbook_zh.py" \
|
||||
--source "${ROOT}/v1/zh_chapters" \
|
||||
--summary-output "${ROOT}/v1/zh_chapters/SUMMARY.md"
|
||||
|
||||
mdbook build "${ROOT}/v1/books/zh"
|
||||
32
build_mdbook_v2.sh
Executable file
32
build_mdbook_v2.sh
Executable file
@@ -0,0 +1,32 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
PYTHON_BIN="$(command -v python3 || command -v python || true)"
|
||||
|
||||
if [[ -z "${PYTHON_BIN}" ]]; then
|
||||
echo "Python is required to prepare the mdBook staging tree." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! command -v mdbook >/dev/null 2>&1; then
|
||||
echo "mdbook is not installed. Install it first, for example with: cargo install mdbook" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# ── English v2 ────────────────────────────────────────────────────────────────
|
||||
"${PYTHON_BIN}" "${ROOT}/tools/ensure_book_resources.py" --chapter-dir "${ROOT}/v2/en_chapters"
|
||||
"${PYTHON_BIN}" "${ROOT}/tools/prepare_mdbook.py" \
|
||||
--source "${ROOT}/v2/en_chapters" \
|
||||
--summary-output "${ROOT}/v2/en_chapters/SUMMARY.md" \
|
||||
--placeholder-prefix "[TODO: src = zh_chapters/"
|
||||
|
||||
mdbook build "${ROOT}/v2"
|
||||
|
||||
# ── Chinese v2 ────────────────────────────────────────────────────────────────
|
||||
"${PYTHON_BIN}" "${ROOT}/tools/ensure_book_resources.py" --chapter-dir "${ROOT}/v2/zh_chapters"
|
||||
"${PYTHON_BIN}" "${ROOT}/tools/prepare_mdbook_zh.py" \
|
||||
--source "${ROOT}/v2/zh_chapters" \
|
||||
--summary-output "${ROOT}/v2/zh_chapters/SUMMARY.md"
|
||||
|
||||
mdbook build "${ROOT}/v2/books/zh"
|
||||
@@ -171,7 +171,7 @@ a {
|
||||
}
|
||||
|
||||
.cover h2 {
|
||||
font-size: 34px;
|
||||
font-size: 24px;
|
||||
margin-bottom: 0px;
|
||||
padding-bottom: 20px;
|
||||
}
|
||||
|
||||
@@ -144,8 +144,10 @@ missing
|
||||
)
|
||||
(source / "existing.md").write_text("# 现有章节\n", encoding="utf-8")
|
||||
|
||||
with self.assertRaises(FileNotFoundError):
|
||||
write_summary(source)
|
||||
summary_path = write_summary(source)
|
||||
summary = summary_path.read_text(encoding="utf-8")
|
||||
self.assertIn("existing", summary)
|
||||
self.assertNotIn("missing", summary)
|
||||
|
||||
def test_rewrite_markdown_normalizes_common_d2l_directives(self) -> None:
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
|
||||
48
theme/version-selector.css
Normal file
48
theme/version-selector.css
Normal file
@@ -0,0 +1,48 @@
|
||||
/* Version and Language selectors — inline in .right-buttons */
|
||||
.openmlsys-nav-selectors {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
gap: 4px;
|
||||
margin-right: 4px;
|
||||
vertical-align: middle;
|
||||
}
|
||||
|
||||
/* Shared style for all selector links/buttons */
|
||||
.openmlsys-selector-link {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
min-width: 32px;
|
||||
height: 28px;
|
||||
padding: 0 8px;
|
||||
border-radius: 4px;
|
||||
border: 1px solid transparent;
|
||||
color: var(--icons, #747474);
|
||||
font-size: 12px;
|
||||
font-weight: 600;
|
||||
text-decoration: none;
|
||||
cursor: pointer;
|
||||
line-height: 1;
|
||||
transition: color 0.1s, background 0.1s;
|
||||
}
|
||||
|
||||
.openmlsys-selector-link:hover {
|
||||
color: var(--icons-hover, #333);
|
||||
background: var(--theme-hover, rgba(0, 0, 0, 0.05));
|
||||
}
|
||||
|
||||
/* Active/current indicator */
|
||||
.openmlsys-selector-link.active {
|
||||
color: var(--links, #4183c4);
|
||||
border-color: var(--links, #4183c4);
|
||||
font-weight: 700;
|
||||
}
|
||||
|
||||
/* Separator between version and language groups */
|
||||
.openmlsys-selector-sep {
|
||||
width: 1px;
|
||||
height: 18px;
|
||||
background: var(--icons, #747474);
|
||||
opacity: 0.3;
|
||||
margin: 0 2px;
|
||||
}
|
||||
74
theme/version-selector.js
Normal file
74
theme/version-selector.js
Normal file
@@ -0,0 +1,74 @@
|
||||
// Version and Language selector for OpenMLSys mdbook
|
||||
(function () {
|
||||
"use strict";
|
||||
|
||||
var path = window.location.pathname;
|
||||
|
||||
// Detect current version and language from URL
|
||||
var currentVersion = "v2";
|
||||
var currentLang = "en";
|
||||
|
||||
if (path.match(/\/v1(\/|$)/)) {
|
||||
currentVersion = "v1";
|
||||
}
|
||||
if (path.match(/\/cn(\/|$)/)) {
|
||||
currentLang = "zh";
|
||||
}
|
||||
|
||||
// Build base paths
|
||||
function basePath(version, lang) {
|
||||
var docsRoot = path.replace(/\/v1\/.*/, "/").replace(/\/cn\/.*/, "/");
|
||||
docsRoot = docsRoot.replace(/(\/docs\/?).*/, "/docs/");
|
||||
if (!docsRoot.endsWith("/")) docsRoot += "/";
|
||||
|
||||
var p = docsRoot;
|
||||
if (version === "v1") p += "v1/";
|
||||
if (lang === "zh") p += "cn/";
|
||||
return p;
|
||||
}
|
||||
|
||||
var container = document.createElement("span");
|
||||
container.className = "openmlsys-nav-selectors";
|
||||
|
||||
// --- Version links: V1 | V2 ---
|
||||
var versions = [
|
||||
{ label: "V1", value: "v1" },
|
||||
{ label: "V2", value: "v2" },
|
||||
];
|
||||
|
||||
versions.forEach(function (v) {
|
||||
var a = document.createElement("a");
|
||||
a.className = "openmlsys-selector-link";
|
||||
a.textContent = v.label;
|
||||
a.href = basePath(v.value, currentLang);
|
||||
if (v.value === currentVersion) a.classList.add("active");
|
||||
container.appendChild(a);
|
||||
});
|
||||
|
||||
// Separator
|
||||
var sep = document.createElement("span");
|
||||
sep.className = "openmlsys-selector-sep";
|
||||
container.appendChild(sep);
|
||||
|
||||
// --- Language toggle: single button that switches to the other language ---
|
||||
var otherLang = currentLang === "zh" ? "en" : "zh";
|
||||
var langLink = document.createElement("a");
|
||||
langLink.className = "openmlsys-selector-link";
|
||||
langLink.textContent = currentLang === "zh" ? "EN" : "ZH";
|
||||
langLink.href = basePath(currentVersion, otherLang);
|
||||
container.appendChild(langLink);
|
||||
|
||||
// Insert into .right-buttons, before existing icons
|
||||
function insertSelector() {
|
||||
var rightButtons = document.querySelector(".right-buttons");
|
||||
if (rightButtons) {
|
||||
rightButtons.insertBefore(container, rightButtons.firstChild);
|
||||
}
|
||||
}
|
||||
|
||||
if (document.readyState === "loading") {
|
||||
document.addEventListener("DOMContentLoaded", insertSelector);
|
||||
} else {
|
||||
insertSelector();
|
||||
}
|
||||
})();
|
||||
@@ -28,8 +28,12 @@ def assemble_publish_tree(
|
||||
docs_subdir: str = "docs",
|
||||
en_source: Path | None = None,
|
||||
zh_source: Path | None = None,
|
||||
v1_en_source: Path | None = None,
|
||||
v1_zh_source: Path | None = None,
|
||||
v2_en_source: Path | None = None,
|
||||
v2_zh_source: Path | None = None,
|
||||
) -> tuple[Path, Path | None]:
|
||||
if en_source is None and zh_source is None:
|
||||
if en_source is None and zh_source is None and v2_en_source is None:
|
||||
raise ValueError("At least one site source must be provided.")
|
||||
|
||||
destination_root = destination_root.resolve()
|
||||
@@ -38,15 +42,25 @@ def assemble_publish_tree(
|
||||
remove_path(docs_root)
|
||||
docs_root.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
if en_source is not None:
|
||||
copy_site(en_source, docs_root)
|
||||
# v2 (latest) is deployed at the root — /docs/
|
||||
effective_en = v2_en_source or en_source
|
||||
if effective_en is not None:
|
||||
copy_site(effective_en, docs_root)
|
||||
else:
|
||||
docs_root.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
zh_destination: Path | None = None
|
||||
if zh_source is not None:
|
||||
effective_zh = v2_zh_source or zh_source
|
||||
if effective_zh is not None:
|
||||
zh_destination = docs_root / "cn"
|
||||
copy_site(zh_source, zh_destination)
|
||||
copy_site(effective_zh, zh_destination)
|
||||
|
||||
# v1 is deployed under /docs/v1/
|
||||
if v1_en_source is not None:
|
||||
v1_root = docs_root / "v1"
|
||||
copy_site(v1_en_source, v1_root)
|
||||
if v1_zh_source is not None:
|
||||
copy_site(v1_zh_source, v1_root / "cn")
|
||||
|
||||
return docs_root, zh_destination
|
||||
|
||||
@@ -69,12 +83,32 @@ def parse_args() -> argparse.Namespace:
|
||||
parser.add_argument(
|
||||
"--en-source",
|
||||
type=Path,
|
||||
help="Built site to publish at docs/.",
|
||||
help="Built site to publish at docs/ (legacy, use --v2-en-source instead).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--zh-source",
|
||||
type=Path,
|
||||
help="Built site to publish at docs/cn/.",
|
||||
help="Built site to publish at docs/cn/ (legacy, use --v2-zh-source instead).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--v1-en-source",
|
||||
type=Path,
|
||||
help="Built v1 English site to publish at docs/v1/.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--v1-zh-source",
|
||||
type=Path,
|
||||
help="Built v1 Chinese site to publish at docs/v1/cn/.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--v2-en-source",
|
||||
type=Path,
|
||||
help="Built v2 English site to publish at docs/.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--v2-zh-source",
|
||||
type=Path,
|
||||
help="Built v2 Chinese site to publish at docs/cn/.",
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
@@ -86,6 +120,10 @@ def main() -> int:
|
||||
docs_subdir=args.docs_subdir,
|
||||
en_source=args.en_source,
|
||||
zh_source=args.zh_source,
|
||||
v1_en_source=args.v1_en_source,
|
||||
v1_zh_source=args.v1_zh_source,
|
||||
v2_en_source=args.v2_en_source,
|
||||
v2_zh_source=args.v2_zh_source,
|
||||
)
|
||||
print(f"Assembled root site at {docs_root}")
|
||||
if zh_root is not None:
|
||||
|
||||
@@ -43,7 +43,7 @@ def main() -> int:
|
||||
for key, fields in parse_bib(extra_bib).items():
|
||||
bib_db.setdefault(key, fields)
|
||||
|
||||
chapters = iter_chapters(book.get("items", []))
|
||||
chapters = iter_chapters(book.get("sections") or book.get("items") or [])
|
||||
|
||||
# Pass 1: collect all :label: directives and figure labels
|
||||
ref_label_map: dict[str, str] = {}
|
||||
|
||||
@@ -42,7 +42,7 @@ def main() -> int:
|
||||
for key, fields in parse_bib(extra_bib).items():
|
||||
bib_db.setdefault(key, fields)
|
||||
|
||||
chapters = iter_chapters(book.get("items", []))
|
||||
chapters = iter_chapters(book.get("sections") or book.get("items") or [])
|
||||
|
||||
# Pass 1: collect all :label: directives and figure labels
|
||||
ref_label_map: dict[str, str] = {}
|
||||
|
||||
@@ -200,11 +200,11 @@ def parse_toc_blocks(markdown: str) -> list[list[TocItem]]:
|
||||
return blocks
|
||||
|
||||
|
||||
def resolve_toc_target(current_file: Path, entry: str) -> Path:
|
||||
def resolve_toc_target(current_file: Path, entry: str) -> Path | None:
|
||||
target_name = entry if entry.endswith(".md") else f"{entry}.md"
|
||||
target = (current_file.parent / target_name).resolve()
|
||||
if not target.exists():
|
||||
raise FileNotFoundError(f"TOC entry '{entry}' from '{current_file}' does not exist")
|
||||
return None
|
||||
return target
|
||||
|
||||
|
||||
@@ -828,7 +828,7 @@ def render_toc_list(entries: list[TocItem], current_file: Path, title_cache: dic
|
||||
continue
|
||||
|
||||
target = resolve_toc_target(current_file, entry.target)
|
||||
if target not in title_cache:
|
||||
if target is None or target not in title_cache:
|
||||
continue
|
||||
|
||||
label = chapter_label(entry, target, title_cache)
|
||||
@@ -943,7 +943,9 @@ def build_summary(source_dir: Path, title_cache: dict[Path, str]) -> str:
|
||||
for entry in block:
|
||||
if entry.kind != "chapter" or entry.target is None:
|
||||
continue
|
||||
append_entry(resolve_toc_target(target, entry.target), indent + 1, entry.label or None)
|
||||
child_target = resolve_toc_target(target, entry.target)
|
||||
if child_target is not None:
|
||||
append_entry(child_target, indent + 1, entry.label or None)
|
||||
|
||||
def append_prefix_chapter(target: Path, label: str | None = None) -> None:
|
||||
target = target.resolve()
|
||||
@@ -969,6 +971,8 @@ def build_summary(source_dir: Path, title_cache: dict[Path, str]) -> str:
|
||||
continue
|
||||
|
||||
target = resolve_toc_target(root_index, entry.target)
|
||||
if target is None:
|
||||
continue
|
||||
if numbered_started:
|
||||
append_entry(target, 0, entry.label or None)
|
||||
else:
|
||||
|
||||
19
v1/book.toml
Normal file
19
v1/book.toml
Normal file
@@ -0,0 +1,19 @@
|
||||
[book]
|
||||
authors = ["OpenMLSys Contributors"]
|
||||
language = "en"
|
||||
src = "en_chapters"
|
||||
title = "Machine Learning Systems: Design and Implementation (1st Edition)"
|
||||
|
||||
[build]
|
||||
build-dir = "../.mdbook-v1/book"
|
||||
create-missing = false
|
||||
|
||||
[preprocessor.openmlsys]
|
||||
command = "python3 tools/mdbook_preprocessor.py"
|
||||
|
||||
[output.html]
|
||||
mathjax-support = true
|
||||
git-repository-url = "https://github.com/openmlsys/openmlsys-zh"
|
||||
preferred-dark-theme = "navy"
|
||||
additional-css = ["theme/dark-mode-images.css", "theme/version-selector.css"]
|
||||
additional-js = ["theme/version-selector.js"]
|
||||
19
v1/books/zh/book.toml
Normal file
19
v1/books/zh/book.toml
Normal file
@@ -0,0 +1,19 @@
|
||||
[book]
|
||||
authors = ["OpenMLSys Contributors"]
|
||||
language = "zh-CN"
|
||||
src = "../../zh_chapters"
|
||||
title = "机器学习系统:设计和实现(第一版)"
|
||||
|
||||
[build]
|
||||
build-dir = "../../../.mdbook-v1-zh/book"
|
||||
create-missing = false
|
||||
|
||||
[preprocessor.openmlsys-zh]
|
||||
command = "python3 tools/mdbook_zh_preprocessor.py"
|
||||
|
||||
[output.html]
|
||||
mathjax-support = true
|
||||
git-repository-url = "https://github.com/openmlsys/openmlsys-zh"
|
||||
preferred-dark-theme = "navy"
|
||||
additional-css = ["theme/dark-mode-images.css", "theme/version-selector.css"]
|
||||
additional-js = ["theme/version-selector.js"]
|
||||
16
v1/books/zh/theme/dark-mode-images.css
Normal file
16
v1/books/zh/theme/dark-mode-images.css
Normal file
@@ -0,0 +1,16 @@
|
||||
/* 暗色模式下仅为正文图片添加浅灰色背景,提高透明背景图片的可读性 */
|
||||
.navy .content main img,
|
||||
.coal .content main img,
|
||||
.ayu .content main img {
|
||||
background-color: #e8e8e8;
|
||||
border-radius: 4px;
|
||||
padding: 8px;
|
||||
}
|
||||
|
||||
/* 首页 frontpage 图片保持透明,不添加正文图像底色。 */
|
||||
.navy .openmlsys-frontpage img,
|
||||
.coal .openmlsys-frontpage img,
|
||||
.ayu .openmlsys-frontpage img {
|
||||
background-color: transparent !important;
|
||||
padding: 0 !important;
|
||||
}
|
||||
12
v1/books/zh/theme/head.hbs
Normal file
12
v1/books/zh/theme/head.hbs
Normal file
@@ -0,0 +1,12 @@
|
||||
<script type="text/x-mathjax-config">
|
||||
MathJax.Hub.Config({
|
||||
"HTML-CSS": {
|
||||
availableFonts: ["TeX"],
|
||||
preferredFont: "TeX",
|
||||
webFont: "TeX"
|
||||
},
|
||||
SVG: {
|
||||
font: "TeX"
|
||||
}
|
||||
});
|
||||
</script>
|
||||
48
v1/books/zh/theme/version-selector.css
Normal file
48
v1/books/zh/theme/version-selector.css
Normal file
@@ -0,0 +1,48 @@
|
||||
/* Version and Language selectors — inline in .right-buttons */
|
||||
.openmlsys-nav-selectors {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
gap: 4px;
|
||||
margin-right: 4px;
|
||||
vertical-align: middle;
|
||||
}
|
||||
|
||||
/* Shared style for all selector links/buttons */
|
||||
.openmlsys-selector-link {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
min-width: 32px;
|
||||
height: 28px;
|
||||
padding: 0 8px;
|
||||
border-radius: 4px;
|
||||
border: 1px solid transparent;
|
||||
color: var(--icons, #747474);
|
||||
font-size: 12px;
|
||||
font-weight: 600;
|
||||
text-decoration: none;
|
||||
cursor: pointer;
|
||||
line-height: 1;
|
||||
transition: color 0.1s, background 0.1s;
|
||||
}
|
||||
|
||||
.openmlsys-selector-link:hover {
|
||||
color: var(--icons-hover, #333);
|
||||
background: var(--theme-hover, rgba(0, 0, 0, 0.05));
|
||||
}
|
||||
|
||||
/* Active/current indicator */
|
||||
.openmlsys-selector-link.active {
|
||||
color: var(--links, #4183c4);
|
||||
border-color: var(--links, #4183c4);
|
||||
font-weight: 700;
|
||||
}
|
||||
|
||||
/* Separator between version and language groups */
|
||||
.openmlsys-selector-sep {
|
||||
width: 1px;
|
||||
height: 18px;
|
||||
background: var(--icons, #747474);
|
||||
opacity: 0.3;
|
||||
margin: 0 2px;
|
||||
}
|
||||
74
v1/books/zh/theme/version-selector.js
Normal file
74
v1/books/zh/theme/version-selector.js
Normal file
@@ -0,0 +1,74 @@
|
||||
// Version and Language selector for OpenMLSys mdbook
|
||||
(function () {
|
||||
"use strict";
|
||||
|
||||
var path = window.location.pathname;
|
||||
|
||||
// Detect current version and language from URL
|
||||
var currentVersion = "v2";
|
||||
var currentLang = "en";
|
||||
|
||||
if (path.match(/\/v1(\/|$)/)) {
|
||||
currentVersion = "v1";
|
||||
}
|
||||
if (path.match(/\/cn(\/|$)/)) {
|
||||
currentLang = "zh";
|
||||
}
|
||||
|
||||
// Build base paths
|
||||
function basePath(version, lang) {
|
||||
var docsRoot = path.replace(/\/v1\/.*/, "/").replace(/\/cn\/.*/, "/");
|
||||
docsRoot = docsRoot.replace(/(\/docs\/?).*/, "/docs/");
|
||||
if (!docsRoot.endsWith("/")) docsRoot += "/";
|
||||
|
||||
var p = docsRoot;
|
||||
if (version === "v1") p += "v1/";
|
||||
if (lang === "zh") p += "cn/";
|
||||
return p;
|
||||
}
|
||||
|
||||
var container = document.createElement("span");
|
||||
container.className = "openmlsys-nav-selectors";
|
||||
|
||||
// --- Version links: V1 | V2 ---
|
||||
var versions = [
|
||||
{ label: "V1", value: "v1" },
|
||||
{ label: "V2", value: "v2" },
|
||||
];
|
||||
|
||||
versions.forEach(function (v) {
|
||||
var a = document.createElement("a");
|
||||
a.className = "openmlsys-selector-link";
|
||||
a.textContent = v.label;
|
||||
a.href = basePath(v.value, currentLang);
|
||||
if (v.value === currentVersion) a.classList.add("active");
|
||||
container.appendChild(a);
|
||||
});
|
||||
|
||||
// Separator
|
||||
var sep = document.createElement("span");
|
||||
sep.className = "openmlsys-selector-sep";
|
||||
container.appendChild(sep);
|
||||
|
||||
// --- Language toggle: single button that switches to the other language ---
|
||||
var otherLang = currentLang === "zh" ? "en" : "zh";
|
||||
var langLink = document.createElement("a");
|
||||
langLink.className = "openmlsys-selector-link";
|
||||
langLink.textContent = currentLang === "zh" ? "EN" : "ZH";
|
||||
langLink.href = basePath(currentVersion, otherLang);
|
||||
container.appendChild(langLink);
|
||||
|
||||
// Insert into .right-buttons, before existing icons
|
||||
function insertSelector() {
|
||||
var rightButtons = document.querySelector(".right-buttons");
|
||||
if (rightButtons) {
|
||||
rightButtons.insertBefore(container, rightButtons.firstChild);
|
||||
}
|
||||
}
|
||||
|
||||
if (document.readyState === "loading") {
|
||||
document.addEventListener("DOMContentLoaded", insertSelector);
|
||||
} else {
|
||||
insertSelector();
|
||||
}
|
||||
})();
|
||||
21
v1/en_chapters/SUMMARY.md
Normal file
21
v1/en_chapters/SUMMARY.md
Normal file
@@ -0,0 +1,21 @@
|
||||
# Summary
|
||||
|
||||
[Machine Learning Systems: Design and Implementation (1st Edition)](index.md)
|
||||
[Preface](chapter_preface/index.md)
|
||||
[Introduction](chapter_introduction/index.md)
|
||||
[Programming Model](chapter_programming_interface/index.md)
|
||||
[Computational Graph](chapter_computational_graph/index.md)
|
||||
[Part I Framework Design](chapter_preface_advanced/index.md)
|
||||
[AI Compiler Frontend](chapter_frontend_and_ir/index.md)
|
||||
[AI Compiler Backend](chapter_backend_and_runtime/index.md)
|
||||
[Hardware Accelerator](chapter_accelerator/index.md)
|
||||
[Data Processing Framework](chapter_data_processing/index.md)
|
||||
[Model Deployment {#ch:deploy}](chapter_model_deployment/index.md)
|
||||
[Distributed Training](chapter_distributed_training/index.md)
|
||||
[Part II Application Scenarios](chapter_preface_extension/index.md)
|
||||
[Recommender System](chapter_recommender_system/index.md)
|
||||
[Federated Learning Systems](chapter_federated_learning/index.md)
|
||||
[Reinforcement Learning System](chapter_reinforcement_learning/index.md)
|
||||
[Explainable AI Systems](chapter_explainable_AI/index.md)
|
||||
[Robotic System](chapter_rl_sys/index.md)
|
||||
[Appendix: Introduction to Machine Learning](appendix_machine_learning_introduction/index.md)
|
||||
@@ -0,0 +1,62 @@
|
||||
## Classic Machine Learning Methods
|
||||
|
||||
Many classic machine learning algorithms, such as Support Vector Machine (SVM), K-Nearest Neighbor (KNN) classification algorithm, and K-Means Clustering Algorithm, differ in various ways---some have network parameters while others do not, some are supervised learning algorithms while others are unsupervised, and their training processes also differ. However, from a systems perspective, they are all based on matrix operations. Below, we briefly introduce these algorithms.
|
||||
|
||||
### Support Vector Machine
|
||||
|
||||
**Support Vector Machine** (SVM) is a classic machine learning classification algorithm whose core idea is to maximize the distance from the decision boundary to the data points. Here, we use linearly separable data as an example; for non-linearly separable data, the **Kernel Method** can be applied in a similar manner.
|
||||
|
||||
If the training data is linearly separable, the objective of SVM is to maximize the **margin**. First, let us define the maximum margin classifier as follows:
|
||||
$$\min_{{w},b} ~~~\frac{1}{2} ||{w}||^2$$
|
||||
$$s.t. ~~~y_i ({w}^T {x_i} + b) \geq 1, ~~~\forall 1 \leq i \leq n$$
|
||||
Its Lagrange multiplier formulation is
|
||||
$$L({w},b,{\lambda}) = \frac{1}{2} ||{w}||^2 + \sum_{i=1}^n \lambda_i (1-y_i({w}^T {x_i} + b))$$
|
||||
Since $\frac{1}{2} ||{w}||^2$ is convex, and $\lambda_i (1-y_i({w}^T {x_i} + b))$ is linear (and therefore also convex), the solution to the optimization problem is
|
||||
$$\max_{\lambda>0} \min_{{w},b} L({w},b, {\lambda})$$
|
||||
Taking the derivatives of $L$ with respect to ${w},b$, we have
|
||||
$$\nabla_{{w}} L= {w} - \sum_{i=1}^n \lambda_i y_i {x_i}$$
|
||||
$$\nabla_b L = - \sum_{i=1}^n \lambda_i y_i$$
|
||||
Setting the derivatives of $L$ with respect to ${w},b$ to zero, we obtain ${w}^* = \sum_{i=1}^n \lambda_i y_i {x_i}$ and $\sum_{i=1}^n \lambda_i y_i = 0$.
|
||||
Since when $\lambda$ is fixed, the value of $b$ does not contribute to the objective function, we can set $b^* = 0$.
|
||||
At this point, by duality theory and the KKT conditions, we obtain:
|
||||
$$y_i ({w}^{*T} {x_i} + b^*) > 1 \Rightarrow \lambda_i^* = 0$$
|
||||
$$\lambda_i^* > 0 \Rightarrow y_i ({w}^{*T} {x_i} + b^*) = 1$$
|
||||
$${w}^* = \sum_{i=1}^n \lambda_i^* y_i {x_i}$$
|
||||
If $y_i ({w}^{*T} {x_i} + b^*) = 1$, then ${x_i}$ is one of the points closest to the hyperplane $({w}^*,b^*)$; otherwise, it is not. Therefore, ${w}^*$ is a linear combination of the points ${x_i}$ that are closest to the hyperplane $({w}^*,b^*)$.
|
||||
|
||||
In this way, through the SVM algorithm, we achieve data classification while maximizing the distance from the decision boundary to the nearest points.
|
||||
We define the ${x_i}$ satisfying $y_i ({w}^{*T} {x_i} + b^*) = 1$ as **support vectors**, and call the classifier $\hat{y}=sgn({w}^{*T} {x_i} + b^*)$ the support vector machine.
|
||||
|
||||
### K-Nearest Neighbor Algorithm
|
||||
|
||||
**K-Nearest Neighbor** (KNN) is also a traditional machine learning algorithm that can be used for basic machine learning tasks such as classification and regression. Unlike the SVM algorithm introduced above, the core idea of the K-Nearest Neighbor algorithm is not to separate data of different classes using a decision boundary, but rather to predict the properties of a data point based on the properties of its K nearest neighbors.
|
||||
|
||||
When KNN is used for classification, a vote is conducted to predict the class of a sample point. The voters are the K sample points closest to the observation point, where each voting sample point may be assigned different weights, and the "content" of the vote is the class label of the sample point. When processing the voting results, a majority vote decision method is used. That is, if most of the K nearest sample points belong to a certain class, then the sample point is also assigned to that class.
|
||||
|
||||
The KNN algorithm can be described as follows: (1) compute the distance from the point to be classified to each known-class point; (2) sort these points by distance and select the K nearest points; (3) tally the votes according to each point's weight, where the vote content is the point's class label; (4) return the class with the highest vote count as the predicted class for the point to be classified.
|
||||
|
||||
The KNN algorithm has several key issues that require attention, including the choice of the hyperparameter K, the distance metric, and the classification decision rule. For the hyperparameter K, it should not be too large, as this would lead to significant approximation error, nor too small, as this would lead to significant estimation error. For the distance metric, one can choose Manhattan distance, Euclidean distance, Minkowski distance, and so on. To reduce the error and impact of the K value on prediction results, we can typically impose certain rules on the classification decision, such as giving closer points larger weights and more distant points smaller weights during voting. When implementing the KNN algorithm programmatically, parameters such as weights are computed in matrix form to improve computational efficiency.
|
||||
|
||||
### K-Means Clustering Algorithm
|
||||
|
||||
**K-Means Clustering Algorithm** is a common unsupervised clustering algorithm in machine learning. Here, we first define the clustering problem: given data points ${x_1},\cdots, {x_n} \in \mathbb{R}^d$ and $K\in \mathbb{N}$, we need to partition them into $K$ clusters ${C_1}, \cdots, {C_K} \in \mathbb{R}^d$ along with the corresponding cluster center ${ C_{(1)}}, \cdots, {C_{(n)}}$ for each data point, so as to minimize the sum of distances $\sum_i ||{x_i} - {C_{(i)}}||^2$.
|
||||
|
||||
The K-Means clustering algorithm solves the clustering problem as follows:
|
||||
|
||||
- Randomly initialize ${C_1}, \cdots, {C_K}$
|
||||
|
||||
- Assign each ${x_i}$ to the cluster whose center is nearest
|
||||
|
||||
- Compute and update ${C_K} = \frac{\sum_{{C_{(i)}}={C_K}} {x_i}}{\sum_{{C_{(i)}}={C_K}} 1}$
|
||||
|
||||
- Repeat the above steps until the algorithm converges
|
||||
|
||||
It can be proven that the K-Means clustering algorithm monotonically decreases the sum of distances $\sum_i ||{x_i} - {C_{(i)}}||^2$ and eventually converges. However, the algorithm may converge to a local minimum.
|
||||
|
||||
Chapter conclusion:
|
||||
|
||||
From a systems perspective, regardless of the specific algorithm, machine learning algorithms involving high-dimensional data tasks are all implemented through matrix operations.
|
||||
|
||||
## References
|
||||
|
||||
:bibliography:`../references/appendix.bib`
|
||||
@@ -0,0 +1,85 @@
|
||||
## Gradient Descent and Backpropagation
|
||||
|
||||
The previous section provided a general introduction to classic neural networks. Now an important question arises: how are the parameters in these networks determined? If the problem can be solved by a simple perceptron, the parameters can be manually determined. However, for deep networks, parameter determination needs to be automated---this is the so-called network training, and this process requires us to define a **loss function** to guide the direction of training optimization.
|
||||
Common loss functions include: 1) Mean Squared Error (MSE), which measures the distance between vectors,
|
||||
$\mathcal{L} = \frac{1}{N}\|{y}-\hat{{y}}\|^{2}_{2} = \frac{1}{N}\sum_{i=1}^N(y_{i}-\hat{y}_{i})^{2}$
|
||||
and Mean Absolute Error (MAE),
|
||||
$\mathcal{L} = \frac{1}{N}\sum_{i=1}^{N}|y_{i}-\hat{y}_{i}|$
|
||||
, where $N$ represents the number of data samples used for averaging, $y$ represents the ground truth labels, and $\hat{y}$ represents the predicted labels output by the network.
|
||||
2) Cross Entropy, which can be used for classification tasks,
|
||||
$\mathcal{L} = - \frac{1}{N} \sum_{i=1}^N \bigg(y_{i}\log\hat{y}_{i} + (1 - y_{i})\log(1 - \hat{y}_{i})\bigg)$, where the loss value is zero if and only if the output labels match the predicted labels.
|
||||
|
||||
With the loss value computed, we can use large amounts of labeled data and optimization methods to update the model parameters. The most commonly used method is **gradient descent**. As shown in :numref:`gradient_descent2`,
|
||||
initially, the model parameters ${w}$ are randomly selected. Then the partial derivative of the loss with respect to the parameters $\frac{\partial \mathcal{L}}{\partial {w}}$ is computed, and optimization is performed through repeated iterations of
|
||||
${w}:={w}-\alpha\frac{\partial \mathcal{L}}{\partial {w}}$. This optimization process effectively reduces the loss value to achieve the task objective, where $\alpha$ is the **learning rate** that controls the optimization step size.
|
||||
In practice, the minimum value obtained by gradient descent is very likely a local minimum rather than the global minimum. However, since deep neural networks provide strong data representation capability, the local minimum can be very close to the global minimum, and the loss value can be sufficiently small.
|
||||
|
||||
![Introduction to gradient descent. (Left) Only one trainable parameter $w$; (Right) Two trainable parameters ${w}=[w_1,w_2]$. After continuously updating and iterating the parameters, the loss value $\mathcal{L}$ gradually decreases. However, due to the existence of many local optima, we often cannot reach the global optimum.](../img/ch_basic/gradient_descent2.png)
|
||||
:width:`600px`
|
||||
:label:`gradient_descent2`
|
||||
|
||||
The next question is: how do we implement gradient descent in deep neural networks? This requires computing the partial derivatives $\frac{\partial \mathcal{L}}{\partial {w}}$ of the parameters at each layer, which can be achieved using **backpropagation** :cite:`rumelhart1986learning,lecun2015deep`.
|
||||
Next,
|
||||
we introduce an intermediate quantity ${\delta}=\frac{\partial \mathcal{L}}{\partial {z}}$ to represent the partial derivative of the loss function $\mathcal{L}$
|
||||
with respect to the neural network output ${z}$ (before the activation function, not $a$),
|
||||
and ultimately obtain $\frac{\partial \mathcal{L}}{\partial {w}}$.
|
||||
|
||||
We illustrate the backpropagation algorithm with an example below.
|
||||
Let the layer index be $l=1, 2, \ldots L$ (the output layer, i.e., the last layer, has index $L$).
|
||||
For each network layer, we have the output ${z}^l$, the intermediate value ${\delta}^l=\frac{\partial \mathcal{L}}{\partial {z}^l}$, and an activation output ${a}^l=f({z}^l)$
|
||||
(where $f$ is the activation function).
|
||||
We assume the model is a multi-layer perceptron using the Sigmoid activation function, with Mean Squared Error (MSE) as the loss function. That is, we define:
|
||||
|
||||
- Network structure ${z}^{l}={W}^{l}{a}^{l-1}+{b}^{l}$
|
||||
|
||||
- Activation function ${a}^l=f({z}^l)=\frac{1}{1+{\rm e}^{-{z}^l}}$
|
||||
|
||||
- Loss function $\mathcal{L}=\frac{1}{2}\|{y}-{a}^{L}\|^2_2$
|
||||
|
||||
We can directly compute the partial derivative of the activation output with respect to the pre-activation output:
|
||||
|
||||
- $\frac{\partial {a}^l}{\partial {z}^l}=f'({z}^l)=f({z}^l)(1-f({z}^l))={a}^l(1-{a}^l)$
|
||||
|
||||
and the partial derivative of the loss function with respect to the activation output:
|
||||
|
||||
- $\frac{\partial \mathcal{L}}{\partial {a}^{L}}=({a}^{L}-{y})$
|
||||
|
||||
With these results, to further obtain the partial derivatives of the loss function with respect to each parameter, we can use the **chain rule**, detailed as follows:
|
||||
|
||||
First, starting from the output layer ($l=L$, the last layer), we propagate the error backward. By the chain rule, we first compute the intermediate quantity of the output layer:
|
||||
|
||||
- ${\delta}^{L}
|
||||
=\frac{\partial \mathcal{L}}{\partial {z}^{L}}
|
||||
=\frac{\partial \mathcal{L}}{\partial {a}^{L}}\frac{\partial {a}^L}{\partial {z}^{L}}=({a}^L-{y})\odot({a}^L(1-{a}^L))$
|
||||
|
||||
Besides the intermediate value ${\delta}^{L}$ of the output layer ($l=L$), how do we compute the intermediate values ${\delta}^{l}$ for the other layers ($l=1, 2, \ldots , L-1$)?
|
||||
|
||||
- Given the model structure ${z}^{l+1}={W}^{l+1}{a}^{l}+{b}^{l+1}$, we can directly obtain $\frac{\partial {z}^{l+1}}{\partial {a}^{l}}={W}^{l+1}$; moreover, we already know that $\frac{\partial {a}^l}{\partial {z}^l}={a}^l(1-{a}^l)$
|
||||
|
||||
- Then by the chain rule, we can obtain ${\delta}^{l}
|
||||
=\frac{\partial \mathcal{L}}{\partial {z}^{l}}
|
||||
=\frac{\partial \mathcal{L}}{\partial {z}^{l+1}}\frac{\partial {z}^{l+1}}{\partial {a}^{l}}\frac{\partial {a}^{l}}{\partial {z}^{l}}
|
||||
=({W}^{l+1})^\top{\delta}^{l+1}\odot({a}^l(1-{a}^l))$
|
||||
|
||||
Having computed the intermediate values ${\delta}^l, l=1, 2, \ldots , L$ for all layers using the above derivation, we can then compute the partial derivatives of the loss function with respect to the parameters of each layer: $\frac{\partial \mathcal{L}}{\partial {W}^l}$ and $\frac{\partial \mathcal{L}}{\partial {b}^l}$, and use gradient descent to update the parameters at each layer.
|
||||
|
||||
- Given the model structure ${z}^l={W}^l{a}^{l-1}+{b}^l$, we can compute
|
||||
$\frac{\partial {z}^{l}}{\partial {W}^l}={a}^{l-1}$ and
|
||||
$\frac{\partial {z}^{l}}{\partial {b}^l}=1$
|
||||
|
||||
- Then by the chain rule, we can obtain $\frac{\partial \mathcal{L}}{\partial {W}^l}=\frac{\partial \mathcal{L}}{\partial {z}^l}\frac{\partial {z}^l}{\partial {W}^l}={\delta}^l({a}^{l-1})^\top$
|
||||
,
|
||||
$\frac{\partial \mathcal{L}}{\partial {b}^l}=\frac{\partial \mathcal{L}}{\partial {z}^l}\frac{\partial {z}^l}{\partial {b}^l}={\delta}^l$
|
||||
|
||||
After obtaining all partial derivatives $\frac{\partial \mathcal{L}}{\partial {W}^l}$ and
|
||||
$\frac{\partial \mathcal{L}}{\partial {b}^l}$, we can update all parameters ${W}^l$
|
||||
and ${b}^l$ using gradient descent:
|
||||
|
||||
- ${W}^l:={W}^l-\alpha\frac{\partial \mathcal{L}}{\partial {W}^l}$,
|
||||
${b}^l:={b}^l-\alpha\frac{\partial \mathcal{L}}{\partial {b}^l}$
|
||||
|
||||
However, there is still one issue to address: each time gradient descent updates the parameters, it needs to compute the loss value under the current parameters. When the training dataset is large ($N$ is large), computing the loss value using the entire training set for each update would be computationally prohibitive.
|
||||
To reduce the computational cost, we use **Stochastic Gradient Descent** (SGD) to compute the loss value. Specifically, instead of using all training data, we randomly select a subset of data samples from the training set to compute the loss value, such as 16, 32, 64, or 128 data samples. The number of samples is called the **batch size**.
|
||||
Furthermore, setting the learning rate is also very important. If the learning rate is too large, we may not be able to approach the valley of the minimum; if it is too small, training proceeds too slowly.
|
||||
Adaptive learning rates, such as Adam :cite:`KingmaAdam2014`, RMSProp :cite:`tieleman2012rmsprop`, and
|
||||
Adagrad :cite:`duchi2011adagrad`, automatically adjust the learning rate during training to achieve fast convergence and reach the minimum.
|
||||
@@ -0,0 +1,12 @@
|
||||
# Appendix: Introduction to Machine Learning
|
||||
|
||||
This book assumes that readers have a basic foundation in machine learning algorithms. Therefore, this chapter only provides a brief introduction to machine learning. Among the topics covered, the gradient descent method is particularly important for understanding machine learning systems and is essential knowledge.
|
||||
|
||||
```toc
|
||||
:maxdepth: 2
|
||||
:numbered:
|
||||
|
||||
neural_network
|
||||
gradient_descent
|
||||
classic_machine_learning
|
||||
```
|
||||
@@ -0,0 +1,156 @@
|
||||
## Neural Networks
|
||||
|
||||
### Perceptron
|
||||

|
||||
:width:`600px`
|
||||
:label:`single_neuron`
|
||||
|
||||
:numref:`single_neuron` shows an example of a neuron, where the input data $x$ is weighted and summed according to the weights $w$ on the connections to produce the output $z$. We call such a model a **perceptron**.
|
||||
Since there is only one layer of neural connections between input and output, this model is also called a single-layer perceptron. The computation of the model in :numref:`single_neuron` can be written as: $z = w_{1}x_{1}+ w_{2}x_{2} + w_{3}x_{3}$.
|
||||
|
||||
When the input data is represented as a column vector ${x}=[x_1,x_2,x_3]^T$ and the model weights are represented as a row vector ${w}=[w_1,w_2,w_3]$, the output scalar $z$ can be written as:
|
||||
|
||||
$$z =
|
||||
\begin{bmatrix}
|
||||
w_1,w_2,w_3\\
|
||||
\end{bmatrix}
|
||||
\begin{bmatrix}
|
||||
x_1\\
|
||||
x_2\\
|
||||
x_3
|
||||
\end{bmatrix}
|
||||
={w}{x}$$
|
||||
|
||||
We can use the output scalar $z$ as a weighted combination of the inputs to accomplish specific tasks.
|
||||
For example, we can classify "good apples" and "bad apples," where $x_1,x_2,x_3$ represent three different features: 1) degree of redness, 2) presence of holes, and 3) size. If the size of the apple has no effect on this judgment, the corresponding weight would be zero.
|
||||
Training this neural network essentially means selecting appropriate weights to accomplish our task. For instance, we can choose appropriate weights such that when $z$ is less than or equal to $0$, it represents a "bad apple," and when $z$ is greater than $0$, it represents a "good apple."
|
||||
The final classification output label $y$ is as follows, where $1$ represents good and $0$ represents bad. Since there is only one layer between the input and output of this neuron, it can be called a single-layer neural network.
|
||||
|
||||
$$
|
||||
y =
|
||||
\begin{cases}
|
||||
1 & z>0 \\
|
||||
0 & z \leq 0 \\
|
||||
\end{cases}$$
|
||||
|
||||
### Decision Boundary vs. Bias
|
||||
|
||||
By selecting appropriate weights and classifying input data based on whether $z$ is greater or less than $0$, we can obtain a **decision boundary** in the data space. As shown in :numref:`single_neuron_decision_boundary2`, using the neuron output $z=0$ as the decision boundary for the output label $y$,
|
||||
without bias the decision boundary must pass through the origin. If the data sample points are not separated by the origin, classification errors will occur.
|
||||
To solve this problem, a **bias** can be added to the neuron. :numref:`single_neuron_bias2`
|
||||
shows a neuron model with bias $b$, which can be expressed by :eqref:`singleneuron_bias`:
|
||||
$$z = w_{1}x_{1}+ w_{2}x_{2}+ w_{3}x_{3} + b$$
|
||||
:eqlabel:`singleneuron_bias`
|
||||
|
||||

|
||||
:width:`600px`
|
||||
:label:`single_neuron_decision_boundary2`
|
||||
|
||||

|
||||
:width:`600px`
|
||||
:label:`single_neuron_bias2`
|
||||
|
||||
With bias, the decision boundary (line, plane, or hyperplane) does not have to pass through the origin, thus enabling better classification of samples.
|
||||
More precisely, the decision boundary separates the sample data into two different classes, and this boundary is
|
||||
$\{x_1, x_2, x_3 | w_{1}x_{1}+ w_{2}x_{2}+ w_{3}x_{3} + b = 0\}$.
|
||||
|
||||
### Logistic Regression
|
||||
|
||||
The input-output relationship of the above neuron is linear. To provide nonlinear data representation capability, an **activation function** can be applied to the neuron output. The most common activation functions include Sigmoid, Tanh, ReLU, and Softmax.
|
||||
For example, the above neuron uses $z=0$ as the boundary for classification tasks. Can we instead have the neuron output a probability? For instance, outputting values between $0$ and $1$, where $1$ means the input data belongs to a certain class with $100\%$ probability.
|
||||
To make the neuron output values between $0$ and $1$, we can apply the logistic function **Sigmoid** to $z$,
|
||||
as shown in :eqref:`sigmoid`. Sigmoid constrains values between 0 and 1, and a simple threshold (e.g., 0.5) can be used to determine whether the final output label belongs to a certain class. This method is called **logistic regression**.
|
||||
|
||||
$$a = f({z}) = \frac{1}{1+{\rm e}^{-{z}}}$$
|
||||
:eqlabel:`sigmoid`
|
||||
|
||||
### Multiple Neurons
|
||||
|
||||

|
||||
:width:`600px`
|
||||
:label:`two_neurons2`
|
||||
|
||||
The above network has only one output. With multiple neurons together, we can have multiple outputs. :numref:`two_neurons2` shows a network with two outputs, where each output is connected to all inputs. This is also called a **fully-connected (FC) layer**,
|
||||
which can be expressed by the following equation :eqref:`fc_cal`.
|
||||
|
||||
$$z_{1} &= w_{11}x_{1} + w_{12}x_{2} + w_{13}x_{3} + b_1 \notag \\ z_{2} &= w_{21}x_{1} + w_{22}x_{2} + w_{23}x_{3} + b_2$$
|
||||
:eqlabel:`fc_cal`
|
||||
|
||||
The following expression shows the matrix form of the computation:
|
||||
|
||||
$$
|
||||
{z} =
|
||||
\begin{bmatrix}
|
||||
z_1 \\
|
||||
z_2
|
||||
\end{bmatrix}
|
||||
=
|
||||
\begin{bmatrix}
|
||||
w_{11} & w_{12} & w_{13}\\
|
||||
w_{21} & w_{22} & w_{23}\\
|
||||
\end{bmatrix}
|
||||
\begin{bmatrix}
|
||||
x_1\\
|
||||
x_2\\
|
||||
x_3
|
||||
\end{bmatrix}
|
||||
+
|
||||
\begin{bmatrix}
|
||||
b_1 \\ b_2
|
||||
\end{bmatrix}
|
||||
= {W}{x} + {b}$$
|
||||
|
||||
|
||||
A network with multiple outputs can solve multi-class classification problems. For example, with 10 numerical outputs, each value represents the probability of a particular class, with each output between $0$ and $1$, and the sum of all 10 outputs equal to $1$.
|
||||
This can be achieved using the **Softmax** function shown in :eqref:`e_softmax`, where $K$ is the number of outputs:
|
||||
|
||||
$$f({z})_{i} = \frac{{\rm e}^{z_{i}}}{\sum_{k=1}^{K}{\rm e}^{z_{k}}}$$
|
||||
:eqlabel:`e_softmax`
|
||||
|
||||
### Multi-Layer Perceptron
|
||||
|
||||

|
||||
|
||||
**Multi-Layer Perceptron** (MLP) :cite:`rosenblatt1958perceptron` enhances the network's representation capability by stacking multiple fully-connected layers. Compared to single-layer networks, the multi-layer perceptron has many intermediate layer outputs that are not exposed to the final output; these layers are called **hidden layers**. The network in this example can be implemented through the following cascaded matrix operations, where $W^l$ and $b^l$ represent the weight matrices and biases of different layers, $l$ denotes the layer index, and $L$ denotes the output layer.
|
||||
|
||||
$${z} = f({W^L}f({W^3}f({W^2}f({W^1}{x} + {b^1}) + {b^2}) + {b^3}) + {b^L})$$
|
||||
|
||||
In the deep learning era, network models are essentially composed of multiple layers of neural network layers connected together. Input data passes through multiple layers of feature extraction, learning **feature vectors** at different levels of abstraction. Below we introduce some other commonly used neural network layers.
|
||||
|
||||
### Convolutional Networks
|
||||
|
||||

|
||||
:width:`600px`
|
||||
:label:`conv_computation_v4`
|
||||
|
||||
**Convolutional Neural Network** (CNN) :cite:`lecun1989backpropagation` consists of multiple **convolutional layers** and is commonly used in computer vision tasks :cite:`krizhevsky2012imagenet,he2016deep`.
|
||||
:numref:`conv_computation_v4` describes an example of a convolution operation.
|
||||
Based on the properties of convolution, we can observe two facts: 1) the number of channels in a convolution kernel equals the number of input channels; 2) the number of output channels equals the number of convolution kernels.
|
||||
|
||||
In the example of :numref:`conv_computation_v4`, the convolution kernel slides by one unit at a time to perform the convolution operation; we say its **stride** is 1. Additionally, if we want the edge values of the input to also be taken into account, we need to perform **zero padding** on the edges. In the example of :numref:`conv_computation_v4`, if each channel of the input is padded with a ring of zeros on all four sides, the output size would be $4\times 4\times 1$. The number of padding rings depends on the kernel size---larger kernels require more padding.
|
||||
|
||||
To perform feature extraction on input image data, the number of convolution kernels is typically greater than the number of input channels, which means the output data contains many more values and the computation increases. However, features of adjacent pixels in image data are often similar, so we can perform aggregation operations on adjacent output features. **Pooling layers** serve this purpose, and we typically use two pooling methods: Max Pooling and Mean Pooling. As shown in :numref:`pooling_v3`, assuming a pooling kernel of size $2\times2$, an input of $4\times4$, and a stride of 2 (with stride 1, the output equals the input), the output is $2\times2$.
|
||||
|
||||

|
||||
:width:`600px`
|
||||
:label:`pooling_v3`
|
||||
|
||||
Both convolutional layers and fully-connected layers are commonly used. However, when the input is high-dimensional image data, convolutional layers require far fewer parameters than fully-connected layers. The operations in convolutional layers are similar to those in fully-connected layers---the former is based on high-dimensional tensor operations, while the latter is based on two-dimensional matrix operations.
|
||||
|
||||
### Sequential Models
|
||||
|
||||
In real life, besides images, there is a large amount of time series data, such as videos, stock prices, and so on. **Recurrent Neural Networks** (RNN) :cite:`rumelhart1986learning` are a type of deep learning model architecture designed for processing sequential data. Sequential data is a series of continuous data $\{x_1, x_2, \dots, x_n\}$, where each $x$ might represent a word in a sentence, for example.
|
||||
|
||||
To receive a continuous sequence of inputs, as shown in :numref:`rnn_simple_cell2`, the vanilla recurrent neural network uses a recurrent cell as the computation unit, with a hidden state to store information from past inputs. Specifically, for each input data $x$ to the model, according to equation :eqref:`aligned`, the recurrent cell repeatedly computes new hidden states to record information from current and past inputs. The new hidden state is then used in the computation of the next cell.
|
||||
|
||||
$${h}_t = {W}[{x}_t; {h}_{t-1}] + {b}$$
|
||||
:eqlabel:`aligned`
|
||||
|
||||

|
||||
:width:`600px`
|
||||
:label:`rnn_simple_cell2`
|
||||
|
||||
However, this simple vanilla recurrent neural network suffers from a severe information forgetting problem. For example, if the input is "I am Chinese, my native language is ___," the hidden state remembers the information about "Chinese," enabling the network to predict the word "Chinese (language)" at the end. But when the sentence is very long, the hidden state may not remember information from too long ago. For instance, "I am Chinese, I went to study in the UK, then worked in France, my native language is ___"---at this point, the information about "Chinese" in the final hidden state may have been forgotten due to multiple updates.
|
||||
To address this problem, various improved methods have been proposed, the most famous being Long Short-Term Memory (LSTM) :cite:`Hochreiter1997lstm`. There are many more sequential models, such as the Transformer :cite:`vaswani2017attention` that emerged in recent years.
|
||||
204
v1/en_chapters/chapter_accelerator/accelerator_architecture.md
Normal file
204
v1/en_chapters/chapter_accelerator/accelerator_architecture.md
Normal file
@@ -0,0 +1,204 @@
|
||||
# Components of Hardware Accelerators
|
||||
|
||||
A hardware accelerator typically comprises multiple on-chip caches and
|
||||
various types of arithmetic units. In this section, we'll examine the
|
||||
fundamental components of hardware accelerators, using the Nvidia Volta
|
||||
GPU architecture as a representative example.
|
||||
|
||||
## Architecture of Accelerators
|
||||
|
||||
Contemporary graphics processing units (GPUs) offer remarkable computing
|
||||
speed, ample memory storage, and impressive I/O bandwidth. A top-tier
|
||||
GPU frequently surpasses a conventional CPU by housing double the number
|
||||
of transistors, boasting a memory capacity of 16 GB or greater, and
|
||||
operating at frequencies reaching up to 1 GHz. The architecture of a GPU
|
||||
comprises streaming processors and a memory system, interconnected
|
||||
through an on-chip network. These components can be expanded
|
||||
independently, allowing for customized configurations tailored to the
|
||||
target market of the GPU.
|
||||
|
||||
Figure :numref:`ch06/ch06-gv100` illustrates the architecture of the
|
||||
Volta GV100 . This architecture has:
|
||||
|
||||

|
||||
:label:`ch06/ch06-gv100`
|
||||
|
||||
1. 6 GPU processing clusters (GPCs), each containing:
|
||||
1. 7 texture processing clusters (TPCs), each containing two
|
||||
streaming multiprocessors (SMs).
|
||||
2. 14 SMs.
|
||||
2. 84 SMs, each containing:
|
||||
1. 64 32-bit floating-point arithmetic units
|
||||
2. 64 32-bit integer arithmetic units
|
||||
3. 32 64-bit floating-point arithmetic units
|
||||
4. 8 Tensor Cores
|
||||
5. 4 texture units
|
||||
3. 8 512-bit memory controllers.
|
||||
|
||||
As shown in Figure :numref:`ch06/ch06-gv100`, a GV100 GPU contains 84 SMs (Streaming
|
||||
Multiprocessors), 5376 32-bit floating-point arithmetic units, 5376
|
||||
32-bit integer arithmetic units, 2688 64-bit floating-point arithmetic
|
||||
units, 672 Tensor Cores, and 336 texture units. A pair of memory
|
||||
controllers controls an HBM2 DRAM stack. Different vendors may use
|
||||
different configurations (e.g., Tesla V100 has 80 SMs).
|
||||
|
||||
## Memory Units
|
||||
|
||||
The memory units of a hardware accelerator resemble a CPU's memory
|
||||
controller. However, they encounter a bottleneck when retrieving data
|
||||
from the computer system's DRAM, as it is slower compared to the
|
||||
processor's computational speed. Without a cache for quick access, the
|
||||
DRAM bandwidth becomes inadequate to handle all transactions of the
|
||||
accelerator. Consequently, if program instructions or data cannot be
|
||||
swiftly retrieved from the DRAM, the accelerator's efficiency diminishes
|
||||
due to prolonged idle time. To tackle this DRAM bandwidth issue, GPUs
|
||||
employ a hierarchical design of memory units. Each type of memory unit
|
||||
offers its own maximum bandwidth and latency. To fully exploit the
|
||||
computing power and enhance processing speed, programmers must select
|
||||
from the available memory units and optimize memory utilization based on
|
||||
varying access speeds.
|
||||
|
||||
1. **Register file**: Registers serve as the swiftest on-chip memories.
|
||||
In contrast to CPUs, each SM in a GPU possesses tens of thousands of
|
||||
registers. Nevertheless, excessively utilizing registers for every
|
||||
thread can result in a reduced number of thread blocks that can be
|
||||
scheduled within the SM, leading to fewer executable threads. This
|
||||
underutilization of hardware capabilities hampers performance
|
||||
considerably. Consequently, programmers must judiciously determine
|
||||
the appropriate number of registers to employ, taking into account
|
||||
the algorithm's demands.
|
||||
|
||||
2. **Shared memory**: The shared memory is a level-1 cache that is
|
||||
user-controllable. Each SM features a 128 KB level-1 cache, with the
|
||||
ability for programmers to manage up to 96 KB as shared memory. The
|
||||
shared memory offers a low access latency, requiring only a few
|
||||
dozen clock cycles, and boasts an impressive bandwidth of up to 1.5
|
||||
TB/s. This bandwidth is significantly higher than the peak bandwidth
|
||||
of the global memory, which stands at 900 GB/s. In high-performance
|
||||
computing (HPC) scenarios, engineers must possess a thorough
|
||||
understanding of how to leverage shared memory effectively.
|
||||
|
||||
3. **Global memory**: Both GPUs and CPUs are capable of reading from
|
||||
and writing to global memory. Global memory is visible and
|
||||
accessible by all threads on a GPU, whereas other devices like CPUs
|
||||
need to traverse buses like PCIe and NV-Link to access the global
|
||||
memory. The global memory represents the largest memory space
|
||||
available in a GPU, with capacities reaching over 80 GB. However, it
|
||||
also exhibits the longest memory latency, with a load/store latency
|
||||
that can extend to hundreds of clock cycles.
|
||||
|
||||
4. **Constant memory**: The constant memory is a virtual address space
|
||||
in the global memory and does not occupy a physical memory block. It
|
||||
serves as a high-speed memory, specifically designed for rapid
|
||||
caching and efficient broadcasting of a single value to all threads
|
||||
within a warp.
|
||||
|
||||
5. **Texture memory**: Texture memory is a specialized form of global
|
||||
memory that is accessed through a dedicated texture cache to enhance
|
||||
performance. In earlier GPUs without caches, the texture memory on
|
||||
each SM served as the sole cache for data. However, the introduction
|
||||
of level-1 and level-2 caches in modern GPUs has rendered the
|
||||
texture memory's role as a cache obsolete. The texture memory proves
|
||||
most beneficial in enabling GPUs to execute hardware-accelerated
|
||||
operations while accessing memory units. For instance, it allows
|
||||
arrays to be accessed using normalized addresses, and the retrieved
|
||||
data can be automatically interpolated by the hardware.
|
||||
Additionally, the texture memory supports both hardware-accelerated
|
||||
bilinear and trilinear interpolation for 2D and 3D arrays,
|
||||
respectively. Moreover, the texture memory facilitates automatic
|
||||
handling of boundary conditions based on array indices. This means
|
||||
that operations on array elements can be carried out without
|
||||
explicit consideration of boundary situations, thus avoiding the
|
||||
need for extra conditional branches in a thread.
|
||||
|
||||
## Compute Units
|
||||
|
||||
Hardware accelerators offer a variety of compute units to efficiently
|
||||
handle various neural networks.
|
||||
Figure :numref:`ch06/ch06-compute-unit` demonstrates how different
|
||||
layers of neural networks select appropriate compute units.
|
||||
|
||||

|
||||
:label:`ch06/ch06-compute-unit`
|
||||
|
||||
1. **Scalar Unit**: calculates one scalar element at a time, similar to
|
||||
the standard reduced instruction set computer (RISC).
|
||||
|
||||
2. **1D Vector Unit**: computes multiple elements at a time, similar to
|
||||
the SIMD used in traditional CPU and GPU architectures. It has been
|
||||
widely used in HPC and signal processing.
|
||||
|
||||
3. **2D Matrix Unit**: computes the inner product of a matrix and a
|
||||
vector or the outer product of a vector within one operation. It
|
||||
reuses data to reduce communication costs and memory footprint,
|
||||
which achieves the performance of matrix multiplication.
|
||||
|
||||
4. **3D Cube Unit**: completes matrix multiplication within one
|
||||
operation. Specially designed for neural network applications, it
|
||||
can reuse data to compensate for the gap between the data
|
||||
communication bandwidth and computing.
|
||||
|
||||
The compute units on a GPU mostly include Scalar Units and 3D Cube
|
||||
Units. As shown in Figure :numref:`ch06/ch06-SM`, each SM has 64 32-bit floating-point
|
||||
arithmetic units, 64 32-bit integer arithmetic units, 32 64-bit
|
||||
floating-point arithmetic units, which are Scalar Units, and 8 Tensor
|
||||
Cores, which are 3D Cube Units specially designed for neural network
|
||||
applications.
|
||||
|
||||

|
||||
:label:`ch06/ch06-SM`
|
||||
|
||||
A Tensor Core is capable of performing one $4\times4$ matrix
|
||||
multiply-accumulate operation per clock cycle, as shown in
|
||||
Figure :numref:`ch06/ch06-tensorcore`.
|
||||
|
||||
```
|
||||
D = A * B + C
|
||||
```
|
||||
|
||||

|
||||
:label:`ch06/ch06-tensorcore`
|
||||
|
||||
$\bf{A}$, $\bf{B}$, $\bf{C}$, and $\bf{D}$ are $4\times4$ matrices.
|
||||
Input matrices $\bf{A}$ and $\bf{B}$ are FP16 matrices, and accumulation
|
||||
matrices $\bf{C}$ and $\bf{D}$ can be either FP16 or FP32 matrices.
|
||||
Tesla V100's Tensor Cores are programmable matrix multiply-accumulate
|
||||
units that can deliver up to 125 Tensor Tera Floating-point Operations
|
||||
Per Second (TFLOPS) for training and inference applications, resulting
|
||||
in a ten-fold increase in computing speed when compared with common FP32
|
||||
compute units.
|
||||
|
||||
## Domain Specific Architecture
|
||||
|
||||

|
||||
:label:`ch06/ch06-davinci_architecture`
|
||||
|
||||
Domain Specific Architecture (DSA) has been an area of interest in
|
||||
meeting the fast-growing demand for computing power by deep neural
|
||||
networks. As a typical DSA design targeting image, video, voice, and
|
||||
text processing, neural network processing units (or namely deep
|
||||
learning hardware accelerators) are system-on-chips (SoCs) containing
|
||||
special compute units, large memory units, and the corresponding control
|
||||
units. A neural processing unit, for example, Ascend chip, typically
|
||||
consists of a control CPU, a number of AI computing engines, multi-level
|
||||
on-chip caches or buffers, and the digital vision pre-processing (DVPP)
|
||||
module.
|
||||
|
||||
The computing core of AI chips is composed of AI Core, which is
|
||||
responsible for executing scalar- and tensor-based arithmetic-intensive
|
||||
computing. Consider the Ascend chip as an example. Its AI Core adopts
|
||||
the Da Vinci architecture.
|
||||
Figure :numref:`ch06/ch06-davinci_architecture` shows the architecture
|
||||
of an AI Core, which can be regarded as a simplified version of modern
|
||||
microprocessor architecture from the control perspective. It includes
|
||||
three types of basic computing units: Cube Unit, Vector Unit, and Scalar
|
||||
Unit. These units are used to compute on tensors, vectors, and scalars,
|
||||
respectively, in three independent pipelines centrally scheduled through
|
||||
the system software to coordinate with each other for higher efficiency.
|
||||
Similar to GPU designs, the Cube Unit functions as the computational
|
||||
core of the AI Core and delivers parallel acceleration for matrix
|
||||
multiply-accumulate operations. Specifically, it can multiply two
|
||||
$16\times16$ matrices in a single instruction --- equivalent to
|
||||
completing 4096 (=$16\times16\times16$) multiply-accumulate operations
|
||||
within an extremely short time --- with precision comparable to FP16
|
||||
operations.
|
||||
@@ -0,0 +1,33 @@
|
||||
# Overview
|
||||
|
||||
An effective computer architecture is expected to be both
|
||||
energy-efficient---quantified by the number of basic operations executed
|
||||
per unit of energy---and versatile---defined by the range of tasks a
|
||||
chip can undertake. We can evaluate these aspects by considering two
|
||||
primary chip categories. The first includes general-purpose processors
|
||||
like CPUs, capable of managing a diverse array of computing tasks,
|
||||
though at the cost of lower energy efficiency, averaging around 0.1
|
||||
TOPS/W. Conversely, application-specific integrated circuits (ASICs)
|
||||
offer enhanced energy efficiency but have more restricted task
|
||||
capabilities. With respect to chip design, general-purpose processors
|
||||
have integrated various acceleration technologies such as superscalar,
|
||||
single-instruction multi-data (SIMD), and single-instruction
|
||||
multi-thread (SIMT) to boost their energy efficiency.
|
||||
|
||||
General-Purpose Graphics Processing Units (GPUs) achieve a respectable
|
||||
equilibrium between energy efficiency and versatility. Modern GPUs
|
||||
incorporate numerous optimization designs for vector, matrix, and tensor
|
||||
computing. For instance, NVIDIA GPUs are equipped with Tensor Cores,
|
||||
Transformer Cores, and Structure Sparsity Cores, which are specifically
|
||||
designed to expedite the distinctive types of computation prevalent in
|
||||
neural networks. Despite these enhancements, GPUs' requirement to
|
||||
support a wide range of computing tasks results in larger footprints and
|
||||
increased power consumption.
|
||||
|
||||
A promising solution to this challenge is deep learning hardware
|
||||
accelerators. Notable examples include Google's Tensor Processing Units
|
||||
(TPUs), Apple's Neural Processing Units (NPUs), and Huawei's Ascend
|
||||
Chips. For instance, Google's TPU, a chip designed to expedite deep
|
||||
learning computations, uses a systolic array to optimize matrix
|
||||
multiplication and convolution operations, fully utilizing local data
|
||||
with minimal memory access.
|
||||
449
v1/en_chapters/chapter_accelerator/accelerator_practise.md
Normal file
449
v1/en_chapters/chapter_accelerator/accelerator_practise.md
Normal file
@@ -0,0 +1,449 @@
|
||||
# Performance Optimization Methods
|
||||
|
||||
Hardware accelerators boast intricate computational and memory
|
||||
architectures. To maximize their performance, developers frequently need
|
||||
to grasp a variety of performance optimization methods. Common methods
|
||||
encompass enhancing arithmetic intensity, capitalizing effectively on
|
||||
shared memory, optimizing the memory load/store pipeline, among others.
|
||||
The subsequent sections will elucidate these methods through practical
|
||||
programming examples, all aimed towards a singular objective:
|
||||
accelerating an FP32 GEMM program.
|
||||
|
||||
## Implementing General Matrix Multiplication
|
||||
|
||||
Code `lst:cpu` shows a reference implementation of GEMM in C++.
|
||||
|
||||
**lst:cpu**
|
||||
```cpp
|
||||
float A[M][K];
|
||||
float B[K][N];
|
||||
float C[M][N];
|
||||
float alpha, beta;
|
||||
|
||||
for (unsigned m = 0; m < M; ++m) {
|
||||
for (unsigned n = 0; n < N; ++n) {
|
||||
float c = 0;
|
||||
for (unsigned k = 0; k < K; ++k) {
|
||||
c += A[m][k] * B[k][n];
|
||||
}
|
||||
C[m][n] = alpha * c + beta * C[m][n];
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
ach element in matrix $C$ is independently computed, and numerous GPU
|
||||
threads can be launched to compute the corresponding elements in matrix
|
||||
$C$ in parallel. The GPU kernel function is shown in
|
||||
Code `lst:gpu`.
|
||||
|
||||
**lst:gpu**
|
||||
```cpp
|
||||
__global__ void gemmKernel(const float * A,
|
||||
const float * B, float * C,
|
||||
float alpha, float beta, unsigned M, unsigned N,
|
||||
unsigned K) {
|
||||
unsigned int m = threadIdx.x + blockDim.x * blockIdx.x;
|
||||
unsigned int n = threadIdx.y + blockDim.y * blockIdx.y;
|
||||
if (m >= M || n >= N)
|
||||
return;
|
||||
float c = 0;
|
||||
for (unsigned k = 0; k < K; ++k) {
|
||||
c += A[m * K + k] * B[k * N + n];
|
||||
}
|
||||
c = c * alpha;
|
||||
float result = c;
|
||||
if (beta != 0) {
|
||||
result = result + C[m * N + n] * beta;
|
||||
}
|
||||
C[m * N + n] = result;
|
||||
}
|
||||
```
|
||||
|
||||
Figure :numref:`cuda_naive_gemm` shows the layout of the implementation.
|
||||
Each element in matrix $C$ is computed by one thread. The row index $m$
|
||||
and column index $n$ of the element in matrix $C$ corresponding to the
|
||||
thread are computed in lines 5 and 6 of the GPU kernel. Then, in lines 9
|
||||
to 11, the thread loads the row vector in matrix $A$ according to the
|
||||
row index and the column vector in matrix $B$ according to the column
|
||||
index, computes the vector inner product. The thread also stores the
|
||||
result back to $C$ matrix in line 17.
|
||||
|
||||

|
||||
:label:`cuda_naive_gemm`
|
||||
|
||||
The method of launching the kernel function is shown in
|
||||
Code `lst:launch`.
|
||||
|
||||
**lst:launch**
|
||||
```cpp
|
||||
void gemmNaive(const float *A, const float *B, float *C,
|
||||
float alpha, float beta, unsigned M,
|
||||
unsigned N, unsigned K) {
|
||||
dim3 block(16, 16);
|
||||
dim3 grid((M - 1) / block.x + 1, (N - 1) / block.y + 1);
|
||||
|
||||
gemmKernel<<<grid, block>>>(A, B, C, alpha, beta, M, N, K);
|
||||
}
|
||||
```
|
||||
|
||||
Each thread block processes $16\times16$ elements in matrix $C$.
|
||||
Therefore, $(M - 1) / 16 + 1 \times (N - 1) / 16 + 1$ thread blocks are
|
||||
used to compute the entire matrix $C$.
|
||||
|
||||
Eigen is used to generate data and compute the GEMM result on the CPU.
|
||||
In addition, error computing and time profiling code are implemented for
|
||||
the GPU computing result. For details, see
|
||||
[first_attempt.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/first_attempt.cu).
|
||||
After the program is compiled and executed, output results are as
|
||||
follows:
|
||||
|
||||
```
|
||||
Average time: 48.961 ms
|
||||
Max error: 0.000092
|
||||
```
|
||||
|
||||
The peak GPU throughput can be approximated by using the following
|
||||
formula: 2 $\times$ Frequency $\times$ Number of single-precision
|
||||
compute units. The number of single-precision compute units equals the
|
||||
number of SMs in the GPU multiplied by the number of single-precision
|
||||
compute units in each SM. The results are as follows:
|
||||
|
||||
```
|
||||
FP32 peak throughput 29767.680 GFLOPS
|
||||
Average Throughput: 185.313 GFLOPS
|
||||
```
|
||||
|
||||
A significant gap exists between the performance that can be achieved by
|
||||
the current code and the peak device performance. In an entire computing
|
||||
process, the process with the highest computing density is matrix
|
||||
multiplication $A\times B$. Its time complexity is $O(M*N*K)$, whereas
|
||||
that time complexity of the entire computing process is
|
||||
$O(M*N*K+2*M*N)$. Therefore, optimizing matrix multiplication is key to
|
||||
improving performance.
|
||||
|
||||
## Enhancing Arithmetic Intensity
|
||||
|
||||
Arithmetic intensity is the ratio of computational instructions to
|
||||
load/store instructions. Modern GPUs typically have numerous compute
|
||||
units, constrained only by a limited load/store bandwidth. This
|
||||
limitation often leaves these units waiting for data loading in a
|
||||
program. Thus, boosting arithmetic intensity is a crucial step to
|
||||
improve program performance.
|
||||
|
||||
In the GPU kernel function discussed previously, we can approximate its
|
||||
arithmetic intensity by dividing the total number of floating-point
|
||||
operations by the number of data reads. When calculating the inner
|
||||
product within $K$ loops, floating-point multiplication and addition
|
||||
operations occur each time elements from matrix $A$ and $B$ are loaded.
|
||||
Consequently, the arithmetic intensity is 1, derived from two 32-bit
|
||||
floating-point operations divided by two 32-bit data load/store
|
||||
instructions.
|
||||
|
||||
In the original code, each thread handles one element in matrix $C$,
|
||||
computing the inner product of a row in matrix $A$ and a column in
|
||||
matrix $B$. In essence, we can elevate the arithmetic intensity by
|
||||
amplifying the elements in matrix $C$ that each thread can process,
|
||||
computing the inner product of multiple rows in matrix $A$ and multiple
|
||||
columns in matrix $B$. More specifically, if $m$ elements in matrix $A$
|
||||
and $n$ elements in matrix $B$ are loaded concurrently while calculating
|
||||
the inner product in $K$ loops, there are $m+n$ 32-bit load/store
|
||||
instructions and $2mn$ 32-bit computational instructions. Hence, the
|
||||
arithmetic intensity becomes $\frac{2mn}{m+n}$. Therefore, by increasing
|
||||
$m$ and $n$, we can optimize the arithmetic intensity.
|
||||
|
||||
In the preceding section, a `float` pointer was employed to access
|
||||
global memory and store data in it, utilizing the hardware instructions
|
||||
`LDG.E` and `STG.E`. Multiple `float` elements can be loaded
|
||||
concurrently using the 128-bit wide instructions `LDG.E.128` and
|
||||
`STG.E.128`. These wide instructions can streamline the instruction
|
||||
sequence, potentially saving dozens of instruction issue cycles compared
|
||||
to four standard instructions, thereby enabling the issue of more
|
||||
computational instructions within the saved time. Wide instructions can
|
||||
also enhance the cache line hit rate. Despite these benefits, we advise
|
||||
against the blanket use of wide instructions in all code. Instead,
|
||||
programmers should prioritize direct optimization methods, such as
|
||||
parallel design and local data reuse.
|
||||
|
||||
A specific implementation is stacking four `float` numbers to form a
|
||||
128-bit `float4` class. The load/store operations will be completed
|
||||
using a wide instruction for the `float4` class. For details about the
|
||||
code implementation, see
|
||||
[util.cuh](https://github.com/openmlsys/openmlsys-cuda/blob/main/util.cuh).
|
||||
|
||||
Note that each thread needs to load four `float` numbers (instead of
|
||||
one) from matrix $A$ and matrix $B$, requiring each thread to process
|
||||
$4\times 4$ blocks (`thread tile`) in matrix $C$. Each thread loads data
|
||||
from matrix $A$ and matrix $B$ from left to right and from top to
|
||||
bottom, computes the data, and stores the data to matrix $C$, as shown
|
||||
in Figure :numref:`use_float4`.
|
||||
|
||||

|
||||
:label:`use_float4`
|
||||
|
||||
For details about the complete code, see
|
||||
[gemm_use_128.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_use_128.cu).
|
||||
We can further increase the amount of data processed by each thread in
|
||||
order to improve the arithmetic intensity more, as shown in
|
||||
Figure :numref:`use_tile`. For
|
||||
details about the code used to achieve this, see
|
||||
[gemm_use_tile.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_use_tile.cu).
|
||||
|
||||

|
||||
:label:`use_tile`
|
||||
|
||||
The test results are as follows:
|
||||
|
||||
```
|
||||
Max Error: 0.000092
|
||||
Average Time: 6.232 ms, Average Throughput: 1378.317 GFLOPS
|
||||
```
|
||||
|
||||
To sample and analyze performance indicators, we will use the analysis
|
||||
tool Nsight Compute released by NVIDIA. This tool, designed for GPU
|
||||
kernel functions, samples and collects GPU activity data by hooking
|
||||
drivers. The following commands can be used to analyze the performance:
|
||||
|
||||
```
|
||||
bash
|
||||
ncu --set full -o <profile_output_file> <profile_process>
|
||||
```
|
||||
|
||||
`–set full` indicates that all data is sampled. `-o` indicates that the
|
||||
result is output as a file. `<profile_output_file>` indicates the output
|
||||
file name without the file name extension. `<profile_process>` indicates
|
||||
the executable file to be analyzed and its arguments. For example, to
|
||||
analyze `first_attempt` and name the output result
|
||||
`first_attepmt_prof_result`, run the following instructions:
|
||||
|
||||
```
|
||||
ncu --set full -o first_attepmt_prof_result ./first_attempt
|
||||
```
|
||||
|
||||
If the system displays a message indicating that you do not have
|
||||
permission to run this command, prefix it with `sudo` and run it again.
|
||||
After obtaining the output file, the program `nv-nsight-cu` can be used
|
||||
to view the file. We compared the profiling results of the new GPU
|
||||
kernel function and the previous one.
|
||||
|
||||
The result shows that the number of `LDG` instructions decreases by 84%,
|
||||
and the value of `Stall LG Throttle` decreases by 33%. By using wide
|
||||
instructions to increase the compute density, we are able to reduce the
|
||||
number of global load/store instructions, thereby cutting the amount of
|
||||
time needed to wait before issuing instructions. The improvement on
|
||||
`Arithmetic Intensity` proves that our analysis of the arithmetic
|
||||
intensity is correct. The gemm_use_tile.cu test results are as follows:
|
||||
|
||||
```
|
||||
Max Error: 0.000092
|
||||
Average Time: 3.188 ms, Average Throughput: 2694.440 GFLOPS
|
||||
```
|
||||
|
||||
The analysis using Nsight Compute shows that the code can also improve
|
||||
other indicators, such as `Stall LG Throttle`.
|
||||
|
||||
## Caching Data in Shared Memory
|
||||
|
||||
By increasing the amount of data that a thread can load in one go, we
|
||||
can improve the arithmetic intensity and performance. However, this
|
||||
method decreases the degree of parallelism because it reduces the total
|
||||
number of enabled threads. Other hardware features need to be exploited
|
||||
in order to improve performance without compromising the degree of
|
||||
parallelism. In earlier code, several thread blocks are enabled, each of
|
||||
which processes one or more matrix blocks in matrix $C$. As shown in
|
||||
Figure :numref:`duplicated_data`, thread $x$ and thread $y$ process the same
|
||||
row in matrix $C$, so they load the same data from matrix $A$. The
|
||||
shared memory can be used to improve the program throughput by enabling
|
||||
different threads in the same thread block to load unique data and reuse
|
||||
shared data.
|
||||
|
||||

|
||||
:label:`duplicated_data`
|
||||
|
||||
We have previously mentioned that the inner product can be computed by
|
||||
loading and accumulating data in $K$ loops. Specifically, in each loop,
|
||||
threads that process the same row in matrix $C$ load the same data from
|
||||
matrix $A$, and threads that process the same column in matrix $C$ load
|
||||
the same data from matrix $B$. However, the code needs to be optimized
|
||||
by dividing $K$ loops into $\frac{K}{tileK}$ outer loops and $tileK$
|
||||
inner loops. In this way, an entire block of data is loaded in each
|
||||
outer loop and accumulated in each inner loop.
|
||||
Figure :numref:`use_smem_store` shows the process of moving data from the
|
||||
global memory to the shared memory. Before each inner loop starts, the
|
||||
entire `tiles` in matrix $A$ and matrix $B$ is stored in the shared
|
||||
memory.
|
||||
|
||||
Figure :numref:`use_smem_load` shows the process of moving data from the
|
||||
shared memory to the register. In each inner loop, data is loaded from
|
||||
the shared memory and computed. An advantage of this design is that each
|
||||
thread does not need to load all the data it requires from the global
|
||||
memory. Instead, the entire thread block loads the data required for all
|
||||
threads from the global memory and stores the data in the shared memory.
|
||||
During computational processes, each thread only needs to load the data
|
||||
it requires from the shared memory.
|
||||
|
||||

|
||||
:label:`use_smem_store`
|
||||
|
||||

|
||||
:label:`use_smem_load`
|
||||
|
||||
For details about the complete code, see
|
||||
[gemm_use_smem.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_use_smem.cu).
|
||||
|
||||
The test results are as follows:
|
||||
|
||||
```
|
||||
Max Error: 0.000092
|
||||
Average Time: 0.617 ms, Average Throughput: 13925.168 GFLOPS
|
||||
```
|
||||
|
||||
Again, we use Nsight Compute to profile the kernel function and compare
|
||||
the results with the previous ones. The analysis shows some major
|
||||
improvements. Specifically, the number of `LDG` instructions decreases
|
||||
by 97%, which is consistent with this design. And the value of
|
||||
`SM Utilization` increases by 218%, which proves that using the shared
|
||||
memory can reduce the memory access latency and improve the memory
|
||||
utilization. Furthermore, the performance of other indicators such as
|
||||
`Pipe Fma Cycles Active` also improves significantly, demonstrating the
|
||||
benefits of the shared memory.
|
||||
|
||||
## Reducing Register Usage
|
||||
|
||||
In previous sections, the data blocks that store matrix $A$ in the
|
||||
shared memory are arranged in a row-first manner, and the shared memory
|
||||
is loaded by row. We can instead adopt a column-first manner in order to
|
||||
reduce loops and loop variables, thereby reducing the number of
|
||||
registers and improving performance.
|
||||
|
||||
For details about the complete code, see
|
||||
[gemm_transpose_smem.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_transpose_smem.cu).
|
||||
|
||||
The test results are as follows:
|
||||
|
||||
```
|
||||
Max Error: 0.000092
|
||||
Average Time: 0.610 ms, Average Throughput: 14083.116 GFLOPS
|
||||
```
|
||||
|
||||
Analysis by Nsight Compute shows that `Occupancy` increases by 1.3%.
|
||||
This is because only 111 registers are used (17 fewer than used by the
|
||||
previous GPU kernel function). The benefit of reducing the number of
|
||||
registers varies depending on the GPU architecture. Observations have
|
||||
shown that the number of `STS` instructions increases and bank conflicts
|
||||
occur, meaning that using fewer registers may not have a positive impact
|
||||
on other GPU architectures.
|
||||
|
||||
## Hiding Shared Memory Loading Latency
|
||||
|
||||
To load data from the shared memory, a GPU uses the `LDS` instruction.
|
||||
After issuing this instruction, the GPU will execute the following
|
||||
instructions without waiting for the data to be loaded to the register
|
||||
unless the instructions require such data. In the previous section, each
|
||||
time this instruction is issued during $tileK$ inner loops, the
|
||||
mathematical operation that requires the loaded data is performed
|
||||
immediately. However, the compute unit has to wait for the data to be
|
||||
loaded from the shared memory, as shown in
|
||||
Figure :numref:`use_smem_pipeline`. Accessing the shared memory may take
|
||||
dozens of clock cycles, but computation instructions can often be
|
||||
completed within only a few clock cycles. In order to significantly
|
||||
accelerate memory access, we can hide the shared memory loading latency
|
||||
by optimizing the pipeline. Specifically, during $tileK$ inner loops,
|
||||
loading instructions that prepare data in the next loop can be loaded at
|
||||
the beginning of each loop, as shown in
|
||||
Figure :numref:`hide_smem_latency`. In this way, computation instructions in
|
||||
the current operation do not require the data in the next loop. As such,
|
||||
the execution of these computation instructions will not be blocked by
|
||||
the instructions that load the data for the next loop.
|
||||
|
||||

|
||||
:label:`use_smem_pipeline`
|
||||
|
||||

|
||||
:label:`hide_smem_latency`
|
||||
|
||||
For details about the complete code, see
|
||||
[gemm_hide_smem_latency.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_hide_smem_latency.cu).
|
||||
|
||||
The test results are as follows:
|
||||
|
||||
```
|
||||
Max Error: 0.000092
|
||||
Average Time: 0.585 ms, Average Throughput: 14686.179 GFLOPS
|
||||
```
|
||||
|
||||
Analysis by Nsight Compute shows that the value of
|
||||
`Stall Short Scoreboard` decreases by 67% when compared with that of the
|
||||
previous GPU kernel function. As mentioned before, after GPU memory
|
||||
load/store instructions are issued, the GPU executes the next
|
||||
instruction without waiting for the data to be landed in the register.
|
||||
However, it will set a flag on the Scoreboard and reset the flag after
|
||||
the data is landed. If instructions that require such data need to be
|
||||
executed, the GPU will execute them only after the data is landed. The
|
||||
decrease of `Stall Short Scoreboard` demonstrates that hiding the access
|
||||
latency of the shared memory is an effective method to better utilize
|
||||
the GPU.
|
||||
|
||||
## Hiding Global Memory Loading Latency
|
||||
|
||||
To load data from the global memory, a GPU uses the textttLDG
|
||||
instruction, the behavior of which is similar to the `LDS` instruction
|
||||
used to load data from the shared memory as discussed in the previous
|
||||
section. At the beginning of each of the $\frac{K}{tileK}$ outer loops,
|
||||
instructions that load the data tiles in matrix $A$ for the next loop
|
||||
are issued. Because this data is not required by any inner loop in a
|
||||
given outer loop, the computational processes in the inner loop will not
|
||||
wait for the read instruction to be completed, thereby hiding the global
|
||||
memory loading latency. We can also enable data in `buffer` to be
|
||||
written to `tile` in the last loop in the inner loop after $tileK - 1$
|
||||
loops are executed, further reducing the latency of writing data to
|
||||
`tile`. Figure :numref:`hide_global_latency` shows the optimized pipeline.
|
||||
|
||||

|
||||
:label:`hide_global_latency`
|
||||
|
||||
For details about the complete code, see
|
||||
[gemm_final.cu](https://github.com/openmlsys/openmlsys-cuda/blob/main/gemm_final.cu).
|
||||
|
||||
The test results are as follows:
|
||||
|
||||
```
|
||||
Max Error: 0.000092
|
||||
Average Time: 0.542 ms, Average Throughput: 15838.302 GFLOPS
|
||||
```
|
||||
|
||||
Similar to the `Stall Short Scoreboard` results obtained in the previous
|
||||
section, analysis by Nsight Compute shows that the value of
|
||||
`Stall Long Scoreboard` (a global memory indicator) decreases by 67%.
|
||||
Such a significant decrease demonstrates that prefetching data can hide
|
||||
the global memory to reduce the loading latency.
|
||||
|
||||
## Performance Optimization Principles
|
||||
|
||||
So far, we have discussed various methods to enhance the performance of
|
||||
an accelerator. Even though other methods exist, the principles of
|
||||
performance optimization generally adhere to the following:
|
||||
|
||||
- Increasing parallelism through resource mapping: Multi-level
|
||||
parallel resources (`blocks`, `warps`, and `threads`) are mapped to
|
||||
the data needing computation and transfer to enhance program
|
||||
parallelism.
|
||||
|
||||
- Reducing memory access latency through memory structure
|
||||
optimization: Based on the recognition of data reuse within the same
|
||||
`block` during computation, the reused data is stored in local
|
||||
memory (like shared memory and registers) to increase locality.
|
||||
|
||||
- Reducing the instruction issue overhead through optimizing
|
||||
instruction execution: The `#pragma unroll` function is used to
|
||||
unroll loops in order to improve the degree of parallelism at the
|
||||
instruction level and reduce logic judgment. The vectorized load
|
||||
instruction is used to increase bandwidth. For the Ampere
|
||||
architecture, the maximum vectorized load instruction is
|
||||
`LDG.E.128`, and the data type for data loading is `float4`.
|
||||
|
||||
- Hiding load/store latency by optimizing the memory access pipeline:
|
||||
In instances where the in-memory data undergoes modifications (such
|
||||
as the movement of matrix data), we can optimize the memory access
|
||||
pipeline. This way, the accelerator performs computations during the
|
||||
intervals between data movement, thereby concealing the latency
|
||||
associated with data movement.
|
||||
181
v1/en_chapters/chapter_accelerator/accelerator_programming.md
Normal file
181
v1/en_chapters/chapter_accelerator/accelerator_programming.md
Normal file
@@ -0,0 +1,181 @@
|
||||
# Programming Methods
|
||||
:label:`Programming Principles for Hardware Accelerators`
|
||||
|
||||
The first two sections of this chapter primarily discuss the
|
||||
significance, ideas, and basic principles behind the design of hardware
|
||||
accelerators. Co-optimization of software and hardware, as an important
|
||||
guiding principle for building efficient AI systems, requires mutual
|
||||
influence and close coupling between software algorithms/stacks and
|
||||
hardware architectures in neural network applications. In order to fully
|
||||
leverage the advantages of accelerators, it is necessary to design a set
|
||||
of programming methods based on the hardware system architecture.
|
||||
|
||||
## Method Classification
|
||||
|
||||
Programming methods for hardware accelerators are categorized into three
|
||||
approaches: using high-level computation operators, harnessing
|
||||
primitives for specialized hardware units, and employing low-level
|
||||
assembly languages:
|
||||
|
||||
1. **High-level computation operators**: Hardware accelerators often
|
||||
come equipped with high-level, hardware-accelerated implementations
|
||||
of operators extensively used in numerical computing and deep
|
||||
learning. For instance, NVIDIA provides cuBLAS (CUDA Basic Linear
|
||||
Algebra Subprograms) and cuDNN (CUDA Deep Neural Network library).
|
||||
These libraries offer developers an accessible way to harness the
|
||||
power of NVIDIA GPUs without delving into low-level code. These
|
||||
operators are optimized for efficiency and automatically exploit
|
||||
specific GPU features, such as Tensor Cores.
|
||||
|
||||
2. **Primitives for task-specific hardware units:**: Hardware
|
||||
accelerators typically feature task-specific hardware units (like
|
||||
the Tensor Cores in NVIDIA GPUs) engineered to execute
|
||||
mixed-precision matrix multiplication operations at high speed.
|
||||
These units have associated programming primitives, such as CUDA's
|
||||
Warp Matrix Multiply Accumulate (WMMA) and primitives for
|
||||
loading/unloading tensors on the units.
|
||||
|
||||
3. **Low-level assembly languages**: Hardware accelerators also have
|
||||
low-level assembly language interfaces. For instance, NVIDIA GPUs
|
||||
offer the PTX ISA (Parallel Thread Execution Instruction Set
|
||||
Architecture). It provides explicit control over all aspects of GPU
|
||||
behavior, but it requires a deep understanding of the GPU
|
||||
architecture and is more challenging to use correctly and
|
||||
effectively than the high-level interfaces provided by cuBLAS and
|
||||
cuDNN. PTX code is typically generated by a compiler from a
|
||||
high-level language like CUDA C++.
|
||||
|
||||
In essence, the above three methods operate at different levels of
|
||||
abstraction. High-level operators like cuBLAS and cuDNN provide
|
||||
easy-to-use interfaces to powerful hardware-accelerated operations,
|
||||
while the primitives provided by task-specific hardware units provide a
|
||||
more detailed interface to hardware operations, and low-level assembly
|
||||
languages like PTX ISA provide the most detailed, low-level control over
|
||||
accelerator behavior.
|
||||
|
||||
## Programming Examples
|
||||
|
||||
We exemplify different programming methods by implementing the General
|
||||
Matrix Multiplication (GEMM) with each approach. The implementation
|
||||
targets an NVIDIA Volta GPU. GEMM follows the equation
|
||||
$\bf{C} = \alpha \bf{A}\times \bf{B} + \beta \bf{C}$, where
|
||||
$\bf{A}\in\mathbb{R}^{M\times K}, \bf{B}\in\mathbb{R}^{K\times N}, \bf{C}\in\mathbb{R}^{M\times N}$,
|
||||
and $\alpha$ and $\beta$ are parameters provided by users.
|
||||
|
||||
### High-level Computation Operators
|
||||
:label:`sec-accelerator-use-cublas`
|
||||
|
||||
Using an operator acceleration library directly is the most
|
||||
straightforward method. NVIDIA offers two types of operator libraries:
|
||||
cuBLAS and cuDNN. cuBLAS provides an interface for leveraging Tensor
|
||||
Cores to accelerate GEMM operations, while cuDNN offers an interface to
|
||||
hasten neural network operations. To utilize Tensor Cores via cuBLAS
|
||||
doing GEMM, we can use function `cublasGemmEx`, its signature is shown
|
||||
in Code `lst:cublasGemmEx`.
|
||||
|
||||
**lst:cublasGemmEx**
|
||||
```cpp
|
||||
cublasStatus_t cublasGemmEx(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const void *alpha, const void *A, cudaDataType_t Atype, int lda, const void *B, cudaDataType_t Btype, int ldb, const void *beta, void *C, cudaDataType_t Ctype, int ldc, cublasComputeType_t computeType, cublasGemmAlgo_t algo)
|
||||
```
|
||||
|
||||
`handle` is the cuBLAS handle, which is created using the `cublasCreate`
|
||||
function. `transa` denotes whether the matrices $\bf{A}$ and $\bf{C}$
|
||||
are transposed, while `transb` denotes whether the matrix $\bf{B}$ is
|
||||
transposed. `m`, `n`, and `k` are used to describe the shape of the
|
||||
matrices. `alpha` and `beta` are used to scale the matrix multiplication
|
||||
results. `A`, `B`, and `C` are pointers to the starting addresses of the
|
||||
matrices. `Atype`, `Btype`, and `Ctype` describe the data type of the
|
||||
matrices. For example, `CUDA_R_16F` indicates that the data is stored in
|
||||
real 16-bit floating point type. `lda`, `ldb`, and `ldc` represent the
|
||||
leading dimensions of the matrices. `computeType` is the data type used
|
||||
in computation. For instance, `CUBLAS_COMPUTE_16F` implies the use of
|
||||
Tensor Cores for computation in 16-bit floating point. Notably, if the
|
||||
input data type is 32-bit float, we can use
|
||||
`CUBLAS_COMPUTE_32F_FAST_16F` to perform the computation in 16-bit
|
||||
floating point and achieve acceleration using Tensor Cores. `algo` is
|
||||
the algorithm used in computation, and `CUBLAS_GEMM_DEFAULT` is commonly
|
||||
used to select the default algorithm.
|
||||
|
||||
### Primitives for Hardware Units
|
||||
|
||||
The second approach to accelerator programming involves the use of
|
||||
programming primitives, such as invoking the CUDA Warp Matrix Multiply
|
||||
Accumulate (WMMA) API on a device. This approach hinges on the
|
||||
collaborative design of software and hardware, meaning that the design
|
||||
of programming APIs at this level is architecture-dependent. For
|
||||
instance, in the Volta architecture, the control object of WMMA is a
|
||||
$16\times16$ matrix block, processed by two Tensor Cores at a time. This
|
||||
notion is tightly linked to the integration of Tensor Cores into a SM.
|
||||
|
||||
In the Volta architecture, NVIDIA offers three distinct sizes of WMMA
|
||||
multiply-accumulate computing interfaces for FP16 input data:
|
||||
$16\times16\times16$, $32\times8\times16$, and $8\times32\times16$.
|
||||
|
||||
The basic control unit of the WMMA API is a fragment, which refers to a
|
||||
template class that specifies information such as the meaning of
|
||||
matrices (multiplier or accumulator), matrix shape
|
||||
(`WMMA_M, WMMA_N, or WMMA_K`), data type (FP16, FP32, etc.), and layout
|
||||
(`row_major or col_major`).
|
||||
Code `lst:frament` shows the fragment types.
|
||||
|
||||
**lst:frament**
|
||||
```
|
||||
wmma::fragment<wmma::matrix_a, WMMA_M, WMMA_N, WMMA_K, half, wmma::row_major> a_frag;
|
||||
wmma::fragment<wmma::matrix_b, WMMA_M, WMMA_N, WMMA_K, half, wmma::col_major> b_frag;
|
||||
wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, float> acc_frag;
|
||||
wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, float> c_frag;
|
||||
```
|
||||
|
||||
The data of the matrix block required by multiplication operations needs
|
||||
to be loaded to the register as a fragment. Fragments are initialized or
|
||||
cleared after multiply-accumulate operations performed by Tensor Cores,
|
||||
the fragments are stored back in global memory. NVIDIA provides the
|
||||
`wmma.load_matrix_sync() and wmma.store_matrix_sync()` interfaces to
|
||||
load or write the submatrix blocks. The `wmma.fill_fragment()` interface
|
||||
is used to initialize the data of the corresponding fragments, and the
|
||||
`wmma.mma_sync()` interface is used to perform multiply-accumulate
|
||||
operations on fragments.
|
||||
|
||||
### Low-level Assembly Language Interface
|
||||
|
||||
The PTX ISA offers another programming interface, for example, the
|
||||
`mma.sync.aligned.m8n8k4` instruction in the Volta architecture. This
|
||||
instruction uses the shape configuration of $M=8, N=8, K=4$ to perform
|
||||
multiply-add operations. The basic control unit of the API is the data
|
||||
element. The matrix size (modifier `.m8n8k4`), data format (modifier
|
||||
`.row` or `.col`) and data formats of input accumulator D, matrix A,
|
||||
matrix B, and output accumulator C (modifier `.f32` or `.f16`) need to
|
||||
be specified. NVIDIA's documentation provides information about
|
||||
using the PTX instruction set, helping programmers compile code based on
|
||||
the corresponding syntax rules, as shown in
|
||||
Code `lst:ptx`.
|
||||
|
||||
**lst:ptx**
|
||||
```cpp
|
||||
half_t *a, *b;
|
||||
float *C, *D;
|
||||
unsigned const* A = reinterpret_cast<unsigned const*>(a);
|
||||
unsigned const* B = reinterpret_cast<unsigned const*>(b);
|
||||
|
||||
asm volatile(
|
||||
"mma.sync.aligned.m8n8k4.row.row.f32.f16.f16.f32 "
|
||||
"{%0,%1,%2,%3,%4,%5,%6,%7}, {%8,%9}, {%10,%11}, "
|
||||
"{%12,%13,%14,%15,%16,%17,%18,%19};\n"
|
||||
: "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3]), "=f"(D[4]),
|
||||
"=f"(D[5]), "=f"(D[6]), "=f"(D[7])
|
||||
: "r"(A[0]), "r"(A[1]), "r"(B[0]), "r"(B[1]), "f"(C[0]),
|
||||
"f"(C[1]), "f"(C[2]), "f"(C[3]), "f"(C[4]), "f"(C[5]),
|
||||
"f"(C[6]), "f"(C[7]));
|
||||
```
|
||||
|
||||
Data elements are directly used as the input (`unsigned` type is used
|
||||
for containing FP16 data elements). Moreover, NVIDIA provides the
|
||||
`ldmatrix` instruction to load data from the shared memory to fragments.
|
||||
|
||||
A finer-grained instruction, `mma`, can form a warp-level WMMA API of
|
||||
more diversified shapes to control the mapping between threads and data
|
||||
in the warp. The PTX instructions offer greater flexibility than
|
||||
directly using CUDA C++ codes.
|
||||
|
||||
[^1]: available at
|
||||
<https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html>
|
||||
26
v1/en_chapters/chapter_accelerator/index.md
Normal file
26
v1/en_chapters/chapter_accelerator/index.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# Hardware Accelerator
|
||||
|
||||
In the field of AI frameworks, hardware accelerators play a vital role
|
||||
in enabling efficient neural network computations. This chapter delves
|
||||
into the design of modern hardware accelerators, their programming
|
||||
techniques, and the typical approaches to optimize accelerator
|
||||
performance.
|
||||
|
||||
This chapter has the following learning objectives:
|
||||
|
||||
1. Understand the architecture of a modern hardware accelerator.
|
||||
|
||||
2. Understand the methods of programming hardware accelerators.
|
||||
|
||||
3. Understand the typical techniques used to optimize the performance
|
||||
of accelerators.
|
||||
|
||||
```toc
|
||||
:maxdepth: 2
|
||||
|
||||
Overview
|
||||
Components_of_Hardware_Accelerators
|
||||
Programming_Methods
|
||||
Performance_Optimization_Methods
|
||||
Chapter_Summary
|
||||
```
|
||||
18
v1/en_chapters/chapter_accelerator/summary.md
Normal file
18
v1/en_chapters/chapter_accelerator/summary.md
Normal file
@@ -0,0 +1,18 @@
|
||||
# Chapter Summary
|
||||
|
||||
1. Hardware accelerators offer various types of on-chip caches and
|
||||
computational units, enhancing the performance of deep learning
|
||||
computational tasks.
|
||||
|
||||
2. To fully exploit the performance potential of hardware accelerators,
|
||||
it's necessary to implement programmable hardware accelerators,
|
||||
bringing architectural innovation.
|
||||
|
||||
3. To balance computational efficiency and usability, the programming
|
||||
methods for hardware accelerators range from high-level computation
|
||||
operators to harnessing the primitives associated with hardware
|
||||
units, and to using low-level assembly languages.
|
||||
|
||||
4. A variety of methods are crucial to optimize accelerator
|
||||
performance, which include enhancing arithmetic intensity, caching
|
||||
data in shared memory, and concealing data store/load latency.
|
||||
@@ -0,0 +1,227 @@
|
||||
## Computation Scheduling and Execution
|
||||
|
||||
After operator selection and memory allocation, computation tasks can be scheduled and executed on hardware through the runtime. Depending on whether operators are compiled into a computational graph, computation scheduling can be divided into two approaches: single-operator scheduling and graph scheduling. For example, MindSpore provides the PyNative mode and Graph mode respectively. Furthermore, depending on the hardware capabilities, the execution of computational graphs can be divided into two modes: interactive execution, where operators are dispatched and executed one by one, and sink execution, where the entire computational graph or partial subgraphs are dispatched to the hardware at once.
|
||||
|
||||
### Single-Operator Scheduling
|
||||
|
||||
Single-operator scheduling, as opposed to graph-based scheduling, means that operators contained in algorithms or models are scheduled and executed one by one through the Python runtime. Examples include PyTorch's default execution mode, TensorFlow's eager mode, and MindSpore's PyNative mode. Taking MindSpore as an example, the code is shown below.
|
||||
|
||||
```python
|
||||
import mindspore.nn as nn
|
||||
from mindspore import context
|
||||
|
||||
class Computation(nn.Cell):
|
||||
def construct(self, x, y):
|
||||
m = x * y
|
||||
n = x - y
|
||||
print(m)
|
||||
z = m + n
|
||||
return z
|
||||
|
||||
compute = Computation()
|
||||
c = compute(1, 2)
|
||||
print(c)
|
||||
```
|
||||
|
||||
The above script defines all computation logic in the `construct` method of the `Computation` class. Since single-operator execution mode is preset in the context at the beginning of the script, the computations in `construct` will be called and executed line by line through the Python runtime, and `print` commands can be inserted at any position in the code to print intermediate computation results.
|
||||
|
||||
The call chain for single-operator execution is shown in :numref:`single_op_exec`. After an operator is triggered for execution on the Python side, it goes through the machine learning framework initialization, which determines information including the operator's precision, input and output types and sizes, and the corresponding hardware device. Then the framework allocates the memory required for computation, and finally hands it over to the specific hardware computing device to complete the execution.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`single_op_exec`
|
||||
|
||||
The advantage of single-operator scheduling lies in its flexibility. Since operators are directly scheduled through the Python runtime, it can express arbitrarily complex computation logic, especially in scenarios requiring complex control flow and Python native data structures to implement complex algorithms. Additionally, single-operator scheduling is very convenient for debugging program correctness, as developers can print any variable that needs to be debugged during code execution. Finally, by driving operators through the Python runtime, computation tasks can be completed in coordination with Python's vast and rich ecosystem of libraries.
|
||||
|
||||
### Graph Scheduling
|
||||
|
||||
Although single-operator scheduling has the advantages described above, its disadvantages are also obvious. On one hand, it is difficult to optimize computation performance, because without global information from the computational graph, single-operator execution cannot perform optimizations such as operator fusion and algebraic simplification based on context. On the other hand, due to the lack of topological relationships in the computation, the entire computation can only be scheduled and executed serially, meaning that parallel computation cannot be achieved through the runtime. For example, the computation logic of the above sample code can be expressed as shown in :numref:`graph_exec`. From this computational graph, we can see that there is no dependency between the multiplication and subtraction operations, so these two computations can be executed in parallel. Such parallel execution information can only be analyzed after the computation is expressed as a computational graph, which is one of the advantages of graph scheduling over single-operator scheduling.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`graph_exec`
|
||||
|
||||
Now let us introduce the scheduling methods for computational graphs. In a typical heterogeneous computing environment, there are multiple types of computing devices such as CPUs, GPUs, and NPUs. Therefore, a computational graph can be composed of operators running on different devices, forming a heterogeneous computational graph. :numref:`computation_graph` shows a typical computational graph involving heterogeneous hardware.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`computation_graph`
|
||||
|
||||
The computational graph described above consists of operators corresponding to the following types of heterogeneous hardware:
|
||||
|
||||
- **CPU Operators**: Operators written in C++ and executed on the host via the CPU. The performance of CPU computation depends on whether the multi-core computing capability of the CPU can be fully utilized.
|
||||
|
||||
- **GPU Operators**: Taking NVIDIA GPU chips as an example, GPU Kernels are dispatched one by one from the host side to the GPU device, where the GPU chip executes the operator's computation logic. Due to the large number of parallel execution units on the chip, it can provide powerful acceleration capabilities for highly parallel algorithms.
|
||||
|
||||
- **NPU Operators**: Taking Huawei Ascend chips as an example, Ascend is a highly integrated SoC chip. The advantage of NPUs is their support for sinking part of or the entire computational graph into the chip to complete computation. During computation, there is no interaction with the host, resulting in higher computational performance.
|
||||
|
||||
- **Python Operators**: Similar to CPU operators in execution mode, both are executed by the host's CPU. The difference is that the computation logic is interpreted and executed by the Python runtime through the Python interpreter.
|
||||
|
||||
The prerequisite for correctly expressing a heterogeneous computational graph is to accurately identify the device on which each operator executes. For example, the CPU, GPU, and Ascend Kernels identified in the heterogeneous computational graph :numref:`computation_graph`, as well as the Python Kernels marked to be executed by the Python runtime. Mainstream frameworks all provide the capability to specify the device on which an operator runs. Taking MindSpore as an example, a simple heterogeneous computation code is shown below.
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from mindspore import Tensor
|
||||
import mindspore.ops.operations as ops
|
||||
from mindspore.common.api import jit
|
||||
|
||||
# Create operators and specify the hardware device for execution
|
||||
add = ops.Add().add_prim_attr('primitive_target', 'CPU')
|
||||
sub = ops.Sub().add_prim_attr('primitive_target', 'GPU')
|
||||
|
||||
# Specify execution in static computational graph mode
|
||||
@jit
|
||||
def compute(x, y, z):
|
||||
r = add(x, y)
|
||||
return sub(r, z)
|
||||
|
||||
# Create arguments
|
||||
x = Tensor(np.ones([2, 2]).astype(np.float32))
|
||||
y = Tensor(np.ones([2, 2]).astype(np.float32))
|
||||
z = Tensor(np.ones([2, 2]).astype(np.float32))
|
||||
|
||||
# Execute computation
|
||||
output = compute(x, y, z)
|
||||
```
|
||||
|
||||
The above code snippet completes the computation logic of x + y - z, where the Add operator is set to execute on the CPU and the Sub operator is set to execute on the GPU, forming CPU-GPU collaborative heterogeneous computation. Through a similar tagging mechanism, arbitrarily complex multi-hardware collaborative heterogeneous computation can be expressed.
|
||||
Another relatively special type of heterogeneity involves Python operators. The advantages of Python lie in its flexibility of expression, development efficiency, and rich surrounding ecosystem. Therefore, introducing Python operators into the computational graph to collaborate with operators on other heterogeneous hardware greatly enhances computation flexibility. Unlike the heterogeneity where CPU and GPU execute on different devices, Python operators and CPU operators implemented in C++ are both executed by the host-side CPU cores. The difference is that Python operators are described through a unified computational graph and therefore also need to be triggered for execution in the backend runtime. To express Python operators in the computational graph, the framework needs to provide corresponding support.
|
||||
|
||||
After marking the devices corresponding to operators in the computational graph, the graph is ready to be scheduled and executed. Depending on hardware capabilities, the execution of heterogeneous computational graphs can be divided into three modes: operator-by-operator interactive execution, whole-graph sink execution, and subgraph sink execution. Interactive execution is mainly for CPU and GPU scenarios, where operators in the computational graph are scheduled and executed one by one according to the dependency relationships of inputs and outputs. Whole-graph sink execution is mainly for NPU chips, whose main advantage is the ability to dispatch the entire neural network's computational graph to the device at once, independently completing the scheduling and execution of all operators in the graph without relying on the host's CPU capability, reducing the number of interactions between host and chip, and improving computational efficiency and performance through the NPU's tensor acceleration capability. Subgraph sink execution combines the previous two execution modes. Due to the flexibility of computational graph expression itself, whole-graph sink execution on NPU chips may not achieve optimal efficiency for complex scenarios. Therefore, parts with low execution efficiency on NPU chips can be separated and handed over to devices with higher execution efficiency such as CPUs or GPUs, while subgraphs more suitable for NPU computation are sunk to the NPU for computation, thus balancing both performance and flexibility.
|
||||
|
||||
The above heterogeneous computational graph can serve two purposes. The first is heterogeneous hardware acceleration, placing specific computations on suitable hardware for execution. The second is achieving concurrent execution between operators. From the computational graph, we can see that there is no dependency between kernel_1 and kernel_2, nor between kernel_3 and kernel_4. Therefore, these two pairs of CPU and GPU operators can logically be invoked concurrently by the framework. However, kernel_5 depends on the outputs of kernel_3 and kernel_4 as its inputs, so kernel_5 needs to wait for kernel_3 and kernel_4 to complete before being triggered for execution.
|
||||
|
||||
Although concurrency relationships between operators can be fully expressed on the computational graph, in practice, some unexpected side effects may arise due to concurrency, as shown in the following code:
|
||||
|
||||
```python
|
||||
import mindspore as ms
|
||||
from mindspore import Parameter, Tensor
|
||||
import mindspore.ops.operations as ops
|
||||
from mindspore.common.api import jit
|
||||
|
||||
# Define global variables
|
||||
x = Parameter(Tensor([1.0], ms.float32), name="x")
|
||||
y = Tensor([0.2], ms.float32)
|
||||
z = Tensor([0.3], ms.float32)
|
||||
|
||||
# Specify execution in static computational graph mode
|
||||
@jit
|
||||
def compute(y, z):
|
||||
ops.Assign()(x, y)
|
||||
ops.Assign()(x, z)
|
||||
r = ops.Sub()(x, y)
|
||||
return r
|
||||
|
||||
compute(y, z)
|
||||
```
|
||||
|
||||
The above code expresses the following computation logic:
|
||||
|
||||
```text
|
||||
x = y
|
||||
x = z
|
||||
x = x - y
|
||||
```
|
||||
|
||||
This simple computation logic, when translated to the computational graph, can be represented as shown in :numref:`side_effect_1`.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`side_effect_1`
|
||||
|
||||
There are no dependencies among the three computations shown in the code, so these three operators can logically be executed concurrently on the computational graph. However, based on the code semantics, it is obvious that the program needs to be executed sequentially. The issue introduced here is called a side effect, which refers to the behavior of modifying state variables defined outside the function. Due to the introduction of side effects, incorrect concurrency relationships occur. One solution is to add dependencies between operators during the computational graph compilation phase to convert concurrent execution logic into sequential execution logic. The transformed computational graph is shown in :numref:`side_effect_2`.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`side_effect_2`
|
||||
|
||||
|
||||
The dashed arrows in the figure represent the dependency relationships between operators. After adding dependency relationships, the operators will execute serially in the order of Assign_1, Assign_2, Sub_1, which is consistent with the original code semantics.
|
||||
|
||||
### Interactive Execution
|
||||
|
||||
As described above, in interactive execution mode, the framework's runtime dispatches operators to the hardware for execution one by one according to the dependency relationships of operators in the computational graph, following a certain execution order (e.g., breadth-first order). To aid understanding and comparison, we first introduce the execution method for non-heterogeneous computational graphs (where all operators in the graph run on the same type of device), as heterogeneous computational graph execution is built upon non-heterogeneous graphs.
|
||||
|
||||
1. Execution of Non-Heterogeneous Computational Graphs
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`graph_exec_1`
|
||||
|
||||
As shown in :numref:`graph_exec_1`, this is a non-heterogeneous computational graph where all Kernels are GPU operators. The execution methods are generally divided into serial execution and parallel execution:
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`graph_exec_2`
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`graph_exec_3`
|
||||
|
||||
- **Serial Execution**: The computational graph is unfolded into an execution sequence, and operators are executed serially one by one according to the execution order, as shown in :numref:`graph_exec_2`. Its characteristics include a fixed execution order, single-threaded execution, and relatively low system resource requirements.
|
||||
|
||||
- **Parallel Execution**: The computational graph is unfolded according to the dependency relationships between operators. Operators with dependencies maintain their execution order through input dependencies, while operators without dependencies can be executed in parallel, as shown in :numref:`graph_exec_3`. Kernel_1 and Kernel_2 have no dependencies and can execute in parallel, and Kernel_3 and Kernel_4 have no dependencies and can execute in parallel. Its characteristics include a non-fixed execution order (the order of operators executed in each round is likely to differ), multi-threaded execution, and relatively high system resource requirements.
|
||||
|
||||
Serial execution and parallel execution each have their advantages and disadvantages, summarized in :numref:`serial_vs_parallel`.
|
||||
|
||||
:Comparison of Serial Execution and Parallel Execution
|
||||
|
||||
| Execution Method | Serial Execution | Parallel Execution |
|
||||
|--------------|----------|------|
|
||||
|Operator Execution Order | Fixed | Non-fixed |
|
||||
|Operator Execution Thread |Single-threaded | Multi-threaded |
|
||||
|Required Execution Resources | Lower | Higher |
|
||||
:label:`serial_vs_parallel`
|
||||
|
||||
2. Execution of Heterogeneous Computational Graphs
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`graph_exec_4`
|
||||
|
||||
As shown in :numref:`graph_exec_4`, this is a heterogeneous computational graph, where Kernel_1, Kernel_2, Kernel_5, and Kernel_9 are CPU operators, Kernel_6 is a Python operator (also executed on the CPU), Kernel_3 and Kernel_4 are GPU operators, and Kernel_7 and Kernel_8 are GPU operators.
|
||||
Generally, computational graph optimizations are implemented based on non-heterogeneous computational graphs, requiring all operators in the graph to be on the same device to facilitate optimizations such as operator fusion and replacement. Therefore, a heterogeneous computational graph needs to be partitioned into multiple non-heterogeneous computational graphs. The partitioning can be quite flexible, with various partitioning rules defined. Generally, partitioning rules that produce as few subgraphs as possible are used, placing as many operators on the same device into one subgraph as possible. As shown in :numref:`graph_exec_5`, five subgraphs are produced: Graph_1\_CPU, Graph_2\_GPU, Graph_3\_CPU, Graph_4\_Ascend, and Graph_5\_CPU.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`graph_exec_5`
|
||||
|
||||
After partitioning a heterogeneous computational graph into multiple subgraphs, the execution methods are generally divided into subgraph partitioned execution and subgraph merged execution:
|
||||
|
||||
- **Subgraph Partitioned Execution**: The partitioned subgraphs are executed separately, i.e., one subgraph finishes execution before the next one starts, as shown in :numref:`graph_exec_6`. The output data of the previous subgraph is transferred to the input of the next subgraph, and the next subgraph needs to copy the input data to its own device memory. For example, Graph_2\_GPU needs to copy the output data of Graph_1\_CPU from CPU to GPU, and conversely, Graph_3\_CPU needs to copy the output data of Graph_2\_GPU from GPU to CPU. There is a certain overhead in switching execution between subgraphs.
|
||||
|
||||
- **Subgraph Merged Execution**: The partitioned subgraphs are merged into a single overall DAG for execution, as shown in :numref:`graph_exec_7`. Copy operators are inserted based on operator device attributes to enable data transfer between operators on different devices, and the copy operators are also incorporated into the whole graph, forming a large unified graph for execution, reducing the overhead of switching between subgraphs.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`graph_exec_6`
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`graph_exec_7`
|
||||
|
||||
Since subgraph merged execution can reduce the overhead of switching between subgraphs, it generally achieves higher performance. A summary comparison is shown in :numref:`partitioning_vs_merging`.
|
||||
|
||||
:Comparison of Subgraph Partitioning and Subgraph Merging
|
||||
|
||||
| Execution Method | Subgraph Partitioning | Subgraph Merging|
|
||||
| --------------|------------------|--------------|
|
||||
| Heterogeneous Data Transfer | Copy between subgraphs | Copy between operators|
|
||||
| Additional Execution Overhead | Subgraph switching overhead | None|
|
||||
| Execution Concurrency Granularity | Subgraph-level concurrency | Native operator-level concurrency|
|
||||
:label:`partitioning_vs_merging`
|
||||
|
||||
|
||||
3. Execution Acceleration of Heterogeneous Computational Graphs
|
||||
|
||||
The previous sections described two execution methods for non-heterogeneous computational graphs and two execution methods for heterogeneous computational graphs, where heterogeneous computational graphs are built upon non-heterogeneous ones. Therefore, heterogeneous computational graphs have four possible execution methods through pairwise combination. Taking MindSpore as an example, it adopts subgraph merged parallel execution, as illustrated in :numref:`graph_exec_5`. First, executing as a single whole graph avoids the overhead of subgraph switching, and then parallel execution within the whole graph maximizes the advantage of concurrent execution, achieving optimal execution performance.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`graph_exec_8`
|
||||
|
||||
### Sink Execution
|
||||
|
||||
Sink execution leverages the SoC architecture of specialized chips to schedule the entire or partial computational graph onto the chip at once to complete the computation of the full data volume. For example, with Ascend chips, a computational graph composed of multiple Ascend operators can be compiled into a Task before execution. Through the interface provided by the Ascend driver, the Task containing multiple operators is dispatched to the hardware at once for scheduling and execution. Therefore, in the above example, the Ascend operators Kernel_7 and Kernel_8 can be optimized into a subgraph Graph_4\_Ascend, which is then compiled into a Task and sunk to the Ascend for execution, as shown in :numref:`graph_exec_8`.
|
||||
|
||||
Sink execution achieves better overall computational performance by avoiding interactions between the host side and the device side during computation. However, sink execution also has some limitations. For example, it faces significant technical challenges in scenarios involving dynamic shape operators and complex control flow.
|
||||
115
v1/en_chapters/chapter_backend_and_runtime/graph_optimizer.md
Normal file
115
v1/en_chapters/chapter_backend_and_runtime/graph_optimizer.md
Normal file
@@ -0,0 +1,115 @@
|
||||
# Graph Optimization
|
||||
|
||||
Graph optimization techniques at the backend primarily focus on
|
||||
hardware-oriented approaches. These techniques can be categorized as
|
||||
hardware-agnostic, such as memory I/O optimization, or specific to
|
||||
particular hardware, such as subgraph transformation to accommodate
|
||||
hardware instruction restrictions.
|
||||
|
||||
## Hardware-Agnostic Optimizations
|
||||
|
||||
Hardware-agnostic optimizations involve subgraph transformation, which
|
||||
replaces a subgraph in a computational graph with a hardware-friendly
|
||||
equivalent.
|
||||
|
||||
One example of such optimization is memory I/O optimization. In deep
|
||||
learning models, operators can be categorized as either
|
||||
compute-intensive (e.g., Conv and FC) or memory-intensive (e.g., ReLU
|
||||
and element-wise Sum). Memory-intensive operators are mainly used for
|
||||
element-wise operations. Often, both types of operators are used
|
||||
together in a typical deep learning model, such as the combination of
|
||||
\"Conv + ReLU\". By fusing ReLU and Conv into a composite operator, we
|
||||
can reduce memory access latency, bandwidth pressure, and improve
|
||||
execution efficiency.
|
||||
|
||||
Figure :numref:`ch07/ch07-compiler-backend-03` illustrates an example of
|
||||
fusing \"Conv + Conv + Sum + ReLU\". This fusion optimization eliminates
|
||||
two read operations and two write operations, optimizing the read and
|
||||
write of the outputs generated by Conv and Sum.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-03`
|
||||
|
||||
Furthermore, automatic operator generation technology enables more
|
||||
flexible general optimizations in addition to fusion-based optimizations
|
||||
for specific operator types. An example of this technology is graph
|
||||
kernel fusion (available on AI frameworks such as TensorFlow and
|
||||
MindSpore). It aims to reduce inefficient memory movements and enable
|
||||
intensive computing through three steps: operator expansion,
|
||||
aggregation, and reconstruction.
|
||||
|
||||
Figure
|
||||
:numref:`ch07/ch07-compiler-backend-graph-kernel` provides an
|
||||
overview of graph kernel fusion, which involves the following steps:
|
||||
|
||||
1. Expander: Composite operators (Op1, Op3, and Op4) in the
|
||||
computational graph are expanded into combinations of basic
|
||||
operators, as represented by the graph nodes with dash lines.
|
||||
|
||||
2. Aggregation: The basic operator (Op2) and expanded operators are
|
||||
aggregated into larger operator combinations.
|
||||
|
||||
3. Reconstruction: The basic operators are classified based on the
|
||||
input-to-output affinity, such as elemwise, broadcast, reduce, and
|
||||
transform. This classification allows the derivation of general
|
||||
compute rules (e.g., elemwise + reduce) to facilitate efficient
|
||||
execution. The operator combination is then analyzed and filtered
|
||||
iteratively, leading to the creation of new operators (New Op1 and
|
||||
New Op2) through reconstruction. These new operators are designed to
|
||||
be hardware-friendly.
|
||||
|
||||
Graph kernel fusion enables joint optimization beyond operator
|
||||
boundaries by expanding and aggregating the computational graph. It
|
||||
generates new hardware-friendly operators through reconstruction based
|
||||
on general compute rules, thereby facilitating efficient execution.
|
||||
However, it should be noted that this approach involves additional
|
||||
memory movements.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-graph-kernel`
|
||||
|
||||
## Hardware-Specific Optimizations
|
||||
|
||||
Hardware-specific optimizations are tailored to address the restrictions
|
||||
imposed by specific hardware instructions and memory formats associated
|
||||
with particular hardware devices.
|
||||
|
||||
### Hardware Instruction Restrictions
|
||||
|
||||
Hardware instruction restrictions arise when certain IR nodes lack
|
||||
direct operator counterparts on a specific hardware device. In such
|
||||
cases, subgraph transformation can be employed to overcome these
|
||||
restrictions. Let's consider an example. The Concat operator on the
|
||||
accelerator supports a maximum of 63 inputs. If the Concat node in the
|
||||
frontend IR exceeds this limit, we can partition the node into multiple
|
||||
smaller Concat nodes. Figure
|
||||
:numref:`ch07/ch07-compiler-backend-04` illustrates how we can
|
||||
split a 100-input Concat node into two smaller nodes, one with 63 inputs
|
||||
and the other with 37 inputs, to meet the 63-input requirement of the
|
||||
accelerator.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-04`
|
||||
|
||||
### Memory Format Restrictions
|
||||
|
||||
Different platforms define varying formats for different operators to
|
||||
achieve optimal performance. When the formats are inconsistent with a
|
||||
particular framework, a common approach is to insert format
|
||||
transformation operations to reformat the operator output. However, this
|
||||
introduces additional memory movements.
|
||||
|
||||
Figure :numref:`ch07/ch07-compiler-backend-05` provides an example to
|
||||
illustrate this scenario. Consider that the default format in an AI
|
||||
framework is NCHW, but the hardware accelerator is optimized for
|
||||
performing convolution with inputs and outputs in NC1HWC0 format. To
|
||||
bridge this gap, the output of the first Conv operator is formatted to
|
||||
NCHW using a TransData operator. It is then reformatted to NC1HWC0 using
|
||||
another TransData operator before being passed to the next Conv
|
||||
operator. The two TransData operations (depicted as dashed lines in the
|
||||
figure) are inverse operations of each other. By employing pattern
|
||||
matching on the computational graph, such operations can be easily
|
||||
eliminated.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-05`
|
||||
38
v1/en_chapters/chapter_backend_and_runtime/index.md
Normal file
38
v1/en_chapters/chapter_backend_and_runtime/index.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# AI Compiler Backend
|
||||
|
||||
In this chapter, we will explore the design of the AI compiler backend.
|
||||
The objective of an AI compiler backend is to enhance the efficiency of
|
||||
AI program execution by optimizing the Intermediate Representation (IR)
|
||||
generated by the compiler frontend. This optimization enables the full
|
||||
utilization of hardware capabilities. The backend achieves this goal by
|
||||
applying optimizations to IR code based on hardware capabilities.
|
||||
Furthermore, it selects suitable operators based on the capabilities of
|
||||
target hardware to execute computations efficiently, while also
|
||||
allocating memory to optimize data reuse and locality. Additionally, the
|
||||
backend often incorporates an operator compiler, which optimizes the
|
||||
execution strategy for code statements associated with operators.
|
||||
|
||||
This chapter aims to achieve the following learning objectives:
|
||||
|
||||
- Understand the role and architecture of an AI compiler backend.
|
||||
|
||||
- Understand typical methods for optimizing computational graphs.
|
||||
|
||||
- Understand typical methods for selecting operators.
|
||||
|
||||
- Understand typical methods for memory allocation.
|
||||
|
||||
- Understand the architecture and functionalities of operator
|
||||
compilers.
|
||||
|
||||
```toc
|
||||
:maxdepth: 2
|
||||
|
||||
Overview
|
||||
Graph_Optimization
|
||||
Operator_Selection
|
||||
Memory_Allocation
|
||||
Operator_Compiler
|
||||
Chapter_Summary
|
||||
Further_Reading
|
||||
```
|
||||
214
v1/en_chapters/chapter_backend_and_runtime/kernel_selecter.md
Normal file
214
v1/en_chapters/chapter_backend_and_runtime/kernel_selecter.md
Normal file
@@ -0,0 +1,214 @@
|
||||
# Operator Selection
|
||||
|
||||
Following graph optimization, the compiler backend generates a sequence
|
||||
of operators that can be executed on hardware. This is achieved by
|
||||
selecting the most suitable operators from a set of candidate operators
|
||||
for each node in the IR. Since these candidate operators have diverse
|
||||
specifications, their execution efficiency varies depending on the
|
||||
scenario. Therefore, the primary objective of operator selection is to
|
||||
choose the operators that are most appropriate for the target device
|
||||
based on the information provided by the IR.
|
||||
|
||||
## Basic Concepts of Operator Selection
|
||||
|
||||
We can think of the nodes in a backend-optimized IR as being units of
|
||||
execution that are visible to the user, and each unit represents a
|
||||
hardware-agnostic operation in the user code. In essence, operator
|
||||
selection involves selecting appropriate hardware information, which is
|
||||
referred to as operator information. Such information defines the
|
||||
following:
|
||||
|
||||
1. The format of an operator, which is a determinant of the operator's
|
||||
performance on the target platform. Machine learning systems
|
||||
commonly use NCHW and NHWC formats.
|
||||
|
||||
2. The data type (such as float32, float16, or int32) of an operator on
|
||||
the target platform. The operators selected are those with data
|
||||
types close to (or the same as) user definitions.
|
||||
|
||||
### Data Formats
|
||||
|
||||
In machine learning systems, many operations are converted into matrix
|
||||
multiplication (e.g., convolution) for faster computation. Matrix
|
||||
multiplication in the form of
|
||||
$\textit{\textit{A}}\times \textit{\textit{B}} = \textit{\textit{C}}$ is
|
||||
essentially a row-by-column multiplication. Specifically, the entry *ij*
|
||||
of **C** is obtained by multiplying the entries in the *i*th row of
|
||||
**A** and the corresponding entries in the *j*th column of **B** and
|
||||
then adding the results together. Consider the example shown in Figure
|
||||
:numref:`ch07/ch07-compiler-backend-06`. Matrix data is stored in
|
||||
row-major order by default, as shown at the top of the figure. However,
|
||||
matrix **B** is read in column-major order in the matrix multiplication
|
||||
process, as shown at the bottom.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-06`
|
||||
|
||||
Storing matrix **B** in the reading order increases the computation
|
||||
efficiency because access to contiguous blocks of memory is faster. We
|
||||
can therefore see that data formats play an important role in
|
||||
performance improvement.
|
||||
|
||||
There are two major formats in machine learning systems: NCHW and NHWC.
|
||||
For an image input, N denotes the batch size, C denotes the number of
|
||||
channels, and H and W denote the height and width respectively. Figure
|
||||
:numref:`ch07/ch07-compiler-backend-07` depicts the logical
|
||||
diagram of an input with batch size 2, channels 16, height 5, and width
|
||||
4.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-07`
|
||||
|
||||
A multidimensional matrix is flattened into 1D format before it is
|
||||
written to memory. This involves indexing, which maps logical data to
|
||||
physical memory.
|
||||
|
||||
Access to machine learning data is performed in an axis-wise order from
|
||||
the last axis forward. For instance, data in NCHW format is read in the
|
||||
axis order of W, H, C, and N. Equation
|
||||
:eqref:`ch05/equation-01` denotes the mapping between
|
||||
logical memory and physical memory for this format of data.
|
||||
|
||||
$$
|
||||
\text{offsetnchw}(n,c,h,w) = n \times \textit{C} \times \textit{H} \times \textit{W} + c \times \textit{H} \times \textit{W} + h \times \textit{W} + w
|
||||
$$
|
||||
:eqlabel:`equation:ch05/equation-01`
|
||||
|
||||
As shown in Figure
|
||||
:numref:`ch07/ch07-compiler-backend-08`, matrix elements are
|
||||
flattened from the lowest dimension (i.e., W axis) forward, and
|
||||
neighboring elements of an axis reside next to each other in memory. To
|
||||
take the same element on the next image in the same location, the whole
|
||||
image size ($C*H*W$) has to be jumped. Assume we have a batch of eight
|
||||
RGB images of size 32$\times$`<!-- -->`{=html}32, or a matrix with
|
||||
$N=8,C=3,H=32,W=32$. Memory storage of these images begins from the
|
||||
first channel of the first image by flattening the matrix along axis W
|
||||
and then arranging matrix elements along axis H. This is performed
|
||||
before the next channel is processed. The same procedure is repeated
|
||||
until the last channel of the last image is processed. NCHW is the
|
||||
default format on PyTorch and MindSpore.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-08`
|
||||
|
||||
Access to data in NHWC format also begins at the lowest dimension (i.e.,
|
||||
C axis) forward. NHWC is the default format on TensorFlow (PyTorch
|
||||
refers to it as the channel-last format). Equation
|
||||
:eqref:`ch05/equation-02` denotes the mapping from logical
|
||||
memory to physical memory for this format of data.
|
||||
|
||||
$$
|
||||
\text{offsetnchw}(n,h,w,c) = n \times \textit{H} \times \textit{W} \times \textit{C} + h \times \textit{W} \times \textit{C} + w \times \textit{C} + c
|
||||
$$
|
||||
:eqlabel:`equation:ch05/equation-02`
|
||||
|
||||
Figure
|
||||
:numref:`ch07/ch07-compiler-backend-nchwandnhwc` compares the
|
||||
logical indexing of the NCHW and NHWC formats. The \[x:1\] marks refer
|
||||
to the jumps from the innermost axis to the next. For example, \[a:1\]
|
||||
indicates the jump from axis W to axis H, and \[b:1\] indicates the jump
|
||||
from axis C (the innermost) to axis W.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-nchwandnhwc`
|
||||
|
||||
These two formats offer a high degree of flexibility and are therefore
|
||||
used on many frameworks. However, to accelerate computing on hardware,
|
||||
further optimization is needed. In a machine learning system, if the
|
||||
size of the user input exceeds what the compute component can pass
|
||||
through the network at a time (which is often the case), the input will
|
||||
be batched before computation. For further optimization, many frameworks
|
||||
introduce blocked formats (which are more hardware-friendly), such as
|
||||
the nChw16c and nChw8c formats of the oneAPI Deep Neural Network Library
|
||||
(oneDNN) and the NC1HWC0 format on the Ascend platform. By leveraging
|
||||
hardware acceleration instructions to move and compute data, matrices
|
||||
can be quickly transformed into vectors, increasing the utilization of
|
||||
the on-chip cache.
|
||||
|
||||
### Data Types
|
||||
|
||||
Single-precision (float32), occupying 32 bits in memory, is the most
|
||||
commonly used data type in machine learning systems. In applications
|
||||
where higher precision is not essential, the half-precision (float16)
|
||||
data type may be used, occupying 16 bits in memory. When used on
|
||||
hardware, float16 offers up to 7 times more arithmetic throughput with
|
||||
less memory footprint compared with the single-precision data type ---
|
||||
this allows for larger batch sizes and consequently reduced training
|
||||
time. Next, we will look at the differences between half-precision
|
||||
floating-point numbers and single-precision floating-point numbers.
|
||||
|
||||
In Figure :numref:`ch07/ch07-float32andfloat16`, *Sig* refers to the sign
|
||||
bit that indicates the sign of a number, *Exponent* refers to the
|
||||
exponent bits, and *Mantissa* refers to the mantissa bits.
|
||||
|
||||

|
||||
:label:`ch07/ch07-float32andfloat16`
|
||||
|
||||
Applying Equation
|
||||
:eqref:`ch05/equation-03` will convert a float16 number in
|
||||
binary scientific notation to decimal format.
|
||||
|
||||
$$
|
||||
(-1)^{\text{Sig}}\times 2^{\text{Exponent}-15}\times (\frac{\text{Mantissa}}{1024}+1)
|
||||
$$
|
||||
:eqlabel:`equation:ch05/equation-03`
|
||||
|
||||
If the exponent bits and mantissa bits are all 0s, the number is 0. If
|
||||
the exponent bits are all 0s but the mantissa bits are not, the number
|
||||
is very small. If the exponent bits are all 1s and the mantissa bits are
|
||||
all 0s, the number is an infinity, either positive or negative depending
|
||||
on the sign bit. Not a Number (NaN) is denoted by the exponent bits
|
||||
being all 1s while the mantissa bits are not all 0s. bfloat16 is a
|
||||
special data type developed by Google for machine learning on its tensor
|
||||
processing units (TPUs). Although bfloat16 is not an industry-standard
|
||||
IEEE 16-bit floating-point data type, it has the same exponent size as
|
||||
float32, meaning that it can be easily converted to and from float32.
|
||||
|
||||
### Operator Information Library
|
||||
|
||||
Hardware devices support different operators based on their data format
|
||||
and data type requirements. Each device maintains an operator
|
||||
information library that contains a comprehensive list of operators
|
||||
supported by that device. During the operator selection process, the
|
||||
most suitable operators are chosen from this library. The library serves
|
||||
as a reference for determining which operators are compatible and can be
|
||||
efficiently executed on a particular hardware device.
|
||||
|
||||
## Process of Operator Selection
|
||||
|
||||
Operator selection involves selecting the most appropriate operator for
|
||||
each operation node in an IR. Operator information contains the
|
||||
supported device type, data type, and data format. After the compiler
|
||||
frontend completes type inference and static analysis, the data type of
|
||||
user code is derived from the IR.
|
||||
|
||||
Figure :numref:`ch07/ch07-compiler-backend-select` shows the operator
|
||||
selection process. First, the target hardware needs to be selected (or
|
||||
this step can be skipped in order to keep the default hardware selection
|
||||
defined in the compiler backend). The implementation, supported data
|
||||
types, and execution efficiency of a given operator vary depending on
|
||||
the target hardware. Then, the compiler backend selects an operator
|
||||
based on the data type and data format derived from the IR.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-select`
|
||||
|
||||
The result of the operator selection process might not be as expected
|
||||
due to software or hardware specifications. Sometimes, we might need to
|
||||
adjust the precision of a particular node to find an operator with the
|
||||
right data type. For example, the Conv2D operator supported by Ascend
|
||||
(i.e., the backend of MindSpore) allows only the float16 data type. When
|
||||
used on a float32 network on Ascend, the Conv2D operator is executable
|
||||
only when its input precision is reduced from float32 to float16.
|
||||
|
||||
Converting operators from one format to another can be time-consuming
|
||||
and incur memory movement overheads. To avoid this, data should be
|
||||
transferred between operators of the same format whenever possible. In
|
||||
addition, data type inconsistency may lead to reduced precision,
|
||||
potentially slowing down or even preventing network convergence. As
|
||||
such, thorough operator analysis is needed to ensure that the right data
|
||||
type is selected.
|
||||
|
||||
Simply put, an operator selection algorithm is considered optimal if it
|
||||
keeps the data type as consistent as possible with user settings while
|
||||
also minimizing data format conversion.
|
||||
261
v1/en_chapters/chapter_backend_and_runtime/memory_allocator.md
Normal file
261
v1/en_chapters/chapter_backend_and_runtime/memory_allocator.md
Normal file
@@ -0,0 +1,261 @@
|
||||
# Memory Allocation
|
||||
|
||||
Memory allocation is a crucial aspect of conventional computer memory
|
||||
hierarchy, acting as a link between cache and disk storage. It provides
|
||||
more storage capacity than the cache and enables faster access compared
|
||||
to disk storage. With the progress of deep learning, accommodating large
|
||||
deep neural networks within the memory of hardware accelerators or AI
|
||||
processors has become increasingly challenging. To overcome this
|
||||
obstacle, various solutions have been developed, including memory reuse,
|
||||
contiguous memory allocation, and in-place memory allocation. Proper
|
||||
implementation of contiguous memory allocation and in-place memory
|
||||
allocation can enhance the execution efficiency of operators and further
|
||||
optimize performance.
|
||||
|
||||
## Device Memory
|
||||
|
||||
In a deep learning architecture, the memory closest to the hardware
|
||||
accelerator (such as the GPU or AI processor) is usually referred to as
|
||||
the device memory, and that closest to the CPU is referred to as the
|
||||
host memory. As shown in Figure
|
||||
:numref:`ch07/ch07-compiler-backend-memory-01`, the CPU can
|
||||
directly access the host memory but not the device memory. Similarly,
|
||||
the AI processor can directly access the device memory but not the host
|
||||
memory. In a typical network training process, data needs to be loaded
|
||||
from disk storage to the host memory, where it is then processed. After
|
||||
that, the data is copied from the host memory to the device memory, so
|
||||
that the device can directly access the data. When the computation is
|
||||
finished, the user can obtain the training result once the result data
|
||||
is copied from the device memory back to the host memory.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-memory-01`
|
||||
|
||||
## Process of Memory Allocation
|
||||
|
||||
The memory allocation module allocates device memory to the input and
|
||||
output of each operator in a graph. The compiler frontend interprets the
|
||||
user script into an IR, based on which the compiler backend performs
|
||||
operator selection and optimization to determine information such as the
|
||||
shape, data type, and format of each input/output tensor of each
|
||||
operator. With this information, the size of each input/output tensor of
|
||||
each operator can be calculated using Equation
|
||||
:eqref:`ch05/equation-04`:
|
||||
|
||||
$$
|
||||
\text{size}=\prod_{i=0}^{\text{dimension }}\text{shape}_i \times \text{sizeof}\left ( \text{datatype} \right )
|
||||
$$
|
||||
:eqlabel:`equation:ch05/equation-04`
|
||||
|
||||
Unaligned memory access can be time-consuming, because the transfer of
|
||||
data to and from memory is most efficient in chunks of 4, 8, or 16
|
||||
bytes. When the size of the data to be transferred is not a multiple of
|
||||
any of these sizes, one or more empty bytes are padded to align the data
|
||||
in memory.
|
||||
|
||||
Figure
|
||||
:numref:`ch07/ch07-compiler-backend-memory-02` illustrates an
|
||||
example of memory allocation.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-memory-02`
|
||||
|
||||
In this example, memory addresses are assigned to the input tensor,
|
||||
Conv2D's weight, and Conv2D's output. Subsequently, a memory address is
|
||||
allocated to the input of BatchNorm. Since the input of BatchNorm is the
|
||||
same as the output of Conv2D, which already has a allocated memory
|
||||
address, the output address of Conv2D can be shared with the input of
|
||||
BatchNorm. This approach avoids redundant memory allocation and
|
||||
unnecessary memory copies. The entire training process in this example
|
||||
involves allocating memory for three types based on their data lifetime:
|
||||
the initial input of the graph, the weights or attributes of operators,
|
||||
and the output tensor of the final operator.
|
||||
|
||||
Frequent allocations and deallocations of memory blocks of various sizes
|
||||
using functions like `malloc` can significantly degrade performance. To
|
||||
mitigate this issue, memory pools can be employed. Memory pools involve
|
||||
pre-allocating a specific amount of memory, allowing memory blocks to be
|
||||
dynamically allocated from the pool as needed and returned for reuse.
|
||||
|
||||
Memory pools are widely utilized in AI frameworks to manage frequent
|
||||
allocations of device memory and ensure consistent memory lifetime for
|
||||
tensors. Different AI frameworks adopt similar memory pool designs.
|
||||
Figure
|
||||
:numref:`ch07/ch07-compiler-backend-memory-03` presents an
|
||||
example of memory allocation in an AI framework. In this case, each
|
||||
tensor's memory is allocated from a pre-allocated device memory space
|
||||
using double pointers to offset the start and end addresses. Weight
|
||||
tensors of operators are allocated memory by offsetting from the start
|
||||
address (with a lifetime lasting throughout the training process). The
|
||||
output tensor of each operator is allocated memory by offsetting from
|
||||
the end address (with a shorter lifetime that terminates when the tensor
|
||||
is no longer needed in the computation process). This approach allows
|
||||
operator memory to be allocated using offset pointers from pre-allocated
|
||||
device memory, significantly reducing the time required compared to
|
||||
direct memory allocations from the device.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-memory-03`
|
||||
|
||||
## Memory Reuse
|
||||
|
||||
In a machine learning system, memory reuse is achieved by analyzing the
|
||||
lifespan of a tensor and, once it reaches the end of its lifespan,
|
||||
releasing its device memory back to the memory pool for future reuse by
|
||||
other tensors. The objective of memory reuse is to enhance memory
|
||||
utilization and enable the accommodation of larger models within the
|
||||
constraints of limited device memory. By reusing memory instead of
|
||||
continuously allocating new memory for tensors, the system can optimize
|
||||
memory utilization and mitigate the memory limitations inherent in deep
|
||||
learning computations.
|
||||
|
||||
Figure
|
||||
:numref:`ch07/ch07-compiler-backend-memory-02` provides an
|
||||
example, where output 1 becomes unused once the computation of the
|
||||
BatchNorm operator is complete. In this case, the device memory of
|
||||
output 1 can be reclaimed and reused for output 3 (if output 3 does not
|
||||
require a larger memory size than output 1).
|
||||
|
||||
Figure
|
||||
:numref:`ch07/ch07-compiler-backend-memory-04` depicts memory
|
||||
lifetime using coordinate charts. The horizontal axes represent the
|
||||
tensor lifetime, and the vertical axes represent the memory sizes.
|
||||
During its lifetime, a tensor occupies a specific amount of device
|
||||
memory. The objective of memory allocation is to find an optimal
|
||||
solution that accommodates the maximum number of non-conflicting
|
||||
rectangular blocks (each denoting a tensor's lifetime and memory size)
|
||||
in the same memory. In Figure
|
||||
:numref:`ch07/ch07-compiler-backend-memory-04`, the memory can
|
||||
accommodate only four rectangular blocks (i.e., tensors T0, T1, T2, and
|
||||
T3) when no memory reuse policy is applied, as shown in the left chart.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-memory-04`
|
||||
|
||||
To determine an appropriate memory reuse policy, we face an NP-complete
|
||||
problem. AI frameworks often employ greedy algorithms, such as best-fit,
|
||||
which allocate memory by searching for the smallest available block in
|
||||
the memory pool one at a time. However, this approach only yields a
|
||||
locally optimal solution rather than a globally optimal one. To
|
||||
approximate a globally optimal solution, a method called Safe Optimized
|
||||
Memory Allocation Solver (SOMAS) can be considered.
|
||||
|
||||
SOMAS addresses the computational graph by conducting aggregative
|
||||
analysis on parallel streams and data dependencies. This analysis
|
||||
reveals the ancestor-descendant relationships between operators. By
|
||||
generating a global set of mutually exclusive constraints concerning the
|
||||
lifetime of each tensor, SOMAS combines multiple heuristic algorithms to
|
||||
achieve an optimal solution for static memory planning. Through SOMAS,
|
||||
an optimized memory reuse outcome is obtained, resulting in increased
|
||||
reusable memory.
|
||||
|
||||
As shown in the right chart of Figure
|
||||
:numref:`ch07/ch07-compiler-backend-memory-04`, with the SOMAS
|
||||
algorithm, the number of tensors allowed in the same memory is increased
|
||||
to seven.
|
||||
|
||||
## Optimization Techniques for Memory Allocation
|
||||
|
||||
In the following, we describe the typical optimization techniques for
|
||||
memory allocation.
|
||||
|
||||
### Memory Fusion
|
||||
|
||||
Commonly used memory allocation methods operate at the tensor level,
|
||||
often resulting in discontinuous device addresses across tensors.
|
||||
However, certain specialized operators, like AllReduce for
|
||||
communication, require contiguous memory allocation. Executing a
|
||||
communication operator involves waiting for communication, which is a
|
||||
significant performance bottleneck in large-scale distributed systems.
|
||||
It includes data transfer and computation. To minimize communication
|
||||
time, we can fuse multiple communication operators into a composite
|
||||
operator. This allows for contiguous memory allocation of the operator
|
||||
input, as depicted in Figure
|
||||
:numref:`ch07/ch07-compiler-backend-memory-06`.
|
||||
|
||||
Additionally, the time spent in communication can be reduced during the
|
||||
weight initialization task in distributed neural network training. This
|
||||
task involves broadcasting the initialized weight from one process to
|
||||
all processes. If a network contains multiple weights (which is often
|
||||
the case), these broadcasts are repeated. To minimize communication time
|
||||
in this scenario, a typical approach is to allocate contiguous memory
|
||||
addresses to all weights on the network and then perform a single
|
||||
broadcast operation.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-memory-06`
|
||||
|
||||
### In-place Operators
|
||||
|
||||
In the memory allocation process depicted in
|
||||
Figure :numref:`ch07/ch07-compiler-backend-memory-02`, the input and
|
||||
output of each operator are assigned different memory addresses.
|
||||
However, this approach can lead to memory waste and performance
|
||||
degradation for several other operators. Examples include optimizer
|
||||
operators used to update neural network weights, Python's `+=` or `*=`
|
||||
operators that modify variable values, and the `a[0]=b` operator that
|
||||
updates the value of `a[0]` with `b`. These operators share a common
|
||||
purpose: updating the input value. The concept of in-place can be
|
||||
illustrated using the `a[0]=b` operator.
|
||||
|
||||
In the original implementation shown on the left of Figure
|
||||
:numref:`ch07/ch07-compiler-backend-memory-08`, the operator
|
||||
involves three steps: copying tensor `a` to tensor `a’`, assigning
|
||||
tensor `b` to tensor `a’`, and then copying tensor `a’` back to tensor
|
||||
`a`. However, by performing the operation in-place, as depicted on the
|
||||
right of Figure
|
||||
:numref:`ch07/ch07-compiler-backend-memory-08`, this process is
|
||||
simplified to a single step: copying tensor `b` to the position
|
||||
corresponding to tensor `a`. This reduces data copy time by eliminating
|
||||
two copies and eliminates the need to allocate memory for tensor `a’`.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-memory-08`
|
||||
|
||||
## Data Compression
|
||||
|
||||
Deep neural networks (DNNs) in modern training heavily rely on GPUs to
|
||||
effectively train intricate networks with hundreds of layers. A
|
||||
prominent challenge faced by both researchers and industry professionals
|
||||
is the constraint imposed by the available GPU main memory as networks
|
||||
become deeper. This limitation restricts the size of networks that can
|
||||
be trained. To address this issue, researchers have recognized the value
|
||||
of employing DNN-layer-specific encoding schemes. Consequently, they
|
||||
have directed their attention towards storing encoded representations of
|
||||
the intermediate layer outputs (feature maps) that are required for the
|
||||
backward pass. These encoded representations are stored during the
|
||||
temporal gap between their uses and are decoded only when needed for the
|
||||
backward pass. The full-fidelity feature maps are promptly discarded
|
||||
after use, resulting in a noteworthy reduction in memory consumption.
|
||||
|
||||
## Memory Swap
|
||||
|
||||
Machine learning frameworks frequently necessitate users to optimize
|
||||
their memory utilization to guarantee that the DNN can be accommodated
|
||||
within the memory capacity of the GPU. This constraint restricts
|
||||
researchers from thoroughly investigating diverse machine learning
|
||||
algorithms, compelling them to make concessions either in terms of
|
||||
network architecture or by distributing the computational load across
|
||||
multiple GPUs. One feasible approach is to incorporate DRAM to
|
||||
facilitate memory swapping. By transferring temporarily inactive data to
|
||||
DRAM, we can optimize GPU utilization. In recent studies, researchers
|
||||
have implemented a cautious approach to allocating GPU memory for the
|
||||
immediate computational needs of a specific layer. This strategy
|
||||
effectively reduces both the maximum and average memory usage, enabling
|
||||
researchers to train more extensive networks. To elaborate further, the
|
||||
researchers promptly release feature maps from GPU memory in the absence
|
||||
of any potential reuse. Alternatively, if there is a possibility of
|
||||
future reuse but no immediate requirement, the feature maps are
|
||||
offloaded to CPU memory and subsequently prefetched back to GPU memory.
|
||||
|
||||
The fundamental concept behind memory swapping is straightforward and
|
||||
inherent. However, its implementation remains challenging and
|
||||
necessitates prior expertise in our compiler frontend. One such
|
||||
expertise involves maximizing the overlap between computation and data
|
||||
swapping time. A precise cost model is essential for evaluating the
|
||||
estimated time required for data movement and the time cost associated
|
||||
with each layer in DNN (Deep Neural Network). Additionally, there are
|
||||
numerous strategies to explore in auto scheduling and auto tuning.
|
||||
Fortunately, there is an abundance of literature available that
|
||||
addresses these issues. For additional information, please refer to the
|
||||
Further Readings section.
|
||||
295
v1/en_chapters/chapter_backend_and_runtime/op_compiler.md
Normal file
295
v1/en_chapters/chapter_backend_and_runtime/op_compiler.md
Normal file
@@ -0,0 +1,295 @@
|
||||
# Operator Compiler {#sec:operator-compiler}
|
||||
|
||||
Operator compilers are used for compiling and optimizing operators,
|
||||
which may be part of a neural network or come from the code implemented
|
||||
in a domain-specific language (DSL). The compilation is the process of
|
||||
*transforming* the source code from one *representation* into another.
|
||||
|
||||
The objective of an operator compiler is to improve the *execution
|
||||
performance* of operators. An operator compiler accepts tensor
|
||||
computation logic described in *dynamic languages* (e.g., Python) as the
|
||||
input and outputs executable files on *specific AI processors*.
|
||||
|
||||
## Scheduling Strategy
|
||||
|
||||
An operator compiler abstracts the execution of statements in an
|
||||
operator implementation into \"scheduling strategies\". Since an
|
||||
operator typically consists of multiple statements, the focus lies in
|
||||
determining the scheduling strategy for the statements within the
|
||||
operator. This strategy encompasses considerations such as the
|
||||
calculation order, data block movement, and other relevant factors.
|
||||
|
||||
If ignoring the specific processor architecture, for the best
|
||||
performance, we only need to load all input tensors to the computation
|
||||
core based on the *computational logic* of the operator and access the
|
||||
result from the core for storage. *Computational logic* refers to basic
|
||||
arithmetic operations (e.g., addition, subtraction, multiplication, and
|
||||
division) and other function expressions (e.g., convolution,
|
||||
transposition, and loss functions).
|
||||
|
||||
Modern computer memory hierarchy looks like a pyramid structure, as
|
||||
shown in Figure
|
||||
:numref:`ch05/ch05-memory_architecture`. As we move up the
|
||||
pyramid, the storage elements have a higher cost but a faster access
|
||||
time.
|
||||
|
||||

|
||||
:label:`ch05/ch05-memory_architecture`
|
||||
|
||||
Such hardware design leads to two basic types of locality:
|
||||
|
||||
\(1\) Temporal locality: the tendency to access the same memory location
|
||||
several times in quick succession. As such, accessing the same location
|
||||
in the L1 cache several times is more efficient than accessing different
|
||||
locations in the L1 cache several times.
|
||||
|
||||
\(2\) Spatial locality: the tendency to access nearby memory locations
|
||||
in quick succession. As such, accessing nearby locations in the L1 cache
|
||||
several times is more efficient than moving back and forth between the
|
||||
L1 cache and the main memory.
|
||||
|
||||
Both types of locality help improve system performance. Specifically, in
|
||||
order to improve the data access speed, data to be repeatedly processed
|
||||
can be placed in fixed nearby memory locations when possible.
|
||||
|
||||
For a serial computational task, it is also possible to decouple the
|
||||
data part from the logic part and generate a range of independent groups
|
||||
of data that can be executed in parallel, as shown in Figure
|
||||
:numref:`ch05/ch05-parallel_computing`.
|
||||
|
||||

|
||||
:label:`ch05/ch05-parallel_computing`
|
||||
|
||||
These specific data-oriented operations performed at program runtime are
|
||||
referred to as *schedules*. A schedule defines the following aspects:
|
||||
|
||||
\(1\) When and where should each value in a function be calculated?
|
||||
|
||||
\(2\) Where is data stored?
|
||||
|
||||
\(3\) How long does it take to access each value between those
|
||||
calculated using preorder structure consumers? And when is independent
|
||||
recomputation performed by each such value?
|
||||
|
||||
Simply put, a scheduling strategy is defined by a set of algorithms
|
||||
designed during compilation based on the characteristics of target
|
||||
hardware architecture to improve locality and parallelism. The purpose
|
||||
of this is to ensure that the resulting executable file delivers optimal
|
||||
performance at runtime. These algorithms have no effect on the
|
||||
computation result; instead, they only adjust the computation process in
|
||||
order to shorten the computation time.
|
||||
|
||||
## Combining Scheduling Strategies
|
||||
|
||||
In the realm of operator compilers, a common optimization technique
|
||||
involves combining multiple abstracted scheduling strategies into a
|
||||
comprehensive and efficient scheduling set through manual template
|
||||
matching. However, this approach may not be fine-tuned and can be
|
||||
labor-intensive when applied to achieve refined optimization across
|
||||
different operators. To illustrate this, let's consider an optimization
|
||||
algorithm implemented in the Tensor Virtual Machine (TVM). It
|
||||
accelerates and optimizes a multiply-accumulate code segment on the CPU
|
||||
by combining several fundamental scheduling strategies.
|
||||
|
||||
In Code `lst:before_tvm`, the basic computational logic is as
|
||||
follows: Initialize tensor C, multiply tensor A by tensor B, and
|
||||
accumulate the results to tensor C.
|
||||
|
||||
**lst:before_tvm**
|
||||
```
|
||||
for (m: int32, 0, 1024) {
|
||||
for (n: int32, 0, 1024) {
|
||||
C[((m*1024) + n)] = 0f32
|
||||
for (k: int32, 0, 1024) {
|
||||
let cse_var_2: int32 = (m*1024)
|
||||
let cse_var_1: int32 = (cse_var_2 + n)
|
||||
C[cse_var_1] = (C[cse_var_1] + (A[(cse_var_2 + k)]*B[((k*1024) + n)]))
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Assuming that the data type is float and that tensors A, B, and C are of
|
||||
size 1024 $\times$ 1024, then the total memory required by the tensors
|
||||
is 1024 $\times$ 1024 $\times$ 3 $\times$ sizeof(float) = 12 MB. This
|
||||
far exceeds the capacity of common caches (e.g., the L1 cache is 32 KB).
|
||||
Therefore, if we want to compute on Tensor A, B, and C in a single
|
||||
operation, we must store them in the main memory. However, the main
|
||||
memory is distant from the compute core, resulting in significantly
|
||||
lower access efficiency compared to using the cache for storage.
|
||||
|
||||
There are several scheduling strategies that can help improve
|
||||
performance: tile, reorder, and split. The size of the L1 cache is 32
|
||||
KB. To ensure that data used in every computation step is stored in the
|
||||
cache, tiling based on the factors of 32 is performed. In this way, only
|
||||
the tiny block formed by `m.inner `$\times$` n.inner` needs to be taken
|
||||
into account, and memory access of the innermost tiny block is
|
||||
independent of the outer loops. A tiny block will occupy only 32
|
||||
$\times$ 32 $\times$ 3 $\times$ sizeof(float), which is 12 KB in the
|
||||
cache. The optimized code is shown in Code
|
||||
`lst:after_tvm`. We perform tiling on loops m and n based on
|
||||
factor 32 as the previous analysis. Similarly, we tile the loop k based
|
||||
on factor 4, then reorder the k.outer and k.inner axis as the outermost
|
||||
axis.
|
||||
|
||||
**lst:after_tvm**
|
||||
```
|
||||
// Obtain an outer loop by tiling for (m: int32, 0, 1024) based on factor 32.
|
||||
for (m.outer: int32, 0, 32) {
|
||||
// Obtain an outer loop by tiling for (n: int32, 0, 1024) based on factor 32.
|
||||
for (n.outer:
|
||||
// Obtain an inner loop by tiling for (m: int32, 0, 1024) based on factor 32.
|
||||
for (m.inner.init: int32, 0, 32) {
|
||||
// Obtain an inner loop by tiling for (n: int32, 0, 1024) based on factor 32.
|
||||
for (n.inner.init: int32, 0, 32) {
|
||||
// Obtain the corresponding factors.
|
||||
C[((((m.outer*32768) + (m.inner.init*1024)) + (n.outer*32)) + n.inner.init)] = 0f32
|
||||
}
|
||||
}
|
||||
// Obtain an outer loop by splitting for (k: int32, 0, 1024) based on factor 4, with reorder.
|
||||
for (k.outer: int32, 0, 256) {
|
||||
// Obtain an outer loop by splitting for (k: int32, 0, 1024) based on factor 4, with reorder.
|
||||
for (k.inner: int32, 0, 4) {
|
||||
// Obtain an inner loop by tiling for (m: int32, 0, 1024) based on factor 32.
|
||||
for (m.inner: int32, 0, 32) {
|
||||
// Obtain an inner loop by tiling for (n: int32, 0, 1024) based on factor 32.
|
||||
for (n.inner: int32, 0, 32) {
|
||||
// Outer axis factor obtained by tiling along axis n
|
||||
let cse_var_3: int32 = (n.outer*32)
|
||||
// Outer axis & inner axis factors obtained by tiling along axis m
|
||||
let cse_var_2: int32 = ((m.outer*32768) + (m.inner*1024))
|
||||
// Outer axis & inner axis factors obtained by tiling along axes m & n
|
||||
let cse_var_1: int32 = ((cse_var_2 + cse_var_3) + n.inner)
|
||||
// Split the computational logic into different layers so that data involved every loop can be stored in the cache.
|
||||
C[cse_var_1] = (C[cse_var_1] + (A[((cse_var_2 + (k.outer*4)) + n.inner)] * B[((((k.outer*4096) + (k.inner*1024)) + cse_var_3) + n.inner)]))
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Finding Optimized Strategies with Polyhedral Models
|
||||
|
||||
Another optimization approach is to automatically select an operator
|
||||
schedule from a schedule search space. A good example of this idea is
|
||||
the polyhedral compilation. They improve the generalization of operator
|
||||
compilation at the expense of prolonged compile time.
|
||||
|
||||
Polyhedral compilation mainly optimizes the loops in user code by
|
||||
abstracting each loop into a multidimensional space, computing instances
|
||||
into points in the space, and dependencies between the instances into
|
||||
lines in the space. The main idea of this algorithm is to model the
|
||||
memory access characteristics in code and adjust the execution order of
|
||||
each instance within each loop. In this way, it aims to enable better
|
||||
locality and parallelism of the loop code under the new schedule.
|
||||
|
||||
Code `lst:before_poly` is used as an example to describe the
|
||||
algorithm.
|
||||
|
||||
**lst:before_poly**
|
||||
```
|
||||
for (int i = 0; i < N; i++)
|
||||
for (int j = 1; j < N; j++)
|
||||
a[i+1][j] = a[i][j+1] - a[i][j] + a[i][j-1];
|
||||
```
|
||||
|
||||
As shown in Figure :numref:`ch05/ch05-poly_test`, a memory access structure is first
|
||||
modeled by using the polyhedral model algorithm, and then dependencies
|
||||
(denoted by arrows) between instances (denoted by nodes) are analyzed.
|
||||
|
||||

|
||||
:label:`ch05/ch05-poly_test`
|
||||
|
||||
Complex dependency analysis and schedule transformation are then
|
||||
performed to obtain an optimal solution that fits the memory model.
|
||||
Using the polyhedral model algorithm, the code is optimized to that
|
||||
shown in Code `lst:after_poly`.
|
||||
|
||||
**lst:after_poly**
|
||||
```
|
||||
for (int i_new = 0; i_new < N; i_new++)
|
||||
for (int j_new = i+1; j_new < i+N; j_new++)
|
||||
a[i_new+1][j_new-i_new] = a[i_new][j_new-i_new+1] - a[i_new][j_new-i_new] + a[i_new][j_new-i_new-1];
|
||||
```
|
||||
|
||||
The resulting code looks relatively complex. We can model the code (as
|
||||
shown in Figure :numref:`ch05/ch05-poly`) to determine its performance
|
||||
improvements. Through dependency analysis, we find that the loop
|
||||
dependencies present in the source code are removed in the optimized
|
||||
code, thereby increasing the opportunities for parallel computing.
|
||||
Specifically, parallel computing is possible when the loop dependencies
|
||||
are partitioned along the dashed lines based on the green blocks, as
|
||||
shown in Figure :numref:`ch05/ch05-poly`.
|
||||
|
||||

|
||||
:label:`ch05/ch05-poly`
|
||||
|
||||
We have only introduced the Polyhedral Compilation technique in this
|
||||
section. However, there are other optimization techniques available,
|
||||
such as Ansor, which is a heuristic searching method with pruning.
|
||||
|
||||
## Adaptation to Instruction Sets
|
||||
|
||||
We have previously explored the optimization techniques of operator
|
||||
compilers. In this section, we build on this foundation to examine how
|
||||
operator compilers adapt to instruction sets on different chips.
|
||||
Typically, a general-purpose compiler is designed to be compatible with
|
||||
as many backend architectures and instruction sets as possible. However,
|
||||
this can present challenges when the compiler must handle backends with
|
||||
different architectures and instruction sets.
|
||||
|
||||
Two common programming models adopted by AI processors are single
|
||||
instruction, multiple data (SIMD) and single instruction, multiple
|
||||
threads (SIMT). As shown in Figures
|
||||
:numref:`ch05/ch05-SIMD` and
|
||||
:numref:`ch05/ch05-SIMT`, respectively, SIMD corresponds to chips
|
||||
with vector instructions, while SIMT corresponds to chips that support
|
||||
multiple threads. Recently, some chips have begun to combine both
|
||||
programming models in order to support both multithreaded parallel
|
||||
computing and vector instructions. When handling different programming
|
||||
models, an operator compiler adopts different optimization strategies,
|
||||
such as vectorization.
|
||||
|
||||

|
||||
:label:`ch05/ch05-SIMD`
|
||||
|
||||

|
||||
:label:`ch05/ch05-SIMT`
|
||||
|
||||
Operator compilers place a strong emphasis on differentiated support in
|
||||
the frontend, midend, and backend. In the frontend, support for multiple
|
||||
backend instruction sets is added, allowing AI programmers to focus on
|
||||
algorithm logic without having to worry about chip differences. In the
|
||||
midend, the architectures of different chips are identified, which
|
||||
allows for specific optimization methods to be implemented for each
|
||||
chip. When generating backend code, the instruction sets of different
|
||||
chips are further identified to ensure efficient execution on target
|
||||
chips.
|
||||
|
||||
## Expression Ability
|
||||
|
||||
The representation capability of an operator compiler is important
|
||||
because it determines how well the frontend can express the input code
|
||||
in an IR without loss of syntax information. The frontend of an operator
|
||||
compiler is often fed with code programmed in flexible languages (e.g.,
|
||||
PyTorch code written in Python). However, flexible expressions (e.g.,
|
||||
indexing and view syntax in Python) pose high requirements on the
|
||||
frontend expression ability of operator compilers. From the model
|
||||
perspective, managing the inputs of an operatorn often contain many
|
||||
control flow statements. Also, some models allow for dynamic-shape
|
||||
operators whose shapes vary with control flow decisions across
|
||||
iterations.
|
||||
|
||||
Additionally, there are a large number of operators that may not have
|
||||
optimized implementation provided by the accelerator libraries (e.g.,
|
||||
cuDNN) directly. This phenomenon is referred to as long tail operators.
|
||||
However, the long tail operators can have highly flexible syntax or
|
||||
abundant control flow statements and sometimes support dynamic shapes,
|
||||
making it extremely difficult for the frontend of existing operator
|
||||
compilers to express, optimize, or accelerate them. Consequently, such
|
||||
operators have to be executed by the Python interpreter or slow virtual
|
||||
machines, leading to a performance bottleneck in network execution. This
|
||||
is why it is imperative to improve the expression ability of the
|
||||
operator compiler frontend.
|
||||
56
v1/en_chapters/chapter_backend_and_runtime/overview.md
Normal file
56
v1/en_chapters/chapter_backend_and_runtime/overview.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# Overview
|
||||
|
||||
Figure :numref:`ch07/ch07-compiler-backend-01` illustrates the
|
||||
architecture of the AI compiler backend, situated between the frontend
|
||||
and the hardware driver layer.
|
||||
|
||||

|
||||
:label:`ch07/ch07-compiler-backend-01`
|
||||
|
||||
Graph optimization is a crucial step that involves transforming the
|
||||
Intermediate Representation (IR) into a format that aligns with the
|
||||
hardware features, facilitating operator selection. Since the frontend's
|
||||
IR is abstracted from low-level runtime details, additional effort is
|
||||
required to map the IR to a set of operators, such as MatMul,
|
||||
Convolution, and ReLU. Sometimes, a single operator is sufficient to
|
||||
handle a subset of the IR's functions. In such cases, the operator
|
||||
fusion technique can be employed to fuse a group of IR nodes together.
|
||||
Similarly, if a direct backend counterpart for a complex IR node is
|
||||
unavailable, it can be partitioned into smaller operators.
|
||||
|
||||
Once the graph optimization is complete, the compiler backend proceeds
|
||||
with operator selection, which involves matching the optimized IR with
|
||||
appropriate operators that can be executed on the target device with
|
||||
optimal efficiency. This process is similar to pattern matching. While
|
||||
the easiest approach would be to map each IR node to a separate hardware
|
||||
operator, such an approach may not be hardware-friendly. Instead,
|
||||
existing compilers generally provide multiple candidate operators for
|
||||
each IR node. The following steps are typically involved in the operator
|
||||
selection process:
|
||||
|
||||
1. The IR nodes received from the frontend are partitioned or fused to
|
||||
generate a low-level IR that is meaningful to the hardware.
|
||||
|
||||
2. The compiler backend carefully selects operator mappings for the IR
|
||||
nodes, aiming to create a complete sequence of operators.
|
||||
|
||||
3. The backend determines the format and data type of each input and
|
||||
output, ensuring fine-grained optimization on the IR.
|
||||
|
||||
4. Finally, the compiler backend traverses the resulting sequence of
|
||||
operators, allocates input and output memory for each operator, and
|
||||
loads the operators onto the target device for computation.
|
||||
|
||||
By following this process, the compiler backend optimizes the IR by
|
||||
selecting suitable operators, determining their input and output
|
||||
requirements, and allocating memory accordingly. This enables efficient
|
||||
execution of the AI program on the target device.
|
||||
|
||||
To further enhance the performance of a single operator, the compiler
|
||||
backend often utilizes an operator compiler like TVM (Tensor Virtual
|
||||
Machine) or XLA (Accelerated Linear Algebra). An operator compiler
|
||||
analyzes the statements in an operator implementation, and it offers
|
||||
various levels of optimization, including operator-level optimizations,
|
||||
code generation, and runtime support. This stack is designed to enable
|
||||
efficient execution of an operator on a wide range of hardware
|
||||
platforms.
|
||||
34
v1/en_chapters/chapter_backend_and_runtime/summary.md
Normal file
34
v1/en_chapters/chapter_backend_and_runtime/summary.md
Normal file
@@ -0,0 +1,34 @@
|
||||
# Chapter Summary
|
||||
|
||||
1. The compiler backend performs three primary tasks: graph
|
||||
optimization, operator selection, and memory allocation.
|
||||
|
||||
2. Graph optimization reduces resource overhead, adapts the graph to
|
||||
hardware capabilities, and enhances execution performance while
|
||||
maintaining the model's numerical properties.
|
||||
|
||||
3. Graph optimization techniques can be hardware-agnostic (e.g., memory
|
||||
I/O optimization) or hardware-specific (e.g., subgraph
|
||||
transformation to adapt to hardware instruction restrictions).
|
||||
|
||||
4. Operator selection involves mapping the compute nodes in an IR to
|
||||
suitable operators for hardware execution.
|
||||
|
||||
5. When selecting an optimized operator, factors such as data format
|
||||
and type must be considered, as they impact operator performance on
|
||||
the target hardware.
|
||||
|
||||
6. An IR is generated after graph optimization and operator selection.
|
||||
Based on the IR, memory is allocated for input and output tensors of
|
||||
each operator before launching them to hardware for execution.
|
||||
|
||||
7. Memory reuse is designed to improve memory utilization and
|
||||
accommodate larger models within limited device memory.
|
||||
|
||||
8. Fusion of communication operators enhances communication efficiency.
|
||||
Properly allocating memory for in-place operators reduces memory
|
||||
footprint and improves computing efficiency.
|
||||
|
||||
9. Operator compilers play a vital role in optimizing hardware
|
||||
performance. Critical optimization techniques include scheduling
|
||||
strategies and the polyhedral model algorithm.
|
||||
@@ -0,0 +1,469 @@
|
||||
# Computational Graph Basics
|
||||
|
||||
A computational graph contains operators (as units of operations) and
|
||||
tensors (as units of data). The operator nodes in a graph are connected
|
||||
with directed edges, which indicate the state of each tensor and
|
||||
dependencies between operators.
|
||||
Figure :numref:`ch04/ch04-simpleDAG` shows a computational graph example
|
||||
of $\bf{Z}$=ReLU$(\bf{X}\times\bf{Y})$.
|
||||
|
||||

|
||||
:label:`ch04/ch04-simpleDAG`
|
||||
|
||||
## Tensors and Operators
|
||||
|
||||
In mathematics, tensors are a generalization of scalars and vectors.
|
||||
Machine learning defines multidimensional data as tensors. The rank of a
|
||||
tensor refers to the number of axes (or dimensions) the tensor has. A
|
||||
scalar is a rank-0 tensor containing a single value, without axes; a
|
||||
vector is a rank-1 tensor with one axis; and a three-channel RGB color
|
||||
image is a rank-3 tensor with three axes. See Figure
|
||||
:numref:`ch04/ch04-tensor`.
|
||||
|
||||

|
||||
:label:`ch04/ch04-tensor`
|
||||
|
||||
In a machine learning framework, a tensor stores not only data itself
|
||||
but also attributes such as the data type, data shape, rank, and
|
||||
gradient transfer status. Table
|
||||
:numref:`ch04/ch4-tensor` describes the main attributes of a
|
||||
tensor.
|
||||
|
||||
:Tensor attributes
|
||||
|
||||
| Tensor Attribute | Description |
|
||||
|------------------|--------------------------------------------------------------------------------- |
|
||||
| shape | Length of each dimension, for example, \[3,3,3\]. |
|
||||
| dim | Number of axes (or dimensions). The value is 0 for a scalar and 1 for a vector. |
|
||||
| dtype | Data type, such as bool, uint8, int16, float32, and float64. |
|
||||
| device | Target device, such as a CPU or GPU. |
|
||||
| name | Tensor name. |
|
||||
:label:`ch04/ch4-tensor`
|
||||
|
||||
|
||||
In the following, we explore each tensor attribute with image data as an
|
||||
example. Assume that our machine learning framework loads a 96-pixel by
|
||||
96-pixel RGB (3-channel) image and converts the image data into a tensor
|
||||
for storage. A *rank*-3 tensor of *shape* \[96,96,3\] is generated, with
|
||||
the three dimensions representing the image height, image width, and
|
||||
number of channels, respectively. The pixels in the RGB image are
|
||||
represented by unsigned integers ranging from 0 to 255. Therefore, the
|
||||
*dtype* of the resulting tensor is uint8. The image data is normalized
|
||||
before it is fed into a CNN for training. Specifically, its data type is
|
||||
reformatted to float32 so that it is compatible with the default data
|
||||
type of common machine learning frameworks.
|
||||
|
||||
Before training, the machine learning framework determines the compute
|
||||
device (i.e., CPU, GPU, or other hardware) and stores the data and
|
||||
weight parameters necessary for training in the memory of the
|
||||
corresponding hardware --- as specified by the *device* attribute.
|
||||
Typically, the device attribute of a tensor is automatically assigned by
|
||||
the machine learning framework based on the hardware environment.
|
||||
Tensors are either mutable or immutable. Mutable tensors store weight
|
||||
parameters and are updated based on gradient information, for example,
|
||||
convolution kernel tensors that participate in convolution operations.
|
||||
Immutable tensors store initial user data or data input to models, for
|
||||
example, the image data tensor mentioned above.
|
||||
|
||||
What does a tensor look like in machine learning settings? Most tensors,
|
||||
like image data and convolution kernel tensors, are \"rectangular\" or
|
||||
\"cubic\" in shape. That is, such a tensor has the same number of
|
||||
elements along each of its axes. However, there are specialized tensors
|
||||
that have different shapes: ragged and sparse tensors. As shown in
|
||||
Figure :numref:`ch04/ch04-tensor1`, a tensor is ragged if it has
|
||||
variable numbers of elements along some axes. Ragged tensors enable
|
||||
efficient storage and processing of irregularly shaped data, such as
|
||||
variable-length texts in natural language processing (NLP) applications.
|
||||
Sparse tensors often handle graph data of graph neural networks (GNNs)
|
||||
and are encoded using special formats such as the coordinate list (COO)
|
||||
to improve storage efficiency.
|
||||
|
||||

|
||||
:label:`ch04/ch04-tensor1`
|
||||
|
||||
Operators are the basic compute units of neural networks. They process
|
||||
tensor data and implement common computational logic in machine
|
||||
learning, including data transformation, conditional control,
|
||||
mathematical calculation, etc. Based on their functionalities, operators
|
||||
are classified into tensor operators, neural network operators, data
|
||||
flow operators, and control flow operators.
|
||||
|
||||
1. **Tensor operators** involve tensor structure and mathematical
|
||||
operations. Typical tensor structure operations include reshaping
|
||||
tensors, permuting tensor dimensions, concatenating tensors, etc.
|
||||
For example, we may need to change the dimension order (between
|
||||
\"channels first\" and \"channels last\") of image data tensors in
|
||||
CNN applications. Mathematical operations are tensor-based and
|
||||
include matrix multiplication, norm calculation, determinant
|
||||
calculation, eigenvalue calculation, etc. They are often seen in the
|
||||
gradient computation of machine learning models.
|
||||
|
||||
2. **Neural network operators**, the foundation of neural network
|
||||
models, are the most common operators, including feature extraction,
|
||||
activation functions, loss functions, optimization algorithms, etc.
|
||||
Feature extraction refers to extracting feature tensors from input
|
||||
data in CNN tasks. With the nonlinear ability introduced by
|
||||
activation functions, neural networks can model highly complex
|
||||
relationships and patterns in data. Optimization algorithms are used
|
||||
to update model parameters so that the loss function is minimized.
|
||||
|
||||
3. **Data flow operators** cover data preprocessing and loading. Data
|
||||
preprocessing mainly refers to data resizing, padding,
|
||||
normalization, and argumentation of mostly visual and textual data,
|
||||
whereas data loading involves operations such as shuffling,
|
||||
batching, and pre-fetching of the dataset. Data flow operators
|
||||
transform raw input data into a format meaningful to the machine
|
||||
learning framework and efficiently load the data to the network for
|
||||
training or inference according to the defined number of iterations,
|
||||
reducing memory usage and wait time.
|
||||
|
||||
4. **Control flow operators**, usually found in flexible and complex
|
||||
models, are used to control data flows in computational graphs.
|
||||
Typical control flow operators are conditional operators and loop
|
||||
operators. They are provided by either the machine learning
|
||||
framework or the frontend language. Control flow operations affect
|
||||
data flows in both forward and backward computation of neural
|
||||
networks.
|
||||
|
||||
## Computational Dependencies
|
||||
|
||||
In a computational graph, the dependencies between operators influence
|
||||
the execution sequence and parallelism of operators. The computational
|
||||
graphs involved in machine learning algorithms are directed acyclic
|
||||
graphs, where data flows must not lead to circular dependencies. With a
|
||||
circular dependency, the training program will run into an infinite loop
|
||||
and never terminate by itself. Data stuck in an infinite loop tends to
|
||||
either infinity or 0, yielding invalid results. To analyze the execution
|
||||
sequence and facilitate model topology design, the following describes
|
||||
the dependencies between the compute nodes in a computational graph.
|
||||
|
||||
As shown in Figure :numref:`ch04/ch04-dependence`, if the Matmul1 operator is
|
||||
removed from the graph, there will be no input to the downstream
|
||||
activation function, and the data flow will be interrupted. We can
|
||||
therefore conclude that the operators in this computational graph depend
|
||||
on each other with transitive relations.
|
||||
|
||||

|
||||
:label:`ch04/ch04-dependence`
|
||||
|
||||
Three types of dependencies are available.
|
||||
|
||||
1. **Direct dependency**: For example, the ReLU1 node is directly
|
||||
dependent on the Matmul1 node. That is, ReLU1 can run properly only
|
||||
when it receives a direct output from Matmul1.
|
||||
|
||||
2. **Indirect dependency**: For example, the Add node indirectly
|
||||
depends on the Matmul1 node. Specifically, Matmul1's output is
|
||||
processed by one or more intermediate nodes and then transmitted to
|
||||
the Add node. The Add node directly or indirectly depends on the
|
||||
intermediate nodes.
|
||||
|
||||
3. **Mutual independence**: For example, the graph shows no
|
||||
input/output dependency between Matmul1 and Matmul2, meaning that
|
||||
the two nodes are independent of each other.
|
||||
|
||||
In the computational graph shown in Figure
|
||||
:numref:`ch04/ch04-recurrent`, the Add node indirectly depends on
|
||||
the Matmul node; conversely, the Matmul node directly depends on the Add
|
||||
node. The two nodes are stuck waiting for each other's output to start
|
||||
their computation. When input data is manually assigned to the two nodes
|
||||
at the same time, they will compute endlessly, and the training process
|
||||
can never terminate by itself. A circular dependency produces a positive
|
||||
feedback data flow, where data values overflow to positive infinity,
|
||||
underflow to negative infinity, or tend to 0. These all lead to
|
||||
unexpected training results. As such, we should avoid circular
|
||||
dependencies between operators when designing deep learning models.
|
||||
|
||||

|
||||
:label:`ch04/ch04-recurrent`
|
||||
|
||||
In machine learning frameworks, the *unrolling* method is used to
|
||||
represent loop iterations. Figure
|
||||
:numref:`ch04/ch04-recurrent-1` shows a computational graph
|
||||
involving three loop iterations. The subgraph of the loop body is
|
||||
replicated to three (according to the number of iterations) to produce
|
||||
an unrolled loop, where the resulting subgraphs are concatenated in the
|
||||
iteration sequence. The subgraph of one iteration has a direct
|
||||
dependency on that of the previous iteration. In one computational
|
||||
graph, tensors and operators are uniquely identified across the loop
|
||||
iterations, even for the same operation. Unlike circular dependencies,
|
||||
loop iterations do not involve mutual dependencies between operators
|
||||
with unique identifiers. When a subgraph is replicated to produce an
|
||||
unrolled loop, the replicated tensors and operators are assigned new
|
||||
identifiers to avoid circular dependencies.
|
||||
|
||||

|
||||
:label:`ch04/ch04-recurrent-1`
|
||||
|
||||
## Control Flows
|
||||
|
||||
A control flow maintains the sequence of computation tasks, thereby
|
||||
facilitating the design of flexible and complex models. By introducing a
|
||||
control flow to a model, we can execute a node iteratively any number of
|
||||
times or skip a node based on specific conditions. Many deep learning
|
||||
models rely on control flows for training and inference. For example,
|
||||
models built on recurrent neural networks (RNNs) and reinforcement
|
||||
learning rely on recurrence relations and input status conditions to
|
||||
complete the computation.
|
||||
|
||||
Popular machine learning frameworks provide two major types of control
|
||||
flows:
|
||||
|
||||
1. **Frontend control flows**: Python control flow statements are used
|
||||
to implement control decision-making in a computational graph.
|
||||
Frontend control flows are easy to use in model building. However,
|
||||
because the computation process of the machine learning framework
|
||||
runs on the backend hardware and the control flow is decoupled from
|
||||
the data flow, the computational graph cannot run entirely on the
|
||||
backend hardware. As such, control flow implementations using the
|
||||
frontend language are referred to as the *out-of-graph approach*.
|
||||
|
||||
2. **Framework control primitives**: Machine learning frameworks come
|
||||
with built-in low-level fine-grained control primitive operators.
|
||||
Such operators are executable on compute hardware. When they are
|
||||
introduced to a model, the computational graph can run entirely on
|
||||
the backend hardware. This type of control flow implementations are
|
||||
referred to as the *in-graph approach*.
|
||||
|
||||
To explain why we need these different approaches to implement control
|
||||
flows, let's look at the differences between the two approaches.
|
||||
|
||||
The out-of-graph approach is familiar to Python programmers. This
|
||||
flexible, intuitive approach allows direct use of Python commands such
|
||||
as `if-else`, `while`, and `for` in building control flows.
|
||||
|
||||
The in-graph approach, by contrast, is more complicated. TensorFlow
|
||||
provides a range of in-graph control flow operators (such as `tf.cond`
|
||||
for conditional control, `tf.while_loop` for loop control, and `tf.case`
|
||||
for branch control). These operators are composites of lower-level
|
||||
primitive operators. The control flow representations adopted by the
|
||||
in-graph approach are in a different style from common programming ---
|
||||
this improves computing performance but comes at the expense of
|
||||
usability.
|
||||
|
||||
The out-of-graph approach is easier to use. However, not all backend
|
||||
compute hardware is compatible with the frontend runtime environment,
|
||||
and extra efforts may be needed to execute the frontend control flows.
|
||||
Nevertheless, control flows implemented using the in-graph approach are
|
||||
directly executable on hardware independent of the frontend environment,
|
||||
improving efficiency throughout the model building, optimization, and
|
||||
execution process.
|
||||
|
||||
The two approaches serve different application scenarios. To run tasks
|
||||
such as model training, inference, and deployment on compute hardware
|
||||
independent of the frontend environment, the in-graph approach is
|
||||
recommended for building control flows. For model validation purposes,
|
||||
the out-of-graph approach allows for higher efficiency in generating
|
||||
model code from the model algorithm.
|
||||
|
||||
Major machine learning frameworks support both the out-of-graph and
|
||||
in-graph approaches. In the following illustrations about the impact of
|
||||
control flows on forward and backward computation, we adopt the
|
||||
out-of-graph approach for control flow implementations, given that
|
||||
frontend control flows are more popular in practice. The most common
|
||||
control flows include conditional branches and loops. For a model
|
||||
containing control flow operations, the control flow is replicated to
|
||||
the gradient computational graph during backpropagation, so that the
|
||||
required tensor gradients can be accurately calculated.
|
||||
|
||||
Code `ch04/code1` shows an example of simple conditional control,
|
||||
where `matmul` indicates the matrix multiplication operator.
|
||||
|
||||
**ch04/code1**
|
||||
```python
|
||||
def control(A, B, C, conditional = True):
|
||||
if conditional:
|
||||
y = matmul(A, B)
|
||||
else:
|
||||
y = matmul(A, C)
|
||||
return y
|
||||
```
|
||||
|
||||
Figure :numref:`ch04/ch04-if` depicts the forward and backward
|
||||
computational graphs of Code
|
||||
`ch04/code1`. When running a model containing `if`
|
||||
conditions, the program needs to know which branch of each condition is
|
||||
taken so that it can apply the gradient computation logic to the right
|
||||
branch. In the forward computational graph, tensor $\bf{C}$ does not
|
||||
participate in computation due to conditional control. Similarly, in the
|
||||
backward computational graph, tensor $\bf{C}$ is skipped in gradient
|
||||
computation.
|
||||
|
||||

|
||||
:label:`ch04/ch04-if`
|
||||
|
||||
A control loop allows us to execute an operation in a loop zero or
|
||||
multiple times. When the loop is unrolled, each operation is assigned a
|
||||
unique identifier to identify different calls to the same operation.
|
||||
Each iteration directly depends on the result of the previous one.
|
||||
Therefore, one or more lists of tensors need to be maintained in the
|
||||
control loop for storing per-iteration intermediate results used in the
|
||||
forward pass and gradient computation. Code
|
||||
`ch04/code2` shows a control loop example. In its unrolled
|
||||
loop, $\bf{X_i}$ and $\bf{W_i}$ are the lists of intermediate result
|
||||
tensors to be maintained.
|
||||
|
||||
**ch04/code2**
|
||||
```python
|
||||
def recurrent_control(X : Tensor, W : Sequence[Tensor], cur_num = 3):
|
||||
for i in range(cur_num):
|
||||
X = matmul(X, W[i])
|
||||
return X
|
||||
# Unroll the loop to obtain an equivalent representation.
|
||||
def recurrent_control(X : Tensor, W : Sequence[Tensor]):
|
||||
X1 = matmul(X, W) # Let W = W[0], W1 = W[1], and W2 = W[2].
|
||||
X2 = matmul(X1, W1)
|
||||
Y = matmul(X2, W2)
|
||||
return Y
|
||||
```
|
||||
|
||||
The forward and backward computational graphs of Code
|
||||
`ch04/code2` are shown in Figure
|
||||
:numref:`ch04/ch04-while`. The gradient of the control loop is
|
||||
also a loop, with the same number of iterations as the forward loop. The
|
||||
gradient value output by one iteration serves as the input value for
|
||||
calculating the gradient of the next iteration until the loop ends.
|
||||
|
||||

|
||||
:label:`ch04/ch04-while`
|
||||
|
||||
## Gradient Computation Using the Chain Rule
|
||||
|
||||
In the loop unrolling example in Section 3.2.3, when input tensor
|
||||
$\bf{X}$ is fed into the neural network, the data is propagated forward
|
||||
one layer at a time in the computational graph, and the intermediate
|
||||
variables are calculated and stored until $\bf{Y}$ is output after
|
||||
multilayer computation. In DNN training, the loss function result is
|
||||
calculated based on the output result of forward propagation and the
|
||||
label value. The model backpropagates the loss function information
|
||||
through the computational graph and updates the training parameters
|
||||
based on computed gradients. Typically, backpropagation works by
|
||||
computing the gradients of the loss function with respect to each
|
||||
parameter. Backpropagation based on other information can also work but
|
||||
is not discussed here.
|
||||
|
||||
The chain rule method is used to calculate the gradients with respect to
|
||||
each parameter during backpropagation. In calculus, the chain rule
|
||||
provides a technique for finding the derivatives of composite functions.
|
||||
The derivative of a composite function at a given point is the product
|
||||
of the derivatives of each individual function at the corresponding
|
||||
point. Assume that *f* and *g* are functions mapped from the real number
|
||||
*x*. If $y=g(x)$ and $z=f(y)=f(g(x))$, the derivative of *z* with
|
||||
respect to *x* is
|
||||
|
||||
$$\frac{\partial z}{\partial x}=\frac{\partial z}{\partial y}\frac{\partial y}{\partial x}.$$
|
||||
:eqlabel:`eq:ch04/chainrule`
|
||||
|
||||
The backpropagation algorithm of neural networks executes the chain rule
|
||||
in the sequence defined by the backward computational graph. Generally,
|
||||
neural networks accept 3D tensor inputs and output 1D vectors.
|
||||
Therefore, we can generalize the gradient computation Equations
|
||||
:eqref:`ch04/chainrule` of composite functions with respect to
|
||||
scalars as follows: Assuming that $\bf{X}$ is an *m*-dimensional tensor,
|
||||
$\bf{Y}$ is an *n*-dimensional tensor, $\bf{z}$ is a 1D vector,
|
||||
$\bf{Y}=g(\bf{X})$, and $\bf{z}=f(\bf{Y})$, the partial derivative of
|
||||
$\bf{z}$ with respect to each element of $\bf{X}$ is
|
||||
|
||||
$$\frac{\partial z}{\partial x_i}=\sum_j\frac{\partial z}{\partial y_j}\frac{ \partial y_j}{ \partial x_i}.$$
|
||||
:eqlabel:`eq:ch04/chainrule-1`
|
||||
|
||||
The equivalent form of Equation
|
||||
:eqref:`ch04/chainrule-1` is
|
||||
|
||||
$$\nabla_{\bf{X}}\bf{z} = (\frac{\partial \bf{Y}}{\partial\bf{X}})^{\top}\nabla_{\bf{Y}}\bf{z},$$
|
||||
:eqlabel:`eq:ch04/chainrule-2`
|
||||
|
||||
where, $\nabla_{\bf{X}}z$ represents the gradient matrix of $z$ with
|
||||
respect to $\bf{X}$.
|
||||
|
||||
Figure :numref:`ch04/ch04-chain` shows the application of the chain rule
|
||||
in neural networks, illustrating both forward and backward passes in a
|
||||
single graph. The neural network performs matrix multiplication twice to
|
||||
obtain the predicted value $\bf{Y}$, and then performs gradient
|
||||
backpropagation based on the error between the output value and label
|
||||
value to update the weight parameters to minimize the error. The weight
|
||||
parameters to be updated include $\bf{W}$ and $\bf{W_1}$.
|
||||
|
||||

|
||||
:label:`ch04/ch04-chain`
|
||||
|
||||
The mean square error (MSE) is selected as the loss function in this
|
||||
example. Two important questions arise here: How does the loss function
|
||||
transfer the gradient information to $\bf{W}$ and $\bf{W_1}$ using the
|
||||
chain rule method? And why do we need to calculate the gradients of
|
||||
non-parameter data $\bf{X}$ and $\bf{X_1}$? To answer these questions,
|
||||
let's analyze the computation details of forward and backward
|
||||
propagation. First, the loss value is calculated through forward
|
||||
propagation in three steps: (1) $\bf{X_1}=\bf{XW}$; (2)
|
||||
$\bf{Y}=\bf{X_1W_1}$; and (3) $Loss=\frac{1}{2} (\bf{Y}-Label)^2$.
|
||||
|
||||
The loss function is calculated to minimize the distance between the
|
||||
prediction value and the label value. According to the chain rule,
|
||||
backpropagation is performed through Equations
|
||||
:eqref:`ch04/chainrule-3` and
|
||||
:eqref:`ch04/chainrule-4` to calculate the gradients of the loss
|
||||
function with respect to parameters $\bf{W}$ and $\bf{W_1}$:
|
||||
|
||||
$$\frac{\partial {\rm Loss}}{\partial \bf{W_1}}=\frac{\partial \bf{Y}}{\partial \bf{W_1}}\frac{\partial {\rm Loss}}{\partial \bf{Y}}$$
|
||||
:eqlabel:`eq:ch04/chainrule-3`
|
||||
|
||||
$$\frac{\partial {\rm Loss}}{\partial \bf{W}}=\frac{\partial \bf{X_1}}{\partial \bf{W}}\frac{\partial {\rm Loss}}{\partial \bf{Y}}\frac{\partial \bf{Y}}{\partial \bf{X_1}}$$
|
||||
:eqlabel:`eq:ch04/chainrule-4`
|
||||
|
||||
Both Equations
|
||||
:eqref:`ch04/chainrule-3` and
|
||||
:eqref:`ch04/chainrule-4` solve
|
||||
$\frac{\partial {\rm Loss}}{\partial \bf{Y}}$, which corresponds to grad
|
||||
$\bf{Y}$ in Figure :numref:`ch04/ch04-chain`.
|
||||
$\frac{\partial {\rm Loss}}{\partial \bf{Y}}\frac{\partial \bf{Y}}{\partial \bf{X_1}}$
|
||||
in Equation
|
||||
:eqref:`ch04/chainrule-4` corresponds to grad $\bf{X_1}$ in
|
||||
Figure :numref:`ch04/ch04-chain`. To calculate the gradient of model
|
||||
parameter $\bf{W}$, the gradient of intermediate result $\bf{X_1}$ is
|
||||
calculated. This also answers the second question raised above. The
|
||||
gradients of non-parameter intermediate results are calculated to
|
||||
facilitate gradient computation with regard to each parameter.
|
||||
|
||||
Because $\bf{X_1}=\bf{XW}$, $\bf{Y}=\bf{X_1W_1}$, and
|
||||
Loss=$\frac{1}{2}$($\bf{Y}$-Label)$^2$, Equations
|
||||
:eqref:`ch04/chainrule-3` and
|
||||
:eqref:`ch04/chainrule-4` are expanded to
|
||||
:eqref:`ch04/chainrule-5` and
|
||||
:eqref:`ch04/chainrule-6` according to Equations
|
||||
:eqref:`ch04/chainrule-2`, respectively. Then, we can analyze how
|
||||
variables participate in gradient computation when the machine learning
|
||||
framework uses the chain rule to build a backward computational graph.
|
||||
|
||||
$$\frac{\partial {\rm Loss}}{\partial \bf{W_1}}=\frac{\partial \bf{Y}}{\partial \bf{W_1}}\frac{\partial {\rm Loss}}{\partial \bf{Y}}=\bf{X_1}^\top(\bf{Y}-{\rm Label})$$
|
||||
:eqlabel:`eq:ch04/chainrule-5`
|
||||
|
||||
$$\frac{\partial {\rm Loss}}{\partial \bf{W}}=\frac{\partial \bf{X_1}}{\partial \bf{W}}\frac{\partial {\rm Loss}}{\partial \bf{Y}}\frac{\partial \bf{Y}}{\partial \bf{X_1}}=\bf{X}^\top(\bf{Y}-{\rm Label})\bf{W_1}^\top$$
|
||||
:eqlabel:`eq:ch04/chainrule-6`
|
||||
|
||||
Equation
|
||||
:eqref:`ch04/chainrule-5` uses intermediate result $\bf{X_1}$ in
|
||||
the forward computational graph when calculating the gradient of
|
||||
$\bf{W_1}$. In equation
|
||||
:eqref:`ch04/chainrule-6`, both input $\bf{X}$ and parameter
|
||||
$\bf{W_1}$ are used for calculating the gradient of parameter $\bf{W}$.
|
||||
This answers the first question. The gradient information transferred
|
||||
backward from downstream network layers, and the intermediate results
|
||||
and parameter values in forward computation, all have roles to play in
|
||||
calculating the gradient of each parameter in the graph.
|
||||
|
||||
Based on Figure :numref:`ch04/ch04-chain` and Equations
|
||||
:eqref:`ch04/chainrule-3`,
|
||||
:eqref:`ch04/chainrule-4`,
|
||||
:eqref:`ch04/chainrule-5` and
|
||||
:eqref:`ch04/chainrule-6`, when the chain rule is used to
|
||||
construct a backward computational graph, the computation process is
|
||||
analyzed and the intermediate results and gradient transfer status in
|
||||
the model are stored. The machine learning framework improves the
|
||||
backpropagation efficiency by reusing buffered computation results.
|
||||
|
||||
We can generalize the chain rule to wider applications. With flexible
|
||||
control flows, the machine learning framework can quickly analyze the
|
||||
computation processes of the forward data flow and backward gradient
|
||||
flow by using computational graph technology, effectively manage the
|
||||
lifetime of each intermediate result in memory, and improve the overall
|
||||
computation efficiency.
|
||||
@@ -0,0 +1,64 @@
|
||||
# Computational Graph Functions
|
||||
|
||||
Early machine learning frameworks are mainly designed for fully
|
||||
connected networks and convolutional neural networks (CNNs). Such neural
|
||||
networks have serial layers, whose topology structures can be
|
||||
represented in simple configuration files (e.g., Caffe model definition
|
||||
in Protocol Buffers format).
|
||||
|
||||
Conversely, modern machine learning models have ever more complex
|
||||
structures. Prominent examples include mixture-of-experts (MoE),
|
||||
generative adversarial network (GAN), and attention models. To improve
|
||||
the training efficiency with complex model structures (e.g., loops with
|
||||
branching), machine learning frameworks are expected to quickly analyze
|
||||
operator dependencies, gradient computation, and training parameters, to
|
||||
facilitate model optimization, formulate scheduling strategies, and
|
||||
automate gradient computation. As such, machine learning system
|
||||
designers call for a common data structure to understand, represent, and
|
||||
execute machine learning models. To this end, machine learning
|
||||
frameworks introduce the computational graph technology while still
|
||||
decoupling the frontend and backend languages in design, as shown in
|
||||
Figure :numref:`ch04/ch04-DAG`. From a top-level view, computational
|
||||
graph technology provides the following key functions:
|
||||
|
||||

|
||||
:label:`ch04/ch04-DAG`
|
||||
|
||||
1. **Unified representation of the computation process.** Developers
|
||||
tend to write machine learning programs in high-level programming
|
||||
languages (e.g., Python, Julia, and C++). However, because most
|
||||
devices such as hardware accelerators provide only C/C++ APIs,
|
||||
implementations of machine learning systems are largely restricted
|
||||
to C/C++. Computational graph technology makes it possible to run
|
||||
programs written in different high-level languages on common
|
||||
low-level C/C++ system modules. As a unified representation, a
|
||||
computational graph describes a model's input data, computational
|
||||
logic (usually referred to as operators), and execution sequence of
|
||||
operators.
|
||||
|
||||
2. **Automatic gradient computation.** The training program receives
|
||||
data samples (or the training dataset), performs forward computation
|
||||
through the network, and then calculates the loss value. Based on
|
||||
the loss value, the machine learning system computes the gradient
|
||||
for each model parameter and then updates the model parameters. The
|
||||
gradient computation method should apply universally and run
|
||||
automatically, regardless of the model topology and loss computation
|
||||
method. Based on the computational graph, the machine learning
|
||||
system can quickly analyze the gradient transfer relations between
|
||||
parameters, thereby achieving automatic gradient computation.
|
||||
|
||||
3. **Lifetime analysis of model variables.** During model training,
|
||||
many intermediate variables are generated, for example, the
|
||||
activation values in the forward pass and the gradients in the
|
||||
backward pass. Some of the intermediate variables generated in the
|
||||
forward pass are used in conjunction with the gradients for updating
|
||||
model parameters. With a computational graph, the machine learning
|
||||
system can accurately analyze the lifetime of each intermediate
|
||||
variable (i.e., from the time the variable is generated to the time
|
||||
it is destroyed), helping the framework optimize memory management.
|
||||
|
||||
4. **Execution optimization.** User programs can have different network
|
||||
structures. With computational graph technology, the machine
|
||||
learning framework can analyze the model topology and operator
|
||||
dependencies, and it automatically searches for operator parallel
|
||||
computing strategies to improve the model execution efficiency.
|
||||
@@ -0,0 +1,432 @@
|
||||
# Generating a Computational Graph
|
||||
|
||||
In the previous section, we explored the ingredients of a computational
|
||||
graph. Now let's proceed to the next question --- how is a computational
|
||||
graph automatically generated? Machine learning frameworks support two
|
||||
approaches to implementing computational graphs: static and dynamic. The
|
||||
static approach builds a static (unchanging) graph based on information
|
||||
such as the network topology and parameter variables described by the
|
||||
frontend language. Because frontend languages are independent, static
|
||||
graphs are especially suitable for model deployment (e.g., deploying a
|
||||
facial recognition application on mobile devices).
|
||||
|
||||
Unlike the static approach, the dynamic approach dynamically generates a
|
||||
temporary graph based on the frontend description each time the model is
|
||||
executed. Dynamic graphs are easy to debug, making it possible to
|
||||
fine-tune models efficiently on the fly. Major machine learning
|
||||
frameworks such as TensorFlow and MindSpore are compatible with both
|
||||
approaches. And although PyTorch uses dynamic graphs, it also offers
|
||||
dynamic-to-static conversion support for efficient model execution. To
|
||||
choose the right approach for a specific task, we need to consider the
|
||||
task requirements as well as the pros and cons of each approach.
|
||||
|
||||
## Static Graph
|
||||
|
||||
The static graph approach decouples the definition and execution
|
||||
processes. That is, a static graph is compiled before it is executed, as
|
||||
shown in Figure :numref:`ch04/ch04-static`.
|
||||
|
||||

|
||||
:label:`ch04/ch04-static`
|
||||
|
||||
When a model program is generated using the frontend language, the
|
||||
machine learning framework first analyzes the model topology for
|
||||
information such as the connections between network layers, parameter
|
||||
variable settings, and loss functions. The framework then compiles the
|
||||
model description into fixed code (i.e., a static computational graph)
|
||||
that can be invoked and executed by the computing backend. In this case,
|
||||
subsequent training or inference on this model is no longer
|
||||
frontend-dependent. Specifically, when input data is fed into the static
|
||||
graph, the operators in the graph are directly scheduled to hardware for
|
||||
execution. And to improve hardware computational efficiency, we can also
|
||||
convert a static graph into other equivalent structures through various
|
||||
optimization strategies.
|
||||
|
||||
Code `ch04/code4` shows an example of generating and executing a
|
||||
simple static graph. In the frontend definition phase, some machine
|
||||
learning frameworks require developers to declare predefined
|
||||
configuration items including tensor placeholders, loss functions,
|
||||
optimization functions, network building and runtime environments, and
|
||||
network executors, as well as in-graph control statements using control
|
||||
flow operators. The design of machine learning frameworks has recently
|
||||
been improved to provide easy-to-use APIs and a unified model building
|
||||
paradigm. For example, MindSpore enables unified frontend programming
|
||||
representations featuring dynamic and static integration. To illustrate,
|
||||
let's consider the following simple model.
|
||||
|
||||
**ch04/code4**
|
||||
```python
|
||||
def model(X, flag):
|
||||
if flag > 0:
|
||||
Y = matmul(W1, X)
|
||||
else:
|
||||
Y = matmul(W2, X)
|
||||
Y = Y + b
|
||||
Y = relu(Y)
|
||||
return Y
|
||||
```
|
||||
|
||||
The machine learning framework does not load input data when generating
|
||||
a static graph. Instead, *placeholder* tensors are used to hold places
|
||||
of input data. In the static graph defined in Code
|
||||
`ch04/code4`, we need to create a placeholder for input
|
||||
$\bf{X}$ in line 1. Because no actual input is fed into the model during
|
||||
static graph generation, the control flow defined in line 2 cannot make
|
||||
control decisions at build time. As such, we need to add the control
|
||||
flow operator and the computational subgraph of each branch to the
|
||||
static graph. When the model receives actual inputs during runtime,
|
||||
different branches are taken (by running the corresponding computational
|
||||
subgraphs) depending on different inputs. However, not all machine
|
||||
learning frameworks are able to compile Python control flows as their
|
||||
static graph equivalents. In order to implement control flows in this
|
||||
case, we can use the control primitives provided by the framework.
|
||||
|
||||

|
||||
:label:`ch04/ch04-static-gen`
|
||||
|
||||
Static computational graphs offer two distinct advantages. First, they
|
||||
yield better performance with less memory. When building a static graph,
|
||||
the machine learning framework acquires the complete model topology
|
||||
containing global information of the model, which facilitates the
|
||||
formulation of graph optimization strategies (e.g., the operator fusion
|
||||
strategy that fuses two or more operators into a larger one). As shown
|
||||
in Figure :numref:`ch04/ch04-static-gen`, the Add and ReLU operators are
|
||||
fused into one operator to reduce the loads/stores of intermediate
|
||||
results and low-level scheduling overhead, thereby improving the
|
||||
execution performance and efficiency with a lower memory footprint.
|
||||
Static graphs allow for many optimization strategies at build time,
|
||||
which we will discuss in later sections.
|
||||
|
||||
Second, by converting static graphs into executable code within the
|
||||
machine learning framework, we can directly deploy our models on various
|
||||
hardware platforms to provide efficient inference services. Also, we can
|
||||
store static graphs using serialization techniques for future execution
|
||||
(either model training or inference), eliminating the need to rebuild
|
||||
the frontend source code from scratch every time before execution.
|
||||
|
||||
Once the frontend code of the model is compiled into a static graph, the
|
||||
graph structure is fixed. If we introduce any optimizations to the
|
||||
graph, the optimized code can differ significantly from the original.
|
||||
However, the optimized code is not intuitively visible, meaning that it
|
||||
is sometimes impossible to locate a runtime error based on the returned
|
||||
code line number in the optimized code. Consider a simple case. Assuming
|
||||
that the Add and ReLU operators in Code
|
||||
`ch04/code4` have been fused for optimization, if a runtime
|
||||
error related to the fused operator is reported, it would be hard for us
|
||||
to determine the exact error location (Add or ReLU).
|
||||
|
||||
In addition, in the daunting process of model debugging and testing,
|
||||
intermediate results cannot be printed in real time. To make this
|
||||
happen, we need to insert additional code to the source code and then
|
||||
recompile the source code for execution, making debugging less
|
||||
efficient. By contrast, the dynamic graph approach offers more
|
||||
flexibility.
|
||||
|
||||
## Dynamic Graph
|
||||
|
||||
Figure :numref:`ch04/ch04-eager1` shows the principle of the dynamic
|
||||
graph approach. A dynamic graph is defined as it runs. The frontend
|
||||
interpreter parses the graph code and the machine learning framework
|
||||
distributes the operators in the graph to the backend for just-in-time
|
||||
(JIT) execution. Adopting the user-friendly imperative programming
|
||||
paradigm, the dynamic graph approach allows developers to create neural
|
||||
network models at the frontend and is therefore favored by a vast number
|
||||
of deep learning researchers.
|
||||
|
||||

|
||||
:label:`ch04/ch04-eager1`
|
||||
|
||||
Next, we reuse the pseudocode in the previous section to compare the
|
||||
dynamic and static graph approaches.
|
||||
|
||||
While these two approaches differ only slightly in their frontend
|
||||
representations, they differ dramatically in terms of their compilation
|
||||
and execution mechanisms. Unlike the static graph approach, the dynamic
|
||||
graph approach calls the built-in operator distribution function of the
|
||||
machine learning framework through the Python API to distribute Python
|
||||
operators to the hardware backend (e.g., CPU, GPU, or NPU) for
|
||||
accelerated computing, which then returns the computational result to
|
||||
the frontend. This process does not generate a static computational
|
||||
graph. Instead, the framework describes the model topology using the
|
||||
frontend language, schedules and executes the model based on
|
||||
computational dependencies, and dynamically generates a temporary graph.
|
||||
|
||||
Figure :numref:`ch04/ch04-dynamic-gen` shows the process of generating a
|
||||
dynamic graph.
|
||||
|
||||

|
||||
:label:`ch04/ch04-dynamic-gen`
|
||||
|
||||
Forward computation is run through the neural network in the sequence
|
||||
defined by the model declaration. Once the model receives input
|
||||
$\bf{X}$, the machine learning framework starts to generate a dynamic
|
||||
graph by adding the input node to the graph and sending the data to the
|
||||
downstream node. The control flow (if available) makes a data flow
|
||||
decision immediately. For example, in Figure
|
||||
:numref:`ch04/ch04-dynamic-gen`, if the conditional returns true,
|
||||
only the Matmul operator node with respect to tensor $\bf{W1}$ is added
|
||||
to the graph. Then, the machine learning framework inserts the Add and
|
||||
ReLU operator nodes based on the operator sequence and computational
|
||||
dependencies defined in the code. For each newly added operator node,
|
||||
the machine learning framework distributes and executes the operator,
|
||||
returns the computational result, and prepares to pass the result to the
|
||||
next node. When forward computation resumes, the last dynamic graph
|
||||
becomes invalid and a new dynamic graph is created according to current
|
||||
input and control decision. In contrast with a static graph that
|
||||
represents the entire model described in the frontend language, a
|
||||
dynamic graph is generated on the fly as the control flow and data flow
|
||||
evolve over time. For this reason, the machine learning framework has
|
||||
few opportunities to optimize the model in the dynamic graph setting.
|
||||
|
||||
In the static graph setting, as the model definition is entirely
|
||||
available, a complete forward computational graph and a complete
|
||||
backward computational graph can be constructed simultaneously. However,
|
||||
in the dynamic graph setting, gradients are calculated for
|
||||
backpropagation as the forward pass proceeds. Specifically, the machine
|
||||
learning framework collects information of each backward operator and
|
||||
tensor participating in gradient computation based on the information of
|
||||
each operator called in the forward pass. Once the forward pass ends,
|
||||
the operator and tensor information for backpropagation becomes
|
||||
available. With this information, the machine learning framework creates
|
||||
a backward computational graph and runs it on hardware to complete
|
||||
gradient computation and parameter update.
|
||||
|
||||
As shown in Figure :numref:`ch04/ch04-dynamic-gen`, when the Matmul operator with
|
||||
respect to tensor $\bf{W1}$ is called, the framework runs the Matmul
|
||||
operator to calculate the product of inputs $\bf{X}$ and $\bf{W1}$, and
|
||||
then records the operator and tensor $\bf{X}$ that will participate in
|
||||
backpropagation based on the backward computation process
|
||||
Grad\_$\bf{W1}$=Grad\_$\bf{Y}*\bf{X}$, thereby completing the forward
|
||||
pass and producing a backward computational graph.
|
||||
|
||||
Although the optimization techniques useful in the static graph setting
|
||||
do not work for dynamic graphs (because the complete network structure
|
||||
is unknown until the dynamic graph runs), researchers and developers can
|
||||
easily analyze errors and debug results during model testing and
|
||||
optimization. This is made possible by dynamic graphs supporting JIT
|
||||
computing and returning computational results immediately with the
|
||||
execution of each statement.
|
||||
|
||||
Also, the dynamic graph approach enables flexible execution using native
|
||||
control flows provided by the frontend --- unlike static graphs, which
|
||||
involve complex control flows along with programming and debugging
|
||||
difficulties. Consequently, the dynamic graph approach lowers the
|
||||
barriers to programming for beginners while also improving the iteration
|
||||
efficiency of algorithm development and model optimization.
|
||||
|
||||
## Dynamic Graph vs. Static Graph
|
||||
|
||||
The two approaches for implementing computational graphs have their pros
|
||||
and cons, as described in
|
||||
Table :numref:`ch04/ch4-graph`.
|
||||
|
||||
:Static graph vs. dynamic graph
|
||||
|
||||
|Feature |Static Graph |Dynamic Graph |
|
||||
|---------------------------------|-------------------------------------------------|---------------------------------------------- |
|
||||
|On-the-fly intermediate results |No |Yes |
|
||||
|Code debugging |Difficult |Easy |
|
||||
|Control flow implementation |Specialized syntax |Frontend syntax |
|
||||
|Performance |Better, supporting wide optimization strategies |Poor, supporting limited graph optimizations |
|
||||
|Memory footprint |Low |High |
|
||||
|Direct deployment |Yes |No |
|
||||
:label:`ch04/ch4-graph`
|
||||
|
||||
Compared with the dynamic graph approach, the static graph approach
|
||||
seems to be less user-friendly to developers because intermediate
|
||||
results are not available on the fly, code debugging is difficult, and
|
||||
implementing control flows is complex. However, static graphs ensure
|
||||
higher execution performance than dynamic graphs. See the example in
|
||||
Code `ch04/code5`.
|
||||
|
||||
**ch04/code5**
|
||||
```python
|
||||
def model(X1, X2):
|
||||
Y1 = matmul(X1, W1)
|
||||
Y2 = matmul(X2, W2)
|
||||
Y = Y1 + Y2
|
||||
output = relu(Y)
|
||||
return output
|
||||
```
|
||||
|
||||
If the static approach is used to implement Code
|
||||
`ch04/code5`, the machine learning framework creates a
|
||||
complete computational graph. Because tensors $\bf{Y_1}$ and $\bf{Y_2}$
|
||||
are computed independently from each other, we can implement automatic
|
||||
parallelism on them in order to improve the computational efficiency.
|
||||
Furthermore, the static approach allows many more optimization
|
||||
strategies to improve efficiency while also lowering memory footprint,
|
||||
for example, fusing operators Add and ReLU to reduce the loads and
|
||||
stores of the intermediate variable $\bf{Y}$. Conversely, if the dynamic
|
||||
approach is used without a manually configured parallelism strategy, the
|
||||
machine learning framework is unaware of the independence between
|
||||
operators due to the lack of a complete computational graph.
|
||||
Consequently, the framework has to execute the operators, including Add
|
||||
and ReLU, in a defined order and store the intermediate variable
|
||||
$\bf{Y}$. To further reduce memory footprint, the static approach
|
||||
narrows down the intermediate variables to be stored for backpropagation
|
||||
beforehand in the forward pass, based on the forward and backward
|
||||
computational graphs defined prior to execution. This is not feasible in
|
||||
the dynamic approach, where the backward computational graph is defined
|
||||
only after the forward pass is complete. As such, more intermediate
|
||||
variables have to be stored in the forward pass to ensure the
|
||||
backpropagation efficiency, resulting in higher memory footprint.
|
||||
|
||||
To choose one approach over the other, we should consider their pros and
|
||||
cons in addition to analyzing specific task requirements. For academic
|
||||
research purposes or in the model design and debugging phases, the
|
||||
dynamic graph approach is suggested because it allows for quick testing
|
||||
of experimental ideas and iterative update of the model structure. In
|
||||
other cases where the model structure is determinant, to accelerate the
|
||||
training process or deploy a model on specific hardware, using the
|
||||
static graph approach offers higher efficiency.
|
||||
|
||||
## Conversion Between and Combination of Dynamic and Static Graphs
|
||||
:label:`conversion_between_and_combination_of_dynamic_and_static_graphs`
|
||||
|
||||
Dynamic graphs are easy to debug and suitable for model design and
|
||||
testing, whereas static graphs improve execution efficiency and shorten
|
||||
model training time. Is there a way for the machine learning framework
|
||||
to combine the merits of both approaches? Major machine learning
|
||||
frameworks, such as TensorFlow, MindSpore, PyTorch, and PaddlePaddle,
|
||||
have added support to convert between dynamic and static graphs,
|
||||
allowing developers to program using the dynamic graph approach and
|
||||
letting the framework automatically convert the code to a static
|
||||
equivalent for execution.
|
||||
|
||||
Table :numref:`ch04/ch4-eagertoscript` lists the APIs for dynamic graph
|
||||
to static graph conversion provided by major frameworks.
|
||||
|
||||
:Dynamic graph to static graph conversion support of major frameworks
|
||||
|
||||
| Framework | Dynamic Graph to Static Graph Conversion |
|
||||
|------------------------------------------------------------------------------|------------------------------------------ |
|
||||
| TensorFlow | |
|
||||
| where AutoGraph automatically transforms a control flow to | |
|
||||
| the equivalent static statement. | |
|
||||
| MindSpore | |
|
||||
| `context.set_context(mode=context.GRAPH_MODE)`: static graph mode, | |
|
||||
| `@ms_function`: builds a static graph from source code. | |
|
||||
| PyTorch | |
|
||||
| `torch.jit.trace()`: builds a static graph by tracing operators. | |
|
||||
| PaddlePaddle | |
|
||||
|`paddle.jit.TracedLayer.trace()`: builds a static graph by tracing operators. | |
|
||||
:label:`ch04/ch4-eagertoscript`
|
||||
|
||||
These dynamic-to-static conversion methods fall into the following two
|
||||
categories:
|
||||
|
||||
1. **Tracing**: A static graph is built by tracing operator scheduling
|
||||
in a dynamic graph.
|
||||
|
||||
2. **Source code transformation**: The frontend code is inspected and
|
||||
built as static graph code. And the static graph executor is
|
||||
automatically called to run the static graph.
|
||||
|
||||
The *tracing* method goes through two simple phases. The first is to
|
||||
generate a dynamic graph, following a workflow similar to that shown in
|
||||
Figure :numref:`ch04/ch04-dynamic-gen`. The machine learning framework
|
||||
runs the created dynamic graph and traces the data flow and operator
|
||||
scheduling in the dynamic graph to produce a static graph. Note that the
|
||||
dynamic graph is not destroyed; instead, it is preserved as a static
|
||||
graph for subsequent execution. As the machine learning framework
|
||||
finishes executing the dynamic graph, a static graph is produced. In the
|
||||
second phase when the model is called again, the machine learning
|
||||
framework runs the static graph for computation. The tracing technique
|
||||
only traces the operators scheduled when the dynamic graph is run for
|
||||
the first time. However, if the model has a data-dependent conditional,
|
||||
only one branch of the conditional can be traced --- the traced graph
|
||||
would be unable to take alternate branches. Similarly, the traced graph
|
||||
cannot include every iteration if there is a data-dependent loop.
|
||||
|
||||
Unlike dynamic graph code which is parsed and executed by the frontend
|
||||
interpreter, a static graph must be first created by the graph compiler
|
||||
of the machine learning framework before execution. Because the graph
|
||||
compiler cannot directly deal with dynamic graph code, the source code
|
||||
transformation--based method is introduced to convert the dynamic graph
|
||||
code into static code description.
|
||||
|
||||
The *source code transformation*--based method can overcome the
|
||||
drawbacks involved in the tracing method and also consists of two
|
||||
phases, as shown in Figure :numref:`ch04/ch04-ast`. The first involves lexical and syntax
|
||||
analysis. Specifically, the lexical analyzer scans and analyzes every
|
||||
character in the dynamic graph code, splits the source text by removing
|
||||
any white spaces or comments, and returns a stream of tokens. Then, the
|
||||
syntax analyzer or parser analyzes the token stream, eliminates any
|
||||
errors, and generates a parse tree as the output of the phase. In the
|
||||
second phase, the built-in translators of the machine learning framework
|
||||
scan and translate each part of the abstract syntax tree to map the
|
||||
grammatical structures from dynamic graph format into static graph
|
||||
format. Any control flow written in the frontend language is transformed
|
||||
into the corresponding static graph API in this phase, so as to include
|
||||
every branch of the control flow in the resulting graph. Next, we can
|
||||
easily generate static graph code from the translated syntax tree.
|
||||
|
||||

|
||||
:label:`ch04/ch04-ast`
|
||||
|
||||
In numerous instances, the utilization of either tracing or source code
|
||||
transformation proves to be a more convenient approach in the conversion
|
||||
of a model to a static graph. Both tracing and source code
|
||||
transformation can be combined to cater to the specific requirements of
|
||||
a model segment. For instance, PyTorch offers both methods for
|
||||
transforming dynamic graphs into static graphs, and frequently, a hybrid
|
||||
approach is employed. Scripted functions can invoke traced functions,
|
||||
which is advantageous when implementing control-flow mechanisms within a
|
||||
straightforward model, such as utilizing beam search in a
|
||||
sequence-to-sequence model with an encoder module produced through
|
||||
tracing. Traced functions, on the other hand, can call script functions,
|
||||
which is beneficial when control-flow is needed in a limited section of
|
||||
a model, typically a feed-forward network.
|
||||
|
||||
To improve the computational efficiency, we can transform the entire
|
||||
model graph for fast deployment on hardware. Alternatively, we can
|
||||
consider transforming some of the model functions into static subgraphs
|
||||
and embedding them into the global dynamic graph as individual
|
||||
operators, so that these exact functions would run in the form of static
|
||||
graphs at execution time. This not only improves computational
|
||||
efficiency but also retains flexibility for code debugging.
|
||||
|
||||
Code `ch04/code6` shows a simple model, which can be built into a
|
||||
dynamic graph as a whole. In this example, we transform the
|
||||
`add_and_relu` module into a static subgraph. The model runs on the
|
||||
input data in a predefined sequence, resulting in a temporary dynamic
|
||||
graph. When the `Y=add_and_relu(Y,b)` statement is executed, the machine
|
||||
learning framework automatically runs the static subgraph transformed
|
||||
from the module, achieving a performance gain by combining the
|
||||
advantages of dynamic and static graphs.
|
||||
|
||||
**ch04/code6**
|
||||
```python
|
||||
def add_and_relu(Y, b):
|
||||
Y = Y + b
|
||||
Y = relu(Y)
|
||||
return Y
|
||||
def model(X, flag):
|
||||
if flag > 0:
|
||||
Y = matmul(W1, X)
|
||||
else:
|
||||
Y = matmul(W2, X)
|
||||
Y = add_and_relu(Y, b)
|
||||
return Y
|
||||
```
|
||||
|
||||
Dynamic-to-static conversion is mostly found in the model deployment
|
||||
stage, as a workaround to the hardware constraints on dynamic graph
|
||||
deployment, which requires the frontend model definition code for
|
||||
topology discovery in addition to the file of already-trained
|
||||
parameters. To remove the frontend dependency, once model training in
|
||||
dynamic graph mode is complete, we may convert the model into static
|
||||
graph format and serialize the model and parameter files, thereby
|
||||
expanding the list of supported hardware.
|
||||
|
||||
However, the process of translating a dynamic graph into a static graph
|
||||
can become more intricate when dealing with reverse graph dependencies
|
||||
and dynamic shapes. Additionally, the performance of the executing
|
||||
engine may be compromised during complex graph transformations. To
|
||||
address this, frameworks like PyTorch have introduced more aggressive
|
||||
dynamic transformation methods. PyTorch's dynamo module not only
|
||||
implements source code transformation, but also replaces the Python
|
||||
execution engine with lower-level APIs. This approach resembles the
|
||||
combination of a compiler and interpreter found in modern Python code
|
||||
execution engines like CPython, resulting in optimal performance.
|
||||
27
v1/en_chapters/chapter_computational_graph/index.md
Normal file
27
v1/en_chapters/chapter_computational_graph/index.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Computational Graph
|
||||
|
||||
In this chapter, we look at the following question: How does a machine
|
||||
learning system efficiently execute such a program on hardware? We can
|
||||
break this down into three sub-questions: How do we schedule and execute
|
||||
the model described by a machine learning program? How do we improve the
|
||||
model scheduling and execution efficiency? And can we implement
|
||||
automatic gradient computation for updating the model? The key to
|
||||
answering these questions is computational graph technology. To explain
|
||||
this technology, this chapter explains the following key aspects:
|
||||
|
||||
1. Computational graph basics
|
||||
|
||||
2. Generation of static and dynamic computational graphs
|
||||
|
||||
3. Common execution methods of computational graphs
|
||||
|
||||
```toc
|
||||
:maxdepth: 2
|
||||
|
||||
Computational_Graph_Functions
|
||||
Computational_Graph_Basics
|
||||
Generating_a_Computational_Graph
|
||||
Scheduling_and_Executing_Computational_Tasks
|
||||
Chapter_Summary
|
||||
Further_Reading
|
||||
```
|
||||
@@ -0,0 +1,245 @@
|
||||
# Scheduling and Executing Computational Tasks
|
||||
|
||||
Training a model is conducted by scheduling the execution of the
|
||||
operators in a computational graph. From a broad perspective, a training
|
||||
job runs a computational graph for a defined number of iterations,
|
||||
relying on optimal scheduling of tasks such as data loading and training
|
||||
(inference) execution. Within each iteration, we need to analyze
|
||||
operator-level scheduling based on the graph topology, computational
|
||||
dependencies, and control flows. We optimize the scheduling and
|
||||
execution of computational graphs to make full use of computing
|
||||
resources, improve computational efficiency, and shorten the model
|
||||
training and inference time. The following introduces the typical
|
||||
techniques of computational graph scheduling.
|
||||
|
||||
The scheduling execution of the computation graph can be divided into
|
||||
three modes according to the graph generation method, which are operator
|
||||
scheduling, whole graph scheduling, and operator and subgraph combined
|
||||
scheduling. These three modes also correspond to the three modes of
|
||||
dynamic graph, static graph, and combination of dynamic and static in
|
||||
the calculation graph generation mechanism.
|
||||
|
||||
Next, we will introduce the scheduling and execution of the calculation
|
||||
graph in detail.
|
||||
|
||||
## Operator Scheduling
|
||||
|
||||
Operator scheduling means that the operators contained in the algorithm
|
||||
or model are scheduled and executed one by one through the runtime of
|
||||
the Python language. This scheduling mechanism is used when the
|
||||
calculation graph is executed in dynamic graph mode, such as PyTorch's
|
||||
default execution mode and TensorFlow's eager mode.
|
||||
|
||||
Operator scheduling includes two steps. In the first step, according to
|
||||
the call sequence of the model operator declaration, the dynamic
|
||||
calculation graph obtains a linear operator scheduling sequence. And the
|
||||
second is distributing the ordering of operators to instruction streams.
|
||||
|
||||
In Figure :numref:`ch04/ch04-diaoduzhixing`, the directed acyclic graph on
|
||||
the left contains five nodes a, b, c, d, and e and four dependency edges
|
||||
a-\>d, b-\>c, c-\>d, and d-\>e (e.g., a-\>d indicates that d depends on
|
||||
a). According to the operator call sequence of the model code, such as
|
||||
a-\>b-\>c-\>d-\>e, all operator nodes are put into the queue in turn,
|
||||
and the scheduling ends.
|
||||
|
||||

|
||||
:label:`ch04/ch04-diaoduzhixing`
|
||||
|
||||
With the ordering, we then prepare to distribute the operators in the
|
||||
ordering and related data to the GPU hardware for execution. Figure
|
||||
:numref:`ch04/ch04-single-op-exec` shows the trace of operator
|
||||
scheduling. Once the Python runtime calls an operator, the machine
|
||||
learning framework initializes the operator by determining information
|
||||
such as the operator precision, type and size of each input/output, and
|
||||
target device. It then allocates memory for the operator before copying
|
||||
the memory to the specific device for execution.
|
||||
|
||||

|
||||
:label:`ch04/ch04-single-op-exec`
|
||||
|
||||
The operator scheduling method offers high flexibility because operators
|
||||
are directly scheduled by the Python runtime. It facilitates the
|
||||
representation of complex computational logic (such as control flows)
|
||||
and use of Python-native data structures for implementing complex
|
||||
algorithms. Operators are driven by the Python runtime to finish
|
||||
computational tasks, facilitating easy collaboration with Python's
|
||||
large, rich ecosystem.
|
||||
|
||||
Despite its advantages, operator scheduling also has some disadvantages.
|
||||
One is that context-based runtime optimizations such as operator fusion
|
||||
and algebraic simplification become difficult. This is because global
|
||||
information about the computational graph is unavailable. Another
|
||||
disadvantage is that computational tasks have to run in serial mode,
|
||||
rather than in parallel, due to the lack of computational topology.
|
||||
|
||||
## Graph Scheduling
|
||||
|
||||
When the calculation graph uses the static graph mechanism for
|
||||
whole-graph scheduling execution, operators will be sent to the hardware
|
||||
for execution one by one according to a certain execution sequence.
|
||||
However, global information about the computational graph is available.
|
||||
it can analyze operator dependencies and the number of computing
|
||||
devices, and complete the scheduling and execution of the entire graph
|
||||
in the following two ways:
|
||||
|
||||
1. **Serial**: executes its tasks one at a time, in the order that they
|
||||
are added to the queue.This method expands a computational graph
|
||||
into a sequence of operators, which are then run separately.
|
||||
Operators are executed in a static order using a single thread,
|
||||
thereby requiring fewer resources.
|
||||
|
||||
2. **Parallel**: executes its tasks concurrently for higher
|
||||
efficiency.This method expands a computational graph based on
|
||||
operator dependencies. Operators are executed in the order defined
|
||||
by their input dependencies, and those without input dependencies
|
||||
are executed concurrently. This method executes operators in a
|
||||
dynamic order (which may vary in each iteration) using multiple
|
||||
threads, thereby consuming more system resources.
|
||||
|
||||
Within a computational graph, most operators are dependent on each other
|
||||
directly or indirectly. When scheduling such operators, their sequence
|
||||
must be guaranteed. Figure
|
||||
:numref:`ch04/ch04-diaodu` shows a computational graph, where a
|
||||
forward pass is run on the input data to produce a predicted value and
|
||||
then the gradient of the loss function is computed for backpropagation.
|
||||
In general, downstream operators run dependently on the output from the
|
||||
upstream. As such, we have to schedule the operators in this
|
||||
computational graph to a serial queue in order to ensure that each
|
||||
operator receives the necessary input.
|
||||
|
||||

|
||||
:label:`ch04/ch04-diaodu`
|
||||
|
||||
A computational graph may also contain operators independent of each
|
||||
other, for example, op1 and op2 shown in Figure
|
||||
:numref:`ch04/ch04-para`. We can have each operator run on
|
||||
different hardware devices to implement parallel computing. Compared
|
||||
with the serial mode, parallel computing decreases execution time by
|
||||
leveraging more computing resources at the same time.
|
||||
|
||||

|
||||
:label:`ch04/ch04-para`
|
||||
|
||||
Serial execution and parallel execution have their own advantages and
|
||||
disadvantages, as summarized in Table
|
||||
:numref:`ch04/ch4-graph`.
|
||||
|
||||
:Comparison between serial execution and parallel execution
|
||||
|
||||
| Execution Method | Serial execution | Parallel execution |
|
||||
|----------------------|------------------|-------------------- |
|
||||
| Execution Order | Static | Dynamic |
|
||||
| Execution Threads | Single thread | Multiple threads |
|
||||
| Resource Consumption | Low | High |
|
||||
:label:`ch04/ch4-graph`
|
||||
|
||||
A computing environment contains more than one type of computing device,
|
||||
such as a CPU, GPU, or other. As such, a computational graph consisting
|
||||
of operators that run on more than one type of computing device is
|
||||
referred to as a heterogeneous computational graph.
|
||||
|
||||
The graph contains the following types of operators based on the
|
||||
computing hardware.
|
||||
|
||||
- **CPU operators**: They are C++ operators that run on the host CPU.
|
||||
The computing performance of the CPU depends on the extent to which
|
||||
the multi-core capability of the CPU is utilized.
|
||||
|
||||
- **GPU operators**: They run on the GPU (e.g., NVIDIA GPU). GPU
|
||||
kernels are delivered to the host GPU one by one for execution. The
|
||||
GPU features ample parallel computing units that offer significant
|
||||
speedup to parallel algorithms.
|
||||
|
||||
- **Python operators**: They run on the host CPU. Unlike CPU
|
||||
operators, Python operators are interpreted and executed by the
|
||||
Python runtime interpreter.
|
||||
|
||||
We mentioned earlier that the dynamic graph mechanism relies on the
|
||||
Python interpreter to distribute operators and execute them serially
|
||||
according to the order of operators defined by the model code. This mode
|
||||
usually allows data to be transmitted on different computing devices.
|
||||
Communication bottlenecks may increase the time spent waiting for
|
||||
operators to execute data, reducing the overall execution efficiency of
|
||||
the calculation graph. Therefore, the first condition for the efficient
|
||||
execution of the calculation graph is to accurately identify the device
|
||||
where the operator is executed, try to avoid the transmission of data
|
||||
between different devices. Independent operators are scheduled on
|
||||
different devices in parallel. The static graph mechanism can get rid of
|
||||
the constraints of the Python interpreter. The calculation graph is sent
|
||||
to the device at one time, which reduces the number of interactions
|
||||
between the host and the computing chip, and improves computing
|
||||
efficiency and performance.
|
||||
|
||||
The combination of operators and subgraphs for scheduling execution mode
|
||||
is a combination of the previous two execution modes. Due to the
|
||||
flexibility of the computing graph structure, the efficiency of
|
||||
computing graphs in complex scenarios may not be optimal when executed
|
||||
on the entire computing chip. For example, computing chips can
|
||||
accelerate floating-point operations, while CPUs are good at processing
|
||||
logical judgments. Therefore, the parts with low execution efficiency
|
||||
for computing chips can be separated and handed over to devices with
|
||||
higher execution efficiency such as CPU for processing, which can take
|
||||
into account both performance and flexibility.
|
||||
|
||||
There are different levels of parallelism: operator parallelism, model
|
||||
parallelism, and data parallelism. Operator parallelism is not just
|
||||
about executing independent operators in parallel. Where applicable, we
|
||||
can further partition an operator into multiple parallel child
|
||||
operations. Model parallelism refers to partitioning a computational
|
||||
graph among several devices in order to shorten the time taken by each
|
||||
training iteration. And data parallelism involves training the same
|
||||
computational graph on different data, reducing the total number of
|
||||
iterations and improving training efficiency. We will discuss these
|
||||
three parallelism methods in Chapter Distributed Training.
|
||||
|
||||
## Synchronous and Asynchronous Data Loading
|
||||
|
||||
As previously mentioned, a single training iteration of a computational
|
||||
graph goes through three serial tasks: data loading, data preprocessing,
|
||||
and model training. Each task is dependent on the output of the previous
|
||||
one. To schedule the three types of tasks in iterative graph training,
|
||||
we can use the synchronous and asynchronous mechanisms at the iteration
|
||||
level.
|
||||
|
||||
1. **Synchronous**: Tasks are executed in order, one after the other.
|
||||
Tasks have to wait for and coordinate between each other.
|
||||
|
||||
2. **Asynchronous**: When a task is complete, the same task in the next
|
||||
iteration can be executed immediately.
|
||||
|
||||
If the synchronous mechanism is adopted to train the computational graph
|
||||
shown in Figure :numref:`ch04/ch04-tongbu`, in each iteration, a batch of input
|
||||
data is loaded, preprocessed, and then passed to the computational graph
|
||||
for model training and parameter update. Tasks in the next iteration
|
||||
wait until the current iteration is complete. The synchronous mechanism
|
||||
wastes computation and communication resources because the data
|
||||
preprocessing and model training tasks must wait until a batch of data
|
||||
is completely loaded, and because the I/O channel for data loading is
|
||||
idle at model training time.
|
||||
|
||||

|
||||
:label:`ch04/ch04-tongbu`
|
||||
|
||||
In the asynchronous setting shown in Figure
|
||||
:numref:`ch04/ch04-yibu`, after loading and passing a batch of
|
||||
input data to the subsequent data preprocessing task, the I/O channel
|
||||
immediately moves on to the next batch without waiting for the current
|
||||
iteration to complete. In contrast with the synchronous mechanism, the
|
||||
idle time between data loading, data preprocessing, and model training
|
||||
in the asynchronous mechanism is notably reduced, thereby shortening the
|
||||
overall training time with improved execution efficiency.
|
||||
|
||||

|
||||
:label:`ch04/ch04-yibu`
|
||||
|
||||
To further shorten the training time and improve the execution
|
||||
efficiency, we can combine the asynchronous mechanism with parallel
|
||||
computing, as shown in Figure
|
||||
:numref:`ch04/ch04-yibubingxing`. On the one hand, the
|
||||
asynchronous mechanism reduces the model's wait time for data loading
|
||||
and preprocessing, allowing the model to quickly traverse the entire
|
||||
dataset. On the other hand, parallel computing increases the batch size
|
||||
in iterative training, increasing the efficiency of computing resources.
|
||||
|
||||

|
||||
:label:`ch04/ch04-yibubingxing`
|
||||
41
v1/en_chapters/chapter_computational_graph/summary.md
Normal file
41
v1/en_chapters/chapter_computational_graph/summary.md
Normal file
@@ -0,0 +1,41 @@
|
||||
# Chapter Summary
|
||||
|
||||
1. The computational graph technology is introduced to machine learning
|
||||
frameworks in order to achieve a trade-off between programming
|
||||
flexibility and computational efficiency.
|
||||
|
||||
2. A computational graph contains tensors (as units of data) and
|
||||
operators (as units of operations).
|
||||
|
||||
3. A computational graph represents the computational logic and status
|
||||
of a machine learning model and offers opportunities for
|
||||
optimizations.
|
||||
|
||||
4. A computational graph is a directed acyclic graph. Operators in the
|
||||
graph are directly or indirectly dependent on or independent of each
|
||||
other, without circular dependencies.
|
||||
|
||||
5. Control flows, represented by conditional control and loop control,
|
||||
determines how data flows in a computational graph.
|
||||
|
||||
6. Computational graphs come in two types: static and dynamic.
|
||||
|
||||
7. Static graphs support easy model deployment, offering high
|
||||
computational efficiency and low memory footprint at the expense of
|
||||
debugging performance.
|
||||
|
||||
8. Dynamic graphs provide computational results on the fly, which
|
||||
increases programming flexibility and makes debugging easy for model
|
||||
optimization and iterative algorithm improvement.
|
||||
|
||||
9. We can appropriately schedule the execution of operators based on
|
||||
their dependencies reflected in computational graphs.
|
||||
|
||||
10. For operators that run independently, we can consider concurrent
|
||||
scheduling to achieve parallel computing. For operators with
|
||||
computational dependencies, schedule them to run in serial.
|
||||
|
||||
11. Specific training tasks of a computational graph can run
|
||||
synchronously or asynchronously. The asynchronous mechanism
|
||||
effectively improves the hardware efficiency and shortens the
|
||||
training time.
|
||||
21
v1/en_chapters/chapter_data_processing/data_order.md
Normal file
21
v1/en_chapters/chapter_data_processing/data_order.md
Normal file
@@ -0,0 +1,21 @@
|
||||
## Order Preservation Design
|
||||
|
||||
Unlike conventional data-parallel computing tasks, parallel data processing in machine learning scenarios needs to maintain order preservation to ensure experimental reproducibility. In concrete implementations, we need to guarantee that the output order of data after parallel preprocessing remains the same as the input order (i.e., SeqB and SeqA in the figure below are identical). This ensures that the output order of the data module is uniquely determined by the output order of the data shuffling component, helping users compare and debug across different experiments. Different machine learning systems adopt different approaches to ensure order preservation. We use MindSpore's implementation as an example to deepen readers' understanding of this topic.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`data_order_definition`
|
||||
|
||||
MindSpore ensures order preservation by constraining the communication behavior between operator thread groups so that the input order to the current operator's downstream operator remains the same as its own input order. Based on this recursive constraint, the output order of the last operator in the entire parallel data processing pipeline is guaranteed to be the same as the input order of the first operator. In the specific implementation, MindSpore uses a Connector as the communication component between operator thread groups. The core operations on the Connector are the Push operation by the upstream operator and the Pop operation by the downstream operator. We focus on MindSpore's constraints on these two behaviors.
|
||||
|
||||
The usage of Connector has the following two requirements:
|
||||
|
||||
- The threads in both the data producer thread group and the data consumer thread group on either side of the Connector are numbered starting from 0.
|
||||
|
||||
- The input data order of the data producers must follow a round-robin distribution across producer threads. That is, when the producer thread group size is M, producer thread 0 holds the (0 + M \* k)-th data sample, producer thread 1 holds the (1 + M \* k)-th sample, producer thread 2 holds the (2 + M \* k)-th sample, and so on (where k=0, 1, 2, 3...).
|
||||
|
||||
The Connector maintains the same number of queues as the number of producer threads and ensures that when data is placed into the Connector, each producer thread's data goes only into the correspondingly numbered queue. This guarantees that the distribution of data across different queues in the Connector is the same as the distribution across different producer threads (the Push function in the code snippet). Then, when the Connector's consumer thread group retrieves data from the Connector, we need to ensure that the final data distribution across different consumer threads still follows a round-robin pattern. That is, when the consumer thread group size is N, consumer thread 0 holds the (0 + N \* k)-th data sample, consumer thread 1 holds the (1 + N \* k)-th sample, consumer thread 2 holds the (2 + N \* k)-th sample, and so on (where k=0, 1, 2, 3...). To achieve this, when a consumer thread requests data from the Connector, the Connector retrieves data from the queues in a round-robin manner, subject to the constraint that the requesting consumer thread number i and the pending data index j satisfy the relationship $i=j\%N$ (where N is the number of consumer threads). If the indices do not satisfy this relationship, the request blocks and waits. Through this communication constraint mechanism, MindSpore achieves order preservation.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`mindspore_data_order_implementation`
|
||||
97
v1/en_chapters/chapter_data_processing/extension.md
Normal file
97
v1/en_chapters/chapter_data_processing/extension.md
Normal file
@@ -0,0 +1,97 @@
|
||||
## Scaling Single-Machine Data Processing Performance
|
||||
|
||||
In the previous sections, we introduced how to accelerate data preprocessing through parallel architectures that leverage multi-core CPU computing power to meet the throughput requirements of model computation on accelerator chips for data consumption. This approach can resolve user issues in most cases. However, data consumption performance is growing rapidly year over year with the development of AI chips (i.e., model computation speed is increasing), while the data module, which primarily relies on CPU computing power, cannot benefit from hardware performance improvements due to the gradual end of Moore's Law. This makes it difficult for data production performance to achieve year-over-year breakthroughs comparable to model computation performance. Moreover, in recent years the growth rate of AI chips in AI servers has far exceeded the growth rate of CPUs, further exacerbating the contradiction between chips' data consumption demands and the data module's data production performance. Taking NVIDIA's DGX series servers as an example, the DGX-1 server is configured with 40 CPU cores and 8 GPU chips. By the next generation NVIDIA DGX-2, the number of GPU chips grew to 16, while the number of CPU cores only increased from 40 to 48. Since all GPU chips share CPU computing power during training, on average, the computing power available to each GPU chip (data consumer) decreased from 5 CPU cores/GPU with NVIDIA DGX-1 to 3 CPU cores/GPU with NVIDIA DGX-2. The CPU computing power bottleneck prevents users from achieving expected scaling performance when training with multiple cards. To address the problem of insufficient CPU computing power on a single machine, we present two currently common solutions: heterogeneous data processing acceleration based on CPU+AI chips and distributed data preprocessing scaling.
|
||||
|
||||
### Heterogeneous Computing-Based Data Preprocessing
|
||||
|
||||
Since AI chips have richer computing resources compared to CPUs, leveraging AI accelerator chips for data preprocessing when CPU computing power becomes the bottleneck is an effective approach. Although AI chips do not possess general-purpose data preprocessing capabilities, most time-consuming data preprocessing operations are Tensor-related computations, such as Fast Fourier Transform (FFT) in speech processing and denoising in image processing, enabling some operations to be offloaded to AI chips for acceleration. For example, the Dvpp module on Huawei's Ascend 310 chip is a built-in hardware decoder on the chip that offers stronger image processing performance compared to CPUs. Dvpp supports basic image processing operations such as JPEG image decoding and resizing. In actual data preprocessing, users can designate certain image processing operations to be completed on the Ascend 310 chip to improve data module performance.
|
||||
|
||||
```python
|
||||
namespace ms = mindspore;
|
||||
namespace ds = mindspore::dataset;
|
||||
|
||||
// Initialization operations
|
||||
//...
|
||||
|
||||
// Build data processing operators
|
||||
|
||||
// 1. Decode
|
||||
std::shared_ptr<ds::TensorTransform> decode(new ds::vision::Decode());
|
||||
// 2. Resize
|
||||
std::shared_ptr<ds::TensorTransform> resize(new ds::vision::Resize({256}));
|
||||
// 3. Normalize
|
||||
std::shared_ptr<ds::TensorTransform> normalize(new ds::vision::Normalize(
|
||||
{0.485 * 255, 0.456 * 255, 0.406 * 255}, {0.229 * 255, 0.224 * 255, 0.225 * 255}));
|
||||
// 4. Center crop
|
||||
std::shared_ptr<ds::TensorTransform> center_crop(new ds::vision::CenterCrop({224, 224}));
|
||||
|
||||
// Build the pipeline and specify using Ascend for computation
|
||||
ds::Execute preprocessor({decode, resize, center_crop, normalize}, MapTargetDevice::kAscend310, 0);
|
||||
|
||||
// Execute the data processing pipeline
|
||||
ret = preprocessor(image, &image);
|
||||
```
|
||||
|
||||
Compared to Dvpp, which only supports a subset of image preprocessing operations, NVIDIA's DALI :cite:`nvidia_dali` is a more general GPU-based data preprocessing acceleration framework. DALI contains the following three core concepts:
|
||||
|
||||
- DataNode: Represents a collection of Tensors
|
||||
|
||||
- Operator: An operator that transforms DataNodes. Both the input and output of an Operator are DataNodes. Notably, operators in DALI can be configured to one of three different execution modes: cpu, gpu, and mixed. In cpu mode, both the operator's input and output are DataNodes on the CPU. In gpu mode, both the input and output are DataNodes on the GPU. In mixed mode, the operator's input is a CPU DataNode while the output is a GPU DataNode.
|
||||
|
||||
- Pipeline: A data processing pipeline constructed by users through describing the transformation process of DataNodes using Operators
|
||||
|
||||
In practice, users configure whether an operator's computation is performed by the CPU or GPU by setting the operator's execution mode. DALI also has the following constraint: when an operator is in mixed or gpu mode, all of its downstream operators are mandatorily required to execute in gpu mode.
|
||||
|
||||

|
||||
|
||||
:width:`800px`
|
||||
:label:`dali_overview`
|
||||
|
||||
Below is an example code snippet demonstrating the construction of a data processing pipeline using DALI. We read image data from files, apply mixed-mode decoding, and then process the images through rotation and resizing operators running on the GPU before returning the results to users.
|
||||
Due to its demonstrated excellent performance,
|
||||
DALI is widely used in high-performance inference services and multi-card training performance optimization.
|
||||
|
||||
|
||||
```python
|
||||
import nvidia.dali as dali
|
||||
|
||||
pipe = dali.pipeline.Pipeline(batch_size = 3, num_threads = 2, device_id = 0)
|
||||
with pipe:
|
||||
files, labels = dali.fn.readers.file(file_root = "./my_file_root")
|
||||
images = dali.fn.decoders.image(files, device = "mixed")
|
||||
images = dali.fn.rotate(images, angle = dali.fn.random.uniform(range=(-45,45)))
|
||||
images = dali.fn.resize(images, resize_x = 300, resize_y = 300)
|
||||
pipe.set_outputs(images, labels)
|
||||
|
||||
pipe.build()
|
||||
outputs = pipe.run()
|
||||
```
|
||||
|
||||
### Distributed Data Preprocessing
|
||||
|
||||
Distributed data preprocessing is another viable solution to address insufficient CPU computing power. A common approach is to leverage existing big data computing frameworks such as Spark or Dask for data preprocessing and write the results to a distributed file system. The training machines then only need to read the preprocessed result data and proceed with training.
|
||||
|
||||

|
||||
|
||||
:width:`800px`
|
||||
:label:`distributed_data_preprocess_based_on_3rd_party_software`
|
||||
|
||||
Although this approach is widely used in the industry, it faces three problems:
|
||||
|
||||
- Since data processing and model training use different frameworks, users often need to write programs in different languages across two different frameworks, increasing the user's burden.
|
||||
|
||||
- Since the data processing system and the machine learning system cannot achieve zero-copy data sharing, data serialization and deserialization often become non-negligible additional overhead.
|
||||
|
||||
- Since big data computing frameworks are not entirely tailored for machine learning scenarios, certain distributed preprocessing operations such as global data shuffling cannot be efficiently implemented.
|
||||
|
||||
To better adapt to data preprocessing in machine learning scenarios, the distributed machine learning framework Ray leverages its own task scheduling capabilities to implement simple distributed data preprocessing ---
|
||||
Ray Dataset :cite:`moritz2018ray`. Since data preprocessing and training reside within the same framework, this reduces the user's programming burden while also eliminating the additional overhead of serialization/deserialization through zero-copy data sharing. Ray Dataset supports simple parallel dataset transformation operators such as map, batch, filter, as well as some basic aggregation operators like mean. Ray
|
||||
Dataset also supports sorting, random shuffling, GroupBy, and other global shuffle operations. This approach is currently under research and development and has not yet been widely adopted. Interested readers can consult relevant materials for further understanding.
|
||||
|
||||
```python
|
||||
ray.data.read_parquet("foo.parquet") \
|
||||
.filter(lambda x: x < 0) \
|
||||
.map(lambda x: x**2) \
|
||||
.random_shuffle() \
|
||||
.write_parquet("bar.parquet")
|
||||
```
|
||||
29
v1/en_chapters/chapter_data_processing/index.md
Normal file
29
v1/en_chapters/chapter_data_processing/index.md
Normal file
@@ -0,0 +1,29 @@
|
||||
# Data Processing Framework
|
||||
|
||||
In the previous two chapters, we introduced the frontend and backend of compilers, elaborating on the optimization process of transforming source programs into target programs. Beyond enabling high-performance execution on accelerator chips during training and inference, we also need to efficiently deliver data to these chips to achieve optimal end-to-end performance. Machine learning model training and inference require loading datasets from storage devices (such as local disks, memory, and remote storage systems), performing a series of processing transformations on the datasets, and sending the processed results to GPUs, Huawei Ascend, or other accelerators for model computation. Performance issues at any step in this pipeline can negatively impact training and inference throughput. In this chapter, we will focus on how to design and implement a data system tailored for machine learning scenarios, helping users easily construct various complex data pipelines while ensuring sufficiently high execution performance so that data preprocessing does not become a performance bottleneck for model training and inference.
|
||||
|
||||
This chapter introduces the data module in machine learning systems from three dimensions: usability, efficiency, and order preservation. In the first two sections, we discuss how to build a user-friendly data module, including how to design programming abstractions that allow users to describe complex preprocessing workflows in just a few lines of code, and how to provide rich built-in operators for usability while flexibly supporting user-defined operators to cover long-tail requirements. After users construct their data processing workflows, the data module is responsible for efficiently scheduling and executing the data pipeline to achieve optimal data processing throughput. Efficiently executing the data pipeline is a challenging task, as we face both I/O performance issues in data loading and computational performance issues in data processing. To address these challenges, we will introduce file format designs for high-throughput data loading, as well as parallel architecture designs that fully leverage multi-core CPU computing power. Moreover, unlike conventional data-parallel computing tasks, most machine learning scenarios have special `order preservation` requirements for data input and output sequences. We will dedicate a section to introducing what order preservation is and how to design corresponding components within the data module's parallel architecture to meet this requirement. After studying the above content, readers will gain a deep understanding of how to build an efficient and user-friendly data module for machine learning scenarios. Finally, as extended content, we will draw on practical experience from both academia and industry to introduce how to scale our data processing module to meet training performance requirements when single-machine processing performance is insufficient. The learning objectives of this chapter include:
|
||||
|
||||
- Understand the key components and their functions in the machine learning data module architecture
|
||||
|
||||
- Understand the design of different data module user programming interfaces
|
||||
|
||||
- Master file format design for high-performance data loading
|
||||
|
||||
- Master the parallel architecture of the data module in machine learning systems
|
||||
|
||||
- Master the concept and solutions for data order preservation in machine learning system data modules
|
||||
|
||||
- Understand two approaches for scaling single-machine data processing performance
|
||||
|
||||
|
||||
```toc
|
||||
:maxdepth: 2
|
||||
|
||||
requirements
|
||||
program_model
|
||||
performance
|
||||
data_order
|
||||
extension
|
||||
summary
|
||||
```
|
||||
182
v1/en_chapters/chapter_data_processing/performance.md
Normal file
182
v1/en_chapters/chapter_data_processing/performance.md
Normal file
@@ -0,0 +1,182 @@
|
||||
## Efficiency Design
|
||||
|
||||
In the previous section, we focused on the programming abstractions and interface design of the data module, ensuring that users can conveniently describe data processing workflows based on the APIs we provide without needing to worry too much about implementation and execution details. In this section, we will further explore the design details of key data module components such as data loading and pipeline scheduling to ensure that users can achieve optimal data processing performance. Throughout this section, we will also draw on practical experience from major existing machine learning systems to help readers deepen their understanding of these critical design approaches.
|
||||
|
||||
As shown in :numref:`async_data_process`, deep learning model training requires the data module to first load datasets from storage devices, perform a series of preprocessing transformations in memory, and finally send the processed data to accelerator chips for model computation. Currently, a large body of work focuses on accelerating model computation on chips through new hardware designs or operator compilation techniques, with relatively little attention paid to data processing pipeline performance issues. However, in many cases, the execution time of data preprocessing occupies a substantial proportion of the entire training task, preventing GPUs, Huawei Ascend, and other accelerators from being fully utilized. Research has shown that approximately 30% of computation time in enterprise data center workloads is spent on data preprocessing steps :cite:`murray2021tf`, and other studies have found that model training tasks on some public datasets spend 65% of their time on data preprocessing :cite:`mohan2020analyzing`. This clearly demonstrates that data module performance has a decisive impact on overall training throughput.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`async_data_process`
|
||||
|
||||
To pursue maximum training throughput, existing systems generally choose to execute data loading, data preprocessing computation, and on-chip model computation asynchronously in parallel. These three steps form a typical producer-consumer upstream-downstream relationship. We denote the data loading rate from storage devices as F, the data preprocessing rate as P, and the on-chip data consumption rate as G. Ideally, we want G < min(F, P), so that the accelerator chip is never blocked waiting for data. However, in practice, we often encounter situations where either the data loading rate F is too low (known as I/O Bound) or the data preprocessing rate P is too low (known as CPU Bound), causing G > min(F, P) and leaving the chip underutilized. To address these critical performance issues, this section will focus on two topics:
|
||||
|
||||
- How to design appropriate file formats and loading methods for the specific I/O requirements of machine learning scenarios to optimize the data loading rate F.
|
||||
|
||||
- How to design parallel architectures that fully leverage the computing power of modern multi-core CPUs to improve the data processing rate P.
|
||||
|
||||
At the end of this section, we will also examine a challenging problem: how to leverage the computational graph compilation techniques learned in previous chapters to optimize the user's data processing computation graph, further achieving optimal data processing throughput performance. Now, let us embark on this section's brainstorming journey together.
|
||||
|
||||
### Efficiency of Data Loading
|
||||
|
||||
First, let us examine how to address the performance challenges of data loading. The first problem we face is the I/O differences caused by diverse data types and non-uniform storage formats. For example, text data may be stored in txt format, and image data may be stored in raw format or compressed formats such as JPEG. We obviously cannot design an optimal data loading scheme for every possible storage scenario. However, we can propose a unified storage format (which we call the Unirecord format) to shield against I/O differences across different data types, and then design and optimize data loading schemes based on this format. In practice, users simply need to convert their original datasets to our unified data format to benefit from efficient read performance.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`unified_record_format`
|
||||
|
||||
So what other characteristics should our Unirecord have beyond unifying user storage formats? Data access in machine learning model training has the following characteristics:
|
||||
|
||||
- Within each epoch, all data is traversed in a random order, with each data sample visited exactly once
|
||||
|
||||
- Across all epochs, the data must be traversed in different random orders
|
||||
|
||||
The above access patterns require that our Unirecord storage format supports efficient random access. When our dataset can fit entirely in RAM, random access to Unirecord is not a major issue. However, when the dataset is too large and must be stored on local disks or distributed file systems, we need to design specific solutions. An intuitive approach is to divide a Unirecord file into an index block and a data block. The index block records metadata for each data sample, such as its size, offset within the file, and checksum values. The data block stores the actual data for each sample. When we need to perform random access on a Unirecord-format file, we first load the file's index block into memory (which is typically much smaller than the entire file) and build an in-memory index table for the data in the file. Then, when we need to randomly access a data sample, we first look up the sample's offset, size, and other information in the index table and read the data from disk based on this information. This loading approach satisfies our random access requirements on disk. Next, we will use the practical experience of MindRecord proposed by MindSpore as an example to introduce the design of a unified file format and help deepen understanding of this topic.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`file_random_access`
|
||||
|
||||
#### Introduction to MindRecord
|
||||
|
||||
MindRecord is the unified data format introduced by MindSpore, with the goal of normalizing user datasets and optimizing the training data loading process. This file format has the following characteristics:
|
||||
|
||||
- Enables unified storage and access of diverse user data, making training data loading more convenient.
|
||||
|
||||
- Aggregated data storage for efficient reading, while being easy to manage and transfer.
|
||||
|
||||
- Efficient data encoding and decoding operations, transparent and imperceptible to users.
|
||||
|
||||
- Flexible control over partition sizes, facilitating distributed training.
|
||||
|
||||
Similar to the Unirecord design described earlier, a MindRecord file also consists of data files and index files. The data file contains a file header, scalar data pages, and block data pages for storing users' normalized training data. The index file contains index information generated based on scalar data (such as image labels, image filenames, etc.) for convenient retrieval and statistical analysis of dataset information. To ensure random access performance for a single MindRecord file, MindSpore recommends that each MindRecord file be smaller than 20 GB. If a dataset exceeds 20 GB, users can specify the corresponding parameters during MindRecord dataset generation to shard the original dataset into multiple MindRecord files.
|
||||
|
||||

|
||||
|
||||
:width:`800px`
|
||||
:label:`mindrecord_format`
|
||||
|
||||
The detailed information about the key components of the data file portion in a MindRecord file is as follows:
|
||||
|
||||
- **File Header**
|
||||
The file header is primarily used to store the file header size, scalar data page size, block data page size, Schema information, index fields, statistical information, file partition information, and the correspondence between scalar data and block data. It serves as the metadata of the MindRecord file.
|
||||
|
||||
- **Scalar Data Pages**
|
||||
Scalar data pages are primarily used to store integer, string, and floating-point data, such as image labels, image filenames, image dimensions, and other information that is suitable for scalar storage.
|
||||
|
||||
- **Block Data Pages**
|
||||
Block data pages are primarily used to store binary strings, NumPy arrays, and similar data, such as binary image files themselves and dictionaries converted from text.
|
||||
|
||||
During training, MindRecord's reader can quickly locate and find the position of data based on index files, and read and decode the data. Additionally, MindRecord possesses certain retrieval capabilities, allowing users to filter and obtain data samples that meet their expectations by specifying query conditions.
|
||||
|
||||
For distributed training scenarios, MindRecord loads metadata based on the Header in data files and index files to obtain the IDs of all samples and their offset information within data files. It then performs data partitioning based on user-input num_shards (number of training nodes) and shard_id (current node ID), obtaining 1/num_shards of the data for the current node. In other words, during distributed training, multiple nodes each read only 1/num_shards of the dataset, and the effect of training on the entire dataset is achieved through AllReduce on the computation side. Furthermore, if users enable the shuffle operation, the shuffle seed is kept consistent across all nodes within each epoch, ensuring that the ID shuffle results for all samples are consistent, which in turn ensures correct data partitioning.
|
||||
|
||||

|
||||
|
||||
:width:`800px`
|
||||
:label:`mindrecord_partition`
|
||||
|
||||
### Efficiency of Data Computation
|
||||
|
||||
After addressing the data loading performance issue, let us continue to study how to improve data computation performance (i.e., maximizing the data processing rate P mentioned earlier). We will use the data preprocessing pipeline mentioned above as an example to study how to design the data module's scheduling and execution of user computation graphs to achieve optimal performance.
|
||||
|
||||

|
||||
|
||||
:width:`800px`
|
||||
:label:`serialized_data_process`
|
||||
|
||||
Since deep learning chips such as GPUs and Huawei Ascend do not possess general-purpose data processing capabilities,
|
||||
we currently still rely primarily on CPUs to complete preprocessing computation. Mainstream AI servers are equipped with multiple multi-core CPUs, and the data module needs to design reasonable parallel architectures to fully leverage multi-core computing power, thereby improving data preprocessing performance and minimizing accelerator stalls caused by waiting for data. In this section, we will introduce two common parallel architectures: pipeline-level parallelism and operator-level parallelism. Pipeline parallelism has a clear structure, is easy to understand and implement, and is primarily adopted by machine learning systems like PyTorch that implement data modules in Python. Influenced by the scheduling and execution architecture designs of classic data-parallel systems, other systems such as Google's TensorFlow and Huawei's MindSpore primarily adopt operator-level parallelism for fine-grained CPU resource allocation to fully utilize multi-core computing power. However, fine-grained allocation means we need to set reasonable parallelism parameters for all operators involved in the data processing pipeline, which poses a significant challenge for users. Consequently, frameworks like MindSpore also provide automatic tuning of key parameters in the data flow graph. Through dynamic analysis at runtime, the system automatically searches for optimal operator parallelism parameters, greatly reducing the user's programming burden. Let us now discuss each approach in detail.
|
||||
|
||||
#### Pipeline Parallelism
|
||||
|
||||
The first common parallelism approach is pipeline-level parallelism, where the user's constructed computation pipeline is executed sequentially within a single thread/process, while multiple threads/processes are launched to execute multiple pipelines in parallel. If users need to process a total of N data samples, then with pipeline parallelism degree M, each process/thread only needs to process (N/M) samples. Pipeline parallelism has a simple architecture and is easy to implement. Within the entire parallel architecture, each executing process/thread only needs to communicate across processes/threads at the beginning and end of data execution. The data module distributes pending data tasks to each pipeline process/thread and finally aggregates the results to send to the chip for model computation. From the user's perspective, usage is also relatively convenient, requiring only the specification of the key parallelism degree parameter. Let us use PyTorch as an example for detailed elaboration.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`pipeline_parallisim`
|
||||
|
||||
In PyTorch, users only need to implement a Dataset Python class to write the data processing logic. The Dataloader launches the corresponding number of Python processes based on the user-specified parallelism parameter num_workers to invoke the user-defined Dataset class for data preprocessing. The Dataloader has two types of process roles: worker processes and the main process, along with two types of inter-process communication queues: index_queue and worker_result_queue. During training, the main process sends the list of pending data tasks to each worker process through index_queue. Each worker process executes the data preprocessing logic of the user-written Dataset class and returns the processed results to the main process through worker_result_queue.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`pytorch_dataloader`
|
||||
|
||||
Next, we present a code snippet of using PyTorch's Dataloader for parallel data preprocessing. We can see that we only need to implement the Dataset class to describe the data preprocessing logic and specify num_workers to achieve pipeline-level parallel data preprocessing.
|
||||
|
||||
|
||||
```python
|
||||
# Describe the data preprocessing workflow
|
||||
class TensorDataset:
|
||||
def __init__(self, inps):
|
||||
sef.inps = inps
|
||||
|
||||
def __getitem__(self, idx):
|
||||
data = self.inps[idx]
|
||||
data = data + 1
|
||||
return data
|
||||
|
||||
def __len__(self):
|
||||
return self.inps.shape[0]
|
||||
|
||||
inps = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)
|
||||
dataset = TensorDataset(inps)
|
||||
|
||||
# Set parallelism degree to 3
|
||||
loader = DataLoader(dataset, batch_size=2, num_workers=3)
|
||||
|
||||
for batch_idx, sample in enumerate(loader):
|
||||
print(sample)
|
||||
```
|
||||
|
||||
Finally, it should be noted that PyTorch Dataloader's execution involves extensive inter-process communication. Although PyTorch has implemented shared memory-based inter-process communication for Tensor-type data to accelerate this step, when the communication data volume is large, cross-process communication can still significantly impact end-to-end data preprocessing throughput performance. Of course, this is not an architectural issue with pipeline parallelism itself, but rather a consequence of CPython's Global Interpreter Lock (GIL), which forces pipeline parallelism at the Python level to use process parallelism rather than thread parallelism. To address this issue, the PyTorch team is currently attempting to remove the GIL from CPython to achieve thread-based pipeline parallelism for improved communication efficiency :cite:`rmpygil`. Interested readers can explore this topic further.
|
||||
|
||||
#### Operator Parallelism
|
||||
|
||||
In pipeline parallelism, computing resources (CPU cores) are allocated at the pipeline granularity. In contrast, operator parallelism allocates resources at the operator granularity, pursuing a more fine-grained resource allocation approach. We aim to assign higher parallelism to operators with greater computation costs and lower parallelism to operators with lesser computation costs, achieving more efficient and reasonable CPU resource utilization. The idea of operator parallelism is in the same spirit as classic data-parallel computing system parallelism. Taking classic MapReduce execution as an example, we can see that this can also be considered a form of operator parallelism (map operators and reduce operators), where the parallelism degree of map operators and reduce operators is determined by the computation cost of each operator phase.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`mapreduce`
|
||||
|
||||
In the figure below, we present the operator parallelism architecture diagram for the data preprocessing pipeline introduced at the beginning of this section. Based on the computation cost of each operator, we set the image decoding operator parallelism to 3, image resizing parallelism to 2, image random rotation operator parallelism to 4, image normalization operator parallelism to 3, and image channel transposition operator parallelism to 1. We aim to achieve efficient and full utilization of computing resources by precisely allocating resources to operators with different computation costs. In specific implementations, operator parallelism generally uses thread-level parallelism, with all operators communicating through shared memory using inter-thread queues and similar methods.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`operator_parallisim`
|
||||
|
||||
Among existing machine learning system data modules, tf.data and MindData both adopt the operator parallelism approach. Due to more efficient resource utilization and high-performance data flow scheduling implemented in C++, operator parallelism approaches often demonstrate better performance. Performance evaluations of tf.data show that it has nearly twice the performance advantage compared to PyTorch's Dataloader :cite:`murray2021tf`.
|
||||
Next, we use a MindSpore-based implementation of the data preprocessing pipeline described at the beginning of this section to demonstrate how to set the parallelism degree for each operator in an operator-parallel data pipeline.
|
||||
|
||||
```python
|
||||
import mindspore.dataset as ds
|
||||
import mindspore.dataset.transforms.c_transforms as c_transforms
|
||||
import mindspore.dataset.transforms.vision.c_transforms as vision
|
||||
|
||||
# Load data
|
||||
dataset_dir = "path/to/imagefolder_directory"
|
||||
dataset = ds.ImageFolderDatasetV2(dataset_dir, num_parallel_workers=8)
|
||||
transforms_list = [vision.Decode(),
|
||||
vision.Resize((256, 256)),
|
||||
vision.RandomRotation((0, 15)),
|
||||
vision.Normalize((100, 115.0, 121.0), (71.0, 68.0, 70.0)),
|
||||
vision.HWC2CHW()]
|
||||
onehot_op = c_transforms.OneHot(num_classes)
|
||||
# Decoding operator parallelism degree: 3
|
||||
dataset = dataset.map(input_columns="image", operations=vision.Decode(), num_parallel_workers=3)
|
||||
# Resizing operator parallelism degree: 2
|
||||
dataset = dataset.map(input_columns="image", operations=vision.Resize((256, 256)), num_parallel_workers=2)
|
||||
# Random rotation operator parallelism degree: 4
|
||||
dataset = dataset.map(input_columns="image", operations=vision.RandomRotation((0, 15)), num_parallel_workers=4)
|
||||
# Normalization operator parallelism degree: 3
|
||||
dataset = dataset.map(input_columns="image", operations=vision.Normalize((100, 115.0, 121.0), (71.0, 68.0, 70.0)), num_parallel_workers=3)
|
||||
# Channel transposition operator parallelism degree: 1
|
||||
dataset = dataset.map(input_columns="image", operations=vision.HWC2CHW(), num_parallel_workers=1)
|
||||
dataset = dataset.map(input_columns="label", operations=onehot_op)
|
||||
```
|
||||
|
||||
We observe that while operator parallelism has higher performance potential, it requires us to set reasonable parallelism parameters for each operator. This not only places high demands on users but also increases the risk of performance degradation due to unreasonable parameter settings. To make operator parallelism easier for users, both tf.data and MindData have added dynamic tuning of key pipeline parameters, computing reasonable parameters based on runtime performance monitoring of the pipeline execution to achieve optimal data preprocessing throughput as much as possible :cite:`murray2021tf`.
|
||||
|
||||
#### Data Processing Computation Graph Optimization
|
||||
|
||||
In the preceding text, we focused on efficiently executing the user's constructed data preprocessing computation graph through parallel architectures. However, we can consider the following question: Is the computation graph given by the user an efficient one?
|
||||
If not, can we optimize and rewrite the user's data computation graph under the premise of equivalent transformation to obtain a computation graph with expected better execution performance? Indeed, this shares the same philosophy as the model computation graph compilation optimization we studied in previous chapters --- that is, achieving better execution performance by analyzing and transforming the computation graph IR to obtain a more optimal IR representation. Common data graph optimization strategies include operator fusion and map operation vectorization. Operator fusion merges operator combinations such as map+map, map+batch, map+filter, and filter+filter into equivalent composite operators, combining computations that originally required execution in two thread groups into composite computations executed in a single thread group. This reduces inter-thread synchronization and communication overhead, achieving better performance. Map operation vectorization transforms the common dataset.map(f).batch(b) operation combination into dataset.batch(b).map(parallel_for(f)), leveraging modern CPUs' parallelism-friendly SIMD instruction sets to accelerate data preprocessing.
|
||||
121
v1/en_chapters/chapter_data_processing/program_model.md
Normal file
121
v1/en_chapters/chapter_data_processing/program_model.md
Normal file
@@ -0,0 +1,121 @@
|
||||
## Usability Design
|
||||
|
||||
In this section, we focus on how to design a user-friendly data module for machine learning systems. As mentioned earlier, usability requires the data module to provide good programming abstractions and interfaces so that users can conveniently construct data processing pipelines, while also supporting users in flexibly registering and using custom operators within the data pipeline to meet diverse and specialized requirements. We will explore this topic from two aspects: programming interface abstraction and custom operator registration mechanisms.
|
||||
|
||||
### Programming Abstraction and Interfaces
|
||||
|
||||
In :numref:`image_process_pipeline`, we present a classic data preprocessing pipeline for training an image classification model. After loading the dataset from storage devices, we perform a series of operations on the image data, including decoding, resizing, rotation, normalization, and channel transposition. We also apply specific preprocessing operations to the dataset labels, and finally send the processed data to the accelerator chip for model computation. We hope that the programming abstractions provided by the data module are sufficiently high-level so that users can describe the data processing logic in just a few lines of code without getting bogged down in excessive, repetitive implementation details. At the same time, we need to ensure that this set of high-level abstractions is sufficiently general to meet diverse data preprocessing requirements. Once we have a good programming abstraction, we will use a code snippet that implements the data preprocessing pipeline described in the figure below using MindSpore's data module programming interfaces as an example to demonstrate how significantly a well-designed programming abstraction can reduce the user's programming burden.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`image_process_pipeline`
|
||||
|
||||
|
||||
In fact, programming abstractions for data computation have long been extensively studied in the field of general-purpose data-parallel computing systems, and a relatively unified consensus has been reached --- that is, to provide LINQ-style :cite:`meijer2006linq` programming abstractions. The key characteristic is to let users focus on describing dataset creation and transformations, while delegating the efficient implementation and scheduling of these operations to the data system's runtime. Some excellent systems such as Naiad :cite:`murray2013naiad`,
|
||||
Spark :cite:`zaharia2010spark`, and DryadLINQ :cite:`fetterly2009dryadlinq` have all adopted this programming model. We will use Spark as an example for a brief introduction.
|
||||
|
||||
Spark provides users with a programming model based on the concept of Resilient Distributed Datasets (RDD). An RDD is a read-only distributed data collection. Users primarily describe the creation and transformation of RDDs through Spark's programming interfaces. Let us elaborate with a Spark example. The following code demonstrates counting the number of lines containing the "ERROR" field in a log file. We first create a distributed dataset `file` by reading from a file (as mentioned earlier, an RDD represents a collection of data; here `file` is actually a collection of log lines).
|
||||
We apply a filter operation to this `file` dataset to obtain a new dataset `errs` that retains only log lines containing the "ERROR" field. Then we apply a map operation to each element in `errs` to obtain the dataset `ones`. Finally, we perform a reduce operation on the `ones` dataset to get our desired result --- the number of log lines containing the "ERROR" field in the `file` dataset.
|
||||
|
||||
```java
|
||||
val file = spark.textFile("hdfs://...")
|
||||
val errs = file.filter(_.contains("ERROR"))
|
||||
val ones = errs.map(_ => 1)
|
||||
val count = ones.reduce(_+_)
|
||||
```
|
||||
|
||||
|
||||
|
||||
We can see that users need only four lines of code to accomplish the complex task of counting specific field occurrences in a distributed dataset. This is made possible by Spark's core RDD programming abstraction. From the computation flow visualization in :numref:`rdd_transformation_example`, we can also clearly see that after creating the dataset, users only need to describe the operators applied to the dataset, while the execution and implementation of the operators are handled by the system's runtime.
|
||||
|
||||

|
||||
|
||||
:width:`800px`
|
||||
:label:`rdd_transformation_example`
|
||||
The data modules in mainstream machine learning systems have also adopted similar programming abstractions, such as TensorFlow's data module tf.data :cite:`murray2021tf`
|
||||
and MindSpore's data module MindData. Next, we will use MindData's interface design as an example to introduce how to design good programming abstractions for the machine learning scenario to help users conveniently construct the diverse data processing pipelines needed in model training.
|
||||
|
||||
MindData is the data module of the machine learning system MindSpore, primarily responsible for completing data preprocessing tasks in machine learning model training. The core programming abstraction that MindData provides to users is based on Dataset transformations. Here, Dataset is a data frame concept (Data
|
||||
Frame), meaning a Dataset is a multi-row, multi-column relational data table where each column has a column name.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`mindspore dataset example`
|
||||
|
||||
Based on this programming model, combined with the key processing steps in the machine learning data workflow introduced in the first section, MindData provides users with dataset operation operators for performing shuffle, map, batch, and other transformation operations on datasets. These operators take a Dataset as input and produce a newly processed Dataset as output. We list the typical dataset transformation interfaces as follows:
|
||||
|
||||
:Dataset operation interfaces supported by MindSpore
|
||||
|
||||
| Dataset Operation | Description |
|
||||
| -------------------- | ------------------------------------------------------------------ |
|
||||
| batch | Groups multiple data rows in the dataset into a mini-batch |
|
||||
| map | Applies transformation operations to each data row in the dataset |
|
||||
| shuffle | Randomly shuffles the order of data rows in the dataset |
|
||||
| filter | Filters data rows in the dataset, retaining only rows that pass the filter condition |
|
||||
| prefetch | Prefetches data from the storage medium |
|
||||
| project | Selects certain columns from the Dataset table for subsequent processing |
|
||||
| zip | Merges multiple datasets into one dataset |
|
||||
| repeat | In multi-epoch training, repeats the entire data pipeline multiple times |
|
||||
| create_dict_iterator | Creates an iterator that returns dictionary-type data for the dataset |
|
||||
| ... | ... |
|
||||
|
||||
The above describes the dataset interface abstractions, while the specific operations on datasets are actually defined by concrete data operator functions. For user convenience, MindData has built-in implementations of rich data operator libraries for common data types and their common processing needs in the machine learning domain. For the vision domain, MindData provides common operators such as Decode, Resize, RandomRotation, Normalize, and HWC2CHW (channel transposition); for the text domain, MindData provides operators such as Ngram, NormalizeUTF8, and BertTokenizer; for the audio domain, MindData provides operators such as TimeMasking, LowpassBiquad, and ComplexNorm. These commonly used operators can cover the vast majority of user requirements.
|
||||
|
||||
In addition to supporting flexible Dataset transformations, MindData also provides flexible Dataset creation to address the challenge of numerous dataset types with varying formats and organizations. There are mainly three categories:
|
||||
|
||||
- Creating from built-in datasets: MindData has a rich set of built-in classic datasets, such as CelebADataset, Cifar10Dataset, CocoDataset, ImageFolderDataset, MnistDataset, VOCDataset, etc. If users need to use these common datasets, they can achieve out-of-the-box usage with a single line of code. MindData also provides efficient implementations for loading these datasets to ensure users enjoy the best read performance.
|
||||
|
||||
- Loading from MindRecord: MindRecord is a high-performance, general-purpose data storage file format designed for MindData. Users can convert their datasets to MindRecord and then leverage MindSpore's relevant APIs for efficient reading.
|
||||
|
||||
- Creating from a Python class: If users already have a Python class for reading their dataset, they can use MindData's GeneratorDataset interface to call that Python class to create a Dataset, providing users with great flexibility.
|
||||
|
||||

|
||||
|
||||
Finally, we use an example of implementing the data processing pipeline described at the beginning of this section using MindData to demonstrate how user-friendly the Dataset-centric data programming abstraction is. We need only about 10 lines of code to accomplish our desired complex data processing. Throughout the entire process, we focus solely on describing the logic, while delegating operator implementation and execution scheduling to the data module, which greatly reduces the user's programming burden.
|
||||
|
||||
```python
|
||||
import mindspore.dataset as ds
|
||||
import mindspore.dataset.transforms.c_transforms as c_transforms
|
||||
import mindspore.dataset.transforms.vision.c_transforms as vision
|
||||
dataset_dir = "path/to/imagefolder_directory"
|
||||
|
||||
# create a dataset that reads all files in dataset_dir with 8 threads
|
||||
dataset = ds.ImageFolderDatasetV2(dataset_dir, num_parallel_workers=8)
|
||||
|
||||
#create a list of transformations to be applied to the image data
|
||||
transforms_list = [vision.Decode(),
|
||||
vision.Resize((256, 256)),
|
||||
vision.RandomRotation((0, 15)),
|
||||
vision.Normalize((100, 115.0, 121.0), (71.0, 68.0, 70.0)),
|
||||
vision.HWC2CHW()]
|
||||
onehot_op = c_transforms.OneHot(num_classes)
|
||||
|
||||
# apply the transform to the dataset through dataset.map()
|
||||
dataset = dataset.map(input_columns="image", operations=transforms_list)
|
||||
dataset = dataset.map(input_columns="label", operations=onehot_op)
|
||||
|
||||
```
|
||||
|
||||
### Custom Operator Support
|
||||
|
||||
With the dataset transformation-based programming abstraction and the rich transformation operator support for various data types in machine learning, we can cover the vast majority of user data processing needs. However, since the machine learning field itself is rapidly evolving with new data processing requirements constantly emerging, there may be situations where a data transformation operator that users want to use is not covered by the data module. Therefore, we need to design a well-crafted user-defined operator registration mechanism so that users can conveniently use custom operators when constructing data processing pipelines.
|
||||
|
||||
In machine learning scenarios, Python is the primary development programming language for users, so we can assume that user-defined operators are more often Python functions or Python classes. The difficulty of supporting custom operators in the data module is mainly related to how the data module schedules computation. For example, PyTorch's dataloader primarily implements computation scheduling at the Python level, and thanks to Python's flexibility, inserting custom operators into the dataloader's data pipeline is relatively straightforward. In contrast, systems like TensorFlow's tf.data and MindSpore's MindData primarily implement computation scheduling at the C++ level, making it more challenging for the data module to flexibly insert user-defined Python operators into the data flow. Next, we will use MindData's custom operator registration and usage implementation as an example to discuss this topic in detail.
|
||||
|
||||

|
||||
|
||||
:width:`800px`
|
||||
:label:`mindspore operator example`
|
||||
|
||||
Data preprocessing operators in MindData can be divided into C-level operators and Python-level operators. C-level operators provide higher execution performance, while Python-level operators can conveniently leverage rich third-party Python packages for development. To flexibly cover more scenarios, MindData supports users in developing custom operators using Python. If users pursue higher performance, MindData also supports users in compiling their C-level operators and registering them as plugins in MindSpore's data processing pipeline.
|
||||
|
||||
For custom data processing operators passed into dataset transformation operators such as map and filter, MindData's Pipeline executes them through the created Python runtime after startup. It should be noted that custom Python operators must ensure that both input and output are of the numpy.ndarray type. During execution, when MindData's Pipeline encounters a user-defined PyFunc operator in a dataset transformation, it passes the input data to the user's PyFunc as numpy.ndarray type. After the custom operator finishes execution, the result is returned to MindData as numpy.ndarray. During this process, the executing dataset transformation operator (such as map, filter, etc.) is responsible for the PyFunc's runtime lifecycle and exception handling. If users pursue higher performance, MindData also supports user-defined C operators. The dataset-plugin repository :cite:`minddata` serves as MindData's operator plugin repository, encompassing operators tailored for specific domains (remote sensing, medical imaging, meteorology, etc.). This repository carries MindData's plugin capability extensions and provides a convenient entry point for users to write new MindData operators. Users can write operators, compile, and install the plugin, and then use the newly developed operators in the map operations of the MindData Pipeline.
|
||||
|
||||
|
||||
|
||||

|
||||
|
||||
:width:`800px`
|
||||
:label:`mindspore_user_defined_operator`
|
||||
31
v1/en_chapters/chapter_data_processing/requirements.md
Normal file
31
v1/en_chapters/chapter_data_processing/requirements.md
Normal file
@@ -0,0 +1,31 @@
|
||||
## Overview
|
||||
|
||||
Data processing in machine learning scenarios is a typical ETL (Extract, Transform, Load) process. The first stage (Extract) loads datasets from storage devices, and the second stage (Transform) performs transformations on the datasets. Although different machine learning systems adopt different technical approaches when building their data modules, the core components generally include data loading, data shuffling, data transformation, data mini-batch assembly, and data sending. The functionality of each component is described as follows:
|
||||
|
||||
- **Data Loading Component (Load)**: Responsible for loading and reading datasets from storage devices. It must consider both the diversity of storage devices (e.g., local disk/memory, remote disk and memory, etc.) and the diversity of dataset formats (e.g., csv format, txt format, etc.). Based on the characteristics of machine learning tasks, AI frameworks have also proposed unified data storage formats (e.g., Google's TFRecord, Huawei's MindRecord, etc.) to provide higher-performance data loading.
|
||||
|
||||
- **Data Shuffling Component (Shuffle)**: Responsible for randomly shuffling the order of input data according to user-specified methods to improve model robustness.
|
||||
|
||||
- **Data Transformation Component (Map)**: Responsible for performing data transformations, with built-in preprocessing operators for various data types, such as resizing and flipping for images, random noise addition and pitch shifting for audio, and stopword removal and random masking for text processing.
|
||||
|
||||
- **Data Batching Component (Batch)**: Responsible for assembling and constructing a mini-batch of data to send to training/inference.
|
||||
|
||||
- **Data Sending Component (Send)**: Responsible for sending processed data to accelerators such as GPUs or Huawei Ascend for subsequent model computation and updates. High-performance data modules often choose to execute data transfer to devices asynchronously with computation on accelerators to improve overall training throughput.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`pipeline`
|
||||
|
||||
Implementing the above components is just the foundation of a data module. We also need to focus on the following aspects:
|
||||
|
||||
#### Usability
|
||||
|
||||
Data processing involved in AI model training/inference is highly flexible. On one hand, datasets in different application scenarios vary significantly in type and characteristics. When loading datasets, the data module must support specific storage formats for multiple types such as images, text, audio, and video, as well as multiple storage device types including memory, local disks, distributed file systems, and object storage systems. The module needs to abstract and unify the I/O differences in data loading under these complex situations to reduce users' learning costs. On the other hand, different data types often have different processing requirements. In common machine learning tasks, image tasks frequently involve resizing, flipping, and blurring; text tasks require tokenization and vectorization; and speech tasks need Fast Fourier Transform, reverb enhancement, and frequency shifting. To help users address data processing needs in the vast majority of scenarios, the data module needs to support a sufficiently rich set of data preprocessing operators for various types. However, new algorithms and data processing requirements are constantly and rapidly emerging, so we need to support users in conveniently using custom processing operators within the data module to handle scenarios not covered by built-in operators, achieving the best balance between flexibility and efficiency.
|
||||
|
||||
#### Efficiency
|
||||
|
||||
Since common AI accelerators such as GPUs and Huawei Ascend are primarily designed for Tensor data type computation and do not possess general-purpose data processing capabilities, current mainstream machine learning system data modules typically use CPUs to execute data pipelines. Ideally, before each training iteration begins, the data module should have data ready to minimize the time accelerators spend waiting for data. However, data loading and preprocessing in the data pipeline often face challenging I/O performance and CPU computation performance issues. The data module needs to design file formats that support random access with high read throughput to resolve data loading bottlenecks, and also needs to design reasonable parallel architectures to efficiently execute data pipelines to address computation performance issues. To achieve high-performance training throughput, mainstream machine learning systems all adopt asynchronous execution of data processing and model computation to hide data preprocessing latency.
|
||||
|
||||
#### Order Preservation
|
||||
|
||||
Unlike conventional data-parallel computing tasks, machine learning model training is sensitive to data input order. When training models using stochastic gradient descent, data is typically fed to the model in a pseudo-random order in each epoch, with a different random order in each training epoch. Since the model's final parameters are sensitive to the order of input data, to help users better debug and ensure reproducibility across different experiments, we need to design mechanisms in the system so that the final order in which data is fed to the model is uniquely determined by the output order of the data shuffling component, rather than being made non-deterministic by parallel data transformations. We will discuss the requirements and specific implementation details of order preservation in later sections.
|
||||
8
v1/en_chapters/chapter_data_processing/summary.md
Normal file
8
v1/en_chapters/chapter_data_processing/summary.md
Normal file
@@ -0,0 +1,8 @@
|
||||
## Summary
|
||||
|
||||
In this chapter, we explored how to design and implement the data preprocessing module in machine learning systems from three dimensions: usability, efficiency, and order preservation. On the usability dimension, we focused on the programming model of the data module. By drawing on the design experience of historically excellent parallel data processing systems, we concluded that a programming abstraction based on describing dataset transformations is well-suited as the programming model for data modules. In concrete system implementations, we need not only to provide a sufficient number of built-in operators on top of this programming model to facilitate users' data preprocessing programming, but also to consider how to support users in conveniently using custom operators. On the efficiency dimension, we introduced specialized file format design and parallel computation architecture design from the perspectives of data loading and computation, respectively. We also applied the model computation graph compilation optimization techniques learned in previous chapters to optimize users' data preprocessing computation graphs, further achieving higher data processing throughput. In machine learning scenarios, models are sensitive to data input order, which gives rise to the special property of order preservation. We analyzed this property in this chapter and demonstrated how real systems ensure order preservation through the special constraint implementation of MindSpore's Connector. Finally, we also addressed situations where single-machine CPU data preprocessing performance is insufficient, introducing the current vertical scaling approach based on heterogeneous processing acceleration and the horizontal scaling approach based on distributed data preprocessing. We believe that after studying this chapter, readers will have a deep understanding of data modules in machine learning systems and an awareness of the challenges that data modules will face in the future.
|
||||
|
||||
## Further Reading
|
||||
|
||||
- For an example of pipeline-level parallelism implementation, we recommend reading [PyTorch DataLoader](https://github.com/pytorch/pytorch/tree/master/torch/utils/data).
|
||||
- For an example of operator-level parallelism implementation, we recommend reading [MindData](https://gitee.com/mindspore/mindspore/tree/master/mindspore/ccsrc/minddata).
|
||||
100
v1/en_chapters/chapter_distributed_training/cluster.md
Normal file
100
v1/en_chapters/chapter_distributed_training/cluster.md
Normal file
@@ -0,0 +1,100 @@
|
||||
# Architecture of Machine Learning Clusters
|
||||
|
||||
Distributed model training is usually implemented in a compute cluster.
|
||||
Next, we will introduce the composition of a compute cluster and explore
|
||||
the design of a cluster network.
|
||||
|
||||
Figure :numref:`ch010/ch10-datacentre` shows the typical architecture of
|
||||
a machine learning cluster. There are many servers deployed in such a
|
||||
cluster, and each server has several hardware accelerators. To
|
||||
facilitate server management, multiple servers are placed into one
|
||||
*rack*, which is connected to a *top of rack (ToR) switch*. If ToR
|
||||
switches are fully loaded but more new racks need to be connected, we
|
||||
can add a *spine switch* between ToR switches. Such a structure forms a
|
||||
multi-level tree. It is worth noting that cross-rack communication
|
||||
within a cluster may encounter network bottlenecks. This is because the
|
||||
network links used to construct the cluster network have the same
|
||||
specifications (necessary to facilitate hardware procurement and device
|
||||
management), increasing the probability of *network bandwidth
|
||||
oversubscription* on the network links from the ToR switches to the
|
||||
spine switch.
|
||||
|
||||
Network bandwidth oversubscription can be defined as a situation wherein
|
||||
the peak bandwidth required exceeds the actual bandwidth available on
|
||||
the network. In the cluster shown in Figure
|
||||
:numref:`ch010/ch10-datacentre`, when server 1 and server 2 send
|
||||
data to server 3 through their respective network links (say 10 Gb/s of
|
||||
data), ToR switch 1 aggregates the data (that is, 20 Gb/s) and sends it
|
||||
to spine switch 1. However, because there is only one network link (10
|
||||
Gb/s) between spine switch 1 and ToR switch 1, the peak bandwidth
|
||||
required is twice the actual bandwidth available, hence network
|
||||
bandwidth oversubscription. In real-world machine learning clusters, the
|
||||
ratio between peak bandwidth and actual bandwidth is generally between
|
||||
1:4 and 1:16. One approach for avoiding network bottlenecks is to
|
||||
restrict network communication within individual racks. This approach
|
||||
has become a core design requirement for distributed machine learning
|
||||
systems.
|
||||
|
||||

|
||||
:label:`ch010/ch10-datacentre`
|
||||
|
||||
So, how much network bandwidth is required for training a large-scale
|
||||
neural network in a compute cluster? Assume a neural network has
|
||||
hundreds of billions of parameters (e.g., GPT-3 --- a huge language
|
||||
model released by OpenAI --- has nearly 175 billion parameters). If each
|
||||
parameter is expressed with a 32-bit floating-point number, a single
|
||||
model replica in data parallelism mode will generate 700 GB (175 billion
|
||||
$*$ 4 bytes) of local gradient data in each round of training iteration.
|
||||
If there are three model replicas, at least 1.4 TB \[700 GB $*$
|
||||
$(3-1)$\] of gradient data needs to be transmitted. This is because for
|
||||
$N$ replicas, only $N-1$ of them need to be transmitted for computation.
|
||||
To ensure that the model replicas will not diverge from the parameters
|
||||
in the main model, the average gradient --- once computed --- is
|
||||
broadcast to all model replicas (1.4 TB of data) for updating local
|
||||
parameters in these model replicas.
|
||||
|
||||
Currently, machine learning clusters generally use Ethernet to construct
|
||||
networks between different racks. The bandwidth of mainstream commercial
|
||||
Ethernet links ranges from 10 Gb/s to 25 Gb/s. [^1] Using Ethernet to
|
||||
transmit massive gradients will encounter severe transmission latency.
|
||||
Because of this, new machine learning clusters (such as NVIDIA DGX) are
|
||||
often configured with the faster InfiniBand. A single InfiniBand link
|
||||
can provide 100 Gb/s or 200 Gb/s bandwidth. Even this high-speed network
|
||||
still faces high latency when transmitting TB-level local gradients.
|
||||
Even if network latency is ignored, it takes at least 40 seconds to
|
||||
transmit 1 TB of data on a 200 Gb/s link.
|
||||
|
||||
To address this issue, InfiniBand uses remote direct memory access
|
||||
(RDMA) as the core of its programming interfaces. RDMA enables
|
||||
InfiniBand to provide high-bandwidth, low-latency data read and write
|
||||
functions. As such, its programming interfaces are vastly different from
|
||||
the TCP/IP socket interfaces used by conventional Ethernet. For
|
||||
compatibility purposes, people use the IP-over-InfiniBand (IPoIB)
|
||||
technology, which ensures that legacy applications can invoke socket
|
||||
interfaces whereas the underlying layer invokes the RDMA interfaces of
|
||||
InfiniBand through IPoIB.
|
||||
|
||||
To support multiple accelerators (typically 2--16) within a server, a
|
||||
common practice is to build a heterogeneous network on the server. Take
|
||||
server 1 in Figure :numref:`ch010/ch10-datacentre` as an example. This server is
|
||||
equipped with two CPUs, which communicate with each other through
|
||||
QuickPath Interconnect (QPI). Within a CPU interface (socket), the
|
||||
accelerator and CPU are connected by a PCIe bus. Accelerators use
|
||||
high-bandwidth memory (HBM), which offers much more bandwidth than PCIe
|
||||
does. A prominent example is the NVIDIA A100 server: In this server, HBM
|
||||
offers 1935 GB/s bandwidth, whereas PCIe 4.0 offers only 64 GB/s
|
||||
bandwidth. PCIe needs to be shared by all accelerators within the
|
||||
server, meaning that it becomes a significant communication bottleneck
|
||||
when multiple accelerators simultaneously transmit data through PCIe. To
|
||||
solve this problem, machine learning servers tend to use accelerator
|
||||
high-speed interconnect technologies (e.g., NVIDIA GPU NVLink). Such
|
||||
technology bypasses PCIe to achieve high-speed communication. A
|
||||
prominent example is NVIDIA A100 GPU --- its NVLink provides 600 GB/s
|
||||
bandwidth, enabling accelerators to transmit large amounts of data to
|
||||
each other.
|
||||
|
||||
## AI Cluster Network Topology
|
||||
|
||||
[^1]: Network bandwidth is typically measured in Gb/s, whereas memory
|
||||
bandwidth is in GB/s --- *b* stands for bit, and *B* stands for
|
||||
byte.
|
||||
274
v1/en_chapters/chapter_distributed_training/collective.md
Normal file
274
v1/en_chapters/chapter_distributed_training/collective.md
Normal file
@@ -0,0 +1,274 @@
|
||||
# Collective Communication
|
||||
|
||||
This section delves into the application of collective communication in
|
||||
the creation of distributed training systems within machine learning
|
||||
clusters. Collective communication, a fundamental aspect of parallel
|
||||
computing, is instrumental in developing high-performance Single Program
|
||||
Multiple Data (SPMD) programs. We will begin by discussing common
|
||||
operators within collective communication. Following this, we explore
|
||||
the use of the AllReduce algorithm to alleviate network bottlenecks in
|
||||
distributed training systems. Lastly, we will address the support
|
||||
available for different collective communication algorithms within
|
||||
existing machine learning systems.
|
||||
|
||||
## Collective Communication Operators
|
||||
|
||||
In this subsection, we will establish a simplified model of collective
|
||||
communication before introducing commonly used collective communication
|
||||
operators. These include Broadcast, Reduce, AllGather, Scatter, and
|
||||
AllReduce:
|
||||
|
||||

|
||||
:label:`ch010/ch10-collective-operators`
|
||||
|
||||
- **Broadcast**: The Broadcast operator is often employed in a
|
||||
distributed machine learning system to transmit model parameters or
|
||||
configuration files from device $i$ to all other devices. The
|
||||
starting and final states of this operation, initiated by device 1
|
||||
in a three-device cluster, are depicted in Figure
|
||||
:numref:`ch010/ch10-collective-operators`.
|
||||
|
||||
- **Reduce**: In a distributed machine learning system, the Reduce
|
||||
operator plays a pivotal role by consolidating computation results
|
||||
from different devices. It is commonly used to aggregate local
|
||||
gradients from each device to compute the gradient summation. This
|
||||
operator employs functions, represented as $f$, which often obey the
|
||||
associative and commutative laws. Such functions, including sum,
|
||||
prod, max, and min, are initiated by all devices, with the final
|
||||
aggregate result stored in device $i$. The initial and final states
|
||||
when device 1 executes the Reduce operator for summation are
|
||||
depicted in Figure
|
||||
:numref:`ch010/ch10-collective-operators`.
|
||||
|
||||
- **AllReduce**: The AllReduce operator, a part of collective
|
||||
communication, stores the result of the Reduce function $f$ in all
|
||||
devices. Figure
|
||||
:numref:`ch010/ch10-collective-operators` shows the starting
|
||||
and ending states when devices 1, 2, and 3 jointly execute AllReduce
|
||||
to perform a summation.
|
||||
|
||||
- **Gather**: The Gather operator can gather data from all devices and
|
||||
store it in device $i$. Figure
|
||||
:numref:`ch010/ch10-collective-operators` shows the initial
|
||||
and end states when device 1 invokes the Gather operator to gather
|
||||
data from all devices.
|
||||
|
||||
- **AllGather**: The AllGather operator sends the gather result to all
|
||||
devices. Figure
|
||||
:numref:`ch010/ch10-collective-operators` shows the initial
|
||||
and end states when devices 1, 2, and 3 invoke the AllGather
|
||||
operator.
|
||||
|
||||
- **Scatter**: The Scatter operator is the inverse of the Gather
|
||||
operator. Figure
|
||||
:numref:`ch010/ch10-collective-operators` shows the initial
|
||||
and end states when device 1 invokes the Scatter operator.
|
||||
|
||||
It's important to note that other collective communication operators may
|
||||
also be deployed in distributed machine learning applications. Examples
|
||||
of these are ReduceScatter, Prefix Sum, Barrier, and All-to-All.
|
||||
However, this section will not delve into the specifics of these
|
||||
operators.
|
||||
|
||||
## Gradient Averaging with AllReduce
|
||||
|
||||
The following discusses how to utilize AllReduce operators to implement
|
||||
efficient gradient averaging in large clusters. We can implement a
|
||||
simple method for computing the average gradient, whereby a device in
|
||||
the cluster gathers local gradients from each device and then broadcasts
|
||||
the computed average gradient to all devices. Although this approach is
|
||||
easy to implement, it leads to two problems. 1) Network congestion may
|
||||
occur if multiple devices send data to the gather device simultaneously.
|
||||
2) It is not feasible to fit gradient averaging computation on a single
|
||||
device due to the computing power constraint.
|
||||
|
||||
To solve the preceding problems, the Reduce-Broadcast implementation of
|
||||
the AllReduce operator can be used to optimize the algorithm. In this
|
||||
implementation, all nodes participate in network communication and
|
||||
averaging computation of gradients so that the huge amount of network
|
||||
and computing overheads is evenly shared across all nodes. This
|
||||
implementation can solve the two problems of a single gradient gather
|
||||
node. Assume that there are $M$ devices, and that each device stores a
|
||||
model replica consisting of $N$ parameters/gradients. According to the
|
||||
requirements of AllReduce, all parameters need to be partitioned into
|
||||
$M$ partitions based on the number of devices, with each partition
|
||||
containing $N/M$ parameters. The initial and end states of the algorithm
|
||||
are provided.
|
||||
|
||||
In the AllReduce example shown in Figure
|
||||
:numref:`ch010/ch10-collective-operators`, there are three
|
||||
devices. Each device has a model replica, and each replica has 3
|
||||
parameters. According to the partitioning method of AllReduce,
|
||||
parameters are partitioned into three partitions (because there are 3
|
||||
devices), and each partition has 1 ($N/M$ = 3/3) parameter. In this
|
||||
example, assume that device 1 has parameters 2, 4, and 6; device 2 has
|
||||
parameters 1, 2, and 3; and device 3 has parameters 4, 8, and 12. After
|
||||
an AllReduce operator is used for computation, the gradient summation
|
||||
results 7, 14, and 21 are sent to all devices. The result 7 of partition
|
||||
1 is the sum of the initial results of partition 1 in the three devices
|
||||
(7 = 1 + 2 + 4). To compute the average gradient, the sum of gradients
|
||||
needs to be divided by the number of devices (e.g., to obtain the final
|
||||
result of partition 1, divide 7 by 3).
|
||||
|
||||
The AllReduce operator splits the gradient computation into $M-1$ Reduce
|
||||
operators and $M-1$ Broadcast operators (where $M$ indicates the number
|
||||
of nodes). Reduce operators are used to compute the summation of
|
||||
gradients, and Broadcast operators are used to broadcast the summation
|
||||
of gradients to all nodes.
|
||||
|
||||
Figure :numref:`ch010/ch10-allreduce-process` shows the execution
|
||||
process of an AllReduce operator. The AllReduce operator starts with a
|
||||
Reduce operator. In the first Reduce operator, the AllReduce operator
|
||||
performs pairing on all nodes and enables them to jointly complete
|
||||
gradient summation. In the first Reduce operator shown in Figure
|
||||
:numref:`ch010/ch10-allreduce-process`, devices 1 and 2 are
|
||||
paired to jointly complete the summation of data in partition 1. Device
|
||||
2 sends local gradient data 1 to device 1, which adds up the received
|
||||
gradient data 1 and gradient data 2 stored in local partition 1 to
|
||||
obtain the intermediate gradient summation result 3. At the same time,
|
||||
devices 1 and 3 are paired to jointly complete the summation of data in
|
||||
partition 3, and devices 3 and 2 are paired to jointly complete the
|
||||
summation of data in partition 2.
|
||||
|
||||

|
||||
:label:`ch010/ch10-allreduce-process`
|
||||
|
||||
Such distributed computing of gradients performed by Reduce operators
|
||||
realizes the following performance optimizations:
|
||||
|
||||
1. **Network optimization:** All devices receive and send data
|
||||
simultaneously by utilizing their ingress and egress bandwidths.
|
||||
Therefore, in the execution process of the AllReduce algorithm, the
|
||||
available bandwidth is $M * B$, where $M$ indicates the number of
|
||||
nodes and $B$ indicates the node bandwidth. This enables the system
|
||||
to implement network bandwidth scalability.
|
||||
|
||||
2. **Computing power optimization:** Processors of all devices
|
||||
participate in the gradient summation. Therefore, in the execution
|
||||
process of the AllReduce algorithm, the total number of available
|
||||
processors is $M * P$, where $M$ indicates the number of nodes and
|
||||
$P$ indicates the number of processors for a single device. This
|
||||
enables the system to implement computing scalability.
|
||||
|
||||
3. **Load balancing:** Data partitions are evenly partitioned.
|
||||
Therefore, the communication and computing overheads allocated to
|
||||
each device are the same.
|
||||
|
||||
In the Reduce operators other than the first one, the AllReduce
|
||||
algorithm selects other pairing methods for different data partitions.
|
||||
For example, in the second Reduce operator shown in Figure
|
||||
:numref:`ch010/ch10-allreduce-process`, the AllReduce algorithm
|
||||
pairs devices 1 and 3 for data summation in partition 1. Devices 1 and 2
|
||||
are paired for data summation in partition 2, and devices 2 and 3 are
|
||||
paired for data summation in partition 3. In a three-node AllReduce
|
||||
cluster, after two Reduce operators complete execution, the data
|
||||
summation result of each partition is obtained. The data summation
|
||||
result (7) of partition 1 is stored on device 3, the data summation
|
||||
result (14) of partition 2 is stored on device 1, and the data summation
|
||||
result (21) of partition 3 is stored on device 2.
|
||||
|
||||
The AllReduce algorithm then enters the broadcast phase. The process in
|
||||
this phase is similar to the execution process of Reduce operators. The
|
||||
core difference is that, after nodes are paired, they do not add up data
|
||||
--- instead, they broadcast the computation results of Reduce operators.
|
||||
In the first Broadcast operator shown in Figure
|
||||
:numref:`ch010/ch10-allreduce-process`, device 1 directly writes
|
||||
the result (14) of partition 2 to partition 2 of device 3. Device 2
|
||||
directly writes the result (21) of partition 3 to device 1, and device 3
|
||||
directly writes the result of partition 1 to device 2. In a three-node
|
||||
AllReduce cluster, the Broadcast operator is repeated twice in order to
|
||||
notify all nodes of the Reduce computation result of each partition.
|
||||
|
||||
## Model Training with Collective Communication
|
||||
|
||||
Typically, a machine learning system flexibly combines different
|
||||
collective communication operators for different clusters to maximize
|
||||
communication efficiency. The following describes two cases: ZeRO and
|
||||
DALL-E.
|
||||
|
||||
ZeRO is a neural network optimizer proposed by Microsoft. In practice,
|
||||
ZeRO successfully trained the world's largest language model in 2020
|
||||
(with up to 17 billion parameters). In the training process of a neural
|
||||
network like this, parameters of the optimizer, gradients obtained
|
||||
during backward computation, and model parameters all impose significant
|
||||
pressure on the memory space of accelerators. If parameters are
|
||||
represented by 32-bit floating-point numbers, a model with 17 billion
|
||||
parameters requires at least 680 GB of memory, far exceeding the maximum
|
||||
memory capacity (80 GB) of NVIDIA A100 (an accelerator with the largest
|
||||
memory available today). Therefore, we need to explore how to
|
||||
efficiently split a model across different accelerators, and how to
|
||||
efficiently utilize collective communication operators for model
|
||||
training and inference. The following describes three optimization
|
||||
technologies regarding collective communication:
|
||||
|
||||
1. **Parameter storage on a single node:** The bandwidth of the
|
||||
accelerators inside a node in a modern cluster is much greater than
|
||||
the inter-node bandwidth. Therefore, we need to minimize inter-node
|
||||
communication and ensure that communication mostly happens between
|
||||
accelerators inside nodes. The model slicing process shows that the
|
||||
amount of communication between different slices during the forward
|
||||
and backward computation of the model is far less than the average
|
||||
amount of communication required for gradient averaging of model
|
||||
replicas. As such, ZeRO stores all slices of a single model in the
|
||||
same node, greatly improving the training efficiency.
|
||||
|
||||
2. **Forward computation based on the AllGather operator:** Assuming
|
||||
that the parameters in a model are linear by layer, we can assign
|
||||
the parameters to different accelerators from front to back based on
|
||||
the sequence of these parameters on the network. In forward
|
||||
computation, the computation of a layer depends only on the
|
||||
parameters of its adjacent layers. Given this, we can apply
|
||||
AllGather computation once on all accelerators that contain model
|
||||
parameters in order to extract the parameters of the next layer for
|
||||
the current layer and to compute the activation value of the current
|
||||
layer. To conserve memory resources, the parameters of layers other
|
||||
than the current one need to be discarded immediately after the
|
||||
AllGather operation is complete.
|
||||
|
||||
3. **Gradient averaging based on the ReduceScatter operator:**
|
||||
Similarly, during backward computation, only the parameters of the
|
||||
previous layer are needed to compute the activation value and
|
||||
gradient of the current layer. Therefore, AllGather can be used
|
||||
again to complete the gradient computation on each accelerator. At
|
||||
the same time, after gradients are gathered, each accelerator needs
|
||||
only the gradient corresponding to the layer with the same index as
|
||||
the accelerator. In this case, the ReduceScatter operator, instead
|
||||
of AllReduce, can be used to directly store the corresponding
|
||||
gradient to accelerator $i$.
|
||||
|
||||
DALL-E is a text-based image generation model proposed by OpenAI. This
|
||||
model has up to 12 billion parameters. In addition to the AllGather +
|
||||
ReduceScatter technique used by ZeRO during training, the OpenAI team
|
||||
made further optimizations. The following describes two optimization
|
||||
technologies regarding collective communication:
|
||||
|
||||
1. **Matrix factorization:** The operational speeds of collective
|
||||
communication operators are positively correlated with the message
|
||||
length. In model training, the message length indicates the number
|
||||
of model parameters. DALL-E uses matrix factorization to convert a
|
||||
high-dimensional tensor into a two-dimensional matrix, and then uses
|
||||
collective communication operators for transmission after
|
||||
factorization. In this way, DALL-E significantly reduces the amount
|
||||
of communication.
|
||||
|
||||
2. **Custom data types:** Another way to reduce the amount of
|
||||
communication is to modify data types. As expected, the 16-bit
|
||||
half-precision floating-point number representation can reduce the
|
||||
amount of communication by nearly half compared with the 32-bit
|
||||
floating-point number representation. However, in practice,
|
||||
low-precision data types cause unstable model convergence and
|
||||
compromise the final training result. OpenAI analyzes the structure
|
||||
of the DALL-E model and classifies the model parameters into three
|
||||
categories based on their sensitivity to the precision of data
|
||||
types. The most precision-sensitive parameters are represented by
|
||||
32-bit floating-point numbers and synchronized only by the AllReduce
|
||||
operator, whereas the most precision-insensitive parameters are
|
||||
compressed and transmitted using matrix factorization. For the
|
||||
remaining parameters, such as the moments and variance parameters
|
||||
involved in Adam optimization, OpenAI implements two new data types
|
||||
based on the IEEE 754 standard: 1-6-9 and 0-6-10. (The first digit
|
||||
indicates the number of bits required for expressing positive and
|
||||
negative, the second digit indicates the number of bits required for
|
||||
expressing the exponent, and the third digit indicates the number of
|
||||
bits required for expressing a valid number.) In addition to
|
||||
conserving space, this also ensures training convergence.
|
||||
54
v1/en_chapters/chapter_distributed_training/index.md
Normal file
54
v1/en_chapters/chapter_distributed_training/index.md
Normal file
@@ -0,0 +1,54 @@
|
||||
# Distributed Training
|
||||
|
||||
As the field of machine learning continues to accelerate at a rapid
|
||||
pace, it has given rise to increasingly sophisticated models. These
|
||||
models are characterized by a staggering quantity of parameters, the
|
||||
gigantic size of a training dataset, and highly sophisticated
|
||||
structures, which in turn place significant demands on both computing
|
||||
and memory resources. Consequently, the limitations of single-machine
|
||||
systems have become increasingly apparent, and they no longer suffice
|
||||
for training these large machine learning models. This necessitates the
|
||||
advent of distributed training systems, designed to alleviate the strain
|
||||
on resources.
|
||||
|
||||
In this chapter, we dive deep into the fundamentals, design aspects, and
|
||||
practical implementations of distributed machine learning systems. We
|
||||
commence our discussion by elucidating what distributed training systems
|
||||
entail, followed by an exploration of the rationale behind their design
|
||||
and the potential benefits they offer. Subsequently, we scrutinize the
|
||||
most commonly adopted methods of distributed training, encompassing data
|
||||
parallelism, model parallelism, and pipeline parallelism. Each of these
|
||||
methods can typically be implemented via one of two techniques:
|
||||
collective communication or parameter servers, both of which come with
|
||||
their unique sets of merits and drawbacks.
|
||||
|
||||
The key learning objectives of this chapter are as follows:
|
||||
|
||||
1. Grasping the advantages offered by distributed training systems.
|
||||
|
||||
2. Understanding widely-used parallelism methods, namely, data
|
||||
parallelism, model parallelism, hybrid parallelism, and pipeline
|
||||
parallelism.
|
||||
|
||||
3. Comprehending the architecture of a machine learning cluster.
|
||||
|
||||
4. Understanding collective communication operators and their
|
||||
applications in distributed training systems.
|
||||
|
||||
5. Developing an understanding of parameter server architectures.
|
||||
|
||||
```toc
|
||||
:maxdepth: 2
|
||||
|
||||
Overview
|
||||
Parallelism_Methods
|
||||
Pipeline_Parallelism_with_Micro-Batching
|
||||
Architecture_of_Machine_Learning_Clusters
|
||||
Collective_Communication
|
||||
Parameter_Server
|
||||
Federated_Learning
|
||||
Training_Large_Language_Models
|
||||
Chapter_Summary
|
||||
Further_Reading
|
||||
```
|
||||
|
||||
174
v1/en_chapters/chapter_distributed_training/methods.md
Normal file
174
v1/en_chapters/chapter_distributed_training/methods.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Parallelism Methods
|
||||
|
||||
This section explores the prevalent methods for implementing distributed
|
||||
training systems, discussing the design goals and a detailed examination
|
||||
of each parallelism approach.
|
||||
|
||||
## Classification of Methods
|
||||
|
||||
Distributed training amalgamates multiple single-node training systems
|
||||
into a parallel structure to expedite the training process without
|
||||
sacrificing model accuracy. A single-node training system, depicted in
|
||||
Figure :numref:`ch010/ch10-single-node`, processes training datasets
|
||||
split into small batches, termed as mini-batches. Here, a mini-batch of
|
||||
*data* is input into the model, guided by a training *program*, which
|
||||
generates gradients to enhance model accuracy. Typically, this program
|
||||
executes a deep neural network. To illustrate the execution of a neural
|
||||
network, we employ a computational graph, comprising connected
|
||||
operators. Each operator executes a layer of the neural network, storing
|
||||
parameters to be updated during the training.
|
||||
|
||||

|
||||
:label:`ch010/ch10-single-node`
|
||||
|
||||
The execution of a computational graph involves two phases: *forward*
|
||||
and *backward* computation. In the forward phase, data is fed into the
|
||||
initial operator, which calculates and generates the data required by
|
||||
the downstream operator. This process is continued sequentially through
|
||||
all operators until the last one concludes its computation. The backward
|
||||
phase initiates from the last operator, computing gradients and updating
|
||||
local parameters accordingly. The process culminates at the first
|
||||
operator. Upon completion of these two phases for a given mini-batch,
|
||||
the system loads the next mini-batch to update the model.
|
||||
|
||||
Considering a model training job, partitioning the *data* and *program*
|
||||
can facilitate parallel acceleration. Table
|
||||
`ch010/ch10-parallel-methods` compiles various partition
|
||||
methods. Single-node training systems enable a \"single program, single
|
||||
data\" paradigm. For parallel computing across multiple devices, data is
|
||||
partitioned and the program is replicated for simultaneous execution,
|
||||
creating a \"single program, multiple data\" or *data parallelism*
|
||||
paradigm. Another approach involves partitioning the program,
|
||||
distributing its operators across devices---termed as \"multiple
|
||||
programs, single data\" or *model parallelism*. When training
|
||||
exceptionally large AI models, both data and program are partitioned to
|
||||
optimize the degree of parallelism (DOP), yielding a \"multiple program,
|
||||
multiple data\" or *hybrid parallelism* paradigm.
|
||||
|
||||
:Parallelism methods
|
||||
| Classification | Single Data | Multiple Data |
|
||||
|------------------|-----------------------|-------------------- |
|
||||
| Single program | single-node execution | data parallelism |
|
||||
| Multiple program | model parallelism | hybrid parallelism |
|
||||
:label:`ch010/ch10-parallel-methods`
|
||||
|
||||
|
||||
## Data Parallelism
|
||||
|
||||
Data parallelism is used when a single node cannot provide sufficient
|
||||
computing power. This is the most common parallelism approach adopted by
|
||||
AI frameworks. Specific implementations include TensorFlow
|
||||
DistributedStrategy, PyTorch Distributed, and Horovod
|
||||
DistributedOptimizer. Given a data-parallel system, assume that the
|
||||
training batch size is $N$, and that there are $M$ devices available for
|
||||
parallel acceleration. To achieve data parallelism, the batch size is
|
||||
partitioned into $M$ partitions, with each device getting $N/M$ training
|
||||
samples. Sharing a replica of the training program, each device executes
|
||||
and calculates a gradient separately over its own data partition. Each
|
||||
device (indexed $i$) calculates a gradient $G_i$ based on local training
|
||||
samples. To ensure that training program parameters are coherent, local
|
||||
gradients $G_i$ on different devices are aggregated to calculate an
|
||||
average gradient $(\sum_{i=1}^{N} G_i) / N$. To complete the training on
|
||||
this mini-batch, the training program updates model parameters based on
|
||||
the average gradient.
|
||||
|
||||
Figure :numref:`ch010/ch10-data-parallel` shows a data-parallel training
|
||||
system composed of two devices. For a batch size of 64, each device is
|
||||
assigned 32 training samples and shares the same neural network
|
||||
parameters (or program replicas). The local training samples are passed
|
||||
through the operators in the program replica in sequence for forward and
|
||||
backward computation. During backward computation, the program replicas
|
||||
generate local gradients. Corresponding local gradients on different
|
||||
devices (e.g., gradient 1 on device 1 and gradient 1 on device 2) are
|
||||
aggregated (typically by AllReduce, a collective communication
|
||||
operation) to calculate an average gradient.
|
||||
|
||||

|
||||
:label:`ch010/ch10-data-parallel`
|
||||
|
||||
## Model Parallelism
|
||||
|
||||
Model parallelism is useful when memory constraints make it impossible
|
||||
to train a model on a single device. For example, the memory on a single
|
||||
device will be insufficient for a model that contains a large operator
|
||||
(such as the compute-intensive fully connected layer for classification
|
||||
purpose). In such cases, we can partition this large operator for
|
||||
parallel execution. Assume that the operator has $P$ parameters and the
|
||||
system consists of $N$ devices. To minimize the workload on each device
|
||||
given the limited memory capacity, we can evenly assign the parameters
|
||||
across the devices ($P/N$ = number of parameters per device). This
|
||||
partitioning method is called **intra-operator parallelism**, which is a
|
||||
typical application of model parallelism.
|
||||
|
||||
Figure :numref:`ch010/ch10-model-parallel-intra-op` shows an example of
|
||||
intra-operator parallelism implemented by two devices. The neural
|
||||
network in this example consists of two operators. To complete forward
|
||||
and backward computation, operator 1 and operator 2 require 16 GB and 1
|
||||
GB of memory, respectively. However, in this example, the maximum amount
|
||||
of memory a single device can provide is only 10 GB. To train this
|
||||
network, parallelism is implemented on operator 1. Specifically, the
|
||||
parameters of operator 1 are evenly partitioned into two partitions
|
||||
between device 1 and device 2, meaning that device 1 runs program
|
||||
partition 1 while device 2 runs program partition 2. The network
|
||||
training process starts with feeding a mini-batch of training data to
|
||||
operator 1. Because the parameters of operator 1 are shared between two
|
||||
devices, the data is broadcast to the two devices. Each device completes
|
||||
forward computation based on the local partition of parameters. The
|
||||
local computation results on the devices are aggregated before being
|
||||
sent to downstream operator 2. In backward computation, the data of
|
||||
operator 2 is broadcast to device 1 and device 2, so that each device
|
||||
completes backward computation based on the local partition of
|
||||
operator 1. The local computation results on the devices are aggregated
|
||||
and returned to complete the backward computation process.
|
||||
|
||||

|
||||
:label:`ch010/ch10-model-parallel-intra-op`
|
||||
|
||||
In some cases, the overall model --- rather than specific operators ---
|
||||
requires more memory than a single device can provide. Given $N$
|
||||
operators and $M$ devices, we can evenly assign the operators across $M$
|
||||
devices. As such, each device needs to run forward and backward
|
||||
computation of only $N/M$ operators, thereby reducing the memory
|
||||
overhead of each device. This application of model parallelism is called
|
||||
*inter-operator parallelism*.
|
||||
|
||||
Figure :numref:`ch010/ch10-model-parallel-inter-op` shows an example of
|
||||
inter-operator parallelism implemented by two devices. The neural
|
||||
network in this example has two operators, each requiring 10 GB of
|
||||
memory for computation (20 GB in total). Because the maximum memory a
|
||||
single device can provide in this example is 10 GB, we can place
|
||||
operator 1 on device 1 and operator 2 on device 2. In forward
|
||||
computation, the output of operator 1 is sent to device 2, which uses
|
||||
this output as input to complete forward computation of operator 2. In
|
||||
backward computation, device 2 sends the backward computation result of
|
||||
operator 2 to device 1 for backward computation of operator 1,
|
||||
completing the training on a mini-batch.
|
||||
|
||||

|
||||
:label:`ch010/ch10-model-parallel-inter-op`
|
||||
|
||||
## Hybrid Parallelism
|
||||
|
||||
In training large AI models, the computing power and memory constraints
|
||||
often go hand in hand. The solution to overcoming these constraints is
|
||||
to adopt a hybrid of data parallelism and model parallelism, that is,
|
||||
hybrid parallelism. Figure
|
||||
:numref:`ch010/ch10-hybrid-parallel` shows an example of hybrid
|
||||
parallelism implemented by four devices. In this example, inter-operator
|
||||
parallelism is adopted to reduce memory overheads by allocating operator
|
||||
1 to device 1 and operator 2 to device 2. Device 3 and device 4 are
|
||||
added to the system to achieve data parallelism, thereby improving the
|
||||
computing power of the system. Specifically, the training data is
|
||||
partitioned to data partitions 1 and 2, and the model (consisting of
|
||||
operators 1 and 2) is replicated on devices 3 and 4 respectively. This
|
||||
makes it possible for the program replicas to run in parallel. During
|
||||
forward computation, devices 1 and 3 run the replicas of operator 1
|
||||
simultaneously and send their respective computation results to devices
|
||||
2 and 4 to compute the replicas of operator 2. During backward
|
||||
computation, devices 2 and 4 compute gradients simultaneously, and the
|
||||
local gradients are averaged by using the AllReduce operation. The
|
||||
averaged gradient is back-propagated to the replicas of operator 1 on
|
||||
devices 1 and 3, and the backward computation process ends.
|
||||
|
||||

|
||||
:label:`ch010/ch10-hybrid-parallel`
|
||||
137
v1/en_chapters/chapter_distributed_training/overview.md
Normal file
137
v1/en_chapters/chapter_distributed_training/overview.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# Overview
|
||||
|
||||
This section provides an overview of the need for distributed training
|
||||
systems.
|
||||
|
||||
## Motivation
|
||||
|
||||
The principal objective of implementing distributed training systems is
|
||||
to circumvent the restrictions imposed by single-node training systems,
|
||||
primarily characterized by their computational and memory constraints.
|
||||
|
||||
### Computational Constraints
|
||||
|
||||
A single processor, confined by its inherent limitations, can only yield
|
||||
a certain extent of computational power, quantified in terms of
|
||||
*floating-point operations per second (FLOPS)*. The advent of
|
||||
distributed training systems emerged as an innovative resolution to
|
||||
overcome these constraints associated with a single processor's
|
||||
computational prowess.
|
||||
|
||||
Figure :numref:`ch010/ch10-computation-increase` illustrates the
|
||||
escalating demands for computational power required by machine learning
|
||||
models compared to the growth rate of a processor's computational
|
||||
capabilities over the past few years. In this context, computational
|
||||
power is measured in petaFLOP/s-day, a unit implying the execution of
|
||||
$10^{15}$ neural network operations every second for an entire day,
|
||||
summing up to approximately $10^{20}$ operations in total. According to
|
||||
Moore's Law, the computational power of CPUs approximately doubles every
|
||||
18 months. This exponential growth principle also extends to
|
||||
accelerators, such as GPUs and TPUs, which are leveraged to support
|
||||
machine learning computations with their immense computational
|
||||
abilities.
|
||||
|
||||
However, the evolution of machine learning models is outpacing this
|
||||
growth rate. A few years back, machine learning models, like AlexNet,
|
||||
could only recognize a limited array of objects. Now, with models like
|
||||
AlphaStar, we have reached a point where machines can outperform humans
|
||||
in executing certain intricate tasks. In this short timeframe, the
|
||||
computational demands of machine learning models have escalated 56-fold
|
||||
every 18 months.
|
||||
|
||||
Distributed computing is designed to reconcile this divergence between
|
||||
the performance of processors and the rising demand for computational
|
||||
power. By capitalizing on the myriad of processors available in
|
||||
expansive data centers and cloud computing facilities and managing them
|
||||
effectively through distributed training systems, we can cater to the
|
||||
surging computational requirements of evolving models.
|
||||
|
||||

|
||||
:label:`ch010/ch10-computation-increase`
|
||||
|
||||
### Memory Constraints
|
||||
|
||||
The process of training machine learning models often necessitates
|
||||
substantial memory. Take, for instance, a neural network model boasting
|
||||
100 billion parameters in a 32-bit floating-point format (4 bytes); it
|
||||
would demand 400 GB of memory to store all parameters. In practice,
|
||||
additional memory is needed to store activation values and gradients.
|
||||
Assuming these are also stored in a 32-bit floating-point format, an
|
||||
extra 800 GB of memory would be required. This would result in an
|
||||
overall memory requirement exceeding 1200 GB (or 1.2 TB). Nevertheless,
|
||||
current accelerators, such as the NVIDIA A100, can only provide a
|
||||
maximum memory of 80 GB.
|
||||
|
||||
However, unlike individual accelerators, whose memory growth is largely
|
||||
hindered by factors such as hardware specifications, heat dissipation,
|
||||
and costs, distributed training systems have the potential to train
|
||||
models with hundreds of billions of parameters across hundreds of
|
||||
accelerators simultaneously. This approach can fulfill the model's
|
||||
memory requirements in the terabyte range.
|
||||
|
||||
## System Architecture
|
||||
|
||||
Data centers, housing hundreds of clusters with each cluster operating
|
||||
hundreds to thousands of servers, provide an ideal environment for
|
||||
distributed training. We can harness the power of numerous servers in a
|
||||
distributed training system to parallelly train a machine learning
|
||||
model.
|
||||
|
||||

|
||||
:label:`ch010/ch10-single-vs-multi`
|
||||
|
||||
For enhancing the efficiency of the distributed training system, it is
|
||||
crucial to assess the computational power and memory usage of computing
|
||||
tasks, ensuring no single task turns into a bottleneck. As depicted in
|
||||
Figure :numref:`ch010/ch10-single-vs-multi`, the system evenly
|
||||
distributes a task across all computing nodes by partitioning the input
|
||||
data into segments. Each model training job, which takes a dataset
|
||||
(e.g., training samples) or a group of tasks (e.g., operators) as input,
|
||||
is run on a computing node (e.g., a GPU) to generate outputs (e.g.,
|
||||
gradients).
|
||||
|
||||
Distributed execution generally comprises three steps:
|
||||
|
||||
1. *Partitioning* the input into smaller segments.
|
||||
|
||||
2. *Distributing* these partitions across multiple compute nodes for
|
||||
parallel computing.
|
||||
|
||||
3. *Merging* the outputs from all compute nodes to generate a result
|
||||
akin to that of single-node computing.
|
||||
|
||||
This process fundamentally adheres to the divide-and-conquer approach,
|
||||
where each compute node runs a small portion of the workload in parallel
|
||||
with others, thus expediting the overall computing process.
|
||||
|
||||
## Benefits
|
||||
|
||||
Distributed training systems bring the following benefits:
|
||||
|
||||
1. **Improved system performance:** Distributed training significantly
|
||||
improves training performance. Generally, we use the
|
||||
time-to-accuracy metric to measure the performance of a distributed
|
||||
training system. This metric is determined by two parameters: time
|
||||
taken to process all training samples one time and the accuracy
|
||||
improved within the time. By adding parallel compute nodes, we can
|
||||
shorten the time taken to process all training samples one time and
|
||||
therefore achieve smaller time-to-accuracy values.
|
||||
|
||||
2. **Reduced costs:** Distributed training reduces the cost of training
|
||||
machine learning models. Due to the limited heat dissipation
|
||||
capacity of a single node, nodes with higher computing power will
|
||||
incur higher costs in terms of dissipating heat. By combining
|
||||
multiple compute nodes, we can obtain the same computing power in a
|
||||
more cost-effective way. This drives cloud service providers (such
|
||||
as Amazon and Microsoft) to focus more on providing distributed
|
||||
machine learning systems.
|
||||
|
||||
3. **Hardware fault protection:** Machine learning training clusters
|
||||
typically run commodity hardware (such as disks and NICs). As such,
|
||||
hardware faults are inevitable over long-term operations. In
|
||||
single-node training, the failure of one hardware device will cause
|
||||
the entire training job to fail. In distributed training, a training
|
||||
job is jointly completed by multiple hardware devices. This means
|
||||
that the system can transfer the workload on the faulty device to a
|
||||
good one, eliminating concerns that hardware faults will interrupt
|
||||
training.
|
||||
170
v1/en_chapters/chapter_distributed_training/parameter_servers.md
Normal file
170
v1/en_chapters/chapter_distributed_training/parameter_servers.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# Parameter Server
|
||||
:label:`parameter server`
|
||||
|
||||
The following describes another common distributed training system:
|
||||
parameter server. In different machine learning frameworks, the
|
||||
parameter server may be implemented in different ways. For example,
|
||||
while TensorFlow and MindSpore come with built-in parameter server
|
||||
implementations, PyTorch requires users to implement the parameter
|
||||
servers themselves by using RPC interfaces.
|
||||
|
||||

|
||||
:label:`ch010/ch10-parameter-servers`
|
||||
|
||||
## System Architecture
|
||||
|
||||
Different from the machine learning systems implemented based on
|
||||
collective communication, the parameter server system assigns two roles
|
||||
to servers: training server or parameter server. The parameter server
|
||||
needs to provide sufficient memory and communication resources, whereas
|
||||
the training server needs to provide a large quantity of computing
|
||||
resources (e.g., hardware accelerators).
|
||||
|
||||
Figure :numref:`ch010/ch10-parameter-servers` depicts a machine learning
|
||||
cluster with two training servers and two parameter servers. Assume we
|
||||
have a model that can be divided into two parameter partitions. Each
|
||||
partition is assigned to a parameter server for synchronizing
|
||||
parameters. In the training process, each training server has a complete
|
||||
model to train a gradient based on the local training dataset shard. The
|
||||
gradient is then pushed to the corresponding parameter server. After the
|
||||
two training servers push their gradients, the parameter servers start
|
||||
to compute the average gradient and update parameters accordingly. The
|
||||
parameter servers then request the training servers to pull the latest
|
||||
parameters and start the next round of training iteration.
|
||||
|
||||
## Asynchronous Distributed Training
|
||||
|
||||
As discussed earlier, after each round of training, training servers
|
||||
need to compute an average gradient to update each model replica. This
|
||||
is necessary to ensure that the parameters of all model replicas are
|
||||
consistent before the next round of training begins. Such implementation
|
||||
is generally referred to as *synchronous training*.
|
||||
|
||||
Although synchronous training helps the training system achieve higher
|
||||
model accuracy, in a large system, stragglers often appear due to
|
||||
various causes. Common causes include: 1) The stragglers may not be in
|
||||
the same rack as other devices. Therefore, the communication bandwidth
|
||||
of the stragglers is significantly lower than that of the other devices.
|
||||
2) The stragglers may share local computing and communication resources
|
||||
with other processes, resulting in resource contention and performance
|
||||
degradation.
|
||||
|
||||
Stragglers will significantly impact the performance of AllReduce-based
|
||||
synchronous training systems. This is because, in such systems, all
|
||||
nodes participate in average-gradient computation and communication.
|
||||
Therefore, the emergence of any straggler will delay the entire
|
||||
AllReduce operation. To solve this problem, we could use a parameter
|
||||
server that realizes *asynchronous training* of models.
|
||||
|
||||
In an asynchronous training system, all training servers have the same
|
||||
model parameter replica at the outset of training. During training, once
|
||||
they finish computing gradients, the training servers immediately push
|
||||
the results to the parameter server. Based on the received gradients,
|
||||
the parameter server immediately updates model parameters and requests
|
||||
training servers to pull the latest parameters. In this process,
|
||||
different training servers are likely to use model parameters of
|
||||
different versions for gradient computation. While this method may
|
||||
negatively affect model accuracy, it enables different training servers
|
||||
to push and pull parameters based on their operation speeds rather than
|
||||
waiting for their peers. In this sense, stragglers will not affect the
|
||||
performance of the entire cluster.
|
||||
|
||||
### Training Sparse Models
|
||||
|
||||
A substantial number of large-scale machine learning models exhibit
|
||||
*sparsity*, which signifies that only a subset of their parameters
|
||||
become activated when a model training or inference request is
|
||||
processed. An illustrative example of this can be found in recommender
|
||||
systems, where a sizable embedding table is stored on parameter servers.
|
||||
In response to an inference request for a specific user, the parameter
|
||||
server retrieves only the embedding pertinent to that user. A similar
|
||||
scenario can be observed in mixture-of-expert models, in which a limited
|
||||
number of experts are activated to process input data, contingent on the
|
||||
data's characteristics.
|
||||
|
||||
Parameter servers can be especially beneficial in streamlining the
|
||||
training of sparse machine learning models. This advantage stems from
|
||||
the ability to store the sparse models on the parameter servers, leaving
|
||||
the dense models---often neural networks---on the training servers where
|
||||
sophisticated hardware accelerators are deployed. Operating with a lower
|
||||
resource footprint, parameter servers mainly necessitate an adequate
|
||||
supply of memory and network resources, rather than the more expensive
|
||||
parallel cores utilized by CPUs and GPUs. As a result, this approach
|
||||
significantly cuts costs when accommodating large sparse models. This is
|
||||
in contrast to the more expensive strategy which relies solely on GPU
|
||||
servers---coordinated through collective communication---to host both
|
||||
sparse and dense models. This practice incurs significantly higher
|
||||
costs.
|
||||
|
||||
## Model Replication
|
||||
|
||||
In this section, we will discuss the ways parameter servers utilize
|
||||
model replication to address issues related to data hotspots and server
|
||||
failures.
|
||||
|
||||
### Addressing Data Hotspots
|
||||
|
||||
Data on the internet typically follows a power-law distribution, which
|
||||
means that certain parameters are accessed more often than others during
|
||||
training. For instance, the embedding item of a widely popular commodity
|
||||
may be pulled by training servers much more frequently than one from a
|
||||
less popular commodity. This disparity can result in a parameter server,
|
||||
storing such popular data, being burdened with a disproportionately high
|
||||
volume of data pull and push requests, leading to data hotspots that can
|
||||
undermine system scalability.
|
||||
|
||||
To mitigate data hotspots, a machine learning cluster can monitor the
|
||||
access frequency of each model parameter. It can then create multiple
|
||||
replicas of frequently accessed parameters, distributing them across
|
||||
different parameter servers. To facilitate this, a router is created
|
||||
which directs a parameter query to an appropriate parameter replica.
|
||||
Within this router, strategies such as random routing or round-robin
|
||||
routing can be implemented to ensure a balanced access workload across
|
||||
all replicas.
|
||||
|
||||
### Managing Server Failures
|
||||
|
||||
Parameter servers are typically deployed for extended periods, enabling
|
||||
training servers or inference servers to continually query and update
|
||||
parameters. During this time, some parameter servers may experience
|
||||
failures due to hardware issues (such as disk, memory, and processors)
|
||||
or network partitions caused by network switch failures or network
|
||||
misconfigurations.
|
||||
|
||||
To combat server failures, parameter servers can create replicas of all
|
||||
parameters and distribute these replicas across different servers. This
|
||||
distribution decreases the chance that these servers will fail
|
||||
simultaneously. Generally, these replicas are located on servers placed
|
||||
in separate racks, clusters, and data centers to further minimize risk.
|
||||
|
||||
### Maintaining Replica Consistency
|
||||
|
||||
Both training and inference servers can update a parameter replicated on
|
||||
different servers. To ensure consistency amongst these replicas,
|
||||
parameter servers must employ a replication protocol to coordinate
|
||||
simultaneous updates on parameter replicas. A commonly utilized protocol
|
||||
is the Leader-Follower replication. This protocol designates one of the
|
||||
replicas as a leader and synchronizes all update operations on training
|
||||
servers to this leader replica before propagating the updates to the
|
||||
follower replicas.
|
||||
|
||||
Deciding on the leader replica and synchronizing updates between the
|
||||
leader and follower replicas are enduring challenges in the field of
|
||||
distributed systems. To address these challenges, industry professionals
|
||||
have developed numerous robust algorithms, such as Paxos and Raft.
|
||||
|
||||
Moreover, striking a balance between availability and consistency when
|
||||
replicating updates is another key concern. A strong-consistency
|
||||
replication protocol, like chain replication, may lead to failure of the
|
||||
training servers' push requests, making the parameter servers
|
||||
unavailable. On the other hand, adopting a weak-consistency replication
|
||||
protocol might result in replicas storing inconsistent parameters. To
|
||||
counter this, recent developments have introduced weak-consistency
|
||||
replication protocols like Adam and Ekko that leverage machine learning
|
||||
workload characteristics to reduce the communication cost of
|
||||
synchronizing replicas. For example, Microsoft's Adam protocol
|
||||
introduces a two-phase commit protocol for accelerating parameter
|
||||
synchronization while Ekko features a decentralized algorithm where
|
||||
parameter servers can analyze the model updates based on the gradient
|
||||
magnitude. Ekko further prioritizes the synchronization requests that
|
||||
are more likely to affect the quality of model inference.
|
||||
33
v1/en_chapters/chapter_distributed_training/summary.md
Normal file
33
v1/en_chapters/chapter_distributed_training/summary.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# Chapter Summary
|
||||
|
||||
1. The advent of large-scale machine learning models has sparked an
|
||||
exponential increase in the need for computational power and memory,
|
||||
leading to the emergence of distributed training systems.
|
||||
|
||||
2. Distributed training systems often utilize data parallelism, model
|
||||
parallelism, or a combination of both, based on memory limitations
|
||||
and computational constraints.
|
||||
|
||||
3. Pipeline parallelism is another technique adopted by distributed
|
||||
training systems, which involves partitioning a mini-batch into
|
||||
micro-batches and overlapping the forward and backward propagation
|
||||
of different micro-batches.
|
||||
|
||||
4. Although distributed training systems usually function in compute
|
||||
clusters, these networks sometimes lack the sufficient bandwidth for
|
||||
the transmission of substantial gradients produced during training.
|
||||
|
||||
5. To meet the demand for comprehensive communication bandwidth,
|
||||
machine learning clusters integrate heterogeneous high-performance
|
||||
networks, such as NVLink, NVSwitch, and InfiniBand.
|
||||
|
||||
6. To accomplish synchronous training of a machine learning model,
|
||||
distributed training systems frequently employ a range of collective
|
||||
communication operators, among which the AllReduce operator is
|
||||
popularly used for aggregating the gradients computed by distributed
|
||||
nodes.
|
||||
|
||||
7. Parameter servers play a crucial role in facilitating asynchronous
|
||||
training and sparse model training. Moreover, they leverage model
|
||||
replication to address issues related to data hotspots and server
|
||||
failures.
|
||||
245
v1/en_chapters/chapter_explainable_AI/explainable_ai.md
Normal file
245
v1/en_chapters/chapter_explainable_AI/explainable_ai.md
Normal file
@@ -0,0 +1,245 @@
|
||||
|
||||
## Background
|
||||
|
||||
Throughout human history, technological progress, production relations, and the development of ethical regulations have evolved dynamically. When a new technology achieves a breakthrough in the laboratory, the resulting changes in value creation sequentially impact commodity forms, production relations, and other aspects. At the same time, once the value gains brought by new technology are recognized, the organizational forms of business logic, in their spontaneous adjustment process, also place demands on the path, content, and even pace of technological development, and adapt new ethical regulations when these demands are met. Through such interactions, technological systems and social systems resonate and co-evolve---this is what constitutes a technological revolution.
|
||||
|
||||
Over the past decade, driven by the cost-performance ratio of computational power and data scale surpassing critical thresholds, connectionist model architectures represented by deep neural networks and statistical learning paradigms (hereinafter referred to as deep learning) have achieved breakthrough advances in feature representation capabilities, greatly advancing the development of artificial intelligence and achieving remarkable results in many scenarios. For example, face recognition accuracy has reached over 97%, and Google's intelligent voice assistant achieved a 92.9% correct response rate in 2019 tests. In these typical scenarios, deep learning's intelligent performance has surpassed that of ordinary humans (and even experts), reaching a tipping point for technology replacement. In recent years, in domains where business logic is technology-friendly or where ethical regulations are temporarily sparse---such as security, real-time scheduling, process optimization, competitive gaming, and information feed distribution---artificial intelligence and deep learning have achieved rapid technical and commercial breakthroughs.
|
||||
|
||||
Having tasted success, no domain wants to miss out on the benefits of technological progress. However, when the commercial application of deep learning enters domains that are technology-sensitive and closely related to human survival or safety---such as autonomous driving, finance, healthcare, and judicial high-risk application scenarios---the existing business logic encounters resistance during technology replacement, leading to slowdowns or even failures in commercialization. The root cause is that the business logic and underlying ethical regulations of these scenarios center on stable, traceable accountability and responsibility distribution; yet the models produced by deep learning are black boxes from which we cannot extract any information about model behavior from the model's structure or weights, rendering the accountability and distribution mechanisms in these scenarios inoperative and causing technical and structural difficulties for AI in business applications.
|
||||
|
||||
Here are two specific examples: Example 1, in the financial risk control scenario, a deep learning model identifies a small subset of users with suspected fraud, but the business department does not dare to directly act on these results. Because people cannot understand how the results were obtained, they cannot determine whether the results are accurate. Moreover, the results lack clear evidence, and if acted upon, cannot be justified to regulatory agencies.
|
||||
Example 2, in the medical field, a deep learning model determines that a patient has tuberculosis based on the patient's test data, but the doctor does not know how the diagnosis was reached and does not dare to directly adopt it, instead relying on their own experience, carefully reviewing the relevant test data, and then making their own judgment. These two examples demonstrate that black-box models seriously hinder the application and promotion of models in real-world scenarios.
|
||||
|
||||
Moreover, model interpretability has attracted national-level attention, with relevant institutions issuing related policies and regulations.
|
||||
|
||||
- In July 2017, the State Council issued the "New Generation Artificial Intelligence Development Plan," which for the first time encompassed explainable AI.
|
||||
|
||||
- In March 2021, the People's Bank of China released the financial industry standard "Evaluation Specification for Financial Applications of Artificial Intelligence Algorithms," which set explicit requirements for the interpretability of AI models in the financial industry.
|
||||
|
||||
- In August 2021, the Cyberspace Administration of China issued the "Provisions on the Management of Algorithmic Recommendations for Internet Information Services," proposing requirements for the interpretability of algorithmic recommendations in the internet industry.
|
||||
|
||||
- In September 2021, the Ministry of Science and Technology released the "Ethical Norms for New Generation Artificial Intelligence."
|
||||
|
||||
Therefore, from both the commercial promotion and regulatory perspectives, we need to open up the black box model and provide explanations for models. Explainable AI is precisely the technology that addresses this class of problems.
|
||||
|
||||
## Definition of Explainable AI
|
||||
|
||||
According to DARPA (Defense Advanced Research Projects Agency), as shown in :numref:`xai_concept`,
|
||||
the concept of explainable AI is: unlike existing AI systems, explainable AI systems can address the problems users face with black-box models, enabling users to know not only what, but also why.
|
||||
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`xai_concept`
|
||||
|
||||
However, neither academia nor industry has a unified definition of explainable AI (eXplainable AI, XAI). Here we list three typical definitions for discussion:
|
||||
|
||||
- Interpretability is the desire to directly understand the working mechanism of a model, breaking open the black box of artificial intelligence.
|
||||
|
||||
- Explainable AI provides human-readable and understandable explanations for decisions made by AI algorithms.
|
||||
|
||||
- Explainable AI is a set of methods that ensures humans can easily understand and trust the decisions made by AI agents.
|
||||
|
||||
|
||||
Based on our practical experience and understanding, we define explainable AI as: a collection of techniques oriented toward machine learning (primarily deep neural networks), including visualization, data mining, logical reasoning, knowledge graphs, etc. The purpose is to use this collection of techniques to make deep neural networks exhibit a certain degree of understandability, so as to satisfy the information needs (such as causal or background information) of relevant users regarding models and application services, thereby establishing cognitive-level trust in AI services among users.
|
||||
|
||||
## Overview of Explainable AI Algorithms
|
||||
|
||||
With the emergence of the concept of explainable AI, XAI has received increasing attention from both academia and industry. The figure below shows the trend of explainable AI keywords in top academic conferences in the field of artificial intelligence. To provide readers with a holistic understanding of existing explainable AI algorithms, we summarize and categorize the types of XAI algorithms with reference to :cite:`2020tkde_li`, as shown in :numref:`XAI_methods`.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`XAI_methods`
|
||||
|
||||
There are diverse methods for explaining models. Here, based on whether the explanation process introduces external knowledge beyond the dataset, we divide them into data-driven explanation methods and knowledge-aware explanation methods.
|
||||
|
||||
**Data-Driven Explanations**
|
||||
|
||||
Data-driven explanations refer to methods that generate explanations purely from the data itself, without requiring external information such as prior knowledge. To provide explanations, data-driven methods typically start by selecting a dataset (with global or local distribution). Then, the selected dataset or its variants are fed into the black-box model (in some cases, selecting a dataset is not necessary; for example, the maximum activation method proposed by :cite:`erhan2009visualizing`), and explanations are generated through certain analysis of the corresponding predictions from the black-box model (e.g., computing derivatives of predictions w.r.t. input features). Based on the scope of interpretability, these methods can be further divided into global methods or local methods---that is, whether they explain the global model behavior across all data points or the behavior of a subset of predictions. In particular, instance-based methods provide a special type of explanation---they directly return data instances as explanations. Although from the perspective of explanation scope, instance-based methods can also fit into global methods (representative samples) or local methods (counterfactuals), we list them separately to emphasize their distinctive way of providing explanations.
|
||||
|
||||
Global methods aim to provide an understanding of the model logic and complete reasoning for all predictions, based on a holistic view of its features, learned components, and structure. Several directions can be explored for global interpretability. For ease of understanding, we divide them into the following three subcategories:
|
||||
(i)
|
||||
Model extraction---extracting an interpretable model from the original black-box model, for example, distilling the original black-box model into an interpretable decision tree through model distillation :cite:`frosst2017distilling` :cite:`zhang2019interpreting`, thereby using the rules in the decision tree to explain the original model;
|
||||
(ii)
|
||||
Feature-based methods---estimating feature importance or relevance, as shown in :numref:`xai_global_feature_importance`.
|
||||
This type of explanation can provide explanations such as "credit overdue records are the most important feature relied upon by the model," thereby helping to determine whether the model has bias. A typical global feature explanation method is SHAP (which can only output global explanations for tree models) :cite:`lundberg2017unified`.
|
||||
(iii) Transparent model design---modifying or redesigning black-box models to improve their interpretability. This class of methods has also gradually become a research hotspot, with recent related work including ProtoPNet :cite:`chen2019looks`, Interpretable CNN :cite:`zhang2018interpretable`, ProtoTree :cite:`nauta2021neural`, etc.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`xai_global_feature_importance`
|
||||
|
||||
|
||||
Global explanations can provide an overall understanding of the black-box model. However, due to the high complexity of black-box models, in practice it is often difficult to obtain simple transparent models with behavior similar to the original model through model extraction/design, and it is often difficult to abstract unified feature importance across the entire dataset. Furthermore, global explanations also lack local fidelity when generating explanations for individual observations, as globally important features may not accurately explain decisions for individual samples. Therefore, local methods have become an important research direction in recent years. Local methods attempt to verify the reasonableness of model behavior for individual instances or a set of instances. When focusing only on local behavior, complex models can become simple, so even simple functions can provide highly credible explanations for local regions. Based on the process of obtaining explanations, local methods can be divided into two categories: local approximation and propagation-based methods.
|
||||
|
||||
Local approximation generates understandable sub-models by simulating the behavior of the black-box model in the neighborhood of a sample. Compared to model extraction in global methods, local approximation only needs to focus on the neighborhood of the sample, making it easier to obtain sub-models that accurately describe local behavior. As shown in :numref:`xai_lime`, by generating $m$ data points $(x_i^\prime, f(x_i^\prime)), for\ i=1,2, ...m$ (where $f$ is the black-box model decision function) near the data point of interest $x$, and linearly fitting these data points, we can obtain a linear model $g=\sum_i^k w_ix^i$, where $k$ represents the feature dimensionality of the data. The weights $w_i$ in the linear model can then be used to represent the importance of the $i$-th feature of data $x$ for model $f$.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`xai_lime`
|
||||
|
||||
Propagation-based methods typically propagate certain information to directly locate relevant features. These methods include backpropagation-based methods and forward propagation-based methods. Backpropagation-based methods attribute the output contributions to input features through gradient backpropagation. As shown in :numref:`xai_gradient_based`, through gradient backpropagation, the gradient of the model output with respect to the input $\frac{d(f(x))}{dx}$ is computed as the model explanation. Common gradient propagation-based methods include the basic Gradient method, GuidedBackprop :cite:`zeiler2014visualizing`, GradCAM :cite:`selvaraju2017grad`, etc.
|
||||
Forward propagation-based methods quantify the correlation between outputs and features by perturbing features and observing the differences in forward inference outputs. Common methods in this category include RISE :cite:`petsiuk2018rise`, ScoreCAM :cite:`wang2020score`, etc.
|
||||
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`xai_gradient_based`
|
||||
|
||||
**Knowledge-Aware Explanations**
|
||||
|
||||
Data-driven explanation methods can provide comprehensive explanations from datasets or the relationships between inputs and outputs. Building on this, external knowledge can also be leveraged to enrich explanations and make them more human-friendly. Laypersons without machine learning background knowledge may find it difficult to directly understand feature importance and the connections between features and targets. With external domain knowledge, we can not only generate explanations indicating feature importance, but also describe why certain features are more important than others. Therefore, knowledge-aware explainable AI methods have attracted increasing attention in recent years. Compared to raw datasets collected from multiple scenarios, knowledge is typically regarded as entities or relationships derived from human life experience or rigorous theoretical reasoning. Generally, knowledge can take many forms. It can reside in people's minds, or be recorded in natural language, audio, or rules with strict logic. To systematically review these methods, we categorize them based on knowledge sources into two types: general knowledge methods and knowledge base (KB) methods. The former uses unstructured data as a knowledge source to construct explanations, while the latter uses structured knowledge bases as the foundation for building explanations.
|
||||
|
||||
A relatively straightforward approach to providing knowledge is through human involvement. In fact, with the explosive growth of AI research and applications, the critical role of humans in AI systems has gradually become apparent. Such systems are called human-centered AI systems. :cite:`riedl2019human` argue that human-centered AI can not only enable AI systems to better understand humans from a sociocultural perspective, but also enable AI systems to help humans understand themselves. To achieve these goals, AI needs to satisfy several properties including interpretability and transparency.
|
||||
|
||||
|
||||
Specifically, humans can play a role in AI systems by providing a considerable number of human-defined concepts. :cite:`kim2018interpretability` uses Concept Activation Vectors (CAV) to test the importance of concepts in classification tasks (TCAV). A CAV is a vector perpendicular to the decision boundary between the activation and non-activation of a target concept of interest. This vector can be obtained as follows: input positive and negative samples of the target concept, perform linear regression to get the decision boundary, and thereby obtain the CAV. Taking the "stripes" concept for "zebra" as an example, the user first collects data samples containing "stripes" and data samples not containing "stripes," feeds them into the network, obtains the activation values of intermediate layers, fits these based on positive and negative sample labels ($1$ for containing the concept, $0$ for not containing the concept) to the intermediate layer activation values, obtains the decision boundary, and the CAV is the perpendicular vector to this decision boundary.
|
||||
|
||||
|
||||
As shown in :numref:`xai_tcav`, to compute the TCAV score, the "concept sensitivity" representing the importance of a concept at layer $l$ for class $k$ prediction can first be computed as the directional derivative $S_{C,k,l}(\mathbf{x})$:
|
||||
$$\begin{split}
|
||||
S_{C,k,l}(\mathbf{x}) = &\lim_{\epsilon\rightarrow 0}\frac{h_{l,k}(f_{l}(\mathbf{x})+\epsilon \mathbf{v}^{l}_{C})-h_{l,k}(f_{l}(\mathbf{x}))}{\epsilon} \\ = &\nabla h_{l,k}(f_{l}(\mathbf{x})) \cdot \mathbf{v}^{l}_{C}
|
||||
\end{split}
|
||||
\label{eq:TCAV_score}$$
|
||||
where $f_{l}(\mathbf{x})$ is the activation at layer $l$, $h_{l,k}(\cdot)$ is the logit for class $k$, $\nabla h_{l,k}(\cdot)$ is the gradient of $h_{l,k}$
|
||||
w.r.t. the activations at layer $l$. $\mathbf{v}^{l}_{C}$ is the CAV for concept $C$ that the user aims to explore. Positive (or negative) sensitivity indicates that concept $C$ has a positive (or negative) influence on the activation of the input.
|
||||
|
||||
Based on $S_{C,k,l}$,
|
||||
TCAV can then be obtained by computing the ratio of samples of class $k$ with positive $S_{C,k,l}$'s:
|
||||
|
||||
$$\textbf{TCAV}_{Q_{C,k,l}}=\frac{\vert \{\mathbf{x}\in X_{k}:S_{C,k,l}(\mathbf{x})>0\}\vert}{\vert X_{k}\vert}
|
||||
\label{eq:TCAV}$$
|
||||
Combined with the $t$-distribution hypothesis method, if $\textbf{TCAV}_{Q_{C,k,l}}$ is greater than 0.5, it indicates that concept $C$ has a significant influence on class $k$.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`xai_tcav`
|
||||
|
||||
Human knowledge can be subjective, while KB can be objective. In current research, KB is usually modeled as a Knowledge Graph (KG). The following uses the explainable recommendation model TB-Net, supported by MindSpore, as an example to explain how to build an explainable model using knowledge graphs. Knowledge graphs can capture rich semantic relationships between entities. One of TB-Net's objectives is to identify which pair of entities (i.e., item-item) has the most significant influence on the user, and through which relationships and key nodes they are connected. Unlike existing KG embedding-based methods (RippleNet uses KG completion methods to predict paths between users and items), TB-Net extracts real paths to achieve high accuracy and superior interpretability of recommendation results.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`tb_net`
|
||||
|
||||
The framework of TB-Net is shown in :numref:`tb_net`: where $i_c$ represents the candidate item to be recommended, $h_n$ represents items that the user has interacted with in their history, $r$ and $e$ represent relations and entities in the knowledge graph, and their vectorized representations are concatenated to form relation matrices and entity matrices. First, TB-Net constructs a subgraph for user $u$ by connecting $i_c$ and $h_n$ through shared attribute values. Each pair of $i_c$ and $h_n$ is connected by a path composed of relations and entities. Then, TB-Net's bidirectional path propagation method propagates the computation of item, entity, and relation vectors from the left and right sides of the path to the middle node, computing the probability that the two directional flows converge at the same intermediate entity. This probability is used to represent the user's preference for the intermediate entity and serves as the basis for explanations. Finally, TB-Net identifies key paths (i.e., key entities and relations) in the subgraph, outputting recommendation results and explanations with semantic-level detail.
|
||||
|
||||
Taking game recommendation as a scenario, randomly recommending a new game to a user, as shown in :numref:`xai_kg_recommendation`, where Half-Life, DOTA 2, Team Fortress 2, etc. are game titles. In the relation attributes, game.year represents the game release year, game.genres represents game genre, game.developer represents the game developer, and game.categories represents game categories. In the attribute nodes, MOBA stands for Multiplayer Online Battle Arena, Valve is the Valve Corporation, Action stands for action genre, Multi-player stands for multiplayer mode, Valve Anti-Cheat enabled represents the Valve Anti-Cheat system, Free means free-to-play, and Cross-Platform means cross-platform support. The games on the right are games the user has played according to their history. The correctly recommended game in the test data is "Team Fortress 2."
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`xai_kg_recommendation`
|
||||
|
||||
In :numref:`xai_kg_recommendation`, there are two highlighted relevance probabilities (38.6%, 21.1%), which are the probabilities of key paths being activated during the recommendation process as computed by the model. The red arrows highlight the key path from "Team Fortress 2" to the historical item "Half-Life." This shows that TB-Net can recommend items to users through various relational connections and identify key paths as explanations. Therefore, the explanation for recommending "Team Fortress 2" to the user can be translated into a fixed narrative: "Team Fortress 2 is an action, multiplayer online, shooting video game developed by game company Valve. It is highly correlated with the game Half-Life that the user has played before."
|
||||
|
||||
## Explainable AI Systems and Practice
|
||||
|
||||
As the demand for explainability grows rapidly across various domains, an increasing number of enterprises are integrating explainable AI toolkits to provide users with fast and convenient explainability solutions. The mainstream toolkits currently available in the industry include:
|
||||
- TensorFlow team's What-if Tool, which allows users to explore learning models without writing any code, enabling non-developers to participate in model tuning.
|
||||
- IBM's AIX360, which provides multiple explanation and measurement methods to evaluate model interpretability and trustworthiness across different dimensions.
|
||||
- Facebook's Torch team's Captum, which offers multiple mainstream explanation methods for image and text scenarios.
|
||||
- Microsoft's InterpretML, which allows users to train different white-box models and explain black-box models.
|
||||
- SeldonIO's Alibi, which focuses on inspecting model internals and decision explanations, providing implementations of various white-box, black-box, single-sample, and global explanation methods.
|
||||
- Huawei MindSpore's XAI tool, which provides data tools, explanation methods, white-box models, and measurement methods, offering users explanations at different levels (local, global, semantic-level, etc.).
|
||||
|
||||
This section uses the MindSpore XAI tool as an example to explain how to use explainable AI tools in practice to provide explanations for image classification models and tabular data classification models, thereby helping users understand models for further debugging and optimization.
|
||||
The architecture of the MindSpore XAI tool is shown below. It is an explainability tool built on the MindSpore deep learning framework and can be deployed on Ascend and GPU devices.
|
||||

|
||||
:width:`800px`
|
||||
:label:`mindspore_xai`
|
||||
|
||||
To use MindSpore Explainable AI, readers first need to install the MindSpore XAI package via pip (supporting MindSpore 1.7 or above, GPU and Ascend processors, recommended to use with JupyterLab):
|
||||
|
||||
```bash
|
||||
pip install mindspore-xai
|
||||
```
|
||||
|
||||
In the MindSpore XAI [official tutorial](https://www.mindspore.cn/xai/docs/zh-CN/r1.8/index.html), detailed instructions on how to install and use the provided explanation methods are available for readers to consult.
|
||||
|
||||
### MindSpore XAI Tool for Image Classification Explanation
|
||||
|
||||
Below is a code demonstration example combining the saliency map visualization method GradCAM, which is supported in MindSpore XAI version 1.8. Readers can refer to the [official tutorial](https://www.mindspore.cn/xai/docs/zh-CN/1.8/using_cv_explainers.html) to obtain the demo dataset, model, and complete script code.
|
||||
|
||||
```python
|
||||
|
||||
from mindspore_xai.explainer import GradCAM
|
||||
|
||||
# Typically specify the last convolutional layer
|
||||
grad_cam = GradCAM(net, layer="layer4")
|
||||
|
||||
# 3 is the ID for the 'boat' class
|
||||
saliency = grad_cam(boat_image, targets=3)
|
||||
```
|
||||
|
||||
If the input is an image tensor of dimension $1*3*224*224$, the returned saliency is a saliency map tensor of dimension $1*1*224*224$. Below we present several examples demonstrating how to use explainable AI capabilities to better understand the prediction results of image classification models, identify the key feature regions used as the basis for classification predictions, and thereby judge the reasonableness and correctness of the classification results to accelerate model optimization.
|
||||
|
||||
|
||||

|
||||
:width:`400px`
|
||||
:label:`correct_correct`
|
||||
|
||||
In the figure above, the predicted label is "bicycle," and the explanation result shows that the key features relied upon are on the wheels, indicating that this classification judgment basis is reasonable and the model can be preliminarily deemed trustworthy.
|
||||
|
||||

|
||||
:width:`400px`
|
||||
:label:`correct_wrong`
|
||||
|
||||
In the figure above, one of the predicted labels is "person," which is correct. However, in the explanation, the highlighted region is on the horse's head, so the key feature basis is likely incorrect, and the reliability of this model needs further verification.
|
||||
|
||||

|
||||
:width:`400px`
|
||||
:label:`wrong_wrong`
|
||||
|
||||
In the figure above, the predicted label is "boat," but there is no boat in the original image. Through the explanation result on the right side of the figure, we can see that the model used the water surface as the key basis for classification to arrive at the prediction "boat"---this basis is incorrect. By analyzing the subset of the training dataset labeled "boat," it was found that the vast majority of images labeled "boat" contain water surfaces, which likely caused the model to mistakenly learn water surfaces as a key feature for the "boat" class during training. Based on this finding, proportionally supplementing images with boats but without water surfaces can significantly reduce the probability of the model misjudging key features during learning.
|
||||
|
||||
### MindSpore XAI Tool for Tabular Classification Explanation
|
||||
MindSpore XAI version 1.8 supports three commonly used tabular data model explanation methods in the industry: LIMETabular, SHAPKernel, and SHAPGradient.
|
||||
|
||||
Using LIMETabular as an example, it provides a locally interpretable model to explain individual samples for a complex, hard-to-explain model:
|
||||
```python
|
||||
from mindspore_xai.explainer import LIMETabular
|
||||
|
||||
# Convert features to feature statistics
|
||||
feature_stats = LIMETabular.to_feat_stats(data, feature_names=feature_names)
|
||||
|
||||
# Initialize the explainer
|
||||
lime = LIMETabular(net, feature_stats, feature_names=feature_names, class_names=class_names)
|
||||
|
||||
# Explain
|
||||
lime_outputs = lime(inputs, targets, show=True)
|
||||
```
|
||||
|
||||
The explainer displays the decision boundary for classifying the sample as setosa. The returned lime_outputs is a structured data representing the decision boundary.
|
||||
Visualizing the explanation yields
|
||||

|
||||
:width:`400px`
|
||||
:label:`tabular_lime`
|
||||
The above explanation shows that for the setosa classification decision, the most important feature is petal length.
|
||||
|
||||
### MindSpore XAI Tool: White-Box Models
|
||||
|
||||
In addition to post-hoc explanation methods for black-box models, the XAI tool also provides industry-leading white-box models, enabling users to train on these white-box models so that during inference the model can simultaneously output both inference results and explanations. Taking TB-Net as an example (refer to :numref:`tb_net` and its [official tutorial](https://e.gitee.com/mind_spore/repos/mindspore/xai/tree/master/models/whitebox/tbnet) for usage), this method has been deployed commercially, providing millions of customers with semantic-level explainable financial product recommendation services. TB-Net leverages knowledge graphs to model the attributes of financial products and customers' historical data. In the graph, financial products with common attribute values are connected. The candidate product and the customer's historically purchased or browsed products are connected through common attribute values into paths, forming the customer's subgraph. Then, TB-Net performs bidirectional propagation computation on the paths in the graph to identify key products and key paths as the basis for recommendations and explanations.
|
||||
|
||||
|
||||
An example of explainable recommendation is as follows: in the historical data, the customer has recently purchased or browsed financial products A, B, N, etc. Through TB-Net's bidirectional path propagation computation, it is found that the path (Product P, moderate-to-high annualized return, Product A) and the path (Product P, moderate risk level, Product N) have high weights, making them key paths. At this point, TB-Net outputs the following explanation: "Financial product P is recommended to this customer because its moderate-to-high annualized return and moderate risk level are consistent with financial products A and B that the customer has recently purchased or browsed."
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`tbnet_finance`
|
||||
|
||||
In addition to the explanation methods introduced above, MindSpore XAI also provides a series of measurement methods for evaluating the quality of different explanation methods, and will continue to add white-box models with built-in explanations. Users can directly adopt mature model architectures to quickly build their own explainable AI systems.
|
||||
|
||||
|
||||
## Future of Explainable AI
|
||||
|
||||
To further advance research in explainable AI, we summarize several noteworthy research directions here.
|
||||
|
||||
First, knowledge-aware XAI still has significant room for research expansion. However, there are still many open questions regarding how to effectively leverage external knowledge. One issue is how to acquire or retrieve useful knowledge from such a vast knowledge space. For example, Wikipedia contains knowledge related to various fields, but if the goal is to solve a medical image classification problem, most Wikipedia entries are irrelevant or noisy, making it difficult to accurately find appropriate knowledge to incorporate into the XAI system.
|
||||
|
||||
Furthermore, the deployment of XAI systems also urgently needs a more standardized and unified evaluation framework. To build such a standardized and unified evaluation framework, we may need to simultaneously leverage different metrics that complement each other. Different metrics may be applicable to different tasks and users. A unified evaluation framework should have corresponding flexibility.
|
||||
|
||||
Finally, we believe that interdisciplinary collaboration will be beneficial. The development of XAI requires not only computer scientists to develop advanced algorithms, but also physicists, biologists, and cognitive scientists to unravel the mysteries of human cognition, as well as domain experts to contribute their domain knowledge.
|
||||
|
||||
## References
|
||||
|
||||
:bibliography:`../references/explainable.bib`
|
||||
21
v1/en_chapters/chapter_explainable_AI/index.md
Normal file
21
v1/en_chapters/chapter_explainable_AI/index.md
Normal file
@@ -0,0 +1,21 @@
|
||||
# Explainable AI Systems
|
||||
|
||||
Over the past decade, driven by the cost-performance ratio of computational power and data scale surpassing critical thresholds, connectionist model architectures represented by deep neural networks and statistical learning paradigms (hereinafter referred to as deep learning) have achieved breakthrough advances in feature representation capabilities, greatly advancing the development of artificial intelligence and achieving remarkable results in many scenarios. For example, face recognition accuracy has reached over 97%, and Google's intelligent voice assistant achieved a 92.9% correct response rate in 2019 tests. In these typical scenarios, deep learning's intelligent performance has surpassed that of ordinary humans (and even experts), reaching a tipping point for technology replacement. In recent years, in domains where business logic is technology-friendly or where ethical regulations are temporarily sparse---such as security, real-time scheduling, process optimization, competitive gaming, and information feed distribution---artificial intelligence and deep learning have achieved rapid technical and commercial breakthroughs.
|
||||
|
||||
Having tasted success, no domain wants to miss out on the benefits of technological progress. However, when the commercial application of deep learning enters domains that are technology-sensitive and closely related to human survival or safety---such as autonomous driving, finance, healthcare, and judicial high-risk application scenarios---the existing business logic encounters resistance during technology replacement, leading to slowdowns or even failures in commercialization. The root cause is that the business logic and underlying ethical regulations of these scenarios center on stable, traceable accountability and responsibility distribution; yet the models produced by deep learning are black boxes from which we cannot extract any information about model behavior from the model's structure or weights, rendering the accountability and distribution mechanisms in these scenarios inoperative and causing technical and structural difficulties for AI in business applications. Moreover, model interpretability has attracted national-level attention, with relevant institutions issuing related policies and regulations.
|
||||
|
||||
Therefore, from both the commercial promotion and regulatory perspectives, we need to open up the black box model and provide explanations for models. Explainable AI is precisely the technology that addresses this class of problems.
|
||||
|
||||
The learning objectives of this chapter include:
|
||||
|
||||
- Understand the goals and application scenarios of explainable AI
|
||||
|
||||
- Master the common types of explainable AI methods and their representative techniques
|
||||
|
||||
- Reflect on the future development of explainable AI methods
|
||||
|
||||
```toc
|
||||
:maxdepth: 2
|
||||
|
||||
explainable_ai
|
||||
```
|
||||
57
v1/en_chapters/chapter_federated_learning/horizontal_fl.md
Normal file
57
v1/en_chapters/chapter_federated_learning/horizontal_fl.md
Normal file
@@ -0,0 +1,57 @@
|
||||
## Horizontal Federated Learning
|
||||
|
||||
### Horizontal Federation in Cloud-Cloud Scenarios
|
||||
|
||||
In a horizontal federated learning system, multiple participants with the same data structure collaboratively build a machine learning model through a cloud server. A typical assumption is that the participants are honest while the server is honest but curious; therefore, no participant is allowed to leak raw gradient information to the server. The training process of such a system typically consists of the following four steps:
|
||||
|
||||
Step 1: Participants compute training gradients locally, mask selected gradients using encryption, differential privacy, or secret sharing techniques, and send the masked results to the server.
|
||||
|
||||
Step 2: The server performs secure aggregation without learning any participant's gradient information.
|
||||
|
||||
Step 3: The server sends the aggregated results back to the participants.
|
||||
|
||||
Step 4: Participants update their respective models using the decrypted gradients.
|
||||
|
||||
Compared to traditional distributed learning, federated learning faces the challenges of unstable training nodes and high communication costs. These challenges prevent federated learning from synchronizing weights across different training nodes after every single training step, as traditional distributed learning does. To improve the computation-to-communication ratio and reduce the high energy consumption caused by frequent communication, Google proposed the Federated Averaging algorithm (FedAvg) in 2017 :cite:`fedavg`. :numfef:`ch10-federated-learning-fedavg` illustrates the overall process of FedAvg. In each training round, clients perform multiple local training steps. Then the server aggregates the weights from multiple clients and computes a weighted average.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`ch10-federated-learning-fedavg`
|
||||
|
||||
### Horizontal Federation in Device-Cloud Scenarios
|
||||
|
||||
The overall process of device-cloud federated learning is the same as cloud-cloud federated learning, but device-cloud federated learning faces additional challenges in the following three aspects:
|
||||
|
||||
1. High communication costs. Unlike cloud-cloud federated learning, the communication overhead in device-cloud federated learning primarily lies in the volume of data per communication round, whereas the overhead in cloud-cloud federated learning mainly lies in the frequency of communication. In device-cloud federated learning scenarios, the typical communication network may be WLAN or mobile data, where network communication speeds can be orders of magnitude slower than local computation, making high communication costs a critical bottleneck for federated learning.
|
||||
|
||||
2. System heterogeneity. Due to variations in client device hardware (CPU, memory), network connections (3G, 4G, 5G, WiFi), and power supply (battery level), each device in the federated learning network may have different storage, computation, and communication capabilities. Limitations of the network and the devices themselves may result in only a subset of devices being active at any given time. Furthermore, devices may encounter unexpected situations such as battery depletion or network disconnection, leading to temporary unavailability. This heterogeneous system architecture affects the formulation of the overall federated learning strategy.
|
||||
|
||||
3. Privacy concerns. Since clients in device-cloud federated learning cannot participate in every iteration round, the difficulty of data privacy protection is higher than in other distributed learning methods. Moreover, during the federated learning process, transmitting model update information between devices and the cloud still poses the risk of exposing sensitive information to third parties or the central server. Privacy protection becomes a critical issue that device-cloud federated learning must address.
|
||||
|
||||
To address the challenges posed by device-cloud federated learning, MindSpore Federated designed a distributed FL-Server architecture. The system consists of three components: the scheduler module, the server module, and the client module. The system architecture is shown in :numref:`ch10-federated-learning-architecture`. The functionalities of each module are described below:
|
||||
|
||||
- Federated Learning Scheduler:
|
||||
|
||||
The Federated Learning Scheduler (FL-Scheduler) assists in cluster networking and is responsible for issuing management tasks.
|
||||
|
||||
- Federated Learning Server:
|
||||
|
||||
The Federated Learning Server (FL-Server) provides client selection, time-limited communication, and distributed federated aggregation capabilities. The FL-Server must be capable of supporting tens of millions of device-cloud devices and supporting the access and secure processing logic of edge servers.
|
||||
|
||||
- Federated Learning Client:
|
||||
|
||||
The Federated Learning Client (FL-Client) is responsible for local data training and securely encrypts the uploaded weights when communicating with the FL-Server.
|
||||
|
||||

|
||||
|
||||
:label:`ch10-federated-learning-architecture`
|
||||
|
||||
In addition, MindSpore Federated has designed four key features for device-cloud federated learning:
|
||||
|
||||
1. Time-limited communication: After the FL-Server and FL-Client establish a connection, a global timer and counter are initiated. When the FL-Server receives model parameters from FL-Clients that meet a certain proportion of all initially connected FL-Clients within a preset time window, aggregation can proceed. If the proportion threshold is not reached within the time window, the system proceeds to the next iteration. This ensures that even with a massive number of FL-Clients, the entire federated learning process will not stall due to excessively long training times or disconnections of individual FL-Clients.
|
||||
|
||||
2. Loosely-coupled networking: An FL-Server cluster is used. Each FL-Server receives and distributes weights to a subset of FL-Clients, reducing the bandwidth pressure on any single FL-Server. Additionally, FL-Clients are supported to connect in a loosely-coupled manner. The mid-session withdrawal of any FL-Client will not affect the global task, and any FL-Client can obtain the complete data needed for training from any FL-Server at any time.
|
||||
|
||||
3. Encryption module: To prevent model gradient leakage, MindSpore Federated deploys multiple encryption algorithms: Local Differential Privacy (LDP), secure aggregation algorithms based on Multi-Party Computation (MPC), and Huawei's proprietary Sign-based Dimension Selection differential privacy algorithm (SignDS).
|
||||
|
||||
4. Communication compression module: MindSpore Federated uses quantization and sparsification techniques to compress and encode weights into smaller data formats when the FL-Server distributes model parameters and when FL-Clients upload model parameters, and decodes the compressed data back to the original format at the receiving end.
|
||||
20
v1/en_chapters/chapter_federated_learning/index.md
Normal file
20
v1/en_chapters/chapter_federated_learning/index.md
Normal file
@@ -0,0 +1,20 @@
|
||||
# Federated Learning Systems
|
||||
|
||||
In this chapter, we introduce an important branch of deep learning --- federated learning and its related system knowledge. The learning objectives of this chapter include:
|
||||
|
||||
- Master the basic definitions of federated learning and become familiar with existing mainstream open-source federated learning frameworks.
|
||||
- Understand horizontal federated learning algorithms.
|
||||
- Understand vertical federated learning algorithms.
|
||||
- Understand federated learning encryption algorithms.
|
||||
- Understand cutting-edge federated learning algorithms and future research directions.
|
||||
|
||||
```toc
|
||||
:maxdepth: 2
|
||||
|
||||
overview
|
||||
horizontal_fl
|
||||
vertical_fl
|
||||
privacy_encryption_algorithm
|
||||
outlook
|
||||
summary
|
||||
```
|
||||
35
v1/en_chapters/chapter_federated_learning/outlook.md
Normal file
35
v1/en_chapters/chapter_federated_learning/outlook.md
Normal file
@@ -0,0 +1,35 @@
|
||||
## Outlook
|
||||
|
||||
To achieve large-scale commercial deployment of federated learning, substantial research work is still needed. For instance, since we cannot inspect the distributed data in federated learning, it is very difficult to select model hyperparameters and configure optimizers, and we can only resort to simulation-based approaches for model tuning and testing. For deployment on mobile devices, individual users have very little labeled data, and sometimes data labels cannot even be obtained, raising the question of how federated learning can be applied to unsupervised learning. Furthermore, due to inconsistent data distributions across participants, training a single global model makes it difficult to evaluate the model's quality for each participant. Additionally, data has always been a core asset for companies, and different companies have been dedicated to collecting data and creating data silos, so how to effectively incentivize companies or institutions to participate in federated learning systems remains an open question. Below we introduce some efforts undertaken by MindSpore Federated and related work in the field.
|
||||
|
||||
**Federated Learning in Heterogeneous Scenarios**
|
||||
|
||||
The horizontal and vertical federated learning approaches discussed earlier all involve different participants collaboratively building a shared machine learning model. However, enterprise-level federated learning frameworks often need to adapt to various heterogeneous scenarios, such as data heterogeneity (inconsistent data scales and distributions across different clients), device heterogeneity (inconsistent computing capabilities and communication efficiency across different client devices), and model heterogeneity (inconsistent features learned by different local client models).
|
||||
|
||||
Two relatively mainstream directions of work in heterogeneous federated learning scenarios are:
|
||||
|
||||
1) Personalized federated learning strategies with local models that are highly robust to heterogeneous data:
|
||||
|
||||
Federated learning trains a global model to obtain a globally optimal solution based on all data. However, the data volume and distribution of different participants are different, and in many scenarios, the global model cannot capture the overall picture while also accommodating such differences. When one party's data deviates significantly from the overall distribution, the performance of federated learning may indeed be inferior to that of local training. How to maximize the overall benefit of all participants while also maximizing individual benefits is the goal of personalized federated learning.
|
||||
|
||||
Personalized federated learning does not require that all participants ultimately use the same model. For example, it allows each participant to fine-tune the model based on their own data after participating in federated learning, thereby generating a unique personalized model. After personalized fine-tuning, the model typically performs better on the local test set. Under this approach, different participants' models share the same structure but may have different parameters. Some other approaches allow all participants to share the same feature extraction layers but have different task classification layers. Another line of thinking introduces knowledge distillation into federated learning, using the global model from federated learning as the teacher model and the personalized model as the student model, which can alleviate the overfitting problem during personalization.
|
||||
|
||||
2) Research on model aggregation strategies for heterogeneous models:
|
||||
|
||||
Generally, under the FedAvg federated aggregation paradigm, fewer local iteration training steps and more frequent aggregation lead to better model convergence accuracy, especially when the data across different participating clients is non-IID. However, aggregation incurs communication costs, and there is a trade-off between communication cost and model accuracy in federated learning. Therefore, many researchers focus on designing adaptive aggregation schemes that find the optimal balance between local updates and global communication under a given training time budget to minimize the generalization error of the global model.
|
||||
|
||||
**Communication Efficiency Improvement**
|
||||
|
||||
In the federated learning process, during each global training round, every participant needs to send the complete parameters to the server, and the server then distributes the aggregated parameters. Modern deep learning networks easily have millions or even more parameters, and transmitting such a large number of parameters incurs enormous communication overhead. To reduce communication overhead, MindSpore Federated has adopted several methods to improve communication efficiency:
|
||||
|
||||
1) Intelligent frequency adjustment strategy: Improve federated learning efficiency by changing the number of global model aggregation rounds, reducing the communication overhead required for training tasks to converge. One intuition is that in the early stages of the federated learning process, parameter changes across different participants are relatively consistent, so setting a lower aggregation frequency can reduce communication costs; in the later stages of federated learning, parameter changes across different participants become more inconsistent, so setting a higher aggregation frequency can enable the model to converge quickly.
|
||||
|
||||
2) Communication compression scheme: Quantize and sparsify weight differences, i.e., only upload a small portion of quantized weight differences in each communication round. The reason for choosing weight differences for quantization and sparsification is that their distribution is easier to fit than weight values, and they have higher sparsity. Quantization maps float32 data types to int8 or even lower-bit representations, reducing storage and communication overhead on one hand, and enabling better use of compression encoding methods for transmission on the other (such as Huffman coding, finite state entropy coding, etc.). A commonly used sparsification method is Top-K sparsification, which sorts gradients by absolute value in ascending order and only uploads the top k parameters per round. Communication compression schemes generally incur some accuracy loss, and selecting an appropriate k is a challenging problem.
|
||||
|
||||
**Federated Ecosystem**
|
||||
|
||||
In the preceding chapters, we introduced some technologies and practices in the field of privacy-preserving federated learning. However, as exploration deepens, the field of federated learning has become increasingly inclusive, encompassing machine learning, model compression and deployment, information security, encryption algorithms, game theory, and more. As more and more companies, universities, and institutions become involved, federated learning today is no longer merely a technical solution but a privacy-preserving ecosystem. For example, different participants wish to join the federated learning process in a sustainable manner, and there are questions about how to design incentive mechanisms to ensure that profits can be shared relatively fairly among participants while effectively deterring participants who engage in malicious attacks or destructive behavior.
|
||||
|
||||
Furthermore, as more laws and regulations on user data privacy protection and proper use are being introduced, establishing technical standards for federated learning has become increasingly important. Such standards can build a bridge between legal regulators and technical developers, letting enterprises know which technologies to adopt in order to better share information while complying with regulations.
|
||||
|
||||
At the end of 2020, the international standard for federated learning (IEEE P3652.1), approved by the IEEE Standards Committee, was officially published and implemented. This standard aims to provide guidelines for building federated learning architectures and applications, with main content including: descriptions and definitions of federated learning, scenario requirement classification and security evaluation, quantification of personalized metrics for federated learning evaluation, and requirements for joint governance. This is also the first international standard established for AI collaborative technology frameworks, marking the beginning of a new chapter for large-scale industrial application of federated learning.
|
||||
47
v1/en_chapters/chapter_federated_learning/overview.md
Normal file
47
v1/en_chapters/chapter_federated_learning/overview.md
Normal file
@@ -0,0 +1,47 @@
|
||||
## Overview
|
||||
|
||||
With the rapid development of artificial intelligence, large-scale and high-quality data has become increasingly important for model performance and user experience. At the same time, data utilization has become a bottleneck constraining the further development of AI. Issues related to privacy, regulation, and engineering have prevented data sharing between devices, leading to the emergence of data silos. To address this challenge, Federated Learning (FL) was proposed. The concept of federated learning was first introduced in 2016. Under the requirements of user privacy protection, data security, and government regulations, federated learning enables effective machine learning modeling using data from multiple parties.
|
||||
|
||||
### Definition
|
||||
|
||||
The core principle of federated learning is that data stays in place while the model moves. Clearly, centralizing data from all parties would fail to protect user privacy and would violate relevant laws and regulations. Federated learning allows the model to "move" across data holders, thereby enabling modeling without data leaving the local device. In federated learning, each party's data remains local, and a machine learning model is built by exchanging encrypted parameters or other information (on a central server).
|
||||
|
||||
### Application Scenarios
|
||||
|
||||
In practical application scenarios, federated learning can be categorized based on the overlap of samples and features into horizontal federated learning (different samples, overlapping features), vertical federated learning (different features, overlapping samples), and federated transfer learning (neither samples nor features overlap).
|
||||
|
||||
**Horizontal federated learning** is suitable for scenarios where different participants possess the same features but different individuals. For example, in an advertising recommendation scenario, algorithm developers use data with the same features (click counts, dwell time, usage frequency, etc.) from different mobile phone users to build models. Since these feature data cannot leave the device, horizontal federated learning is used to jointly build models from multiple users' feature data.
|
||||
|
||||
**Vertical federated learning** is suitable for scenarios with substantial sample overlap but little feature overlap. For example, consider two different institutions: an insurance company and a hospital. Their user bases are likely to include most residents of the area, so the intersection of their users may be large. However, since the insurance company records users' financial behavior and credit ratings while the hospital holds users' disease and medication records, their feature intersection is small. Vertical federated learning aggregates these different features in an encrypted state to enhance model capability.
|
||||
|
||||
**Federated transfer learning** focuses on finding similarities between the source domain and the target domain. For example, consider two different institutions: a bank located in China and an e-commerce company located in the United States. Due to geographical limitations, the user base intersection of these two institutions is very small. Meanwhile, due to the different types of institutions, their data features also have only a small overlap. In this case, to conduct effective federated learning, transfer learning must be introduced. Federated transfer learning can address problems of small data scale on a single side and insufficient labeled samples, thereby improving model performance.
|
||||
|
||||
### Deployment Scenarios
|
||||
|
||||
Federated learning is architecturally very similar to the parameter server approach (data center distributed learning), both employing a centralized server and decentralized clients to collaboratively build a machine learning model. Furthermore, depending on the deployment scenario, federated learning can be further divided into cross-silo and cross-device federated learning. Generally, cross-silo federated learning involves users at the enterprise or organizational level, while cross-device federated learning targets portable electronic devices and mobile devices. :numref:`ch10-federated-learning-different-connection` illustrates the differences and connections among the three approaches:
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`ch10-federated-learning-different-connection`
|
||||
|
||||
### Common Frameworks
|
||||
|
||||
As the demand for federated learning technology from users and developers continues to grow, the number of federated learning tools and frameworks has also been increasing. Below we introduce some mainstream federated learning frameworks.
|
||||
|
||||
[TFF](https://www.tensorflow.org/federated) (TensorFlow Federated) is an open-source federated learning framework led by Google for machine learning and other computations on decentralized data. TFF was developed to facilitate open research and experimentation in federated learning. It trains shared global models among many participating clients who keep their training data locally. For example, federated learning has been used to train prediction models for mobile keyboards without uploading sensitive typing data to a server.
|
||||
|
||||
[PaddleFL](https://paddlefl.readthedocs.io/en/latest/index.html) is an open-source federated learning framework based on PaddlePaddle, proposed by Baidu. Researchers can easily replicate and compare different federated learning algorithms using PaddleFL, and developers can readily deploy PaddleFL federated learning systems in large-scale distributed clusters. PaddleFL provides various federated learning strategies (horizontal federated learning, vertical federated learning) and their applications in computer vision, natural language processing, recommendation algorithms, and other domains. Additionally, PaddleFL offers applications for traditional machine learning training strategies, such as multi-task learning and transfer learning in federated learning environments. Leveraging PaddlePaddle's large-scale distributed training capabilities and Kubernetes' elastic scheduling of training tasks, PaddleFL can be easily deployed based on a full-stack open-source software.
|
||||
|
||||
[FATE](https://fate.fedai.org) (Federated AI Technology Enabler), proposed by WeBank, is the world's first industrial-grade open-source federated learning framework that enables enterprises and institutions to collaborate on data while ensuring data security and privacy. The FATE project uses Secure Multi-Party Computation (MPC) and Homomorphic Encryption (HE) technologies to build underlying secure computation protocols, supporting secure computation for various types of machine learning, including logistic regression, tree-based algorithms, deep learning, and transfer learning. FATE was first open-sourced in February 2019, and the FATE community was established. Community members include major domestic cloud computing and financial services companies.
|
||||
|
||||
[FedML](https://FedML.ai) is an open-source federated learning research and benchmark library led by the University of Southern California (USC), which facilitates the development of new federated learning algorithms and fair performance comparison. FedML supports three computing paradigms (distributed training, on-device training, and standalone simulation) for users to experiment in different system environments. FedML also facilitates diverse algorithmic research through flexible and general API design and reference baseline implementations. To enable fair comparison of various federated learning algorithms, FedML has set up comprehensive benchmark datasets, including non-Independent and Identically Distributed (IID) datasets.
|
||||
|
||||
[PySyft](https://openmined.github.io/PySyft/index.html) is a secure and private deep learning Python library released by University College London (UCL), DeepMind, and OpenMined, encompassing federated learning, differential privacy, and multi-party learning. PySyft uses differential privacy and encrypted computation (MPC and HE) to decouple private data from model training.
|
||||
|
||||
[Fedlearner](https://github.com/bytedance/fedlearner) is a vertical federated learning framework proposed by ByteDance that allows joint modeling on data distributed across institutions. Fedlearner comes with surrounding infrastructure for cluster management, job management, job monitoring, and network proxying. Fedlearner adopts a cloud-native deployment approach and stores data in HDFS. Fedlearner manages and launches tasks through Kubernetes. Both participating parties of each Fedlearner task need to simultaneously launch training tasks through Kubernetes, with a Master node uniformly managing multiple training tasks and Workers handling communication.
|
||||
|
||||
[OpenFL](https://openfl.readthedocs.io/en/latest/index.html) is a Python framework for federated learning proposed by Intel. OpenFL aims to be a flexible, extensible, and easy-to-learn tool for data scientists.
|
||||
|
||||
[Flower](https://flower.dev) is an open-source federated learning system released by the University of Cambridge, primarily optimized for deploying federated learning algorithms on large-scale, heterogeneous devices.
|
||||
|
||||
[MindSpore Fedrated](https://www.mindspore.cn/en) is an open-source federated learning framework proposed by Huawei, supporting commercial deployment on tens of millions of stateless terminal devices, enabling full-scenario intelligent applications while keeping user data local. MindSpore Federated focuses on application scenarios of horizontal federated learning with large-scale participants, enabling users participating in federated learning to collaboratively build AI models without sharing local data. MindSpore Federated primarily addresses challenges in deploying federated learning in industrial scenarios, including privacy security, large-scale federated aggregation, semi-supervised federated learning, communication compression, and cross-platform deployment.
|
||||
@@ -0,0 +1,139 @@
|
||||
## Privacy Encryption Algorithms
|
||||
|
||||
During the federated learning process, user data is only used for training on local devices and does not need to be uploaded to the central FL-Server. This can prevent the direct leakage of users' personal data. However, in the federated learning framework, uploading model weights to the cloud in plaintext still poses the risk of indirectly leaking user privacy. After obtaining the plaintext weights uploaded by users, adversaries can recover users' personal training data through reconstruction, model inversion, and other attacks, leading to user privacy leakage.
|
||||
|
||||
The MindSpore Federated framework provides secure aggregation algorithms based on Local Differential Privacy (LDP), Multi-Party Computation (MPC), and Huawei's proprietary Sign-based Dimension Selection differential privacy algorithm (SignDS), which add noise or perturbation to the local model weights before uploading them to the cloud. These algorithms address the privacy leakage problem in federated learning while ensuring model usability.
|
||||
|
||||
### LDP-Based Secure Aggregation
|
||||
|
||||
Differential privacy is a mechanism for protecting user data privacy. Differential privacy is defined as:
|
||||
$$
|
||||
Pr[\mathcal{K}(D)\in S] \le e^{\epsilon} Pr[\mathcal{K}(D') \in S]+\delta
|
||||
$$
|
||||
|
||||
For two datasets $D$ and $D'$ that differ by only one record, the probability that the output of a randomized algorithm $\mathcal{K}$ falls within a subset of set $S$ satisfies the above formula. $\epsilon$ is the differential privacy budget, $\delta$ is the perturbation parameter, and smaller values of $\epsilon$ and $\delta$ indicate that the output distributions of $\mathcal{K}$ on $D$ and $D'$ are closer.
|
||||
|
||||
In federated learning, suppose the model weight matrix after local training on an FL-Client is $W$. Since the model "memorizes" the characteristics of the training set during training, an adversary can use $W$ to reconstruct the user's training dataset.
|
||||
|
||||
MindSpore Federated provides an LDP-based secure aggregation algorithm to prevent privacy data leakage when local model weights are uploaded to the cloud.
|
||||
|
||||
The FL-Client generates a differential noise matrix $G$ with the same dimensions as the local model weight matrix $W$, and then adds the two together to obtain a weight matrix $W_p$ that satisfies the differential privacy definition:
|
||||
|
||||
$$
|
||||
W_p=W+G
|
||||
$$
|
||||
|
||||
The FL-Client uploads the noisy model weight matrix $W_p$ to the cloud-side FL-Server for federated aggregation. The noise matrix $G$ essentially adds a layer of masking to the original model, reducing the risk of the model leaking sensitive data while also affecting the convergence of model training. How to achieve a better balance between model privacy and usability remains an open research question. Experiments show that when the number of participants $n$ is sufficiently large (generally above 1000), most noise can cancel each other out, and the local differential privacy mechanism has no significant impact on the accuracy and convergence of the aggregated model.
|
||||
|
||||
### MPC-Based Secure Aggregation
|
||||
|
||||
Although differential privacy technology can adequately protect user data privacy, when the number of participating FL-Clients is small or the Gaussian noise amplitude is large, model accuracy can be significantly affected. To simultaneously satisfy both model protection and model convergence requirements, MindSpore Federated provides an MPC-based secure aggregation scheme.
|
||||
|
||||
Although differential privacy technology can adequately protect user data privacy, when the number of participating FL-Clients is small or the Gaussian noise amplitude is large, model accuracy can be significantly affected. To simultaneously satisfy both model protection and model convergence requirements, MindSpore Federated provides an MPC-based secure aggregation scheme.
|
||||
|
||||
In this training mode, suppose the set of participating FL-Clients is $U$. For any FL-Client $u$ and $v$, they negotiate a pair of random perturbations $p_{uv}$ and $p_{vu}$ that satisfy
|
||||
|
||||
$$
|
||||
\label{puv}
|
||||
p_{uv}=
|
||||
\begin{cases}
|
||||
-p_{vu}, &u{\neq}v\\
|
||||
0, &u=v
|
||||
\end{cases}
|
||||
$$
|
||||
Thus, each FL-Client $u$ adds the perturbations negotiated with other users to its original model weights $x_u$ before uploading them to the FL-Server:
|
||||
|
||||
$$
|
||||
x_{encrypt}=x_u+\sum\limits_{v{\in}U}p_{uv}
|
||||
$$
|
||||
|
||||
Consequently, the FL-Server aggregation result $\overline{x}$ is:
|
||||
$$
|
||||
\label{eq:juhejieguo}
|
||||
\overline{x}=\sum\limits_{u{\in}U}(x_{u}+\sum\limits_{v{\in}U}p_{uv})=\sum\limits_{u{\in}U}x_{u}+\sum\limits_{u{\in}U}\sum\limits_{v{\in}U}p_{uv}=\sum\limits_{u{\in}U}x_{u}
|
||||
$$
|
||||
The above process only introduces the main idea of the aggregation algorithm. The MPC-based aggregation scheme is lossless in terms of accuracy, at the cost of additional communication rounds.
|
||||
|
||||
### LDP-SignDS Algorithm-Based Secure Aggregation
|
||||
|
||||
For the previous dimension-wise noise-adding LDP algorithm, the noise scale added to each dimension is essentially proportional to the number of model parameters. Therefore, for high-dimensional models, a very large number of participants may be needed to mitigate the impact of noise on model convergence. To address this "dimension dependence" issue, MindSpore Federated further provides the **Sign-based Dimension Selection (SignDS)** :cite:`jiang2022signds` algorithm based on dimension selection.
|
||||
|
||||
The main idea of the SignDS algorithm is as follows: for each true local update $\Delta\in\mathbb{R}^{d}$, the FL-Client first selects a small subset of the most significantly updated dimensions to construct a Top-K set $S_k$, and then selects a dimension set $J$ based on this to return to the FL-Server. The FL-Server constructs a corresponding sparse update $\Delta^\prime$ based on the dimension set $J$ and aggregates all sparse updates to update the global model. Since local model updates are correlated with local data information, directly selecting the true largest update dimensions may lead to privacy leakage. To address this, the SignDS algorithm provides privacy guarantees in two aspects. On one hand, the algorithm uses an Exponential Mechanism (EM :cite:`mcsherry2007mechanism`)-based dimension selection algorithm **EM-MDS**, ensuring that the selected dimension set satisfies strict $\epsilon$-LDP guarantees; on the other hand, when constructing sparse updates, a constant value is assigned to the selected dimensions instead of directly using the actual update values, ensuring that the sparse updates are no longer directly correlated with local data. Since the dimension selection satisfies $\epsilon$-LDP and the update values assigned to the selected dimensions are independent of local data, by the post-processing property of differential privacy :cite:`dwork2014algorithmic`, the constructed sparse updates also satisfy $\epsilon$-LDP guarantees. **Compared to the previous dimension-wise noise-adding LDP algorithm, the SignDS algorithm can significantly improve training accuracy for high-dimensional models. Moreover, since FL-Clients only need to upload a small subset of dimension values rather than all model weights, the uplink communication volume of federated learning is also greatly reduced.**
|
||||
|
||||
Below, we provide detailed introductions to the construction of the Top-K set $S_k$ and the EM-MDS dimension selection algorithm.
|
||||
|
||||
First, since actual update values can be positive or negative, directly assigning the same constant value to all selected dimensions may significantly change the model update direction and affect model convergence. To solve this problem, SignDS proposes a sign-based Top-K set construction strategy. Specifically, the algorithm introduces an additional sign variable $s\in\\{-1,1\\}$. This variable is randomly sampled with equal probability by the FL-Client and is used to determine the Top-K set $S_k$ of the local update $\Delta$. If $s=1$, we sort $\Delta$ by **actual update values** and record the $k$ dimensions with the **largest** updates as $S_k$. We further randomly select a subset of dimensions from $S_k$ and use $s=1$ as the update value for these dimensions to construct the sparse update. Intuitively, the update values of dimensions in $S_k$ are likely to be greater than zero. Therefore, assigning $s=1$ to the selected dimensions will not cause a large deviation in the model update direction, thereby mitigating the impact on model accuracy. Similarly, when $s=-1$, we select the $k$ dimensions with the **smallest** updates as $S_k$ and assign $s=-1$ to the selected dimensions.
|
||||
|
||||
Next, we further introduce the EM-MDS algorithm for dimension selection. In brief, the purpose of the EM-MDS algorithm is to randomly select a dimension set $J\in\mathcal{J}$ from the output dimension domain $\mathcal{J}$ with a certain probability $\mathcal{P}$, where different dimension sets correspond to different probabilities. We assume that $J$ contains a total of $h$ dimensions, of which $\nu$ dimensions belong to the Top-K set (i.e., $|S_k \cap J|=\nu$, where $\nu\in[0,h]$), and the other $h-\nu$ dimensions belong to the non-Top-K set. Intuitively, the larger $\nu$ is, the more Top-K dimensions $J$ contains, and the better the model convergence. Therefore, we want to assign higher probabilities to dimension sets with larger $\nu$. Based on this idea, we define the score function as:
|
||||
$$
|
||||
u(S_{k}, J) = 𝟙(|S_k\cap J| \geq \nu_{th}) = 𝟙(\nu \geq \nu_{th})
|
||||
$$
|
||||
:eqlabel:`score_function`
|
||||
|
||||
$u(S_{k}, J)$ measures whether the number of Top-K dimensions contained in the output dimension set $J$ exceeds a certain threshold $\nu_{th}$ ($\nu_{th}\in[1,h]$): it equals 1 if exceeded, and 0 otherwise. Furthermore, the sensitivity of $u(S_{k}, J)$ can be computed as:
|
||||
|
||||
$$
|
||||
\phi = \max_{J\in\mathcal{J}} ||u(S_{k}, J) - u(S^\prime_{k}, J)||= 1 - 0 = 1
|
||||
$$
|
||||
:eqlabel:`sensitivity`
|
||||
|
||||
Note that :eqref:`sensitivity` holds for any pair of different Top-K sets $S_k$ and $S_k^\prime$.
|
||||
|
||||
Based on the above definitions, the EM-MDS algorithm is described as follows:
|
||||
|
||||
*Given the Top-K set $S_k$ of the true local update $\Delta\in\mathbb{R}^{d}$ and the privacy budget $\epsilon$, the sampling probability of the output dimension set $J\in\mathcal{J}$ is:*
|
||||
|
||||
$$
|
||||
\mathcal{P}=\frac{\mathrm{exp}(\frac{\epsilon}{\phi}\cdot u(S_{k}, J))}{\sum_{J^\prime\in\mathcal{J}}\mathrm{exp}(\frac{\epsilon}{\phi}\cdot u(S_{k}, J^\prime))}
|
||||
=
|
||||
\frac{\mathrm{exp}(\epsilon\cdot 𝟙(\nu \geq \nu_{th}))}{\sum_{\tau=0}^{\tau=h}\omega_{\tau}\cdot \mathrm{exp}(\epsilon\cdot 𝟙(\tau\geq\nu_{th}))}
|
||||
=
|
||||
\frac{\mathrm{exp}(\epsilon\cdot 𝟙(\nu \geq \nu_{th}))}{\sum_{\tau=0}^{\tau=\nu_{th}-1}\omega_{\tau} + \sum_{\tau=\nu_{th}}^{\tau=h}\omega_{\tau}\cdot \mathrm{exp}(\epsilon)}
|
||||
$$
|
||||
:eqlabel:`emmds`
|
||||
|
||||
*where $\nu$ is the number of Top-K dimensions contained in $J$, $\nu_{th}$ is the score function threshold, $J^\prime$ is any output dimension set, and $\omega_{\tau}=\binom{k}{\tau}\binom{d-k}{h-\tau}$ is the number of all sets containing $\tau$ Top-K dimensions.*
|
||||
|
||||
We further provide the privacy proof of the EM-MDS algorithm:
|
||||
|
||||
For each FL-Client, given a randomly sampled sign value $x$, let the Top-K sets of any two local updates $\Delta$ and $\Delta^\prime$ be denoted as $S_k$ and $S_k^\prime$. For any output dimension set $J\in\mathcal{J}$, let $\nu=|S_k \cap J|$ and $\nu^\prime=|S_k^\prime \cap J|$ be the intersection sizes of $J$ with the two Top-K dimension sets. According to :eqref:`emmds`, the following inequality holds:
|
||||
|
||||
$$
|
||||
\frac{\mathrm{Pr}\[J|\Delta\]}{\mathrm{Pr}\[J|\Delta^\prime\]} = \frac{\mathrm{Pr}\[J|S_{k}\]}{\mathrm{Pr}\[J|S^\prime_{k}\]} = \frac{\frac{\mathrm{exp}(\frac{\epsilon}{\phi}\cdot u(S_{k}, J))}{\sum_{J^\prime\in\mathcal{J}}\mathrm{exp}(\frac{\epsilon}{\phi}\cdot u(S_{k}, J^\prime))}}{\frac{\mathrm{exp}(\frac{\epsilon}{\phi}\cdot u(S^\prime_{k}, J))}{\sum_{J^\prime\in\mathcal{J}}\mathrm{exp}(\frac{\epsilon}{\phi}\cdot u(S^\prime_{k}, J^\prime))}}
|
||||
= \frac{\frac{\mathrm{exp}(\epsilon\cdot 𝟙(\nu \geq \nu_{th}))}{\sum_{\tau=0}^{\tau=h}\omega_{\tau}\cdot \mathrm{exp}(\epsilon\cdot 𝟙(\tau\geq\nu_{th}))}}{\frac{
|
||||
\mathrm{exp}(\epsilon\cdot 𝟙(\nu^\prime \geq \nu_{th}))}{\sum_{\tau=0}^{\tau=h}\omega_{\tau}\cdot \mathrm{exp}(\epsilon\cdot 𝟙(\tau\geq\nu_{th}))}} \\
|
||||
= \frac{\mathrm{exp}(\epsilon\cdot 𝟙(\nu \geq \nu_{th}))}{
|
||||
\mathrm{exp}(\epsilon\cdot 𝟙(\nu^\prime \geq \nu_{th}))}
|
||||
\leq \frac{\mathrm{exp}(\epsilon\cdot 1)}{\mathrm{exp}(\epsilon\cdot 0)} = \mathrm{exp}(\epsilon)
|
||||
$$
|
||||
|
||||
*This proves that the EM-MDS algorithm satisfies the $\epsilon$-LDP guarantee.*
|
||||
|
||||
It is worth noting that computing :eqref:`emmds` requires first determining the Top-K dimension count threshold $\nu_{th}$. To this end, we first derive the probability distribution and expectation of the number of Top-K dimensions contained in any output dimension set $J$ given the threshold $\nu_{th}$:
|
||||
|
||||
$$
|
||||
\mathrm{Pr}(\nu=\tau|\nu_{th})=
|
||||
\begin{cases}
|
||||
\omega_{\tau} / \Omega \quad \quad \quad \quad \quad \mathrm{ } &if \quad \tau\in\[0,\nu_{th}\) \\
|
||||
\omega_{\tau}\cdot\mathrm{exp}(\epsilon) / \Omega \quad \quad &if \quad \tau\in\[\nu_{th},h\]
|
||||
\end{cases}
|
||||
$$
|
||||
:eqlabel:`discrete-prob`
|
||||
|
||||
$$
|
||||
\mathbb{E}\[\nu|\nu_{th}\] = \sum_{\tau=0}^{\tau=h}\tau\cdot \mathrm{Pr}(\nu=\tau|\nu_{th})
|
||||
$$
|
||||
:eqlabel:`expectation`
|
||||
|
||||
Here, $\Omega$ is the denominator part of $\mathcal{P}$ in :eqref:`emmds`. Intuitively, the higher $\mathbb{E}\[\nu|\nu_{th}\]$ is, the greater the probability that the randomly sampled set $J$ contains Top-K dimensions, and thus the better the model utility. Therefore, we determine the threshold that maximizes $\mathbb{E}\[\nu|\nu_{th}\]$ as the target threshold $\nu_{th}^{\*}$, i.e.:
|
||||
|
||||
$$
|
||||
\nu_{th}^{\*} = \underset{\nu_{th}\in\[1, h\]}{\operatorname{argmax}} \mathbb{E}\[\nu|\nu_{th}\]
|
||||
$$
|
||||
:eqlabel:`threshold`
|
||||
|
||||
Finally, we describe the detailed workflow of the SignDS algorithm in :numref:`signds_workflow`. Given a local model update $\Delta$, we first randomly sample a sign value $s$ and construct the Top-K set $S_k$. Next, we determine the threshold $\nu_{th}^{\*}$ according to :eqref:`threshold` and select the output set $J$ following the probability defined in :eqref:`emmds`. Considering that the output domain $\mathcal{J}$ contains $\binom{d}{k}$ possible dimension sets, directly sampling a combination from $\mathcal{J}$ with a certain probability would require very high computational and space costs. Therefore, we adopt an inverse sampling algorithm to improve computational efficiency. Specifically, we first sample a random value $\beta\sim U(0,1)$ from the standard uniform distribution, and determine the number of Top-K dimensions $\nu$ in the output dimension set based on the cumulative distribution function $CDF_{\tau}$ of $p(\nu=\tau|\nu_{th})$ in :eqref:`discrete-prob`. Finally, we randomly select $\nu$ dimensions from the Top-K set $S_k$ and randomly sample $h-\nu$ dimensions from the non-Top-K set to construct the final output dimension set $J$.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`signds_workflow`
|
||||
3
v1/en_chapters/chapter_federated_learning/summary.md
Normal file
3
v1/en_chapters/chapter_federated_learning/summary.md
Normal file
@@ -0,0 +1,3 @@
|
||||
## Summary
|
||||
|
||||
In this chapter, we briefly introduced the background, system architecture, federated averaging algorithm, privacy encryption algorithms, and practical deployment challenges of federated learning. Federated learning is an emerging artificial intelligence paradigm that can build effective machine learning models under the two major constraints of "data protection" and "data silos." Furthermore, due to the unique characteristics of federated learning scenarios (local data not being uploaded, high security and privacy requirements, and non-IID data distributions), the development of systems and algorithms becomes more challenging: how to balance computation and communication overhead, how to ensure the model does not leak privacy, and how algorithms can converge under non-IID scenarios. These challenges require developers to have a deeper understanding of practical federated learning scenarios.
|
||||
61
v1/en_chapters/chapter_federated_learning/vertical_fl.md
Normal file
61
v1/en_chapters/chapter_federated_learning/vertical_fl.md
Normal file
@@ -0,0 +1,61 @@
|
||||
## Vertical Federated Learning
|
||||
|
||||
Now we introduce another type of federated learning algorithm: Vertical Federated Learning. In vertical federated learning, the participating parties possess data with the same sample space but different feature spaces. They perform secure joint modeling using shared sample data, which has broad applications in fields such as finance and advertising. Compared to horizontal federated learning, vertical federated learning requires participants to collaboratively complete data intersection, joint model training, and joint model inference. Moreover, the more participants involved, the higher the complexity of the vertical federated learning system.
|
||||
|
||||
Below, we use a two-party example with Enterprise A and Enterprise B to introduce the basic architecture and workflow of vertical federated learning. Suppose Enterprise A has both feature data and label data and can build models independently; Enterprise B has feature data but lacks label data and thus cannot build models independently. Due to privacy regulations and industry standards, data between the two enterprises cannot be directly shared. Enterprise A and Enterprise B can adopt a vertical federated learning solution to collaborate: data stays local, and both parties use their shared sample data for joint modeling and training. Ultimately, both parties obtain a more powerful model.
|
||||
|
||||
### Vertical Federation Architecture
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`federated-learning-vfl-arch`
|
||||
|
||||
Model training in a vertical federated learning system generally consists of the following phases:
|
||||
- Sample alignment: First, align the sample data with the same ID (Identification) across Enterprise A and Enterprise B. During the data alignment phase, the system employs encryption algorithms to protect the data, ensuring that neither party's user data is exposed.
|
||||
- Joint training: After determining the shared user data between Enterprise A and Enterprise B, this shared data can be used to collaboratively train a business model. During the model training process, model parameter information is transmitted in an encrypted manner. The trained federated learning model can be deployed across all participating parties in the federated learning system.
|
||||
|
||||
### Sample Alignment
|
||||
|
||||
Private Set Intersection (PSI) technology is a commonly used solution for data sample alignment in vertical federated learning. There are multiple PSI implementation approaches in the industry: circuit-based, public-key encryption-based, oblivious transfer protocol-based, and fully homomorphic encryption-based. Different PSI approaches have their own advantages and disadvantages. For example, public-key encryption-based approaches do not require an auxiliary server to run but incur high computational overhead for public-key encryption; while oblivious transfer-based approaches have high computational performance but incur large communication overhead. Therefore, in specific applications, the best balance among functionality, performance, and security should be chosen based on the actual scenario.
|
||||
|
||||
RSA blind signature is a classic PSI method based on public-key encryption and is one of the widely adopted technologies in current vertical federated learning systems. Below, we describe the basic workflow of the RSA blind signature algorithm using Enterprise A and Enterprise B as an example.
|
||||
|
||||

|
||||
:width:`600px`
|
||||
:label:`federated-learning-vfl-data`
|
||||
|
||||
Enterprise A acts as the server and possesses a set containing label data and sample IDs. Enterprise B acts as the client and possesses a set of sample IDs. First, Enterprise A uses the RSA algorithm to generate a private key and a public key. The private key is retained on the server side, and the public key is sent to Enterprise B.
|
||||
|
||||
The server uses the RSA algorithm to compute the signatures of the IDs participating in sample alignment:
|
||||
$$t_j=H^{'}(K_{a:j})$$
|
||||
where $K_{a:j}=(H(a_j))^d \ mod \ n$ is the RSA encryption result of $H(a_j)$ encrypted with the private key $d$. $H()$ and $H^{'}()$ are hash functions.
|
||||
|
||||
_Similarly, on the client side, the sample IDs are encrypted with the public key and multiplied by a random number $R_{b,i}$ for blinding perturbation:
|
||||
$$y_i=H(b_i)\cdot(R_{b,i})^e \ mod \ n$$
|
||||
The client transmits the computed values $\{y_1,...,y_v\}$ to the server side. After receiving the $y_i$ values, the server signs them using the private key $d$ and computes:
|
||||
$$y_i^{'}=y_i^d \ mod \ n$$
|
||||
Then the server sends the computed $\{y_1^{'},...,y_v^{'}\}$ and $\{t_1,...,t_w\}$ to the client side.
|
||||
Upon receiving $y_i^{'}$ and $t_j$, the client first performs the unblinding operation:
|
||||
$$K_{b:i}={y_i}^{'}/R_{b,i}$$
|
||||
and aligns its own ID signatures with the server's ID signatures to obtain the ID intersection $I$ in an encrypted and hashed state:
|
||||
$${t_i}^{'}=H^{'}(K_{b:i}) \\I=\{t_1,...,t_w\}\cap \{{t_1}^{'},...,{t_v}^{'}\}$$
|
||||
|
||||
Finally, the aligned sample ID intersection $I$ is sent to the server, and the server uses its own mapping table to independently derive the plaintext results. In this way, Enterprise A and Enterprise B complete the intersection computation of user sets in an encrypted state, and throughout the entire process, non-overlapping sample IDs of both parties are never exposed.
|
||||
|
||||
### Joint Training
|
||||
|
||||
After sample ID alignment, developers can use the shared data to build machine learning models.
|
||||
|
||||
Currently, models such as linear regression, decision trees, and neural networks have been widely applied in vertical federated learning systems. During the model training process in vertical federated learning, a third-party collaborator C is generally introduced to implement the central server functionality, and it is assumed that this third-party collaborator C is trustworthy and will not collude with other participants. The central server acts as a neutral party during training, generating and distributing keys, and decrypting and computing encrypted data. However, the central server role is not mandatory; for example, in a two-party federated learning scenario, a third-party collaborator C is not needed to coordinate the training tasks of both parties, and Enterprise A, which holds the label data, can assume the role of the central server. Without loss of generality, we continue to describe the vertical federated learning joint training process using a scheme that includes the third-party collaborator C.
|
||||
|
||||

|
||||
:width:`800px`
|
||||
:label:`federated-learning-vfl-train`
|
||||
|
||||
- Step 1: The third-party collaborator C creates a key pair and sends the public key to Enterprise A and B.
|
||||
- Step 2: Enterprise A and B separately compute the intermediate results needed for gradient and loss computation, and encrypt and exchange them.
|
||||
- Step 3: Enterprise A and B separately compute the encrypted gradients and add masks. Meanwhile, Enterprise A also computes the encrypted loss value. After computation, Enterprise A and B send the encrypted values to the third-party collaborator C.
|
||||
- Step 4: The third-party collaborator C decrypts the gradients and loss values, and sends the results back to Enterprise A and B.
|
||||
- Step 5: Enterprise A and B first remove the masks from the received gradient values, and then update their local model parameters.
|
||||
|
||||
Throughout the entire training process, any sensitive data between Enterprise A and B is encrypted using encryption algorithms before leaving their respective trust domains. Homomorphic Encryption (HE) is one of the commonly used algorithms in federated learning frameworks. Homomorphic encryption means that performing certain operations on two pieces of encrypted data and then directly decrypting the result yields the same outcome as performing the same operations on the original data. When this operation is addition, it is called additive homomorphic encryption. We denote the encryption function as $[[\cdot]]$.
|
||||
462
v1/en_chapters/chapter_frontend_and_ir/ad.md
Normal file
462
v1/en_chapters/chapter_frontend_and_ir/ad.md
Normal file
@@ -0,0 +1,462 @@
|
||||
# Automatic Differentiation
|
||||
|
||||
In the following, we describe the key methodologies applied in automatic
|
||||
differentiation.
|
||||
|
||||
## Types of Differentiation Methods
|
||||
|
||||
Differentiation constitutes a collection of methodologies enabling the
|
||||
efficient and precise evaluation of derivatives within computer
|
||||
programs. Since the 1960s and 1970s, it has been extensively utilized
|
||||
across multiple sectors including fluid mechanics, astronomy, and
|
||||
mathematical finance . Its theories and implementation have been
|
||||
rigorously studied over time.
|
||||
|
||||
With the advancement of deep learning, which has shown remarkable
|
||||
progress across an expanding range of machine learning tasks in recent
|
||||
years, automatic differentiation has found wide-spread application in
|
||||
the field of machine learning. Given that many optimization algorithms
|
||||
employed in machine learning models necessitate derivatives of the
|
||||
models, automatic differentiation has emerged as an integral component
|
||||
within mainstream machine learning frameworks such as TensorFlow and
|
||||
PyTorch.
|
||||
|
||||
There are four primary methods to evaluate derivatives in computer
|
||||
programs, each of which is explained in the following sections.
|
||||
|
||||
### Manual Differentiation
|
||||
|
||||
Manual differentiation involves the direct computation of the derivative
|
||||
expression of a function, a task which hinges upon the input values
|
||||
specified within a program. Although this method could seem appealing
|
||||
due to its simplicity and directness, it is worth noting that it comes
|
||||
with its share of limitations.
|
||||
|
||||
A primary drawback of manual differentiation is the need to re-derive
|
||||
and re-implement the derivative every time a function changes, which can
|
||||
be laborious and time-consuming. This is especially true for complex
|
||||
functions or when working on large-scale projects where the function
|
||||
might undergo frequent updates.
|
||||
|
||||
Moreover, manual differentiation can be prone to human errors. The
|
||||
process of deriving complex functions often involves intricate chains of
|
||||
mathematical reasoning. A slight oversight or error in any of these
|
||||
steps can lead to an incorrect derivative, which, in turn, can greatly
|
||||
affect the outcome of the computation. This susceptibility to mistakes
|
||||
can add a layer of uncertainty to the reliability of this method.
|
||||
|
||||
Furthermore, in cases where high-order derivatives or partial
|
||||
derivatives with respect to many variables are needed, manual
|
||||
differentiation quickly becomes unfeasible due to the increase in
|
||||
complexity. The difficulty of computing these derivatives correctly
|
||||
grows exponentially with the number of variables and the order of the
|
||||
derivative.
|
||||
|
||||
### Numerical Differentiation
|
||||
|
||||
Numerical differentiation is an approach that logically stems from the
|
||||
fundamental definition of a derivative and employs the method of
|
||||
difference approximation. The basic formula for numerical
|
||||
differentiation can be described as follows:
|
||||
|
||||
$$f^{'}(x)=\lim_{h \to 0}\frac{f(x+h)-f(x)}{h}$$
|
||||
|
||||
In this equation, for a sufficiently small value of the step size $h$,
|
||||
the difference quotient $\frac{f(x+h)-f(x)}{h}$ is used as an
|
||||
approximation of the derivative. The inherent error in this
|
||||
approximation is referred to as the truncation error, which
|
||||
theoretically diminishes as the value of $h$ approaches zero. This
|
||||
suggests that a smaller step size would yield a more accurate
|
||||
approximation.
|
||||
|
||||
However, the scenario in practice is not always so straightforward due
|
||||
to the phenomenon of round-off error. This error arises from the finite
|
||||
precision of floating-point arithmetic operations in digital computer
|
||||
systems. As the value of $h$ decreases, the round-off error conversely
|
||||
increases, adding a degree of uncertainty to the computation.
|
||||
|
||||
This creates a complex interplay between truncation error and round-off
|
||||
error. When the value of $h$ is large, the truncation error dominates,
|
||||
whereas when $h$ is small, the round-off error is more significant.
|
||||
Consequently, the total error of numerical differentiation achieves a
|
||||
minimum at an optimal $h$ value that balances these two types of errors.
|
||||
|
||||
In a nutshell, while numerical differentiation offers the advantage of
|
||||
relative simplicity in implementation, it suffers from certain
|
||||
limitations with regard to accuracy. The complexities arising from the
|
||||
interplay between truncation and round-off errors make it less reliable
|
||||
for certain tasks, particularly when high precision is required.
|
||||
Therefore, for many practical applications, more sophisticated
|
||||
techniques of automatic differentiation are preferred.
|
||||
|
||||
### Symbolic Differentiation
|
||||
|
||||
Symbolic differentiation involves the use of computer programs to
|
||||
automatically calculate derivatives. This is accomplished by recursively
|
||||
transforming function expressions in accordance with specific
|
||||
differentiation rules. These rules can be summarized as follows:
|
||||
|
||||
$$\frac{\partial}{\partial x}(f(x)+g(x))\rightsquigarrow\frac{\partial}{\partial x}f(x)+\frac{\partial }{\partial x}g(x)$$
|
||||
|
||||
$$\frac{\partial}{\partial x}(f(x)g(x))\rightsquigarrow(\frac{\partial}{\partial x}f(x))g(x)+f(x)(\frac{\partial}{\partial x}g(x))$$
|
||||
|
||||
Symbolic differentiation has been integrated into many modern algebraic
|
||||
systems such as Mathematica, as well as machine learning frameworks like
|
||||
Theano. It successfully addresses the issues related to hard-coding
|
||||
derivatives inherent in manual differentiation, thus automating the
|
||||
differentiation process and minimizing human error.
|
||||
|
||||
Despite these advantages, symbolic differentiation has its own set of
|
||||
challenges. One of its primary limitations is its strict adherence to
|
||||
transforming and expanding expressions recursively, without the ability
|
||||
to reuse previous results of transformations. This can lead to a
|
||||
phenomenon known as expression swell , which results in highly complex
|
||||
and expanded expressions that can significantly slow down computation
|
||||
and increase memory usage.
|
||||
|
||||
In addition, symbolic differentiation requires that the expressions to
|
||||
be differentiated are defined in closed form. This constraint largely
|
||||
restricts the use of control flow statements such as loops and
|
||||
conditional branches, which are common in programming. This lack of
|
||||
flexibility can significantly limit the design and expressivity of
|
||||
neural networks within machine learning frameworks, as these often
|
||||
require intricate control flow structures for more advanced operations.
|
||||
|
||||
### Automatic Differentiation
|
||||
|
||||
Automatic differentiation cleverly amalgamates the strategies of
|
||||
numerical differentiation and symbolic differentiation to offer an
|
||||
efficient and precise mechanism for derivative evaluation. It breaks
|
||||
down the arithmetic operations in a program into a finite set of
|
||||
elementary operations, for each of which the rules of derivative
|
||||
evaluation are already known. Upon determining the derivative of each
|
||||
elementary operation, the chain rule is applied to synthesize these
|
||||
individual results, ultimately yielding the derivative of the entire
|
||||
program.
|
||||
|
||||
The fundamental strength of automatic differentiation lies in its
|
||||
ability to sidestep the primary drawbacks of both numerical and symbolic
|
||||
differentiation. Unlike numerical differentiation, which suffers from
|
||||
precision issues due to truncation and round-off errors, automatic
|
||||
differentiation facilitates accurate derivative evaluations.
|
||||
Furthermore, it mitigates the issue of expression swell, a significant
|
||||
concern in symbolic differentiation, by decomposing the program into a
|
||||
series of elementary expressions. Symbolic differentiation rules are
|
||||
only applied to these simplified expressions, and the derivative results
|
||||
are reused to enhance efficiency.
|
||||
|
||||
Automatic differentiation also surpasses symbolic differentiation in its
|
||||
capability to handle control flow statements. It has the ability to
|
||||
process branching, looping, and recursion, enhancing its flexibility and
|
||||
applicability to complex computational scenarios.
|
||||
|
||||
In contemporary applications, automatic differentiation has found
|
||||
widespread use in deep learning frameworks for the evaluation of
|
||||
derivatives, given its blend of accuracy and efficiency. The subsequent
|
||||
sections delve into the mechanics and implementation aspects of
|
||||
automatic differentiation, elucidating its role as a crucial tool in
|
||||
computational mathematics and machine learning.
|
||||
|
||||
## Forward Mode and Reverse Mode
|
||||
|
||||
Automatic differentiation can be categorized into two modes, forward and
|
||||
reverse, based on the sequence in which the chain rule is applied.
|
||||
Consider a composite function $y=a(b(c(x)))$. The formula to calculate
|
||||
its gradient, $\frac{\partial y}{\partial x}$, is given as:
|
||||
|
||||
$$\frac{\partial y}{\partial x}=\frac{\partial y}{\partial a}\frac{\partial a}{\partial b}\frac{\partial b}{\partial c}\frac{\partial c}{\partial x}$$
|
||||
|
||||
In the forward mode of automatic differentiation, the computation of the
|
||||
gradient originates from the inputs, as shown in the following
|
||||
formulation:
|
||||
|
||||
$$\frac{\partial y}{\partial x}=(\frac{\partial y}{\partial a}(\frac{\partial a}{\partial b}(\frac{\partial b}{\partial c}\frac{\partial c}{\partial x})))$$
|
||||
|
||||
Conversely, in the reverse mode, the computation of the gradient begins
|
||||
from the outputs, represented by the equation:
|
||||
|
||||
$$\frac{\partial y}{\partial x}=(((\frac{\partial y}{\partial a}\frac{\partial a}{\partial b})\frac{\partial b}{\partial c})\frac{\partial c}{\partial x})$$
|
||||
|
||||
To illustrate the computation methods of the two modes, let us consider
|
||||
the following function and aim to evaluate its derivative,
|
||||
$\frac{\partial y}{\partial x_1}$ at the point $(x_1, x_2)=(2,5)$:
|
||||
$$y=f(x_1,x_2)=ln(x_1)+{x_1}{x_2}-sin(x_2)$$
|
||||
|
||||
Figure :numref:`ch04/ch04-calculation_graph` represents the
|
||||
computational graph of this function, providing a visual demonstration
|
||||
of how automatic differentiation processes the function in both forward
|
||||
and reverse modes. This distinction between forward and reverse modes is
|
||||
particularly important when dealing with functions of multiple
|
||||
variables, with each mode having specific use cases and efficiency
|
||||
implications.
|
||||
|
||||

|
||||
:label:`ch04/ch04-calculation_graph`
|
||||
|
||||
### Forward Mode
|
||||
|
||||

|
||||
:label:`ch04/ch04-forward-mode-compute-function`
|
||||
|
||||
Figure :numref:`ch04/ch04-forward-mode-compute-function` elucidates thecomputation process within the forward mode. The sequence of elementaryoperations, derived from the source program, is displayed on the left.Following the chain rule and using established derivative evaluationrules, we sequentially compute each intermediate variable ${\dot{v}_i}=\frac{\partial v_i}{\partial x_1}$ from top to bottom, as depicted on the right.
|
||||
Consequently, this leads to the computation ofthe final variable ${\dot{v}_5}=\frac{\partial y}{\partial x_1}$. In the process of derivative evaluation of a function, we obtain a setof partial derivatives of any output with respect to any input of thisfunction.
|
||||
For a function $f:{\mathbf{R}^n}\to \mathbf{R}^m$, where $n$ is the number of independent input variables $x_i$, and $m$ is thenumber of independent output variables $y_i$, the derivative resultscorrespond to the following Jacobian matrix:
|
||||
|
||||
$$
|
||||
\mathbf{J}_{f}= \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\
|
||||
\vdots & \ddots & \vdots \\
|
||||
\frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix}
|
||||
$$
|
||||
|
||||
Each forward pass of function $f$ results in partial derivatives of alloutputs with respect to a single input, represented by the vectorsbelow. This corresponds to one column of the Jacobian matrix. Therefore,executing $n$ forward passes gives us the full Jacobian matrix.
|
||||
|
||||
$$
|
||||
\begin{bmatrix} \frac{\partial y_1}{\partial x_i} \\
|
||||
\vdots \\
|
||||
\frac{\partial y_m}{\partial x_i} \end{bmatrix}
|
||||
$$
|
||||
|
||||
The forward mode allows us to compute Jacobian-vector products byinitializing $\dot{\mathbf{x}}=\mathbf{r}$ to generate the results for asingle column. As the derivative evaluation rules for elementaryoperations are pre-determined, we know the Jacobian matrix for all theelementary operations. Consequently, by leveraging the chain rule toevaluate the derivatives of $f$ propagated from inputs to outputs, wesecure one column in the Jacobian matrix of the entire network.
|
||||
|
||||
$$
|
||||
\mathbf{J}_{f}\mathbf{r}= \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\
|
||||
\vdots & \ddots & \vdots \\
|
||||
\frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix} \begin{bmatrix} r_1 \\
|
||||
\vdots \\
|
||||
r_n \end{bmatrix}
|
||||
$$
|
||||
|
||||
### Reverse Mode
|
||||
|
||||
Figure :numref:`ch04/ch04-backward-mode-compute` illustrates theautomatic differentiation process in the reverse mode. The sequence ofelementary operations, derived from the source program, is displayed onthe left. Beginning from $\bar{v}_5=\bar{y}=\frac{\partial y}{\partial y}=1$, we sequentiallycompute each intermediate variable ${\bar{v}_i}=\frac{\partial y_j}{\partial v_i}$ from bottom to top,
|
||||
leveraging the chain rule and established derivative evaluation rules
|
||||
(as depicted on the right). Thus, we can compute the final variables
|
||||
${\bar{x}_1}=\frac{\partial y}{\partial x_1}$ and
|
||||
${\bar{x}_2}=\frac{\partial y}{\partial x_2}$.
|
||||
|
||||

|
||||
:label:`ch04/ch04-backward-mode-compute`
|
||||
|
||||
Every reverse pass of function $f$ produces partial derivatives of asingle output with respect to all inputs, represented by the followingvectors. This corresponds to a single row of the Jacobian matrix.Consequently, executing $m$ reverse passes gives us the full Jacobianmatrix.
|
||||
|
||||
$$
|
||||
\begin{bmatrix} \frac{\partial y_j}{\partial x_1} & \cdots & \frac{\partial y_j}{\partial x_n} \end{bmatrix}$$Similarly, we can compute vector-Jacobian products to obtain the resultsfor a single row.$$\mathbf{r}^{T}\mathbf{J}_{f}= \begin{bmatrix} r_1 & \cdots & r_m \end{bmatrix} \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\
|
||||
\vdots & \ddots & \vdots \\
|
||||
\frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix}
|
||||
$$
|
||||
|
||||
The quantity of columns and rows in a Jacobian matrix directly
|
||||
influences the number of forward and reverse passes needed to solve it
|
||||
for a given function $f$. This characteristic is particularly
|
||||
significant when determining the most efficient method of automatic
|
||||
differentiation.
|
||||
|
||||
When the function has significantly fewer inputs than outputs
|
||||
$(f:{\mathbf{R}^n}\to \mathbf{R}^m, n << m)$, the forward mode proves to
|
||||
be more efficient. Conversely, when the function has considerably more
|
||||
inputs than outputs $(f:{\mathbf{R}^n}\to \mathbf{R}^m, n >> m)$, the
|
||||
reverse mode becomes advantageous.
|
||||
|
||||
For an extreme case where the function maps from $n$ inputs to a single
|
||||
output $f:{\mathbf{R}^n}\to \mathbf{R}$, we can evaluate all the
|
||||
derivatives of the output with respect to the inputs
|
||||
$(\frac{\partial y}{\partial x_1},\cdots,\frac{\partial y}{\partial n})$
|
||||
using a single reverse pass or $n$ forward passes. This is a situation
|
||||
akin to derivative evaluation for a multi-input, single-output network,
|
||||
a structure frequently encountered in machine learning.
|
||||
|
||||
Due to this feature, reverse-mode automatic differentiation forms the
|
||||
basis for the backpropagation algorithm, a key technique for training
|
||||
neural networks. By enabling efficient computation of gradients,
|
||||
especially in scenarios with high-dimensional input data and scalar
|
||||
output (common in many machine learning applications), reverse-mode
|
||||
automatic differentiation has become indispensable in the field.
|
||||
|
||||
However, the reverse mode does come with certain limitations. For
|
||||
instance, once a source program is decomposed into a sequence of
|
||||
elementary operations in the forward mode, inputs can be obtained
|
||||
synchronously during the execution of these operations. This is possible
|
||||
because the sequence of derivative evaluations aligns with the sequence
|
||||
of operation execution. In contrast, in the reverse mode, the sequence
|
||||
for derivative evaluation is the inverse of the execution sequence of
|
||||
the source program, leading to a two-phased computation process. The
|
||||
initial phase entails executing the source program and storing the
|
||||
intermediate results in memory, while the subsequent phase involves
|
||||
retrieving these intermediate results to evaluate the derivatives. Due
|
||||
to the additional steps involved, the reverse mode requires more memory.
|
||||
|
||||
## Implementing Automatic Differentiation
|
||||
|
||||
This section explores typical design patterns for implementing automatic
|
||||
differentiation in machine learning frameworks. These design patterns
|
||||
can be broadly classified into three categories: elemental libraries,
|
||||
operator overloading, and source transformation.
|
||||
|
||||
### Elemental Libraries
|
||||
|
||||
Elemental libraries encapsulate elementary expressions and their
|
||||
differential expressions as library functions. When coding, users must
|
||||
manually decompose a program into a set of elementary expressions and
|
||||
replace them with corresponding library functions. Take the program
|
||||
$a=(x+y)/z$ as an example; it needs to be manually decomposed as
|
||||
follows:
|
||||
|
||||
```
|
||||
t = x + y
|
||||
a = t / z
|
||||
```
|
||||
|
||||
Subsequently, users replace the decomposed elementary expressions with
|
||||
appropriate library functions:
|
||||
|
||||
```
|
||||
// The parameters include variables x, y, and t and their derivative variables dx, dy, and dt.
|
||||
call ADAdd(x, dx, y, dy, t, dt)
|
||||
// The parameters include variables t, z, and a and their derivative variables dt, dz, and da.
|
||||
call ADDiv(t, dt, z, dz, a, da)
|
||||
```
|
||||
|
||||
The library functions, ADAdd and ADDiv, use the chain rule to define the
|
||||
Add and Div differential expressions, respectively. This is illustrated
|
||||
in Code `lst:diff`.
|
||||
|
||||
**lst:diff**
|
||||
```
|
||||
def ADAdd(x, dx, y, dy, z, dz):
|
||||
z = x + y
|
||||
dz = dy + dx
|
||||
|
||||
def ADDiv(x, dx, y, dy, z, dz):
|
||||
z = x / y
|
||||
dz = dx / y + (x / (y * y)) * dy
|
||||
```
|
||||
|
||||
Elemental libraries constitute a simple and straightforward way of
|
||||
implementing automatic differentiation for programming languages.
|
||||
However, this approach requires users to manually decompose a program
|
||||
into elementary expressions before calling library functions for
|
||||
programming. Furthermore, it is not possible to use the native
|
||||
expressions found in programming languages.
|
||||
|
||||
### Operator Overloading
|
||||
|
||||
Leveraging the polymorphism characteristic inherent in modern
|
||||
programming languages, the Operator Overloading design pattern redefines
|
||||
the semantics of elementary operations and successfully encapsulates
|
||||
their differentiation rules. During the execution phase, it methodically
|
||||
documents the type, inputs, and outputs of every elementary operation
|
||||
within a data structure known as a 'tape'. These tapes have the ability
|
||||
to generate a trace, serving as a pathway for applying the chain rule.
|
||||
This makes it possible to aggregate elementary operations either in a
|
||||
forward or backward direction to facilitate differentiation. As depicted
|
||||
in Code `lst:OO`,
|
||||
we utilize the AutoDiff code from automatic differentiation libraries as
|
||||
a case in point to overload the basic arithmetic operators in
|
||||
programming languages.
|
||||
|
||||
**lst:OO**
|
||||
```cpp
|
||||
namespace AutoDiff
|
||||
{
|
||||
public abstract class Term
|
||||
{
|
||||
// To overload and call operators (`+`, `*`, and `/`),
|
||||
// TermBuilder records the types, inputs, and outputs of operations in tapes.
|
||||
public static Term operator+(Term left, Term right)
|
||||
{
|
||||
return TermBuilder.Sum(left, right);
|
||||
}
|
||||
public static Term operator*(Term left, Term right)
|
||||
{
|
||||
return TermBuilder.Product(left, right);
|
||||
}
|
||||
public static Term operator/(Term numerator, Term denominator)
|
||||
{
|
||||
return TermBuilder.Product(numerator, TermBuilder.Power(denominator, -1));
|
||||
}
|
||||
}
|
||||
|
||||
// Tape data structures include the following basic elements:
|
||||
// 1) Arithmetic results of operations
|
||||
// 2) Derivative evaluation results corresponding to arithmetic results of operations
|
||||
// 3) Inputs of operations
|
||||
// In addition, functions Eval and Diff are used to define the computation and differentiation rules of the arithmetic operations.
|
||||
internal abstract class TapeElement
|
||||
{
|
||||
public double Value;
|
||||
public double Adjoint;
|
||||
public InputEdges Inputs;
|
||||
|
||||
public abstract void Eval();
|
||||
public abstract void Diff();
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Operator overloading carries the advantage of tracing the program
|
||||
through function calls and control flows, resulting in an implementation
|
||||
process that is both simple and straightforward. However, the
|
||||
requirement to trace the program during runtime introduces certain
|
||||
challenges. Specifically, operator overloading is necessitated to
|
||||
execute reverse-mode differentiation along the trace, which can
|
||||
potentially cause a drop in performance, particularly for elementary
|
||||
operations that are executed swiftly. Furthermore, due to the
|
||||
constraints of runtime, operator overloading is unable to conduct
|
||||
compile-time graph optimization prior to execution, and the unfolding of
|
||||
control flows must be based on the information available at runtime.
|
||||
Despite these challenges, operator overloading is extensively employed
|
||||
in the PyTorch framework for automatic differentiation due to its
|
||||
inherent simplicity and adaptability.
|
||||
|
||||
### Source Transformation
|
||||
|
||||
Source transformation is a design pattern that enriches programming
|
||||
languages and scrutinizes a program's source code or its Abstract Syntax
|
||||
Tree (AST) to automatically deconstruct the program into a set of
|
||||
differentiable elementary operations, each with predefined
|
||||
differentiation rules. The chain rule is then employed to amalgamate the
|
||||
differential expressions of the elementary operations, resulting in a
|
||||
novel program expression that conducts the differentiation. Source
|
||||
Transformation is integral to machine learning frameworks such as
|
||||
TensorFlow and MindSpore.
|
||||
|
||||
Unlike operator overloading, which functions within programming
|
||||
languages, source transformation necessitates parsers and tools that
|
||||
manipulate IRs. It also requires transformation rules for function calls
|
||||
and control flow statements, such as loops and conditions. The principal
|
||||
advantage of source transformation is that the automatic differentiation
|
||||
transformation occurs only once per program, thus eliminating runtime
|
||||
overhead. Additionally, the complete differentiation program is
|
||||
available during compilation, enabling ahead-of-time optimization using
|
||||
compilers.
|
||||
|
||||
However, source transformation presents a higher implementation
|
||||
complexity compared to the other approaches. It must support a wider
|
||||
array of data types and operations, and it necessitates preprocessors,
|
||||
compilers, or interpreters of extended languages, along with a more
|
||||
robust type-checking system. Even though source transformation does not
|
||||
manage automatic differentiation transformation at runtime, it still
|
||||
must ensure that certain intermediate variables from the forward pass
|
||||
are accessible by the adjoint in reverse mode. Two modes are available
|
||||
to facilitate this:
|
||||
|
||||
- **Tape-based mode**: This mode requires a global tape that ensures
|
||||
the accessibility of intermediate variables. In this method, the
|
||||
primitive function is augmented so that intermediate variables are
|
||||
written to functions in the tape during the forward pass, and the
|
||||
adjoint program reads these intermediate variables from the tape
|
||||
during the backward pass. The tape used in source transformation
|
||||
primarily stores the intermediate variables, while the tape used in
|
||||
operator overloading additionally stores the executed operation
|
||||
types. Given that the tape is a data structure constructed at
|
||||
runtime, custom compiler optimizations are required. Moreover, tape
|
||||
read and write operations must be differentiable to support
|
||||
higher-order differentiation, which involves multiple applications
|
||||
of reverse mode. As most tape-based tools do not differentiate tape
|
||||
read and write operations, such tools do not support
|
||||
reverse-over-reverse automatic differentiation.
|
||||
|
||||
- **Closure-based mode**: This mode was proposed to mitigate some of
|
||||
the limitations observed in the tape-based mode. Within functional
|
||||
programming, closures can capture the execution environment of a
|
||||
statement and identify the non-local use of intermediate variables.
|
||||
@@ -0,0 +1,86 @@
|
||||
# Overview of AI Compilers
|
||||
|
||||
Like classical compilers, AI compilers also convert user-written code
|
||||
into efficient machine-executable code. In the following, we delve into
|
||||
the intricacies of AI compilers, discussing various concepts inherent to
|
||||
general-purpose compilers such as ahead of time (AOT), just in time
|
||||
(JIT), intermediate representations (IRs), pass-based optimization,
|
||||
abstract syntax tree, side effects, and closures. Our focus will be
|
||||
primarily on the distinctive design and functionality of AI compilers as
|
||||
compared to classical compilers, rather than offering definitions of
|
||||
these concepts, as these can be found in numerous other compiler-related
|
||||
textbooks.
|
||||
|
||||
The design of AI compilers is significantly influenced by classical
|
||||
compilers like the Low Level Virtual Machine (LLVM). Thus, gaining an
|
||||
understanding of the basic architecture of the LLVM compiler, depicted
|
||||
in Figure :numref:`ch04/llvm-basic`, will be beneficial.
|
||||
|
||||

|
||||
:label:`ch04/llvm-basicwidth="\\linewidth"`
|
||||
|
||||
The LLVM compiler consists of three components: the frontend,
|
||||
intermediate representations, and the backend. The frontend converts
|
||||
high-level languages into IRs. The backend then transforms these IRs
|
||||
into machine instructions executable on the target hardware. As their
|
||||
name implies, IRs serve as a transition phase from the frontend to the
|
||||
backend, where necessary optimizations can take place. The architecture
|
||||
of the LLVM compiler ensures that IRs are reusable and compatible with
|
||||
any newly introduced frontend or hardware. While IRs can exist on one or
|
||||
more levels, LLVM typically uses a one-level structure, meaning the
|
||||
frontend and backend optimizations share the same set of IRs.
|
||||
|
||||
AI compilers, on the other hand, commonly employ a multi-level IR
|
||||
structure. An example is the multi-level IR (MLIR) design adopted by
|
||||
TensorFlow, as depicted in Figure
|
||||
:numref:`ch04/TF-IR`.
|
||||
TensorFlow's MLIR comprises three levels of IRs: the TensorFlow graph
|
||||
IR, the XLA HLO IR, and hardware-specific LLVM IR or TPU IR. The
|
||||
subsequent sections briefly outline these levels and their corresponding
|
||||
compilation optimization processes.
|
||||
|
||||

|
||||
:label:`ch04/TF-IRwidth="\\linewidth"`
|
||||
|
||||
The process of optimization in computational graphs is known as graph
|
||||
compilation optimization. The first level of IR, the graph IR, carries
|
||||
out optimization and operations (e.g., graph optimization and graph
|
||||
segmentation) for an entire graph. While this complete-graph IR is
|
||||
suitable for static graph execution, it proves challenging for
|
||||
hardware-specific optimization due to the absence of hardware
|
||||
information. To address this, hardware-specific generic compilation
|
||||
optimization is applied at the mid-level of IRs. Platforms like XLA,
|
||||
Tensor RT, and MindSpore's graph kernel fusion enhance the execution
|
||||
performance of various neural networks on specific hardware by executing
|
||||
operator fusion and other optimizations for different hardware types.
|
||||
|
||||
The final level of IR deals exclusively with a certain type of hardware
|
||||
accelerator and often comes bundled with a hardware vendor's compiler.
|
||||
For instance, the TBE compiler, paired with the Ascend hardware, is
|
||||
based on HalideIR as its efficient execution operators are generated
|
||||
based on TVM's HalideIR.
|
||||
|
||||
The multi-level IR design grants IRs enhanced flexibility and
|
||||
facilitates more efficient pass-based optimization for each specific IR
|
||||
level. However, this design has limitations. First, achieving fully
|
||||
compatible IR transformation across different levels is challenging due
|
||||
to the substantial engineering effort required and potential information
|
||||
loss during the transformation. Optimization carried out at one IR level
|
||||
might eliminate some information, and the implications of this removal
|
||||
must be evaluated at the next level. As a result, IR transformation
|
||||
imposes stricter constraints on the sequence in which optimization
|
||||
occurs. Second, the decision of at which of two adjacent levels to
|
||||
perform certain IR optimizations presents a dilemma for framework
|
||||
developers. Lastly, because different IR levels can define different
|
||||
operator granularities, some accuracy might be compromised.
|
||||
|
||||
To mitigate these drawbacks, the AI compiler in the MindSpore machine
|
||||
learning framework uses a unified IR design known as MindIR. Figure
|
||||
:numref:`ch04/msflow`
|
||||
illustrates the internal execution process of MindSpore's AI compiler.
|
||||
In this process, the compiler frontend handles graph compilation and
|
||||
hardware-agnostic optimization, while the compiler backend conducts
|
||||
tasks like hardware-specific optimization and operator selection.
|
||||
|
||||

|
||||
:label:`ch04/msflowwidth="\\linewidth"`
|
||||
@@ -0,0 +1,95 @@
|
||||
# Frontend Compilation Optimization
|
||||
|
||||
Much like classical compilers, AI compilers implement compilation
|
||||
optimization to enhance the effectiveness of the IRs generated during
|
||||
the compilation process. This strategy reduces not only the length of
|
||||
the code and the time required for its compilation and execution but
|
||||
also diminishes the energy usage of processors during execution.
|
||||
Compilation optimization techniques can be divided into two categories:
|
||||
hardware-agnostic optimization and hardware-specific optimization.
|
||||
However, all optimization techniques applied at the frontend are
|
||||
inherently hardware-agnostic, as the frontend remains oblivious to the
|
||||
backend hardware specifics.
|
||||
|
||||
## Process of Compilation Optimization
|
||||
|
||||
Typically, compilation optimizers execute a sequence of optimization
|
||||
passes. In each pass, an IR is used as input, which then produces a
|
||||
revised IR as output. A single pass might incorporate several sub-passes
|
||||
and can be conducted once or multiple times.
|
||||
|
||||
The overall success of compilation optimization significantly depends on
|
||||
the selection and ordering of optimization operations. Not only does the
|
||||
compiler execute various compilation optimization operations as needed,
|
||||
but it can also adjust the number of optimization passes along with the
|
||||
types and sequence of optimization operations. These adjustments are
|
||||
contingent upon the set level of compilation optimization, as
|
||||
illustrated in Figure :numref:`ch06/ch06-opt-pass`.
|
||||
|
||||

|
||||
:label:`ch06/ch06-opt-pass`
|
||||
|
||||
## Prevalent Optimization Methods
|
||||
|
||||
Today, a wide array of frontend compilation optimization methods exist.
|
||||
Analogously, machine learning frameworks also employ various
|
||||
optimization methods, although these diverge from those found in
|
||||
classical compilers. This section will detail three frequently employed
|
||||
and versatile frontend compilation optimization methods.
|
||||
|
||||
### Elimination of Dead Code and Unreachable Code
|
||||
|
||||
Dead code refers to segments of code that yield outputs not utilized by
|
||||
any other code, while unreachable code refers to segments of code that
|
||||
are not included in any valid control flow path. Figure
|
||||
:numref:`ch06/ch06-opt-pass-useless-code0-elimination`
|
||||
demonstrates these two types of code. The removal of dead or unreachable
|
||||
code can decrease the size of IRs and expedite both the compilation and
|
||||
execution of a program. These types of code can result from human errors
|
||||
or may manifest during other compilation optimizations.
|
||||
|
||||

|
||||
:label:`ch06/ch06-opt-pass-useless-code0-elimination`
|
||||
|
||||
In Chapter
|
||||
[\[subsec:conversion_between_and_combination_of_dynamic_and_static_graphs\]](#subsec:conversion_between_and_combination_of_dynamic_and_static_graphs){reference-type="ref"
|
||||
reference="subsec:conversion_between_and_combination_of_dynamic_and_static_graphs"},
|
||||
it was previously mentioned that the tracing method can be employed
|
||||
during the process of converting dynamic graphs to static graphs. The
|
||||
tracing method is considered highly effective in identifying dead code
|
||||
and unreachable code. Consequently, this step is often incorporated into
|
||||
the graph conversion procedure.
|
||||
|
||||
### Constant Propagation and Constant Folding
|
||||
|
||||
Constant propagation is a process that replaces specific constants with
|
||||
their known values during compilation. On the other hand, constant
|
||||
folding is a process that substitutes variables with constants when the
|
||||
results of multiple operations can be computed directly during
|
||||
compilation.
|
||||
Figure :numref:`ch06/ch06-opt-pass-constant-broadcast` depicts these two
|
||||
methods.
|
||||
|
||||

|
||||
:label:`ch06/ch06-opt-pass-constant-broadcast`
|
||||
|
||||
### Common Subexpression Elimination
|
||||
|
||||
In order to understand what common subexpression elimination entails,
|
||||
let's consider the following: If an expression E has been computed and
|
||||
the values of all its variables remain unchanged from the prior
|
||||
computation, E is identified as a common subexpression. This concept is
|
||||
visualized in
|
||||
Figure :numref:`ch06/ch06-opt-pass-CSE`. As such, E doesn't need to be
|
||||
computed again; it can be directly replaced with the expression result
|
||||
obtained from the preceding computation.
|
||||
|
||||

|
||||
:label:`ch06/ch06-opt-pass-CSE`
|
||||
|
||||
Common subexpression elimination, like the elimination of dead code and
|
||||
unreachable code, is typically carried out during the graph conversion
|
||||
process. In PyTorch, the torch script module provides a dedicated API
|
||||
for common subexpression elimination. This approach is inherent as it
|
||||
simplifies the identification of common subexpressions within
|
||||
torchscript.
|
||||
38
v1/en_chapters/chapter_frontend_and_ir/index.md
Normal file
38
v1/en_chapters/chapter_frontend_and_ir/index.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# AI Compiler Frontend
|
||||
|
||||
Tailored for machine learning frameworks, an AI compiler is designed to
|
||||
convert Python-based machine learning programs into their optimized
|
||||
forms, enabling efficient native execution on heterogeneous processors.
|
||||
This chapter first outlines the typical architecture of an AI compiler
|
||||
before delving into the design of the compiler's frontend. The compiler
|
||||
frontend incorporates various techniques, including intermediate
|
||||
representations (IRs), automatic differentiation, type systems, static
|
||||
analysis, and compilation optimization.
|
||||
|
||||
The learning objectives of this chapter include:
|
||||
|
||||
- Understanding the typical architecture of an AI compiler.
|
||||
|
||||
- Understanding the types and implementation of IRs in machine
|
||||
learning frameworks.
|
||||
|
||||
- Understanding the methods of automatic differentiation implemented
|
||||
in AI compilers.
|
||||
|
||||
- Understanding type systems and static analysis in AI compilers.
|
||||
|
||||
- Understanding common frontend compilation optimization methods used
|
||||
by AI compilers.
|
||||
|
||||
```toc
|
||||
:maxdepth: 2
|
||||
|
||||
Overview_of_AI_Compilers
|
||||
Overview_of_AI_Compiler_Frontends
|
||||
Intermediate_Representation
|
||||
Automatic_Differentiation
|
||||
Type_Systems_and_Static_Analysis
|
||||
Frontend_Compilation_Optimization
|
||||
Chapter_Summary
|
||||
Further_Reading
|
||||
```
|
||||
@@ -0,0 +1,505 @@
|
||||
# Intermediate Representation
|
||||
|
||||
In this section, we begin by introducing basic IR concepts and the types
|
||||
of IR employed in classical compilers. Next, we address the new
|
||||
requirements and challenges that arise in the IR design for machine
|
||||
learning frameworks. To conclude this section, we examine the types of
|
||||
IRs utilized by well-known machine learning frameworks and delve into
|
||||
their implementation.
|
||||
|
||||
## Definition of Intermediate Representations
|
||||
|
||||
An IR is a data structure or a form of code that a compiler utilizes to
|
||||
represent source code. Almost all compilers need IRs to model the
|
||||
program code that requires analysis, transformation, and optimization.
|
||||
The representational capability of an IR is crucial during the
|
||||
compilation process. It must accurately depict source code without
|
||||
information loss, ensure the completeness of the source-to-target code
|
||||
compilation, and guarantee the effectiveness and performance of code
|
||||
optimization.
|
||||
|
||||
As illustrated in Figure :numref:`ch04/ch04-IR`, IRs facilitate the representation of
|
||||
multiple source program languages from the frontend and enable the
|
||||
backend to connect to various target machines. Located between the
|
||||
frontend and backend is an optimizer, which allows for the addition of
|
||||
new optimization processes directly into the frontend and backend. These
|
||||
processes use existing IRs as input and generate new IRs as output. By
|
||||
analyzing and optimizing IRs, the optimizer enhances the extensibility
|
||||
of the compilation process and minimizes the impact that might be
|
||||
introduced during an optimization process on the frontend and backend.
|
||||
|
||||

|
||||
:label:`ch04/ch04-IR`
|
||||
|
||||
With the ongoing evolution of compiler techniques, the development of
|
||||
IRs has progressed through three stages. In the initial stage, IRs were
|
||||
confined within a compiler and exclusively used by compiler developers.
|
||||
During the middle stage, when specific compilers became open source, IRs
|
||||
started being made publicly available, primarily for use by the users of
|
||||
compilers and related compilation tools. In the current stage, IRs are
|
||||
advancing toward facilitating an ecosystem of ecosystems (through a
|
||||
unified IR approach), encouraging increasing stakeholders (for example,
|
||||
hardware accelerator designers, machine learning framework users, and
|
||||
more) to participate in advertising AI computing.
|
||||
|
||||
## Types of Intermediate Representations
|
||||
|
||||
We will discuss various types of IR structures used by classical
|
||||
compilers. Understanding these IR structures is essential for analyzing
|
||||
source programs and generating optimized compiled code. Table
|
||||
:numref:`ch06/ch06-categorize` offers an overview of the
|
||||
different IR types. It is important to design IR structures carefully,
|
||||
considering the specific requirements of the compiler's design.
|
||||
|
||||
:Types of IRs
|
||||
|
||||
| IR Structure | Characteristics | Examples |
|
||||
| --------------| --------------------------------------| ----------------------------------------------
|
||||
| Linear IR | Based on linear code | Stack machine code, three-address code |
|
||||
| Graphical IR | Based on graphs | Abstract syntax tree, directed acyclic graph |
|
||||
| Hybrid IR | Based on both graphs and linear code |LLVM IR |
|
||||
:label:`ch06/ch06-categorize`
|
||||
|
||||
|
||||
### Linear Intermediate Representation
|
||||
|
||||
Linear IRs are widely used in compiler design, resembling assembly code
|
||||
for abstract machines. They represent the code to be compiled as a
|
||||
sequentially ordered series of operations. This ordering is important in
|
||||
practical terms. Linear IRs are popular because most processors utilize
|
||||
linear assembly languages.
|
||||
|
||||
Two common types of linear IRs are stack machine code and three-address
|
||||
code . Stack machine code, a form of single-address code, offers a
|
||||
straightforward and compact representation. Instructions in stack
|
||||
machine code typically consist solely of an opcode that specifies an
|
||||
operation, with operands stored on a stack. Most instructions retrieve
|
||||
operands from the stack and push the results of their operations back
|
||||
onto it. On the other hand, three-address code (3AC) emulates the
|
||||
instruction format used in modern RISC machines. It employs a set of
|
||||
quadruples, each containing an operator and three addresses (two
|
||||
operands and one target). Figure
|
||||
:numref:`ch04/ch04-linearIR` illustrates the stack machine code
|
||||
and three-address code representations for the expression $a-b*5$.
|
||||
|
||||

|
||||
:label:`ch04/ch04-linearIR`
|
||||
|
||||
### Graphical Intermediate Representation
|
||||
|
||||
Graphical IRs store information about the compilation process in the
|
||||
form of graphs. These graphs utilize nodes, edges, lists, trees, and
|
||||
other elements to collectively represent an algorithm. Although all
|
||||
graphical IRs consist of nodes and edges, they differ in terms of
|
||||
abstraction levels and graph structures. Common examples of graphical
|
||||
IRs include abstract syntax trees (ASTs), directed acyclic graphs
|
||||
(DAGs), and control-flow graphs (CFGs).
|
||||
|
||||
An AST is a tree-structured IR that closely resembles the structure of
|
||||
the source code. Figure :numref:`ch04/ch04-AST_DAG` depicts the AST for the expression
|
||||
$a5+a5b$. It is worth noting that the AST contains two identical copies
|
||||
of $a5$, which introduces redundancy. To address this redundancy, the
|
||||
DAG offers a simplified representation where identical subtrees can be
|
||||
shared by multiple parent nodes. By reusing subtrees, the DAG reduces
|
||||
the cost of the evaluation process, especially when the compiler can
|
||||
verify that the value of $a$ remains constant.
|
||||
|
||||

|
||||
:label:`ch04/ch04-AST_DAG`
|
||||
|
||||
### Hybrid Intermediate Representation
|
||||
|
||||
Hybrid IRs combine both linear IR and graphical IR elements. An example
|
||||
of a hybrid IR is LLVM IR , which is illustrated in Figure
|
||||
:numref:`ch04/ch04-LLVM_IR`. LLVM is an open-source compiler
|
||||
framework with the goal of providing unified IRs for different frontends
|
||||
and backends.
|
||||
|
||||
In LLVM IR, linear IRs are used to construct basic blocks, while
|
||||
graphical IRs represent the control flow between these blocks. Each
|
||||
instruction within a basic block is presented as a static single
|
||||
assignment (SSA) . SSA requires each variable to be defined before use,
|
||||
with values assigned to them only once. Multiple SSA instructions form a
|
||||
linear list within a basic block.
|
||||
|
||||
In the control flow graph (CFG), each node represents a basic block, and
|
||||
control transfer between these blocks is implemented through edges. This
|
||||
combination of linear IR for basic blocks and graphical IR for control
|
||||
flow allows for a flexible and efficient representation in LLVM IR.
|
||||
|
||||

|
||||
:label:`ch04/ch04-LLVM_IR`
|
||||
|
||||
## Intermediate Representation in Machine Learning Frameworks
|
||||
|
||||
Classical IRs (such as LLVM IR) primarily target programming languages
|
||||
for general-purpose computation tasks, which falls short of satisfying
|
||||
the unique requirements of machine-learning-related computation. When
|
||||
designing IRs tailored for machine learning frameworks, certain vital
|
||||
factors warrant attention:
|
||||
|
||||
- **Tensor Representation**. Given the predominance of tensor data in
|
||||
machine learning frameworks, it's imperative that the IRs can
|
||||
effectively handle tensor representation.
|
||||
|
||||
- **Automatic Differentiation**. A core aspect of machine learning
|
||||
involves evaluating derivatives of neural networks and optimizers
|
||||
through automatic differentiation. Accordingly, IRs must prioritize
|
||||
simplicity, performance, and scalability of higher-order
|
||||
differentials for automatic differentiation.
|
||||
|
||||
- **Computational Graph Mode**. Machine learning frameworks like
|
||||
TensorFlow, PyTorch, and MindSpore operate on two computational
|
||||
graph modes: static and dynamic. The static mode, with pre-defined
|
||||
computational graphs, enhances optimization but compromises on
|
||||
flexibility. Conversely, the dynamic mode trades running speed for
|
||||
flexibility and easier debugging by executing operators immediately
|
||||
in the computational graph. IRs should therefore support both modes,
|
||||
enabling users to choose the one best suited for their tasks while
|
||||
building algorithm models.
|
||||
|
||||
- **Support for Higher-order Functions and Closures**. Essential in
|
||||
functional programming, higher-order functions take or return
|
||||
functions, while closures bundle code blocks with references to the
|
||||
surrounding environment, facilitating access to an outer function's
|
||||
scope from an inner function. Such support reduces redundant code,
|
||||
improves abstraction, and enhances the flexibility and simplicity of
|
||||
framework representations.
|
||||
|
||||
- **Compilation Optimization**. Machine learning frameworks lean on
|
||||
compilation optimizations, including hardware-agnostic,
|
||||
hardware-specific, and deployment- or inference-related
|
||||
optimizations. These rely significantly on IRs implementations.
|
||||
|
||||
- **Just-in-Time (JIT) Compilation**. For expedited compilation and
|
||||
execution in machine learning frameworks, JIT compilation is
|
||||
frequently utilized. Optimization of JIT compilation, including loop
|
||||
unrolling, fusion, and inlining, plays a crucial role in optimizing
|
||||
parts of data flow graphs in IRs. A flawed IR design could
|
||||
potentially hamper JIT compilation performance in machine learning
|
||||
frameworks, thereby impacting the program's running capabilities.
|
||||
|
||||
Considering these factors, developers persistently refine classical IRs
|
||||
and introduce new IRs specifically tailored for machine learning
|
||||
frameworks. In the following section, we will delve into the IRs
|
||||
employed by various machine learning frameworks.
|
||||
|
||||
### Intermediate Representation in PyTorch
|
||||
|
||||
PyTorch is a dynamic, Python-oriented machine learning framework.
|
||||
Renowned for its usability and flexibility, PyTorch simplifies the
|
||||
process of writing and debugging machine learning programs. It
|
||||
introduces TorchScript, a method used for constructing serializable and
|
||||
optimizable models during the saving and loading of neural networks.
|
||||
|
||||
Particularly, TorchScript IR employs JIT compilation to convert Python
|
||||
code into target model files. All TorchScript programs can be saved
|
||||
within the Python process and later loaded into processes devoid of
|
||||
Python dependencies.
|
||||
|
||||
Aligning with the imperative programming paradigm, PyTorch incorporates
|
||||
the TorchScript IR, composed primarily of Single Static Assignment
|
||||
(SSA)-based linear IRs, to represent Python code. This representation
|
||||
can be achieved through either the Tracing or Scripting method of JIT
|
||||
compilation. TorchScript IR not only amplifies model deployment
|
||||
capabilities but also bolsters compilation performance. Additionally,
|
||||
TorchScript IR greatly improves the model visualization within the
|
||||
PyTorch framework.
|
||||
|
||||
Code `lst:torchscript` illustrates the use of the Scripting method
|
||||
to print a TorchScript IR graph.
|
||||
|
||||
**lst:torchscript**
|
||||
```python
|
||||
import torch
|
||||
|
||||
@torch.jit.script
|
||||
def test_func(input):
|
||||
rv = 10.0
|
||||
for i in range(5):
|
||||
rv = rv + input
|
||||
rv = rv/2
|
||||
return rv
|
||||
|
||||
print(test_func.graph)
|
||||
```
|
||||
|
||||
Code `lst:torchscriptir` shows the structure of this IR graph.
|
||||
|
||||
**lst:torchscriptir**
|
||||
```
|
||||
graph(%input.1 : Tensor):
|
||||
%9 : int = prim::Constant[value=1]()
|
||||
%5 : bool = prim::Constant[value=1]() # test.py:6:1
|
||||
%rv.1 : float = prim::Constant[value=10.]() # test.py:5:6
|
||||
%2 : int = prim::Constant[value=5]() # test.py:6:16
|
||||
%14 : int = prim::Constant[value=2]() # test.py:8:10
|
||||
%rv : float = prim::Loop(%2, %5, %rv.1) # test.py:6:1
|
||||
block0(%i : int, %rv.9 : float):
|
||||
%rv.3 : Tensor = aten::add(%input.1, %rv.9, %9) # <string>:5:9
|
||||
%12 : float = aten::FloatImplicit(%rv.3) # test.py:7:2
|
||||
%rv.6 : float = aten::div(%12, %14) # test.py:8:7
|
||||
-> (%5, %rv.6)
|
||||
return (%rv)
|
||||
```
|
||||
|
||||
|
||||
### Intermediate Representation in JAX
|
||||
|
||||
The JAX framework facilitates both static and dynamic computational
|
||||
graphs and employs the Jax Program Representation (Jaxpr) IR. This IR
|
||||
ensures that the output, not reliant on global variables, depends solely
|
||||
on the input, with both input and output encapsulating typed
|
||||
information. Functionality-wise, Jaxpr IR supports an array of features
|
||||
such as loops, branching, recursion, closure function differentiation,
|
||||
third-order differentiation, as well as backpropagation and forward
|
||||
propagation in automatic differentiation.
|
||||
|
||||
Jaxpr IR utilizes the A-normal Form (ANF), a form of functional
|
||||
expression, demonstrated in
|
||||
Code `lst:ANF`
|
||||
via the ANF grammar.
|
||||
|
||||
**lst:ANF**
|
||||
```
|
||||
<aexp> ::= NUMBER | STRING | VAR | BOOLEAN | PRIMOP
|
||||
| (lambda (VAR ...) <exp>)
|
||||
<cexp> ::= (<aexp> <aexp> ...)
|
||||
| (if <aexp> <exp> <exp>)
|
||||
<exp> ::= (let ([VAR <cexp>]) <exp>) | <cexp> | <aexp>
|
||||
```
|
||||
|
||||
The ANF segregates expressions into atomic expressions (aexp) and
|
||||
compound expressions (cexp). Atomic expressions represent constants,
|
||||
variables, primitives, and anonymous functions, while compound
|
||||
expressions, comprising several atomic expressions, can be viewed as
|
||||
invocations of anonymous or primitive functions. The first input in a
|
||||
cexp represents the invoked function, and all subsequent inputs
|
||||
symbolize the invoked parameters.
|
||||
|
||||
Code `lst:JaxCode` displays the Jaxpr corresponding to a function.
|
||||
|
||||
**lst:JaxCode**
|
||||
```python
|
||||
from jax import make_jaxpr
|
||||
import jax.numpy as jnp
|
||||
|
||||
def test_func(x, y):
|
||||
ret = x + jnp.sin(y) * 3
|
||||
return jnp.sum(ret)
|
||||
|
||||
print(make_jaxpr(test_func)(jnp.zeros(8), jnp.ones(8)))
|
||||
```
|
||||
|
||||
The structure of this Jaxpr is shown in
|
||||
Code `lst:JaxPr`.
|
||||
|
||||
**lst:JaxPr**
|
||||
```
|
||||
{ lambda ; a:f32[8] b:f32[8]. let
|
||||
c:f32[8] = sin b
|
||||
d:f32[8] = mul c 3.0
|
||||
e:f32[8] = add a d
|
||||
f:f32[] = reduce_sum[axes=(0,)] e
|
||||
in (f,) }
|
||||
```
|
||||
|
||||
### Intermediate Representation in TensorFlow
|
||||
|
||||
TensorFlow utilizes dataflow programming to execute numerical
|
||||
computations through dataflow graphs. TensorFlow's static graph
|
||||
mechanism progresses through a series of abstractions and analyses when
|
||||
running a program, transforming it from higher-level to lower-level IRs,
|
||||
a process referred to as \"lowering\".
|
||||
|
||||
To cater to diverse hardware platforms, TensorFlow employs a range of IR
|
||||
designs. As illustrated in
|
||||
Figure :numref:`ch04/ch04-tensorflow_ecosystem`, the blue boxes denote
|
||||
graph-based IRs while the green ones indicate SSA-based IRs. During the
|
||||
IR transformation, each level optimizes the IR independently, precluding
|
||||
communication with other levels. This absence of awareness about
|
||||
optimizations performed at other levels necessitates optimal
|
||||
implementation at each level, often leading to repetitive tasks and
|
||||
sub-optimal efficiency. Notably, transitioning from graph-based IRs to
|
||||
SSA-based IRs involves a qualitative transformation that incurs
|
||||
significant costs. The inability to reuse the same optimization code
|
||||
across levels also hampers development efficiency.
|
||||
|
||||
Multi-level IRs present a mixed bag of advantages and disadvantages. On
|
||||
the plus side, they offer flexible representations, pass-based
|
||||
optimization at varying levels, and efficient optimization algorithms.
|
||||
On the downside, they pose challenges due to their inherent
|
||||
characteristics: The transformation between different IRs often
|
||||
complicates full compatibility implementation, thereby increasing
|
||||
engineering workload and potentially leading to information loss. This
|
||||
might make lower-level optimization challenging if information at a
|
||||
higher level has been optimized. To mitigate such information loss, we
|
||||
can impose stricter constraints on the optimization sequence.
|
||||
Additionally, choosing the level for implementing certain optimizations
|
||||
that can be performed at two adjacent levels can be a conundrum for
|
||||
framework developers. Finally, defining distinct operator granularities
|
||||
at different levels might impact accuracy to a certain degree.
|
||||
|
||||

|
||||
:label:`ch04/ch04-tensorflow_ecosystem`
|
||||
|
||||
### Multi-Level Intermediate Representation
|
||||
|
||||
Multi-Level Intermediate Representation (MLIR) serves as a unified
|
||||
platform for IRs rather than being a specific type of IR. Leveraging the
|
||||
infrastructure provided by MLIR, developers can define IRs to suit their
|
||||
needs. Thus, MLIR can be interpreted as a \"compiler of compilers\". It
|
||||
expands beyond the TensorFlow framework and can be used to construct IRs
|
||||
linking other languages to backend platforms (such as LLVM).
|
||||
|
||||
Despite the design of MLIR being heavily influenced by LLVM, MLIR
|
||||
fosters a more open ecosystem. Given that MLIR does not confine
|
||||
developers to a set group of operation or abstraction types, it offers
|
||||
more latitude to define IRs and solve specific problems. To facilitate
|
||||
this extensibility, MLIR introduces the concept of \"dialects\". These
|
||||
provide a grouping mechanism for abstraction under a unique namespace.
|
||||
Each dialect lays out a production and associates an operation to an IR,
|
||||
thus producing an MLIR-typed IR. Within MLIR, the \"operation\" is the
|
||||
fundamental unit of abstraction and computation. Operations can carry
|
||||
application-specific semantics and encapsulate all the core IR
|
||||
structures in LLVM, including instructions, functions, modules, etc.
|
||||
|
||||
The MLIR assembly for an operation is illustrated as follows:
|
||||
|
||||
```
|
||||
%tensor = "toy.transpose"(%tensor) {inplace = true} : (tensor<2x3xf64>) -> tensor<3x2xf64> loc("example/file/path":12:1)
|
||||
```
|
||||
|
||||
This MLIR operation can be dissected as follows:
|
||||
|
||||
- %tensor: The identifier for the result defined by this operation
|
||||
(prefixed with a $\%$ to prevent naming conflicts). An operation may
|
||||
define no results or multiple results, represented as SSA values.
|
||||
|
||||
- \"toy.transpose\": The operation name. It is usually a unique
|
||||
string, with the dialect's namespace prefixing the ".". This refers
|
||||
to the transpose operation within the toy dialect.
|
||||
|
||||
- (%tensor): A list that can contain zero or more input operands (or
|
||||
arguments), which are SSA values defined by other operations or that
|
||||
refer to block arguments.
|
||||
|
||||
- inplace = true: A dictionary that may contain zero or more
|
||||
attributes. These are constant special operands. Here, a boolean
|
||||
attribute named `inplace` with a constant value of `true` is
|
||||
defined.
|
||||
|
||||
- (tensor\<2x3xf64\>)-\>tensor\<3x2xf64\>: This represents the
|
||||
operation type in a functional form, specifying the input before the
|
||||
arrow and output after. The data types and shapes of the input and
|
||||
output are contained within the parentheses. For instance,
|
||||
$<2x3xf64>$ represents a tensor with a shape of `(2, 3)` and data
|
||||
type `float64`.
|
||||
|
||||
- loc(\"example/file/path\":12:1): This refers to the source code
|
||||
location from where this operation originated.
|
||||
|
||||
As each level's IR design adheres to this assembly, it simplifies
|
||||
transformation across levels, boosting the efficiency of IR
|
||||
transformation. Moreover, different levels can interact to optimize the
|
||||
IRs, enabling optimization to be performed at the most suitable level,
|
||||
thereby negating the need for optimal performance at each level. By
|
||||
transforming them into the IR at the most appropriate level, other IRs
|
||||
can be optimized, enhancing both optimization and development
|
||||
efficiency. TensorFlow can also employ MLIR to perform multi-layer
|
||||
transformation from graph-based IRs to
|
||||
|
||||
### Intermediate Representation in MindSpore
|
||||
|
||||
MindSpore adopts graph-based functional IRs, known as MindSpore IR
|
||||
(abbreviated to MindIR). MindIR employs a unified IR approach instead of
|
||||
a multi-level IR structure, outlining the network's logical structure
|
||||
and operator attributes. This approach obliterates model disparities
|
||||
across different backends, facilitating connections to various target
|
||||
machines.
|
||||
|
||||
MindIR primarily caters to the automatic differential transformation. It
|
||||
implements a transformation method grounded in functional programming
|
||||
frameworks, thereby making it similar to ANF (A-Normal Form) functional
|
||||
semantics. Its defining characteristics include:
|
||||
|
||||
1. **Graph-based Representation**. MindSpore represents programs as
|
||||
graphs which are conducive to optimization. MindSpore treats
|
||||
functions as essential elements of a machine learning program,
|
||||
allowing for recursive invocation, parameter passing, or returning
|
||||
from other functions. This ability paves the way for representing a
|
||||
range of control flow structures.
|
||||
|
||||
2. **Purely Functional**. In a purely functional context, the function
|
||||
outcomes depend solely on parameters. Side effects are potential
|
||||
issues when a function relies on or affects external states, such as
|
||||
global variables. These can lead to incorrect results if code
|
||||
execution sequence isn't strictly maintained. These side effects can
|
||||
also impact automatic differentiation, necessitating the requirement
|
||||
for pure functions. MindIR has the capability to transform
|
||||
representations with side effects into purely functional
|
||||
representations, ensuring correct code execution sequence while
|
||||
upholding ANF functional semantics and enabling a higher degree of
|
||||
automatic differentiation freedom.
|
||||
|
||||
3. **Closure Representation**. Reverse mode automatic differentiation
|
||||
requires the storage of basic operation intermediate results in
|
||||
closures for a combined connection. Closures, the combination of a
|
||||
code block bundled with references to its surrounding environment,
|
||||
become particularly crucial. In MindIR, the code block takes the
|
||||
shape of a function diagram, with the surrounding environment
|
||||
interpreted as the function invocation context.
|
||||
|
||||
4. **Strongly Typed**. Each node requires a specific type for achieving
|
||||
optimal performance. This is particularly crucial in machine
|
||||
learning frameworks where operator execution can be time-consuming.
|
||||
Detecting errors at the earliest can help save valuable time.
|
||||
MindIR's type and shape inference capabilities thus center on the
|
||||
support for function invocation and higher-order functions.
|
||||
|
||||
Figure :numref:`ch04/ch04-MindIR` outlines the MindIR grammar based on
|
||||
MindSpore framework's characteristics. ANode corresponds to an atomic
|
||||
expression in ANF, ValueNode represents the constant value,
|
||||
ParameterNode signifies the function's formal parameter, and CNode
|
||||
(corresponding to a compound expression in ANF) indicates function
|
||||
invocation.
|
||||
|
||||

|
||||
:label:`ch04/ch04-MindIR`
|
||||
|
||||
The example provided below in Code 1 offers a deeper analysis of MindIR.
|
||||
|
||||
**lst:MindSporeCode**
|
||||
```
|
||||
def func(x, y):
|
||||
return x / y
|
||||
|
||||
@ms_function
|
||||
def test_f(x, y):
|
||||
a = x - 1
|
||||
b = a + y
|
||||
c = b * func(a, b)
|
||||
return c
|
||||
```
|
||||
|
||||
The ANF expression corresponding to this function is demonstrated in
|
||||
Code `lst:MindIR`.
|
||||
|
||||
**lst:MindIR**
|
||||
```
|
||||
lambda (x, y)
|
||||
let a = x - 1 in
|
||||
let b = a + y in
|
||||
let func = lambda (x, y)
|
||||
let ret = x / y in
|
||||
ret end in
|
||||
let %1 = func(a, b) in
|
||||
let c = b * %1 in
|
||||
c end
|
||||
```
|
||||
|
||||
In ANF, each expression is encapsulated as a variable utilizing the
|
||||
`let` expression, with dependencies on the expression's output
|
||||
represented via variable references. In contrast, MindIR packages each
|
||||
expression as a node, portraying dependencies through directed edges
|
||||
connecting the nodes.
|
||||
@@ -0,0 +1,51 @@
|
||||
# Overview of AI Compiler Frontends
|
||||
|
||||
Figure :numref:`ch04/compiler_frontend_structure` depicts the typical
|
||||
structure of the AI compiler frontend within a machine learning
|
||||
framework. As AI compilers parse source programs similarly to classical
|
||||
compilers, we will not detail the parsing process here. Instead, we will
|
||||
explore a feature unique to the compiler frontend in a machine learning
|
||||
framework - its automatic differentiation functionality. To enact
|
||||
automatic differentiation, the machine learning framework requires a new
|
||||
IR structure built upon classical IRs. Consequently, this section
|
||||
concentrates on IRs and automatic differentiation, and later provides a
|
||||
succinct introduction to basic compiler concepts, including type
|
||||
systems, static analysis, and frontend optimization.
|
||||
|
||||

|
||||
:label:`ch04/compiler_frontend_structure`
|
||||
|
||||
An **Intermediate Representation** is a data structure, or a form of
|
||||
code, employed by a compiler to represent source code. Essentially, an
|
||||
IR serves as a bridge between a source language and a target language
|
||||
during the compilation process. In classical compilers, IRs are divided
|
||||
into linear IR, graphical IR, and hybrid IR. However, as these classical
|
||||
IRs do not provide the comprehensive range of functionalities required
|
||||
by machine learning frameworks, developers have extended classical IRs
|
||||
and proposed numerous new IRs specifically for machine learning
|
||||
frameworks.
|
||||
|
||||
**Automatic Differentiation** is a method used to compute derivatives
|
||||
and efficiently resolve symbols for computational graphs. Combining the
|
||||
benefits of both symbolic and numerical differentiation while mitigating
|
||||
their shortcomings, automatic differentiation proves particularly
|
||||
valuable in calculating the gradient of a function. Modern AI
|
||||
algorithms, such as deep learning algorithms, use vast amounts of data
|
||||
to learn models with various parameters, and typically employ a gradient
|
||||
descent approach to update these parameters. Therefore, automatic
|
||||
differentiation is crucial to deep learning and becomes an integral
|
||||
component of training algorithms. Automatic differentiation generally
|
||||
resolves IR symbols during the frontend optimization process to generate
|
||||
new IRs with gradient functions.
|
||||
|
||||
**Type Systems and Static Analysis** are incorporated into the compiler
|
||||
frontend to help reduce potential runtime errors. A type system can
|
||||
avert type errors during program execution, while static analysis offers
|
||||
insights and other information for compilation optimization, effectively
|
||||
reducing issues like structural errors and security vulnerabilities in
|
||||
program code.
|
||||
|
||||
**Frontend Compilation Optimization** aims to tackle code efficiency
|
||||
issues. It is a significant aspect in both classical compilers and
|
||||
machine learning frameworks and is independent of specific hardware
|
||||
types.
|
||||
52
v1/en_chapters/chapter_frontend_and_ir/summary.md
Normal file
52
v1/en_chapters/chapter_frontend_and_ir/summary.md
Normal file
@@ -0,0 +1,52 @@
|
||||
# Chapter Summary
|
||||
|
||||
- Intermediate Representation (IR) serves as one of the fundamental
|
||||
data structures of a compiler. It represents the transition from the
|
||||
source language to the target language during the process of program
|
||||
compilation.
|
||||
|
||||
- Classical compilers categorize IRs into three types based on their
|
||||
structure: linear IR, graphical IR, and hybrid IR.
|
||||
|
||||
- The demands imposed by machine learning frameworks necessitate new
|
||||
forms of IRs, as classical IRs fail to fully satisfy these
|
||||
requirements. Therefore, innovative IRs that are more compatible
|
||||
with these frameworks must be developed based on classical IRs.
|
||||
|
||||
- The central principle in automatic differentiation is the
|
||||
decomposition of a program's arithmetic operations into a finite set
|
||||
of basic operations. Knowing the derivative evaluation rules for all
|
||||
these operations allows for the calculation of the derivative for
|
||||
each basic operation. Subsequently, these results are aggregated
|
||||
using the chain rule to obtain the derivative result for the entire
|
||||
program.
|
||||
|
||||
- Automatic differentiation operates in two modes---forward-mode and
|
||||
reverse-mode---based on the sequence adopted by the chain rule for
|
||||
combining derivatives.
|
||||
|
||||
- Forward-mode automatic differentiation is applied when evaluating
|
||||
the derivative of a network where the input dimension is smaller
|
||||
than the output dimension. In contrast, reverse-mode automatic
|
||||
differentiation is employed when the output dimension of a network
|
||||
is smaller than the input dimension.
|
||||
|
||||
- Implementation methods for automatic differentiation encompass
|
||||
elemental libraries, operator overloading, and source
|
||||
transformation.
|
||||
|
||||
- Type systems, which are utilized to define various types, detail the
|
||||
operations of each type and outline the interactions among types.
|
||||
Comprising a set of types and the type-based rules that delineate
|
||||
program behavior, type systems are extensively used in compilers,
|
||||
interpreters, and static checking tools.
|
||||
|
||||
- Static analysis involves the inspection and verification of code
|
||||
through lexical analysis, syntactic analysis, control flow analysis,
|
||||
and data flow analysis, all of which are conducted without executing
|
||||
the programs.
|
||||
|
||||
- The objective of compilation optimization is to boost the efficiency
|
||||
of the IRs generated during the compilation process. Notably,
|
||||
compilation optimization conducted at the frontend is
|
||||
hardware-agnostic.
|
||||
@@ -0,0 +1,128 @@
|
||||
# Type Systems and Static Analysis
|
||||
|
||||
In the realm of compiler frontends, type systems and static analysis
|
||||
play instrumental roles in bolstering the compiler's abstraction
|
||||
prowess, while simultaneously mitigating potential errors that may arise
|
||||
during program runtime. This section delves into the basic principles,
|
||||
functionalities, and quintessential examples related to type systems and
|
||||
static analysis.
|
||||
|
||||
## Type Systems
|
||||
|
||||
In the context of programming languages, 'types' represent certain
|
||||
attributes, which could be numerical values, expressions, or functions.
|
||||
Type systems, which define these varied types, also determine the
|
||||
operations applicable to each type and orchestrate the interactions
|
||||
among these types. Essentially, a type system comprises a set of types
|
||||
and type-oriented rules that dictate the behavior of a program. They
|
||||
find extensive applications in compilers, interpreters, and static
|
||||
checking tools, offering the following capabilities:
|
||||
|
||||
1. **Precision**: Type systems in compilers deploy type checking to
|
||||
detect potential runtime errors, thus enhancing runtime safety.
|
||||
Leveraging type inference and type checking, the compiler can
|
||||
identify the majority of type-associated exceptions and errors,
|
||||
thereby averting runtime errors such as those triggered by program
|
||||
exceptions. This also ensures memory safety and thwarts invalid
|
||||
computations and semantic logic errors between types.
|
||||
|
||||
2. **Optimization**: The information obtained from static type checking
|
||||
enables the compiler to execute more efficient instructions, thereby
|
||||
reducing the runtime duration.
|
||||
|
||||
3. **Abstraction**: A type system, when employed with adept
|
||||
abstraction, can significantly boost system performance, given the
|
||||
system remains secure. Such streamlined abstraction allows
|
||||
developers to concentrate their efforts on high-level design.
|
||||
|
||||
4. **Readability**: The use of explicit type declarations amplifies
|
||||
code readability, enabling readers to grasp the program code more
|
||||
effectively.
|
||||
|
||||
Machine learning frameworks frequently use Python, a both dynamically
|
||||
and strongly typed language, as the frontend language for describing
|
||||
neural network model structures. Python's simplicity and ease of
|
||||
development have earned its popularity, despite its slower execution due
|
||||
to its interpretative execution mode.
|
||||
|
||||
While Python offers users dynamic and flexible semantics at the
|
||||
frontend, the backend framework demands static and strongly typed IRs
|
||||
that are optimization-friendly, to generate efficient backend code. To
|
||||
transform Python frontend representations into their equivalent static
|
||||
and strongly typed IRs, we require an effective and trustworthy static
|
||||
analysis method to enhance both development and execution efficiency.
|
||||
|
||||
A notable example is the Hindley--Milner (HM) type system---a type
|
||||
system that caters to the simply typed lambda calculus with parametric
|
||||
polymorphism. Initially proposed by J. Roger Hindley , the HM type
|
||||
system was subsequently expanded and validated by Robin Milner . Later,
|
||||
Luis Damas conducted a comprehensive formal analysis and proof of this
|
||||
system , further extending it to support polymorphic references. The HM
|
||||
type system is designed to infer the type of any expression
|
||||
automatically, without requiring any given type annotations. It employs
|
||||
a versatile algorithm to represent expressions using simple symbols and
|
||||
infer clear and intuitive definitions. This type system is widely used
|
||||
for type inference and type checking in the design of programming
|
||||
languages such as Haskell and OCaml.
|
||||
|
||||
## Static Analysis
|
||||
|
||||
Once a type system has been established, we must then construct a static
|
||||
analysis system. This will allow the compiler to perform static checking
|
||||
and analysis of IRs. Initially, the syntax parser deciphers the program
|
||||
code and forms an abstract syntax tree based on the resultant data,
|
||||
which subsequently generates the corresponding IR. As this IR lacks the
|
||||
abstract information stipulated in the type system, a static analysis
|
||||
module is needed to process and scrutinize the IR. This paves the way
|
||||
for a statically and strongly typed IR, which is indispensable for
|
||||
subsequent steps such as compilation optimization, automatic
|
||||
parallelization, and automatic differentiation. During the process of
|
||||
compiling program code, the frontend compiler might execute static
|
||||
analysis several times. In certain frameworks, the decision to terminate
|
||||
compilation optimization could be based on the outcome of static
|
||||
analysis.
|
||||
|
||||
The static analysis module is responsible for executing operations like
|
||||
type inference and generic specialization on IRs, utilizing abstract
|
||||
interpretations. Alongside these processes, the following operations are
|
||||
also undertaken:
|
||||
|
||||
1. **Abstract Interpretation**: This involves an abstract interpreter
|
||||
creating a generalized abstraction of a language's semantics,
|
||||
garnering only the attributes needed for subsequent optimization,
|
||||
and carrying out interpretive execution on ambiguous aspects.
|
||||
Abstract values typically include aspects like the types and
|
||||
dimensions of variables.
|
||||
|
||||
2. **Type Inference**: Based on abstract interpretation, the compiler
|
||||
can infer the abstract types of variables or expressions within the
|
||||
program code. This process is integral to facilitating subsequent
|
||||
compilation optimization that hinges on type information.
|
||||
|
||||
3. **Generic Specialization**: During the compilation phase, the
|
||||
compiler carries out type inference, a necessary precursor for
|
||||
generic specialization. This helps determine the type of function to
|
||||
be invoked. Subsequently, the compiler conducts type replacement
|
||||
(provided it can supply the context of types), generating a distinct
|
||||
function method for each type through generic specialization.
|
||||
|
||||
To illustrate the implementation of the static analysis module, we can
|
||||
consider the example of the MindSpore framework. MindSpore employs
|
||||
abstract interpretation to perform interpretive execution on uncertain
|
||||
abstract semantics, thereby acquiring abstract values. These abstract
|
||||
values for each node in a function graph represent the anticipated
|
||||
static program information. Within an abstract interpretation method,
|
||||
interpretive execution commences from the entry point of a top-level
|
||||
function graph in MindIR. This is followed by topological sorting of all
|
||||
nodes in the function graph, and the recursive inference of the abstract
|
||||
value for each node, based on node semantics. If there are any function
|
||||
subgraphs involved, interpretive execution is carried out within each
|
||||
subgraph recursively. The outcome of this process is the abstract value
|
||||
of the top-level function's output node. The static analysis module in
|
||||
MindSpore consists of several components, such as the abstract domain
|
||||
module, cache module, semantics inference module, and control flow
|
||||
processing module, as illustrated in
|
||||
Figure :numref:`ch04/ch04-compiler-frontend`.
|
||||
|
||||

|
||||
:label:`ch04/ch04-compiler-frontend`
|
||||
53
v1/en_chapters/chapter_introduction/applications.md
Normal file
53
v1/en_chapters/chapter_introduction/applications.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# Machine Learning Applications
|
||||
|
||||
In general terms, machine learning is a technology that learns useful
|
||||
knowledge from data. There are a variety of machine learning methods,
|
||||
including supervised learning, unsupervised learning, and reinforcement
|
||||
learning.
|
||||
|
||||
1. In supervised learning, the mapping relationships between inputs and
|
||||
outputs are known to machines. For example, a discrete label can be
|
||||
assigned to an input image.
|
||||
|
||||
2. In unsupervised learning, input data is provided to machines without
|
||||
any labels assigned. For example, to distinguish cats and dogs among
|
||||
a group of images, a machine needs to learn by itself the
|
||||
characteristics of cats and dogs in order to classify them. This
|
||||
unsupervised classification is also called clustering.
|
||||
|
||||
3. In reinforcement learning, an algorithm that runs on the machine
|
||||
automatically improves itself to achieve the task objective in a
|
||||
given learning environment. A well-known example of this is AlphaGo,
|
||||
in which the rules of Go serve as the learning environment and the
|
||||
victory score is set as the task objective.
|
||||
|
||||
Machine learning is applied in a variety of fields --- computer vision,
|
||||
natural language processing (NLP), and intelligent decision-making, to
|
||||
name just a few. Computer vision, in a narrow sense, includes all
|
||||
image-based applications, such as facial recognition, object
|
||||
recognition, target tracking, human pose estimation, and image
|
||||
understanding. It is widely used in autonomous driving, smart city,
|
||||
smart security, and other scenarios.
|
||||
|
||||
NLP involves both text- and speech-related applications, including
|
||||
language translation, text-to-speech and speech-to-text conversion, text
|
||||
understanding, and image style transfer. NLP and computer vision overlap
|
||||
in many aspects. For instance, in order to generate text description for
|
||||
images, or to generate or process images based on texts, machines need
|
||||
to handle both language and image data.
|
||||
|
||||
Intelligent decision-making is usually achieved through technical means
|
||||
such as computer vision, NLP, reinforcement learning, and cybernetics.
|
||||
It is widely used in many scenarios, such as robotics, autonomous
|
||||
driving, games, recommender systems, smart factories, and smart grids.
|
||||
|
||||
These machine learning applications use different underlying algorithms
|
||||
--- such as support vector machine (SVM), logistic regression, and naive
|
||||
Bayes --- based on the needs and characteristics of the applications. In
|
||||
recent years, deep learning has progressed significantly thanks to the
|
||||
availability of massive data, development of neural network algorithms,
|
||||
and maturity of hardware accelerators. But despite a wide variety of
|
||||
machine learning algorithms, the vast majority of computation work still
|
||||
relies on vector and matrix operations, regardless of whether classical
|
||||
or deep learning algorithms are employed. In this book, we therefore
|
||||
discuss machine learning systems that employ neural networks.
|
||||
72
v1/en_chapters/chapter_introduction/architecture.md
Normal file
72
v1/en_chapters/chapter_introduction/architecture.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# Machine Learning Framework Architecture
|
||||
|
||||
Figure :numref:`intro/framework-architecture` shows the basic
|
||||
architecture of a typical, complete machine learning framework.
|
||||
|
||||

|
||||
:label:`intro/framework-architecture`
|
||||
|
||||
1. **Programming interfaces:** A machine learning framework needs to
|
||||
provide programming interfaces, usually those of high-level
|
||||
programming languages (like Python), to cater for the diversified
|
||||
backgrounds of machine learning developers. At the same time, the
|
||||
framework also needs to support a system implementation that is
|
||||
mainly based on low-level programming languages (e.g., C and C++) so
|
||||
that operating system features (e.g., thread management and network
|
||||
communication) and various hardware accelerators can be utilized
|
||||
efficiently for optimized performance.
|
||||
|
||||
2. **Computational graph:** Machine learning applications, though
|
||||
implemented through different programming interfaces, need to share
|
||||
the same backend when the applications run. The computational graph
|
||||
technology is key to realizing this backend. A computational graph,
|
||||
which defines a user's machine learning application, includes many
|
||||
graph nodes that represent computational operations. These nodes are
|
||||
connected by edges, which represent computational dependencies.
|
||||
|
||||
3. **Compiler frontend:** Once a computational graph is built, the
|
||||
machine learning framework analyzes and optimizes it (or the
|
||||
corresponding application) through the compiler frontend. The
|
||||
compiler frontend provides key functions such as intermediate
|
||||
representation, automatic differentiation, type derivation, and
|
||||
static analysis.
|
||||
|
||||
4. **Compiler backend and runtime:** After analyzing and optimizing the
|
||||
computational graph, the machine learning framework uses the
|
||||
compiler backend and runtime to optimize different types of
|
||||
underlying hardware. In addition to optimizing the selection or
|
||||
scheduling sequence of operators, common optimization technologies
|
||||
usually analyze the L2/L3 cache size and the instruction pipeline
|
||||
length to match hardware specifications.
|
||||
|
||||
5. **Heterogeneous processors:** A machine learning application is
|
||||
co-executed by central processing units (CPUs) and hardware
|
||||
accelerators (such as NVIDIA GPUs, Huawei Ascend processors, and
|
||||
Google TPUs). During the execution, non-matrix operations (e.g.,
|
||||
complex data preprocessing and computational graph scheduling) are
|
||||
handled by CPUs, whereas matrix operations and certain frequently
|
||||
used machine learning operators (e.g., Transformer operators and
|
||||
convolution operators) are performed by hardware accelerators.
|
||||
|
||||
6. **Data processing:** A machine learning application needs to perform
|
||||
complex preprocessing on raw data and manage a large number of
|
||||
training, validation, and test datasets. The data processing module
|
||||
(e.g., the tf.data module of TensorFlow, or the DataLoader module of
|
||||
PyTorch) is responsible for such data-centered operations.
|
||||
|
||||
7. **Model deployment:** In addition to model training, model
|
||||
deployment is another key function needed in a machine learning
|
||||
framework. Model compression technologies --- such as model
|
||||
conversion, quantization, and distillation --- enable us to run
|
||||
models on hardware with limited memory. It is also necessary to
|
||||
optimize model operators for specific hardware inference platforms
|
||||
(e.g., NVIDIA Orin). Furthermore, in order to ensure the security of
|
||||
a model (e.g., to deny unauthorized user reads), model obfuscation
|
||||
must be considered in the framework's design.
|
||||
|
||||
8. **Distributed training:** A machine learning model is usually
|
||||
trained in parallel on distributed compute nodes. Common parallel
|
||||
training methods include data parallelism, model parallelism, hybrid
|
||||
parallelism, and pipeline parallelism, all of which are usually
|
||||
implemented through the remote procedure call (RPC), collective
|
||||
communication, or parameter server.
|
||||
86
v1/en_chapters/chapter_introduction/design.md
Normal file
86
v1/en_chapters/chapter_introduction/design.md
Normal file
@@ -0,0 +1,86 @@
|
||||
# Design Objectives of Machine Learning Frameworks
|
||||
|
||||
*Machine learning frameworks* (e.g., TensorFlow, PyTorch, and MindSpore)
|
||||
were designed and implemented so that machine learning algorithms could
|
||||
be developed efficiently for different applications. In a broad sense,
|
||||
these frameworks achieved the following common design objectives.
|
||||
|
||||
1. **Neural network programming:** The huge success of deep learning
|
||||
has solidified neural networks as the core of many machine learning
|
||||
applications. People need to customize neural networks to meet their
|
||||
specific application requirements --- such customization typically
|
||||
results in the creation of convolutional neural networks (CNNs) and
|
||||
self-attention neural networks. In order to develop, train, and
|
||||
deploy these networks, we need a generic system software.
|
||||
|
||||
2. **Automatic differentiation:** The training of neural networks
|
||||
involves continuously computing gradients through the combined use
|
||||
of training data, data annotation, and a loss function to
|
||||
iteratively improve model parameters. Computing gradients manually
|
||||
is a complex and time-consuming task. Consequently, a machine
|
||||
learning framework is expected to compute gradients automatically
|
||||
based on a neural network application provided by developers. This
|
||||
computation process is called automatic differentiation.
|
||||
|
||||
3. **Data management and processing:** Data is the key to machine
|
||||
learning. There are several types of data, including training,
|
||||
validation, and test datasets, as well as model parameters. A
|
||||
machine learning system should be able to read, store, and
|
||||
preprocess (data augmentation and cleansing are examples of
|
||||
preprocessing) these types of data by itself.
|
||||
|
||||
4. **Model training and deployment:** A machine learning model is
|
||||
expected to provide optimal performance. In order to achieve this,
|
||||
we need to use an optimization method --- for example, mini-batch
|
||||
stochastic gradient descent (SGD) --- to repeatedly compute
|
||||
gradients through multi-step iteration. This process is called
|
||||
training. Once the training is complete, we can then deploy the
|
||||
trained model to the inference device.
|
||||
|
||||
5. **Hardware accelerators:** Many core operations in machine learning
|
||||
can be deemed as matrix computation. To accelerate such computation,
|
||||
machine learning developers leverage many specially designed
|
||||
hardware components referred to as hardware accelerators or AI
|
||||
chips.
|
||||
|
||||
6. **Distributed training:** As the volume of training data and the
|
||||
number of neural network parameters increase, the amount of memory
|
||||
used by a machine learning system far exceeds what a single machine
|
||||
can provide. Therefore, a machine learning framework should be able
|
||||
to train models on distributed machines.
|
||||
|
||||
Early attempts by developers to design such a framework employed
|
||||
traditional methods such as *neural network libraries* (e.g., Theano and
|
||||
Caffe) and *data processing frameworks* (e.g., Apache Spark and Google's
|
||||
Pregel), but the results were disappointing. At that time, neural
|
||||
network libraries lacked the ability to manage and process large
|
||||
datasets, deploy models, or perform distributed model execution, meaning
|
||||
they were not qualified enough for developing today's product-level
|
||||
machine learning applications even though they supported neural network
|
||||
development, automatic differentiation, and hardware accelerators.
|
||||
Furthermore, data-parallel computing frameworks were not suitable for
|
||||
developing neural network--centered machine learning applications
|
||||
because they lacked support for neural networks, automatic
|
||||
differentiation, and accelerators, although such frameworks were already
|
||||
mature in supporting distributed running and data management.
|
||||
|
||||
These drawbacks led many enterprise developers and university
|
||||
researchers to design and implement their own software frameworks for
|
||||
machine learning from scratch. In only a few short years, numerous
|
||||
machine learning frameworks emerged --- well-known examples of these
|
||||
include TensorFlow, PyTorch, MindSpore, MXNet, PaddlePaddle, OneFlow,
|
||||
and CNTK. These frameworks boosted the development of AI significantly
|
||||
in both upstream and downstream industries. Table
|
||||
:numref:`intro-comparison` lists the differences between machine
|
||||
learning frameworks and other related systems.
|
||||
|
||||
|
||||
:Differences between machine learning frameworks and related systems
|
||||
|
||||
|Design Method | Neural Network | Automatic Differentiation | Data Management | Training and Deployment | Accelerator | Distributed Training |
|
||||
|----------------------------|----------------|----------------------------|-------------------|---------------------------|---------------|----------------------|
|
||||
|Neural network libraries | Yes | Yes | No | No | Yes | No |
|
||||
|Data processing frameworks | No | No | Yes | No | No | Yes |
|
||||
|Machine learning frameworks | Yes | Yes | Yes | Yes | Yes | Yes |
|
||||
:label:intro-comparison
|
||||
|
||||
78
v1/en_chapters/chapter_introduction/ecosystem.md
Normal file
78
v1/en_chapters/chapter_introduction/ecosystem.md
Normal file
@@ -0,0 +1,78 @@
|
||||
# Application Scenarios of Machine Learning Systems
|
||||
|
||||
A machine learning framework is commonly utilized in diverse scenarios,
|
||||
giving rise to a range of *machine learning systems*. In a broader
|
||||
context, a machine learning system refers to a collective term
|
||||
encompassing a variety of software and hardware systems that facilitate
|
||||
and execute machine learning applications. Figure
|
||||
:numref:`intro/system-ecosystem` provides an overview of the
|
||||
various application scenarios for machine learning systems.
|
||||
|
||||

|
||||
:label:`intro/system-ecosystem`
|
||||
|
||||
1. **Federated learning:** Laws and regulations on user privacy
|
||||
protection and data protection prevent many machine learning
|
||||
applications from accessing user data directly for model training
|
||||
purposes. This is where federated learning --- based on a machine
|
||||
learning framework --- benefits such applications.
|
||||
|
||||
2. **Recommender system:** Incorporating machine learning (especially
|
||||
deep learning) into recommender systems have achieved major success
|
||||
over the past few years. Compared with traditional rule-based
|
||||
recommender systems, those based on deep learning can analyze
|
||||
massive feature data of users more effectively, thereby bringing
|
||||
huge improvements to the accuracy and timeliness of recommendations.
|
||||
|
||||
3. **Reinforcement learning:** Because reinforcement learning is
|
||||
special in terms of the way it collects data and trains models, it
|
||||
is therefore necessary to develop dedicated reinforcement learning
|
||||
systems based on a machine learning framework.
|
||||
|
||||
4. **Explainable AI:** As machine learning becomes more and more
|
||||
popular in many key areas, including finance, healthcare, and
|
||||
governmental affairs, developing explainable AI systems based on a
|
||||
machine learning framework is gaining wider attention.
|
||||
|
||||
5. **Robotics:** Robotics is another area where the use of machine
|
||||
learning frameworks is gaining popularity. Compared with traditional
|
||||
robot vision methods, machine learning methods have achieved
|
||||
enormous success in several robot tasks, such as automatic feature
|
||||
extraction, target recognition, and path planning.
|
||||
|
||||
6. **Graph learning:** Graphs are the most widely used data structure
|
||||
and are used to express large volumes of Internet data, for
|
||||
instance, social network graphs and product relationship graphs.
|
||||
Machine learning algorithms have been proven effective for analyzing
|
||||
large-scale graph data. A machine learning system designed to
|
||||
process graph data is referred to as a graph learning system.
|
||||
|
||||
7. **Scientific computing:** Scientific computing covers a wide range
|
||||
of traditional fields (such as electromagnetic simulation, graphics,
|
||||
and weather forecast), in which many large-scale problems can be
|
||||
effectively solved by machine learning methods. Therefore,
|
||||
developing special machine learning systems for scientific computing
|
||||
is becoming an increasingly common practice.
|
||||
|
||||
8. **Scheduling of a machine learning cluster:** A machine learning
|
||||
cluster consists of heterogeneous processors, heterogeneous
|
||||
networks, and even heterogeneous storage devices. But in a machine
|
||||
learning cluster, computing tasks often have common characteristics
|
||||
during their execution (e.g., iterative execution based on the
|
||||
collective communication operator AllReduce). In order to take
|
||||
account of the cluster's heterogeneity of devices and the common
|
||||
characteristics in task execution, a machine learning cluster is
|
||||
often designed to use a special scheduling method.
|
||||
|
||||
9. **Quantum computing:** Quantum computers are generally realized
|
||||
through a hybrid architecture, in which quantum computing is
|
||||
performed by quantum computers and the simulation of quantum
|
||||
computers is performed by classical computers. Many simulation
|
||||
systems (such as TensorFlow Quantum and MindQuantum) are realized on
|
||||
the basis of a machine learning framework because the simulation
|
||||
often requires massive matrix computations and gradient computation.
|
||||
|
||||
There are too many machine learning systems for this book to cover them
|
||||
all in depth. Instead, we aim to provide a system designer's perspective
|
||||
on several core systems used in federated learning, recommenders,
|
||||
reinforcement learning, explainable AI, and robotics.
|
||||
18
v1/en_chapters/chapter_introduction/index.md
Normal file
18
v1/en_chapters/chapter_introduction/index.md
Normal file
@@ -0,0 +1,18 @@
|
||||
# Introduction
|
||||
|
||||
This chapter aims to provide readers with a comprehensive understanding
|
||||
of machine learning systems by describing the applications of machine
|
||||
learning and summarizing the design objectives and basic composition
|
||||
principles of such systems.
|
||||
|
||||
|
||||
```toc
|
||||
:maxdepth: 2
|
||||
|
||||
Machine_Learning_Applications
|
||||
Design_Objectives_of_Machine_Learning_Frameworks
|
||||
Machine_Learning_Framework_Architecture
|
||||
Application_Scenarios_of_Machine_Learning_Systems
|
||||
Book_Organization_and_Intended_Audience
|
||||
```
|
||||
|
||||
33
v1/en_chapters/chapter_introduction/readers.md
Normal file
33
v1/en_chapters/chapter_introduction/readers.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# Book Organization and Intended Audience
|
||||
|
||||
This book adopts a level-by-level approach to discuss design principles
|
||||
and implementation practices of machine learning systems. The
|
||||
**Framework Design** part starts with introducing key concepts that
|
||||
framework users need to understand, including programming interface
|
||||
design and computational graph. This part then describes the frontend
|
||||
and backend techniques used in AI compilers as well as key techniques
|
||||
for processing data, deploying models, and distributing training to
|
||||
multiple machines. The **Application Scenarios** part elaborates on
|
||||
several important types of machine learning systems, such as federated
|
||||
learning and recommender systems, in an attempt to provide readers with
|
||||
useful knowledge for both deploying and operating machine learning
|
||||
frameworks in different application scenarios.
|
||||
|
||||
This book is intended for the following readers:
|
||||
|
||||
1. **Students:** This book provides a wealth of design principles and
|
||||
hands-on experience of machine learning systems. Such knowledge will
|
||||
help students better understand the theoretical pros and cons and
|
||||
practical challenges of machine learning algorithms.
|
||||
|
||||
2. **Researchers:** This book aims to help researchers tackle various
|
||||
challenges in machine learning implementation and guide them through
|
||||
the design of next-generation machine learning algorithms meant to
|
||||
solve large-scale practical problems.
|
||||
|
||||
3. **Developers:** We also hope this book will allow developers to gain
|
||||
a profound understanding on the internal architecture of a machine
|
||||
learning system. Such knowledge will move them a step further in
|
||||
developing new functions for their applications, debugging system
|
||||
performance issues, and even customizing a machine learning system
|
||||
based on their business needs.
|
||||
36
v1/en_chapters/chapter_model_deployment/index.md
Normal file
36
v1/en_chapters/chapter_model_deployment/index.md
Normal file
@@ -0,0 +1,36 @@
|
||||
# Model Deployment {#ch:deploy}
|
||||
|
||||
In earlier chapters, we discussed the basic components of the machine
|
||||
learning model training system. In this chapter, we look at the basics
|
||||
of model deployment, a process whereby a trained model is deployed in a
|
||||
runtime environment for inference. We explore the conversion from a
|
||||
training model into an inference model, model compression methods that
|
||||
adapt to hardware restrictions, model inference and performance
|
||||
optimization, and model security protection.
|
||||
|
||||
The key aspects this chapter explores are as follows:
|
||||
|
||||
1. Conversion and optimization from a training model to an inference
|
||||
model.
|
||||
|
||||
2. Common methods for model compression: quantization, sparsification,
|
||||
and knowledge distillation.
|
||||
|
||||
3. Model inference process and common methods for performance
|
||||
optimization.
|
||||
|
||||
4. Common methods for model security protection.
|
||||
|
||||
```toc
|
||||
:maxdepth: 2
|
||||
|
||||
Overview
|
||||
Conversion_to_Inference_Model_and_Model_Optimization
|
||||
Model_Compression
|
||||
Advanced_Efficient_Techniques
|
||||
Model_Inference
|
||||
Security_Protection_of_Models
|
||||
Chapter_Summary
|
||||
Further_Reading
|
||||
```
|
||||
|
||||
369
v1/en_chapters/chapter_model_deployment/model_compression.md
Normal file
369
v1/en_chapters/chapter_model_deployment/model_compression.md
Normal file
@@ -0,0 +1,369 @@
|
||||
# Model Compression
|
||||
:label:`ch-deploy/model-compression`
|
||||
|
||||
The previous section briefly described the purpose of model conversion
|
||||
and focused on some common model optimization methods for model
|
||||
deployment. Hardware restrictions differ depending on where models are
|
||||
deployed. For instance, smartphones are more sensitive to the model
|
||||
size, usually supporting models only at the MB level. Larger models need
|
||||
to be compressed using compression techniques before they can be
|
||||
deployed on different computing hardware.
|
||||
|
||||
## Quantization
|
||||
|
||||
Model quantization is a technique that approximates floating-point
|
||||
weights of contiguous values (usually float32 or many possibly discrete
|
||||
values) at the cost of slightly reducing accuracy to a limited number of
|
||||
discrete values (usually int8). As shown in Figure
|
||||
:numref:`ch-deploy/quant-minmax`, $T$ represents the data range
|
||||
before quantization. In order to reduce the model size, model
|
||||
quantization represents floating-point data with fewer bits. As such,
|
||||
the memory usage during inference can be reduced, and the inference on
|
||||
processors that are good at processing low-precision operations can be
|
||||
accelerated.
|
||||
|
||||

|
||||
:label:`ch-deploy/quant-minmax`
|
||||
|
||||
The number of bits and the range of data represented by different data
|
||||
types in a computer are different. Based on service requirements, a
|
||||
model may be quantized into models with different number of bits based
|
||||
on service requirements. Generally, single-precision floating-point
|
||||
numbers are used to represent a deep neural network. If signed integers
|
||||
can be used to approximate parameters, the size of the quantized weight
|
||||
parameters may be reduced to a quarter of the original size. Using fewer
|
||||
bits to quantize a model results in a higher compression rate --- 8-bit
|
||||
quantization is the mostly used in the industry. The lower limit is
|
||||
1-bit quantization, which can compress a model to 1/32 of its original
|
||||
size. During inference, efficient XNOR and BitCount bit-wise operations
|
||||
can be used to accelerate the inference.
|
||||
|
||||
According to the uniformity of the original ranges represented by the
|
||||
quantized data, quantization can be further divided into linear
|
||||
quantization and non-linear quantization. Because the weights and
|
||||
activations of a deep neural network are usually not uniform in
|
||||
practice, non-linear quantization can theoretically achieve a smaller
|
||||
loss of accuracy. In real-world inference, however, non-linear
|
||||
quantization typically involves higher computation complexity, meaning
|
||||
that linear quantization is more commonly used. The following therefore
|
||||
focuses on the principles of linear quantization.
|
||||
|
||||
In Equation
|
||||
:eqref:`ch-deploy/quantization-q`, assume that $r$ represents
|
||||
the floating-point number before quantization. We are then able to
|
||||
obtain the integer $q$ after quantization.
|
||||
|
||||
$$q=clip(round(\frac{r}{s}+z),q_{min},q_{max})$$
|
||||
:eqlabel:`ch-deploy/quantization-q`
|
||||
|
||||
$clip(\cdot)$ and $round(\cdot)$ indicate the truncation and rounding
|
||||
operations, and $q_{min}$ and $q_{max}$ indicate the minimum and maximum
|
||||
values after quantization, respectively. $s$ is the quantization
|
||||
interval, and $z$ is the bias representing the data offset. The
|
||||
quantization is symmetric if the bias ($z$) used in the quantization is
|
||||
0, or asymmetric in other cases. Symmetric quantization reduces the
|
||||
computation complexity during inference because it avoids computation
|
||||
related to $z$. In contrast, asymmetric quantization determines the
|
||||
minimum and maximum values based on the actual data distribution, and
|
||||
the information about the quantized data is more effectively used. As
|
||||
such, asymmetric quantization reduces the loss of accuracy caused by
|
||||
quantization.
|
||||
|
||||
According to the shared range of quantization parameters $s$ and $z$,
|
||||
quantization methods may be classified into layer-wise quantization and
|
||||
channel-wise quantization. In the former, separate quantization
|
||||
parameters are defined for each layer. Whereas the latter involves
|
||||
defining separate quantization parameters for each channel.
|
||||
Finer-grained channel-wise quantization yields higher quantization
|
||||
precision, but increases the computation complexity.
|
||||
|
||||
Model quantization can also be classified into quantization aware
|
||||
training (QAT) and post-training quantization (PTQ) based on whether
|
||||
training is involved. In QAT, fake-quantization operators are added, and
|
||||
statistics on the input and output ranges before and after quantization
|
||||
are collected during training to improve the accuracy of the quantized
|
||||
model. This method is therefore suitable for scenarios that place strict
|
||||
requirements on accuracy. In PTQ, models are directly quantized after
|
||||
training, requiring only a small amount of calibration data. This method
|
||||
is therefore suitable for scenarios that place strict requirements on
|
||||
usability and have limited training resources.
|
||||
|
||||
**1. Quantization aware training**
|
||||
|
||||
QAT simulates quantization during training by including the accuracy
|
||||
loss introduced by fake-quantization operators. In this way, the
|
||||
optimizer can minimize the quantization error during training, leading
|
||||
to higher model accuracy. QAT involves the following steps:
|
||||
|
||||
1. Initialization: Set initial values for the $q_{min}$/$q_{max}$
|
||||
ranges of weights and activations.
|
||||
|
||||
2. Building a network for simulated quantization: Insert
|
||||
fake-quantization operators after weights and activations that
|
||||
require quantization.
|
||||
|
||||
3. Running QAT: Compute the range (i.e., $q_{min}$ and $q_{max}$) for
|
||||
each weight and activation of the quantized network layer. Then,
|
||||
perform forward computation with the quantization loss considered,
|
||||
so that the loss can be involved in subsequent backpropagation and
|
||||
network parameter update.
|
||||
|
||||
4. Exporting the quantized network: Obtain $q_{min}$ and $q_{max}$, and
|
||||
compute the quantization parameters $s$ and $z$. Substitute the
|
||||
quantization parameters into the quantized formula to transform the
|
||||
network weights into quantized integer values. Then, delete the
|
||||
fake-quantization operators, and add quantization and dequantization
|
||||
operators before and after the quantization network layer,
|
||||
respectively.
|
||||
|
||||
**2. Post-training quantization**
|
||||
|
||||
PTQ can be divided into two types: weight quantization and full
|
||||
quantization. Weight quantization quantizes only the weights of a model
|
||||
to compress its size, and then the weights are dequantized to the
|
||||
original float32 format during inference. The subsequent inference
|
||||
process is the same as that of a common float32 model. The advantage of
|
||||
weight quantization is that calibration dataset and quantized operators
|
||||
are not required, and that the accuracy loss is small. However, it does
|
||||
not improve the inference performance, because the operators used during
|
||||
inference are still float32. Full quantization quantizes both the
|
||||
weights and activations of a model, and the quantized operators are
|
||||
executed to accelerate model inference. The quantization of activations
|
||||
requires a small number of calibration datasets (training dataset or
|
||||
inputs of real scenarios) to collect the distribution of the activations
|
||||
at each layer and calibrate the quantized operators. Calibration
|
||||
datasets are used as the input during the quantization of activations.
|
||||
After the inference, the distribution of activations at each layer is
|
||||
collected to obtain quantization parameters. The process is summarized
|
||||
as follows:
|
||||
|
||||
1. Use a histogram to represent the distribution $P_f$ of the original
|
||||
float32 data.
|
||||
|
||||
2. Select several $q_{min}$ and $q_{max}$ values from a given search
|
||||
space, quantize the activations, and obtain the quantized data
|
||||
$Q_q$.
|
||||
|
||||
3. Use a histogram to represent the distribution of $Q_q$.
|
||||
|
||||
4. Compute the distribution difference between $Q_q$ and $P_f$, and
|
||||
find the $q_{min}$ and $q_{max}$ values corresponding to the
|
||||
smallest difference between $Q_q$ and $P_f$ in order to compute the
|
||||
quantization parameters. Common indicators used to measure the
|
||||
distribution differences include symmetric Kullback-Leibler
|
||||
divergence and Jenson-Shannon divergence.
|
||||
|
||||
In addition, the inherent error of quantization requires calibration
|
||||
during quantization. Take the matrix multiplication
|
||||
$a=\sum_{i=1}^Nw_ix_i+b$ as an example. $w$ denotes the weight, $x$ the
|
||||
activation, and $b$ the bias. To overcome the quantization error, we
|
||||
first calibrate the quantized mean value, and then obtain the mean value
|
||||
of each channel output by the float32 operator and the quantized
|
||||
operator. Assume that the mean value output by the float32 operator of
|
||||
channel $i$ is $a_i$, and that output by the quantized operator after
|
||||
dequantization is $a_{qi}$. From this, we can obtain the final mean
|
||||
value by adding the mean value difference $a_i-a_q$ of the two channels
|
||||
to the corresponding channel. In this manner, the final mean value is
|
||||
consistent with that output by the float32 operator. We also need to
|
||||
ensure that the distribution after quantization is the same as that
|
||||
before quantization. Assume that the mean value and variance of the
|
||||
weight of a channel are $E(w_c)$ and $||w_c-E(w_c)||$, and the mean
|
||||
value and variance after quantization are $E(\hat{w_c})$ and
|
||||
$||\hat{w_c}-E(\hat{w_c})||$, respectively. Equation
|
||||
:eqref:`ch-deploy/post-quantization` is the calibration of the
|
||||
weight:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
\hat{w_c}\leftarrow\zeta_c(\hat{w_c}+u_c) \\
|
||||
u_c=E(w_c)-E(\hat{w_c}) \\
|
||||
\zeta_c=\frac{||w_c-E(w_c)||}{||\hat{w_c}-E(\hat{w_c})||}
|
||||
\end{aligned}
|
||||
$$
|
||||
:eqlabel:`ch-deploy/post-quantization`
|
||||
|
||||
As a general model compression method, quantization can significantly
|
||||
improve the memory and compression efficiency of neural networks, and
|
||||
has been widely used.
|
||||
|
||||
## Model Sparsification
|
||||
|
||||
Model sparsification reduces the memory and computation overheads by
|
||||
removing some components (such as weights, feature maps, and convolution
|
||||
kernels) from a neural network. It is a type of strong inductive bias
|
||||
introduced to reduce the computation complexity of the model, just like
|
||||
weight quantization, weight sharing, and pooling.
|
||||
|
||||
**1. Motivation of model sparsification**
|
||||
|
||||
Convolution on a convolutional neural network can be considered as a
|
||||
weighted linear combination of the input and the weights of the
|
||||
convolution kernel. In this sense, tiny weights have a relatively small
|
||||
impact on the output. Model sparsification can be justified based on two
|
||||
assumptions:
|
||||
|
||||
1. Most neural network models have over-parameterized weights. The
|
||||
number of weight parameters can reach tens or even hundreds of
|
||||
millions.
|
||||
|
||||
2. For most computer vision tasks such as detection, classification,
|
||||
and segmentation, useful information accounts for only a small
|
||||
proportion in an activation feature map generated during inference.
|
||||
|
||||
As such, model sparsification can be classified into two types according
|
||||
to the source of sparsity: weight sparsification and activation
|
||||
sparsification. Both types reduce the computation workload and model
|
||||
storage requirements by reducing redundant components in a model. In
|
||||
model sparsification, some weak connections are pruned based on the
|
||||
absolute value of weights or activations (i.e. the weight or activation
|
||||
of such connections is set to 0), with the goal of improving the model
|
||||
performance. The sparsity of a model is measured by the proportion of
|
||||
zero-value weights or activation tensors. Because the accuracy of a
|
||||
model typically decreases as its sparsity increases, we hope to minimize
|
||||
such loss when increasing the sparsity.
|
||||
|
||||
Neurobiology was the inspiration for inventing neural networks --- it
|
||||
has also inspired the sparsification of neural network models.
|
||||
Neurobiologists found that most mammalian brains, including humans, have
|
||||
a process called synapse pruning, which occurs between infancy and
|
||||
adulthood. During synapse pruning, neuron axons and dendrites decay and
|
||||
die off, and the neuron connections are continuously simplified and
|
||||
reconstructed. This process allows brains to work more efficiently and
|
||||
consume less energy.
|
||||
|
||||
**2. Structured and unstructured sparsification**
|
||||
|
||||
Let's first look at weight sparsification. It can be classified into
|
||||
structured and unstructured sparsification. Structured sparsification
|
||||
involves pruning channels or convolution kernels in order to generate
|
||||
regular and smaller weight matrices that are more likely to obtain
|
||||
speedup on CPUs and GPUs. However, this mode is coarse-grained, meaning
|
||||
that it severely reduces the model accuracy.
|
||||
|
||||
In contrast, unstructured sparsification allows a weight at any location
|
||||
to be pruned, meaning it is a fine-grained mode that causes less loss to
|
||||
the model accuracy. However, the unstructured mode limits the speedup of
|
||||
sparse models on hardware for a number of reasons:
|
||||
|
||||
1. The irregular layout of weights requires many control flow
|
||||
instructions. For instance, the presence of zero values introduces
|
||||
many `if-else` instructions for decision-making, which inevitably
|
||||
reduces instruction-level parallelism.
|
||||
|
||||
2. The computation of convolution kernels is typically multi-threaded.
|
||||
However, the irregular layout of weight matrices on memory causes
|
||||
thread divergence and load imbalance, which therefore affects
|
||||
thread-level parallelism.
|
||||
|
||||
3. The irregular layout of weight matrices on the memory hinders data
|
||||
locality and reduces the cache hit rate. Consequently, the
|
||||
load/store efficiency is reduced.
|
||||
|
||||
In an attempt to solve these problems, recent work combines structured
|
||||
sparsification with unstructured sparsification. This approach
|
||||
incorporates the advantages of both modes, and overcomes their
|
||||
disadvantages to an extent.
|
||||
|
||||
**3. Sparsification strategies**
|
||||
|
||||
Given a neural network model, after deciding to sparsify the weights or
|
||||
activations, we need to determine when and how to perform the
|
||||
sparsification. The most common sparsification process is currently
|
||||
pre-training, pruning, and fine-tuning. With this process, we need to
|
||||
sparsify and fine-tune a converged dense model obtained through
|
||||
training. Given the fact that a pre-trained model contains knowledge it
|
||||
has learned, sparsification on such models will achieve a better effect
|
||||
than directly on the initial model. In addition to pruning the
|
||||
pre-trained model, we usually interleave pruning with network training.
|
||||
Compared with one-shot pruning, iterative pruning is integrated more
|
||||
closely with training, so that redundant convolution kernels can be
|
||||
identified more efficiently. As such, iterative pruning is widely used.
|
||||
|
||||
To illustrate how to prune a network, we will use Deep
|
||||
Compression [@han2015deep] as an example. Removing most weights leads to
|
||||
a loss of accuracy of the neural network, as shown in Figure
|
||||
:numref:`ch-deploy/deepcomp`. Fine-tuning a pruned sparse neural
|
||||
network can help improve model accuracy, and the pruned network may be
|
||||
quantized to represent weights using fewer bits. In addition, using
|
||||
Huffman coding can further reduce the memory cost of the deep neural
|
||||
network.
|
||||
|
||||

|
||||
:label:`ch-deploy/deepcomp`
|
||||
|
||||
In addition to removing redundant neurons, a dictionary learning-based
|
||||
method can be used to remove unnecessary weights on a deep convolutional
|
||||
neural network. By learning the bases of convolution kernels, the
|
||||
original convolutional kernels can be transformed into the coefficient
|
||||
domain for sparsification. An example of this approach is the work by
|
||||
Bagherinezhad et al. [@bagherinezhad2017lcnn], in which they proposed
|
||||
that the original convolution kernel can be decomposed into a weighted
|
||||
linear combination of the base of the convolution kernel and sparse
|
||||
coefficient.
|
||||
|
||||
## Knowledge Distillation
|
||||
|
||||
Knowledge distillation (KD), also known as the teacher-student learning
|
||||
algorithm, has gained much attention in the industry. Large deep
|
||||
networks tend to deliver good performance in practice, because
|
||||
over-parameterization increases the generalization capability when it
|
||||
comes to new data. In KD, a large pre-trained network serves as the
|
||||
teacher, a deep and thin brand-new neural network serves as the student,
|
||||
supervised by the teacher network. The key to this learning algorithm is
|
||||
how to transfer knowledge converted by the teacher to the student.
|
||||
|
||||
Hinton et al. [@Distill] first proposed a teacher-student learning
|
||||
framework. It is used for the learning of deep and thin neural networks
|
||||
by minimizing the differences between the teacher and student neural
|
||||
networks. The teacher network is denoted as $\mathcal{N}_{T}$ with
|
||||
parameters $\theta_T$, and the student network is denoted as
|
||||
$\mathcal{N}_{S}$ with parameters $\theta_S$. In general, the student
|
||||
network has fewer parameters than the teacher network.
|
||||
|
||||
[@Distill] proposed KD, which makes the classification result of the
|
||||
student network more closely resembles the ground truth as well as the
|
||||
classification result of the teacher network, that is, Equation :eqref:`c2Fcn:distill`.
|
||||
|
||||
$$\mathcal{L}_{KD}(\theta_S) = \mathcal{H}(o_S,\mathbf{y}) +\lambda\mathcal{H}(\tau(o_S),\tau(o_T)),$$
|
||||
:eqlabel:`c2Fcn:distill`
|
||||
|
||||
where $\mathcal{H}(\cdot,\cdot)$ is the cross-entropy function, $o_S$
|
||||
and $o_T$ are outputs of the student network and the teacher network,
|
||||
respectively, and $\mathbf{y}$ is the label. The first item in
|
||||
Equation :eqref:`c2Fcn:distill` makes the classification result of the
|
||||
student network resemble the expected ground truth, and the second item
|
||||
aims to extract useful information from the teacher network and transfer
|
||||
the information to the student network, $\lambda$ is a weight parameter
|
||||
used to balance two objective functions, and $\tau(\cdot)$ is a soften
|
||||
function that smooths the network output.
|
||||
|
||||
Equation :eqref:`c2Fcn:distill` only extracts useful information from the
|
||||
output of the teacher network classifier --- it does not mine
|
||||
information from other intermediate layers of the teacher network.
|
||||
Romero et al. [@FitNet] proposed an algorithm for transferring useful
|
||||
information from any layer of a teacher network to a small student
|
||||
network. Note that not all inputs are useful for convolutional neural
|
||||
network computing and subsequent task execution. For example, in an
|
||||
image containing an animal, it is important to classify and identify the
|
||||
region where the animal is rather than the background information.
|
||||
Therefore, it is an efficient way to select useful information from the
|
||||
teacher network. Zagoruyko and Komodakis [@attentionTS] proposed a
|
||||
learning method based on an attention loss function to improve the
|
||||
performance of the student network. This method introduces an attention
|
||||
module. The attention module generates an attention map, which
|
||||
identifies the importance of different areas of an input image to the
|
||||
classification result. The attention map is then transferred from the
|
||||
teacher network to the student network, as depicted in Figure
|
||||
:numref:`ch-deploy/attentionTS`.
|
||||
|
||||
KD is an effective method to optimize small networks. It can be combined
|
||||
with other compression methods such as pruning and quantization to train
|
||||
efficient models with higher accuracy and less computation workload.
|
||||
|
||||
<figure id="fig:ch-deploy/attentionTS">
|
||||
<div class="center">
|
||||
<img src="../img/ch08/distillation.png" style="width:80.0%" />
|
||||
</div>
|
||||
<figcaption>Teacher-student neural network learning
|
||||
algorithm</figcaption>
|
||||
</figure>
|
||||
@@ -0,0 +1,227 @@
|
||||
# Conversion to Inference Model and Model Optimization
|
||||
:label:`ch-deploy/model-optimization`
|
||||
|
||||
## Model Conversion
|
||||
|
||||
As mentioned earlier, TensorFlow, PyTorch, MindSpore, MXNet, and CNTK
|
||||
define their own model data structures. This means that the inference
|
||||
system needs to convert these structures to a unified one. Open Neural
|
||||
Network Exchange (ONNX) is designed to implement such conversion. It
|
||||
supports an extensive range of machine learning operators and converts
|
||||
models from various frameworks (e.g., TensorFlow and PyTorch) into ONNX
|
||||
models. Because models are structured data, the conversion process
|
||||
involves converting the data structure. It starts by analyzing the
|
||||
similarities and differences between two data structures. If they are
|
||||
the same, data is transferred; if the structures are similar but with
|
||||
slight differences, data is mapped; if the structures differ
|
||||
significantly, extra semantics conversion might be required; and if they
|
||||
are totally incompatible, the conversion will fail. ONNX features strong
|
||||
expressive power, meaning that it can convert models from most
|
||||
frameworks in the industry to compatible ONNX models. If a model is
|
||||
abstracted as a graph, its data structure can be defined as follows:
|
||||
|
||||
1. **Topological expression of model:** The topological connections of
|
||||
a model are represented as edges in a graph. From the perspective of
|
||||
a model, these edges define the data flows and control flows in the
|
||||
model. Based on such definitions, we can extend to the expressions
|
||||
of the subgraphs, model inputs and outputs, and control flow
|
||||
structures. For example, the control flow on TensorFlow 1.x is
|
||||
expressed as a cyclic graph. To prevent the formation of cycles,
|
||||
TensorFlow 1.x uses operators such as Enter, Exit, Switch, LoopCond,
|
||||
and NextIteration, whereas ONNX uses operators such as Loop and If.
|
||||
As such, when converting a TensorFlow1.x control flow model into an
|
||||
ONNX model, the control flow graph structure in the TensorFlow model
|
||||
must be merged into a While or If operator on ONNX.
|
||||
|
||||
2. **Operator prototype definition:** Operators can be regarded as data
|
||||
processing or control flow nodes in a model or as vertices in a
|
||||
graph. An operator prototype defines the type, inputs, outputs, and
|
||||
attributes of an operator. For instance, Slice has different
|
||||
semantics on Caffe and ONNX. To convert a Caffe model into an ONNX
|
||||
model, we need to map Slice on Caffe to Split on ONNX.
|
||||
FusedBatchnorm on TensorFlow does not have a mapping operator on
|
||||
Caffe. Rather, Batchnorm and Scale on Caffe need to be combined to
|
||||
express the same semantics of FusedBatchnorm on TensorFlow.
|
||||
Generally, the model conversion process involves converting the
|
||||
topological relationships and mapping the operator prototypes
|
||||
between models.
|
||||
|
||||
Following model conversion, some input-agnostic operations are conducted
|
||||
for optimization purposes prior to model deployment, including constant
|
||||
folding, operator fusion, operator replacement, and operator reordering
|
||||
--- optimization methods discussed earlier in this book. For instance,
|
||||
constant folding is usually performed during the compilation executed on
|
||||
the compiler frontend, whereas, operator fusion and partition are often
|
||||
performed (depending on the backend hardware support) once the
|
||||
compilation is complete. However, some optimization operations can only
|
||||
be performed in their entirety during the deployment phase.
|
||||
|
||||

|
||||
:label:`ch-deploy/fusion-storage`
|
||||
|
||||
## Operator Fusion
|
||||
:label:`ch-deploy/kernel-fusion`
|
||||
|
||||
Operator fusion involves combining multiple operators in a deep neural
|
||||
network (DNN) model into a new operator based on certain rules, reducing
|
||||
the inference latency and power consumption by lowering the computation
|
||||
workload and load/store overhead during online inference.
|
||||
|
||||
The two main performance benefits brought by operator fusion are as
|
||||
follows: First, it maximizes the utilization of registers and caches.
|
||||
And second, because it combines operators, the load/store time between
|
||||
the CPU and memory is reduced. Figure
|
||||
:numref:`ch-deploy/fusion-storage` shows the architecture of a
|
||||
computer's storage system. While the storage capacity increases from the
|
||||
level-1 cache (L1) to hard disk, so too does the time for reading data.
|
||||
After operator fusion is performed, the previous computation result can
|
||||
be temporarily stored in the CPU's register or cache where the next
|
||||
computation can directly read the result, reducing the number of I/O
|
||||
operations on the memory. Furthermore, operator fusion allows some
|
||||
computation to be completed in advance, eliminating redundant or even
|
||||
cyclic redundant computing during forward computation.
|
||||
|
||||

|
||||
:label:`ch-deploy/conv-bn-fusion`
|
||||
|
||||
To describe the principle of operator fusion, we will use two operators,
|
||||
Convolution and Batchnorm, as shown in Figure
|
||||
:numref:`ch-deploy/conv-bn-fusion`. In the figure, the
|
||||
solid-colored boxes indicate operators, the resulting operators after
|
||||
fusion is performed are represented by hatched boxes, and the weights or
|
||||
constant tensors of operators are outlined in white. The fusion can be
|
||||
understood as the simplification of an equation. The computation of
|
||||
Convolution is expressed as Equation
|
||||
:eqref:`ch-deploy/conv-equation`.
|
||||
|
||||
$$\bf{Y_{\rm conv}}=\bf{W_{\rm conv}}\cdot\bf{X_{\rm conv}}+\bf{B_{\rm conv}}$$
|
||||
:eqlabel:`equ:ch-deploy/conv-equation`
|
||||
|
||||
Here, we do not need to understand what each variable means. Instead, we
|
||||
only need to keep in mind that Equation
|
||||
:eqref:`ch-deploy/conv-equation` is an equation for
|
||||
$\bf{Y_{\rm conv}}$ with respect to $\bf{X_{\rm conv}}$, and other
|
||||
symbols are constants.
|
||||
|
||||
Equation
|
||||
:eqref:`ch-deploy/bn-equation` is about the computation of
|
||||
Batchnorm:
|
||||
|
||||
$$\bf{Y_{\rm bn}}=\gamma\frac{\bf{X_{\rm bn}}-\mu_{\mathcal{B}}}{\sqrt{{\sigma_{\mathcal{B}}}^{2}+\epsilon}}+\beta$$
|
||||
:eqlabel:`equ:ch-deploy/bn-equation`
|
||||
|
||||
Similarly, it is an equation for $\bf{Y_{\rm bn}}$ with respect to
|
||||
$\bf{X_{\rm bn}}$. Other symbols in the equation represent constants.
|
||||
|
||||
As shown in Figure
|
||||
:numref:`ch-deploy/conv-bn-fusion`, when the output of
|
||||
Convolution is used as the input of Batchnorm, the formula of Batchnorm
|
||||
is a function for $\bf{Y_{\rm bn}}$ with respect to $\bf{X_{\rm conv}}$.
|
||||
After substituting $\bf{Y_{\rm conv}}$ into $\bf{X_{\rm bn}}$ and
|
||||
uniting and extracting the constants, we obtain Equation
|
||||
:eqref:`ch-deploy/conv-bn-equation-3`.
|
||||
|
||||
$$\bf{Y_{\rm bn}}=\bf{A}\cdot\bf{X_{\rm conv}}+\bf{B}$$
|
||||
:eqlabel:`equ:ch-deploy/conv-bn-equation-3`
|
||||
|
||||
Here, $\bf{A}$ and $\bf{B}$ are two matrices. It can be noticed that
|
||||
Equation
|
||||
:eqref:`ch-deploy/conv-bn-equation-3` is a formula for computing
|
||||
Convolution. The preceding example shows that the computation of
|
||||
Convolution and Batchnorm can be fused into an equivalent Convolution
|
||||
operator. Such fusion is referred to as formula fusion.
|
||||
|
||||
The fusion of Convolution and Batchnorm eliminates a Batchnorm
|
||||
operation, thereby reducing the quantity of parameters and computation
|
||||
workload are reduced, and thereby the load/store operations are also
|
||||
reduced. In general, this fusion not only optimizes the power
|
||||
consumption and performance during model deployment, but also brings
|
||||
certain benefits in compressing the model size.
|
||||
|
||||
Symbols that are considered as constants in the Convolution and
|
||||
Batchnorm formulas during fusion are considered as parameters during
|
||||
training. Performing fusion during the training process will result in
|
||||
missing model parameters. Because the fusion eliminates a Batchnorm
|
||||
operator and corresponding parameters from the network, the algorithm of
|
||||
the DNN is changed, degrading the accuracy to unacceptable levels.
|
||||
Therefore, the fusion of Convolution and Batchnorm is an optimization
|
||||
method typically used during deployment. To evaluate the optimization
|
||||
effect, we constructed a sample network with Convolution and Batchnorm
|
||||
using MindSpore Lite. We ran the sample network and mobilenet-v2 network
|
||||
for inference in dual threads on a Huawei Mate 30 smartphone to compare
|
||||
the time of running 3,000 inference epochs before and after the fusion.
|
||||
As shown in Table
|
||||
`ch09-conv-bn-fusion`, the inference performance of
|
||||
the sample network and mobilenet-v2 network is improved considerably
|
||||
after the fusion --- by 8.5% and 11.7% respectively. Such improvements
|
||||
are achieved without bringing side effects and without requiring
|
||||
additional hardware or operator libraries.
|
||||
|
||||
:Convolution + Batchnorm inference performance before and after fusion (unit: ms)
|
||||
|
||||
|Fusion | Sample | Mobilenet-v2 |
|
||||
|---------------| --------| -------------- |
|
||||
|Before fusion | 0.035 | 15.415 |
|
||||
|After fusion | 0.031 | 13.606 |
|
||||
:label:`ch09/ch09-conv-bn-fusion`
|
||||
|
||||
## Operator Replacement
|
||||
|
||||
The principle of operator replacement is to simplify an operator formula
|
||||
by uniting like terms, extracting common factors, and employing other
|
||||
mathematical methods, and then map the simplified formula to a certain
|
||||
type of operators that have the same computational logic but are more
|
||||
suitable for online deployment. In this way, we can reduce the
|
||||
computation workload and compress the model.
|
||||
|
||||

|
||||
:label:`ch-deploy/bn-replace`
|
||||
|
||||
Figure :numref:`ch-deploy/bn-replace` depicts the replacement of
|
||||
Batchnorm with Scale, which is used as an example to describe the
|
||||
principle of operator replacement. After decomposing Equation
|
||||
:eqref:`ch-deploy/bn-equation` (the Batchnorm formula) and
|
||||
folding the constants, Batchnorm is defined as Equation
|
||||
:eqref:`ch-deploy/replace-scale`
|
||||
|
||||
$$\bf{Y_{bn}}=scale\cdot\bf{X_{bn}}+offset$$
|
||||
:eqlabel:`equ:ch-deploy/replace-scale`
|
||||
|
||||
where **scale** and **offsets** are scalars. This simplified formula can
|
||||
be mapped to a Scale operator.
|
||||
|
||||
Compared with the original Batchnorm formula, the simplified formula has
|
||||
fewer parameters and involves less computation workload. This indicates
|
||||
that operator replacement is an effective approach to optimizing the
|
||||
power consumption and performance of a model during deployment. Symbols
|
||||
that are considered as constants in Batchnorm during deployment are not
|
||||
considered as constants during training, meaning that the replacement
|
||||
can be performed only during deployment. Operator replacement reduces
|
||||
the quantity of parameters and changes the structure of the model,
|
||||
weakening the expressive power and reducing the accuracy of the model
|
||||
during convergence.
|
||||
|
||||
## Operator Reordering
|
||||
|
||||
Another way of reducing the computation workload of an inference model
|
||||
is to adjust the topological order of its operators according to certain
|
||||
rules, on the condition that the inference accuracy is not degraded.
|
||||
Common methods of operator reordering include moving cropping operators
|
||||
(e.g., Slice, StrideSlice, and Crop) forward, and reordering Reshape,
|
||||
Transpose, and BinaryOp.
|
||||
|
||||

|
||||
:label:`ch-deploy/crop-reorder`
|
||||
|
||||
Crop is used to cut a part out of the input feature map as the output.
|
||||
After Crop is executed, the size of the feature map is reduced. As shown
|
||||
in Figure :numref:`ch-deploy/crop-reorder`, moving Crop forward to cut the
|
||||
feature map before other operators reduces the computation workload of
|
||||
subsequent operators, thereby improving the inference performance in the
|
||||
deployment phase. Such improvement is related to the operator
|
||||
parameters. Note, however, that Crop can be moved forward only along
|
||||
element-wise operators.
|
||||
|
||||
The experiment result above proves that optimizing models before
|
||||
inference makes it possible to significantly reduce the latency, power
|
||||
consumption, and memory usage.
|
||||
@@ -0,0 +1,71 @@
|
||||
# Overview
|
||||
|
||||
After training a model, we need to save it and its parameters to files
|
||||
to make them persistent. However, because different training frameworks
|
||||
adopt different data structures for such files, the inference system
|
||||
must support models trained using different training frameworks and
|
||||
convert the data in the files into a unified data structure. During the
|
||||
conversion from a training model to an inference model, optimization
|
||||
operations such as operator fusion and constant folding on the model can
|
||||
be performed to improve the inference performance.
|
||||
|
||||
The hardware restrictions of different production environments must be
|
||||
considered when we deploy an inference model. For instance, a
|
||||
large-scale model needs to be deployed on a server in a computing or
|
||||
data center with strong computing power, whereas a mid-scale model
|
||||
should be deployed on an edge server, PC, or smartphone --- such devices
|
||||
often have limited computing resources and memory. For simple,
|
||||
small-scale models, ultra-low power microcontrollers can be used. In
|
||||
addition, different hardware supports different data types (such as
|
||||
float32, float16, bfloat16, and int8). To adapt to the hardware
|
||||
restrictions, a trained model may sometimes need to be compressed in
|
||||
order to reduce model complexity or data precision and reduce model
|
||||
parameters.
|
||||
|
||||
Before a model can be used for inference, it needs to be deployed in the
|
||||
runtime environment. To optimize model inference, which may be affected
|
||||
by latency, memory usage, and power consumption, we can design chips
|
||||
dedicated for machine learning --- such dedicated chips usually
|
||||
outperform general-purpose ones in terms of energy efficiency. Another
|
||||
approach is to fully leverage hardware capabilities through
|
||||
software-hardware collaboration. Take a CPU as an example. When
|
||||
designing and optimizing models for a specific CPU architecture, we can
|
||||
suitably divide data blocks to meet the cache size, rearrange data to
|
||||
facilitate contiguous data access during computing, reduce data
|
||||
dependency to improve the parallelism of hardware pipelines, and use
|
||||
extended instruction sets to improve the computing performance.
|
||||
|
||||
Because models are an important enterprise asset, it is important to
|
||||
ensure their security after they are deployed in the runtime
|
||||
environment. This chapter will discuss some of the common protection
|
||||
measures and use model obfuscation as an example.
|
||||
|
||||
Some of the common methods used in the industry to address the preceding
|
||||
challenges are as follows:
|
||||
|
||||
1. **Model compression:** Technologies that reduce the model size and
|
||||
computation complexity by means of quantization and pruning. Such
|
||||
technologies can be categorized according to whether retraining is
|
||||
required.
|
||||
|
||||
2. **Operator fusion:** Technologies that combine multiple operators
|
||||
into one by simplifying expressions and fusing attributes, aiming to
|
||||
reduce the computation complexity and size of the model.
|
||||
|
||||
3. **Constant folding:** Forward computation of operators that meet
|
||||
certain conditions is completed in the offline phase, reducing the
|
||||
computation complexity and size of a model. This requires that the
|
||||
inputs of operators be constants in the offline phase.
|
||||
|
||||
4. **Data format:** According to the operator library and hardware
|
||||
restrictions and exploration of the optimal data format of each
|
||||
layer on the network, data is rearranged or data rearrangement
|
||||
operators are inserted, in order to reduce the inference latency
|
||||
during model deployment.
|
||||
|
||||
5. **Model obfuscation:** Network nodes or branches are added and
|
||||
operator names are changed for a trained model, so that it is
|
||||
difficult for attackers to understand the original model structure
|
||||
even if they steal the model. An obfuscated model may be directly
|
||||
executed in the deployment environment, thereby ensuring the
|
||||
security of the model during execution.
|
||||
372
v1/en_chapters/chapter_model_deployment/model_inference.md
Normal file
372
v1/en_chapters/chapter_model_deployment/model_inference.md
Normal file
@@ -0,0 +1,372 @@
|
||||
# Model Inference
|
||||
|
||||
After conversion and compression, a trained model needs to be deployed
|
||||
on the computation hardware in order to execute inference. Such
|
||||
execution involves the following steps:
|
||||
|
||||
1. Preprocessing: Process raw data to suit the network input.
|
||||
|
||||
2. Inference execution: Deploy the model resulting from offline
|
||||
conversion on the device to execute inference and compute the output
|
||||
based on the input.
|
||||
|
||||
3. Postprocessing: Further process the output of the model, for
|
||||
example, by threshold filtering.
|
||||
|
||||
## Preprocessing and Postprocessing
|
||||
|
||||
**1. Preprocessing**
|
||||
|
||||
Raw data, such as images, voices, and texts, is so disordered that
|
||||
machine learning models cannot identify or extract useful information
|
||||
from it. Preprocessing is intended to convert such into tensors that
|
||||
work for machine learning networks, eliminate irrelevant information,
|
||||
restore useful true information, enhance the detectability of relevant
|
||||
information, and simplify the data as much as possible. In this way,
|
||||
reliability indicators related to feature extraction, image
|
||||
segmentation, matching, and recognition of the models can be improved.
|
||||
|
||||
The following techniques are often used in data preprocessing:
|
||||
|
||||
1. Feature encoding: Encode the raw data that describes features into
|
||||
numbers and input them to machine learning models which can process
|
||||
only numerical values. Common encoding approaches include
|
||||
discretization, ordinal encoding, one-hot encoding, and binary
|
||||
encoding.
|
||||
|
||||
2. Normalization: Modify features to be on the same scale without
|
||||
changing the correlation between them, eliminating the impact of
|
||||
dimensions between data indicators. Common approaches include
|
||||
Min-Max normalization that normalizes the data range, and Z-score
|
||||
normalization that normalizes data distribution.
|
||||
|
||||
3. Outliner processing: An outlier is a data point that is distant from
|
||||
all others in distribution. Elimination of outliers can improve the
|
||||
accuracy of a model.
|
||||
|
||||
**2. Postprocessing**
|
||||
|
||||
After model inference, the output data is transferred to users for
|
||||
postprocessing. Common postprocessing techniques include:
|
||||
|
||||
1. Discretization of contiguous data: Assume we expect to predict
|
||||
discrete data, such as the quantity of a good, using a model, but a
|
||||
regression model only provides contiguous prediction values, which
|
||||
have to be rounded or bounded.
|
||||
|
||||
2. Data visualization: This technique uses graphics and tables to
|
||||
represent data so that we can find relationships in the data in
|
||||
order to support analysis strategy selection.
|
||||
|
||||
3. Prediction range widening: Most values predicted by a regression
|
||||
model are concentrated in the center, and few are in the tails. For
|
||||
example, abnormal values of hospital laboratory data are used to
|
||||
diagnose diseases. To increase the accuracy of prediction, we can
|
||||
enlarge the values in both tails by widening the prediction range
|
||||
and multiplying the values that deviate from the normal range by a
|
||||
coefficient to
|
||||
|
||||
## Parallel Computing
|
||||
:label:`ch-deploy/parallel-inference`
|
||||
|
||||
Most inference models have a multi-thread mechanism that leverages the
|
||||
capabilities of multiple cores in order to achieve performance
|
||||
improvements. In this mechanism, the input data of operators is
|
||||
partitioned, and multiple threads are used to process different data
|
||||
partitions. This allows operators to be computed in parallel, thereby
|
||||
multiplying the operator performance.
|
||||
|
||||

|
||||
:label:`ch09_parallel`
|
||||
|
||||
In Figure :numref:`ch09_parallel`, the matrix in the multiplication can be
|
||||
partitioned according to the rows of matrix A. Three threads can then be
|
||||
used to compute A1 \* B, A2 \* B, and A3 \* B (one thread per
|
||||
computation), implementing multi-thread parallel execution of the matrix
|
||||
multiplication.
|
||||
|
||||
To facilitate parallel computing of operators and avoid the overhead of
|
||||
frequent thread creation and destruction, inference frameworks usually
|
||||
have a thread pooling mechanism. There are two common practices:
|
||||
|
||||
1. Open Multi-Processing (OpenMP) API: OpenMP is an API that supports
|
||||
concurrency through memory sharing across multiple platforms. It
|
||||
provides interfaces that are commonly used to implement operator
|
||||
parallelism. An example of such an interface is `parallel for`,
|
||||
which allows `for` loops to be concurrently executed by multiple
|
||||
threads.
|
||||
|
||||
2. Framework-provided thread pools: Such pools are more lightweight and
|
||||
targeted at the AI domain compared with OpenMP interfaces, and can
|
||||
deliver better performance.
|
||||
|
||||
## Operator Optimization
|
||||
:label:`ch-deploy/kernel-optimization`
|
||||
|
||||
When deploying an AI model, we want model training and inference to be
|
||||
performed as fast as possible in order to obtain better performance. For
|
||||
a deep learning network, the scheduling of the framework takes a short
|
||||
period of time, whereas operator execution is often a bottleneck for
|
||||
performance. This section introduces how to optimize operators from the
|
||||
perspectives of hardware instructions and algorithms.
|
||||
|
||||
**1. Hardware instruction optimization**
|
||||
|
||||
Given that most devices have CPUs, the time that CPUs spend processing
|
||||
operators has a direct impact on the performance. Here we look at the
|
||||
methods for optimizing hardware instructions on ARM CPUs.
|
||||
|
||||
**1) Assembly language**
|
||||
|
||||
High-level programming languages such as C++ and Java are compiled as
|
||||
machine instruction code sequences by compilers, which often have a
|
||||
direct influence on which capabilities these languages offer. Assembly
|
||||
languages are close to machine code and can implement any instruction
|
||||
code sequence in one-to-one mode. Programs written in assembly languages
|
||||
occupy less memory, and are faster and more efficient than those written
|
||||
in high-level languages.
|
||||
|
||||
In order to exploit the advantages of both types of languages, we can
|
||||
write the parts of a program that require better performance in assembly
|
||||
languages and the other parts in high-level languages. Because
|
||||
convolution and matrix multiplication operators in deep learning involve
|
||||
a large amount of computation, using assembly languages for code
|
||||
necessary to perform such computation can improve model training and
|
||||
inference performance by dozens or even hundreds of times.
|
||||
|
||||
Next, we use ARMv8 CPUs to illustrate the optimization related to
|
||||
hardware instructions.
|
||||
|
||||
**2) Registers and NEON instructions**
|
||||
|
||||
Each ARMv8 CPU has 32 NEON registers, that is, v0 to v31. As shown in
|
||||
Figure :numref:`ch-deploy/register`, NEON register v0 can store 128
|
||||
bits, which is equal to the capacity of 4 float32, 8 float16, or 16
|
||||
int8.
|
||||
|
||||

|
||||
:label:`ch-deploy/register`
|
||||
|
||||
The single instruction multiple data (SIMD) method can be used to
|
||||
improve the data access and computing speed on this CPU. Compared with
|
||||
single data single instruction (SISD), the NEON instruction can process
|
||||
multiple data values in the NEON register at a time. For example, the
|
||||
`fmla` instruction for floating-point data is used as
|
||||
`fmla v0.4s, v1.4s, v2.4s`. As depicted in Figure
|
||||
:numref:`ch-deploy/fmla`, the products of the corresponding
|
||||
floating-point values in registers v1 and v2 are added to the value in
|
||||
v0.
|
||||
|
||||

|
||||
:label:`ch-deploy/fmla`
|
||||
|
||||
**3) Assembly language optimization**
|
||||
|
||||
For assembly language programs with known functions, computational
|
||||
instructions are usually fixed. In this case, non-computational
|
||||
instructions are the source the performance bottleneck. The structure of
|
||||
computer storage devices resembles a pyramid, as shown in Figure
|
||||
:numref:`ch-deploy/fusion-storage`. The top layer has the fastest
|
||||
speed but the smallest space; conversely, the bottom layer has the
|
||||
largest space but the slowest speed. L1 to L3 are referred to as caches.
|
||||
When accessing data, the CPU first attempts to access the data from one
|
||||
of its caches. If the data is not found, the CPU then accesses an
|
||||
external main memory. Cache hit rate is introduced to measure the
|
||||
proportion of data that is accessed from the cache. In this sense, the
|
||||
cache hit rate must be maximized to improve the program performance.
|
||||
|
||||
There are some techniques to improve the cache hit rate and optimize the
|
||||
assembly performance:
|
||||
|
||||
1. Loop unrolling: Use as many registers as possible to achieve better
|
||||
performance at the cost of increasing the code size.
|
||||
|
||||
2. Instruction reordering: Reorder the instructions of different
|
||||
execution units to improve the pipeline utilization, thereby
|
||||
allowing instructions that incur latency to be executed first. In
|
||||
addition to reducing the latency, this method also reduces data
|
||||
dependency before and after the instruction.
|
||||
|
||||
3. Register blocking: Block NEON registers appropriately to reduce the
|
||||
number of idle registers and reuse more registers.
|
||||
|
||||
4. Data rearrangement: Rearrange the computational data to ensure
|
||||
contiguous memory reads and writes and improve the cache hit rate.
|
||||
|
||||
5. Instruction prefetching: Load the required data from the main memory
|
||||
to the cache in advance to reduce the access latency.
|
||||
|
||||
**2. Algorithm optimization**
|
||||
|
||||
For most AI models, 90% or more of the inference time of the entire
|
||||
network is spent on computing convolution and matrix multiplication
|
||||
operators. This section focuses on the optimization of convolution
|
||||
operator algorithms, which can be applied to various hardware devices.
|
||||
The computation of convolution can be converted into the multiplication
|
||||
of two matrices, and we have elaborated on the optimization of the GEMM
|
||||
algorithm in Section :ref:`ch-deploy/parallel-inference`. For different hardware,
|
||||
appropriate matrix blocking can optimize data load/store efficiency and
|
||||
instruction parallelism. This helps to maximize the utilization of the
|
||||
hardware's computing power, thereby improving the inference performance.
|
||||
|
||||
**(1) Img2col**
|
||||
|
||||
Img2col is often used to convert convolution into matrix multiplication.
|
||||
Convolutional layers typically operate on 4D inputs in NHWC format.
|
||||
Figure :numref:`ch-deploy/conv_nhwc` is a diagram of convolution. The
|
||||
input shape is (1, IH, IW, IC), the convolution kernel shape is (OC, KH,
|
||||
KW, IC), and the output shape is (1, OH, OW, OC).
|
||||
|
||||

|
||||
:label:`ch-deploy/conv_nhwc`
|
||||
|
||||
As shown in Figure
|
||||
:numref:`ch-deploy/img2col_input`, the Img2col rules for
|
||||
convolution are as follows: The input is reordered to obtain the matrix
|
||||
on the right. The number of rows corresponds to the number of OH \* OW
|
||||
outputs. For a row vector, Img2col processes KH \* KW data points of
|
||||
each input channel in sequence, from the first channel to channel IC.
|
||||
|
||||

|
||||
:label:`ch-deploy/img2col_input`
|
||||
|
||||
As shown in Figure
|
||||
:numref:`ch-deploy/img2col_weight`, the weights are rearranged.
|
||||
One convolution kernel is expanded into one column of the weight matrix.
|
||||
This means that there are OC columns in total. On each column vector, KH
|
||||
\* KW data values on the first input channel are arranged first, and
|
||||
then on subsequent channels until the channel IC. In this manner, the
|
||||
convolution operation is converted into the multiplication of two
|
||||
matrices. In practice, the data rearrangement of Img2col and GEMM is
|
||||
performed simultaneously to save time.
|
||||
|
||||

|
||||
:label:`ch-deploy/img2col_weight`
|
||||
|
||||
**(2) Winograd**
|
||||
|
||||
Convolution is essentially considered as matrix multiplication. The time
|
||||
complexity of multiplying two 2D matrices is $O(n^3)$. The Winograd
|
||||
algorithm can reduce the complexity of matrix multiplication.
|
||||
|
||||
Assume that a 1D convolution operation is denoted as ***F***($m$, $r$),
|
||||
where $m$ indicates the number of outputs, and $r$ indicates the number
|
||||
of convolution kernels. The input is
|
||||
$\textit{\textbf{d}}=[d_0 \ d_1 \ d_2 \ d_3]$, and the convolution
|
||||
kernel is $g=[g_0 \ g_1 \ g_2]^{\rm T}$. The convolution operation may
|
||||
be written using matrices as Equation
|
||||
:eqref:`ch-deploy/conv-matmul-one-dimension`, which contains six
|
||||
multiplications and four additions.
|
||||
|
||||
$$
|
||||
\textit{\textbf{F}}(2, 3)=
|
||||
\left[ \begin{matrix} d_0 & d_1 & d_2 \\ d_1 & d_2 & d_3 \end{matrix} \right] \times \left[ \begin{matrix} g_0 \\ g_1 \\ g_2 \end{matrix} \right]=
|
||||
\left[ \begin{matrix} y_0 \\ y_1 \end{matrix} \right]
|
||||
$$
|
||||
:eqlabel:`equ:ch-deploy/conv-matmul-one-dimension`
|
||||
|
||||
In the preceding equation, there are repeated elements $d_1$ and $d_2$
|
||||
in the input matrix. As such, there is space for optimization for matrix
|
||||
multiplication converted from convolution compared with general matrix
|
||||
multiplication. The matrix multiplication result may be obtained by
|
||||
computing an intermediate variable $m_0-m_3$, as shown in Equation
|
||||
:eqref:`ch-deploy/conv-2-winograd`:
|
||||
|
||||
$$
|
||||
\textit{\textbf{F}}(2, 3)=
|
||||
\left[ \begin{matrix} d_0 & d_1 & d_2 \\ d_1 & d_2 & d_3 \end{matrix} \right] \times \left[ \begin{matrix} g_0 \\ g_1 \\ g_2 \end{matrix} \right]=
|
||||
\left[ \begin{matrix} m_0+m_1+m_2 \\ m_1-m_2+m_3 \end{matrix} \right]
|
||||
$$
|
||||
:eqlabel:`equ:ch-deploy/conv-2-winograd`
|
||||
|
||||
where $m_0-m_3$ are computed as Equation
|
||||
:eqref:`ch-deploy/winograd-param`:
|
||||
|
||||
$$
|
||||
\begin{aligned}
|
||||
m_0=(d_0-d_2) \times g_0 \\
|
||||
m_1=(d_1+d_2) \times (\frac{g_0+g_1+g_2}{2}) \\
|
||||
m_2=(d_0-d_2) \times (\frac{g_0-g_1+g_2}{2}) \\
|
||||
m_3=(d_1-d_3) \times g_2
|
||||
\end{aligned}
|
||||
$$
|
||||
:eqlabel:`equ:ch-deploy/winograd-param`
|
||||
|
||||
The indirect computation of r1 and r2 by computing $m_0-m_3$ involves
|
||||
four additions of the input $d$ and four multiplications and four
|
||||
additions of the output $m$. Because the weights are constant during
|
||||
inference, the operations on the convolution kernel can be performed
|
||||
during graph compilation, which is excluded from the online runtime. In
|
||||
total, there are four multiplications and eight additions --- fewer
|
||||
multiplications and more additions compared with direct computation
|
||||
(which has six multiplications and four additions). In computer systems,
|
||||
multiplications are generally more time-consuming than additions.
|
||||
Decreasing the number of multiplications while adding a small number of
|
||||
additions can accelerate computation.
|
||||
|
||||
In a matrix form, the computation can be written as Equation
|
||||
:eqref:`ch-deploy/winograd-matrix`, where $\odot$ indicates the
|
||||
multiplication of corresponding locations, and ***A***, ***B***, and
|
||||
***G*** are all constant matrices. The matrix here is used to facilitate
|
||||
clarity --- in real-world use, faster computation can be achieved if the
|
||||
matrix computation is performed based on the handwritten form, as
|
||||
provided in Equation
|
||||
:eqref:`ch-deploy/winograd-param`.
|
||||
|
||||
$$\textit{\textbf{Y}}=\textit{\textbf{A}}^{\rm T}(\textit{\textbf{G}}g) \odot (\textit{\textbf{B}}^{\rm T}d)$$
|
||||
:eqlabel:`equ:ch-deploy/winograd-matrix`
|
||||
|
||||
$$\textit{\textbf{B}}^{\rm T}=
|
||||
\left[ \begin{matrix} 1 & 0 & -1 & 0 \\ 0 & 1 & 1 & 0 \\ 0 & -1 & 1 & 0 \\ 0 & 1 & 0 & -1 \end{matrix} \right]$$
|
||||
:eqlabel:`equ:ch-deploy/winograd-matrix-bt`
|
||||
|
||||
$$\textit{\textbf{G}}=
|
||||
\left[ \begin{matrix} 1 & 0 & 0 \\ 0.5 & 0.5 & 0.5 \\ 0.5 & -0.5 & 0.5 \\ 0 & 0 & 1 \end{matrix} \right]$$
|
||||
:eqlabel:`equ:ch-deploy/winograd-matrix-g`
|
||||
|
||||
$$\textit{\textbf{A}}^{\rm T}=
|
||||
\left[ \begin{matrix} 1 & 1 & -1 & 0 \\ 0 & 1 & -1 & -1 \end{matrix} \right] \\$$
|
||||
:eqlabel:`equ:ch-deploy/winograd-matrix-at`
|
||||
|
||||
In deep learning, 2D convolution is typically used. When ***F***(2, 3)
|
||||
is extended to ***F***(2$\times$`<!-- -->`{=html}2,
|
||||
3$\times$`<!-- -->`{=html}3), it can be written in a matrix form, as
|
||||
shown in Equation
|
||||
:eqref:`ch-deploy/winograd-two-dimension-matrix`. In this case,
|
||||
Winograd has 16 multiplications, reducing the computation complexity by
|
||||
2.25 times compared with 36 multiplications of the original convolution.
|
||||
|
||||
$$\textit{\textbf{Y}}=\textit{\textbf{A}}^{\rm T}(\textit{\textbf{G}}g\textit{\textbf{G}}^{\rm T}) \odot (\textit{\textbf{B}}^{\rm T}d\textit{\textbf{B}})\textit{\textbf{A}}$$
|
||||
:eqlabel:`equ:ch-deploy/winograd-two-dimension-matrix`
|
||||
|
||||
The logical process of Winograd can be divided into four steps, as shown
|
||||
in Figure :numref:`ch-deploy/winograd`.
|
||||
|
||||

|
||||
:label:`ch-deploy/winograd`
|
||||
|
||||
To use Winograd of ***F***(2$\times$`<!-- -->`{=html}2,
|
||||
3$\times$`<!-- -->`{=html}3) for any output size, we need to divide the
|
||||
output into 2$\times$`<!-- -->`{=html}2 blocks. We can then perform the
|
||||
preceding four steps using the corresponding input to obtain the
|
||||
corresponding output. Winograd is not limited to solving
|
||||
***F***(2$\times$`<!-- -->`{=html}2, 3$\times$`<!-- -->`{=html}3). For
|
||||
any ***F***($m \times m$, $r \times r$), appropriate constant matrices
|
||||
***A***, ***B***, and ***G*** can be found to reduce the number of
|
||||
multiplications through indirect computation. However, as $m$ and $r$
|
||||
increase, the number of additions involved in input and output and the
|
||||
number of multiplications of constant weights increase. In this case,
|
||||
the decrease in the computation workload brought by fewer
|
||||
multiplications is offset by additions and constant multiplications.
|
||||
Therefore, we need to evaluate the benefits of Winograd before using it.
|
||||
|
||||
This section describes methods for processing data and optimizing
|
||||
performance during model inference. An appropriate data processing
|
||||
method can facilitate the input feature extraction and output
|
||||
processing. And to fully leverage the computing power of hardware, we
|
||||
can use parallel computing and operator-level hardware instruction and
|
||||
algorithm optimization. In addition, the memory usage and load/store
|
||||
rate are also important for the inference performance. Therefore, it is
|
||||
essential to design an appropriate memory overcommitment strategy for
|
||||
inference. Related methods have been discussed in the section about the
|
||||
compiler backend.
|
||||
165
v1/en_chapters/chapter_model_deployment/model_security.md
Normal file
165
v1/en_chapters/chapter_model_deployment/model_security.md
Normal file
@@ -0,0 +1,165 @@
|
||||
# Security Protection of Models
|
||||
|
||||
After training and optimizing models locally, AI service providers
|
||||
deploy the models on third-party platforms (such as mobile devices, edge
|
||||
devices, and cloud servers) to provide inference services. The design
|
||||
and training of the AI models require a large amount of time, data, and
|
||||
computing power. This is why model and service providers protect the
|
||||
intellectual property rights of the models (including model structures
|
||||
and parameters) from being stolen during transfer, storage, and running
|
||||
in the deployment phase.
|
||||
|
||||
## Overview
|
||||
|
||||
The security protection of models can be divided into static protection
|
||||
and dynamic protection. Static protection refers to protecting models
|
||||
during transfer and storage. At present, it is widely implemented based
|
||||
on file encryption, in which AI model files are transferred and stored
|
||||
in ciphertext and are decrypted in the memory before being used for
|
||||
inference. However, throughout the inference process, models remain in
|
||||
plaintext in the memory, making it possible for theft. Dynamic
|
||||
protection refers to protecting models during runtime. Dynamic
|
||||
protection methods currently available can be classified into three
|
||||
categories. The first is trusted execution environment-based (TEE-based)
|
||||
protection. TEEs are usually secure zones isolated on trusted hardware,
|
||||
and AI model files are stored and transferred in non-secure zones and
|
||||
running after decryption in the secure zones. Although this method
|
||||
involves only a short inference latency on the CPU, it requires specific
|
||||
trusted hardware, making it difficult to implement. In addition, due to
|
||||
constraints on hardware resources, protecting large-scale deep models is
|
||||
difficult and heterogeneous hardware acceleration is still challenging.
|
||||
The second is a cryptographic computing-based protection, which ensures
|
||||
that models remain in ciphertext during transfer, storage, and running
|
||||
using cryptographic techniques (such as homomorphic encryption and
|
||||
secure multi-party computation). Although this method is free from
|
||||
hardware constraints, it has large computation or communications
|
||||
overheads and cannot protect model structure information. The third is
|
||||
obfuscation-based protection. This method scrambles the computational
|
||||
logic of models with fake nodes, so that attackers cannot understand the
|
||||
models even if they obtain them. Compared with the former two methods,
|
||||
obfuscation-based protection brings a smaller overhead to the
|
||||
performance and neglectable loss of accuracy. Furthermore, it is
|
||||
hardware-agnostic, and can support protection of very large models. We
|
||||
will focus on protection using the obfuscation-based method.
|
||||
|
||||
## Model Obfuscation
|
||||
|
||||
Model obfuscation can automatically obfuscate the computational logic of
|
||||
plaintext AI models, preventing attackers from understanding the models
|
||||
even if they obtain them during transfer and storage. In addition,
|
||||
models can run while still being obfuscated, thereby ensuring the
|
||||
confidentiality while they are running. Obfuscation does not affect the
|
||||
inference results and brings only a low performance overhead.
|
||||
|
||||

|
||||
:label:`ch-deploy/model_obfuscate`
|
||||
|
||||
Figure :numref:`ch-deploy/model_obfuscate` depicts the model obfuscation
|
||||
procedure, which is described as follows.
|
||||
|
||||
1. **Interpret the given model into a computational graph:** Based on
|
||||
the structure of a trained model, interpret the model file into the
|
||||
graph expression (computational graph) of the model computational
|
||||
logic for subsequent operations. The resulting computational graph
|
||||
contains information such as node identifiers, node operator types,
|
||||
node parameters, and network structures.
|
||||
|
||||
2. **Scramble the network structure of the computational graph[^1]:**
|
||||
Scramble the relationship between nodes in the computational graph
|
||||
using graph compression, augmentation, and other techniques in order
|
||||
to conceal the true computational logic. In graph compression, the
|
||||
key subgraph structure is matched by checking the entire graph.
|
||||
These subgraphs are compressed and replaced with a single new
|
||||
computing node. Graph augmentation adds new input/output edges to
|
||||
the compressed graph in order to further conceal the dependencies
|
||||
between nodes. An input/output edge comes from or points to an
|
||||
existing node in the graph, or comes from or points to the new
|
||||
obfuscation node in this step.
|
||||
|
||||
3. **Anonymize nodes in the computational graph:** Traverse the
|
||||
computational graph processed in Step (2) and select the nodes to be
|
||||
protected. For a node to be protected, we can replace the node
|
||||
identifier, operator type, and other attributes that can describe
|
||||
the computational logic of the model with non-semantic symbols. For
|
||||
node identifier anonymization, the anonymized node identifier must
|
||||
be unique in order to distinguish different nodes. For operator type
|
||||
anonymization, to avoid operator type explosion caused by
|
||||
large-scale computational graph anonymization, we can divide nodes
|
||||
with the same operator type into several disjoint sets, and replace
|
||||
the operator type of nodes in the same set with the same symbol.
|
||||
Step (5) ensures that the model can be identified and executed after
|
||||
node anonymization.
|
||||
|
||||
4. **Scramble weights of the computational graph:** Add random noise
|
||||
and mapping functions to the weights to be protected. The random
|
||||
noise and mapping functions can vary with weights. Step (6) ensures
|
||||
that the noise of weights does not change the model execution
|
||||
result. The computational graph processed after Steps (2), (3),
|
||||
and (4) are then saved as a model file for subsequent operations.
|
||||
|
||||
5. **Transform operator interfaces:** Steps (5) and (6) transform
|
||||
operators to be protected in order to generate candidate obfuscated
|
||||
operators. An original operator may correspond to multiple
|
||||
obfuscated operators. The quantity of candidate obfuscated operators
|
||||
depends on how many sets the nodes are grouped into in Step (3). In
|
||||
this step, the operator interfaces are transformed based on the
|
||||
anonymized operator types and operator input/output relationship
|
||||
obtained after Steps (2), (3), and (4). Such transformation can be
|
||||
implemented by changing the input, output, or interface name.
|
||||
Changing the input and output involves modification on the input and
|
||||
output data, making the form of the obfuscated operator different
|
||||
from that of the original operator. The added data includes the data
|
||||
dependency introduced by graph augmentation in Step (2) and the
|
||||
random noise introduced by weight obfuscation in Step (4). The
|
||||
operator name is changed to the name of the anonymized operator
|
||||
obtained in Step (3) to ensure that the model can still be
|
||||
identified and executed after the nodes are anonymized and that the
|
||||
operator name does not reveal the computational logic.
|
||||
|
||||
6. **Transform the operator implementation:** Transform the operator
|
||||
code implementation by encrypting strings, adding redundant code,
|
||||
and employing other code obfuscation techniques in order to keep the
|
||||
computational logic consistent between the original operator and
|
||||
obfuscated operator while also making the logic more difficult to
|
||||
understand. A combination of different code obfuscation techniques
|
||||
may be applied to different operators in order to realize the code
|
||||
implementation transformation. In addition to equivalent code
|
||||
transformation, the obfuscated operators further implement some
|
||||
additional computational logic. For example, in Step (4), noise has
|
||||
been added to the weights of an operator. The obfuscated operator
|
||||
also implements an inverse mapping function of the weight noise,
|
||||
dynamically eliminating noise in the operator execution process and
|
||||
ensuring that the computation result is the same as the original
|
||||
model. The generated obfuscated operators can then be saved as a
|
||||
library file for subsequent operations.
|
||||
|
||||
7. **Deploy the model and operator library:** Deploy the obfuscated
|
||||
model and corresponding operator library file on the desired device.
|
||||
|
||||
8. **Load the obfuscated model:** Parse the obfuscated model file and
|
||||
obtain the graph expression of the model computational logic, that
|
||||
is, the obfuscated computational graph obtained after Step (2), (3),
|
||||
and (4).
|
||||
|
||||
9. **Initialize the computational graph:** Initialize the computational
|
||||
graph to generate an execution task sequence. According to security
|
||||
configuration options, if runtime model security needs to be
|
||||
protected, the obfuscated graph should be directly initialized to
|
||||
generate an execution task sequence. Each compute unit in the
|
||||
sequence corresponds to execution of one obfuscated operator or
|
||||
original operator. If security protection is required during only
|
||||
model transfer and storage, restore the obfuscated graph in the
|
||||
memory to the source graph, and then initialize the source graph to
|
||||
generate an execution task sequence. Each unit in the sequence
|
||||
corresponds to the execution of an original operator. In this way,
|
||||
performance overheads during inference can be further reduced.
|
||||
|
||||
10. **Execute inference tasks:** The model executes the compute units
|
||||
sequentially on the input of the AI application in order to obtain
|
||||
an inference result. If a compute unit corresponds to an obfuscated
|
||||
operator, the obfuscated operator library is invoked. Otherwise, the
|
||||
original operator library is invoked.
|
||||
|
||||
[^1]: Scrambling refers to adding noise to the computational graph.
|
||||
Common methods include adding redundant nodes and edges and merging
|
||||
some subgraphs.
|
||||
38
v1/en_chapters/chapter_model_deployment/summary.md
Normal file
38
v1/en_chapters/chapter_model_deployment/summary.md
Normal file
@@ -0,0 +1,38 @@
|
||||
# Chapter Summary
|
||||
|
||||
1. Model deployment is restricted by factors including the model size,
|
||||
runtime memory usage, inference latency, and inference power
|
||||
consumption.
|
||||
|
||||
2. Models can be compressed using techniques such as quantization,
|
||||
pruning, and knowledge distillation in the offline phase. In
|
||||
addition, some model optimization techniques, such as operator
|
||||
fusion, can also reduce the model size, albeit to a lesser degree.
|
||||
|
||||
3. Runtime memory usage can be improved by optimizing the model size,
|
||||
deployment framework size, and runtime temporary memory usage.
|
||||
Methods for optimizing the model size have been summarized earlier.
|
||||
Making the framework code simpler and more modular helps optimize
|
||||
the deployment framework. Memory pooling can help implement memory
|
||||
overcommitment to optimize the runtime temporary memory usage.
|
||||
|
||||
4. Model inference latency can be optimized from two aspects. In the
|
||||
offline phase, the model computation workload can be reduced using
|
||||
model optimization and compression methods. Furthermore, improving
|
||||
the inference parallelism and optimizing operator implementation can
|
||||
help maximize the utilization of the computing power. In addition to
|
||||
the computation workload and computing power, consideration should
|
||||
be given to the load/store overhead during inference.
|
||||
|
||||
5. Power consumption during inference can be reduced through offline
|
||||
model optimization and compression technologies. By reducing the
|
||||
computational workload, these technologies also facilitate power
|
||||
consumption reduction, which coincides with the optimization method
|
||||
for model inference latency.
|
||||
|
||||
6. In addition to the optimization of factors related to model
|
||||
deployment, this chapter also discussed technologies regarding
|
||||
deployment security, such as model obfuscation and model encryption.
|
||||
Secure deployment protects the model assets of enterprises and
|
||||
prevents hackers from attacking the deployment environment by
|
||||
tampering with models.
|
||||
54
v1/en_chapters/chapter_preface/index.md
Normal file
54
v1/en_chapters/chapter_preface/index.md
Normal file
@@ -0,0 +1,54 @@
|
||||
# Preface
|
||||
|
||||
## Background
|
||||
|
||||
In 2020, I joined the School of Informatics at the University of Edinburgh, which is considered one of the birthplaces of Artificial Intelligence (AI) research. The university offers machine learning courses that cover a wide range of topics, including natural language processing, computer vision, and computational neuroscience. Additionally, the university is well-known for providing a complete series of fundamental courses on computer systems, such as operating systems, programming languages, compilers, and computer architecture. However, when I asked my students about how computer systems are utilized to deploy and accelerate computation in machine learning, many of them appeared puzzled. This led me to contemplate whether the University of Edinburgh, along with other universities worldwide, should expand their curricula by adding a course that bridges the gap between machine learning and computer systems.
|
||||
|
||||
Initially, my idea was to expand an existing course. At the time, the "AI Systems" course at the University of California, Berkeley was particularly popular. It explored various research directions in machine learning systems, with an emphasis on studying research papers. Unfortunately, many of these papers did not stand the test of time, and the course did not provide a comprehensive architectural overview of the knowledge. Consequently, students were unable to gain a complete understanding of the subject or learn how to construct a machine learning system from scratch. I then looked to other universities, where I discovered that the University of Washington offered a brief course called "Deep Learning Systems," which focused on the compilation process of machine learning programs. However, the course primarily centered around Apache TVM, a compiler stack for deep learning systems, and lacked a systematic introduction to machine learning systems. Stanford University also had a course in this area, "Machine Learning Systems Design," but it focused on topics such as data cleansing, management, and annotation, as databases were the course designer's primary expertise.
|
||||
|
||||
In my search for a suitable course, I expanded my scope to Microsoft Research Asia. Their "AI Systems" course seemed like the closest match to my expectations at the time, as it elaborated on the design concepts of machine learning systems. However, as I prepared to teach it to undergraduates, I realized that it provided only a general introduction to the core design concepts of machine learning systems and assumed students had a solid foundational knowledge of computer systems. It was better suited for doctoral students than undergraduates. In fact, all the courses I previously mentioned focused on studying research papers rather than on easily comprehensible textbooks that provide a clear knowledge map. Consequently, the materials involved in these courses were filled with scattered ideas, creating significant obstacles for students attempting to learn about machine learning systems.
|
||||
|
||||
On the flip side, 2020 was a year in which we saw the emergence of excellent course materials, providing fundamental knowledge about operating systems, databases, distributed systems, and even machine learning algorithms. However, it remained difficult to find a textbook that systematically introduces machine learning systems. Many enterprise and university labs needed to expend significant resources in order to train students and engineers from scratch and enhance their understanding of the fundamental architecture of machine learning systems. The absence of such textbooks presented a huge challenge in developing academic and industry talent. Against this backdrop, the idea of writing a textbook on machine learning systems began to take shape in my mind.
|
||||
|
||||
## Beginning
|
||||
|
||||
When I shared this idea with my friends, they recognized the immense value of writing such a textbook. However, the preparation and writing process involved could be a daunting uphill battle. My postdoctoral mentor advised me to focus on publishing high-impact papers at the beginning of my faculty career instead of spending significant amounts of time and energy on a book that may not even be published. Other professors preferred to revise existing textbooks rather than write new ones, particularly in the field of machine learning systems, which evolve rapidly through a process of trial and error. Even if a new book were published, it may become obsolete quickly due to technological advancements over time.
|
||||
|
||||
Despite encountering several obstacles, the idea of writing a textbook on machine learning systems did not fade away until I went to China for a holiday and spoke with Xuefeng Jin, the architect of MindSpore. We first met in London around Christmas time in 2019 when he was leading the development of MindSpore 1.0, which had yet to be launched. We became acquainted through our mutual interest in the development of machine learning systems. In 2018, I co-built a new machine learning framework from scratch, similar to PyTorch, with my colleagues. Although the project ended due to insufficient resources, the experience motivated me to publish several papers on machine learning systems. Xuefeng and I both recognized how challenging it was to develop AI systems and to find experts in machine learning system development. Students often focused more on machine learning algorithms and had only a superficial understanding of key system design principles. They did not realize the significance of these principles until they applied machine learning technologies in practice, but by that point, it was too late to learn them. I shared my idea with Xuefeng about writing a textbook on machine learning systems and anticipated that it might take three to four years to complete. Xuefeng had a similar idea and asked whether he could assist in any way.
|
||||
|
||||
Xuefeng's offer was enlightening. I started asking myself: why not break the conventional pattern of book writing, which follows the chronicle of discipline development over years by one or two professors. This pattern is similar to the waterfall model in traditional software development, but with technological advancements, software development has evolved to open-source agile development. Therefore, why should book writing follow the outdated approach? A good example of this is the \emph{Deep Dive into Deep Learning} book, compiled by the MXNet open-source community. I immediately invited Hao Dong, an assistant professor at Peking University and co-founder of the TensorLayer open-source community, to collaborate with us. Excited about this prospect, Xuefeng invited his colleague, Zhiliang Gan, to join us. We were committed to creating a new textbook and finally settled down to writing.
|
||||
|
||||
After several rounds of discussion, we named the book **Machine Learning Systems: Design and Implementation**. Our intention was to introduce the time-tested design principles of machine learning systems and share a wealth of system implementation experience, so that students could learn how to analyze and solve problems in future work and scientific research.
|
||||
|
||||
## Community Building
|
||||
|
||||
Since the field of machine learning systems is an evolving discipline that continually nurtures a variety of research subjects, I pondered how to create an author community to ensure the book's sustainability. As my research expertise focuses on large-scale software systems, I chose to build a community by referencing several key design points of distributed systems, as follows:
|
||||
|
||||
- **Prevention of single-point failure or bottleneck:**
|
||||
Modern distributed systems are typically designed to separate the control plane from the data plane to avoid single-point failure or bottleneck. To ensure the sustainability of the book, we decided to follow this approach and design a highly scalable writing community using a distributed mechanism. The editor spent most of their time searching for excellent, proactive, and responsible chapter owners. Chapter owners then collaborated with other authors to facilitate writing progress on a per-chapter basis, communicating with chapter authors about writing details and adhering to given deadlines. The editor and chapter owners had weekly meetings to synchronize writing progress and ensure that chapter content met the overall expectations of the editor and the community in terms of quality.
|
||||
|
||||
- **Iterative improvement:**
|
||||
The stochastic gradient descent (SGD) optimization algorithm in deep learning uses local gradients to perform numerous iterations in complex problems and find local optimal solutions. I applied the same principles when designing the iterative improvement process for the book's quality. Similar to determining initial parameters, we drafted the first edition of the book on Overleaf. Then, we organized the content into a standard Git code repository and established a mechanism to encourage readers and community members to access issues and pull requests (PRs) on GitHub. We also set up comprehensive book building tools, continuous integration tools, and contributor seminars. This enabled us to continually improve the book's quality, aiming to achieve optimal quality. It was akin to the outcome we achieve in machine learning by following the SGD method.
|
||||
|
||||
- **High availability:** We established a 24/7 online writing platform for participants to develop the book and receive feedback from the community in any time zone and language around the world. The Git repository was hosted on GitHub and mirrored on Gitee to ensure high availability of the writing platform.
|
||||
|
||||
- **Content neutralization:** In a distributed system, the equal treatment of each node is crucial for long-term operation, as it allows for a unified approach to rectifying issues. Similarly, in writing a book, we must anticipate potential challenges such as outdated designs or the departure of writers, and mitigate them through collaboration among participants from diverse backgrounds. We emphasize the importance of creating neutral, objective, and inclusive content and ensuring that any issues that arise do not impede progress.
|
||||
|
||||
|
||||
## Current Situation and Future Outlook
|
||||
|
||||
With the established mechanism, writing progressed smoothly and more participants joined the project. My former students Xiulong Yuan, Zihan Ding, Yao Fu, Jie Ren, and Wenteng Liang were also dedicated to writing and editing this book. Jiarong Han and Cheng Lai from Peng Cheng Laboratory, along with numerous MindSpore developers all made significant contributions to the book. Many senior designers of machine learning systems also held discussions with us through various channels and provided valuable feedback for the book. In addition, many academic and industry top minds shared their thoughts with us. And worldwide, talented students participated in writing. They included Jiankai Sun from Stanford University, Peiyuan Liao from Carnegie Mellon University, Hanchen Wang from Cambridge University, and Pei Mu from the University of Edinburgh. Kaiyan Xiao, a machine learning expert from GlaxoSmithKline PLC, also became one of the authors. Furthermore, professors Peter Pietzuch from Imperial College London and Lei Chen from Hong Kong University of Science and Technology, among others, provided continuous writing advice to enhance the book's quality.
|
||||
|
||||
After we implemented the "distributed system" for book writing, the book's quality has continually improved. When we released the book as an open-source project, the number of participants rapidly increased, coming as a major surprise to us. Driven by the open-source community, the English and Chinese versions of the book have been advanced. This was the first time that I realized the huge benefit of using the idea of distributed systems and the knowledge of machine learning in solving complex problems in real life.
|
||||
|
||||
A single tree is too weak to withstand a sandstorm. Similarly, it was the forest of friends and the power of the community that gave us the courage to take the very first and crucial step in writing this book. I hope that this way of thinking can inspire and help in finding solutions to other complex problems.
|
||||
|
||||
By May 2022, the core authors and editors (Luo Mai, Hao Dong, Xuefeng Jin, and Zhiliang Gan), the book coordinator (Zhipeng Tan), and the following contributors have endeavored to create this book: **Introduction** (Luo Mai, Hao Dong, and Zhiliang Gan), **Programming Model** (Cheng Lai, Luo Mai, and Hao Dong), **Computational Graph** (Jiarong Han, Luo Mai, and Hao Dong), **AI Compiler and Frontend Technology** (Zhibo Liang, Qinghua Zhang, Bingjian Huang, Jianfeng Yu, and Zhiliang Gan), **AI Compiler Backend and Runtime** (Jinjin Chu, Pei Mu, and Fubi Cai), **Hardware Accelerator** (Renwei Zhang, Jie Ren, Wenteng Liang, Chao Liu, Gang Chen, and Mingqi Li), **Data Processing** (Xiulong Yuan), **Model Deployment** (Gangqiang Han, Yehui Tang, Zhiqiang Zhai, and Shanni Li), **Distributed Training** (Luo Mai and Peiyuan Liao) **Federated Learning System** (Tiancheng Wu and Hanchen Wang), **Recommender System** (Yao Fu, Bei Pei, and Luo Mai), **Reinforcement Learning System** (Zihan Ding), **Explainable AI System** (Haoyang Li and Xiaohui Li), and **Robotic System** (Jiankai Sun and Kaiyan Xiao).
|
||||
|
||||
We welcome new contributors to help improve and expand the book's content. If you're interested, please contact us through our book's [OpenMLSys Community](https://openmlsys.github.io/html-en/). Let's work together to create a machine learning systems book that advances the world.
|
||||
|
||||
Luo Mai
|
||||
|
||||
Edinburgh, United Kingdom
|
||||
|
||||
4th May 2022
|
||||
13
v1/en_chapters/chapter_preface_advanced/index.md
Normal file
13
v1/en_chapters/chapter_preface_advanced/index.md
Normal file
@@ -0,0 +1,13 @@
|
||||
# Part I Framework Design
|
||||
:label:`part-i-framework-design`
|
||||
|
||||
In Part 1, we present a top-down approach to designing a machine
|
||||
learning framework. We begin by introducing the design of programming
|
||||
models for machine learning frameworks, followed by a discussion on
|
||||
representing a machine learning program as a computational graph. The
|
||||
machine learning program undergoes compilation by an AI compiler, which
|
||||
employs a range of frontend and backend techniques. Additionally, we
|
||||
will delve into the system components within a machine learning
|
||||
framework that facilitate data processing, model deployment, and
|
||||
distributed training.
|
||||
|
||||
7
v1/en_chapters/chapter_preface_extension/index.md
Normal file
7
v1/en_chapters/chapter_preface_extension/index.md
Normal file
@@ -0,0 +1,7 @@
|
||||
# Part II Application Scenarios
|
||||
:label:`part-ii-application-scenarios`
|
||||
|
||||
In Part II, we will introduce various scenarios of applying machine
|
||||
learning frameworks. These scenarios include federated learning systems,
|
||||
recommender systems, reinforcement learning systems, and robotic
|
||||
systems.
|
||||
@@ -0,0 +1,98 @@
|
||||
# Bridging Python and C/C++ Functions
|
||||
|
||||
Developers frequently encounter the need to incorporate custom operators
|
||||
into a machine learning framework. These operators implement new models,
|
||||
optimizers, data processing functions, and more. Custom operators, in
|
||||
particular, often require implementation in C/C++ to achieve optimized
|
||||
performance. They also have Python interfaces, facilitating developers
|
||||
to integrate custom operators with existing machine learning workflows
|
||||
written in Python. This section will delve into the implementation
|
||||
details of this process.
|
||||
|
||||
The Python interpreter, being implemented in C, enables the invocation
|
||||
of C and C++ functions within Python. Contemporary machine learning
|
||||
frameworks such as TensorFlow, PyTorch, and MindSpore rely on pybind11
|
||||
to automatically generate Python functions from underlying C and C++
|
||||
functions. This mechanism is known as *Python binding*. Prior to the
|
||||
advent of pybind11, Python binding was accomplished using one of the
|
||||
following approaches:
|
||||
|
||||
1. **C-APIs in Python**: This approach necessitates the inclusion of
|
||||
`Python.h` in C++ programs and the utilization of Python's C-APIs to
|
||||
execute Python operations. To effectively work with C-APIs,
|
||||
developers must possess a comprehensive understanding of Python's
|
||||
internal implementation, such as managing reference counting.
|
||||
|
||||
2. **Simplified Wrapper and Interface Generator (SWIG)**: SWIG serves
|
||||
as a bridge between C/C++ code and Python, and it played a
|
||||
significant role in the initial development of TensorFlow. Utilizing
|
||||
SWIG involves crafting intricate interface statements and relying on
|
||||
SWIG to automatically generate C code that interfaces with Python's
|
||||
C-APIs. However, due to the lack of readability in the generated
|
||||
code, the maintenance costs associated with it tend to be high.
|
||||
|
||||
3. **Python `ctypes` module**: This module encompasses a comprehensive
|
||||
range of types found in the C language and allows direct invocation
|
||||
of dynamic link libraries (DLLs). However, a limitation of this
|
||||
module is its heavy reliance on native C types, which results in
|
||||
insufficient support for customized types.
|
||||
|
||||
4. **CPython**: In basic terms, CPython can be described as the fusion
|
||||
of Python syntax with static types from the C language. It
|
||||
facilitates the retention of Python's syntax while automatically
|
||||
translating CPython functions into C/C++ code. This functionality
|
||||
empowers developers to seamlessly incorporate invocations of C/C++
|
||||
functions within the CPython environment.
|
||||
|
||||
5. **Boost::Python (a C++ library)**: Boost::Python allows for the
|
||||
exposure of C++ functions as Python functions. It operates on
|
||||
similar principles to Python's C-APIs but provides a more
|
||||
user-friendly interface. However, the reliance on the Boost library
|
||||
introduces a significant dependency on third-party components, which
|
||||
can be a potential drawback for Boost::Python.
|
||||
|
||||
In comparison to the above Python binding approaches, pybind11 shares
|
||||
similarities with Boost::Python in terms of simplicity and usability.
|
||||
However, pybind11 stands out due to its focus on supporting C++ 11 and
|
||||
eliminating dependencies on Boost. As a lightweight Python library,
|
||||
pybind11 is particularly suitable for exposing numerous Python functions
|
||||
in complex C++ projects such as the machine learning system discussed in
|
||||
this book. The combination of Code
|
||||
`ch02/code2.5.1` and Code
|
||||
`ch02/code2.5.2` is an example of adding a custom operator to
|
||||
Pytorch with the integration of C++ and Python:\
|
||||
In C++:
|
||||
|
||||
**ch02/code2.5.1**
|
||||
```cpp
|
||||
//custom_add.cpp
|
||||
#include <torch/extension.h>
|
||||
#include <pybind11/pybind11.h>
|
||||
|
||||
torch::Tensor custom_add(torch::Tensor a, torch::Tensor b) {
|
||||
return a + b;
|
||||
}
|
||||
|
||||
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
|
||||
m.def("custom_add", &custom_add, "A custom add function");
|
||||
}
|
||||
```
|
||||
|
||||
In Python:
|
||||
|
||||
**ch02/code2.5.2**
|
||||
```python
|
||||
import torch
|
||||
from torch.utils.cpp_extension import load
|
||||
|
||||
# Load the C++ extension
|
||||
custom_extension = load(
|
||||
name='custom_extension',
|
||||
sources=['custom_add.cpp'],
|
||||
verbose=True
|
||||
)
|
||||
# Use your custom add function
|
||||
a = torch.randn(10)
|
||||
b = torch.randn(10)
|
||||
c = custom_extension.custom_add(a, b)
|
||||
```
|
||||
@@ -0,0 +1,87 @@
|
||||
# Overview
|
||||
|
||||
With the advent of machine learning systems, the design of user-friendly
|
||||
and high-performance APIs has become a paramount concern for system
|
||||
designers. In the early stages of machine learning frameworks (as
|
||||
depicted in Figure :numref:`ch03/framework_development_history`), developers often
|
||||
opted for high-level programming languages like Lua (Torch) and Python
|
||||
(Theano) to write machine learning programs. These frameworks offered
|
||||
essential functions, including model definition and automatic
|
||||
differentiation, which are integral to machine learning. They were
|
||||
particularly well-suited for creating small-scale machine learning
|
||||
applications targeted toward scientific research purposes.
|
||||
|
||||
<figure id="fig:ch03/framework_development_history">
|
||||
<embed src="../img/ch03/framework_development_history.pdf" />
|
||||
<figcaption> Evolution of Machine Learning Programming Frameworks: A
|
||||
Historical Perspective</figcaption>
|
||||
</figure>
|
||||
|
||||
The rapid advancement of deep neural networks (DNNs) since 2011 has
|
||||
sparked groundbreaking achievements in various AI application domains,
|
||||
such as computer vision, speech recognition, and natural language
|
||||
processing. However, training DNNs requires substantial computational
|
||||
power. Unfortunately, earlier frameworks like Torch (primarily using
|
||||
Lua) and Theano (mainly using Python) were unable to fully harness this
|
||||
computing power. On the other hand, general-purpose APIs like CUDA C for
|
||||
computational accelerators such as NVIDIA GPUs have become increasingly
|
||||
mature, and multi-thread libraries like POSIX Threads built on CPU
|
||||
multi-core technology have gained popularity among developers.
|
||||
Consequently, many machine learning users sought to develop
|
||||
high-performance deep learning applications utilizing C/C++. These
|
||||
requirements led to the emergence of frameworks like Caffe, which
|
||||
employed C/C++ as their core APIs.
|
||||
|
||||
However, customization of machine learning models is often necessary to
|
||||
suit specific deployment scenarios, data types, identification tasks,
|
||||
and so on. This customization typically falls on the shoulders of AI
|
||||
application developers, who may come from diverse backgrounds and may
|
||||
not fully leverage the capabilities of C/C++. This became a significant
|
||||
bottleneck that hindered the widespread adoption of programming
|
||||
frameworks like Caffe, which heavily relied on C/C++.
|
||||
|
||||
In late 2015, Google introduced TensorFlow, which revolutionized the
|
||||
landscape. In contrast to Torch, TensorFlow adopted a design where the
|
||||
frontend and backend were relatively independent. The frontend,
|
||||
presented to users, utilized the high-level programming language Python,
|
||||
while the high-performance backend was implemented in C/C++. TensorFlow
|
||||
provided numerous Python-based frontend APIs, gaining wide acceptance
|
||||
among data scientists and machine learning researchers. It seamlessly
|
||||
integrated into Python-dominated big data ecosystems, benefiting from
|
||||
various big data development libraries such as NumPy, Pandas, SciPy,
|
||||
Matplotlib, and PySpark. Python's exceptional interoperability with
|
||||
C/C++, as demonstrated in multiple Python libraries, further enhanced
|
||||
TensorFlow's appeal. Consequently, TensorFlow combined the flexibility
|
||||
and ecosystem of Python with high-performance capabilities offered by
|
||||
its C/C++ backend. This design philosophy was inherited by subsequent
|
||||
frameworks like PyTorch, MindSpore, and PaddlePaddle.
|
||||
|
||||
Subsequently, as observed globally, prominent enterprises started
|
||||
favoring open-source machine learning frameworks, leading to the
|
||||
emergence of Keras and TensorLayerX. These high-level libraries
|
||||
significantly expedited the development of machine learning
|
||||
applications. They provided Python APIs that allowed quick importing of
|
||||
existing models, and these high-level APIs were decoupled from the
|
||||
intricate implementation details of specific machine learning
|
||||
frameworks. As a result, Keras and TensorLayerX could be utilized across
|
||||
different machine learning frameworks.
|
||||
|
||||
While deep neural networks continued to evolve, new challenges surfaced
|
||||
regarding the APIs of machine learning frameworks. Around 2020, novel
|
||||
frameworks like MindSpore and JAX emerged to tackle these challenges.
|
||||
MindSpore, in addition to inheriting the hybrid interfaces (Python and
|
||||
C/C++) from TensorFlow and PyTorch, expanded the scope of machine
|
||||
learning programming models. This expansion facilitated efficient
|
||||
support for a diverse range of AI backend chips, including NVIDIA GPU,
|
||||
Huawei Ascend , and ARM. Consequently, machine learning applications can
|
||||
be swiftly deployed across a wide array of heterogeneous devices.
|
||||
|
||||
Simultaneously, the proliferation of ultra-large datasets and
|
||||
ultra-large DNNs necessitated distributed execution as a fundamental
|
||||
design requirement for machine learning programming frameworks. However,
|
||||
implementing distributed execution in TensorFlow and PyTorch required
|
||||
developers to write substantial amounts of code for allocating datasets
|
||||
and DNNs across distributed nodes. Yet, many AI developers are not
|
||||
well-versed in distributed programming. In this regard, JAX and
|
||||
MindSpore significantly improves the situation by enabling the seamless
|
||||
execution of programs on a single node across various other nodes.
|
||||
49
v1/en_chapters/chapter_programming_interface/index.md
Normal file
49
v1/en_chapters/chapter_programming_interface/index.md
Normal file
@@ -0,0 +1,49 @@
|
||||
# Programming Model
|
||||
|
||||
Machine learning frameworks comprise various components that facilitate
|
||||
the efficient development of algorithms, data processing, model
|
||||
deployment, performance optimization, and hardware acceleration. When
|
||||
designing the application programming interfaces (APIs) for these
|
||||
components, a key consideration is striking the right balance between
|
||||
framework performance and usability. To achieve optimal performance,
|
||||
developers utilize C or C++, as these programming languages enable
|
||||
efficient invocation of the APIs provided by the operating system and
|
||||
hardware accelerators.
|
||||
|
||||
Regarding usability, machine learning framework users, including data
|
||||
scientists, biologists, chemists, and physicists, often possess strong
|
||||
industrial backgrounds and are skilled in using high-level scripting
|
||||
languages like Python, Matlab, R, and Julia. While these languages offer
|
||||
remarkable programming usability, they lack deep optimization
|
||||
capabilities for underlying hardware or operating systems compared to C
|
||||
and C++. Therefore, the core design objective of machine learning
|
||||
frameworks encompasses two aspects: providing easy-to-use APIs for
|
||||
implementing algorithms using high-level languages like Python, and
|
||||
providing low-level APIs centered around C and C++ to assist framework
|
||||
developers in implementing numerous high-performance components and
|
||||
efficiently executing them on hardware. This chapter describes
|
||||
strategies for achieving this design objective.
|
||||
|
||||
The chapter aims to achieve the following learning objectives:
|
||||
|
||||
1. Understanding the workflows and programming principles of machine
|
||||
learning frameworks.
|
||||
|
||||
2. Understanding the design of neural network models and layers.
|
||||
|
||||
3. Understanding how machine learning frameworks bridge Python and
|
||||
C/C++ functions.
|
||||
|
||||
4. Understanding the support for functional programming in machine
|
||||
learning frameworks.
|
||||
|
||||
```toc
|
||||
:maxdepth: 2
|
||||
|
||||
Overview
|
||||
Machine_Learning_Workflow
|
||||
Neural_Network_Programming
|
||||
Functional_Programming
|
||||
Bridging_Python_and_C_C++_Functions
|
||||
Chapter_Summary
|
||||
```
|
||||
@@ -0,0 +1,115 @@
|
||||
# Functional Programming
|
||||
|
||||
In the following, we will discuss the reasons behind the growing trend
|
||||
of incorporating functional programming into the design of machine
|
||||
learning frameworks.
|
||||
|
||||
## Benefits of Functional Programming
|
||||
|
||||
Training constitutes the most critical phase in machine learning, and
|
||||
the manner in which training is depicted hinges significantly on
|
||||
optimizer algorithms. Predominantly, contemporary machine learning tasks
|
||||
utilize first-order optimizers, favored for their ease of use. With
|
||||
machine learning advancing at a rapid pace, both software and hardware
|
||||
are incessantly updated to stay abreast. Consequently, an increasing
|
||||
number of researchers are beginning to investigate higher-order
|
||||
optimizers, noted for their superior convergence performance. Frequently
|
||||
utilized second-order optimizers, such as the Newton method,
|
||||
quasi-Newton method, and AdaHessians, necessitate the computation of a
|
||||
Hessian matrix incorporating second-order derivative information. Two
|
||||
considerable challenges arise from this computation: 1) how to manage
|
||||
such a hefty computational load efficiently; 2) how to express
|
||||
higher-order derivatives in programmatic language.
|
||||
|
||||
In recent times, numerous large AI models have been introduced, which
|
||||
include (with the number of parameters noted in parentheses) OpenAI
|
||||
GPT-3 (175B) in 2020; PanGu (100B), PanGu-$\alpha$ (200B), Google's
|
||||
Switch Transformer (1.6T), and WuDao (1.75T) in 2021; along with
|
||||
Facebook's NLLB-200 (54B) in 2022. The demand for ultra-large model
|
||||
training is escalating, and data parallelism alone cannot meet this
|
||||
growing requirement. Conversely, model parallelism demands manual model
|
||||
segmentation, a process that is time-intensive and laborious.
|
||||
Consequently, the main challenge future machine learning frameworks must
|
||||
overcome is how to actualize automatic parallelism. At its core, a
|
||||
machine learning model is a representation of a mathematical model.
|
||||
Hence, the ability to succinctly represent machine learning models has
|
||||
risen to a key concern in the design of programming paradigms for
|
||||
machine learning frameworks.
|
||||
|
||||
Recognizing the challenges presented by the practical implementation of
|
||||
machine learning frameworks, researchers have identified that functional
|
||||
programming could offer beneficial solutions. Functional programming, in
|
||||
computer science, is a programming paradigm that envisions computation
|
||||
as the evaluation of mathematical functions, actively avoiding state
|
||||
changes and data mutations. This paradigm harmonizes well with
|
||||
mathematical reasoning. Neural networks are composed of interconnected
|
||||
nodes, with each node performing basic mathematical operations.
|
||||
Functional programming languages allow developers to portray these
|
||||
mathematical operations in a language that closely mirrors the
|
||||
operations, enhancing the readability and maintainability of programs.
|
||||
Concurrently, in functional languages, functions are kept separate,
|
||||
simplifying the management of concurrency and parallelism.
|
||||
|
||||
In summary, functional programming is anticipated to confer the
|
||||
following benefits to machine learning frameworks:
|
||||
|
||||
1. It is suited for machine learning scenarios where higher-order
|
||||
derivatives are needed.
|
||||
|
||||
2. It simplifies the development of parallel programming interfaces.
|
||||
|
||||
3. It results in a more concise code representation.
|
||||
|
||||
## Framework Support for Functional Programming
|
||||
|
||||
Machine learning frameworks have increasing support for functional
|
||||
programming. In 2018, Google rolled out JAX. Contrary to traditional
|
||||
machine learning frameworks, JAX amalgamates neural network computation
|
||||
and numerical computation. Its interfaces are compatible with native
|
||||
data science interfaces in Python, such as NumPy and SciPy. Moreover,
|
||||
JAX extends distribution, vectorization, high-order derivation, and
|
||||
hardware acceleration in a functional programming style, characterized
|
||||
by Lambda closure and no side effects.
|
||||
|
||||
In 2020, Huawei introduced MindSpore, the functional differential
|
||||
programming architecture of which allows users to concentrate on the
|
||||
native mathematical expressions of machine learning models. In 2022,
|
||||
taking inspiration from Google's JAX, PyTorch launched functorch.
|
||||
Functorch is essentially a library aimed at providing composable vmap
|
||||
(vectorization) and autodiff transforms compatible with PyTorch modules
|
||||
and PyTorch autograd, thereby achieving excellent eager-mode
|
||||
performance. It can be inferred that functorch meets the requirements
|
||||
for distributed parallelism in PyTorch static graphs. Code
|
||||
`ch02/code2.4` gives an example of functorch.
|
||||
|
||||
**ch02/code2.4**
|
||||
```
|
||||
from functorch import combine_state_for_ensemble, vmap
|
||||
minibatches = data[:num_models]
|
||||
models = [MLP().to(device) for _ in range(num_models)]
|
||||
fmodel, params, buffers = combine_state_for_ensemble(models)
|
||||
predictions1_vmap = vmap(fmodel, out_dims=1)(params, buffers, minibatches)
|
||||
```
|
||||
|
||||
Functorch introduces *vmap*, standing for \"vectorized map\". Its role
|
||||
is to adapt functions designed for individual inputs so that they can
|
||||
handle batches of inputs, therefore facilitating efficient vectorized
|
||||
calculations. Unlike the batch processing capabilities of standard
|
||||
PyTorch modules, vmap can convert any operation to be batch-aware
|
||||
without the need to alter the operation's original structure. Moreover,
|
||||
vmap offers greater flexibility to batch dimensions, allowing users to
|
||||
specify which dimension should be treated as the batch dimension
|
||||
(specifying the $out\_dim$ argument), a contrast to the default
|
||||
behaviour of the standard PyTorch where the first dimension is usually
|
||||
chosen as the batch dimension.
|
||||
|
||||
By tracing the development of machine learning frameworks, it becomes
|
||||
evident that the functional programming paradigm become increasingly
|
||||
popular. This can be attributed to functional programming's ability to
|
||||
express machine learning models intuitively and its convenience for
|
||||
implementing automatic differentiation, high-order derivation, and
|
||||
parallel execution. Consequently, future machine learning frameworks are
|
||||
likely to adopt layered frontend interfaces that are not exclusively
|
||||
designed for machine learning scenarios. Instead, they will primarily
|
||||
offer differential programming in their abstraction designs, making
|
||||
gradient-based software easy to be developed for various applications.
|
||||
129
v1/en_chapters/chapter_programming_interface/ml_workflow.md
Normal file
129
v1/en_chapters/chapter_programming_interface/ml_workflow.md
Normal file
@@ -0,0 +1,129 @@
|
||||
# Machine Learning Workflow
|
||||
|
||||
In machine learning systems, the fundamental design objective of
|
||||
programming models is to offer comprehensive workflow programming
|
||||
support for developers. A typical machine learning task adheres to the
|
||||
workflow depicted in Figure :numref:`ch03/workflow`. This workflow involves loading the
|
||||
training dataset, training, testing, and debugging models. The following
|
||||
APIs are defined to facilitate customization within the workflow
|
||||
(assuming that high-level APIs are provided as Python functions):
|
||||
|
||||
1. **Data Processing API:** Users first require a data processing API
|
||||
to read datasets from a disk. Subsequently, they need to preprocess
|
||||
the data to make it suitable for input into machine learning models.
|
||||
Code `ch02/code2.2.1` is an example of how PyTorch can be used
|
||||
to load data and create data loaders for both training and testing
|
||||
purposes.
|
||||
|
||||
**ch02/code2.2.1**
|
||||
```python
|
||||
import pickle
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
data_path = '/path/to/data'
|
||||
dataset = pickle.load(open(data_path, 'rb')) # Example for a pkl file
|
||||
batch_size = ... # You can make it an argument of the script
|
||||
|
||||
class CustomDataset(Dataset):
|
||||
def __init__(self, data, labels):
|
||||
self.data = data
|
||||
self.labels = labels
|
||||
|
||||
def __len__(self):
|
||||
return len(self.data)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
sample = self.data[idx]
|
||||
label = self.labels[idx]
|
||||
return sample, label
|
||||
|
||||
training_dataset = CustomDataset(dataset['training_data'], dataset['training_labels'])
|
||||
testing_dataset = CustomDataset(dataset['testing_data'], dataset['testing_labels'])
|
||||
|
||||
training_dataloader = DataLoader(training_dataset, batch_size=batch_size, shuffle=True) # Create a training dataloader
|
||||
testing_dataloader = DataLoader(testing_dataset, batch_size=batch_size, shuffle=False) # Create a testing dataloader
|
||||
```
|
||||
|
||||
2. **Model Definition API:** Once the data is preprocessed, users need
|
||||
a model definition API to define machine learning models. These
|
||||
models include model parameters and can perform inference based on
|
||||
given data. Code
|
||||
`ch02/code2.2.2` is an example of how to create a custom
|
||||
model in Pytorch:
|
||||
|
||||
**ch02/code2.2.2**
|
||||
```python
|
||||
import torch.nn as nn
|
||||
class CustomModel(nn.Module):
|
||||
def __init__(self, input_size, output_size):
|
||||
super(CustomModel, self).__init__()
|
||||
self.linear = nn.Linear(input_size, output_size) # A single linear layer
|
||||
|
||||
def forward(self, x):
|
||||
return self.linear(x)
|
||||
```
|
||||
|
||||
3. **Optimizer Definition API:** The outputs of models need to be
|
||||
compared with user labels, and their difference is evaluated using a
|
||||
loss function. The optimizer definition API enables users to define
|
||||
their own loss functions and import or define optimization
|
||||
algorithms based on the actual loss. These algorithms calculate
|
||||
gradients and update model parameters. Code
|
||||
`ch02/code2.2.3` is an example of an optimizer definition
|
||||
in Pytorch:
|
||||
|
||||
**ch02/code2.2.3**
|
||||
```python
|
||||
import torch.optim as optim
|
||||
import torch.nn
|
||||
model = CustomModel(...)
|
||||
# Optimizer definition (Adam, SGD, etc.)
|
||||
optimizer = optim.Adam(model.parameters(), lr=1e-4, momentum=0.9)
|
||||
loss = nn.CrossEntropyLoss() # Loss function definition
|
||||
```
|
||||
|
||||
4. **Training API:** Given a dataset, model, loss function, and
|
||||
optimizer, users require a training API to define a loop that reads
|
||||
data from datasets in a mini-batch mode. In this process, gradients
|
||||
are computed repeatedly, and model parameters are updated
|
||||
accordingly. This iterative update process is known as *training*.
|
||||
Code `ch02/code2.2.4` is an example of how to train a model in
|
||||
Pytorch:
|
||||
|
||||
**ch02/code2.2.4**
|
||||
```python
|
||||
device = "cuda:0" if torch.cuda.is_available() else "cpu" # Select your training device
|
||||
model.to(device) # Move the model to the training device
|
||||
model.train() # Set the model to train mode
|
||||
epochs = ... # You can make it an argument of the script
|
||||
for epoch in range(epochs):
|
||||
for batch_idx, (data, target) in enumerate(training_dataloader):
|
||||
data, target = data.to(device), target.to(device)
|
||||
optimizer.zero_grad() # zero the parameter gradients
|
||||
output = model(data) # Forward pass
|
||||
loss_value = loss(output, target) # Compute the loss
|
||||
loss_value.backward() # Backpropagation
|
||||
optimizer.step()
|
||||
```
|
||||
|
||||
5. **Testing and Debugging APIs:** Throughout the training process,
|
||||
users need a testing API to evaluate the accuracy of the model
|
||||
(training concludes when the accuracy exceeds the set goal).
|
||||
Additionally, a debugging API is necessary to verify the performance
|
||||
and correctness of the model. Code
|
||||
`ch02/code2.2.5` is an example of model evaluation in
|
||||
Pytorch:
|
||||
|
||||
**ch02/code2.2.5**
|
||||
```python
|
||||
model.eval() # Set the model to evaluation mode
|
||||
overall_accuracy = []
|
||||
for batch_idx, (data, target) in enumerate(testing_dataloader):
|
||||
data, target = data.to(device), target.to(device)
|
||||
output = model(data) # Forward pass
|
||||
accuracy = your_metrics(output, target) # Compute the accuracy
|
||||
overall_accuracy.append(accuracy) # Print the accuracy
|
||||
# For debugging, you can print logs inside the training or evaluation loop, or use python debugger.
|
||||
```
|
||||
|
||||

|
||||
:label:`ch03/workflow`
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user