From cb90db338bddfd5407c5d420320d90c669edb452 Mon Sep 17 00:00:00 2001
From: Alden <83172530+AldenWangExis@users.noreply.github.com>
Date: Sat, 21 Feb 2026 11:19:07 +0800
Subject: [PATCH] [UPDATE] add CMU 15-779 course pages (#834)

Add CMU 15-779 (LLM systems) notes in CN/EN and link them from the deep generative models roadmap.
---
 docs/深度生成模型/roadmap.en.md              |  2 ++
 docs/深度生成模型/roadmap.md                 |  2 ++
 docs/深度生成模型/大语言模型/CMU15-779.en.md | 35 ++++++++++++++++++++
 docs/深度生成模型/大语言模型/CMU15-779.md    | 35 ++++++++++++++++++++
 4 files changed, 74 insertions(+)
 create mode 100644 docs/深度生成模型/大语言模型/CMU15-779.en.md
 create mode 100644 docs/深度生成模型/大语言模型/CMU15-779.md

diff --git a/docs/深度生成模型/roadmap.en.md b/docs/深度生成模型/roadmap.en.md
index e736d7f9..d85305af 100644
--- a/docs/深度生成模型/roadmap.en.md
+++ b/docs/深度生成模型/roadmap.en.md
@@ -18,6 +18,8 @@ The GPT series by OpenAI has demonstrated remarkable performance under the guida
 
 - [Stanford CS336: Language Modeling from Scratch](https://stanford-cs336.github.io/spring2025/index.html): As the title suggests, this course teaches you to build all the core components of an LLM from scratch, such as the tokenizer, model architecture, training optimizer, low-level operators, data cleaning, and post-training algorithms. Each assignment has a 40-50 page PDF handout—very rigorous. Highly recommended if you want to fully understand every low-level detail of LLMs.
 
+- [CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)](./大语言模型/CMU15-779.md) / [EN](./大语言模型/CMU15-779.en.md): A systems and performance-oriented course that explains how high-level models are decomposed into low-level kernels and executed efficiently on heterogeneous accelerators and in distributed environments. Topics include CUDA, ML compilation, graph-level optimizations, auto-parallelization, and LLM serving/inference acceleration, along with weekly paper reviews and a final systems project.
+
 - [CMU 11868: Large Language Model Systems](https://llmsystem.github.io/llmsystem2025spring/): This CMU course focuses on system-level optimization of LLMs, including GPU acceleration, distributed training/inference, and cutting-edge techniques. Great for students in systems research to gain a holistic understanding of the field. (Disclosure: One of my papers on PD decoupling is included in the syllabus, hence the personal recommendation.) Assignments involve implementing a mini-PyTorch framework and then building system-level LLM optimizations on top of it.
 
 - [CMU 11667: Large Language Models: Methods and Applications](https://cmu-llms.org/) and [CMU 11711: Advanced NLP](https://www.phontron.com/class/anlp-fall2024/): Compared to the previous two, these courses focus more on higher-level algorithms and applications. Each lecture includes many recommended readings, making them suitable for gaining a broad understanding of LLM research frontiers. You can then dive deeper into any subfield that interests you based on the reading materials.
diff --git a/docs/深度生成模型/roadmap.md b/docs/深度生成模型/roadmap.md
index 1dbc139a..275b1252 100644
--- a/docs/深度生成模型/roadmap.md
+++ b/docs/深度生成模型/roadmap.md
@@ -18,6 +18,8 @@ OpenAI 的 GPT 系列让大语言模型在 Scaling Law 的指引下展现出惊
 
 - [Stanford CS336: Language Modeling from Scratch](https://stanford-cs336.github.io/spring2025/index.html): 正如课程标题写的，在这门课程中你将从头编写大语言模型的所有核心组件，例如 Tokenizer，模型架构，训练优化器，底层算子，训练数据清洗，后训练算法等等。每次作业的 handout 都有四五十页 pdf，相当硬核。如果你想充分吃透大语言模型的所有底层细节，那么非常推荐学习这门课程。
 
+- [CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)](./大语言模型/CMU15-779.md) / [EN](./大语言模型/CMU15-779.en.md): 偏系统与性能优化视角，重点讲清“高层模型如何分解成 kernel 并在异构加速器与分布式环境中高效执行”，覆盖 CUDA、ML 编译、图级优化、自动并行化、LLM Serving 与推理加速等内容，并配套按周论文阅读与期末系统项目。
+
 - [CMU 11868: Large Language Model Systems](https://llmsystem.github.io/llmsystem2025spring/): CMU 的大语言模型系统课程，侧重底层系统优化，例如 GPU 加速，分布式训练和推理，以及各种前沿技术。非常适合从事系统领域的同学对这个方向有个全貌性的了解。课表里还包含了一篇我发表的 PD 分离相关的文章，因此私心推荐一下。课程作业的话会让你先实现一个迷你 Pytorch，然后在上面实现各种大语言模型的系统级优化。
 
 - [CMU 11667: Large Language Models: Methods and Applications](https://cmu-llms.org/) 和 [CMU 11711: Advanced NLP](https://www.phontron.com/class/anlp-fall2024/): 和前两门课相比，这两门课更偏重上层算法和应用，而且每节课都列举了很多相关阅读材料，适合对大语言模型发展前沿的各个方向都有个粗糙的认识，如果对某个子领域感兴趣的话再寻着参考资料深入学习。
diff --git a/docs/深度生成模型/大语言模型/CMU15-779.en.md b/docs/深度生成模型/大语言模型/CMU15-779.en.md
new file mode 100644
index 00000000..54808a42
--- /dev/null
+++ b/docs/深度生成模型/大语言模型/CMU15-779.en.md
@@ -0,0 +1,35 @@
+# CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)
+
+## Course Overview
+
+- University: Carnegie Mellon University
+- Prerequisites: No strict prerequisites; an intro ML background and hands-on deep learning training experience are recommended; familiarity with PyTorch helps; basic CUDA/GPU knowledge will significantly improve the learning curve
+- Programming Language: Python (systems and kernel-level topics involve CUDA/hardware concepts)
+- Course Difficulty: 🌟🌟🌟🌟 
+- Estimated Study Hours: 80-120 hours
+
+This course takes a systems-first view of modern machine learning and LLM infrastructure. The core question it repeatedly answers is: how does a model written in a high-level framework (e.g., PyTorch) get decomposed into low-level kernels, and how is it executed efficiently on heterogeneous accelerators (GPUs/TPUs) and in distributed environments. The syllabus covers GPU programming, ML compilers, graph-level optimizations, distributed training and auto-parallelization, and LLM serving and inference acceleration. It is a strong fit if you want to connect “framework-level experience” with “kernels, compilation, hardware, and cluster execution.”
+
+The workload is organized around consistent pre-lecture reading assignments (paper reviews) and a team-based final course project (proposal, presentation, report). For self-study, it is best to follow the schedule week by week rather than treating it as a slide-only course.
+
+## Topics Covered
+
+The course is structured as lectures, with major themes including:
+
+1. ML systems fundamentals via TensorFlow/PyTorch (abstractions, execution models)
+2. GPU architecture and CUDA programming (memory, performance tuning)
+3. Transformer and attention case studies (FlashAttention and IO-aware attention)
+4. Advanced CUDA techniques (warp specialization, mega kernels)
+5. ML compilation (tile-based DSLs like Triton, kernel auto-tuning, graph-level optimizations, superoptimization such as Mirage)
+6. Parallelization and distributed training (ZeRO/FSDP, model/pipeline parallelism, auto-parallelization such as Alpa)
+7. LLM serving and inference (batching, PagedAttention, RadixAttention, speculative decoding)
+8. Post-training and architectures (PEFT like LoRA/QLoRA, MoE architectures/kernels/parallelism)
+
+## Course Resources
+
+- Course Website: <https://www.cs.cmu.edu/~zhihaoj2/15-779/>
+- Schedule (slides and reading list per lecture): <https://www.cs.cmu.edu/~zhihaoj2/15-779/schedule.html>
+- Slides (PDF): <https://www.cs.cmu.edu/~zhihaoj2/15-779/slides/>
+- Logistics (grading, paper reviews, course project): <https://www.cs.cmu.edu/~zhihaoj2/15-779/logistics.html>
+- Materials (intro deep learning materials): <https://www.cs.cmu.edu/~zhihaoj2/15-779/materials.html>
+
diff --git a/docs/深度生成模型/大语言模型/CMU15-779.md b/docs/深度生成模型/大语言模型/CMU15-779.md
new file mode 100644
index 00000000..b8a1fc05
--- /dev/null
+++ b/docs/深度生成模型/大语言模型/CMU15-779.md
@@ -0,0 +1,35 @@
+# CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)
+
+## 课程简介
+
+- 所属大学：Carnegie Mellon University
+- 先修要求：无硬性先修要求；建议具备机器学习入门与深度学习训练经验，熟悉 PyTorch；了解 CUDA/GPU 基础会显著提升学习效率
+- 编程语言：Python（系统与算子层面内容可能涉及 CUDA/硬件概念）
+- 课程难度：4/5
+- 预计学时：80-120 学时
+
+这门课从系统视角系统性回答一个核心问题：一个用高层框架（例如 PyTorch）写出来的模型，是如何被分解为底层 kernel，并在异构硬件加速器（GPU/TPU）与分布式环境中高效执行的。课程覆盖 GPU 编程、ML 编译器、图级优化、分布式训练与自动并行化、LLM Serving 与推理加速等主题，强系统导向，适合希望把“框架层经验”向“算子/编译/硬件/集群执行”打通的人。
+
+从教学组织上，这门课会要求你持续完成课前论文阅读（paper review / reading assignments），并以小组形式完成期末系统类课程项目（proposal、presentation、report），因此自学时建议把它当成一个“按周推进的系统训练营”，而不是只看几份 slide。
+
+## 课程内容
+
+课程内容以 lecture 为主线，主题大致包括：
+
+1. ML 系统基础：以 TensorFlow/PyTorch 为例理解计算图、执行模型与系统抽象
+2. GPU 架构与 CUDA 编程：硬件与编程模型、内存与性能优化要点
+3. Transformer 与 Attention 案例：FlashAttention 等 IO-aware attention 优化思路
+4. 高级 CUDA 编程：warp specialization、mega kernel 等低延迟/高吞吐优化技术
+5. ML 编译：Tile-based DSL（Triton 等）、内核自动调优（Ansor 等）、图级优化（TASO/PET 等）、超优化（Mirage）
+6. 并行化与分布式训练：ZeRO/FSDP、模型/流水线并行、自动并行化（Alpa 等）
+7. LLM 推理与服务：批处理、PagedAttention、RadixAttention、推测解码等
+8. 后训练与模型结构：参数高效微调（LoRA/QLoRA）、MoE（架构、kernel、并行化）
+
+## 课程资源
+
+- 课程网站：<https://www.cs.cmu.edu/~zhihaoj2/15-779/>
+- 课程安排（含每讲 slide 与阅读列表）：<https://www.cs.cmu.edu/~zhihaoj2/15-779/schedule.html>
+- 课程讲义（PDF slides）：<https://www.cs.cmu.edu/~zhihaoj2/15-779/slides/>
+- 课程规则与项目要求（Grading、Paper Review、Course Project）：<https://www.cs.cmu.edu/~zhihaoj2/15-779/logistics.html>
+- 预备材料（深度学习入门材料汇总）：<https://www.cs.cmu.edu/~zhihaoj2/15-779/materials.html>
+