From cb90db338bddfd5407c5d420320d90c669edb452 Mon Sep 17 00:00:00 2001
From: Alden <83172530+AldenWangExis@users.noreply.github.com>
Date: Sat, 21 Feb 2026 11:19:07 +0800
Subject: [PATCH] [UPDATE] add CMU 15-779 course pages (#834)
Add CMU 15-779 (LLM systems) notes in CN/EN and link them from the deep generative models roadmap.
---
docs/深度生成模型/roadmap.en.md | 2 ++
docs/深度生成模型/roadmap.md | 2 ++
docs/深度生成模型/大语言模型/CMU15-779.en.md | 35 ++++++++++++++++++++
docs/深度生成模型/大语言模型/CMU15-779.md | 35 ++++++++++++++++++++
4 files changed, 74 insertions(+)
create mode 100644 docs/深度生成模型/大语言模型/CMU15-779.en.md
create mode 100644 docs/深度生成模型/大语言模型/CMU15-779.md
diff --git a/docs/深度生成模型/roadmap.en.md b/docs/深度生成模型/roadmap.en.md
index e736d7f9..d85305af 100644
--- a/docs/深度生成模型/roadmap.en.md
+++ b/docs/深度生成模型/roadmap.en.md
@@ -18,6 +18,8 @@ The GPT series by OpenAI has demonstrated remarkable performance under the guida
- [Stanford CS336: Language Modeling from Scratch](https://stanford-cs336.github.io/spring2025/index.html): As the title suggests, this course teaches you to build all the core components of an LLM from scratch, such as the tokenizer, model architecture, training optimizer, low-level operators, data cleaning, and post-training algorithms. Each assignment has a 40-50 page PDF handout—very rigorous. Highly recommended if you want to fully understand every low-level detail of LLMs.
+- [CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)](./大语言模型/CMU15-779.md) / [EN](./大语言模型/CMU15-779.en.md): A systems and performance-oriented course that explains how high-level models are decomposed into low-level kernels and executed efficiently on heterogeneous accelerators and in distributed environments. Topics include CUDA, ML compilation, graph-level optimizations, auto-parallelization, and LLM serving/inference acceleration, along with weekly paper reviews and a final systems project.
+
- [CMU 11868: Large Language Model Systems](https://llmsystem.github.io/llmsystem2025spring/): This CMU course focuses on system-level optimization of LLMs, including GPU acceleration, distributed training/inference, and cutting-edge techniques. Great for students in systems research to gain a holistic understanding of the field. (Disclosure: One of my papers on PD decoupling is included in the syllabus, hence the personal recommendation.) Assignments involve implementing a mini-PyTorch framework and then building system-level LLM optimizations on top of it.
- [CMU 11667: Large Language Models: Methods and Applications](https://cmu-llms.org/) and [CMU 11711: Advanced NLP](https://www.phontron.com/class/anlp-fall2024/): Compared to the previous two, these courses focus more on higher-level algorithms and applications. Each lecture includes many recommended readings, making them suitable for gaining a broad understanding of LLM research frontiers. You can then dive deeper into any subfield that interests you based on the reading materials.
diff --git a/docs/深度生成模型/roadmap.md b/docs/深度生成模型/roadmap.md
index 1dbc139a..275b1252 100644
--- a/docs/深度生成模型/roadmap.md
+++ b/docs/深度生成模型/roadmap.md
@@ -18,6 +18,8 @@ OpenAI 的 GPT 系列让大语言模型在 Scaling Law 的指引下展现出惊
- [Stanford CS336: Language Modeling from Scratch](https://stanford-cs336.github.io/spring2025/index.html): 正如课程标题写的,在这门课程中你将从头编写大语言模型的所有核心组件,例如 Tokenizer,模型架构,训练优化器,底层算子,训练数据清洗,后训练算法等等。每次作业的 handout 都有四五十页 pdf,相当硬核。如果你想充分吃透大语言模型的所有底层细节,那么非常推荐学习这门课程。
+- [CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)](./大语言模型/CMU15-779.md) / [EN](./大语言模型/CMU15-779.en.md): 偏系统与性能优化视角,重点讲清“高层模型如何分解成 kernel 并在异构加速器与分布式环境中高效执行”,覆盖 CUDA、ML 编译、图级优化、自动并行化、LLM Serving 与推理加速等内容,并配套按周论文阅读与期末系统项目。
+
- [CMU 11868: Large Language Model Systems](https://llmsystem.github.io/llmsystem2025spring/): CMU 的大语言模型系统课程,侧重底层系统优化,例如 GPU 加速,分布式训练和推理,以及各种前沿技术。非常适合从事系统领域的同学对这个方向有个全貌性的了解。课表里还包含了一篇我发表的 PD 分离相关的文章,因此私心推荐一下。课程作业的话会让你先实现一个迷你 Pytorch,然后在上面实现各种大语言模型的系统级优化。
- [CMU 11667: Large Language Models: Methods and Applications](https://cmu-llms.org/) 和 [CMU 11711: Advanced NLP](https://www.phontron.com/class/anlp-fall2024/): 和前两门课相比,这两门课更偏重上层算法和应用,而且每节课都列举了很多相关阅读材料,适合对大语言模型发展前沿的各个方向都有个粗糙的认识,如果对某个子领域感兴趣的话再寻着参考资料深入学习。
diff --git a/docs/深度生成模型/大语言模型/CMU15-779.en.md b/docs/深度生成模型/大语言模型/CMU15-779.en.md
new file mode 100644
index 00000000..54808a42
--- /dev/null
+++ b/docs/深度生成模型/大语言模型/CMU15-779.en.md
@@ -0,0 +1,35 @@
+# CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)
+
+## Course Overview
+
+- University: Carnegie Mellon University
+- Prerequisites: No strict prerequisites; an intro ML background and hands-on deep learning training experience are recommended; familiarity with PyTorch helps; basic CUDA/GPU knowledge will significantly improve the learning curve
+- Programming Language: Python (systems and kernel-level topics involve CUDA/hardware concepts)
+- Course Difficulty: 🌟🌟🌟🌟
+- Estimated Study Hours: 80-120 hours
+
+This course takes a systems-first view of modern machine learning and LLM infrastructure. The core question it repeatedly answers is: how does a model written in a high-level framework (e.g., PyTorch) get decomposed into low-level kernels, and how is it executed efficiently on heterogeneous accelerators (GPUs/TPUs) and in distributed environments. The syllabus covers GPU programming, ML compilers, graph-level optimizations, distributed training and auto-parallelization, and LLM serving and inference acceleration. It is a strong fit if you want to connect “framework-level experience” with “kernels, compilation, hardware, and cluster execution.”
+
+The workload is organized around consistent pre-lecture reading assignments (paper reviews) and a team-based final course project (proposal, presentation, report). For self-study, it is best to follow the schedule week by week rather than treating it as a slide-only course.
+
+## Topics Covered
+
+The course is structured as lectures, with major themes including:
+
+1. ML systems fundamentals via TensorFlow/PyTorch (abstractions, execution models)
+2. GPU architecture and CUDA programming (memory, performance tuning)
+3. Transformer and attention case studies (FlashAttention and IO-aware attention)
+4. Advanced CUDA techniques (warp specialization, mega kernels)
+5. ML compilation (tile-based DSLs like Triton, kernel auto-tuning, graph-level optimizations, superoptimization such as Mirage)
+6. Parallelization and distributed training (ZeRO/FSDP, model/pipeline parallelism, auto-parallelization such as Alpa)
+7. LLM serving and inference (batching, PagedAttention, RadixAttention, speculative decoding)
+8. Post-training and architectures (PEFT like LoRA/QLoRA, MoE architectures/kernels/parallelism)
+
+## Course Resources
+
+- Course Website:
+- Schedule (slides and reading list per lecture):
+- Slides (PDF):
+- Logistics (grading, paper reviews, course project):
+- Materials (intro deep learning materials):
+
diff --git a/docs/深度生成模型/大语言模型/CMU15-779.md b/docs/深度生成模型/大语言模型/CMU15-779.md
new file mode 100644
index 00000000..b8a1fc05
--- /dev/null
+++ b/docs/深度生成模型/大语言模型/CMU15-779.md
@@ -0,0 +1,35 @@
+# CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)
+
+## 课程简介
+
+- 所属大学:Carnegie Mellon University
+- 先修要求:无硬性先修要求;建议具备机器学习入门与深度学习训练经验,熟悉 PyTorch;了解 CUDA/GPU 基础会显著提升学习效率
+- 编程语言:Python(系统与算子层面内容可能涉及 CUDA/硬件概念)
+- 课程难度:4/5
+- 预计学时:80-120 学时
+
+这门课从系统视角系统性回答一个核心问题:一个用高层框架(例如 PyTorch)写出来的模型,是如何被分解为底层 kernel,并在异构硬件加速器(GPU/TPU)与分布式环境中高效执行的。课程覆盖 GPU 编程、ML 编译器、图级优化、分布式训练与自动并行化、LLM Serving 与推理加速等主题,强系统导向,适合希望把“框架层经验”向“算子/编译/硬件/集群执行”打通的人。
+
+从教学组织上,这门课会要求你持续完成课前论文阅读(paper review / reading assignments),并以小组形式完成期末系统类课程项目(proposal、presentation、report),因此自学时建议把它当成一个“按周推进的系统训练营”,而不是只看几份 slide。
+
+## 课程内容
+
+课程内容以 lecture 为主线,主题大致包括:
+
+1. ML 系统基础:以 TensorFlow/PyTorch 为例理解计算图、执行模型与系统抽象
+2. GPU 架构与 CUDA 编程:硬件与编程模型、内存与性能优化要点
+3. Transformer 与 Attention 案例:FlashAttention 等 IO-aware attention 优化思路
+4. 高级 CUDA 编程:warp specialization、mega kernel 等低延迟/高吞吐优化技术
+5. ML 编译:Tile-based DSL(Triton 等)、内核自动调优(Ansor 等)、图级优化(TASO/PET 等)、超优化(Mirage)
+6. 并行化与分布式训练:ZeRO/FSDP、模型/流水线并行、自动并行化(Alpa 等)
+7. LLM 推理与服务:批处理、PagedAttention、RadixAttention、推测解码等
+8. 后训练与模型结构:参数高效微调(LoRA/QLoRA)、MoE(架构、kernel、并行化)
+
+## 课程资源
+
+- 课程网站:
+- 课程安排(含每讲 slide 与阅读列表):
+- 课程讲义(PDF slides):
+- 课程规则与项目要求(Grading、Paper Review、Course Project):
+- 预备材料(深度学习入门材料汇总):
+