[UPDATE] add CMU 15-779 course pages (#834)

Add CMU 15-779 (LLM systems) notes in CN/EN and link them from the deep generative models roadmap.
This commit is contained in:
Alden
2026-02-21 11:19:07 +08:00
committed by GitHub
parent f45cd57844
commit cb90db338b
4 changed files with 74 additions and 0 deletions

View File

@@ -18,6 +18,8 @@ The GPT series by OpenAI has demonstrated remarkable performance under the guida
- [Stanford CS336: Language Modeling from Scratch](https://stanford-cs336.github.io/spring2025/index.html): As the title suggests, this course teaches you to build all the core components of an LLM from scratch, such as the tokenizer, model architecture, training optimizer, low-level operators, data cleaning, and post-training algorithms. Each assignment has a 40-50 page PDF handout—very rigorous. Highly recommended if you want to fully understand every low-level detail of LLMs.
- [CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)](./大语言模型/CMU15-779.md) / [EN](./大语言模型/CMU15-779.en.md): A systems and performance-oriented course that explains how high-level models are decomposed into low-level kernels and executed efficiently on heterogeneous accelerators and in distributed environments. Topics include CUDA, ML compilation, graph-level optimizations, auto-parallelization, and LLM serving/inference acceleration, along with weekly paper reviews and a final systems project.
- [CMU 11868: Large Language Model Systems](https://llmsystem.github.io/llmsystem2025spring/): This CMU course focuses on system-level optimization of LLMs, including GPU acceleration, distributed training/inference, and cutting-edge techniques. Great for students in systems research to gain a holistic understanding of the field. (Disclosure: One of my papers on PD decoupling is included in the syllabus, hence the personal recommendation.) Assignments involve implementing a mini-PyTorch framework and then building system-level LLM optimizations on top of it.
- [CMU 11667: Large Language Models: Methods and Applications](https://cmu-llms.org/) and [CMU 11711: Advanced NLP](https://www.phontron.com/class/anlp-fall2024/): Compared to the previous two, these courses focus more on higher-level algorithms and applications. Each lecture includes many recommended readings, making them suitable for gaining a broad understanding of LLM research frontiers. You can then dive deeper into any subfield that interests you based on the reading materials.

View File

@@ -18,6 +18,8 @@ OpenAI 的 GPT 系列让大语言模型在 Scaling Law 的指引下展现出惊
- [Stanford CS336: Language Modeling from Scratch](https://stanford-cs336.github.io/spring2025/index.html): 正如课程标题写的,在这门课程中你将从头编写大语言模型的所有核心组件,例如 Tokenizer模型架构训练优化器底层算子训练数据清洗后训练算法等等。每次作业的 handout 都有四五十页 pdf相当硬核。如果你想充分吃透大语言模型的所有底层细节那么非常推荐学习这门课程。
- [CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)](./大语言模型/CMU15-779.md) / [EN](./大语言模型/CMU15-779.en.md): 偏系统与性能优化视角,重点讲清“高层模型如何分解成 kernel 并在异构加速器与分布式环境中高效执行”,覆盖 CUDA、ML 编译、图级优化、自动并行化、LLM Serving 与推理加速等内容,并配套按周论文阅读与期末系统项目。
- [CMU 11868: Large Language Model Systems](https://llmsystem.github.io/llmsystem2025spring/): CMU 的大语言模型系统课程,侧重底层系统优化,例如 GPU 加速,分布式训练和推理,以及各种前沿技术。非常适合从事系统领域的同学对这个方向有个全貌性的了解。课表里还包含了一篇我发表的 PD 分离相关的文章,因此私心推荐一下。课程作业的话会让你先实现一个迷你 Pytorch然后在上面实现各种大语言模型的系统级优化。
- [CMU 11667: Large Language Models: Methods and Applications](https://cmu-llms.org/) 和 [CMU 11711: Advanced NLP](https://www.phontron.com/class/anlp-fall2024/): 和前两门课相比,这两门课更偏重上层算法和应用,而且每节课都列举了很多相关阅读材料,适合对大语言模型发展前沿的各个方向都有个粗糙的认识,如果对某个子领域感兴趣的话再寻着参考资料深入学习。

View File

@@ -0,0 +1,35 @@
# CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)
## Course Overview
- University: Carnegie Mellon University
- Prerequisites: No strict prerequisites; an intro ML background and hands-on deep learning training experience are recommended; familiarity with PyTorch helps; basic CUDA/GPU knowledge will significantly improve the learning curve
- Programming Language: Python (systems and kernel-level topics involve CUDA/hardware concepts)
- Course Difficulty: 🌟🌟🌟🌟
- Estimated Study Hours: 80-120 hours
This course takes a systems-first view of modern machine learning and LLM infrastructure. The core question it repeatedly answers is: how does a model written in a high-level framework (e.g., PyTorch) get decomposed into low-level kernels, and how is it executed efficiently on heterogeneous accelerators (GPUs/TPUs) and in distributed environments. The syllabus covers GPU programming, ML compilers, graph-level optimizations, distributed training and auto-parallelization, and LLM serving and inference acceleration. It is a strong fit if you want to connect “framework-level experience” with “kernels, compilation, hardware, and cluster execution.”
The workload is organized around consistent pre-lecture reading assignments (paper reviews) and a team-based final course project (proposal, presentation, report). For self-study, it is best to follow the schedule week by week rather than treating it as a slide-only course.
## Topics Covered
The course is structured as lectures, with major themes including:
1. ML systems fundamentals via TensorFlow/PyTorch (abstractions, execution models)
2. GPU architecture and CUDA programming (memory, performance tuning)
3. Transformer and attention case studies (FlashAttention and IO-aware attention)
4. Advanced CUDA techniques (warp specialization, mega kernels)
5. ML compilation (tile-based DSLs like Triton, kernel auto-tuning, graph-level optimizations, superoptimization such as Mirage)
6. Parallelization and distributed training (ZeRO/FSDP, model/pipeline parallelism, auto-parallelization such as Alpa)
7. LLM serving and inference (batching, PagedAttention, RadixAttention, speculative decoding)
8. Post-training and architectures (PEFT like LoRA/QLoRA, MoE architectures/kernels/parallelism)
## Course Resources
- Course Website: <https://www.cs.cmu.edu/~zhihaoj2/15-779/>
- Schedule (slides and reading list per lecture): <https://www.cs.cmu.edu/~zhihaoj2/15-779/schedule.html>
- Slides (PDF): <https://www.cs.cmu.edu/~zhihaoj2/15-779/slides/>
- Logistics (grading, paper reviews, course project): <https://www.cs.cmu.edu/~zhihaoj2/15-779/logistics.html>
- Materials (intro deep learning materials): <https://www.cs.cmu.edu/~zhihaoj2/15-779/materials.html>

View File

@@ -0,0 +1,35 @@
# CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)
## 课程简介
- 所属大学Carnegie Mellon University
- 先修要求:无硬性先修要求;建议具备机器学习入门与深度学习训练经验,熟悉 PyTorch了解 CUDA/GPU 基础会显著提升学习效率
- 编程语言Python系统与算子层面内容可能涉及 CUDA/硬件概念)
- 课程难度4/5
- 预计学时80-120 学时
这门课从系统视角系统性回答一个核心问题:一个用高层框架(例如 PyTorch写出来的模型是如何被分解为底层 kernel并在异构硬件加速器GPU/TPU与分布式环境中高效执行的。课程覆盖 GPU 编程、ML 编译器、图级优化、分布式训练与自动并行化、LLM Serving 与推理加速等主题,强系统导向,适合希望把“框架层经验”向“算子/编译/硬件/集群执行”打通的人。
从教学组织上这门课会要求你持续完成课前论文阅读paper review / reading assignments并以小组形式完成期末系统类课程项目proposal、presentation、report因此自学时建议把它当成一个“按周推进的系统训练营”而不是只看几份 slide。
## 课程内容
课程内容以 lecture 为主线,主题大致包括:
1. ML 系统基础:以 TensorFlow/PyTorch 为例理解计算图、执行模型与系统抽象
2. GPU 架构与 CUDA 编程:硬件与编程模型、内存与性能优化要点
3. Transformer 与 Attention 案例FlashAttention 等 IO-aware attention 优化思路
4. 高级 CUDA 编程warp specialization、mega kernel 等低延迟/高吞吐优化技术
5. ML 编译Tile-based DSLTriton 等、内核自动调优Ansor 等、图级优化TASO/PET 等、超优化Mirage
6. 并行化与分布式训练ZeRO/FSDP、模型/流水线并行、自动并行化Alpa 等)
7. LLM 推理与服务批处理、PagedAttention、RadixAttention、推测解码等
8. 后训练与模型结构参数高效微调LoRA/QLoRA、MoE架构、kernel、并行化
## 课程资源
- 课程网站:<https://www.cs.cmu.edu/~zhihaoj2/15-779/>
- 课程安排(含每讲 slide 与阅读列表):<https://www.cs.cmu.edu/~zhihaoj2/15-779/schedule.html>
- 课程讲义PDF slides<https://www.cs.cmu.edu/~zhihaoj2/15-779/slides/>
- 课程规则与项目要求Grading、Paper Review、Course Project<https://www.cs.cmu.edu/~zhihaoj2/15-779/logistics.html>
- 预备材料(深度学习入门材料汇总):<https://www.cs.cmu.edu/~zhihaoj2/15-779/materials.html>