From cb90db338bddfd5407c5d420320d90c669edb452 Mon Sep 17 00:00:00 2001 From: Alden <83172530+AldenWangExis@users.noreply.github.com> Date: Sat, 21 Feb 2026 11:19:07 +0800 Subject: [PATCH] [UPDATE] add CMU 15-779 course pages (#834) Add CMU 15-779 (LLM systems) notes in CN/EN and link them from the deep generative models roadmap. --- docs/深度生成模型/roadmap.en.md | 2 ++ docs/深度生成模型/roadmap.md | 2 ++ docs/深度生成模型/大语言模型/CMU15-779.en.md | 35 ++++++++++++++++++++ docs/深度生成模型/大语言模型/CMU15-779.md | 35 ++++++++++++++++++++ 4 files changed, 74 insertions(+) create mode 100644 docs/深度生成模型/大语言模型/CMU15-779.en.md create mode 100644 docs/深度生成模型/大语言模型/CMU15-779.md diff --git a/docs/深度生成模型/roadmap.en.md b/docs/深度生成模型/roadmap.en.md index e736d7f9..d85305af 100644 --- a/docs/深度生成模型/roadmap.en.md +++ b/docs/深度生成模型/roadmap.en.md @@ -18,6 +18,8 @@ The GPT series by OpenAI has demonstrated remarkable performance under the guida - [Stanford CS336: Language Modeling from Scratch](https://stanford-cs336.github.io/spring2025/index.html): As the title suggests, this course teaches you to build all the core components of an LLM from scratch, such as the tokenizer, model architecture, training optimizer, low-level operators, data cleaning, and post-training algorithms. Each assignment has a 40-50 page PDF handout—very rigorous. Highly recommended if you want to fully understand every low-level detail of LLMs. +- [CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)](./大语言模型/CMU15-779.md) / [EN](./大语言模型/CMU15-779.en.md): A systems and performance-oriented course that explains how high-level models are decomposed into low-level kernels and executed efficiently on heterogeneous accelerators and in distributed environments. Topics include CUDA, ML compilation, graph-level optimizations, auto-parallelization, and LLM serving/inference acceleration, along with weekly paper reviews and a final systems project. + - [CMU 11868: Large Language Model Systems](https://llmsystem.github.io/llmsystem2025spring/): This CMU course focuses on system-level optimization of LLMs, including GPU acceleration, distributed training/inference, and cutting-edge techniques. Great for students in systems research to gain a holistic understanding of the field. (Disclosure: One of my papers on PD decoupling is included in the syllabus, hence the personal recommendation.) Assignments involve implementing a mini-PyTorch framework and then building system-level LLM optimizations on top of it. - [CMU 11667: Large Language Models: Methods and Applications](https://cmu-llms.org/) and [CMU 11711: Advanced NLP](https://www.phontron.com/class/anlp-fall2024/): Compared to the previous two, these courses focus more on higher-level algorithms and applications. Each lecture includes many recommended readings, making them suitable for gaining a broad understanding of LLM research frontiers. You can then dive deeper into any subfield that interests you based on the reading materials. diff --git a/docs/深度生成模型/roadmap.md b/docs/深度生成模型/roadmap.md index 1dbc139a..275b1252 100644 --- a/docs/深度生成模型/roadmap.md +++ b/docs/深度生成模型/roadmap.md @@ -18,6 +18,8 @@ OpenAI 的 GPT 系列让大语言模型在 Scaling Law 的指引下展现出惊 - [Stanford CS336: Language Modeling from Scratch](https://stanford-cs336.github.io/spring2025/index.html): 正如课程标题写的,在这门课程中你将从头编写大语言模型的所有核心组件,例如 Tokenizer,模型架构,训练优化器,底层算子,训练数据清洗,后训练算法等等。每次作业的 handout 都有四五十页 pdf,相当硬核。如果你想充分吃透大语言模型的所有底层细节,那么非常推荐学习这门课程。 +- [CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)](./大语言模型/CMU15-779.md) / [EN](./大语言模型/CMU15-779.en.md): 偏系统与性能优化视角,重点讲清“高层模型如何分解成 kernel 并在异构加速器与分布式环境中高效执行”,覆盖 CUDA、ML 编译、图级优化、自动并行化、LLM Serving 与推理加速等内容,并配套按周论文阅读与期末系统项目。 + - [CMU 11868: Large Language Model Systems](https://llmsystem.github.io/llmsystem2025spring/): CMU 的大语言模型系统课程,侧重底层系统优化,例如 GPU 加速,分布式训练和推理,以及各种前沿技术。非常适合从事系统领域的同学对这个方向有个全貌性的了解。课表里还包含了一篇我发表的 PD 分离相关的文章,因此私心推荐一下。课程作业的话会让你先实现一个迷你 Pytorch,然后在上面实现各种大语言模型的系统级优化。 - [CMU 11667: Large Language Models: Methods and Applications](https://cmu-llms.org/) 和 [CMU 11711: Advanced NLP](https://www.phontron.com/class/anlp-fall2024/): 和前两门课相比,这两门课更偏重上层算法和应用,而且每节课都列举了很多相关阅读材料,适合对大语言模型发展前沿的各个方向都有个粗糙的认识,如果对某个子领域感兴趣的话再寻着参考资料深入学习。 diff --git a/docs/深度生成模型/大语言模型/CMU15-779.en.md b/docs/深度生成模型/大语言模型/CMU15-779.en.md new file mode 100644 index 00000000..54808a42 --- /dev/null +++ b/docs/深度生成模型/大语言模型/CMU15-779.en.md @@ -0,0 +1,35 @@ +# CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition) + +## Course Overview + +- University: Carnegie Mellon University +- Prerequisites: No strict prerequisites; an intro ML background and hands-on deep learning training experience are recommended; familiarity with PyTorch helps; basic CUDA/GPU knowledge will significantly improve the learning curve +- Programming Language: Python (systems and kernel-level topics involve CUDA/hardware concepts) +- Course Difficulty: 🌟🌟🌟🌟 +- Estimated Study Hours: 80-120 hours + +This course takes a systems-first view of modern machine learning and LLM infrastructure. The core question it repeatedly answers is: how does a model written in a high-level framework (e.g., PyTorch) get decomposed into low-level kernels, and how is it executed efficiently on heterogeneous accelerators (GPUs/TPUs) and in distributed environments. The syllabus covers GPU programming, ML compilers, graph-level optimizations, distributed training and auto-parallelization, and LLM serving and inference acceleration. It is a strong fit if you want to connect “framework-level experience” with “kernels, compilation, hardware, and cluster execution.” + +The workload is organized around consistent pre-lecture reading assignments (paper reviews) and a team-based final course project (proposal, presentation, report). For self-study, it is best to follow the schedule week by week rather than treating it as a slide-only course. + +## Topics Covered + +The course is structured as lectures, with major themes including: + +1. ML systems fundamentals via TensorFlow/PyTorch (abstractions, execution models) +2. GPU architecture and CUDA programming (memory, performance tuning) +3. Transformer and attention case studies (FlashAttention and IO-aware attention) +4. Advanced CUDA techniques (warp specialization, mega kernels) +5. ML compilation (tile-based DSLs like Triton, kernel auto-tuning, graph-level optimizations, superoptimization such as Mirage) +6. Parallelization and distributed training (ZeRO/FSDP, model/pipeline parallelism, auto-parallelization such as Alpa) +7. LLM serving and inference (batching, PagedAttention, RadixAttention, speculative decoding) +8. Post-training and architectures (PEFT like LoRA/QLoRA, MoE architectures/kernels/parallelism) + +## Course Resources + +- Course Website: +- Schedule (slides and reading list per lecture): +- Slides (PDF): +- Logistics (grading, paper reviews, course project): +- Materials (intro deep learning materials): + diff --git a/docs/深度生成模型/大语言模型/CMU15-779.md b/docs/深度生成模型/大语言模型/CMU15-779.md new file mode 100644 index 00000000..b8a1fc05 --- /dev/null +++ b/docs/深度生成模型/大语言模型/CMU15-779.md @@ -0,0 +1,35 @@ +# CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition) + +## 课程简介 + +- 所属大学:Carnegie Mellon University +- 先修要求:无硬性先修要求;建议具备机器学习入门与深度学习训练经验,熟悉 PyTorch;了解 CUDA/GPU 基础会显著提升学习效率 +- 编程语言:Python(系统与算子层面内容可能涉及 CUDA/硬件概念) +- 课程难度:4/5 +- 预计学时:80-120 学时 + +这门课从系统视角系统性回答一个核心问题:一个用高层框架(例如 PyTorch)写出来的模型,是如何被分解为底层 kernel,并在异构硬件加速器(GPU/TPU)与分布式环境中高效执行的。课程覆盖 GPU 编程、ML 编译器、图级优化、分布式训练与自动并行化、LLM Serving 与推理加速等主题,强系统导向,适合希望把“框架层经验”向“算子/编译/硬件/集群执行”打通的人。 + +从教学组织上,这门课会要求你持续完成课前论文阅读(paper review / reading assignments),并以小组形式完成期末系统类课程项目(proposal、presentation、report),因此自学时建议把它当成一个“按周推进的系统训练营”,而不是只看几份 slide。 + +## 课程内容 + +课程内容以 lecture 为主线,主题大致包括: + +1. ML 系统基础:以 TensorFlow/PyTorch 为例理解计算图、执行模型与系统抽象 +2. GPU 架构与 CUDA 编程:硬件与编程模型、内存与性能优化要点 +3. Transformer 与 Attention 案例:FlashAttention 等 IO-aware attention 优化思路 +4. 高级 CUDA 编程:warp specialization、mega kernel 等低延迟/高吞吐优化技术 +5. ML 编译:Tile-based DSL(Triton 等)、内核自动调优(Ansor 等)、图级优化(TASO/PET 等)、超优化(Mirage) +6. 并行化与分布式训练:ZeRO/FSDP、模型/流水线并行、自动并行化(Alpa 等) +7. LLM 推理与服务:批处理、PagedAttention、RadixAttention、推测解码等 +8. 后训练与模型结构:参数高效微调(LoRA/QLoRA)、MoE(架构、kernel、并行化) + +## 课程资源 + +- 课程网站: +- 课程安排(含每讲 slide 与阅读列表): +- 课程讲义(PDF slides): +- 课程规则与项目要求(Grading、Paper Review、Course Project): +- 预备材料(深度学习入门材料汇总): +