mirror of
https://github.com/PKUFlyingPig/cs-self-learning.git
synced 2026-03-20 12:06:23 +08:00
[UPDATE] add CMU 15-779 course pages (#834)
Add CMU 15-779 (LLM systems) notes in CN/EN and link them from the deep generative models roadmap.
This commit is contained in:
@@ -18,6 +18,8 @@ The GPT series by OpenAI has demonstrated remarkable performance under the guida
|
||||
|
||||
- [Stanford CS336: Language Modeling from Scratch](https://stanford-cs336.github.io/spring2025/index.html): As the title suggests, this course teaches you to build all the core components of an LLM from scratch, such as the tokenizer, model architecture, training optimizer, low-level operators, data cleaning, and post-training algorithms. Each assignment has a 40-50 page PDF handout—very rigorous. Highly recommended if you want to fully understand every low-level detail of LLMs.
|
||||
|
||||
- [CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)](./大语言模型/CMU15-779.md) / [EN](./大语言模型/CMU15-779.en.md): A systems and performance-oriented course that explains how high-level models are decomposed into low-level kernels and executed efficiently on heterogeneous accelerators and in distributed environments. Topics include CUDA, ML compilation, graph-level optimizations, auto-parallelization, and LLM serving/inference acceleration, along with weekly paper reviews and a final systems project.
|
||||
|
||||
- [CMU 11868: Large Language Model Systems](https://llmsystem.github.io/llmsystem2025spring/): This CMU course focuses on system-level optimization of LLMs, including GPU acceleration, distributed training/inference, and cutting-edge techniques. Great for students in systems research to gain a holistic understanding of the field. (Disclosure: One of my papers on PD decoupling is included in the syllabus, hence the personal recommendation.) Assignments involve implementing a mini-PyTorch framework and then building system-level LLM optimizations on top of it.
|
||||
|
||||
- [CMU 11667: Large Language Models: Methods and Applications](https://cmu-llms.org/) and [CMU 11711: Advanced NLP](https://www.phontron.com/class/anlp-fall2024/): Compared to the previous two, these courses focus more on higher-level algorithms and applications. Each lecture includes many recommended readings, making them suitable for gaining a broad understanding of LLM research frontiers. You can then dive deeper into any subfield that interests you based on the reading materials.
|
||||
|
||||
@@ -18,6 +18,8 @@ OpenAI 的 GPT 系列让大语言模型在 Scaling Law 的指引下展现出惊
|
||||
|
||||
- [Stanford CS336: Language Modeling from Scratch](https://stanford-cs336.github.io/spring2025/index.html): 正如课程标题写的,在这门课程中你将从头编写大语言模型的所有核心组件,例如 Tokenizer,模型架构,训练优化器,底层算子,训练数据清洗,后训练算法等等。每次作业的 handout 都有四五十页 pdf,相当硬核。如果你想充分吃透大语言模型的所有底层细节,那么非常推荐学习这门课程。
|
||||
|
||||
- [CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)](./大语言模型/CMU15-779.md) / [EN](./大语言模型/CMU15-779.en.md): 偏系统与性能优化视角,重点讲清“高层模型如何分解成 kernel 并在异构加速器与分布式环境中高效执行”,覆盖 CUDA、ML 编译、图级优化、自动并行化、LLM Serving 与推理加速等内容,并配套按周论文阅读与期末系统项目。
|
||||
|
||||
- [CMU 11868: Large Language Model Systems](https://llmsystem.github.io/llmsystem2025spring/): CMU 的大语言模型系统课程,侧重底层系统优化,例如 GPU 加速,分布式训练和推理,以及各种前沿技术。非常适合从事系统领域的同学对这个方向有个全貌性的了解。课表里还包含了一篇我发表的 PD 分离相关的文章,因此私心推荐一下。课程作业的话会让你先实现一个迷你 Pytorch,然后在上面实现各种大语言模型的系统级优化。
|
||||
|
||||
- [CMU 11667: Large Language Models: Methods and Applications](https://cmu-llms.org/) 和 [CMU 11711: Advanced NLP](https://www.phontron.com/class/anlp-fall2024/): 和前两门课相比,这两门课更偏重上层算法和应用,而且每节课都列举了很多相关阅读材料,适合对大语言模型发展前沿的各个方向都有个粗糙的认识,如果对某个子领域感兴趣的话再寻着参考资料深入学习。
|
||||
|
||||
35
docs/深度生成模型/大语言模型/CMU15-779.en.md
Normal file
35
docs/深度生成模型/大语言模型/CMU15-779.en.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)
|
||||
|
||||
## Course Overview
|
||||
|
||||
- University: Carnegie Mellon University
|
||||
- Prerequisites: No strict prerequisites; an intro ML background and hands-on deep learning training experience are recommended; familiarity with PyTorch helps; basic CUDA/GPU knowledge will significantly improve the learning curve
|
||||
- Programming Language: Python (systems and kernel-level topics involve CUDA/hardware concepts)
|
||||
- Course Difficulty: 🌟🌟🌟🌟
|
||||
- Estimated Study Hours: 80-120 hours
|
||||
|
||||
This course takes a systems-first view of modern machine learning and LLM infrastructure. The core question it repeatedly answers is: how does a model written in a high-level framework (e.g., PyTorch) get decomposed into low-level kernels, and how is it executed efficiently on heterogeneous accelerators (GPUs/TPUs) and in distributed environments. The syllabus covers GPU programming, ML compilers, graph-level optimizations, distributed training and auto-parallelization, and LLM serving and inference acceleration. It is a strong fit if you want to connect “framework-level experience” with “kernels, compilation, hardware, and cluster execution.”
|
||||
|
||||
The workload is organized around consistent pre-lecture reading assignments (paper reviews) and a team-based final course project (proposal, presentation, report). For self-study, it is best to follow the schedule week by week rather than treating it as a slide-only course.
|
||||
|
||||
## Topics Covered
|
||||
|
||||
The course is structured as lectures, with major themes including:
|
||||
|
||||
1. ML systems fundamentals via TensorFlow/PyTorch (abstractions, execution models)
|
||||
2. GPU architecture and CUDA programming (memory, performance tuning)
|
||||
3. Transformer and attention case studies (FlashAttention and IO-aware attention)
|
||||
4. Advanced CUDA techniques (warp specialization, mega kernels)
|
||||
5. ML compilation (tile-based DSLs like Triton, kernel auto-tuning, graph-level optimizations, superoptimization such as Mirage)
|
||||
6. Parallelization and distributed training (ZeRO/FSDP, model/pipeline parallelism, auto-parallelization such as Alpa)
|
||||
7. LLM serving and inference (batching, PagedAttention, RadixAttention, speculative decoding)
|
||||
8. Post-training and architectures (PEFT like LoRA/QLoRA, MoE architectures/kernels/parallelism)
|
||||
|
||||
## Course Resources
|
||||
|
||||
- Course Website: <https://www.cs.cmu.edu/~zhihaoj2/15-779/>
|
||||
- Schedule (slides and reading list per lecture): <https://www.cs.cmu.edu/~zhihaoj2/15-779/schedule.html>
|
||||
- Slides (PDF): <https://www.cs.cmu.edu/~zhihaoj2/15-779/slides/>
|
||||
- Logistics (grading, paper reviews, course project): <https://www.cs.cmu.edu/~zhihaoj2/15-779/logistics.html>
|
||||
- Materials (intro deep learning materials): <https://www.cs.cmu.edu/~zhihaoj2/15-779/materials.html>
|
||||
|
||||
35
docs/深度生成模型/大语言模型/CMU15-779.md
Normal file
35
docs/深度生成模型/大语言模型/CMU15-779.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# CMU 15-779: Advanced Topics in Machine Learning Systems (LLM Edition)
|
||||
|
||||
## 课程简介
|
||||
|
||||
- 所属大学:Carnegie Mellon University
|
||||
- 先修要求:无硬性先修要求;建议具备机器学习入门与深度学习训练经验,熟悉 PyTorch;了解 CUDA/GPU 基础会显著提升学习效率
|
||||
- 编程语言:Python(系统与算子层面内容可能涉及 CUDA/硬件概念)
|
||||
- 课程难度:4/5
|
||||
- 预计学时:80-120 学时
|
||||
|
||||
这门课从系统视角系统性回答一个核心问题:一个用高层框架(例如 PyTorch)写出来的模型,是如何被分解为底层 kernel,并在异构硬件加速器(GPU/TPU)与分布式环境中高效执行的。课程覆盖 GPU 编程、ML 编译器、图级优化、分布式训练与自动并行化、LLM Serving 与推理加速等主题,强系统导向,适合希望把“框架层经验”向“算子/编译/硬件/集群执行”打通的人。
|
||||
|
||||
从教学组织上,这门课会要求你持续完成课前论文阅读(paper review / reading assignments),并以小组形式完成期末系统类课程项目(proposal、presentation、report),因此自学时建议把它当成一个“按周推进的系统训练营”,而不是只看几份 slide。
|
||||
|
||||
## 课程内容
|
||||
|
||||
课程内容以 lecture 为主线,主题大致包括:
|
||||
|
||||
1. ML 系统基础:以 TensorFlow/PyTorch 为例理解计算图、执行模型与系统抽象
|
||||
2. GPU 架构与 CUDA 编程:硬件与编程模型、内存与性能优化要点
|
||||
3. Transformer 与 Attention 案例:FlashAttention 等 IO-aware attention 优化思路
|
||||
4. 高级 CUDA 编程:warp specialization、mega kernel 等低延迟/高吞吐优化技术
|
||||
5. ML 编译:Tile-based DSL(Triton 等)、内核自动调优(Ansor 等)、图级优化(TASO/PET 等)、超优化(Mirage)
|
||||
6. 并行化与分布式训练:ZeRO/FSDP、模型/流水线并行、自动并行化(Alpa 等)
|
||||
7. LLM 推理与服务:批处理、PagedAttention、RadixAttention、推测解码等
|
||||
8. 后训练与模型结构:参数高效微调(LoRA/QLoRA)、MoE(架构、kernel、并行化)
|
||||
|
||||
## 课程资源
|
||||
|
||||
- 课程网站:<https://www.cs.cmu.edu/~zhihaoj2/15-779/>
|
||||
- 课程安排(含每讲 slide 与阅读列表):<https://www.cs.cmu.edu/~zhihaoj2/15-779/schedule.html>
|
||||
- 课程讲义(PDF slides):<https://www.cs.cmu.edu/~zhihaoj2/15-779/slides/>
|
||||
- 课程规则与项目要求(Grading、Paper Review、Course Project):<https://www.cs.cmu.edu/~zhihaoj2/15-779/logistics.html>
|
||||
- 预备材料(深度学习入门材料汇总):<https://www.cs.cmu.edu/~zhihaoj2/15-779/materials.html>
|
||||
|
||||
Reference in New Issue
Block a user