Date |
Presenter |
Topics/Readings |
Papers |
Slides |
Selected Review |
Aug 28 |
Minjia Zhang |
Course Introduction |
|
| |
Aug 30 |
Minjia Zhang |
Training Efficiency |
|
slides |
| |
Sept 4 |
Minjia Zhang |
Serving Efficiency |
|
slides |
| |
Sept 6 |
Minjia Zhang |
Efficient and Effective Algorithms |
|
slides |
| |
|
System Optimizations for Training Massive Models |
Sept 11 |
Nicholas Satchanov |
(Tensor-slicing parallelism) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
|
pdf |
slides | review |
Sept 11 |
Bhagyashree Taleka |
(Pipeline Parallelism) GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
|
pdf |
slides |
review
|
Sept 13 |
Qinjun Jiang, Tong Wei |
(3D parallelism) Efficient large-scale language model training on GPU clusters using megatron-LM
|
pdf |
slides |
review |
Sept 13 |
Yihe Zhang |
(Sequence Parallelism) Reducing Activation Recomputation in Large Transformer Models
|
pdf |
slides |
review |
Sept 18 |
Jiankun Wang |
(Sequence Parallelism) Ring Attention with Blockwise Transformers for Near-Infinite Context
|
pdf |
slides | |
Sept 18 |
Shuning Zhang |
(ZeRO-style Data Parallelism) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
|
pdf |
slides |
review
|
Sept 20 |
Yueming Yuan, Ananth Madan |
ZeRO-Offload: Democratizing Billion-Scale Model Training
|
pdf |
slides |
review
|
Sept 20 |
Nikhil Kanamarla, Xinyi Song |
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
|
pdf |
slides |
review
|
Sept 25 |
Hyungyo Kim, Noelle Crawford |
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
|
pdf |
slides |
review
|
Sept 25 |
Zelei Shao |
Mixed Precision Training |
pdf |
slides |
review
|
Sept 27 |
Khoa Pham, Julian Yu |
(Auto-Parallelism) Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
|
pdf |
slides |
review
|
Oct 2 |
Ashley Chen |
Coop: Memory is not a Commodity
|
pdf |
slides |
review
|
Oct 2 |
Aydan Pirani |
(Pipeline Parallelism) Zero Bubble Pipeline Parallelism
|
pdf |
slides |
review
|
|
System Optimizations for Low Inference Latency and Cost |
Oct 4 |
Deema Alnuhait |
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
|
pdf |
slides
|
review
|
Oct 4 |
Nachuan Wang, Akul Gupta |
Efficiently Scaling Transformer Inference
|
pdf |
slides
|
review
|
Oct 9 |
Jianping Li |
(vLLM) Efficient Memory Management for Large Language Model Serving with PagedAttention
|
pdf |
slides
|
review
|
Oct 9 |
Anay Bhakat |
SGLang: Efficient Execution of Structured Language Model Programs
|
pdf |
slides
|
review
|
Oct 11 |
Rahul Bothra, Chengyi Wang |
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
|
pdf |
slides
|
review
|
Oct 13 |
Proposal Due |
Oct 16 |
Sarthak Chakraborty, Ben Civjan |
Orca: A Distributed Serving System for Transformer-Based
|
pdf |
slides
|
review
|
Oct 18 |
Ryan Ziegler, Sultan Durrani |
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
|
pdf |
slides
|
review
|
Oct 23 |
Krut Patel |
Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations
|
pdf |
slides
|
review
|
|
Efficient Algorithms to Make DL Models Smaller, Faster, and Cheaper |
Oct 25 |
Raunak Shah |
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
|
pdf |
slides
|
review
|
Oct 25 |
Aditi Tiwari |
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
|
pdf |
slides
|
review
|
Oct 30 |
Xiaoke LI |
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
|
pdf |
slides
|
review
|
Oct 30 |
Neel Dani |
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
|
pdf |
slides
|
review
|
Nov 1 |
Ya-Ting Pai |
Efficient Streaming Language Models with Attention Sinks
|
pdf |
slides
|
review
|
Nov 1 |
Hanyang Chen |
Fast Inference from Transformers via Speculative Decoding
|
pdf |
slides
|
review
|
Nov 6 |
Ruize Gao |
QLoRA: Efficient Finetuning of Quantized LLMs
|
pdf |
slides
|
review
|
|
Efficiency Improvements for Emerging Real-World Models and Applications |
Nov 8 |
Xiaocong Yang, Aryan Bhardwaj |
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
|
pdf |
slides
|
review
|
Nov 13 |
Dazhen Chen |
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
|
pdf |
slides
|
review
|
Nov 13 |
Chenghao Mo |
The Illustrated AlphaFold
|
pdf |
slides
|
review
|
Nov 15 |
Simon Sun |
CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs
|
pdf |
slides
|
review
|
Nov 20 |
Jiatong Li |
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
|
pdf |
slides
| |
Nov 20 |
Jiaqi Lou, Divya Koya |
Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models
|
pdf |
slides
| |
Nov 22 |
Haoran Yuan |
Scalable Diffusion Models with Transformers
|
pdf |
| |
Nov 27 |
Fall break
|
Nov 29 |
Fall break
|
Dec 4 |
Final Presentation
|
Dec 6 |
Final Presentation
|
Dec 13 |
Final Report Due
|