(2024 Fall) CS 598 AIE - AI Efficiency: Systems & Algorithms

Schedule (Tentative)

Date Presenter Topics/Readings Papers Slides Selected Review
Aug 28 Minjia Zhang Course Introduction
Aug 30 Minjia Zhang Training Efficiency slides
Sept 4 Minjia Zhang Serving Efficiency slides
Sept 6 Minjia Zhang Efficient and Effective Algorithms slides
System Optimizations for Training Massive Models
Sept 11 Nicholas Satchanov (Tensor-slicing parallelism) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism pdf slides review
Sept 11 Bhagyashree Taleka (Pipeline Parallelism) GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism pdf slides review
Sept 13 Qinjun Jiang,
Tong Wei
(3D parallelism) Efficient large-scale language model training on GPU clusters using megatron-LM pdf slides review
Sept 13 Yihe Zhang (Sequence Parallelism) Reducing Activation Recomputation in Large Transformer Models pdf slides review
Sept 18 Jiankun Wang (Sequence Parallelism) Ring Attention with Blockwise Transformers for Near-Infinite Context pdf slides
Sept 18 Shuning Zhang (ZeRO-style Data Parallelism) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models pdf slides review
Sept 20 Yueming Yuan,
Ananth Madan
ZeRO-Offload: Democratizing Billion-Scale Model Training pdf slides review
Sept 20 Nikhil Kanamarla,
Xinyi Song
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning pdf slides review
Sept 25 Hyungyo Kim,
Noelle Crawford
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training pdf slides review
Sept 25 Zelei Shao Mixed Precision Training pdf slides review
Sept 27 Khoa Pham,
Julian Yu
(Auto-Parallelism) Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning pdf slides review
Oct 2 Ashley Chen Coop: Memory is not a Commodity pdf slides review
Oct 2 Aydan Pirani (Pipeline Parallelism) Zero Bubble Pipeline Parallelism pdf slides review
System Optimizations for Low Inference Latency and Cost
Oct 4 Deema Alnuhait FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning pdf slides review
Oct 4 Nachuan Wang,
Akul Gupta
Efficiently Scaling Transformer Inference pdf slides review
Oct 9 Jianping Li (vLLM) Efficient Memory Management for Large Language Model Serving with PagedAttention pdf slides review
Oct 9 Anay Bhakat SGLang: Efficient Execution of Structured Language Model Programs pdf slides review
Oct 11 Rahul Bothra,
Chengyi Wang
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention pdf slides review
Oct 13 Proposal Due
Oct 16 Sarthak Chakraborty,
Ben Civjan
Orca: A Distributed Serving System for Transformer-Based pdf slides review
Oct 18 Ryan Ziegler,
Sultan Durrani
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning pdf slides review
Oct 23 Krut Patel Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations pdf slides review
Efficient Algorithms to Make DL Models Smaller, Faster, and Cheaper
Oct 25 Raunak Shah AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration pdf slides review
Oct 25 Aditi Tiwari SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models pdf slides review
Oct 30 Xiaoke LI GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers pdf slides review
Oct 30 Neel Dani H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models pdf slides review
Nov 1 Ya-Ting Pai Efficient Streaming Language Models with Attention Sinks pdf slides review
Nov 1 Hanyang Chen Fast Inference from Transformers via Speculative Decoding pdf slides review
Nov 6 Ruize Gao QLoRA: Efficient Finetuning of Quantized LLMs pdf slides review
Efficiency Improvements for Emerging Real-World Models and Applications
Nov 8 Xiaocong Yang,
Aryan Bhardwaj
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity pdf slides review
Nov 13 Dazhen Chen Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference pdf slides review
Nov 13 Chenghao Mo The Illustrated AlphaFold pdf slides review
Nov 15 Simon Sun CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs pdf slides review
Nov 20 Jiatong Li Mamba: Linear-Time Sequence Modeling with Selective State Spaces pdf slides
Nov 20 Jiaqi Lou,
Divya Koya
Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models pdf slides
Nov 22 Haoran Yuan Scalable Diffusion Models with Transformers pdf
Nov 27 Fall break
Nov 29 Fall break
Dec 4 Final Presentation
Dec 6 Final Presentation
Dec 13 Final Report Due