| Date | 
						Presenter | 
						Topics/Readings | 
						
						Slides | 
					
					
						| Jan 16 | 
						Minjia Zhang | 
						Course Introduction | 
						
						 | 
					
                    
						| Jan 18 | 
						Minjia Zhang | 
						Training Efficiency | 
						
						pdf | 
					
                    
						| Jan 23 | 
						Minjia Zhang | 
						Inference Effiiency | 
						
						pdf | 
					
                    
						| System Optimizations for Training Massive Models | 
					
                    
						| Jan 25 | 
						Olatunji Ruwase (Invited speaker) | 
						DeepSpeed Library | 
						pdf | 
					
					
						| Jan 30 | 
						Yiqi Liu | 
						(ZeRO-style data parallelism) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models 
                        (SC 2020)  
                         | 
						pdf | 
					
					
						| Feb 1 | 
						Haoyang Zhang | 
						 ZeRO-Offload: Democratizing Billion-Scale Model Training (ATC 2021)  
						
						
                         |  
						pdf | 
					
					
						| Feb 6 | 
						Yufeng Du | 
						(Tensor-slicing parallelism) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (Arxiv 2019)  
						 | 
						pdf | 
					
					
						| Feb 8 | 
						Siyuan Chai | 
						ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
                        (SC 2021) 
						 | pdf | 
					
					
						| Feb 13 | 
						Gangmuk Lim,   Ahan Gupta | 
						POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging (ICML 2022)  
						(Sequence parallelism) Reducing Activation Recomputation in Large Transformer Models (Arxiv 2022)  
                         | 
						pdf    pdf  | 
					
                    
                    
						| System Optimizations for Low Inference Latency and Cost | 
					
					
						| Feb 15 | 
						Yuhao Ge,   Yuqi Xue | 
						FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (NeurIPS 2022)  
                        (vLLM) Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023)  
						 | 
						pdf   pdf | 
					
                    
						| Feb 20 | 
						Yanzhuo Chen | 
						ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs (IPDPS 2023 Best Paper)  
                         | 
						pdf | 
					
					
						| Feb 22 | 
						Aditya Prerepa | 
						
                        Orca: A Distributed Serving System for Transformer-Based Generative Models
                         (OSDI 2022)  
						 | 
						pdf | 
					
                    
					
                        | Feb 27 | 
                        Vignesh Suresh | 
                        Efficiently Scaling Transformer Inference (Arxiv 2022)                           
                         | 
                        pdf | 
                    
                    
                        | Feb 29 | 
                        Steven Gao | 
                        
                        FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU (ICML 2023)  
                         | 
                        pdf | 
                    
                    
						| Efficient Algorithms to Make DL Models Smaller, Faster, and Cheaper | 
					
                    
                        | Mar 5 | 
                        Xinyu Lian,   Mayank Bhatia | 
                        ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers (NeurIPS 2022)   
                        SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (ICML 2023)  
                         | 
						pdf   pdf  | 
                    
					
						| Pre-proposal meetings | 
					
                    
                        | Mar 7 | 
                        Students & Instructors | 
                        Pre-proposal: 15 minutes each group (need schedule ahead-of-time) | 
                         | 
                    
					
						| Efficient Algorithms to Make DL Models Smaller, Faster, and Cheaper | 
					
                    
                        | Mar 19 | 
                        Selin Yildirim,   -  | 
                        Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time (ICLR 2023)   
                        H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (ICML 2023)  
                         | 
                        pdf   -  | 
                    
                    
                        | Mar 21 | 
                        Akhil Bhimaraju,   Lingzhi Zhao | 
                        Efficient Streaming Language Models with Attention Sinks (Arxiv 2023)   
                        Fast Inference from Transformers via Speculative Decoding (ICML 2023)  
                         | 
                        pdf   pdf | 
                    
                    
                        | Mar 26 | 
                        Akshat Sharma,   Henry Zhu | 
                        
						Mixed Precision Training (Arxiv 2017)   
						QLoRA: Efficient Finetuning of Quantized LLMs (NeurIPS 2023)   
                         | 
                        pdf   pdf | 
                    
                    
						| System and Algorithm Co-Design for Efficient Training and Inference | 
					
                    
                        | Mar 28 | 
                        Wanyu Zhao | 
                        
                        E.T.: re-thinking self-attention for transformer models on GPUs (SC 2021)  
                         | 
                        pdf | 
                    
                    
                        | Apr 2 | 
                        Bakshree Mishra | 
                        Training and Inference of Large Language Models using 8-bit Floating Point (Arxiv 2023)  
                         | 
                        pdf | 
                    
                    
						| Apr 11 | 
						Chunyuan Li (Invited speaker) | 
						Invited Talk on multi-modal | 
						pdf | 
					
					
						| Apr 16 | 
						Wei Wen (Invited speaker) | 
						Invited Talk on NAS + DLRM from Meta | 
						 | 
					
                    
						| Efficiency Improvements for Emerging Real-World Models and Applications | 
					
                    
                        | Apr 4 | 
                        Ritik Dutta | 
                        Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (JMLR 2022)  
                         | 
                        pdf | 
                    
                    
                        | Apr 9 | 
                        Hari Umesh | 
                        When Parameter-efficient Tuning Meets General-purpose Vision-language Models (Arxiv 2023)  
                         | 
                        pdf | 
                    
					
                        | Apr 18 | 
                        James Soole | 
                        InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation (Arxiv 2023)  
                         | 
                        pdf | 
                    
                    
                        | Apr 23 | 
                        Tanay Dixit | 
                        AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (Arxiv 2023)  
                         | 
                        pdf | 
                    
                    
                        | Apr 25 | 
                        Haochen Shen | 
						Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Arxiv 2023)  
                         | 
                        pdf | 
                    
                    
                        | Apr 30 | 
                        Zhenrui Yue | 
                        Scalable Diffusion Models with Transformers (CVPR 2023)  
                         | 
                        pdf | 
                    
					
						| TBD | 
						 | 
						Final Project Presentations | 
						
						 |