(Pipeline Parallelism) GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism 
    (ZeRO-style Data Parallelism) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models 
    (3D Parallelism) Efficient large-scale language model training on GPU clusters using megatron-LM     
    (Sequence Parallelism) Reducing Activation Recomputation in Large Transformer Models 
    NeurIPS 2019
    SC 2020
    SC 2021
    Arxiv 2022
System Optimizations for Training Large Models on Limited GPU Resources
    (Gradient checkpointing aka rematerialization) Training Deep Nets with Sublinear Memory Cost
    Arxiv 2016
    ZeRO-Offload: Democratizing Billion-Scale Model Training
    USENIX ATC 2021
    ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
    SC 2021
    POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging
    ICML 2022
    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
    Arxiv 2023
    
    DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
    SC 2022
    
    ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs
    IPDPS 2023 Best Paper
    
    (vLLM) Efficient Memory Management for Large Language Model Serving with PagedAttention
    SOSP 2023
    FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU
    ICML 2023
    SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
    ICML 2023
    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
    ICLR 2023
    Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
    ICLR 2023
    H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
    ICML 2023
    Efficient Streaming Language Models with Attention Sinks
    Arxiv 2023
    
    Fast Inference from Transformers via Speculative Decoding
    ICML 2023
 
    QLoRA: Efficient Finetuning of Quantized LLMs
    NeurIPS 2023
 
    E.T.: re-thinking self-attention for transformer models on GPUs
    SC 2021
 
    ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
    Arxiv 2023
 
    
Training and Inference of Large Language Models using 8-bit Floating Point
    Arxiv 2023  
 
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation
    Arxiv 2023  
    
 
When Parameter-efficient Tuning Meets General-purpose Vision-language Models
    Arxiv 2023  
    
 
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
    Arxiv 2023  
    
 
DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems
WWW 2021