(Pipeline Parallelism) GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
(ZeRO-style Data Parallelism) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
(3D Parallelism) Efficient large-scale language model training on GPU clusters using megatron-LM
(Sequence Parallelism) Reducing Activation Recomputation in Large Transformer Models
(Sequence Parallelism) Ring Attention with Blockwise Transformers for Near-Infinite Context
(Pipeline Parallelism) Zero Bubble Pipeline Parallelism
NeurIPS 2019
SC 2020
SC 2021
Arxiv 2022
Arxiv 2023
Arxiv 2023
USENIX ATC 2021
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
SC 2021
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
NeurIPS 2024
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
OSDI 2022
Mixed Precision Training
Arxiv 2017
(Gradient checkpointing aka rematerialization) Training Deep Nets with Sublinear Memory Cost
Arxiv 2016
Coop: Memory is not a Commodity
NeurIPS 2023
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Arxiv 2023
Efficiently Scaling Transformer Inference
MLSys 2023 Best Paper
(vLLM) Efficient Memory Management for Large Language Model Serving with PagedAttention
SOSP 2023
SGLang: Efficient Execution of Structured Language Model Programs
Arxiv 2024
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Arxiv 2024
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
OSDI 2018
Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations
MAPL 2019
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
ICML 2023
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
ICLR 2023
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
ICML 2023
Efficient Streaming Language Models with Attention Sinks
Arxiv 2023
Fast Inference from Transformers via Speculative Decoding
ICML 2023
QLoRA: Efficient Finetuning of Quantized LLMs
NeurIPS 2023
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Arxiv 2024
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Arxiv 2023
Scalable Diffusion Models with Transformers
CVPR 2023
CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs
ICDE 2024
Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models
ISCA 2022
The Illustrated AlphaFold
Blog 2024