(Pipeline Parallelism) GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
(ZeRO-style Data Parallelism) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
(3D Parallelism) Efficient large-scale language model training on GPU clusters using megatron-LM
(Sequence Parallelism) Reducing Activation Recomputation in Large Transformer Models
NeurIPS 2019
SC 2020
SC 2021
Arxiv 2022
System Optimizations for Training Large Models on Limited GPU Resources
(Gradient checkpointing aka rematerialization) Training Deep Nets with Sublinear Memory Cost
Arxiv 2016
ZeRO-Offload: Democratizing Billion-Scale Model Training
USENIX ATC 2021
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
SC 2021
POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging
ICML 2022
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Arxiv 2023
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
SC 2022
ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs
IPDPS 2023 Best Paper
(vLLM) Efficient Memory Management for Large Language Model Serving with PagedAttention
SOSP 2023
FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU
ICML 2023
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
ICML 2023
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
ICLR 2023
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
ICLR 2023
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
ICML 2023
Efficient Streaming Language Models with Attention Sinks
Arxiv 2023
Fast Inference from Transformers via Speculative Decoding
ICML 2023
QLoRA: Efficient Finetuning of Quantized LLMs
NeurIPS 2023
E.T.: re-thinking self-attention for transformer models on GPUs
SC 2021
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
Arxiv 2023
Training and Inference of Large Language Models using 8-bit Floating Point
Arxiv 2023
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation
Arxiv 2023
When Parameter-efficient Tuning Meets General-purpose Vision-language Models
Arxiv 2023
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Arxiv 2023
DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems
WWW 2021