Publications Related to AI Efficiency

Efficient and Scalable System Strategies for Training Massive Models

(Tensor-Slicing Parallelism) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Arxiv 2019

(Pipeline Parallelism) GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
NeurIPS 2019

(ZeRO-style Data Parallelism) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
SC 2020

(3D Parallelism) Efficient large-scale language model training on GPU clusters using megatron-LM
SC 2021

(Sequence Parallelism) Reducing Activation Recomputation in Large Transformer Models
Arxiv 2022

(Sequence Parallelism) Ring Attention with Blockwise Transformers for Near-Infinite Context
Arxiv 2023

(Pipeline Parallelism) Zero Bubble Pipeline Parallelism
Arxiv 2023 ZeRO-Offload: Democratizing Billion-Scale Model Training
USENIX ATC 2021

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
SC 2021

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
NeurIPS 2024

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
OSDI 2022

Mixed Precision Training
Arxiv 2017

(Gradient checkpointing aka rematerialization) Training Deep Nets with Sublinear Memory Cost
Arxiv 2016

Coop: Memory is not a Commodity
NeurIPS 2023

System Optimizations for Low Inference Latency and Cost

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
NeurIPS 2022

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Arxiv 2023

Efficiently Scaling Transformer Inference
MLSys 2023 Best Paper

(vLLM) Efficient Memory Management for Large Language Model Serving with PagedAttention
SOSP 2023

SGLang: Efficient Execution of Structured Language Model Programs
Arxiv 2024

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Arxiv 2024

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
OSDI 2018

Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations
MAPL 2019

Efficient Algorithms to Make DL Models Smaller, Faster, and Cheaper

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
NeurIPS 2023

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
ICML 2023

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
ICLR 2023

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
ICML 2023

Efficient Streaming Language Models with Attention Sinks
Arxiv 2023

Fast Inference from Transformers via Speculative Decoding
ICML 2023

QLoRA: Efficient Finetuning of Quantized LLMs
NeurIPS 2023

Efficiency Improvements for Emerging Models and Applications

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
JMLR 2022

Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Arxiv 2024

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Arxiv 2023

Scalable Diffusion Models with Transformers
CVPR 2023

CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs
ICDE 2024

Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models
ISCA 2022

The Illustrated AlphaFold
Blog 2024