Publications Related to AI Efficiency

Parallelism Strategies for Training Massive Models

(Tensor-Slicing Parallelism) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Arxiv 2019

(Pipeline Parallelism) GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
NeurIPS 2019

(ZeRO-style Data Parallelism) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
SC 2020

(3D Parallelism) Efficient large-scale language model training on GPU clusters using megatron-LM
SC 2021

(Sequence Parallelism) Reducing Activation Recomputation in Large Transformer Models
Arxiv 2022

System Optimizations for Training Large Models on Limited GPU Resources

(Gradient checkpointing aka rematerialization) Training Deep Nets with Sublinear Memory Cost
Arxiv 2016

ZeRO-Offload: Democratizing Billion-Scale Model Training
USENIX ATC 2021

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
SC 2021

POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging
ICML 2022

System Optimizations for Low Inference Latency and Cost

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
NeurIPS 2022

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Arxiv 2023

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
SC 2022

ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs
IPDPS 2023 Best Paper

(vLLM) Efficient Memory Management for Large Language Model Serving with PagedAttention
SOSP 2023

FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU
ICML 2023

Efficient Algorithms to Make DL Models Smaller, Faster, and Cheaper

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
NeurIPS 2022

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
ICML 2023

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
ICLR 2023

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
ICLR 2023

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
ICML 2023

Efficient Streaming Language Models with Attention Sinks
Arxiv 2023

Fast Inference from Transformers via Speculative Decoding
ICML 2023

QLoRA: Efficient Finetuning of Quantized LLMs
NeurIPS 2023

System and Algorithm Co-Design for Efficient Training and Inference

Mixed Precision Training
Arxiv 2017

E.T.: re-thinking self-attention for transformer models on GPUs
SC 2021

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
Arxiv 2023

Training and Inference of Large Language Models using 8-bit Floating Point
Arxiv 2023

Efficiency Improvements for Emerging Models and Applications

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
JMLR 2022

InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation
Arxiv 2023

When Parameter-efficient Tuning Meets General-purpose Vision-language Models
Arxiv 2023

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Arxiv 2023

DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems
WWW 2021