(ZeRO-style Data Parallelism) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
SC 2020
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
OSDI 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
NeurIPS 2022
Orca: A Distributed Serving System for Transformer-Based Generative Models
OSDI 2022
Efficiently Scaling Transformer Inference
MLSys 2023
(vLLM) Efficient Memory Management for Large Language Model Serving with PagedAttention
SOSP 2023
SGLang: Efficient Execution of Structured Language Model Programs
NeurIPS 2024
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
ICLR 2023
Fast Inference from Transformers via Speculative Decoding
ICML 2023