(ZeRO-style Data Parallelism) ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
SC 2020
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
OSDI 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
NeurIPS 2022
Orca: A Distributed Serving System for Transformer-Based Generative Models
OSDI 2022
(vLLM) Efficient Memory Management for Large Language Model Serving with PagedAttention
SOSP 2023
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
ICLR 2023
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
ICML 2023
Fast Inference from Transformers via Speculative Decoding
ICML 2023
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Arxiv 2024