Training and Inference Systems
Goal of the course
Our goal is bring the students and the collaborators together to read and discuss AI system papers. The scope of our AI system seminar covers pipeline training, communication scheduling, communication library, datacenter network topology, RDMA congestion control, distributed inference, etc. The paper presented by our students and collaborators appeared in recent SIGCOMM/NSDI/SOSP/OSDI/Eurosys/ASPLOS/SC/ATC.
Meeting Schedule
Week 10:
ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning, ASPLOS 2023——2023.09.19
Presenter:Yuxiao Wang
Week 9:
Lyra: Elastic Scheduling for Deep Learning Clusters, Eurosys 2023——2023.09.13
Presenter:Ying Zheng
Week 8:
An Overview of Collective Communication Library——2023.09.05
Presenter:Zhiyi Yao
Week 7:
Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference——2023.08.29
Presenter:Yuxiao Wang
Tabi: An Efficient Multi-Level Inference System for Large Language Models, EuroSys23——2023.08.29
Presenter:Yuxiao Wang
Week 6:
TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs, NSDI2023——2023.08.15
Presenter:Jiaxin Zhu
Re-archutecting Congestion Management in Lossless Ethernet, OSDI2020——2023.08.15
Presenter:Chao Peng
Week 5:
Hydro:{Surrogate-Based} Hyperparameter Tuning Service in Datacenters, OSDI2023——2023.08.08
Presenter:Ying Zheng
Week 4:
Congestion detection in lossless networks, SIGCOMM2021——2023.08.01
Presenter:Leyi Ye
Week 3:
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving, OSDI2023——2023.07.25
Presenter:Yuxiao Wang
· demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models
· explore the new trade-offspace and present a novel serving system
Week 2:
Titan: A Scheduler for Foundation Model Fine-tuningWorkloads, SOCC2022 ——2023.07.18
Presenter:Yuxiao Wang
· design a scheduler to efficiently fine-tune FMs in a large-scale GPU cluster
Week 1:
LucidA Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs, ASPLOS 2023 ——2023.07.11
Presenter:Yuxiao Wang
· A non-intrusive deep learning workload scheduler based on interpretable models