AI System Seminar

Training and Inference Systems

Goal of the course

Our goal is bring the students and the collaborators together to read and discuss AI system papers. The scope of our AI system seminar covers pipeline training, communication scheduling, communication library, datacenter network topology, RDMA congestion control, distributed inference, etc. The paper presented by our students and collaborators appeared in recent SIGCOMM/NSDI/SOSP/OSDI/Eurosys/ASPLOS/SC/ATC.

Meeting Schedule

Week 10:

ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning, ASPLOS 2023——2023.09.19

Presenter:Yuxiao Wang

Week 9:

Lyra: Elastic Scheduling for Deep Learning Clusters, Eurosys 2023——2023.09.13

Presenter:Ying Zheng

Week 8:

An Overview of Collective Communication Library——2023.09.05

Presenter:Zhiyi Yao

Week 7:

Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference——2023.08.29

Presenter:Yuxiao Wang

Tabi: An Efficient Multi-Level Inference System for Large Language Models, EuroSys23——2023.08.29

Presenter:Yuxiao Wang

Week 6:

TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs, NSDI2023——2023.08.15

Presenter:Jiaxin Zhu

Re-archutecting Congestion Management in Lossless Ethernet, OSDI2020——2023.08.15

Presenter:Chao Peng

Week 5:

Hydro:{Surrogate-Based} Hyperparameter Tuning Service in Datacenters, OSDI2023——2023.08.08

Presenter:Ying Zheng

Week 4:

Congestion detection in lossless networks, SIGCOMM2021——2023.08.01

Presenter:Leyi Ye

Week 3:

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving, OSDI2023——2023.07.25

Presenter:Yuxiao Wang
· demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models
· explore the new trade-offspace and present a novel serving system

Week 2:

Titan: A Scheduler for Foundation Model Fine-tuningWorkloads, SOCC2022 ——2023.07.18

Presenter:Yuxiao Wang
· design a scheduler to efficiently fine-tune FMs in a large-scale GPU cluster

Week 1:

LucidA Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs, ASPLOS 2023 ——2023.07.11

Presenter:Yuxiao Wang
· A non-intrusive deep learning workload scheduler based on interpretable models