Postdoctoral Position: Optimization & Learning for Collective Communication in Datacenters (LLM training & inference)

Forum 'Thèses et Post-Docs' - Sujet créé le 2025-04-15 par Youcef Magnouche

Optimization & Learning for Collective Communication in Datacenters (LLM training & inference)

The Operation Research team of the Data Communication Network Algorithm and Measurement Technology Laboratory, Huawei France Research Center, located in Boulogne-Billancourt, is looking for highly motivated candidates for a Postdoctoral position on Optimization & Learning for Collective Communication in Datacenters (LLM training & inference).
Topic

Training and inference of large language models (LLMs) at scale requires efficient communications across GPUs distributed within a data center. This communication overhead, which stems from exchanging model parameters, gradients, and activations, is a critical bottleneck that impacts training speed and efficiency, as it can lower GPU utilization. As these communication operations come with regular and periodic communication patterns, it is possible to leverage Collective Communication Libraries (CCLs) that organize and schedule data exchange between GPUs. To further improve this communication process, several optimization techniques can be used, impacting either the application layer (e.g., TE-CCL for efficient communication scheduling) or the network layer, (e.g., Load balancing, TE engineering, Routing, and Flow buffering\scheduling [1]). Various decision-making methods, from linear programming or\and reinforcement learning, can be used to model and efficiently control networks.

This research topic concerns a single data center hosting multiple interconnected servers through a known regular network topology. Frameworks such as Ray [4] allow parallelizing training/inference jobs over the GPUs assigned to each user by the datacenter controller. The chosen parallelization model, together with the GPU location, has an impact on the communication pattern induced by the AI jobs as different types of data are exchanged between the GPUs (models, parameters, gradients, and activations) by CCLs [2]. This generated data traffic has some characteristics: 1) Flow Transfer from multiple GPUs to multiple GPUs. 2) Traffic size variability. 3) Traffic follows a specific and recognizable pattern.

In the public cloud, multiple LLMs belonging to different tenants can be trained simultaneously, sharing the same resources, link bandwidths. Unfortunately, network resources are limited and must be efficiently managed to prevent congestion or losses that would dramatically impact on training/inference efficiency.

Two research parts have to be considered:

1) Offline parts - Bi-Level Optimization Problem:
To achieve optimal performance, GPUs allocations and task scheduling must be optimized simultaneously with traffic routing. This requires a bi-level optimization algorithm, where:
· Upper Level: Decides the GPUs assignment of each job and the schedule of the associated tasks.
· Lower Level: Optimizes routing to minimize the completion time of training.

In this part, the estimation of ongoing concurrent traffic is needed to ensure lossless and congestion-free communications. By jointly solving these problems, we can enhance network efficiency, reduce delays, and accelerate LLM training.

2) Online part - CCL buffering and load balancing:
Buffering a CCL consists of holding-on some associated traffic transfers of a given CCL for some time before pursuing the transmission. This decision cannot be efficiently taken offline as there is no precise knowledge of which concurrent training/inference will be active at the same time. This mechanism helps to optimize resource utilization to optimize the final job completion time. In this part, we need to decide which flow delaying, taking into account the current network resources (bandwidth, congestion levels) and current job characteristics, as well as which load balancing decisions can be taken given the network status. Tasks buffering and load-balancing must dynamically adapt to a lot of uncertainty: a) Randomness in Background Traffic and b) Dynamic Job Arrivals. By effectively managing dynamic task buffering, we can reduce congestion, improve network efficiency, and accelerate LLM training.

Methodology
From a methodology perspective, the candidate will first investigate routing and tasks scheduling problems in Datacenters for LLM training & inference where optimization is relevant (e.g., [2]). This will allow the designing of a Bi-Level optimization model for the problem. An efficient heuristic needs to be derived for fast optimization. In the second step of the postdoc, we focus on the tasks buffering and load balancing. The candidate will, first, perform a deep literature analysis on this topic and investigate the impact of Reinforcement learning. From the theoretical point of view, the candidate will explore both problem modelling and the solution methods able to unveil the structural properties of an optimal policy. This helps to speed up the learning convergence iterations. Finally, performance evaluation of the whole pipeline will be obtained in simulations entailing realistic state spaces along with benchmarks against state-of-the-art approaches.

Specific Requirements
Candidates should have a Ph.D. degree in Operation Research, Computer Science, or Applied Mathematics from a University or a Grande Ecole. They should have a solid background in Combinatorial Optimization & Reinforcement Learning. Knowledge of telecommunications, LLM training or CCL will be appreciated.

English: Operational
Contacts
Huawei FRC: Dr. Youcef Magnouche (youcef.magnouche@huawei.com)

Application
To apply please send a complete CV, a cover letter, grades of University/Grande-Ecole studies, and references. The position is for 12 months (possible extension to 18 months) starting as soon as possible.

Deadline:
The application must be submitted as soon as possible. We will continue accepting applications until the position is filled.

Huawei
The Huawei France Research Center (PRC) located in Boulogne-Billancourt, Paris area, is responsible for advanced research in the fields of Algorithm and Software design, Aesthetics, MBB & Home devices and Parallel Computing, to create and design innovative technologies and software platforms.

References
[1] R. Li, D. Fu, C. Shi, Z. Huang and G. Lu, "Efficient LLMs Training and Inference: An Introduction," in IEEE Access, vol. 13, pp. 32944-32970, 2025, doi: 10.1109/ACCESS.2024.3501358.
[2] Xuting Liu, Behnaz Arzani, Siva Kesava Reddy Kakarla, Liangyu Zhao, Vincent Liu, Miguel Castro, Srikanth Kandula, and Luke Marshall. 2024. Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem. In Proceedings of the ACM SIGCOMM 2024 Conference (ACM SIGCOMM '24). Association for Computing Machinery, New York, NY, USA, 16–37. https://doi.org/10.1145/3651890.3672249.
[3] Zhisheng Ye, Wei Gao, Qinghao Hu, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, and Yonggang Wen. 2024. Deep Learning Workload Scheduling in GPU Datacenters: A Survey. ACM Comput. Surv. 56, 6, Article 146 (June 2024), 38 pages. https://doi.org/10.1145/3638757[4] ray.io