2025

On Provable Benefits of Muon in Federated Learning

Xinwen Zhang, Hongchang Gao

Preprint. 2025

The recently introduced optimizer, Muon, has gained increasing attention due to its superior performance across a wide range of applications. However, its effectiveness in federated learning remains unexplored. To address this gap, this paper investigates the performance of Muon in the federated learning setting. Specifically, we propose a new algorithm, FedMuon, and establish its convergence rate for nonconvex problems. Our theoretical analysis reveals multiple favorable properties of FedMuon. In particular, due to its orthonormalized update direction, the learning rate of FedMuon is independent of problem-specific parameters, and, importantly, it can naturally accommodate heavy-tailed noise. The extensive experiments on a variety of neural network architectures validate the effectiveness of the proposed algorithm.

On Provable Benefits of Muon in Federated Learning

Xinwen Zhang, Hongchang Gao

Preprint. 2025

The recently introduced optimizer, Muon, has gained increasing attention due to its superior performance across a wide range of applications. However, its effectiveness in federated learning remains unexplored. To address this gap, this paper investigates the performance of Muon in the federated learning setting. Specifically, we propose a new algorithm, FedMuon, and establish its convergence rate for nonconvex problems. Our theoretical analysis reveals multiple favorable properties of FedMuon. In particular, due to its orthonormalized update direction, the learning rate of FedMuon is independent of problem-specific parameters, and, importantly, it can naturally accommodate heavy-tailed noise. The extensive experiments on a variety of neural network architectures validate the effectiveness of the proposed algorithm.

Nonconvex Decentralized Stochastic Bilevel Optimization under Heavy-Tailed Noises

Xinwen Zhang, Yihan Zhang, Hongchang Gao

Preprint. 2025

Existing decentralized stochastic optimization methods assume the lower-level loss function is strongly convex and the stochastic gradient noise has finite variance. These strong assumptions typically are not satisfied in real-world machine learning models. To address these limitations, we develop a novel decentralized stochastic bilevel optimization algorithm for the nonconvex bilevel optimization problem under heavy-tailed noises. Specifically, we develop a normalized stochastic variance-reduced bilevel gradient descent algorithm, which does not rely on any clipping operation. Moreover, we establish its convergence rate by innovatively bounding interdependent gradient sequences under heavy-tailed noises for nonconvex decentralized bilevel optimization problems. As far as we know, this is the first decentralized bilevel optimization algorithm with rigorous theoretical guarantees under heavy-tailed noises. The extensive experimental results confirm the effectiveness of our algorithm in handling heavy-tailed noises.

Nonconvex Decentralized Stochastic Bilevel Optimization under Heavy-Tailed Noises

Xinwen Zhang, Yihan Zhang, Hongchang Gao

Preprint. 2025

Existing decentralized stochastic optimization methods assume the lower-level loss function is strongly convex and the stochastic gradient noise has finite variance. These strong assumptions typically are not satisfied in real-world machine learning models. To address these limitations, we develop a novel decentralized stochastic bilevel optimization algorithm for the nonconvex bilevel optimization problem under heavy-tailed noises. Specifically, we develop a normalized stochastic variance-reduced bilevel gradient descent algorithm, which does not rely on any clipping operation. Moreover, we establish its convergence rate by innovatively bounding interdependent gradient sequences under heavy-tailed noises for nonconvex decentralized bilevel optimization problems. As far as we know, this is the first decentralized bilevel optimization algorithm with rigorous theoretical guarantees under heavy-tailed noises. The extensive experimental results confirm the effectiveness of our algorithm in handling heavy-tailed noises.

On the Convergence of Stochastic Smoothed Multi-Level Compositional Gradient Descent Ascent

Xinwen Zhang, Hongchang Gao

Advances in Neural Information Processing Systems (NeurIPS) 2025

Multi-level compositional optimization is a fundamental framework in machine learning with broad applications. While recent advances have addressed compositional minimization problems, the stochastic multi-level compositional minimax problem introduces significant new challenges—most notably, the biased nature of stochastic gradients for both the primal and dual variables. In this work, we address this gap by proposing a novel stochastic multi-level compositional gradient descent-ascent algorithm, incorporating a smoothing technique under the nonconvex-PL condition. We establish a convergence rate to an $(\epsilon, \epsilon/\sqrt{\kappa})$-stationary point with improved dependence on the condition number at $O(\kappa^{3/2})$, where $\epsilon$ denotes the solution accuracy and $\kappa$ represents the condition number. Moreover, we design a novel stage-wise algorithm with variance reduction to address the biased gradient issue under the two-sided PL condition. This algorithm successfully enables a translation from and $(\epsilon, \epsilon/\sqrt{\kappa})$-stationary point to an $\epsilon$-stationary point. Finally, extensive experiments validate the effectiveness of our algorithms.

On the Convergence of Stochastic Smoothed Multi-Level Compositional Gradient Descent Ascent

Xinwen Zhang, Hongchang Gao

Advances in Neural Information Processing Systems (NeurIPS) 2025

Multi-level compositional optimization is a fundamental framework in machine learning with broad applications. While recent advances have addressed compositional minimization problems, the stochastic multi-level compositional minimax problem introduces significant new challenges—most notably, the biased nature of stochastic gradients for both the primal and dual variables. In this work, we address this gap by proposing a novel stochastic multi-level compositional gradient descent-ascent algorithm, incorporating a smoothing technique under the nonconvex-PL condition. We establish a convergence rate to an $(\epsilon, \epsilon/\sqrt{\kappa})$-stationary point with improved dependence on the condition number at $O(\kappa^{3/2})$, where $\epsilon$ denotes the solution accuracy and $\kappa$ represents the condition number. Moreover, we design a novel stage-wise algorithm with variance reduction to address the biased gradient issue under the two-sided PL condition. This algorithm successfully enables a translation from and $(\epsilon, \epsilon/\sqrt{\kappa})$-stationary point to an $\epsilon$-stationary point. Finally, extensive experiments validate the effectiveness of our algorithms.

Sharpness-Aware Optimization Through Variance Suppression on Deep AUC Maximization

Xinwen Zhang, Hongchang Gao

The IEEE International Conference on Data Mining (ICDM) 2025

🔹 Motivation: Sharpness-Aware Minimization (SAM) enhances generalization, yet prior studies rarely explore the minimax optimization perspective.
🔹 Challenge: The loss landscape of minimax optimization is inherently more complex.
🔹 Algorithm: We propose VaSSO-SGDAM, a Variance-Suppressed Sharpness-Aware Optimization algorithm for Deep AUC Maximization.
🔹 Theory: We establish the first convergence guarantee for sharpness-aware optimization in AUC maximization.
🔹 Experiments: Experiments on benchmark and medical datasets show that VaSSO-SGDAM consistently outperforms existing baselines, and eigenvalue spectral analysis reveals a smoother AUC loss landscape.

Sharpness-Aware Optimization Through Variance Suppression on Deep AUC Maximization

Xinwen Zhang, Hongchang Gao

The IEEE International Conference on Data Mining (ICDM) 2025

🔹 Motivation: Sharpness-Aware Minimization (SAM) enhances generalization, yet prior studies rarely explore the minimax optimization perspective.
🔹 Challenge: The loss landscape of minimax optimization is inherently more complex.
🔹 Algorithm: We propose VaSSO-SGDAM, a Variance-Suppressed Sharpness-Aware Optimization algorithm for Deep AUC Maximization.
🔹 Theory: We establish the first convergence guarantee for sharpness-aware optimization in AUC maximization.
🔹 Experiments: Experiments on benchmark and medical datasets show that VaSSO-SGDAM consistently outperforms existing baselines, and eigenvalue spectral analysis reveals a smoother AUC loss landscape.

2024

A Federated Stochastic Multi-level Compositional Minimax Algorithm for Deep AUC Maximization

Xinwen Zhang, Ali Payani, Myungjin Lee, Richard Souvenir, Hongchang Gao

International Conference on Machine Learning (ICML) 2024

AUC maximization is an effective approach to address the imbalanced data classification problem in federated learning. In the past few years, a couple of federated AUC maximization approaches have been developed based on the minimax optimization. However, directly solving a minimax optimization problem to maximize the AUC score cannot achieve satisfactory performance. To address this issue, we propose to maximize AUC via optimizing a federated multi-level compositional minimax problem. Specifically, we develop a novel federated multi-level compositional minimax algorithm with rigorous theoretical guarantees to solve this new learning paradigm in both algorithmic design and theoretical analysis. To the best of our knowledge, this is the first work studying the multi-level minimax optimization problem. Additionally, extensive empirical evaluations confirm the efficacy of our proposed approach.

A Federated Stochastic Multi-level Compositional Minimax Algorithm for Deep AUC Maximization

Xinwen Zhang, Ali Payani, Myungjin Lee, Richard Souvenir, Hongchang Gao

International Conference on Machine Learning (ICML) 2024

AUC maximization is an effective approach to address the imbalanced data classification problem in federated learning. In the past few years, a couple of federated AUC maximization approaches have been developed based on the minimax optimization. However, directly solving a minimax optimization problem to maximize the AUC score cannot achieve satisfactory performance. To address this issue, we propose to maximize AUC via optimizing a federated multi-level compositional minimax problem. Specifically, we develop a novel federated multi-level compositional minimax algorithm with rigorous theoretical guarantees to solve this new learning paradigm in both algorithmic design and theoretical analysis. To the best of our knowledge, this is the first work studying the multi-level minimax optimization problem. Additionally, extensive empirical evaluations confirm the efficacy of our proposed approach.

2023

Federated Compositional Deep AUC Maximization

Xinwen Zhang, Yihan Zhang, Tianbao Yang, Richard Souvenir, Hongchang Gao

Advances in Neural Information Processing Systems (NeurIPS) 2023

Federated learning has attracted increasing attention due to the promise of balancing privacy and large-scale learning; numerous approaches have been proposed. However, most existing approaches focus on problems with balanced data, and prediction performance is far from satisfactory for many real-world applications where the number of samples in different classes is highly imbalanced. To address this challenging problem, we developed a novel federated learning method for imbalanced data by directly optimizing the area under curve (AUC) score. In particular, we formulate the AUC maximization problem as a federated compositional minimax optimization problem, develop a local stochastic compositional gradient descent ascent with momentum algorithm, and provide bounds on the computational and communication complexities of our algorithm. To the best of our knowledge, this is the first work to achieve such favorable theoretical results. Finally, extensive experimental results confirm the efficacy of our method.

Federated Compositional Deep AUC Maximization

Xinwen Zhang, Yihan Zhang, Tianbao Yang, Richard Souvenir, Hongchang Gao

Advances in Neural Information Processing Systems (NeurIPS) 2023

Federated learning has attracted increasing attention due to the promise of balancing privacy and large-scale learning; numerous approaches have been proposed. However, most existing approaches focus on problems with balanced data, and prediction performance is far from satisfactory for many real-world applications where the number of samples in different classes is highly imbalanced. To address this challenging problem, we developed a novel federated learning method for imbalanced data by directly optimizing the area under curve (AUC) score. In particular, we formulate the AUC maximization problem as a federated compositional minimax optimization problem, develop a local stochastic compositional gradient descent ascent with momentum algorithm, and provide bounds on the computational and communication complexities of our algorithm. To the best of our knowledge, this is the first work to achieve such favorable theoretical results. Finally, extensive experimental results confirm the efficacy of our method.