Recently, Assistant Professor Yaodong Yang's research team from the Institute for Artificial Intelligence, Peking University, published a paper titled "Efficient and scalable reinforcement learning for large-scale network control" in the journal Nature Machine Intelligence. The paper introduces a model-based decentralized policy optimization method that achieves significant breakthroughs. This advancement enables efficient decentralized collaborative training and decision-making in multi-agent systems, markedly enhancing the scalability and applicability of AI decision models in large-scale multi-agent environments.

Figure 1 Screenshot of the paper
Achieving efficient and scalable decision-making in large-scale multi-agent systems is a crucial objective in AI development. These systems rely on extensive interactive data among numerous agents, utilizing substantial computational resources to learn cooperative strategies for complex tasks—a core paradigm known as multi-agent reinforcement learning. Significant progress in this field has led to applications such as game AI.
Currently, two primary learning paradigms exist: centralized learning and independent learning. Centralized learning requires each agent to have global observation capabilities, substantially increasing algorithmic complexity and communication costs, thereby reducing scalability in large-scale systems. Conversely, independent learning simplifies system and algorithmic complexity but often results in unstable learning processes and suboptimal decision performance. Notably, in real-world scenarios beyond gaming, objective interaction constraints and cost factors make it challenging for existing methods to scale effectively. For instance, in urban traffic systems, controlling traffic signals involves frequent large-scale communication operations, leading to increased power loss and higher susceptibility to signal interference. Additionally, computational complexity escalates exponentially with the number of traffic lights. Therefore, designing multi-agent reinforcement learning methods capable of extending decision-making abilities to complex real-world systems with numerous agents, under limited data and communication constraints, is imperative.
This research addresses these challenges by reducing reliance on global communication and extensive interactive data in existing multi-agent learning methods, facilitating the widespread deployment and efficient scaling of reinforcement learning algorithms in complex large-scale systems—a significant step toward scalable decision paradigms.

Figure 2 The difference between centralized learning and independent learning, the starting point of this study, and the types of networked systems involved
In this study, the research team decoupled the dynamic characteristics of large-scale multi-agent systems at the agent level, representing inter-agent relationships as networked structures under various topologies, including linear, circular, and mesh configurations with homogeneous or heterogeneous nodes, thereby reducing system processing complexity. Previous studies have also modeled inter-agent relationships in a networked manner to enhance algorithmic scalability. However, such system decompositions often rely on strong assumptions that may not align with real-world system characteristics. Therefore, the team proposed a more general networked system model to characterize the relationship between the dynamics of decoupled multi-agent systems and real-world system dynamics, capable of handling a broader range of cooperative multi-agent tasks. This concept bridges the gap between standard network systems and general multi-agent systems, providing a necessary theoretical framework and analytical tools for decentralized multi-agent system research.

Furthermore, based on this generalized network system, the team extended model learning theories from single-agent learning to multi-agent systems, enabling agents to independently learn local state transitions, neighborhood information values, and decentralized policies, transforming complex large-scale decision problems into more tractable optimization problems. Consequently, large AI systems can achieve satisfactory decision performance even under limited sample data and information exchange. As early as the 1990s, reinforcement learning pioneer Richard Sutton proposed model-based methods to learn intrinsic dynamic characteristics of systems to assist policy learning and enhance sample efficiency. In this work, the research team coupled localized model learning with decentralized policy optimization, proposing a model-based decentralized policy optimization method. This method is efficient and scalable, allowing agents to approximate monotonically improving policies with minimal local information (when inter-agent information exchange is limited). Specifically, agents can use well-trained localized models to predict future states and employ local communication to convey these predictions.

Figure 3 Multi-agent model learning process
To mitigate model prediction errors, the team adopted a branching rollout strategy, replacing a few long-term rollouts with numerous short-term rollouts to reduce compound errors in model learning and prediction, promoting approximate monotonic improvement during policy learning.

The team further theoretically demonstrated that the extended value functions and policy gradients resulting from system decoupling closely approximate true gradients, establishing a crucial theoretical link between decentralized model learning and monotonic policy improvement.

Multiple test results indicate that this method can scale to large-scale networked systems, such as power grids and traffic networks with hundreds of agents, achieving high decision performance with low communication costs.

Figure 4 Effect of research methods in intelligent traffic control scenarios
In intelligent traffic control scenarios, traffic signals controlled by this method can regulate complex traffic flows by receiving traffic information from adjacent intersections. This is because, under the designed networked structure, overall traffic conditions are indirectly transmitted and aggregated through the urban road network to neighboring intersections. By analyzing observations from these adjacent intersections, it is possible to infer and predict traffic changes across the city, leading to optimal decisions. The method's scalability is also validated in smart grids, achieving low power losses in power networks with hundreds of nodes.

Figure 5 The result of research methods in smart grid control scenarios
The first author of the paper is Chengdong Ma, a doctoral student at the Institute of Artificial Intelligence, Peking University. The corresponding author is Yaodong Yang. Aming Li, a researcher at the Multi-Agent Research Center of the College of Engineering and the Institute for Artificial Intelligence, and Professor Yali Du from King's College London are co-first authors.