Highway On-Ramp Merging Human-Like Decision Making Based on BC-MAAC Algorithm

Highway On-Ramp Merging Human-Like Decision Making Based on BC-MAAC Algorithm

Introduction

The rapid development of autonomous driving technology has led to widespread applications of intelligent connected systems. In mixed traffic environments where connected autonomous vehicles (CAVs) coexist with human-driven vehicles (HDVs), highway on-ramp merging remains one of the most challenging scenarios. CAVs must understand their surroundings and make appropriate driving decisions to merge safely and efficiently without disrupting traffic flow. Traditional approaches for solving highway on-ramp merging tasks include mathematical models and deep reinforcement learning (DRL). However, these methods often struggle with designing reward functions that capture human-like intelligence and cooperative behavior.

To address these challenges, this paper proposes a novel human-like decision-making scheme for highway on-ramp merging based on the BC-MAAC (Behavior Cloning – Multi-Actor Attention Critic) algorithm. The approach integrates behavior cloning (BC) with multi-agent reinforcement learning (MARL), leveraging expert demonstrations to shape reward functions and improve decision-making. Additionally, an action masking mechanism is introduced to filter out unsafe or ineffective actions, enhancing learning efficiency.

Background and Related Work

Challenges in Highway On-Ramp Merging

Highway on-ramp merging requires CAVs to coordinate with surrounding vehicles while maintaining safety and efficiency. Traditional rule-based and optimization-based methods have been applied to this problem, but they often lack flexibility in dynamic environments. Reinforcement learning (RL) and DRL have shown promise in autonomous driving tasks, but training controllers that ensure safety remains difficult.

Multi-Agent Reinforcement Learning (MARL)

MARL has gained attention for its scalability and robustness in cooperative decision-making tasks. Previous studies have explored MARL frameworks for intersection management, lane-changing, and highway merging. However, many existing approaches do not fully capture human-like coordination, leading to suboptimal performance in complex scenarios.

Imitation Learning and Expert Guidance

Imitation learning, particularly behavior cloning, has been used to incorporate human expertise into RL frameworks. By learning from expert demonstrations, autonomous agents can develop more natural and safe driving behaviors. Recent work has combined imitation learning with DRL to improve decision-making in autonomous driving, but few studies have applied this approach to multi-agent highway merging scenarios.

The BC-MAAC Algorithm

Overview

The BC-MAAC algorithm combines behavior cloning with the MAAC (Multi-Actor Attention Critic) framework to enhance cooperative decision-making in highway on-ramp merging. The key contributions include:

  1. Expert Demonstration Collection: Human experts control multiple CAVs in a simulated environment, generating expert trajectories that capture cooperative merging behavior.
  2. Reward Shaping with KL Divergence: The KL divergence between the expert policy and the agent’s current policy is used to shape the reward function, guiding the agent toward human-like behavior.
  3. Action Masking: Unsafe or invalid actions are filtered out at each step, improving learning efficiency and safety.

Algorithm Architecture

The BC-MAAC framework operates under the centralized training with decentralized execution (CTDE) paradigm. Each CAV has its own policy network, while a centralized critic network evaluates joint actions based on global observations. The algorithm employs an attention mechanism to selectively focus on relevant neighboring vehicles, improving decision-making efficiency.

Policy and Critic Networks

• Policy Networks: Each CAV’s policy network takes local observations as input and outputs actions. The network is trained to maximize both the expected reward and similarity to expert behavior.

• Critic Networks: A centralized critic evaluates joint actions using global state information. The attention mechanism dynamically weights the influence of other vehicles, allowing CAVs to focus on the most relevant interactions.

Reward Function

The reward function consists of two components:

  1. Traditional MARL Reward: Encourages safety, efficiency, and smooth merging by penalizing collisions, rewarding high speeds, and maintaining safe headway distances.
  2. Expert Guidance Reward: Measures the KL divergence between the agent’s policy and the expert policy, ensuring that learned behaviors remain close to human-like strategies.

Action Masking

To prevent unsafe actions, an action masking mechanism filters out invalid maneuvers. For example:
• Lane changes are prohibited before reaching the merging zone.

• Acceleration/deceleration commands are ignored if the vehicle is already at speed limits.

This mechanism ensures that only feasible actions are considered, reducing exploration inefficiencies.

Experimental Setup

Simulation Environment

The experiments were conducted using the Highway-env simulation platform, which models highway on-ramp merging scenarios. The environment includes:
• A 520-meter highway with a 100-meter merging lane.

• Randomly spawned CAVs and HDVs with varying densities (simple and hard modes).

• Realistic vehicle dynamics, including longitudinal and lateral control.

Training and Evaluation

The BC-MAAC algorithm was compared against baseline MARL methods, including MAAC, MAA2C, MAPPO, and MAACKTR. Performance metrics included:
• Success Rate: Percentage of successful merges without collisions.

• Average Speed: Measures traffic efficiency.

• Average Reward: Combines safety and efficiency metrics.

Results and Analysis

Training Performance

In both simple and hard traffic density scenarios, BC-MAAC achieved higher average rewards and speeds compared to baseline methods. The inclusion of expert guidance helped the algorithm converge faster and maintain stable performance.

Testing Performance

• Simple Mode: BC-MAAC achieved a 100% success rate, outperforming all baselines. The average speed and reward were significantly higher, demonstrating efficient and safe merging.

• Hard Mode: Despite increased complexity, BC-MAAC maintained a 93.4% success rate, with notable improvements in speed and reward over baseline methods.

Impact of Collision Penalty

A sensitivity analysis on collision penalty coefficients revealed that overly strict penalties could reduce efficiency by encouraging overly conservative driving. A balanced penalty (e.g., 20) provided optimal trade-offs between safety and performance.

Conclusion

The BC-MAAC algorithm presents a robust solution for highway on-ramp merging in mixed traffic environments. By integrating behavior cloning with MARL, the approach enables CAVs to learn human-like cooperative strategies while maintaining safety and efficiency. Key innovations include:
• Expert-guided reward shaping via KL divergence.

• Attention mechanisms for efficient multi-agent coordination.

• Action masking to enforce safe behaviors.

Future work will explore integrating prediction models to further enhance decision-making robustness. The proposed framework demonstrates significant potential for real-world deployment in intelligent transportation systems.

doi.org/10.19734/j.issn.1001-3695.2024.06.0204

Was this helpful?

0 / 0