Deep Reinforcement Learning Tracking Control for Robotic Manipulator Based on Selective Hindsight Experience Replay

Deep Reinforcement Learning Tracking Control for Robotic Manipulator Based on Selective Hindsight Experience Replay

Introduction

Robotic systems have expanded their applications across diverse fields including healthcare, military operations, and entertainment. Among various robotic tasks, trajectory tracking for robotic manipulators remains a critical research area. Traditional control methods such as PID control, sliding mode control, and model predictive control often require precise system models or extensive parameter tuning. While neural networks have been employed to enhance stability and adaptability, they face challenges such as local optima when training data is insufficient. Deep reinforcement learning (DRL) has emerged as a promising solution by enabling agents to learn through environmental interactions, thereby overcoming data scarcity issues.

This article presents a novel DRL-based control method for robotic manipulator trajectory tracking, integrating Selective Hindsight Experience Replay (SHER) with the Deep Deterministic Policy Gradient (DDPG) algorithm. The proposed approach enhances exploration efficiency by selectively reinforcing useful experiences, thereby improving convergence speed and stability.

Problem Formulation

Dynamic Model of Robotic Manipulator

The dynamics of a robotic manipulator are derived using the Euler-Lagrange equations, which describe the relationship between joint positions, velocities, and applied torques. For a two-degree-of-freedom manipulator, the inertia matrix, Coriolis force matrix, and gravitational force matrix are explicitly defined. The system’s state includes joint positions, velocities, and their respective desired values. The control objective is to design a DRL-based controller that enables the manipulator to track a time-varying reference trajectory without prior knowledge of the system model.

Control Objective

Given a desired trajectory, the goal is to train a DRL agent to generate control torques that minimize tracking errors. The state space includes current joint positions and velocities, while the action space consists of torque commands. The reward function is designed to penalize tracking errors, encouraging the agent to learn optimal control policies.

Methodology

DDPG Algorithm

DDPG is chosen as the base algorithm due to its ability to handle continuous action spaces and its compatibility with experience replay techniques. The framework consists of an actor network, which generates control actions, and a critic network, which evaluates action quality. Both networks are updated using gradient descent, with target networks providing stable learning signals through soft updates.

Selective Hindsight Experience Replay (SHER)

Unlike traditional Hindsight Experience Replay (HER), which relabels failed experiences with alternative goals, SHER selectively modifies rewards for experiences that reduce tracking errors. By reinforcing beneficial actions, SHER accelerates learning and improves policy quality. The key steps include:

  1. Experience Sampling: Randomly selecting transitions from the replay buffer.
  2. Reward Modification: Adjusting rewards for transitions where actions lead to reduced tracking errors.
  3. Buffer Update: Replacing original experiences with enhanced versions to guide future learning.

Algorithm Workflow

  1. Initialization: The actor and critic networks are initialized with random weights.
  2. Exploration: The agent interacts with the environment, storing transitions in the replay buffer.
  3. Experience Refinement: SHER modifies rewards for selected experiences to emphasize beneficial actions.
  4. Network Updates: The critic and actor networks are trained using mini-batch gradient descent.
  5. Target Network Synchronization: Soft updates ensure stable learning.

Experimental Validation

Simulation Setup

A two-degree-of-freedom robotic manipulator is modeled in a simulated environment. The manipulator tracks a sinusoidal reference trajectory under disturbances. Training parameters, including learning rates and discount factors, are carefully selected to balance exploration and exploitation.

Comparative Algorithms

The proposed DDPG with SHER is compared against: • Standard DDPG

• Soft Actor-Critic (SAC)

• DDPG with HER (without selective filtering)

• Traditional PID control

Performance Metrics

Tracking performance is evaluated using: • Average Tracking Error: Measures deviation from the desired trajectory.

• Convergence Speed: Indicates how quickly the algorithm reaches a stable policy.

• Robustness: Assesses performance under external disturbances.

Results

  1. Convergence Behavior: DDPG with SHER achieves faster convergence and higher rewards compared to baseline methods.
  2. Tracking Accuracy: The proposed method exhibits the lowest steady-state tracking errors, outperforming PID and other DRL variants.
  3. Disturbance Rejection: SHER-enhanced DRL maintains superior tracking performance even with added noise, demonstrating robustness.

Discussion

The success of SHER lies in its ability to prioritize experiences that contribute to error reduction, thereby accelerating policy improvement. Unlike HER, which may introduce misleading goals in trajectory tracking tasks, SHER ensures that only relevant experiences are reinforced. This selective approach mitigates the risk of suboptimal convergence and enhances learning efficiency.

Conclusion

This article introduces a DRL-based control strategy for robotic manipulator trajectory tracking, leveraging SHER to improve exploration and policy quality. Experimental results confirm that the proposed method achieves superior convergence, accuracy, and robustness compared to existing approaches. Future work will extend the framework to higher-dimensional manipulators and real-world applications.

For further details, refer to the original publication: https://doi.org/10.19734/j.issn.1001-3695.2024.07.0234

Was this helpful?

0 / 0