Deep Reinforcement Learning Approach for Solving Takeout Delivery Problem

Deep Reinforcement Learning Approach for Solving Takeout Delivery Problem

The rapid growth of the online food delivery industry has transformed modern lifestyles, with platforms promising 30-minute deliveries becoming increasingly prevalent. However, this expansion has also highlighted challenges related to delivery rider efficiency and earnings. The takeout delivery problem, which involves optimizing delivery routes to maximize rider benefits while ensuring timely service, is a critical aspect of this industry. Traditional approaches often struggle with computational precision and stability, especially as the problem scales. This paper introduces a novel deep reinforcement learning (DRL) algorithm, termed DRL-MFA, to address these challenges by modeling the problem as a Minimum Ratio Traveling Salesman Problem (MRTSP).

Introduction

The takeout delivery industry has experienced exponential growth, with billions of orders processed annually. However, the efficiency of delivery routes directly impacts rider earnings, customer satisfaction, and platform profitability. Current methods for solving routing problems, such as heuristic algorithms, often suffer from low accuracy and instability, particularly for large-scale instances. The MRTSP, an extension of the classic Traveling Salesman Problem (TSP), introduces a nonlinear objective function that minimizes the ratio of total travel distance to total earnings, making it more complex yet more reflective of real-world scenarios.

This paper proposes DRL-MFA, a deep reinforcement learning-based approach, to solve the MRTSP efficiently. The algorithm leverages a Markov Decision Process (MDP) framework to model interactions between the delivery agent (rider) and the environment (delivery network). By incorporating multi-feature aggregation and attention mechanisms, DRL-MFA enhances feature representation and decision-making, outperforming traditional heuristic methods in both accuracy and stability.

Problem Formulation

The takeout delivery problem is modeled as an MRTSP, where the goal is to minimize the cost-benefit ratio—total distance traveled divided by total earnings. The problem is defined on a weighted complete graph, where nodes represent delivery locations (including the central hub), edges represent paths between locations, and weights correspond to distances and associated earnings. Key assumptions include constant rider speed, Euclidean distance calculations, and no capacity constraints.

The MRTSP objective function is nonlinear, distinguishing it from the linear TSP. This complexity arises because the ratio of distance to earnings must be minimized, requiring solutions that balance both factors simultaneously. Traditional heuristic algorithms, such as genetic algorithms or particle swarm optimization, often struggle with such nonlinearities, leading to suboptimal solutions or slow convergence.

Methodology

The DRL-MFA algorithm is structured around an encoder-decoder framework, enhanced with multi-feature aggregation and attention mechanisms.

Encoder Design

The encoder processes three types of input features: node coordinates, distance matrices, and earnings matrices. A Multi-Feature Aggregation (MFA) sublayer is introduced to combine these features effectively. The MFA sublayer employs separate fully connected networks to embed each feature type, concatenates them, and applies another fully connected layer to produce a unified representation. This approach ensures complementary feature interactions and improves the model’s ability to capture nonlinear relationships.

Following the MFA sublayer, the encoder uses multiple self-attention (SA) layers to refine node embeddings. Each SA layer consists of a multi-head attention (MHA) mechanism and a feed-forward (FF) network, with residual connections and batch normalization to stabilize training. The MHA mechanism computes attention scores between nodes, while the FF network applies nonlinear transformations to enhance feature representation.

Decoder Design

The decoder generates delivery routes by sequentially selecting nodes based on learned probabilities. At each step, it considers a context vector composed of the graph embedding, the first visited node, and the last visited node. This context is processed through a fully connected layer and fed into a multi-head attention layer, which computes attention scores between the context and all nodes.

A single-head attention layer then produces a probability distribution over unvisited nodes, guiding the selection of the next delivery location. Two action selection strategies are employed: greedy search for deterministic decisions during testing and sampling search for exploration during training.

Training Approach

The model is trained using the REINFORCE algorithm with a baseline to reduce variance in gradient estimates. The baseline is derived from a separate network trained to predict expected rewards, improving learning stability. The loss function minimizes the expected cost-benefit ratio, and gradients are computed using policy gradients. The Adam optimizer is used for parameter updates, with training conducted over multiple epochs.

Experimental Results

The performance of DRL-MFA is evaluated on both classic MRTSP benchmarks and real-world takeout delivery datasets from Changchun, China.

Classic Benchmark Comparisons

DRL-MFA is compared against five heuristic algorithms: Gravitational Search Algorithm (GSA), Biogeography-Based Optimization (BBO), Most Valuable Player Algorithm (MVPA), Particle Swarm Optimization (PSO), and Genetic Algorithm (GA). Results show that DRL-MFA consistently achieves superior performance, particularly for larger problem instances. For example, in a 10-node benchmark, DRL-MFA attains the optimal solution with zero deviation, while heuristic methods exhibit significant gaps.

Real-World Case Study

The Changchun dataset includes delivery locations across nine administrative districts, with problem sizes ranging from 35 to 100 nodes. DRL-MFA outperforms heuristic methods and a baseline DRL model without MFA, demonstrating its robustness and scalability. The inclusion of the MFA sublayer is critical, as it improves convergence speed and solution quality, especially for larger instances.

Sensitivity Analysis

A sensitivity analysis explores how pricing strategies affect delivery routes and rider earnings. Three pricing components are examined:

  1. Long-distance premiums: Increasing earnings for deliveries beyond 5 km incentivizes riders to prioritize longer routes, reducing the cost-benefit ratio.
  2. Mid-distance adjustments: Higher earnings for 3–5 km deliveries shift rider preferences toward shorter routes, balancing distance and earnings.
  3. Base fee variations: Higher base fees for short-distance deliveries make them more attractive, reducing overall travel distance but increasing earnings.

The analysis confirms that pricing strategies significantly influence route optimization, providing actionable insights for platform operators.

Conclusion

The DRL-MFA algorithm presents a robust solution to the takeout delivery problem, addressing limitations of traditional heuristic methods. By integrating multi-feature aggregation and attention mechanisms, the model captures complex feature interactions and delivers high-quality solutions efficiently. Experimental results validate its superiority in both classic benchmarks and real-world scenarios, while sensitivity analysis highlights the impact of pricing on route optimization.

Future work could extend this approach to multi-rider scenarios or incorporate dynamic factors like traffic conditions. Overall, DRL-MFA offers a promising direction for enhancing delivery efficiency and rider earnings in the rapidly evolving food delivery industry.

doi.org/10.19734/j.issn.1001-3695.2024.05.0179

Was this helpful?

0 / 0