Оптимізація руху транспорту в простій мережі за допомогою глибокого навчання з підкріпленням | Збірник наукових праць "Управління розвитком складних систем"

Заголовок (англійською):

Optimization of transport traffic in a simple network using deep learning with reinforcement

Автор(и):

Левицький В. В.

Автор(и) (англ):

Levytskyi Volodymyr

Ключові слова (укр):

DDPG; Aimsun; Q-learning; дорожній рух

Ключові слова (англ):

DDPG; Aimsun; Q-learning; traffic

Анотація (укр):

Оптимізація транспортного потоку в міських умовах залишається одним із ключових викликів сучасних досліджень, навіть попри значний обсяг наукових праць, присвячених цій темі. Незважаючи на досягнення, ця проблема все ще не має універсального рішення, яке б ефективно працювало в реальних сценаріях. Однією з основних складностей є опрацювання великого масиву вхідних даних, зокрема даних про дорожній рух, що постійно надходять із датчиків, встановлених по всій міській дорожній мережі. Традиційно, через масштабність завдання, дослідники зосереджувалися на розробці систем із локалізованими агентами. Такі агенти зазвичай управляють трафіком на окремих перехрестях, при цьому їхня координація здійснюється в рамках багатопотокових агентних систем. Однак в сучасних підходах враховується обсяг і складність вхідних даних завдяки застосуванню методів глибокого навчання. Зокрема, пропонується використання алгоритму глибокого детермінованого градієнта політики (DDPG), на основі якого можна обробляти великі вхідні масиви даних. У рамках експериментального дослідження була випробувана проста модель перехрестя, щоб перевірити ефективність підходу. Алгоритм DDPG засвідчив переваги в простій моделі порівняно з Q-learning. DDPG видавав нагороду 3-4 бали в діапазоні, тоді як нагорода Q-learning була в діапазоні 2-4 бали. Для оцінювання продуктивності підходу DDPG порівняно із Q-learning і випадковими таймінгами основним критерієм є середня винагорода за епізод. DDPG і Q-навчання досягають схожих рівнів винагороди, проте DDPG демонструє стабільну конвергенцію (0.04-0.21 бали), тоді як Q-learning залишається нестабільним (0.04-0.43 бали). Дослідження продуктивності внутрішнього епізоду засвідчує, що DDPG досягає покращень переважно ближче до кінця епізоду. Загалом цей алгоритм показав себе успішно для такого сценарію, а отримані результати можуть слугувати основою для подальших удосконалень і застосувань у складніших дорожніх сценаріях.

Анотація (англ):

Traffic flow optimization in urban environments remains one of the key challenges of modern research, even despite the significant volume of scientific works devoted to this topic. Despite the achievements, this problem still does not have a universal solution that would work effectively in real-world scenarios. One of the main difficulties is the processing of a large array of input data, in particular, traffic data, which constantly comes from sensors installed throughout the urban road network. Traditionally, due to the scale of the task, researchers have focused on the development of systems with localized agents. Such agents usually manage traffic at individual intersections, while their coordination is carried out within the framework of multi-stream agent systems. However, modern approaches take into account the volume and complexity of input data through the use of deep learning methods. In particular, the use of the deep deterministic policy gradient (DDPG) algorithm is proposed, on the basis of which large input data can be processed. As part of the experimental study, a simple intersection model was tested to verify the effectiveness of the approach. The DDPG algorithm performed better in the simple model compared to Q-learning. DDPG provided a reward in the range of 4-4.3 points, while the reward of Q-learning was in the range of 2-4 points. To evaluate the performance of the DDPG approach compared to Q-learning and random timings, the main criterion is the average reward per episode. DDPG and Q-learning achieve similar reward levels, but DDPG shows stable convergence (0.04-0.21 points), while Q-learning remains unstable (0.04-0.43 points). The study of intra-episode performance shows that DDPG achieves improvements mainly closer to the end of the episode. Overall, this algorithm has proven successful for this scenario, and the results obtained can serve as a basis for further improvements and applications in more complex traffic scenarios.

Публікатор:

Київський національний університет будівництва і архітектури

Назва журналу, номер, рік випуску (укр):

Управління розвитком складних систем, номер 61, 2025

Назва журналу, номер, рік випуску (англ):

Management of Development of Complex Systems, number 61, 2025

Мова статті:

Українська

Формат документа:

application/pdf

Документ:

151-159.pdf

Дата публікації:

07 Апрель 2025

Номер збірника:

61

Розділ:

ІНФОРМАЦІЙНІ ТЕХНОЛОГІЇ УПРАВЛІННЯ

Університет автора:

Київський національний університет будівництва і архітектури, Київ

Литература:

1. Van der Pol E. Deep reinforcement learning for coordination in traffic light control. 2016. URL: https://www.researchgate.net/publication/315810688_Deep_Reinforcement_Learning_for_Coordination_in_Traffic_Light_Control_MSc_thesis.

2. Lecun Y., Bottou L., Bengio Y., Haffner P. Gradient-based learning applied to document recognition. IEEE. 1998. Vol. 86, № 11. С. 2278–2324. DOI: https://doi.org/10.1109/5.726791.

3. LA P., Bhatnagar S. Reinforcement Learning With Function Approximation for Traffic Signal Control. IEEE Transactions on Intelligent Transportation Systems. 2011. Vol. 12, № 2. P. 412–421. DOI: https://doi.org/10.1109/TITS.2010.2091408.

4. Acharya S., Dash K. K., Chaini R. Fuzzy Logic: An Advanced Approach to Traffic Control. Learning and Analytics in Intelligent Systems. 2020. DOI: https://dx.doi.org/10.4018/ijide.2014010103.

5. Daneshfar F., Akhlaghian F., Mansoori F. Adaptive and cooperative multi-agent fuzzy system architecture. 14th International CSI Computer Conference. 2009. P. 30–34. DOI: https://doi.org/10.1109/CSICC.2009.5349439.

6. Mnih V., Kavukcuoglu K., Silver D., Graves A., Antonoglou I., Wierstra D., Riedmiller M. Playing Atari with Deep Reinforcement Learning. 2013. URL: https://doi.org/10.48550/arXiv.1312.5602

7. Tesauro G. Temporal difference learning and td-gammon. Communications of the ACM. 1995. Vol. 38, № 3. P. 58–68. DOI: https://doi.org/10.1145/203330.203343.

8. Pollack J. B., Blair A. D. Why did td-gammon work? Advances in Neural Information Processing Systems. 1997.
С. 10–16. DOI: https://doi.org/10.1145/203330.203343.

9. Kingma D., Ba J. Adam: A method for stochastic optimization. 2014. URL: https://arxiv.org/abs/1412.6980.

10. Sutton R., Mcallester D. A., Singh S., Mansour Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. Adv. Neural Inf. Process. Syst. 12. URL: https://dl.acm.org/doi/10.5555/3009657.3009806.

11. Lillicrap T. P., Hunt J. J., Pritzel A., Heess N., Erez T., Tassa Y., Silver D., Wierstra D. Continuous control with deep reinforcement learning. URL: https://doi.org/10.48550/arXiv.1509.02971.

12. Silver D., Lever G., Heess N., Degris T., Wierstra D., Riedmiller M. Deterministic policy gradient algorithms. 31st International Conference on Machine Learning (ICML-14). 2014. P. 387–395. URL: http://dx.doi.org/10.13140/RG.2.2.16324.71048.

13. Aimsun Next user manual, version 24.0.1. URL: https://docs.aimsun.com/next/24.0.1/.

14. Levytskyi V., Kruk, Lopuha O., Sereda D., Sapaiev V., Matsiievskyi O. Use of Deep Learning Methodologies in Combination with Reinforcement Techniques within Autonomous Mobile Cyber-physical Systems. 2024 IEEE. DOI: https://doi.org/10.1109/SIST61555.2024.10629589.

References: