Сучасні малі мережі для класифікації зображень. Аналіз особливостей

Автор(и):

Гозак Я. Д., Палій С. В.

Автор(и) (англ)

Hozak Ya., Paliy S.

Дата публікації:

29.11.2024

Анотація (укр):

Зі зростанням попиту на використання моделей глибокого навчання на пристроях з обмеженими ресурсами, таких як смартфони, датчики IoT і периферійні обчислювальні платформи, потреба в ефективних згорткових нейронних мережах (ЗНМ) стала першорядною. У статті запропоновано вичерпний огляд кількох найсучасніших полегшених архітектур ЗНМ, розроблених для вирішення цих проблем шляхом зменшення обчислювальної складності та використання пам’яті, зберігаючи конкурентоспроможність у задачах класифікації зображень. Переглянуті ключові архітектури включають MobileNets, ShuffleNet, DiceNet і ESPNet, кожна з яких використовує різні стратегії для оптимізації ефективності мережі. MobileNets представляє концепцію згорток, що розділяються по глибині, які розкладають стандартну операцію згортки на згортку по глибині та згортку по точках (1x1). Це суттєво зменшує кількість параметрів і обчислень порівняно з традиційними згортками. З іншого боку, ShuffleNet використовує групові згортки і перетасування каналів для підвищення ефективності, уможливлюючи розділяти та рекомбінувати карти ознак, що зменшує витрати на обчислення без суттєвої шкоди для точності. DiceNet спирається на ці концепції, запроваджуючи багаторозгалужену архітектуру з різними темпами розширення для виділення ознак у різних масштабах, підвищуючи як точність, так і ефективність у середовищах із низьким ресурсом. ESPNet використовує ефективні просторові пірамідальні структури разом із поточковими згортками для обробки різноманітних просторових особливостей у різних масштабах, одночасно з високою обчислювальною ефективністю. Незважаючи на ці досягнення, загальним вузьким місцем у цих архітектурах є покладання на поточкові (1x1) згортки, які, хоч і ефективніші, ніж стандартні згортки, все ж роблять значний внесок у загальну вартість обчислень, особливо на більш глибоких рівнях мережі. Крім того, розміри фільтрів часто оптимізовані для продуктивності в хмарних середовищах, але можуть бути не ідеальними для периферійних середовищ, де обчислювальна швидкість й енергоефективність є вирішальною. Ми бачимо потенціал у зміні розмірів фільтрів у деяких шарах до 2x2, що є найменшим можливим фільтром для вилучення просторової інформації. Також слід звернути увагу на те, як інформація поширюється між каналами, а також на те, як кількість каналів формується шляхом заміни згортки 1x1 іншою передбачуваною математичною операцією.

Анотація (рус):

Анотація (англ):

With the increasing demand for deploying deep learning models on resource-constrained devices, such as smartphones, IoT sensors, and edge computing platforms, the need for efficient convolutional neural networks (CNNs) has become paramount. This paper offers a comprehensive review of several state-of-the-art lightweight CNN architectures designed to address these challenges by reducing computational complexity and memory usage, while maintaining competitive performance in image classification tasks. Key architectures reviewed include MobileNets, ShuffleNet, DiceNet, and ESPNet, each of which employs distinct strategies to optimize network efficiency. MobileNets introduce the concept of depthwise separable convolutions, which decompose the standard convolution operation into a depthwise convolution and a point-wise convolution (1x1). This drastically reduces the number of parameters and computations compared to traditional convolutions. ShuffleNet, on the other hand, leverages group convolutions and channel shuffling to enhance efficiency, allowing feature maps to be split and recombined, which reduces computational cost without significantly compromising accuracy. DiceNet builds upon these concepts by introducing multi-branch architecture with different dilation rates to capture features at multiple scales, enhancing both accuracy and efficiency in low-resource environments. ESPNet employs efficient spatial pyramidal structures, along with point-wise convolutions, to handle diverse spatial features at different scales while being highly computationally efficient. Despite these advancements, a common bottleneck across these architectures is the reliance on point-wise (1x1) convolutions, which, while more efficient than standard convolutions, still contribute significantly to the overall computational cost, particularly in deeper layers of the network. Furthermore, filter sizes are often optimized for performance in a cloud-based setting but may not be ideal for edge environments where computational and energy efficiency are crucial. We see the potential in changing filter sizes in some layers to 2x2 which is the smallest possible filter for spatial information extraction. Also it worth paying attention to the way the information is spread across channels as well as how the channels number is formed by replacing 1x1 convolution with a generic but yet predictable mathematical operation.

Література:

1. Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097-1105.

2. Simonyan, K. & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

3. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9).

4. He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778).

5. Bhatt, D., Patel, C., Talsania, H., Patel, J., Vaghela, R., Pandya, S., Modi, K. & Ghayvat, H. (2021). CNN variants for computer vision: History, architecture, application, challenges and future scope. Electronics, 10(20), 2470. https://doi.org/10.3390/electronics10202470

6. Patel, C. I., Patel, R. & Patel, P. (2011, July). Goal detection from unsupervised video surveillance. In International Conference on Advances in Computing and Information Technology (pp. 76-88). Berlin, Heidelberg: Springer Berlin Heidelberg.

7. Patel, R. & Patel, C. I. (2013). Robust face recognition using distance matrice. International Journal of Computer and Electrical Engineering, 5(4), 401-404.

8. Bosamiya, D. & Fuletra, J. D. (2013). A survey on drivers drowsiness detection techniques. International Journal of Recent Innovations and Trends in Computing and Communication, 1, 816-819.

9. Tan, M. & Le, Q. V. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning (Vol. 97, pp. 6105-6114). PMLR.

10. Tan, M. & Le, Q. V. (2021). EfficientNetV2: Smaller models and faster training. arXiv preprint arXiv:2104.00298.

11. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M. & Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.

12. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.-C. (2018). MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4510-4520).

13. Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V. & Le, Q. V. (2019). Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1314-1324).

14. Zhang, X., Zhou, X., Lin, M. & Sun, J. (2018). ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6848-6856).

15. Ma, N., Zhang, X., Zheng, H. T. & Sun, J. (2018). ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision (pp. 116-131). Springer.

16. Mehta, S., Rastegari, M., Caspi, A., Shapiro, L. & Hajishirzi, H. (2018). ESPNet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. arXiv preprint arXiv:1803.06815.

17. Mehta, S., Rastegari, M., Caspi, A., Shapiro, L. & Hajishirzi, H. (2019). ESPNetv2: A light-weight, power efficient, and general purpose convolutional neural network. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9190-9198).

18. Ma, Y., Shao, Y., Wu, X. & Sun, Y. (2020). DiCENet: Dimension-wise convolutions for efficient networks. arXiv preprint arXiv:2002.10902.

19. Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J. & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. arXiv preprint arXiv:1602.07360.

20. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.

21. Nair, V. & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (pp. 807-814).

22. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.

23. Lin, M., Chen, Q, & Yan, S. (2014). Network in network. arXiv preprint arXiv:1312.4400.

24. Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1251-1258).

25. He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.

26. Ramachandran, P., Zoph, B. & Le, Q. V. (2017). Searching for activation functions. arXiv preprint arXiv:1710.05941.

27. Elfwing, S., Uchibe, E. & Doya, K. (2017). Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. arXiv preprint arXiv:1702.03118.

28. Tan, M., Chen, B., Pang, R., Vasudevan, V. & Le, Q. V. (2019). MnasNet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626.

29. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211-252.

30. Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2881-2890).

31. He, K., Zhang, X., Ren, S. & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In Proceedings of the European Conference on Computer Vision (pp. 346-361). Springer.

32. Holschneider, M., Kronland-Martinet, R., Morlet, J. & Tchamitchian, P. (1990). A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets (pp. 286–297). Springer.

33. Biloshchytskyi, A., Dikhtiarenko, O. & Paliy, S. (2015). Searching for partial duplicate images in scientific works. Management of Development of Complex Systems, 21, 149 – 155.

References: