Department of Applied Mathematics, Charles University, Czech Republic
Abstract
Deep neural networks have evolved from heuristic pattern recognition tools into mathematically grounded systems whose theoretical understanding spans approximation theory, geometry, dynamical systems, and optimal control. This article develops a unified theoretical framework that interprets classical and modern neural architectures through the lenses of geometric separability, statistical learning, and dynamical systems theory. Drawing exclusively from foundational and contemporary contributions in machine learning, control theory, and high dimensional geometry, we provide a comprehensive synthesis of the evolution from perceptrons and support vector machines to residual networks and neural ordinary differential equations.
The study begins by examining early geometric formulations of classification, including linear threshold units and shattering properties, and situates these within modern capacity analysis. We then analyze multilayer feedforward networks in terms of universal approximation and storage capacity, addressing both width and depth considerations. Special attention is devoted to the power of depth and residual connections, emphasizing the reinterpretation of deep networks as discretized dynamical systems.
A central contribution of this work is an integrative exploration of neural ordinary differential equations and their mean field optimal control formulations. We explain how continuous depth models unify discrete architectures and reveal new insights into controllability, interpolation, and long time behavior. The mean field perspective connects parameter learning with population level dynamics, clarifying the role of measure theoretic interpolation and turnpike phenomena in training trajectories.
We further investigate geometric and topological perspectives, including manifold learning and invertible architectures, demonstrating how controllability conditions determine expressive power in neural ODE frameworks. Stability considerations and identity preserving structures are analyzed to explain empirical success in deep training regimes. Optimization landscape properties, stochastic gradient methods, and automatic differentiation are contextualized within this broader dynamical view.
Finally, the article synthesizes classical statistical learning theory with modern transformer based dynamics and cluster formation in self attention systems, positioning deep learning as a theory of measure evolution under learned flows. By integrating insights from approximation theory, control, geometry, and statistical learning, this work provides a publication ready theoretical narrative that clarifies both the mathematical foundations and the emerging research directions of deep learning as a dynamical systems discipline.
Keywords
Deep learning, neural ordinary differential equations, universal approximation
References
π1. Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137 to 1155, 2003.
π2. Boland, R. P., and Urrutia, J. Separating collections of points in Euclidean spaces. Information Processing Letters, 53(4):177 to 183, 1995.
π3. Bonnet, B., Cipriani, C., Fornasier, M., and Huang, H. A measure theoretical approach to the mean field maximum principle for training NeurODEs. Nonlinear Analysis, 227:113161, 2023.
π4. Breiman, L. Random forests. Machine Learning, 45(1):5 to 32, 2001.
π5. Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential equations. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 6572 to 6583, 2018.
π6. Cheng, J., Li, Q., Lin, T., and Shen, Z. Interpolation, approximation, and controllability of deep neural networks. SIAM Journal on Control and Optimization, 63(1):625 to 649, 2025.
π7. Cortes, C., and Vapnik, V. Support vector networks. Machine Learning, 20(3):273 to 297, 1995.
π8. Cover, T. M. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, EC 14(3):326 to 334, 1965.
π9. Cover, T. M. The number of linearly inducible orderings of points in d space. SIAM Journal on Applied Mathematics, 15(2):434 to 439, 1967.
π10. Cuchiero, C., Larsson, M., and Teichmann, J. Deep neural networks, generic universal interpolation, and controlled ODEs. SIAM Journal on Mathematics of Data Science, 2(3):901 to 919, 2020.
π11. Donoho, D. High dimensional data analysis: The curses and blessings of dimensionality. AMS Mathematical Challenges Lecture, 1 to 32, 2000.
π12. Duch, W. K separability. In ICANN 06 Proceedings of the 16th International Conference on Artificial Neural Networks, Springer, 188 to 197, 2006.
π13. E, W. A proposal on machine learning via dynamical systems. Communications in Mathematical Statistics, 5:1 to 11, 2017.
π14. E, W., Han, J., and Li, Q. A mean field optimal control formulation of deep learning. Research in the Mathematical Sciences, 6(1):10, 2018.
π15. Elamvazhuthi, K., Zhang, X., Oymak, S., and Pasqualetti, F. Learning on manifolds: Universal approximation properties using geometric controllability conditions for neural ODEs. Proceedings of the 5th Annual Learning for Dynamics and Control Conference, 211:1 to 11, 2023.
π16. Eldan, R., and Shamir, O. The power of depth for feedforward neural networks. JMLR Workshop and Conference Proceedings, 49:1 to 34, 2015.
π17. Esteve Yague, C., and Geshkovski, B. Sparsity in long time control of neural ODEs. Systems and Control Letters, 172:105452, 2023.
π18. Fisher, R. A. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(7):179 to 188, 1936.
π19. Freimer, R., Mitchell, J. S. B., and Piatko, C. On the complexity of shattering using arrangements. Technical Report, Cornell University, 1991.
π20. Geshkovski, B., Letrouit, C., Polyanskiy, Y., and Rigollet, P. The emergence of clusters in self attention dynamics. Advances in Neural Information Processing Systems, 36:57026 to 57037, 2023.
π21. Geshkovski, B., Rigollet, P., and Ruiz Balet, D. Measure to measure interpolation using transformers. arXiv:2411.04551, 2024.
π22. Geshkovski, B., and Zuazua, E. Turnpike in optimal control of PDEs, ResNets, and beyond. Acta Numerica, 31:135 to 263, 2022.
π23. Haber, E., and Ruthotto, L. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2017.
π24. Hardt, M., and Ma, T. Identity matters in deep learning. International Conference on Learning Representations, 1:1627 to 1640, 2017.
π25. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, 770 to 778, 2016.
π26. Houle, M. F. Algorithms for weak and wide separation of sets. Discrete Applied Mathematics, 45(2):139 to 159, 1993.
π27. Huang, G. B. Learning capability and storage capacity of two hidden layer feedforward networks. IEEE Transactions on Neural Networks, 14(2):274 to 281, 2003.
π28. Huang, S. C., and Huang, Y. F. Bounds on the number of hidden neurons in multilayer perceptrons. IEEE Transactions on Neural Networks, 2(1):47 to 55, 1991.
π29. Ishikawa, I., Teshima, T., Tojo, K., Oono, K., Ikeda, M., and Sugiyama, M. Universal approximation property of invertible neural networks. Journal of Machine Learning Research, 24(287):1 to 68, 2023.
π30. Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
π31. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 to 2324, 1998.
π32. LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521:436 to 444, 2015.
π33. Li, Q., Chen, L., Tai, C., and E, W. Maximum principle based algorithms for deep learning. Journal of Machine Learning Research, 18(165):1 to 29, 2018.
π34. Li, Q., Lin, T., and Shen, Z. Deep learning via dynamical systems: An approximation perspective. Journal of the European Mathematical Society, 25(5):1671 to 1709, 2022.
π35. Lin, H., and Jegelka, S. ResNet with one neuron hidden layers is a universal approximator. Advances in Neural Information Processing Systems, 31:6169 to 6178, 2018.
π36. Massaroli, S., Poli, M., Park, J., Yamashita, A., and Asama, H. Dissecting neural ODEs. Advances in Neural Information Processing Systems, 33:3952 to 3963, 2020.
π37. Mumford, D., Fogarty, J., and Kirwan, F. Geometric Invariant Theory. Springer, 1994.
π38. Nguyen, Q., and Hein, M. Optimization landscape and expressivity of deep CNNs. Proceedings of the 35th International Conference on Machine Learning, 3730 to 3739, 2018.
π39. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in PyTorch. NIPS 2017 Workshop on Autodiff, 2017.
π40. Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386, 1958.