Meta-Learned Dynamic Distillation for Automated Hyperparameter Optimization in Machine Learning Systems
Abstract
We propose a meta-learned dynamic distillation framework for automated hyperparameter optimization in machine learning systems, which adaptively adjusts knowledge transfer intensity between teacher and student models during training. Traditional distillation methods rely on static intensity schedules or manual tuning, often leading to suboptimal performance when faced with dataset shifts or model uncertainty. The proposed method addresses this limitation by formulating distillation intensity as a learnable hyperparameter, optimized in real-time through a bilevel optimization scheme. Our framework integrates three key components: an Adaptive Distillation Controller (ADC) that meta-learns intensity adjustments based on gradient dynamics and validation loss, a bilevel optimization engine that jointly minimizes student loss and intensity regularization, and a curriculum-aware memory buffer that stabilizes training through historical trajectory analysis. The ADC employs a recurrent neural network to dynamically modulate intensity, while the bilevel optimizer ensures efficient meta-gradient computation without unrolled computational graphs. Furthermore, the memory buffer captures gradient variance patterns to inform intensity adjustments, enabling robust adaptation to evolving training conditions. Experiments demonstrate that our method outperforms conventional static distillation approaches across multiple benchmarks, achieving superior model performance with reduced manual intervention. The unified treatment of hyperparameter optimization and knowledge distillation not only eliminates the need for handcrafted schedules but also provides a principled way to balance task-specific learning and teacher guidance. This work advances the state of automated machine learning by introducing a scalable, adaptive solution for model tuning.
Keywords:
Adaptive Knowledge Distillation,Bilevel Optimization,Automated Hyperparameter Optimization (HPO)Copyright Notice & License:
All articles published in the Journal of Engineering Systems and Applications (JESA) are licensed under the Creative Commons Attribution–NonCommercial 4.0 International License (CC BY-NC 4.0).
Under this license:
-
Attribution — Users must give appropriate credit to the author(s), provide a link to the license, and indicate if any changes were made.
-
Non-Commercial Use — The work may not be used for commercial purposes without explicit written permission from the copyright holder.
-
Adaptations Allowed — Users may remix, transform, or build upon the material for non-commercial purposes, as long as proper attribution is maintained.
-
No Additional Restrictions — Users may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
-
Third-Party Material — Any third-party material included in the article is subject to the copyright terms of its respective owner, unless otherwise indicated.
Full license text: https://creativecommons.org/licenses/by-nc/4.0/

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
References
Elshawi, R., Maher, M., & Sakr, S. (2019). Automated machine learning: State-of-the-art and open challenges. arXiv preprint, arXiv:1906.02287
Gou, J., Yu, B., Maybank, S. J., & Tao, D. (2021). Knowledge distillation: A survey. International Journal of Computer Vision.
Chauhan, K., Jani, S., Thakkar, D., Dave, R., et al. (2020). Automated machine learning: The new wave of machine learning. In 2020 2nd International Conference on Artificial Intelligence and Advanced Manufacture.
Hutter, F., Kotthoff, L., & Vanschoren, J. (2019). Automated machine learning: Methods, systems, challenges. library.oapen.org.
Feurer, M., Klein, A., Eggensperger, K., et al. (2015). Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems.
Tuggener, L., Amirian, M., Rombach, K., et al. (2019). Automated machine learning in practice: State of the art and recent results. In 2019 6th Swiss Conference on Data Science.
Adekunle, B. I., Chukwuma-Eke, E. C., Balogun, E. D., et al. (2021). Machine learning for automation: Developing data-driven solutions for process optimization and accuracy improvement. In International Conference on Machine Learning.
Brüning, J., Denkena, B., Dittrich, M. A., & Hocke, T. (2017). Machine learning approach for optimization of automated fiber placement processes. Procedia CIRP.
Im, D. J., Savin, C., & Cho, K. (2021). Online hyperparameter optimization by real-time recurrent learning. arXiv preprint, arXiv:2102.07813.
Olson, R. S., & Moore, J. H. (2016). TPOT: A tree-based pipeline optimization tool for automating machine learning. In Workshop on Automatic Machine Learning.
Xiao, H., Fu, L., Shang, C., Bao, X., et al. (2024). A knowledge distillation compression algorithm for ship speed and energy coordinated optimal scheduling model based on deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems.
Amin, I., Raja, S., & Krishnapriyan, A. (2025). Towards fast, specialized machine learning force fields: Distilling foundation models via energy Hessians. arXiv preprint, arXiv:2501.09009.
Van Erven, T., Koolen, W. M., & Van der Hoeven, D. (2021). Metagrad: Adaptation using multiple learning rates in online learning. Journal of Machine Learning Research.
Baik, S., Choi, M., Choi, J., Kim, H., et al. (2020). Meta-learning with adaptive hyperparameters. In Advances in Neural Information Processing Systems.
Bergstra, J., Bardenet, R., Bengio, Y., et al. (2011). Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems.
Snoek, J., Larochelle, H., et al. (2012). Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems.
Maclaurin, D., Duvenaud, D., et al. (2015). Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning.
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint, arXiv:1503.02531.
Hong, Y. W., Leu, J. S., Faisal, M., & Prakosa, S. W. (2022). Analysis of model compression using knowledge distillation. IEEE Access.
Colson, B., Marcotte, P., & Savard, G. (2005). Bilevel programming: A survey. 4OR.
Dempe, S., & Zemkoho, A. (2020). Bilevel Optimization. Springer Optimization and Its Applications.
Andrychowicz, M., Denil, M., Gomez, S., et al. (2016). Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems.
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images. cs.utoronto.ca.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., et al. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint, arXiv:1804.07461.
Panayotov, V., Chen, G., Povey, D., et al. (2015). LibriSpeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Computer Vision and Pattern Recognition.
Devlin, J., Chang, M. W., Lee, K., et al. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics.
Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
Gou, J., Yu, B., Maybank, S. J., & Tao, D. (2021). Knowledge distillation: A survey. International Journal of Computer Vision.
Liu, Z., Sun, M., Zhou, T., Huang, G., & Darrell, T. (2018). Rethinking the value of network pruning. arXiv preprint, arXiv:1810.05270.
Wu, Y., Chanda, S., Hosseinzadeh, M., et al. (2023). Few-shot learning of compact models via task-specific meta distillation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
Xu, D. D. K., Mukherjee, S., Liu, X., Dey, D., et al. (2022). Few-shot task-agnostic neural architecture search for distilling large language models. In Advances in Neural Information Processing Systems.
