Efficient Deployment of Multimodal LargeModels: A Surveyon Technical Innovations, Industrial Applications, and Challenges of Heterogeneous MoE Architecture, Low-bit Quantization, and Cloud-Edge-End Collaboration (2024-2026)
Abstract
This survey focuses on the efficient deployment of multimodal large models (MLLMs) and systematically reviews the core technologies from 2024 to 2026. At the architectural level, heterogeneous Mixture-of-Experts (MoE) architectures represented by ERNIE 4.5 achieve a balance between cross-modal parameter sharing and modality-specific computation through modality-isolated routing and router orthogonal loss, significantly improving computational efficiency. In terms of model compression and inference optimization, methods such as Layerwise Ultra-Low Bit Quantization (LUQ) and MoEQuant effectively address the quantization challenges of multimodal data, while TensorRT-LLM and ONNX Runtime provide deeply optimized inference engine solutions for hardware architectures like Blackwell. Regarding deployment strategies, cloud-edge-end collaborative architectures represented by DistMLLM, AIVD, and NANOMIND meet the requirements of low latency and high privacy while achieving cost savings and energy efficiency improvements through adaptive offloading, dynamic resource scheduling, and hardware-software co-design. Application cases show that these technologies have effectively solved industry pain points such as high latency and data privacy in medical edge diagnostic devices and quality inspection \& predictive maintenance scenarios in smart manufacturing, balancing efficiency and performance.
Keywords:
Multimodal Large Models; Heterogeneous MoE; Low-bit Quantization; Cloud- Edge-End Collaboration; Industrial DeploymentCopyright Notice & License:
This article is published in Silence under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided that the original author(s) and source are properly credited.
The author(s) retain copyright of the work. By publishing in the journal, the author(s) grant Silence a non-exclusive, worldwide, and irrevocable license to publish, archive, and disseminate this article in all forms and media.

This work is licensed under a Creative Commons Attribution 4.0 International License.
References
Liang, C. X., Tian, P., Yin, C. H., et al. (2024). A comprehensive survey and guide to multimodal large language models in vision-language tasks. arXiv preprint arXiv:2411.06284. https://doi.org/10.48550/arXiv.2411.06284
Technode. (2025, June 30). Baidu open-sources ERNIE 4.5 multimodal AI model. https://00l.xyz/03U3s
Baidu Inc. (2025). ERNIE 4.5: Multimodal large language model [Computer software]. GitHub. https://github.com/PaddlePaddle/ERNIE
Koparkar, S. (2025, December 3). Mixture of experts powers the most intelligent frontier AI models, runs 10x faster to deliver 1/10 the token cost on NVIDIA Blackwell NVL72. NVIDIA Blog. https://blogs.nvidia.com/blog/mixture-of-experts-blackwell-nvl72
Dutt, R., Hanspal, H., Xia, G., et al. (2025). Exploiting mixture-of-experts redundancy unlocks multimodal generative abilities. arXiv preprint arXiv:2503.22517. https://doi.org/10.48550/arXiv.2503.22517
Bhatnagar, S., Xu, A., Tan, K.-H., et al. (2025). LUQ: Layerwise ultra-low bit quantization for multimodal large language models. arXiv preprint arXiv:2509.23729. https://arxiv.org/abs/2509.23729
Chen, M., Wu, M., Jin, H., et al. (2025). Int v.s. fp: A comprehensive study of fine-grained low-bit quantization formats. arXiv preprint arXiv:2510.25602. https://arxiv.org/abs/2510.25602
Fedus, W., Zoph, B., Shazeer, N., et al. (2022). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1–39. http://jmlr.org/papers/v23/21-0998.html
ONNX Runtime. (2025). TensorRT execution provider. https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html
Mao, H. Z., Xin, J., Yu, Y., et al. (2025). Enable Blackwell inference with TensorRT model optimizer [Video]. NVIDIA On-Demand. https://www.nvidia.com/en-us/on-demand/session/gtc25-s72609/
Qiu, H., Biswas, A., Zhao, Z., et al. (2025). ModServe: Modality- and stage-aware resource disaggregation for scalable multimodal model serving. arXiv preprint arXiv:2502.00937. https://arxiv.org/abs/2502.00937
Li, Y., Zhang, S., Zeng, Y., et al. (2025). Tiny but mighty: A software-hardware co-design approach for efficient multimodal inference on battery-powered small devices. arXiv preprint arXiv:2510.05109. https://arxiv.org/abs/2510.05109
Longpre, S., Akiki, C., Lund, C., et al. (2025). Economies of open intelligence: Tracing power & participation in the model ecosystem. arXiv preprint arXiv:2512.03073. https://doi.org/10.48550/arXiv.2512.03073
Alghareeb, F. S., & Hasan, B. T. (2025). Leveraging multithreading on edge computing for smart healthcare based on intelligent multimodal classification approach. Computerized Medical Imaging and Graphics, 124, Article 102594. https://doi.org/10.1016/j.compmedimag.2025.102594
Rydziński, S., Zahradnikova, B., Sutova, Z., et al. (2024). A predictive quality inspection framework for the manufacturing process in the context of Industry 4.0. Sensors, 24(17), 5644. https://doi.org/10.3390/s24175644
Baidu Research Blog. (2025, June 30). Announcement of open source release of the ERNIE 4.5 model family. https://ernie.baidu.com/blog/posts/ernie4.5/
ERNIE Team. (2025). ERNIE 4.5 technical report. Baidu Inc. https://yiyan.baidu.com/blog/publication/ERNIE_Technical_Report.pdf
Hu, X., Chen, Z., Yang, D., et al. (2025). MoeQuant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance. arXiv preprint arXiv:2505.03804. https://doi.org/10.48550/arXiv.2505.03804
Fang, Z., Lin, Z., Hu, S., et al. (2025). HFedMoE: Resource-aware heterogeneous federated learning with mixture-of-experts. arXiv preprint arXiv:2601.00583. https://arxiv.org/abs/2601.00583
NVIDIA. (2025). TensorRT-LLM: NVIDIA’s open-source library for accelerating large language model inference. https://developer.nvidia.com/tensorrt-llm
NVIDIA. (2025). NVIDIA/tensorrt-llm [Source code]. GitHub. https://github.com/NVIDIA/TensorRT-LLM
Yao, D., Yang, C., Tong, Z., et al. (2025). VecInfer: Efficient LLM inference with low-bit KV cache via outlier-suppressed vector quantization. arXiv preprint arXiv:2510.06175. https://arxiv.org/abs/2510.06175
Yuan, X., Chen, H., Liu, L., et al. (2025). DistMLLM: Enhancing multimodal large language model serving in heterogeneous edge computing. Sensors, 25(24), 7612. https://doi.org/10.3390/s25247612
Hu, Y., Yang, Z., Zhao, C., et al. (2025). AIVD: Adaptive edge-cloud collaboration for accurate and efficient industrial visual detection. arXiv preprint arXiv:2601.04734. https://arxiv.org/abs/2601.04734v1
Pasdar, A., Lee, Y. C., Hassanzadeh, T., et al. (2021). Resource recommender for cloud-edge engineering. Information, 12(6), 224. https://doi.org/10.3390/info12060224
Wang, Z., Zhao, H., Yang, Y., et al. (2025). OmniFuser: Adaptive multimodal fusion for service-oriented predictive maintenance. arXiv preprint arXiv:2511.01320. https://arxiv.org/abs/2511.01320
Kang, S. (2025). Trustworthy equipment monitoring via cascaded anomaly detection and thermal localization. arXiv preprint arXiv:2512.24755. https://arxiv.org/abs/2512.24755
Lueh, K. L. (2025, September 9). Industrial AI market: 10 insights on how AI is transforming manufacturing. IoT Analytics. https://iot-analytics.com/industrial-ai-market-insights-how-ai-is-transforming-manufacturing/
Li, X. (2025, September 1). Artificial intelligence empowers the digital-intelligent transformation of China’s manufacturing industry. People’s Tribune, 4. https://paper.people.com.cn/rmlt/pc/content/202509/01/content_30104680.html
Putteti, S., Santhi, G., Mittoor, G. R., et al. (2025). Intelligent industrial IoT: A data-driven approach for smart manufacturing and predictive maintenance. In 2025 Third International Conference on Augmented Intelligence and Sustainable Systems (ICAISS) (pp. 1032–1040). IEEE. https://doi.org/10.1109/ICAISS61471.2025.11041978
Prasser, D. R. (2025, July 21). Future of manufacturing: 13 trends driving 2026-2035 growth. StartUs Insights. https://www.startus-insights.com/innovators-guide/future-of-manufacturing/
Zhang, W., Yang, D., Gao, S., et al. (2026). Conclusions. In Wireless Networks (pp. 131–152). Springer. https://doi.org/10.1007/978-3-032-07667-0_6
Yu, W., Liu, Y., Dillon, T., et al. (2023). Edge computing-assisted IoT framework with an autoencoder for fault detection in manufacturing predictive maintenance. IEEE Transactions on Industrial Informatics, 19(4), 5701–5710. https://doi.org/10.1109/TII.2022.3178732
Wang, X., Li, X., Wang, N., et al. (2022). Fine-grained cloud edge collaborative dynamic task scheduling based on DNN layer-partitioning. In 2022 18th International Conference on Mobility, Sensing and Networking (MSN) (pp. 155–162). IEEE. https://doi.org/10.1109/MSN57253.2022.00037
Doslouglu, T., & MacDonald, M. (2022). Circuit design for predictive maintenance. arXiv preprint arXiv:2211.10248. https://doi.org/10.48550/arXiv.2211.10248
Zhao, F., Wu, Y., Hu, M., et al. (2025). Current progress of digital twin construction using medical imaging. Journal of Applied Clinical Medical Physics, 26(8), Article e70226. https://doi.org/10.1002/acm2.70226
Wang, K., Zhou, X., & Guan, J. (2026). The construction of an integrated cloud network digital intelligence platform for rail transit based on artificial intelligence. Scientific Reports, 16, Article 393. https://doi.org/10.1038/s41598-026-39968-2
Hao, Y., Cheng, C., Li, J., et al. (2025). Multimodal integration in health care: Development with applications in disease management. Journal of Medical Internet Research, 27, Article e76557. https://doi.org/10.2196/76557
Zhou, S., Xia, Y., Huang, H., et al. (2025). Enhancing multi-modal models with heterogeneous MoE adapters for fine-tuning. In 2025 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1–6). IEEE. https://doi.org/10.1109/ICME59968.2025.11210043
Wang, Y., & Zhao, J. (2025). A unified and resource-aware framework for adaptive inference acceleration on edge and embedded platforms. Electronics, 14(11), 2188. https://doi.org/10.3390/electronics14112188
Shimabukuro, J. (2025, October 28). Under-radar AI disruptors (projections from late-Oct. 2025). ETC Journal. https://etcjournal.com/2025/10/28/under-radar-ai-disruptors-projections-from-late-oct-2025/
Laurent, A. (2025, December). Latest AI research (Dec 2025): GPT-5, agents & trends. Intuition Labs. https://intuitionlabs.ai/articles/latest-ai-research-trends-2025
NVIDIA. (2025). Overview - TensorRT edge-LLM. https://nvidia.github.io/TensorRT-Edge-LLM/0.4.0/developer_guide/01.1_Overview.html
Intel Market Research. (2025, September 16). AI medical edge computing system market growth analysis, dynamics, key players and innovations, outlook and forecast 2025-2032. https://www.intelmarketresearch.com/ai-medical-edge-computing-system-market-7571
