Unwanted Reviews

Efficient Deployment of Multimodal LargeModels: A Surveyon Technical Innovations, Industrial Applications, and Challenges of Heterogeneous MoE Architecture, Low-bit Quantization, and Cloud-Edge-End Collaboration (2024-2026)

Yaolin Zhang (Corresponding Author)
ROR Jinan University, Guangzhou, China
Pengrong Huang
ROR South China Agricultural University, Guangzhou, China
Silence
Published:2026-02-16

Abstract

This survey focuses on the efficient deployment of multimodal large models (MLLMs) and systematically reviews the core technologies from 2024 to 2026. At the architectural level, heterogeneous Mixture-of-Experts (MoE) architectures represented by ERNIE 4.5 achieve a balance between cross-modal parameter sharing and modality-specific computation through modality-isolated routing and router orthogonal loss, significantly improving computational efficiency. In terms of model compression and inference optimization, methods such as Layerwise Ultra-Low Bit Quantization (LUQ) and MoEQuant effectively address the quantization challenges of multimodal data, while TensorRT-LLM and ONNX Runtime provide deeply optimized inference engine solutions for hardware architectures like Blackwell. Regarding deployment strategies, cloud-edge-end collaborative architectures represented by DistMLLM, AIVD, and NANOMIND meet the requirements of low latency and high privacy while achieving cost savings and energy efficiency improvements through adaptive offloading, dynamic resource scheduling, and hardware-software co-design. Application cases show that these technologies have effectively solved industry pain points such as high latency and data privacy in medical edge diagnostic devices and quality inspection \& predictive maintenance scenarios in smart manufacturing, balancing efficiency and performance.

Keywords:

Multimodal Large Models; Heterogeneous MoE; Low-bit Quantization; Cloud- Edge-End Collaboration; Industrial Deployment
Journal Cover
237 Views

PDF Downloads

Download data is not yet available.

Journal Info

ISSN3054-4386
PublisherPanorama Scholarly Group

How to Cite

Zhang, Y., & Huang, P. (2026). Efficient Deployment of Multimodal LargeModels: A Surveyon Technical Innovations, Industrial Applications, and Challenges of Heterogeneous MoE Architecture, Low-bit Quantization, and Cloud-Edge-End Collaboration (2024-2026). Silence, 1(1), 13-39. https://doi.org/10.5281/zenodo.18681507

References

Liang, C. X., Tian, P., Yin, C. H., et al. (2024). A comprehensive survey and guide to multimodal large language models in vision-language tasks. arXiv preprint arXiv:2411.06284. https://doi.org/10.48550/arXiv.2411.06284

Technode. (2025, June 30). Baidu open-sources ERNIE 4.5 multimodal AI model. https://00l.xyz/03U3s

Baidu Inc. (2025). ERNIE 4.5: Multimodal large language model [Computer software]. GitHub. https://github.com/PaddlePaddle/ERNIE

Koparkar, S. (2025, December 3). Mixture of experts powers the most intelligent frontier AI models, runs 10x faster to deliver 1/10 the token cost on NVIDIA Blackwell NVL72. NVIDIA Blog. https://blogs.nvidia.com/blog/mixture-of-experts-blackwell-nvl72

Dutt, R., Hanspal, H., Xia, G., et al. (2025). Exploiting mixture-of-experts redundancy unlocks multimodal generative abilities. arXiv preprint arXiv:2503.22517. https://doi.org/10.48550/arXiv.2503.22517

Bhatnagar, S., Xu, A., Tan, K.-H., et al. (2025). LUQ: Layerwise ultra-low bit quantization for multimodal large language models. arXiv preprint arXiv:2509.23729. https://arxiv.org/abs/2509.23729

Chen, M., Wu, M., Jin, H., et al. (2025). Int v.s. fp: A comprehensive study of fine-grained low-bit quantization formats. arXiv preprint arXiv:2510.25602. https://arxiv.org/abs/2510.25602

Fedus, W., Zoph, B., Shazeer, N., et al. (2022). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1–39. http://jmlr.org/papers/v23/21-0998.html

ONNX Runtime. (2025). TensorRT execution provider. https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html

Mao, H. Z., Xin, J., Yu, Y., et al. (2025). Enable Blackwell inference with TensorRT model optimizer [Video]. NVIDIA On-Demand. https://www.nvidia.com/en-us/on-demand/session/gtc25-s72609/

Qiu, H., Biswas, A., Zhao, Z., et al. (2025). ModServe: Modality- and stage-aware resource disaggregation for scalable multimodal model serving. arXiv preprint arXiv:2502.00937. https://arxiv.org/abs/2502.00937

Li, Y., Zhang, S., Zeng, Y., et al. (2025). Tiny but mighty: A software-hardware co-design approach for efficient multimodal inference on battery-powered small devices. arXiv preprint arXiv:2510.05109. https://arxiv.org/abs/2510.05109

Longpre, S., Akiki, C., Lund, C., et al. (2025). Economies of open intelligence: Tracing power & participation in the model ecosystem. arXiv preprint arXiv:2512.03073. https://doi.org/10.48550/arXiv.2512.03073

Alghareeb, F. S., & Hasan, B. T. (2025). Leveraging multithreading on edge computing for smart healthcare based on intelligent multimodal classification approach. Computerized Medical Imaging and Graphics, 124, Article 102594. https://doi.org/10.1016/j.compmedimag.2025.102594

Rydziński, S., Zahradnikova, B., Sutova, Z., et al. (2024). A predictive quality inspection framework for the manufacturing process in the context of Industry 4.0. Sensors, 24(17), 5644. https://doi.org/10.3390/s24175644

Baidu Research Blog. (2025, June 30). Announcement of open source release of the ERNIE 4.5 model family. https://ernie.baidu.com/blog/posts/ernie4.5/

ERNIE Team. (2025). ERNIE 4.5 technical report. Baidu Inc. https://yiyan.baidu.com/blog/publication/ERNIE_Technical_Report.pdf

Hu, X., Chen, Z., Yang, D., et al. (2025). MoeQuant: Enhancing quantization for mixture-of-experts large language models via expert-balanced sampling and affinity guidance. arXiv preprint arXiv:2505.03804. https://doi.org/10.48550/arXiv.2505.03804

Fang, Z., Lin, Z., Hu, S., et al. (2025). HFedMoE: Resource-aware heterogeneous federated learning with mixture-of-experts. arXiv preprint arXiv:2601.00583. https://arxiv.org/abs/2601.00583

NVIDIA. (2025). TensorRT-LLM: NVIDIA’s open-source library for accelerating large language model inference. https://developer.nvidia.com/tensorrt-llm

NVIDIA. (2025). NVIDIA/tensorrt-llm [Source code]. GitHub. https://github.com/NVIDIA/TensorRT-LLM

Yao, D., Yang, C., Tong, Z., et al. (2025). VecInfer: Efficient LLM inference with low-bit KV cache via outlier-suppressed vector quantization. arXiv preprint arXiv:2510.06175. https://arxiv.org/abs/2510.06175

Yuan, X., Chen, H., Liu, L., et al. (2025). DistMLLM: Enhancing multimodal large language model serving in heterogeneous edge computing. Sensors, 25(24), 7612. https://doi.org/10.3390/s25247612

Hu, Y., Yang, Z., Zhao, C., et al. (2025). AIVD: Adaptive edge-cloud collaboration for accurate and efficient industrial visual detection. arXiv preprint arXiv:2601.04734. https://arxiv.org/abs/2601.04734v1

Pasdar, A., Lee, Y. C., Hassanzadeh, T., et al. (2021). Resource recommender for cloud-edge engineering. Information, 12(6), 224. https://doi.org/10.3390/info12060224

Wang, Z., Zhao, H., Yang, Y., et al. (2025). OmniFuser: Adaptive multimodal fusion for service-oriented predictive maintenance. arXiv preprint arXiv:2511.01320. https://arxiv.org/abs/2511.01320

Kang, S. (2025). Trustworthy equipment monitoring via cascaded anomaly detection and thermal localization. arXiv preprint arXiv:2512.24755. https://arxiv.org/abs/2512.24755

Lueh, K. L. (2025, September 9). Industrial AI market: 10 insights on how AI is transforming manufacturing. IoT Analytics. https://iot-analytics.com/industrial-ai-market-insights-how-ai-is-transforming-manufacturing/

Li, X. (2025, September 1). Artificial intelligence empowers the digital-intelligent transformation of China’s manufacturing industry. People’s Tribune, 4. https://paper.people.com.cn/rmlt/pc/content/202509/01/content_30104680.html

Putteti, S., Santhi, G., Mittoor, G. R., et al. (2025). Intelligent industrial IoT: A data-driven approach for smart manufacturing and predictive maintenance. In 2025 Third International Conference on Augmented Intelligence and Sustainable Systems (ICAISS) (pp. 1032–1040). IEEE. https://doi.org/10.1109/ICAISS61471.2025.11041978

Prasser, D. R. (2025, July 21). Future of manufacturing: 13 trends driving 2026-2035 growth. StartUs Insights. https://www.startus-insights.com/innovators-guide/future-of-manufacturing/

Zhang, W., Yang, D., Gao, S., et al. (2026). Conclusions. In Wireless Networks (pp. 131–152). Springer. https://doi.org/10.1007/978-3-032-07667-0_6

Yu, W., Liu, Y., Dillon, T., et al. (2023). Edge computing-assisted IoT framework with an autoencoder for fault detection in manufacturing predictive maintenance. IEEE Transactions on Industrial Informatics, 19(4), 5701–5710. https://doi.org/10.1109/TII.2022.3178732

Wang, X., Li, X., Wang, N., et al. (2022). Fine-grained cloud edge collaborative dynamic task scheduling based on DNN layer-partitioning. In 2022 18th International Conference on Mobility, Sensing and Networking (MSN) (pp. 155–162). IEEE. https://doi.org/10.1109/MSN57253.2022.00037

Doslouglu, T., & MacDonald, M. (2022). Circuit design for predictive maintenance. arXiv preprint arXiv:2211.10248. https://doi.org/10.48550/arXiv.2211.10248

Zhao, F., Wu, Y., Hu, M., et al. (2025). Current progress of digital twin construction using medical imaging. Journal of Applied Clinical Medical Physics, 26(8), Article e70226. https://doi.org/10.1002/acm2.70226

Wang, K., Zhou, X., & Guan, J. (2026). The construction of an integrated cloud network digital intelligence platform for rail transit based on artificial intelligence. Scientific Reports, 16, Article 393. https://doi.org/10.1038/s41598-026-39968-2

Hao, Y., Cheng, C., Li, J., et al. (2025). Multimodal integration in health care: Development with applications in disease management. Journal of Medical Internet Research, 27, Article e76557. https://doi.org/10.2196/76557

Zhou, S., Xia, Y., Huang, H., et al. (2025). Enhancing multi-modal models with heterogeneous MoE adapters for fine-tuning. In 2025 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1–6). IEEE. https://doi.org/10.1109/ICME59968.2025.11210043

Wang, Y., & Zhao, J. (2025). A unified and resource-aware framework for adaptive inference acceleration on edge and embedded platforms. Electronics, 14(11), 2188. https://doi.org/10.3390/electronics14112188

Shimabukuro, J. (2025, October 28). Under-radar AI disruptors (projections from late-Oct. 2025). ETC Journal. https://etcjournal.com/2025/10/28/under-radar-ai-disruptors-projections-from-late-oct-2025/

Laurent, A. (2025, December). Latest AI research (Dec 2025): GPT-5, agents & trends. Intuition Labs. https://intuitionlabs.ai/articles/latest-ai-research-trends-2025

NVIDIA. (2025). Overview - TensorRT edge-LLM. https://nvidia.github.io/TensorRT-Edge-LLM/0.4.0/developer_guide/01.1_Overview.html

Intel Market Research. (2025, September 16). AI medical edge computing system market growth analysis, dynamics, key players and innovations, outlook and forecast 2025-2032. https://www.intelmarketresearch.com/ai-medical-edge-computing-system-market-7571