Review

. 2024 Nov 12;11(12):nwae403.

doi: 10.1093/nsr/nwae403. eCollection 2024 Dec.

A survey on multimodal large language models

Shukang Yin¹, Chaoyou Fu^{2

3}, Sirui Zhao¹, Ke Li⁴, Xing Sun⁴, Tong Xu¹, Enhong Chen¹

Affiliations

¹ School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei 230026, China.
² State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China.
³ School of Intelligence Science and Technology, Nanjing University, Suzhou 215163, China.
⁴ Tencent YouTu Lab, Shanghai 200233, China.

PMID: 39679213
PMCID: PMC11645129
DOI: 10.1093/nsr/nwae403

Review

A survey on multimodal large language models

Shukang Yin et al. Natl Sci Rev. 2024.

. 2024 Nov 12;11(12):nwae403.

doi: 10.1093/nsr/nwae403. eCollection 2024 Dec.

Authors

Shukang Yin¹, Chaoyou Fu^{2

3}, Sirui Zhao¹, Ke Li⁴, Xing Sun⁴, Tong Xu¹, Enhong Chen¹

Affiliations

¹ School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei 230026, China.
² State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China.
³ School of Intelligence Science and Technology, Nanjing University, Suzhou 215163, China.
⁴ Tencent YouTu Lab, Shanghai 200233, China.

PMID: 39679213
PMCID: PMC11645129
DOI: 10.1093/nsr/nwae403

Abstract

Recently, the multimodal large language model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful large language models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of the MLLM, such as writing stories based on images and optical character recognition-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even outperform GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First, we present the basic formulation of the MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages and scenarios. We continue with multimodal hallucination and extended techniques, including multimodal in-context learning, multimodal chain of thought and LLM-aided visual reasoning. To conclude the paper, we discuss existing challenges and point out promising research directions.

Keywords: large language model; multimodal large language model; vision language model.

PubMed Disclaimer

Figures

**Figure 1.**
A timeline of representative MLLMs. We are witnessing rapid growth in this field. More works can be found on our released GitHub page, which is updated daily.

**Figure 2.**
An illustration of typical MLLM architecture. It includes an encoder, a connector and an LLM. An optional generator can be attached to the LLM to generate more modalities besides text. The encoder takes in images, audios or videos and outputs features, which are processed by the connector so that the LLM can better understand. There are broadly three types of connector: projection-based, query-based and fusion-based connectors. The former two types adopt token-level fusion, processing features into tokens to be sent along with text tokens, while the last type enables a feature-level fusion inside the LLM.

**Scheme 1.**
A simplified template to structure the caption data. {image} is the placeholder for the visual tokens, and {caption} is the caption for the image. Note that only the part marked in red is used for loss calculation.

formula image — **Scheme 1.**
A simplified template to structure the caption data. {image} is the placeholder for the visual tokens, and {caption} is the caption for the image. Note that only the part marked in red is used for loss calculation.

**Figure 3.**
Comparison of three typical learning paradigms, adapted from [76].

**Scheme 2.**
A simplified template to structure the multimodal instruction data. instruction is a textual description of the task. {image, text} and output are the input and output from the data sample. Note that text in the input may be missed for some datasets; for example, image caption datasets merely have image. The example is adapted from [81].

**Scheme 3.**
Instruction templates for VQA datasets, cited from [51]. Image and {Question} are the image and the question in the original VQA datasets, respectively.

**Scheme 4.**
A simplified example of the template to structure an M-ICL query, adapted from [81]. For illustration, we list two in-context examples and a query divided by a dashed line. {instruction} and {response} are texts from the data sample. image is a placeholder to represent the multimodal input (an image in this case). BOS and EOS are tokens denoting the start and the end of the input to the LLM, respectively.

See this image and copyright information in PMC

References

1. Zhao WX, Zhou K, Li J et al. A survey of large language models. arXiv: 2303.18223.
1. Xu B, Poo Mm. Large language models and brain-inspired general intelligence. Natl Sci Rev 2023; 10: nwad267. 10.1093/nsr/nwad267 - DOI - PMC - PubMed
1. Peng B, Li C, He P et al. Instruction tuning with GPT-4. arXiv: 2304.03277.
1. Brown T, Mann B, Ryder N et al. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates, 2020, 1877–1901.
1. Wei J, Wang X, Schuurmans D et al. Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates, 2024, 24824–37.

Publication types

Actions

LinkOut - more resources

Full Text Sources
- PubMed Central
- Silverchair Information Systems

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A survey on multimodal large language models

Affiliations

A survey on multimodal large language models

Authors

Affiliations

Abstract

Figures

References

Publication types

LinkOut - more resources

Full Text Sources