close
Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 Nov 12;11(12):nwae403.
doi: 10.1093/nsr/nwae403. eCollection 2024 Dec.

A survey on multimodal large language models

Affiliations
Review

A survey on multimodal large language models

Shukang Yin et al. Natl Sci Rev. .

Abstract

Recently, the multimodal large language model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful large language models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of the MLLM, such as writing stories based on images and optical character recognition-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even outperform GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First, we present the basic formulation of the MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages and scenarios. We continue with multimodal hallucination and extended techniques, including multimodal in-context learning, multimodal chain of thought and LLM-aided visual reasoning. To conclude the paper, we discuss existing challenges and point out promising research directions.

Keywords: large language model; multimodal large language model; vision language model.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
A timeline of representative MLLMs. We are witnessing rapid growth in this field. More works can be found on our released GitHub page, which is updated daily.
Figure 2.
Figure 2.
An illustration of typical MLLM architecture. It includes an encoder, a connector and an LLM. An optional generator can be attached to the LLM to generate more modalities besides text. The encoder takes in images, audios or videos and outputs features, which are processed by the connector so that the LLM can better understand. There are broadly three types of connector: projection-based, query-based and fusion-based connectors. The former two types adopt token-level fusion, processing features into tokens to be sent along with text tokens, while the last type enables a feature-level fusion inside the LLM.
Scheme 1.
Scheme 1.
A simplified template to structure the caption data. {formula imageimageformula image} is the placeholder for the visual tokens, and {caption} is the caption for the image. Note that only the part marked in red is used for loss calculation.
Figure 3.
Figure 3.
Comparison of three typical learning paradigms, adapted from [76].
Scheme 2.
Scheme 2.
A simplified template to structure the multimodal instruction data. formula imageinstructionformula image is a textual description of the task. {formula imageimageformula image, formula imagetextformula image} and formula imageoutputformula image are the input and output from the data sample. Note that formula imagetextformula image in the input may be missed for some datasets; for example, image caption datasets merely have formula imageimageformula image. The example is adapted from [81].
Scheme 3.
Scheme 3.
Instruction templates for VQA datasets, cited from [51]. formula imageImageformula image and {Question} are the image and the question in the original VQA datasets, respectively.
Scheme 4.
Scheme 4.
A simplified example of the template to structure an M-ICL query, adapted from [81]. For illustration, we list two in-context examples and a query divided by a dashed line. {instruction} and {response} are texts from the data sample. formula imageimageformula image is a placeholder to represent the multimodal input (an image in this case). formula imageBOSformula image and formula imageEOSformula image are tokens denoting the start and the end of the input to the LLM, respectively.

References

    1. Zhao WX, Zhou K, Li J et al. A survey of large language models. arXiv: 2303.18223.
    1. Xu B, Poo Mm. Large language models and brain-inspired general intelligence. Natl Sci Rev 2023; 10: nwad267. 10.1093/nsr/nwad267 - DOI - PMC - PubMed
    1. Peng B, Li C, He P et al. Instruction tuning with GPT-4. arXiv: 2304.03277.
    1. Brown T, Mann B, Ryder N et al. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates, 2020, 1877–1901.
    1. Wei J, Wang X, Schuurmans D et al. Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates, 2024, 24824–37.

LinkOut - more resources