Designing Multi-modal LLM is tough.
The state-of-the-art multi-modal LLMs are based totally on current LLM architectures, with modifications particularly addressing totally different sources of enter, and that’s the place the problem comes from. The most recent Nvidia paper divides the generally used multi-modal architectures into two classes:
- decoder-based;
- cross-attention-based.
Considered one of my earlier medium articles mentioned the newest paper from Meta, utilizing decoder-based structure, which converts an enter picture right into a latent vector utilizing a VAE encoder to handle the problem that the picture area is steady and totally different from the discrete textual content area.
Nonetheless, the issue with cross-attention-based structure is totally different. For instance, within the multi-modal LLM mannequin Flamingo, the important subject is changing the imaginative and prescient embedding from a generic imaginative and prescient mannequin of various temporal and spatial dimensions into the cross-attention layer to match the language enter dimension.
On this publish, I’ll dive deep into Flamingo’s distinctive design on high of the imaginative and prescient encoder, the Perceiver Resampler, to elucidate how this subject was solved. Moreover, I’ll discover the Perceiver Resampler’s origin — the Induced Set Consideration Block from Set Transformer, which additional impressed DeepMind’s Perceiver mannequin for studying fixed-length latent embeddings from generic enter knowledge.