Apollo and Design Decisions of Video Massive Multimodal Fashions (LMMs) | by Matthew Gunton

Let’s discover main design decisions from Meta’s Apollo paper

As we’ve been anticipating, fashions have gotten more and more able to understanding various kinds of inputs. We’ve seen picture transformer fashions (see my blogs on fine-tuning Flux and the analysis behind MM1) and now we’re starting to see video fashions hit the scene.

In December of 2024, Meta unveiled their new Apollo household of fashions. After they unveiled these, additionally they printed a paper detailing their analysis and work round Massive Multimodal Fashions (LMMs). The paper is stuffed with nice particulars, so reasonably than attempt to cowl all of it I’ll be specializing in the 4 main design decisions they highlighted when making their mannequin.

Let’s dive in!

Embedding

Let’s first structure some fast concepts which are necessary to know what’s happening right here. Each Transformer depends on embeddings for its enter. Nonetheless, person enter is usually first transformed from one thing user-understood (textual content, movies) to tokens after which embeddings. To transform to embeddings, we use an embedding mannequin. For multi-modal inputs, we usually use a unique encoder for every enter kind.

Apollo and Design Decisions of Video Massive Multimodal Fashions (LMMs) | by Matthew Gunton | Jan, 2025

Let’s discover main design decisions from Meta’s Apollo paper

Embedding

Information to Reinforcement Finetuning – Analytics Vidhya

How Google’s AI Is Unlocking the Secrets and techniques of Dolphin Communication

3 Issues Caiwei Chen is into proper now

Why we nonetheless want AM radio

Microsoft 2025 annual Work Development Index

Information to Reinforcement Finetuning – Analytics Vidhya

How Google’s AI Is Unlocking the Secrets and techniques of Dolphin Communication

3 Issues Caiwei Chen is into proper now

Why we nonetheless want AM radio