As we’ve been anticipating, fashions have gotten more and more able to understanding various kinds of inputs. We’ve seen picture transformer fashions (see my blogs on fine-tuning Flux and the analysis behind MM1) and now we’re starting to see video fashions hit the scene.
In December of 2024, Meta unveiled their new Apollo household of fashions. After they unveiled these, additionally they printed a paper detailing their analysis and work round Massive Multimodal Fashions (LMMs). The paper is stuffed with nice particulars, so reasonably than attempt to cowl all of it I’ll be specializing in the 4 main design decisions they highlighted when making their mannequin.
Let’s dive in!
Embedding
Let’s first structure some fast concepts which are necessary to know what’s happening right here. Each Transformer depends on embeddings for its enter. Nonetheless, person enter is usually first transformed from one thing user-understood (textual content, movies) to tokens after which embeddings. To transform to embeddings, we use an embedding mannequin. For multi-modal inputs, we usually use a unique encoder for every enter kind.