The Problem of Captioning Video at Extra Than 1fps -

The power for machine studying techniques to acknowledge the occasions that happen inside a video is essential to the way forward for AI-based video technology – not least as a result of video datasets require correct captions with a purpose to produce fashions that adhere to a person’s request, and that don’t excessively hallucinate.

An example of a captioning schema from Google's VidReCap project. Source: https://sites.google.com/view/vidrecap

An instance of a captioning schema from Google’s VidReCap venture. Supply: https://websites.google.com/view/vidrecap

Manually captioning the size of movies wanted for efficient coaching datasets is an unconscionable prospect. Though it’s attainable to coach AI techniques to auto-caption movies, an awesome many human-generated examples are nonetheless wanted as floor reality, for selection and protection.

Extra importantly, nearly each present AI-based video-captioning mannequin operates at 1fps, which isn’t a dense sufficient seize charge to discern variations in an awesome many situations: sudden micro-expression adjustments for emotion-recognition techniques; fast occasions in high-speed sports activities reminiscent of basketball; violent actions; fast cuts in dramatic films, the place techniques reminiscent of PySceneDetect could fail to establish them (or aren’t getting used); and lots of different situations the place the window of consideration clearly must be extra intense.

Click on to play. Speedy however life-changing motion in what can in any other case be one of many slowest sports activities on the planet, as Alex Higgins clinches the world championship in opposition to Ray Reardon in 1982. Supply: https://www.youtube.com/watch?v=_1PuqKno_Ok

Transfer Quick and Break Logic

This low charge is the usual for numerous logistical causes. For one, video-captioning is a resource-intensive exercise, whether or not the system is learning one sequential body at a time, or else utilizing numerous strategies to semantically cohere a string of frames into an interpretable caption sequence. In both case, the context window is inevitably restricted by {hardware} constraints.

One more reason for 1fps being the present normal is that movies aren’t typically filled with fast occasions; it’s subsequently redundant to offer 300 frames of static snooker desk the identical consideration because the split-second by which a potted black ball wins the championship (see instance above).

It’s attainable to make use of broader secondary cues to establish pivotal moments in a sports activities video, such because the sustained crowd response to a fast slam-dunk in a basketball recreation. Nevertheless, such clues could happen for different causes (reminiscent of sudden participant accidents), and cannot be relied on. That is one instance of how a mislabeled video dataset can result in a generative video mannequin that hallucinates or misinterprets directions, i.e., as a result of the mannequin would possibly present a participant harm when it was requested to generate a slam-dunk (as a result of the ‘secondary clue’ of crowd-agitation was not unique to 1 particular kind of occasion).

That is in some ways a ‘budgetary’ downside, and in different methods a procedural downside. Frameworks so far have operated on the precept that sparse keyframes can successfully seize important data, however that is simpler in establishing style and different aspects of a video’s subject material, since proof, in that case, persists over a number of frames.

F-16

A brand new paper from China is providing an answer, within the type of the primary multimodal giant language mannequin (MLLM, or just LLM) that may analyze video at 16fps as an alternative of the usual 1fps, whereas avoiding the foremost pitfalls of accelerating the evaluation charge.

In checks, the authors declare that the brand new system, titled F-16, outperforms proprietary state-of-the-art fashions reminiscent of GPT-4o and Google’s Gemini-1.5 professional. Whereas different present fashions have been in a position to match or exceed F-16’s leads to trials, the competing fashions have been far bigger and unwieldier.

Although F-16 was skilled on some severe {hardware} (as we’ll look at shortly), inference is normally far much less demanding than coaching. Due to this fact we are able to hope that the code (promised for a near-future launch) might be able to working on medium or high-level home GPUs .

What’s wanted for the vitality of the hobbyist scene (and that features the skilled VFX scene, more often than not) is a video-captioning mannequin of this sort that may function, maybe quantized, on shopper techniques, in order that the complete generative video scene doesn’t migrate to API-based industrial techniques, or power customers to hook native frameworks as much as industrial on-line GPU companies.

Past Scaling Up

The authors observe that this type of strategy is a sensible different to scaling up datasets. One can infer additionally that for those who have been going to throw extra information on the downside, that is nonetheless the sort of strategy that could possibly be preferable, as a result of the brand new system distinguishes occasions in a extra granular approach.

They state:

‘Low body charge sampling can lead to crucial visible data loss, significantly in movies with quickly altering scenes, intricate particulars, or quick movement. Moreover, if keyframes are missed, but the mannequin is skilled on labels that depend on keyframe data, it might wrestle to align its predictions with the anticipated content material, doubtlessly resulting in hallucinations and degraded efficiency…

‘… F-16 achieves SOTA efficiency normally video QA amongst fashions of comparable dimension and demonstrates a transparent benefit in high-frame-rate video understanding, outperforming industrial fashions reminiscent of GPT-4o. This work opens new instructions for advancing high-frame-rate video comprehension in multimodal LLM analysis.’

The new paper is titled Enhancing LLM Video Understanding with 16 Frames Per Second, and comes from eight authors throughout Tsinghua College and ByteDance.

Methodology

Since consecutive frames usually include redundant data, F-16 applies a high-frame-rate aligner to compress and encode key movement particulars whereas retaining visible semantics. Every body is first processed by a pretrained picture encoder, extracting function representations earlier than being handed to an aligner primarily based on Gaussian Error Linear Items (GELUs).

F-16’s architecture processes video at 16 FPS, capturing more frames than traditional low-frame-rate models, and its high-frame-rate aligner preserves visual semantics while efficiently encoding motion dynamics without adding extra visual tokens. Source: https://arxiv.org/pdf/2503.13956

F-16’s structure processes video at 16 FPS, capturing extra frames than conventional low-frame-rate fashions, and its high-frame-rate aligner preserves visible semantics whereas effectively encoding movement dynamics with out including additional visible tokens. Supply: https://arxiv.org/pdf/2503.13956

To deal with the elevated body rely effectively, F-16 teams frames into small processing home windows, merging visible options utilizing a three-layer Multi-Layer Perceptron (MLP), serving to to retain solely essentially the most related movement particulars, and decreasing pointless duplication, whereas preserving the temporal stream of actions. A spatial max-pooling layer additional compresses the token rely, maintaining computational prices inside bounds.

The processed video tokens are then fed into the Qwen2-7B LLM, which generates textual responses primarily based on the extracted visible options and a given person immediate.

By structuring video enter this fashion, F-16 allows, the authors assert, extra exact occasion recognition in dynamic scenes, whereas nonetheless sustaining effectivity.

The Quick Model

F-16 extends a pretrained picture LLM, LLaVA-OneVision, to course of video by reworking its visible enter pipeline. Whereas normal picture LLMs deal with remoted frames, F-16’s high-frame-rate aligner reformats a number of frames right into a kind the mannequin can extra effectively course of; this avoids overwhelming the system with redundant data whereas preserving key movement cues essential for correct video understanding.

To make sure compatibility with its image-based basis, F-16 reuses pretrained parameters by restructuring its aligner into sub-matrices. This strategy permits it to combine information from single-frame fashions whereas adapting to sequential video enter.

The aligner first compresses body sequences right into a format optimized for the LLM, preserving essentially the most informative options whereas discarding pointless particulars. The structure design allows the system to course of high-frame-rate video whereas maintaining computational calls for beneath management, which the authors posit as proof that scaling isn’t the one (or the perfect) approach ahead for video captioning.

Various the Tempo

Since processing video at 16 FPS improves movement understanding however will increase computational price, significantly throughout inference, F-16 introduces a variable-frame-rate decoding methodology, permitting it to regulate body charge dynamically with out retraining.

The single-frame and high frame rate aligners available to F-16.

The one-frame and excessive body charge aligners out there to F-16.

This flexibility allows the mannequin to function effectively at decrease FPS when excessive precision isn’t required, and reduces computational overhead.

At check time, when a decrease body charge is chosen, F-16 reuses beforehand skilled aligner parameters by repeating enter frames to match the anticipated dimensions. This ensures the mannequin can nonetheless course of video successfully with out modifying its structure.

In contrast to naive downsampling (i.e., merely eradicating frames), which dangers shedding crucial movement particulars, this methodology preserves the aligner’s realized movement representations, sustaining accuracy even at diminished body charges. For common video comprehension, a decrease FPS setting can pace up inference with out vital efficiency loss, whereas high-speed movement evaluation can nonetheless leverage the total 16 FPS functionality.

Knowledge and Checks

Constructed on Qwen2-7B, FP-16 extends LLaVA-OneVision utilizing SigLIP as a picture encoder. With video frames sampled at 16 FPS, as much as 1,760 frames will be obtained from every video. For longer video clips, frames have been uniformly (i.e., extra sparsely) sampled.

For coaching, F-16 used the identical common video datasets as LLaVA-Video, together with LLaVA-Video-178K, NExT-QA, ActivityNet-QA, and PerceptionTest.

F-16 was moreover fine-tuned on the high-speed sports activities datasets FineGym, Diving48, and SoccerNet. The authors additionally curated a group of 276 NBA video games performed between November 13 and November 25, 2024, specializing in whether or not a shot was profitable (a activity requiring high-frame-rate processing).

The mannequin was evaluated utilizing the NSVA check set, with efficiency measured by F1 rating.

Gymnastics and diving fashions have been evaluated primarily based on occasion recognition accuracy, whereas soccer and basketball fashions tracked passes and shot outcomes.

The mannequin was skilled for 1 epoch utilizing 128 NVIDIA H100 GPUs (and at a standard-issue 80GB of VRAM per GPU, this entailed using 10,24 terabytes of GPU reminiscence; even by current requirements, that is the highest-specced GPU cluster I’ve personally come throughout in maintaining with pc imaginative and prescient analysis literature). A studying charge of two×10⁻⁵ was used throughout coaching.

Moreover, a LoRA was fine-tuned on sports activities information used LoRA adapters with 64 GPUs for five epochs. Right here, solely the LLM was skilled, leaving the picture encoder frozen.

Opposing frameworks examined within the preliminary spherical for ‘common video understanding’ have been GPT-4o; Gemini-1.5-Professional; Qwen2-VL-7B; VideoLLaMA2-7B; VideoChat2-HD-7B; LLaVA-OV-7B; MiniCPM-V2.6-8B; LLaVA-Video-7B; and NVILA-7B;

The fashions have been evaluated on Video-MME; VideoVista; TemporalBench; MotionBench; Subsequent-QA; MLVU; and LongVideoBench.

Comparison of video QA results across models, showing FPS limits and performance on multiple benchmarks. F-16 achieves SOTA among 7B models on Video-MME, NQA, TPB, and MB, rivaling proprietary models such as GPT-4o and Gemini-1.5-Pro.

Comparability of video QA outcomes throughout fashions, displaying FPS limits and efficiency on a number of benchmarks. F-16 achieves SOTA amongst 7B fashions on Video-MME, NQA, TPB, and MB, rivaling proprietary fashions reminiscent of GPT-4o and Gemini-1.5-Professional.

Of those outcomes, the authors state:

‘On the Video-MME Quick, Medium, and NeXT-QA datasets—every designed for brief video understanding—our mannequin surpasses the earlier 7B SOTA mannequin by 3.2%, 1.0%, and 0.9% in accuracy, highlighting its robust efficiency on quick movies.

‘For benchmarks evaluating lengthy video understanding, reminiscent of Video-MME Lengthy, LongVideoBench, and MLVU, the problem is larger resulting from sparser body sampling, inflicting frames throughout the processing window to exhibit extra vital variations.

‘This will increase the problem for the modality aligner to successfully encode temporal adjustments throughout the restricted token illustration. In consequence, F-16 experiences a slight efficiency drop in comparison with [LLaVA-Video-7B], which is skilled on the identical video dataset.’

F-16’s high-frame-rate processing, the authors proceed, additionally resulted in a 13.5% enchancment on TemporalBench and a 2.5% acquire on MotionBench, in comparison with current 7B fashions, and carried out at an identical stage to industrial fashions reminiscent of GPT-4o and Gemini-1.5-Professional.

Excessive Velocity Sports activities Video Understanding

F-16 was examined on FineGym, Diving48, SoccerNet, and NBA datasets to judge its means to grasp high-speed sports activities actions.

Utilizing the ten,000 manually annotated NBA clips, the coaching targeted on ball motion and participant actions, and whether or not the fashions might appropriately decide if a shot was profitable, utilizing the NSVA check set evaluated with F1 rating.

Results of high-speed sports video analysis. F-16 with the high-frame-rate aligner performed better than its low-frame-rate counterpart across all sports tasks. GPT-4o and Gemini-1.5-Pro were also evaluated on NBA and SoccerNet QA, where in-domain training knowledge was not required.

Outcomes of high-speed sports activities video evaluation. F-16 with the high-frame-rate aligner carried out higher than its low-frame-rate counterpart throughout all sports activities duties. GPT-4o and Gemini-1.5-Professional have been additionally evaluated on NBA and SoccerNet QA, the place in-domain coaching information was not required.

On FineGym, which measures gymnastics motion recognition, F-16 carried out 13.8% higher than the earlier 7B SOTA mannequin, demonstrating improved fine-grained movement understanding.

Diving48 required figuring out complicated motion sequences reminiscent of takeoff, somersault, twist, and flight phases, and F-16 confirmed greater accuracy in recognizing these transitions.

For SoccerNet, the mannequin analyzed 10-second clips, figuring out ball passes, and outcomes confirmed an enchancment over current 7B fashions, indicating that greater FPS contributes to monitoring small and fast actions.

Within the NBA dataset, F-16’s means to find out shot outcomes approached the accuracy of bigger proprietary fashions reminiscent of GPT-4o and Gemini-1.5-Professional, additional suggesting that greater body charges enhances its means to course of dynamic movement.

Variable Body-Charges

F-16 was examined at completely different body charges to measure its adaptability. As an alternative of retraining, it dealt with decrease FPS by repeating frames to match the aligner’s enter construction. This strategy retained extra efficiency than merely eradicating (vulnerable to trigger accuracy loss).

The outcomes point out that whereas decreasing FPS had some influence on movement recognition, F-16 nonetheless outperformed low-frame-rate fashions and maintained robust outcomes even under 16 FPS.

Left, the time consumption of different F-16 modules during inference, measured on 300 videos from the Video-MME Long set at varying test FPS and sequence lengths. Right, a comparison between Video-MME performance for models trained and tested at different FPS. The solid line represents models trained and tested at the same FPS, while the dashed line shows performance when a model trained at 16 FPS is tested at a lower frame rate.

Left, the time consumption of various F-16 modules throughout inference, measured on 300 movies from the Video-MME Lengthy set at various check FPS and sequence lengths. Proper, a comparability between Video-MME efficiency for fashions skilled and examined at completely different FPS. The stable line represents fashions skilled and examined on the similar FPS, whereas the dashed line reveals efficiency when a mannequin skilled at 16 FPS is examined at a decrease body charge.

F-16’s high-frame-rate processing elevated computational necessities, though its aligner helped handle these prices by compressing redundant visible tokens.

The mannequin required extra FLOPs per video than lower-FPS fashions, but in addition achieved higher accuracy per token, suggesting that its body choice and token compression methods helped offset the added computation.

Conclusion

It’s tough to overstate both the significance or the challenges of this specific strand of analysis – particularly this yr, which is ready to be the breakthrough yr for generative video, throwing the shortcomings of video dataset curation and captioning high quality into sharp reduction.

It must also be emphasised that the challenges concerned in getting correct descriptions of inner video particulars can’t be solved completely by throwing VRAM, time, or disk area on the situation. The strategy by which occasions are remoted/extracted from in any other case lengthy and tedious tracts of video (as with golf or snooker video clips, as an example) will profit from a rethink of the semantic approaches and mechanisms at the moment dominating SOTA options – as a result of a few of these limitations have been established in additional resource-impoverished occasions.

(by the way, even when 16fps looks like a really low body charge for 2025, it’s fascinating to notice that that is additionally the native coaching pace of video clips used within the vastly well-liked Wan 2.1 generative video mannequin, and the pace at which it subsequently operates with fewest points. Hopefully the analysis scene will keep watch over attainable ‘requirements entropy’ right here; generally out of date constraints can perpetuate future requirements)

First revealed Wednesday, March 19, 2025

The Problem of Captioning Video at Extra Than 1fps