The lengthy orbit to benchmarking lengthy video understanding

Pipeline

Lengthy video datasets are difficult to construct due to the numerous handbook effort required to pick out, watch, perceive and annotate lengthy movies with free-form pure language. Answering difficult questions on longer movies is commonly a multimodal job which will contain listening to the audio monitor along with watching the video. It could even be a non-linear job, as a result of generally it could be essential to rewind and rewatch key elements to reply a query. Proposing appropriate high-level questions that aren’t trivially solved by observing only some frames may also be difficult for individuals to do persistently and with ample selection.

With a view to clear up this downside we suggest a semi-automatic pipeline that first generates candidate a number of selection questions utilizing plenty of robust vision-language fashions (VLMs) and enormous language fashions (LLMs) with rigorously designed prompts, after which lets human annotators filter and proper the proposed questions to scale back errors and bias. With a view to cut back human effort, we leverage automated instruments to (1) discover appropriate movies, (2) extract helpful indicators, after which (3) routinely generate video-level captions, questions and solutions.

Our pipeline begins with the choice of video content material. We filter movies to extend visible and demographic range. We additionally take away movies with principally static content material in addition to gaming movies and animated content material. Within the subsequent stage, we extract two sorts of captions from the ensuing movies: automated speech recognition (ASR) captions and body captions. For the latter, we immediate a VLM to explain video frames sampled at one body per second. The following step summarizes these captions by segmenting the video into photographs, grouping them by subjects and prompting an LLM to summarize ASR and frame-level captions into shot-level captions.

Given these captions, the pipeline generates multiple-choice questions in two phases. Within the first stage, we immediate an LLM to generate a set of difficult questions and solutions, offering it with the video captions as context. Within the second stage, we immediate the LLM with a generated question-answer pair and the video captions and ask it to generate 4 decoy solutions. Decoys should be incorrect however believable solutions to the query. The ultimate stage of the pipeline is human verification, the place we ask human raters to filter or right incorrect questions, solutions and decoys.