Florence-2: Mastering A number of Imaginative and prescient Duties with a Single VLM Mannequin | by Lihi Gur Arie, PhD | Oct, 2024

Loading Florence-2 mannequin and a pattern picture

After putting in and importing the required libraries (as demonstrated within the accompanying Colab pocket book), we start by loading the Florence-2 mannequin, processor and the enter picture of a digicam:

#Load mannequin:
model_id = ‘microsoft/Florence-2-large’
mannequin = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype='auto').eval().cuda()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

#Load picture:
picture = Picture.open(img_path)

Auxiliary Features

On this tutorial, we’ll use a number of auxiliary capabilities. A very powerful is the run_example core perform, which generates a response from the Florence-2 mannequin.

The run_example perform combines the duty immediate with any extra textual content enter (if offered) right into a single immediate. Utilizing the processor, it generates textual content and picture embeddings that function inputs to the mannequin. The magic occurs in the course of the mannequin.generate step, the place the mannequin’s response is generated. Right here’s a breakdown of some key parameters:

  • max_new_tokens=1024: Units the utmost size of the output, permitting for detailed responses.
  • do_sample=False: Ensures a deterministic response.
  • num_beams=3: Implements beam search with the highest 3 probably tokens at every step, exploring a number of potential sequences to seek out the most effective general output.
  • early_stopping=False: Ensures beam search continues till all beams attain the utmost size or an end-of-sequence token is generated.

Lastly, the mannequin’s output is decoded and post-processed with processor.batch_decode and processor.post_process_generation to supply the ultimate textual content response, which is returned by the run_example perform.

def run_example(picture, task_prompt, text_input=''):

immediate = task_prompt + text_input

inputs = processor(textual content=immediate, pictures=picture, return_tensors=”pt”).to(‘cuda’, torch.float16)

generated_ids = mannequin.generate(
input_ids=inputs[“input_ids”].cuda(),
pixel_values=inputs[“pixel_values”].cuda(),
max_new_tokens=1024,
do_sample=False,
num_beams=3,
early_stopping=False,
)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(
generated_text,
process=task_prompt,
image_size=(picture.width, picture.peak)
)

return parsed_answer

Moreover, we make the most of auxiliary capabilities to visualise the outcomes (draw_bbox ,draw_ocr_bboxes and draw_polygon) and deal with the conversion between bounding bins codecs (convert_bbox_to_florence-2 and convert_florence-2_to_bbox). These may be explored within the hooked up Colab pocket book.

Florence-2 can carry out quite a lot of visible duties. Let’s discover a few of its capabilities, beginning with picture captioning.

1. Captioning Era Associated Duties:

1.1 Generate Captions

Florence-2 can generate picture captions at numerous ranges of element, utilizing the '<CAPTION>' , '<DETAILED_CAPTION>' or '<MORE_DETAILED_CAPTION>' process prompts.

print (run_example(picture, task_prompt='<CAPTION>'))
# Output: 'A black digicam sitting on prime of a wood desk.'

print (run_example(picture, task_prompt='<DETAILED_CAPTION>'))
# Output: 'The picture reveals a black Kodak V35 35mm movie digicam sitting on prime of a wood desk with a blurred background.'

print (run_example(picture, task_prompt='<MORE_DETAILED_CAPTION>'))
# Output: 'The picture is a close-up of a Kodak VR35 digital digicam. The digicam is black in colour and has the Kodak brand on the highest left nook. The physique of the digicam is made from wooden and has a textured grip for simple dealing with. The lens is within the middle of the physique and is surrounded by a gold-colored ring. On the highest proper nook, there's a small LCD display screen and a flash. The background is blurred, nevertheless it seems to be a wooded space with timber and greenery.'

The mannequin precisely describes the picture and its surrounding. It even identifies the digicam’s model and mannequin, demonstrating its OCR skill. Nonetheless, within the '<MORE_DETAILED_CAPTION>' process there are minor inconsistencies, which is anticipated from a zero-shot mannequin.

1.2 Generate Caption for a Given Bounding Field

Florence-2 can generate captions for particular areas of a picture outlined by bounding bins. For this, it takes the bounding field location as enter. You possibly can extract the class with '<REGION_TO_CATEGORY>' or an outline with '<REGION_TO_DESCRIPTION>' .

On your comfort, I added a widget to the Colab pocket book that allows you to attract a bounding field on the picture, and code to transform it to Florence-2 format.

task_prompt = '<REGION_TO_CATEGORY>'
box_str = '<loc_335><loc_412><loc_653><loc_832>'
outcomes = run_example(picture, task_prompt, text_input=box_str)
# Output: 'digicam lens'
task_prompt = '<REGION_TO_DESCRIPTION>'
box_str = '<loc_335><loc_412><loc_653><loc_832>'
outcomes = run_example(picture, task_prompt, text_input=box_str)
# Output: 'digicam'

On this case, the '<REGION_TO_CATEGORY>' recognized the lens, whereas the '<REGION_TO_DESCRIPTION>' was much less particular. Nonetheless, this efficiency could differ with totally different pictures.

2. Object Detection Associated Duties:

2.1 Generate Bounding Containers and Textual content for Objects

Florence-2 can determine densely packed areas within the picture, and to supply their bounding field coordinates and their associated labels or captions. To extract bounding bins with labels, use the ’<OD>’process immediate:

outcomes = run_example(picture, task_prompt='<OD>')
draw_bbox(picture, outcomes['<OD>'])

To extract bounding bins with captions, use '<DENSE_REGION_CAPTION>' process immediate:

task_prompt outcomes = run_example(picture, task_prompt= '<DENSE_REGION_CAPTION>')
draw_bbox(picture, outcomes['<DENSE_REGION_CAPTION>'])
The picture on the left reveals the outcomes of the ’<OD>’ process immediate, whereas the picture on the best demonstrates ‘<DENSE_REGION_CAPTION>’

2.2 Textual content Grounded Object Detection

Florence-2 can even carry out text-grounded object detection. By offering particular object names or descriptions as enter, Florence-2 detects bounding bins across the specified objects.

task_prompt = '<CAPTION_TO_PHRASE_GROUNDING>'
outcomes = run_example(picture,task_prompt, text_input=”lens. digicam. desk. brand. flash.”)
draw_bbox(picture, outcomes['<CAPTION_TO_PHRASE_GROUNDING>'])
CAPTION_TO_PHRASE_GROUNDING process with the textual content enter: “lens. digicam. desk. brand. flash.”

3. Segmentation Associated Duties:

Florence-2 can even generate segmentation polygons grounded by textual content ('<REFERRING_EXPRESSION_SEGMENTATION>') or by bounding bins ('<REGION_TO_SEGMENTATION>'):

outcomes = run_example(picture, task_prompt='<REFERRING_EXPRESSION_SEGMENTATION>', text_input=”digicam”)
draw_polygons(picture, outcomes[task_prompt])
outcomes = run_example(picture, task_prompt='<REGION_TO_SEGMENTATION>', text_input="<loc_345><loc_417><loc_648><loc_845>")
draw_polygons(output_image, outcomes['<REGION_TO_SEGMENTATION>'])
The picture on the left reveals the outcomes of the REFERRING_EXPRESSION_SEGMENTATION process with ‘digicam’ textual content as enter. The picture on the best demonstrates REGION_TO_SEGMENTATION process with a bounding field across the lens offered as enter.

4. OCR Associated Duties:

Florence-2 demonstrates sturdy OCR capabilities. It might extract textual content from a picture with the '<OCR>' process immediate, and extract each textual content and its location with '<OCR_WITH_REGION>' :

outcomes = run_example(picture,task_prompt)
draw_ocr_bboxes(picture, outcomes['<OCR_WITH_REGION>'])

Florence-2 is a flexible Imaginative and prescient-Language Mannequin (VLM), able to dealing with a number of imaginative and prescient duties inside a single mannequin. Its zero-shot capabilities are spectacular throughout various duties reminiscent of picture captioning, object detection, segmentation and OCR. Whereas Florence-2 performs effectively out-of-the-box, extra fine-tuning can additional adapt the mannequin to new duties or enhance its efficiency on distinctive, customized datasets.