Desk Extraction from PDFs utilizing Multimodal (Imaginative and prescient) LLMs

Couple of weeks in the past a colleague and I participated in an inside hackathon the place the duty was to provide you with an attention-grabbing use case utilizing the latest multi-modal Giant Language Fashions (LLMs). Multi-modal LLMs take not solely textual content inputs by way of their immediate like earlier LLMs, however may settle for non-text modalities corresponding to pictures and audio. Some examples of multi-modal LLMs are GPT-4o from OpenAI, Gemini 1.5 from Google, and Claude-3.5 Sonnet from Anthropic. The hackathon offered entry to GPT-4o via Azure, Microsoft’s Cloud Computing Platform. We didn’t win, there have been different entries that had been higher than ours each by way of the originality of their concept in addition to high quality of their implementations. Nonetheless, we realized some cool new issues in the course of the hackathon, and figured that these may be of normal curiosity to others as nicely, therefore this submit.

Our concept was to make use of GPT-4o to extract and codify tables present in educational papers as semi-structured knowledge (i.e. JSON). We might then both question the JSON knowledge for looking out inside tables, or convert it to Markdown for downstream LLMs to question them simply by way of their textual content interface. We had initially meant to increase the thought to figures and charts, however we couldn’t get that pipeline working finish to finish.

Here’s what our pipeline seemed like.

  1. Tutorial papers are normally obtainable as PDFs. We use the PyMuPDF library to separate the PDF file right into a set of picture recordsdata, the place every picture file corresponds to a web page within the paper.
  2. We then ship every web page picture via the Desk Transformer, which returns bounding field info for every desk it detects within the web page, in addition to a confidence rating. The Desk Transformer mannequin we used was microsoft/table-transformer-detection.
  3. We crop out every desk from the pages utilizing the bounding field info, after which ship every desk to GPT-4o as a part of a immediate asking to transform it to a JSON construction. GPT-4o responds with a JSON construction representing the desk.

This pipeline was primarily based on my colleague’s concept. I like the way it progressively simplifies the duty by splitting every web page of the incoming PDF into its personal picture, then makes use of a pre-trained Desk Transformer to crop out the tables from them, and solely then passes the desk to GPT-4o to transform to JSON. That desk picture is handed into the immediate as a “knowledge URL” which is simply the base-64 encoding of the picture formatted as "knowledge:{mime_type};base64,{base64_encoded_data}. The Desk Transformer, whereas not excellent, proved remarkably profitable at figuring out tables within the textual content. I say exceptional as a result of we used a pre-trained mannequin, however maybe it isn’t that exceptional when you contemplate that it most likely skilled on tables in educational papers as nicely.

Our immediate for GPT-4o seemed one thing like this:

System: You might be an AI mannequin that makes a speciality of detecting the tables and extracting, deciphering desk content material from pictures. Observe under instruction step-by-step:
1. Acknowledge whether or not the given picture is desk or not, if it is not a desk print "None". if it is a desk go to subsequent step.
2. precisely convert the desk's content material right into a structured structured Json format

normal instruction:
1. don't output something additional. 
2. a desk should accommodates rows and columns

Person: Given the picture, detect whether or not it is a desk or not, if it is a desk then convert it to Json format
{image_data_url}

For the determine pipeline, I attempted to make use of an OWL-VIT (Imaginative and prescient Transformer for Open World Localization) mannequin instead of the Desk Transformer. However it was not as profitable at detecting figures within the textual content, most likely since SAM appears to be fine-tuned to detect objects in pure pictures. Sadly we could not discover a pre-trained mannequin that will work for this specific case. One other situation was changing the fgure right into a semi-structured JSON illustration, we ended up asking GPT-4o to explain the picture as textual content.

One suggestion by a few of my TWIML non-work colleagues was to ask GPT-4o to return the bounding bins for the photographs it finds in it, after which use that to extract the figures to ship to GPT-4o for describing. It did not work sadly, however was undoubtedly price attempting. As LLMs get increasingly more succesful, I feel it is sensible to rethink our pipelines to delegate increasingly more work to the LLM. Or not less than confirm that it could possibly’t do one thing earlier than transferring on to older (and more durable to implement) options.

Leave a Reply