LLM-Powered Parsing and Evaluation of Semi-Structured & Unstructured Paperwork | by Umair Ali Khan

The code to implement this complete workflow is obtainable on GitHub.

Let’s undergo these steps one after the other.

1. Textual content Extraction

The paperwork used on this instance embrace the AI advisory suggestions that we offer to firms after an advisory session. These firms embrace startups and established firms who need to combine AI into their enterprise or need to advance their present AI options. The suggestions doc is a semi-structured doc whose format is proven under. The names and different data on this doc have been modified attributable to privateness constraints.

Instance doc of our AI advisory suggestions (picture by the creator)

The AI specialists present their evaluation for every discipline. Nonetheless, with tons of of such paperwork, extracting insights from the information turns into a difficult job. To realize insights into this information, it must be transformed right into a concise, structured format that may be analyzed utilizing present statistical or machine studying strategies. Performing this conversion manually just isn’t solely labor-intensive and time-consuming but in addition vulnerable to errors.

Along with the readily seen data within the doc, equivalent to the corporate title, session date, and knowledgeable(s) concerned, I aimed to extract particular particulars for additional evaluation. These included the foremost business or area every firm operates in, a concise description of the present options provided, the AI subject(s), firm kind, AI maturity stage, purpose, and a short abstract of the suggestions. This extraction wanted to be carried out on the detailed textual content related to every discipline. Moreover, the suggestions template has advanced over time, which has resulted in paperwork with inconsistent codecs.

Earlier than we talk about the textual content extraction from the paperwork, please observe that following libraries have to be put in for operating the whole code used on this article.

# Set up the required libraries
!pip set up tqdm  # For displaying a progress bar for doc processing
!pip set up requests  # For making HTTP requests
!pip set up pandas  # For information manipulation and evaluation
!pip set up python-docx  # For processing Phrase paperwork
!pip set up plotly  # For creating interactive visualizations
!pip set up numpy  # For numerical computations
!pip set up scikit-learn  # For machine studying algorithms and instruments
!pip set up matplotlib  # For creating static, animated, and interactive plots
!pip set up openai  # For interacting with the OpenAI API
!pip set up seaborn  # For statistical information visualization

The next code extracts textual content from a doc (.docx format) utilizing python-docx library. You will need to extract textual content from all of the codecs, together with paragraphs, tables, headers, and footers.

def extract_text_from_docx(docx_path: str):
"""
Extract textual content content material from a Phrase (.docx) file.
"""
doc = docx.Doc(docx_path)
full_text = []# Extract textual content from paragraphs
for para in doc.paragraphs:
full_text.append(para.textual content)
# Extract textual content from tables
for desk in doc.tables:
for row in desk.rows:
for cell in row.cells:
full_text.append(cell.textual content)
# Extract textual content from headers and footers 
for part in doc.sections:
header = part.header
footer = part.footer
for para in header.paragraphs:
full_text.append(para.textual content)
for para in footer.paragraphs:
full_text.append(para.textual content)
return 'n'.be part of(full_text).strip()

2. Set LLM Prompts

We have to instruct the LLM on methods to extract the required data from the paperwork. Additionally, we have to clarify the that means of every discipline of curiosity to be extracted in order that it may well extract the semantically matching data from the paperwork. That is significantly vital as a result of a required discipline comprising a number of phrases will be interpreted in a number of methods. For example, we have to clarify what we imply by “purpose”, which principally refers back to the firm’s plans for AI integration or the way it desires to advance its present answer. Subsequently, crafting the suitable immediate for this goal is essential.

I set the directions in system immediate to information the LLM’s habits. The enter immediate includes the information to be processed by the LLM. The system immediate is proven under.

# System immediate with extraction directions
system_message = """
You're an knowledgeable in analyzing and extracting data from the suggestions varieties written by AI specialists after AI advisory periods with firms.  
Please rigorously learn the offered suggestions kind and extract the next 15 key data. Be sure that the important thing names are precisely the identical as 
given under. Don't create any extra key names apart from these 15. 
Key names and their descriptions:
1. Firm title: title of the corporate in search of AI advisory
2. Nation: Firm's nation [output 'N/A' if not available]
3. Session Date [output 'N/A' if not available]
4. Specialists: individuals offering AI consultancy [output 'N/A' if not available]
5. Session kind: Common or pop-up [output 'N/A' if not available]
6. Space/area: Discipline of the corporate’s operations. Some examples: healthcare, industrial manufacturing, enterprise improvement, schooling, and many others. 
7. Present Answer: description of the present answer provided by the corporate. The corporate might be at the moment in ideation section. Some examples of ‘Present Answer’ discipline embrace i) Advice system for vehicles, homes, and different objects, ii) Skilled steering system, iii) AI-based matchmaking service for academic peer-to-peer help. [Be very specific and concise]
8. AI discipline: AI's sub-field in use or required. Some examples: picture processing, giant language fashions, pc imaginative and prescient, pure language processing, predictive modeling, speech recognition, and many others. [This field is not explicitly available in the document. Extract it by the semantic understanding of the overall document.]
9. AI maturity stage: low, reasonable, excessive [output 'N/A' if not available].
10. Firm kind: ‘startup’ or ‘established firm’
11. Goal: The AI duties the corporate is in search of. Some examples: i) Improve AI-driven techniques for diagnosing coronary heart illnesses, ii) to automate identification of key variable combos in buyer surveys, iii) to develop AI-based system for automated citation technology from engineering drawings, iv) to constructing and managing enterprise-grade LLM purposes. [Be very specific and concise]
12. Recognized goal market: The focused prospects. Some examples: healthcare professionals, building corporations, hospitality, academic establishments, and many others. 
13. Information Requirement Evaluation: The kind of information required for the meant AI integration? Some examples: Transcripts of remedy periods, affected person information, textual information, picture information, movies, and many others. 
14. FAIR Companies Sought: The providers anticipated from FAIR. For example, technical recommendation, proof of idea. 
15. Suggestions: A short abstract of the suggestions within the type of key phrases or phrase record. Some examples: i) Deal with information steadiness, monitor for bias, prioritize transparency, ii) Discover machine studying algorithms, implement determination timber, gradient boosting. [Be very specific and concise] 
Pointers:
- Essential: don't make up something. If the knowledge of a required discipline just isn't obtainable, output ‘N/A’ for it.
- Output in JSON format. The JSON ought to include the above 15 keys.
"""

You will need to emphasize what the LLM ought to give attention to. For example, the variety of key components to be extracted, utilizing precisely the identical discipline names as specified, and never inventing any data if not obtainable. A proof of every discipline and a few examples of the required data (if doable) are additionally vital. It’s value mentioning that an optimum immediate will not be crafted within the first try.

3. Course of Paperwork

Processing the paperwork refers to sending the information to an LLM for parsing. I used OpenAI’s gpt-4o-mini mannequin for doc parsing which is an reasonably priced and clever small mannequin for quick, light-weight duties. GPT-4o mini is cheaper and extra succesful than GPT-3.5 Turbo. Nonetheless, the light-weight variations of open LLMs equivalent to Llama, Mistral, or Phi-3 may also be examined for this goal.

The next code walks via a listing and its sub-directories to search out the AI advisory paperwork (.docx format), extract textual content from every doc, and ship the doc to gpt-4o-mini through an API name.

def process_files(directory_path: str, api_key: str, system_message: str):
"""
Course of all .docx recordsdata within the given listing and its subdirectories,
ship their content material to the LLM, and retailer the JSON responses.
"""
json_outputs = []
docx_files = []# Stroll via the listing and its subdirectories to search out .docx recordsdata
for root, dirs, recordsdata in os.stroll(directory_path):
for file in recordsdata:
if file.endswith(".docx"):
docx_files.append(os.path.be part of(root, file))
if not docx_files:
print("No .docx recordsdata discovered within the specified listing or sub-directories.")
return json_outputs
# Iterate via all .docx recordsdata within the listing with a progress bar
for file_path in tqdm(docx_files, desc="Processing recordsdata...", unit="file"):
filename = os.path.basename(file_path)
extracted_text = extract_text_from_docx(file_path)
# Put together the person message with the extracted textual content
input_message = extracted_text
# Put together the API request payload
headers = {
"Content material-Sort": "utility/json",
"Authorization": f"Bearer {api_key}"
}
payload = {
"mannequin": "gpt-4o-mini",
"messages": [
{"role": "system", "content": system_message},
{"role": "user", "content": input_message}
],
"max_tokens": 2000,
"temperature": 0.2
}
# Ship the request to the LLM API
response = requests.put up("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
# Extract the JSON response
json_response = response.json()
content material = json_response['choices'][0]['message']['content'].strip("```jsonn").strip("```")
parsed_json = json.hundreds(content material)
# Normalize the parsed JSON output
normalized_json = normalize_json_output(parsed_json)
# Append the normalized JSON output to the record
json_outputs.append(normalized_json)
return json_outputs

Within the name’s payload, I set the utmost variety of tokens (max_tokens) to 2000 to accommodate the enter/output tokens. I set a comparatively low temperature (0.2) in order that the LLM doesn’t have a excessive creativity which isn’t required for this job. A excessive temperature could result in hallucinations the place the LLM could invent new data.

The LLM’s response is obtained in a JSON object and is additional parsed and normalized as mentioned within the subsequent part.

4. Parse LLM Output

As proven within the above code, the response from the API is obtained in a JSON object (parsed_json) which is additional normalized utilizing the next operate.

def normalize_json_output(json_output):
"""
Normalize the keys and convert record values to comma-separated strings.
"""
normalized_output = {}
for key, worth in json_output.objects():
normalized_key = key.decrease().change(" ", "_")
if isinstance(worth, record):
normalized_output[normalized_key] = ', '.be part of(worth)
else:
normalized_output[normalized_key] = worth
return normalized_output

This operate standardizes the keys of the JSON object by changing them to lowercase and changing areas with underscores. Moreover, it converts any record values into comma-separated strings to make the information simpler to work with and analyze.

The normalized JSON object (json_outputs), containing the extracted key data from all of the paperwork, is lastly saved to an Excel file.

def save_json_to_excel(json_outputs, output_file_path: str):
"""
Save the record of JSON objects to an Excel file with a SNO. column.
"""
# Convert the record of JSON objects to a DataFrame
df = pd.DataFrame(json_outputs)# Add a Serial Quantity (SNO.) column
df.insert(0, 'SNO.', vary(1, len(df) + 1))
# Guarantee all columns are constant and save the DataFrame to an Excel file
df.to_excel(output_file_path, index=False)

A snapshot of the Excel file is proven under. LLM-powered parsing produced exact data pertaining to the required fields. The “N/A” within the snapshot represents the information unavailable within the paperwork (previous suggestions templates lacking this data).

LLM-Powered Parsing and Evaluation of Semi-Structured & Unstructured Paperwork | by Umair Ali Khan | Aug, 2024

1. Textual content Extraction

2. Set LLM Prompts

3. Course of Paperwork

4. Parse LLM Output

Dream 7B: How Diffusion-Based mostly Reasoning Fashions Are Reshaping AI

FastAPI-MCP Tutorial for Newcomers and Specialists

Your intestine microbes would possibly encourage legal conduct

Dia-1.6B TTS : Finest Textual content-to-Dialogue Technology Mannequin

What My GPT Stylist Taught Me About Prompting Higher

Dream 7B: How Diffusion-Based mostly Reasoning Fashions Are Reshaping AI

FastAPI-MCP Tutorial for Newcomers and Specialists

Your intestine microbes would possibly encourage legal conduct

Dia-1.6B TTS : Finest Textual content-to-Dialogue Technology Mannequin