CV Information Extraction

Introduction

When attending a job interview or hiring for a big firm, reviewing each CV intimately is commonly impractical because of the excessive quantity of candidates. As an alternative, leveraging CV knowledge extraction to concentrate on how effectively key job necessities align with a candidate’s CV can result in a profitable match for each the employer and the candidate.

Think about having your profile label checked—no want to fret! It’s now simple to evaluate your match for a place and determine any gaps in your {qualifications} relative to job necessities.

For instance, if a job posting highlights expertise in mission administration and proficiency in a selected software program, the candidate ought to guarantee these expertise are clearly seen on their CV. This focused method helps hiring managers shortly determine certified candidates and ensures the candidate is taken into account for positions the place they will thrive.

By emphasizing essentially the most related {qualifications}, the hiring course of turns into extra environment friendly, and each events can profit from an excellent match. The corporate finds the appropriate expertise extra shortly, and the candidate is extra prone to land a job that matches their expertise and expertise.

Studying Outcomes

  • Perceive the significance of information extraction from CVs for automation and evaluation.
  • Achieve proficiency in utilizing Python libraries for textual content extraction from numerous file codecs.
  • Discover ways to preprocess photos to boost textual content extraction accuracy.
  • Discover methods for dealing with case sensitivity and normalizing tokens in extracted textual content.
  • Determine key instruments and libraries important for efficient CV knowledge extraction.
  • Develop sensible expertise in extracting textual content from each photos and PDF recordsdata.
  • Acknowledge the challenges concerned in CV knowledge extraction and efficient options.

This text was printed as part of the Information Science Blogathon.

To successfully extract knowledge from resumes and CVs, leveraging the appropriate instruments is crucial for streamlining the method and making certain accuracy. This part will spotlight key libraries and applied sciences that improve the effectivity of CV knowledge extraction, enabling higher evaluation and insights from candidate profiles.

Python

It has a library or technique that may break up sentences or paragraph into phrases. In Python, you possibly can obtain phrase tokenization utilizing completely different libraries and strategies, akin to break up() (fundamental tokenization) or the Pure Language Toolkit (NLTK) and spaCy libraries for extra superior tokenization.

 Easy tokenization( break up of sentences) don’t acknowledge punctuations and different particular characters.

sentences="Immediately is a gorgeous day!."
sentences.break up()
['Today', 'is', 'a', 'beautiful', 'day!.']

Libraries: NLTK and SpaCy

Python has extra highly effective device for tokenization (Pure Language Toolkit (NLTK). 

In NLTK (Pure Language Toolkit), the punkt tokenizer actively tokenizes textual content by utilizing a pre-trained mannequin for unsupervised sentence splitting and phrase tokenization.

import nltk
nltk.obtain('punkt')
from nltk import word_tokenize

sentences="Immediately is a gorgeous day!."
sentences.break up()
print(sentences)
phrases= word_tokenize(sentences)
print(phrases)

[nltk_data] Downloading bundle punkt to
[nltk_data]     C:Usersss529AppDataRoamingnltk_data...
Immediately is a gorgeous day!.
['Today', 'is', 'a', 'beautiful', 'day', '!', '.']
[nltk_data]   Bundle punkt is already up-to-date!

Key Options of punkt:

  • It may well tokenize a given textual content into sentences and phrases while not having any prior details about the language’s grammar or syntax.
  • It makes use of machine studying fashions to detect sentence boundaries, which is beneficial in languages the place punctuation doesn’t strictly separate sentences.

SpaCy is superior NLP library that provides correct tokenization and different language processing options.

Common Expressions: Customized tokenization primarily based on patterns, however requires guide set.

import re
common= "[A-za-z]+[W]?"
re.findall(common, sentences)
['Today ', 'is ', 'a ', 'beautiful ', 'day!']

Pytesseract

It’s a python primarily based optical character recognition device used for studying textual content in photos.  

Pillow Library

An open-source library for dealing with numerous picture codecs, helpful for picture manipulation.

Pictures or PDF Information

Resumes could also be in PDF or picture codecs.

PDFPlumber or PyPDF2

To extract textual content from a PDF and tokenize it into phrases, you possibly can observe these steps in Python:

  • Extract textual content from a PDF utilizing a library like PyPDF2 or pdfplumber.
  • Tokenize the extracted textual content utilizing any tokenization technique, akin to break up(), NLTK, or spaCy.

Getting Phrases from PDF Information or Pictures

For pdf recordsdata we’ll want Pdf Plumber and for photos OCR.

If you wish to extract textual content from a picture (as an alternative of a PDF) after which tokenize and rating primarily based on predefined phrases for various fields, you possibly can obtain this by following these steps:

Set up  pytesseract OCR Machine.

It can assist to extract textual content from photos

pip set up pytesseract Pillow nltk

Set up library Pillow

It can assist to deal with numerous photos.

With regards to picture processing and manipulation in Python—akin to resizing, cropping, or changing between completely different codecs—the open-source library that always involves thoughts is Pillow.

Let’s see how the pillow works, to see the picture in Jupyter Pocket book I’ve to make use of the show and inside brackets need to retailer the variable holding the picture.

from PIL import Picture
picture = Picture.open('artwork.jfif')
show(picture)
Install library Pillow: CV Data Extraction

To resize and save the picture, the resize and saved technique is used, the width is ready to 400 and the peak to 450.

To resize and save the image, the resize and saved method is used, the width is set to 400 and the height to 450.

Key Options of Pillow:

  • Picture Codecs- Help completely different codecs
  • Picture Manipulation Features – One can resize, crop photos, convert shade photos to grey, and so on.

Set up nltk for tokenization (or spaCy)

Uncover how you can improve your textual content processing capabilities by putting in NLTK or spaCy, two highly effective libraries for tokenization in pure language processing.

Obtain Tesseract and Configure Path

Discover ways to obtain Tesseract from GitHub and seamlessly combine it into your script by including the mandatory path for optimized OCR performance.

pytesseract.pytesseract.tesseract_cmd = 'C:Program FilesTesseract-OCRtesseract.exe''
  • macOS: brew set up tesseract
  • Linux: Set up through bundle supervisor (e.g., sudo apt set up tesseract-ocr).
  • pip set up pytesseract Pillow

There are a number of instruments amongst them one is the Google-developed, open-source library Tesseract which has supported many languages and OCR.

Pytesseract is used for Python-based initiatives, that act as a wrapper for Tesseract OCR engine.

Picture and PDF Textual content Extraction Methods

Within the digital age, extracting textual content from photos and PDF recordsdata has change into important for numerous functions, together with knowledge evaluation and doc processing. This text explores efficient methods for preprocessing photos and leveraging highly effective libraries to boost optical character recognition (OCR) and streamline textual content extraction from various file codecs.

Preprocessing Pictures for Enhanced OCR Efficiency

Preprocessing photos can enhance the OCR efficiency by following the steps talked about under.

  • Pictures to Grayscale: Pictures are transformed into grayscale to scale back noisy background and have a agency concentrate on the textual content itself, and is beneficial for photos with various lighting situations.
  • from PIL import ImageOps
  • picture = ImageOps.grayscale(picture)
  • Thresholding : Apply binary thresholding to make the textual content stand out by changing the picture right into a black-and-white format.
  • Resizing : Upscale smaller photos for higher textual content recognition.
  • Noise Removing : Take away noise or artifacts within the picture utilizing filters (e.g., Gaussian blur).
import nltk
import pytesseract
from PIL import Picture
import cv2

from nltk.tokenize import word_tokenize

nltk.obtain('punkt')
pytesseract.pytesseract.tesseract_cmd = r'C:Usersss529anaconda3Tesseract-OCRtesseract.exe'
picture = enter("Identify of the file: ")
imag=cv2.imread(picture)
 
#convert to grayscale picture
grey=cv2.cvtColor(photos, cv2.COLOR_BGR2GRAY)
 
from nltk.tokenize import word_tokenize
def text_from_image(picture):
    img = Picture.open(imag)
    textual content = pytesseract.image_to_string(img)
    return textual content
picture="CV1.png"


text1 = text_from_image(picture)

# Tokenize the extracted textual content
tokens = word_tokenize(text1)

print(tokens)
Preprocessing Images for Enhanced OCR Performance

To know what number of phrases match the necessities we’ll evaluate and provides factors to each matching phrase as 10.

# Evaluating tokens with particular phrases, ignore duplicates, and calculate rating
def compare_tokens_and_score(tokens, specific_words, score_per_match=10):
    match_words = set(phrase.decrease() for phrase in tokens if phrase.decrease() in specific_words)
    total_score = len(fields_keywords) * score_per_match
    return total_score

# Fields with differents expertise
fields_keywords = {

    "Data_Science_Carrier": { 'supervised machine studying', 'Unsupervised machine studying', 'knowledge','evaluation', 'statistics','Python'},
        
}

# Rating primarily based on particular phrases for that area
def process_image_for_field(picture, area):
    if area not in fields_keywords:
        print(f"Discipline '{area}' shouldn't be outlined.")
        return

    # Extract textual content from the picture
    textual content = text_from_image(picture)
    
    # Tokenize the extracted textual content
    tokens = tokenize_text(textual content)
    
    # Examine tokens with particular phrases for the chosen area
    specific_words = fields_keywords[field]
    total_score = compare_tokens_and_score(tokens, specific_words)
    print(f"Discipline: {area}")
    print("Whole Rating:", total_score)


picture="CV1.png" 
area = 'Data_Science_Carrier'  

To deal with case sensitivity e.g., “Information Science” vs. “knowledge science”, we will convert all tokens and key phrases to lowercase.

tokens = word_tokenize(extracted_text.decrease())

With the usage of lemmatization with NLP libraries like NLTK or stemming with spaCy to scale back phrases  (e.g., “operating” to “run”)

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def normalize_tokens(tokens):
    return [lemmatizer.lemmatize(token.lower()) for token in tokens]

Getting Textual content from PDF Information

Allow us to now discover the actions required to get textual content from pdf recordsdata.

Set up Required Libraries

You’ll need the next libraries:

  • PyPDF2
  • pdfplumber
  • spacy
  • nltk

Utilizing pip

pip set up PyPDF2 pdfplumber nltk spacy
python -m spacy obtain en_core_web_sm
import PyPDF2

def text_from_pdf(pdf_file):
    with open(pdf_file, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        textual content = ""
        for page_num in vary(len(reader.pages)):
            web page = reader.pages[page_num]
            textual content += web page.extract_text() + "n"
    return textual content
import pdfplumber

def text_from_pdf(pdf_file):
    with pdfplumber.open(pdf_file) as pdf:
        textual content = ""
        for web page in pdf.pages:
            textual content += web page.extract_text() + "n"
    return textual content
pdf_file="SoniaSingla-DataScience-Bio.pdf"

# Extract textual content from the PDF
textual content = text_from_pdf(pdf_file)

# Tokenize the extracted textual content
tokens = word_tokenize(textual content)

print(tokens)    

Normalizing Tokens for Consistency

To deal with the PDF file as an alternative of a picture and make sure that repeated phrases don’t obtain a number of scores, modify the earlier code. We are going to extract textual content from the PDF file, tokenize it, and evaluate the tokens towards particular phrases from completely different fields. The code will calculate the rating primarily based on distinctive matched phrases.

import pdfplumber
import nltk
from nltk.tokenize import word_tokenize


nltk.obtain('punkt')


def extract_text_from_pdf(pdf_file):
    with pdfplumber.open(pdf_file) as pdf:
        textual content = ""
        for web page in pdf.pages:
            textual content += web page.extract_text() + "n"
    return textual content


def tokenize_text(textual content):
    tokens = word_tokenize(textual content)
    return tokens


def compare_tokens_and_score(tokens, specific_words, score_per_match=10):
    # Use a set to retailer distinctive matched phrases to forestall duplicates
    unique_matched_words = set(phrase.decrease() for phrase in tokens if phrase.decrease() in specific_words)
    # Calculate whole rating primarily based on distinctive matches
    total_score = len(unique_matched_words) * score_per_match
    return unique_matched_words, total_score

# Outline units of particular phrases for various fields
fields_keywords = {

    "Data_Science_Carrier": { 'supervised machine studying', 'Unsupervised machine studying', 'knowledge','evaluation', 'statistics','Python'},
        
    # Add extra fields and key phrases right here
}

# Step 4: Choose the sector and calculate the rating primarily based on particular phrases for that area
def process_pdf_for_field(pdf_file, area):
    if area not in fields_keywords:
        print(f"Discipline '{area}' shouldn't be outlined.")
        return
 
    textual content = extract_text_from_pdf(pdf_file)
      
    tokens = tokenize_text(textual content)  
    
    specific_words = fields_keywords[field]
    unique_matched_words, total_score = compare_tokens_and_score(tokens, specific_words)
      
    print(f"Discipline: {area}")
    print("Distinctive matched phrases:", unique_matched_words)
    print("Whole Rating:", total_score)


pdf_file="SoniaSingla-DataScience-Bio.pdf"  
area = 'data_science'  
process_pdf_for_field(pdf_file, fie

It can produce an error message as data_science area shouldn’t be outlined.

It will produce an error message as data_science field is not defined

When the error is eliminated, it really works superb.

CV Data Extraction

To deal with case sensitivity correctly and make sure that phrases like “knowledge” and “Information” are thought-about the identical phrase whereas nonetheless scoring it solely as soon as (even when it seems a number of occasions with completely different circumstances), you possibly can normalize the case of each the tokens and the precise phrases. We will do that by changing each the tokens and the precise phrases to lowercase in the course of the comparability however nonetheless protect the unique casing for the ultimate output of matched phrases.

  • Utilizing pdfplumber to extract the textual content from the pdf file.
  • Utilizing OCR to transform picture into machine  code.
  • Utilizing pytesseract for changing python wrap codes into textual content.

Conclusion

We explored the essential means of extracting and analyzing knowledge from CVs, specializing in automation methods utilizing Python. We realized how you can make the most of important libraries like NLTK, SpaCy, Pytesseract, and Pillow for efficient textual content extraction from numerous file codecs, together with PDFs and pictures. By making use of strategies for tokenization, textual content normalization, and scoring, we gained insights into how you can align candidates’ {qualifications} with job necessities effectively. This systematic method not solely streamlines the hiring course of for employers but in addition enhances candidates’ possibilities of securing positions that match their expertise.

Key Takeaways

  • Environment friendly knowledge extraction from CVs is significant for automating the hiring course of.
  • Instruments like NLTK, SpaCy, Pytesseract, and Pillow are important for textual content extraction and processing.
  • Correct tokenization strategies assist in precisely analyzing the content material of CVs.
  • Implementing a scoring mechanism primarily based on key phrases enhances the matching course of between candidates and job necessities.
  • Normalizing tokens by way of methods like lemmatization improves textual content evaluation accuracy.

Often Requested Questions

Q1. How one can get textual content extracted from pdf?

A. PyPDF2 or pdfplumber libraries to extract textual content from pdf.

Q2. Tips on how to extract textual content from CV in picture format?

A. If the CV is in picture format (scanned doc or photograph), you should utilize OCR (Optical Character Recognition) to extract textual content from the picture. Probably the most generally used device for this in Python is pytesseract, which is a wrapper for Tesseract OCR.

Q3. How do I deal with poor high quality photos in OCR?

A. Bettering the standard of photos earlier than feeding them into OCR can considerably enhance textual content extraction accuracy. Methods like grayscale conversion, thresholding, and noise discount utilizing instruments like OpenCV can assist.

The media proven on this article shouldn’t be owned by Analytics Vidhya and is used on the Creator’s discretion.

I’ve carried out my Grasp Of Science in Biotechnology and Grasp of Science in Bioinformatics from reputed Universities. I’ve written a number of analysis papers, reviewed them, and am at present an Advisory Editorial Board Member at IJPBS.
I Sit up for the alternatives in IT to make the most of my expertise gained throughout work and Internship.
https://aster28.github.io/SoniaSinglaBio/website/