Matter Modeling Open-Supply Analysis with the OpenAlex API | by Alex Davis | Jul, 2024

Whereas we ingest the info from the API, we are going to apply some standards. First, we are going to solely ingest paperwork the place the 12 months is between 2016 and 2022. We would like pretty latest language as phrases and taxonomy of sure topics can change over lengthy intervals of time.

We may even add key phrases and conduct a number of searches. Whereas usually we’d probably ingest random topic areas, we are going to use key phrases to slim our search. This manner, we could have an thought of how might high-level subjects we now have, and may examine that to the output of the mannequin. Under, we create a operate the place we will add key phrases and conduct searches via the API.

import pandas as pd
import requests
def import_data(pages, start_year, end_year, search_terms):

"""
This operate is used to make use of the OpenAlex API, conduct a search on works, a return a dataframe with related works.

Inputs:
- pages: int, variety of pages to loop via
- search_terms: str, key phrases to seek for (have to be formatted in keeping with OpenAlex requirements)
- start_year and end_year: int, years to set as a variety for filtering works
"""

#create an empty dataframe
search_results = pd.DataFrame()

for web page in vary(1, pages):

#use paramters to conduct request and format to a dataframe
response = requests.get(f'https://api.openalex.org/works?web page={web page}&per-page=200&filter=publication_year:{start_year}-{end_year},kind:article&search={search_terms}')
knowledge = pd.DataFrame(response.json()['results'])

#append to empty dataframe
search_results = pd.concat([search_results, data])

#subset to related options
search_results = search_results[["id", "title", "display_name", "publication_year", "publication_date",
"type", "countries_distinct_count","institutions_distinct_count",
"has_fulltext", "cited_by_count", "keywords", "referenced_works_count", "abstract_inverted_index"]]

return(search_results)

We conduct 5 totally different searches, every being a unique expertise space. These expertise areas are impressed by the DoD “Important Expertise Areas”. See extra right here:

Right here is an instance of a search utilizing the required OpenAlex syntax:

#seek for Trusted AI and Autonomy
ai_search = import_data(35, 2016, 2024, "'synthetic intelligence' OR 'deep study' OR 'neural internet' OR 'autonomous' OR drone")

After compiling our searches and dropping duplicate paperwork, we should clear the info to organize it for our matter mannequin. There are 2 important points with our present output.

  1. The abstracts are returned as an inverted index (resulting from authorized causes). Nevertheless, we will use these to return the unique textual content.
  2. As soon as we get hold of the unique textual content, will probably be uncooked and unprocessed, creating noise and hurting our mannequin. We’ll conduct conventional NLP preprocessing to get it prepared for the mannequin.

Under is a operate to return authentic textual content from an inverted index.

def undo_inverted_index(inverted_index):

"""
The aim of the operate is to 'undo' and inverted index. It inputs an inverted index and
returns the unique string.
"""

#create empty lists to retailer uninverted index
word_index = []
words_unindexed = []

#loop via index and return key-value pairs
for okay,v in inverted_index.objects():
for index in v: word_index.append([k,index])

#kind by the index
word_index = sorted(word_index, key = lambda x : x[1])

#be part of solely the values and flatten
for pair in word_index:
words_unindexed.append(pair[0])
words_unindexed = ' '.be part of(words_unindexed)

return(words_unindexed)

Now that we now have the uncooked textual content, we will conduct our conventional preprocessing steps, reminiscent of standardization, eradicating cease phrases, lemmatization, and many others. Under are features that may be mapped to a listing or collection of paperwork.

def preprocess(textual content):

"""
This operate takes in a string, coverts it to lowercase, cleans
it (take away particular character and numbers), and tokenizes it.
"""

#convert to lowercase
textual content = textual content.decrease()

#take away particular character and digits
textual content = re.sub(r'd+', '', textual content)
textual content = re.sub(r'[^ws]', '', textual content)

#tokenize
tokens = nltk.word_tokenize(textual content)

return(tokens)

def remove_stopwords(tokens):

"""
This operate takes in a listing of tokens (from the 'preprocess' operate) and
removes a listing of stopwords. Customized stopwords will be added to the 'custom_stopwords' listing.
"""

#set default and customized stopwords
stop_words = nltk.corpus.stopwords.phrases('english')
custom_stopwords = []
stop_words.lengthen(custom_stopwords)

#filter out stopwords
filtered_tokens = [word for word in tokens if word not in stop_words]

return(filtered_tokens)

def lemmatize(tokens):

"""
This operate conducts lemmatization on a listing of tokens (from the 'remove_stopwords' operate).
This shortens every phrase right down to its root kind to enhance modeling outcomes.
"""

#initalize lemmatizer and lemmatize
lemmatizer = nltk.WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

return(lemmatized_tokens)

def clean_text(textual content):

"""
This operate makes use of the beforehand outlined features to take a string and
run it via your entire knowledge preprocessing course of.
"""

#clear, tokenize, and lemmatize a string
tokens = preprocess(textual content)
filtered_tokens = remove_stopwords(tokens)
lemmatized_tokens = lemmatize(filtered_tokens)
clean_text = ' '.be part of(lemmatized_tokens)

return(clean_text)

Now that we now have a preprocessed collection of paperwork, we will create our first matter mannequin!

Leave a Reply