TL;DR
Within the humanitarian response world there will be tens of hundreds of tabular (CSV and Excel) datasets, lots of which comprise vital data for serving to save lives. Knowledge will be offered by lots of of various organizations with totally different naming conventions, languages and information requirements, so having data (metadata) about what every column represents in tables is vital for locating the suitable information and understanding the way it suits collectively. A lot of this metadata is about manually, which is time-consuming and error susceptible, so any automated methodology can have an actual impact in the direction of serving to individuals. On this article we revisit a earlier evaluation “Predicting Metadata of Humanitarian Datasets with GPT 3” to see how advances within the final 18 months open the best way for extra environment friendly and fewer time-consuming strategies for setting metadata on tabular information.
Utilizing metadata-tagged CSV and Excel datasets from the Humanitarian Knowledge Trade (HDX) we present that fine-tuning GPT-4o-mini works nicely for predicting Humanitarian Trade Language (HXL) tags and attributes for the commonest tags associated to location and dates. Nevertheless, for much less well-represented tags and attributes the approach is usually a bit restricted because of poor high quality coaching information the place people have made errors in manually labelling information or just aren’t utilizing all doable HXL metadata mixtures. It additionally has the limitation of not with the ability to alter when the metadata normal adjustments, for the reason that coaching information wouldn’t mirror these adjustments.
Given extra highly effective LLMs are actually out there, we examined a way to immediately immediate GPT-4o or GPT-4o-mini moderately than fine-tuning, offering the complete HXL core schema definition within the system immediate now that bigger context home windows can be found. This method was proven to be extra correct than fine-tuning when utilizing GPT-4o, in a position to help rarer HXL tags and attributes and requiring no customized coaching information, making it simpler to handle and deploy. It’s nonetheless dearer, however not if utilizing GPT-4o-mini, albeit with a slight lower in efficiency. Utilizing this method we offer a easy Python class in a GitHub Gist that can be utilized in information processing pipelines to routinely add HXL metadata tags and attributes to tabular datasets.
About 18 months in the past I wrote a weblog submit Predicting Metadata of Humanitarian Datasets with GPT 3.
That’s proper, with GPT 3, not even 3.5! 🙂
Even so, again then Giant Language Mannequin (LLM) fine-tuning produced nice efficiency for predicting Humanitarian Trade Language (HXL) metadata fields for tabular datasets on the superb Humanitarian Knowledge Trade (HDX). In that research, the coaching information represented the distribution of HXL information on HDX and so was comprised of the commonest tags referring to location and dates. These are crucial for linking totally different datasets collectively in location and time, a vital consider utilizing information to optimize humanitarian response.
The LLM area has since superior … a LOT.
So on this article, we are going to revisit the approach, broaden it to cowl much less frequent HXL tags and attributes and discover different choices now out there to us for conditions the place a posh, high-cardinality taxonomy must be utilized to information. We may even discover the power to foretell much less frequent HXL normal tags and attributes not at the moment represented within the human-labeled coaching information.
You’ll be able to observe together with this evaluation by opening these notebooks in Google Colab or operating them domestically:
Please consult with the README within the repo for set up directions.
For this research, and with assist from the HDX group, we are going to use information extracted from the HDX platform utilizing a crawler course of they run to trace using HXL metadata tags and attributes on the platform. You will discover nice HXL assets on GitHub, however if you wish to observe together with this evaluation I’ve additionally saved the supply information on Google Drive because the crawler will take days to course of the lots of of hundreds of tabular datasets on HDX.
The information seems to be like this, with one row per HXL-tagged desk column …
The HXL postcard is a very nice overview of the commonest HXL tags and attributes within the core schema. For our evaluation, we are going to apply the complete normal as discovered on HDX which gives a spreadsheet of supported tags and attributes …
The generate-test-train-data.ipynb pocket book gives all of the steps taken to create check and coaching datasets, however listed here are some key factors to notice:
1. Elimination of automated pipeline repeat HXL information
On this research, I eliminated duplicate information created by automated pipelines that add information to HDX, through the use of an MDF hash of column names in every tabular dataset (CSV and Excel information). For instance, a CSV file of inhabitants statistics created by a corporation is commonly very related for every country-specific CSV or Excel file, so we solely take one instance. This has a balancing impact on the information, offering extra variation of HXL tags and attributes by eradicating very related repeat information.
2. Constraining information to legitimate HXL
About 50% of the HDX information with HXL tags makes use of a tag or attribute which aren’t specified within the HXL Core Schema, so this information is faraway from coaching and check units.
3. Knowledge enrichment
As a (principally!) human being, when deciding what HXL tags and attributes to make use of on a column, I take a peek on the information for that column and likewise the information as a complete within the desk. For this evaluation we do the identical for the LLM fine-tuning and immediate information, including in information excerpts for every column. A desk description can be added utilizing an LLM (GPT-3.5-Turbo) abstract of the information to make them constant, as summaries on HDX can range in type, starting from pages to some phrases.
4. Fastidiously splitting information to create practice/check units
Many machine studying pipelines cut up information randomly to create coaching and check units. Nevertheless, for HDX information this may end in columns and information from the identical group being in practice and check. I felt this was a bit too straightforward for testing predictions and so as an alternative cut up the information by organizations to make sure organizations within the check set weren’t within the coaching information. Moreover, subsidiaries of the identical mother or father group — eg “ocha-iraq” and “ocha-libya” — weren’t allowed to be in each the coaching and check units, once more to make the predictions extra practical. My goal was to check prediction with organizations as if their information had by no means been seen earlier than.
After the entire above and down-sampling to save lots of prices, we’re left with 2,883 rows within the coaching set and 485 rows within the check set.
In my unique article I opted for utilizing a completion mannequin, however with the discharge of GPT-4o-mini I as an alternative generated prompts applicable for fine-tuning a chat mannequin (see right here for extra details about the out there fashions).
Every immediate has the shape …
{
"messages": [
{
"role": "system",
"content": "<SYSTEM PROMPT>"
},
{
"role": "user",
"content": "<INPUT PROMPT>"
},
{
"role": "assistant",
"content": "<EXPECTED OUTPUT>"
}
]
}
Observe: The above has been formatted for readability, however JSONL could have all the things in a single line per document.
Utilizing the information excerpts, LLM_generated desk description, column identify we collated, we are able to now generate prompts which seem like this …
{
"messages": [
{
"role": "system",
"content": "You are an assistant that replies with HXL tags and attributes"
},
{
"role": "user",
"content": "What are the HXL tags and attributes for a column with these details?
resource_name='admin1-summaries-earthquake.csv';
dataset_description='The dataset contains earthquake data for various
administrative regions in Afghanistan,
including country name, admin1 name, latitude,
longitude, aggregation type, indicator name,
and indicator value. The data includes maximum
earthquake values recorded in different regions,
with corresponding latitude and longitude coordinates.
The dataset provides insights into the seismic
activity in different administrative areas of
Afghanistan.';
column_name:'indicator';
examples: ['earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake', 'earthquake']"
},
{
"function": "assistant",
"content material": "#indicator+identify"
}
]
}
We now have check and coaching information in the suitable format for fine-tuning an OpenAI chat mannequin, so let’s tune our mannequin …
def fine_tune_model(train_file, model_name="gpt-4o-mini"):
"""
Advantageous-tune an OpenAI mannequin utilizing coaching information.Args:
prompt_file (str): The file containing the prompts to make use of for fine-tuning.
model_name (str): The identify of the mannequin to fine-tune. Default is "davinci-002".
Returns:
str: The ID of the fine-tuned mannequin.
"""
# Add file to OpenAI for fine-tuning
file = consumer.information.create(
file=open(train_file, "rb"),
function="fine-tune"
)
file_id = file.id
print(f"Uploaded coaching file with ID: {file_id}")
# Begin the fine-tuning job
ft = consumer.fine_tuning.jobs.create(
training_file=file_id,
mannequin=model_name
)
ft_id = ft.id
print(f"Advantageous-tuning job began with ID: {ft_id}")
# Monitor the standing of the fine-tuning job
ft_result = consumer.fine_tuning.jobs.retrieve(ft_id)
whereas ft_result.standing != 'succeeded':
print(f"Present standing: {ft_result.standing}")
time.sleep(120) # Look forward to 60 seconds earlier than checking once more
ft_result = consumer.fine_tuning.jobs.retrieve(ft_id)
if 'failed' in ft_result.standing.decrease():
sys.exit()
print(f"Advantageous-tuning job {ft_id} succeeded!")
# Retrieve the fine-tuned mannequin
fine_tuned_model = ft_result.fine_tuned_model
print(f"Advantageous-tuned mannequin: {fine_tuned_model}")
return fine_tuned_model
mannequin = fine_tune_model("hxl_chat_prompts_train.jsonl", model_name="gpt-4o-mini-2024-07-18")
Within the above we’re utilizing the brand new GPT-4-mini mannequin, which from OpenAI is at the moment free to fine-tune …
“Now by means of September 23, GPT-4o mini is free to fine-tune as much as a every day restrict of 2M coaching tokens. Overages over 2M coaching tokens will likely be charged at $3.00/1M tokens. Beginning September 24, fine-tuning coaching will value $3.00/1M tokens. Try the fine-tuning docs for extra particulars on free entry.”
Even at $3.00/1 Million tokens, the prices are fairly low for this activity, popping out at about $7 a fine-tuning run for simply over 2 million tokens within the check file. Making an allowance for, fine-tuning ought to be a uncommon occasion for this specific activity, as soon as now we have such a mannequin it may be reused.
The fine-tuning produces the next output …
Uploaded coaching file with ID: file-XXXXXXXXXXXXXXX
Advantageous-tuning job began with ID: ftjob-XXXXXXXXXXXXXXX
Present standing: validating_files
Present standing: validating_files
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Present standing: operating
Advantageous-tuning job ftjob-XXXXXXXXXXXXXXX succeeded!
Advantageous-tuned mannequin: ft:gpt-4o-mini-2024-07-18::XXXXXXX
It took about 45 minutes.
Now that now we have a pleasant new shiny fine-tuned mannequin for predicting HXL tags and attributes, we are able to use the check file to take it for a spin …
def make_chat_predictions(prompts, mannequin, temperature=0.1, max_tokens=13):
"""
Generate chat predictions based mostly on given prompts utilizing the OpenAI chat mannequin.Args:
prompts (listing): An inventory of prompts, the place every immediate is a dictionary containing an inventory of messages.
Every message within the listing has a 'function' (both 'system', 'consumer', or 'assistant') and 'content material'.
mannequin (str): The identify or ID of the OpenAI chat mannequin to make use of for predictions.
temperature (float, optionally available): Controls the randomness of the predictions. Increased values (e.g., 0.5) make the
output extra random, whereas decrease values (e.g., 0.1) make it extra deterministic.
Defaults to 0.1.
max_tokens (int, optionally available): The utmost variety of tokens within the predicted response. Defaults to 13.
Returns:
pandas.DataFrame: A DataFrame containing the outcomes of the chat predictions. Every row within the DataFrame
corresponds to a immediate and consists of the immediate messages, the precise message, and the
predicted message.
"""
outcomes = []
for p in prompts:
precise = p["messages"][-1]["content"]
p["messages"] = p["messages"][0:2]
completion = consumer.chat.completions.create(
mannequin=mannequin,
messages=p["messages"],
temperature=temperature,
max_tokens=max_tokens
)
predicted = completion.selections[0].message.content material
predicted = filter_for_schema(predicted)
res = {
"immediate": p["messages"],
"precise": precise,
"predicted": predicted
}
print(f"Predicted: {predicted}; Precise: {precise}")
outcomes.append(res)
outcomes = pd.DataFrame(outcomes)
return outcomes
def filter_for_schema(textual content):
"""
Filters the enter textual content to extract authorized HXL schema tokens.
Args:
textual content (str): The enter textual content to be filtered.
Returns:
str: The filtered textual content containing solely authorized HXL schema tokens.
"""
if " " in textual content:
textual content = textual content.substitute(" ","")
tokens_raw = textual content.cut up("+")
tokens = [tokens_raw[0]]
for t in tokens_raw[1:]:
tokens.append(f"+{t}")
filtered = []
for t in tokens:
if t in APPROVED_HXL_SCHEMA:
if t not in filtered:
filtered.append(t)
filtered = "".be part of(filtered)
if len(filtered) > 0 and filtered[0] != '#':
filtered = ""
return filtered
def output_prediction_metrics(outcomes, prediction_field="predicted", actual_field="precise"):
"""
Prints out mannequin efficiency report for HXL tag prediction. Metrics are for
simply predicting tags, in addition to predicting tags and attributes.
Parameters
----------
outcomes : dataframe
Dataframe of outcomes
prediction_field : str
Discipline identify of aspect with prediction. Helpful for evaluating uncooked and post-processed predictions.
actual_field: str
Discipline identify of the particular outcome for comparability with prediction
"""
y_test = []
y_pred = []
y_justtag_test = []
y_justtag_pred = []
for index, r in outcomes.iterrows():
if actual_field not in r and predicted_field not in r:
print("Offered outcomes don't comprise anticipated values.")
sys.exit()
y_pred.append(r[prediction_field])
y_test.append(r[actual_field])
actual_tag = r[actual_field].cut up("+")[0]
predicted_tag = r[prediction_field].cut up("+")[0]
y_justtag_test.append(actual_tag)
y_justtag_pred.append(predicted_tag)
print(f"LLM outcomes for {prediction_field}, {len(outcomes)} predictions ...")
print("nJust HXL tags ...n")
print(f"Accuracy: {spherical(accuracy_score(y_justtag_test, y_justtag_pred),2)}")
print(
f"Precision: {spherical(precision_score(y_justtag_test, y_justtag_pred, common='weighted', zero_division=0),2)}"
)
print(
f"Recall: {spherical(recall_score(y_justtag_test, y_justtag_pred, common='weighted', zero_division=0),2)}"
)
print(
f"F1: {spherical(f1_score(y_justtag_test, y_justtag_pred, common='weighted', zero_division=0),2)}"
)
print(f"nTags and attributes with {prediction_field} ...n")
print(f"Accuracy: {spherical(accuracy_score(y_test, y_pred),2)}")
print(
f"Precision: {spherical(precision_score(y_test, y_pred, common='weighted', zero_division=0),2)}"
)
print(
f"Recall: {spherical(recall_score(y_test, y_pred, common='weighted', zero_division=0),2)}"
)
print(
f"F1: {spherical(f1_score(y_test, y_pred, common='weighted', zero_division=0),2)}"
)
return
with open(TEST_FILE) as f:
X_test = [json.loads(line) for line in f]
outcomes = make_chat_predictions(X_test, mannequin)
output_prediction_metrics(outcomes)
print("Carried out")
Noting within the above that every one predictions are filtered for allowed tags and attributes as outlined within the HXL normal.
This offers the next outcomes …
LLM outcomes for predicted, 458 predictions ...Simply HXL tags ...
Accuracy: 0.83
Precision: 0.85
Recall: 0.83
F1: 0.82
Tags and attributes with predicted ...
Accuracy: 0.61
Precision: 0.6
Recall: 0.61
F1: 0.57
‘Simply HXL Tags’ means predicting the primary a part of the HXL, for instance if the complete HXL is #affected+contaminated+f, the mannequin accurately received the #affected half appropriate. ‘Tags and attributes’ means predicting the complete HXL string, ie ‘#affected+contaminated+f’, a a lot tougher problem because of all of the mixtures doable.
The efficiency isn’t good, however not that dangerous, particularly as now we have balanced the dataset to scale back the variety of location and date tags and attributes (ie made this research a bit more difficult). There are tens of hundreds of humanitarian response tables with out HDX, even the above efficiency would doubtless add worth.
Let’s look into instances the place predictions didn’t agree with human-labeled information …
The predictions have been saved to a spreadsheet, and I manually went by means of a lot of the predictions that didn’t agree with the labels. You will discover this evaluation right here and summarized under …
What’s fascinating is that in some instances the LLM is definitely appropriate, for instance in including extra HXL attributes which the human labeled information doesn’t embody. There are additionally instances the place the human labeled HXL was completely cheap, however the LLM predicted one other tag or attribute that is also interpreted as appropriate. For instance a #area will also be an #admin1 in some nations, and whether or not one thing is an +id or +code is usually troublesome to resolve, each are applicable.
Utilizing the above classes, I created a brand new check set the place the anticipated HXL tags have been corrected. On re-running the prediction we get improved outcomes …
Simply HXL tags ...Accuracy: 0.88
Precision: 0.88
Recall: 0.88
F1: 0.88
Tags and attributes with predicted ...
Accuracy: 0.66
Precision: 0.71
Recall: 0.66
F1: 0.66
The above exhibits that the human-labeled information itself will be incorrect. The HXL normal is designed excellently, however is usually a problem to memorize for builders and information scientists when setting HXL tags and attributes on information. There are some superb instruments already offered by the HXL group, however generally the HXL remains to be incorrect. This introduces an issue to the fine-tuning method which depends on this human-labeled information for coaching, particularly for much less nicely represented tags and attributes that people should not utilizing fairly often. It additionally has the limitation of not with the ability to alter when the metadata normal adjustments, for the reason that coaching information wouldn’t mirror these adjustments.
Because the preliminary evaluation 18 months in the past varied LLM suppliers have superior their fashions considerably. OpenAI after all launched GPT-4o as their flagship product, which importantly has a context window of 128k tokens and is one other information level suggesting prices of foundational fashions are lowering (see for instance GPT-4-Turbo in comparison with GPT-4o right here). Given these elements, I questioned …
If fashions have gotten extra highly effective and cheaper to make use of, might we keep away from fine-tuning altogether and use them to foretell HXL tags and attributes by prompting alone?
Not solely might this imply much less engineering work to wash information and fine-tune fashions, it might have a giant benefit in with the ability to embody HXL tags and attributes which aren’t included within the human-labeled coaching information however are a part of the HXL normal. That is one probably big benefit of highly effective LLMs, with the ability to classify with zero- and few-shot prompting.
Fashions like GPT-4o are skilled on internet information, so I assumed I’d first do a check utilizing one in every of our prompts to see if it already knew all the things there was to find out about HXL tags …
What we see is that it appears to find out about HXL syntax, however the reply is wrong (the right reply is ‘#affected+contaminated’), and it has chosen tags and attributes that aren’t within the HXL normal. It’s truly just like what we see with human-tagged HXL.
How about we offer an important components of the HXL normal within the system immediate?
def generate_hxl_standard_prompt(local_data_file):
"""
Generate a typical immediate for predicting Humanitarian Markup Language (HXL) tags and attributes.Args:
local_data_file (str): The trail to the native information file containing core hashtags and attributes.
Returns:
str: The generated HXL normal immediate.
"""
core_hashtags = pd.read_excel(local_data_file, sheet_name='Core hashtags')
core_hashtags = core_hashtags.loc[core_hashtags["Release status"] == "Launched"]
core_hashtags = core_hashtags[["Hashtag", "Hashtag long description", "Sample HXL"]]
core_attributes = pd.read_excel(local_data_file, sheet_name='Core attributes')
core_attributes = core_attributes.loc[core_attributes["Status"] == "Launched"]
core_attributes = core_attributes[["Attribute", "Attribute long description", "Suggested hashtags (selected)"]]
print(core_hashtags.form)
print(core_attributes.form)
core_hashtags = core_hashtags.to_dict(orient='data')
core_attributes = core_attributes.to_dict(orient='data')
hxl_prompt= f"""
You might be an AI assistant that predicts Humanitarian Markup Language (HXL) tags and attributes for columns of knowledge the place the HXL normal is outlined as follows:
CORE HASHTAGS:
{json.dumps(core_hashtags,indent=4)}
CORE ATTRIBUTES:
{json.dumps(core_attributes, indent=4)}
Key factors:
- ALWAYS predict hash tags
- NEVER predict a tag which isn't a sound core hashtag
- NEVER begin with a core hashtag, it's essential to at all times begin with a core hashtag
- At all times try to predict an attribute if doable
- Don't use attribute +code if the information examples are human readable names
You need to return your outcome as a JSON document with the fields 'predicted' and 'reasoning', every is of sort string.
"""
print(len(hxl_prompt.cut up(" ")))
print(hxl_prompt)
return hxl_prompt
This offers us a immediate like this …
You might be an AI assistant that predicts Humanitarian Markup Language (HXL) tags and attributes for columns of knowledge the place the HXL normal is outlined as follows:CORE HASHTAGS:
[
{
"Hashtag": "#access",
"Hashtag long description": "Accessiblity and constraints on access to a market, distribution point, facility, etc.",
"Sample HXL": "#access +type"
},
{
"Hashtag": "#activity",
"Hashtag long description": "A programme, project, or other activity. This hashtag applies to all levels; use the attributes +activity, +project, or +programme to distinguish different hierarchical levels.",
"Sample HXL": "#activity +project"
},
{
"Hashtag": "#adm1",
"Hashtag long description": "Top-level subnational administrative area (e.g. a governorate in Syria).",
"Sample HXL": "#adm1 +code"
},
{
"Hashtag": "#adm2",
"Hashtag long description": "Second-level subnational administrative area (e.g. a subdivision in Bangladesh).",
"Sample HXL": "#adm2 +name"
},
{
"Hashtag": "#adm3",
"Hashtag long description": "Third-level subnational administrative area (e.g. a subdistrict in Afghanistan).",
"Sample HXL": "#adm3 +code"
},
{
"Hashtag": "#adm4",
"Hashtag long description": "Fourth-level subnational administrative area (e.g. a barangay in the Philippines).",
"Sample HXL": "#adm4 +name"
},
{
"Hashtag": "#adm5",
"Hashtag long description": "Fifth-level subnational administrative area (e.g. a ward of a city).",
"Sample HXL": "#adm5 +code"
},
{
"Hashtag": "#affected",
"Hashtag long description": "Number of people or households affected by an emergency. Subset of #population; superset of #inneed.",
"Sample HXL": "#affected +f +children"
},
{
"Hashtag": "#beneficiary",
"Hashtag long description": "General (non-numeric) information about a person or group meant to benefit from aid activities, e.g. "lactating women".",
"Sample HXL": "#beneficiary +name"
},
{
"Hashtag": "#capacity",
"Hashtag long description": "The response capacity of the entity being described (e.g. "25 beds").",
"Sample HXL": "#capacity +num"
},
... Truncated for brevity
},
{
"Hashtag": "#targeted",
"Hashtag long description": "Number of people or households targeted for humanitarian assistance. Subset of #inneed; superset of #reached.",
"Sample HXL": "#targeted +f +adult"
},
{
"Hashtag": "#value",
"Hashtag long description": "A monetary value, such as the price of goods in a market, a project budget, or the amount of cash transferred to beneficiaries. May be used together with #currency in financial or cash data.",
"Sample HXL": "#value +transfer"
}
]
CORE ATTRIBUTES:
[
{
"Attribute": "+abducted",
"Attribute long description": "Hashtag refers to people who have been abducted.",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached"
},
{
"Attribute": "+activity",
"Attribute long description": "The implementers classify this activity as an "activity" proper (may imply different hierarchical levels in different contexts).",
"Suggested hashtags (selected)": "#activity"
},
{
"Attribute": "+adolescents",
"Attribute long description": "Adolescents, loosely defined (precise age range varies); may overlap +children and +adult. You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+adolescents +age12_17".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+adults",
"Attribute long description": "Adults, loosely defined (precise age range varies); may overlap +adolescents and +elderly. You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+adults +age18_64".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+approved",
"Attribute long description": "Date or time when something was approved.",
"Suggested hashtags (selected)": "#date"
},
{
"Attribute": "+bounds",
"Attribute long description": "Boundary data (e.g. inline GeoJSON).",
"Suggested hashtags (selected)": "#geo"
},
{
"Attribute": "+budget",
"Attribute long description": "Used with #value to indicate that the amount is planned/approved/budgeted rather than actually spent.",
"Suggested hashtags (selected)": "#value"
},
{
"Attribute": "+canceled",
"Attribute long description": "Date or time when something (e.g. an #activity) was canceled.",
"Suggested hashtags (selected)": "#date"
},
{
"Attribute": "+children",
"Attribute long description": "The associated hashtag applies to non-adults, loosely defined (precise age range varies; may overlap +infants and +adolescents). You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+children +age3_11".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+cluster",
"Attribute long description": "Identifies a sector as a formal IASC humanitarian cluster.",
"Suggested hashtags (selected)": "#sector"
},
{
"Attribute": "+code",
"Attribute long description": "A unique, machine-readable code.",
"Suggested hashtags (selected)": "#region, #country, #adm1, #adm2, #adm3, #adm4, #adm5, #loc, #beneficiary, #activity, #org, #sector, #subsector, #indicator, #output, #crisis, #cause, #impact, #severity, #service, #need, #currency, #item, #need, #service, #channel, #modality, #event, #group, #status"
},
{
"Attribute": "+converted",
"Attribute long description": "Date or time used for converting a monetary value to another currency.",
"Suggested hashtags (selected)": "#date"
},
{
"Attribute": "+coord",
"Attribute long description": "Geodetic coordinates (lat+lon together).",
"Suggested hashtags (selected)": "#geo"
},
{
"Attribute": "+dest",
"Attribute long description": "Place of destination (intended or actual).",
"Suggested hashtags (selected)": "#region, #country, #adm1, #adm2, #adm3, #adm4, #adm5, #loc"
},
{
"Attribute": "+displaced",
"Attribute long description": "Displaced people or households. Refers to all types of displacement: use +idps or +refugees to be more specific.",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
{
"Attribute": "+elderly",
"Attribute long description": "Elderly people, loosely defined (precise age range varies). May overlap +adults. You can optionally create custom attributes in addition to this to add precise age ranges, e.g. "+elderly +age65plus".",
"Suggested hashtags (selected)": "#affected, #inneed, #targeted, #reached, #population"
},
... Truncated for brevity
{
"Attribute": "+url",
"Attribute long description": "The data consists of web links related to the main hashtag (e.g. for an #org, #service, #activity, #loc, etc).",
"Suggested hashtags (selected)": "#contact, #org, #activity, #service, #meta"
},
{
"Attribute": "+used",
"Attribute long description": "Refers to a #service, #item, etc. that affected people have actually consumed or otherwise taken advantage of.",
"Suggested hashtags (selected)": "#service, #item"
}
]
Key factors:
- ALWAYS predict hash tags
- NEVER predict a tag which isn't a sound core hashtag
- NEVER begin with a core hashtag, it's essential to at all times begin with a core hashtag
- At all times try to predict an attribute if doable
You need to return your outcome as a JSON document with the fields 'predicted' and 'reasoning', every is of sort string.
It’s fairly lengthy (the above has been truncated), however encapsulates the HXL normal.
One other benefit of the direct immediate methodology is that we are able to additionally ask for the LLM to offer its reasoning when predicting HXL. This will after all embody hallucination, however I’ve at all times discovered it helpful for refining prompts.
For the consumer immediate, we are going to use the identical data that we used for fine-tuning, to incorporate excerpt and LLM-generated desk abstract …
What are the HXL tags and attributes for a column with these particulars? resource_name='/content material/drive/MyDrive/Colab/hxl-metadata-prediction/information/IFRC Appeals Knowledge for South Sudan8.csv';
dataset_description='The dataset accommodates data on varied
appeals and occasions associated to South Sudan,
together with particulars corresponding to the kind of attraction,
standing, sector, quantity requested and funded,
begin and finish dates, in addition to country-specific
data like nation code, area, and common
family dimension. The information consists of appeals for
totally different crises corresponding to floods, inhabitants
actions, cholera outbreaks, and Ebola preparedness,
with particulars on beneficiaries and affirmation wants.
The dataset additionally consists of metadata corresponding to IDs,
names, and translation modules for nations and areas.';
column_name:'support';
examples: ['18401', '17770', '17721', '16858', '15268', '15113', '14826', '14230', '12788', '9286', '8561']
Placing all of it collectively, and prompting each GPT-4o-mini and GPT-4o for comparability …
def call_gpt(immediate, system_prompt, mannequin, temperature, top_p, max_tokens):
"""
Calls the GPT mannequin to generate a response based mostly on the given immediate and system immediate.Args:
immediate (str): The consumer's enter immediate.
system_prompt (str): The system's enter immediate.
mannequin (str): The identify or ID of the GPT mannequin to make use of.
temperature (float): Controls the randomness of the generated output. Increased values (e.g., 0.8) make the output extra random, whereas decrease values (e.g., 0.2) make it extra deterministic.
top_p (float): Controls the variety of the generated output. Increased values (e.g., 0.8) make the output extra various, whereas decrease values (e.g., 0.2) make it extra targeted.
max_tokens (int): The utmost variety of tokens to generate within the response.
Returns:
dict or None: The generated response as a dictionary object, or None if an error occurred throughout era.
"""
response = consumer.chat.completions.create(
mannequin=mannequin,
messages= [
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
],
max_tokens=2000,
temperature=temperature,
top_p=top_p,
frequency_penalty=0,
presence_penalty=0,
cease=None,
stream=False,
response_format={ "sort": "json_object" }
)
outcome = response.selections[0].message.content material
outcome = outcome.substitute("```json","").substitute("```","")
attempt:
outcome = json.masses(outcome)
outcome["predicted"] = outcome["predicted"].substitute(" ","")
besides:
print(outcome)
outcome = None
return outcome
def make_prompt_predictions(prompts, mannequin, temperature=0.1, top_p=0.1,
max_tokens=2000, debug=False, actual_field="precise"):
"""
Generate predictions for a given set of prompts utilizing the required mannequin.
Args:
prompts (pandas.DataFrame): A DataFrame containing the prompts to generate predictions for.
mannequin (str): The identify of the mannequin to make use of for prediction.
temperature (float, optionally available): The temperature parameter for the mannequin's sampling. Defaults to 0.1.
top_p (float, optionally available): The highest-p parameter for the mannequin's sampling. Defaults to 0.1.
max_tokens (int, optionally available): The utmost variety of tokens to generate for every immediate. Defaults to 2000.
debug (bool, optionally available): Whether or not to print debug data throughout prediction. Defaults to False.
actual_field (str, optionally available): The identify of the column within the prompts DataFrame that accommodates the precise values. Defaults to "precise".
Returns:
pandas.DataFrame: A DataFrame containing the outcomes of the predictions, together with the immediate, precise worth, predicted worth, and reasoning.
"""
num_prompts = len(prompts)
print(f"Variety of prompts: {num_prompts}")
outcomes = []
for index, p in prompts.iterrows():
if index % 50 == 0:
print(f"{index/num_prompts*100:.2f}% full")
immediate = p["prompt"]
immediate = ast.literal_eval(immediate)
immediate = immediate[1]["content"]
precise = p[actual_field]
outcome = call_gpt(immediate, hxl_prompt, mannequin, temperature, top_p, max_tokens)
if result's None:
print(" !!!!! No LLM outcome")
predicted = ""
reasoning = ""
else:
predicted = outcome["predicted"]
reasoning = outcome["reasoning"]
if debug is True:
print(f"Precise: {precise}; Predicted: {predicted}; Reasoning: {reasoning}")
outcomes.append({
"immediate": immediate,
"precise": precise,
"predicted": predicted,
"reasoning": reasoning
})
outcomes = pd.DataFrame(outcomes)
print(f"nn===================== {mannequin} Outcomes =========================nn")
output_prediction_metrics(outcomes)
print(f"nn=================================================================")
outcomes["match"] = outcomes['predicted'] == outcomes['actual']
outcomes.to_excel(f"{LOCAL_DATA_DIR}/hxl-metadata-prompting-only-prediction-{mannequin}-results.xlsx", index=False)
return outcomes
for mannequin in ["gpt-4o-mini","gpt-4o"]:
print(f"Mannequin: {mannequin}")
outcomes = make_prompt_predictions(X_test, mannequin, temperature=0.1, top_p=0.1, max_tokens=2000)
We get …
===================== gpt-4o-mini Outcomes =========================LLM outcomes for predicted, 458 predictions ...
Simply HXL tags ...
Accuracy: 0.77
Precision: 0.83
Recall: 0.77
F1: 0.77
Tags and attributes with predicted ...
Accuracy: 0.53
Precision: 0.54
Recall: 0.53
F1: 0.5
===================== gpt-4o Outcomes =========================
LLM outcomes for predicted, 458 predictions ...
Simply HXL tags ...
Accuracy: 0.86
Precision: 0.86
Recall: 0.86
F1: 0.85
Tags and attributes with predicted ...
Accuracy: 0.71
Precision: 0.7
Recall: 0.71
F1: 0.69
=================================================================
As a reminder, the fine-tuned mannequin produced the next outcomes …
Simply HXL tags ...Accuracy: 0.83
Precision: 0.85
Recall: 0.83
F1: 0.82
Tags and attributes with predicted ...
Accuracy: 0.61
Precision: 0.6
Recall: 0.61
F1: 0.57
How does prompting-only GPT-4o evaluate with GPT-4o-mini?
Wanting on the above, we see that GPT-4o-mini prompting-only predicts simply tags with 77% accuracy, which is lower than GPT-4o-mini fine-tuning (83%) and GPT-4o prompting-only (86%). That mentioned the efficiency remains to be good and would enhance HXL protection even when used as-is.
How does prompting-only evaluate with the fine-tuned mannequin?
GPT-4o prompting-only gave the very best outcomes of all fashions, with 86% accuracy on tags and 71% on tags and attributes. In truth, the efficiency might nicely be higher after a bit extra evaluation of the check information to appropriate incorrect human-labeled tags,.
Let’s take a more in-depth take a look at the instances GPT-4o received it incorrect …
df = pd.read_excel(f"{LOCAL_DATA_DIR}/hxl-metadata-prompting-only-prediction-gpt-4o-results.xlsx")breaks = df[df["match"]==False]
print(breaks.form)
for index, row in breaks.iterrows():
print("n======================================== ")
pprint.pp(f"nPrompt: {row['prompt']}")
print()
print(f"Precise", row["actual"])
print(f"Predicted", row["predicted"])
print()
pprint.pp(f'Reasoning: n{row["reasoning"]}')
'n'
'Immediate: What are the HXL tags and attributes for a column with these '
'particulars? '
"resource_name='/content material/drive/MyDrive/Colab/hxl-metadata-prediction/information/IFRC "
"Appeals Knowledge for South Sudan8.csv'; dataset_description='The dataset "
'accommodates data on varied appeals and occasions associated to South Sudan, '
'together with particulars corresponding to the kind of attraction, standing, sector, quantity '
'requested and funded, begin and finish dates, in addition to country-specific '
'data like nation code, area, and common family dimension. The information '
'consists of appeals for various crises corresponding to floods, inhabitants actions, '
'cholera outbreaks, and Ebola preparedness, with particulars on beneficiaries and '
'affirmation wants. The dataset additionally consists of metadata corresponding to IDs, names, '
"and translation modules for nations and areas.'; column_name:'dtype.id'; "
"examples: ['12', '5', '1', '1', '12', '12', '1', '6', '1', '1', '7']")Precise #trigger+id
Predicted #meta+id
('Reasoning: n'
"The column 'dtype.id' accommodates numeric identifiers (e.g., '12', '5', '1') "
'that are prone to be inside identifiers for information data. Based on '
'the HXL normal, the suitable hashtag for inside identifiers is '
"'#meta' with the attribute '+id'.")
========================================
('n'
'Immediate: What are the HXL tags and attributes for a column with these '
'particulars? '
"resource_name='/content material/drive/MyDrive/Colab/hxl-metadata-prediction/information/IFRC "
"Appeals Knowledge for South Sudan8.csv'; dataset_description='The dataset "
'accommodates data on varied appeals and occasions associated to South Sudan, '
'together with particulars corresponding to the kind of attraction, standing, sector, quantity '
'requested and funded, begin and finish dates, in addition to country-specific '
'data like nation code, area, and common family dimension. The information '
'consists of appeals for various crises corresponding to floods, inhabitants actions, '
'cholera outbreaks, and Ebola preparedness, with particulars on beneficiaries and '
'affirmation wants. The dataset additionally consists of metadata corresponding to IDs, names, '
"and translation modules for nations and areas.'; "
"column_name:'dtype.identify'; examples: ['Flood', 'Population Movement', "
"'Epidemic', 'Epidemic', 'Flood', 'Flood', 'Epidemic', 'Complex Emergency', "
"'Epidemic', 'Epidemic', 'Civil Unrest']")
Precise #trigger+identify
Predicted #occasion+sort
('Reasoning: n'
"The examples offered within the column ('Flood', 'Inhabitants Motion', "
"'Epidemic', 'Complicated Emergency', 'Civil Unrest') describe various kinds of "
'occasions or incidents inside a disaster or emergency. Based on the HXL '
'normal, the suitable hashtag for this kind of information is #occasion, and the '
'attribute +sort is used to specify the sort or class of the occasion.')
========================================
('n'
'Immediate: What are the HXL tags and attributes for a column with these '
'particulars? '
"resource_name='/content material/drive/MyDrive/Colab/hxl-metadata-prediction/information/IFRC "
"Appeals Knowledge for South Sudan8.csv'; dataset_description='The dataset "
'accommodates data on varied appeals and occasions associated to South Sudan, '
'together with particulars corresponding to the kind of attraction, standing, sector, quantity '
'requested and funded, begin and finish dates, in addition to country-specific '
'data like nation code, area, and common family dimension. The information '
'consists of appeals for various crises corresponding to floods, inhabitants actions, '
'cholera outbreaks, and Ebola preparedness, with particulars on beneficiaries and '
'affirmation wants. The dataset additionally consists of metadata corresponding to IDs, names, '
"and translation modules for nations and areas.'; "
"column_name:'status_display'; examples: ['Active', 'Active', 'Closed', "
"'Closed', 'Closed', 'Closed', 'Closed', 'Closed', 'Closed', 'Closed', "
"'Closed']")
Precise #standing+identify
Predicted #standing+code
('Reasoning: n'
"The column 'status_display' accommodates values corresponding to 'Energetic' and 'Closed', "
'which describe the standing of appeals or occasions. The suitable HXL hashtag '
'for challenge or exercise standing is #standing. Because the values are categorical '
'and symbolize totally different statuses, the attribute +code is appropriate to '
'point out these standing codes.')
========================================
('n'
'Immediate: What are the HXL tags and attributes for a column with these '
'particulars? '
"resource_name='/content material/drive/MyDrive/Colab/hxl-metadata-prediction/information/IFRC "
"Appeals Knowledge for South Sudan8.csv'; dataset_description='The dataset "
'accommodates data on varied appeals and occasions associated to South Sudan, '
'together with particulars corresponding to the kind of attraction, standing, sector, quantity '
'requested and funded, begin and finish dates, in addition to country-specific '
'data like nation code, area, and common family dimension. The information '
'consists of appeals for various crises corresponding to floods, inhabitants actions, '
'cholera outbreaks, and Ebola preparedness, with particulars on beneficiaries and '
'affirmation wants. The dataset additionally consists of metadata corresponding to IDs, names, '
"and translation modules for nations and areas.'; "
"column_name:'area.id'; examples: ['0', '0', '0', '0', '0', '0', '0', '0', "
"'0', '0', '0']")
Precise #adm1+code
Predicted #area+id
('Reasoning: n'
"The column 'area.id' accommodates numeric identifiers for areas, which "
'aligns with the HXL tag #area and the attribute +id. The examples offered '
'are all numeric, indicating that these are doubtless distinctive identifiers for '
'areas.')
========================================
Discover how we now have a ‘Reasoning’ area to point why the tags have been chosen. That is helpful and could be an vital half for refining the immediate to enhance efficiency.
Wanting on the pattern above, we see some acquainted eventualities that have been discovered when analyzing the fine-tuned mannequin failed predictions …
- +id and +code ambiguity
- #area and #adm1 used interchangeably
- #occasion versus extra detailed tags like #trigger
These appear to fall into the class the place two tags are doable for a given column given their HXL definition. However there are some actual discrepancies which would want extra investigation.
That mentioned, utilizing GPT-4o to foretell HXL tags and attributes yields the very best outcomes, and I imagine at an appropriate degree given loads of information is lacking HXL metadata altogether and lots of the datasets which have it have incorrect tags and attributes.
Let’s see how prices evaluate with every approach and mannequin …
def num_tokens_from_string(string: str, encoding_name: str) -> int:
"""
Returns the variety of tokens in a textual content string utilizing toktoken.
See: https://github.com/openai/openai-cookbook/blob/major/examples/How_to_count_tokens_with_tiktoken.ipynbArgs:
string (str): The textual content string to rely the tokens for.
encoding_name (str): The identify of the encoding to make use of.
Returns:
num_tokens: The variety of tokens within the textual content string.
"""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
def calc_costs(information, mannequin, methodology="prompting"):
"""
Calculate token prices for a given dataset, methodology and mannequin.
Observe: Just for inference prices, not fine-tuning
Args:
information (pandas.DataFrame): The information to get the tokens for.
methodology (str, optionally available): The tactic to make use of. Defaults to "prompting".
mannequin (str): The mannequin to make use of, eg "gpt-4o-mini"
Returns:
input_tokens: The variety of enter tokens.
output_tokens: The variety of output tokens.
"""
# See https://openai.com/api/pricing/
value = {
"gpt-4o-mini": {
"enter": 0.150,
"output": 0.600
},
"gpt-4o": {
"enter": 5.00,
"output": 15.00
}
}
input_tokens = 0
output_tokens = 0
for index, p in information.iterrows():
immediate = p["prompt"]
immediate = ast.literal_eval(immediate)
enter = immediate[1]["content"]
# If prompting, we should embody system immediate
if methodology == "prompting":
enter += " " + hxl_prompt
output = p["Corrected actual"]
input_tokens += num_tokens_from_string(str(enter), "cl100k_base")
output_tokens += num_tokens_from_string(str(output), "cl100k_base")
input_cost = input_tokens / 1000000 * value[model]["input"]
output_cost = output_tokens / 1000000 * value[model]["output"]
print(f"nFor {information.form[0]} desk columns the place we predicted HXL tags ...")
print(f"{methodology} prediction with mannequin {mannequin}, {input_tokens} enter tokens = ${input_cost}")
print(f"Advantageous-tuning prediction GPT-4o-mini {output_tokens} output tokens = ${output_cost}n")
hxl_prompt = generate_hxl_standard_prompt(HXL_SCHEMA_LOCAL_FILE, debug=False)
X_test2 = pd.read_excel(f"{LOCAL_DATA_DIR}/hxl-metadata-fine-tune-prediction-results-review.xlsx", sheet_name=0)
calc_costs(X_test2, methodology="fine-tuning", mannequin="gpt-4o-mini")
calc_costs(X_test2, methodology="prompting", mannequin="gpt-4o-mini")
calc_costs(X_test2, methodology="prompting", mannequin="gpt-4o")
Which supplies …
For 458 desk columns the place we predicted HXL tags ...
fine-tuning prediction with mannequin gpt-4o-mini, 99738 enter tokens = $0.014960699999999999
Advantageous-tuning prediction GPT-4o-mini 2001 output tokens = $0.0012006For 458 desk columns the place we predicted HXL tags ...
prompting prediction with mannequin gpt-4o-mini, 2688812 enter tokens = $0.4033218
Advantageous-tuning prediction GPT-4o-mini 2001 output tokens = $0.0012006
For 458 desk columns the place we predicted HXL tags ...
prompting prediction with mannequin gpt-4o, 2688812 enter tokens = $13.44406
Advantageous-tuning prediction GPT-4o-mini 2001 output tokens = $0.030015000000000003
Observe: the above is just for the inference value, there will likely be a really small extra value in producing desk information summaries with GPT-3.5.
Given the check set, predicting HXL for 458 columns …
Advantageous-tuning:
As anticipated, inference prices for the fine-tuned GPT-4o mini mannequin (which value about $7 to fine-tune) are very low about $0.02.
Prediction-only:
- GPT-4o prediction solely is dear, due to the HXL normal being handed in to the system immediate each time, and comes out at $13.44.
- GPT-4o-mini, albeit with decreased efficiency, is a extra cheap $0.40.
So ease of use comes with a price if utilizing GPT-4o, however GPT-4o-mini is a horny different.
Lastly, it’s price noting that in lots of instances, setting HXL tags may to not be actual time, for instance for a crawler course of that corrects already uploaded datasets. This may imply that the brand new OpenAI batch API may very well be used, decreasing prices by 50%.
Placing this all collectively, I created a Github gist hxl_utils.py. Examine this out from GitHub and place the file in your present working listing.
Let’s obtain a file to check it with …
# See HDX for this file: https://information.humdata.org/dataset/sudan-acled-conflict-data
DATAFILE_URL="https://information.humdata.org/dataset/5efad450-8b15-4867-b7b3-8a25b455eed8/useful resource/3352a0d8-2996-4e70-b618-3be58699be7f/obtain/sudan_hrp_civilian_targeting_events_and_fatalities_by_month-year_as-of-25jul2024.xlsx"
local_data_file = f"{LOCAL_DATA_DIR}/{DATAFILE_URL.cut up('/')[-1]}"# Save information file domestically
urllib.request.urlretrieve(DATAFILE_URL, local_data_file)
# Learn it to get a dataframe
df = pd.read_excel(local_data_file, sheet_name=1)
And utilizing this dataframe, let’s predict HXL tags …
from hxl_utils import HXLUtilshxl_utils = HXLUtils(LOCAL_DATA_DIR, mannequin="gpt-4o")
information = hxl_utils.add_hxl(df,"sudan_hrp_civilian_targeting_events_and_fatalities_by_month-year_as-of-25jul2024.xlsx")
print("nnAFTER: nn")
show(information)
And there now we have it, some beautiful HXL tags!
Let’s see how nicely GPT-4o-mini does …
hxl_utils = HXLUtils(LOCAL_DATA_DIR, mannequin="gpt-4o-mini")
information = hxl_utils.add_hxl(df,"sudan_hrp_civilian_targeting_events_and_fatalities_by_month-year_as-of-25jul2024.xlsx")
Which supplies …
Fairly good! gpt-4o gave “#affected+killed+num” for the final column, the place “gpt-4o-mini” gave “#affected+num”, however this might doubtless be resolved with some deft immediate engineering.
Admittedly this wasn’t a very difficult dataset, nevertheless it was in a position to accurately predict tags for occasions and fatalities, that are much less frequent than location and dates.
I believe a giant takeaway right here is that the direct-prompting approach produces good outcomes with out the necessity for coaching. Sure, dearer for inference, however perhaps not if a knowledge scientist is required to curate incorrectly human-labeled fine-tuning information. It will depend upon the group and metadata use-case.
Listed below are some areas that could be thought-about in future work …
Improved check information
This evaluation did a fast evaluation of the check set to appropriate HXL tags which have been incorrect within the information or had a number of doable values. Extra time may very well be spent on this, as at all times in machine studying, floor reality is essential.
Immediate engineering and hyperparameter tuning
The above evaluation makes use of very primary prompts with no actual engineering or methods utilized, these might undoubtedly be improved for higher efficiency. With an analysis set and a framework corresponding to Promptflow, immediate variants may very well be examined. Moreover we would add extra context information, for instance in deciding administrative ranges, which may range per nation. Lastly, now we have used mounted hyperparameters for temperature and top_p, in addition to completion token size. All these may very well be tuned main to higher efficiency.
Price optimization
The prompting-only method undoubtedly seems to be a powerful possibility and simplifies how a corporation can routinely set HXL tags on their information utilizing GPT-4o. There are after all value implications with this mannequin, being a dearer, however predictions happen solely on low-volume schema adjustments, not when the underlying information itself adjustments, and with new choices for batch submission on OpenAI and ever lowering LLM prices, this system seems viable for a lot of organizations. GPT-4o-mini additionally performs nicely and is a fraction of the fee.
Utility to different metadata requirements
It will be fascinating to use this system to different metadata and labeling requirements, I’m positive many organizations are already utilizing LLMs for this.
Please like this text if inclined and I’d be delighted in the event you adopted me! You will discover extra articles right here.