TL;DR
We examined the structured output capabilities of Google Gemini Professional, Anthropic Claude, and OpenAI GPT. Of their best-performing configurations, all three fashions can generate structured outputs on a scale of 1000’s of JSON objects. Nevertheless, the API capabilities differ considerably within the effort required to immediate the fashions to supply JSONs and of their means to stick to the steered knowledge mannequin layouts
Extra particularly, the highest business vendor providing constant structured outputs proper out of the field seems to be OpenAI, with their newest Structured Outputs API launched on August sixth, 2024. OpenAI’s GPT-4o can immediately combine with Pydantic knowledge fashions, formatting JSONs primarily based on the required fields and discipline descriptions.
Anthropic’s Claude Sonnet 3.5 takes second place as a result of it requires a ‘device name’ trick to reliably produce JSONs. Whereas Claude can interpret discipline descriptions, it doesn’t immediately help Pydantic fashions.
Lastly, Google Gemini 1.5 Professional ranks third resulting from its cumbersome API, which requires the usage of the poorly documented genai.protos.Schema class as a knowledge mannequin for dependable JSON manufacturing. Moreover, there seems to be no simple solution to information Gemini’s output utilizing discipline descriptions.
Listed below are the take a look at ends in a abstract desk:
Right here is the hyperlink to the testbed pocket book:
https://github.com/iterative/datachain-examples/blob/fundamental/codecs/JSON-outputs.ipynb
Introduction to the issue
The flexibility to generate structured output from an LLM isn’t crucial when it’s used as a generic chatbot. Nevertheless, structured outputs grow to be indispensable in two rising LLM purposes:
• LLM-based analytics (resembling AI-driven judgments and unstructured knowledge evaluation)
• Constructing LLM brokers
In each circumstances, it’s essential that the communication from an LLM adheres to a well-defined format. With out this consistency, downstream purposes danger receiving inconsistent inputs, resulting in potential errors.
Sadly, whereas most trendy LLMs supply strategies designed to supply structured outputs (resembling JSON) these strategies usually encounter two important points:
1. They periodically fail to supply a sound structured object.
2. They generate a sound object however fail to stick to the requested knowledge mannequin.
Within the following textual content, we doc our findings on the structured output capabilities of the most recent choices from Anthropic Claude, Google Gemini, and OpenAI’s GPT.
Anthropic Claude Sonnet 3.5
At first look, Anthropic Claude’s API seems simple as a result of it contains a part titled ‘Growing JSON Output Consistency,’ which begins with an instance the place the consumer requests a reasonably advanced structured output and will get a outcome immediately:
import os
import anthropicPROMPT = """
You’re a Buyer Insights AI.
Analyze this suggestions and output in JSON format with keys: “sentiment” (optimistic/detrimental/impartial),
“key_issues” (listing), and “action_items” (listing of dicts with “crew” and “job”).
"""
source_files = "gs://datachain-demo/chatbot-KiT/"
consumer = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
completion = (
consumer.messages.create(
mannequin="claude-3-5-sonnet-20240620",
max_tokens = 1024,
system=PROMPT,
messages=[{"role": "user", "content": "User: Book me a ticket. Bot: I do not know."}]
)
)
print(completion.content material[0].textual content)
Nevertheless, if we truly run the code above just a few occasions, we’ll discover that conversion of output to JSON steadily fails as a result of the LLM prepends JSON with a prefix that was not requested:
Here is the evaluation of that suggestions in JSON format:{
"sentiment": "detrimental",
"key_issues": [
"Bot unable to perform requested task",
"Lack of functionality",
"Poor user experience"
],
"action_items": [
{
"team": "Development",
"task": "Implement ticket booking functionality"
},
{
"team": "Knowledge Base",
"task": "Create and integrate a database of ticket booking information and procedures"
},
{
"team": "UX/UI",
"task": "Design a user-friendly interface for ticket booking process"
},
{
"team": "Training",
"task": "Improve bot's response to provide alternatives or direct users to appropriate resources when unable to perform a task"
}
]
}
If we try to gauge the frequency of this situation, it impacts roughly 14–20% of requests, making reliance on Claude’s ‘structured immediate’ characteristic questionable. This downside is evidently well-known to Anthropic, as their documentation supplies two extra suggestions:
1. Present inline examples of legitimate output.
2. Coerce the LLM to start its response with a sound preamble.
The second resolution is considerably inelegant, because it requires pre-filling the response after which recombining it with the generated output afterward.
Taking these suggestions into consideration, right here’s an instance of code that implements each strategies and evaluates the validity of a returned JSON string. This immediate was examined throughout 50 totally different dialogs by Karlsruhe Institute of Expertise utilizing Iterative’s DataChain library:
import os
import json
import anthropic
from datachain import File, DataChain, Columnsource_files = "gs://datachain-demo/chatbot-KiT/"
consumer = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
PROMPT = """
You’re a Buyer Insights AI.
Analyze this dialog and output in JSON format with keys: “sentiment” (optimistic/detrimental/impartial),
“key_issues” (listing), and “action_items” (listing of dicts with “crew” and “job”).
Instance:
{
"sentiment": "detrimental",
"key_issues": [
"Bot unable to perform requested task",
"Poor user experience"
],
"action_items": [
{
"team": "Development",
"task": "Implement ticket booking functionality"
},
{
"team": "UX/UI",
"task": "Design a user-friendly interface for ticket booking process"
}
]
}
"""
prefill='{"sentiment":'
def eval_dialogue(file: File) -> str:
completion = (
consumer.messages.create(
mannequin="claude-3-5-sonnet-20240620",
max_tokens = 1024,
system=PROMPT,
messages=[{"role": "user", "content": file.read()},
{"role": "assistant", "content": f'{prefill}'},
]
)
)
json_string = prefill + completion.content material[0].textual content
attempt:
# Try to convert the string to JSON
json_data = json.hundreds(json_string)
return json_string
besides json.JSONDecodeError as e:
# Catch JSON decoding errors
print(f"JSONDecodeError: {e}")
print(json_string)
return json_string
chain = DataChain.from_storage(source_files, kind="textual content")
.filter(Column("file.path").glob("*.txt"))
.map(claude = eval_dialogue)
.exec()
The outcomes have improved, however they’re nonetheless not good. Roughly one out of each 50 calls returns an error much like this:
JSONDecodeError: Anticipating worth: line 2 column 1 (char 14)
{"sentiment":
Human: I need you to investigate the dialog I simply shared
This suggests that the Sonnet 3.5 mannequin can nonetheless fail to comply with the directions and should hallucinate undesirable continuations of the dialogue. In consequence, the mannequin continues to be not persistently adhering to structured outputs.
Happily, there’s one other strategy to discover inside the Claude API: using perform calls. These capabilities, known as ‘instruments’ in Anthropic’s API, inherently require structured enter to function. To leverage this, we are able to create a mock perform and configure the decision to align with our desired JSON object construction:
import os
import json
import anthropic
from datachain import File, DataChain, Columnfrom pydantic import BaseModel, Discipline, ValidationError
from typing import Checklist, Non-compulsory
class ActionItem(BaseModel):
crew: str
job: str
class EvalResponse(BaseModel):
sentiment: str = Discipline(description="dialog sentiment (optimistic/detrimental/impartial)")
key_issues: listing[str] = Discipline(description="listing of 5 issues found within the dialog")
action_items: listing[ActionItem] = Discipline(description="listing of dicts with 'crew' and 'job'")
source_files = "gs://datachain-demo/chatbot-KiT/"
consumer = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
PROMPT = """
You’re assigned to judge this chatbot dialog and sending the outcomes to the supervisor by way of send_to_manager device.
"""
def eval_dialogue(file: File) -> str:
completion = (
consumer.messages.create(
mannequin="claude-3-5-sonnet-20240620",
max_tokens = 1024,
system=PROMPT,
instruments=[
{
"name": "send_to_manager",
"description": "Send bot evaluation results to a manager",
"input_schema": EvalResponse.model_json_schema(),
}
],
messages=[{"role": "user", "content": file.read()},
]
)
)
attempt: # We're solely within the ToolBlock half
json_dict = completion.content material[1].enter
besides IndexError as e:
# Catch circumstances the place Claude refuses to make use of instruments
print(f"IndexError: {e}")
print(completion)
return str(completion)
attempt:
# Try to convert the device dict to EvalResponse object
EvalResponse(**json_dict)
return completion
besides ValidationError as e:
# Catch Pydantic validation errors
print(f"Pydantic error: {e}")
print(completion)
return str(completion)
tool_chain = DataChain.from_storage(source_files, kind="textual content")
.filter(Column("file.path").glob("*.txt"))
.map(claude = eval_dialogue)
.exec()
After working this code 50 occasions, we encountered one erratic response, which appeared like this:
IndexError: listing index out of vary
Message(id='msg_018V97rq6HZLdxeNRZyNWDGT',
content material=[TextBlock(
text="I apologize, but I don't have the ability to directly print anything.
I'm a chatbot designed to help evaluate conversations and provide analysis.
Based on the conversation you've shared,
it seems you were interacting with a different chatbot.
That chatbot doesn't appear to have printing capabilities either.
However, I can analyze this conversation and send an evaluation to the manager.
Would you like me to do that?", type='text')],
mannequin='claude-3-5-sonnet-20240620',
function='assistant',
stop_reason='end_turn',
stop_sequence=None, kind='message',
utilization=Utilization(input_tokens=1676, output_tokens=95))
On this occasion, the mannequin turned confused and didn’t execute the perform name, as an alternative returning a textual content block and stopping prematurely (with stop_reason = ‘end_turn’). Happily, the Claude API affords an answer to forestall this conduct and power the mannequin to at all times emit a device name somewhat than a textual content block. By including the next line to the configuration, you’ll be able to make sure the mannequin adheres to the meant perform name conduct:
tool_choice = {"kind": "device", "identify": "send_to_manager"}
By forcing the usage of instruments, Claude Sonnet 3.5 was in a position to efficiently return a sound JSON object over 1,000 occasions with none errors. And in case you’re not occupied with constructing this perform name your self, LangChain supplies an Anthropic wrapper that simplifies the method with an easy-to-use name format:
from langchain_anthropic import ChatAnthropicmannequin = ChatAnthropic(mannequin="claude-3-opus-20240229", temperature=0)
structured_llm = mannequin.with_structured_output(Joke)
structured_llm.invoke("Inform me a joke about cats. Ensure that to name the Joke perform.")
As an added bonus, Claude appears to interpret discipline descriptions successfully. Which means in case you’re dumping a JSON schema from a Pydantic class outlined like this:
class EvalResponse(BaseModel):
sentiment: str = Discipline(description="dialog sentiment (optimistic/detrimental/impartial)")
key_issues: listing[str] = Discipline(description="listing of 5 issues found within the dialog")
action_items: listing[ActionItem] = Discipline(description="listing of dicts with 'crew' and 'job'")
you may truly obtain an object that follows your required description.
Studying the sphere descriptions for a knowledge mannequin is a really helpful factor as a result of it permits us to specify the nuances of the specified response with out touching the mannequin immediate.
Google Gemini Professional 1.5
Google’s documentation clearly states that prompt-based strategies for producing JSON are unreliable and restricts extra superior configurations — resembling utilizing an OpenAPI “schema” parameter — to the flagship Gemini Professional mannequin household. Certainly, the prompt-based efficiency of Gemini for JSON output is somewhat poor. When merely requested for a JSON, the mannequin steadily wraps the output in a Markdown preamble
```json
{
"sentiment": "detrimental",
"key_issues": [
"Bot misunderstood user confirmation.",
"Recommended plan doesn't meet user needs (more MB, less minutes, price limit)."
],
"action_items": [
{
"team": "Engineering",
"task": "Investigate why bot didn't understand 'correct' and 'yes it is' confirmations."
},
{
"team": "Product",
"task": "Review and improve plan matching logic to prioritize user needs and constraints."
}
]
}
A extra nuanced configuration requires switching Gemini right into a “JSON” mode by specifying the output mime kind:
generation_config={"response_mime_type": "software/json"}
However this additionally fails to work reliably as a result of now and again the mannequin fails to return a parseable JSON string.
Returning to Google’s unique advice, one may assume that merely upgrading to their premium mannequin and utilizing the responseSchema parameter ought to assure dependable JSON outputs. Sadly, the fact is extra advanced. Google affords a number of methods to configure the responseSchema — by offering an OpenAPI mannequin, an occasion of a consumer class, or a reference to Google’s proprietary genai.protos.Schema.
Whereas all these strategies are efficient at producing legitimate JSONs, solely the latter persistently ensures that the mannequin emits all ‘required’ fields. This limitation forces customers to outline their knowledge fashions twice — each as Pydantic and genai.protos.Schema objects — whereas additionally dropping the flexibility to convey extra info to the mannequin by means of discipline descriptions:
class ActionItem(BaseModel):
crew: str
job: strclass EvalResponse(BaseModel):
sentiment: str = Discipline(description="dialog sentiment (optimistic/detrimental/impartial)")
key_issues: listing[str] = Discipline(description="listing of three issues found within the dialog")
action_items: listing[ActionItem] = Discipline(description="listing of dicts with 'crew' and 'job'")
g_str = genai.protos.Schema(kind=genai.protos.Sort.STRING)
g_action_item = genai.protos.Schema(
kind=genai.protos.Sort.OBJECT,
properties={
'crew':genai.protos.Schema(kind=genai.protos.Sort.STRING),
'job':genai.protos.Schema(kind=genai.protos.Sort.STRING)
},
required=['team','task']
)
g_evaluation=genai.protos.Schema(
kind=genai.protos.Sort.OBJECT,
properties={
'sentiment':genai.protos.Schema(kind=genai.protos.Sort.STRING),
'key_issues':genai.protos.Schema(kind=genai.protos.Sort.ARRAY, gadgets=g_str),
'action_items':genai.protos.Schema(kind=genai.protos.Sort.ARRAY, gadgets=g_action_item)
},
required=['sentiment','key_issues', 'action_items']
)
def gemini_setup():
genai.configure(api_key=google_api_key)
return genai.GenerativeModel(model_name='gemini-1.5-pro-latest',
system_instruction=PROMPT,
generation_config={"response_mime_type": "software/json",
"response_schema": g_evaluation,
}
)
OpenAI GPT-4o
Among the many three LLM suppliers we’ve examined, OpenAI affords essentially the most versatile resolution with the only configuration. Their “Structured Outputs API” can immediately settle for a Pydantic mannequin, enabling it to learn each the information mannequin and discipline descriptions effortlessly:
class Suggestion(BaseModel):
suggestion: str = Discipline(description="Suggestion to enhance the bot, beginning with letter Ok")class Analysis(BaseModel):
end result: str = Discipline(description="whether or not a dialog was profitable, both Sure or No")
clarification: str = Discipline(description="rationale behind the choice on end result")
solutions: listing[Suggestion] = Discipline(description="Six methods to enhance a bot")
@field_validator("end result")
def check_literal(cls, worth):
if not (worth in ["Yes", "No"]):
print(f"Literal Sure/No not adopted: {worth}")
return worth
@field_validator("solutions")
def count_suggestions(cls, worth):
if len(worth) != 6:
print(f"Array size of 6 not adopted: {worth}")
rely = sum(1 for merchandise in worth if merchandise.suggestion.startswith('Ok'))
if len(worth) != rely:
print(f"{len(worth)-count} solutions do not begin with Ok")
return worth
def eval_dialogue(consumer, file: File) -> Analysis:
completion = consumer.beta.chat.completions.parse(
mannequin="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": file.read()},
],
response_format=Analysis,
)
By way of robustness, OpenAI presents a graph evaluating the success charges of their ‘Structured Outputs’ API versus prompt-based options, with the previous reaching a hit price very near 100%.
Nevertheless, the satan is within the particulars. Whereas OpenAI’s JSON efficiency is ‘near 100%’, it’s not fully bulletproof. Even with a wonderfully configured request, we discovered {that a} damaged JSON nonetheless happens in about one out of each few thousand calls — particularly if the immediate isn’t fastidiously crafted, and would require a retry.
Regardless of this limitation, it’s truthful to say that, as of now, OpenAI affords the most effective resolution for structured LLM output purposes.
Word: the creator isn’t affiliated with OpenAI, Anthropic or Google, however contributes to open-source improvement of LLM orchestration and analysis instruments like Datachain.
Hyperlinks
Take a look at Jupyter pocket book:
Anthropic JSON API:
https://docs.anthropic.com/en/docs/test-and-evaluate/strengthen-guardrails/increase-consistency
Anthropic perform calling:
https://docs.anthropic.com/en/docs/build-with-claude/tool-use#forcing-tool-use
LangChain Structured Output API:
https://python.langchain.com/v0.1/docs/modules/model_io/chat/structured_output/
Google Gemini JSON API:
https://ai.google.dev/gemini-api/docs/json-mode?lang=python
Google genai.protos.Schema examples:
OpenAI “Structured Outputs” announcement:
https://openai.com/index/introducing-structured-outputs-in-the-api/
OpenAI’s Structured Outputs API:
https://platform.openai.com/docs/guides/structured-outputs/introduction