How to Generate OpenAI GPT Output in JSON Format Using Python for Legal Text Analysis

Generating consistent JSON output is now possible thanks to function calling feature in OpenAI’s gpt-4-0613 and gpt-3.5-turbo-0613 models. Function calling introduces a systematic approach to generating structured data. This capability allows for the precise definition of functions within the models, which then produce outputs in JSON format, as opposed to unstructured text. This development addresses the inefficiencies associated with previous methodologies, such as prompt engineering, which was inconsistent and necessitated significant post-processing.

One notable application of this feature is in the domain of legal text analysis. Legal documents are often characterized by dense, unstructured text. Information extraction from these documents can be particularly challenging. However, by describing functions to gpt-4-0613 and gpt-3.5-turbo-0613 models, it is possible to efficiently extract information from unstructured legal texts and convert them into structured JSON format. For instance, users can define functions that instruct the models to identify and extract specific elements, such as case references, statutes, or contractual terms. The models then process the text and generate structured outputs that systematically organize the extracted information.

As of this writing, you can describe functions only to gpt-4-0613 and gpt-3.5-turbo-0613 models. So, for example gpt-4 and gpt-3.5-turbo models won’t work.

In this article, I will address the following points:

What Does GPT’s Function Calling Bring to the Table? – An examination of the function calling feature in OpenAI’s Generative Pre-trained Transformers (GPT) and its utility.
OpenAI GPT JSON Output: What are the Challenges? – A consideration of the limitations and potential issues associated with using GPT’s function calling feature.
Putting GPT’s Function Calling to the Test: A Legal Text Analysis Example – A case study illustrating the application of the function calling feature in the context of legal text analysis. Specifically, I will be extracting case law references and the context in which they are cited from a given legal text.

What Does GPT’s Function Calling Bring to the Table?

Function calling in GPT-4-0613 and GPT-3.5-turbo-0613 models brings a new dimension to the generation of structured data. By describing functions to these models and producing JSON outputs instead of unstructured text, it effectively eliminates the hassles of the past where “prompt engineering” was used. This hit-or-miss approach was fraught with inconsistencies and required significant post-processing and debugging.

The benefits of GPT’s function calling in ChatGPT are manifold:

Enhanced Reliability: By producing structured JSON outputs, ChatGPT offers a higher degree of consistency and reliability.

Interoperability: ChatGPT’s JSON output allows multiple systems to communicate seamlessly using JSON, a universal language in modern software systems.

Diverse Applications: The new feature finds applications in a variety of sectors, including:

Chat assistants
Data Extraction and Analysis
Customer support
E-commerce

A significant advantage is the facilitation of seamless communication between multiple systems through a universal language – JSON, which is widely regarded as the common language among contemporary software systems. Consequently, we are edging closer to employing Large Language Models (LLMs) as the backbone for intricate applications.

OpenAI GPT JSON Output: What are the Challenges?

However, a word of caution is necessary. We are still navigating the early stages of this technology and it is important to be cognizant of certain limitations. Despite the remarkable strides made in AI development, challenges surrounding determinism persist. To elucidate for those who may not be technically inclined, determinism implies that identical inputs should consistently yield the same outputs.

Caught gpt-4-0613 hallucinating a json value for the first time (invalid, per the schema I gave it)

Ran it for ~20 times without issue. That's a ~5% error rate.

Cool that we get JSON, but must be validated and rerun. We are still wrestling a beast we can't quite control. pic.twitter.com/gMXbhFjvrD
— swyx 🐣 (@swyx) June 17, 2023

For instance, a Twitter user highlighted an instance where the system deviated from the expected units of measurement. While asked to create a recipe, GPT-4 chose “clove” as a unit for garlic, as opposed to pre-determined units of grams or teaspoons. It may be argued that AI should be more like a supervised “intern” than a tool embedded in a process. On the other hand, using “clove” is indeed a more precise measurement for garlic, so in this case, it was not an error but a beneficial refinement. However, it was unexpected, and therein lies the challenge.

It’s important to note that results may vary. Based on my observations, the reliability of JSON output is inherently tied to the cognitive load on GPT, which, at present, is non-deterministic. Most of the time, I’ve succeeded to get the desired output in JSON format but in some instances there are glitches.

There are two strategies to mitigate the cognitive load on the model:

1. Employ a more concise context: Ensure that your input doesn’t exceed 50% of the maximum context capacity of your model. For instance, if you’re using gpt-3.5-turbo-0613, which has a maximum context of 4096 tokens, your input should be limited to 2048 tokens or fewer. The objective is to strike a balance between not overwhelming the model while providing enough context for meaningful output.

2. Refrain from extracting excessive information simultaneously: In my legal text analysis example, when I attempted to extract legal terms, arguments, normative conclusions, and case references within a single JSON schema, I encountered incorrect output, such as the model confusing legal terms with case names. The challenge here is to sustain consistently accurate output while keeping API costs low. Therefore, you may consider dividing the JSON schema into smaller segments and iterating over the input text for different output objectives. Be reminded that this approach will increase API costs.

Another factor that affects the quality of the JSON output is the choice of the GPT model. All other things being equal, GPT-4 generally generates more reliable output compared to GPT-3.5-turbo. However, it’s worth noting that the GPT-4 model is ten times more expensive than GPT-3.5-turbo. Therefore, the choice of the model depends on the specific use case. If you have the ability to check the output more frequently and if cost is a concern, then the GPT-3.5-turbo model may be a better choice.

How to Get JSON Response from OpenAI GPT: A Legal Text Analysis Example

In the example that follows, I analyze an excerpt from an investment arbitration case, Glamis Gold v United States of America. My input text comprises 600 tokens, and I aim to extract the case name, year, applicable treaties, description, and case references.

After importing openai library, I define a Python dictionary named schema_detailed_cases, which outlines the schema or structure of an object meant to store information about arbitral cases. The object has a property named referred_arbitral_cases, which is an array of objects. Each object in this array should have properties case_name, year, applicable_treaties, description, and reference. The schema defines the data type and description for each of these properties.

import openai

schema_detailed_cases= {
	"type": "object",
	"properties": {
		"referred_arbitral_cases": {
			"type": "array",
			"items": {
				"type": "object",
				"properties": {
					"case_name": {
						"type": "string",
						"description": "The name of the case."
					},
					"year": {
						"type": "integer",
						"description": "The year the case was decided."
					},
					"applicable_treaties": {
						"type": "array",
						"items": {
							"type": "string",
							"description": "The treaties applicable to the case."
						}
					},
					"description": {
						"type": "string",
						"description": "A brief description of the case."
					},
					"reference": {
						"type": "string",
						"description": "The reference to the case decision."
					}
				},
				"required": ["case_name", "year", "applicable_treaties", "description", "reference"]
			}
		}
	},
	"required": ["referred_arbitral_cases"]
}


text = """The Tribunal notes that numerous NAFTA tribunals have wrestled with the question of the scope and bounds of "fair and equitable treatment" and the duties and obligations that this treatment requires of a State Party. Probably the most comprehensive review was done by the tribunal in Waste Management, in which it attempted a survey of the holdings to date in NAFTA jurisprudence:
Taken together, the S.D. Myers, Mondev, ADF and Loewen cases suggest that the minimum standard of treatment... of fair and equitable treatment is infringed by conduct attributable to the State and harmful to the claimant if the conduct is arbitrary, grossly unfair, unjust or idiosyncratic, is discriminatory and exposes the claimant to sectional or racial prejudice, or involves a lack of due process leading to an outcome which offends judicial propriety - as might be the case with a manifest failure of natural justice in judicial proceedings or a complete lack of transparency and candour in an administrative process. In applying this standard it is relevant that the treatment is in breach of representations made by the host State which were reasonably relied on by the claimant.1128
The tribunal in GAMI primarily followed this line of reasoning, extracting four "implications" that it found particularly salient:

"The failure to fulfill the objectives of administrative regulations without more does not necessarily rise to a breach of international law;"
"A failure to satisfy requirements of national law does not necessarily violate international law;"
"Proof of a good faith effort by the Government to achieve the objectives of its laws and regulations may counter-balance instances of disregard of legal or regulatory requirements;" and
"The record as a whole - not isolated events - determines whether there has been a breach of international law."1129
Waste Management, Award, ¶ 98 (Apr. 30, 2004). As noted above at footnote 1087, Claimant is not arguing a duty of non-discrimination as a duty separate from those included in the requirement of fair and equitable treatment under Article 1105.

GAMI Investments, Final Award, ¶ 97 (Nov. 15, 2004).

The tribunal in International Thunderbird Gaming had a slightly different holding: "the Tribunal views acts that would give rise to a breach of the minimum standard of treatment prescribed by the NAFTA and customary international law as those that, weighed against the given factual context, amount to a gross denial of justice or manifest arbitrariness falling below acceptable international standards."1130 Although bad faith would meet the standards described, most tribunals agree that a breach of Article 1105 does not require bad faith.1131

International Thunderbird, Award, ¶ 194 (Jan. 26, 2006).

See Loewen, Award, ¶ 132 (June 26, 2003); Mondev, Award, ¶ 115 (Oct 11, 2002); Waste Management, Award, ¶ 93 (Apr. 30, 2004)."""


completion = openai.ChatCompletion.create(
    model="gpt-4-0613",
    messages=[
        {"role": "system", "content": "You are a helpful legal assistant."},
        {"role": "user", "content": text},
    ],
    functions=[{"name": "extract_info", "parameters": schema_detailed_cases}],
    function_call={"name": "extract_info"},
    temperature=0,
)

print(completion.choices[0].message.function_call.arguments)

This is the output:

{
  "referred_arbitral_cases": [
    {
      "case_name": "Waste Management",
      "year": 2004,
      "applicable_treaties": ["NAFTA"],
      "description": "The tribunal attempted a survey of the holdings to date in NAFTA jurisprudence, suggesting that the minimum standard of treatment of fair and equitable treatment is infringed by conduct attributable to the State and harmful to the claimant if the conduct is arbitrary, grossly unfair, unjust or idiosyncratic, is discriminatory and exposes the claimant to sectional or racial prejudice, or involves a lack of due process leading to an outcome which offends judicial propriety.",
      "reference": "Waste Management, Award, ¶ 98 (Apr. 30, 2004)"
    },
    {
      "case_name": "GAMI Investments",
      "year": 2004,
      "applicable_treaties": ["NAFTA"],
      "description": "The tribunal primarily followed the line of reasoning of Waste Management, extracting four implications that it found particularly salient.",
      "reference": "GAMI Investments, Final Award, ¶ 97 (Nov. 15, 2004)"
    },
    {
      "case_name": "International Thunderbird Gaming",
      "year": 2006,
      "applicable_treaties": ["NAFTA"],
      "description": "The tribunal held that acts that would give rise to a breach of the minimum standard of treatment prescribed by the NAFTA and customary international law are those that, weighed against the given factual context, amount to a gross denial of justice or manifest arbitrariness falling below acceptable international standards.",
      "reference": "International Thunderbird, Award, ¶ 194 (Jan. 26, 2006)"
    },
    {
      "case_name": "Loewen",
      "year": 2003,
      "applicable_treaties": ["NAFTA"],
      "description": "The tribunal agreed that a breach of Article 1105 does not require bad faith.",
      "reference": "Loewen, Award, ¶ 132 (June 26, 2003)"
    },
    {
      "case_name": "Mondev",
      "year": 2002,
      "applicable_treaties": ["NAFTA"],
      "description": "The tribunal agreed that a breach of Article 1105 does not require bad faith.",
      "reference": "Mondev, Award, ¶ 115 (Oct 11, 2002)"
    }
  ]
}

The JSON output is not bad; however, it failed to capture two arbitration cases: ADF v. USA and SD Myers v. Canada. Furthermore, it did not produce the full names of the cases and only included the claimants. This is a common practice in citations, though. Parties and Tribunals often refer to cases using only a single name, such as Loewen or Mondev, for example. I kept the user prompt short. It might be worthwhile to attempt steering the GPT model further through user or system prompts.

If the comprehensiveness of the response is important to you and if the output can be limited to predefined options, then it might be a good idea to present the output to the model as an array. In our case, there are more than 800 known investment arbitration cases, in addition to legacy cases from before the treaty arbitration era, which are still frequently cited by tribunals. This implies that if we provide all the case names to the model as an array, it will likely overwhelm the model’s cognitive capacity. As a result, we are still awaiting GPT models with larger context windows. However, I should note that our function calling works well for the extraction of keywords or normative conclusions.