JSONSchemaBench: Structured Output Benchmark for Language Models
Core Contribution
-
Benchmark introduction: JSONSchemaBench comprising 10K real-world JSON schemas for evaluating constrained decoding frameworks
"We introduce JSONSchemaBench, a benchmark for constrained decoding comprising 10K real-world JSON schemas that encompass a wide range of constraints with varying complexity"
-
Three-dimensional evaluation framework assessing efficiency, coverage, and quality of constrained decoding approaches
"We present an evaluation framework to assess constrained decoding approaches across three critical dimensions: efficiency in generating constraint-compliant outputs, coverage of diverse constraint types, and quality of the generated outputs"
-
Six frameworks evaluated: Guidance, Outlines, Llamacpp, XGrammar, OpenAI, and Gemini
"We evaluate six state-of-the-art constrained decoding frameworks, including Guidance, Outlines, Llamacpp, XGrammar, OpenAI, and Gemini"
Key Findings
Efficiency Results
-
50% speedup potential from constrained decoding over unconstrained generation
"Constrained decoding can speed up the generation process by 50% compared to unconstrained decoding"
-
Guidance achieves best throughput through guidance acceleration technique
"Guidance achieves even higher efficiency, which it accomplishes by fast-forwarding certain generation steps with its guidance acceleration"
-
Compilation time varies significantly: Outlines has "significantly higher compilation time" compared to Guidance and Llamacpp which have "minimal grammar compilation time"
Coverage Results
-
Coverage disparity: Best framework supports twice as many schemas as worst
"Frameworks demonstrate significant differences in their actual support for real-world JSON schemas, with the best framework supporting twice as many schemas as the worst"
-
Guidance leads empirical coverage on 6 out of 8 datasets
"Guidance shows the highest empirical coverage on six out of the eight datasets"
-
Compliance rate critical: Guidance demonstrates "highest compliance rate across all datasets, making it the most reliable option for ensuring schema compliance"
Quality Results
-
Performance improvement: Constrained decoding improves downstream task accuracy up to 4%
"Constrained decoding consistently improves the performance of downstream tasks up to 4%, even for tasks with minimal structure like GSM8k"
-
Guidance best for quality: "Guidance consistently delivers the best performance across all tasks, with approximately a 3% improvement over the LM-only approach in every task"
Core Concepts
Coverage Definitions
-
Declared Coverage: Schema processed without rejection or runtime errors
"A schema is considered declared covered if the framework processes the schema without explicitly rejecting it or encountering runtime errors such as exceptions or crashes"
-
Empirical Coverage: Experiments show framework produces schema-compliant outputs
"A schema is considered empirically covered if our experiments show that the constraints generated by the framework result in LM outputs that are schema-compliant"
-
True Coverage: Constraints precisely equivalent to JSON Schema definition
"A schema is considered truly covered if the framework produces constraints that are precisely equivalent to the original JSON Schema definition"
-
Compliance Rate: Ratio of empirical to declared coverage
"Compliance Rate = C_Empirical/C_Declared... estimates the reliability of the constrained decoding framework in guaranteeing compliance given it accepts a given schema"
Failure Modes
-
Over-constrained: Framework rejects valid JSON instances
"A framework is over-constrained if it rejects JSON instances that are valid according to a given JSON Schema. This means the engine is too strict and excludes outputs that should be allowed"
-
Under-constrained: Framework allows invalid JSON instances
"A framework is under-constrained if it allows JSON instances that are invalid according to a given JSON Schema. This means the engine is overly permissive and allows outputs that should be rejected"
Evaluation Methodology
Efficiency Metrics
- Grammar Compilation Time (GCT): Time spent on grammar compilation
- Time to First Token (TTFT): Time from generation start to first token
- Time per Output Token (TPOT): Average time per token after first
Dataset Composition
-
10 dataset suites with varying complexity:
"We split the data into five collections based on the schema size: trivial, small, medium, large, ultra"
-
Sources include:
- GitHub schemas (6,000 from Baazizi et al., 2021)
- JSON Schema Test Suite (official test cases)
- Schema Store (largest independent collection)
- GlaiveAI function calling dataset (2,000 schemas)
- Kubernetes configuration files
- Washington Post ANS specification
Experimental Setup
- Model used: Llama-3.1-8B-Instruct for efficiency, Llama-3.2-1B-Instruct for coverage
- Hardware: Single NVIDIA A100-SXM4-80GB GPU with AMD EPYC 7543 CPU
- Validation: jsonschema Python library with Draft2020-12, format checks enabled
- Generation: Greedy decoding, zero temperature, single run, 40s timeouts
JSON Schema Test Suite Analysis
-
45 categories covering JSON Schema features (43 after filtering)
"The test suite organizes its test cases into 45 categories, each of which corresponds to a feature of JSON Schema"
-
Guidance dominates:
"Guidance outperforms other engines at all coverage levels, achieving full coverage on 13 categories and moderate coverage on 21"
-
Failure patterns vary:
"Outlines, Llamacpp, and Guidance follow a consistent failure pattern, with most errors occurring during compilation and over-constrained failures being more frequent than under-constrained ones. In contrast, XGrammar minimizes compilation errors but shows the highest number of under-constrained failures"
Quality Evaluation Tasks
Three Reasoning Tasks
-
Last Letter: CoT reasoning + answer in a-z
"Input: Ian Peter Bernard Stephen Output: nrdn"
-
Shuffle Objects: CoT reasoning + answer in A-E
"Input: Sequence of exchanges among individuals + choices Output: A-E"
-
GSM8K: CoT reasoning + answer as integer/float
"Input: Basic calculation problems Output: Number, e.g. 8"
-
JSON structure: All tasks use
{"reasoning": <reasoning>, "answer": <final answer>}format
Problem Context
Motivation
-
Machine-oriented applications require structured outputs
"Unlike traditional natural language processing (NLP) tasks where the output is aimed at review by humans, output in these applications is often consumed by machines such as controller and service APIs"
-
Probabilistic nature problematic:
"However, the LM generation process is probabilistic and does not provide guarantees on the output's structure, making it challenging to deploy LMs in applications requiring structured inputs and high reliability"
-
JSON Schema as standard:
"JSON Schema has emerged as a key specification language for constrained decoding... Commercial LM providers, such as OpenAI, have embraced constrained decoding by incorporating support for JSON Schema directly into their APIs"
Research Gap
-
Evaluation under-explored:
"The evaluation of constrained decoding remains an under-explored topic, with no consensus on what defines the effectiveness of constrained decoding"
-
No framework comparisons: Prior studies "fail to provide comparisons across different constrained decoding frameworks"
-
Limited benchmarks: Previous benchmarks "narrowly focused on specific tasks or rely on formal-grammar–based artificial setups, that have unclear relevance to real-world use cases"
Technical Implementation
Constrained Decoding Algorithm
-
Core mechanism: Token masking at each generation step
"Constrained decoding intervenes in the decoding process of LMs by masking out invalid tokens based on given constraints and prefix tokens"
-
Algorithm steps: Update constraint state → compute mask → calculate logits → apply mask → sample token → append to output
Addressing Coverage Bias
-
Intersection-based metrics: Calculate efficiency only on schemas all engines support
"To ensure fairness, we calculate efficiency metrics on the intersection of covered instances across all engines"
-
Rationale: "Engines with lower coverage often process simpler, shorter schemas, which naturally compile and generate faster"
Key Tools & Frameworks
Open Source Engines
- Guidance (Guidance AI, 2023): Dynamic constraint computation, token healing
- Outlines (Willard & Louf, 2023): Regex-based constraints
- Llamacpp (Gerganov & al., 2023): Grammar module, GGML BNF
- XGrammar (Dong et al., 2024): Concurrent compilation with pre-filling
Closed Source APIs
- OpenAI: JSON Schema support in API
- Gemini: Limited schema support (<1% on some datasets)
Validation & Testing
- jsonschema Python library (Berman, 2025): Draft2020-12 validation
- JSON Schema Test Suite (JSON Schema Org, 2024): Official correctness tests
- Bowtie (2025): Cross-implementation comparison tool
Related Work
JSON Schema Collections
- Baazizi et al., 2021: 6,000+ schemas from GitHub repositories
- Attouche et al., 2022: Witness generation algorithm evaluation
- Schema Store: Largest independent JSON schema collection
Constrained Decoding Research
- Early methods:
- Deutsch et al., 2019: General-purpose constrained sequential inference
- Shin et al., 2021: Constrained language models as semantic parsers
-
Scholak et al., 2021: PICARD for incremental parsing
-
Optimization frameworks:
- Beurer-Kellner et al., 2023: Prompting as programming query language
- Kuchnik et al., 2023: ReLM validation
- Zheng et al., 2024: SGLang structured execution
Quality Concerns
- Geng et al., 2023: Grammar-constrained decoding without finetuning
- Tam et al., 2024: Study on format restriction impacts
- Kurt, 2024: Response to performance decline concerns, argues issues stem from "inadequate prompting, insufficient contextual information, and poorly crafted schemas"
- Geng et al., 2024: Tokenization ambiguity issues
Applications
- Yao et al., 2023: Web navigation (ReAct)
- Polak & Morgan, 2024: Materials data extraction
- Schick et al., 2023: Toolformer for tool use
Advanced Features
Optimization Techniques
- Grammar caching: Reuse compiled constraints
- Parallel execution: "Mask computation can run in parallel with the LM's forward pass, and grammar compilation can be performed concurrently with pre-filling computations"
- Speculative decoding: Constraint-based approach (GuidanceAI, 2024)
- Token healing: Guidance's technique for handling tokenization boundaries (GuidanceAI, 2024)
JSON Schema Features
- Most common keywords: type, properties, required, description, items, enum
- Format constraints: date-time, email, uri, uuid (each complex to implement)
- Advanced features: $defs, $ref, if-then-else, minItems, maxItems, pattern matching
Limitations & Future Work
Current Challenges
-
Under-constraining tradeoffs:
"Under-constraining effectively delegates responsibility to the LM, which may produce valid output despite a lack of strict constraints... requires careful implementation and transparency to ensure reliability"
-
Feature representation: "Not all features are equally represented in real-world schemas... strong or weak performance on specific features can have disproportionate impacts depending on their prevalence"
-
Test suite limitations: "No straightforward correspondence between test suite performance and empirical coverage"
Aspirational Goals
-
GitHub-Ultra dataset: Retained "as an aspirational target for future advancements" despite being too hard for current frameworks
-
Format validation: Paper excluded 'format' tests but notes "We hope to extend this work to include these optional tests in a follow-up"
-
Formal verification: True coverage requires "formal verification method that is capable of exhaustively comparing the schema's semantics against the framework's implementation"
Repository & Resources
- Benchmark release: https://github.com/guidance-ai/jsonschemabench
- JSON Schema specification: Draft 2020-12 (Wright et al., 2022)
- Model used: Llama-3.1-8B-Instruct (Grattafiori et al., 2024)
