Runbook

This is the full set of instructions the AI agent follows when generating a Croissant file. You can copy it and use it directly with your own agent (Claude Code, Codex, Gemini CLI, etc.) and your own PDF — no web app needed.

Use it locally

1. Copy the runbook below (or download it)

2. Save it as RUNBOOK.md

3. Place your PDF in the same directory

4. Run your agent with the runbook as the system prompt:

claude --system-prompt RUNBOOK.md \
  "Generate a Croissant file for the dataset in paper.pdf"

Replace {{pdf_filename}} with your actual filename and fill in the optional template variables.

SetupExtractionGenerationValidation LoopReporting

Inputs

{{pdf_filename}}required{{results_dir}}{{huggingface_url}}{{dataset_name}}
Step 1

Environment Setup

Install mlcroissant, create output dirs, verify PDF exists

Step 2

Read & Analyze Paper

Extract dataset identity, creators, structure, characteristics, and RAI metadata

Step 3

Cross-Reference HF

If HuggingFace URL provided, confirm splits, formats, license (paper is primary source)

Step 4

Build Croissant JSON-LD

Construct @context, distribution, recordSets, RAI fields per Croissant 1.0 spec

croissant.json
Step 5

Evaluate Outputs

Validate JSON syntax, Croissant schema, and record set inspection

Step 6

Iterate on Errors

Fix validation failures using common-fixes table, re-validate (max 3 rounds)

retry up to 3x
Step 7

Write Executive Summary

Document extracted fields, inferences, gaps, validation results, and recommendations

summary.md
Step 8

Write Validation Report

Structured JSON with stages, pass/fail counts, and output file manifest

validation_report.json
Step 9

Final Checklist

Run verification script, confirm all 3 output files exist and are valid

Required Outputs

croissant.jsonsummary.mdvalidation_report.json

Validation Loop

Steps 46 form an iteration loop. If schema validation fails, the agent applies fixes from the common-fixes table and re-validates — up to 3 rounds before keeping the best result.