Runbook
This is the full set of instructions the AI agent follows when generating a Croissant file. You can copy it and use it directly with your own agent (Claude Code, Codex, Gemini CLI, etc.) and your own PDF — no web app needed.
Use it locally
1. Copy the runbook below (or download it)
2. Save it as RUNBOOK.md
3. Place your PDF in the same directory
4. Run your agent with the runbook as the system prompt:
claude --system-prompt RUNBOOK.md \ "Generate a Croissant file for the dataset in paper.pdf"
Replace {{pdf_filename}} with your actual filename and fill in the optional template variables.
Inputs
Environment Setup
Install mlcroissant, create output dirs, verify PDF exists
Read & Analyze Paper
Extract dataset identity, creators, structure, characteristics, and RAI metadata
Cross-Reference HF
If HuggingFace URL provided, confirm splits, formats, license (paper is primary source)
Build Croissant JSON-LD
Construct @context, distribution, recordSets, RAI fields per Croissant 1.0 spec
Evaluate Outputs
Validate JSON syntax, Croissant schema, and record set inspection
Iterate on Errors
Fix validation failures using common-fixes table, re-validate (max 3 rounds)
Write Executive Summary
Document extracted fields, inferences, gaps, validation results, and recommendations
Write Validation Report
Structured JSON with stages, pass/fail counts, and output file manifest
Final Checklist
Run verification script, confirm all 3 output files exist and are valid
Required Outputs
Validation Loop
Steps 4–6 form an iteration loop. If schema validation fails, the agent applies fixes from the common-fixes table and re-validates — up to 3 rounds before keeping the best result.