A .ai.pdf is a fully valid PDF that also embeds a
Brotli-compressed semantic XML layer — so AI parsers read structure
directly, with no OCR and no brittle extraction.
The Problem
AI systems parsing PDFs today have two bad options. .ai.pdf eliminates both.
Slow. Lossy. Layout-dependent. Tables shatter. Reading order is guessed. Scanned pages with unusual fonts silently drop content.
Column detection fails on complex layouts. Multi-column papers come out as interleaved noise. Citations, cross-references, and figure captions are silently stripped. Math is lost entirely.
Full structure embedded as XML — headings, tables, equations, citations, figures, bbox coordinates.
Opens unchanged in every PDF reader. Legacy readers ignore the embedded layer completely.
Rust, Python, TypeScript — byte-identical output enforced by shared conformance fixtures.
Architecture
Embedded Unicode CID/Type0 font. Per-block page coordinates and bbox written back into the visual layer. Renders identically in Acrobat, Preview, Chrome — any reader.
Brotli-compressed XML validated against the V1 schema. Linked via PDF /AF. Detected by a fast literal byte-scan for /Subtype /application#2Faipdf+xml+br, with a structural lopdf fallback for third-party re-saved PDFs.
Title, authors, format version. Searchable by standard PDF tooling without touching the embedded file.
Format
Every block carries id, page, bbox, and role. Version negotiation accepts any 1.x; unknown elements are silently ignored.
Block types
Inputs
Direct semantic XML conforming to the V1 schema
HTML5 — headings, tables, lists, code, figures
Markdown via pulldown-cmark — GFM tables, fenced code, blockquotes, images
Typst — headings, lists, $…$ equations, image() figures
aipdf ingest — text extraction + optional tesseract OCR
Export
All read-side transforms are byte-identical across Rust, Python, and TypeScript — enforced by shared conformance tests.
^-separated cells, |-separated rows.SDKs
Write-side operations (build, ingest) delegate to the installed binary — Rust stays authoritative. Read-side transforms are pure-library.
CLI
Agent Integration
The Python package ships a Model Context Protocol server. Any MCP-compatible agent can inspect, extract, validate, build, and convert .ai.pdf files natively.
Security
The semantic layer stores no embeddings, model output, prompts, or executable content. sanitize_xml runs on every XML path, both build and extract.
Rejected: <!DOCTYPE, <?xml-stylesheet, <script, /JavaScript, /Launch. Natural-language text is never banned.
External entity resolution disabled. Decompressed payload capped at 16 MiB.
Disallowed-marker list kept byte-identical in Rust core, Python SDK, and TypeScript SDK.