V1 Prototype · Apache 2.0

PDF for humans.
XML for machines.
One file.

A .ai.pdf is a fully valid PDF that also embeds a Brotli-compressed semantic XML layer — so AI parsers read structure directly, with no OCR and no brittle extraction.

Install
$ curl -fsSL https://raw.githubusercontent.com/tanmaysheoran/ai.pdf/main/install.sh | sh

The Problem

PDF wasn't built for machines.

AI systems parsing PDFs today have two bad options. .ai.pdf eliminates both.

Semantic first

Full structure embedded as XML — headings, tables, equations, citations, figures, bbox coordinates.

Valid PDF

Opens unchanged in every PDF reader. Legacy readers ignore the embedded layer completely.

Three SDKs

Rust, Python, TypeScript — byte-identical output enforced by shared conformance fixtures.


5
Input formats accepted
4
Export formats
3
SDKs with byte-identical output
16 MB
Maximum decompressed payload

Architecture

Every .ai.pdf is a standard PDF
with an invisible second layer.

Visual

PDF page tree + content streams

Embedded Unicode CID/Type0 font. Per-block page coordinates and bbox written back into the visual layer. Renders identically in Acrobat, Preview, Chrome — any reader.

Semantic

aipdf-semantic.xml.br

Brotli-compressed XML validated against the V1 schema. Linked via PDF /AF. Detected by a fast literal byte-scan for /Subtype /application#2Faipdf+xml+br, with a structural lopdf fallback for third-party re-saved PDFs.

Metadata

XMP packet (/Metadata)

Title, authors, format version. Searchable by standard PDF tooling without touching the embedded file.

// Input → .ai.pdf

Input (XML / MD / HTML / Typst)
  → semantic_xml_from_source
  → sanitize_xml
  → validate_xml // accept 1.x; reject other majors
  → build_aipdf
      • minimal Brotli → hand-written PDF (14 objects, embedded font)
      • full layout engine → page+bbox → compress → embed + images
      • browser headless Chrome renders HTML+CSS → attach layer
  → .ai.pdf

Existing PDF
  → ingest_pdf // lopdf text extract + optional tesseract OCR per page
  → .ai.pdf

Format

Typed, attributed, versioned.

Every block carries id, page, bbox, and role. Version negotiation accepts any 1.x; unknown elements are silently ignored.

<!-- aipdf-semantic.xml (Brotli-compressed, embedded) --> <document version="1.0"> <metadata> <title>Research Summary</title> </metadata> <section id="s-intro" level="1"> <title>Introduction</title> <paragraph id="p1" page="1" bbox="72,680,540,710"> This paper proposes… </paragraph> <table id="t1"> <thead><row><cell header="true">Method</cell><cell header="true">Score</cell></row></thead> <tbody><row><cell>Baseline</cell><cell>71.2</cell></row></tbody> </table> </section> </document>

Block types

paragraphBody text + page/bbox
titleSection headings
tableStructured data, header rows
figureImages, alt text, captions
codeBlockFenced code + language tag
equationLaTeX / Typst math
listOrdered and unordered
citation / referenceAcademic citations
footnote / noteAnnotations

Inputs

Build from anything.

.xml

Direct semantic XML conforming to the V1 schema

.html

HTML5 — headings, tables, lists, code, figures

.md

Markdown via pulldown-cmark — GFM tables, fenced code, blockquotes, images

.typ

Typst — headings, lists, $…$ equations, image() figures

.pdf

aipdf ingest — text extraction + optional tesseract OCR


Export

Four output formats.
Three SDKs. One golden fixture.

All read-side transforms are byte-identical across Rust, Python, and TypeScript — enforced by shared conformance tests.

--format xml
XML
Raw semantic payload — decompressed embedded layer
--format markdown
Markdown
Human-readable rendering with headings, tables, code
--format markdown-ast
MDAST
MDAST-compatible JSON tree for programmatic use

SDKs

Three languages.
Zero runtime dependencies.

Write-side operations (build, ingest) delegate to the installed binary — Rust stays authoritative. Read-side transforms are pure-library.

use aipdf::{AipdfDocument, BuildOptions, RenderMode}; // Read — no binary needed let doc = AipdfDocument::open("paper.ai.pdf")?; println!("{}", doc.to_markdown()?); println!("{}", doc.to_onto()?); for block in doc.get_reading_order()? { println!("[{}] page={:?} {}", block.kind, block.page, block.text); } // Build — delegates to aipdf binary let bytes = build_aipdf(&xml, &BuildOptions { title: "My Paper".into(), render: RenderMode::Full, ..Default::default() })?;
# pip install -e sdk/python (only dep: brotli>=1.1.0) from aipdf import AIPDF doc = AIPDF.open("paper.ai.pdf") print(doc.to_xml()) print(doc.to_markdown()) print(doc.to_onto()) for block in doc.get_reading_order(): print(f"[{block.kind}] page={block.page} {block.text[:60]}")
// Zero runtime deps — uses Node built-in zlib for Brotli import { AIPDF } from "./src/index.js"; const doc = AIPDF.open("paper.ai.pdf"); console.log(doc.toMarkdown()); console.log(doc.toOnto()); for (const block of doc.getReadingOrder()) { console.log(`[${block.kind}] page=${block.page} ${block.text.slice(0,60)}`); }

CLI

One binary. Every workflow.

Build
$ aipdf build samples/minimal.xml
$ aipdf build paper.md --render full
$ aipdf build page.html --render browser # full CSS
$ aipdf build paper.md --font /path/NotoSansCJK.ttf
Ingest existing PDF
$ aipdf ingest scanned.pdf # OCR auto
$ aipdf ingest report.pdf --ocr never
$ aipdf ingest doc.pdf --lang eng+deu
Inspect & Validate
$ aipdf inspect paper.ai.pdf
$ aipdf validate paper.ai.pdf
$ aipdf extract paper.ai.pdf
Export
$ aipdf export paper.ai.pdf --format xml
$ aipdf export paper.ai.pdf --format markdown
$ aipdf export paper.ai.pdf --format onto
$ aipdf export paper.ai.pdf --format markdown-ast

Agent Integration

MCP stdio server
for AI agents.

The Python package ships a Model Context Protocol server. Any MCP-compatible agent can inspect, extract, validate, build, and convert .ai.pdf files natively.

aipdf_inspect aipdf_extract aipdf_reading_order aipdf_validate aipdf_build aipdf_convert
# Start the MCP stdio server
aipdf-mcp

# or equivalently
python -m aipdf.mcp_server

# See docs/mcp.md for client config

Security

Data, not behavior.

The semantic layer stores no embeddings, model output, prompts, or executable content. sanitize_xml runs on every XML path, both build and extract.

Active-content only

Rejected: <!DOCTYPE, <?xml-stylesheet, <script, /JavaScript, /Launch. Natural-language text is never banned.

No external entities

External entity resolution disabled. Decompressed payload capped at 16 MiB.

Identical across all SDKs

Disallowed-marker list kept byte-identical in Rust core, Python SDK, and TypeScript SDK.