# Doctrail full manual

Generated from mkdocs navigation by `scripts/build_llms_full.py`.

<!-- index.md -->

# Doctrail

<div class="dt-start-callout" markdown>
To begin right away, point your agent at [doctrail.org/llms.txt](https://doctrail.org/llms.txt) and/or run `uvx doctrail`.
</div>

Doctrail is a software library that allows researchers to perform and validate the large-scale enrichment of text corpora with large language models. It is written to be driven by agents (Claude Code, Codex) as much as humans, though humans must understand how it works. It grew naturally out of several applied computational social science research projects, eventually evolving to become a standalone tool.

Here is an example.

![Ingest a folder of documents, inspect the YAML codebook, enrich, and query the results — all in the terminal](assets/demo.gif)


## What did I just watch?

Let's step through it.

First, we ingest a pile of documents (pdf, doc, docx, xlsx, html; doctrail handles ~a dozen file types) that are in a folder. This is your corpus. "Ingest" here means that we pull the text out of the file and put it into a database that is on your computer.

Second, we looked at what they were (indeed, html, docx, and pdf files).

Third — this is the most important part — we looked at a set of instructions to an LLM (called a 'prompt') and a codebook that defined:

- What is going *into* the LLM (one row from our Federalist papers database each time, plus our prompt explaining what we want *out*)
- The prompt said "Code this... using the codebook below." The model is then instructed to identify the author and to measure the "fear of disunion" the text shows, including what those measures mean.
- What must come *out* of the LLM. This is a schema that enforces only certain responses. When we ask for the coding of a text for social science, we do not want ChatGPT to say "My, what a brilliant question..." We want only the outputs, and *types* of output, that we define. Here, that is defined by the schema (i.e. codebook), in this case like:

`author: {enum: ["Hamilton", "Madison", "Jay"]}`

This means that we want a field called "author" and the only values it can take are one of those three (an enum, or categorical variable).

Similarly with the field "fear of disunion":

`fear_of_disunion: {type: integer, minimum: 0, maximum: 5}`

This can *only* emit the type integer that takes the values between 0 and 5.

Fourth, we examined the output with a SQL command.

The rest of this page explains more about how Doctrail works, the mental model that is most helpful for working with it, and gives two examples of real social science projects that Doctrail facilitated.

## Your corpus as a grid

Doctrail turns a pile of documents — your corpus — into a table, or a grid. So, you should think about your workflow with it as often the equivalent to "adding columns to my documents, which is now a grid."

First, it creates an SQLite database. SQLite is the most widely used database in the world. It exists as a single file on disk and can be easily shared, copied, or backed up. It can be opened in any database browser, and in fact Doctrail is best when you are toggling between *looking* at your LLM enrichments in one window, updating your prompt in another, and re-running your enrichment in a third. Increasingly however, much of this can be subsumed, with the right instructions, by a terminal agent (again, like Claude and Codex).

It is very important that we use a database for all this.

## Why a database, and what's in it?

At a high level, a relational database consists of tables that can be linked to each other with keys.

In doctrail, your files get turned into text and ingested into a table called `documents`.

Doctrail's tables are prefixed by `_`, so they cluster together and stay out of the way.

As Doctrail uses LLMs to enrich the files, the results are stored in an append-only log. SQL queries are then used to reconstruct pieces of these into other tables, or views, that you can inspect and do useful work on. The internal machinery is complex, and many thousands of lines of code define the behaviour. The key idea is that every input to the LLM, and every output from the LLM, is always captured in the database and fully auditable. This means it can be reconstructed in arbitrary ways as discussed below.

In the end, all this is intended to make it trivial to iterate on a prompt and codebook, to confirm its behavior on a new random sample, and only then implement it on thousands, tens of thousands, of hundreds of thousands of documents in the corpus.

There are many anciliary benefits to using a database as the storage engine, including:

* Your corpus and its enrichments stay together in a single file, linked by keys;
* Each write is atomic and incremental, meaning you can resume large runs that get interrupted and no data should be lost or corrupted;
* The corpus is never loaded into computer memory at once. This is not a problem for small corpora, but if it grows to hundreds of thousands of documents or millions of documents, it is awkward, inefficient, and sometimes impossible to store all this in memory and repeatedly rewrite it all to disk;
* You can keep the database open as writes are happening and inspect the enrichments directly as they come in;
* You can easily filter your documents and inspect their enrichments;
* All the standard database guarantees — types, keys, and unique constraints that keep the data consistent;
* SQLite is portable and can be read by numerous other types of software.

## Typical workflows

Here are two main ways that doctrail can be used.

**The qualitative triage loop**. I have 3,000 court decisions, and I have a hunch that some much smaller number of them contain data relevant to my research question. I might want to code these with variables, or I may simply want to closely read and think about what is in them and the differences between them. One could use keywords to filter down the number, but it is difficult to think of every possible keyword that could define the research question. One wishes to employ human-like 'understanding' of meaning and to screen each of them individually, looking at the court decision and the research question and saying: "Is this relevant?" In doctrail this would be a cheap screening enrichment (`relevant: boolean`, i.e. the column name would be `relevant` and it could take a value or `0` or `1`), run in batch mode on a cheap model, and it would result in a few hundred cases for closer analysis. It is then simple to pull random samples from the excluded group to ensure that relevant documents were not left on the table. Not all of the remaining 50 documents may be relevant, but Doctrail has quickly triaged the relevant document set.

**A measurement pipeline.** Many times, one will wish to construct a qualitative measure and apply it identically to every row, producing a measure of some feature of a text that is theoretically relevant. Such a measure might only apply to a subset of documents in a large corpus, so Doctrail can perform the first task of reducing the population of relevant records with a SQL filter, and send only that subset off for enrichment. Whether that measure can be trusted is the subject of the next section.

## Validation

The qualitative coding of some feature in a document is a claim; one will often want to know if such claims are credible. Setting aside the question of truth, the two questions one will ask about any measure are: is it reliable? is it valid?

By **reliability**, we simply want to know whether different coders roughly converge on the same claims. If coders have low agreement about how some feature should be coded, you may have to rethink your measure. `doctrail icr` codes a random sample under several coders, and `doctrail icr-report` scores their agreement (Krippendorff's alpha, Cohen's kappa). Thus, doctrail allows you to randomly sample from your corpus, code such samples with several LLMs (and humans, for that matter), and test the reliability of the measure before running it across the full corpus.

Human coders are stored like LLM coders in the ledger -- both are simply a coder identity. This means one can pool them and test agreement with the same command. To get human codings in, `doctrail overrides-export` writes a CSV template for a run (open it in Excel or anything), a human codes or corrects the rows, and `doctrail overrides-import` reads it back; the human then sits in the ledger as just another coder.

**Validity** is accuracy against a trusted standard. Because a human coder is just another coder in the comparison, the same `doctrail icr-report` gives you this for free: its pairwise table reports how closely each model agrees with the human, so the human-versus-model row is your validity measure. When you would rather eyeball cases than read a statistic, `doctrail review` opens a web UI that walks a human through the model's codings and shows a running accuracy.

These two affordances allow one to validate a codebook on a small random sample, read the disagreements, revise, and only scale once the LLM is behaving.

Doctrail's validation framework is in active development. A key idea is that doctrail *itself* is not intended to be your validation software. It is the canonical store of codings, and provides affordances for getting values in and out, but the statistics one creates will often need to be tailored closely to a specific project, and Doctrail facilitates getting your codes into a rectangle so you can do that.

## Two example use cases

One project began with over 100,000 rows scraped from the Chinese internet. These include media reports, government press releases, announcements on the websites of hospitals, and more. The research question involved identifying the subset of these documents to described details of a specific policy, and to then to measure the implementation of that policy in a systematic manner. First, Doctrail used a small LLM to run a 'relevance' filter on the documents; the prompt simply described the research question and said "Is this document relevant?" This removed the majority of the corpus. Of course, we then sampled from the 'irrelevant' set to make sure they were indeed irrelevant. Now, on a defined subset, it was possible and meaningful to apply successive enrichments that extracted structured data such as: `policy_name`, `year_began`, `fune_name`, `amount`, `families_involved`, and so forth. This is a far better approach than trying to define a dictionary upfront. And because everything is in SQLite, it became simple to increasingly refine the funnel, so that in the end only a few dozen documents with highly diagnostic evidence were the subject of analysis.

Another project combined tens of thousands of editorials from three PRC state media, in both English and Chinese. First, the table of editorials had to be turned into a table of country-editorial pairs, because this was the unit of analysis. SQLite made this simple and kept all our data together, linked by keys. We could then run successive enrichments across this reshaped table, before finally validating them against human codes. The codes stored in SQLite then fed directly into a build pipeline that performed the modeling and produced the descriptive statistics, tables, and validation measures—meaning that later changes in the database are automatically carried through all outputs.

## Other features

1. **Cache-friendly by default**. As long as the codebook is written with row-specific `{placeholder}` text at the end, most commercial model providers will give a large discount to the cached tokens, significantly reducing the inference costs;
2. **Batch mode.** `doctrail enrich <name> --execution-mode batch` submits through the providers' batch APIs (OpenAI, Anthropic, Gemini) at roughly half the regular price. Large runs are sharded into provider jobs, `doctrail batch watch` follows progress, results reconcile into the same ledger, and partially failed shards simply retry on the next append-mode run;
3. **Packed screening.** For rare-hit boolean screens over short texts, `pack_size` groups many rows into one call and `pack_response_mode: selected_indexes` has the model return only the indexes of the hits — so the 99% of rows that don't match cost almost no output tokens. This can significantly reduce costs for cheap screens;
4. **Cost guardrails.** Before a run, Doctrail estimates the spend and asks you to confirm once it crosses a threshold (default $5), so a misconfigured run cannot use all your money while you sleep; `--skip-cost-check` bypasses it and `--cost-threshold` moves the line;
5. **Model-agnostic.** OpenAI, Anthropic, and Gemini are built in, and OpenRouter is wired in too, so an enrichment can point at any of hundreds of models by name. You can instruct your agent to get Doctrail to list all available models on OpenRouter;
6. **Run diffing.** `doctrail diff-runs` shows precisely where two runs disagree — prompt v1 against v2, or one model against another — so you can see what a codebook change actually moved, then diagnose hard cases;
7. **Ingest from Zotero.** Besides a folder of files (~a dozen formats), `doctrail ingest --zotero` pulls a Zotero library or collection straight into the corpus, so your reference manager can be the source. You have to set this up first.

## Where to go

Use the [quick start](quickstart.md) to install and get going, the [tutorial](tutorial.md) for the guided walkthrough, the [code books](yaml.md) page for the complete config surface, and the [reference](cli.md) for exact commands and flags.

Or, better yet, don't do any of that! Just point your agent (i.e. Codex, Claude Code, whether terminal-based or desktop application) at [llms.txt](https://doctrail.org/llms.txt) and tell it you want to enrich a pile of documents.

As long as you, the human operator, have a fairly clear mental model of how the machinery works, you don't have to manage the implementation details. You describe your goals to the agent, inspect and iterate on the codebook, and get the agent to use Doctrail to carry it out. Doctrail is designed to be driven by agents.

While enrichments are happening, you can open the SQLite database in a software like [TablePlus](https://tableplus.com/) to inspect them and then iterate.

<!-- quickstart.md -->

# Quick start

## Installation

doctrail is built to be driven by an agent. The setup is short:

1. Install [uv](https://docs.astral.sh/uv/) if you don't have it. uv is a Python package manager. It can be installed like this:

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

(In general, read install scripts before piping them into a shell; this is the official uv installer.)

2. Install doctrail: `uv tool install doctrail`

3. Tell your agent to run `doctrail` — it prints how to operate itself and points to `doctrail agent`, the full operating guide.

(Alternatively, don't install anything. Just point your agent at https://doctrail.org/llms.txt and tell it you want to install uv, doctrail, and start enriching)

The rest of this page explains what it gets you, and how to drive doctrail yourself if you prefer.

## See it work, no API key needed

Before pointing it at your own files or spending a cent, run the tutorial:

```bash
doctrail init test fed
doctrail run test
```

This scaffolds a small corpus, a code book, and saved model responses into the current folder, then runs the whole pipeline offline. The [tutorial](tutorial.md) walks through exactly what just happened.

## On your own files

The assumption is simple: you are in a project folder, and your documents are in a subfolder of it.

1. Set an API key for your provider, or let `doctrail init` create a `.env` for you.
2. Run `doctrail init` in your project folder.
3. Ingest your documents, write one code book, dry-run it, run a small sample, then open the database in any SQLite browser and look at the grid.

```bash
doctrail ingest --input-dir ./data --yes
doctrail enrich <name> --dry-run
doctrail enrich <name> --limit 5
```

If you would rather not learn the commands, you do not have to: install doctrail, then tell your agent to run `doctrail` and order it around.

## Before real model calls

The tutorial above uses saved replay responses, so it does not need an API key. Your own enrichments do. Put the key in the project folder's `.env` file, which is usually the cleanest option:

```bash
OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
GEMINI_API_KEY=...
GOOGLE_API_KEY=...
OPENROUTER_API_KEY=...
```

You only need the line for the provider you plan to use. Doctrail also reads keys already exported in your shell environment, so a global key in `~/.zshenv` or `~/.bashrc` is fine if you want one default across projects. A project `.env` is better when different projects should use different providers or accounts. Do not commit `.env`; `doctrail init` adds it to `.gitignore`.

<!-- tutorial.md -->

# Tutorial

If you want to understand how to use doctrail, follow these instructions very carefully, on your own computer, in order. It is only by doing this slowly once that the mental model clicks.

What you are about to do: turn a pile of documents into a grid, then add typed columns to that grid.[^column] "Typed" means each new column must take a certain kind of value — a number, a category ("enum"), a short string — and you decide which. You declare your schema (think: codebook) and the model fills the column in.

## Setup

After following the [quickstart](quickstart.md) to install doctrail, open your terminal, navigate to a fresh empty directory, and type:

```bash
doctrail init test
```

This sets up a hidden folder called `.doctrail` (configs live there), a folder called `./data` containing 18 Federalist files — 10 PDFs, 3 HTML files, 5 Word documents — plus UN speech excerpts, and `./out/database.db`, into which those files have already been ingested. The Federalist files include the twelve whose authorship was disputed for 150 years.

Install a database viewer — [TablePlus](https://tableplus.com/) is good — and open `out/database.db`. You will see a `documents` table: one row per file, with the extracted text. Your documents are now in a grid.

Alternatively, open your coding agent in the project folder, give it `https://doctrail.org/llms.txt`, and tell it to get started. That page is the full agent-facing manual in one file: commands, YAML structure, schema examples, and the basic workflow.

## First enrichment

You would like to know things about these files. Who wrote each one — Hamilton, Madison, or Jay? And how hard does the author lean on fear of disunion — measured on a scale from 0 to 5, per a detailed codebook you supply?

We have prepared this so you can test. Examine the file `.doctrail/enrichments/test.yml`. It declares three things: which rows to read, what to ask (the prompt — read it, it is a real codebook), and the shape every answer must take (the schema: an enum for the author, an integer 0–5 for the fear scale, a one-sentence rationale).

Now type:

```bash
doctrail run test
```

This is not really calling a model. We saved the responses ahead of time and the config says `model: replay` — everything else is exactly what a real run does. (When you use your own API key later, the only thing that changes is the model name.)

Open your database again. There is now a `_enrichments` table holding every answer, and a view called `v_documents_enriched`. That view is just a saved SQL query that lays the answers out wide: one row per document, one column per answer. You can change it, or get your agent to make more of them. Sort by `fear_of_disunion`. Check the model's authorship calls against the scholarly consensus column we included. Madison wins.

If you had only wanted a taste, you could have run:

```bash
doctrail run test --limit 5
```

The word `test` here is just the name of the enrichment.

## Second enrichment: your own quantity

Maybe you want to measure some other political or sociological quantity. Define a schema for it and put it in `.doctrail/enrichments/`. We prepared a second one, `securitization.yml`, over a different corpus (`doctrail init test` also gave you `./data/un_speeches/` — excerpts from UN General Debate addresses). It asks a classic question from international relations: does the speaker frame some issue — migration, climate, a pandemic — as an existential threat that justifies extraordinary measures? This schema is more complex: it has a boolean gate, and the issue and intensity fields only mean something when the gate is true. Look at it, then:

```bash
doctrail run securitization --limit 100
```

The schema defines the model, the prompt, and how the model must respond. That "how" is the whole trick: the model is forced to return structured data matching your codebook, every time, and doctrail stores it where SQL can reach it.

## Kicking it up a notch

Two more ideas, because this is where the real power is.

One document, many annotations. A speech mentions many countries. You don't want one answer per speech — you want one answer per country per speech. The `country_mentions.yml` enrichment shows the pattern: the schema returns a list of objects (country, stance), and doctrail explodes them so each mention becomes its own row in the review view.

Multiple models as coders. Anything you measure with one model, you can measure with two and ask how much they agree. We canned responses from two different models:

```bash
doctrail icr country_stance -m replay/coder-a -m replay/coder-b
doctrail icr-report --field stance
```

That prints Krippendorff's alpha and Cohen's kappa — the same intercoder reliability statistics you would report for human coders. If the models can't agree, your codebook is not as clear as you thought, exactly like with research assistants.

To see what the numbers mean, we prepared two more, one at each extreme:

```bash
doctrail icr mentions_climate -m replay/coder-a -m replay/coder-b
doctrail icr-report --field mentions_climate
doctrail icr optimism -m replay/coder-a -m replay/coder-b
doctrail icr-report --field optimism
```

The first asks whether the speech mentions climate change at all. Agreement is near perfect — of course it is: that is a fact of the text, and the codebook says exactly what counts. The second asks how "optimistic" the speech is, 0 to 5, and we wrote that prompt the way people write prompts when they are not thinking: no anchors, no examples, no definition of what a 3 is. Agreement collapses. Now look at why:

```bash
doctrail view pivot icr_optimism -e optimism --by-model
```

Open that view in your database browser and read the rows where the coders differ. The texts are not ambiguous — the codebook is: one coder read "optimism" as tone, the other as concrete commitments, and both readings are defensible because we never said which we meant. The fix is not a better model; it is a better codebook. That loop — code, measure agreement, read the disagreements, tighten the codebook — is the whole discipline, and it now costs minutes instead of a semester of research assistants.

## Now do it on your files

You have now created and run enrichments end to end: ingest, declare a codebook, enrich, inspect the view, scale up, validate.

You are ready to open your own project folder in the terminal (how to use the terminal is another topic — if you can't do that, we cannot help you) and point doctrail at your files.

However, the best way to use doctrail is with a terminal agent, like Claude Code or Codex. OpenAI and Anthropic also produce desktop software that bundles the same functionality. That way you barely need to know how terminals work, or you can skip the terminal entirely and use the GUI: open your agent in the right folder, and tell it to run `doctrail docs` — the full manual prints straight from the CLI, no internet required. Then order it around. A typical way of working: describe the outputs you want; the agent whips up a YAML template; you look at it; you tell the agent to run it with `--limit 10`; you open the database and look at the first `v_` view. If you like what you see, run the lot.

You only need a pile of files you want the text out of, and questions you want answered about each one. Think in terms of the grid: your files are rows, and you are adding typed columns.

One last practical point: the tutorial used replayed responses, so it did not need a provider key. Real enrichments do. Before you swap `model: replay` for OpenAI, Anthropic, Gemini, or OpenRouter, put the relevant key in your project `.env` file, or export it from your shell startup file such as `~/.zshenv` or `~/.bashrc`. The environment variable names are `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GEMINI_API_KEY` or `GOOGLE_API_KEY`, and `OPENROUTER_API_KEY`.

[^column]: Note that, under the hood, doctrail does *not* simply add a column to your document grid. It is easy to think of it that way, but this would be the wrong data model. The reason is that your schema may become complicated, and you may want to add lots of enrichments of many forms. Perhaps one document needs multiple sets of the same type of enrichment. For example, it deals with country1, country2, and country3; in that case, your unit of analysis is not a document, but a "country-document", and you have a many-to-one relationship between countries and documents. What doctrail actually does is explained in [the data model](data-model.md).

<!-- yaml.md -->

# Code books

A code book is the small YAML file you write for each enrichment: it says which rows to read, what to ask, and what shape the answer must take. (YAML is just the plain-text format it is written in — and you will usually have an agent write it.) A code book draws on four ideas: a database, SQL queries, enrichments, and schemas. The tested examples in `tests/schema_examples/` are the first source of truth. The snippets below are either copied from that corpus or smoke-checked with `doctrail enrich --dry-run`.

## Minimal single-field enum

Source: `tests/schema_examples/single_field/classify_language.yml`.

```yaml
name: classify_language
description: "Classify the language of the document"
model: gpt-4o-mini
input:
  query: all_docs
  input_columns: ["raw_content"]
output_column: detected_language
schema:
  enum: ["english", "chinese", "mixed", "other"]
prompt: |
  Analyze this document and classify its primary language.
  Return one of: english, chinese, mixed, or other.
```

## Boolean field

Source: `tests/schema_examples/single_field/validate_content.yml`.

```yaml
name: validate_content
description: "Validate document content quality"
input:
  query: all_docs
  input_columns: ["raw_content", "filename"]
schema:
  content_valid: {type: "boolean"}
prompt: |
  Check if this document has valid, readable content.
  Return 'true' if the document has valid, readable content.
  Return 'false' if it's empty, corrupted, or unreadable.
```

## Full current-path config

Smoke-checked with `doctrail enrich --config config.yml packed_relevance --dry-run` and `doctrail enrich --config config.yml donor_payment --dry-run`.

```yaml
database: ./docs.db
default_model: gpt-4o-mini
default_table: documents
project: docs_smoke

sql_queries:
  all_docs: |
    SELECT rowid, sha1 FROM documents ORDER BY rowid
  donor_docs: |
    SELECT rowid, sha1 FROM documents
    WHERE raw_content LIKE '%donor%'
    ORDER BY rowid

enrichments:
  - name: packed_relevance
    input:
      query: all_docs
      input_columns: ["filename", "raw_content:240"]
    system_prompt: "Return only data matching the schema."
    append_file: context.md
    prompt: |
      Mark each item true if it mentions donor-family payment evidence.
    output_column: is_relevant
    schema: {type: "boolean"}
    dedupe_scope: query
    key_column: sha1
    pack_size: 5
    pack_response_mode: selected_indexes

  - name: donor_payment
    input:
      query: donor_docs
      input_columns: ["filename", "raw_content:500"]
    prompt: |
      Extract donor-family payment evidence from the supplied document.
      Use the filename only as source context.

      Codebook:
      - condolence_money: direct cash or ceremonial money given to a donor family.
      - relief_fund: money from a relief, hardship, charity, or assistance fund.
      - subsidy: public or institutional subsidy, reimbursement, or benefit.
      - none: no donor-family payment evidence appears.

      If payment_type is none, amount must be null and evidence should briefly say no payment evidence was found.
      If the amount is not stated, amount must be null.
    schema:
      payment_type: {enum: ["condolence_money", "relief_fund", "subsidy", "none"]}
      amount: {type: "float", minimum: 0, optional: true}
      evidence: {type: "string", maxLength: 300}
      tags: {enum_list: ["cash", "policy", "medical", "unclear"], max_items: 3}
      evidence_items:
        type: "array"
        maxItems: 3
        items:
          type: "object"
          properties:
            phrase: {type: "string", maxLength: 120}
            payment_kind: {enum: ["cash", "subsidy", "unknown"]}
```

## Prompt construction

Write `prompt` as a codebook, not just a question. Define every enum value, anchor every numeric scale point, and state gate/null behavior for optional fields. For example, if `has_payment` is false, say which evidence fields must be null; if a score runs from 0 to 5, define 0, the middle, and 5.

Keep the prompt as static as possible. Doctrail renders the prompt first, adds schema instructions there for JSON-mode paths or sends provider-native schema payloads as constant request structure, and appends per-row `input_columns` content last. OpenAI automatic prompt caching and Gemini implicit caching on Gemini 2.5 and newer models can bill repeated prefix tokens at cached-input rates when the provider reports a cache hit, though Gemini does not guarantee savings on every request. Doctrail does not currently set Anthropic `cache_control` markers.

Prefer `input_columns` with `:N` truncation, such as `raw_content:3000`, over `{column}` interpolation inside the prompt. Placeholders still work, but a row-specific token breaks the shared prompt prefix there, so everything after it re-bills at full input price per row. If one is truly needed, put it at the very end.

## Token economy pattern

Use a cheap packed pass before an expensive extraction pass when most rows are likely irrelevant. In the example above, `packed_relevance` keeps each row small with `raw_content:240`, sends five rows per request with `pack_size: 5`, and uses `pack_response_mode: selected_indexes` so the model only has to name the relevant item indexes.

That pattern is meant for triage. The packed pass writes a normalized answer for each input row, including false answers for rows the model does not select, so append mode can skip completed packed batches on rerun. Run the richer `donor_payment` extraction only on the narrowed SQL query or a saved relevance view.

## Multi-table inputs

Source: `tests/schema_examples/multi_field/multi_table_review.yml`, with its legacy `output_table` comment removed. The feature is the `table.column` input syntax.

```yaml
name: multi_table_review
description: "Review and refine extraction using data from multiple tables"
model: gpt-4o
input:
  query: docs_with_entities
  input_columns:
    - "documents.raw_content:1000"
    - "documents.filename"
    - "extracted_entities.organizations"
    - "extracted_entities.locations"
    - "extracted_entities.key_terms"
    - "extracted_entities.entity_count"
schema:
  organizations: {type: "array", items: {type: "string"}, maxItems: 10}
  locations: {type: "array", items: {type: "string", lang: "zh"}, maxItems: 5}
  key_terms: {type: "array", items: {type: "string"}, maxItems: 15}
  entity_count: {type: "integer", minimum: 0}
  quality_score: {type: "float", minimum: 0.0, maximum: 1.0}
  corrections_made: {type: "string", maxLength: 500}
prompt: |
  Review the previous entity extraction and source document supplied below.

  Use the existing entity fields as draft model output, not as ground truth.
  Correct omissions, remove spurious entities, and explain major corrections.
```

## Language validation

Source: `tests/schema_examples/advanced/test_array_language_validation.yml`.

```yaml title="fragment"
schema:
  summary_zh: {type: "string", lang: "zh"}
  evidence_en: {type: "array", items: {type: "string", lang: "en"}, maxItems: 5}
  keywords_zh: {type: "array", items: {type: "string", lang: "zh"}, maxItems: 10}
  confidence: {type: "float", minimum: 0, maximum: 1}
```

## Multiple models

Source: `tests/schema_examples/advanced/test_enrich_multi_model.yml`.

```yaml
enrichments:
  - name: compare_models
    description: "Compare outputs from multiple models"
    model: ["gpt-4o-mini", "gpt-3.5-turbo"]
    input:
      query: all_docs
      input_columns: ["raw_content:500"]
    schema:
      summary: {type: "string"}
      sentiment: {enum: ["positive", "negative", "neutral"]}
    prompt: |
      Provide a brief summary of this document and determine its sentiment.
      Summary should be one sentence only.
```

## YAML imports

Source: `tests/schema_examples/test_yaml_imports.yml`.

```yaml title="fragment"
database: placeholder
default_model: gpt-4o-mini

sql_queries:
  all_docs: |
    SELECT rowid, sha1 FROM documents
    ORDER BY rowid

enrichments:
  - !import single_field/classify_language.yml
  - !import single_field/validate_content.yml
```

## Schema surface

| Schema form | Status | Note |
| --- | --- | --- |
| `{type: "string"}` | stable | text |
| `{type: "integer"}` | stable | whole numbers |
| `{type: "float"}` | stable | decimals |
| `{type: "boolean"}` | stable | true/false |
| `{enum: [...]}` | stable | one choice |
| `{enum_list: [...]}` | stable | several choices |
| `{type: "array", items: ...}` | stable | JSON list |
| `{type: "object", properties: ...}` | stable | JSON object |
| `optional: true` | stable | allows null |
| `nullable: true` | stable | alias |
| `minimum`, `maximum` | stable | numeric bounds |
| `minLength`, `maxLength`, `pattern` | stable | string bounds |
| `minItems`, `maxItems` | stable | array bounds |
| `lang: "zh"` or `lang: "en"` | stable | language check |
| `convert: "chinese_to_pinyin"` | internal | plugin hook |
| `{type: "number"}` | deprecated | use `float` |
| `schema: [...]` | deprecated | use `enum` |

## Config key stability

Batch 4 inventory source: `.get(...)` and direct config reads across `src/doctrail/`.

| Key | Status | Note |
| --- | --- | --- |
| `database` | stable | SQLite path |
| `project` | stable | run tag |
| `default_table` | stable | source fallback |
| `default_model` | stable | model fallback |
| `sql_queries` | stable | named SQL |
| `enrichments` | stable | task list |
| `name` | stable | enrichment id |
| `input.query` | stable | named or inline SQL |
| `input.input_columns` | stable | prompt payload |
| `prompt` | stable | user prompt |
| `system_prompt` | stable | system message |
| `append_file` | stable | prompt appendix |
| `model` | stable | one or many |
| `models` | stable | model config map |
| `schema` | stable | output contract |
| `output_column` | stable | single-field alias |
| `key_column` | stable | row key |
| `dedupe_scope` | stable | query/prompt/enrichment |
| `pack_size` | stable | sync grouping |
| `pack_response_mode` | stable | exhaustive or selected_indexes |
| `output_table` | deprecated | compatibility hint |
| `dedupe-scope` | deprecated | use underscore |
| `--execution-mode openai-batch` | deprecated | use `batch` |
| `output_columns` | internal | derived/legacy |
| `all_fields_optional` | internal | broad optionality |
| `min_input_chars` | internal | skip threshold |
| `truncate` | internal | prefer CLI flag |
| `reasoning_effort` | internal | GPT-5 control |
| `exports` | internal | old export path |
| `output_naming` | internal | export filenames |
| `views.priority_columns` | internal | view sorting |
| `zotero` credentials | internal | ingest plugin |
| `documents_path` | internal | init/ingest helper |

<!-- data-model.md -->

# Data model

Source tables or views keep the original inputs. `documents` is the name of the default table that stores the primary corpus.

The rows you want to enrich are defined in SQL:

```yaml
sql_queries:
  all_docs: |
    SELECT rowid, sha1 FROM documents
```

The query must return a stable key column. `sha1` is the default, because it assumes that the text came from a file (like a pdf, docx, html, etc.; and a sha1 is a unique identifier for a file). Use `key_column` when the source uses a different identifier.

## Schema at a glance

How the ledger links back to your documents. Joins are by the key column (default `sha1`); these are logical relationships, not enforced SQLite foreign keys. Views (`v_`) are not shown — they are pivots computed over `_enrichments`.

```mermaid
erDiagram
    documents ||--o{ "_enrichments" : "key_value = sha1"
    documents ||--o{ "_enrichment_audit" : "key_value = sha1"
    "_enrichment_runs" ||--o{ "_enrichments" : run_id
    "_enrichment_runs" ||--o{ "_enrichment_audit" : run_id
    "_enrichment_runs" ||--o{ "_enrichment_run_items" : run_id
    "_prompts" ||--o{ "_enrichments" : prompt_hash
    documents {
        text sha1 PK
        text filename
        text raw_content
    }
    "_enrichments" {
        int id PK
        text key_value FK
        text enrichment_name
        text field_name
        text value
        text value_type
        text model
        text prompt_hash
        text run_id FK
    }
    "_enrichment_audit" {
        int id PK
        text key_value FK
        text raw_json
        text projection_json
        text run_id FK
    }
    "_enrichment_runs" {
        text run_id PK
        text enrichment_name
        text model
        text dedupe_scope
    }
    "_enrichment_run_items" {
        int id PK
        text run_id FK
        text key_value
    }
    "_prompts" {
        text prompt_id PK
        text prompt_hash
        text prompt_text
    }
```

## Normalized storage

`_enrichment_audit` stores raw calls, raw responses, prompt/query provenance, errors, and the normalized projection payload used for rebuilds.

`_enrichments` stores parsed fields in long form:

```text
key_value | enrichment_name | field_name | value | model | prompt_hash | run_id
```

The current row identity is key, enrichment name, field, model, and prompt hash. Upserts keep the current value for that identity while raw responses and projection payloads remain available in audit history; migration side tables preserve recoverable legacy duplicates.

`_prompts` stores prompt text, system prompt text, model, and prompt hash.

`_enrichment_runs` stores run-level provenance: query SQL/hash, model, prompt id, key column, source name, dedupe scope, project, and execution mode.

`_enrichment_run_items` stores the exact input rowset for a run when materialization is enabled.

Doctrail-managed tables use a leading underscore so they do not look like source tables. Schema migration 2 renames older `enrichments`, `enrichment_audit`, `enrichment_runs`, and related bookkeeping tables when it opens an existing database.

## Dedupe

Append mode skips rows only after a successful normalized result exists in `_enrichments` for the dedupe scope. Audit rows alone are not completion.

Null answers are still answers. A parsed null is stored in `_enrichments` with `value_type = 'null'`, so append mode will not resubmit the same row for the same dedupe scope. View type detection ignores those null rows when deciding whether a field should be cast as numeric.

Scopes:

| Scope | Meaning |
| --- | --- |
| `query` | same enrichment, model, prompt, and query |
| `prompt` | same enrichment, model, and prompt |
| `enrichment` | same enrichment and model |
| `name` | legacy alias for `enrichment` |

## Views

Run views show one persisted run in wide form. They are best for pilot runs, final runs, and human review.

Pivot views build reusable wide analysis surfaces over normalized enrichments.

Spec views are YAML-defined review surfaces. They can include source columns, enrichment columns, and one exploded JSON-array field.

Final views and final tables layer human overrides or materialize an editable dataset without changing the original model run.

Doctrail-managed views are prefixed with `v_`. For example, a run view is named like `v_run_<enrichment>_<timestamp>`, a spec view named `payments_review` is created as `v_payments_review`, and a final view is named like `v_final_<enrichment>_<timestamp>`.

Model-collapse caveat: default views choose a current/latest value per field. Use `--by-model`, run-specific views, or explicit fields when comparing models.

<!-- cli.md -->

# CLI

Generated from the live Click command tree.

## doctrail

```text
Usage: doctrail [OPTIONS] COMMAND [ARGS]...

  SQLite document enrichment with normalized outputs and derived views.

  Agents: run `doctrail agent` for the full operating manual in one shot —
  the mental model, the enrichment workflow, and troubleshooting. Start
  there; everything else is discoverable from it.

  Humans: `doctrail docs` prints the complete reference manual; this --help lists every command.

Options:
  --skip-requirements  Skip system requirements check
  -v, --version        Show version
  --help               Show this message and exit.

Commands:
  agent                Print the full agent guide: mental model, workflow, troubleshooting.
  batch                Manage submitted provider batch enrichment runs.
  diff-runs            Show where two runs disagree.
  docs                 Print the packaged manual: everything an agent needs, in one file.
  document             Get a single document by ID.
  edit                 Open a project enrichment YAML in $EDITOR.
  enrich               Enrich database content using LLM processing.
  export               Export enriched data in various formats.
  finalize             Materialize an editable final table from a run or existing review view.
  icr                  Run intercoder reliability: enrich sampled rows with multiple models.
  icr-report           Compute intercoder reliability statistics from enrichment codings.
  ingest               Ingest documents from local directories, Zotero, or plugins.
  init                 Initialize a doctrail project in the current directory.
  list-enrichments     List all available enrichments from a configuration file.
  models               List doctrail model identifiers by backend.
  new                  Create a new custom enrichment.
  overrides-export     Export one run to a CSV template for human review and overrides.
  overrides-import     Import human overrides from a CSV and refresh the final merged view.
  query                Query the database.
  rebuild-enrichments  Rebuild _enrichments exactly from projection payloads stored in...
  review               Validate enrichment accuracy with a web UI.
  run                  Enrich database content using LLM processing.
  runs                 List recent enrichment runs with persisted run IDs and summary counts.
  serve                Start the Doctrail multi-database server.
  skill                Print or install the packaged Doctrail skill.
  sql                  Execute a read-only SQL query (SELECT only).
  stats                Get database statistics.
  view                 Manage derived views for reviewing and analyzing normalized enrichments.
```

### doctrail agent

```text
Usage: doctrail agent [OPTIONS]

  Print the full agent guide: mental model, workflow, troubleshooting.

  This is the entry point for an LLM or coding agent driving doctrail. It prints the complete
  operating manual to stdout, no install required. Same content as `doctrail skill`; `agent` is the
  name agents reach for.

Options:
  --help  Show this message and exit.
```

### doctrail batch

```text
Usage: doctrail batch [OPTIONS] COMMAND [ARGS]...

  Manage submitted provider batch enrichment runs.

Options:
  --help  Show this message and exit.

Commands:
  cancel  Cancel all active batch shards for a run.
  poll    Poll batch jobs once and reconcile any completed outputs.
  watch   Poll until a batch-backed run is fully reconciled.
```

### doctrail batch cancel

```text
Usage: doctrail batch cancel [OPTIONS]

  Cancel all active batch shards for a run.

Options:
  --db-path TEXT  Path to SQLite database
  --run-id TEXT   Run ID to cancel  [required]
  --help          Show this message and exit.
```

### doctrail batch poll

```text
Usage: doctrail batch poll [OPTIONS]

  Poll batch jobs once and reconcile any completed outputs.

Options:
  --db-path TEXT  Path to SQLite database
  --run-id TEXT   Poll only one run ID
  --help          Show this message and exit.
```

### doctrail batch watch

```text
Usage: doctrail batch watch [OPTIONS]

  Poll until a batch-backed run is fully reconciled.

Options:
  --db-path TEXT    Path to SQLite database
  --run-id TEXT     Run ID to watch  [required]
  --interval FLOAT  Polling interval in seconds  [default: 5.0]
  --help            Show this message and exit.
```

### doctrail diff-runs

```text
Usage: doctrail diff-runs [OPTIONS]

  Show where two runs disagree.

Options:
  --db-path PATH   Path to SQLite database
  --run-a TEXT     First run ID  [required]
  --run-b TEXT     Second run ID  [required]
  --limit INTEGER  Max differing cells to show  [default: 20]
  --json           Output as JSON
  --help           Show this message and exit.
```

### doctrail docs

```text
Usage: doctrail docs [OPTIONS]

  Print the packaged manual: everything an agent needs, in one file.

Options:
  --help  Show this message and exit.
```

### doctrail document

```text
Usage: doctrail document [OPTIONS]

  Get a single document by ID.

Options:
  --db-path PATH        Path to SQLite database  [required]
  --id TEXT             Document ID (primary key value)  [required]
  --format [text|json]  Output format
  --help                Show this message and exit.
```

### doctrail edit

```text
Usage: doctrail edit [OPTIONS] NAME

  Open a project enrichment YAML in $EDITOR.

Options:
  --help  Show this message and exit.
```

### doctrail enrich

```text
Usage: doctrail enrich [OPTIONS] [ENRICHMENT_NAMES]...

  Enrich database content using LLM processing.

  Run enrichments by name:     doctrail enrich language     doctrail enrich language summarize

  In a doctrail project (.doctrail/ folder), enrichments are loaded from
  .doctrail/enrichments/<name>.yml and merged with .doctrail/config.yml. Model outputs are written
  to normalized tables; use `doctrail view create` or `doctrail view pivot` to inspect them in a
  wide, human-readable form.

Options:
  --config TEXT                   Path to config YAML (auto-detects .doctrail/config.yml)
  --enrichments TEXT              (Legacy) Enrichment task names
  --limit INTEGER                 Limit number of rows to process
  --overwrite                     Overwrite existing data in output columns
  --verbose                       Enable verbose logging
  --log-updates                   Log updates to a file
  --model TEXT                    Override the default model for all enrichments
  --db-path TEXT                  Override the database path from config
  --output-db TEXT                Write enrichments to this database instead of the source database
  --batch-size INTEGER            Override batch size for processing
  --rowid INTEGER                 Process only a specific row by rowid
  --sha1 TEXT                     Process only a specific row by sha1 hash
  --truncate                      Truncate long inputs to fit model context window instead of
                                  failing
  --skip-cost-check               Skip cost estimation and confirmation
  --cost-threshold FLOAT          Cost threshold for confirmation prompt (default: $5.00)
  --where TEXT                    Filter the enrichment query with an outer SQL WHERE predicate
  --query TEXT                    Replace the SQL query from config entirely
  --project TEXT                  Tag enrichments with a project name for filtering (e.g.,
                                  mock_compliance)
  --dry-run                       Preview without calling LLM: show row counts, schema, and sample
                                  input
  --dedupe-scope [query|prompt|enrichment|name]
                                  Append-mode dedupe scope. Overrides per-enrichment dedupe_scope.
  --materialize-inputs / --no-materialize-inputs
                                  Persist the exact input rowset for each run
  --execution-mode [sync|batch|openai-batch]
                                  How to execute the enrichment work. batch maps direct OpenAI
                                  models to /v1/batches -> /v1/chat/completions, direct Claude
                                  models to /v1/messages/batches -> /v1/messages, and direct Gemini
                                  models to File API upload ->
                                  /v1beta/models/{model}:batchGenerateContent. openai-batch is
                                  accepted as a legacy alias.  [default: sync]
  --allow-column-collision        Allow enrichment field names that match source table columns
  --help                          Show this message and exit.
```

### doctrail export

```text
Usage: doctrail export [OPTIONS]

  Export enriched data in various formats.

Options:
  --config TEXT       Path to the configuration YAML file  [required]
  --export-type TEXT  Type of export to run (e.g., parallel-translation, case-summaries)  [required]
  --output-dir TEXT   Override the default output directory from config
  --verbose           Enable verbose logging
  --help              Show this message and exit.
```

### doctrail finalize

```text
Usage: doctrail finalize [OPTIONS]

  Materialize an editable final table from a run or existing review view.

Options:
  --db-path PATH  Path to SQLite database
  --run-id TEXT   Materialize the final surface for one run ID
  --view TEXT     Materialize an existing review view instead of a run
  --table TEXT    Writable table name to create
  --replace       Replace the target table if it already exists
  --help          Show this message and exit.
```

### doctrail icr

```text
Usage: doctrail icr [OPTIONS] ENRICHMENT_NAME

  Run intercoder reliability: enrich sampled rows with multiple models.

  Example:     doctrail icr threat_coding -m openrouter/google/gemini-2.5-flash -m gpt-4o-mini
  --sample 50 --seed 42

Options:
  -m, --models TEXT   Models to use as coders (repeat for multiple)  [required]
  --sample INTEGER    Sample N rows (default: all)
  --stratify-by TEXT  Stratify sample by this enrichment field
  --seed INTEGER      Random seed for reproducibility
  --config TEXT       Path to config YAML (auto-detects .doctrail/config.yml)
  --db-path TEXT      Database path override
  --overwrite         Re-run models that already have codings
  --skip-cost-check   Skip cost confirmation
  --project TEXT      Tag enrichments with a project name
  --verbose           Verbose logging
  --help              Show this message and exit.
```

### doctrail icr-report

```text
Usage: doctrail icr-report [OPTIONS]

  Compute intercoder reliability statistics from enrichment codings.

  Example:     doctrail icr-report --db-path out/db.db --field hostility_level

Options:
  --db-path TEXT                  Database path (defaults to .doctrail/config.yml)
  --field TEXT                    Field name to analyse  [required]
  --enrichment-name TEXT          Filter by enrichment name
  -m, --models TEXT               Specific models to compare (repeat)
  --sample-id TEXT                Filter to specific ICR sample
  --level [nominal|ordinal|interval]
                                  Measurement level (auto-detected if omitted)
  -o, --output TEXT               Write CSV coding matrix to this path
  --verbose                       Verbose logging
  --help                          Show this message and exit.
```

### doctrail ingest

```text
Usage: doctrail ingest [OPTIONS]

  Ingest documents from local directories, Zotero, or plugins.

  Supported local file types: txt/md; csv/tsv; pdf; epub; mobi; doc; rtf; docx; xlsx; xls; pptx;
  ppt; djvu; mhtml/mht; html/htm; png/jpg/jpeg/gif/bmp/tiff/tif.

  Examples:     doctrail ingest --input-dir ./docs --db-path ./data.db     doctrail ingest --zotero
  --collection "Papers" --db-path ./lit.db     doctrail ingest --plugin zotero --collection "My
  Research"

Options:
  --config TEXT                   Path to the configuration YAML file
  --db-path TEXT                  SQLite database file, or a directory to use/create doctrail.db in
  --table TEXT                    Table name for documents
  --verbose                       Enable detailed logging
  --input-dir TEXT                Input directory (can repeat)
  --force                         Force ingest even if schema mismatch
  --overwrite                     Overwrite existing documents
  --limit INTEGER                 Limit files to process
  --include-pattern TEXT          Only process matching files
  --exclude-pattern TEXT          Skip matching files
  --workers INTEGER RANGE         Number of extraction worker threads  [x>=1]
  --pdf-engine [auto|pymupdf|pdftotext|mutool|mac-ocr]
                                  PDF extraction strategy
  --ocr-engine [auto|textra|ocrmypdf|mac-ocr]
                                  OCR backend when OCR is needed
  --readability                   Use readability for HTML
  --html-extractor [default|smart]
  --skip-garbage-check            Skip garbage detection
  -y, --yes                       Skip prompts
  --fulltext                      Create FTS index
  --manifest TEXT                 Path to manifest.json
  --zotero                        Zotero mode
  --collection TEXT               Zotero collection name
  --plugin TEXT                   Plugin name
  --plugin-dir TEXT               Custom plugins directory
  --cache-db TEXT                 [doi_connector] Cache database
  --project TEXT                  [doi_connector] Project name
  --base-path TEXT                [doi_connector] Base path
  --api-key TEXT                  [zotero] API key
  --user-id TEXT                  [zotero] User ID
  --zotero-dir TEXT               [zotero] Data directory
  --help                          Show this message and exit.
```

### doctrail init

```text
Usage: doctrail init [OPTIONS] [[test]] [[fed|un|econ-threat]]

  Initialize a doctrail project in the current directory.

  Creates: - .doctrail/config.yml     (project settings) - .doctrail/enrichments/   (your analysis
  tasks) - out/{name}.db            (database, unless --database is used) - .env
  (API key, unless --no-env is used)

  Doctrail stores model outputs in normalized enrichment tables and then materializes user-facing
  views for review and analysis.

  Example:     cd my_research/     doctrail init

Options:
  --name TEXT                     Project name (used for database filename)
  --api-key TEXT                  API key (or set interactively)
  --provider [openai|gemini|anthropic|openrouter]
                                  LLM provider (default: openai)
  --docs TEXT                     Path to documents folder (relative to current dir)
  --database TEXT                 Path to SQLite database to use in config
  --no-docs                       Skip document-folder setup for query-first projects
  --no-env                        Do not create a .env file; rely on existing environment variables
  -y, --yes                       Skip prompts, use defaults
  -e, --enrichments TEXT          Enrichments to set up (can repeat)
  --help                          Show this message and exit.
```

### doctrail list-enrichments

```text
Usage: doctrail list-enrichments [OPTIONS]

  List all available enrichments from a configuration file.

Options:
  --config TEXT  Path to the configuration YAML file  [required]
  --help         Show this message and exit.
```

### doctrail models

```text
Usage: doctrail models [OPTIONS]

  List doctrail model identifiers by backend.

Options:
  -p, --provider TEXT  Filter by backend (e.g. openai, anthropic, gemini, cli, openrouter/openai)
  -s, --search TEXT    Search models by name or identifier
  --refresh            Force refresh the underlying pricing or batch catalog cache
  -n, --limit INTEGER  Max models to display per section  [default: 10]
  --all                Show the full OpenRouter catalog with pricing
  --openai-batch       Show the verified OpenAI batch model catalog and batch pricing
  --json               Output as JSON
  --help               Show this message and exit.
```

### doctrail new

```text
Usage: doctrail new [OPTIONS] [NAME]

  Create a new custom enrichment.

  Example:     doctrail new sentiment --prompt "Classify the sentiment" --enum
  "positive,negative,neutral"

Options:
  -p, --prompt TEXT               Instructions for the LLM
  -o, --output TEXT               Output column/field name
  --type [string|integer|number|boolean|array]
                                  Output type (default: string)
  --enum TEXT                     Comma-separated enum values (e.g., "positive,negative,neutral")
  --overwrite                     Overwrite an existing enrichment YAML
  --help                          Show this message and exit.
```

### doctrail overrides-export

```text
Usage: doctrail overrides-export [OPTIONS]

  Export one run to a CSV template for human review and overrides.

Options:
  --db-path PATH  Path to SQLite database
  --run-id TEXT   Run ID to export for review  [required]
  --output PATH   CSV path to write (defaults to ./overrides_<run>.csv)
  --help          Show this message and exit.
```

### doctrail overrides-import

```text
Usage: doctrail overrides-import [OPTIONS]

  Import human overrides from a CSV and refresh the final merged view.

Options:
  --db-path PATH   Path to SQLite database
  --run-id TEXT    Run ID to import overrides into  [required]
  --input PATH     CSV created by overrides-export  [required]
  --reviewer TEXT  Reviewer name stored with imported overrides
  --help           Show this message and exit.
```

### doctrail query

```text
Usage: doctrail query [OPTIONS] [QUERY_OR_ID]

  Query the database.

  Examples:     doctrail query                    # List documents     doctrail query 1
  # Show details of document #1     doctrail query -c                 # List with content preview
  doctrail query --json             # Full JSON output     doctrail query "SELECT ..."       #
  Custom SQL

Options:
  -n, --limit INTEGER  Limit number of rows (default: 10)
  -t, --table TEXT     Table name (defaults to project default_table, then documents)
  --json               Output as JSON
  -c, --content        Show content preview
  --help               Show this message and exit.
```

### doctrail rebuild-enrichments

```text
Usage: doctrail rebuild-enrichments [OPTIONS]

  Rebuild _enrichments exactly from projection payloads stored in _enrichment_audit.

Options:
  --db-path PATH     Path to SQLite database
  --run-id TEXT      Restrict rebuild to one run ID
  --enrichment TEXT  Restrict rebuild to one enrichment name
  --key-value TEXT   Restrict rebuild to one document key
  -y, --yes          Skip confirmation
  --help             Show this message and exit.
```

### doctrail review

```text
Usage: doctrail review [OPTIONS] DB_PATH

  Validate enrichment accuracy with a web UI.

  Opens a browser-based interface for rapid Y/N validation of LLM classifications. Results are saved
  to the human_audit table.

  Examples:     doctrail review /path/to/db.db --field is_relevant --sample 50     doctrail review
  db.db --field language --sample 100 --table documents     doctrail review db.db --field
  is_relevant --config enrichment.yml

Options:
  --field TEXT        Field name to review (e.g., is_relevant)  [required]
  --sample INTEGER    Sample size per class (default: 50)
  --port INTEGER      Port to run server on (default: 8765)
  --table TEXT        Table name (default: articles)
  --config PATH       Config file to get truncation from input_columns
  --truncate INTEGER  Content truncation limit (overrides config)
  --help              Show this message and exit.
```

### doctrail run

```text
Usage: doctrail run [OPTIONS] [ENRICHMENT_NAMES]...

  Enrich database content using LLM processing.

  Run enrichments by name:     doctrail enrich language     doctrail enrich language summarize

  In a doctrail project (.doctrail/ folder), enrichments are loaded from
  .doctrail/enrichments/<name>.yml and merged with .doctrail/config.yml. Model outputs are written
  to normalized tables; use `doctrail view create` or `doctrail view pivot` to inspect them in a
  wide, human-readable form.

Options:
  --config TEXT                   Path to config YAML (auto-detects .doctrail/config.yml)
  --enrichments TEXT              (Legacy) Enrichment task names
  --limit INTEGER                 Limit number of rows to process
  --overwrite                     Overwrite existing data in output columns
  --verbose                       Enable verbose logging
  --log-updates                   Log updates to a file
  --model TEXT                    Override the default model for all enrichments
  --db-path TEXT                  Override the database path from config
  --output-db TEXT                Write enrichments to this database instead of the source database
  --batch-size INTEGER            Override batch size for processing
  --rowid INTEGER                 Process only a specific row by rowid
  --sha1 TEXT                     Process only a specific row by sha1 hash
  --truncate                      Truncate long inputs to fit model context window instead of
                                  failing
  --skip-cost-check               Skip cost estimation and confirmation
  --cost-threshold FLOAT          Cost threshold for confirmation prompt (default: $5.00)
  --where TEXT                    Filter the enrichment query with an outer SQL WHERE predicate
  --query TEXT                    Replace the SQL query from config entirely
  --project TEXT                  Tag enrichments with a project name for filtering (e.g.,
                                  mock_compliance)
  --dry-run                       Preview without calling LLM: show row counts, schema, and sample
                                  input
  --dedupe-scope [query|prompt|enrichment|name]
                                  Append-mode dedupe scope. Overrides per-enrichment dedupe_scope.
  --materialize-inputs / --no-materialize-inputs
                                  Persist the exact input rowset for each run
  --execution-mode [sync|batch|openai-batch]
                                  How to execute the enrichment work. batch maps direct OpenAI
                                  models to /v1/batches -> /v1/chat/completions, direct Claude
                                  models to /v1/messages/batches -> /v1/messages, and direct Gemini
                                  models to File API upload ->
                                  /v1beta/models/{model}:batchGenerateContent. openai-batch is
                                  accepted as a legacy alias.  [default: sync]
  --allow-column-collision        Allow enrichment field names that match source table columns
  --help                          Show this message and exit.
```

### doctrail runs

```text
Usage: doctrail runs [OPTIONS]

  List recent enrichment runs with persisted run IDs and summary counts.

Options:
  --db-path PATH     Path to SQLite database
  --enrichment TEXT  Filter to one enrichment name
  --limit INTEGER    Max runs to list  [default: 20]
  --json             Output as JSON
  --help             Show this message and exit.
```

### doctrail serve

```text
Usage: doctrail serve [OPTIONS]

  Start the Doctrail multi-database server.

  The server provides HTTP endpoints for searching and enriching multiple SQLite databases.
  Configure databases in a YAML file:

  # doctrail-server.yaml
  server:
    host: 0.0.0.0
    port: 8000
  databases:
    literature: /data/literature    # directory containing .db file
    organs: /data/organs

  Each database directory should contain: - A .db file (the SQLite database) - Optional:
  doctrail.yaml (schema config) - Optional: chroma_db/ (vector store for semantic search) -
  Optional: help.md (database-specific documentation)

  Example:     doctrail serve --config doctrail-server.yaml     doctrail serve --port 9000

Options:
  --config PATH   Server configuration file (default: doctrail-server.yaml)
  --host TEXT     Override host from config
  --port INTEGER  Override port from config
  --verbose       Enable verbose logging
  --help          Show this message and exit.
```

### doctrail skill

```text
Usage: doctrail skill [OPTIONS]

  Print or install the packaged Doctrail skill.

Options:
  --install  Install the packaged Doctrail skill into ~/.claude/skills/doctrail/SKILL.md
  --force    Overwrite an existing installed skill when used with --install
  --help     Show this message and exit.
```

### doctrail sql

```text
Usage: doctrail sql [OPTIONS]

  Execute a read-only SQL query (SELECT only).

Options:
  --db-path PATH        Path to SQLite database  [required]
  -q, --query TEXT      SQL SELECT query  [required]
  --format [text|json]  Output format
  --help                Show this message and exit.
```

### doctrail stats

```text
Usage: doctrail stats [OPTIONS]

  Get database statistics.

Options:
  --db-path PATH        Path to SQLite database  [required]
  --format [text|json]  Output format
  --help                Show this message and exit.
```

### doctrail view

```text
Usage: doctrail view [OPTIONS] [[list|new|refresh|create|pivot|spec|render]] [NAME]

  Manage derived views for reviewing and analyzing normalized enrichments.

  Doctrail stores model outputs in normalized tables (`_enrichments`, `_enrichment_audit`,
  `_enrichment_runs`). Views are the user-facing surface: they join source rows with selected
  enrichment fields in a wide format.

  Commands:
      doctrail view                     List all views in database
      doctrail view create              List recent runs / enrichments
      doctrail view create <enrichment> Materialize the latest run view for one enrichment
      doctrail view create --run-id <run_id>  Materialize one specific persisted run
      doctrail view new <name>          Create a custom view SQL file
      doctrail view spec <name|path>    Create/apply a YAML view spec
      doctrail view refresh             Execute all .doctrail/views/*.sql and *.yml
      doctrail view pivot <name> -e <enrichment>  Create a reusable wide analysis view
      doctrail view render <name>       Export a materialized view to HTML, CSV, or JSON

  View workflow:
      doctrail runs
      doctrail view create --run-id <run_id>
      doctrail overrides-export --run-id <run_id>

  Pivot examples:
      doctrail view pivot my_review -e framing_v6
      doctrail view pivot my_review -e framing --fields hostility,frame --include "title,raw_content:500"
      doctrail view pivot icr_check -e framing --by-model --include title

  View spec example:
      doctrail view spec payments_review
      doctrail view refresh
      doctrail view render payments_review --output payments_review.html

  Review shortcut:
      doctrail view create my_enrichment
      doctrail query "SELECT * FROM v_run_my_enrichment_20260228_1430 LIMIT 20"

Options:
  --table TEXT              Source table to join against (default: from config)
  --run-id TEXT             Run ID to build a view for
  -e, --enrichment TEXT     Enrichment name (for pivot action)
  --fields TEXT             Comma-separated field names to include (default: all)
  --include TEXT            Source columns to include, with optional :N truncation (e.g.
                            "title,raw_content:500")
  --by-model                Create per-model columns for ICR comparison
  --output TEXT             Output path for render action
  --format [html|csv|json]  Output format for render action  [default: html]
  --limit INTEGER           Optional row limit for render action
  --help                    Show this message and exit.
```