Code books¶

A code book is the small YAML file you write for each enrichment: it says which rows to read, what to ask, and what shape the answer must take. (YAML is just the plain-text format it is written in — and you will usually have an agent write it.) A code book draws on four ideas: a database, SQL queries, enrichments, and schemas. The tested examples in tests/schema_examples/ are the first source of truth. The snippets below are either copied from that corpus or smoke-checked with doctrail enrich --dry-run.

Minimal single-field enum¶

Source: tests/schema_examples/single_field/classify_language.yml.

name: classify_language
description: "Classify the language of the document"
model: gpt-4o-mini
input:
  query: all_docs
  input_columns: ["raw_content"]
output_column: detected_language
schema:
  enum: ["english", "chinese", "mixed", "other"]
prompt: |
  Analyze this document and classify its primary language.
  Return one of: english, chinese, mixed, or other.

Boolean field¶

Source: tests/schema_examples/single_field/validate_content.yml.

name: validate_content
description: "Validate document content quality"
input:
  query: all_docs
  input_columns: ["raw_content", "filename"]
schema:
  content_valid: {type: "boolean"}
prompt: |
  Check if this document has valid, readable content.
  Return 'true' if the document has valid, readable content.
  Return 'false' if it's empty, corrupted, or unreadable.

Full current-path config¶

Smoke-checked with doctrail enrich --config config.yml packed_relevance --dry-run and doctrail enrich --config config.yml donor_payment --dry-run.

database: ./docs.db
default_model: gpt-4o-mini
default_table: documents
project: docs_smoke

sql_queries:
  all_docs: |
    SELECT rowid, sha1 FROM documents ORDER BY rowid
  donor_docs: |
    SELECT rowid, sha1 FROM documents
    WHERE raw_content LIKE '%donor%'
    ORDER BY rowid

enrichments:
  - name: packed_relevance
    input:
      query: all_docs
      input_columns: ["filename", "raw_content:240"]
    system_prompt: "Return only data matching the schema."
    append_file: context.md
    prompt: |
      Mark each item true if it mentions donor-family payment evidence.
    output_column: is_relevant
    schema: {type: "boolean"}
    dedupe_scope: query
    key_column: sha1
    pack_size: 5
    pack_response_mode: selected_indexes

  - name: donor_payment
    input:
      query: donor_docs
      input_columns: ["filename", "raw_content:500"]
    prompt: |
      Extract donor-family payment evidence from the supplied document.
      Use the filename only as source context.

      Codebook:
      - condolence_money: direct cash or ceremonial money given to a donor family.
      - relief_fund: money from a relief, hardship, charity, or assistance fund.
      - subsidy: public or institutional subsidy, reimbursement, or benefit.
      - none: no donor-family payment evidence appears.

      If payment_type is none, amount must be null and evidence should briefly say no payment evidence was found.
      If the amount is not stated, amount must be null.
    schema:
      payment_type: {enum: ["condolence_money", "relief_fund", "subsidy", "none"]}
      amount: {type: "float", minimum: 0, optional: true}
      evidence: {type: "string", maxLength: 300}
      tags: {enum_list: ["cash", "policy", "medical", "unclear"], max_items: 3}
      evidence_items:
        type: "array"
        maxItems: 3
        items:
          type: "object"
          properties:
            phrase: {type: "string", maxLength: 120}
            payment_kind: {enum: ["cash", "subsidy", "unknown"]}

Prompt construction¶

Write prompt as a codebook, not just a question. Define every enum value, anchor every numeric scale point, and state gate/null behavior for optional fields. For example, if has_payment is false, say which evidence fields must be null; if a score runs from 0 to 5, define 0, the middle, and 5.

Keep the prompt as static as possible. Doctrail renders the prompt first, adds schema instructions there for JSON-mode paths or sends provider-native schema payloads as constant request structure, and appends per-row input_columns content last. OpenAI automatic prompt caching and Gemini implicit caching on Gemini 2.5 and newer models can bill repeated prefix tokens at cached-input rates when the provider reports a cache hit, though Gemini does not guarantee savings on every request. Doctrail does not currently set Anthropic cache_control markers.

Prefer input_columns with :N truncation, such as raw_content:3000, over {column} interpolation inside the prompt. Placeholders still work, but a row-specific token breaks the shared prompt prefix there, so everything after it re-bills at full input price per row. If one is truly needed, put it at the very end.

Token economy pattern¶

Use a cheap packed pass before an expensive extraction pass when most rows are likely irrelevant. In the example above, packed_relevance keeps each row small with raw_content:240, sends five rows per request with pack_size: 5, and uses pack_response_mode: selected_indexes so the model only has to name the relevant item indexes.

That pattern is meant for triage. The packed pass writes a normalized answer for each input row, including false answers for rows the model does not select, so append mode can skip completed packed batches on rerun. Run the richer donor_payment extraction only on the narrowed SQL query or a saved relevance view.

Multi-table inputs¶

Source: tests/schema_examples/multi_field/multi_table_review.yml, with its legacy output_table comment removed. The feature is the table.column input syntax.

name: multi_table_review
description: "Review and refine extraction using data from multiple tables"
model: gpt-4o
input:
  query: docs_with_entities
  input_columns:
    - "documents.raw_content:1000"
    - "documents.filename"
    - "extracted_entities.organizations"
    - "extracted_entities.locations"
    - "extracted_entities.key_terms"
    - "extracted_entities.entity_count"
schema:
  organizations: {type: "array", items: {type: "string"}, maxItems: 10}
  locations: {type: "array", items: {type: "string", lang: "zh"}, maxItems: 5}
  key_terms: {type: "array", items: {type: "string"}, maxItems: 15}
  entity_count: {type: "integer", minimum: 0}
  quality_score: {type: "float", minimum: 0.0, maximum: 1.0}
  corrections_made: {type: "string", maxLength: 500}
prompt: |
  Review the previous entity extraction and source document supplied below.

  Use the existing entity fields as draft model output, not as ground truth.
  Correct omissions, remove spurious entities, and explain major corrections.

Language validation¶

Source: tests/schema_examples/advanced/test_array_language_validation.yml.

fragment

schema:
  summary_zh: {type: "string", lang: "zh"}
  evidence_en: {type: "array", items: {type: "string", lang: "en"}, maxItems: 5}
  keywords_zh: {type: "array", items: {type: "string", lang: "zh"}, maxItems: 10}
  confidence: {type: "float", minimum: 0, maximum: 1}

Multiple models¶

Source: tests/schema_examples/advanced/test_enrich_multi_model.yml.

enrichments:
  - name: compare_models
    description: "Compare outputs from multiple models"
    model: ["gpt-4o-mini", "gpt-3.5-turbo"]
    input:
      query: all_docs
      input_columns: ["raw_content:500"]
    schema:
      summary: {type: "string"}
      sentiment: {enum: ["positive", "negative", "neutral"]}
    prompt: |
      Provide a brief summary of this document and determine its sentiment.
      Summary should be one sentence only.

YAML imports¶

Source: tests/schema_examples/test_yaml_imports.yml.

fragment

database: placeholder
default_model: gpt-4o-mini

sql_queries:
  all_docs: |
    SELECT rowid, sha1 FROM documents
    ORDER BY rowid

enrichments:
  - !import single_field/classify_language.yml
  - !import single_field/validate_content.yml

Schema surface¶

Schema form	Status	Note
`{type: "string"}`	stable	text
`{type: "integer"}`	stable	whole numbers
`{type: "float"}`	stable	decimals
`{type: "boolean"}`	stable	true/false
`{enum: [...]}`	stable	one choice
`{enum_list: [...]}`	stable	several choices
`{type: "array", items: ...}`	stable	JSON list
`{type: "object", properties: ...}`	stable	JSON object
`optional: true`	stable	allows null
`nullable: true`	stable	alias
`minimum`, `maximum`	stable	numeric bounds
`minLength`, `maxLength`, `pattern`	stable	string bounds
`minItems`, `maxItems`	stable	array bounds
`lang: "zh"` or `lang: "en"`	stable	language check
`convert: "chinese_to_pinyin"`	internal	plugin hook
`{type: "number"}`	deprecated	use `float`
`schema: [...]`	deprecated	use `enum`

Config key stability¶

Batch 4 inventory source: .get(...) and direct config reads across src/doctrail/.

Key	Status	Note
`database`	stable	SQLite path
`project`	stable	run tag
`default_table`	stable	source fallback
`default_model`	stable	model fallback
`sql_queries`	stable	named SQL
`enrichments`	stable	task list
`name`	stable	enrichment id
`input.query`	stable	named or inline SQL
`input.input_columns`	stable	prompt payload
`prompt`	stable	user prompt
`system_prompt`	stable	system message
`append_file`	stable	prompt appendix
`model`	stable	one or many
`models`	stable	model config map
`schema`	stable	output contract
`output_column`	stable	single-field alias
`key_column`	stable	row key
`dedupe_scope`	stable	query/prompt/enrichment
`pack_size`	stable	sync grouping
`pack_response_mode`	stable	exhaustive or selected_indexes
`output_table`	deprecated	compatibility hint
`dedupe-scope`	deprecated	use underscore
`--execution-mode openai-batch`	deprecated	use `batch`
`output_columns`	internal	derived/legacy
`all_fields_optional`	internal	broad optionality
`min_input_chars`	internal	skip threshold
`truncate`	internal	prefer CLI flag
`reasoning_effort`	internal	GPT-5 control
`exports`	internal	old export path
`output_naming`	internal	export filenames
`views.priority_columns`	internal	view sorting
`zotero` credentials	internal	ingest plugin
`documents_path`	internal	init/ingest helper