Custom Metadata¶

RAGWire supports fully custom metadata fields for any domain — legal, HR, medical, e-commerce, and more. Define your fields in a YAML file and RAGWire builds a typed Pydantic schema automatically.

Custom Metadata via YAML File¶

The easiest way to define custom metadata fields is a YAML config file. RAGWire builds a Pydantic model from your field definitions and uses with_structured_output for reliable, type-safe extraction — no manual JSON parsing.

1. Create `metadata.yaml`¶

fields:
  - name: organization
    description: "Organization name in lowercase"

  - name: doc_type
    description: "Type of document"
    values: ["contract", "policy", "report", "memo"]

  - name: effective_year
    description: "Year the document is effective"
    type: integer

  - name: jurisdiction
    description: "Country or region the document applies to, or null"

2. Reference it in `config.yaml`¶

metadata:
  config_file: "metadata.yaml"

That's it. RAGWire reads the YAML at startup, builds a Pydantic schema from your fields, and uses it for every ingested document. The extracted values are stored in Qdrant and can be filtered at query time exactly like the built-in fields.

See Domain Examples below for ready-to-use schemas for legal, HR, healthcare, and more.

Field definitions¶

Top-level keys

Key	Required	Description
`prompt`	No	Custom extraction prompt. Document text is injected automatically — no placeholder needed. Defaults to the built-in RAGWire prompt if omitted.
`fields`	Yes	List of metadata field definitions (see below)

Field definition keys

Key	Required	Description
`name`	Yes	Field key stored in metadata
`description`	Yes	Instruction sent to the LLM describing what to extract
`type`	No	`string` (default) \| `list` \| `integer`
`values`	No	Example/allowed values — shown to LLM as format hints

Open-ended lists

For type: list fields, values are format examples only — not a whitelist. The LLM will extract any value in the same format, even if it's not in the list.

Custom Extraction Prompt¶

By default RAGWire uses a built-in extraction prompt. You can override it per-config using a top-level prompt key in your metadata.yaml. Write your instructions — the document text is appended automatically.

prompt: |
  You are a <domain> expert. Extract the metadata fields below from the document.
  Be thorough — only return null if the information is genuinely absent.

fields:
  - name: my_field
    description: "What to extract"

Note

The prompt key is optional. When omitted, the default RAGWire extraction prompt is used. No {content} placeholder needed — document text is appended automatically.

Domain Examples¶

Health & Gym Supplement Research Papers¶

prompt: |
  You are an expert research paper analyst specializing in health, fitness, and sports science.
  Your task is to extract structured metadata from the research paper below.

  ## Extraction Rules
  1. **Be thorough**: Extract every field you can find. A field should only be null if the information is completely absent.
  2. **Be precise**: Extract exactly what is stated. Do not infer, assume, or hallucinate values not present in the document.
  3. **Lists**: Scan the entire document and extract ALL matching values — not just the first occurrence.
  4. **Strings**: Normalize to lowercase. Trim extra whitespace.
  5. **Integers**: Return the numeric value only — no units, symbols, or surrounding text.
  6. **Null**: Return null only when the field is genuinely not mentioned anywhere in the document.

fields:
  - name: title
    description: "Full title of the research paper exactly as it appears in the document. Do not paraphrase."

  - name: authors
    description: "List of full author names exactly as they appear in the paper (e.g. 'John A. Smith'). Extract all authors."
    type: list

  - name: publication_year
    description: "Year the paper was published or last revised. Extract the 4-digit year only."
    type: integer

  - name: research_focus
    description: "List of all primary research topics covered, in lowercase-hyphenated format. Not limited to the examples — extract any focus area mentioned in the paper."
    type: list
    values: ["muscle-growth", "recovery", "performance", "endurance", "cognitive-function", "fat-loss", "safety", "hormonal"]

Legal / HR¶

fields:
  - name: organization
    description: "Organization name in lowercase"

  - name: doc_type
    description: "Type of document"
    values: ["contract", "policy", "report", "memo", "nda", "agreement"]

  - name: effective_year
    description: "Year the document is effective"
    type: integer

  - name: jurisdiction
    description: "Country or region the document applies to, or null"

  - name: parties
    description: "List of parties involved in the document (individuals or organizations)"
    type: list

  - name: status
    description: "Current status of the document"
    values: ["active", "expired", "draft", "terminated"]

Healthcare / Medical¶

fields:
  - name: condition
    description: "List of medical conditions or diseases discussed, in lowercase-hyphenated format"
    type: list

  - name: treatment_type
    description: "List of treatments or interventions discussed"
    type: list
    values: ["medication", "surgery", "therapy", "diagnostic", "preventive", "rehabilitation"]

  - name: specialty
    description: "List of medical specialties the document covers"
    type: list
    values: ["cardiology", "oncology", "neurology", "orthopedics", "general-practice", "psychiatry"]

  - name: publication_year
    description: "Year the document was published"
    type: integer

  - name: patient_population
    description: "Target patient population described in the document, or null"

  - name: evidence_level
    description: "Level of clinical evidence"
    values: ["systematic-review", "rct", "cohort-study", "case-study", "expert-opinion"]

  - name: key_finding
    description: "Most important clinical finding or recommendation, maximum 200 characters"

Complete Custom Metadata Example¶

Here is a full end-to-end workflow for a legal/HR use case using custom metadata fields.

1. `metadata.yaml`¶

fields:
  - name: organization
    description: "Organization name in lowercase"

  - name: doc_type
    description: "Type of document"
    values: ["contract", "policy", "report", "memo"]

  - name: effective_year
    description: "Year the document is effective"
    type: integer

  - name: jurisdiction
    description: "Country or region the document applies to, or null"

  - name: parties
    description: "List of parties involved in the document (individuals or organizations)"
    type: list

2. `config.yaml`¶

embeddings:
  provider: "openai"
  model: "text-embedding-3-small"

llm:
  provider: "ollama"
  model: "qwen3.5:9b"
  base_url: "http://localhost:11434"

metadata:
  config_file: "metadata.yaml"

vectorstore:
  url: "http://localhost:6333"
  collection_name: "legal_docs"
  use_sparse: true

retriever:
  search_type: "hybrid"
  top_k: 5
  auto_filter: false   # keep false for agent use — use get_filter_context() instead

3. Ingest and retrieve¶

from ragwire import RAGWire

rag = RAGWire("config.yaml")

stats = rag.ingest_directory("data/")
print(f"Ingested {stats['processed']} docs, {stats['chunks_created']} chunks")

results = rag.retrieve("data protection policy", top_k=1)
print(results[0].metadata)
# {
#     "organization": "acme corp",
#     "doc_type": "policy",
#     "effective_year": 2024,
#     "jurisdiction": "eu",
#     "source": "data/acme_data_policy.pdf",
#     ...
# }

4. Filter by custom fields¶

Explicit filters — pass a dict directly, no LLM extraction:

results = rag.retrieve(
    "employee data handling responsibilities",
    filters={"doc_type": "policy"}
)

results = rag.retrieve(
    "termination clauses",
    filters={"doc_type": "contract", "jurisdiction": "eu", "effective_year": 2024}
)

Agent-controlled — expose get_filter_context and search_documents as two separate tools. The agent calls get_filter_context first to understand available metadata, then decides what filters to pass to search_documents:

from typing import Optional

@tool
def get_filter_context(query: str) -> str:
    """Get available metadata fields, stored values, and filter suggestions for a query.
    Call this before search_documents when the query involves specific metadata.
    """
    return rag.get_filter_context(query)

@tool
def search_documents(query: str, filters: Optional[dict] = None) -> str:
    """Search the document knowledge base with optional filters."""
    results = rag.retrieve(query, top_k=5, filters=filters)
    if not results:
        return "No relevant documents found."
    chunks = []
    for doc in results:
        source = doc.metadata.get("file_name", "unknown")
        meta_parts = [
            f"{k}={str(v)[:100]}"
            for k, v in doc.metadata.items()
            if k != "file_name" and v not in (None, "", [])
        ]
        header = f"[{source}" + (f" | {', '.join(meta_parts)}" if meta_parts else "") + "]"
        chunks.append(f"{header}\n{doc.page_content}")
    return "\n\n---\n\n".join(chunks)

# Agent flow:
# 1. get_filter_context("EU data protection policies")
#    → sees jurisdiction: ["eu", "us"], doc_type: ["policy", "contract"]
#    → extracted: {"jurisdiction": "eu", "doc_type": "policy"}
# 2. search_documents("EU data protection policies", filters={"jurisdiction": "eu", "doc_type": "policy"})

5. Run the example script¶

A complete runnable script is available at examples/custom_metadata_usage.py:

python examples/custom_metadata_usage.py