Metadata — Schema, Extraction & Filtering¶

Understanding metadata is critical for building precise RAG applications. RAGWire automatically extracts and stores metadata on every chunk — and you can filter by it at query time.

Metadata Schema¶

Every chunk stored in Qdrant carries the following metadata fields:

LLM-Extracted Fields¶

Extracted once per document from the first chunk using your configured LLM.

Default schema — Finance

The fields below are the default metadata schema, designed for financial documents. You are not locked into these fields. RAGWire lets you define any fields you need via a simple YAML file. See Custom Metadata for details.

Field	Type	Example	Description
`company_name`	`str`	`"apple"`	Company name, normalized to lowercase
`doc_type`	`str`	`"10-k"`	Document type (`10-k`, `10-q`, `8-k`)
`fiscal_quarter`	`str`	`"q1"`	Fiscal quarter (`q1`–`q4`) or `null`
`fiscal_year`	`list[int]`	`[2025]`	Fiscal year(s) covered by the document

File-Level Fields¶

Set automatically from the file at ingestion time.

Field	Type	Example	Description
`source`	`str`	`"/data/Apple_10k_2025.pdf"`	Full file path
`file_name`	`str`	`"Apple_10k_2025.pdf"`	Original filename
`file_type`	`str`	`"pdf"`	File extension
`file_hash`	`str`	`"abc123..."`	SHA256 hash — used for deduplication

Chunk-Level Fields¶

Set per chunk at ingestion time.

Field	Type	Example	Description
`chunk_id`	`str`	`"abc123_0"`	Unique chunk identifier (`file_hash_index`)
`chunk_hash`	`str`	`"def456..."`	SHA256 hash of chunk content
`chunk_index`	`int`	`0`	Position of this chunk within the document
`total_chunks`	`int`	`42`	Total chunks in the document
`created_at`	`str`	`"2026-03-22T10:00:00+00:00"`	UTC ISO timestamp set at ingestion

Inspecting Metadata on Retrieved Chunks¶

from ragwire import RAGWire

rag = RAGWire("config.yaml")
results = rag.retrieve("What is Apple's revenue?", top_k=3)

for doc in results:
    print(doc.metadata)

Example output:

{
    "company_name": "apple",
    "doc_type": "10-k",
    "fiscal_quarter": None,
    "fiscal_year": [2025],
    "source": "/data/Apple_10k_2025.pdf",
    "file_name": "Apple_10k_2025.pdf",
    "file_type": "pdf",
    "file_hash": "a1b2c3...",
    "chunk_id": "a1b2c3_0",
    "chunk_hash": "d4e5f6...",
    "chunk_index": 0,
    "total_chunks": 42,
    "created_at": "2026-03-22T10:00:00.000000"
}

Discovering Available Fields and Values¶

RAGWire provides two ways to inspect fields — use the right one for your purpose:

Method	Returns	Use for
`rag.filter_fields`	Semantic/LLM-extracted fields only	Building filter prompts, agent prompts
`rag.discover_metadata_fields()`	All fields including system fields	Collection inspection, debugging

# Filterable fields only — use these for filter prompts
rag.filter_fields
# → ['company_name', 'doc_type', 'fiscal_quarter', 'fiscal_year']

# All fields — includes file_hash, chunk_id, source, created_at, etc.
rag.discover_metadata_fields()
# → ['company_name', 'doc_type', 'fiscal_year', 'file_name', 'file_hash', 'chunk_id', ...]

# Get stored values for filterable fields
rag.get_field_values(rag.filter_fields)
# → {'company_name': ['apple', 'microsoft'], 'doc_type': ['10-k', '10-q'], ...}

# Raise the limit for high-cardinality fields (default limit=50)
rag.get_field_values("file_name", limit=200)
# → ['Apple_10k_2025.pdf', 'Microsoft_10k_2025.pdf', ...]

Results are ordered by frequency — most common values first.

Filtering at Query Time¶

RAGWire supports three filtering modes:

Mode	How	When to use
Explicit	Pass `filters=` dict to `retrieve()`	Programmatic pipelines, known inputs
Auto-filter	Set `auto_filter: true` in config	Simple chatbots, no agent involved
Agent-controlled	Call `extract_filters()` or `get_filter_context()` manually	Agents that need to reason about filters

auto_filter is off by default

No filter extraction happens automatically unless auto_filter: true is set in config.yaml. For agents, keep the default and use extract_filters() or get_filter_context() to control extraction explicitly.

# Explicit — LLM extraction skipped entirely
results = rag.retrieve("What is the revenue?", filters={"company_name": "apple"})

# Auto-filter — requires auto_filter: true in config.yaml
results = rag.retrieve("What is Apple's revenue for 2025?")

# Agent-controlled — extract, inspect, adjust, then retrieve
filters = rag.extract_filters("What is Apple's revenue for 2025?")
# → {"company_name": "apple", "fiscal_year": 2025}
results = rag.retrieve("What is Apple's revenue for 2025?", filters=filters)

Filter by company¶

results = rag.retrieve(
    "What is the total revenue?",
    top_k=5,
    filters={"company_name": "apple"}
)

Filter by document type¶

results = rag.retrieve(
    "What are the risk factors?",
    top_k=5,
    filters={"doc_type": "10-k"}
)

Filter by fiscal year¶

# Single year — pass as int
results = rag.retrieve(
    "What is the net income?",
    top_k=5,
    filters={"fiscal_year": 2025}
)

# Multiple years — matches documents covering ANY of the years (OR logic)
results = rag.retrieve(
    "Compare net income across 2023 and 2024",
    top_k=10,
    filters={"fiscal_year": [2023, 2024]}
)

Combined filters¶

results = rag.retrieve(
    "What is the revenue breakdown by segment?",
    top_k=5,
    filters={"company_name": "apple", "fiscal_year": 2025}
)

Filter by file name¶

results = rag.retrieve(
    "What are the capital expenditures?",
    top_k=5,
    filters={"file_name": "Apple_10k_2025.pdf"}
)

Filters in Hybrid Search¶

Filters work identically with hybrid_search():

results = rag.hybrid_search(
    "Apple revenue fiscal 2025",
    k=5,
    filters={"company_name": "apple"}
)

Agent-Controlled Filtering¶

For agents, keep auto_filter off and use two tools — one for metadata awareness, one for retrieval:

Two-tool pattern (recommended)¶

from typing import Optional

@tool
def get_filter_context(query: str) -> str:
    """Get available metadata fields, stored values, and filter suggestions for a query.

    Call this before search_documents when the query involves specific metadata
    (company, year, document type, etc.). Use it to decide what filters to apply.
    Safe to call per sub-query in multi-query flows — always fresh from Qdrant.
    """
    return rag.get_filter_context(query)

@tool
def search_documents(query: str, filters: Optional[dict] = None) -> str:
    """Search the document knowledge base. Pass filters decided from get_filter_context."""
    results = rag.retrieve(query, top_k=5, filters=filters)
    if not results:
        return "No relevant documents found."
    return "\n\n---\n\n".join(
        f"[{doc.metadata.get('file_name', 'unknown')}]\n{doc.page_content}"
        for doc in results
    )

Agent flow:

1. Agent calls get_filter_context("Apple revenue 2025")
   → sees fields, stored values, extracted: {"company_name": "apple", "fiscal_year": 2025}
   → decides filters

2. Agent calls search_documents("Apple revenue 2025", filters={"company_name": "apple", "fiscal_year": 2025})

The agent calls get_filter_context only when metadata is relevant — skips it for purely semantic queries. For multi-query tasks each sub-query gets its own fresh context.

What `get_filter_context()` returns¶

## RAGWire Filter Context

### Available Metadata Fields and Stored Values
- **company_name**: ['apple', 'microsoft', 'google']
- **doc_type**: ['10-k', '10-q']
- **fiscal_year**: [2023, 2024, 2025]

### Extracted Filters from Query
- **company_name**: `apple`
- **fiscal_year**: `2025`

### Instructions
1. Review the extracted filters above.
2. If an extracted value does not match or closely relate to any stored value, adjust or drop that filter.
3. If the query has no clear metadata intent, pass an empty dict {} as filters.
4. Pass the final filters dict to the retrieval tool as filters=.

For custom metadata fields (legal, HR, medical, or any non-financial domain), see Custom Metadata.