API Reference¶

All public APIs are importable directly from ragwire:

from ragwire import RAGWire, MarkItDownLoader, get_embedding, QdrantStore, ...

Core API¶

These are the primary user-facing APIs. Most applications only need these.

RAGWire¶

The main orchestrator. Handles the full pipeline from config loading to ingestion and retrieval.

from ragwire import RAGWire

`RAGWire(config_path)`¶

Initialize the pipeline from a YAML config file.

Parameter	Type	Required	Description
`config_path`	`str`	Yes	Path to `config.yaml`

Raises:

FileNotFoundError — config file not found
ValueError — missing required config keys (e.g. llm.model)

rag = RAGWire("config.yaml")

`rag.ingest_documents(file_paths)`¶

Ingest a list of documents into the vector store. Skips files already ingested (SHA256 deduplication).

Parameter	Type	Required	Description
`file_paths`	`list[str]`	Yes	List of file paths to ingest

Returns: dict

{
    "total": 3,           # Total files submitted
    "processed": 2,       # Successfully ingested
    "skipped": 1,         # Already in vector store (duplicate)
    "failed": 0,          # Failed to load or process
    "chunks_created": 84, # Total chunks added to Qdrant
    "errors": []          # List of {"file": ..., "error": ...} dicts
}

stats = rag.ingest_documents([
    "data/Apple_10k_2025.pdf",
    "data/Microsoft_10k_2025.pdf",
])
print(f"Processed: {stats['processed']}, Chunks: {stats['chunks_created']}")

A progress bar (tqdm) is shown automatically while ingestion runs.

`rag.ingest_directory(directory, recursive, extensions)`¶

Ingest all supported documents from a directory. Internally calls ingest_documents().

Parameter	Type	Required	Default	Description
`directory`	`str`	Yes	—	Path to the directory
`recursive`	`bool`	No	`False`	Search subdirectories
`extensions`	`list[str]`	No	loader config	File extensions to include

Returns: dict — same stats dict as ingest_documents()

# Ingest all PDFs/DOCX in a folder
stats = rag.ingest_directory("data/")

# Recursively include subdirectories
stats = rag.ingest_directory("data/", recursive=True)

# Only specific extensions
stats = rag.ingest_directory("data/", extensions=[".pdf"])

`rag.retrieve(query, top_k, filters)`¶

Retrieve the most relevant chunks for a query.

Parameter	Type	Required	Default	Description
`query`	`str`	Yes	—	Search query
`top_k`	`int`	No	config value	Number of results to return
`filters`	`dict`	No	`None`	Metadata filters (see Metadata & Filtering)

Returns: list[Document]

Each Document has:

doc.page_content — the chunk text
doc.metadata — dict with all metadata fields (see Metadata Schema)

Filter behaviour:

If filters is passed → used as-is, no LLM call (always, regardless of auto_filter setting)
If filters is not passed and auto_filter: true in config → LLM extracts filters from the query
If filters is not passed and auto_filter: false (default) → no filtering, pure semantic search

When to use auto-filter vs explicit filters: Use explicit filters in programmatic pipelines where you control the inputs (faster, zero LLM overhead). Enable auto_filter in simple user-facing chatbots. For agents, keep auto_filter: false and use rag.extract_filters(query) to give the agent full control over whether and how to apply filters.

# Explicit filters — LLM extraction skipped
results = rag.retrieve(
    "What is the net income?",
    top_k=5,
    filters={"company_name": "apple", "fiscal_year": 2025}
)

# auto_filter: true in config — LLM extracts {"company_name": "apple", "fiscal_year": 2025}
results = rag.retrieve("What is Apple's net income for 2025?")

# auto_filter: false (default) — pure semantic search, no filter extraction
results = rag.retrieve("What is Apple's net income for 2025?")

for doc in results:
    print(doc.metadata.get("company_name"))
    print(doc.page_content[:300])

`rag.hybrid_search(query, k, filters)`¶

Perform hybrid search combining dense (semantic) and sparse (keyword) vectors. Requires use_sparse: true in config.

Parameter	Type	Required	Default	Description
`query`	`str`	Yes	—	Search query
`k`	`int`	No	`5`	Number of results
`filters`	`dict`	No	`None`	Metadata filters

Returns: list[Document]

Hybrid search requires sparse vectors

hybrid_search() only performs true hybrid (dense + sparse) search when both conditions are met:

use_sparse: true in config.yaml — collection must be created with sparse vector support
pip install fastembed — required for sparse encoding

If either is missing, the call silently falls back to dense-only similarity search. There is no error raised. If your collection was created with use_sparse: false, you must set force_recreate: true and re-ingest to enable hybrid search.

retrieve() vs hybrid_search() — when to use which:

	`retrieve()`	`hybrid_search()`
Search type	Whatever is set in `config.yaml` (`similarity`, `mmr`, or `hybrid`)	Always hybrid (dense + sparse), regardless of config
Auto-filter	Only when `auto_filter: true` in config (default `false`)	Same — respects `auto_filter` setting
`top_k` default	From `config.yaml`	`k=5` parameter
Typical use	Primary method for all RAG flows	Override to force hybrid on a single call

If your config.yaml already has search_type: "hybrid", both methods produce identical results. Use hybrid_search() only when your config is set to similarity or mmr and you want to force hybrid for a specific call.

# Use retrieve() in most cases — honours config search type
results = rag.retrieve("Apple revenue fiscal 2025", top_k=5)

# Use hybrid_search() to force hybrid regardless of config
results = rag.hybrid_search(
    "Apple revenue fiscal 2025",
    k=5,
    filters={"company_name": "apple"}
)

`rag.extract_filters(query)`¶

Extract metadata filters from a natural language query without triggering retrieval. Returns the raw extracted dict so an agent can inspect, adjust, or discard before passing to retrieve().

Parameter	Type	Required	Description
`query`	`str`	Yes	Natural language query

Returns: dict of extracted filters, or None if nothing was extracted.

Note

This method always runs regardless of the auto_filter config setting. It gives agents explicit control — call it manually, decide what to do, then pass the result to retrieve(filters=...).

# Agent workflow — full control over filters
filters = rag.extract_filters("muscle building studies from 2023")
# → {"research_focus": "muscle building", "publication_year": 2023}

# Agent validates against stored values
stored = rag.get_field_values(rag.filter_fields)
if filters.get("research_focus") not in stored.get("research_focus", []):
    filters.pop("research_focus")  # drop uncertain filter, rely on semantic search

results = rag.retrieve("muscle building studies from 2023", filters=filters)

`rag.get_filter_context(query, limit)`¶

Build a ready-made markdown prompt block for an agent — contains available metadata fields, their stored values, the filters extracted from the current query, and instructions for the agent on how to act on them. Append or prepend to your agent's task prompt.

Parameter	Type	Required	Default	Description
`query`	`str`	Yes	—	Natural language query
`limit`	`int`	No	`50`	Max stored values to show per field

Returns: str — formatted markdown block ready to inject into an agent prompt.

filter_context = rag.get_filter_context("muscle building studies from 2023")
agent_prompt = filter_context + "\n\n" + your_task_prompt

The returned block looks like:

## RAGWire Filter Context

### Available Metadata Fields and Stored Values
- **research_focus**: ["muscle-growth", "endurance", "recovery", ...]
- **publication_year**: [2022, 2023, 2024]
- **authors**: ["john smith", "jane doe", ...]

### Extracted Filters from Query
- **research_focus**: `muscle building`
- **publication_year**: `2023`

### Instructions
1. Review the extracted filters above.
2. If an extracted value does not match or closely relate to any stored value, adjust or drop that filter.
3. If the query has no clear metadata intent, pass an empty dict {} as filters.
4. Pass the final filters dict to the retrieval tool as filters=.

Typical agent workflow

Use get_filter_context() to give the agent full situational awareness. The agent can then call rag.retrieve(query, filters=adjusted_filters) with a well-informed decision on which filters to apply.

`rag.filter_fields`¶

Property. Returns the metadata fields used for filtering and auto-filter extraction — the semantic/LLM-extracted fields only. System fields like file_hash, chunk_id, source, chunk_index, created_at are excluded.

Use this when building dynamic filter prompts for an LLM agent. Using discover_metadata_fields() instead would include system fields that have no value as filters.

fields = rag.filter_fields
# Default: ['company_name', 'doc_type', 'fiscal_quarter', 'fiscal_year']
# Custom:  whatever fields are defined in your metadata.yaml

values = rag.get_field_values(fields)
# → {'company_name': ['apple', 'microsoft'], 'doc_type': ['10-k'], ...}

`rag.discover_metadata_fields()`¶

Return all metadata field names present in the collection, including system fields. Scrolls one point — fast regardless of collection size.

Use this for collection inspection or debugging. For building filter prompts, use rag.filter_fields instead.

Returns: list[str]

fields = rag.discover_metadata_fields()
print(fields)
# ['company_name', 'doc_type', 'fiscal_year', 'fiscal_quarter',
#  'file_name', 'file_type', 'file_hash', 'chunk_id', 'chunk_index', ...]

`rag.get_field_values(fields, limit)`¶

Return unique values for one or more metadata fields using Qdrant's facet API. Results are ordered by frequency (most common values first).

Parameter	Type	Required	Default	Description
`fields`	`str \\| list[str]`	Yes	—	Field name or list of field names
`limit`	`int`	No	`50`	Max unique values to return per field. Increase for high-cardinality fields (e.g. `file_name`).

Returns: - list — if fields is a str - dict[str, list] — if fields is a list

# Single field — returns a list of up to 50 unique values
rag.get_field_values("company_name")
# → ['apple', 'microsoft', 'google']

# Multiple fields — returns a dict
rag.get_field_values(["company_name", "doc_type"])
# → {'company_name': ['apple', 'microsoft', 'google'], 'doc_type': ['10-k', '10-q']}

# High-cardinality field — raise the limit
rag.get_field_values("file_name", limit=200)
# → ['Apple_10k_2025.pdf', 'Microsoft_10k_2025.pdf', ...]

# Typical agent workflow — use filter_fields, not discover_metadata_fields()
values = rag.get_field_values(rag.filter_fields)
results = rag.retrieve("revenue", filters={"company_name": values["company_name"][0]})

`rag.extract_metadata(text)`¶

Extract structured metadata from text using the configured LLM.

Automatically passes stored collection values so the LLM reuses existing entity names (e.g. "apple inc.") rather than extracting inconsistent variants ("apple", "Apple Inc."). This grounding is applied transparently — you do not need to pass stored values manually.

Parameter	Type	Required	Description
`text`	`str`	Yes	Document text to extract metadata from (first 10,000 chars used)

Returns: dict

metadata = rag.extract_metadata(open("report.txt").read())
print(metadata)
# {'company_name': 'apple inc.', 'doc_type': '10-k', 'fiscal_quarter': None, 'fiscal_year': [2025]}

`rag.get_stats()`¶

Get statistics about the current collection.

Returns: dict

{
    "collection_name": "financial_docs",
    "total_documents": 420,   # Total chunks in Qdrant
    "vector_size": 768,       # Embedding dimension
    "indexed": 420            # Number of indexed vectors
}

stats = rag.get_stats()
print(f"Collection: {stats['collection_name']}, Chunks: {stats['total_documents']}")

Config Reference — `llm` and `embeddings`¶

All parameters below are set in config.yaml and read automatically by RAGWire at startup.

`llm` section¶

Controls the LLM used for metadata extraction (and filter extraction during retrieval).

Key	Required	Default	Description
`provider`	Yes	—	`ollama`, `openai`, `google`, `groq`, `anthropic`
`model`	Yes	—	Model name (e.g. `qwen3.5:9b`, `gpt-4o-mini`)
`base_url`	Ollama only	`http://localhost:11434`	Ollama server URL
`num_ctx`	Ollama only	LangChain default	Context window size — only set this if you need to override the default
`api_key`	Google / Groq / Anthropic	—	API key (or use `${ENV_VAR}` syntax)

OpenAI

OpenAI reads OPENAI_API_KEY from the environment automatically — no api_key field needed in config.

# Ollama
llm:
  provider: "ollama"
  model: "qwen3.5:9b"
  base_url: "http://localhost:11434"
  num_ctx: 16384

# OpenAI
llm:
  provider: "openai"
  model: "gpt-4o-mini"

# Google Gemini
llm:
  provider: "google"
  model: "gemini-2.5-flash"
  api_key: "${GOOGLE_API_KEY}"

# Groq
llm:
  provider: "groq"
  model: "llama-3.3-70b-versatile"
  api_key: "${GROQ_API_KEY}"

# Anthropic
llm:
  provider: "anthropic"
  model: "claude-haiku-4-5-20251001"
  api_key: "${ANTHROPIC_API_KEY}"

`embeddings` section¶

Controls the embedding model used to encode documents and queries into vectors.

Key	Required	Default	Description
`provider`	Yes	—	`ollama`, `openai`, `google`, `huggingface`, `fastembed`
`model`	Most providers	provider default	Embedding model name
`base_url`	Ollama only	`http://localhost:11434`	Ollama server URL
`num_ctx`	Ollama only	LangChain default	Context window size — only set this if you need to override the default
`api_key`	Google only	—	API key (or use `${ENV_VAR}` syntax)
`model_name`	HuggingFace / FastEmbed only	see below	Model identifier (uses `model_name` key, not `model`)
`model_kwargs`	HuggingFace only	`{}`	Passed to the HuggingFace model constructor (e.g. `{"device": "cpu"}`)
`encode_kwargs`	HuggingFace only	`{}`	Passed to the encode call (e.g. `{"normalize_embeddings": true}`)

Default models per provider:

Provider	Default model
`ollama`	`nomic-embed-text`
`openai`	`text-embedding-3-small`
`google`	`models/embedding-001`
`huggingface`	`sentence-transformers/all-MiniLM-L6-v2`
`fastembed`	`BAAI/bge-small-en-v1.5`

# Ollama
embeddings:
  provider: "ollama"
  model: "nomic-embed-text"
  base_url: "http://localhost:11434"
  num_ctx: 16384

# OpenAI
embeddings:
  provider: "openai"
  model: "text-embedding-3-small"

# Google Gemini
embeddings:
  provider: "google"
  model: "models/gemini-embedding-001"
  api_key: "${GOOGLE_API_KEY}"

# HuggingFace (local)
embeddings:
  provider: "huggingface"
  model_name: "sentence-transformers/all-MiniLM-L6-v2"
  model_kwargs:
    device: "cpu"
  encode_kwargs:
    normalize_embeddings: true

# FastEmbed (local, sparse-capable)
embeddings:
  provider: "fastembed"
  model_name: "BAAI/bge-small-en-v1.5"

`retriever` section¶

Controls retrieval behaviour.

Key	Required	Default	Description
`search_type`	No	`"similarity"`	`"similarity"` \| `"mmr"` \| `"hybrid"` (hybrid requires `use_sparse: true`)
`top_k`	No	`5`	Number of results returned by `retrieve()`
`auto_filter`	No	`false`	If `true`, LLM automatically extracts metadata filters from every query passed to `retrieve()` / `hybrid_search()`. If `false`, no filter extraction happens unless `filters=` is passed explicitly or `rag.extract_filters()` is called manually.

retriever:
  search_type: "hybrid"
  top_k: 5
  auto_filter: false   # set true to enable automatic filter extraction from queries

Agent use case

Keep auto_filter: false when an agent is driving retrieval. Use rag.extract_filters(query) to let the agent inspect and adjust filters before calling retrieve(filters=...).

MarkItDownLoader¶

Converts documents (PDF, DOCX, XLSX, PPTX, TXT, MD) to markdown text.

from ragwire import MarkItDownLoader

When to use MarkItDownLoader directly: Use it when you need to convert documents to text before passing them to a custom pipeline, or when you want to inspect/transform the text before ingestion.

`MarkItDownLoader.load(file_path)`¶

Parameter	Type	Required	Description
`file_path`	`str`	Yes	Path to the document

Returns: dict

{
    "success": True,
    "text_content": "# Apple Inc.\n\n...",  # Markdown text
    "file_name": "Apple_10k_2025.pdf",
    "file_type": "pdf",
    "error": None                            # Error message if success=False
}

loader = MarkItDownLoader()
result = loader.load("data/Apple_10k_2025.pdf")

if result["success"]:
    print(f"Loaded {len(result['text_content'])} characters")
else:
    print(f"Error: {result['error']}")

`loader.load_batch(file_paths)`¶

Load multiple documents in one call. Returns results in the same order as the input list.

Parameter	Type	Required	Description
`file_paths`	`list[str]`	Yes	List of file paths to load

Returns: list[dict] — same structure as load() for each file.

loader = MarkItDownLoader()
results = loader.load_batch(["doc1.pdf", "doc2.pdf", "doc3.docx"])

for result in results:
    if result["success"]:
        print(f"{result['file_name']}: {len(result['text_content'])} chars")
    else:
        print(f"{result['file_name']}: {result['error']}")

`loader.load_directory(directory, extensions, recursive)`¶

Load all supported documents from a directory.

Parameter	Type	Required	Default	Description
`directory`	`str`	Yes	—	Path to directory
`extensions`	`list[str]`	No	all supported	File extensions to include
`recursive`	`bool`	No	`False`	Scan subdirectories

Returns: list[dict]

loader = MarkItDownLoader()
results = loader.load_directory("data/", extensions=[".pdf", ".docx"], recursive=True)
texts = [r["text_content"] for r in results if r["success"]]

Text Splitters¶

from ragwire import get_splitter, get_markdown_splitter, get_code_splitter

All splitters return a RecursiveCharacterTextSplitter instance with a .split_text(text) method.

Choosing a splitter: - get_markdown_splitter — best for PDF/DOCX/reports (converted to markdown by MarkItDown); respects document structure - get_splitter — best for plain text, HTML, or any content without markdown headers - get_code_splitter — best for source code files; splits on class/function boundaries

Chunk size guidance: Larger chunks (8k–12k chars) preserve more context per chunk — good for long-form financial/legal docs. Smaller chunks (500–2k chars) give more precise retrieval — good for FAQ-style content. chunk_overlap prevents context being cut mid-sentence; 20% of chunk size is a sensible default.

`get_markdown_splitter(chunk_size, chunk_overlap)`¶

Splits on markdown headers first (##, ###, ####), then paragraphs. Best for PDF/DOCX converted via MarkItDown.

Parameter	Type	Default	Description
`chunk_size`	`int`	`1000`	Max characters per chunk
`chunk_overlap`	`int`	`200`	Overlap between chunks

splitter = get_markdown_splitter(chunk_size=10000, chunk_overlap=2000)
chunks = splitter.split_text(text)
print(f"{len(chunks)} chunks")

`get_splitter(chunk_size, chunk_overlap, separators)`¶

Generic recursive splitter. Splits on \n\n → \n → → "".

Parameter	Type	Default	Description
`chunk_size`	`int`	`1000`	Max characters per chunk
`chunk_overlap`	`int`	`200`	Overlap between chunks
`separators`	`list[str]`	`["\n\n", "\n", " ", ""]`	Custom separators

splitter = get_splitter(chunk_size=5000, chunk_overlap=500)
chunks = splitter.split_text(text)

`get_code_splitter(chunk_size, chunk_overlap)`¶

Splits on code structure: class, def, comments. Best for source code files.

splitter = get_code_splitter(chunk_size=2000, chunk_overlap=200)
chunks = splitter.split_text(source_code)

get_embedding¶

Factory function — returns an embedding model instance for the configured provider.

from ragwire import get_embedding

`get_embedding(config)`¶

Parameter	Type	Required	Description
`config`	`dict`	Yes	Provider config dict with `provider` key

Supported providers: ollama, openai, huggingface, google, fastembed

Returns: Embedding model with .embed_query(text) and .embed_documents(texts) methods.

# Ollama
embedding = get_embedding({
    "provider": "ollama",
    "model": "nomic-embed-text",
    "base_url": "http://localhost:11434",
})

# OpenAI
embedding = get_embedding({
    "provider": "openai",
    "model": "text-embedding-3-small",
})

# HuggingFace
embedding = get_embedding({
    "provider": "huggingface",
    "model_name": "sentence-transformers/all-MiniLM-L6-v2",
    "model_kwargs": {"device": "cpu"},
})

vector = embedding.embed_query("What is Apple's revenue?")
print(f"Dimension: {len(vector)}")

MetadataExtractor¶

Extract structured metadata from document text using an LLM.

from ragwire import MetadataExtractor

`MetadataExtractor(llm, schema_model)`¶

Uses with_structured_output with a Pydantic model for reliable, type-safe extraction — no manual JSON parsing.

Parameter	Type	Required	Description
`llm`	`Any`	Yes	LangChain chat model instance
`schema_model`	`BaseModel`	No	Pydantic model defining fields and types. Defaults to `FinancialMetadata`

from ragwire import MetadataExtractor, FinancialMetadata
from langchain_ollama import ChatOllama

llm = ChatOllama(model="qwen3.5:9b", base_url="http://localhost:11434")

# Default — uses FinancialMetadata schema (company_name, doc_type, fiscal_quarter, fiscal_year)
extractor = MetadataExtractor(llm)

# Custom Pydantic schema
from pydantic import BaseModel, Field
from typing import Optional, List

class MySchema(BaseModel):
    organization: Optional[str] = Field(None, description="Organization name in lowercase")
    doc_type: Optional[str] = Field(None, description="contract | policy | report")
    effective_year: Optional[int] = Field(None, description="Year the document is effective")
    tags: Optional[List[str]] = Field(None, description="List of topic tags")

extractor = MetadataExtractor(llm, schema_model=MySchema)

`extractor.extract(text, stored_values)`¶

Parameter	Type	Required	Description
`text`	`str`	Yes	Document text (first 10,000 chars used)
`stored_values`	`dict`	No	Existing field values from the collection. When provided, the LLM reuses stored names (e.g. `"apple inc."`) instead of extracting inconsistent variants. Pass `rag.get_field_values(fields)` or use `rag.extract_metadata()` which injects this automatically.

Returns: dict

{
    "company_name": "apple inc.",
    "doc_type": "10-k",
    "fiscal_quarter": None,
    "fiscal_year": [2025]
}

# Basic extraction
metadata = extractor.extract(document_text)

# With grounding — LLM reuses stored entity names
stored = rag.get_field_values(rag.filter_fields)
metadata = extractor.extract(document_text, stored_values=stored)
print(metadata)

`MetadataExtractor.from_yaml(llm, yaml_path)`¶

Create an extractor from a YAML file. Builds a Pydantic model dynamically from the field definitions.

Parameter	Type	Required	Description
`llm`	`Any`	Yes	LangChain chat model instance
`yaml_path`	`str`	Yes	Path to metadata YAML config file

Returns: MetadataExtractor

extractor = MetadataExtractor.from_yaml(llm, "metadata.yaml")
metadata = extractor.extract(document_text)

See Custom Metadata for the YAML format including type and values field options.

`extractor.extract_batch(texts)`¶

Parameter	Type	Description
`texts`	`list[str]`	List of document texts

Returns: list[dict]

DocumentMetadata¶

Pydantic schema for chunk metadata. Useful for type-checking or building typed wrappers.

from ragwire import DocumentMetadata

meta = DocumentMetadata(
    company_name="apple",
    doc_type="10-k",
    fiscal_year=[2025],
    source="/data/Apple_10k_2025.pdf",
    file_name="Apple_10k_2025.pdf",
    file_type="pdf",
    file_hash="abc123...",
    chunk_id="abc123_0",
    chunk_hash="def456...",
    chunk_index=0,
    total_chunks=42,
)
print(meta.model_dump())

See Metadata & Filtering for the full field reference.

Logging¶

from ragwire import setup_logging, setup_colored_logging

Use setup_logging for plain text logs (production, log files). Use setup_colored_logging during development — color-codes log levels so warnings and errors stand out at a glance.

`setup_logging(log_level, log_file, console_output, format_string)`¶

Parameter	Type	Default	Description
`log_level`	`str`	`"INFO"`	`DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`
`log_file`	`str`	`None`	Optional path to write logs to file
`console_output`	`bool`	`True`	Print logs to stdout
`format_string`	`str`	`None`	Custom log format string

Returns: logging.Logger

logger = setup_logging(log_level="DEBUG", log_file="logs/rag.log")
logger.info("Pipeline started")

`setup_colored_logging(log_level, log_file)`¶

Same as setup_logging but with colored console output — errors in red, warnings in yellow, info in green. Useful during development to spot issues quickly.

Parameter	Type	Default	Description
`log_level`	`str`	`"INFO"`	`DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`
`log_file`	`str`	`None`	Optional path to write plain-text logs to file

Returns: logging.Logger

from ragwire import setup_colored_logging

logger = setup_colored_logging(log_level="DEBUG")
logger.info("Pipeline started")   # green
logger.warning("Slow response")   # yellow
logger.error("LLM call failed")   # red

You can also enable colored logging from config.yaml — no code change needed:

logging:
  level: "INFO"
  colored: true
  console_output: true
  # log_file: "logs/rag.log"   # uncomment to also write to file

Low-level / Advanced API¶

These APIs are exported for advanced use cases — custom pipelines, direct vector store access, or building on top of RAGWire internals. Most users will not need these directly.

QdrantStore¶

Direct Qdrant collection management. Use this when you need fine-grained control over the vector store outside of RAGWire.

from ragwire import QdrantStore

`QdrantStore(config, embedding, collection_name)`¶

Parameter	Type	Required	Description
`config`	`dict`	Yes	Vectorstore config (`url`, `api_key`)
`embedding`	`Any`	Yes	Embedding model instance
`collection_name`	`str`	No	Collection name

Methods¶

Method	Returns	Description
`set_collection(name)`	`None`	Set active collection
`get_store(use_sparse)`	`QdrantVectorStore`	Get LangChain vectorstore instance
`create_collection(use_sparse)`	`None`	Create a new collection
`delete_collection()`	`None`	Delete the collection
`collection_exists()`	`bool`	Check if collection exists
`file_hash_exists(file_hash)`	`bool`	Check if file already ingested
`get_collection_info()`	`CollectionInfo`	Get Qdrant collection metadata
`get_metadata_keys()`	`list[str]`	Scroll one point, return all metadata field names
`get_field_values(fields, limit)`	`dict`	Unique values per field via Qdrant facet API
`create_payload_indexes(fields)`	`None`	Create keyword indexes for facet API (auto-called during ingestion)

store = QdrantStore(
    config={"url": "http://localhost:6333"},
    embedding=embedding,
    collection_name="my_docs",
)
store.create_collection(use_sparse=True)
vectorstore = store.get_store(use_sparse=True)

docs = vectorstore.similarity_search("revenue", k=5)

`store.get_metadata_keys()`¶

Scrolls one point from the collection and returns all metadata field names present. Use this when you don't know what fields were stored — e.g. inspecting a collection built by someone else, or verifying custom metadata was extracted correctly.

fields = store.get_metadata_keys()
# → ['company_name', 'doc_type', 'fiscal_year', 'file_name', 'chunk_index', ...]

`store.get_field_values(fields, limit)`¶

Returns unique values for each requested field using Qdrant's facet API. Requires payload indexes on those fields — call create_payload_indexes() first if you haven't ingested via RAGWire (which does this automatically).

Parameter	Type	Default	Description
`fields`	`list[str]`	—	Field names (without `metadata.` prefix)
`limit`	`int`	`50`	Max unique values per field

Returns: dict[str, list]

# Discover fields first, then get values for the ones you care about
fields = store.get_metadata_keys()
# → ['company_name', 'doc_type', 'fiscal_year', ...]

values = store.get_field_values(["company_name", "doc_type"])
# → {'company_name': ['apple', 'microsoft'], 'doc_type': ['10-k', '10-q']}

# High-cardinality field — raise the limit
values = store.get_field_values(["file_name"], limit=200)

Using RAGWire instead?

If you're using RAGWire, prefer rag.filter_fields + rag.get_field_values() for filter prompts, and rag.discover_metadata_fields() for collection inspection — they are thin wrappers over these same methods and don't require you to manage the QdrantStore instance directly.

Retrieval Functions¶

Use these when building a custom retrieval layer outside of RAGWire.

from ragwire import get_retriever, hybrid_search, mmr_search

Choosing a search strategy:

Strategy	Use when
`similarity`	General semantic search; fast, good default
`hybrid`	Queries mix semantic meaning with exact keywords (e.g. ticker symbols, product names, IDs)
`mmr`	You want diverse results — avoids returning 5 nearly identical chunks from the same page

`get_retriever(vectorstore, top_k, search_type)`¶

Parameter	Type	Default	Description
`vectorstore`	`QdrantVectorStore`	—	Vector store instance
`top_k`	`int`	`5`	Number of results
`search_type`	`str`	`"similarity"`	`"similarity"`, `"mmr"`, `"hybrid"`

Returns: LangChain retriever with .invoke(query) method.

`hybrid_search(vectorstore, query, k, filters)`¶

Parameter	Type	Default	Description
`vectorstore`	`QdrantVectorStore`	—	Vector store instance
`query`	`str`	—	Search query
`k`	`int`	`5`	Number of results
`filters`	`dict`	`None`	Plain metadata filter dict (same format as `rag.retrieve()` filters)

Returns: list[Document]

`mmr_search(vectorstore, query, k, fetch_k, lambda_mult, filters)`¶

Maximal Marginal Relevance — retrieves diverse, non-redundant results. Use this when a regular similarity search returns several near-identical chunks from the same section of a document, and you want results spread across different parts.

fetch_k controls how many candidates are retrieved first, then MMR selects the most diverse k from them. A larger fetch_k gives MMR more candidates to choose from. lambda_mult controls the balance: 0.0 = maximise diversity, 1.0 = maximise relevance (same as similarity search), 0.5 = balanced default.

Parameter	Type	Default	Description
`vectorstore`	`QdrantVectorStore`	—	Vector store instance
`query`	`str`	—	Search query
`k`	`int`	`5`	Number of results to return
`fetch_k`	`int`	`20`	Candidates fetched before MMR selection
`lambda_mult`	`float`	`0.5`	Diversity (`0.0` = max diverse, `1.0` = max relevant)
`filters`	`dict`	`None`	Plain metadata filter dict (same format as `rag.retrieve()` filters)

Returns: list[Document]

# Balanced — good default
results = mmr_search(vectorstore, "Apple revenue and earnings", k=5)

# More diverse — useful when documents are long and repetitive
results = mmr_search(vectorstore, "Apple revenue and earnings", k=5, lambda_mult=0.3)

Hashing Utilities¶

Used internally by the pipeline for SHA256 deduplication. Exposed for custom ingestion workflows.

Why deduplication matters: Without it, re-running ingestion on the same files doubles the chunks in Qdrant, degrading retrieval quality and wasting storage. RAGWire checks file_hash before ingesting — if a file with the same hash already exists in the collection, the file is skipped entirely.

from ragwire import sha256_text, sha256_file_from_path, sha256_chunk

Function	Parameters	Returns	Description
`sha256_text(text)`	`text: str`	`str`	SHA256 of a text string
`sha256_file_from_path(path)`	`path: str \\| Path`	`str`	SHA256 of a file (streamed, memory-efficient)
`sha256_chunk(chunk_id, content)`	`chunk_id: str, content: str`	`str`	SHA256 of a chunk (id + content combined)

from ragwire import sha256_file_from_path

file_hash = sha256_file_from_path("data/Apple_10k_2025.pdf")
print(file_hash)  # "a1b2c3d4..."

get_logger¶

Get a child logger under the ragwire namespace. Used internally by all modules.

from ragwire import get_logger

logger = get_logger(__name__)
logger.info("Custom module log")

API Reference¶

Core API¶

RAGWire¶

RAGWire(config_path)¶

rag.ingest_documents(file_paths)¶

rag.ingest_directory(directory, recursive, extensions)¶

rag.retrieve(query, top_k, filters)¶

rag.hybrid_search(query, k, filters)¶

rag.extract_filters(query)¶

rag.get_filter_context(query, limit)¶

rag.filter_fields¶

rag.discover_metadata_fields()¶

rag.get_field_values(fields, limit)¶

rag.extract_metadata(text)¶

rag.get_stats()¶

Config Reference — llm and embeddings¶

llm section¶

embeddings section¶

retriever section¶

MarkItDownLoader¶

MarkItDownLoader.load(file_path)¶

loader.load_batch(file_paths)¶

loader.load_directory(directory, extensions, recursive)¶

Text Splitters¶

get_markdown_splitter(chunk_size, chunk_overlap)¶

get_splitter(chunk_size, chunk_overlap, separators)¶

get_code_splitter(chunk_size, chunk_overlap)¶

get_embedding¶

get_embedding(config)¶

MetadataExtractor¶

MetadataExtractor(llm, schema_model)¶

extractor.extract(text, stored_values)¶

MetadataExtractor.from_yaml(llm, yaml_path)¶

extractor.extract_batch(texts)¶

DocumentMetadata¶

Logging¶

setup_logging(log_level, log_file, console_output, format_string)¶

setup_colored_logging(log_level, log_file)¶

Low-level / Advanced API¶

QdrantStore¶

QdrantStore(config, embedding, collection_name)¶

Methods¶

store.get_metadata_keys()¶

store.get_field_values(fields, limit)¶

Retrieval Functions¶

get_retriever(vectorstore, top_k, search_type)¶

hybrid_search(vectorstore, query, k, filters)¶

mmr_search(vectorstore, query, k, fetch_k, lambda_mult, filters)¶

Hashing Utilities¶

get_logger¶

`RAGWire(config_path)`¶

`rag.ingest_documents(file_paths)`¶

`rag.ingest_directory(directory, recursive, extensions)`¶

`rag.retrieve(query, top_k, filters)`¶

`rag.hybrid_search(query, k, filters)`¶

`rag.extract_filters(query)`¶

`rag.get_filter_context(query, limit)`¶

`rag.filter_fields`¶

`rag.discover_metadata_fields()`¶

`rag.get_field_values(fields, limit)`¶

`rag.extract_metadata(text)`¶

`rag.get_stats()`¶

Config Reference — `llm` and `embeddings`¶

`llm` section¶

`embeddings` section¶

`retriever` section¶

`MarkItDownLoader.load(file_path)`¶

`loader.load_batch(file_paths)`¶

`loader.load_directory(directory, extensions, recursive)`¶

`get_markdown_splitter(chunk_size, chunk_overlap)`¶

`get_splitter(chunk_size, chunk_overlap, separators)`¶

`get_code_splitter(chunk_size, chunk_overlap)`¶

`get_embedding(config)`¶

`MetadataExtractor(llm, schema_model)`¶

`extractor.extract(text, stored_values)`¶

`MetadataExtractor.from_yaml(llm, yaml_path)`¶

`extractor.extract_batch(texts)`¶

`setup_logging(log_level, log_file, console_output, format_string)`¶

`setup_colored_logging(log_level, log_file)`¶

`QdrantStore(config, embedding, collection_name)`¶

`store.get_metadata_keys()`¶

`store.get_field_values(fields, limit)`¶

`get_retriever(vectorstore, top_k, search_type)`¶

`hybrid_search(vectorstore, query, k, filters)`¶

`mmr_search(vectorstore, query, k, fetch_k, lambda_mult, filters)`¶