Skip to content

API Reference

All public APIs are importable directly from ragwire:

from ragwire import RAGWire, MarkItDownLoader, get_embedding, QdrantStore, ...

Core API

These are the primary user-facing APIs. Most applications only need these.


RAGWire

The main orchestrator. Handles the full pipeline from config loading to ingestion and retrieval.

from ragwire import RAGWire

RAGWire(config_path)

Initialize the pipeline from a YAML config file.

Parameter Type Required Description
config_path str Yes Path to config.yaml

Raises:

  • FileNotFoundError — config file not found
  • ValueError — missing required config keys (e.g. llm.model)
rag = RAGWire("config.yaml")

rag.ingest_documents(file_paths)

Ingest a list of documents into the vector store. Skips files already ingested (SHA256 deduplication).

Parameter Type Required Description
file_paths list[str] Yes List of file paths to ingest

Returns: dict

{
    "total": 3,           # Total files submitted
    "processed": 2,       # Successfully ingested
    "skipped": 1,         # Already in vector store (duplicate)
    "failed": 0,          # Failed to load or process
    "chunks_created": 84, # Total chunks added to Qdrant
    "errors": []          # List of {"file": ..., "error": ...} dicts
}
stats = rag.ingest_documents([
    "data/Apple_10k_2025.pdf",
    "data/Microsoft_10k_2025.pdf",
])
print(f"Processed: {stats['processed']}, Chunks: {stats['chunks_created']}")

A progress bar (tqdm) is shown automatically while ingestion runs.


rag.ingest_directory(directory, recursive, extensions)

Ingest all supported documents from a directory. Internally calls ingest_documents().

Parameter Type Required Default Description
directory str Yes Path to the directory
recursive bool No False Search subdirectories
extensions list[str] No loader config File extensions to include

Returns: dict — same stats dict as ingest_documents()

# Ingest all PDFs/DOCX in a folder
stats = rag.ingest_directory("data/")

# Recursively include subdirectories
stats = rag.ingest_directory("data/", recursive=True)

# Only specific extensions
stats = rag.ingest_directory("data/", extensions=[".pdf"])

rag.retrieve(query, top_k, filters)

Retrieve the most relevant chunks for a query.

Parameter Type Required Default Description
query str Yes Search query
top_k int No config value Number of results to return
filters dict No None Metadata filters (see Metadata & Filtering)

Returns: list[Document]

Each Document has:

  • doc.page_content — the chunk text
  • doc.metadata — dict with all metadata fields (see Metadata Schema)

Filter behaviour:

  • If filters is passed → used as-is, no LLM call (always, regardless of auto_filter setting)
  • If filters is not passed and auto_filter: true in config → LLM extracts filters from the query
  • If filters is not passed and auto_filter: false (default) → no filtering, pure semantic search

When to use auto-filter vs explicit filters: Use explicit filters in programmatic pipelines where you control the inputs (faster, zero LLM overhead). Enable auto_filter in simple user-facing chatbots. For agents, keep auto_filter: false and use rag.extract_filters(query) to give the agent full control over whether and how to apply filters.

# Explicit filters — LLM extraction skipped
results = rag.retrieve(
    "What is the net income?",
    top_k=5,
    filters={"company_name": "apple", "fiscal_year": 2025}
)

# auto_filter: true in config — LLM extracts {"company_name": "apple", "fiscal_year": 2025}
results = rag.retrieve("What is Apple's net income for 2025?")

# auto_filter: false (default) — pure semantic search, no filter extraction
results = rag.retrieve("What is Apple's net income for 2025?")

for doc in results:
    print(doc.metadata.get("company_name"))
    print(doc.page_content[:300])

rag.hybrid_search(query, k, filters)

Perform hybrid search combining dense (semantic) and sparse (keyword) vectors. Requires use_sparse: true in config.

Parameter Type Required Default Description
query str Yes Search query
k int No 5 Number of results
filters dict No None Metadata filters

Returns: list[Document]

Hybrid search requires sparse vectors

hybrid_search() only performs true hybrid (dense + sparse) search when both conditions are met:

  1. use_sparse: true in config.yaml — collection must be created with sparse vector support
  2. pip install fastembed — required for sparse encoding

If either is missing, the call silently falls back to dense-only similarity search. There is no error raised. If your collection was created with use_sparse: false, you must set force_recreate: true and re-ingest to enable hybrid search.

retrieve() vs hybrid_search() — when to use which:

retrieve() hybrid_search()
Search type Whatever is set in config.yaml (similarity, mmr, or hybrid) Always hybrid (dense + sparse), regardless of config
Auto-filter Only when auto_filter: true in config (default false) Same — respects auto_filter setting
top_k default From config.yaml k=5 parameter
Typical use Primary method for all RAG flows Override to force hybrid on a single call

If your config.yaml already has search_type: "hybrid", both methods produce identical results. Use hybrid_search() only when your config is set to similarity or mmr and you want to force hybrid for a specific call.

# Use retrieve() in most cases — honours config search type
results = rag.retrieve("Apple revenue fiscal 2025", top_k=5)

# Use hybrid_search() to force hybrid regardless of config
results = rag.hybrid_search(
    "Apple revenue fiscal 2025",
    k=5,
    filters={"company_name": "apple"}
)

rag.extract_filters(query)

Extract metadata filters from a natural language query without triggering retrieval. Returns the raw extracted dict so an agent can inspect, adjust, or discard before passing to retrieve().

Parameter Type Required Description
query str Yes Natural language query

Returns: dict of extracted filters, or None if nothing was extracted.

Note

This method always runs regardless of the auto_filter config setting. It gives agents explicit control — call it manually, decide what to do, then pass the result to retrieve(filters=...).

# Agent workflow — full control over filters
filters = rag.extract_filters("muscle building studies from 2023")
# → {"research_focus": "muscle building", "publication_year": 2023}

# Agent validates against stored values
stored = rag.get_field_values(rag.filter_fields)
if filters.get("research_focus") not in stored.get("research_focus", []):
    filters.pop("research_focus")  # drop uncertain filter, rely on semantic search

results = rag.retrieve("muscle building studies from 2023", filters=filters)

rag.get_filter_context(query, limit)

Build a ready-made markdown prompt block for an agent — contains available metadata fields, their stored values, the filters extracted from the current query, and instructions for the agent on how to act on them. Append or prepend to your agent's task prompt.

Parameter Type Required Default Description
query str Yes Natural language query
limit int No 50 Max stored values to show per field

Returns: str — formatted markdown block ready to inject into an agent prompt.

filter_context = rag.get_filter_context("muscle building studies from 2023")
agent_prompt = filter_context + "\n\n" + your_task_prompt

The returned block looks like:

## RAGWire Filter Context

### Available Metadata Fields and Stored Values
- **research_focus**: ["muscle-growth", "endurance", "recovery", ...]
- **publication_year**: [2022, 2023, 2024]
- **authors**: ["john smith", "jane doe", ...]

### Extracted Filters from Query
- **research_focus**: `muscle building`
- **publication_year**: `2023`

### Instructions
1. Review the extracted filters above.
2. If an extracted value does not match or closely relate to any stored value, adjust or drop that filter.
3. If the query has no clear metadata intent, pass an empty dict {} as filters.
4. Pass the final filters dict to the retrieval tool as filters=.

Typical agent workflow

Use get_filter_context() to give the agent full situational awareness. The agent can then call rag.retrieve(query, filters=adjusted_filters) with a well-informed decision on which filters to apply.


rag.filter_fields

Property. Returns the metadata fields used for filtering and auto-filter extraction — the semantic/LLM-extracted fields only. System fields like file_hash, chunk_id, source, chunk_index, created_at are excluded.

Use this when building dynamic filter prompts for an LLM agent. Using discover_metadata_fields() instead would include system fields that have no value as filters.

fields = rag.filter_fields
# Default: ['company_name', 'doc_type', 'fiscal_quarter', 'fiscal_year']
# Custom:  whatever fields are defined in your metadata.yaml

values = rag.get_field_values(fields)
# → {'company_name': ['apple', 'microsoft'], 'doc_type': ['10-k'], ...}

rag.discover_metadata_fields()

Return all metadata field names present in the collection, including system fields. Scrolls one point — fast regardless of collection size.

Use this for collection inspection or debugging. For building filter prompts, use rag.filter_fields instead.

Returns: list[str]

fields = rag.discover_metadata_fields()
print(fields)
# ['company_name', 'doc_type', 'fiscal_year', 'fiscal_quarter',
#  'file_name', 'file_type', 'file_hash', 'chunk_id', 'chunk_index', ...]

rag.get_field_values(fields, limit)

Return unique values for one or more metadata fields using Qdrant's facet API. Results are ordered by frequency (most common values first).

Parameter Type Required Default Description
fields str \| list[str] Yes Field name or list of field names
limit int No 50 Max unique values to return per field. Increase for high-cardinality fields (e.g. file_name).

Returns: - list — if fields is a str - dict[str, list] — if fields is a list

# Single field — returns a list of up to 50 unique values
rag.get_field_values("company_name")
# → ['apple', 'microsoft', 'google']

# Multiple fields — returns a dict
rag.get_field_values(["company_name", "doc_type"])
# → {'company_name': ['apple', 'microsoft', 'google'], 'doc_type': ['10-k', '10-q']}

# High-cardinality field — raise the limit
rag.get_field_values("file_name", limit=200)
# → ['Apple_10k_2025.pdf', 'Microsoft_10k_2025.pdf', ...]

# Typical agent workflow — use filter_fields, not discover_metadata_fields()
values = rag.get_field_values(rag.filter_fields)
results = rag.retrieve("revenue", filters={"company_name": values["company_name"][0]})

rag.extract_metadata(text)

Extract structured metadata from text using the configured LLM.

Automatically passes stored collection values so the LLM reuses existing entity names (e.g. "apple inc.") rather than extracting inconsistent variants ("apple", "Apple Inc."). This grounding is applied transparently — you do not need to pass stored values manually.

Parameter Type Required Description
text str Yes Document text to extract metadata from (first 10,000 chars used)

Returns: dict

metadata = rag.extract_metadata(open("report.txt").read())
print(metadata)
# {'company_name': 'apple inc.', 'doc_type': '10-k', 'fiscal_quarter': None, 'fiscal_year': [2025]}

rag.get_stats()

Get statistics about the current collection.

Returns: dict

{
    "collection_name": "financial_docs",
    "total_documents": 420,   # Total chunks in Qdrant
    "vector_size": 768,       # Embedding dimension
    "indexed": 420            # Number of indexed vectors
}
stats = rag.get_stats()
print(f"Collection: {stats['collection_name']}, Chunks: {stats['total_documents']}")


Config Reference — llm and embeddings

All parameters below are set in config.yaml and read automatically by RAGWire at startup.


llm section

Controls the LLM used for metadata extraction (and filter extraction during retrieval).

Key Required Default Description
provider Yes ollama, openai, google, groq, anthropic
model Yes Model name (e.g. qwen3.5:9b, gpt-4o-mini)
base_url Ollama only http://localhost:11434 Ollama server URL
num_ctx Ollama only LangChain default Context window size — only set this if you need to override the default
api_key Google / Groq / Anthropic API key (or use ${ENV_VAR} syntax)

OpenAI

OpenAI reads OPENAI_API_KEY from the environment automatically — no api_key field needed in config.

# Ollama
llm:
  provider: "ollama"
  model: "qwen3.5:9b"
  base_url: "http://localhost:11434"
  num_ctx: 16384

# OpenAI
llm:
  provider: "openai"
  model: "gpt-4o-mini"

# Google Gemini
llm:
  provider: "google"
  model: "gemini-2.5-flash"
  api_key: "${GOOGLE_API_KEY}"

# Groq
llm:
  provider: "groq"
  model: "llama-3.3-70b-versatile"
  api_key: "${GROQ_API_KEY}"

# Anthropic
llm:
  provider: "anthropic"
  model: "claude-haiku-4-5-20251001"
  api_key: "${ANTHROPIC_API_KEY}"

embeddings section

Controls the embedding model used to encode documents and queries into vectors.

Key Required Default Description
provider Yes ollama, openai, google, huggingface, fastembed
model Most providers provider default Embedding model name
base_url Ollama only http://localhost:11434 Ollama server URL
num_ctx Ollama only LangChain default Context window size — only set this if you need to override the default
api_key Google only API key (or use ${ENV_VAR} syntax)
model_name HuggingFace / FastEmbed only see below Model identifier (uses model_name key, not model)
model_kwargs HuggingFace only {} Passed to the HuggingFace model constructor (e.g. {"device": "cpu"})
encode_kwargs HuggingFace only {} Passed to the encode call (e.g. {"normalize_embeddings": true})

Default models per provider:

Provider Default model
ollama nomic-embed-text
openai text-embedding-3-small
google models/embedding-001
huggingface sentence-transformers/all-MiniLM-L6-v2
fastembed BAAI/bge-small-en-v1.5
# Ollama
embeddings:
  provider: "ollama"
  model: "nomic-embed-text"
  base_url: "http://localhost:11434"
  num_ctx: 16384

# OpenAI
embeddings:
  provider: "openai"
  model: "text-embedding-3-small"

# Google Gemini
embeddings:
  provider: "google"
  model: "models/gemini-embedding-001"
  api_key: "${GOOGLE_API_KEY}"

# HuggingFace (local)
embeddings:
  provider: "huggingface"
  model_name: "sentence-transformers/all-MiniLM-L6-v2"
  model_kwargs:
    device: "cpu"
  encode_kwargs:
    normalize_embeddings: true

# FastEmbed (local, sparse-capable)
embeddings:
  provider: "fastembed"
  model_name: "BAAI/bge-small-en-v1.5"

retriever section

Controls retrieval behaviour.

Key Required Default Description
search_type No "similarity" "similarity" | "mmr" | "hybrid" (hybrid requires use_sparse: true)
top_k No 5 Number of results returned by retrieve()
auto_filter No false If true, LLM automatically extracts metadata filters from every query passed to retrieve() / hybrid_search(). If false, no filter extraction happens unless filters= is passed explicitly or rag.extract_filters() is called manually.
retriever:
  search_type: "hybrid"
  top_k: 5
  auto_filter: false   # set true to enable automatic filter extraction from queries

Agent use case

Keep auto_filter: false when an agent is driving retrieval. Use rag.extract_filters(query) to let the agent inspect and adjust filters before calling retrieve(filters=...).


MarkItDownLoader

Converts documents (PDF, DOCX, XLSX, PPTX, TXT, MD) to markdown text.

from ragwire import MarkItDownLoader

When to use MarkItDownLoader directly: Use it when you need to convert documents to text before passing them to a custom pipeline, or when you want to inspect/transform the text before ingestion.

MarkItDownLoader.load(file_path)

Parameter Type Required Description
file_path str Yes Path to the document

Returns: dict

{
    "success": True,
    "text_content": "# Apple Inc.\n\n...",  # Markdown text
    "file_name": "Apple_10k_2025.pdf",
    "file_type": "pdf",
    "error": None                            # Error message if success=False
}
loader = MarkItDownLoader()
result = loader.load("data/Apple_10k_2025.pdf")

if result["success"]:
    print(f"Loaded {len(result['text_content'])} characters")
else:
    print(f"Error: {result['error']}")

loader.load_batch(file_paths)

Load multiple documents in one call. Returns results in the same order as the input list.

Parameter Type Required Description
file_paths list[str] Yes List of file paths to load

Returns: list[dict] — same structure as load() for each file.

loader = MarkItDownLoader()
results = loader.load_batch(["doc1.pdf", "doc2.pdf", "doc3.docx"])

for result in results:
    if result["success"]:
        print(f"{result['file_name']}: {len(result['text_content'])} chars")
    else:
        print(f"{result['file_name']}: {result['error']}")

loader.load_directory(directory, extensions, recursive)

Load all supported documents from a directory.

Parameter Type Required Default Description
directory str Yes Path to directory
extensions list[str] No all supported File extensions to include
recursive bool No False Scan subdirectories

Returns: list[dict]

loader = MarkItDownLoader()
results = loader.load_directory("data/", extensions=[".pdf", ".docx"], recursive=True)
texts = [r["text_content"] for r in results if r["success"]]

Text Splitters

from ragwire import get_splitter, get_markdown_splitter, get_code_splitter

All splitters return a RecursiveCharacterTextSplitter instance with a .split_text(text) method.

Choosing a splitter: - get_markdown_splitter — best for PDF/DOCX/reports (converted to markdown by MarkItDown); respects document structure - get_splitter — best for plain text, HTML, or any content without markdown headers - get_code_splitter — best for source code files; splits on class/function boundaries

Chunk size guidance: Larger chunks (8k–12k chars) preserve more context per chunk — good for long-form financial/legal docs. Smaller chunks (500–2k chars) give more precise retrieval — good for FAQ-style content. chunk_overlap prevents context being cut mid-sentence; 20% of chunk size is a sensible default.

get_markdown_splitter(chunk_size, chunk_overlap)

Splits on markdown headers first (##, ###, ####), then paragraphs. Best for PDF/DOCX converted via MarkItDown.

Parameter Type Default Description
chunk_size int 1000 Max characters per chunk
chunk_overlap int 200 Overlap between chunks
splitter = get_markdown_splitter(chunk_size=10000, chunk_overlap=2000)
chunks = splitter.split_text(text)
print(f"{len(chunks)} chunks")

get_splitter(chunk_size, chunk_overlap, separators)

Generic recursive splitter. Splits on \n\n\n"".

Parameter Type Default Description
chunk_size int 1000 Max characters per chunk
chunk_overlap int 200 Overlap between chunks
separators list[str] ["\n\n", "\n", " ", ""] Custom separators
splitter = get_splitter(chunk_size=5000, chunk_overlap=500)
chunks = splitter.split_text(text)

get_code_splitter(chunk_size, chunk_overlap)

Splits on code structure: class, def, comments. Best for source code files.

splitter = get_code_splitter(chunk_size=2000, chunk_overlap=200)
chunks = splitter.split_text(source_code)

get_embedding

Factory function — returns an embedding model instance for the configured provider.

from ragwire import get_embedding

get_embedding(config)

Parameter Type Required Description
config dict Yes Provider config dict with provider key

Supported providers: ollama, openai, huggingface, google, fastembed

Returns: Embedding model with .embed_query(text) and .embed_documents(texts) methods.

# Ollama
embedding = get_embedding({
    "provider": "ollama",
    "model": "nomic-embed-text",
    "base_url": "http://localhost:11434",
})

# OpenAI
embedding = get_embedding({
    "provider": "openai",
    "model": "text-embedding-3-small",
})

# HuggingFace
embedding = get_embedding({
    "provider": "huggingface",
    "model_name": "sentence-transformers/all-MiniLM-L6-v2",
    "model_kwargs": {"device": "cpu"},
})

vector = embedding.embed_query("What is Apple's revenue?")
print(f"Dimension: {len(vector)}")

MetadataExtractor

Extract structured metadata from document text using an LLM.

from ragwire import MetadataExtractor

MetadataExtractor(llm, schema_model)

Uses with_structured_output with a Pydantic model for reliable, type-safe extraction — no manual JSON parsing.

Parameter Type Required Description
llm Any Yes LangChain chat model instance
schema_model BaseModel No Pydantic model defining fields and types. Defaults to FinancialMetadata
from ragwire import MetadataExtractor, FinancialMetadata
from langchain_ollama import ChatOllama

llm = ChatOllama(model="qwen3.5:9b", base_url="http://localhost:11434")

# Default — uses FinancialMetadata schema (company_name, doc_type, fiscal_quarter, fiscal_year)
extractor = MetadataExtractor(llm)

# Custom Pydantic schema
from pydantic import BaseModel, Field
from typing import Optional, List

class MySchema(BaseModel):
    organization: Optional[str] = Field(None, description="Organization name in lowercase")
    doc_type: Optional[str] = Field(None, description="contract | policy | report")
    effective_year: Optional[int] = Field(None, description="Year the document is effective")
    tags: Optional[List[str]] = Field(None, description="List of topic tags")

extractor = MetadataExtractor(llm, schema_model=MySchema)

extractor.extract(text, stored_values)

Parameter Type Required Description
text str Yes Document text (first 10,000 chars used)
stored_values dict No Existing field values from the collection. When provided, the LLM reuses stored names (e.g. "apple inc.") instead of extracting inconsistent variants. Pass rag.get_field_values(fields) or use rag.extract_metadata() which injects this automatically.

Returns: dict

{
    "company_name": "apple inc.",
    "doc_type": "10-k",
    "fiscal_quarter": None,
    "fiscal_year": [2025]
}
# Basic extraction
metadata = extractor.extract(document_text)

# With grounding — LLM reuses stored entity names
stored = rag.get_field_values(rag.filter_fields)
metadata = extractor.extract(document_text, stored_values=stored)
print(metadata)

MetadataExtractor.from_yaml(llm, yaml_path)

Create an extractor from a YAML file. Builds a Pydantic model dynamically from the field definitions.

Parameter Type Required Description
llm Any Yes LangChain chat model instance
yaml_path str Yes Path to metadata YAML config file

Returns: MetadataExtractor

extractor = MetadataExtractor.from_yaml(llm, "metadata.yaml")
metadata = extractor.extract(document_text)

See Custom Metadata for the YAML format including type and values field options.


extractor.extract_batch(texts)

Parameter Type Description
texts list[str] List of document texts

Returns: list[dict]


DocumentMetadata

Pydantic schema for chunk metadata. Useful for type-checking or building typed wrappers.

from ragwire import DocumentMetadata
meta = DocumentMetadata(
    company_name="apple",
    doc_type="10-k",
    fiscal_year=[2025],
    source="/data/Apple_10k_2025.pdf",
    file_name="Apple_10k_2025.pdf",
    file_type="pdf",
    file_hash="abc123...",
    chunk_id="abc123_0",
    chunk_hash="def456...",
    chunk_index=0,
    total_chunks=42,
)
print(meta.model_dump())

See Metadata & Filtering for the full field reference.


Logging

from ragwire import setup_logging, setup_colored_logging

Use setup_logging for plain text logs (production, log files). Use setup_colored_logging during development — color-codes log levels so warnings and errors stand out at a glance.

setup_logging(log_level, log_file, console_output, format_string)

Parameter Type Default Description
log_level str "INFO" DEBUG, INFO, WARNING, ERROR, CRITICAL
log_file str None Optional path to write logs to file
console_output bool True Print logs to stdout
format_string str None Custom log format string

Returns: logging.Logger

logger = setup_logging(log_level="DEBUG", log_file="logs/rag.log")
logger.info("Pipeline started")

setup_colored_logging(log_level, log_file)

Same as setup_logging but with colored console output — errors in red, warnings in yellow, info in green. Useful during development to spot issues quickly.

Parameter Type Default Description
log_level str "INFO" DEBUG, INFO, WARNING, ERROR, CRITICAL
log_file str None Optional path to write plain-text logs to file

Returns: logging.Logger

from ragwire import setup_colored_logging

logger = setup_colored_logging(log_level="DEBUG")
logger.info("Pipeline started")   # green
logger.warning("Slow response")   # yellow
logger.error("LLM call failed")   # red

You can also enable colored logging from config.yaml — no code change needed:

logging:
  level: "INFO"
  colored: true
  console_output: true
  # log_file: "logs/rag.log"   # uncomment to also write to file

Low-level / Advanced API

These APIs are exported for advanced use cases — custom pipelines, direct vector store access, or building on top of RAGWire internals. Most users will not need these directly.


QdrantStore

Direct Qdrant collection management. Use this when you need fine-grained control over the vector store outside of RAGWire.

from ragwire import QdrantStore

QdrantStore(config, embedding, collection_name)

Parameter Type Required Description
config dict Yes Vectorstore config (url, api_key)
embedding Any Yes Embedding model instance
collection_name str No Collection name

Methods

Method Returns Description
set_collection(name) None Set active collection
get_store(use_sparse) QdrantVectorStore Get LangChain vectorstore instance
create_collection(use_sparse) None Create a new collection
delete_collection() None Delete the collection
collection_exists() bool Check if collection exists
file_hash_exists(file_hash) bool Check if file already ingested
get_collection_info() CollectionInfo Get Qdrant collection metadata
get_metadata_keys() list[str] Scroll one point, return all metadata field names
get_field_values(fields, limit) dict Unique values per field via Qdrant facet API
create_payload_indexes(fields) None Create keyword indexes for facet API (auto-called during ingestion)
store = QdrantStore(
    config={"url": "http://localhost:6333"},
    embedding=embedding,
    collection_name="my_docs",
)
store.create_collection(use_sparse=True)
vectorstore = store.get_store(use_sparse=True)

docs = vectorstore.similarity_search("revenue", k=5)

store.get_metadata_keys()

Scrolls one point from the collection and returns all metadata field names present. Use this when you don't know what fields were stored — e.g. inspecting a collection built by someone else, or verifying custom metadata was extracted correctly.

fields = store.get_metadata_keys()
# → ['company_name', 'doc_type', 'fiscal_year', 'file_name', 'chunk_index', ...]

store.get_field_values(fields, limit)

Returns unique values for each requested field using Qdrant's facet API. Requires payload indexes on those fields — call create_payload_indexes() first if you haven't ingested via RAGWire (which does this automatically).

Parameter Type Default Description
fields list[str] Field names (without metadata. prefix)
limit int 50 Max unique values per field

Returns: dict[str, list]

# Discover fields first, then get values for the ones you care about
fields = store.get_metadata_keys()
# → ['company_name', 'doc_type', 'fiscal_year', ...]

values = store.get_field_values(["company_name", "doc_type"])
# → {'company_name': ['apple', 'microsoft'], 'doc_type': ['10-k', '10-q']}

# High-cardinality field — raise the limit
values = store.get_field_values(["file_name"], limit=200)

Using RAGWire instead?

If you're using RAGWire, prefer rag.filter_fields + rag.get_field_values() for filter prompts, and rag.discover_metadata_fields() for collection inspection — they are thin wrappers over these same methods and don't require you to manage the QdrantStore instance directly.


Retrieval Functions

Use these when building a custom retrieval layer outside of RAGWire.

from ragwire import get_retriever, hybrid_search, mmr_search

Choosing a search strategy:

Strategy Use when
similarity General semantic search; fast, good default
hybrid Queries mix semantic meaning with exact keywords (e.g. ticker symbols, product names, IDs)
mmr You want diverse results — avoids returning 5 nearly identical chunks from the same page

get_retriever(vectorstore, top_k, search_type)

Parameter Type Default Description
vectorstore QdrantVectorStore Vector store instance
top_k int 5 Number of results
search_type str "similarity" "similarity", "mmr", "hybrid"

Returns: LangChain retriever with .invoke(query) method.

hybrid_search(vectorstore, query, k, filters)

Parameter Type Default Description
vectorstore QdrantVectorStore Vector store instance
query str Search query
k int 5 Number of results
filters dict None Plain metadata filter dict (same format as rag.retrieve() filters)

Returns: list[Document]

mmr_search(vectorstore, query, k, fetch_k, lambda_mult, filters)

Maximal Marginal Relevance — retrieves diverse, non-redundant results. Use this when a regular similarity search returns several near-identical chunks from the same section of a document, and you want results spread across different parts.

fetch_k controls how many candidates are retrieved first, then MMR selects the most diverse k from them. A larger fetch_k gives MMR more candidates to choose from. lambda_mult controls the balance: 0.0 = maximise diversity, 1.0 = maximise relevance (same as similarity search), 0.5 = balanced default.

Parameter Type Default Description
vectorstore QdrantVectorStore Vector store instance
query str Search query
k int 5 Number of results to return
fetch_k int 20 Candidates fetched before MMR selection
lambda_mult float 0.5 Diversity (0.0 = max diverse, 1.0 = max relevant)
filters dict None Plain metadata filter dict (same format as rag.retrieve() filters)

Returns: list[Document]

# Balanced — good default
results = mmr_search(vectorstore, "Apple revenue and earnings", k=5)

# More diverse — useful when documents are long and repetitive
results = mmr_search(vectorstore, "Apple revenue and earnings", k=5, lambda_mult=0.3)

Hashing Utilities

Used internally by the pipeline for SHA256 deduplication. Exposed for custom ingestion workflows.

Why deduplication matters: Without it, re-running ingestion on the same files doubles the chunks in Qdrant, degrading retrieval quality and wasting storage. RAGWire checks file_hash before ingesting — if a file with the same hash already exists in the collection, the file is skipped entirely.

from ragwire import sha256_text, sha256_file_from_path, sha256_chunk
Function Parameters Returns Description
sha256_text(text) text: str str SHA256 of a text string
sha256_file_from_path(path) path: str \| Path str SHA256 of a file (streamed, memory-efficient)
sha256_chunk(chunk_id, content) chunk_id: str, content: str str SHA256 of a chunk (id + content combined)
from ragwire import sha256_file_from_path

file_hash = sha256_file_from_path("data/Apple_10k_2025.pdf")
print(file_hash)  # "a1b2c3d4..."

get_logger

Get a child logger under the ragwire namespace. Used internally by all modules.

from ragwire import get_logger

logger = get_logger(__name__)
logger.info("Custom module log")