API Reference¶
All public APIs are importable directly from ragwire:
Core API¶
These are the primary user-facing APIs. Most applications only need these.
RAGWire¶
The main orchestrator. Handles the full pipeline from config loading to ingestion and retrieval.
RAGWire(config_path)¶
Initialize the pipeline from a YAML config file.
| Parameter | Type | Required | Description |
|---|---|---|---|
config_path |
str |
Yes | Path to config.yaml |
Raises:
FileNotFoundError— config file not foundValueError— missing required config keys (e.g.llm.model)
rag.ingest_documents(file_paths)¶
Ingest a list of documents into the vector store. Skips files already ingested (SHA256 deduplication).
| Parameter | Type | Required | Description |
|---|---|---|---|
file_paths |
list[str] |
Yes | List of file paths to ingest |
Returns: dict
{
"total": 3, # Total files submitted
"processed": 2, # Successfully ingested
"skipped": 1, # Already in vector store (duplicate)
"failed": 0, # Failed to load or process
"chunks_created": 84, # Total chunks added to Qdrant
"errors": [] # List of {"file": ..., "error": ...} dicts
}
stats = rag.ingest_documents([
"data/Apple_10k_2025.pdf",
"data/Microsoft_10k_2025.pdf",
])
print(f"Processed: {stats['processed']}, Chunks: {stats['chunks_created']}")
A progress bar (tqdm) is shown automatically while ingestion runs.
rag.ingest_directory(directory, recursive, extensions)¶
Ingest all supported documents from a directory. Internally calls ingest_documents().
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
directory |
str |
Yes | — | Path to the directory |
recursive |
bool |
No | False |
Search subdirectories |
extensions |
list[str] |
No | loader config | File extensions to include |
Returns: dict — same stats dict as ingest_documents()
# Ingest all PDFs/DOCX in a folder
stats = rag.ingest_directory("data/")
# Recursively include subdirectories
stats = rag.ingest_directory("data/", recursive=True)
# Only specific extensions
stats = rag.ingest_directory("data/", extensions=[".pdf"])
rag.retrieve(query, top_k, filters)¶
Retrieve the most relevant chunks for a query.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
query |
str |
Yes | — | Search query |
top_k |
int |
No | config value | Number of results to return |
filters |
dict |
No | None |
Metadata filters (see Metadata & Filtering) |
Returns: list[Document]
Each Document has:
doc.page_content— the chunk textdoc.metadata— dict with all metadata fields (see Metadata Schema)
Filter behaviour:
- If
filtersis passed → used as-is, no LLM call (always, regardless ofauto_filtersetting) - If
filtersis not passed andauto_filter: truein config → LLM extracts filters from the query - If
filtersis not passed andauto_filter: false(default) → no filtering, pure semantic search
When to use auto-filter vs explicit filters: Use explicit filters in programmatic pipelines where you control the inputs (faster, zero LLM overhead). Enable auto_filter in simple user-facing chatbots. For agents, keep auto_filter: false and use rag.extract_filters(query) to give the agent full control over whether and how to apply filters.
# Explicit filters — LLM extraction skipped
results = rag.retrieve(
"What is the net income?",
top_k=5,
filters={"company_name": "apple", "fiscal_year": 2025}
)
# auto_filter: true in config — LLM extracts {"company_name": "apple", "fiscal_year": 2025}
results = rag.retrieve("What is Apple's net income for 2025?")
# auto_filter: false (default) — pure semantic search, no filter extraction
results = rag.retrieve("What is Apple's net income for 2025?")
for doc in results:
print(doc.metadata.get("company_name"))
print(doc.page_content[:300])
rag.hybrid_search(query, k, filters)¶
Perform hybrid search combining dense (semantic) and sparse (keyword) vectors. Requires use_sparse: true in config.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
query |
str |
Yes | — | Search query |
k |
int |
No | 5 |
Number of results |
filters |
dict |
No | None |
Metadata filters |
Returns: list[Document]
Hybrid search requires sparse vectors
hybrid_search() only performs true hybrid (dense + sparse) search when both conditions are met:
use_sparse: trueinconfig.yaml— collection must be created with sparse vector supportpip install fastembed— required for sparse encoding
If either is missing, the call silently falls back to dense-only similarity search. There is no error raised.
If your collection was created with use_sparse: false, you must set force_recreate: true and re-ingest to enable hybrid search.
retrieve() vs hybrid_search() — when to use which:
retrieve() |
hybrid_search() |
|
|---|---|---|
| Search type | Whatever is set in config.yaml (similarity, mmr, or hybrid) |
Always hybrid (dense + sparse), regardless of config |
| Auto-filter | Only when auto_filter: true in config (default false) |
Same — respects auto_filter setting |
top_k default |
From config.yaml |
k=5 parameter |
| Typical use | Primary method for all RAG flows | Override to force hybrid on a single call |
If your config.yaml already has search_type: "hybrid", both methods produce identical results. Use hybrid_search() only when your config is set to similarity or mmr and you want to force hybrid for a specific call.
# Use retrieve() in most cases — honours config search type
results = rag.retrieve("Apple revenue fiscal 2025", top_k=5)
# Use hybrid_search() to force hybrid regardless of config
results = rag.hybrid_search(
"Apple revenue fiscal 2025",
k=5,
filters={"company_name": "apple"}
)
rag.extract_filters(query)¶
Extract metadata filters from a natural language query without triggering retrieval. Returns the raw extracted dict so an agent can inspect, adjust, or discard before passing to retrieve().
| Parameter | Type | Required | Description |
|---|---|---|---|
query |
str |
Yes | Natural language query |
Returns: dict of extracted filters, or None if nothing was extracted.
Note
This method always runs regardless of the auto_filter config setting. It gives agents explicit control — call it manually, decide what to do, then pass the result to retrieve(filters=...).
# Agent workflow — full control over filters
filters = rag.extract_filters("muscle building studies from 2023")
# → {"research_focus": "muscle building", "publication_year": 2023}
# Agent validates against stored values
stored = rag.get_field_values(rag.filter_fields)
if filters.get("research_focus") not in stored.get("research_focus", []):
filters.pop("research_focus") # drop uncertain filter, rely on semantic search
results = rag.retrieve("muscle building studies from 2023", filters=filters)
rag.get_filter_context(query, limit)¶
Build a ready-made markdown prompt block for an agent — contains available metadata fields, their stored values, the filters extracted from the current query, and instructions for the agent on how to act on them. Append or prepend to your agent's task prompt.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
query |
str |
Yes | — | Natural language query |
limit |
int |
No | 50 |
Max stored values to show per field |
Returns: str — formatted markdown block ready to inject into an agent prompt.
filter_context = rag.get_filter_context("muscle building studies from 2023")
agent_prompt = filter_context + "\n\n" + your_task_prompt
The returned block looks like:
## RAGWire Filter Context
### Available Metadata Fields and Stored Values
- **research_focus**: ["muscle-growth", "endurance", "recovery", ...]
- **publication_year**: [2022, 2023, 2024]
- **authors**: ["john smith", "jane doe", ...]
### Extracted Filters from Query
- **research_focus**: `muscle building`
- **publication_year**: `2023`
### Instructions
1. Review the extracted filters above.
2. If an extracted value does not match or closely relate to any stored value, adjust or drop that filter.
3. If the query has no clear metadata intent, pass an empty dict {} as filters.
4. Pass the final filters dict to the retrieval tool as filters=.
Typical agent workflow
Use get_filter_context() to give the agent full situational awareness. The agent can then call rag.retrieve(query, filters=adjusted_filters) with a well-informed decision on which filters to apply.
rag.filter_fields¶
Property. Returns the metadata fields used for filtering and auto-filter extraction — the semantic/LLM-extracted fields only. System fields like file_hash, chunk_id, source, chunk_index, created_at are excluded.
Use this when building dynamic filter prompts for an LLM agent. Using discover_metadata_fields() instead would include system fields that have no value as filters.
fields = rag.filter_fields
# Default: ['company_name', 'doc_type', 'fiscal_quarter', 'fiscal_year']
# Custom: whatever fields are defined in your metadata.yaml
values = rag.get_field_values(fields)
# → {'company_name': ['apple', 'microsoft'], 'doc_type': ['10-k'], ...}
rag.discover_metadata_fields()¶
Return all metadata field names present in the collection, including system fields. Scrolls one point — fast regardless of collection size.
Use this for collection inspection or debugging. For building filter prompts, use rag.filter_fields instead.
Returns: list[str]
fields = rag.discover_metadata_fields()
print(fields)
# ['company_name', 'doc_type', 'fiscal_year', 'fiscal_quarter',
# 'file_name', 'file_type', 'file_hash', 'chunk_id', 'chunk_index', ...]
rag.get_field_values(fields, limit)¶
Return unique values for one or more metadata fields using Qdrant's facet API. Results are ordered by frequency (most common values first).
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
fields |
str \| list[str] |
Yes | — | Field name or list of field names |
limit |
int |
No | 50 |
Max unique values to return per field. Increase for high-cardinality fields (e.g. file_name). |
Returns:
- list — if fields is a str
- dict[str, list] — if fields is a list
# Single field — returns a list of up to 50 unique values
rag.get_field_values("company_name")
# → ['apple', 'microsoft', 'google']
# Multiple fields — returns a dict
rag.get_field_values(["company_name", "doc_type"])
# → {'company_name': ['apple', 'microsoft', 'google'], 'doc_type': ['10-k', '10-q']}
# High-cardinality field — raise the limit
rag.get_field_values("file_name", limit=200)
# → ['Apple_10k_2025.pdf', 'Microsoft_10k_2025.pdf', ...]
# Typical agent workflow — use filter_fields, not discover_metadata_fields()
values = rag.get_field_values(rag.filter_fields)
results = rag.retrieve("revenue", filters={"company_name": values["company_name"][0]})
rag.extract_metadata(text)¶
Extract structured metadata from text using the configured LLM.
Automatically passes stored collection values so the LLM reuses existing entity names (e.g. "apple inc.") rather than extracting inconsistent variants ("apple", "Apple Inc."). This grounding is applied transparently — you do not need to pass stored values manually.
| Parameter | Type | Required | Description |
|---|---|---|---|
text |
str |
Yes | Document text to extract metadata from (first 10,000 chars used) |
Returns: dict
metadata = rag.extract_metadata(open("report.txt").read())
print(metadata)
# {'company_name': 'apple inc.', 'doc_type': '10-k', 'fiscal_quarter': None, 'fiscal_year': [2025]}
rag.get_stats()¶
Get statistics about the current collection.
Returns: dict
{
"collection_name": "financial_docs",
"total_documents": 420, # Total chunks in Qdrant
"vector_size": 768, # Embedding dimension
"indexed": 420 # Number of indexed vectors
}
stats = rag.get_stats()
print(f"Collection: {stats['collection_name']}, Chunks: {stats['total_documents']}")
Config Reference — llm and embeddings¶
All parameters below are set in config.yaml and read automatically by RAGWire at startup.
llm section¶
Controls the LLM used for metadata extraction (and filter extraction during retrieval).
| Key | Required | Default | Description |
|---|---|---|---|
provider |
Yes | — | ollama, openai, google, groq, anthropic |
model |
Yes | — | Model name (e.g. qwen3.5:9b, gpt-4o-mini) |
base_url |
Ollama only | http://localhost:11434 |
Ollama server URL |
num_ctx |
Ollama only | LangChain default | Context window size — only set this if you need to override the default |
api_key |
Google / Groq / Anthropic | — | API key (or use ${ENV_VAR} syntax) |
OpenAI
OpenAI reads OPENAI_API_KEY from the environment automatically — no api_key field needed in config.
# Ollama
llm:
provider: "ollama"
model: "qwen3.5:9b"
base_url: "http://localhost:11434"
num_ctx: 16384
# OpenAI
llm:
provider: "openai"
model: "gpt-4o-mini"
# Google Gemini
llm:
provider: "google"
model: "gemini-2.5-flash"
api_key: "${GOOGLE_API_KEY}"
# Groq
llm:
provider: "groq"
model: "llama-3.3-70b-versatile"
api_key: "${GROQ_API_KEY}"
# Anthropic
llm:
provider: "anthropic"
model: "claude-haiku-4-5-20251001"
api_key: "${ANTHROPIC_API_KEY}"
embeddings section¶
Controls the embedding model used to encode documents and queries into vectors.
| Key | Required | Default | Description |
|---|---|---|---|
provider |
Yes | — | ollama, openai, google, huggingface, fastembed |
model |
Most providers | provider default | Embedding model name |
base_url |
Ollama only | http://localhost:11434 |
Ollama server URL |
num_ctx |
Ollama only | LangChain default | Context window size — only set this if you need to override the default |
api_key |
Google only | — | API key (or use ${ENV_VAR} syntax) |
model_name |
HuggingFace / FastEmbed only | see below | Model identifier (uses model_name key, not model) |
model_kwargs |
HuggingFace only | {} |
Passed to the HuggingFace model constructor (e.g. {"device": "cpu"}) |
encode_kwargs |
HuggingFace only | {} |
Passed to the encode call (e.g. {"normalize_embeddings": true}) |
Default models per provider:
| Provider | Default model |
|---|---|
ollama |
nomic-embed-text |
openai |
text-embedding-3-small |
google |
models/embedding-001 |
huggingface |
sentence-transformers/all-MiniLM-L6-v2 |
fastembed |
BAAI/bge-small-en-v1.5 |
# Ollama
embeddings:
provider: "ollama"
model: "nomic-embed-text"
base_url: "http://localhost:11434"
num_ctx: 16384
# OpenAI
embeddings:
provider: "openai"
model: "text-embedding-3-small"
# Google Gemini
embeddings:
provider: "google"
model: "models/gemini-embedding-001"
api_key: "${GOOGLE_API_KEY}"
# HuggingFace (local)
embeddings:
provider: "huggingface"
model_name: "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs:
device: "cpu"
encode_kwargs:
normalize_embeddings: true
# FastEmbed (local, sparse-capable)
embeddings:
provider: "fastembed"
model_name: "BAAI/bge-small-en-v1.5"
retriever section¶
Controls retrieval behaviour.
| Key | Required | Default | Description |
|---|---|---|---|
search_type |
No | "similarity" |
"similarity" | "mmr" | "hybrid" (hybrid requires use_sparse: true) |
top_k |
No | 5 |
Number of results returned by retrieve() |
auto_filter |
No | false |
If true, LLM automatically extracts metadata filters from every query passed to retrieve() / hybrid_search(). If false, no filter extraction happens unless filters= is passed explicitly or rag.extract_filters() is called manually. |
retriever:
search_type: "hybrid"
top_k: 5
auto_filter: false # set true to enable automatic filter extraction from queries
Agent use case
Keep auto_filter: false when an agent is driving retrieval. Use rag.extract_filters(query) to let the agent inspect and adjust filters before calling retrieve(filters=...).
MarkItDownLoader¶
Converts documents (PDF, DOCX, XLSX, PPTX, TXT, MD) to markdown text.
When to use MarkItDownLoader directly: Use it when you need to convert documents to text before passing them to a custom pipeline, or when you want to inspect/transform the text before ingestion.
MarkItDownLoader.load(file_path)¶
| Parameter | Type | Required | Description |
|---|---|---|---|
file_path |
str |
Yes | Path to the document |
Returns: dict
{
"success": True,
"text_content": "# Apple Inc.\n\n...", # Markdown text
"file_name": "Apple_10k_2025.pdf",
"file_type": "pdf",
"error": None # Error message if success=False
}
loader = MarkItDownLoader()
result = loader.load("data/Apple_10k_2025.pdf")
if result["success"]:
print(f"Loaded {len(result['text_content'])} characters")
else:
print(f"Error: {result['error']}")
loader.load_batch(file_paths)¶
Load multiple documents in one call. Returns results in the same order as the input list.
| Parameter | Type | Required | Description |
|---|---|---|---|
file_paths |
list[str] |
Yes | List of file paths to load |
Returns: list[dict] — same structure as load() for each file.
loader = MarkItDownLoader()
results = loader.load_batch(["doc1.pdf", "doc2.pdf", "doc3.docx"])
for result in results:
if result["success"]:
print(f"{result['file_name']}: {len(result['text_content'])} chars")
else:
print(f"{result['file_name']}: {result['error']}")
loader.load_directory(directory, extensions, recursive)¶
Load all supported documents from a directory.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
directory |
str |
Yes | — | Path to directory |
extensions |
list[str] |
No | all supported | File extensions to include |
recursive |
bool |
No | False |
Scan subdirectories |
Returns: list[dict]
loader = MarkItDownLoader()
results = loader.load_directory("data/", extensions=[".pdf", ".docx"], recursive=True)
texts = [r["text_content"] for r in results if r["success"]]
Text Splitters¶
All splitters return a RecursiveCharacterTextSplitter instance with a .split_text(text) method.
Choosing a splitter:
- get_markdown_splitter — best for PDF/DOCX/reports (converted to markdown by MarkItDown); respects document structure
- get_splitter — best for plain text, HTML, or any content without markdown headers
- get_code_splitter — best for source code files; splits on class/function boundaries
Chunk size guidance: Larger chunks (8k–12k chars) preserve more context per chunk — good for long-form financial/legal docs. Smaller chunks (500–2k chars) give more precise retrieval — good for FAQ-style content. chunk_overlap prevents context being cut mid-sentence; 20% of chunk size is a sensible default.
get_markdown_splitter(chunk_size, chunk_overlap)¶
Splits on markdown headers first (##, ###, ####), then paragraphs. Best for PDF/DOCX converted via MarkItDown.
| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size |
int |
1000 |
Max characters per chunk |
chunk_overlap |
int |
200 |
Overlap between chunks |
splitter = get_markdown_splitter(chunk_size=10000, chunk_overlap=2000)
chunks = splitter.split_text(text)
print(f"{len(chunks)} chunks")
get_splitter(chunk_size, chunk_overlap, separators)¶
Generic recursive splitter. Splits on \n\n → \n → → "".
| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size |
int |
1000 |
Max characters per chunk |
chunk_overlap |
int |
200 |
Overlap between chunks |
separators |
list[str] |
["\n\n", "\n", " ", ""] |
Custom separators |
get_code_splitter(chunk_size, chunk_overlap)¶
Splits on code structure: class, def, comments. Best for source code files.
splitter = get_code_splitter(chunk_size=2000, chunk_overlap=200)
chunks = splitter.split_text(source_code)
get_embedding¶
Factory function — returns an embedding model instance for the configured provider.
get_embedding(config)¶
| Parameter | Type | Required | Description |
|---|---|---|---|
config |
dict |
Yes | Provider config dict with provider key |
Supported providers: ollama, openai, huggingface, google, fastembed
Returns: Embedding model with .embed_query(text) and .embed_documents(texts) methods.
# Ollama
embedding = get_embedding({
"provider": "ollama",
"model": "nomic-embed-text",
"base_url": "http://localhost:11434",
})
# OpenAI
embedding = get_embedding({
"provider": "openai",
"model": "text-embedding-3-small",
})
# HuggingFace
embedding = get_embedding({
"provider": "huggingface",
"model_name": "sentence-transformers/all-MiniLM-L6-v2",
"model_kwargs": {"device": "cpu"},
})
vector = embedding.embed_query("What is Apple's revenue?")
print(f"Dimension: {len(vector)}")
MetadataExtractor¶
Extract structured metadata from document text using an LLM.
MetadataExtractor(llm, schema_model)¶
Uses with_structured_output with a Pydantic model for reliable, type-safe extraction — no manual JSON parsing.
| Parameter | Type | Required | Description |
|---|---|---|---|
llm |
Any |
Yes | LangChain chat model instance |
schema_model |
BaseModel |
No | Pydantic model defining fields and types. Defaults to FinancialMetadata |
from ragwire import MetadataExtractor, FinancialMetadata
from langchain_ollama import ChatOllama
llm = ChatOllama(model="qwen3.5:9b", base_url="http://localhost:11434")
# Default — uses FinancialMetadata schema (company_name, doc_type, fiscal_quarter, fiscal_year)
extractor = MetadataExtractor(llm)
# Custom Pydantic schema
from pydantic import BaseModel, Field
from typing import Optional, List
class MySchema(BaseModel):
organization: Optional[str] = Field(None, description="Organization name in lowercase")
doc_type: Optional[str] = Field(None, description="contract | policy | report")
effective_year: Optional[int] = Field(None, description="Year the document is effective")
tags: Optional[List[str]] = Field(None, description="List of topic tags")
extractor = MetadataExtractor(llm, schema_model=MySchema)
extractor.extract(text, stored_values)¶
| Parameter | Type | Required | Description |
|---|---|---|---|
text |
str |
Yes | Document text (first 10,000 chars used) |
stored_values |
dict |
No | Existing field values from the collection. When provided, the LLM reuses stored names (e.g. "apple inc.") instead of extracting inconsistent variants. Pass rag.get_field_values(fields) or use rag.extract_metadata() which injects this automatically. |
Returns: dict
# Basic extraction
metadata = extractor.extract(document_text)
# With grounding — LLM reuses stored entity names
stored = rag.get_field_values(rag.filter_fields)
metadata = extractor.extract(document_text, stored_values=stored)
print(metadata)
MetadataExtractor.from_yaml(llm, yaml_path)¶
Create an extractor from a YAML file. Builds a Pydantic model dynamically from the field definitions.
| Parameter | Type | Required | Description |
|---|---|---|---|
llm |
Any |
Yes | LangChain chat model instance |
yaml_path |
str |
Yes | Path to metadata YAML config file |
Returns: MetadataExtractor
extractor = MetadataExtractor.from_yaml(llm, "metadata.yaml")
metadata = extractor.extract(document_text)
See Custom Metadata for the YAML format including type and values field options.
extractor.extract_batch(texts)¶
| Parameter | Type | Description |
|---|---|---|
texts |
list[str] |
List of document texts |
Returns: list[dict]
DocumentMetadata¶
Pydantic schema for chunk metadata. Useful for type-checking or building typed wrappers.
meta = DocumentMetadata(
company_name="apple",
doc_type="10-k",
fiscal_year=[2025],
source="/data/Apple_10k_2025.pdf",
file_name="Apple_10k_2025.pdf",
file_type="pdf",
file_hash="abc123...",
chunk_id="abc123_0",
chunk_hash="def456...",
chunk_index=0,
total_chunks=42,
)
print(meta.model_dump())
See Metadata & Filtering for the full field reference.
Logging¶
Use setup_logging for plain text logs (production, log files). Use setup_colored_logging during development — color-codes log levels so warnings and errors stand out at a glance.
setup_logging(log_level, log_file, console_output, format_string)¶
| Parameter | Type | Default | Description |
|---|---|---|---|
log_level |
str |
"INFO" |
DEBUG, INFO, WARNING, ERROR, CRITICAL |
log_file |
str |
None |
Optional path to write logs to file |
console_output |
bool |
True |
Print logs to stdout |
format_string |
str |
None |
Custom log format string |
Returns: logging.Logger
setup_colored_logging(log_level, log_file)¶
Same as setup_logging but with colored console output — errors in red, warnings in yellow, info in green. Useful during development to spot issues quickly.
| Parameter | Type | Default | Description |
|---|---|---|---|
log_level |
str |
"INFO" |
DEBUG, INFO, WARNING, ERROR, CRITICAL |
log_file |
str |
None |
Optional path to write plain-text logs to file |
Returns: logging.Logger
from ragwire import setup_colored_logging
logger = setup_colored_logging(log_level="DEBUG")
logger.info("Pipeline started") # green
logger.warning("Slow response") # yellow
logger.error("LLM call failed") # red
You can also enable colored logging from config.yaml — no code change needed:
logging:
level: "INFO"
colored: true
console_output: true
# log_file: "logs/rag.log" # uncomment to also write to file
Low-level / Advanced API¶
These APIs are exported for advanced use cases — custom pipelines, direct vector store access, or building on top of RAGWire internals. Most users will not need these directly.
QdrantStore¶
Direct Qdrant collection management. Use this when you need fine-grained control over the vector store outside of RAGWire.
QdrantStore(config, embedding, collection_name)¶
| Parameter | Type | Required | Description |
|---|---|---|---|
config |
dict |
Yes | Vectorstore config (url, api_key) |
embedding |
Any |
Yes | Embedding model instance |
collection_name |
str |
No | Collection name |
Methods¶
| Method | Returns | Description |
|---|---|---|
set_collection(name) |
None |
Set active collection |
get_store(use_sparse) |
QdrantVectorStore |
Get LangChain vectorstore instance |
create_collection(use_sparse) |
None |
Create a new collection |
delete_collection() |
None |
Delete the collection |
collection_exists() |
bool |
Check if collection exists |
file_hash_exists(file_hash) |
bool |
Check if file already ingested |
get_collection_info() |
CollectionInfo |
Get Qdrant collection metadata |
get_metadata_keys() |
list[str] |
Scroll one point, return all metadata field names |
get_field_values(fields, limit) |
dict |
Unique values per field via Qdrant facet API |
create_payload_indexes(fields) |
None |
Create keyword indexes for facet API (auto-called during ingestion) |
store = QdrantStore(
config={"url": "http://localhost:6333"},
embedding=embedding,
collection_name="my_docs",
)
store.create_collection(use_sparse=True)
vectorstore = store.get_store(use_sparse=True)
docs = vectorstore.similarity_search("revenue", k=5)
store.get_metadata_keys()¶
Scrolls one point from the collection and returns all metadata field names present. Use this when you don't know what fields were stored — e.g. inspecting a collection built by someone else, or verifying custom metadata was extracted correctly.
fields = store.get_metadata_keys()
# → ['company_name', 'doc_type', 'fiscal_year', 'file_name', 'chunk_index', ...]
store.get_field_values(fields, limit)¶
Returns unique values for each requested field using Qdrant's facet API. Requires payload indexes on those fields — call create_payload_indexes() first if you haven't ingested via RAGWire (which does this automatically).
| Parameter | Type | Default | Description |
|---|---|---|---|
fields |
list[str] |
— | Field names (without metadata. prefix) |
limit |
int |
50 |
Max unique values per field |
Returns: dict[str, list]
# Discover fields first, then get values for the ones you care about
fields = store.get_metadata_keys()
# → ['company_name', 'doc_type', 'fiscal_year', ...]
values = store.get_field_values(["company_name", "doc_type"])
# → {'company_name': ['apple', 'microsoft'], 'doc_type': ['10-k', '10-q']}
# High-cardinality field — raise the limit
values = store.get_field_values(["file_name"], limit=200)
Using RAGWire instead?
If you're using RAGWire, prefer rag.filter_fields + rag.get_field_values() for filter prompts, and rag.discover_metadata_fields() for collection inspection — they are thin wrappers over these same methods and don't require you to manage the QdrantStore instance directly.
Retrieval Functions¶
Use these when building a custom retrieval layer outside of RAGWire.
Choosing a search strategy:
| Strategy | Use when |
|---|---|
similarity |
General semantic search; fast, good default |
hybrid |
Queries mix semantic meaning with exact keywords (e.g. ticker symbols, product names, IDs) |
mmr |
You want diverse results — avoids returning 5 nearly identical chunks from the same page |
get_retriever(vectorstore, top_k, search_type)¶
| Parameter | Type | Default | Description |
|---|---|---|---|
vectorstore |
QdrantVectorStore |
— | Vector store instance |
top_k |
int |
5 |
Number of results |
search_type |
str |
"similarity" |
"similarity", "mmr", "hybrid" |
Returns: LangChain retriever with .invoke(query) method.
hybrid_search(vectorstore, query, k, filters)¶
| Parameter | Type | Default | Description |
|---|---|---|---|
vectorstore |
QdrantVectorStore |
— | Vector store instance |
query |
str |
— | Search query |
k |
int |
5 |
Number of results |
filters |
dict |
None |
Plain metadata filter dict (same format as rag.retrieve() filters) |
Returns: list[Document]
mmr_search(vectorstore, query, k, fetch_k, lambda_mult, filters)¶
Maximal Marginal Relevance — retrieves diverse, non-redundant results. Use this when a regular similarity search returns several near-identical chunks from the same section of a document, and you want results spread across different parts.
fetch_k controls how many candidates are retrieved first, then MMR selects the most diverse k from them. A larger fetch_k gives MMR more candidates to choose from. lambda_mult controls the balance: 0.0 = maximise diversity, 1.0 = maximise relevance (same as similarity search), 0.5 = balanced default.
| Parameter | Type | Default | Description |
|---|---|---|---|
vectorstore |
QdrantVectorStore |
— | Vector store instance |
query |
str |
— | Search query |
k |
int |
5 |
Number of results to return |
fetch_k |
int |
20 |
Candidates fetched before MMR selection |
lambda_mult |
float |
0.5 |
Diversity (0.0 = max diverse, 1.0 = max relevant) |
filters |
dict |
None |
Plain metadata filter dict (same format as rag.retrieve() filters) |
Returns: list[Document]
# Balanced — good default
results = mmr_search(vectorstore, "Apple revenue and earnings", k=5)
# More diverse — useful when documents are long and repetitive
results = mmr_search(vectorstore, "Apple revenue and earnings", k=5, lambda_mult=0.3)
Hashing Utilities¶
Used internally by the pipeline for SHA256 deduplication. Exposed for custom ingestion workflows.
Why deduplication matters: Without it, re-running ingestion on the same files doubles the chunks in Qdrant, degrading retrieval quality and wasting storage. RAGWire checks file_hash before ingesting — if a file with the same hash already exists in the collection, the file is skipped entirely.
| Function | Parameters | Returns | Description |
|---|---|---|---|
sha256_text(text) |
text: str |
str |
SHA256 of a text string |
sha256_file_from_path(path) |
path: str \| Path |
str |
SHA256 of a file (streamed, memory-efficient) |
sha256_chunk(chunk_id, content) |
chunk_id: str, content: str |
str |
SHA256 of a chunk (id + content combined) |
from ragwire import sha256_file_from_path
file_hash = sha256_file_from_path("data/Apple_10k_2025.pdf")
print(file_hash) # "a1b2c3d4..."
get_logger¶
Get a child logger under the ragwire namespace. Used internally by all modules.