Skip to content

RAGWire with HuggingFace

Use HuggingFace sentence-transformers for local embeddings — no API key, runs on CPU or GPU.

HuggingFace does not provide a hosted chat LLM in RAGWire. Pair it with Ollama, OpenAI, Groq, or Anthropic for the metadata extraction LLM.

Prerequisites

  • RAGWire installed: pip install "ragwire[huggingface]"
  • Qdrant running: docker run -d -p 6333:6333 qdrant/qdrant

1. Install Dependencies

# HuggingFace for embeddings + Ollama for LLM (fully local, no cost)
pip install "ragwire[huggingface]" "ragwire[ollama]"

# Or with OpenAI for LLM
pip install "ragwire[huggingface]" "ragwire[openai]"

pip install fastembed               # For hybrid search

2. Configuration

HuggingFace Embeddings + Ollama LLM (fully local)

embeddings:
  provider: "huggingface"
  model_name: "sentence-transformers/all-MiniLM-L6-v2"   # 384-dim, fast
  # model_name: "BAAI/bge-large-en-v1.5"                 # 1024-dim, higher quality
  model_kwargs:
    device: "cpu"     # "cuda" for GPU, "mps" for Apple Silicon
  encode_kwargs:
    normalize_embeddings: true

llm:
  provider: "ollama"
  model: "qwen3.5:9b"
  base_url: "http://localhost:11434"
  num_ctx: 16384

vectorstore:
  url: "http://localhost:6333"
  collection_name: "my_docs"
  use_sparse: true
  force_recreate: false

retriever:
  search_type: "hybrid"
  top_k: 5
  auto_filter: false   # set true to enable LLM-based filter extraction from every query

HuggingFace Embeddings + OpenAI LLM

embeddings:
  provider: "huggingface"
  model_name: "sentence-transformers/all-MiniLM-L6-v2"
  model_kwargs:
    device: "cpu"

llm:
  provider: "ollama"
  model: "qwen3.5:9b"
  base_url: "http://localhost:11434"

vectorstore:
  url: "http://localhost:6333"
  collection_name: "my_docs"
  use_sparse: true
  force_recreate: false

retriever:
  search_type: "hybrid"
  top_k: 5
  auto_filter: false   # set true to enable LLM-based filter extraction from every query

3. Python Usage

from ragwire import RAGWire

rag = RAGWire("config.yaml")

# Ingest
stats = rag.ingest_documents(["data/Apple_10k_2025.pdf"])
print(f"Chunks created: {stats['chunks_created']}")

# Retrieve
results = rag.retrieve("What is Apple's total revenue?", top_k=5)
for doc in results:
    print(doc.metadata.get("company_name"), doc.page_content[:200])

4. Run the Example

python examples/basic_usage.py
Model Dimensions Notes
sentence-transformers/all-MiniLM-L6-v2 384 Fast, lightweight, good general purpose
sentence-transformers/all-mpnet-base-v2 768 Better quality, still fast
BAAI/bge-large-en-v1.5 1024 High quality, larger model
BAAI/bge-m3 1024 Multilingual, very strong
intfloat/e5-large-v2 1024 Strong retrieval performance

GPU Acceleration

embeddings:
  provider: "huggingface"
  model_name: "BAAI/bge-large-en-v1.5"
  model_kwargs:
    device: "cuda"     # NVIDIA GPU
    # device: "mps"    # Apple Silicon

Notes

  • Models are downloaded from HuggingFace Hub on first use and cached locally (~/.cache/huggingface/).
  • If you change the model after ingestion, set force_recreate: true once to rebuild the collection (dimensions may differ).