Skip to content

RAGWire — System Overview

RAGWire is a production-grade RAG (Retrieval-Augmented Generation) toolkit. It has two primary workflows: Ingestion (storing documents) and Retrieval (finding relevant chunks for a query). Both are orchestrated by the central RAGWire class.


High-Level Architecture

graph TB
    User(["User / Application"])

    subgraph Core ["RAGWire — pipeline.py"]
        direction LR
        Ingest["ingest_documents()"]
        Retrieve["retrieve()"]
    end

    subgraph DocProc ["① Document Processing"]
        direction LR
        Loader["MarkItDownLoader\nFile → Markdown"]
        Splitter["Text Splitter\nMarkdown / Recursive"]
        Hasher["SHA256 Hasher\nDeduplication"]
    end

    subgraph AI ["② Intelligence Layer"]
        direction LR
        Extractor["MetadataExtractor\nLLM → structured JSON"]
        Embedder["Embedding Model\nText → Dense Vector"]
    end

    subgraph StorageLayer ["③ Storage & Retrieval"]
        direction LR
        VectorStore["QdrantStore\nDense + Sparse Vectors"]
        Retriever["Retriever\nSimilarity / MMR / Hybrid"]
    end

    subgraph External ["External Services"]
        direction LR
        LLM["LLM Provider\nOllama · OpenAI · Gemini\nGroq · Anthropic"]
        EmbedProvider["Embedding Provider\nOllama · OpenAI · HuggingFace\nGoogle · FastEmbed"]
        QdrantDB[("Qdrant\nVector DB")]
    end

    User -->|"ingest_documents()"| Ingest
    User -->|"retrieve()"| Retrieve

    Ingest --> DocProc
    Ingest --> AI
    Ingest --> StorageLayer

    Retrieve --> AI
    Retrieve --> StorageLayer

    Extractor <-->|"prompt / response"| LLM
    Embedder <-->|"text / vector"| EmbedProvider
    VectorStore <-->|"upsert / search"| QdrantDB
    Retriever <-->|"vector search"| QdrantDB

Two Workflows at a Glance

Ingestion Retrieval
Input List of file paths Natural language query
Output Stats dict (processed, skipped, chunks) List of Document objects
LLM used for Extracting metadata from document content Extracting filters from query
Qdrant operation add_documents (upsert) similarity_search / hybrid search
Deduplication SHA256 file hash checked before ingestion
Caching _stored_values_cache invalidated after run _stored_values_cache populated on first call

Configuration-Driven Design

Everything is driven by config.yaml. The Config class loads the YAML, resolves ${ENV_VAR} placeholders, and returns a plain dict. Each component reads its own section:

config.yaml
├── loader       → MarkItDownLoader (file extensions)
├── splitter     → Text splitter (chunk_size, strategy)
├── embeddings   → Embedding factory (provider, model)
├── llm          → LLM factory (provider, model) + MetadataExtractor
├── metadata     → Optional custom metadata YAML path
├── vectorstore  → QdrantStore (url, collection, use_sparse)
├── retriever    → Retriever (search_type, top_k)
└── logging      → Logging setup (level, colored, log_file)