RAGWire — System Overview¶

RAGWire is a production-grade RAG (Retrieval-Augmented Generation) toolkit. It has two primary workflows: Ingestion (storing documents) and Retrieval (finding relevant chunks for a query). Both are orchestrated by the central RAGWire class.

High-Level Architecture¶

graph TB
    User(["User / Application"])

    subgraph Core ["RAGWire — pipeline.py"]
        direction LR
        Ingest["ingest_documents()"]
        Retrieve["retrieve()"]
    end

    subgraph DocProc ["① Document Processing"]
        direction LR
        Loader["MarkItDownLoader\nFile → Markdown"]
        Splitter["Text Splitter\nMarkdown / Recursive"]
        Hasher["SHA256 Hasher\nDeduplication"]
    end

    subgraph AI ["② Intelligence Layer"]
        direction LR
        Extractor["MetadataExtractor\nLLM → structured JSON"]
        Embedder["Embedding Model\nText → Dense Vector"]
    end

    subgraph StorageLayer ["③ Storage & Retrieval"]
        direction LR
        VectorStore["QdrantStore\nDense + Sparse Vectors"]
        Retriever["Retriever\nSimilarity / MMR / Hybrid"]
    end

    subgraph External ["External Services"]
        direction LR
        LLM["LLM Provider\nOllama · OpenAI · Gemini\nGroq · Anthropic"]
        EmbedProvider["Embedding Provider\nOllama · OpenAI · HuggingFace\nGoogle · FastEmbed"]
        QdrantDB[("Qdrant\nVector DB")]
    end

    User -->|"ingest_documents()"| Ingest
    User -->|"retrieve()"| Retrieve

    Ingest --> DocProc
    Ingest --> AI
    Ingest --> StorageLayer

    Retrieve --> AI
    Retrieve --> StorageLayer

    Extractor <-->|"prompt / response"| LLM
    Embedder <-->|"text / vector"| EmbedProvider
    VectorStore <-->|"upsert / search"| QdrantDB
    Retriever <-->|"vector search"| QdrantDB

Two Workflows at a Glance¶

	Ingestion	Retrieval
Input	List of file paths	Natural language query
Output	Stats dict (processed, skipped, chunks)	List of `Document` objects
LLM used for	Extracting metadata from document content	Extracting filters from query
Qdrant operation	`add_documents` (upsert)	`similarity_search` / hybrid search
Deduplication	SHA256 file hash checked before ingestion	—
Caching	`_stored_values_cache` invalidated after run	`_stored_values_cache` populated on first call

Configuration-Driven Design¶

Everything is driven by config.yaml. The Config class loads the YAML, resolves ${ENV_VAR} placeholders, and returns a plain dict. Each component reads its own section:

config.yaml
├── loader       → MarkItDownLoader (file extensions)
├── splitter     → Text splitter (chunk_size, strategy)
├── embeddings   → Embedding factory (provider, model)
├── llm          → LLM factory (provider, model) + MetadataExtractor
├── metadata     → Optional custom metadata YAML path
├── vectorstore  → QdrantStore (url, collection, use_sparse)
├── retriever    → Retriever (search_type, top_k)
└── logging      → Logging setup (level, colored, log_file)