Back to Portfolio

Enterprise RAG System

Intelligent Document Retrieval with Multi-Layer Caching

A production-grade RAG architecture that combines Azure AI Search with intelligent caching strategies to deliver accurate document-based answers at scale. Designed for enterprise environments requiring high throughput, low latency, and cost-efficient LLM operations.

The Challenge

Key problems this architecture solves

📚

Enterprise knowledge scattered across thousands of documents (PDF, DOCX, internal wikis) making information discovery slow and inefficient.

🔍

Traditional keyword search misses semantic context and relationships between concepts, leading to incomplete or irrelevant results.

⚠️

Direct LLM queries without retrieval lead to hallucinations and incorrect information, undermining trust in AI-powered systems.

💰

High costs and slow response times when processing large document sets repeatedly, making enterprise deployments economically unfeasible.

System Architecture

End-to-end data flow and components

📥 Document Ingestion
Azure Blob Storage
Document Repository
Azure Functions
Parsing & Chunking
🧮 Embedding & Indexing
Azure OpenAI Embeddings
text-embedding-3-large
Azure AI Search
Vector Database
🎯 User Query Path
Redis Cache
Semantic Caching ⚡
Hybrid Search
Vector + BM25
Re-ranking
Cross-Encoder
🤖 LLM Processing
GPT-5
Context + Prompt
Response Streaming
Real-time Delivery

System Components

Technical building blocks and Azure services

📁

Document Storage

Azure Blob Storage

Centralized repository for enterprise documents (PDF, DOCX, TXT). Supports versioning and metadata tagging for efficient organization.

⚙️

Document Processing

Azure Functions

Serverless parsing and chunking pipeline. Extracts text, splits into 1000-token segments with 200-token overlap, preserves document structure and metadata.

🧮

Vector Embeddings

Azure OpenAI Embeddings

Converts text chunks into high-dimensional vectors capturing semantic meaning. Enables similarity-based retrieval beyond keyword matching.

🗄️

Hybrid Search Index

Azure AI Search

Combines vector similarity search with BM25 keyword ranking. Provides both semantic understanding and exact term matching for optimal retrieval accuracy.

Semantic Cache

Azure Cache for Redis

Multi-tier caching strategy. Stores exact query matches and semantically similar queries to minimize redundant LLM calls and reduce response latency.

🤖

Language Model

Azure OpenAI (GPT-5)

Generates contextual answers using retrieved document chunks. Engineered prompts ensure grounded responses with source citations and minimal hallucination.

Data Flow: Question to Answer

Step-by-step journey through the system

1
User Query Received
User submits a question through the interface.
"What is our company's remote work policy?"
2
Semantic Cache Check
System checks Redis cache for identical or similar questions using vector similarity threshold > 0.95.
Cache Hit: Instant Response
Cache Miss: Continue
3
Hybrid Search Execution
Query converted to embedding, then searches Azure AI Search using both vector similarity and BM25 keyword matching.
20 candidate chunks retrieved
4
Document Retrieval & Re-ranking
Top 20 candidate chunks retrieved, then re-ranked using cross-encoder model to select most relevant chunks.
Top 5 chunks selected
5
Context Assembly & Prompt Construction
Selected chunks combined with engineered prompt template, includes instructions for grounding and source citation.
~3000 tokens
6
LLM Generation
GPT-5 processes prompt + context, generates answer with source references, streams response for better UX.
7
Response Caching
Final answer cached in Redis (both exact query and semantic embedding) for future reuse.
TTL: 24 hours
8
Answer Delivered
Structured response with answer text + source document citations returned to user.
"According to our Employee Handbook (Section 4.2), remote work is available 2 days per week for eligible positions..."