ContextOS

Document QA with cross-encoder reranking and live eval dashboard

GitHub Live DemoStack: MERN, PostgreSQL, pgvector, NVIDIA NIM APIs, Redis

Overview

ContextOS solves the problem of "knowledge fragmentation and retrieval latency" in enterprise environments. Standard LLMs hallucinate or lack proprietary context. Basic RAG systems suffer from low precision and high latency. ContextOS provides a highly optimized, sub-second hybrid retrieval system (Dense + Sparse) with cross-encoder reranking to deliver accurate, grounded answers from uploaded proprietary documents.

System diagram

Frontend Architecture

Built with React and Tailwind CSS for a modern, responsive design.

Query Flow: Controlled inputs capture the query. Submission triggers an async Axios/Fetch call to the FastAPI backend with optimistic UI updates and loading skeletons.
File Uploads: Drag-and-drop zone using HTML5 APIs and FormData for multipart requests, with polling/SSE to track ingestion status.
Rendering: Markdown parser renders LLM responses with citations linked back to specific chunk IDs. Streaming responses mask latency.

React UI and State Management

Backend Architecture & API

FastAPI serves as the core workhorse due to its native async support and high performance. Heavy I/O bounds are completely non-blocking.

Redis Caching Flow: Creates deterministic cache keys. A hit returns instantly (<10ms). A miss runs retrieval and caches results with a TTL.
Async Execution: CPU-bound tasks (like BM25 tokenization or parsing PDFs) are offloaded using asyncio.to_thread() to prevent event-loop blocking.

Chunking Strategy: Enforced 500-token chunks using Parent-Child chunking. Stores small chunks for highly specific retrieval, but passes the larger parent chunk to the LLM for broader context.
Hybrid Retrieval: PostgreSQL with pgvector utilizes Cosine similarity (<=>) for semantic meaning, while BM25 provides sparse indexing for exact keyword matches. Reciprocal Rank Fusion (RRF) normalizes and combines both scores.

Cross-Encoder Reranking:

Vector databases are great for "first-stage" fast retrieval, but they struggle with complex relationships.
Passed top 10 RRF results to NVIDIA's /v1/reranking endpoint. Cross-encoders evaluate the query and document together for deep semantic matching to get the absolute best top 3.
This second stage boosts relevance significantly and prevents hallucinations, resulting in a 30% improvement in retrieval precision.

Database & Infrastructure

I initially considered Mongo + FAISS, but managing two separate infrastructures led to synchronization nightmares. Moving to PostgreSQL + pgvector unified metadata and vector storage.

Schema: Uses UUID native types for ultra-fast indexing and ON DELETE CASCADE to ensure deleting a document automatically wipes its chunks.

Performance: Achieved sub-1.5s latency through parallel execution of BM25 and Dense Embeddings, combined with asyncpg connection pooling.

Deployment: Vercel (Frontend), Render (Backend), Supabase (PostgreSQL)

Demo

What I'd Improve

Redesign the async ingestion worker architecture to use Celery or RabbitMQ to support queueing 100,000+ documents concurrently.
Implement event-based Redis cache invalidation (e.g., automatically wiping specific cache keys when a document is deleted).
Implement IP-based rate limiting using Redis and FastAPI middleware (e.g., slowapi) to prevent DoS attacks on expensive embedding endpoints.

Next Project →

Let's work together!

RAG pipelines and multi-agent systems engineer. I approach AI through a builder's lens — interested in both the architecture and the outcome.

Version

2024 © Edition

Timezone

20:42 IST (GMT+5:30)

Socials

TwitterGithub Linkedin