GitHubLive DemoStack: MERN, PostgreSQL, pgvector, NVIDIA NIM APIs, Redis Overview
ContextOS solves the problem of "knowledge fragmentation and retrieval latency" in enterprise environments. Standard LLMs hallucinate or lack proprietary context. Basic RAG systems suffer from low precision and high latency. ContextOS provides a highly optimized, sub-second hybrid retrieval system (Dense + Sparse) with cross-encoder reranking to deliver accurate, grounded answers from uploaded proprietary documents.

System diagram
Frontend Architecture
Built with React and Tailwind CSS for a modern, responsive design.
- Query Flow: Controlled inputs capture the query. Submission triggers an async Axios/Fetch call to the FastAPI backend with optimistic UI updates and loading skeletons.
- File Uploads: Drag-and-drop zone using HTML5 APIs and FormData for multipart requests, with polling/SSE to track ingestion status.
- Rendering: Markdown parser renders LLM responses with citations linked back to specific chunk IDs. Streaming responses mask latency.

React UI and State Management
Backend Architecture & API
FastAPI serves as the core workhorse due to its native async support and high performance. Heavy I/O bounds are completely non-blocking.
- Redis Caching Flow: Creates deterministic cache keys. A hit returns instantly (<10ms). A miss runs retrieval and caches results with a TTL.
- Async Execution: CPU-bound tasks (like BM25 tokenization or parsing PDFs) are offloaded using
asyncio.to_thread() to prevent event-loop blocking.
- Chunking Strategy: Enforced 500-token chunks using Parent-Child chunking. Stores small chunks for highly specific retrieval, but passes the larger parent chunk to the LLM for broader context.
- Hybrid Retrieval: PostgreSQL with pgvector utilizes Cosine similarity (<=>) for semantic meaning, while BM25 provides sparse indexing for exact keyword matches. Reciprocal Rank Fusion (RRF) normalizes and combines both scores.

Cross-Encoder Reranking:
- Vector databases are great for "first-stage" fast retrieval, but they struggle with complex relationships.
- Passed top 10 RRF results to NVIDIA's
/v1/reranking endpoint. Cross-encoders evaluate the query and document together for deep semantic matching to get the absolute best top 3. - This second stage boosts relevance significantly and prevents hallucinations, resulting in a 30% improvement in retrieval precision.
Database & Infrastructure
I initially considered Mongo + FAISS, but managing two separate infrastructures led to synchronization nightmares. Moving to PostgreSQL + pgvector unified metadata and vector storage.
Schema: Uses UUID native types for ultra-fast indexing and ON DELETE CASCADE to ensure deleting a document automatically wipes its chunks.Performance: Achieved sub-1.5s latency through parallel execution of BM25 and Dense Embeddings, combined with asyncpg connection pooling.Deployment: Vercel (Frontend), Render (Backend), Supabase (PostgreSQL)
What I'd Improve
- Redesign the async ingestion worker architecture to use Celery or RabbitMQ to support queueing 100,000+ documents concurrently.
- Implement event-based Redis cache invalidation (e.g., automatically wiping specific cache keys when a document is deleted).
- Implement IP-based rate limiting using Redis and FastAPI middleware (e.g., slowapi) to prevent DoS attacks on expensive embedding endpoints.
Next Project →