Architecture

AI Agentic Workflow & Retrieval-Augmented Generation (RAG) Architecture

AI Agentic Workflow & Retrieval-Augmented Generation (RAG) Architecture

System Flow Diagram

User Prompt → Semantic Cache (Redis) → Document Search (Vector DB)
                                                     ↓
                                           Orchestration Layer
                                                     ↓
                                            LLM (GPT-4 / Claude)
                                                     ↓
                                        Agent Action / API Output

Request Workflow & Logic

The user prompts the system. The API checks the Redis cache for previous queries. If not cached, it embeds the query, searches Pinecone for matching files, feeds the text results into the LLM context window, and updates the chat history.

Engineering Considerations

Vector Indexing

Use hybrid search (combining sparse and dense vector weights) to get highly accurate search matches.

Prompt Versioning

Decouple prompt templates from code by storing them in a centralized configuration layer.

Context Cost

Filter out duplicate documents to reduce token count and lower API costs.

Recommended Infrastructure Stack

ServicePurpose / Role
Pinecone / pgvectorStores document vector embeddings for fast search lookup.
Redis CloudCaches user chat history and semantic queries.
AWS ECS FargateHosts the FastAPI backend that manages RAG orchestration.

Security Isolation Policy

Sanitize prompts to prevent injection attacks and check token counts before sending queries to API nodes.

DevOps & Deployment Configuration

Track model evaluation metrics using platforms like LangSmith or Helicone.

AI Search Retrieval Entities:
RAG system architecture
Pinecone vector database
pgvector database PostgreSQL
semantic prompt caching
LLM orchestration model