What is Retrieval-Augmented Generation (RAG)?
Retrieval-Augmented Generation (RAG) is a pipeline that combines a language model with external knowledge at query time.
Instead of relying only on the model’s internal training data, RAG retrieves relevant information from your own documents and injects it into the prompt, allowing the model to generate responses that are grounded in up-to-date or private data.
Why RAG is needed
Large language models are powerful, but they have important limitations:
- they cannot access private documents by default
- they do not know about newly created or frequently changing data
- they may hallucinate when asked about unknown topics
RAG addresses these limitations by separating knowledge storage from generation.
The model remains general-purpose, while your data is retrieved dynamically and provided as context only when needed.
The core idea behind RAG
At a high level, RAG works by:
- Converting documents into embeddings
- Storing those embeddings in a vector store
- Retrieving the most relevant content for a given query
- Injecting that content into the prompt sent to the model
The model then generates an answer using both:
- the user’s question
- the retrieved context
This keeps responses grounded in your data without retraining the model.
The main building blocks
Vector stores
A vector store holds embeddings that represent your documents in numerical form.
Similarity search is used to find the most relevant documents for a given query.
How to create one:
Context retrieval
When a user asks a question, the query is embedded and compared against the vector store.
The most relevant entries are retrieved and passed downstream.
How to retrieve context:
Prompt augmentation
Retrieved content must be merged into a prompt in a controlled way.
This step defines:
- how much context is included
- how it is framed
- how the model should use it
How to inject context: