Skip to content

What is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is a pipeline that combines a language model with external knowledge at query time.

Instead of relying only on the model’s internal training data, RAG retrieves relevant information from your own documents and injects it into the prompt, allowing the model to generate responses that are grounded in up-to-date or private data.

Why RAG is needed

Large language models are powerful, but they have important limitations:

  • they cannot access private documents by default
  • they do not know about newly created or frequently changing data
  • they may hallucinate when asked about unknown topics

RAG addresses these limitations by separating knowledge storage from generation.

The model remains general-purpose, while your data is retrieved dynamically and provided as context only when needed.

The core idea behind RAG

At a high level, RAG works by:

  1. Converting documents into embeddings
  2. Storing those embeddings in a vector store
  3. Retrieving the most relevant content for a given query
  4. Injecting that content into the prompt sent to the model

The model then generates an answer using both:

  • the user’s question
  • the retrieved context

This keeps responses grounded in your data without retraining the model.

The main building blocks

Vector stores

A vector store holds embeddings that represent your documents in numerical form.

Similarity search is used to find the most relevant documents for a given query.

How to create one:

Context retrieval

When a user asks a question, the query is embedded and compared against the vector store.

The most relevant entries are retrieved and passed downstream.

How to retrieve context:

Prompt augmentation

Retrieved content must be merged into a prompt in a controlled way.

This step defines:

  • how much context is included
  • how it is framed
  • how the model should use it

How to inject context: