Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Large Language Model

Large Language Models (LLMs) are defined as massive deep learning models that contain billions of parameters and are conventionally implemented using transformer architectures. They are trained on massive text corpora, often trillions of words, to probabilistically generate responses.

The core function of an LLM, particularly the variant underlying tools like ChatGPT, is to take an input (a piece of text, and sometimes accompanying images or sound) and produce a prediction for what comes next in the passage.

Here is a description of the key concepts and steps involved from a user’s input (prompt) to the generated reply:

Key Concepts and the LLM Workflow

The processing of a user’s input, or prompt, through an LLM involves several essential stages:

1. Input Processing: Tokenization and Embedding

The first step in processing input text is to break it down.

2. Contextual Transformation: The Transformer Core

The sequence of vectors (embeddings) then passes through the layers of the transformer network.

3. Output Generation: Prediction and Sampling

Once the input has been fully processed, the LLM generates the response, one segment (token) at a time.

Augmenting LLMs with External Information (RAG)

While LLMs are powerful based on their massive training data, a technique called Retrieval-Augmented Generation (RAG) is commonly used to enhance performance, particularly by incorporating information that was not available in the original training data. RAG is considered a standard way of augmenting a model to specialize in a task without needing to retrain the underlying model.

The RAG process typically involves:

  1. Data Encoding: External documents (unstructured text, PDFs, knowledge graphs, etc.) are converted into LLM embeddings (numerical representations) and stored in a vector database.

  2. Retrieval: When a user poses a query, a document retriever selects the most relevant documents by comparing the prompt’s embedding to the stored embeddings based on semantic distance.

  3. Augmentation: This relevant retrieved content is then integrated into the LLM’s prompt, a process sometimes referred to as “prompt stuffing,” which guides the model’s response generation.

  4. Generation: The LLM generates the final output using both the initial query and the augmented context from the retrieved documents.

By incorporating RAG, LLMs can access domain-specific or updated information, helping to reduce common issues like AI hallucinations and allowing for the inclusion of verifiable citations in responses. Tools like Google’s NotebookLM utilize RAG principles by digesting and analyzing user-uploaded sources (like PDFs, Google Docs, websites, and YouTube videos) using multimodal capabilities (Gemini 1.5) to glean and synthesize insights, and provide citations linked back to the original passages.

Why has RAG died?

The difference between traditional Retrieval-Augmented Generation (RAG) and newer agentic approaches lies primarily in the control flow and the model’s ability to selectively utilize capabilities (referred to as tools) to accomplish a task. Agentic strategies are considered crucial steps beyond earlier forms of RAG, such as “naive chunk retrieval”.

Here is a description of how these two approaches differ based on the material provided:

The old Retrieval-Augmented Generation (RAG)

RAG is defined as a standard way of augmenting an LLM to specialize in a task without needing to retrain the underlying model. The core mechanism of RAG is focused on efficiently retrieving relevant external data and integrating it directly into the LLM’s input (the prompt) for generation.

The new Agentic Approach

Agentic approaches elevate the LLM from a simple sequence generator to an orchestrator capable of complex reasoning and taking actions using various external components or “tools”.

In essence, RAG focuses primarily on augmenting the context input with retrieved information, while agentic approaches focus on orchestrating multiple steps and tools, one of which can be a RAG system, to generate a comprehensive and reasoned reply.