Introduction
As large language models (LLMs) become more widely adopted, the challenges that come with them grow. One of the most pressing issues is hallucination—when an LLM generates plausible but incorrect or misleading information due to gaps in its training data. This situation occurs because LLMs are stochastic text generators, predicting words based on statistical probabilities rather than true understanding. Since their knowledge is static and frozen at the time of training, they struggle with new, proprietary, or real-time information. To address this problem, researchers have explored several techniques, but one—the focus of this installment—stands out as a practical and effective solution: retrieval-augmented generation (RAG).
RAG enables the integration of custom documents into the context window of a large language model, allowing the LLM to interface directly with data that was not present in its initial training corpus. For industries in which data can be private or data need to be updated in real time to gain a competitive edge, such as investment management, this approach is a game changer. Because LLMs often have training cutoffs that can be a year or more in the past, the static knowledge baked into these models can become stale. Investors need access to timely, accurate information—without hallucinations.
While RAG significantly reduces hallucinations and enables near real-time updates, it is not a perfect solution. Its effectiveness depends on several factors, including how documents are structured, how metadata are extracted, and how retrieval is optimized.
This installment of The Automation Ahead series explores these concepts through the lens of an investment analyst. After introducing the key components of RAG step by step, we present a case study that evaluates its performance in extracting executive compensation and governance details from corporate proxy statements.
We explore two key applications:
- Interfacing proxy statements: Testing how well RAG can extract specific governance and compensation details from individual proxy statements.
- Automated multicompany extraction: Evaluating whether RAG can scale up and accurately extract multiple compensation details across portfolio companies for structured table comparison.
This installment of The Automation Ahead series is designed for both those interested in understanding the inner workings of RAG and those focused on its practical applications for investment analysis. For readers who want a more thorough analysis, we provide Jupyter Notebooks that explore the case study in detail, along with visualization code that illustrates key concepts. The full implementation can be found on the CFA Institute Research and Policy Center GitHub page.1
Discover More Use Cases
Interested in other impactful RAG applications? Explore our GitHub page for tutorials and examples, including the following:
- Material News Updater: A system that monitors financial news sources for material updates affecting portfolio stocks.
Visit the Research and Policy Center Labs GitHub page - Retrieval Augmented Generation for the full Jupyter Notebook tutorials and hands-on examples.
Under the Surface of RAG Workflows
In this section, we dive into the underlying mechanisms that make RAG work. For those interested in understanding how RAG retrieves and processes relevant information, keep reading. If you are just looking for applications for RAG, feel free to skip ahead to the section titled “Applying RAG”.
RAG leverages LLM embeddings to retrieve the most relevant information for a given query. These embeddings are vector representations of words, sentences, or entire passages of text that capture their semantic meaning. Exhibit 1 illustrates a sentence being transformed into an embedding vector.
Exhibit 1. Text to Vector Demonstration
These vectors are generated by embedding models that have been trained on vast amounts of text to understand context. These models differ from their counterpart models, such as GPT (generative pretrained transformer), because their output is to transform entire passages of text into a fixed-length vector representation rather than generating word tokens one by one2. Through their training process, the models learn to position semantically similar texts near one another. For example, assume we had three sentences:
- Investing in smaller companies is riskier. [blue]
- Smaller company investments come with higher risks. [blue]
- Physics explores particle behavior. [green]
The embedding model will place the first two sentences nearer to one another than the third sentence given it is completely unrelated. Exhibit 2 provides a way for us to visualize this in three-dimensional space (v=[x,y,z]).
Exhibit 2. 3D Embedding Demonstration
In reality, these vectors are in a much higher dimension, making it impossible for us to visualize. For example, OpenAIʼs text-embedding-3-small model has a dimension of 1,536 (v = [v1, v2, v3, … , v1,536]). This high dimension is needed to capture the semantical nuance of our language.
In a RAG workflow, the goal is to retrieve the most relevant information for a given query. When dealing with timely financial data, such as a newly released earnings call transcript, the information is inherently new and would not have been part of the LLMʼs training data. To make this data accessible, it must first be embedded and stored in a vector database, where it can be accessed by the LLM when responding to a query.
To achieve this, the data (in this scenario, the earnings call transcript) are first separated into chunks. These chunks can consist of words, sentences, or passages of text, but their size is constrained by the maximum input length of the embedding model. This step is critical to ensuring the accuracy of the final LLM output and has been the focus of significant research aimed at improving LLMsʼ ability to deliver precise and contextually relevant responses.
Once the data are divided into individual chunks, they are embedded using an embedding model and stored in a vector database. The database serves as a repository, allowing the LLM to retrieve relevant information dynamically when prompted with a query. Exhibit 3 illustrates the transformation of a document into chunks, the embedding of those chunks, and their storage in a vector database.
Exhibit 3. Vector Database Creation
We can now think of this vector database as a high-dimensional space, where each chunk of text is positioned based on its contextual meaning. When an LLM receives a query, it first converts the query into an embedding and then searches for the most semantically similar embeddings within the database. The closest-matching text chunks are then retrieved and appended to the original query to provide additional context, helping the model generate a more accurate response.
To illustrate this process in a simplified way, we use a toy example based on a transcript of Apple’s Q4 2024 earnings conference call3. The goal is to conceptually demonstrate how a retrieval system prioritizes relevant information by simulating how an LLM might identify the most useful sentences from a document.
We start our toy example with the following question:
Q: What are some of the market risks that could impact near-term EBITDA?
To demonstrate how retrieval works, we are still running the full semantic similarity process—converting the question into an embedding, comparing it against stored embeddings, and using k-nearest neighbors (k-NN) to identify the most relevant text. Here k is a hyperparameter that specifies the number of chunks that have the highest vector similarity. For clarity, however, we have preselected two sentences that are highly relevant to the query and seven that are unrelated.
The two related sentences are as follows:
A: Most commodities are going to move down in price while NAND and DRAM increased during the course of the September quarter, and we expect them to increase during the December quarter4.
B: If you look at how we did for the quarter in China, we were relatively flat year over year, and a key component of that improvement relative to the year-over-year performance that we had been achieving is that there was a sequential improvement in foreign exchange.
The seven unrelated sentences are as follows:
C: And we love celebrating the craft of great storytellers who know how to put on a show.
D: I had an incredible time during launch day in September alongside our team at Apple Fifth Avenue where energy and enthusiasm filled the air.
E: Today, users choose Apple Pay for purchases across tens of millions of retailers worldwide.
F: In honor of World Teachers' Day, Apple was proud to share new resources for teachers to engage their students in ways that aim to make learning easy and fun.
G: With AirPods 4, we’ve broken new ground in comfort and design with our best-ever open-ear headphones available for the first time with active noise cancellation.
H: The iPhone active installed base grew to a new all-time high in total and in every geographic segment.
I: The latest reports from 451 Research indicated customer satisfaction of 96% for Watch in the U.S.
Exhibit 4 shows chunk A and B (highlighted in [yellow]) as the closest embeddings to the query (highlighted in [red]).
Exhibit 4. 3D Transcript Chunk Embedding Visualization
If we specified k = 2 for our nearest neighbor calculation, these top two relevant sentences would then be automatically appended to the prompt of the language model to increase the accuracy of answering this question. For illustrative purposes, we can see how this added context can improve the LLM’s ability to answer the question.
Original Query Response
The following is an excerpt from GPT-4o that answers the question based only on the original query:
Prompt:
You are a financial analyst assistant.Question:
What are some of the market risks that could impact Apple's near-term EBITDA?
Response:
Apple's near-term EBITDA could be impacted by supply chain disruptions, particularly if geopolitical tensions or natural disasters affect the availability of key components. Additionally, fluctuations in consumer demand due to economic downturns or shifts in consumer preferences could also negatively impact their revenue and profitability.
Contextual Query Response
Now the top two sentences are appended as context automatically when we specify a prompt template that pulls in these top k sentences. Here is the full prompt with the context:
Prompt with Context:
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question.
Context:
{Most commodities are going to move down in price while NAND and DRAM increased during the course of the September quarter, and we expect them to increase during the December quarter.
If you look at how we did for the quarter in China, we were relatively flat year over year, and a key component of that improvement relative to the year-over-year performance that we had been achieving is that there was a sequential improvement in foreign exchange.}
Question:
What are some of the market risks that could impact Apple's near-term EBITDA? Give your answer in two sentences.
Response with Context:
Some market risks that could potentially impact Apple's near-term EBITDA include fluctuations in foreign exchange rates, which can affect revenue and profitability, especially in key markets like China. Additionally, changes in commodity prices, such as increases in NAND and DRAM costs, could raise production expenses and impact margins.
We can see that the first response relied on general knowledge about Apple but lacked the accuracy and timeliness provided by the retrieved context.
With RAG, the modelʼs responses are anchored in specific, retrievable information rather than just its pretrained knowledge. This approach helps reduce hallucinations, improves transparency by making the sources of information clear, and ensures that outputs are based on more relevant and up-to-date data. In domains such as finance, where real-time and accurate information is crucial, this ability to incorporate external data can be particularly useful5.
Now that we have an intuitive grasp on underlying RAG concepts, we move on to building a RAG pipeline and applying it to the investment domain.
Applying a RAG Workflow
While structured financial data—such as earnings reports and balance sheets—are best handled through application programming interfaces (APIs) and traditional data pipelines, qualitative disclosures in corporate filings present a different challenge. Proxy statements contain key details about executive compensation, board independence, and insider transactions—information that is not captured in structured datasets. Since these filings are released periodically, they will not be included in an LLMʼs pretraining data, making them a good test case for RAG: Can it retrieve and analyze the latest filings in a way that is actually useful for analysts? To explore this question further, we present the following case study.
Interfacing Proxy Statements: Case Study
Suppose you are a portfolio manager overseeing a discretionary equity portfolio. You want to analyze governance practices across your portfolio companies, comparing pay structures, performance incentives, and other governance risks. Manually reviewing proxy statements is time consuming, and while structured databases capture some high-level figures, they often miss the full context. You are interested in seeing whether RAG can help streamline this process.
To test its effectiveness, we will explore two different approaches:
- Direct questioning on a vector database of portfolio proxy statements: We will embed proxy statements from portfolio companies and use RAG to retrieve specific governance details. This will help us assess how well the system identifies relevant passages and whether it can extract targeted information accurately.
- Automated multicompany multivariable extraction: We will test whether RAG can analyze multiple companies at once, pulling key compensation details and structuring them into a table for cross-comparison analysis. This approach will evaluate the systemʼs ability to extract relevant data at scale and its consistency in formatting responses.
Before going further, it helps to step back and see how a basic RAG workflow fits together from end to end. Exhibit 5 provides an overview of the major components involved in a RAG pipeline.
Exhibit 5. End-to-End RAG Workflow

Core Components of RAG Workflows
In essence, there are five major components in every RAG pipeline, each playing a role in ensuring information is ingested, processed, retrieved, and eventually used to produce contextually relevant answers. What follows is a closer look at each stage for this case study. If you are an investment professional looking to build or understand the mechanics of RAG systems, the following breakdown will be useful in walking you through each stage in detail. If you are more interested in high-level applications, feel free to skip ahead to the next section, “Testing the RAG Workflow”
With our use case defined, we will now walk through the full RAG pipeline to see how it handles real-world proxy statements.
Quick Note: Choosing the Right Tools
It is important to highlight a common challenge developers face when beginning LLM projects: selecting suitable tools. Many start by using abstraction libraries, such as LangChain, which simplify interactions with chat model APIs and provide convenient tools for rapid prototyping. However, developers often encounter challenges with increased complexity, reduced transparency, and difficulties in debugging due to the abstraction layers provided by these libraries. The ideal approach depends heavily on balancing rapid prototyping needs against long-term goals of scalability, maintainability, and efficiency.
A hybrid approach is frequently the most effective solution. As we will demonstrate in our case study, utilizing LangChain for initial document handling tasks—such as document loading, text splitting, metadata extraction and vector database creation—can streamline early development stages. Later, for more complex chat completions, switching to direct API interaction (e.g., OpenAI) offers greater transparency, customization, and precision.
Step 1. Data Ingestion and Parsing Data
The first step in any RAG workflow is data ingestion—collecting, parsing, and preparing the data for retrieval. In our case, that means downloading proxy statement PDFs and converting them into a format suitable for embedding. For this case study, we will be using three illustrative portfolio companies—Apple, Amazon, and Microsoft. We first download their proxy statements (Form DEF 14A) from the SEC’s EDGAR website and save them to our DEF14A PDFs folder.
from langchain.document_loaders import PyPDFLoader
import os
# Directory where PDFs are stored
pdf_folder = "Retrieval Augmented Generation/DEF14A-PDFs"
# Get list of PDF files in the folder
pdf_files = [os.path.join(pdf_folder, f) for f in os.listdir(pdf_folder) if f.lower().endswith(".pdf")]
# Load each PDF into a list of documents
documents = []
for file_path in pdf_files:
loader = PyPDFLoader(file_path)
docs = loader.load()
documents.extend(docs)
Once the documents are collected, the next step is parsing them into a usable format. PDFs often contain a mix of text, images, and tables, making it difficult to extract structured data efficiently. Traditional PDF parsers primarily convert PDFs into plain text, which can lead to loss of important metadata, such as section headers and table structures, that are crucial for accurate information retrieval.
To address this, we will use Mistral’s Optical Character Recognition (OCR) API, which converts PDFs into markdown files with preserved structure, including section headers and table formatting. This approach provides several advantages:
- Improved text splitting: Section headers allow for more logical segmentation of content.
- Enhanced metadata extraction: Structured outputs enable easier classification of different sections of the document.
- Better query accuracy: By preserving document hierarchy, RAG models can retrieve relevant information with greater precision.
Later, we will compare this markdown-based approach with a more naive text extraction strategy that does not preserve structure, highlighting the benefits of structured parsing in enhancing retrieval accuracy and efficiency.
from pathlib import Path
import json
def parse_pdf(client: object, pdf_path: Path) -> str:
# Upload PDF to Mistral's OCR service
uploaded_file = client.files.upload(
file={
"file_name": pdf_path.stem,
"content": pdf_path.read_bytes()
},
purpose="ocr"
)
# Get URL for the uploaded file from Mistral API
signed_url = client.files.get_signed_url(file_id=uploaded_file.id, expiry=1)
# Process PDF with OCR, returning markdown text
pdf_response = client.ocr.process(
document=DocumentURLChunk(document_url=signed_url.url),
model="mistral-ocr-latest"
)
# Extract and return the markdown content
response_dict = json.loads(pdf_response.model_dump_json())
markdown_text = "\n\n".join(page["markdown"] for page in response_dict["pages"])
return markdown_text
The following is a markdown output example:
# UNITED STATES
# UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, DC 20549
## SCHEDULE 14A
PROXY STATEMENT PURSUANT TO SECTION 14(a)
OF THE SECURITIES EXCHANGE ACT OF 1934\n(Amendment No. )
Filed by the Registrant
Filed by a Party other than the Registrant
## Check the appropriate box:
Preliminary Proxy Statement
Confidential
Note that for simplicity, we are not parsing images in this example, even though Mistral’s OCR API supports it. We have included table extraction, but image processing is beyond the scope of this demonstration6.
Step 2. Chunking the Markdown Files
Now that we have converted our PDFs into structured markdown, the next step is to split them into smaller, meaningful chunks. Instead of using naive character-based splitting, which can break the semantic structure of text, we will use markdown headers to ensure logical segmentation.
from langchain_text_splitters import MarkdownHeaderTextSplitter
def split_markdown_by_headers(pdf_name: str, markdown: str) -> list:
# Define header levels to split on
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3")
]
# Initialize the Markdown splitter
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
# Split the markdown content based on headers
md_header_splits = markdown_splitter.split_text(markdown)
# Attach the original file name as metadata for traceability
for split in md_header_splits:
split.metadata['file_name'] = pdf_name
return md_header_splits
With this function, our markdown content is now divided into semantically meaningful chunks, preserving the structure of the proxy statements. This will improve retrieval performance when querying our RAG system in later steps.
Step 3. Embedding and Storing Documents in the Vector Database
With the markdown chunks prepared, the next step is to convert them into numerical vector embeddings and store them in a vector database. We will use Chroma, a lightweight and high-performance vector store, to store these embeddings. The embedding process is handled by OpenAIʼs OpenAIEmbeddings model, which converts the text into numerical representations optimized for retrieval.
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
# Store document embeddings in a vector database
vectorstore = Chroma.from_documents(
documents=md_header_splits,
embedding=OpenAIEmbeddings(),
persist_directory="vector_db_md_split"
)
Step 4. Define the Retriever and Set Up the Prompt
Now that our structured markdown content has been embedded and stored, we need to define a retriever. The retriever is responsible for searching the vector database and returning the most relevant chunks of text to be added to the modelʼs context window when responding to queries.
We will use k-NN for retrieval and set k equal to 100, which identifies the top 100 most semantically similar text chunks to the input query. This is done by using a similarity measure, such as Euclidean distance, which measures the straight line distance of vectors, or such measures as cosine similarity, which measures the angular similarities of the vectors. This ensures that the model has enough relevant context for accurate analysis.
# Create a retriever from the vectorstore
retriever = vectorstore.as_retriever(search_kwargs={"k": 100})
Next, we need to set up the prompt to send to the retriever. We do this by creating a prompt template in which we define the instructions for the LLM and the inputs that will change based on each question. In our case, the inputs are the question we are asking and the relevant context the retriever finds in the database based on our question.
from langchain_core.prompts import ChatPromptTemplate
# Define your prompt template explicitly
template = """Execute the following query based only on the following context.
Query: \n {query}
Context: \n {context}
"""
def create_rag_prompt(query: str, retriever, template: str) -> str:
# Create the prompt object from the prompt template
prompt = ChatPromptTemplate.from_template(template)
# Retrieve relevant documents (context) for a given query
context_docs = retriever.get_relevant_documents(query)
# Format context documents into a single string
context_str = str(context_docs)
# Format the final prompt we will send to the Openai chat completions API with context and query
final_prompt = prompt.format(context=context_str, query=query)
return final_prompt
We are now ready to query the model for response generation.
Step 5. Response Generation
The final step in the RAG workflow is response generation, where we use the context-enriched prompt to obtain answers from the large language model. For this example, we are leveraging OpenAIʼs API to generate structured responses. Structured outputs ensure the modelʼs responses conform precisely to a specified format. This will be very useful for the second approach, in which we are requesting the LLM to output the executive compensation details into a table for cross comparison. For the general interfacing of the documents, we will use a general model structure to force the LLM to explain its thoughts before providing an answer.
from openai import OpenAI
from dotenv import load_dotenv
from pydantic import BaseModel
# Define the structured output
class ExplainOutputs(BaseModel):
explanation: str
answer: str
# Define the OpenAI API client and load environment API keys
client = OpenAI()
load_dotenv()
# Send structured prompt to OpenAI's ChatCompletion API
completion = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are an expert at interfacing documents and providing responses in a structured format."},
{"role": "user", "content": final_prompt}
],
response_format=ExplainOutputs, # Enforces structured output
)
structured_response = completion.choices[0].message.parsed
We now have all the components of our RAG pipeline in place—from ingestion through structured response generation—and we are ready to move into the next phase: testing the RAG workflow.
Testing the RAG Workflow
Now that we have set up each component of our RAG pipeline, we can evaluate its effectiveness in handling real-world proxy statements. We will test two approaches: (1) directly questioning the vector database to extract specific details and (2) performing an automated extraction across multiple companies for comparative analysis. We begin by exploring the first method—direct questioning on a vector database of portfolio proxy statements.
Direct Questioning on a Vector Database of Portfolio Proxy Statements
At this point we have ingested proxy statement PDFs for our portfolio companies—Apple, Amazon, and Microsoft—into our vector database. This can be scaled up to any number of portfolio companies with minimal effort once the ingestion pipeline is defined. Essentially, the process is only constrained by processing time and compute budget.
To assess how effectively our RAG setup retrieves relevant information, we will pose targeted queries focused on board independence and governance practices. While the following examples are not comprehensive, they give us a general sense of whether the LLM accurately retrieves valuable, contextually relevant information.
Exhibit 6 provides information we have created to evaluate the responses of the RAG workflow. It contains questions posed directly to the RAG system, along with the relevant text passages found in the document, the LLM-generated responses, and the accuracy evaluations.
Exhibit 6. Proxy Statement Direct Questions
Query | Relevant Text Passages | LLM Answer | Accuracy Evaluation | Evaluation Notes |
---|---|---|---|---|
“Does Apple have a dedicated committee in place to oversee artificial intelligence?” | “Apple’s full Board directly oversees corporate and product strategy and receives regular updates on emerging technologies, including artificial intelligence. The Audit Committee assists the Board in its oversight of privacy-related artificial intelligence matters. In June 2024, the Company announced its Responsible AI principles, which aim to guide how the Company develops artificial intelligence tools.” | “No, Apple does not have a dedicated committee solely for artificial intelligence, but it has integrated oversight mechanisms within its existing governance structure.” | ✅ Accurate | The LLM correctly concluded that Apple does not have a dedicated committee. As stated in the document, the board oversees AI-related activities with assistance from the Audit Committee. |
“What is Microsoft’s policy on executives’ derivatives trading?” | “Our directors and executive officers are prohibited from trading in options, puts, calls, or other derivative instruments related to Microsoft equity or debt securities. They also are prohibited from purchasing Microsoft common stock on margin, borrowing against Microsoft common stock held in a margin account, or pledging Microsoft common stock as collateral for a loan.” | “Microsoft prohibits its directors and executive officers from trading in options, puts, calls, or other derivative instruments related to Microsoft equity or debt securities. They are also prohibited from purchasing Microsoft common stock on margin, borrowing against Microsoft common stock held in a margin account, or pledging Microsoft common stock as collateral for a loan.” | ✅ Accurate | The LLM correctly states Microsoft’s derivatives trading policy. |
“List executive board members and independent board members for Amazon to be elected for the next fiscal period” | “The Board has determined that the following directors are independent as defined by Nasdaq rules: Gen. Alexander, Ms. Cooper, Ms. Gorelick, Dr. Huttenlocher, Ms. McGrath, Dr. Ng, Mrs. Nooyi, Mr. Rubinstein, Mr. Smith, Ms. Stonesifer, and Mr. Weeks.” “Total Number of Directors 13*” “* Ms. McGrath is not standing for re-election at the Annual Meeting.” |
Executive Board Members: – Jeffrey P. Bezos – Andrew R. Jassy Independent Board Members: – Keith B. Alexander – Edith W. Cooper – Jamie S. Gorelick – Daniel P. Huttenlocher – Andrew Y. Ng – Indra K. Nooyi – Jonathan J. Rubinstein – Brad D. Smith – Patricia Q. Stonesifer – Wendell P. Weeks |
✅ Accurate | The LLM identified the current members up for reelection and correctly classified them as either independent or executive board members. |
“What was Satya Nadella’s Total Compensation without the inclusion of other compensation? Give it to me as % of Total Compensation to the 4th decimal place.” | “Satya Nadella – Chairman and Chief Executive Officer 2024: Salary: $2,500,000 | Stock Awards: $71,236,392 | Non-Equity Incentive Plan: $5,200,000 | Other Compensation: $169,791 | Total: $79,106,183” | ‘99.7864%’ | ❌ inaccurate | This question required first extracting the total compensation of $79,106,183, then subtracting the other compensation of $169,791, and dividing the result by the original total compensation. Its answer is very close to the correct answer of 99.7854%, but the LLM seems to have hallucinated the last two digits. |
Overall, the RAG pipeline was able to accurately retrieve general queries but struggled when we asked for detailed mathematical calculations. This is an area where function calling and agents can dramatically increase accuracy. In the last query, the LLM came very close with its calculation, but it hallucinated two digits—a small but significant error in a financial context.
To showcase how agents address this problem, we connected the LLM to an agent equipped with a Python interpreter, enabling it to run exact computations. The result was the correct answer, generated with full traceability. For those interested in exploring the implementation further, the Jupyter Notebook contains a working example of this agent, with more to come in future releases.
Now we will try a more nuanced question. For executive compensation, say we want to know the long-term equity awards of the chief executive realized over the fiscal period. Under Financial Accounting Standards Board (FASB) Accounting Standards Codification (ASC) Topic 718, companies are required to report the fair value of stock awards at grant date—not the realized value of vested awards for the period. This requirement helps with comparability across companies. Thus, in compensation tables, they quote the total compensation received in accordance with ASC Topic 718. To get the correct realized value over the period, there is typically a separate table, such as “Grants Vestedˮ or “Stock Vested.ˮ To extract this value, we prompt the RAG pipeline with the following:
“What is Tim Cook’s realized long-term incentives for the fiscal year 2024 based on shares vested?”
RAG response:
“Tim Cook's realized long-term incentives for the fiscal year 2024 based on shares vested include 707,410 shares, resulting in a value realized of $122,808,793.”
The response is accurate
The LLM accurately provided the correct information. This will be a useful baseline for us to test whether the LLM can scale this level of nuance when requesting the same information for multiple companies and in combination for multiple variables in the next section.
Automated Multi-Company Multi-Variable Extraction
In the previous section, we demonstrated that the RAG workflow effectively handles individual, targeted queries. Now, we evaluate the system’s capability in extracting multiple compensation details simultaneously from proxy statements across multiple companies, and structuring this information into a table for direct comparison. The goal isn’t to showcase the system’s strengths, but rather to explore the limitations of a simple RAG setup. We also use this task to demonstrate how an LLM can evaluate outputs—a valuable technique for scaling the evaluation process to larger datasets.
This task is notably more complex. Our structured prompt (provided below) requests both qualitative and quantitative compensation details from the proxy statements of Apple, Amazon, and Microsoft. Including both data types allows us to evaluate whether the large language model handles numerical (quantitative) or descriptive (qualitative) data more effectively.
Here’s the structured prompt used for this extraction:
query = """You are an expert financial analysis assistant specializing in extracting detailed executive compensation data from proxy statements. Your task is to extract the following fields from the provided proxy statements and output the information in the given structure. For any missing or not applicable information, use “N/A”. Extract these fields for each company: 1. Company Name: Full legal name of the company. 2. Company CEO: Name of the Chief Executive Officer. 3. Coverage Period: The fiscal period covered by the proxy (e.g., "Fiscal Year 2024" or "Fiscal Year Ended June 30, 2024"). 4. Total Target LTI (Full Grant Amount): The total target long-term incentive amount, including all equity components. 5. LTI Package Grant Date: The date when the LTI grant was awarded. 6. Annual LTI Grant?: Indicate “Yes” if equity awards are granted annually, or “No” if not. 7. Time-Based RSU Vesting Schedule: Details on vesting for time-based equity (e.g., "Three equal annual installments starting April 1, 2026" or quarterly vesting over a specified period). If not applicable, indicate “N/A”. 8. Performance-Based RSU Vesting Schedule: Details on vesting for performance-based equity, including the performance period and conditions (e.g., "Vests on October 1, 2026 based on performance from October 1, 2023 to September 30, 2026"). If not applicable, indicate “N/A”. 9. Compensation Governance Arrangements: Information on the oversight mechanisms (e.g., independent compensation committees, clawback provisions, and stock ownership policies). 10. CEO Pay Alignment Mechanisms: How the compensation is structured to align CEO pay with long-term shareholder value. 11. Performance Metrics Used (Detailed): Provide specifics of the performance metrics applied, including the measurement period and targets (e.g., "Relative TSR compared to the S&P 500 over a 3-year period (October 2023 – October 2026), payout ranges from 0%–200%" or details on revenue growth targets). 12. Realized Base Salary: Actual base salary paid in the period. 13. Realized STIs: Actual short-term incentives (bonuses) paid. 14. Realized Long-Term Awards: Value of equity awards that have vested in the period. 15. Realized Other Compensation: Additional benefits (e.g., security costs, deferred compensation, change-in-control benefits). 16. Realized Total Compensation: The sum of all compensation elements actually received in the period. Output your findings in a JSON format. """
Ground Truth
We evaluate the LLM output in Exhibit 7, which was manually created and represents our basis for truth in evaluation.
Exhibit 7. Ground Truth
Company | CEO | Coverage Period | Total Target LTI | LTI Grant Date | Annual LTI Grant? | Time-Based RSU Schedule | Performance-Based RSU Schedule | Governance Arrangements | CEO Pay Alignment | Performance Metrics | Realized Base Salary | Realized STIs | Realized LT Awards | Realized Other Compensation | Realized Total Compensation |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Amazon.com, Inc. | Andrew R. Jassy | Fiscal Year 2023 | $211,933,520 | 2021 | No | RSUs vest over 5–10 years (80% between 2026–2031) | N/A | Independent Leadership Development and Compensation Committee; clawback provisions and anti-hedging policies | Long-term RSU grants aligned directly with long-term shareholder value; no cash bonuses or short-term metrics | N/A | $365,000 | N/A | $27,830,305 | $992,764 | $29,188,069 |
Apple Inc. | Tim Cook | Fiscal Year 2024 | $50,000,000 | 2023-10-01 | Yes | Three equal annual installments starting April 1, 2026 | Vests October 1, 2026 based on performance from October 1, 2023 to September 30, 2026 | Independent People & Compensation Committee oversight; clawback and stock ownership policies | Long-term incentives tied to Apple’s relative TSR compared to the S&P 500 over a three-year performance period | Relative TSR compared to the S&P 500 over a 3-year performance period (Oct 2023–Oct 2026), payout ranges 0%–200% | $3,000,000 | $12,000,000 | $122,808,793 | $1,520,856 | $139,329,649 |
Microsoft Corporation | Satya Nadella | Fiscal Year 2024 | $50,000,000 | 2023-09-18 | Yes | N/A | Vests August 31, 2026, based on 3-year performance period (July 1, 2023 – June 30, 2026) | Independent Compensation Committee oversight; strong clawback provisions; minimum stock ownership requirements | Over 95% performance-based pay with strict alignment to long-term financial goals | Relative TSR vs. S&P 500 over 3-year period (FY24–FY26), Revenue growth targets for Azure Cloud Services, Microsoft Cloud (ex-Azure), Search & News Advertising, Xbox; payout ranges 0%–150% | $2,500,000 | $5,200,000 | $44,011,776 | $169,791 | $51,881,567 |
Evaluating Results
We now evaluate the RAG workflow outputs systematically.
To systematically evaluate the automated extraction, we will use a separate LLM (“LLM Judgeˮ) tasked with comparing the model-generated outputs against this ground truth table. For each extracted field, the LLM Judge assigns an accuracy score between 0 and 1:
- 0 means completely incorrect or missing.
- 1 means fully correct and accurately matching the provided ground truth.
Essentially, we are simply instructing an LLM to look at the generated answer and the ground truth and provide a score of correctness of the generated answer between 0 and 1. This approach allows for a standardized evaluation of nuanced qualitative and quantitative data, which is especially valuable given the complexity and subjective nature of some executive compensation disclosures. Exhibit 8 shows a heat map of the LLM Judge evaluations.
Exhibit 8. LLM Judge Evaluation Heat Map

Overall average score = 62%.
For example, the LLM assigned a score of 0.25 to governance arrangements for Apple. Next, we show the ground truth, the LLM answer, and the reasoning for the score:
- Ground truth: Independent People and Compensation Committee oversight; clawback and stock ownership policies.
- LLM answer: The People and Compensation Committee oversees compensation and engages with shareholders.
- Reasoning: The ground truth specifies oversight by an Independent People and Compensation Committee, specifically mentioning “clawback and stock ownership policies.” The generated answer refers only to “the People and Compensation Committee” and adds an element of engaging with shareholders, while omitting independence, the oversight of people, and the details regarding clawback and stock ownership policies. Thus, while it captures a portion related to compensation oversight, it misses several key details.
Interested readers should see the section titled “Evaluate the Outputs Using LLM as a Judge” in the Jupyter Notebook on the RPC Labs GitHub page, where you will find the full prompt and the reasonings behind each score.
Nuance Capture and Reasoning Capabilities Analysis
From the heatmap, it is clear that performance varies significantly by company:
- Apple achieved high accuracy.
- Microsoft showed moderate accuracy.
- Amazon performed poorly.
The poor performance on Amazon highlights the model’s difficulty in handling nuanced compensation structures. Amazon’s CEO received a significant grant in 2021, intended to cover several subsequent years without annual grants, causing the model to miss critical details, such as grant date and vesting schedules. This nuance, requiring deeper understanding beyond direct information retrieval, caused inaccuracies in realized compensation values.
Similarly, the nuance we tested earlier surrounding the realized long-term awards was not captured for any of the companies. This inaccuracy had a cascading effect, causing discrepancies in the total compensation figures. It is likely that instructing the LLM to extract multiple variables simultaneously limited its ability to perform step-by-step (chain-of-thought) reasoning for each item, ultimately leading to lower accuracy in responses.
Quantitative vs. Qualitative Data
To further analyze performance, we distinguish between qualitative (descriptive) and quantitative (numeric) data extraction7. Qualitative data are descriptive information, such as the coverage period, details around grants, and vesting schedules, whereas quantitative data include numeric values, such as realized base salary, short-term incentives (STIs), and total realized compensation.
The accuracy scores for each category are as follows:
- Qualitative average score = 66%.
- Quantitative average score = 55%.
This suggests that our RAG pipeline is currently better at retrieving qualitative information. The accuracy of quantitative extraction heavily depends on document structure. When quantitative data are dispersed across multiple tables or presented without sufficient contextual clues, the LLM struggles compared to qualitative data, which usually includes richer context. This may be why performance was best for Apple: The realized compensation table is clearly laid out in a single table.
Importance of Document Stucturing and Metadata Extraction
Earlier, we emphasized how critical document structure and metadata extraction are for enhancing RAG performance. To illustrate this, we compared our structured approach with a naive strategy—extracting plain text and chunking it by a fixed character limit without any metadata. Exhibit 9 shows the LLM Judge evaluations on the naive strategy.
Exhibit 9. Naive Chunking Strategy LLM Judge Evaluations

Overall average score = 32%.
The naive method resulted in a 25% drop in accuracy, clearly demonstrating the substantial benefits provided by document structuring and metadata enrichment.
These insights reinforce the importance of careful workflow design when implementing RAG in professional settings.
Key Takeaways
The following are the most important takeaways from this case study.
- RAG is useful for basic information retrieval: It provides analysts and portfolio managers with an intuitive, conversational method of accessing information from large documents. However, its accuracy notably decreases when responses require calculations, counting, or nuanced interpretation from multiple locations in a document. This is an area where more advanced methods, such as agents and function calling, can play a key role.
- Limitations with nuanced reasoning: When tasked with extracting multiple complex variables simultaneously from multiple documents, RAG struggles with nuanced and detailed reasoning, significantly reducing accuracy.
- Qualitative versus quantitative performance: The model consistently demonstrates higher accuracy with qualitative, descriptive information, whereas quantitative data, especially numeric data requiring interpretation or aggregation, tend to pose greater challenges.
- Document structuring and metadata are critical: Clear structuring of documents and inclusion of detailed metadata considerably improve the accuracy and effectiveness of RAG workflows.
- RAG as part of a broader toolkit: While RAG is beneficial for certain retrieval tasks, it alone does not provide comprehensive or robust automation capabilities. A broader, integrated approach using additional tools and techniques, such as function calling and agents, is necessary for achieving reliable and accurate automation workflows. Additionally, there are many enhancement mechanisms that can be applied to improve the ultimate accuracy of the RAG system.
Conclusion
This report has walked readers through each component of the retrieval-augmented generation workflow, highlighting both its strengths and limitations. RAG proves beneficial for general information retrieval tasks but reveals significant weaknesses when questions require detailed numerical analysis or nuanced interpretation.
Ultimately, RAG is a useful but limited tool in isolation. To increase accuracy and create robust automation systems, it must be integrated with additional methods, such as function calling and intelligent agents. The next installment of The Automation Ahead series will explore these tools, building toward a comprehensive approach to automation.
[1] https://github.com/CFA-Institute-RPC/The-Automation-Ahead/
[2] A token is a unit of text, such as a word, character, or subword, that models use as the building blocks for processing language
[3] The transcript is available at https://seekingalpha.com/article/4731856-apple-inc-aapl-q4-2024-earnings-call-transcript.
[4] NAND and DRAM are types of semiconductor memory used in electronic devices.
[5] Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, and Lili Qiu, “Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make Your LLMs Use External Data More Wisely,” arXiv (preprint). https://arxiv.org/pdf/2409.14924v1
[6] For more information on multimodal RAG, see Monica Riedler and Stefan Langer, “Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications,” arXiv (29 October 2024): https://arxiv.org/abs/2410.21943. See also Sergio Paniego, “Multimodal Retrieval-Augmented Generation (RAG) with Document Retrieval (ColPali) and Vision Language Models (VLMs),” Hugging Face: https://huggingface.co/learn/cookbook/en/multimodal_rag_using_document_retrieval_and_vlms
[7] Qualitatitive variables include: CEO, Coverage Period, LTI Grant Date, Annual LTI Grant, Time-based RSU vesting, Performance-based RSU Vesting, Governance Arrangements, CEO Pay Alignment, Performance Metrics; while quantitative variables include: Total Target LTI Grant, Realized Base Salary, Realized STIs, Realized LT Awards, Realized Other Compensation, Realized Total Compensation.