Practical Guide for LLMs in the Financial Industry
Introduction
By Alex Spyrou and Brian Pisaneschi, CFA
Large language models (LLMs) are advanced artificial intelligence (AI) models trained to understand and generate human-like text based on vast datasets, often containing millions or even billions of sentences. At the core of LLMs are deep neural networks that learn patterns, relationships, and contextual nuances in language. By processing sequences of words, phrases, and sentences, these models can predict and generate coherent responses, answer questions, create summaries, and even carry out complex, specialized tasks.
In the financial industry, the adoption of LLMs is still in its early stages, but interest is rapidly growing. Financial institutions are beginning to explore how these models can enhance various processes, such as analyzing financial reports, automating customer service, detecting fraud, and conducting market sentiment analysis. While some organizations are experimenting with these technologies, widespread integration is limited due to such factors as data privacy concerns, regulatory compliance, and the need for specialized fine-tuning to ensure accuracy in finance-specific applications.
In response to these challenges, many organizations are adopting a hybrid approach that combines frontier large-scale LLMs with retrieval-augmented generation (RAG) systems.1 This approach leverages the strengths of LLMs for general language understanding while incorporating domain-specific data through retrieval mechanisms to improve accuracy and relevance. However, the value of smaller, domain-specific models remains significant, especially for tasks requiring efficient processing or where data privacy and regulatory compliance are of utmost concern. These models offer tailored solutions that can be fine-tuned to meet the stringent demands of the financial industry, providing a complementary alternative to larger, more generalized systems.
This paper serves as a starting point for financial professionals and organizations looking to integrate LLMs into their workflows. It provides a broad overview of various financial LLMs and techniques available for their application, exploring how to select, evaluate, and deploy these tools effectively.
Open-Source vs. Closed-Source Models: Benefits and Challenges
Transitioning from exploring the potential of LLMs in finance to selecting the right model is a critical step. With the vast and evolving array of models available, financial institutions face an important decision: choosing between open-source and closed-source options. Each model type presents unique benefits and challenges, impacting such factors as customization, data control, and cost. In this guide, we delve into the distinctions between open-source and closed-source models, examining how each can be strategically leveraged to meet the specific needs of financial professionals and support various use cases in a secure, compliant, and efficient manner.
Open-source models provide unrestricted access to the underlying code and parameters, allowing organizations to customize them for proprietary applications. These models, such as the backbone LLaMA series developed by Meta and FinLLMs (open-source-based fine-tuned in financial text LLMs), can be fine-tuned on unique datasets to adapt to specific financial contexts. Fine-tuning open-source models is cost-effective, allowing frequent updates at a fraction of the cost of training a model from scratch.
In contrast, closed-source models (e.g., ChatGPT, Claude, BloombergGPT) are commercially licensed and do not allow access to the internal model parameters or training data. While often pretrained on extensive datasets and optimized for various tasks, these models offer limited customization potential. Financial institutions using closed-source models must rely on external application programming interfaces (APIs, which are protocols allowing software applications to communicate with each other), incurring higher operational costs, particularly for large-scale tasks.
Benefits of Using Open-Source Models
Open-source models offer several distinct advantages:
- Cost efficiency: Fine-tuning open-source models offers significant cost savings because they can be adapted using specific datasets at a fraction of the cost of developing a high-performing closed-source model from scratch. In contrast, lightweight adaptation of open-source LLMs usually costs less than $300 per training session, making it an economical choice for ongoing adjustments and specialized applications.
- Flexibility and customization: Open-source models can be fine-tuned to optimize for specialized tasks, such as real-time stock price prediction, targeted sentiment analysis, credit scoring, and fraud detection. This flexibility enables real-time adaptability and seamless integration with proprietary financial datasets, allowing institutions to tailor models to their unique requirements.
- Transparency and interpretability: With open-source models, developers can access and modify the model architecture directly, which enables greater control over interpretability features. For example, developers can adjust how the model processes specific inputs, test different interpretability techniques, or even insert custom layers or logic to improve transparency in model outputs. This level of access can lead to better transparency, especially in such fields as finance, where understanding model behavior is crucial for trust and regulatory compliance.
- Data privacy: Open-source architectures can be deployed on an organization’s own infrastructure, whether on premises or within a private cloud, which keeps proprietary data within controlled environments. In contrast, using closed-source architectures requires organizations to interact with third-party platforms, which often necessitates sending proprietary data through external APIs. This could expose sensitive information to external entities, raising potential security and privacy concerns—especially in regulated industries, such as finance, where data confidentiality and compliance are paramount.
Challenges of Open-Source Models
While open-source models offer distinct advantages, they also come with unique challenges:
- Data quality and curation: Financial data are complex and varied, covering everything from structured financial statement data to unstructured and alternative data, such as social media sentiment.2 For open-source models to perform well, they require carefully curated, task-specific datasets. In closed-source models, however, much of this heavy lifting has already been handled by large data science teams, who have pretrained the models on vast, high-quality datasets, minimizing the need for rigorous data preparation. Open-source models, in contrast, rely on the organization’s own data preprocessing efforts to ensure accuracy, particularly when working with unstructured or rapidly changing data sources, where insufficient curation can introduce noise and affect model reliability.
- Hallucinations and accuracy challenges: Financial language is highly technical and context dependent, which can make LLMs susceptible to “hallucinations”—plausible-sounding but incorrect responses. Mitigating these errors often requires advanced techniques, such as reinforcement learning from human feedback (RLHF), which can improve model accuracy but demands substantial resources and specialized domain expertise, often beyond the reach of smaller teams. Additional methods to enhance accuracy include setting strict rules to limit model responses when confidence is low, using RAG to source real-time data, keeping models updated to reflect current information, and using chain of thought (CoT) prompting to encourage step-by-step reasoning. Each of these techniques contributes to improved reliability, but they require careful implementation to be effective.
- Maintenance and expertise requirements: Open-source models need skilled data scientists and machine learning engineers for ongoing fine-tuning and retraining to ensure they remain accurate over time. This might be a barrier for smaller institutions with limited access to in-house AI experts.
Summary of Benefits and Challenges
In summary, open-source models offer flexibility and cost-effectiveness, making them ideal for organizations that prioritize customizability. However, they may require more technical expertise and maintenance. Closed-source models, though more costly, provide out-of-the-box reliability and may reduce operational complexity for teams with limited AI expertise.
Evaluating and Fine-Tuning Models
Incorporating LLMs into financial workflows requires more than just selecting a capable model; it demands a comprehensive evaluation of its suitability for specific tasks and fine-tuning to meet the unique demands of those tasks. These processes ensure that models deliver the required performance and accuracy in the context of their intended applications.
This section outlines structured methods for evaluating LLMs, including the use of task-based datasets and benchmarks tailored to financial tasks. Additionally, it provides insights into adaptation techniques and best practices to optimize models for real-world deployment.
Evaluating Model Suitability: Task-Based Evaluation Datasets
To effectively use LLMs for financial tasks, selecting the right model suited to the specific task is crucial. Evaluating a model’s suitability involves benchmarking, a process that has advanced significantly in the machine learning community in recent years. Evaluation benchmarks consist of task-specific datasets and metrics tailored to various objectives. For example, classification tasks may use such metrics as F1 score or accuracy, while regression tasks may rely on mean squared error (MSE) or R-squared. A model’s performance on these benchmarks helps assess its suitability for a given task.
The following list shows some key financial tasks for which LLMs are commonly evaluated with some example datasets. The datasets are annotated corpuses derived from financial documents, news, and blogs and contain an input query to the LLM and the respective answer (ground truth).
- Sentiment analysis: This task assesses a model’s ability to gauge market sentiment from financial content, such as news headlines, blog posts, and reports. Notable datasets include the Financial PhraseBank (FPB)3 and FiQA-SA.4
- Example of FPB:
- Input prompt: Analyze the sentiment of this statement extracted from a financial news article. Provide your answer as either negative, positive, or neutral. Text: We have analyzed Kaupthing Bank Sweden and found a business which fits well into Alandsbanken,” said Alandsbanken’s chief executive Peter Wiklof in a statement.
- Answer: Positive
- Example of FPB:
- Numerical reasoning in conversational AI: Focused on complex question-answering, this task evaluates the model’s ability to perform sophisticated numerical reasoning over financial documents, often through such datasets as FinQA5 and ConvFinQA,6 which involve analysis of earnings reports.
- Stock movement prediction: For this task, models are evaluated on their ability to predict stock price trends (rise or fall) based on curated datasets, such as ACL187 and BigData22.8
- Financial text summarization: This task evaluates a model’s ability to produce coherent and informative summaries of financial documents, a crucial skill for interpreting dense financial information. Common datasets used for assessing financial summarization include ECTSum9 and EDTSum.10
- Stock trading strategy formulation: This advanced task evaluates a model’s proficiency in synthesizing diverse information to create and simulate trading strategies. FinTrade11 is one dataset curated for this task.
These evaluation datasets provide a structured approach for evaluating LLMs in finance, allowing organizations to select models that meet the specific demands of their financial applications.
Evaluation Benchmarks
Evaluation benchmarks have been pivotal in standardizing the assessment of language models across a range of financial tasks. These benchmarks typically involve evaluation datasets specifically curated for different tasks, as mentioned earlier. Some notable examples in the financial domain are FLUE, FLARE, and FinBEN.
- FLUE (Financial Language Understanding Evaluation): Launched in 2022, FLUE was the first evaluation benchmark tailored for financial natural language processing (NLP) tasks. It focuses on core NLP-focused tasks, such as named entity recognition, news headline classification, and sentiment analysis, providing a foundation for evaluating models’ basic understanding of financial language.
- FLARE: Introduced in 2023, FLARE expanded the scope of financial benchmarks by including both NLP and financial prediction tasks, such as stock movement forecasting. It integrates time-series data, allowing for a more comprehensive evaluation of models on tasks that require temporal insights.
- FinBEN: The most recent and extensive benchmark, released in the summer of 2024, FinBEN covers 36 datasets and 24 tasks across multiple categories, including information extraction, risk management, decision making, and text generation. This makes FinBEN a versatile framework for assessing LLMs across a wide spectrum of complex financial applications.
Case Study
In this review, we conducted a comprehensive analysis of existing language models developed or fine-tuned specifically for financial tasks, comparing them against general purpose models, such as the GPT series, that are trained on broad, non-domain-specific datasets. Our objective was to assess the suitability and effectiveness of each category—domain-specific models versus out-of-the-box general purpose models—for a range of financial applications. By comparing the models’ performance across various tasks, we aim to provide some general insights into which types of models are better suited for specific financial tasks and under what conditions domain adaptation adds value.12
Next, we outline key financial tasks, followed by some insights into model performance in each area.
Sentiment Analysis and Headline Classification
FinLLMs, such as FinMA 7B or FinGPT 7B, consistently outperform general purpose LLMs in sentiment analysis and headline classification in financial contexts. This is due to domain-specific instruction tuning, which enables FinLLMs to better understand nuanced financial sentiment and terminology. General purpose models, lacking specialized financial knowledge, often struggle to interpret financial sentiment accurately.
Numerical Reasoning and Question Answering
General purpose LLMs, such as GPT-based models, outperform FinLLMs in complex numerical reasoning and question-answering tasks. This is due to their extensive pretraining on a broad range of mathematical and reasoning data, which enhances their ability to handle intricate calculations and logical reasoning. FinLLMs, which are less exposed to mathematical data, show limitations in these tasks, highlighting a need for mathematical pretraining to improve performance in financial reasoning.
Stock Movement Prediction
Stock movement prediction remains challenging for both FinLLMs and general purpose LLMs, with no model achieving consistently high accuracy. However, FinLLMs, such as FinMA 7B full—a LLaMA 7 billion parameter model fine-tuned for complex financial tasks—demonstrate better results in specific evaluations, indicating that domain-specific pretraining and fine-tuning improve predictive performance. The inherent complexity of stock movement prediction, however, suggests that further model adaptations may be necessary to achieve reliable and robust results.
In contrast, traditional time-series models, such as regression techniques and statistical approaches (e.g., autoregressive integrated moving average, or ARIMA), as well as deep learning techniques, such as long short-term memory (LSTM), are often better suited for stock price prediction tasks. These models, specifically trained to process sequential numerical data, are computationally efficient and generally require fewer resources to train and deploy. Their targeted nature makes them a practical choice for tasks focused exclusively on numerical time-series data.
Financial Text Summarization
Financial text summarization remains challenging for both FinLLMs and general purpose LLMs because it requires a nuanced understanding of complex financial language. General purpose models, such as GPT and Google’s Gemini—often exceeding 1 trillion parameters—tend to perform slightly better because of broad fine-tuning for coherence and conciseness. However, smaller models, such as InvestLM 65M—a LLaMA-based FinLLM fine-tuned for financial advice—demonstrate that targeted domain-specific tuning can allow smaller models to match the performance of larger, general-purpose models in summarization tasks.
Summary
FinLLMs excel in tasks requiring financial language understanding, such as sentiment analysis, but tend to lag behind general purpose models in areas needing complex reasoning or mathematical skills. In such tasks as stock prediction and summarization, both model types encounter limitations, though FinLLMs gain some advantage through specialized tuning. Notably, the larger scale of such models as GPT and Gemini affects their performance, resource requirements, and suitability for specific financial applications, with the larger models offering more nuanced language comprehension but at a higher computational cost.
Model Adaptation Techniques
LLM adaptation techniques are methods to tailor large language models for domain-specific tasks, but they differ in complexity, cost, and the level of customization achieved.
- In-context learning (ICL): ICL is a technique in which task demonstrations are integrated into the prompt in a natural language format. This approach allows pretrained LLMs to address new tasks without fine-tuning the model.
-
Zero-shot learning: In zero-shot learning, the model performs a task without seeing any task-specific examples, relying solely on its pretrained knowledge to generalize and generate relevant responses.
-
One-shot learning: One-shot learning provides the model with a single input–output example, which it uses alongside its pretrained knowledge to understand and complete the task.
-
Few-shot learning: In few-shot learning, the model is given a handful of input–output examples, enabling it to better understand task patterns and produce more accurate responses for new tasks.
-
CoT prompting: CoT prompting improves large language models’ reasoning by including intermediate steps in the prompt, especially enhancing performance in complex tasks when paired with few-shot prompting. This approach is particularly useful in such fields as finance, where it boosts model accuracy for tasks requiring layered calculations and logical decisions.
- Model fine-tuning:13 Fine-tuning involves retraining a model on domain-specific data by updating its parameters to improve performance on specialized tasks. This process requires more resources but produces a model with deep task-specific knowledge. There are two primary approaches: full fine-tuning and parameter-efficient fine-tuning.
-
Full fine-tuning: This approach updates all model parameters with domain-specific data, making it the most powerful but also the most resource-intensive and costly method. It is ideal for tasks where high accuracy and deep contextual understanding are essential.
-
Parameter-efficient fine-tuning (PEFT): Techniques such as low-rank adaptation (LoRA) and quantized LoRA (QLoRA) modify only a subset of model parameters, allowing for faster, less resource-intensive fine-tuning. These methods are effective for creating specialized models without the high computational costs of full fine-tuning.
- Retrieval-augmented generation: RAG is a powerful technique for adapting large language models by combining information retrieval with language generation. This approach is particularly effective for tasks that require access to external, dynamic information sources, such as question-answering systems. Rather than relying solely on pretrained knowledge, RAG retrieves relevant documents from an external knowledge base and includes them in the model’s prompt, enhancing both accuracy and relevance.
Exhibit 1 provides a comparison of model adaptation techniques, highlighting their unique features and practical applications.
Exhibit 1. Model Adaptation Techniques
Feature |
ICL |
Full Fine-Tuning |
PEFT |
RAG |
---|---|---|---|---|
Parameter Updates | None | All parameters | Limited parameters | None (uses external retrieval) |
Cost and Resources | Low, quick implementation | High, computationally intensive | Moderate, less computationally intensive | Moderate, requires retrieval infrastructure |
Degree of Specialization | Moderate, flexible | High, deep task alignment | High, task alignment with efficiency | High relevance without internal chargers |
Use Case | Rapid customization for general tasks | Precision in domain-specific tasks | Specialized tasks with reduced resource needs | Real-time integration of external data |
Practical Steps for Adapting Open-Source LLMs
Financial professionals interested in building and deploying LLMs can leverage open-source models on such platforms as Hugging Face, which hosts a variety of financial and general purpose models. The following guide outlines the key steps to ensure effective model development and deployment.
Step 1: Identify Your Task
Begin by defining the specific task the model will perform. Clearly identifying the purpose is essential because different tasks require different model capabilities. A well-defined task helps guide model selection and fine-tuning efforts, ensuring you choose a model aligned with your needs.
Step 2: Choose a Model
Browse the Hugging Face Model Hub to find a model suited to your task. Consider popular open-source financial models, such as the following:
- FinGPT: Useful for a wide range of financial applications
- FinMA: Optimized for sentiment analysis and headline classification in financial contexts
- InvestLM: Effective for providing financial advice and summarization
- FinLLaMA: Effective in tasks requiring structured reasoning and complex calculations, such as in forming stock trading strategies
Also, evaluate general purpose open-source models (e.g., LLaMA3, Falcon, Mistral, Qwen, Gemma) if they align with your requirements. Review each model’s documentation to understand its strengths, limitations, and suitability for your task.
Step 3: Adapt the Model with Your Data
Once you’ve selected a model, the next step is to tailor it to your specific needs by fine-tuning. Fine-tuning involves training the model on data that reflects your unique requirements, enhancing its relevance and accuracy. Use high-quality, task-specific datasets where possible, such as proprietary financial data.
For example, if you’re working on sentiment analysis, training with financial sentiment datasets will make the model more responsive to the nuances of financial language and sentiment.
Step 4: Evaluate Model Performance
After adaptation, it’s essential to evaluate the model to ensure it performs well on your task. Use financial evaluation benchmarks, such as FLARE or FinBEN, which offer standardized datasets and metrics to assess accuracy, relevance, and other key performance indicators.
Evaluation is crucial for identifying strengths and potential areas of improvement, providing confidence that the model meets your standards before deployment.
Practical Recommendations
Given the rapid pace of change in financial data, keeping your models updated is crucial. Rather than engaging in extensive retraining each time, consider using such techniques as LoRA, which enables frequent, low-cost updates. By integrating these lightweight updates, you can keep your models in sync with the latest financial news and trends without incurring the time and resource costs of a full retraining process.
To maintain consistent performance, it’s beneficial to regularly evaluate your models against standardized benchmarks, such as FinBEN. This ongoing evaluation helps ensure that your models continue to meet accuracy and relevance standards as financial tasks and market demands evolve. Regular benchmarking acts as a quality check, confirming that your AI solutions remain aligned with your organization’s goals over time.
Finally, as you decide between open-source and closed-source models, it’s essential to weigh the tradeoffs. Open-source models offer flexibility and cost savings, making them an excellent choice for organizations looking to customize solutions affordably. However, they may require a higher level of technical expertise and ongoing maintenance. Closed-source models, while more costly, provide a ready-to-use solution that may reduce the operational burden on teams with limited AI resources. Each choice has its merits, so consider your organization’s priorities, resources, and long-term goals when selecting the best approach.
Conclusion
In today’s data-driven financial industry, LLMs offer transformative opportunities for streamlining processes and gaining insights. By using open-source models and financial benchmarks, financial professionals can develop cost-effective, customized AI solutions tailored to their unique needs. With the right models, tools, and evaluation benchmarks, financial institutions can harness the full potential of AI, gaining speed and precision in navigating complex markets and making data-informed decisions.
Note to Readers: For practical resources and further details, please visit our new RPC Labs GitHub page. There, you’ll find
- a sample notebook with examples on running a Hugging Face LLM and
- the full results of the models’ evaluation, including benchmark comparisons and performance analysis.
GLOSSARY
Hallucinations: In the context of language models, hallucinations refer to plausible-sounding but incorrect or fabricated information generated by the model. This can be particularly problematic in fields such as finance, where accuracy is crucial.
Chain of thought (CoT) prompting: A technique that guides the model to generate step-by-step reasoning or explanations in its responses. CoT prompting helps improve the accuracy of complex tasks, such as numerical reasoning or logical problem solving, by encouraging the model to break down its thought process.
Retrieval-augmented generation (RAG): A model adaptation method that combines a language model with a retrieval system to fetch relevant information from an external source in real time. This allows the model to incorporate up-to-date or domain-specific knowledge into its responses, enhancing accuracy and relevance without changing the model’s internal parameters.
Headline classification: Financial news headlines contain important time-sensitive information on price changes. This task, initially developed for the gold commodity domain, can be used to analyze the various hidden meanings in news headlines that might be of interest to investors and policymakers.
[1]T. Tully, J. Redfern, and D. Xiao, “2024: The State of Generative AI in the Enterprise,” Menlo Ventures (20 November 2024). https://menlovc.com/2024-the-state-of-generative-ai-in-the-enterprise/.
[2]For an overview of structured, unstructured, and alternative data types, see B. Pisaneschi, “Unstructured Data and AI: Fine-Tuning LLMs to Enhance the Investment Process,” CFA Institute (1 May 2024). https://rpc.cfainstitute.org/research/reports/2024/unstructured-data-and-ai
[3]https://huggingface.co/datasets/TheFinAI/en-fpb.
[4]https://huggingface.co/datasets/TheFinAI/fiqa-sentiment-classification?row=5.
[5]https://huggingface.co/datasets/TheFinAI/flare-finqa.
[6]https://huggingface.co/datasets/ChanceFocus/flare-convfinqa.
[7]https://huggingface.co/datasets/TheFinAI/flare-sm-acl.
[8]https://huggingface.co/datasets/TheFinAI/flare-sm-bigdata.
[9]https://huggingface.co/datasets/TheFinAI/flare-ectsum.
[10]https://huggingface.co/datasets/TheFinAI/flare-edtsum.
[11]https://huggingface.co/datasets/TheFinAI/FinTrade_train.
[12]To view a full comparison table with evaluation benchmark results, see the table on the RPC Labs GitHub webpage: https://github.com/CFA-Institute-RPC/The-Automation-Ahead/blob/main/Practical%20Guide%20For%20LLMs%20In%20the%20Financial%20Industry/FinLLM%20Comparison%20Table.md
[13]For an overview of fine-tuning techniques and an environmental, social, and governance (ESG) case study showcasing their value for investment professionals, see Pisaneschi, “Unstructured Data and AI.” https://rpc.cfainstitute.org/research/reports/2024/unstructured-data-and-ai.