Synthetic Data in Investment Management

doi:10.56227/25.1.24

THEME: TECHNOLOGY

28 July 2025 Research Reports

Synthetic Data in Investment Management

James Tait

Generative AI can create synthetic datasets to augment workflows such as scenario stress-testing and portfolio optimization. This report explains how, with guidance on assessing data quality and a case study fine-tuning an LLM for sentiment analysis.

Synthetic Data in Investment Management View PDF GitHub repository with code for the synthetic data case study Access repository

Report Overview

The investment management industry depends increasingly on timely and high-quality data to drive investment decisions. Yet firms regularly encounter challenges around both data quality and data quantity, such as lack of historical data, costly data collection, data imbalances, and privacy concerns. Synthetic data, which is data that has been artificially generated to replicate the statistical properties of real data, offers a potential solution to these challenges.

Hear from the Author

The Data Dilemma in Investment Management

“Synthetic Data in Investment Management” discusses the potential of synthetic data in investment management. The report focuses on generative AI approaches to synthetic data generation, including variational autoencoders, generative adversarial networks, diffusion models, and large language models. Unlike more-traditional methods, such as Monte Carlo simulation and bootstrapping, these generative techniques are better suited to modeling the complexities of real-world data and are capable of generating data modalities frequently encountered in finance, such as time-series, tabular, and textual data.

Despite the potential of generative AI approaches to synthetic data, these methods are currently not widely used in the industry. Promising academic research in this area has yet to transition into widespread adoption.

Explore CFA Institute Programs

From foundational knowledge to investment mastery, our learning programs and certificates are designed give you a critical advantage at every stage in your career.

Bridging Research and Practice

The report aims to shed light on these generative methods by summarizing academic publications that illustrate proof of concepts, illustrating how generative AI-based synthetic data can improve the likes of model training, portfolio optimization, stress testing, and risk analysis. It discusses current practices used to evaluate synthetic data quality and concludes with a case study.

The case study shows how synthetic financial text data was used to improve the performance (F1-score) of a large language model fine-tuned for financial sentiment analysis by nearly 10 percentage points.

A Roadmap for Adoption

The report envisions the integration of synthetic data to mirror the recent experimental adoption of large language models (LLMs) across the industry — potentially transformative, but currently lacking standardized frameworks and guidance. Practitioners should begin the integration process by assessing their workflows and identifying pain points that synthetic data could address. Starting with simpler, more transparent methodologies, practitioners can experiment with progressively sophisticated models, frequently evaluating and comparing performance using real-world data and benchmark models. Staying up to date with the latest research is essential to keep track of developments as new methods and use cases continually emerge in a rapidly evolving field.

Key Takeaways

Synthetic data can address key data constraints in financial workflows, including data scarcity, dataset imbalances, and privacy issues.
Traditional, statistical methods to synthetic data creation (e.g., bootstrapping, Monte Carlo) remain useful but can struggle to model complex or unstructured data.
Generative AI models can create flexible, high-fidelity synthetic data across modalities, including textual, time-series, and tabular data, by learning deeper patterns in real datasets.
These models can support core investment tasks, such as model training, backtesting, portfolio optimization, risk modeling, and financial sentiment analysis.
Ensuring synthetic data quality is critical. Use both qualitative (e.g., visualizations) and quantitative (e.g., statistical tests, train-on-synthetic, test-on-real) methods for evaluation.
Benchmark synthetic data-augmented workflows against real-data-only baselines, and update models regularly to avoid data drift. If synthetic data is already being used, benchmark newer, generative approaches against existing implementations.
More data isn’t always better — experiment with different synthetic-to-real data ratios to optimize results.

Synthetic data, powered by generative AI, stands at the frontier of innovation in investment management — offering not just a workaround to data scarcity, but a catalyst for smarter, faster, and more resilient decision-making. As the industry experiments with AI more broadly, embracing synthetic data isn't just a technical upgrade — it's a strategic imperative.

1.25 PL Record PL credit Manage your Professional Learning credits

Publisher Information

CFA Institute doi.org/10.56227/25.1.24

For further reading on how advanced AI techniques — including those that rely on novel data approaches like synthetic data — are applied throughout investment management, see AI in Asset Management from CFA Institute Research Foundation.