Reproducible Quantitative Research – Beyond Pure MCP Workflows

September 14, 2025 · 6 min read

Reproducibility is the cornerstone of credible quantitative research. In both academic papers and proprietary trading strategy development, results mean little if others cannot replicate them. Yet in quantitative finance, reproducibility remains challenging due to proprietary data, complex methodologies, and now, increasingly autonomous AI agents.

The latest AI coding assistants like Claude Code, Google Gemini and OpenAI's Codex using Model Context Protocol (MCP) have revolutionized research workflows.

They can compress weeks of development into hours.

But this power comes with a hidden cost: when AI agents operate autonomously, they can undermine the very reproducibility that makes research credible.

In this post, we'll explore how modern AI tools are transforming quantitative research, why pure agentic workflows threaten reproducibility, and a better approach to address these challenges.

The Evolution: From Text Generator to Active Researcher

AI assistants have rapidly evolved from simple code completion tools into active research partners. Early large language models could only suggest text based on their training.

Today's AI Agents can autonomously:

Fetch and analyze historical market data
Execute complex multi-step research workflows
Run backtests and do statistical tests
Generate visualizations and reports
Commit results to version control

This transformation was enabled by giving LLMs tool-use capabilities. Claude Code was the first to achieve this: since its initial version, it has not just suggested code but actively taken actions on your behalf. It maintains project-wide awareness, navigates documentation, and performs complex tasks from natural language prompts.

By now, both OpenAI and Google have caught up—with Codex and Gemini Code—matching the functionality of Claude Code.

To generalize tool use, Anthropic introduced the Model Context Protocol (MCP) in late 2024

Understanding MCP: Power and Pitfalls

The Model Context Protocol (MCP) is Anthropic's open-source standard for providing uniform API that allows AI models to interact with external tools and services. Instead of hard-coding specific integrations, MCP servers expose tools that AI agents can invoke as needed.

MCP in Quantitative Research

Common MCP applications include:

Query financial databases for market data (like Polygon.io's recent MCP connector)
Execute trades through broker APIs: Alpaca, Tasty or IBKR, etc.
Fetch social sentiment from Reddit or Twitter
Screen and analyze indicators for trading signals: TradingView MCP
Read and analyze research papers from arXiv
And many more

Zen-MCP: The Multi-Model Orchestrator

While not strictly related to quantitative finance, zen-mcp is worth menitioning (and using).

This open-source orchestrator extends agentic coding tools to enable multi-model AI workflows. What this means in practice is that you can use OpenAI's, Google's and Anthropic's (and many other) models in the same session. For example, one model can design the task, the other can implement it and the third can review it.

The different models can collaborate and chat with each other, which is impressive to see in action.

The Reproducibility problem

While powerful, autonomous AI agents introduce several reproducibility challenges.

1. The Opacity Problem

When AI agents autonomously fetch data, perform analysis, and generate results, the exact steps often remain hidden. Unlike executing a script - where every transformation is visible - AI agent workflows can be black boxes. You might get results, but understanding how those results were obtained becomes difficult or impossible.

2. Non-Deterministic Execution

AI models may take different approaches to solving the same problem across runs. This non-determinism means:

The same research question might yield different methodologies
Data processing steps may vary
Rate limits or tier changes can trigger model fallbacks (e.g., Opus → Sonnet), altering tool choices and outputs

Note: Read more about defeating nondeterminism in LLM inference on the Thinking Machines blog

3. Hidden State and Dependencies

Similar to the notorious "hidden state" problem in Jupyter notebooks where cell execution order affects results, AI agents compound this by:

Dynamically choosing data sources without documentation
Using different libraries or methods without explicit tracking
Making assumptions that aren't recorded

Danger-zone: Agentic Trading

Several open-source projects use MCP for quantitative analysis and trading:

Maverick MCP: "financial data analysis, technical indicators, and portfolio optimization tools directly to your Claude Desktop"
PrimoAgent: "multi agent AI stock analysis system ... to provide comprehensive daily trading insights and next-day price predictions"
Alpaca's example on building MCP-Based Trading workflow

As teaching demos, they are excellent: they reduce integration friction and demonstrate how quickly you can reach a working prototype. In practice, however, running trading strategies this way is too risky.

Danger Zone

In these projects, the trading logic relies on the model’s output, which is inherently non-deterministic and can change over time. To make things worse, you may get downgraded to a cheaper model mid-session due to rate limits or quota exhaustion.

This means the exact same results cannot be reproduced, even if the rules, data, and environment are fixed.

A simple solution

Instead of relying on MCP agents to research and trade, use them to generate code that you review, version-control, and run in a controlled environment. Over time, the accumulated code can be curated into a strategy or research library.

Based on our experience with hundreds of hours of AI-assisted strategy and product development, we recommend:

1. Treat AI as a Code Generator, Not an Autonomous Agent

Use AI to generate reproducible scripts and analysis code
Review AI-generated plans and code before execution
Maintain human oversight of critical decisions

2. Version Control Everything

Regularly commit all analysis scripts, strategies, and utilities to version control
Include the models and the prompts used to generate the code (e.g., in Pull Request description)
Document data sources, extraction timestamps, and filters in a data catalog
Persist backtest results, metrics, and visualizations in structured storage

3. Use the right model for the task

Use the most capable model to create a detailed plan for the task (e.g. Anthropic Opus or GPT-5 at the time of writing)
Review the plan using different models to gain confidence (e.g. OpenAI's o3-pro, Gemini-2.5-pro)
Use the detailed implementation plan to generate the code. A less capable model can be used here. (e.g. Anthropic Sonnet)
Review the generated code using different models (e.g.: gemini-2.5-pro, r1 and o3-pro)
Conduct the final review of the generated code using human expertise

Note:

Over time, we hope every MCP server and agentic tool will generate audit logs of every tool call, including inputs, outputs, model IDs, and timestamps. This would resolve several reproducibility issues.

Conclusion

Pure MCP agentic workflows are productivity rockets—but but also reproducibility traps. For credible research, treat agents as compilers and planners rather than autonomous researchers. Generate code, pin environments and data, and log every run.

If a result can’t be reproduced from code, config, data snapshot, and a manifest, it’s not research—it’s a demo.

The Evolution: From Text Generator to Active Researcher​

Understanding MCP: Power and Pitfalls​

MCP in Quantitative Research​

Zen-MCP: The Multi-Model Orchestrator​

The Reproducibility problem​

1. The Opacity Problem​

2. Non-Deterministic Execution​

3. Hidden State and Dependencies​

Danger-zone: Agentic Trading​

A simple solution​

1. Treat AI as a Code Generator, Not an Autonomous Agent​

2. Version Control Everything​

3. Use the right model for the task​

Conclusion​