Reproducible Quantitative Research – Beyond Pure MCP Workflows

Reproducibility is the cornerstone of credible quantitative research. In both academic papers and proprietary trading strategy development, results mean little if others cannot replicate them. Yet in quantitative finance, reproducibility remains challenging due to proprietary data, complex methodologies, and now, increasingly autonomous AI agents.
The latest AI coding assistants like Claude Code, Google Gemini and OpenAI's Codex using Model Context Protocol (MCP) have revolutionized research workflows.
They can compress weeks of development into hours.
But this power comes with a hidden cost: when AI agents operate autonomously, they can undermine the very reproducibility that makes research credible.
In this post, we'll explore how modern AI tools are transforming quantitative research, why pure agentic workflows threaten reproducibility, and a better approach to address these challenges.
The Evolution: From Text Generator to Active Researcher
AI assistants have rapidly evolved from simple code completion tools into active research partners. Early large language models could only suggest text based on their training.
Today's AI Agents can autonomously:
- Fetch and analyze historical market data
- Execute complex multi-step research workflows
- Run backtests and do statistical tests
- Generate visualizations and reports
- Commit results to version control
This transformation was enabled by giving LLMs tool-use capabilities. Claude Code was the first to achieve this: since its initial version, it has not just suggested code but actively taken actions on your behalf. It maintains project-wide awareness, navigates documentation, and performs complex tasks from natural language prompts.
By now, both OpenAI and Google have caught up—with Codex and Gemini Code—matching the functionality of Claude Code.
To generalize tool use, Anthropic introduced the Model Context Protocol (MCP) in late 2024
Understanding MCP: Power and Pitfalls
The Model Context Protocol (MCP) is Anthropic's open-source standard for providing uniform API that allows AI models to interact with external tools and services. Instead of hard-coding specific integrations, MCP servers expose tools that AI agents can invoke as needed.
MCP in Quantitative Research
Common MCP applications include:
- Query financial databases for market data (like Polygon.io's recent MCP connector)
- Execute trades through broker APIs: Alpaca, Tasty or IBKR, etc.
- Fetch social sentiment from Reddit or Twitter
- Screen and analyze indicators for trading signals: TradingView MCP
- Read and analyze research papers from arXiv
- And many more
Zen-MCP: The Multi-Model Orchestrator
While not strictly related to quantitative finance, zen-mcp is worth menitioning (and using).
This open-source orchestrator extends agentic coding tools to enable multi-model AI workflows. What this means in practice is that you can use OpenAI's, Google's and Anthropic's (and many other) models in the same session. For example, one model can design the task, the other can implement it and the third can review it.
The different models can collaborate and chat with each other, which is impressive to see in action.
The Reproducibility problem
While powerful, autonomous AI agents introduce several reproducibility challenges.
1. The Opacity Problem
When AI agents autonomously fetch data, perform analysis, and generate results, the exact steps often remain hidden. Unlike executing a script - where every transformation is visible - AI agent workflows can be black boxes. You might get results, but understanding how those results were obtained becomes difficult or impossible.
2. Non-Deterministic Execution
AI models may take different approaches to solving the same problem across runs. This non-determinism means:
- The same research question might yield different methodologies
- Data processing steps may vary
- Rate limits or tier changes can trigger model fallbacks (e.g., Opus → Sonnet), altering tool choices and outputs
Note: Read more about defeating nondeterminism in LLM inference on the Thinking Machines blog
3. Hidden State and Dependencies
Similar to the notorious "hidden state" problem in Jupyter notebooks where cell execution order affects results, AI agents compound this by:
- Dynamically choosing data sources without documentation
- Using different libraries or methods without explicit tracking
- Making assumptions that aren't recorded
Danger-zone: Agentic Trading
Several open-source projects use MCP for quantitative analysis and trading:
- Maverick MCP: "financial data analysis, technical indicators, and portfolio optimization tools directly to your Claude Desktop"
- PrimoAgent: "multi agent AI stock analysis system ... to provide comprehensive daily trading insights and next-day price predictions"
- Alpaca's example on building MCP-Based Trading workflow
As teaching demos, they are excellent: they reduce integration friction and demonstrate how quickly you can reach a working prototype. In practice, however, running trading strategies this way is too risky.
In these projects, the trading logic relies on the model’s output, which is inherently non-deterministic and can change over time. To make things worse, you may get downgraded to a cheaper model mid-session due to rate limits or quota exhaustion.
This means the exact same results cannot be reproduced, even if the rules, data, and environment are fixed.
A simple solution
Instead of relying on MCP agents to research and trade, use them to generate code that you review, version-control, and run in a controlled environment. Over time, the accumulated code can be curated into a strategy or research library.
Based on our experience with hundreds of hours of AI-assisted strategy and product development, we recommend:
1. Treat AI as a Code Generator, Not an Autonomous Agent
- Use AI to generate reproducible scripts and analysis code
- Review AI-generated plans and code before execution
- Maintain human oversight of critical decisions
2. Version Control Everything
- Regularly commit all analysis scripts, strategies, and utilities to version control
- Include the models and the prompts used to generate the code (e.g., in Pull Request description)
- Document data sources, extraction timestamps, and filters in a data catalog
- Persist backtest results, metrics, and visualizations in structured storage
3. Use the right model for the task
- Use the most capable model to create a detailed plan for the task (e.g. Anthropic Opus or GPT-5 at the time of writing)
- Review the plan using different models to gain confidence (e.g. OpenAI's o3-pro, Gemini-2.5-pro)
- Use the detailed implementation plan to generate the code. A less capable model can be used here. (e.g. Anthropic Sonnet)
- Review the generated code using different models (e.g.: gemini-2.5-pro, r1 and o3-pro)
- Conduct the final review of the generated code using human expertise
Note:
Over time, we hope every MCP server and agentic tool will generate audit logs of every tool call, including inputs, outputs, model IDs, and timestamps. This would resolve several reproducibility issues.
Conclusion
Pure MCP agentic workflows are productivity rockets—but but also reproducibility traps. For credible research, treat agents as compilers and planners rather than autonomous researchers. Generate code, pin environments and data, and log every run.
If a result can’t be reproduced from code, config, data snapshot, and a manifest, it’s not research—it’s a demo.