Abstract
The development of AI coding agents has entered a new stage. Model architecture and computing power still matter, but they are no longer the only decisive factors. High-quality real-world developer behavior data, expert code review and human engineering judgment are becoming the new competitive moat.
Recent reports revealed Project Marlin, an Anthropic initiative developed in partnership with Snorkel AI to improve Claude Code. The project recruits about 1,000 professional software engineers to evaluate code, design tasks and conduct A/B testing. Each task takes roughly one hour and pays $280, far above the industry average.
At the same time, Cursor, Elon Musk’s xAI and OpenAI are all pursuing valuable developer data through different strategies. Cursor collects product usage data under its privacy policy. xAI is pursuing large-scale capital and ecosystem integration. OpenAI trains Codex through real coding tasks in isolated sandboxes.
This article examines the operating mechanism of Project Marlin, Snorkel AI’s transformation from a weak-supervision company into an expert-data supplier, and the different data strategies used by major AI companies. It keeps the original statistical data, task examples and industry research findings, while reorganizing the analysis into a clearer industry framework.
The central argument is simple: the next generation of AI coding agents will not be determined only by larger models. It will be shaped by who can learn most effectively from real software engineering workflows.
1. Project Marlin: Anthropic’s High-Cost Strategy to Improve Claude Code
1.1 Project Background and Core Goal
Claude Code is Anthropic’s project-level AI coding agent. It can read entire repositories, modify code across files, run tests and submit completed changes. It is no longer a simple code-generation chatbot. It is closer to an autonomous software engineering assistant.
Boris Cherny, the lead of Claude Code, has described how deeply the tool is used in his own work. According to his account, he had barely written code manually for more than two months. Claude Code once submitted 22 pull requests in a single day, with a peak of 27 pull requests on another day. This pattern is also common inside Anthropic, where AI tools now handle a large share of internal coding work.
At the same time, Anthropic is aware of a key limitation. Existing code datasets can teach models how to generate syntactically correct code. But they do not fully teach models how experienced engineers think.
Production-grade software is not only about passing syntax checks. It requires security awareness, maintainability, robustness and careful trade-offs. These qualities often come from years of engineering practice. They are difficult to capture from public code alone.
This is the background of Project Marlin. According to reports from Business Insider, Anthropic partnered with Snorkel AI to build a dedicated training and evaluation pipeline for Claude Code. The goal is to turn expert engineering judgment into structured training data.
In other words, Project Marlin is not just buying code. It is buying the reasoning process behind good code.
1.2 How External Engineers Work in Project Marlin
Project Marlin recruits around 1,000 external software engineers. These engineers participate in structured evaluation tasks. The workflow is relatively strict.
First, an engineer selects a GitHub repository from a list of thousands of open-source projects. Then the engineer designs a realistic coding task and writes the corresponding prompt. Claude Code is asked to generate two different solutions for the same task.
After that, the engineer performs A/B testing. The two outputs are compared across several dimensions:
- Code correctness
- Runtime safety
- Operational reliability
- Long-term maintainability
- Compatibility with existing project logic
Some complex tasks require multiple rounds of discussion with Snorkel’s review team. Only after review and verification can the task be finalized.
The compensation is unusually high. Each independent task takes about one hour and pays $280. Similar software engineering evaluation tasks on platforms such as Scale AI and Mercor usually pay around $110 per hour. This means Marlin’s rate is nearly 2.5 times the market average. Top participants can earn more than $3,000 per week through continuous task completion.
This pricing reveals the true value of the project. Anthropic is not paying for routine labeling work. It is paying for expert judgment, engineering taste and practical experience.
Two task examples show the type of capability Anthropic wants to capture.
The first task asks engineers to refactor a metadata-processing module. The goal is to improve code clarity and maintainability without changing the original business behavior.
The second task focuses on security. Engineers need to help Claude Code fix a command-injection vulnerability in MLflow, a widely used open-source machine learning platform. The vulnerability appears when the platform downloads Python packages during model loading. The fix must block dangerous behavior while preserving legitimate pip parameters.
These examples are very different from simple benchmark problems. They require judgment, caution and domain understanding. They also reflect the types of issues that appear in real engineering work.
1.3 Why Claude Code Needs Stronger Risk Control
Anthropic’s investment in Project Marlin is closely related to the risk profile of AI coding agents.
Claude Code is not just suggesting code in a text box. It can modify local files, execute system commands and interact with code repositories. If such an agent produces unsafe or flawed code, the consequences can be serious.
A normal coding assistant may give a wrong answer. A full-process coding agent may change project files, run scripts or affect production workflows. This raises the standard for reliability.
To reduce risk, Anthropic has built permission controls and sandboxing mechanisms. By default, high-risk file changes and command execution require user approval. This protects users, but it can also create approval fatigue during long tasks.
To address this problem, Anthropic introduced more refined sandbox controls. The agent can operate within predefined file-system and network boundaries. This gives it more room to work while keeping sensitive operations restricted.
This shows a broader direction in AI coding tools. The goal is no longer only to make AI write correct code. The goal is to make AI write code that is safe, maintainable and suitable for real software projects.
That type of improvement requires human review at scale. It also requires feedback from engineers who understand how production systems fail.
2. Snorkel AI: From Weak Supervision to Expert Data Infrastructure
2.1 Company Background and Strategic Shift
Snorkel AI plays a central role in Project Marlin. The company was spun out of Stanford AI Lab in 2019. It was founded by Alex Ratner and his PhD advisor Chris Ré. Ré is a Stanford professor, MacArthur Fellow and serial entrepreneur. He also founded SambaNova, a company once valued at $5 billion.
Snorkel’s early focus was weak supervision. The company aimed to solve a long-standing problem in AI development: manual data labeling was too slow and expensive. AI teams often spent about 80% of their time preparing and labeling data. Snorkel proposed using rules, programs and labeling functions to reduce manual work.
The team published more than 60 academic papers around weak supervision. Its open-source tools were adopted by major technology companies, including Google and Intel.
However, the rise of foundation models changed the data market.
Basic labeled data is no longer the scarcest resource. Today, the most valuable data often comes from experts. Senior engineers, doctors, lawyers and other specialists provide judgments that cannot be easily replaced by simple labels.
Snorkel has adapted to this change. The company once focused on reducing human involvement in labeling. Now it helps organize expert teams to evaluate advanced AI systems.
Project Marlin is a clear example of this shift. Snorkel is not just managing data. It is building a pipeline for expert human judgment.
2.2 Expert Evaluation Workflow and Market Position
Snorkel has built a mature workflow for expert evaluation projects. The process usually includes four steps.
First, the task is defined. Second, scoring standards and verification rules are created. Third, the expert review pipeline is launched. Finally, results go through multi-level review and final judgment.
All operational records are retained. This makes the process traceable. Snorkel also provides a unified evaluation environment. The same task can be tested repeatedly across different model versions. This makes scores more reproducible and comparable.
To protect objectivity, external engineers are not told which model version they are evaluating. This helps reduce bias in A/B testing.
Snorkel’s pricing varies by field. Public legal evaluation tasks may pay between $10 and $100 per qualified task. Software engineering tasks, such as those in Project Marlin, can reach $280 per task. This gap shows the premium value of developer expertise.
Snorkel’s client list includes major AI companies such as Google, Mistral AI and Anthropic. In May 2025, the company completed its Series D financing at a valuation of $1.3 billion. Kate Jensen, Anthropic’s head of revenue, said Anthropic would continue working with professional data providers such as Snorkel to unlock the potential of Claude models.
This reflects a larger industry change. Data service companies such as Snorkel, Scale AI and Mercor are no longer only labeling vendors. They have become part of the hidden supply chain behind frontier AI models.
They now control access to expert data. And expert data increasingly determines the upper bound of model performance.
3. Different Strategies for Capturing Developer Data
Anthropic is not the only company competing for real developer data. Cursor, xAI and OpenAI are also building their own data strategies.
Their methods are different, but the goal is similar. Each company wants access to real software development processes. This includes prompts, edits, tool calls, failed attempts, test results and human corrections.
3.1 Cursor: Collecting Data Through Product Usage
Cursor follows a product-driven data strategy.
Its privacy policy defines how user data may be used. When privacy mode is enabled, Cursor and third parties do not use code content for model training. When privacy mode is disabled, Cursor may collect repository data, prompts, user edits and code snippets to improve AI features and train underlying models.
This gives Cursor a strong advantage. It already sits inside the daily workflow of many developers. Its product naturally captures continuous interaction data.
Cursor’s internal AI model Tab generates more than 10 billion edited characters per day. The total request volume has grown nearly 100 times compared with the initial version.
This usage data supports the development of Cursor’s Composer models. The Composer series uses reinforcement learning to improve tool-calling behavior and long-horizon task execution. The latest Composer 2.5 is optimized for complex tasks that may require hundreds of steps.
Cursor’s data advantage is not based on one-time labeling. It comes from repeated use in real projects. This creates a long-term stream of developer behavior data.
3.2 xAI and Elon Musk: Acquiring Data Through Capital Strategy
Elon Musk’s xAI is taking a more aggressive path. Its strategy is based on capital integration and ecosystem control.
In February 2026, xAI was reportedly merged into SpaceX. In late April, SpaceX obtained an option to acquire Anysphere, Cursor’s parent company, for $600 billion within the year. Another possible path was deep strategic cooperation with an upfront payment of $10 billion.
The target of this layout is clear: Cursor owns valuable real developer behavior data. For a frontier AI company, this data can become a strategic asset.
On May 25, Musk announced on X that Grok V9-Medium had completed training. The model reportedly has 1.5 trillion parameters, about three times the size of existing production models. Musk also said this version had not yet used Cursor’s data for supplementary training. He suggested that Grok’s coding ability would improve significantly after adding that data.
The model was scheduled for release in mid-June. If the plan proceeds as described, Grok could become one of the first large models to systematically learn from real developer operation data at scale.
3.3 OpenAI Codex: Learning Through Sandboxed Engineering Tasks
OpenAI Codex uses another approach. It trains through real engineering tasks in isolated environments.
The new Codex, released in 2025, is powered by the codex-1 model. It is trained through reinforcement learning on large numbers of real coding tasks. The goal is to make Codex generate code that follows human coding styles and pull request norms.
Codex can run tests, inspect failures and revise its changes. This allows it to iterate until a task is completed.
All Codex tasks run in isolated sandboxes. These sandboxes are preloaded with the user’s repository. The design improves operational safety while allowing the agent to interact with realistic project environments.
Codex has become OpenAI’s flagship agentic coding platform. According to Axios reports, it has surpassed 5 million weekly active users. Its large user base generates a significant amount of task data, which can continue feeding model improvement.
3.4 The Core of the Competition: Process Data
The four players follow different paths.
Anthropic purchases expert judgment through third-party evaluation teams. Cursor accumulates behavior data through product usage. xAI seeks data access through capital and strategic integration. OpenAI trains Codex in sandboxed engineering environments.
Despite these differences, they are competing for the same thing: development process data.
Static code tells a model what code looks like. Process data teaches a model how good code is produced.
That difference is critical. Real development includes code review, debugging, failed attempts, test execution, rollback, refactoring and trade-off decisions. These steps reveal how engineers actually work.
For teams using multiple AI coding agents or model providers, treerouter can serve as a supplementary API aggregation layer for multi-model access and service continuity.
4. The New Moat: Human Engineering Intuition
A research project called SWE-chat highlights the importance of real-world developer data.
The research team collected 6,000 real AI coding agent sessions. These sessions included more than 63,000 user prompts and 355,000 tool-call records.
The results were revealing. Only 44% of the code generated by AI agents was eventually retained and submitted by users. More than half of the output was deleted, modified or rejected.
This finding shows the limits of traditional coding benchmarks. Benchmarks such as HumanEval are now highly saturated. Higher public benchmark scores do not always translate into better real-world usability.
The real development process is messy. It contains repeated attempts, partial failures, corrections and human judgment. This is where the next stage of AI coding competition will happen.
As AI models become stronger, the scarcest resource is no longer only public code. It is the part of human expertise that models have not fully absorbed.
That includes engineering intuition, code review ability, security awareness and judgment under complex constraints.
Anthropic’s decision to pay $280 per task and recruit around 1,000 engineers for A/B evaluation reflects this shift. The company is trying to convert expert experience into training data.
In the current phase of AI coding agent development, the winner will not simply be the company with the largest model. The winner will be the company that can best transform real engineering experience into model capability.
The gap between coding models will increasingly depend on data quality, expert feedback and process-level learning.
5. Conclusion
The competition for real developer data shows that AI coding agents have entered a new stage.
Anthropic’s Project Marlin uses Snorkel AI to collect expert engineering evaluations. Cursor relies on product usage to accumulate developer behavior data. xAI is pursuing data through capital strategy and ecosystem integration. OpenAI trains Codex through sandboxed reinforcement learning on real coding tasks.
These strategies differ in execution, but they point to the same conclusion. Real development scenarios, human engineering experience and process data are becoming the core moat for AI coding tools.
Snorkel AI’s transformation also reflects the evolution of the AI data industry. After basic labeling became common, expert data became the new high-value resource. In coding, this resource is especially important. A useful AI coding agent must do more than generate syntactically valid code. It must learn how senior engineers think about maintenance, reliability and security.
The SWE-chat finding is a clear reminder. If only 44% of AI-generated code is retained by users, AI coding agents still have a long way to go. They must continue learning from real developers.
In the next phase, competition among AI coding tools will focus on how developer data is collected, evaluated and applied. The rivalry among Anthropic, Cursor, xAI and OpenAI will push AI coding agents closer to real software engineering workflows.
Over time, these tools may evolve from coding assistants into core development partners. But that evolution will depend less on benchmark scores and more on real-world engineering judgment.




