Abstract

Self-evolving AI agents have become an important research direction in large language model applications. The goal is clear: agents should not remain static after deployment. They should be able to learn from experience, acquire new skills, and improve their task performance over time.

However, most existing self-evolution methods still depend on strong external support. They often require successful historical trajectories, manually designed skill libraries, labeled datasets, or explicit human feedback. These conditions are difficult to guarantee in real-world environments. As a result, many self-evolving agent systems work well in controlled experiments but struggle in open deployment scenarios.

To address this limitation, the research team led by Sun Lichao from Lehigh University proposed the OpenSkill framework. OpenSkill is designed to help LLM agents acquire executable and transferable skills without relying on task-specific supervision signals. It builds a closed-loop process around open-world knowledge acquisition, leakage-free skill evolution, and zero-shot evaluation.

Experiments show that OpenSkill achieves strong results across benchmark tests, cross-model skill transfer, and ablation studies. On SkillsBench, it improves the pass rate of Opus 4.6 to 43.6% and GPT 5.2 to 42.1%, outperforming the strongest baseline methods by 8.9 and 8.8 percentage points, respectively. Skill migration tests also show clear performance gains across weaker models, ranging from 5.5 to 14.8 percentage points.

This article explains OpenSkill’s research motivation, architecture, workflow, experimental results, limitations, and future development directions. It also discusses why supervision-free skill learning may become a key path for the next generation of autonomous AI agents.

1. Industry Background and Core Pain Points

Self-evolving LLM agents are designed to improve over time. Unlike fixed AI assistants, they are expected to learn new skills, adapt to changing environments, and handle more complex tasks through continuous iteration.

This capability is valuable in many scenarios. In enterprise operations, agents may need to learn internal workflows. In software engineering, they may need to master project-specific tools. In intelligent office systems, they may need to handle documents, calendars, forms, and cross-platform tasks. In customer service, agents may need to update their response strategies as business rules change.

The problem is that most current self-evolution methods still rely on supervised resources.

Traditional approaches usually depend on three types of input. The first is successful historical trajectories. These are records of how a task was completed correctly in the past. The second is a pre-built skill library, often created by humans or derived from curated data. The third is explicit supervision, such as human feedback, annotated samples, or labeled evaluation signals.

These resources are useful in laboratory environments. They make the learning process easier to control and evaluate. But they are much harder to obtain in open-world deployment.

In real business environments, successful trajectories may be limited or inconsistent. Skill libraries may not cover new tasks. Human experts may not have enough time to supervise every iteration. In some cases, the agent may face tasks that have no prior examples at all.

This creates a major bottleneck. Once external supervision is missing, many self-evolving agents stop improving. Their learning process depends too much on prepared data.

OpenSkill attempts to solve this problem. Instead of assuming that supervision signals already exist, it explores how agents can acquire and refine skills from open-world resources and simulated execution feedback.

This makes OpenSkill different from conventional methods based on manual planning, supervised learning, or simple LLM generation. Its main contribution is to reduce the dependency on task-specific supervision while still producing executable skills.

2. The Three-Stage Workflow of OpenSkill

OpenSkill takes several inputs: task instructions, executable environments, base large language models, tool access, and open-world resources. Based on these inputs, it builds a three-stage skill evolution pipeline.

The three stages are:

open-world knowledge acquisition
leakage-free skill evolution
zero-shot target evaluation

Together, they form a closed-loop process for skill generation, testing, optimization, and deployment.

2.1 Open-World Knowledge Acquisition

The first stage is knowledge acquisition.

In this stage, OpenSkill searches open-world resources for task-related knowledge. These resources may include public documents, web information, technical references, tutorials, examples, and general background knowledge.

This is different from using a manually curated dataset. Open-world resources are broad, diverse, and noisy. They may contain useful information, but they may also include outdated, duplicated, or conflicting content.

The goal of this stage is not to directly train the model with raw web data. Instead, the framework collects useful reference information for later skill generation.

OpenSkill also performs initial filtering and verification. This helps remove obviously irrelevant or invalid content. The cleaner the retrieved information is, the better the later skill evolution process becomes.

This stage is important because self-evolving agents need external knowledge. If an agent only relies on its internal model parameters, it may lack the latest or most task-specific information. Open-world acquisition gives it a broader knowledge foundation.

2.2 Leakage-Free Skill Evolution

The second stage is the core of the framework.

After collecting open-world knowledge, OpenSkill generates multiple candidate skills for the target task. These candidate skills are then tested and refined in virtual task environments.

The key point is that this process does not use task-specific human supervision. It relies on execution feedback from the virtual environment.

In practice, the framework generates skills, tests them, compares their behavior, removes weak candidates, and keeps the ones that perform better. This creates an iterative improvement loop.

The term leakage-free is also important. It means that the skill evolution process is designed to avoid exposing target evaluation data or leaking intermediate information into the wrong stage. This reduces the risk of overfitting to hidden test data. It also helps protect skill assets and sensitive task information during practical deployment.

After several rounds of testing and filtering, OpenSkill keeps skills that are stable, executable, and more likely to generalize.

This design gives the framework a practical advantage. It does not need a human expert to label every step. It also does not require a ready-made skill library. The agent can build skills through interaction, testing, and refinement.

2.3 Zero-Shot Target Evaluation

The final stage is zero-shot target evaluation.

Once the skills are evolved, they are deployed to the target agent. The agent is then evaluated on hidden test tasks that were not exposed during the skill generation process.

This matters because visible evaluation data can easily cause overfitting. If a framework repeatedly optimizes against known test cases, the final result may look strong but fail in real tasks.

Hidden evaluation provides a better measure of generalization. It shows whether the evolved skills can actually help agents perform better in unseen scenarios.

The evaluation result can also be fed back into earlier stages. This allows OpenSkill to refine knowledge retrieval and skill evolution over multiple cycles.

In this sense, the framework is not just a one-time skill generator. It is a self-improving system that can continue refining its process.

3. Experimental Results and Data Analysis

The research team evaluated OpenSkill from three angles:

benchmark performance
cross-model skill migration
ablation studies

The results show that OpenSkill performs well across multiple evaluation settings.

3.1 Benchmark Performance

The team tested OpenSkill on three benchmark datasets and two mainstream target agents.

On SkillsBench, OpenSkill achieved the best automatic performance among the tested methods. It increased the overall pass rate of Opus 4.6 to 43.6%. It also raised the pass rate of GPT 5.2 to 42.1%.

Compared with the strongest baseline algorithms, these results represent gains of 8.9 percentage points for Opus 4.6 and 8.8 percentage points for GPT 5.2.

These improvements are meaningful because SkillsBench focuses on practical agent skills. A higher pass rate means the framework is not only producing theoretical outputs. It is improving the agent’s ability to complete executable tasks.

OpenSkill also performs well on SocialMaze and ScienceWorld. These benchmarks test different aspects of agent behavior. SocialMaze focuses more on interactive and social decision-making tasks. ScienceWorld emphasizes scientific reasoning and environment-based task execution.

Strong results across these datasets suggest that OpenSkill is not limited to one narrow domain. It has broader applicability across different types of agent tasks.

3.2 Cross-Model Skill Migration

Skill migration is one of the most important indicators of practical value.

If a skill only works on the model that generated it, its usefulness is limited. In real deployments, teams often use multiple models with different sizes, costs, and capabilities. A good skill should be transferable across model families or model tiers.

The research team tested this by generating skills with Opus 4.6 through OpenSkill. These skills were then directly transferred to four weaker large language models without additional adaptation.

The result was positive. All four target models achieved clear performance gains after receiving the migrated skills.

The improvement range was between 5.5 and 14.8 percentage points compared with the baseline without auxiliary skills.

This result shows that OpenSkill-generated skills have strong portability. They are not tied to only one model. They can help weaker models perform better, which is valuable for cost-sensitive deployment.

For enterprises, this matters a lot. A company may use a stronger model to generate skills, then transfer those skills to cheaper or smaller models for daily execution. This can reduce cost while preserving part of the performance benefit.

3.3 Ablation Study Results

Ablation studies help identify which parts of the framework are most important.

The OpenSkill experiments show that performance reaches its peak at 3 iteration rounds, with the highest comprehensive score of 82.7%.

This is an important finding. More iterations do not always mean better performance. When the number of rounds continues to increase, overall performance begins to decline.

The likely explanation is over-optimization. Excessive iteration may cause skills to fit too closely to the simulated environment. This reduces their ability to generalize to real tasks.

The ablation experiments also show that both open-world retrieval and virtual verification independently improve performance. When used together, they produce the best effect.

This confirms the value of the two core modules. Open-world retrieval provides external knowledge. Virtual verification tests whether the generated skills are executable and useful.

Another important result is the consistency between the virtual verifier and real evaluation. The virtual task verifier covers 88.9% of real test intentions.

This means that the simulated environment can capture most of the key requirements of real tasks. It is not perfect, but it is reliable enough to guide skill evolution.

4. Current Limitations and Future Directions

Although OpenSkill achieves strong results, it still has clear limitations.

The research team also identifies several areas that need further improvement.

4.1 Uneven Quality of Open-World Knowledge

The first challenge is data quality.

Open-world resources are useful, but they are not always reliable. They may include noise, outdated information, duplicated content, biased claims, or contradictory explanations.

If low-quality information enters the skill generation process, the resulting skills may become unstable or incorrect.

Future versions of OpenSkill need stronger knowledge filtering. The framework must better evaluate source credibility, remove irrelevant content, and resolve conflicting information.

This is especially important for high-stakes domains such as finance, healthcare, law, cybersecurity, and enterprise automation.

4.2 Limited Fidelity of Virtual Tasks

The second challenge is simulation fidelity.

Virtual task environments are useful for testing skills, but they cannot fully reproduce the complexity of real-world tasks. Real environments often include unexpected events, incomplete information, changing user goals, and dynamic tool behavior.

If the virtual environment is too simple, skills may perform well during simulation but fail in real deployment.

The current experiments show that the virtual verifier covers 88.9% of real test intentions, which is strong. But there is still a gap.

Future work should improve simulation design. Virtual environments need to capture more real-world uncertainty, edge cases, and dynamic interactions.

4.3 High Operational Cost

The third challenge is cost.

OpenSkill requires multiple rounds of knowledge retrieval, skill generation, virtual testing, filtering, and evaluation. This process consumes compute resources and time.

For large research teams, this may be acceptable. For individual developers and small teams, the cost may become a barrier.

Future optimization should focus on efficiency. The framework needs better caching, fewer redundant evaluations, faster skill filtering, and lower-cost model usage.

A more efficient OpenSkill pipeline would make supervision-free skill evolution easier to deploy in real products.

5. Application Value and Industry Impact

OpenSkill has important value for both academic research and industrial deployment.

For researchers, it offers a new way to study agent self-evolution without relying on heavy supervision. It shifts the focus from manually prepared training signals to open-world knowledge and execution-based feedback.

For developers, it provides a practical framework for building more adaptive agents. Instead of manually writing every skill, teams can allow agents to generate and refine skills based on business tasks.

For enterprises, OpenSkill may reduce the workload of manual labeling and skill library construction. It can also shorten the iteration cycle for AI agents.

The skill migration results are especially valuable. Since OpenSkill-generated skills can improve weaker models, teams may use stronger models for skill creation and smaller models for deployment. This creates a more flexible cost-performance strategy.

When teams deploy multiple LLMs and agent frameworks, they may also need a stable access layer for model calls. Treerouter can serve as a supplementary API aggregation platform for multi-model access, with lower-cost options than some direct official channels and a simpler interface for model invocation.

From a broader industry perspective, OpenSkill points to a possible future for autonomous agents.

The next generation of agents may not depend entirely on human-written skills. They may retrieve knowledge, simulate tasks, generate candidate skills, test them, and transfer successful skills across models.

If the framework’s limitations are improved, this approach could be applied to many fields, including:

intelligent customer service
automated operations and maintenance
AI office assistants
workflow automation
software engineering agents
research assistants
enterprise knowledge agents

OpenSkill is therefore not only an algorithmic contribution. It is also a step toward more scalable agent learning.

6. Conclusion

OpenSkill is a promising framework for self-evolving LLM agents. It enables agents to acquire and refine skills without relying on task-specific supervision signals.

Its three-stage workflow is clear:

open-world knowledge acquisition
leakage-free skill evolution
zero-shot target evaluation

Experimental results show strong performance. On SkillsBench, OpenSkill improves Opus 4.6 to 43.6% and GPT 5.2 to 42.1%, exceeding the strongest baselines by 8.9 and 8.8 percentage points. Cross-model transfer also brings gains of 5.5 to 14.8 percentage points across weaker models. Ablation studies show that 3 iteration rounds produce the best result, with a peak score of 82.7%. The virtual verifier covers 88.9% of real test intentions.

At the same time, OpenSkill is not yet a complete solution. It still faces challenges in data quality, virtual simulation fidelity, and operational cost.

Even so, its direction is important. It reduces the dependence of agent self-evolution on labeled data, manual feedback, and pre-built skill libraries. It also shows that transferable skills can improve different models, including weaker ones.

As open-world knowledge retrieval, virtual simulation, and self-evolution algorithms continue to improve, frameworks like OpenSkill may become a key foundation for future autonomous agents.

In the long term, the most capable agents will not only follow instructions. They will learn new skills, test their own strategies, migrate knowledge across models, and continuously adapt to changing environments.