Context bloat has long been one of the most stubborn pain points for production-grade AI Agents. During long multi-turn conversations and complex iterative tasks, continuously appending full dialogue content will lead to a sharp surge in token consumption, raise operational costs, and even cause large language models (LLMs) to lose focus on core information due to overly redundant context. This article thoroughly dissects a complete set of context optimization solutions adopted by the open-source project Gliding Horse. The system achieves remarkable results: after up to 50 rounds of continuous interaction, the total token volume remains nearly stable, and the agent can still accurately recall details from early conversation rounds. We will break down five core engineering designs, including summary-based history storage, structured prompt partitioning, background batch information extraction, knowledge graph conversion for massive data, and the full context assembly workflow. All technical logic and practical cases are derived from real deployment, providing replicable optimization ideas for AI Agent developers.
1. Introduction: The Predicament of Traditional Context Management
For conventional AI Agent systems, the most common way to handle multi-turn dialogue is to concatenate the complete content of every round of conversation into the overall context. When the interaction extends to 50 rounds or more, the context window is filled with a large amount of repetitive and lengthy raw text. This brings two prominent drawbacks. First, the token usage rises linearly with the number of dialogue rounds, following an O(n) growth trend, which significantly increases API calling costs and inference latency. Second, excessive messy information interferes with the LLM’s judgment. The model is prone to being distracted by trivial content, unable to lock onto key tasks and historical decisions, resulting in degraded output quality and frequent logical errors.
The Gliding Horse project has solved this industry-wide dilemma through a series of refined context engineering. Its core idea is to separate "index references" from "detailed raw data": retain concise summaries in the main context to control token volume, and store complete original content in an independent structured database. When the agent needs to query details, it triggers dedicated retrieval tools to obtain information on demand. This design reduces the token growth of long-cycle tasks to nearly O(1) while fully preserving the complete memory of historical interactions. This article will elaborate on each design module step by step, combined with practical cases and technical principles.
2. Core Design 1: Historical Context – Summary + IRI Reference Mechanism
The first and most foundational optimization is reforming the storage form of multi-turn historical content. Instead of splicing full dialogue texts, the system generates a standardized summary for each round of LLM responses, and only aggregates these summaries into the main context. To avoid the loss of detailed information caused by pure summarization, each summary is bound with a unique IRI (Internationalized Resource Identifier) as a data access address.
2.1 Working Mode and Practical Case
After each round of task execution, the agent is forced to output a concise summary that records the core objectives, decisions and results of the current round. The historical context is composed of these sequential summaries. A typical example of a multi-round analysis task is as follows:
Round 1 Summary: The user needs to analyze Q2 sales data to support inventory planning. Round 2 Summary: Confirmed the data source is sales_q2.csv, covering three dimensions: region, product and sales volume. Round 3 Summary: Decided to use Python for analysis, divided into data cleaning, grouping analysis and predictive modeling. Round 4 Summary: Data cleaning completed; 12 abnormal data entries found in the East China region, pending confirmation.
When the model only needs to sort out the overall task progress, relying on these summaries is sufficient. If the agent needs to verify specific details from a certain historical round (such as confirming the specification of a JWT key mentioned in Round 3), it can locate the corresponding complete data through the IRI attached to the summary. The IRI points to the storage address of the original content in the database, for example memory:session-042/block-003.
2.2 Technical Implementation with Graph Database
The system uses JSON-LD as the unified data address bus and Oxigraph, a high-performance graph database, to store all original dialogue records. When the model initiates a detail query, it calls the built-in graph query tool and executes standard SPARQL statements to retrieve the complete content corresponding to the IRI:
SELECT ?content WHERE {
<memory:session-042/block-003> mem:content ?content .
}
This mechanism realizes the decoupling of "lightweight index" and "heavyweight raw data". Although a small number of additional tokens are consumed to store summaries and IRIs in a single round, for long-running tasks with dozens of rounds, the overall token consumption is drastically reduced. The model no longer carries massive full texts, and the memory capability for historical details is completely retained.
3. Core Design 2: Prompt Partitioning – Leveraging Human Attention Rules
LLMs follow similar attention patterns to humans: they have the deepest impression of content at the beginning and end of a prompt, while information in the middle area is easily ignored. Many developers casually mix all system settings, historical content and user inputs into one block of prompt, which often leads the model to forget core role definitions and execution specifications.
Aiming at this characteristic, the project divides the entire prompt into four fixed sequential partitions, arranging different types of content according to priority to maximize the model’s information absorption efficiency:
- Fixed prompts (First position): Including role positioning, mandatory output formats and basic operation rules. These are the immutable "fundamental rules" of the agent. Placing them at the forefront ensures the model always clarifies its identity and execution standards, avoiding behavioral deviations.
- Dynamic prompts (Middle position): Covering 5W2H constraint rules, tool lists and accumulated empirical experience. The content of this part changes dynamically with different tasks. Being placed in the middle will not interfere with the model’s perception of core rules.
- Historical summaries (After dynamic prompts): The aggregated multi-round summary content mentioned above, which lets the model understand the overall task progress without occupying core attention positions.
- Current user input (Last position): Utilizing the recency effect, the model prioritizes processing the latest user requirements, ensuring timely response to current tasks.
This partitioned layout standardizes the prompt structure, eliminates the chaos of mixed content, and effectively improves the model’s execution compliance and response accuracy.
4. Core Design 3: Background Batch Extraction – Reusing User Input Value
User input contains a large amount of core requirements, constraint conditions and decision information, which is a valuable data source easily overlooked by many Agent systems. This project builds an independent background batch processing mechanism to fully mine the value of user dialogue content.
The system adopts a sliding window to collect user inputs from recent rounds. After accumulating a certain amount of content, it automatically invokes the LLM to extract structured key information. The extracted results are distributed to two storage destinations:
- L2 Blackboard: Stores active key information of the current task. These highlights are directly injected into the real-time context of subsequent agent execution, so the model can quickly obtain core requirements without repeatedly traversing historical user conversations.
- L0 Knowledge Graph: Extract entities, logical relationships and key decisions from user inputs, create independent IRI nodes for permanent storage. When the same type of task is started again, the agent can directly retrieve accumulated experience from the knowledge graph.
This design realizes the secondary utilization of user input, further optimizes the context richness while controlling token volume, and gradually forms the task experience precipitation capability of the agent.
5. Core Design 4: Large Tool Result Processing – Knowledge Graph Instead of Simple Truncation
When the agent calls external tools such as SQL queries and web crawlers, it often obtains massive returned data (such as thousands of lines of logs and hundreds of search records). Traditional processing methods usually adopt direct text truncation or simple abstract compression. However, truncation may cut off critical details, and pure abstracts will lose the original data structure, making it impossible for the model to verify specific content later.
The solution adopted here is to convert large tool execution results into local knowledge graphs:
- Split massive data into independent nodes, and assign a unique IRI to each node;
- Only retain the overall statistical summary and graph query entry in the main context, without importing the original full data;
- When the model needs to view detailed data, call the
query_graphtool to access the corresponding node via IRI.
Practical Case
Suppose a SQL query returns 2000 rows of sales data. The system will generate a local knowledge graph for all data rows. The content injected into the main context is only a short summary and query entrance:
The query returns 2000 pieces of data. Summary: East China's sales increased by 35%, while South China's sales decreased by 12%. You can query detailed data via
query_graph(IRI: result-042).
If the agent needs to confirm which specific products caused the decline in South China's sales, it can initiate a graph query to obtain complete details. For extremely oversized results that take a long time to generate knowledge graphs, the system adds a chunking fallback mechanism: split the data into multiple blocks, generate summaries and IRIs for each block, and only record the IRI list in the context. This balances operational efficiency and data completeness.
6. Complete Context Assembly and Runtime Workflow
Combining all the above modules, the project forms a standardized full-process context construction and execution workflow, which is jointly undertaken by components such as Prompt Builder, L2 Blackboard, L3 Projection Engine and AgentRunner:
- The Prompt Builder loads fixed prompts (role, format) and dynamic prompts (constraints, tools) in sequence according to the partition rules;
- Obtains key highlights extracted by background batches from the L2 Blackboard and injects them into the middle area of the prompt;
- Reads historical summaries and supporting IRI lists, and splices them after dynamic content;
- Appends the latest user input at the end to form a complete prompt and sends it to the LLM for execution;
- After the LLM outputs the result, the system splits the content: the generated summary is written back to the L1 historical record, and the complete original content is stored in the L0 knowledge graph with a new IRI.
The whole workflow forms a closed loop of "assembly - execution - storage - iteration", realizing automatic iteration and continuous optimization of the context system during long-term operation.
7. Four Major Core Advantages and Summary
The entire set of context optimization designs can be summarized into four core points, which together solve the three major pain points of token bloat, memory loss and structural chaos in traditional Agent context:
- Summary + IRI reference for history: Reduces the token growth of long-cycle tasks from O(n) to nearly O(1), and supports on-demand retrieval of historical details through graph databases.
- Fixed partition prompt layout: Utilizes the LLM’s attention characteristics to ensure core rules and task requirements are not ignored.
- Background batch extraction: Fully taps the value of user input, realizes real-time use and permanent precipitation of key information.
- Knowledge graph for large results: Abandons simple truncation and compression, retains data structure and details while controlling context size.
The essence of this set of designs is to transform the agent’s "working desktop": instead of stacking all raw materials on the desktop, only place concise indexes. All detailed content is stored in an independent structured database. The model queries details through indexes when needed, which is both efficient and orderly. This set of architectures has been open-sourced on GitHub as part of the Gliding Horse project, providing a complete reference for Agent developers.
For development teams that need to test and deploy multiple large models and Agent frameworks, using a unified API relay service can simplify interface management and reduce operational costs. As a professional API gateway, Treerouter supports one-stop access to various mainstream LLMs, with pricing more favorable than official direct access. It is compatible with most mainstream development frameworks, allowing developers to switch models seamlessly during context scheme debugging without rewriting business code.




