Two dark clouds over Agent: real-time interaction with the environment and learning from experience

date

Oct 19, 2025

slug

two-dark-clouds=over-agent

status

Published

The Second Half - Shunyu Yao

https://ysymyth.github.io/The-Second-Half/

In the first half, we continuously developed new training methods and models, achieving consistent results in benchmark tests. We kept creating more challenging benchmarks and consistently scored high on these tests, cycling through this process repeatedly. Ultimately, we found an effective method capable of achieving generalization: reinforcement learning.

This recipe has been largely standardized and requires little new thinking; as long as the above cycle is continuously followed, performance can keep improving. Therefore, a fundamental rethinking of the evaluation method is necessary.

The issue is that despite using AI to defeat world champions in chess and Go, surpass most humans on the SAT and bar exams, and achieve gold-medal levels in competitions, the world hasn't changed much—at least from an economic or GDP perspective.

The author refers to it as the utility problem.

Previous evaluation settings differ from the real-world setup in many ways. Two examples:

Evaluations should run automatically. Typically, an agent receives a task and acts autonomously, subsequently earning a task reward. However, in reality, the agent must continuously interact with humans throughout the entire task process—you can’t just send an extremely long message to customer support, wait ten minutes, and expect to receive a detailed reply that solves all your problems.

The evaluation "should" follow the independent and identically distributed (i.i.d.) principle. If the test set contains 500 tasks, each task must be executed independently, and the overall evaluation result is derived by aggregating task metrics. However, in reality, task processing tends to be sequential rather than parallel. As Google engineers become more familiar with the codebase, their ability to handle Google3 issues continuously improves; meanwhile, software engineering agents—even when addressing numerous problems within the same codebase—fail to achieve such incremental progress. We clearly need long-term memory mechanisms (existing methods already enable this), but academia lacks both suitable benchmarks to validate its necessity and the academic courage to question the foundational assumption of machine learning: the i.i.d. hypothesis.

In the first half of artificial intelligence development, these assumptions established benchmarks without issue, as enhancing intelligence typically increased utility when AI capabilities were relatively low. However, universal methodologies now ensure effectiveness under these assumptions. Thus, the key to navigating the new landscape of the second half lies in:

Develop novel evaluation settings or tasks for practical applications.

Solve problems according to the established plan, or refine the solution by introducing innovative elements. Repeat this cycle.

While the first half of the game is filled with incremental approaches and models, the second half will, to some extent, filter them out. Unless new premises that break conventions can be established, universal solutions will completely overshadow those gradual methods—only then will there be an opportunity to pursue truly disruptive research.

and I came across an expression that struck me as incredibly clever. I absolutely adore the following passage:

Thinking, or reasoning, is a strange kind of action - it does not directly affect the external world, yet the space of reasoning is open-ended and combintocially infinite — you can think about a word, a sentence, a whole passage, or 10000 random English words, but the world around you doesn’t immediate change. In the classical RL theory, it is a terrible deal and makes decision-making impossible. Imagine you need to choose one out of two boxes, and there’s only one box with $1M and the other one empty. You’re expected to earn $500k. Now imagine I add infinite empty boxes. You’re expected to earn nothing. But by adding reasoning into the action space of any RL environment, we make use of the language pre-training priors to generalize, and we afford to have flexible test-time compute for different decisions. It is a really magical thing and I apologize for not fully making sense of it here, I might need to write another blog post just for it. You’re welcome to read ReAct for the original story of reasoning for agents and read my vibes at the time. For now, my intuitive explanation is: even though you add infinite empty boxes, you have seen them throughout your life in all kinds of games, and choosing these boxes prepare you to better choose the box with money for any given game. My abstract explanation would be: language generalizes through reasoning in agents.

Section 1: Agent interaction with environment in real-time

Real-time interaction challenges of voice agents

Fundamental contradiction: Serial processing vs. real-time requirements

Must wait: first listen, then think; only after thinking can one speak.

Blocking wait: Every link becomes a bottleneck

user finish speaking(VAD)→speech recognition(ASR)→ complete sentence
complete sentence → llm with thinking → complete output after thinking
complete thinking → split into sentences→Speech synthesis(TTS) → voice response

cumulative delay: The total delay far exceeds human tolerance

The dilemma of fast versus slow response

fast response make mistakes easily and slow response burns the users’ patience.

unable to Anticipate and deliberate while listening

technology bottleneck

perception phase

voice:Waiting for the entire sentence to end before converting to text results in high latency; feeding fragmented speech into the speech recognition model leads to low recognition accuracy.

vision:High prefill latency for 2K token screenshots

thinking phase

Complete input is required to think.

Unable to predict user intent.

Test-time scaling exacerbates the delay.

execution phase

only can act when think ends

Every step of the GUI operation requires taking a new screenshot for consideration.

architecture innovation:SEAL(Streaming,Event-driven Agent Loop)

Core idea: Abstract all interactions into asynchronous event streams to achieve low-latency, interruptible real-time interaction.

perception layer

Converting continuous real-world signals (speech, GUI video) into discrete event streams

thinking layer

Async event processing, think while listening, speak while thinking, generate interleaved sequences of thought and action.

execution layer

Converting discrete action commands back into continuous real-world signals (TTS voice, mouse movements)

Layer 1 perception layer

input: sequential signal:voice stream,GUI video stream

output:speech_start,interrupt,laugh,speech_fragment,ui_change etc.

Streaming speech perception model replacing VAD+ASR

Streaming Speech-Aware Models Based on Open-Source Autoregressive LLMs

Unlike traditional ASR models such as Whisper, which use an autoregressive architecture, this approach reduces speech recognition latency.

Streaming processing of input speech tokens
Streaming text and acoustic events

Based on open-source LLM post-training

Retaining dialogue context and supporting in-context learning significantly improve the recognition accuracy of user personal information and domain-specific terminology.
With world knowledge and common sense, the recognition rate for brand names, amounts, etc., has significantly improved.

The output information is rich, encompassing not only text but also acoustic events.

Real-time transcription text segment

Special Tokens（Acoustic event）：

<speak_start>

<speak_end>

<interrupt>

<emotion:happy>

<laugh><sigh>

<music>

Layer 2:thinking Layer

Based on an event-driven loop, it enables interruptible and asynchronous listening while thinking, and speaking while thinking.

Input

discrete event stream(from event queue)

output

Interlaced thoughts and action commands

core innovation:interactive ReAct

traditional ReAct

Interactive ReAct:

Interactive ReAct:Think while Listening

traditional ReAct:Once interrupted, all previous thoughts are invalidated and must be started over from the beginning.

Interactive ReAct:Preserve the interrupted thought process and, after adding new user input, allow the model to continue thinking based on previous context.

Interactive ReAct:Speak while Thinking

Use "preludes" to strive for deep thinking about events and reduce first-character delay.

Layer 3:Execution Layer

Convert discrete action commands into continuous real-world signals.

Input

speak(…),click(…)

Output

sequential signal(Voice waveform, mouse trajectory)

last mile for GUI operation

The agent struggles to output coordinates. Solution: Draw inspiration from the VLA model in the field of Robotics and perform post-training on the model using RL, enabling it to directly output actions.

Option 1: The main model directly outputs mouse click coordinates.

Option 2:Train a standalone VLA model to mimic human mouse movement patterns:Adopting a closed-loop feedback model of "move, fine-tune, click.”

More human-like in speech synthesis: Generate labeled text, then produce speech with TTS.

Agent learning from experience

Paradigm 1: Post-Training

Method: Parameter Update (Post-training)

Update weights through gradient descent

Requires a large amount of annotated data

The model is fixed after training.

The learning process is slow and expensive.

Paradigm Two: In-Context Learning

Method: In-context Learning

Implicit learning through the attention mechanism.

Using long context as temporary memory

Learning effects are limited to the current conversation and are not permanent.

Paradigm Three: Externalized Learning

Method: Externalizing Knowledge and Processes

RAG: Efficient, Reliable, Hallucination-Free Knowledge

Tool-generation: Codify processes to achieve self-evolution.

Transcending the limitations of parametric knowledge

Best Practice: Contextual Embeddings + Contextual BM25+Reranking + Top-20 chunks

Fine-tuning vs. RAG: An Empirical Comparison of Knowledge Injection Methods

Based on the paper: Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs

https://aclanthology.org/2024.emnlp-main.15.pdf

Core insight of the paper: RAG is not only more effective but also avoids the issues of knowledge forgetting and hallucinations that may arise from fine-tuning.

Tool Generation - Enabling Agent Self-Evolution

https://arxiv.org/abs/2505.20286

Minimum Predefined Principle

Minimalist Architecture: Equipped with only a single core component (Web proxy)

Avoid over-engineering: Do not presuppose complex tools and workflows.

Generality first: Reduce domain-specific hardcoding

Maximum Self-Evolution Mechanism

core ability

Self-create tools: Generate new tools based on task requirements.

Capability Enhancement: Iteratively improve the performance of existing tools

Experience Reuse: Solidifying successful patterns into reusable components.

MCP-Zero Active Tool Discovery

Traditional methods dilemma:

Full injection: The complete toolset occupies a large number of tokens → Context explosion.

Static retrieval: Based on initial query selection, unable to predict task evolution. Debugging files requires file system + code analysis + command execution.

MCP-Zero: From Passive to Active

Core Concept: Enabling Agents to Proactively Identify Capability Gaps and Request Tools On-Demand

Active Tool Request: Agent generates structured requirements

Hierarchical Semantic Routing: First Filter Servers, Then Match Tools

Iterative Capability Expansion: Dynamically Discovering and Building Toolchains During Execution

Externalizing learning to transcend the limitations of attention is an inevitable trend.

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.