Berkeley Summit House

Summary

A high-signal panel on the future of AI agents, featuring leading voices from academia and industry exploring robustness, recursive improvement, and standards for real-world deployment.

The panel brought together leading voices in the AI agent space—including Zhou Yu, Alane Suhr, Rebecca Qian, Robert Parker, Vinay Rao, and Shunyu Yao—alongside a technical audience of researchers, founders, and engineers from top academic labs and startups.

1. Robustness Before Demos

Model brittleness & entropy: Robert Parker noted that small prompt or latency changes can destabilize agent behavior. Without transitive closure across subtasks, multi-step agents risk falling into entropy.
Agent OS: Parker called for OS-level semantics—shared memory, error correction, formal task graphs—comparable to early operating system breakthroughs.
Tool-call verifiability: Vinay Rao emphasized the burden of verifying when and how to call external tools, and how to roll back from unreliable outputs.

2. Measuring What Matters

Beyond leaderboards: Rebecca Qian compared agent evaluation to autonomous driving—it must consider workflows, safety, and real-world dynamics.
Community benchmarks: Shunyu Yao emphasized shared benchmarks like InterCode and FinanceBench to enable comparable, domain-specific evaluation.

3. Continuous Improvement & Recursive Agents

Self-adjusting agents: Shunyu Yao and Vinay Rao discussed agents that run tests, refine themselves, and avoid reward hacking through recursive feedback.
Trusted recursion: Parker pointed to recursive reasoning and parsing as necessary for tasks current LLMs fail at, such as large-scale code refactor.

4. Safety, Standards & Governance

Brakes & protocols: Zhou Yu called for peer-style agent protocols to reject unsafe calls. Rao likened current systems to “cars without brakes.”
Full-stack safety: Parker urged that future stacks must enforce process boundaries akin to hardware-level protections.

5. Academia × Industry Collaboration

Data vs. compute asymmetry: Academia needs access to real-world usage data; startups need long-term evaluation frameworks.
Joint projects: Collaborative models (e.g., CMU-style) help bridge speculative research and live deployments.

6. Vision & Futures

Shunyu Yao: Agents as autonomous data scientists
Rebecca Qian: Oversight agents for scalable evaluation
Alane Suhr: Decomposable tasks, educational agents
Zhou Yu: Exponential returns from self-improving agents
Parker & Rao: Parser-driven recursive agents that understand their limits

Key Takeaways

Robustness before demos: Stability and task coherence are foundational
Tool calls cost trust: Each external call requires validation and observability
Evaluate workflows, not prompts: End-to-end benchmarks matter
Agent OS is coming: Process semantics > prompt hacks
Safety accelerates adoption: Brakes don’t slow cars—they enable speed
Collaboration compounds: Data-sharing and reproducible testbeds benefit all

Coming Soon

AI Agents in Action Panel