Timestamped index of the main-stage livestreams. Each row links to where the speaker is introduced. Tick rows and hit Copy selected to paste a formatted table into Microsoft Teams. Timestamps & summaries derived from the video transcripts.
swyx frames the entire day around “loops” — his Loopcraft idea that AI engineering is about choosing which loop you work in, going up a level for scale and down a level for reliability. Sets up the day's “software factories” theme.
Castro breaks knowledge into intrinsic (in the model), extrinsic (retrieved) and learned, tracing the productivity curve from IntelliSense to Copilot to agents. He demos Microsoft Foundry's “agent optimizer,” which builds a real learning loop from an agent's own traces to capture each organization's differentiated knowledge.
OpenAI argues AI didn't end engineering but returned it to its problem-solving roots — “AI engineers are eating the world.” They note the accelerating release cadence (from ~15 months to ~6 weeks between models) and, instead of a live demo, bring on a special guest.
The OpenClaw “ClawFather,” now at OpenAI, describes going from babysitting 10 terminal windows to managing one long-running “manager” agent that delegates to a team — enabled by server-side compaction, coordination and triggers. His theme: the bottleneck keeps moving, from tokens to compute to human attention.
Joining remotely, Li introduces GLM-5.2 and explains that “GLM” (General Language Model) dates to a 2021 paper, making Z.ai one of the earliest large-model labs. He positions GLM as a top open-weights model that's strong well beyond coding.
Hugging Face's Thom Wolf interviews MiniMax's Olive Song about M3 — a ~400B-parameter (20B active) open model that also understands vision. They dig into its 1M-token context and new MiniMax Sparse Attention (MSA) architecture, framing MiniMax among China's top open “AI dragons.”
Diggs gives a brief intro to the conference's first Security Track and the shift from traditional app-sec to agentic security. He frames three obstacles — insecure AI-generated code, safely deploying autonomous agents, and the geopolitics of model access — and points people to the track. (The printed schedule listed Manoj Nair for this slot.)
A two-person talk on the one thing you lose when an agent misbehaves in production: reproducibility. Using a broker-API example where an agent sells 1,000 shares instead of $1,000, they show why turning temperature to zero doesn't make a broken reasoning path debuggable.
Kushan (ex-founding engineer at Sohm) argues browser agents underperform because the infra around the model is poor, not the model itself. He demos a compressed page representation that lets a cheaper model plan long action sequences and recover from failures far faster. (This window also carries other short demos.)
Tížková defines the “software factory” as the whole autonomous software lifecycle — collecting signals, prioritizing, orchestrating, validating and continuously improving — not just code generation. She argues writing code is the easy part, a swarm of coding agents isn't a factory, and organizations must rebuild from the ground up rather than bolt one on.
Mutagent applies the build-loop idea to building agents themselves: an offline loop (iterate, test, evaluate, improve) and an online loop (monitor production traces, diagnose, feed back). Their thesis — doing this loop by hand doesn't scale to hundreds of agents, so the loop itself should run agentically.
Holtz argues for conducting an orchestra of agents rather than running an assembly line. Using his Conductor tooling he covers a central “feed the beast” database of all company context (Slack, Discord, meetings) exposed via a SQL tool, “free-range” sandboxed agents, and carefully-authored AGENTS.md / CLAUDE.md files.
As reviewer (not author) of code across tens of thousands of teams, Gupta shares what he found in over a million AI-generated pull requests. He covers how hard it is to even identify fully “vibe-coded” PRs and argues the future is agents simulating users to validate code — rather than humans reviewing endless slop.
Nori's CEO argues coding agents can do far more than write code — including visual artifacts. His insight: don't hand agents human tools like PowerPoint or Figma; give them the right medium (HTML), so a model that looks bad at spatial tasks (Simon Willison's pelican-SVG test) can build good decks end-to-end.
A 14-year mobile engineer asks why the promised 10x from AI agents hasn't materialized. Using the factory-electrification analogy — real gains came only when factories were redesigned around small distributed motors, not by swapping the steam engine — he argues we must redesign the whole workflow, not bolt agents onto the old one.
Pragmatic Engineer author Gergely Orosz hosts a deep-technical fireside with Turbopuffer founder/CEO Simon Eskildsen, from his origin story (PowerPoint, FrontPage, WoW-fueled English) to Turbopuffer's CPU-first architecture. Includes a comedic Jensen Huang “do you vape?” anecdote and why CPUs are surprisingly scarce at the hyperscalers.
Hou, who leads engineering on Google's Antigravity coding product, argues you should “get out of the model's way” — like giving Messi the ball. He walks through Antigravity 2.0 decoupling the IDE from a standalone agent manager (subagents, worktrees, scheduled tasks, voice) and the principle of “scaling with intelligence.”
Warp's founder — a 20-year engineer who hasn't written code in six months — argues software engineering is becoming “factory engineering,” moving from chat/autocomplete to interactive agents to full automation. He covers open-sourcing Warp (60k+ stars, 800k+ developers) and building self-improving factories.
OpenGov engineer Gabe demos “OG Assist,” an AI assistant embedded across OpenGov's government ERP products (budgeting, procurement, permitting). Agents make tool calls against product data and can read/act on the current screen, backed by automated evals in CI and deterministic human-in-the-loop approval gates.
Salomon (creator of AgentCraft and MCP-UI) argues humans — not models — are now the bottleneck, because steering and reviewing many agents is exhausting. His fix borrows from gaming: AgentCraft is an RTS/Sims-style orchestrator that represents each agent as a unit you can spawn and supervise.
Notion's AI engineering lead talks “Token Town” — building AI-native products sustainably so you go from “AI-pled to AI-poor.” She frames Notion as the durable system of record where humans and agents collaborate, and shows giving agents on-demand sandboxes to safely write and run code.
Gupta describes BAML's provocative practices — no code reviews, everyone in parallel, no standard tooling — while shipping a programming language that can't tolerate slop. His answer: a tiny, stable architecture.md (not CLAUDE.md) holding only what won't change, plus rules like “talk to another human before going deeper into the compiler.”
Mistele argues most people build loops wrong — piping a prompt into a coding agent yields 40,000-line PRs nobody reads. Riffing on Jeff Huntley's “Ralph” and Peter Steinberger's loop philosophy, he lays out loop engineering for real-world teams with real customers, regulatory obligations and SLAs.
The MC welcomes everyone back, thanks the sponsors (presenting sponsor Microsoft) and sets up a closing block on building software factories that actually work — and don't produce slop.
7:33:03
Dex Horthy (Human Layer)
Harness Engineering Is Not Enough: Why Software Factories Fail
Horthy (who coined “context engineering”) pushes back on the “you're the bottleneck, just spend more tokens, stop reading the code” narrative. Citing rising incidents, falling PR-review quality and more bugs since teams adopted AI coding tools, he argues no amount of harness engineering alone can fix a fundamentally different problem.
A tutorial (not a pitch) on using elementary type systems and compiler knowledge to make AI agents provably safe. Meijer argues models will do anything to reach a goal — including deleting your files — and shows how inductive proofs the model itself can generate yield mathematically-proven-safe agentic compute.
Robinson explains how Cursor trains its own models and how the model “learns to train itself” via recursive improvement. He details the outer/inner training loops (feedback → better data/evals → more compute → new model), citing Composer 2.5 as Cursor's most popular model and teasing a notable new one soon.
Allie Howe closes Day 1: the raw materials for software factories now exist (bigger context, better memory, vision, verification, agent security), but teams need the discipline to wield them — and, echoing Dex's talk, engineering as a practice is not dead.
The livestream opens on walk-in music, then IBM developer advocate and MC Tejas Kumar welcomes Day 2 of the AI Engineer World's Fair (7,000 attendees, 18 tracks). He recaps Day 1 themes of loops, reliability and evals before thanking the sponsors led by Microsoft.
Anthropic member of technical staff Thariq Shihipar (introduced on-stage as 'Tariq') gives the first-ever 'Field Guide to Fable,' Anthropic's new frontier model rolling out later that day. He organizes working with the new class of models into four parts — unhobbling Claude, finding your unknowns, dealing with the grief, and being unreasonable — and teases a 12:30 fireside with Cat Wu and Simon Willison. (printed: Thariq Shihipar)
Sonar CEO Tariq Shaukat argues that as models generate plausible but unverified 'slop,' deliberate verification is what actually unlocks enterprise value. He pitches Sonar's code-verification approach and an 'agent-centric development cycle (ACDC)' that builds verification into the agentic, CI, and code-maintenance loops. (printed: Tariq Shaukat)
Amazon AGI Lab member of technical staff Antje Barth (name garbled to 'onjab' in captions) explains why agents that can click and use tools still fail at end-to-end work that lives 'in the seams' between apps, making reliability rather than capability the next hard problem. She introduces 'perception agents' that read and understand a screen, pointing to Amazon's computer-use track work and GitHub repo. (printed: Antje Barth)
Google DeepMind VP of Research Benoit Schillings (ex-Google X, Waymo/Glass) describes his team's mandate to build whatever technology Gemini needs one month to one year out, focused largely on code. He traces code-focused ML from the 2018 'Pitchfork' project at X to today, arguing ML can eventually surface research breakthroughs humans cannot perceive. (printed: Benoit Schillings)
Arize co-founder and CPO Aparna Dhinakaran opens the Evals track, noting Arize runs 100M+ evals a month and top teams run thousands of evaluators. She argues evals must move from scoring single prompts to evaluating traces of tool-calling, reasoning, multi-agent, long-horizon systems, and previews Arize's 'Signal' long-running eval agent. (printed co-speaker Laurie Voss did not appear.)
In a break-time expo talk, OpenGov engineer Gabe Dees Mesa walks through how the government-ERP company built and scaled 'OG Assist,' an in-product AI assistant. He covers their agent harness, the A2A protocol, eval/observability traces, human-in-the-loop tool approvals, and sandboxing. (Unlisted filler; the same talk is replayed later at 7:31.)
The 'First Steps Toward Automated AI Research' talk (printed as Richard Socher; the speaker is unnamed on stream but cites recursive.com and AIX Ventures) pitches the 'Eureka machine' that automates research the way evolution generated inventions. It argues recursive self-improvement is the next S-curve stacked on prior exponentials, and that AI engineers will increasingly manage AI that itself does AI research. (printed: Richard Socher)
Meta software-engineering tech lead Nishan Gupta (Meta Superintelligence Lab infra) argues evaluation must shift from benchmark scores to production system behavior — task completion, tool correctness, planning, and failure recovery. He frames a hierarchy of agentic failure modes and a continuous, always-on evaluation loop. (Unlisted ~4-minute filler; replayed during lunch at 3:34.)
2:20:34
Han Xiao (Jina AI / Elastic)
Autoresearch for Dense Retrieval: Test-Time Compute with Frozen Embedding Models
Han Xiao (founder of Jina AI, acquired by Elastic in October where he now leads model inference/training) asks whether small retrieval models can improve at inference time like big models do. Running auto-research overnight, he shows that 'search is test-time compute' — assembling more retrieval and reranking over a single frozen embedder instead of scaling the model. (printed: Han Xiao)
2:41:10
Dominic Turno (Resonate)
Durable Execution / The Prompt Is a Platform (expo/demo)
Resonate founder and CEO Dominic Turno argues that as agents generate bespoke implementations on demand, value moves from implementation to specification — 'the prompt is a platform.' He describes Resonate's deliberately minimal durable-execution protocol, centered on a durable promise and a durable task, synthesized onto existing infrastructure such as NATS. (Unlisted filler.)
Sakana AI research scientist Stefania Druga tackles context rot in long-horizon agents with a memory harness that runs entirely on-device on local models (GLM, DeepSeek V4 flash on an M3 Ultra). She frames this as 'sovereign AI,' demoing evals she ran non-stop on her own Mac back in Tokyo. (printed: Stefania Druga)
Visual Labs founder 'Bash' argues the last human skill in software is figuring out what to build — 'you can't prompt the room, you can prompt your AI.' He recounts an internal hackathon where 17 of 21 agent ideas were dropped for creating no business value, reframing requirement elicitation and product management as the real bottleneck. (Unlisted filler.)
Prime Intellect research engineer Elie Bakouch argues recursive self-improvement should be studied in the open, since no third-party benchmark for it exists. He traces the nanoGPT 'speedrun' lineage (Karpathy to Keller Jordan's modded-nanoGPT) into an environment where multi-agent systems propose, run, judge and scale ML research ideas on Prime Intellect's training stack. (printed: Elie Bakouch)
A verbatim replay of Nishan Gupta's earlier Meta talk on production evaluation for agentic systems, aired as lunch-time filler. Same content: scenario-driven offline evals, observability/distributed tracing, and a continuous evaluation loop mapping every metric to a business outcome.
3:41:23
Google DeepMind gen-media team (Demetrios, Shane, Nicole)
Fireside: Generative Media — Veo, Omni & Nano Banana
A lunch-time fireside (the AI Engineer host with three Google DeepMind generative-media researchers) covering the week's launches: Nano Banana 2 Light, the fastest and cheapest image model in the family, and the Gemini Omni Flash APIs pre-announced at I/O. The panel discusses Veo video models, Omni Thinking / Gemini RL, and why generative media deserves as much exploration as coding.
Decoding AI founder/CEO Pauline (with co-presenter Luis) demos building a personal 'AI research OS' that sits between coding/agent harnesses (Codex, NotebookLM) and a 10,000-note Obsidian/Readwise/Notion 'second brain.' The system surfaces high-signal notes for whatever project you're starting and ships with open code. (Unlisted demo leading into the afternoon sessions.)
4:39:55
Tim Sweeney (Weights & Biases / CoreWeave)
Closing the Loop: An Autonomous AI Research Agent (Arya)
Weights & Biases / CoreWeave principal engineer Tim Sweeney (not the Epic Games founder) introduces 'Arya,' an AI research-and-iteration agent, with a live auto-research demo that optimizes a model overnight. He shows how W&B Weave collects production traffic, generates insights with LLM judges, runs nightly eval suites, and promotes the best candidate model with confidence. (printed: Tim Sweeney)
5:00:48
(speaker not named on stream)
Role-Playing Language Agents & Persona-Fidelity Evals
An unlisted talk on role-playing language agents (RPLAs) — Character.AI, Hello History — and an open-source persona framework the speaker built called 'companion,' running on Claude Opus 4.7. The thesis questions what persona benchmarks (e.g. InCharacter's 80.7% 'gold standard') actually measure, using an Alexander Hamilton example to show high scores hiding shallow, stereotyped personas.
5:05:54
Zhengyao Jiang (Weco)
An AI Agent Became the #1 Contributor in OpenAI's Hiring Challenge
Weco co-founder/CEO Zhengyao Jiang (PhD, UCL; introduced as 'Junya') describes 'Aiden/AIDE,' a multi-agent self-improving auto-research agent, entering OpenAI's 'Parameter Golf' hiring challenge. Over about 22 days Aiden set seven leaderboard records (the best human set three) and had the highest community H-index — work other participants forked and built on. (printed: Zhengyao Jiang)
5:22:04
(speaker not named on stream)
Designing Agentic Systems: the Relocation Scout (demo)
An unlisted talk on the engineering discipline of architecting an agent rather than one-shot prompting one, using 'Relocation Scout,' a house-hunting agent, as the running example. The speaker emphasizes systems thinking and structured, queryable memory 'contracts' so an agent's output can safely become another step's input.
5:35:06
Lakshya Agrawal (GEPA)
Self-Improvement of Context, Harness & Weights via Reflective Optimization
Lakshya Agrawal presents reflective optimization (GEPA) — self-improving prompts, agents, and model weights from textual feedback instead of gradient descent — to address sample-inefficiency when rollouts are slow or expensive. He notes GEPA plugs into any framework/model with zero hard dependencies and cites Dropbox, Shopify, and an OpenAI blog post using it. (printed: Lakshya Agrawal)
Independent researcher Raymond (also at OpenPros) applies recursive language models (RLMs) to coding agents, framing today's agents as 'mismanaged geniuses' whose missing layer is specifying, managing, reusing and verifying work. In an RLM the full prompt/context is the object of computation, explored symbolically via a read-eval-print loop; his slides are live at recursivecodingagents.com. (Unlisted filler.)
Tejas Bhakta (introduced as 'Tis'; ex-Tesla inference optimization) shows how auto-research — Karpathy-style goal-seeking while-loops — makes models about 3x faster by generating GPU/CUDA kernels that beat hand-tuning. He stresses that humans still supply the high-level ideas while the loop tunes parameters and verifies for speed and correctness, warning that ~80% of attempts reward-hack. (printed: Tejas Bhakta)
In an unlisted talk, Victor pitches 'Polygraph,' an agent-agnostic meta-harness that gives coding agents cross-repo, organization-wide 'photographic' memory so intent and prior sessions never have to be re-explained per repository. He likens today's repo-bound, memoryless agents to a genius who can see only one-thousandth of the codebase (trypolygraph.com).
Amnara CEO Ean argues an agent's identity isn't its model or runtime but its log — the append-only event history of inputs, outputs, tool calls and results, analogous to a Skyrim save file. Because every state transition is written to the log, the log alone is enough to resume the agent on any runtime. (Unlisted filler.)
Roland (who left xAI's agent-infra team with his co-founder) lays out a 2026 blueprint for productizing auto-research around three ideas, starting with 'the loop is the product.' He uses OpenClaw's car-haggling loop as the canonical example and discusses deploying always-on, long-horizon agents that scale with customers. (printed: Roland Gavrilescu, Julian Bright)
6:48:28
Rushab (MachineCraft)
A Factory That Taught Itself to Remember — Fabric (demo)
Rushab, who runs MachineCraft — a 100-person thermoforming factory in India with no ML budget — recounts building a 36-agent system ('Fabric') that runs the company's entire go-to-market. Without training any model, they chunked hundreds of gigabytes of private history (quotes, drawings, emails) into a retrieval stack of vector + relationship-graph + CRM databases and three model providers, operated from a single Cursor tab. (Unlisted filler.)
Erina (ex-Microsoft; introduced as 'Arena') and teammate Arunachalam Manikandan present 'Project Paradox' from Supercell's AI Innovation Lab — a modular framework for stateful game agents with per-agent RAG memory, emotion vectors, and belief scores. She shows how auto-research (controlled experiments that keep only changes surviving measurement) beats endless prompt hand-tuning for long-horizon agents. (printed: Erina Karati, Arunachalam Manikandan)
Nori Agentic CEO Amole argues coding agents can do far more than write code, and shows how to make them produce slides, docs and video by giving them the right medium — HTML rather than a human-oriented canvas. He references Simon Willison's 'pelican on a bicycle' SVG test and pitches Nory 'Sessions'/Norybot for building decks from your own company data. (Unlisted filler.)
7:23:11
Zion
10X: Reimagining the Mobile Dev Workflow (expo/demo)
Mobile engineer Zion (14 years) argues the promised 10x AI productivity hasn't arrived because teams swapped the tool but kept the old workflow — like early factories that merely replaced steam engines with electric ones. He pitches reimagining mobile development around agents (cloud builds/previews, no local Xcode/Android Studio, agent-opened PRs) to remove that friction. (Unlisted filler.)
A replay of OpenGov engineer Gabe Dees Mesa's earlier talk on building and scaling 'OG Assist,' shown as afternoon-break filler before the closing keynotes and cut off by walk-in music. (Duplicate of the 1:43 talk.)
Oliver Wyman Americas director of technology Deina Delias takes over as MC for the closing keynote block, thanks the live and online audience and sponsors, and introduces the closing keynote speaker.
Google engineering leader and author Addy Osmani argues the engineer of the future owns the 'verdict' — the accountable production decision — over increasingly automated, agent-driven work. He frames roles rebundling around what part of the system you can own (prototype, build, sweep, grow, maintain) and reprises harness/loop engineering, arguing that, like past abstractions, agents will grow rather than shrink demand for engineering. (printed: Addy Osmani)
8:05:06
George Cameron & Micah Hill-Smith (Artificial Analysis)
Artificial Analysis co-founders Micah Hill-Smith and George Cameron (an AI benchmarking company that tests chips, clouds, models and agents) present 'the cost of intelligence' using their agentic knowledge-work evals. They distinguish tasks bounded by the intelligence frontier — where you pay for more capability — from ceiling-bounded tasks where you pick the cheapest model that clears the bar. (printed title: 'Trends in AI')
Arena (LMArena) co-founder/CTO Wei-Lin Chiang (PhD, UC Berkeley; creator of Chatbot Arena and early LLM-as-a-judge work) recaps how Arena measures 'intelligence in the real world' across the year's breakthroughs, now with 10M+ monthly visitors. He describes category-sliced, rubric-based agentic evaluation to help developers pick the best model. (printed: Wei-Lin Chiang)
MC Deina Delias closes Day 2, thanking the audience, organizers and sponsors (Microsoft as presenting sponsor), and notes the startup battlefield and closing notes are the following night before sending everyone off until the next morning.
Pre-show walk-in music and a countdown loop over the main-stage feed before the program begins. An announcer welcomes the room to the AI Engineer World's Fair and hands off to the day's MC.
Developer relations engineer Ralph Shabri from Replit opens the final day, noting ~7,000 attendees and framing the day's theme as harness engineering. He previews keynotes from Anthropic, Stanford and DSPy and thanks presenting sponsor Microsoft. (He repeatedly calls it 'Day four', though the printed schedule labels this Session Day 3.)
Amplify investment partner Barr Yaron presents her annual State of AI Engineering survey, run this year in partnership with Notion and Vercel with 1,048 respondents. She reports AI engineering is more a discipline than a title, skewing to senior engineers who are new to AI, and walks through adoption trends across the stack.
0:35:43
John Ousterhout (Stanford)
TCP and RDMA are Killing Inference Throughput; Homa Can Fix It
Stanford professor emeritus John Ousterhout argues AI network workloads are shifting from large throughput-bound transfers to small latency-sensitive messages that TCP and RoCE RDMA handle poorly. He introduces Homa, a clean-slate Stanford transport protocol that can cut tail latency by an order of magnitude. (Printed name 'Ousterhout' appears as 'Sterhout' in captions.)
0:54:14
Maxime Rivest & Isaac Miller (DSPy)
The Unreasonable Effectiveness of Separating the Task from the Model
DSPy core contributors Maxime Rivest and Isaac Miller make the case for treating AI programs like functions with defined inputs/outputs, separated from the model, harness and implementation details. They pitch DSPy, an open-source Python framework, as the way to make AI workflows reusable, composable, testable and optimizable. (Captions render 'Maxim Rest'.)
Instagram co-founder and Anthropic member of technical staff Mike Krieger sits for a fireside with swyx on how his model usage and role shifted from Chief Product Officer to hands-on building. He describes moving from step-by-step engineering to describing a goal and letting stronger internal models (Mythos/Fable) run with it.
1:39:57
Emil Eifrem (Neo4j)
Thinner Agents on a Smarter Substrate: The Ontology-based Semantic Layer
Neo4j founder/CEO Emil Eifrem lays out a blueprint for an ontology-based semantic layer so teams stop re-wiring data sources for every new enterprise agent. It combines a business ontology, a technical ontology and execution traces so many thin agents can share one smarter graph substrate. (Captions render 'Neo Forj'.)
An unlisted break-slot capstone talk on an RL-guided agent that selects bounded remediation actions for failed ETL data jobs on AWS (Glue, EventBridge, Lambda, CloudWatch, S3). It separates deterministic anomaly rules, a tabular Q-learning policy for action choice, and a safety override to keep actions explainable and within trusted boundaries.
Anthropic platform-team leads Katelyn (captioned 'Caitlyn') Lesse and Angela Jiang challenge the idea that all tokens are fungible, proposing that tokens be given distinct 'jobs' beyond execution. They demo strategies that pair executors with advisor, grader and 'dreaming' tokens, showing some are far more token-efficient per useful output, backed by Anthropic's managed-agent primitives.
2:09:51
Pauline Bush & Louis-François Bouchard (Towards AI / Decoding ML)
An unlisted video demo on building a personal 'AI research OS' that sits between coding harnesses and a second brain (Obsidian, Readwise, Notion) to surface high-signal notes. Presented by Pauline, founder/CEO of Decoding AI and co-author of the LLM Engineer's Handbook, and Louis-François Bouchard (captioned 'Lu France Bush'), co-founder/CTO of Towards AI and creator of the 'What's AI' channel.
2:20:08
Nikita Kothari (Salesforce)
MCPs, CLIs, and Skills: Choosing the Right Tooling Layer for Agentic Development
Salesforce senior member of technical staff Nikita Kothari (Agentforce) compares three agent tooling layers—MCP, CLIs and structured Skills—and gives a rubric for when to use each. She frames the tooling/'plumbing' layer as what makes or breaks production agents, citing context explosion, invisible failures and security surface, and advises composable CLIs and skills-as-code.
Nori Agentic CEO Amole argues coding agents can do far more than write code, including making slides, docs and video if you give them the right medium. His fix is to let agents author artifacts as code/HTML rather than manipulate human-oriented canvases, demoed via Nory Sessions building decks from company data.
An unlisted talk by Isidora, who runs a 225-year-old Virginia wedding venue and built AI agents for it and other venues. She frames agents as high-IQ, low-EQ interns and presents a layered prompt architecture—immutable brand identity, real-time signals, and per-audience conditioning—so a voice-critical brand never goes off-message.
2:50:03
Michael Grinich (WorkOS)
Auth for Agents: Unblock Autonomous AI with auth.md
WorkOS founder Michael Grinich (captioned 'Greenwich') traces the shift from human-in-the-loop coding to long-running autonomous agents and argues the next bottleneck is letting agents securely access web services on a user's behalf. He pitches an auth standard for agents so they can act autonomously across services, calling agents the next major software platform.
An unlisted talk by Rushab, who runs Machine Craft, a ~100-person thermoforming-machine factory in India with no ML team. He describes building 36 AI agents that run go-to-market by ingesting hundreds of gigabytes of private company history (quotes, drawings, emails) into a retrieval/graph 'company brain' rather than training a model.
3:15:06
Mike Chambers (AWS)
Harness Engineering: Building the Production Cage for Powerful Domain Agents
AWS senior AI specialist and developer advocate Mike Chambers distinguishes two kinds of agents and live-codes deploying a domain agent with a production 'harness' on AWS. He walks through scaling components and guardrails for powerful agents using Bedrock and AgentCore-style tooling.
An unlisted ~1-hour lunch-slot Oxford-style debate hosted by Ally How (Insecure Agents podcast, Keycard) on whether there is a gap between the hype around agent 'loops'/software factories and what works in practice. Panelists include Ian Livingstone (CEO, Keycard), Jeffrey Huntley (creator of the Ralph loop), a Sentry developer, and Dex Horthy (CEO, Human Layer), judged by whichever side changes the most minds.
Morgan Stanley ML researcher Brendan Rappazzo presents Loophole, a personal open-source project where you specify your morals, one agent codifies them into a legal system, and two adversarial agents hunt for contradictions. He frames it via a DNA/forensics analogy about codifying the nuance of one's own moral beliefs.
An unlisted talk by data scientist Sail Shik and senior data solutions engineer Ankosha Soi of Persistica on why dumping every tool schema into the prompt (a 'fat agent') breaks at scale. They show hundreds of tools consuming ~127K tokens per request and causing lost-in-the-middle failures, and propose semantic routing with just-in-time context.
5:05:04
Giselle van Dongen (Restate)
Every step you take, every call you make — the reliable agent stack
This talk covers the infrastructure needed to run agents reliably in production as they become persistent, asynchronous, long-running processes. It introduces Restate, an open-source durable-execution framework (inspired by Apache Flink) that supplies durable execution, retries and recovery so agents survive crashes. (Speaker named per printed schedule; not stated in captions.)
An unlisted talk by Resonate founder/CEO Dominic Turno arguing that general-purpose implementations will be replaced by bespoke ones generated on demand, moving reuse upstream to specifications. He reframes Resonate's durable-execution product as a small protocol (durable promise + durable task) from which trusted server implementations can be synthesized.
5:35:05
Sarah Sanders (PostHog)
We let an AI agent execute Bash and lived to talk about it
PostHog context engineer Sarah Sanders dissects the security of PostHog's 'wizard', an agentic CLI that reads your codebase, installs the SDK, instruments events and builds dashboards in minutes. She walks the wizard's threat model and lessons on prompt-injection/supply-chain risk, noting the wizard, 'warlock' and 'context mill' are open source.
5:55:37
Shashi (Super Agentic)
Turbocharge Your Agents' Retrieval with Turbo (online-track demo)
An unlisted online-track talk by Super Agentic founder Shashi on cutting agent-retrieval memory ~5x by compressing embeddings in the KV cache from 32 bits to 3–4 bits. It explains the KV-cache growth problem on local/Mac devices and applies the 'TurboCon' compression technique (Google Research, ICLR 2026) without losing search quality.
6:00:47
Kay Malcolm (Oracle)
No Memory, No Harness: Why the Database Is the Last Line of Defense
Oracle outbound database PM lead Kay Malcolm shows how AI sped up individuals but hurt team productivity when context (from Codex) wasn't shared across her globally distributed team. Her fix builds a unified dependency graph ('Polygraph') feeding a meta-harness so agents can read/write across hundreds of repos as one codebase, positioning the Oracle database as the durable memory layer.
Vercel 'chief of software' Andrew Qu recounts a year-long experiment that led to an agent explosion at Vercel, building on the AI SDK's unified provider interface, model fallbacks, secure code execution and durability. He argues it's never been easier to automate HR, finance and sales work, pitching Vercel's agent-building product (Eve).
An unlisted talk by Amnara CEO Isan arguing an agent's identity is neither the model nor the runtime but its append-only event log—like a Skyrim save file. Treating the log as the primitive (every input, output, tool call and permission is an event) makes agents resumable and reliable, since models and tools just read and append to it.
6:54:59
Philipp Schmid (Google DeepMind)
Agents Without Code: How Skills, YAML, and Filesystems Replaced Python
Google DeepMind's Philipp Schmid builds the same GitHub PR-review agent three ways, deleting code each time until the agent is 'just files'—markdown skills, an agents.md, a bash tool and a filesystem in a sandboxed Gemini Interactions API. He cites Manus, LangChain and Vercel repeatedly refactoring/removing harness code as models improve, urging 'build to delete' via the anti-gravity harness.
An unlisted ~30-minute talk on evaluating role-playing language agents (RPLAs) behind tools like Character AI and Hello History, using an open 'companion' framework on Claude Opus 4.7. The speaker, a data scientist running a labor-market-intermediary analytics lab, argues standard 'in-character' benchmarks measure fluency and can't catch the dominant failure of anachronistic persona compositing.
YouTuber Theo Browne frames the AI moment through model 'eras'—Sonnet 3.5 (reliable tool calls), Opus 4.5 (long-running tasks) and Mythos/Fable—and jokes about collective 'AI psychosis'. His takeaway: with models this capable it's time to build audaciously (your own AWS, compete with Slack); if your idea doesn't feel stupid, it isn't big enough.
Y Combinator president/CEO Garry Tan answers Theo's 'what do we build' from the founder/investor side, claiming a roughly 400x jump in his own coding output and companies where one person does what took a thousand. He urges the room to build AI-native companies and the compounding libraries beneath them, previewing the Startup Battlefield.
Airtable founder/CEO Howie Liu (introduced as CEO of Hyper Agent, a product within Airtable) gives a product-form-factor view of agents along a spectrum from completions to fully 'employable' agents. He applies Airtable's horizontal-platform philosophy to agents and closes by offering Hyper Agent credits usable across Opus 4.8, Fable 5, GLM 5.2 and others.
8:37:13
swyx with Howie Liu; judges Theo Browne & Joshua Xu
swyx closes the event by teaming with Hyper Agent's 'Founding 500' program: 500 founders were narrowed to 20 who competed, with three finalists pitching live. Judged by Theo Browne, Joshua Xu and Howie Liu, the agent-native finalists (including comment.io) collectively win $100K in prizes. (Printed as 'Startup Battlefield — Howie Liu, Joshua Xu, swyx'.)
MC Ralph Shabri returns to celebrate the Battlefield finalists and judges (Josh, Howie, Theo) and closes AI Engineer World's Fair 2026 after 4 days, ~7,000 attendees and 40 tracks. He thanks sponsors and crew and teases a New York event in October before a final show reel.