Engineering brief

Autonomous Coding Agents That Run for Hours, Not Minutes

AI Jason

The Brief

Coding agents have a bad habit of quitting early on complex tasks. A new pattern—using an LLM to judge whether the goal is actually met—keeps agents working autonomously for hours. It works for migrations and large refactors, but the catch is brutal: your goal prompt must define "done" with near-legal precision. Vague instructions produce loops of incoherence. Engineering leads who invest in robust goal files will see real gains; everyone else will watch agents fail spectacularly.

Decision relevance

Read this for workflow impact, implementation trade-offs, and the claims that need technical scrutiny before they reach team planning.

Summary

Modern coding agents often quit early on complex tasks, declaring victory after superficial fixes. This video shows how Codex's new 'goal' feature and Hermes' 'persist ghost' solve that by using a secondary LLM call to judge completion—only stopping when the objective is truly met. It’s a clear evolution from the earlier 'rough loop' hack where agents were repeated programmatically without intelligent gating.

The real value surfaces when teams push beyond well‑scoped tickets into migrations, large refactors, or experimental work where the path isn’t known upfront. The feature can run overnight (9‑hour migration demo) and handle ambiguous goals like 'cut Docker image size by 60%' by letting the agent explore multiple approaches. However, the magic depends entirely on prompt engineering: the goal must define done criteria explicitly, give quantifiable stop conditions, and often requires an alignment interview with the agent first.

There’s a catch. For multi‑week missions (SEO, ad optimization) lacking immediate verifiable feedback, the loop still breaks down. The speaker’s team is prototyping a 'mission' layer that schedules runs over days/weeks with human‑in‑the‑loop sanity checks—early results show improved Twitter engagement. So while the goal feature marks real progress, it’s still a power tool, not a set‑and‑forget solution. Teams that invest time in crafting robust goal files (e.g., using the open‑source 'go body' helper) will see the most impact; those who toss vague instructions will see agents loop into incoherence or premature stops.

Why It Matters

Turns code agents from short‑task helpers into long‑running autonomous workers for migrations, refactors, and complex tests.

Editorial analysis

Key claims

  • LLM‑judged loops reduce false completions, but you must define done well or they fail spectacularly.

Practical use cases

  • Use this as input for tooling evaluation, workflow planning, and technical due diligence.

Risks / caveats

  • Hype that agents now 'understand' goals; they still need borderline‑legal‑contract levels of precision in prompts.

Who should care

  • Engineering managers, tech leads, and CTOs evaluating AI or developer tooling decisions.

Related topics

Bottom Line

LLM‑judged loops reduce false completions, but you must define done well or they fail spectacularly.

Watch

This video is blocked due to your privacy settings. To watch this video, please accept YouTube marketing cookies.

Related breakdowns

Get TL;DW

Too Long; Didn't Watch.

A concise breakdowns of the AI and devtools videos that actually matter for engineering leaders.

Free. Weekly. No hype.

Video and thumbnails remain the property of their respective creators. tldw.news provides editorial analysis, commentary, and discovery links to original content.

Autonomous Coding Agents That Run for Hours, Not Minutes | tldw.news