Welcome back. AI agents keep getting smarter, but hosting them on your laptop or Mac Mini only takes you so far. Manus just shipped an update that changes how you run agents around the clock.

Also: How to cut your agent's context window by 90%, train your AI on 2,000 DESIGN.md files from top products, and where Karpathy says LLMs go after coding.

Today’s Insights

  • Powerful new updates and hacks for devs

  • AI evals: The new compute bottleneck

  • How to stop Codex from giving up mid-task

  • Trending social posts, top repos, and more

TODAY IN PROGRAMMING

Click here to watch Manus Cloud Computer in action.

Manus ships persistent cloud machines for 24/7 bots: The AI agent startup just released Cloud Computer, a persistent virtual machine for always-on bots and software. Unlike standard sandboxes that reset, this stays live and keeps your tools installed across sessions. Developers can now deploy Slack bots, schedule scrapers, or self-host apps using plain-English prompts. It runs on Ubuntu and includes SSH and a web terminal. Watch how it works.

Cursor and Anthropic ship enterprise security agents: The AI coding startup just launched Security Review for Teams and Enterprise, using always-on agents to catch vulnerabilities in every pull request and send scan results straight to Slack. Just hours later, the Claude creator released Claude Security in public beta for Enterprise users. It uses Opus 4.7 to scan codebases, reduce false positives, and suggest patches that engineers can quickly approve.

Theo Browne urges Anthropic staff to leave in an open letter: Developer and content creator Theo Browne dropped an open letter criticizing Anthropic’s leadership and urging engineers to quit. Browne described a culture of fear and poor code quality, even offering to help anyone jumping ship to match their current vesting schedules. While some devs call it clickbait, Browne insists the post reflects genuine, widespread developer frustration.

PRESENTED BY YOU. COM

Teams often pick an API by checking a benchmark table and calling it done—a shortcut that could miss what really matters in production. This guide from You. com explains why raw latency is a misleading signal and what to measure instead.

Stop optimizing for the wrong number. Download the guide.

INSIGHT

AI evals are becoming the new compute bottleneck

Source: The Code, Superhuman

The bill is finally due. For a long time, training models was the main expense, while testing was just an afterthought. Recently, Hugging Face flipped that script: running a top-tier benchmark now costs as much as a used car every single time. Because of this, most teams can't afford to verify the leaderboards they use to choose their models.

The old shortcuts don't work anymore. Back when benchmarks were just simple Q&A tests, labs could shrink them to save money without changing the rankings. This worked because most questions were too easy to really show the difference between models. Agent benchmarks changed everything. These involve long, multi-step tasks where cutting corners means losing the exact data you need.

Running a test once isn't enough. An agent might nail a task on a Monday but fail it completely the following day. A new Princeton paper shows that recent gains in capability haven't actually made agents more consistent. Current leaderboards only report the best-case results. They ignore the fact that these agents rarely behave the same way twice.

Whoever has the money writes the leaderboard. Most teams end up starting their evals from scratch because there isn't a standard way to share results. Hugging Face's Every Eval Ever project is trying to fix this with a shared format. Without it, only the massive labs building frontier models will have the budget to actually grade them.

IN THE KNOW

What’s trending on socials and headlines

Meme of the day.

Beyond Coding: OpenAI founding member Andrej Karpathy used his Sequoia Ascent 2026 fireside to map out three new horizons for LLMs that go way past faster code.
3,800 bookmarks

Pretty Agents: Sick of agents shipping ugly UIs? This site has 2,000 DESIGN.md files modeled on the world's best products for your AI to study.
10,000 bookmarks

Drop and Done: A developer used Cursor's new SDK to build a Kanban board where you just drop in tasks, and the agents handle the rest.
5,200 likes

Brain Drain: A 22-year-old dev says six months of running 6 to 8 Claude Code terminals at once has left him noticeably less sharp, and his friends agree.
783,200 views

Context Cut: A dev cut his agent's context window by 90% after fixing one thing most AI builders never check. His prompts had nothing to do with it.
152,100 views

Teacher Trap: Why does fine-tuning on data from the strongest LLMs sometimes hurt your model? This viral interview question shares an effect every AI engineer must know.
32,400 views

Stack Check: A developer shared a 20-tool LLM engineering stack for production. How does yours compare?
4,500 comments

AI CODING HACK

How to stop Codex from giving up mid-task

Source: X/thsottiaux

Coding agents usually quit mid-task, but OpenAI’s new Ralph loop pattern lets Codex run until your goal is finished. Head of Codex, Tibo Sottiaux, recently announced that version 0.128.0 now supports tasks that can run for days.

To enable this experimental CLI feature, add these lines to your “~/.codex/config.toml”:

[features]
goals = true

After updating, use the “/goal” command to keep Codex running until the task is finished. You can pause, resume, or clear tasks anytime.

/goal Refactor the auth module to use JWTs, update every call site, and fix any failing tests.

Just remember to set a token limit in your config first to avoid unexpected costs.

TOP & TRENDING RESOURCES

Click here to watch the tutorial in action.

Top Tutorial

6 ways to use Claude Code efficiently: This tutorial gives developers six practical tips for getting the most out of Claude Code. You'll learn to maintain control by avoiding loops, reviewing AI actions in plan mode, and building custom agents. It also emphasizes providing clear instructions, verifying all output, and the necessity of writing your own code.

Top Tool

GitBar: Manage pull requests directly from your macOS menu bar. Get instant updates on reviews, conflicts, and approvals with a live PR count. Built with React Native, it’s a no-cost, one-click solution for tracking your team’s workflow.

Top Repo

Dynamic Context Pruning (2.5k ⭐): This plugin cuts OpenCode token costs by compressing and auto-cleaning conversation context. It preserves history with placeholders for pruned data before it hits the LLM.

Trending Paper

Continually improving agent harness (by Cursor): Integrating AI models is a challenge because rigid context windows and tool errors lead to "context rot," which drags down performance. By constantly fine-tuning a dynamic, model-specific "harness," you can drastically reduce these errors and make the agent’s coding much more reliable.

Grow customers & revenue: Join companies like Google, IBM, and Datadog. Showcase your product to our 250K+ engineers and 150K+ followers on socials. Get in touch.

What did you think of today's newsletter?

Your feedback helps us create better emails for you!

Login or Subscribe to participate

You can also reply directly to this email if you have suggestions, feedback, or questions.

Until next time — The Code team

Keep Reading