Welcome back. AI agents keep getting smarter, but hosting them on your laptop or Mac Mini only takes you so far. Manus just shipped an update that changes how you run agents around the clock.
Also: How to cut your agent's context window by 90%, train your AI on 2,000 DESIGN.md files from top products, and where Karpathy says LLMs go after coding.
Today’s Insights
Powerful new updates and hacks for devs
AI evals: The new compute bottleneck
How to stop Codex from giving up mid-task
Trending social posts, top repos, and more

TODAY IN PROGRAMMING
Manus ships persistent cloud machines for 24/7 bots: The AI agent startup just released Cloud Computer, a persistent virtual machine for always-on bots and software. Unlike standard sandboxes that reset, this stays live and keeps your tools installed across sessions. Developers can now deploy Slack bots, schedule scrapers, or self-host apps using plain-English prompts. It runs on Ubuntu and includes SSH and a web terminal. Watch how it works.
Cursor and Anthropic ship enterprise security agents: The AI coding startup just launched Security Review for Teams and Enterprise, using always-on agents to catch vulnerabilities in every pull request and send scan results straight to Slack. Just hours later, the Claude creator released Claude Security in public beta for Enterprise users. It uses Opus 4.7 to scan codebases, reduce false positives, and suggest patches that engineers can quickly approve.
Theo Browne urges Anthropic staff to leave in an open letter: Developer and content creator Theo Browne dropped an open letter criticizing Anthropic’s leadership and urging engineers to quit. Browne described a culture of fear and poor code quality, even offering to help anyone jumping ship to match their current vesting schedules. While some devs call it clickbait, Browne insists the post reflects genuine, widespread developer frustration.

PRESENTED BY YOU. COM
Teams often pick an API by checking a benchmark table and calling it done—a shortcut that could miss what really matters in production. This guide from You. com explains why raw latency is a misleading signal and what to measure instead.
Stop optimizing for the wrong number. Download the guide.

INSIGHT
AI evals are becoming the new compute bottleneck

Source: The Code, Superhuman
The bill is finally due. For a long time, training models was the main expense, while testing was just an afterthought. Recently, Hugging Face flipped that script: running a top-tier benchmark now costs as much as a used car every single time. Because of this, most teams can't afford to verify the leaderboards they use to choose their models.
The old shortcuts don't work anymore. Back when benchmarks were just simple Q&A tests, labs could shrink them to save money without changing the rankings. This worked because most questions were too easy to really show the difference between models. Agent benchmarks changed everything. These involve long, multi-step tasks where cutting corners means losing the exact data you need.
Running a test once isn't enough. An agent might nail a task on a Monday but fail it completely the following day. A new Princeton paper shows that recent gains in capability haven't actually made agents more consistent. Current leaderboards only report the best-case results. They ignore the fact that these agents rarely behave the same way twice.
Whoever has the money writes the leaderboard. Most teams end up starting their evals from scratch because there isn't a standard way to share results. Hugging Face's Every Eval Ever project is trying to fix this with a shared format. Without it, only the massive labs building frontier models will have the budget to actually grade them.

IN THE KNOW
What’s trending on socials and headlines

Meme of the day.
Beyond Coding: OpenAI founding member Andrej Karpathy used his Sequoia Ascent 2026 fireside to map out three new horizons for LLMs that go way past faster code.
3,800 bookmarks
Pretty Agents: Sick of agents shipping ugly UIs? This site has 2,000 DESIGN.md files modeled on the world's best products for your AI to study.
10,000 bookmarks
Drop and Done: A developer used Cursor's new SDK to build a Kanban board where you just drop in tasks, and the agents handle the rest.
5,200 likes
Brain Drain: A 22-year-old dev says six months of running 6 to 8 Claude Code terminals at once has left him noticeably less sharp, and his friends agree.
783,200 views
Context Cut: A dev cut his agent's context window by 90% after fixing one thing most AI builders never check. His prompts had nothing to do with it.
152,100 views
Teacher Trap: Why does fine-tuning on data from the strongest LLMs sometimes hurt your model? This viral interview question shares an effect every AI engineer must know.
32,400 views
Stack Check: A developer shared a 20-tool LLM engineering stack for production. How does yours compare?
4,500 comments

AI CODING HACK
How to stop Codex from giving up mid-task
Coding agents usually quit mid-task, but OpenAI’s new Ralph loop pattern lets Codex run until your goal is finished. Head of Codex, Tibo Sottiaux, recently announced that version 0.128.0 now supports tasks that can run for days.
To enable this experimental CLI feature, add these lines to your “~/.codex/config.toml”:
[features]
goals = trueAfter updating, use the “/goal” command to keep Codex running until the task is finished. You can pause, resume, or clear tasks anytime.
/goal Refactor the auth module to use JWTs, update every call site, and fix any failing tests.Just remember to set a token limit in your config first to avoid unexpected costs.

TOP & TRENDING RESOURCES
Top Tutorial
6 ways to use Claude Code efficiently: This tutorial gives developers six practical tips for getting the most out of Claude Code. You'll learn to maintain control by avoiding loops, reviewing AI actions in plan mode, and building custom agents. It also emphasizes providing clear instructions, verifying all output, and the necessity of writing your own code.
Top Tool
GitBar: Manage pull requests directly from your macOS menu bar. Get instant updates on reviews, conflicts, and approvals with a live PR count. Built with React Native, it’s a no-cost, one-click solution for tracking your team’s workflow.
Top Repo
Dynamic Context Pruning (2.5k ⭐): This plugin cuts OpenCode token costs by compressing and auto-cleaning conversation context. It preserves history with placeholders for pruned data before it hits the LLM.
Trending Paper
Continually improving agent harness (by Cursor): Integrating AI models is a challenge because rigid context windows and tool errors lead to "context rot," which drags down performance. By constantly fine-tuning a dynamic, model-specific "harness," you can drastically reduce these errors and make the agent’s coding much more reliable.
Grow customers & revenue: Join companies like Google, IBM, and Datadog. Showcase your product to our 250K+ engineers and 150K+ followers on socials. Get in touch.
What did you think of today's newsletter?
You can also reply directly to this email if you have suggestions, feedback, or questions.
Until next time — The Code team




