Anthropic ships a security guidance plugin, Datacurve reveals DeepSWE benchmark

Welcome back. As developers, we usually take benchmarks from frontier AI labs at face value (even though we know benchmarks don’t tell the entire story). Now, a Silicon Valley startup is putting current agentic models through the wringer with a new approach: by simulating real-world developer workflows, they’ve uncovered some pretty shocking results.

Also: A former Dropbox principal engineer's AI-era playbook, what world models actually do, and why a Socket engineer uses AI to write better code more slowly.

Today’s Insights

Powerful new updates and hacks for devs
How to evaluate multi-agent systems at scale
How to enforce better coding principles in Claude
Trending social posts, top repos, and more

TODAY IN PROGRAMMING

Click here to see Claude Code’s new plugin in action.

Anthropic ships security guardrails for AI-written code: The AI lab just dropped a security-guidance plugin that builds vulnerability review directly into Claude Code. It scans each edit for risky patterns like unsafe eval() calls, then runs a background model review of the full diff after every turn. A deeper agentic check kicks in on commits, reading surrounding code to validate findings. Internal teams saw a 30 to 40 percent drop in security comments on pull requests. Click here to install it.

Datacurve reveals a benchmark that exposes coding models: An SF-based startup just unveiled DeepSWE to surface real differences between top models that standard leaderboards tend to miss. Rather than scraping GitHub, the team built custom tasks and verifiers that test behavior instead of matching implementation details. The results are eye-opening. GPT-5.5 is out front at 70%, while Claude Sonnet 4.6 sits at 32%, and Gemini 3 Flash comes in at 5%. See the leaderboard.

Hackers exploit Ghost CMS to compromise 700 websites: The open-source publishing platform behind blogs at Harvard, Oxford, and DuckDuckGo is in the middle of a wide-scale attack. Attackers are using an SQL injection bug, patched all the way back in February, to grab Admin API keys and slip malicious JavaScript into live posts. Nearly half of the affected sites are developer and tech blogs whose owners never installed the fix. If you run an instance, patch it now.

PRESENTED BY COMP AI

Get SOC 2 and ISO-ready in days (readers save $2,000)

You don’t have months to get audit-ready. With Comp AI, you can get SOC 2 or ISO 27001 audit-ready in just a few days, so you move fast and close enterprise customers.

Connects to your stack and automates evidence collection
Keeps you audit-ready 24/7
Eliminates 90% of the work

Automate your compliance with Comp AI (The Code readers save $2,000).

INSIGHT

How to evaluate multi-agent systems at scale

Source: The Code, Superhuman

Agents are scaling up fast. Anthropic reports that Claude's longest agent sessions now run over 45 minutes. That's nearly double what they were three months ago. Software engineering still drives about half of all API tool calls, and production is now full of long-running, multi-step agent traffic.

Which breaks the old debugging playbook. When something fails, an engineer opens the bad run. They walk the handoff and find the bug. That works fine for 50 runs. At 10,000, it falls apart. By then, failure hides across thousands of traces. The same broken handoff repeats until the pattern is too big to ignore.

That's the new shape of multi-agent failures. Recent research suggests 40% of agentic projects die before shipping, mostly over reliability nobody could fix. The same handoffs keep breaking the same way. The same specialists keep missing the same signals. Review thresholds fire for the wrong cases. The pattern repeats until someone catches it.

The recipe. OpenAI just shipped a cookbook to fix exactly this problem. The workflow analyzes 1,000 simulated agent traces, grouping similar failures to pinpoint exactly which agent or tool caused the issue. It's a clear sign that the real challenge has shifted from model quality to total system visibility.

IN THE KNOW

What’s trending on socials and headlines

Meme of the day.

Wrong Muscle: A former Dropbox principal engineer just shared his playbook for the AI era and why most devs train the wrong muscle.
Past the Slop: An ex-Google engineer's post on world models covers what AI's buzziest term actually does (211K views).
Long Memory: An AI engineer's two-step setup turns Obsidian into Codex's persistent memory layer (1.8K bookmarks).
Live Edits: A developer built a Claude skill that turns any HTML output into a Google Docs-style canvas, letting you comment in-browser and make changes (180K views).
Harness School: This website is the best place to learn harness engineering, the discipline of making Codex and Claude Code actually reliable (7.3K bookmarks).
SOUL Layer: Before any tool or memory loads, the SOUL.md file decides who your agent actually is. This breakdown covers what belongs inside.
Slow Code: A Socket engineer makes the case for using AI to write better code more slowly, sharing the multi-model review skill that finds tons of bugs in his PRs.

AI CODING HACK

How to enforce better coding principles in Claude Code

Claude often makes bad assumptions, over-engineers simple tasks, and touches code it shouldn't. After Andrej Karpathy flagged these issues in January, a developer packed the fixes into a 70-line CLAUDE.md file. It hit number one on GitHub trending with over 220K stars. Just drop it into your project root:

curl -o CLAUDE.md https://raw.githubusercontent.com/forrestchang/andrej-karpathy-skills/main/CLAUDE.md

This human-readable file enforces four principles: surface tradeoffs instead of guessing, scope changes strictly to the request, write minimal code, and define success criteria for verification.

If you have an existing CLAUDE.md, use the append command in the README to merge them.

P.S. Get 50+ AI coding hacks for Claude Code, Cursor, and Codex here.

TOP & TRENDING RESOURCES

Click here to watch the tutorial.

Top Tool

Parsewise: Turn messy PDFs, spreadsheets, and text files into clean, reliable answers with a single API call. It handles multi-doc processing, contradiction checks, and full lineage automatically, so you can skip the complex parsing pipelines and ship solid AI features faster.

Top Repo

ECC (195K ⭐): This agent optimization system plugs into Claude Code, Cursor, and Codex to turn messy configs into a streamlined workflow. It helps you ship faster and safer by automating token efficiency, security audits, and research-backed patterns across every repo.

Trending Cookbook

How Anthropic keeps Claude in check across different apps: As AI agents grow more capable, their potential to cause widespread damage expands rapidly. Ultimately, engineers must prioritize strict environmental containment over fallible human approvals or model defenses to safely deploy them.

Our most-clicked story from yesterday

This 34-project roadmap shows you exactly how to build a full LLM stack from the ground up.

Grow customers & revenue: Join companies like Google, IBM, and Datadog. Showcase your product to our 290K+ engineers and 150K+ followers on socials. Get in touch.

What did you think of today's newsletter?

Your feedback helps us create better emails for you!

You can also reply directly to this email if you have suggestions, feedback, or questions.

Until next time — The Code team