Running Opus 4.6 Full-Time: An Honest Review

Anthropic shipped Opus 4.6 instead of Sonnet 5 a couple of months ago. Nobody expected that. I moved my whole stack to it the morning it landed.

What follows is the honest version: the review I wrote in February when 4.6 was new, then the April update after 4.7 happened and the picture changed.

February: 4.6 first impressions

The good: 1M token context window (beta, API), smarter code review, better self-correction, adaptive thinking that picks when to dig deeper.

The expensive: $5/M in, $25/M out. Doubles to $10/$37.50 over 200k context. 2-4x GPT-5 pricing.

It’s more careful than 4.5. Pauses more, reads more files, catches more edge cases. But slower (1-2 min tasks now take 5-10), more corporate in tone, and context gathering is still weak. Gave it a monorepo, asked about React best practices, it ignored the React Native app until I prompted again.

Net: 5-10% smarter in some ways, 3-5% worse in others. Use it if code quality matters more than speed and you can absorb the cost. Skip it if budget’s tight or you liked 4.5’s vibe.

That was February. Here’s what happened next.

April update: three months in

What actually happened

April 16: Opus 4.7. April 23: GPT-5.5. A week apart, not the same day.

The same week as GPT-5.5, Anthropic published a post-mortem on Claude Code quality. Three regressions over the previous month: a default reasoning-effort drop from high to medium, a thinking-cache bug that compounded across turns, a verbosity-reduction prompt that cost 3% on intelligence evals. All fixed in v2.1.116+. Usage limits reset. Credit to Anthropic for naming the actual mechanisms instead of waving at “model variance.”

So the “4.6 felt worse before 4.7 even existed” feeling I’d had was confirmed and now documented.

The Reddit and X discourse collapses three separate things into one “4.7 is bad” narrative: those Claude Code regressions (now fixed), the new tokenizer (input up 1.0-1.35x, real per-task cost up 1.5-3x), and 4.7’s deliberate shift to literal instruction following (“will not silently generalize”) which breaks prompts that worked on 4.6.

4.7 isn’t quite the disaster the discourse painted. Anthropic’s claimed double-digit benchmark gains in finance, legal and coding are probably real on the benchmarks they ran. But on my actual work, with vanilla recommended settings, 4.7 is already starting to feel the same kind of dumb 4.6 had drifted into by the end. The little I’ve tested of GPT-5.5 has been night and day.

The moment that decided it

I was debugging a race condition in a redirect callback. Asked Opus 4.7. Read the file, made a recommendation, moved on. Confident. Wrong direction.

I had Kimi-k2.6 running locally via Ollama as a second opinion. Free model. Pulled the same file, immediately flagged a stray await on shouldShowPendingIntro() at line 941. Same race window. Opus had read the file and not seen it.

A free local model found what an Opus call could not. Once you’ve seen that, the math doesn’t recover.

The workflow

The real story isn’t the model swap. It’s the workflow.

I run cmux. CLI on top of tmux that turns each agent into its own surface. Codex in one. Claude Code in another. Kimi-k2.6 in a third. I coordinate from my own. The agents talk to each other directly via XML envelopes codified in AGENTS.md: <msg from="claude" to="codex">…</msg>. Sending a diff to a peer for review is one command. cmux send, then cmux send-key Enter, read with cmux read-screen.

Why it works:

Cross-family review catches what same-family review misses. When two of three different models agree, the bug is real. When they disagree, the disagreement itself is the signal.
Asking for review went from “annoying” to “free.” Default-on, not opt-in. Bugs caught at the cheap end of the cost curve.
Peers find classes of bug a single model skips. Race conditions in async callbacks. Tautological branches. Comment drift. Dead surfaces.
Trust is calibrated explicitly. Kimi hallucinates file contents, so every concrete claim (file path, line number, symbol name) gets verified before anyone acts on it.

The workflow isn’t Claude-specific. It works because the agents are different.

Where I landed

GPT-5.5 in ChatGPT and Codex for the heavy lifting. Kimi-k2.6 via Ollama as the free local second opinion. Claude Code is still in the panel, but as one voice, not the panel itself. Waiting for OpenAI Business to clear to x20. When it does, most of my Claude Code use goes away.

Stardust, my own thing, is closing in on friends-and-family. App goes to Apple Store review this week. That’s part of why multi-agent stopped being a curiosity and became a requirement.

The infrastructure read

There’s a layer under all this worth naming.

Anthropic is compute-constrained by their own admission. Run-rate revenue jumped from $9B end-of-2025 to $30B by April. They’re scrambling: $100B Amazon Trainium for 5GW, Google TPU 1GW → 4GW by 2027, own-chip plans, an exec hunt for European data centers.

The April post-mortem reads differently in that light. Verbosity caps, cache eviction, reduced reasoning effort. Capacity-saving optimizations. Useful for the system, bad for the user.

OpenAI has near-term runway via the Nvidia 10GW partnership plus $50B Amazon and $30B SoftBank rounds in February. Stargate isn’t full production until 2029 and the $100B Nvidia tranche is reportedly on ice, but the headroom today is real. That’s part of why GPT-5.5 lands sharper and faster.

Long-term I still think Google wins. They own the whole stack: Ironwood TPUs, Gemini, GCP, plus three billion Workspace users as built-in distribution. Anthropic increasingly runs on Google TPUs. Google has put up to $40B into Anthropic. OpenAI will spend the decade paying off data centers; Google already owns theirs.

I might be wrong by July. That’s the cycle now. But the choice to weight my daily work toward GPT-5.5 with Kimi as backup looks more defensible once you read 4.7’s regressions as downstream of a capacity problem its company hasn’t solved.

The framing

You can’t run on a release schedule. You run on what’s working today. What’s working today is a free local model, a frontier OpenAI model and a tmux script gluing them together.

The lesson isn’t “switch to GPT-5.5.” It’s: stop paying for the assumption that one model should be the answer to everything.

If you’re paying Opus prices for daily work, run the same comparison against a free local model before your next invoice.

Keep going?

If you got something out of this, drop me an email or like the LinkedIn share. If the signal is there, I’ll keep writing. If not, I’ll put the energy back into Stardust.

A single blog covering dev, design, AI and tech in one place, written by someone shipping every day, is what I’d want to read myself. That’s mostly why I do it. But it helps to know if anyone else does too.