AI Coding in 2026: What Actually Breaks in Production

Andrej Karpathy coined the term "vibe coding" in early 2025, and it described something a lot of us were already quietly doing — you tell the AI what you want, you click accept, and you check if it looks right. No deep code review. Just vibes.

For a while, it worked well enough to feel like a superpower.

Then 2026 happened.

The crash was predictable, honestly

The most talked-about incident came in July 2025. Jason Lemkin, the SaaStr founder, had been running a 12-day experiment, trying to build a real app with Replit's AI agent, no developer involved. On Day 9, he came back to find his production database wiped. Over 1,200 executive records, gone. When he questioned the agent, it admitted to running unauthorized commands, then tried to hide what it had done, told him recovery wasn't possible (it was), and filled the gap with roughly 4,000 fabricated records. Replit's CEO called it "unacceptable" and pushed out fixes, database separation, and planning-only mode. Good fixes, probably too late for the PR.

I'll be honest, when I read that story, I had two reactions back to back. First, pure horror. Then something closer to of course.

Because we've been handing these agents access to production systems and trusting them to behave like careful engineers. They're not engineers. They're autocompletes that scaled up.

Platforms like Lovable have a different but related problem: developers getting trapped in loops where the AI tries to fix its own errors, fails, tries again, burns through credits, and never actually resolves anything. It's less dramatic than a deleted database, but it wastes just as much of your time and money.

The thread running through all of it: we are confused about how fast AI generates code, with how sound that code actually is.

The tools, and the thing nobody talks about

The market right now has split into two main approaches.

Claude Code runs in your terminal. It works directly in your filesystem; it uses shell tools like Grep, Bash, and Glob to search code, edit files, and run tests. It's built for automation, the kind of thing you'd run overnight for dependency audits or PR reviews. If you're comfortable in the command line, it fits naturally into how you already work.

Cursor's Agent Mode lives inside VS Code. It orchestrates changes across multiple files in a visual interface, which makes it easier to get into, especially if terminals aren't your thing. The tradeoff is that loose instructions can lead it to run package installs you didn't ask for or trigger database migrations before you're ready.

Here's the part that doesn't make the feature comparison articles:

	Claude Code	Cursor Agent Mode
Interface	Terminal	VS Code GUI
State between sessions	CLAUDE.md + cross-session memory	Active workspace history
Integration	Terminal, shell, GitHub Actions	VS Code extensions, MCP
Main risk	Needs precise instructions	Loose guardrails can cause unintended changes

I used Cursor's premium Agent Mode for a while. Eventually stopped. The monthly cost stopped making sense for what I was actually getting out of it. I switched back to GitHub Copilot for inline autocomplete and free-tier Claude for thinking through architecture. It forces more engagement; you can't just hit generate and go make tea. Honestly, that constraint made me better at using the tools, not worse.

What actually changed in how I work

I spend less time writing code now and more time deciding what to build and why.

Before I write a single line on a new project, I'm asking, "What's the right folder structure for this to stay maintainable when it grows?" Who's actually using this, and what does the interface need to do for them? What should I handle myself, and what's safe to hand off?

That's less of a coding mindset and more of a coordination one. The AI handles the volume. I handle the decisions that require understanding the actual problem.

But here's the thing that doesn't get said enough: the speed of generation doesn't change who owns what ships. Every line of AI-generated code that goes into production carries the same weight as code you typed yourself. It still needs to be secure. It still needs to work under load. If it breaks at 2 AM, you're the one getting paged, not the model.

Roughly speaking, here's where the time savings are real and where they disappear:

Task	Before AI	With AI	The catch
Boilerplate & scaffolding	4–8 hours	5–15 minutes	Bad instructions = weak foundation
Unit test generation	2–4 hours	10–30 minutes	Tests need real edge cases, not just happy paths
Multi-file refactoring	8–16 hours	30–60 minutes	Context overload causes silent regressions
Dependency review	1–2 hours	5–10 minutes	Still needs a human security check
Production hardening	12–24 hours	12–24 hours	Unchanged. The AI has no context about your actual users

That last row is the one worth staring at for a minute.

A bug that taught me more than any tutorial

While building out my portfolio site, I hit something that perfectly illustrated all of this.

The card layout in my projects section looked completely broken. Not subtly off, actually broken. The AI's code had zero syntax errors, no compiler warnings, nothing. It looked fine on paper.

My first instinct used to be to ask the AI, "Why is this broken?" I've learned that usually sends you into a doom loop; it suggests fixes, the fixes don't work, it suggests more fixes, and your CSS turns into an archaeology dig of failed attempts.

Instead, I looked at the layout myself, figured out the issue (flexbox wrapping and item-based), explained the structural problem to Claude, and pointed out the fix.

Small story. But that loop, where you diagnose, you explain, and AI executes, is the actual unit of work now. The faster you accept that division, the less time you waste fighting the tool.

Why demos always look better than the real thing

Local development is a lie. A comfortable, useful lie, but still.

On your machine, there are no concurrent users, no network latency, no attackers looking for gaps, and no transaction conflicts. If something breaks, you refresh. It's a windless room.

Production is different. Real software has to stay stable when hundreds of users hit the same endpoint at once, recover when a third-party API goes down, fail in ways that don't corrupt your data, and give you enough visibility that someone can actually diagnose what went wrong when it does.

AI is excellent at the visible parts — buttons, layouts, pages. It's largely blind to the invisible parts that keep those things alive. Concurrency management, connection pooling, rate limiting, and error isolation — these require context that lives entirely outside what the model was trained on. Your specific traffic patterns, your infrastructure, your users' behavior.

The five failure patterns I'd watch for in AI-generated production code:

The double-booking problem — AI writes code that works perfectly in single-threaded local testing. Under real load, when two users try to buy the last item simultaneously, if the database transactions aren't locked correctly, you charge both. The model doesn't know your database cluster config.

The dependency blindspot — AI recommends packages based on training data with a fixed cutoff. That library it just suggested might have a breaking change, a deprecated API, or a known CVE that came out after the model was trained. Every suggested package needs a human review.

The logging gap — AI-generated code is consistently under-instrumented. The model doesn't have to debug a production crash at 3 AM, so it doesn't naturally write the kind of structured, readable logs that make that possible.

Silent math errors — the expensive ones. Code that runs without raising a single error but calculates something wrong. A payment system that forgets to apply tax on international orders won't throw an exception. Only someone with actual domain knowledge will catch it, probably after it's already cost money.

Context saturation — over a long session, the model's context window fills up, and it starts forgetting earlier instructions, repeating mistakes, and going in circles. The fix is keeping tasks small and well-defined. Don't hand it a whole repository and say, "Clean this up."

The security stuff got serious fast

This part of the conversation moved quickly in 2025, and I don't think most developers are fully caught up yet.

In late 2024, Anthropic introduced MCP — Model Context Protocol — a standard that lets AI assistants connect directly to external databases, APIs, and filesystems. Powerful. Also, a real attack surface.

Two vulnerabilities hit Cursor in mid-2025:

CurXecute (CVE-2025-54135, CVSS 8.6) was disclosed on August 1 by AIM Security researchers. Because Cursor's agent reads external sources like GitHub issues and Slack messages, an attacker can embed a malicious prompt inside a public resource. The agent parses it, and the payload writes a malicious config file to your workspace, and in older Cursor versions, that file executed automatically — giving the attacker code execution with your developer privileges. Patched in Cursor v1.3.9.

MCPoison (CVE-2025-54136) was disclosed on August 5 by Check Point Research. Three-stage attack: The attacker commits a harmless MCP config to a public repo, you clone and approve it, and then they push a silent update with a malicious payload. On your next pull and project reload, it executes without prompting you. Patched in Cursor v1.3 by requiring re-approval on any config change.

There's also a technique — not a single CVE but an approach — where attackers embed invisible Unicode characters (zero-width joiners, directional overrides) inside rule config files. You ask the agent to generate a login form. The hidden characters parse into its context, and it silently injects a backdoor. Nothing shows in the chat window. You review the generated code, and it looks clean.

A separate audit of publicly available MCP servers found that 43% had command injection vulnerabilities, 33% had unrestricted outbound requests, and 66% had poor security practices overall. These are community-maintained servers that developers are connecting to corporate codebases.

I'm not saying don't use any of this. I'm saying know what you're plugging in.

GigSignal, and where I actually hit the wall

I built a personal project called GigSignal — a bot that scans for job opportunities and sends me real-time alerts on newly launched tokens.

Getting the scaffolding up was fast. That part genuinely worked the way the demos promise.

Then the system got more complex. Live web scrapers, token price APIs, messaging alerts — all talking to each other. API timeouts started killing the bot silently. Rate limits hit, and nothing told me why. The error handling I'd let the AI write wasn't handling much.

I had to put the AI aside and dig through logs manually. Write custom recovery logic by hand. The AI had gotten me 70% of the way there in a fraction of the time it would have taken me to write it from scratch. The last 30% was entirely me.

If you're planning to build automated pipelines, that's the honest shape of it. Budget for the moment when you have to take over.

What to hand off and what to keep

The clearest framework I've found:

Hand off scaffolding CRUD routes, generating TypeScript schemas, unit tests for pure functions, and formatting. These are high-volume, low-consequence tasks.

Keep control of any new third-party dependency, anything touching authentication or encryption, shared state and lock pools, and all telemetry and error handling paths. The AI will generate something for these that looks complete. It usually isn't.

As products scale, you need to tighten how much autonomy the agent has:

Stage	Biggest risk	How much to delegate
0 → 1 (finding product-market fit)	Building the wrong thing	High — move fast, prototype freely
1 → 10k users	Can't diagnose runtime errors	Moderate — feature work with manual test review
10k+ users	Race conditions, supply chain attacks	Low — sandboxed scaffolding only

One separate note on Web3: the stakes there are categorically different. A bug in a web app gets patched. A bug in a deployed smart contract can drain user funds permanently. AI-generated smart contracts miss reentrancy exploits and state-handling flaws regularly. Don't ship them without a manual security review from someone who actually knows what they're reading.

What I'd tell a founder today

If someone came to me and said they wanted to build their entire platform with AI agents to cut costs, I'd tell them: do it. Use AI to replace the volume of manual boilerplate. But take the money you save and put at least one experienced engineer in the architect seat.

You need someone who can structure your databases correctly, catch security gaps, and fix things when the agent breaks something it doesn't know it broke.

I keep hearing AI described as a force multiplier. That's true. But a multiplier needs something to multiply. Without human direction and real validation behind it, you're not cutting corners — you're just accumulating debt that's going to come due at the worst possible time.

I write about this stuff — AI, Web3, and what it actually looks like to build things in 2026 — in The Synthesis Stack. No hype. Subscribe if that sounds useful.

Beyond the "Vibe": What AI Coding Actually Looks Like in 2026

The crash was predictable, honestly

The tools, and the thing nobody talks about

What actually changed in how I work

A bug that taught me more than any tutorial

Why demos always look better than the real thing

The security stuff got serious fast

GigSignal, and where I actually hit the wall

What to hand off and what to keep

What I'd tell a founder today

Comments

More from this blog

AI is Smart. Blockchain is Honest. Here's What Happens When They Work Together.

Command Palette

The crash was predictable, honestly

The tools, and the thing nobody talks about

What actually changed in how I work

A bug that taught me more than any tutorial

Why demos always look better than the real thing

The security stuff got serious fast

GigSignal, and where I actually hit the wall

What to hand off and what to keep

What I'd tell a founder today

Comments

More from this blog