TDD with AI

Ever since I started writing about testing with AI on this very blog, I’ve had quite the journey at work trying to balance raw speed with the experience and scars I’ve accumulated over time.

This is the story of how I leveraged AI without losing my sanity, using a TDD-first mindset. Wear your helmets, tighten your seatbelts let’s go.

Optimisation opportunity

Imagine a graph we need to traverse as a tree. We need traversal in both directions: forward and reverse.

The initial idea was intentionally naive: get it working with a very basic implementation where database calls scale with the number of hops. The goal wasn’t performance it was shape. I wanted the input and output to be exactly right so we could lock behavior down with tests before touching optimisation.

So that’s what we did.

We wrote stress tests and benchmarks using testcontainers to stay close to production behavior. Inputs and outputs were spot-on. Then we layered in extra specifications as unit and integration tests.

Now the fun part begins.

I told the AI to optimise not with “make this faster”, but by giving direction and constraints about how to think about the problem.

I put AI to the test: “do something” vs “be instructed”

“Do something”

I said:

“Go look into this service, analyse it, and come back with optimisation opportunities.”

Sonnet 4.5 did what most engineers do when skimming unfamiliar code: “Let’s cache this bad boy.”

Early returns, hashmaps, tracking visited nodes in memory all technically reasonable ideas.

Except… we run multiple pods in a Kubernetes cluster.

Each pod would maintain its own in-memory cache with its own TTL. Consistency becomes fuzzy, cache warm-up becomes painful, and until caches stabilize, each node still fires hundreds of queries per graph traversal.

Yes, you could try to fix this with shared cache hit/miss tracking, but that’s not optimisation that’s complexity debt.

“Do a single recursive CTE query”

This time I gave a very explicit instruction: turn the traversal into a single recursive CTE query.

It took hours. But eventually, we got there.

A massive query with lots of bindings, covering both the base and recursive steps. Proper composite indexes, well thought-out relations all technically solid.

I was positively surprised.

This was the real difference between “figure something out” and “solve this problem in this way”. AI is a statistical model. It optimizes for correctness, not for appropriateness. Its first suggestion was reasonable, but it didn’t even glance at the k8s setup so it couldn’t possibly know caching was the wrong trade-off.

Once guided, it delivered exactly what I asked for. But that came with a hidden cost.

At low data volume, the recursive CTE felt fine.

But real production data isn’t small and recursive queries don’t degrade linearly.

Under the hood, Postgres was evaluating a join condition that looked harmless:

… JOIN edges ON from_id = current_id OR to_id = current_id …

That OR inside a recursive join is a silent killer.

The planner can’t commit to a single index path, so each recursion level multiplies the amount of work. Scans grow, merges explode, and execution plans balloon.

Without a defensive approach, the outcome would have been predictable:

timeouts
massive execution plans
queries that look correct but never finish

Thankfully, I didn’t discover this on a customer’s bill at least not this time. That’s why it’s called experience.

Yet again, I had to explicitly tell Sonnet that timeouts were non-negotiable. Graph traversals can go sideways very quickly, and once infinite or near-infinite queries start piling up, they can take an entire database down.

The save: defensive engineering first

Here’s the twist: I didn’t deploy the query blindly.

Two deliberate design decisions saved us:

Per-workspace scoping a misbehaving query was limited to a single tenant.
A hard 15-second query timeout runaway queries couldn’t block the system.

Instead of bringing the entire service down, only the offending workspace felt the impact.

That’s not luck. And it’s not statistics.

That’s engineering.

It’s also the part of the job you only appreciate before something becomes an incident. This is where engineers design systems; coding is just one slice of the work usually the one that wrecks your back.

Even the dopamine doesn’t come from “the code worked”, but from realizing the solution survived reality.

The fix: split, simplify, and respect index usage

Next, I benchmarked properly.

I replicated production-like tables, generated mock data, and mirrored production indexes. You don’t need a perfect clone you need something close enough to reveal how things behave under stress.

For small datasets, everything looked fine. I pulled median numbers from Databricks to estimate realistic loads.

Seemed okay.

Ship it, yeet!

It worked exactly as expected… for median users.

Within hours of release, timeout alerts started firing in crowded regions. Heavy users had arrived.

A feature that returns nothing is not a feature.

Quick ANALYZE. The problem was obvious: indexes were effectively abandoned. A single recursive CTE with an OR condition throws away most of what makes relational databases fast.

What the database was really doing:

scan A
scan B
merge
deduplicate

So I changed the approach.

Instead of one recursive query handling both directions, I split it into:

one traversal for forward edges
one traversal for reverse edges
merge and deduplicate in application code, not in the database

That small refactor turned a query that timed out under load into one that consistently runs in ~30 ms.

Why such a dramatic difference?

Because in databases, efficiency isn’t about correctness it’s about predictable access paths. By separating traversal directions, each query could use its index cleanly. No OR. No bitmap chaos. No exponential blow-ups.

What this taught me (and what I want you to take away)

Three inseparable truths stand out.

1. Tests catch bugs; engineering catches failures

Tests prove logic. Design protects production.

Don’t let AI take the wheel entirely.

AI can help with both if you ask the right questions.

When I wrote the original query, I was testing correctness. When I rewrote it, I was engineering for failure with AI as an assistant.

Those are different goals.

2. Defensive defaults save your future self

Timeouts, scoped queries, resource limits they feel like overhead until the day they save you.

Treat them not as optional extras, but as core safety rails. Bake them into every AI-assisted session you have.

3. Mindset beats tools

You can have the best language, the smartest LLM, the fanciest deployment setup
but without the habit of asking “what could go wrong here?”, you’ll eventually pay for it.

Closing thoughts

I’m writing this not as a tutorial, but as a reminder to myself and to anyone building systems for real users at real scale:

Elegance in code doesn’t guarantee stability in production. Reliability is earned through anticipation, not just validation.

Build with that mindset first, and everything else becomes faster and more enjoyable with AI.

Cheers to building software that survives the real world.