Bit Byte Bit

Anthropic's Steganography Controversy Explained in Non-Technical Terms

Zarar Siddiqi — Wed, 01 Jul 2026 16:26:51 GMT

You may have heard that Anthropic was caught doing something sneaky which I want to explain without any of the technical stuff because the lesson is for everyone, especially if you’re paying for AI tools or deciding whether your team should use them.

Someone looked at the internals of Claude Code and found it had been quietly doing something they didn’t tell anyone about, not exactly something malicious, but the point is that it was concealed from the user.

Every time it sent a message from the user to the backend (i.e., a prompt), it made a small inconsequential change to what was sent over the wire from your computer to Anthropic. Instead of sending a date like YYYY-MM-DD, they sent it like YYYY/MM/DD (notice hyphen replaced with slash) whenever the user was from certain parts of China. They also did similar things if you were using Claude through a reseller, etc., but the details of what they sent aren’t the important point, but how they sent it.

Companies collect information about their users all the time so that part isn’t the problem. If Anthropic wanted to know who its users are, they could just ask or relay that information plainly like, {"location": "Shanghai"} as part of the data that is sent to them from your Claude Code to Anthropic servers. What’s bothering people is how they did it.

They own the whole chain as the tool is theirs and the servers are theirs. If they wanted this information, they could have written it down plainly, in the open, the way every normal company does. Instead they chose to hide it and to hide it inside the one part of the message a developer actually relies heavily on: the prompt. It’s like a contractor you hired scribbling notes about you in invisible ink, on the very documents they hand back to you. This is known as steganography, the practice of concealing secret information within an ordinary, non-secret piece of information.

You don’t hide something you’re allowed to do. You hide something when you don’t want the other person to know you did it. This is irking people because if they do something like this here, it’s hard to believe they won’t do it elsewhere as well, and it becomes harder to trust their word when it comes to security and privacy. And nobody would have known, except one person happened to take the software apart, and that’s the problem: the fact that we only found it by accident.

These AI tools aren’t little chat windows anymore. They’re agents that run on our computers with total access to your computer. They can run commands, read your files, and reach out to the internet, all on your behalf. You hand them the keys to the house because they legitimately do something useful for you, and you inherently trust them.

So let’s think about that. A company shipped software that runs on your machine with the keys to everything, and it was quietly doing something it never disclosed. The hidden mark (e.g., the hyphen/slash swapping to reveal Chinese users) itself was nothing but it proves they’re willing to run things you can’t see. And you can’t check what you can’t see. Essentially, you trust the person who shows you their work over the one who says “just trust me.” Openness earns trust and hiding loses it and this is crystal clear evidence that Anthropic went out of their way to hide it from you. As an aside, the way they hid it is so sloppy that it makes you wonder whether “big tech” developers really are what they are propped up to be.

So where does that leave us? I think it’s a real argument for running these tools on AI models you can run yourself, on your own machine, where your data and your work never leave the building. The local models (like the one I wrote about here) aren’t quite as sophisticated as the big ones yet, and “open” doesn’t automatically mean “safe” but the direction to run more locally is the right one (without even considering the cost angle). You want to be moving toward tools you can see into.

I’m not telling anyone to throw out their tools tomorrow. I’m saying this that when you’re choosing who to trust with the keys, pay less attention to what a company promises and more to whether you can check for yourself. The ones worth trusting are the ones who don’t ask you to take their word for it.

Enforcing Invariants in AI-Generated Code with ADRs and Contracts

Zarar Siddiqi — Tue, 30 Jun 2026 17:21:13 GMT

I had earlier written about using Hooks to enforce certain rules in AI-generated code. The idea was to use deterministic checks rather than prose-based guidance which can’t be enforced at time of generation. A hook here is just a script that runs at a point in the agent’s lifecycle and can block it from continuing. The broader idea is to enforce invariants and in this post I will show how we can use:

Classic Architectural Decision Records (ADRs) to record and enforce invariants
Use the RFC 2119 keywords like SHALL and MUST to record and enforce invariants

But first, what is an invariant? Borrowed from Domain-Driven Design, an invariant is a rule that must always hold true for your system to be in a valid state. It’s a promise the code makes to itself that this condition is never allowed to break. An LLM will produce code that looks correct yet quietly violates a rule you assumed was safe, and it has no inherent memory of the constraints your system depends on. The job, then, is to encode those invariants where the AI can’t ignore them, so that generated code is forced to honour them.

Architecture Decision Records as Invariants

Decisions you have made about your architecture can be thought of as invariants that need to be followed. As you incrementally make decisions, we need to ensure they are recorded so agents can treat them as invariants (i.e., rules to be followed). ADRs provide a structured method to do this. To take advantage of them, we need to:

Figure out when we need to record one
Actually record it
Point agents to treat the ADRs as invariants

Knowing when to record one

The hardest part is knowing when to record one. Architectural decisions rarely announce themselves as they slip by inside an ordinary coding session when you pick one storage approach over another, add a major dependency, introduce a new abstraction, or replace an established pattern. To catch these, I use an ADR auto-suggest skill that watches the shape of a conversation and flags an architectural inflection point as it happens. It looks for the tell-tale signals. An “X vs Y” comparison, “replace X with Y” or “deprecate X” language, the introduction of a new system service, or any pattern-setting choice that future work will inherit. It deliberately ignores bug fixes, styling, and behaviour-preserving refactors so it doesn’t fire on noise. The skill never writes the record itself and its only job is to recognize the moment and steer me toward the /adr command. It runs passively most of the time, but I can also invoke it manually whenever I want a second opinion on whether a decision I’m about to make deserves to be recorded.

Recording it

Once a decision is worth capturing, the /adr command does the mechanical work. It finds the next sequential ID, creates a new NNNN-short-kebab-title.md file, and fills in the frontmatter and template: a Context section for the “why now”, a Decision stated in a sentence or two, the Options considered with their trade-offs, and the Consequences that follow. The ADR is staged alongside the related feature commit, so there’s history next to the related code rather than in a separate commit (example ADR here). Just as importantly, the command keeps an index page current with a table of live decisions and one for superseded decisions, so there is always a single, ordered map of every invariant the architecture has committed to.

Treating ADRs as invariants

Recording a decision is pointless if agents never read it and the index page is the entry point to all such decisions. I point the agent at it so that before touching anything architectural it consults the relevant ADRs and treats their Decision sections as hard constraints. I add a deterministic check in the same spirit as the Hooks from before that verifies the ADRs were actually consulted before code is allowed through. The check confirms that any change touching an area governed by an ADR references that ADR, and fails the run otherwise. This closes the loop: the decision is recorded, surfaced, and then enforced, so an invariant can’t be silently violated simply because the agent didn’t bother to look.

Below is a Stop hook which runs when the agent thinks it’s finished. Like every Claude Code hook it receives a blob of JSON on stdin, which is where the path to the session transcript comes from. Each ADR declares the paths it governs as a scope list of globs in its frontmatter, and the hook compares the files changed in the working tree against those globs. If a changed file falls under an ADR’s scope, that ADR has to have been opened during the session, which I detect by scanning the transcript for the ADR’s file path. If a governed file was touched but its ADR was never read, the hook exits non-zero and hands the agent the reason, forcing it to go back and consult the record before it can stop.

#!/usr/bin/env bash
# .claude/hooks/check-adrs.sh - runs on Stop
transcript=$(jq -r ‘.transcript_path’)   # hook input arrives as JSON on stdin
changed=$(git diff --name-only HEAD)

for adr in site/internal/src/content/docs/adrs/[0-9]*.md; do
  # read the scope globs from the ADR’s frontmatter, one per line
  for scope in $(yq --front-matter=extract ‘.scope[]’ “$adr”); do
    for file in $changed; do
      if [[ $file == $scope ]] && ! grep -qF “$adr” “$transcript”; then
        echo “BLOCKED: $file is governed by $adr, which was never consulted.” >&2
        exit 2                           # non-zero -> agent must address it
      fi
    done
  done
done

It doesn’t try to judge whether the code honours the decision, only that the decision was read. That alone removes the most common problem which is when an agent ignores a constraint. This leaves the RFC 2119 to do more semantic checks as described below.

RFC 2119 Keywords as Invariants

Where an ADR records a decision, RFC 2119 keywords record behaviour. Words like MUST, MUST NOT, SHALL, SHOULD and MAY turn a requirement into a rule and pairing them with a Gherkin style given/when/then makes each rule concrete enough to check. Written this way a requirement stops being a suggestion and becomes an invariant the code has to satisfy.

Keeping the spec in sync

A spec is only an invariant if it matches the behaviour of the code. The real risk with spec-driven work is drift, where you change your mind during implementation and the spec becomes out of date. I use OpenSpec mostly for this reason. It produces the spec as part of planning and rewrites it after the fact when the implementation diverges, so the keywords and scenarios stay in sync. Here’s an example of a spec it generated that conforms to the RFC. A single requirement reads like this.

#### Requirement: Fulfillment record creation

The system SHALL create a fulfillment record when an order is completed.
Each record MUST have a unique identifier.

Scenario: Order completed with an add-on
  WHEN an order containing an add-on is marked complete
  THEN a fulfillment record is created for that add-on
  AND the record is assigned a unique id

The keyword carries the weight. SHALL and MUST are the invariant, the scenario is how you check it.

Pointing the agent at the spec

A spec is useful only when the agent reads it before generating code. With OpenSpec this is automatic, since any code it generates consults the existing specs and checks for violations first. Without it you wire the same behaviour by hand. Keep the specs in a known directory, tell the agent in its instructions to load the relevant ones before writing code, and back that with a check like the ADR hook above that fails the run if a changed file’s spec was never opened. The tooling differs but the invariant concept still holds, which is that no code ships without its spec being consulted.

Wrapping up

ADRs and RFC 2119 specs solve the same problem from two ends. ADRs pin down the decisions that shape the architecture and specs pin down the behaviour the code has to honour. Both only work if the agent actually reads them, which is why each one is backed by a deterministic check rather than a prompt.

These checks are deliberately shallow. They check if a rule was consulted, not that the code truly honours it, and the harder semantic judgment still falls to you and your tests. What they help with is the the most common problem, which is the agent never knowing the rule existed in the first place.

Prose tells an agent what you would like, a check decides what it can ship. The more of your invariants you can move from prose to checks, the less you have to trust that the model remembered, and the more your codebase stays in a state that you recognize.

Four Ways to Plan Agent Work, and When to Switch

Zarar Siddiqi — Mon, 22 Jun 2026 18:31:17 GMT

This post is for developers who want to learn about the different planning approaches to take during agentic software development. This is my experience, and all of this could be wrong, but it does work for me.

The type of planning I reach for depends on the type of change being introduced, and I’ll walk through these from the lightest to the heaviest. But picking the right approach up front is only part of it, as often I start in one mode and realize partway through that I’ve under-planned. The last section covers how I notice that and change direction mid-flight.

Planning Approaches

Before walking through each approach, here’s the whole idea in one chart. The way I see it, the effort of planning boils down to two things: 1) do I actually know what I want, and 2) how much breaks if I get it wrong? Everything below is just those two questions playing out.

Simple changes

If a change is dead simple and you feel the chances of the agent messing it up is low to none, you can enter a direct prompt. The three most recent changes I made using this approach are:

Double the thickness of the border of the selected item on the product page
Change default radius of ad targeting from 15 km to 25 km
Use <.stat> component to display statistics instead of raw HTML on dashboard

These are all changes that have easy verification and are 1-15 lines of changes, if that. I find there is no need to get into any sort of complex planning mode here because you could even hand-roll these changes out easily.

Agent plan mode

Most agents now ship with some sort of a plan mode which allows you to scope and bound the change before implementation. The plan mode can be invoked using /plan in Claude Code and other agents have similar, if not the exact, command. I reserve agent plan mode for changes where I know what I want to implement but where I want a “preview” of what the agent is about to do.

This doesn’t need to be a heavy weight blow-by-blow plan, but something that gives me confidence that it’s about to do the right things, and an opportunity for me to provide guardrails before it does it.

These are the three most recent changes I’ve requested an agent-driven plan for:

The url is not a required field, but if it is provided, use the url path validator that we implemented recently which allows a / or a http(s) url to be used here.
Add a clickable link next to subscriber count on campaign audience page that opens a modal showing a random sample of targeted subscribers with name and email filters.
Remove the image_aspect thumbnail mode setting for event images as we now derive what aspect ratio to use based on whether the event has single or multiple images

Spec Driven Development Execution

There are changes where I know what I want but I:

Expect a bigger blast radius across the codebase (e.g., more than 3-5 files)
Want to spend more time on technical design and architecture
Want a more comprehensive historical record on why the change was introduced

In these cases, I will use a tool like OpenSpec or Superpowers to create a more comprehensive plan which will consider customer needs, technical designs and post-implementation documentation.

I’m a big fan of OpenSpec entirely because of how lightweight and non-prescriptive it is. I will start with an /ospx:propose command which will create the high-level proposal, a Gherkin-based spec which can be used as invariants by agents for future changes, a technical design and a task list for review. I will spend perhaps 10-30 minutes reviewing the plan and tweaking it before going into execution.

These are the three most recent changes I’ve done using /opsx:propose:

Centralize object-level authorization in a shared on_mount hook and replace the per-LiveView/IDOR-open object checks with one shared hook
Create deep links from the order email customer receives to the refund page for that order’s items, while brand settings and not showing user refund links when not applicable to the item they purchased
Derive a brand’s default URL slug from brand name with a mnemonic suffix when URL slug collisions occur (use an existing Elixir library to create suffixes - not random numbers like currently)

The above cases do a significant refactor (1), touch a crucial part of the customer experience which is the email they get (2) and change the internals of how we calculate public-facing URLs (3). These all affect the technical design of the feature to a degree where I want to review the impact in detail to ensure SRP, DRY etc are followed to my liking.

Spec Driven Development Brainstorming

In all the above cases so far, I have known what I want the outcome and design to be at some level. In many other cases, I find myself in a position where:

I don’t know what exact customer experience I’d like and but have a general sense of what outcome I want
I have many options on how to implement a particular feature, and am not sure which is the right approach for me as I haven’t weighed the trade-offs
I haven’t implemented something like this before and need to come up with an approach which will help me narrow down scope and create MVPs

The OpenSpec equivalent to this is /opsx:explore and the Superpowers equivalent is /brainstorm.

I find myself reaching for this planning approach a lot as it goes wide by diverging across different concerns before attempting to converge. In these case I don’t want to arrive at a solution quickly but want to uncover risks, align on the best customer experience, whittle down MVPs, and explore trade-offs and architectural patterns. I may spend hours to days in this mode going back and forth, and only after I’m comfortable I’ll convert the chat history into an /opsx:propose.

These are the three most recent changes I’ve initiated using /opsx:explore and where I spend at least a full day in this mode before hammering out a spec:

Currently, admins must set an absolute price on every combination individually (e.g., 3 sizes x 3 colors = 9 prices to manage). Changing the base price requires updating all combinations manually. This doesn’t match the Uber Eats-style additive pricing model the modifier system was designed to support, where each option value has a +/- price adjustment and the final price is computed automatically. Let’s make this easier for admins by specifying base price and providing adjustments per combination. This change affects payouts, reports, analytics and possibly other areas of the app.
I want to extend the role-based system by having customers create their own roles instead of pre-defined set of roles. They should also be able to map permissions to users (not just roles). I will later also seek to add object-level authorization (instead of just account based)
Let merchants connect external email-marketing providers (Mailchimp, Squarespace) so that mailing-list signups and opted-in customers are automatically pushed to the merchant’s own audience on those platforms.

As you can probably see, these are big changes with wide ranging impacts to multiple app modules. I see myself as not necessarily knowing what the solution here will be, and want to spend a lot of time branching out and exploring before aligning on a direction. Direct prompting is entirely useless here, agent planning mode is insufficient, and even creating a spec for immediate execution seems risky.

Changing Direction Mid-Flight

Everything above assumes I pick the right planning mode up front. In practice the more useful skill is noticing, partway through, that I picked too light a mode and bumping up a tier before I’ve wasted too much time. Here’s what I watch for.

Switching from simplistic prompts to agent plan mode The giveaway is when the diff surprises me. I asked to swap raw HTML for the <.stat> component on the dashboard, expecting a handful of lines, and the agent starts reaching into how the dashboard loads its data. The moment a “1-15 line” change touches a file I didn’t picture, I stop and re-issue it as a /plan so I can see the full scope before it runs.

Switching from agent plan mode to spec execution This happens when the plan coming back is bigger than the preview I expected. I asked for the optional-url-validator change expecting a bounded plan, and instead it touches five or six files, or contains a design decision I can’t make in thirty seconds, e.g., where the validation should live, whether it’s shared or a one-time use. When a “preview” turns into something I need to review for SRP and DRY, plan mode is no longer good enough and I need a spec to review.

Switching from spec execution to brainstorming The clearest signal is that I’m starting to extensively rewrite the spec instead of just reviewing it. If I sit down to tweak an /opsx:propose output and keep changing the approach, or worse, I keep changing the Gherkin invariant because I don’t actually know what the correct behaviour should be, then I admit defeat and go into brainstorming/explore mode. I find that if I don’t do this, I end up baking my confusion into the code.

Conclusion

Your mileage may vary on these and you may have an entirely different approach, and that’s completely fine. The type of tools available to us has exponentially increased over the last three years, and there’s no single “right” way of approaching things. The key is to de-risk the implementation at the right juncture so that what you end up producing conforms to your mental model of how the software internally works, and the changeability of those internals as customer needs invariably shift.

Don't rely on instructions; use Agent Hooks to enforce guardrails

Zarar Siddiqi — Sat, 20 Jun 2026 02:57:25 GMT

This post is for developers who use AGENTS.md or CLAUDE.md to provide guardrails for agent-generated code, but find that the agent sometimes ignores rules. If you want a deterministic check that will work 100% of the time, read on about agent hooks.

First, a clarification. Agent Hooks are different than git hooks which many developers are familiar with. The most popular Git hook might be the pre-commit hook which is called before you try to commit everything and is a popular place to do perhaps a git pull or some code formatting (e.g., prettier or mix format) to ensure your code is formatted as per the language’s standards. The limitation of a pre-commit hook is that it gets executed well after you have generated the code and just before you think you’re done (i.e., commit time).

Agent hooks are invoked when the agent (e.g., Claude Code) is doing work and allows developers to interject themselves into the agent’s workflow, rather than after the work is done (e.g., code review). Here’s a list of Claude Code Hooks which we’ll refer to. As a caution, not all agents have the same hooks. Unlike Skills where standard exist, Hooks are a bit of a mess so you’ll have to see what hooks your agent makes available to you. I’m going to be doing two deterministic checks which have bit me in the past:

Ensure that the agent never uses a tag directly because I want it to use the design components I have
Ensure that the agent never tells me it’s done while my design-system ratchet test is failing

These two fire at completely different points in the agent’s lifecycle. The first runs before the agent executes a tool; the second runs when the agent thinks it’s finished.

Every hook gets a blob of JSON on stdin, and the shape of that blob depends on the event. That’s what the jq calls below are digging into. I’ll show you exactly what each hook receives so the paths the jq tool is using makes sense. I’m using jq but you could have written a Python script, a shell script or anything that the agent could call.

1. No raw tags

This one is a PreToolUse hook. PreToolUse fires right before Claude Code runs a tool, and it’s the one place where you can actually stop the tool from happening by exiting with an error code other than 1 or 2. Whatever you wrote to stderr when exiting with exit code 2 will be seen by the agent as feedback. Exit 1 only logs a warning and lets the tool through.

I want every form field to go through my own <.cinput> component, not a bare . So I check the content the agent is about to write and block it if I see the tag. This goes in .claude/settings.json:

{
  “hooks”: {
    “PreToolUse”: [
      {
        “matcher”: “Write|Edit”,
        “hooks”: [
          {
            “type”: “command”,
            “command”: “jq -r ‘.tool_input.content // .tool_input.new_string // empty’ | grep -q ‘ design component, not a raw  tag.’ >&2; exit 2; } || exit 0”
          }
        ]
      }
    ]
  }
}

Here’s what the hook actually sees on stdin when the agent goes to write a file:

{
  “hook_event_name”: “PreToolUse”,
  “tool_name”: “Write”,
  “tool_input”: {
    “file_path”: “lib/amplify_web/components/form.ex”,
    “content”: “...the code the agent wants to write...”
  },
  “session_id”: “…”, “cwd”: “…”, “transcript_path”: “…”
}

That’s why the jq pulls .tool_input.content. A Write puts the whole file under content, but an Edit puts it under new_string instead (with old_string alongside it), so I fall back to .tool_input.new_string to cover both. The agent never gets to put a raw on disk as the write dies and my message tells it to go use the component instead.

What I could’ve done instead, but didn’t trust:

Just writing “always use <.cinput>, never raw “ in CLAUDE.md. That’s the exact thing the agent ignores half the time and the reason you’re reading this.
An mix credo (or eslint or pick your language’s linter) rule. Better, but it only catches the tag whenever something actually runs the linter, which the agent may not bother to do, and even then it’s at lint/commit time, well after the code is already written.

2. Don’t let it stop until the ratchet test passes

This one’s a Stop hook, which fires the moment the agent decides it’s finished. It’s the inverse of PreToolUse as instead of blocking an action before it happens, it refuses to let the agent end the turn at all. Exit 2 here means “no, keep working,” and the stderr message tells it why.

I keep a ratchet test that locks in design-system decisions I’ve made at test/amplify_web/design_system_ratchet_test.exs. The thing that’s bitten me most is the agent announcing it’s done with that ratchet red. The agent may run tests it thinks it needs to verify it’s work, but the ratchet test doesn’t always get picked up as it’s more of a “global” check rather than specific to a feature. So I gate the finish on exactly that test, not the whole suite (it’s faster, and it’s the decision I actually care about):

{
  “hooks”: {
    “Stop”: [
      {
        “hooks”: [
          {
            “type”: “command”,
            “command”: “[ \”$(jq -r ‘.stop_hook_active’)\” = true ] && exit 0; mix test test/amplify_web/design_system_ratchet_test.exs >/dev/null 2>&1 || { echo ‘Design-system ratchet test is failing — fix it before you call it done.’ >&2; exit 2; }”
          }
        ]
      }
    ]
  }
}

The stdin for a Stop hook is much thinner since there’s no tool to inspect, just the fact that the agent wants to wrap up:

{
  “hook_event_name”: “Stop”,
  “stop_hook_active”: false,
  “session_id”: “…”, “cwd”: “…”, “transcript_path”: “…”
}

No tool_input here since there’s no tool invocation happening. Stop hook runs my gate and decides whether the turn is allowed to end. So the jq only reaches for .stop_hook_active. Now the agent literally can’t wrap up until the ratchet test is passing.

One important point that tripped me up: that stop_hook_active check at the front is not optional. Once a Stop hook has forced a continuation, that flag comes back true on the next stop, and if you don’t bail out when you see it, a permanently-red ratchet will trap the agent in an infinite “fix → stop → blocked → fix” loop until you kill the session, so we must check the flag and let it stop.

What I could’ve done instead, but didn’t trust:

CLAUDE.md (”always run the ratchet before saying you’re done”). Ignored, same as everything else in this category.
A PostToolUse hook running the ratchet after every edit. It works, but it fires constantly mid-task when the code is legitimately half-finished, so it’s slow and noisy. The Stop gate runs once, at the only moment that matters is when the agent claims it’s done.
Leaving it to pre-commit or CI. Catches it eventually, but only at commit/push time, i.e., after the agent’s already declared victory and I’ve moved on. That’s the exact “too late” problem with the pre-commit hook I opened this post complaining about.

One more trap that applies to both is if jq fails silently. Get a path wrong (.tool_input.content, .stop_hook_active) and jq returns null, your check matches nothing, and the gate quietly does nothing while looking like it works. Test each one against a real hook payload before you trust it.

That’s it. Two checks at two different points in the loop, both deterministic, both fire every single time and give you more confidence that the agent isn’t going sideways by ignoring your MUST DO VERY IMPORTANT DON’T FORGET instructions in CLAUDE.md!

Run a local coding model with pi and LM Studio

Zarar Siddiqi — Wed, 17 Jun 2026 16:13:58 GMT

This post is for people who generally use Claude, Codex, or Gemini but have heard you can run open-source models locally for free. The goal is to get you set up in no time so you can play around with the power of local models.

If you already use a coding agent, you already know how this works. A coding agent (e.g., Claude Code) talks to a model over an HTTP API: it sends your request, the model sends back tokens, and the agent uses tools (read, edit, run) to do real work. With Claude Code or Codex, that API lives in a datacenter and you reach it over the internet.

Running locally changes one main thing: where the endpoint is. Instead of pointing your agent at a remote provider, you point it at a server running on your own machine. Same request in, same completion out except the model just happens to be sitting on your local computer.

Here’s the architecture, side by side:

The three pieces on the local side:

pi — the coding agent (the equivalent of Claude Code / Codex). It sends requests and runs tools.
LM Studio — a local server that hosts the model and exposes an OpenAI-compatible endpoint on http://localhost:1234/v1. You can use other options like ollama which is headless but we’re going to stick with the easiest way (I think) and use LM Studio.
qwen/qwen3.6-27b — the actual model, running on your hardware which you can download through LM Studio.

Because pi speaks the OpenAI chat-completions protocol and LM Studio serves an OpenAI-compatible endpoint, hooking them together is a drop-in. You tell pi the base URL is localhost:1234, and that’s the whole trick.

Step 1: Check what your machine can run

Before downloading anything, check your hardware. Two browser tools detect your GPU, VRAM, and RAM and tell you which models will actually run (and how well):

https://www.canirun.ai/

— detects your hardware and grades models from “runs great” to “too heavy.”

https://www.caniusellm.com/

— similar check, plus quantization recommendations (which INT4/INT8/FP16 build fits your specs).

A 27B model like qwen3.6-27b at a 4-bit quant is roughly 15–16 GB of weights before you add any context, so a 24 GB GPU is a comfortable sweet spot. If your machine is smaller, the checker will point you at a model that fits.

I’m running this on an Apple M5 with 128 GB of unified memory. Because Apple Silicon shares that memory pool between CPU and GPU, the whole 128 GB is available to the model. A model like qwen3.6-27b plus a generous context window barely makes a dent, so I can run the full 256K window without thinking about it and even reach for higher-quality quants (Q6/Q8). If you’ve got a machine in this class you have plenty of headroom; on a smaller GPU, let the hardware checker steer your model and quant choice.

Quantization just means a compressed version of the weights. Q4 is the usual sweet spot between quality and size. If it fits and runs, you’re good, don’t overthink it for your first model.

Step 2: Install LM Studio and load your model

Download LM Studio (macOS, Windows, Linux) and install it. The docs walk through the app if you want them.

Inside LM Studio, use the model search to download qwen3.6-27b (or whichever model the hardware check recommended), picking the quant that fits your machine. Load it, then turn on the local server: open the Developer tab and toggle Start server. That’s what exposes the OpenAI-compatible endpoint at http://localhost:1234/v1 that pi will talk to. (Server docs.)

Step 3: Install pi

pi is the coding agent. Install it with the official script:

curl -fsSL https://pi.dev/install.sh | sh

Or via npm if you prefer: npm install -g @mariozechner/pi-coding-agent. Either way, run pi --version to confirm it worked, and see the quickstart docs for first-run setup and logging into cloud providers. You could also use OpenCode but I prefer pi so let’s run with it.

Step 4: Point pi at your local model

pi finds custom providers in a models.json file in your agent directory. Open it:

vi ~/.pi/agent/models.json

Here’s how I configured mine:

{
  “providers”: {
    “lmstudio”: {
      “baseUrl”: “http://localhost:1234/v1”,
      “api”: “openai-completions”,
      “apiKey”: “lm-studio”,
      “models”: [
        {
          “id”: “qwen/qwen3.6-27b”,
          “input”: [
            “text”
          ]
        }
      ]
    }
  }
}

A few notes on what’s going on here:

"api": "openai-completions" tells pi to use the OpenAI chat-completions protocol. This is the part that makes any OpenAI-compatible local server (LM Studio, Ollama, vLLM) just work.
"apiKey": "lm-studio" is required but ignored by LM Studio — any non-empty string is fine.
The "id" must match exactly what LM Studio exposes. If you’re not sure, run curl http://localhost:1234/v1/models and copy the id from there.

Once it’s saved, you’ll see the model in pi‘s model picker (/model). And whenever you want to jump back to a cloud model, the same picker switches you over with local and remote living side by side.

Step 5: Set your context size and reload the model

The context window is the model’s working memory, meaning everything it can “see” at once. On a cloud model this is fixed for you. Locally, you choose it when you load the model, because a bigger context costs more VRAM (the KV cache grows with context length).

In LM Studio, set the context length on the model’s load settings, then reload the model for it to take effect.

How much context you want depends on what you’re doing and how much VRAM you can spare on top of the weights:

What you’re doingContext sizeExtra VRAM (rough)A couple files16K~1 GBSimple coding, UI development64K~4 GBMulti-file refactors, medium complexity tasks128K~8 GBPlanning with full-repo contet256K~16 GB

These are ballpark KV-cache numbers on top of the ~15–16 GB the 27B weights already use, so the full 256K window is pretty heavy. You have to budget for it before you crank the slider all the way to the right. When in doubt, start at 64K. It’s plenty for day-to-day work and unless you find the agent making mistakes and forgetting things a lot, you can stick around here.

256K locally vs 1M in the cloud

qwen3.6-27b supports up to 256K tokens of context natively. For comparison, Claude Opus 4.8 runs in a 1M-token window. That gap is the main thing to keep in mind when deciding what to run where.

256K is still a lot and easily holds a focused slice of a codebase: the files for a feature, a module and its tests, a long debugging session. For most everyday coding, single-feature work, and contained refactors, you won’t feel a limitation here but your mileage may vary.

Where the 1M cloud window pulls ahead is the whole-codebase, long-horizon stuff: loading an entire large repo at once, agentic tasks that run for hundreds of steps and accumulate huge history, or analyses that need to hold a giant document set in view. If you find yourself constantly trimming what you feed the model, that’s the sign to reach for a cloud model for that task.

A reasonable rule of thumb:

Local (qwen3.6-27b): focused edits, day-to-day coding, contained refactors, anything you want to run offline, privately, or at zero cost. A great use case is throwaway work where you don’t want to spend tokens.
Cloud (Claude / Codex / Gemini): whole-repo context, the hardest reasoning, and long agentic runs where the bigger window and top-tier capability earn their keep.

Re-use Claude Code skills

If you’ve built up SKILL.md skills for Claude Code, pi uses the same open Agent Skills standard (originally from Anthropic, now adopted across Claude Code, Codex, Gemini CLI, and more).

The catch is that pi only auto-discovers skills in its own locations by default. It does not look in ~/.claude/skills unless you tell it to. So if your Claude skills aren’t showing up in pi, this is why.

Point pi at them in its settings file. Note this is settings.json, not the models.json from Step 4:

vi ~/.pi/agent/settings.json

Here’s a full settings.json that points pi at your skills and boots it straight into your local model:

{
  “defaultProvider”: “lmstudio”,
  “defaultModel”: “qwen/qwen3.6-27b”,
  “skills”: [
    “~/.claude/skills”,
    “~/.codex/skills”
  ]
}

The skills array is a list of directories pi scans for SKILL.md folders. (defaultProvider and defaultModel are optional; they match the provider name and model id from your models.json in Step 4 so pi starts on your local model automatically.)

Then restart pi as it only scans skill locations at startup, so a running session won’t pick up the change. After the restart, pi loads skills exactly the way Claude Code does with only the descriptions loaded in context, and the full instructions load on demand when a task matches (or when you force it with /skill:name).

Quick gotchas before you start

A list of the things that tripped me up:

Start LM Studio’s local server. Loading the model isn’t enough and the server has to be switched on, or pi can’t reach localhost:1234. This is the most common problem.
Match the model id exactly between LM Studio and your models.json. curl http://localhost:1234/v1/models shows you the truth.
Expect it to be slower than cloud. A 27B running on your GPU won’t hit cloud token rates, and the first load takes a moment. That’s normal.

My workflow: Plan with Claude, build locally

pi isn’t local-only. It has built-in support for Claude, so you can run cloud and local models from the same agent. Authenticate once with /login (it works with a Claude Pro/Max subscription or an Anthropic API key, stored in ~/.pi/agent/auth.json), and Claude shows up in the same /model picker as your local model.

That unlocks the workflow that makes local models genuinely practical: plan with Claude, execute locally.

Use a Claude model for the thinking such as architecting a feature, breaking the work into steps where top-tier reasoning and the 1M window come into play.
Then switch to your local model (/model) to execute the plan: the repetitive edits, running tests, grinding through the refactor — for free, offline, and private.
Flip back and forth as much as you want within a single session. Cloud for the hard thinking, local for the volume.
I use OpenSpec so I create the plans with Claude and execute locally

That’s the whole setup. Check your hardware, install LM Studio and load a model that fits, install pi, drop a provider into models.json, set your context, and start coding. You’re running a capable coding model entirely on your own machine.

Spec-Driven Development: From Vibe Coding to Structured Development

Zarar Siddiqi — Wed, 25 Feb 2026 00:50:43 GMT

Introduction

If you’ve used an AI coding tool in the last year, you’ve probably had the experience: you describe what you want, the AI generates something that looks right, you run it, and... it doesn’t quite work. You refine your prompt. The AI fixes one thing and breaks another. Three iterations later you’re debugging code you didn’t write and don’t fully understand.

This is the failure mode of what Andrej Karpathy called “vibe coding” and it’s become the default way most developers interact with AI. Spec driven development (SDD) is the emerging counter movement. Instead of throwing prompts at an LLM and hoping for the best, you write a structured specification first, then let the AI implement against it.

The idea isn’t new. We’ve been writing requirements documents since forever, but the tooling is new. Tools like GitHub’s Spec Kit, Amazon’s Kiro, and Fission AI’s OpenSpec are attempting to formalize this workflow into something repeatable. Whether that formalization helps or hinders depends entirely on what you’re building, how you’re building it, and the tradeoffs you’re willing to make.

Our team uses OpenSpec, so most of the practical examples in this post come from that experience. But the principles apply regardless of which tool you pick.

The Problem: Why “Just Prompting” Breaks Down

The pitch for AI assisted coding is attractive: describe what you want in English and get working code back. And for simple tasks, a helper function, a config change, renaming a module, it works remarkably well. The challenges starts when changes aren’t trivial but require edits to multiple files or packages/modules.

The core issue is context loss. When you’re five prompts deep into a feature, the AI has no persistent memory of the architectural decisions you made in prompt one. It doesn’t know you chose a specific idempotency strategy for a reason. It doesn’t remember that you explicitly avoided storing raw card data outside the tokenization boundary. Every new prompt starts from a partial view of the world, and the AI fills in the gaps with whatever patterns it’s seen most in training data.

In payments systems, this produces particularly dangerous failures. Reconciliation logic scattered across three different modules because each prompt generated its own approach. A refund handler that doesn’t account for partial captures. Currency conversion applied twice because the AI didn’t know about the upstream normalization step. And perhaps most critically in our domain, security flaws: API keys committed to source, missing input validation on transaction amounts, authorization checks that live on the client instead of the server. Studies have found that roughly 45% of AI generated code contains security vulnerabilities. In a payments context, that’s more than just a bug but a compliance issue.

The other failure is architectural drift. Without a shared plan, each prompt/response cycle makes locally reasonable decisions that are globally incoherent. The AI can’t refactor itself out of architectural problems it doesn’t understand. You ask it to add retry logic to a payment gateway call and it builds a standalone retry mechanism, unaware that you already have a circuit breaker pattern in your infrastructure layer. Once the codebase reaches a certain size, the context window can only see fragments of it. You end up with a system that processes transactions but that nobody, including the AI, fully understands anymore.

This isn’t the AI being dumb. It’s the natural consequence of building without a map.

What Spec Driven Development Actually Is

At its simplest, spec driven development means: write down what you’re building before you write the code, and make that written artifact the thing your AI agent works from.

That might sound like waterfall but It’s not, or at least, it doesn’t have to be. The key differences are timescale and scope. Traditional waterfall specs were project level documents written over weeks and often carved in stone. SDD specs are feature level documents written in minutes and meant to evolve. You’re not planning an entire system upfront; you’re planning the next meaningful chunk of work in enough detail that an AI can implement it without guessing.

A typical SDD workflow looks like this:

Define requirements. What should this feature do? Who is it for? What are the acceptance criteria? What are the edge cases?
Create a technical design. How should it be implemented? What’s the data model? What APIs are involved? What patterns should be followed?
Break it into tasks. What are the discrete, testable units of work? In what order should they be done?
Implement. The AI executes against the task list, one piece at a time, with the full spec as context.

You’re not writing all of this yourself. You describe the intent in natural language, and the AI generates the spec artifacts: the proposal, the requirements, the design, the task breakdown. Your job is to review, refine, and correct. You steer and the AI does the heavy lifting. This is what makes the process fast enough to be practical. Writing a 200 line spec by hand for every feature would be painful. Having the AI draft it in 30 seconds and then spending 5 minutes reviewing and adjusting it is a different proposition entirely.

The spec becomes a persistent artifact, a “super prompt” that doesn’t disappear when your chat session ends. It lives in version control alongside your code. When the AI drifts, you point it back to the spec. When requirements change, you update the spec and regenerate.

The fundamental shift is that the specification becomes the source of truth, and code becomes the derived artifact. Traditional documentation describes code that already exists. SDD inverts that relationship. You define the behaviour, constraints, and architecture in the spec, and the AI produces code that conforms to it. The spec isn’t something you write after the fact to explain what was built but the input that determines what gets built. Code is the output.

The Tooling Landscape

Three tools have emerged as the most prominent in this space. Each takes a different philosophical approach.

GitHub Spec Kit

Spec Kit is an open source CLI from GitHub that scaffolds a spec driven workflow into your existing project. It’s agent agnostic, working with GitHub Copilot, Claude Code, Gemini CLI, and others. The workflow follows rigid phases driven by slash commands: /speckit.constitution to establish project principles, /speckit.specify to create feature specs, /speckit.plan for a technical plan, /speckit.tasks for work items, and /speckit.implement to execute.

Strengths: Thorough documentation output, the “constitution” concept for project wide principles, works with many agents.

Weaknesses: Heavyweight. Sometimes it get generate a lot of artifacts for simple changes. Rigid phase gates mean you can’t easily jump back and forth between planning and implementing.

Amazon Kiro

Kiro is a full IDE (a VS Code fork) with spec driven development baked into the editing experience. The workflow follows a similar shape (requirements → design → tasks → implement) but is tightly integrated with the editor. It generates user stories with acceptance criteria, creates technical design documents, and produces task lists. It also introduces “Hooks,” user defined prompts triggered by file changes.

Strengths: Most polished integrated experience. The Hooks system is excellent and something you’d have to configure manually if you decide to do it on your own. No context switching between planning and editing because of the IDE integration.

Weaknesses: You’re locked into their IDE and limited to Claude models. Can be overkill for small changes. One developer reported a simple bug fix generating 4 user stories with 16 acceptance criteria. The overhead can be significant.

OpenSpec (Fission AI)

OpenSpec is the most lightweight of the three. It’s a TypeScript CLI with a fluid, iterative workflow and no rigid phase gates. Where Spec Kit enforces a strict sequence and Kiro wraps everything in an IDE, OpenSpec gets out of your way and lets you move between planning artifacts freely.

Its distinguishing philosophy is “brownfield first.” While the other tools are optimized for building new things from scratch, OpenSpec is designed to work with existing codebases. Each change produces a “spec delta,” a document that captures what’s being added, modified, or removed relative to the existing system. Over time, these deltas merge into a living specification that reflects the current state of the system.

OpenSpec also handles change history better. Every completed change is archived with its full artifact set: the original proposal, the spec deltas, the design, and the task list. This means you can go back and see not just what changed in the system, but why it changed, what alternatives were considered in the design, and what the original acceptance criteria were. Spec Kit and Kiro generate artifacts during planning but don’t have the same structured archive and merge cycle. In OpenSpec, the openspec/changes/archive/ directory becomes a chronological record of every significant change to the system, and the openspec/specs/ directory is always the merged, current truth. For regulated environments where auditability matters, this distinction is significant.

Strengths: Works with 20+ AI tools including Claude Code, Cursor, Copilot, Windsurf, and many others. The brownfield focus is valuable in our context as most real work is on existing codebases. Fluid workflow lets you update any artifact at any time and you are not forced into a linear way of working. The archive/merge cycle produces both a living spec and an auditable change history.

Weaknesses: Less hand holding in the spec writing process is the trade-off it makes while allowing you to navigate back-and-forth between spec and implementation. The tool is newer and the ecosystem is still growing.

Installing OpenSpec

OpenSpec requires Node.js 20.19.0 or higher.

Install OpenSpec globally:

npm install -g @fission-ai/openspec@latest

Then navigate to your project directory and initialize:

cd your-project
openspec init

The init process will ask which AI tool you’re using and configure the appropriate slash commands or agent instructions for your environment.

OpenSpec also works with pnpm, yarn, bun, and nix. See the official installation docs for alternative paths.

Keeping OpenSpec Updated

Upgrade the package:

npm install -g @fission-ai/openspec@latest

Then refresh agent instructions in each project:

openspec update

OpenSpec’s Workflow in Depth

Understanding the full lifecycle of an OpenSpec change is worth the time, because the artifacts it generates serve different roles on the team in different ways.

The Core Commands

OpenSpec’s workflow is built around the opsx slash commands. Here’s the complete set, the ones you interact with the most are bolded:

CommandPurpose/opsx:onboardGuided tutorial through the complete workflow using real code/opsx:exploreThink through ideas, investigate problems, clarify requirements before committing to a change/opsx:newCreate a new change folder with metadata/opsx:continueProgress a change to its next phase (proposal → design → tasks)/opsx:ff“Fast forward”: generate all planning artifacts at once/opsx:applyImplement tasks, writing code and checking off items/opsx:verifyValidate that implementation matches the artifacts (completeness, correctness, coherence)/opsx:syncMerge delta specs into main specs without archiving (useful for long running changes)/opsx:archiveArchive a completed change, merging delta specs into main specs/opsx:bulk-archiveArchive multiple completed changes at once, handling spec conflicts

The typical flow is new → ff → apply → archive, but the power of OpenSpec is that you can break out of that sequence at any point. Need to revisit the design after you’ve started implementing? Just edit design.md. Want to add acceptance criteria while coding? Update the spec delta. There are no phase gates forcing you to “finish” one stage before moving to another.

Starting a Change: Explore vs. New

One of the first decisions in any OpenSpec workflow is how you enter it. There are two entry points, and choosing the right one makes a real difference in the quality of what comes out the other side.

/opsx:new is for when you know what you’re building. You have a clear feature in mind, you understand the requirements well enough to describe them, and you’re ready to start generating planning artifacts. Maybe you’ve already discussed this in a planning meeting. Maybe you’ve built something similar before. Maybe the ticket is well defined and you just need to formalize it. In these cases, /opsx:new add-feature-name followed by /opsx:ff gets you from zero to a full set of planning documents in minutes.

You: /opsx:new add-payment-retry-with-exponential-backoff

AI:  Created openspec/changes/add-payment-retry-with-exponential-backoff/
     Ready to create: proposal

You: /opsx:ff

AI:  Creating all planning artifacts...
     ✓ proposal.md
     ✓ specs/
     ✓ design.md
     ✓ tasks.md
     Ready for implementation!

This works well when the problem space is familiar and constraints are understood. You’ve implemented retry logic before, you know the gateway’s behavior, you just need to formalize it.

/opsx:explore is for when you’re still figuring things out. Maybe you have a vague idea but the requirements aren’t clear. Maybe you’re investigating a production anomaly and you don’t know where the root cause is yet. Maybe a stakeholder said “we need to support instant payments” and you need to figure out what that actually means for your settlement architecture before you can spec anything.

Explore opens a broader, open ended conversation with the AI before any change folder is created and before any artifacts are generated. There’s no structure imposed, no proposal template to fill out, no pressure to define acceptance criteria. You’re just thinking out loud with an assistant that has context on your codebase.

You: /opsx:explore

AI:  What would you like to explore?

You: We’re seeing intermittent failures on our direct debit 
     collections via the pain.008 pathway. Some mandates are 
     being rejected with AM04 (insufficient funds) but the 
     retry logic doesn’t seem to be kicking in correctly. I’m 
     not sure if it’s a status mapping issue, a timing problem 
     with our batch processing window, or something in our 
     mandate lifecycle management. I want to figure out where 
     to focus before committing to a specific fix.

AI:  Let me help investigate. Based on your codebase, I can 
     see a few potential areas...
     [Analyzes code paths, suggests profiling approaches, 
      identifies likely bottlenecks, asks clarifying questions]

The explore conversation might last five minutes or an hour. It might branch into several tangents before converging on a clear problem statement. That’s intentional. You’re doing the messy, nonlinear thinking that produces good requirements, and you’re doing it before the spec exists rather than discovering gaps during implementation.

This is also where existing requirements documents from business analysts become valuable. In payments organizations, BAs often produce detailed feature requirements independently of any SDD tool: field mapping spreadsheets, business rule matrices, sample ISO 20022 payloads (pain.001, pain.008, pacs.008), regulatory constraint documents, and workflow diagrams. These artifacts are the raw material that explore turns into actionable specs.

Rather than asking the AI to invent requirements from scratch, you feed it the BA’s work:

You: /opsx:explore

AI:  What would you like to explore?

You: Our BA has written up requirements for adding SEPA Instant 
     Credit Transfers. I’m attaching their document which 
     includes the field mappings from our internal format to 
     pacs.008, the business rules for amount limits and BIC 
     validation, and sample XML payloads. I need to understand 
     how this fits into our existing payment orchestration 
     layer and what the technical implications are before we 
     spec the implementation.

AI:  I’ve reviewed the BA requirements document. Let me walk 
     through the key integration points...
     [Maps BA requirements against existing codebase, identifies 
      gaps, flags technical decisions that need to be made]

The explore phase becomes a bridge between the BA’s domain knowledge and the engineering reality of the codebase. The BA doesn’t need to know about your GenServer architecture or your Ecto schema conventions. The developer doesn’t need to memorize the ISO 20022 payload structure. Explore lets both perspectives converge into a proposal that reflects both business intent and technical feasibility.

When you’ve reached clarity, you transition naturally into the structured workflow:

You: OK, the main complexity is in the real-time settlement 
     confirmation flow. The BA’s field mappings look solid 
     but we need to add timeout handling for the 10 second 
     SCT Inst window. Let’s spec that.

You: /opsx:new add-sepa-instant-credit-transfers

AI:  Created openspec/changes/add-sepa-instant-credit-transfers/
     Ready to create: proposal

Now the proposal and specs will be grounded in both the BA’s requirements and the technical understanding you built during exploration, rather than being generated from a one line prompt.

When to use which:

Use /opsx:new when you can describe the feature or fix in a sentence and you’re confident in the scope. Use /opsx:explore when any of the following are true: you’re unsure what the root cause of a problem is, the requirements are ambiguous or underspecified, you need to evaluate multiple approaches before committing to one, or you want to pressure test an idea before investing in formal planning. In practice, we find ourselves using explore more often than we initially expected. The few minutes spent thinking before speccing consistently produce better specs, which in turn produce better code.

The Artifact Lifecycle

When you run /opsx:new add-idempotent-refunds, OpenSpec creates a change directory:

openspec/changes/add-idempotent-refunds/
├── .openspec.yaml          # Metadata: change name, status, timestamps
└── (ready for artifacts)

Running /opsx:ff (or stepping through with /opsx:continue) generates the planning artifacts:

openspec/changes/add-idempotent-refunds/
├── .openspec.yaml
├── proposal.md             # Why we’re doing this, what’s changing, scope
├── specs/                  # Requirements and scenarios (the spec delta)
│   └── refunds/
│       └── spec.md         # Functional requirements with ADDED/MODIFIED/REMOVED markers
├── design.md               # Technical approach, data model, component structure
└── tasks.md                # Ordered implementation checklist

Each of these artifacts has a specific purpose and a specific audience. Let’s look at what goes into them.

proposal.md is the “why” document. It describes the motivation for the change, the scope of what’s included and excluded, and any constraints or dependencies. This is the document you’d share in a planning meeting or attach to a ticket. It answers the question: “Why are we doing this, and what does ‘done’ look like at a high level?” For a refunds feature, this might capture that the driver is duplicate refund incidents costing the business money, that the scope includes full and partial refunds but excludes chargebacks, and that the constraint is backwards compatibility with the existing refund API contract.

specs/ contains the spec delta, the functional requirements for this specific change. Requirements are marked as ADDED, MODIFIED, or REMOVED relative to the current system. Each requirement uses structured language (”The system SHALL...”) with clear acceptance criteria and scenarios. This is where edge cases live. This is where you define what happens when a refund is submitted with the same idempotency key as a previous request, what the system does when the gateway returns a timeout mid refund, or how partial refunds interact with the original transaction’s settlement status.

design.md is the technical blueprint. It covers the data model, API contracts, component architecture, sequence flows, and any technology choices specific to this feature. For the refunds example, it’s where you’d document the idempotency key storage strategy, the state machine transitions for refund lifecycle, and the gateway adapter interface for multi acquirer support.

tasks.md breaks the work into discrete, ordered implementation steps. Each task is small enough to verify independently, ideally something that can be implemented in under 30 minutes. Tasks have clear completion criteria so both the developer and the AI know when they’re done.

What Happens at Archive

When all tasks are complete and verified, /opsx:archive does something important: it merges the spec deltas from the change back into the main openspec/specs/ directory. The change folder moves to openspec/changes/archive/, preserving the history. The main specs now reflect the updated state of the system.

This is the mechanism that turns specs into a living document. After a dozen features have been built and archived, openspec/specs/ contains a comprehensive, up to date description of what the system does. Not what it was designed to do originally, but what it actually does right now.

Who Benefits: SDD Across Roles

One of the underappreciated aspects of spec driven development is that the artifacts aren’t just for the developer writing the code. They create value across every role that touches the project.

For Developers

The immediate benefit is implementation quality. Instead of translating a vague Jira ticket into code via a series of increasingly frustrated prompts, you’re working from a spec that already captures requirements, edge cases, and technical decisions. The AI produces better code because it has better context. You spend less time debugging and reworking because misunderstandings surface during spec review, not during code review.

The longer term benefit is onboarding and maintenance. When you come back to a feature six months later, or when a new developer joins the team, the spec explains not just what the code does but why it was built that way. The proposal captures the business motivation. The design doc captures the technical rationale. The spec captures the behavioral contract.

For Business Analysts and Product Managers

The proposal and spec artifacts are written in structured natural language, not code. A BA or PM can read proposal.md and immediately understand the scope, motivation, and acceptance criteria for a change without needing to parse a pull request.

More importantly, they can contribute to these documents. If the spec says “The system SHALL retry failed direct debit collections up to 3 times” and the BA knows the scheme rules mandate a maximum of 2 retries with specific interval requirements, they can flag that in the spec before any code is written. The spec becomes a shared contract between product and engineering, reviewable by both sides.

BAs in payments organizations often produce detailed requirements documents that exist outside of any development tool: field mapping spreadsheets between internal formats and ISO 20022 messages, business rule matrices for transaction routing, sample payloads for pain.001 or pacs.008 messages, regulatory constraint documents, and scheme specific validation rules. These documents don’t need to be rewritten into OpenSpec format. Instead, they serve as input to the /opsx:explore conversation and as reference material that the proposal and specs can point to. The spec might say “Field mappings follow the BA’s pain.008 mapping document (see docs/ba-requirements/sepa-dd-field-mappings.xlsx)” rather than duplicating that content. OpenSpec captures the engineering requirements; the BA’s documents capture the domain requirements. The two reference each other.

For teams practicing any kind of requirements analysis, the spec delta format (ADDED/MODIFIED/REMOVED) maps naturally to how BAs think about change impact. You can see at a glance exactly what existing behavior is changing and what’s new.

For QA Engineers

The specs are essentially test plans waiting to happen. Each requirement with its acceptance criteria maps directly to test cases. “WHEN a refund is submitted with an idempotency key matching a previously completed refund, THEN the system SHALL return the original refund response without processing a duplicate” is a test case in all but name.

QA can review specs before implementation begins, catching gaps in test coverage at the cheapest possible point in the development cycle. In payments, where edge cases around timeouts, partial failures, and concurrent operations are where bugs hide, having QA eyes on the spec early is especially valuable. They can also use specs to verify completeness: does the implementation actually cover every scenario in the spec? OpenSpec’s /opsx:verify command automates part of this check, but human QA review of the spec itself is where the real value lies.

For Tech Leads and Principal Engineers

The design document is where architectural oversight happens. A principal can review design.md to ensure the proposed approach fits the system’s overall architecture, without needing to wait for a code review to discover that someone introduced a new database table that duplicates an existing one, or bypassed the payment gateway abstraction layer by calling the acquirer API directly.

The proposal document is equally valuable at this level. It provides enough context to make prioritization decisions, estimate impact on downstream systems like settlement and reconciliation, and flag dependencies before work begins.

For organizations running architecture review boards or design review processes, OpenSpec artifacts slot directly into those workflows. The artifacts are markdown in version control, which means they can be reviewed through the same pull request process as code.

For the Whole Team

The openspec/specs/ directory, the living spec that accumulates as changes are archived, becomes something like institutional memory for the project. It captures not just the current state of the system but the evolution of requirements over time. New team members can browse the specs to understand the system. Archived changes provide an audit trail of what changed, when, and why.

This is especially valuable for distributed teams where not everyone is in every meeting. The spec is always available, always current, and always in the repo.

Bridging BA Requirements and Engineering Specs

In most payments organizations, business analysts produce detailed requirements documents long before any developer opens an IDE. These documents are the product of weeks of domain analysis: field mapping spreadsheets that map internal data structures to ISO 20022 message formats like pain.001 or pacs.008, business rule matrices that define transaction routing logic, validation rules derived from scheme specifications, sample XML payloads, regulatory constraint summaries, and workflow diagrams for processes like mandate lifecycle management or chargeback dispute flows.

These artifacts are enormously valuable. They represent concentrated domain knowledge that no AI model has in its training data. But they exist outside of any development tool, usually in Confluence pages, shared drives, or Excel files. The question is: where do they plug into the SDD workflow?

The answer is /opsx:explore.

Rather than asking the AI to invent payment domain requirements from scratch (which it will get wrong in subtle, dangerous ways), you feed the BA’s existing documents into the explore conversation. The AI reads the BA’s field mappings, business rules, and sample payloads, then maps them against your actual codebase. The explore conversation becomes a structured dialogue about feasibility: which business rules can be implemented directly, which ones require architectural changes, where the BA’s requirements conflict with existing behavior, and what edge cases the BA may not have considered from a technical perspective.

You: /opsx:explore

AI:  What would you like to explore?

You: The BA team has finalized requirements for adding SEPA 
     Direct Debit collections. I’m attaching their document 
     which includes the pain.008 field mappings from our 
     internal mandate format, the business rules for collection 
     timing windows, the R-transaction handling matrix (AM04, 
     MS02, MD01 etc.), and sample XML for both first and 
     recurring collections. I need to understand what this 
     means for our payment orchestration layer before we plan 
     the implementation.

AI:  I’ve reviewed the BA’s requirements. Let me walk through 
     the integration points with your existing codebase...
     
     Your mandate schema already covers most of the creditor 
     reference fields, but the BA’s mapping requires three 
     additional fields for the CreditorSchemeIdentification 
     that aren’t in your current model...
     
     The R-transaction handling matrix defines 12 reason codes 
     but your existing rejection handler only supports 4 
     generic rejection states. This will need a more granular 
     status model...

This conversation might reveal that 70% of the BA’s requirements map cleanly to existing patterns and 30% require new design decisions. Those design decisions then flow into the proposal and spec with full context, rather than being invented by the AI from a one sentence prompt.

The key principle: the BA’s documents are input to the spec, not replaced by it. The OpenSpec proposal can reference them directly (”Field mappings follow the BA’s pain.008 mapping document, see docs/ba-requirements/sepa-dd-field-mappings.xlsx”). The spec captures the engineering interpretation of business requirements, while the BA’s artifacts remain the authoritative source for domain rules. The two complement each other.

For teams with a strong BA function, this workflow turns explore into the most valuable step in the entire process. It’s where domain expertise meets technical reality, and where misunderstandings between product and engineering get caught before they become expensive.

Beyond Epics and User Stories

For years, the standard way to decompose work in software organizations has been the Agile hierarchy: Epics break into Features, Features break into User Stories, User Stories break into Tasks. Each layer adds structure, and each layer adds overhead. Grooming sessions to refine stories. Estimation ceremonies to assign points. Sprint planning to negotiate what fits. Story splitting when something is “too big.” Acceptance criteria written in Given/When/Then format.

This process was designed for a world where humans wrote every line of code, and work needed to be decomposed into pieces small enough for one developer to complete in a sprint. The granularity served a coordination function: if three developers are working on the same feature in parallel, you need clearly bounded units of work to avoid stepping on each other.

With AI agents handling the bulk of code generation, developers now work in significantly larger chunks. A feature that would have been split into 8 user stories with 24 tasks can be described as a single spec and implemented in one session. The AI doesn’t need two week sprints to context switch between stories. It doesn’t need story points to estimate effort. It doesn’t care whether a unit of work is a 3 or a 5. It needs a clear description of what to build and enough context to build it correctly.

The overhead of the old hierarchy was always significant. Ceremonies consume 15-30% of a team’s time. The BA writes detailed requirements and translates them into epics and stories. The tech lead estimates them. The developer re-interprets them during implementation. Each translation step is an opportunity for information loss.

SDD collapses this. Instead of an Epic with 5 Features containing 20 User Stories containing 60 Tasks, you have a proposal that captures the business intent, a spec that defines the requirements, a design that describes the technical approach, and a task list that the AI executes against. The entire chain from “why are we doing this” to “what code gets written” lives in one change folder, reviewable as a single unit. There’s no translation loss and the spec is the shared artifact that all of those roles read, review, and contribute to.

This doesn’t mean you abandon planning. It means the unit of planning shifts from “what can one developer finish in two days” to “what is the next coherent change to the system.” That change might be small (fix a settlement rounding error) or large (add multi currency support). The spec scales to fit either. And because the AI generates the task breakdown from the spec, you don’t need to spend planning meetings debating granularity. The tasks exist to guide the AI’s implementation, not to fill a sprint board.

SDD offers a natural landing spot for a switch to AI-first development. You keep the parts that matter (clear requirements, technical review, incremental delivery) and shed the parts that were only ever coordination scaffolding.

A Practical Workflow: Adding Multi Currency Settlement to a Payment Engine

Let’s walk through a concrete example using OpenSpec to add multi currency settlement support to a payment processing platform.

Step 1: Explore (optional but useful)

If the requirements aren’t fully clear, or if you have BA created documents that need to be digested before planning, start with exploration:

You: /opsx:explore
AI:  What would you like to explore?

You: We need to support settlement in multiple currencies. 
     Right now everything settles in the merchant’s home 
     currency, but we’re onboarding merchants in the EU who 
     need EUR settlement even when the cardholder pays in GBP.
     I’ve attached the BA’s requirements document which includes 
     the field mappings, currency pair rules, and the regulatory 
     constraints for cross border settlement. I need to 
     understand how this fits into our existing reconciliation 
     pipeline and whether our current batch architecture can 
     handle currency grouping.

AI:  I’ve reviewed the BA requirements document. Let me map 
     these against your existing settlement pipeline...
     [Analyzes BA requirements against codebase, identifies 
      integration points, flags technical decisions needed]

This step is for thinking, not committing. No change folder is created yet.

Step 2: Create the change

You: /opsx:new add-multi-currency-settlement

AI:  Created openspec/changes/add-multi-currency-settlement/
     Ready to create: proposal

Step 3: Generate planning artifacts

If you have a clear picture of what to build, fast forward through all planning stages:

You: /opsx:ff

AI:  Creating all planning artifacts...
     ✓ proposal.md
     ✓ specs/
     ✓ design.md
     ✓ tasks.md
     Ready for implementation!

Alternatively, use /opsx:continue to step through one phase at a time, reviewing and refining each artifact before moving to the next.

Step 4: Review and refine the artifacts

This is the most important step. Read the generated proposal, specs, and design. Push back on anything that doesn’t match your intent:

You: The design proposes fetching FX rates at settlement time. 
     We need to lock the rate at authorization and store it on 
     the transaction record. Also the spec is missing the 
     requirement that settlement files must include both the 
     original transaction currency and the settlement currency.

AI:  I’ll update design.md to lock FX rates at authorization 
     and add the dual currency settlement file requirement 
     to the spec.
     [Updates artifacts]

Step 5: Implement

You: /opsx:apply

AI:  Working through tasks...
     ✓ 1.1 Add fx_rate and settlement_currency to transactions schema
     ✓ 1.2 Create FX rate locking module with gateway adapter
     ✓ 1.3 Update authorization flow to capture and persist rate
     ✓ 2.1 Modify settlement batch builder for multi currency grouping
     ✓ 2.2 Update reconciliation to match on settlement currency
     ✓ 2.3 Add settlement file format with dual currency columns
     All tasks complete!

Step 6: Verify

You: /opsx:verify

AI:  Checking implementation against artifacts...
     ✓ All requirements covered
     ✓ Design patterns followed
     ✓ All tasks marked complete
     ⚠ Spec mentions “handle rate expiration gracefully” but 
       current implementation raises an unhandled error when 
       the cached rate is older than 30 minutes. Consider a 
       re-fetch with merchant notification.

Step 7: Archive

You: /opsx:archive

AI:  ✓ Merged specs into openspec/specs/
     ✓ Archived to openspec/changes/archive/2026-02-24-add-multi-currency-settlement/
     Specs updated. Ready for the next feature.

Handling Interrupts and Stopping Mid Implementation

Real work isn’t linear. OpenSpec handles context switches gracefully because your plan lives in the artifacts, not in the AI’s memory.

One important detail: during /opsx:apply, you can stop the AI at any point while monitoring its output. If you see it heading in the wrong direction on task 1.3, you stop it, correct course, and resume. The task list in tasks.md tracks what’s been completed, so nothing about your plan is lost. This is fundamentally different from vibe coding, where stopping mid stream means losing context and hoping you can reconstruct where you were.

The same principle applies to bigger interrupts. You can pause an entire feature to handle something urgent:

You: I’m working on multi-currency-settlement but we have an 
     urgent issue: the nightly batch is double-settling some 
     transactions.

You: /opsx:new fix-duplicate-settlement-batch

AI:  Created openspec/changes/fix-duplicate-settlement-batch/

You: /opsx:ff
     [Plans the fix]

You: /opsx:apply
     [Implements the fix]

You: /opsx:archive
     ✓ Archived fix-duplicate-settlement-batch

You: Let’s get back to multi currency settlement.

You: /opsx:apply add-multi-currency-settlement
AI:  Resuming add-multi-currency-settlement...
     Picking up at task 2.2: Update reconciliation to match 
     on settlement currency...

The multi currency feature picks up exactly where it left off. The artifacts held the plan while you were away, whether that was five minutes or five days.

When NOT to Use Spec Driven Development

SDD is not appropriate in all cases. Here are some cases where the overhead isn’t worth it:

Quick bug fixes. If you know exactly what’s wrong and the fix is a one line change to a gateway timeout value, writing a spec is like filing a building permit to hang a picture frame. Just fix it.

Exploratory prototyping. When you’re trying to figure out what to build, not how to build it, specs slow you down. Vibe coding is genuinely great for rapid exploration. If you’re prototyping a new merchant dashboard layout to see what feels right, just build it iteratively.

Highly visual or interactive work. SDD tools are text based. If your feature is primarily about UI layout, animation, or interaction design, you’ll spend more time describing the visual result in markdown than you’d spend just building it with visual feedback (though pairing SDD with TideWave can work wonders for UI work).

Trivial features. Updating an error message string, renaming a config key, bumping a dependency version. These don’t need a spec. Use your judgment about the complexity threshold.

Rapidly changing requirements. If you’re in a phase where the payment scheme keeps revising the spec and requirements shift weekly, maintaining your own specs becomes overhead that fights against your pace. Get to stability first, then spec the features that need to stick.

The general rule: if you can hold the entire change in your head and verify it by looking at it, you probably don’t need a spec. If the change involves multiple files, multiple concerns, or behavior you can’t verify visually, a spec starts paying for itself.

What to Watch Out For

Having used these tools and studied the experiences of others, here are the traps:

Spec bloat. The AI loves to generate exhaustive specifications. A feature that would take you 30 minutes to implement can produce 800+ lines of markdown. You have to be disciplined about trimming specs to what’s actually useful. If you’re not reading the spec carefully, it’s worse than not having one because you’ll have false confidence that edge cases are covered when they’re not.

The waterfall trap. SDD can slide into big design up front if you’re not careful and start bundling many features into one spec. If changing the spec feels expensive or bureaucratic, you’ve over formalized. OpenSpec’s fluid workflow helps here since there are no phase gates, but you still need the discipline to keep specs lightweight enough to throw away and rewrite if you find yourself going down the wrong path.

Spec drift. The spec says one thing; the code does another. This happens when you make implementation fixes outside the spec workflow. Either update the spec when you deviate, or accept that the spec is aspirational rather than authoritative. OpenSpec’s /opsx:sync command can help keep specs aligned during long running changes.

The AI ignores its own spec. This is a real and documented problem. Context windows are larger, but that doesn’t mean the AI attends to everything in them equally. People have reported that AI agents generate code that contradicts the spec they just wrote, creating duplicate classes, ignoring constraints, or implementing patterns the spec explicitly avoided. The /opsx:verify step exists specifically to catch this.

Review fatigue. SDD adds a new category of artifact to review. You’re now reviewing specs AND code. If your team doesn’t value spec review as highly as code review, specs become rubber stamped documents that provide an illusion of rigour.

Over application to small changes. The tooling doesn’t scale down well. Applying the full SDD workflow to a minor feature creates overhead that dwarfs the implementation time. You need a personal threshold for when to spec and when to just build.

The Waterfall Question

Every discussion of SDD eventually arrives at the same question: isn’t this just waterfall with better marketing?

The comparison is fair to raise and unfair to leave unexamined. Traditional waterfall failed because of long feedback loops: months of design, months of implementation, and discovery at the end that the design didn’t match reality. The feedback cycle was measured in quarters.

SDD, practiced well, has feedback cycles measured in minutes to hours. You write a spec for a single feature, not an entire system. You review the generated design before implementation starts. You implement in small, verifiable tasks. And critically, changing the spec and regenerating is cheap. The whole point is that code is a derived artifact you can throw away and recreate.

SDD can slide into waterfall like rigidity if you treat specs as immutable, if the spec writing phase becomes its own bottleneck, or if you use SDD as a substitute for iterative discovery. As Gojko Adzic observed, the movement builds on solid intent-first ideas but could reintroduce rigidity if practitioners aren’t thoughtful about it.

The Thoughtworks perspective captures the nuance well: the problems of vibe coding come from being too fast, spontaneous, and haphazard, while the problems of waterfall come from being too slow, rigid, and disconnected from reality. SDD, when practiced well, occupies the middle ground. It provides a mechanism for shorter and more effective feedback loops than either extreme.

The honest answer is that SDD sits on a spectrum. At one end, you have “spec as lightweight sketch,” a quick outline that gives the AI direction without constraining it. At the other end, you have “spec as source of truth,” a comprehensive document that the code must conform to. OpenSpec’s fluid approach leans toward the lighter end of that spectrum, which is why it appeals to teams who want discipline without ceremony.

Pros and Cons

What SDD Gives You

Reduced rework. Catching misunderstandings at the spec level is dramatically cheaper than catching them in code. When a BA’s field mapping is wrong, you want to discover that while reviewing a proposal, not while debugging a failed settlement file at 2 AM.

Persistent context. Specs survive session boundaries, tool switches, and team changes. Six months from now, when someone asks why the FX rate locking works the way it does, the spec and its proposal explain both the what and the why.

Reviewable intent across roles. You can review a spec without reading any code. Product managers, BAs, QA, and principals can participate in spec review and catch requirement gaps before implementation begins. In a payments context, this means compliance can review the spec for regulatory alignment without needing to read Elixir.

What SDD Costs You

Time upfront. Writing and reviewing specs takes time that vibe coding doesn’t require. For simple tasks, this overhead is pure cost with minimal benefit.

False precision. Detailed specs can create an illusion of completeness. Just because the spec covers edge cases on paper doesn’t mean the AI will implement them correctly. You still need to test.

Tool immaturity. These tools are all early stage. Expect rough edges, breaking changes, and workflow gaps. The ecosystem is moving fast, which means today’s best practices may be obsolete in six months.

Where This Is Heading

Spec driven development is less than a year old as a named practice, and the tooling is evolving fast. The fundamental insight, that AI agents produce better code when given structured intent rather than ad hoc prompts, seems durable even if the specific tools don’t survive.

What’s interesting is the convergence. BDD (Behavior Driven Development), TDD (Test Driven Development), and now SDD all share the same DNA: define the desired behavior before writing the implementation. SDD is that idea adapted for a world where the implementer is an AI agent rather than a human developer.

The open question is whether specs will remain the domain of dedicated tools, or whether this discipline gets absorbed into the AI coding tools themselves. We’re already seeing Cursor, Claude Code, and Copilot add planning and multi step reasoning capabilities that accomplish some of what SDD tools do, without the explicit spec writing step.

For now, the practical takeaway is simple: if you’re doing anything more complex than a quick prototype with AI coding tools, some form of structured planning, whether you call it SDD or just “thinking before prompting,” will produce better results than vibing your way through it. The tools can help enforce that discipline, but the discipline itself is what matters.

The spec isn’t the point. The thinking is.

It's a Great Time to be a Software Engineer

Zarar Siddiqi — Wed, 07 Jan 2026 02:02:56 GMT

Published first here.

Here are some thoughts on AI development based on my experience of the last two years. As with any list, these are in no particular order.

Get excited. AI is only coming for your job if you treat it as an optional part of your job. It’s here to help you become a better and more efficient software engineer. Embrace it wholeheartedly just like you embraced IDEs in favour of text editors. Using AI doesn’t make you a lesser programmer and not using it doesn’t make you special in any way. In fact, not using it or resisting it makes you look out-of-touch. This is what you have been waiting for to love your job again, and it just might remind you that you got into this business because it feels great to create things, not necessarily code things.
Most code (upwards of 80%) should be AI generated at this point. If it’s not, there is something inherently flawed about your workflow. Just put your pride aside, and acknowledge that AI is a better programmer than you. Your coding skills are now worth little, but your software engineering skills are worth a lot more. Invest in the latter, don’t cling on to the former. AI code is still “your” code so you can take the same pride in it as you did before. You just learned how to type faster. A lot faster!
SRP, DRY, SOLID and clean design/code should be the focus of the programmer. Guiding AIs to get these right requires understanding the business context in which the software is being used, which AI doesn’t know. How a feature is expected to change in the future, and what trade-offs need to be made there is something you need to be an expert at. Do I create a new module? Is this method named appropriately? Is it taking too many parameters? Am I violating Demeter’s Law? Is this file getting too big? Should I separate these two concerns? What would make this more reusable? These are the decisions you should be spending time on. This requires understanding the product more than you needed to in the past. You’re not only a Software Engineer, you’re a Product Engineer, and that requires a deep understanding of something you may have ignored in the past.
Context management (or engineering) is where efficiencies are to be gained. If you find yourself repeating things to a forgetful AI, then that’s a problem to be solved. Simple solutions include Claude Skills and more sophisticated ones include using Beads. Your workflow should be constantly “saving” things to memory to make you more efficient. Sometimes I find myself frustrated by having to remind Claude that it needs to “do X first when it’s doing Y” - those rules should be codified. Don’t treat AGENTS.md or any other instruction file as a static document or it’ll waste your time. How to manage your own context (and your team’s) is something to dedicate time to. If you work in a large company, this is an especially interesting challenge as you have to balance alignment and autonomy, hard rule and guidelines, etc.
Everyone should read a book where you build an LLM from scratch. It’s going to be painful and, like me, you’re probably going to have to re-read chapters just to get it through your head (I did, many times), but when it does, you’ll be better off for it. Though chances are you’ll never develop your own LLM and probably use a frontier model most of the time, it helps knowing how things are working underneath the hood. You’ll need to tweak model parameters at some point in your career, and having this foundational knowledge will be the difference between winging it and knowing what you’re doing.
Code review is the new bottleneck. The good news is that we already have tools popping up that make this easier (e.g., Code Rabbit). For reviewing code locally, multi-agent workflows work great. Having a separate agent contextualized to reviewing code for correctness, security etc. with rules and guidelines are easy to implement, e.g. claude-code review --aspect "correctness" src/ > /tmp/review_correctness.md. If you’re not using multi-agent workflows, this is an easy place to start. Here’s a couple other candidates: 1) an agent dedicated to providing good commit messages based on git diff, 2) test refactoring agent which gets invoked to clean up tests; shoving test clean up rules into the “development” context may be too much, so having a separate focused agent will work better.
There is no excuse not to have clean code. Refactoring is cheap, writing tests is cheaper. If you have code that’s not clean, generate higher-level tests for it, and then ask the agent to refactor. The tests will serve as your guiding light on whether something went wrong. This is especially valuable in brownfield codebases where changes are the riskiest. Having dedicated workflows to “clean up code” is another example of easy to implement multi-agent workflows.
Documentation is free. Whether it be inline code documentation, architectural diagrams or Correction of Error analysis, what used to take days now takes minutes. There is simply no excuse not to have comprehensive and up-to-date documentation, both from a product and engineering point of view. Not only should your code describe what it does where clarity is needed, it should also indicate the business rules behind it (whether it be inline or linked to external docs). A programmer reading the code should have a single point of entry to understand both the design decisions and the context in which the customer is using it.
Cost optimization is now part of software engineering. Not every task needs Claude Opus, and knowing when to delegate to cheaper AIs is a skill. Even better, a free one like Qwen Code should be installed locally for simple tasks and basic CRUD operations (which is about 90% of all development). Complex refactoring with business context is worth the Opus pricing. You should have mental models about which model to reach for given the problem at hand. Track your AI costs per feature just like you’d track compute costs on AWS so you can optimize your workflow and not just the code. Running expensive models on trivial tasks is wasteful and unprofessional.
High-Level System Design is where you are needed. AI will crush implementation details but architectural decisions require human judgment that understands business constraints, team capabilities, and long-term maintenance burden. You need to get better at system design, understanding trade-offs between different architectural patterns, and making decisions that account for factors AI can’t know - like the fact your team hates microservices or that you’re planning to acquire a company next quarter. This is where your value multiplies.

Scope discipline when AI makes building fast

Zarar Siddiqi — Tue, 23 Dec 2025 16:00:39 GMT

I needed to show users contextual messages, e.g., banners for announcements, modals for important actions, tours for onboarding. I already use PostHog for analytics and PostHog allows the user to create apps which can provide this functionality while being tightly integrated with their analytics capabilities.

But I built my own system instead. Here’s why, and why the hardest part wasn’t building but knowing when to stop.

The Business Problem

I run an event management platform. Venues use it to sell tickets, manage events, run marketing campaigns, create ads etc. The product has grown sophi . New features ship, but users don’t find them.

This isn’t a documentation problem as users don’t read docs. It’s not an email problem either. Announcement emails get 20% open rates if you’re lucky. The features exist and users need them. They just don’t know they’re there.

The real problem breaks down into specifics:

Feature discovery. I ship something new, maybe a better way to handle refunds or a new analytics dashboard. The users who would benefit most never click on it because they don’t know it exists.

Contextual nudges. A user logs in through SSO but hasn’t set a password. That’s fine for now, but if SSO breaks they’re locked out. I want to prompt them to set a password, but only when it’s relevant, not in an email they’ll ignore.

Onboarding flows. New users need guidance. Not a wall of text. Step by step tours that show them where things are. “Click here to create your first event. Now add tickets. Now publish.”

Multi-tenant complexity. This isn’t a simple user model. I have accounts (the venue), users within accounts (staff), and customers (ticket buyers). A message might be relevant to one account but not another. Dismissing a message as one user shouldn’t dismiss it for your colleague.

Non-intrusive UX. Whatever I build needs to be easy to dismiss. Remember that the user dismissed it. Not show it again. Respect their attention.

These requirements shaped everything that followed.

Why PostHog Was a Serious Candidate

PostHog is the obvious choice due to it already being used and providing a way to track user behavior. It has capabilities that you can extend for this kind of thing and the AI was quick to suggest building a PostHog custom app which extended it’s core features to delivery on the following using these approaches:

Surveys work as modal-style messages. Create a popover or modal, target by URL or user properties, built-in dismiss tracking. No code needed for basic use cases.

Feature flags with payloads could drive banner content. The flag controls who sees the message. The payload contains the content. Evaluate client-side and render.

const flag = posthog.getFeatureFlag(’welcome-banner’)
const payload = posthog.getFeatureFlagPayload(’welcome-banner’)
// payload: { title: “Welcome!”, content: “...”, style: “info” }

Site Apps let you write custom JavaScript that runs in PostHog’s context. Full control if surveys and flags aren’t enough.

I seriously considered this path. PostHog handles targeting UI, cohort management, percentage rollouts. That’s real value. The code was also already there and it was tempting to go with this but a pause was needed.

Why PostHog Didn’t Work

The problems started when I mapped PostHog’s features to my actual requirements.

No dismissal tracking for feature flags. Surveys track dismissals automatically. Feature flags don’t. If I use flags for banners, I’d need to:

Send a custom event when user dismisses: posthog.capture('message_dismissed', { message_id: 'xyz' })
Create a cohort of users who have that event
Exclude that cohort from the feature flag
Repeat for every message

That’s a lot of manual cohort management. It gets messy fast.

No per-account targeting. PostHog targets users, not accounts. My multi-tenant model needs messages scoped to specific accounts. User A on Account X dismisses a message. User A on Account Y should still see it. User B on Account X should also still see it.

PostHog would require setting account_id as a user property, then creating cohorts per account. That doesn’t scale to hundreds of accounts.

No path-based targeting for feature flags. Surveys can target by URL. Feature flags can’t. I’d need to check the path client-side:

if (window.location.pathname.startsWith(’/dashboard’)) {
  const flag = posthog.getFeatureFlag(’dashboard-message’)
}

That works, but now I’m writing conditional logic in JavaScript for every message. The targeting that should be configuration becomes code.

No tours. PostHog has nothing like driver.js. No step-by-step walkthroughs. I’d integrate driver.js separately and use feature flags to control when tours trigger. At that point I’m building half the system myself anyway.

Server-side control. My app is Phoenix LiveView. I’ve worked hard to keep logic server-side. Adding PostHog’s JavaScript SDK for messaging means rendering decisions happen in the browser. State lives in two places. Debugging gets harder.

The dependency question. PostHog is great today. But SaaS products change pricing, get acquired, pivot. Messaging is core infrastructure for my product. If PostHog changed their pricing model or discontinued a feature, I’d need to rebuild under pressure. Owning it from the start avoids that risk.

My decision: custom for system messages and tours, PostHog for surveys and A/B tests where their tooling genuinely adds value. Hybrid approach and the right tool for the job.

This was an example of AI confidently providing very reasonable sounding options, but without someone sitting down and mapping the requirements to the solution with a view to long-term maintenance, one would have easily gone down a reasonable but dangerous path. In my view, software engineering knowledge is key and without that chaos will ensue. It’s why I’m genuinely liking the Product Engineer title which is being thrown around. Maybe more on that some other time

Brainstorming with AI

I used Claude to brainstorm the system. This is where things got dangerous.

The initial ideas list was ambitious:

Snooze and remind later with configurable intervals
Message dependencies (show X only after Y is dismissed)
Role-based targeting
Feature flag integration
Time-bounded messages with start and end dates
Event-triggered auto-dismissal
Multiple dismiss behaviors (close, complete, snooze)

AI is good at generating possibilities. Too good. Every feature seemed reasonable and each one addressed a real use case. The ideas kept expanding the deeper I went into exploration.

Here’s the thing about AI-assisted development: it makes building fast. Features that would take days take hours and even though this sounds like a benefit, it’s actually a trap.

When building is slow, you naturally filter ideas. “That would take a week” is a forcing function for scope and you have a tendency to avoid it. When building is fast, that filter disappears and it becomes an extra hour. Suddenly, you realize that you’ve been going back-and-forth on things you may not need because it’s so easy to cover this case and that case.

I had to keep asking myself one question: Is this the problem that I wanted to solve when I started this feature?

The problem was users not knowing about new features. That’s it. Show a message. Let them dismiss it. Everything else is optimization for problems I don’t have yet.

The ideas list is just that, a bunch of half-thought ideas, not a backlog. Not a roadmap. Not a commitment. When someone says “you should add snooze functionality” I can say “that’s on the ideas list” without it meaning anything about when or if I’ll build it.

Backlogs should be short. If it’s not solving the current problem it doesn’t belong there.

The Temptation of Clean Code

The hardest moment wasn’t writing features, but trying to figure out the code I was going to delete even though it worked. As an example, Claude had generated a complete implementation of snooze functionality. The schema changes were done, the UI was wired up and test cases were written as well. The code was clean:

defp handle_messaging_event(”snooze_message”, %{”message-id” => id, “days” => days}, socket) do
  scope = socket.assigns[:current_scope]
  account_id = socket.assigns[:messaging_account_id]
  snooze_until = DateTime.add(DateTime.utc_now(), String.to_integer(days), :day)

  Messaging.snooze_message(id, scope, account_id, snooze_until)

  updated_messages = remove_message_from_assigns(socket.assigns.app_messages, id)
  {:halt, assign(socket, :app_messages, updated_messages)}
end

It looked good and it worked. I was so close to having this capability and was tempted to just commit it and move on.

But I didn’t need it, at least not right now and just because the code is in front of you doesn’t mean you have to take it. That’s the real danger of AI-assisted development. It’s not that the code is bad. Often it’s excellent. The danger is that good code is seductive. You want to keep it. You rationalize why you might need it someday. You commit it “just in case.”

This is the moment where you have to view clean code as technical debt, not because there’s something wrong with the code, but because it adds to your maintenance burden without an immediate benefit. The potential benefit is in the future and once you discount it by time, it becomes a bad idea to keep it. Especially when it was so easy to generate in the first place.

MVP Scope

What I kept:

Banners and modals (two display types cover most cases)
Path-based targeting (show message only on certain pages)
Function-based targeting (custom Elixir rules for complex conditions)
Per-user and per-account dismissals (remember who saw what)

What I cut:

Snooze and remind later (just dismiss it)
Role-based targeting (function rules can check roles if needed)
Message dependencies (not needed for announcing features)
Feature flag integration (can add later)
Time-bounded messages (same)

The JSON targeting field was my escape hatch. It means I can add role targeting, feature flags, and dependencies later without database migrations. I’m not saying no to these features. I’m saying not yet. And “not yet” is fine because none of them solve the problem of telling users about new features.

Implementation

Two patterns carry the system.

Global event handling with attach_hook. This is where AI genuinely helped. I needed dismiss events to work from any LiveView without adding handlers to each one. Claude suggested attach_hook for handle_event:

def on_mount(:default, _params, session, socket) do
  if connected?(socket) do
    socket
    |> assign(:app_messages, %{})
    |> assign(:messaging_account_id, session[”account_id”])
    |> attach_hook(:messaging_params, :handle_params, &handle_messaging_params/3)
    |> attach_hook(:messaging_events, :handle_event, &handle_messaging_event/3)
  end
end

defp handle_messaging_event(”dismiss_message”, %{”message-id” => message_id}, socket) do
  scope = socket.assigns[:current_scope]
  account_id = socket.assigns[:messaging_account_id]

  if scope && account_id do
    Messaging.dismiss_message(message_id, scope, account_id)
  end

  updated_messages = remove_message_from_assigns(socket.assigns.app_messages, message_id)
  {:halt, assign(socket, :app_messages, updated_messages)}
end

defp handle_messaging_event(_event, _params, socket), do: {:cont, socket}

The elegance here is the fallthrough. Events this hook doesn’t care about return {:cont, socket} and flow to the LiveView’s own handlers. Events it does handle return {:halt, socket} and stop there. One module handles all messaging events globally. Individual LiveViews don’t know messaging exists.

This pattern is powerful. Add the on_mount to a live_session and every page in that session gets messaging. No changes to individual LiveViews. No copy-paste of event handlers. The layout renders banners and modals from @app_messages. The hook handles dismissals. Everything works.

Function-based rules. This is where the power lives. I’m in a functional programming environment so everything should boil down to a simple function call with no side effects:

def evaluate(func_name, scope, _account_id, _message_id) do
  case func_name do
    “has_no_password” -> has_no_password(scope)
    _ -> :show
  end
end

def has_no_password(scope) do
  case scope do
    %{unified_user: %{hashed_password: nil}} -> :show
    _ -> :hide
  end
end

Adding a new rule means adding a function. The function receives context (user, account, message) and returns :show or :hide. No configuration files. No complex DSL. Just functions.

Need to check if a user has created events? Write a function. Need to check subscription tier? Write a function. The MessageRules module becomes a library of predicates that the messaging system evaluates.

Lessons

Right tool for the right job. I didn’t replace PostHog entirely. I use it for surveys and A/B tests where their tooling adds value. For core messaging that touches my domain model, custom won. The hybrid approach means I get the best of both.

AI makes scope discipline harder, not easier. Claude will build whatever you ask for. That’s the problem. The speed removes the natural friction that used to prevent scope creep. You have to actively ask: is this the current problem I’m solving? If not, it goes on the ideas list, not the backlog.

Good code is seductive. The hardest code to delete is code that works. AI generates clean, functional implementations quickly. The temptation is to keep them. Resist. Just because it’s in front of you doesn’t mean you have to take it.

Technical patterns matter. The on_mount hook and attach_hook pattern solved a hard problem: how do you add cross-cutting behavior to every LiveView without modifying each one? Understanding LiveView’s lifecycle deeply made this possible.

Functions over configuration. In a functional language, the most flexible targeting system is just a function. No complex rule engines. No JSON DSL. Just scope in, show or hide out. Everything else is syntax sugar over that core idea.

Embedding-Based Tool Selection for AI Agents

Zarar Siddiqi — Sun, 21 Dec 2025 17:36:56 GMT

When I first built our AI assistant, it had five tools. Look up an order. Process a refund. Check ticket availability. Simple stuff. Fast forward six months and we’re at nearly 40 tools spanning orders, events, marketing campaigns, contests, and customer management.

The problem became obvious during a routine cost review: we were burning thousands of tokens on every single request just describing tools the model would never use. Someone asks “What time does my show start?” and we’re sending the full spec for process_refund, create_email_campaign, and manage_contest_prizes. Wasteful.

The Tool Explosion Problem

Each tool definition isn’t trivial. You need a name, a description detailed enough for the LLM to understand when to use it, and parameter specifications with types and constraints. Here’s what one looks like in our codebase:

%ToolDefinition{
  name: “process_refund”,
  description: “”“
  Process a refund for a specific order. Validates the refund amount
  against the original order total and available balance. Requires
  order_id from get_order_details. Returns confirmation with refund ID.
  “”“,
  parameters: [
    %{name: “order_id”, type: :string, required: true},
    %{name: “amount”, type: :number, required: true},
    %{name: “reason”, type: :string, required: false}
  ],
  handler: {RefundsRegistry, :handle_process_refund},
  category: :refunds
}

Multiply by 40 and you’re looking at 3,000+ tokens before the user even says anything. The costs add up, latency increases, and here’s the kicker: having too many tools actually makes the model worse at picking the right one. More noise, more confusion.

Semantic Selection with Embeddings

The fix is conceptually simple. Instead of sending every tool on every request, we embed all tool descriptions into vectors and store them in Postgres using pgvector. When a query comes in, we embed it too, then find the 5-10 most semantically similar tools using cosine distance.

The query “refund order #12345” gets embedded, compared against all tool embeddings, and returns process_refund, calculate_refund_amount, get_order_details. We send only those to the LLM.

This cuts our tool payload by 75-90% on most requests. The model sees fewer, more relevant options and picks better.

Choosing an Embedding Provider

We debated two main approaches: calling OpenAI’s embedding API or running our own model.

OpenAI’s text-embedding-3-small is the path of least resistance. It’s a REST call, returns 1536-dimensional vectors, costs about a hundredth of a cent per embedding, and just works. The semantic understanding is excellent. The downside is the external dependency. Every query needs a network round-trip, your data touches their servers, and you’re subject to their rate limits and outages.

Running something like ModernBERT locally is appealing for different reasons. Zero marginal cost, sub-millisecond latency since there’s no network hop, and complete data privacy. But now you’re managing infrastructure. You need a server running the model, monitoring, scaling considerations, and you’re on the hook for model selection and updates. For a small team, that operational burden is real.

There’s also a hybrid approach: use OpenAI in production for reliability, run a local model in development and testing to avoid API costs and flakiness. We built our system with a provider abstraction to make this possible:

defmodule Amplify.EmbeddingProvider do
  @callback generate_embedding(String.t()) :: {:ok, list(float())} | {:error, any()}
  @callback dimensions() :: pos_integer()
  @callback model_id() :: String.t()
end

Switching providers is a config change. The abstraction cost an extra hour upfront but buys flexibility later.

Why We Went With OpenAI

For our volume, OpenAI was the obvious choice. We process hundreds of queries daily, not millions. At $0.00001 per embedding, we’re talking pennies per month. The reliability is excellent, the semantic quality is strong for our e-commerce domain, and there’s zero infrastructure to manage.

If we were processing millions of queries or had strict data residency requirements, the calculus would be different. But for a small team running a ticketing platform, paying a few cents to avoid running another service is a good trade.

Generating Embeddings in Development

Adding a new tool or updating an existing one means regenerating embeddings. In development, it’s a mix task:

mix generate_tool_embeddings

This iterates through all tool definitions, calls OpenAI for each, and upserts the results into the tool_embeddings table. Takes about 10 seconds for 40 tools. The task is idempotent so you can run it whenever.

The implementation is straightforward. We convert each ToolDefinition to embedding text that captures the name, description, and parameter info, then store the vector alongside the tool name and model ID.

Generating Embeddings in Production

For production, we built a simple admin page. Navigate to the AI operations screen, see the current embedding count, click a button to regenerate. Non-technical team members can trigger it after tool updates without touching the console.

The alternative is shelling into the production console:

Amplify.Services.ToolSelector.regenerate_embeddings()

Either way, regeneration is safe to run anytime. It deletes existing embeddings and creates fresh ones. The whole process takes seconds.

One gotcha: if you ever switch embedding providers, you must regenerate everything. OpenAI’s 1536-dimension vectors are incompatible with a local model’s 768-dimension vectors. We store model_id with each embedding to catch mismatches and make debugging easier.

Handling Multi-Step Operations

Pure similarity search has a gap. If someone says “refund order #12345”, we’ll find process_refund. But the LLM also needs get_order_details to look up the order before it can refund anything. Those two tools aren’t semantically similar enough to both appear in the top results.

We solved this with category expansion. Each tool has a category like :orders, :refunds, or :events. When we select tools via similarity, we expand to include related categories:

@category_expansions %{
  orders: [:orders, :refunds, :customers],
  refunds: [:orders, :refunds, :payments],
  events: [:events, :tickets]
}

So finding process_refund (category :refunds) automatically pulls in order lookup tools. The LLM gets everything it needs for multi-step workflows.

The pgvector Query

For those curious about the database side, here’s the actual query we run:

SELECT name, 1 - (embedding <=> $1) as similarity
FROM tool_embeddings
WHERE (embedding <=> $1) <= $3
ORDER BY embedding <=> $1
LIMIT $2

The <=> operator is pgvector’s cosine distance. We filter by a similarity threshold (0.4 by default) to avoid returning completely irrelevant tools, then take the top K results. The whole thing runs in under 10ms.

Testing Without Hitting OpenAI

We use Mimic for mocking in tests. Every test that touches tool selection stubs the embedding provider to return consistent vectors:

Mimic.stub(EmbeddingProvider, :generate, fn _text ->
  {:ok, List.duplicate(0.1, 1536)}
end)

This keeps tests fast, deterministic, and free of API dependencies. We can simulate failures too, testing that the system gracefully falls back to using all tools when embedding generation fails.

What We Learned

A few things surprised us along the way.

The similarity threshold matters more than we expected. Too high and you filter out useful tools. Too low and you’re back to noise. We settled on 0.4 after some experimentation but it’s worth tuning for your domain.

Category expansion was an afterthought that became essential. Pure semantic similarity misses the dependencies between tools. If your assistant does multi-step operations, you need something like this.

The provider abstraction was worth it even though we haven’t switched providers. It forced us to think cleanly about the interface and made testing much easier. The Mimic stubs work because there’s a clear boundary to mock.

Cold start is a real concern. If your embeddings table is empty, you need a fallback. We log a warning and use all tools, which isn’t ideal but prevents complete failure.

Results

After rolling this out, our per-request token usage for tool definitions dropped 60-80%. Latency improved by about 200ms since the model processes fewer tokens. Tool selection accuracy actually got slightly better because there’s less noise confusing the model.

The embedding costs are negligible. We’re at maybe $0.01 per day for our volume. The whole system adds a 10ms database query per request, which disappears in the noise of the LLM call.

For anyone dealing with tool explosion in their AI agents, this approach is worth considering. The implementation isn’t complex, the costs are minimal, and the benefits compound as your tool count grows.

React2Shell serves as good reminder why JavaScript is no fun

Zarar Siddiqi — Sun, 07 Dec 2025 20:49:57 GMT

Got this email from Vercel after the JavaScript ecosystem had another moment.

I understand the urgency of the matter, but blocking deployments is a step too far as I don’t even have RSC enabled, and need to make changes to the app. This isn’t a simple upgrade as I was on Next 14 and React 18. The peer-dependency situation is hell as there are many libraries that have hardcoded dependencies to React 18. In some cases you can use legacy-peer-deps=true to get around things, but in others you have to update critical libraries that you had no real need to do, e.g., PixiJS 7 -> 8, and those have breaking changes. Not to mention the code mods that need to be applied from Next 14 to 15 only leave you with a sense of dread. But a weekend later, all is done.

What this experience reminded me of is why I got out of the JavaScript world and switched to Elixir/Phoenix (about 80% migration done). Half the time I was chasing library upgrades and framework changes, rather than focusing on actual valuable work. Since I’ve shifted away from it, it’s not an exaggeration that boilerplate plus maintenance of libraries has gone from taking about 30-50% of time down to about 5%.

All the JavaScript frameworks ultimately either:

Aim to manage state so that reactive experiences can be easier to develop.
Want to provide some performance optimization (e.g., SSR)

Compared to how Phoenix/LiveView does it, they’ve all failed. Every. Single. One. There is ton of boilerplate and your app is always worried about the next breaking change some library which doesn’t care about backward compatibility is about to publish, leaving you behind.

You can’t avoid JavaScript but I’ve found exiting the React/JavaSCript ecosystem has made for a healthier programming lifestyle.

Good riddance to Auth0 and social logins

Zarar Siddiqi — Fri, 21 Nov 2025 16:36:31 GMT

Three months ago I got rid of social logins, fired Auth0 and made the decision to use Magic Links (easy in Phoenix - hello mix phx.gen.auth).

I had made the decision to go with Auth0 after I had implemented by own social logins to FB and Google, and wanted to support GitHub, in addition to regular username/passwords. I had grand ideas on how I’d use Auth0 to manage permissions and implement their “Actions” by implementing manage middleware, create user journeys which route people this way and that way, but most of all it was about security. I didn’t want to get into the business of managing passwords and and tokens, and wanted to focus on value-added feature, you know, things people pay for.

But here I am a year later reversing it all, and these are some reasons why. These are in no particular order.

85% of customers just use regular email/passwords so the additional options on login just confuse them. I wanted to make things easier for people and had thought I would reduce signup friction if I added social logins. This was an incorrect assumption.
People mistakenly login use the email from their regular email/password combo to login to social logins which they also have an account, and it’s chaos to manage that, especially from a customer support standpoint.
I didn’t have to manage keys from Meta, Google and the like, e.g., policy updates, token expirations, other random “Required Actions” being thrown at me in the Meta Developer Portal where each page takes 20 seconds to load.
I had overestimated how much work implementing your own security system would be, but Phoenix 1.8 made it dead simple to implement Magic Links and with the help of Claude, the whole thing took a weekend to go from zero to in-production.
I liked the idea of outsourcing authentication to a user’s email provider (usually Gmail) and let it be the weakest link in the chain. They think about this stuff all the time so I don’t need to. They basically implement MFA for me.
I didn’t want to incur a needless cost. Though I’m on Auth0’s “Startup Program”, they can jack up fees any time and I didn’t like that unpredictability. I’m startup poor.
Managing permissions in a separate system was too complex. I’m using RBAC which isn’t hard to implement, and crossing network and system boundaries just to see “does this user have access?” seemed overkill. I did shove all the permissions in the JWT via an action, but any update to those permissions required pulling that information again, and my users update permissions all the time. Too much work to sync state and it felt needlessly complex.
Implementing resource-based authorization turned out to be much simpler using Elixir’s LetMe library. Just give me a database I can query over an API I have to call - so much simpler. The UI to manage was also snappier to implement in LiveView as we’re avoiding REST calls.
There was very little control when using Auth0’s Universal Login on how you’d like to customize the screen. I realize login screens have to be simple, but just putting some other links was hard. Just having the ability to customize branding isn’t enough.
Lot of users were confused by the redirect to a “different” site even though I had a custom domain going. The older users felt like they had done something wrong when the header of the site changed.
I implemented account spoofing but it was a pain as I couldn’t just decrypt the token from Auth0 as they don’t provide the secret to do so (rightfully so, I suppose). I had a really funky workaround for operations people to fix customer issues via spoofing. My implementation could’ve been better, but it’s still unnecessary code complexity.
I hate Meta and Google and want to steer my users away from them, not towards them.
I felt I could secure customer data by using proper system admin, storage and encryption practices, and didn’t feel I was getting any additional security benefit by outsourcing this info to an external provider. I had overestimated how complex this was to manage (knock on wood). I felt having a good cloud and storage provider was more important than having a good identity management provider.

All in all, learned a lot through this whole process and Auth0 did help me get off the ground quickly back when Phoenix 1.8 wasn’t released. Maybe this is just another case of Elixir/Phoenix making development a joy more than an indictment on hiring identity providers.

Pass on polymorphic tables

Zarar Siddiqi — Fri, 31 Oct 2025 19:04:37 GMT

A decision I often have to make when designing databases is whether to create a polymorphic table or not. Let me give an example. I have:

A discount table with an id which stores discount codes
A mailing_list table where signing up gives you a discount
A contest table where participating gives you a discount

You could design this as:

discount(id, code)
mailing_list_discounts(mailing_list_id, discount_id)
contest_discounts(contest_id, discount_id)

Or you could go:

discount(id, code)
auto_discounts(mailing_list_id, contest_id, discount_id)

One of mailing_list_id or contest_id would always be null. The query complexity for either will be very similar given proper indexing, and I find it generally easier to look for things in fewer tables than more, so auto_discounts is very tempting.

The challenge comes in when you have to store information related to specific types of discounts, say contest_winning_attempt, in which case we now have to have null values in cases where it doesn’t apply:

Another issue is that when when querying against a single table you need to add a check for null. For example, if you want to find out which discount codes were awarded for people who didn’t win on their first try, you’d have to do:

SELECT * FROM auto_discounts
WHERE contest_id IS NOT NULL AND contest_winning_attempt != 1;

You need the contest_id IS NOT NULL check or it would pick up records that have nothing to do with contests. You could add a discriminator like type which is either contest or mailing_list and that makes things even more verbose and error prone.

So as much I sometimes get critiqued for having too many tables in my app, I generally opt for separate tables to keep concerns separate (though are are somewhat related). There’s also the benefit of optimizing indexes on a per-table basis, lesser chance of misuse since columns that don’t concern you just aren’t there to misuse, and you keep scaling flexibility on the table (e.g., future sharding).

Stop Worrying and Love the Bomb

Zarar Siddiqi — Sat, 11 Oct 2025 14:29:33 GMT

A variant of this comment on HN is something I’ve heard too often:

I used to have this hard-to-get, in-demand skill that paid lots of money and felt like even though programming languages, libraries and web frameworks were always evolving I could always keep up because I’m smart...I find it way less fun to be waiting around for agents to do stuff and it’s way harder to get into flow state managing multiple of these things. It makes me want to move into something completely different like sales

Yeah, it sucks that a skill we all had has been commoditized. You always heard stories about factory workers getting their jobs outsourced or automated, but we (incorrectly) thought that this would happen to white-collar jobs like software engineers. In hindsight, this was quite naive but now that we’re here, we have no choice but to deal with it and just embrace it like a helpless man on a beach staring at an incoming tsunami. Ride the wave or drown.

And yes, we all have job loss anxiety, especially those of us who don’t have FT jobs but contract their way around, but I want to acknowledge just how relieved of actual work stress I have been because of AI. Simply put, no upcoming feature request scares me because I know I have help. Have to look at a legacy codebase and make a change? No problem, I can understand the codebase 50x faster than I could have before. A customer requested a major change that requires a bunch of refactoring? Child’s play. I want to parse through a bunch of logs to find a needle in a haystack? Done. Need to create a plan for a double-entry accounting system and I’m not even familiar with accounting concepts? Bring it on. Code has poor documentation and you’re procrastinating? Not any more. You get the idea.

All these things used to stress me because the research, investigation or learning curve for these things was steep and time-consuming. Some of the curves may still be steep, but it certainly isn’t time consuming anymore, and that in itself has improved my quality of life. The amount of time AI has freed up for me to simply read a book, or watch a TV show, or spend more time with family is significant. My weekends used to be swamped with jumpcomedy.com work and I actually went for a bike ride because I know something that would take me 20 hours will take me 4, and I actually have the option to take up leisure time. This is wild to me.

Do I miss writing code from scratch? I don’t know the answer to that question. I do know that I don’t miss getting stuck even though getting stuck is how I learned many things. I do know that I like seeing my ideas come to life faster - in minutes and hours instead of weeks and months. This means I get to experiment more, so maybe I’ve replaced “getting stuck learning” with “experimental learning”. The number of iterations has increased and so did their speed so I’m going through the inspect-and-adapt loop many more times than before. It’s the Lean Startup cycle on overdrive.

But again, do I miss writing code from scratch? Gun to my head. I’d say...no. That sounds like a betrayal of my art and profession, but code was always a means to an end, and I seem to be getting to the end a lot faster, so what am I feeling sad about? Is it nostalgia? Is it letting go of something you invested in for so long? Is it that writing code was part of your identity? Is it that learning how to code feels like a sunk cost? Are my degrees all for naught? No, yes, a little bit, maybe.

What matters now is that we have arrived in an entirely new world and I have some skills that meet the moment. It turns out that the core principles of software engineering have little to do with syntax and language, and the skills that I now use most aren’t too different than pre-AI, but seem to be more valuable because AI seems to work against some of these, some examples:

Problem decomposition, i.e., breaking a larger problem into manageable parts and then focusing on each one intentionally without bloating your problem/context space
A dedication to the YAGNI/KISS principles because it’s so easy to generate code with AI and implement things you might not need
Finding the right application architecture and abstractions because you know your customer’s needs better than AI and are better at anticipating change, because you talk to customers
Greater focus on the business problem rather than the “how” which is what code is. I find myself knowing more about the problem domain than before.

These aren’t necessarily hard engineering skills but they are engineering skills which I’ve elevated to be my “primary” skills, replacing writing code. I feel some nostalgia, but I also see results faster which more than makes up for it. Is this going to eventually result in my unemployment and loss of income? Maybe. But the future is worry and the past is regrets, so may s well just live in the moment.

Everything that's wrong with Google Search in one image

Zarar Siddiqi — Wed, 24 Sep 2025 22:11:17 GMT

I typed in Midjourney to search for Midjourney because I wanted to use Midjourney. Here’s what I got instead. It’s the fifth result down on the page. So if you want to rank high on Google, not only do you need to build a great product to have enough backlinks but also pay Google so other lesser products can’t pay themselves to be ahead of you.

SAD! Thank you for your attention to this matter.

I'm rejuvenated by the Elixir EU Conference

Zarar Siddiqi — Fri, 16 May 2025 19:57:53 GMT

Finished my first ever Elixir conference and it feels great for three reasons:

Learned a lot
Realized that I don't know a lot
Felt grateful to be part of a community where I can possibly fill the gap

More than ever I've felt I've made the right decision opting for the Elixir/Erlang stack for jumpcomedy.com. It was comforting to hear from people who were facing similar decisions 7-8 years ago and took the cold plunge. They were risk takers, and I'm merely a follower who has the benefit of a vast amount of work done by immensely talented people. Here are five highlights of many from the conference.

Electric SQL / Phoenix Sync

This is a game-changer. In a world where real-time views are the norm and attention spans are decreasing fast, having data sync'd between the UI and the data store is ever more important for a great UX. Phoenix Sync does this by looking at the PostGres replication logs and syncing your LiveViews (or even JS front-ends) in near instant time. Combine this with the power of LiveView streams and you literally have to change five characters, and an INSERT done from a PostGres GUI will show up in your LiveView UI. I kid you not - it seems too good to be true to not have to write any code to make this happen, but that's how it works.

Ash AI

There was a lot of AI talk and some amazing demos using LangChain but what excited me most was Ash Framework's AI extension, simply because I think they've nailed the declarative syntax and even made Model Context Protocol implementations, dare I say, rather trivial. Elixir/Phoenix has always given, and I don't say this lightly, a 5x productivity boost compared to anything else and this implementation just demonstrated how easy they've made it to power up your apps with AI.

LiveVue / JS Escape Hatches

Any good Elixir abstraction library provides a clean escape hatch, and LiveVue (and LiveReact and LiveSvelte) do just that - instead of sending HTML diffs based on state changes across the wire, they sync props, which means that you can use any front-end framework with Phoenix backends seamlessly for cases where UIs need to have heavy JavaScript for whatever reason. This is a fear of many who contemplate LiveView which is extremely server-centric, i.e., not having control over front-end state can keep a lot of developers away, especially those who rely on framework utilities in their app (npm i is just too easy). This addresses that problem.

Type Checking vs Type Inference

Jose Valim spoke about what's coming up in Elixir 1.19, which will have support for even more inferred types, notably inferring the types of keys in Maps. He suggested that in 18-24 months Elixir will have type declarations and that it's a matter of when, not if. This would close a huge perceived gap in the language and pave the way for developers who are used to TypeScript and Java to give Elixir a shot. Not having types never scared me because it forced me to not depend on types, which promoted a behaviour of writing more obvious code, smaller functions and better tests. There are already a lot of big companies using Elixir, like BBC, Bleacher Report, etc. but typing would open up the door even more to larger enterprises who often see it as a requirement simply because that's what they're used to.

Waffle

I have to give a shout out to the many lightning talks, and one of them was Waffle, which is a classic example of small Elixir library that does one job, does it well, and doesn't try to do anything else. Waffle is a file processor/uploader with extensions to S3, Azure etc. I like how the author said that "we're pretty much done" with this library and that's what I love about Elixir libraries: a "last commit" of months or even years ago doesn't imply that the library is abandoned, it's that it just did what it was supposed to do really well and there's nothing more to do.

Chris McCord, the creator of Phoenix gave the closing keynote titled Code Generators are Dead. Long Live Code Generators, and man, he looked a little jaded as he was introducing how AIs he built are replacing the traditional Phoenix code generators. He literally built a customized TodoApp in 20 minutes and pondered the role of developers in the age of AI while doing it, and also while low-key dissing "vibe coding". His main idea seemed to be that AI isn't going to replace developers but understands why that might be a fear. That people who cut-and-pasted StackOverflow will continue to cut-and-paste AI-generated code and that once things normalize, the human will still end up being needed.

I don't know what to think of this but AI hits different than the Stack Overflow metaphor simply because Stack Overflow still required you to connect the dots. Lot of AIs don't. A metaphor that resonates with me is one comparing AI with the industrial revolution, which revolutionized everything but demand even more labour than before. Time will tell.

DORA: AI boosting productivity, hindering delivery

Zarar Siddiqi — Fri, 22 Nov 2024 00:47:14 GMT

The new DORA report got released and has 30 pages dedicated to AI adoption. One particular finding stuck out to me: an increase in AI adoption was correlated with an increase in code quality, quicker code reviews, decrease in code complexity, decrease in technical debt and improved documentation. However, it was also correlated with a decrease in the key DORA metrics of lead time, throughput and change failure rate, while having no positive impact on product performance.

This is counter intuitive because you would imagine improving your software development process would lead to better software delivery performance. After all, if you're writing better code, having that code reviewed faster, you'd think that it would have some notable downstream impacts. Turns out it's not and the authors hypothesize that:

...the fundamental paradigm shift that AI has produced in terms of respondent productivity and code generation speed may have caused the field to forget one of DORA’s most basic principles—the importance of small batch sizes. That is, since AI allows respondents to produce a much greater amount of code in the same amount of time, it is possible, even likely, that changelists are growing in size. DORA has consistently shown that larger changes are slower and more prone to creating instability.

Basically, a rush to the red light effect where the deployment process of organizations isn't keeping up with the increase in developer productivity, leading to larger changes which lead to riskier deploys. This is fascinating to me and confirms one of my long-held beliefs (biases?) that keeping change size small is central to continuous delivery since, as all things being equal:

Smaller the change, the lower the risk
Smaller the change, the more changes you need to make, the better you become at making changes
Larger the change, the more coordination required, making the change more expensive, thus reducing the incentive to make the change

The positive impact of smaller batch sizes in increasing quality and performance is best highlighted in Reinertsen's work which is summed up neatly in this HBR article from 2012 and the DORA authors have been hammering that point home for years.

I don't know if they have enough data to confirm this hypotheses but it is a reasonable one. I'd like to take it a step further and hypothesize that this is only a short-term trend. It is relatively easy for developers to adopt AI tools since it can be as simple as opening up a new browser window. However, for pipelines and deployment processes it will be slower since it requires deeper changes to platform-level toolsets which span beyond individuals to teams and departments. We may be seeing a lag between the "left" and "right" side of the delivery process and I think this is a short-run problem which will go away sooner than expected because of the urgency around AI.

For example, though the report doesn't explicitly link the two, earlier in the document they do point out that organizations are:

...willing to forgo the typical huge bureaucracy involved in adopting new technology because they felt an urgency to adopt AI, questioning "what if our competitor takes those actions before us"

Sidebar: This reminds me of the JDK Version Index which is conceptually similar to the Big Mac Index and a quick indicator of how far behind an organization is with industry. Big companies are usually around 4-5 (sometimes more) behind what the widely accepted JDK version is. I remember when JDK 1.5 came out with generics, lot of gigs I had insisted on sticking around with 1.4 for years because there just wasn't enough impetus to change something seen as so core. Similar stories can be heard about Java 8 and Java 17.

AI is different and the adoption speed is going to be faster since there is a perceived foregoing of tangible benefits if you don't. It's amazing how lumbering organizations who always cite their size and "oh we're so complex" as reasons not to adapt quickly, when faced with a crisis (or perceived crisis), are able to suddenly slice through the red tape to get stuff done (Shock Doctrine, anyone?).

Covid responses are a great example of this. Unfortunately, in my experience whatever process waste that was slashed during Covid has crept back in. Though I doubt anything will drive the impetus to change as Covid did, AI looks to have similar impact so DORA's findings of AI's positive impact on developer not translating to software delivery performance may just be a short-term phenomenon.

Good software development habits

Zarar Siddiqi — Thu, 05 Sep 2024 13:21:02 GMT

This post is not advice, it's what's working for me.

It's easy to pick up bad habits and hard to create good ones. Writing down what's working for me helps me maintain any good habits I've worked hard to develop. Here's an unordered list of 10 things that have helped me increase speed and maintain a respectable level of quality in the product I'm currently developing.

Keep commits small enough that you wonder if you're taking this "keep commits small" thing a little too far. You just never know when you have to revert a particular change and there's a sense of bliss knowing where you introduced a bug six days ago and only reverting that commit without going through the savagery of merge conflicts. My rule of thumb: compiling software should be commitable.
Live Kent Beck's holy words of wisdom: "for each desired change, make the change easy (warning: this may be hard), then make the easy change". Aim for at least half of all commits to be refactorings. Continuous refactoring is thinking of changes I can make in under 10 minutes that improve something. Doing this pays off whenever a bigger requirement comes in and you find yourself making a small change to satisfy it only because of those smaller improvements. Big refactorings are a bad idea.
All code is a liability. Undeployed code is the grim reaper of liabilities. I need to know if it works or at least doesn't break anything. Tests give you confidence, production gives you approval. The hosting costs might rack up a little with so many deploys but it's a small price to pay for knowing the last thing you did was a true sign of progression. Working software is the primary measure of progress, says one of the agile principles. Working and progress are doing a lot of heavy lifting in that sentence, so I've defined them for myself. Working is something being working enough to be deployed, and if it's code that's contributing to a capability, that's progress.
Know when you're testing the framework's capability. If you are, don't do it. The framework is already tested by people who know a lot more than you, and you have to trust them that the useState() hook does what it's supposed to do. If you keep components small, then you reduce the need for a lot of tests as the framework will be doing most of the heavy lifting in the component. If the component is big, then you introduce more complexity and now you need to write a lot of tests.
If a particular function doesn't fit anywhere, create a new module (or class or component) for it and you'll find a home for it later. It's better to create a new independent construct than to jam it into an existing module where you know deep down it doesn't make sense. Worst comes to worst, it lives as an independent module which isn't too bad anyway.
If you don't know what an API should look like, write the tests first as it'll force you to think of the "customer" which in this case is you. You'll invariably discover cases that you would not have thought of if you had just written the code first and tests after. You don't have to be religious about TDD and it's OK to work in larger batches (e.g., write more than just a couple lines of code before making it pass). The amount of code to write in a red/failing state doesn't always have to be small. You know what you're doing, don't let dogma get in the way of productivity.
Copy-paste is OK once. The second time you're introducing duplication (i.e., three copies), don't. You should have enough data points to create a good enough abstraction. The risk of diverging implementations of the same thing is too high at this point, and consolidation is needed. It's better to have some wonky parameterization than it is to have multiple implementations of nearly the same thing. Improving the parameters will be easier than to consolidate four different implementations if this situation comes up again.
Designs get stale. You can slow the rate at which they get stale by refactoring, but ultimately you'll need to change how things work. Don't feel too bad about moving away from something that was dear to you a while ago and something you felt proud about at the time. You did the right thing then and shouldn't beat yourself up for not getting it right enough that you wouldn't need to change anything. Most of the time writing software is changing software. Just accept it and move on. There's no such thing as the perfect design, and change is at the core of software development. How good you are at changing things is how good you are at software development.
Technical debt can be classified into three main types: 1) things that are preventing you from doing stuff now, 2) things that will prevent you from doing stuff later, and 3) things that might prevent you from doing stuff later. Every other classification is a subset of these three. Minimize having lots of stuff in #1 and try to focus on #2. Ignore #3.
Testability is correlated with good design. Something not being easily testable hints that the design needs to be changed. Sometimes that design is your test design. As an example, if you find yourself finding it difficult to mock em.getRepository(User).findOneOrFail({id}), then chances are you either need to put that call into its own function that can be mocked, or write a test utility which allows for easier mocking of the entity manager methods. Tests go unwritten when it's hard to test, not because you don't want to test.

There's probably a lot more, but 10 is a nice number.

Symptoms of a System in Stasis

Zarar Siddiqi — Mon, 26 Aug 2024 13:11:34 GMT

Here's an unordered list of symptoms that might indicate more profound issues with your organizational culture - one that's preventing you from delivering on your true potential.

You're "doing agile" by using some sort of iterative development pattern, and have gained enough efficiencies compared to any other way of working that you've ceased trying to systematically improve the development process. There are important milestones to hit and probing the development process doesn't appear like a way to hit those since all the juice out of the lemon is perceived to have been squeezed.
Retrospectives happen as a formality more than an inquiry of how the team is working within and across the organization. Only limited options of change are on the table, most are off limits. Improvement ideas are not stifled or suppressed, but everyone knows the Overton Window of what's up for debate.
Standardization is seen to increase efficiency and considered as a productivity gaining approach. Deviation from the common process is generally viewed as a defect to be corrected rather than a possibly novel paradigm-shift .
Process metrics dominate value metrics. Greater focus is given to team velocity metrics than to customer outcome metrics. The latter is seen to drive the former, with no evidence supporting this perceived causality.
The team is subdivided into skill-sets based on technologies, creating hand-offs within the team leading to longer queues and wait times. This forces managers to to allocate work based more on individual skill availability than overall team capacity.
Work is started more often than it is finished, leading to a pile of zombie initiatives where managers grapple with the sunk cost fallacy as they try to extract positive interpretations out of negative outcomes.
The cost of organizational change/restructuring is seen as too high, but at the same time there's acknowledgement that the current structure is not conducive to delivering the value everyone agrees needs to be delivered. Nobody wants to tackle this math as it's politically explosive.
Competitor offerings are seen as a proxy for what customers desire more than a direct line of communication with the customer. The team's communication with customers is mediated by multiple layers of organization constructs, making it difficult to discern their actual needs.
Work fills the time allocated to it (Parkinson's Law) with most projects being relatively on-time, thus providing little incentive to interrogate the system in which they are delivered.
Most ideas for the next project come from above, not from the software development team, which is seen as a CPU to execute work, more than a source of ideas on what to do next. Work is pushed to the team rather than pulled by the team.

The anatomy of a 2AM mental breakdown

Zarar Siddiqi — Tue, 20 Aug 2024 15:02:48 GMT

Around 2AM this morning I had a realization that this was the most stressed I have ever been. On verge of a complete breakdown.

Why? Because I noticed around 10PM that jumpcomedy.com was entirely broken with all HTTP POST calls made by RTK Query failing. Nothing worked and though I had deployed recent changes, none of them would cause this. I was at a complete loss as to where to look, especially as this is working locally. Posting on the usual Discords (NextJS, Vercel) is leading to dead silence. I'm alone and have to fix this issue which I didn't cause.

This isn't the first production defect I've introduced in my 25 years of working, but this is the first one where I had absolutely nobody to turn to in a time of crisis while customer complaints are piling up at a rate never seen before. No production support, no SRE, no Sr. Engineer, no manager to make it go away. Nothing. And here's the worst part: people who have taken a chance on me to the point where their entire small businesses depend on me are sad. Not only do I have no idea how to fix this, I'm also hurting people. This absolutely sucks. I felt shame, sorrow, and incompetence. Oh the incompetence and the imposter syndrome that comes with it.

The thoughts that were crossing my mind were bizarre: do I just shut this business down? Do I send a mass apology email to my customers and just ask them to pick a different event management provider? What do I do because I don't know where to look and it's been four hours already.

Enter Eminem. Alright, calm down, relax, start breathin'

I started breathing but it didn't help a damn as I still didn't know what the issue was. No matter how many console.log() statements I sprinkled around, nothing made sense. Was it the headers, the length of the API token, the sequence of calls...but it was just working. Why? WHY? WHY???? IS THIS HAPPENING? And why are GET and DELETE calls working?

It's OK. The world won't end. So what if your business entirely fails and you're paraded at the next tech conference as a case of what not to do. Oh well, that's your destiny, just deal with it BUT right now deal with this goddamn bug that you didn't cause but have to suffer through. The only clue I have is that it's working on localhost, which reminded me of that old joke where during a production outage the junior developer tells his boss, "but it's working on my machine". Well buddy, you're the junior developer. Also, you're a sack of shit. No, no, don't go there. There's plenty of time for self-reflection and self-hate later, but right now just see why those cursed POST calls are failing with:

TypeError: failed to execute 'fetch' on 'window': …with a request object that has already been used

Now that error message is a complete red herring and tells me nothing. It may as well have said, "The Lannisters refuse to pay their debts and flight UA763 from Miami is delayed".

Haha. I start making jokes to add some levity to the situation. It's not so bad, life is about nature and trees and sooner this business shuts down and you take a boat to a deserted island, the sooner you can start your memoirs and the first chapter of the memoir would be: TypeError: failed to execute 'fetch'.

My wife. Oh my poor wife. She offered me a cup of tea and ruffled my hair. "It's OK, big companies have production outages too". Ah, that's so sweet of her. I told her to go to bed while I question every major life decision leading up to this moment. Oh shit, what's this? It's customer emails piling up in my inbox. Lovely.

"Hey Zarar, I can't change the the price of my event"

"Hi Zarar, I'm trying to remove a promo code and it won't let me"

....

Please, can I just delete my email at this point and take a bus to the northern wilderness? Because I still have no clue what's going on and now I'm thinking maybe I should take that break to clear my head. You know, like they say in those self-help books, but what they don't say is that every five minutes I'm getting an email saying something's broken and my response is basically, "I apologize. Working on it". But I'm not working on it, I'm just staring at the screen putting debug statements where I feel Chrome Inspector is saying, "Bro you serious? You think there's a bug on this line?"

Ah, what's this? A Chrome update came in today? Could that have caused it? Hmmm...hope, I see hope. DASHED. HOPE IS DASHED! This is reproducible in Firefox and Edge. Edge? Even Edge is like WTF. Back to console.log() and break points. Now I'm dealing with source maps and libraries that don't publish source maps so now I'm looking at code that looks like this:

eC=Math.random().toString(36).slice(2),eE="__reactFiber$"+eC,ex="__reactProps$"+eC,ez="__reactContainer$"+eC,eP="__reactEvents$"+eC,eN="__reactListeners$"+eC,e_="__reactHandles$"+eC,eL="__reactResources$"+eC,eT="__reactMarker$"+eC;

This is no good. Let me just try reverting to a version from a month ago. Nothing. Three months ago? Nothing. Still failing. A year ago? Zilch.

OK, so you re-ask the question what's happening in prod that's happening locally. Or vice-versa. Some candidates:

Sentry is disabled locally
Databases are pointing to docker instead of cloud providers

Got rid of Sentry in production. Nothing. Pointed to PROD databases locally. Nothing. Maybe I should take that break, if only to calculate the financial damage and the much more significant reputational damage.

What else is different? Maybe PostHog, I have the api_key blanked out locally to reduce costs, so let me just add it to see what gives. Shot in the dark. 1 in a million chance. Let's do it.

WHAT?! REPRODUCED ON LOCALHOST. GIVE ME THAT FUCKING CUP OF TEA NOW!

Next commit: take out PostHog and everything is working.

At this point I'm thinking all the people I've recommended PostHog to as this "amazing tool which shows you what your users are experiencing". How naive I was? Right now I hate PostHog more than anything and can't believe I was about to pay for that product (still a good product, I'm overreacting here). But still, in the moment I wanted to burn the company down.

But I did feel good about finding the defect because soon after many people reported the same:

https://github.com/PostHog/posthog/issues/24471

https://github.com/reduxjs/redux-toolkit/issues/4573

I write more on zarar.dev. I don’t post as much on Substack as the editor isn’t really conducive to code.

Empowered Developers Write Clean Code

Zarar Siddiqi — Mon, 12 Aug 2024 18:09:42 GMT

Site note: I’m writing a lot more on my blog zarar.dev because I don’t want to spam people’s inbox’s in Substack as many of the posts can be technical and not really suited for Substack formatting (e.g., code snippets).

However, very excited to have Tom Howlett, Head of Product Management at Sonar as our guest for this episode of the Continuous Delivery Podcast.