Building Hardened: An AI Software Builder That Actually Ships Production Code
I have been building products with AI for the better part of a year now. Eight of them, across wildly different domains: proptech, fintech, marketing automation, LLM routing, video tooling. Through all of it, I kept refining the same development loop: build, review with a different perspective, log the mistakes, feed them back. Repeat.
At some point I had an uncomfortable realisation. The most valuable thing I had built was not any of the products. It was the process.
So I am building it into a product. It is called Hardened, and it lives at hardened.build.
What Existing Tools Get Wrong
Let me be blunt. Tools like Bolt, Lovable, and v0 are impressive for what they are: rapid prototyping environments. You describe what you want, and you get a working UI in seconds. That is genuinely useful.
But there is a canyon between "working prototype" and "production-ready software." These tools generate code with a single model, in a single pass, with no verification step. The output looks right. It runs. But it is riddled with the kind of subtle issues that only surface when real users hit edge cases: accessibility failures, state mutations, race conditions in serverless environments, missing error boundaries.
I know this because I spent months cataloguing exactly these failures. A single-model system has blind spots. It agrees with itself. It thinks the patterns it chose are reasonable because they are the patterns it would choose. Self-review catches maybe 40% of real issues.
Cross-perspective verification catches closer to 85%.
Introducing Forge
At the core of Hardened is Forge, an ensemble AI engine built specifically for software construction.
Forge is not a single model. It is an orchestrated ensemble that separates the concerns of planning, building, and reviewing into distinct stages, each handled by different reasoning perspectives. The architecture is deliberately adversarial: the component that builds code and the component that reviews it operate with fundamentally different priorities and evaluation criteria.
Think of it like a construction project. You would never have the same person design the building, pour the concrete, and sign off on the structural inspection. Forge enforces the same separation of concerns, but for code.
The 5-Stage Pipeline
Hardened is not a chatbot that writes code. It is an automated pipeline with five discrete stages:
1. Forge Planner takes a project brief and produces a structured implementation plan. Architecture decisions, file structure, dependency choices, security considerations. This is the thinking stage. No code is written here.
2. Forge Builder executes the plan. It writes the actual code inside an isolated E2B sandbox. The sandbox means the code runs in a real environment: npm install, pnpm build, pnpm lint all happen for real, not in simulation.
3. Forge Reviewer is where it gets interesting. A completely different reasoning engine reviews every file the Builder produced. It checks against the original plan, runs through a pedantic checklist (accessibility, error handling, type safety, performance), and assigns a Quality Score from 0 to 100.
interface QualityReport {
score: number // 0-100 composite
categories: {
typesSafety: number // strict mode, no `any` leaks
accessibility: number // WCAG 2.1 AA baseline
errorHandling: number // boundaries, fallbacks, edge cases
performance: number // bundle size, re-renders, queries
security: number // input validation, auth, XSS
architecture: number // separation, naming, patterns
}
failures: ReviewFailure[]
passed: boolean // score >= 80 to ship
}If the score is below 80, the code does not ship. Full stop. The Reviewer sends failures back to the Builder for another pass.
4. Learning Engine logs every failure the Reviewer catches into a PATTERNS.md file scoped to the project. If the same failure category appears three or more times, it gets promoted to a hard rule. The next time the Builder runs, it reads these patterns and avoids them.
This is the part I am most excited about. It means Hardened gets measurably better at building your specific project over time. Sprint 01 might score a 72 and need two revision cycles. By Sprint 06, the Builder has internalised your project's patterns and consistently hits 90+ on the first pass.
5. Deployer pushes the reviewed build to production. Vercel, Netlify, or a custom target. The user gets a build report showing exactly what passed, what was flagged, and what the final score was.
The entire pipeline is orchestrated by Inngest, event-driven, step-based, with automatic retries and observability. No cron jobs. No fragile shell scripts. Each stage is an Inngest function that triggers the next.
Why Ensemble Verification Matters
I need to stress this because it is the core thesis: adversarial verification is not a nice-to-have. It is the entire architecture.
When a single system reviews its own output, it has the same blind spots that produced the output in the first place. It agrees with itself. It thinks the patterns it chose are reasonable because they are the patterns it would choose.
Forge's ensemble approach deliberately introduces cognitive diversity. The Builder and Reviewer are not just different instances of the same engine. They operate with different architectural biases, different evaluation priorities, and different failure mode sensitivities. The Builder optimises for shipping. The Reviewer optimises for correctness. These are frequently in tension, and that tension is productive.
In internal testing, self-review catches roughly 40% of meaningful issues. Forge's cross-perspective review catches closer to 85%. That gap is the entire product thesis.
The Feedback Flywheel
The Learning Engine deserves special attention because it creates a compounding advantage.
Most AI code generators are stateless. Every request starts from zero. They have no memory of what went wrong last time, what patterns your project uses, or what mistakes keep recurring.
Forge maintains a project-scoped pattern registry. Categories include accessibility (A11Y), design system compliance (DSC), TypeScript strictness (TS), error handling (ERR), performance (PERF), architecture (ARCH), and testing (TEST).
When a pattern accumulates three or more occurrences, it is promoted from an observation to a hard rule. The Builder reads these rules before writing any code. The result is a system that makes fewer of the same mistakes over time, specific to your codebase, your conventions, your edge cases.
This is not fine-tuning. It is structured context injection. The base models remain unchanged. What changes is the instruction set they operate under, and that instruction set grows more precise with every build cycle.
The Stack
- Framework: Next.js 15, TypeScript, Tailwind
- Database: Supabase (auth, storage, project metadata)
- Pipeline orchestration: Inngest (event-driven, serverless step functions)
- Code execution: E2B sandboxes (isolated containers per build)
- AI engine: Forge (proprietary ensemble)
- Payments: Stripe
- Domain: hardened.build
The Plan
14 sprints across three phases:
- Core Pipeline (Sprints 01-06): Get the 5-stage pipeline working end-to-end. Planner, Builder, Reviewer, Learning Engine, Deployer. This is the hard part.
- Polish (Sprints 07-10): Dashboard, build history, project settings, quality score visualisation, team features.
- Scale (Sprints 11-14): Agency workflows, white-labelling, API access, usage-based billing at volume.
Pricing will be Free (to try it), Pro at $49/mo, Team at $149/mo, Agency at $399/mo. The free tier gives you enough to see the quality difference. Pro is for solo developers who want to ship confidently. Agency is for studios managing multiple client projects.
Why This Is the Hardest Thing I Have Built
The other products in the Uptrail portfolio are relatively contained. A classifier, a marketing swarm, a video tool. They solve one problem well.
Hardened is different. It is an AI system that orchestrates other AI systems to produce software. The failure modes are compounding: a bad plan produces bad code, a bad review lets bad code through, a bad learning engine teaches the wrong lessons. Every stage needs to be rock solid because every downstream stage depends on it.
I am not going to pretend this is easy. The pipeline coordination alone, handling timeouts, partial failures, sandbox cleanup, token budget management across the Forge ensemble, is genuinely complex. Inngest handles a lot of the orchestration pain, but the domain logic is intricate.
But the payoff, if it works, is significant. Not another prototype generator. An actual software builder that produces code you would be comfortable deploying to production, with a transparent quality score and a paper trail of every decision.
What Is Next
Sprint 01 just kicked off. The first milestone is a working Planner, Builder, Reviewer loop for a simple Next.js app.
I will be writing about the build as it progresses. The scores, the failures, the patterns that emerge. All of it, publicly.
The honest question I keep asking myself: can a methodology that works brilliantly when I am in the loop translate into a fully automated product that works for strangers? I think so. But I will not know until real users push it past my assumptions. That is what build-in-public is for.