Yes, You Can Run a Dark Factory on Your Codebase
What works today, what arrives next, and how to start this week.
I keep meeting CTOs and engineering leads who have read about the dark software factory, and who have looked at one or two writeups, and who have come away with the impression that this is something other companies do. Not their company, and not their codebase and not their team - it is impossible. I think they are wrong, and the reason I am writing this is that the same conversation has stalled at the same point with enough different people that I want to put the answer down in a form I can point at.
The short version: yes, you can run this on your codebase. The recipe is lighter than the existing writeups make it look, and the categories of work where it already produces good results cover something I estimate to be half of what most engineering teams ship. The way forward across the next five years extends the possibilities every year, and the teams that start now compound on it.
I should also say upfront that we build a dark-factory harness for a living, which means I have a commercial interest in you adopting the pattern. I am still going to tell you when it does not work right now.
A dark software factory is a setup where AI agents read intent (e.g. a ticket), write the code and run the tests, and push to production and where humans do not review the code line by line. The name borrows from lights-out manufacturing, where robots run the production floor and the lights can stay off because nobody is on it. In the software version the humans move to the beginning and end of the pipeline. They decide what to build, they set the policy the factory has to obey, and they look at the outcomes and decide whether what shipped is good enough. The writing of the code is the machine’s job. I have written some longer pieces on dark factories, and they are linked at the bottom of this one if you want the background.
Let’s spend some time on what works today. I think this is where CTOs underestimate their environment. Dark factories can produce working results for the following kinds of work, against the harness infrastructure most teams already have. These are bread and butter (often inward looking) of engineering work.
-
Integrations, API clients, and SDKs. The spec is the upstream, the API documentation is the contract, and your existing integration tests against that service are the grader. Most teams already have these tests, and you have most of the parts already.
-
Migrations and refactors. A migration is a particularly clean case because the old behaviour is the specification for the new behaviour, and the outputs of the previous system grade whether the new system is doing the right thing. Database migrations, framework upgrades, language version bumps, and the long backlog of “rewrite this in the new style” tickets all live here. Migrations are often boring, often urgent, and often the place where senior engineers spend time they should be spending on something else.
-
Internal APIs and CRUD on well-typed schemas. When you have a schema, the schema is the contract and the contract is the grader.
-
Codegen from formal specifications. OpenAPI, GraphQL, protobuf, JSON Schema. The specification is machine-readable, and the gap between specification and implementation is exactly the gap the factory was built to close. If you are already using these formats, you are most of the way there.
-
Infrastructure as code and policy. Cloud APIs are well-documented, the desired state is declarative, and your security policy is itself a kind of specification. Terraform, Pulumi, Kubernetes manifests, IAM policy, and the operational glue around them all work well, and the audit trail the factory leaves is often cleaner than what humans produce under deadline pressure.
-
Bug fixes with a reproducing test. I want to flag this one because most teams overlook it. When you have a failing test that reproduces a bug, the test is the grader by definition, and the factory’s job is to make it pass without breaking the rest of the suite. The backlog of bugs-with-reproductions is long in almost every codebase I see, and the factory can chew through it in the background while the humans work on harder things.
-
Parsers, serializers, and format converters. Input/output pairs grade the work, and they are usually easy to assemble from real data.
Add these seven categories up honestly for a typical SaaS company and they cover something between forty and sixty percent of what the engineering team ships. This is the part of the conversation that gets lost when the category is discussed through its flagship examples. The pattern is not aiming at a narrow industry niche. It is aiming at the unglamorous middle of the codebase where most of the work actually lives, and that middle is where it works today.
Beyond the today-list is a near edge of work that sits right at the threshold of being dark-factory-ready, and where the harness tooling has matured enough in the last six months that I expect routine adoption within the next year:
- Deep feature work behind feature flags becomes feasible because the flag collapses the cost of shipping the wrong thing, and the verification calculus changes once the rollback is one click.
- Performance work against benchmarks becomes feasible because the benchmark is the grader, and the practice that has worked in compiler engineering for decades is moving into application code as the harnesses get better at running benchmarks without flakiness.
And then? What’s next? We will see the next ladder of the dark factory. The next ring includes the kinds of work where Level 6 self-optimisation in the harness starts to matter, and where the surrounding telemetry needs to be more mature than most teams have today.
Net-new product features with strong telemetry, where the A/B test in production decides if a feature works or not. Consumer apps get this first because their telemetry is shaped for it, and B2B SaaS follows shortly after.
Self-serve onboarding flows, dynamic onboarding flows and growth surfaces, where engagement and conversion are measurable and the factory iterates against them.
Internal tools and dashboards, where usage telemetry tells you whether the tool is working, and where internal users are reachable and willing to give honest feedback when it is not - or where internal users are iterating themselves.
I am less confident about specifics further out, but the gradient is visible. Whole subsystems specified end to end and shipped autonomously, with the factory making architectural choices inside a policy system set by humans. Level 7 ideation is when the factory reads customer signals and proposes its own work - and what not to do!
Cross-company harness learning, where the patterns that work in one factory propagate to others without revealing private code. Continuous architectural evolution, where the shape of the codebase is itself something the factory tunes against measured outcomes rather than something a senior engineer redraws every two years.
The fastest way to find out whether the pattern works for you is to try it out.
Run one ticket end to end without human interaction and ship the result. Note what worked and what did not, and then pick the next ticket. The compounding on learnings starts quickly because the harness tuning you do for the first tickets makes the next cheaper, and the team’s intent-writing skill sharpens with every ticket that ships.
Level up the harness. You need a fenced environment so the agent runs in a devcontainer it cannot escape, you need the agent to see your tickets and your documentation and your design notes, you need a policy layer that controls credentials and outbound network calls, and you need a workflow for outcome review by humans.
Some of your competitors are doing this already. The gap they open while you doubt, discuss and postpone decisions is a gap you will need to close with bigger effort. The advantage of dark factories is not the model, and it is not the harness, and it is not even the discipline taken alone. It is having been doing all three for long enough that the institutional knowledge is sharp, and that knowledge and practice cannot be bought.
We create human, a dark factory and enterprise AI solution. human is the harness for the today-list. The devcontainer, the context pipes, the policy layer, and the lifecycle skills are all in place, and you can point it at one subsystem this week. It is the tool I wished I had as a CTO when I started running Claude Code against real codebases, and it is the tool I would pick today if I were starting fresh on the seven categories above.
The architecture is also pointed at Level 6 of dark factories, because the lifecycle is integrated end to end and the signals that a self-optimising harness needs are already being collected at every stage. The today-list gets you shipping, and the path from there to the next ring and the ring after that is what the product is being built for.
If you have read the writeups about dark factories and concluded that this is for other companies, I think you are wrong. Getting started is easier than you think. The test infrastructure you already have is most of what you need. The harness around it is a known shape and other people have already built it. Install human, point it at the subsystem you picked, and ship the first ticket. The rest follows from there.
Further reading
- What is a Dark Software Factory? — the first piece in this series, with the origin of the term, the five levels framework, and the verification ceiling argument.
- Beyond the Dark Factory — the second piece, on Level 6 self-optimisation and Level 7 autonomous ideation.