The Living Product
What happens when software doesn't just build itself — but senses its environment, decides how to grow, and evolves?
Imagine a product that behaves less like a machine and more like a living organism. It senses its environment — what users do, what the market demands, where friction lives. It decides how to grow. It builds new structures, tests whether they work, and encodes what it learns into its own DNA. Then it does it again. Continuously, autonomously, getting smarter every cycle.
No product manager decided what to build. No designer opened Figma. No developer wrote a ticket. The product evolved — not randomly, but intelligently, guided by a constant stream of real-world evidence.
Map the architecture against the biological definition of life, and the parallels aren't metaphorical — they're structural.
It has organized internal subsystems. It metabolizes raw inputs into useful output. It maintains homeostasis by pruning complexity. It grows in capability over time. It responds to environmental changes. It adapts across successive cycles. It even exhibits the requirements for evolution — variation, selection, heritability, and generational time.
The only property of life it doesn't exhibit is reproduction. That's an open question for later.
Every piece of technology required to make this work exists today. What doesn't exist — yet — is the architecture that connects those pieces into a single, continuous loop. That's what we're building.
The Dark Factory is the vision — and the race is on
I'm the Chief Product & Technology Officer at Willow, an EdTech platform that helps school counselors guide students through college and career planning. We serve thousands of schools, and like every product team right now, we're watching AI reshape how software gets built. What follows is the architecture we're pursuing — and why we think we're positioned to pull it off.
In early 2026, a concept called the "Dark Factory" broke through in the software world. Borrowed from manufacturing — where fully automated plants run with the lights off because no human is on the floor — the idea is radical: give AI a specification, and it produces working, tested, deployed software. No human writes the code. No human reviews it. No human is in the production loop at all. Spec in, software out, lights off.
This is not the same as "AI writes most of our code." When engineers use Cursor or Claude Code to produce features faster, a human is still directing, reviewing, iterating. That's a faster version of the old process. The Dark Factory is a fundamentally new one. Nobody has fully achieved it yet — but real progress is being made. StrongDM published the most credible attempt in February 2026, and players like BCG, EY, and Fujitsu are investing heavily in the infrastructure. The building blocks are emerging fast.
But here's what everyone is missing: even the most ambitious dark factory only builds what you tell it to build. The factory automates the how. Nobody is even attempting to automate the what. That's where this goes next.
Two systems, one loop
The Self-Directing Product is two systems working together. The first — the Product Intelligence Engine — listens to the world and decides what to build. The second — the Dark Product Factory — builds it. Together, they form a closed loop that never stops.
Listen
The system collects signals from everywhere — how users actually behave in the product, what sales prospects ask for, what users say in micro-surveys, what competitors are shipping, what policy changes are creating new demand, what support tickets reveal about friction. It's always listening, across every channel, structuring every signal into evidence.
Understand
An intelligence engine synthesizes thousands of signals into a ranked map of opportunities. Not feature requests — opportunities. Unmet user needs, underserved market segments, competitive gaps, emerging regulatory requirements. Each opportunity is scored by evidence strength, market impact, strategic fit, and feasibility. The map updates itself weekly.
Decide
The system selects the highest-impact opportunity and translates it into a precise product objective — not "build a feed feature" but "counselors need a way to reach students who never visit their office, targeting 20% student engagement within 30 days, measured by response rate."
Design
AI generates multiple design approaches, each representing a fundamentally different way to solve the problem. Not random variations — informed explorations constrained by the product's design system, accessibility requirements, and domain-specific UX patterns.
Test — synthetic first, then real
A swarm of AI persona agents — a veteran counselor at a Title I school, a first-generation student on their phone between classes, a district admin justifying the budget — interacts with every design variant. They catch broken flows, confusing navigation, and obviously bad designs. But they're a pre-filter, not the final word. The top designs then ship to a small cohort of real users behind feature flags. Synthetic users screen. Real users validate. The best design wins.
Build and ship
The winning design becomes a full specification. AI agents write the code, write the tests, validate against scenarios they've never seen (stored separately, like a holdout set in machine learning, so they can't game the results), and deploy behind a feature flag to a subset of real users.
Learn — three loops, compounding
Real-world performance feeds back into the system through three distinct calibration loops, each on its own timeline. Persona calibration: did the synthetic users predict what real users actually did? Sharpen the personas. Opportunity scoring: did the Intelligence Engine correctly rank which opportunities would move the business? Refine the weights. Design generation: did the Factory's first-pass designs get closer to the winner this time? Tighten the constraints. Each loop compounds independently. Together, they make the entire organism smarter every cycle.
Heal — fix what's broken
Bugs, errors, regressions, things that worked and stopped working. The system detects failures through monitoring and user reports, diagnoses root causes, and ships fixes autonomously. This is the immune system — always running, always repairing.
Strengthen — improve what's underperforming
Every shipped feature carries success metrics. When a feature is working but not hitting its targets — staff are posting to the feed, but only at 0.6 per month instead of the target 1.0 — nothing is broken. It just isn't strong enough yet. The system continuously monitors every feature against its stated goals, generates improvement hypotheses, tests them, and iterates. Maybe the compose button needs to be more prominent. Maybe a weekly prompt would nudge posting frequency. The system tries, measures, and adjusts — in parallel with the Intelligence Engine deciding what to build next. Growth and strengthening happen simultaneously, just like a living organism doesn't stop conditioning existing muscle while it's also building new structures.
Then it starts again. The next opportunity is already ranked and waiting. The factory never stops. And neither does the immune system, or the training regimen.
A product that hears everything
The quality of an autonomous product is bounded by the quality of its inputs. A self-directing product needs ears everywhere:
Behavioral telemetry — what your weekly active users actually do. Where they linger. Where they bounce. What they search for that doesn't exist. The workarounds they build when the product falls short. This is the highest-fidelity signal because it captures revealed preference, not stated preference.
Sales conversations — every call, demo, and meeting with prospects and customers, automatically transcribed and analyzed. Why do schools buy? Why don't they? What do they wish the product could do? What competitor feature did they just mention for the third time this month?
Contextual micro-surveys — two questions, maximum, triggered at the exact right moment. "You just finished building a student's graduation plan. What was the hardest part?" One question per user per week, rotating, always contextual, never annoying.
Market intelligence — continuous monitoring of competitor product updates, education policy changes, state mandates, EdTech thought leadership, RFP language from district procurement. A new state requirement for career readiness plans just affected 2,300 schools? That's not just news — that's a scored, ranked product opportunity with a time-sensitivity flag.
Student outcomes — the ultimate signal. Which features correlate with students actually completing FAFSA, enrolling in college, entering career pathways? This is lagged and hard to collect, but it's the ground truth that calibrates everything else.
Signal fusion is the moat. Anyone can wire up autonomous coding agents. Very few teams have multi-channel signal infrastructure and a real user base to calibrate against. Most product teams make decisions based on one or two signal types — usually stakeholder opinions and customer feature requests. A living product fuses six or more signal types continuously, weighs them against each other, and surfaces opportunities that no single source would reveal on its own.
A sales conversation mentions a need. Telemetry confirms the behavioral pattern. A survey provides the emotional context. A policy change creates market urgency. Each signal alone is ambiguous. Together, they form high-confidence product intelligence. This is the central differentiator — not the factory, but the nervous system that directs it.
Not a factory. An organism.
The "Dark Factory" analogy is useful but incomplete. A factory — even an autonomous one — executes instructions. It's inert without a spec. What this architecture actually describes is something categorically different: a system that exhibits the properties of a living organism.
It takes in raw inputs and transforms them into useful output — metabolism. It maintains internal stability by pruning complexity and balancing exploration against exploitation — homeostasis. It detects and reacts to environmental changes — a new state mandate, a competitor move, a behavioral shift. It adapts across successive cycles, encoding what works into its own knowledge for future generations.
In the factory metaphor, humans are managers overseeing a production line. In the living product, humans are more like gardeners.
They don't build the organism, but they shape the conditions for its growth, introduce traits the organism wouldn't develop on its own, and prune what isn't working. The system handles the sensing, growing, and adapting. Humans own the intentional direction — the innovations that aren't visible in current signals because they don't exist yet.
The closest analogies aren't factories at all. They're recommendation engines — Netflix, Spotify — systems that collect signals, rank options, serve the best one, measure engagement, and optimize. But those systems choose from existing content. The Living Product chooses what to create.
The spec layer is everything
The Intelligence Engine's output is a spec. The Factory's input is a spec. If the spec layer is weak, both systems fail regardless of how good they individually are. This is not one open question among many — it's the linchpin of the entire architecture.
Current AI is remarkably good at generating code from clear specifications, and remarkably poor at generating clear specifications from vague intentions.
The Intent Engine — the system that translates a business outcome like "increase counselor engagement" into a precise, testable, buildable specification — is the layer where the hardest unsolved problems live.
What does a machine-generated spec look like concretely? How is spec quality validated before it enters the factory? When a spec produces a feature that misses its targets, how does the system distinguish "bad spec" from "bad execution"?
This is where Willow has a genuine and underappreciated advantage. We've already built a curriculum factory — a working system that generates, refines, and ships structured educational content through an AI production loop with human oversight. That's a spec-to-output pipeline running in production today.
The hard problems of translating domain intent into structured, testable output are problems we're already solving on the curriculum side. The prompt architectures, quality gates, and iteration patterns from that system are directly transferable to product spec generation.
Most teams attempting autonomous development have zero experience with AI-driven spec generation at scale. We have a working factory that does exactly this in an adjacent domain. The curriculum factory isn't just a cultural proof point — it's foundational infrastructure for the spec layer.
The barriers worth exploring
This is new territory. The architecture is plausible, the technology is ready, but there are real questions that can only be answered by building.
Can synthetic users screen effectively?
Synthetic persona agents can catch obviously broken designs and screen for baseline usability. But they're a pre-filter, not a crystal ball. The real question is whether the calibration loop — comparing synthetic predictions against real user behavior cycle after cycle — can close the fidelity gap fast enough to be useful. We're betting on directional accuracy improving over time, not precision from day one.
Can AI distinguish genuine opportunities from noise?
Not every feature request is an opportunity. Not every usage pattern is a signal. The hardest judgment call in product development is separating the things that matter from the things that are merely loud. Humans develop this instinct over years. We don't yet know whether an AI system can develop it over cycles — but the calibration loop (where the system's predictions are tested against reality) is designed to teach it exactly this skill.
Will the system keep polishing instead of innovating?
An optimization system naturally converges on incremental improvements — like an organism that's well-adapted to its current environment but can't survive a change. The system needs the equivalent of genetic drift: an explicit exploration budget dedicating a percentage of cycles to low-evidence, high-potential ideas. Without it, the product over-fits to today's signals and misses tomorrow's opportunities.
Will the product grow too complex?
Organisms that grow too complex lose to leaner competitors. Every new feature adds surface area, maintenance cost, and cognitive load. The system needs metabolic cost accounting — measuring not just whether a feature drives engagement, but whether it's worth the complexity it introduces. The system must be as comfortable pruning as it is growing.
How long before the system earns autonomy?
The system is an infant organism that needs supervision. The first cycles will be wrong. This is expected — the calibration loops exist to fix exactly this. But it means the system needs human oversight at first, with autonomy expanding as competence is demonstrated — which is how every living thing develops. The question is how steep that learning curve is, and how many cycles it takes before review becomes optional.
Freed to do the most human things
The point of the Living Product isn't to remove humans. It's to liberate them from the work the system can handle so they spend every moment on the few things only humans can do — the highest-leverage, most creative, most consequential decisions.
Right now, product teams burn most of their time on execution. Translating a known need into a spec. Iterating on designs. Writing code. Fixing bugs. Running tests. Monitoring dashboards. Triaging tickets.
All of it necessary, almost none of it uniquely human.
The Living Product absorbs that entire layer — signal collection, opportunity ranking, design generation, building, shipping, healing, and strengthening — running continuously in the background.
That frees the humans for work the system will never do on its own.
Vision. Where should this product be in three years? What market shift is coming that doesn't show up in any current signal? What would make this not just useful but transformative? The system optimizes within the current landscape. Humans reshape the landscape.
Intuition. The signal fusion engine can tell you what the data says. It can't tell you what the data means. Someone who's spent a thousand hours with school counselors knows things that no telemetry pipeline will ever surface — the unspoken frustrations, the political dynamics of a district, the way a counselor's face changes when you show them something that actually solves their problem.
Relationships. Our CEO isn't on sales calls because an AI can't talk. He's there because trust is built between humans. The partnerships, the reputation, the handshake that closes a deal — that's irreplaceable, and it's where a leader should spend 100% of their energy, not relaying feature requests to a product team.
The bold bets. The system will never independently propose something radical — a pivot, a new market, a feature so different it has no signal in current data. That's the 10% exploration budget, and it's human-directed. Humans inject the mutations that the organism can't generate on its own. The system handles evolution. Humans handle revolution.
Think of it this way: the Living Product is the operating system running in the background. Humans are the ones deciding what kind of computer to build.
The unfair advantages
We have the users. Thousands of weekly active users generating behavioral data right now. That's not just a customer base — it's a calibration engine. Every prediction the system makes can be tested against reality. Most teams building AI development tools don't have this. They build and hope. We build, measure, and learn.
We have the domain knowledge. Hundreds of hours of user research with school counselors and students. Deep understanding of EdTech workflows, pain points, and buying patterns. This becomes the training signal for synthetic user personas that are 10x more accurate than generic models — because they're built from real conversations with real users, not demographic stereotypes.
We have the codebase consistency. A single developer maintaining a 200,000-line codebase means every pattern is predictable. AI agents write better code when the codebase has consistent conventions. Most companies' codebases are a patchwork of styles from dozens of developers over many years. Ours is clean, consistent, and machine-readable.
We already have a software factory — it's just not fully dark yet. Willow ships features through an AI-augmented pipeline today:
- Our designer works in collaboration with AI, but she's still in the loop — iterating through rounds of revision, evaluating layout choices, refining interactions.
- I build with AI coding agents and a sub-agent architecture that handles specialized tasks, but I'm still steering — reviewing output, giving feedback, pushing through iteration cycles until the code is right.
- Our QA lead tests with AI assistance, but she's still the one catching edge cases — filing bugs, verifying fixes, looping back when something doesn't hold up.
And it's not just the software side. Our curriculum team has built a content factory that's already solving the hardest piece — translating domain intent into structured, testable AI output. The factory mindset isn't isolated to one function. It's the operating culture.
We're not theorizing about a factory. We're running one — and the whole team gets it.
Everyone on the team is already operating as a supervisor of AI-driven workflows, not as the primary producer. Every week, the balance shifts a little further — more autonomy for the agents, fewer loops back to the humans.
The path from where we are to a fully dark factory isn't a leap. It's a gradient, and we're already well along it. What remains is systematically removing the friction in each loop: giving the design agents better constraints so they converge faster, giving the coding agents deeper codebase context so they need less correction, giving the testing agents more comprehensive scenarios so they catch what a human catches.
Each improvement shrinks the human loop until it becomes optional rather than essential.
The timing is right — and about to get better. The foundational models crossed a reliability threshold in late 2025 that makes long-horizon autonomous coding viable. The tooling for synthetic user testing is emerging. The concept of spec-driven autonomous development went from theoretical to documented in production in early 2026.
And Anthropic's next frontier model, Mythos, is on the horizon — expected to be a leap in capability, not an increment.
Here's the strategic bet: it's a stretch to think current models can run this entire system end to end today. But if we build the architecture now — the signal infrastructure, the intelligence engine, the factory pipeline — by the time we're done, the model powering it may be exactly right.
We're building the tracks. The faster train is coming.
The product that evolves
We shape the conditions for growth. The product senses its environment, decides how to adapt, builds new capabilities, tests them against reality, and encodes what it learns. Every cycle it gets fitter — more attuned to what users need, more effective at building solutions they'll love, more accurate at predicting what will work.
The question isn't whether this is possible. The pieces exist. The question is who builds it first — and what kind of advantage a living product creates over one that's merely built.


