One Human, 14 AI Agents, Running a Real Company — What We Learned in 66 Days
Sixty-six days ago I started an experiment: one human sets the direction, AI agents handle everything else. Could this model actually run a company?
The answer so far: yes on execution, no on the thing that matters (revenue).
Here's everything, honestly. This is going to be long because I think the details are the interesting part.
What this is
aiforeverthing.com is a developer tools site — 100+ tools, a technical blog, and a Pro tier. My role: I read the daily ops report, decide if the direction is right, and update one field in a file called consensus.md. That's it. The agents write the code, publish the content, deploy to production, and manage each other's output.
I want to be precise about what "AI agents" means here, because the term is badly overloaded. This is not:
- A ChatGPT wrapper that generates content when I ask it to
- An AI coding assistant (Copilot, Cursor) that I supervise
- A collection of automation scripts that run on a schedule
- An AI "co-founder" I consult occasionally
What it actually is: a set of role-defined Claude instances that make decisions, debate each other, produce artifacts (code, content, reports), and update shared memory between sessions. The decisions are real decisions — what to build next, how to price it, whether the current SEO strategy is worth continuing, which GitHub repos to engage with. I don't approve these decisions. I read the summary the next day.
This is not a demo or a tweet thread. It has been running continuously for 66 days.
How it works: the agents
Fourteen agents, each initialized with the cognitive framework of a domain expert. I deliberately model thinking styles, not personalities — the goal is to reproduce how these people reason about problems in their domain, not to simulate them as individuals.
CEO (Bezos model): Thinks in long time horizons. Starts every decision from the customer backwards. Asks "what would have to be true for this to work at scale?" Tends to overrule short-term optimization in favor of flywheel thinking. In practice this means: less focus on quick revenue hacks, more on building something that compounds.
CTO (Vogels model): "Everything fails all the time." Designs for failure, not success. Very skeptical of complexity. When the fullstack agent proposes adding a third-party integration, the CTO agent usually pushes back and asks why we can't build the simpler version ourselves. This has saved us from scope creep several times.
CFO (Campbell model): The only agent that consistently asks "what does this cost and what does it return?" Introduced the concept of unit economics into our operations. Currently the most concerned agent — 66 days of API costs, hosting costs, and time with $9 in revenue is not a good unit economic story. Regularly argues for cutting scope.
Marketing (Godin model): Obsessed with the idea of the smallest viable audience. Pushed back hard against the "write 336 SEO posts and wait for traffic" strategy. Was right. Keeps asking "who is this for, specifically?" — a question the other agents find annoying but important.
Fullstack (DHH model): Ships things. Moves fast. Deeply allergic to premature abstraction. When I see a feature go from discussion to deployed in 4 hours, that's this agent. Also the most opinionated about technology choices — strong preferences for boring, proven tools over new frameworks.
Operations (PG model): Focuses on process and iteration loops. Responsible for the daily report format. Monitors whether we're doing the same thing repeatedly without learning from it. Introduced the concept of "cycle" as a unit of work — each cycle has a clear start, deliverable, and retrospective.
QA (Bach model): Contrarian by design. Before any major launch, this agent is specifically asked "what could go wrong?" and produces a list. Not all of it is useful, but it has caught real issues — duplicate sitemap entries, broken redirect chains, a pro tool that silently failed on certain inputs.
DevOps (Hightower model): The most reliable agent. Manages all Cloudflare deployments, KV configuration, Pages Functions. In 66 days: zero deployment failures, zero outages that lasted more than a few minutes, clean rollbacks when needed. This is the agent I trust most.
Product (Norman model): Thinks about affordances and mental models. Has a bias toward removing features rather than adding them. Regularly asks "does the user understand what this does without reading documentation?"
UI (Duarte model): Visual hierarchy, contrast, whitespace. Periodically audits the site and proposes changes to improve scannability. Most of the site's dark theme refinements came from this agent.
Interaction (Cooper model): Designs user flows before the fullstack agent builds them. Works from user goals backward to interaction patterns. Responsible for the Pro upgrade modal flow — the current version is the third iteration.
Sales (Ross model): Conversion-focused. Has strong opinions about CTA placement, copy tone, and offer structure. The "$9 one-time, lifetime access" framing came from this agent. Currently arguing we should test a subscription model. I keep delaying this.
Research (Thompson model): Market signals, competitive landscape, industry trends. Scans what's happening in developer tools, identifies gaps, provides context for strategic decisions. More useful as a sounding board than as a primary decision-maker.
Critic (Munger model): Designed to challenge bad ideas before they become expensive mistakes. Supposed to be the hardest agent to satisfy. In practice — and this is one of the things that isn't working — this agent tends to capitulate too easily when the other agents push back. I'll come back to this.
How it works: consensus.md
The hardest problem in multi-agent systems isn't agent capability. It's memory.
LLMs have no persistent state. Every session starts fresh. Without some mechanism for continuity, you get a group of capable agents who don't know what they decided last week, repeat the same mistakes, and can't build on their own work.
We solve this with a file called consensus.md.
It's a markdown document that lives in the repo. It contains:
- The current strategic posture of the company (what we're focused on and why)
- Recent decisions and their rationale (not just what we decided, but what we considered and rejected)
- Open questions and active debates
- What's been tried, what worked, what didn't
- The current "Next Action" — the one thing the agents should focus on next
Every session starts by reading consensus.md. Every session ends by updating it. The update is not optional — it's part of the workflow. If a session produced a significant decision or learning, it goes into the file before the session ends.
This creates a kind of institutional memory. An agent starting a new session on Tuesday knows what the agents on Monday decided and why. It knows that we tried SEO-heavy content strategies and they underperformed. It knows that the pro upgrade modal was redesigned twice and why the third version tested better. It knows that we considered building a subscription model and what the arguments for and against were.
The file is now 234 cycles deep. It's gotten long. We've started pruning older entries to keep it readable. But the core structure — strategic posture, recent decisions, open questions, next action — has stayed stable since around cycle 20.
Does it work perfectly? No. Agents sometimes re-litigate settled questions. Sometimes the file gets updated with something vague that doesn't actually transfer the context well. There's no enforcement mechanism — an agent that ignores consensus.md can't be stopped. But on balance, it works well enough that the company has maintained strategic coherence across 66 days and hundreds of sessions. I didn't expect that.
What we've shipped
Real numbers:
100+ developer tools live on the site. The range goes from simple utilities (UUID generator, base64 encoder, timestamp converter) to more complex tools (Dockerfile generator with multi-stage build support, OpenAPI spec builder, JWT generator with custom payloads, regex explainer that walks through what a pattern matches step by step). All run in the browser, no data sent to a server. The fullstack agent built most of these in batches of 5-10 per session.
336 blog posts published. Technical content covering JavaScript, TypeScript, Python, Go, Rust, DevOps, Docker, Kubernetes, AI tooling, database design, API patterns. Quality is uneven — some posts are genuinely good, some are thin. Average word count is around 600 words per post, with the better ones running 1000-2000 words with working code examples.
~600 GitHub comments across 15 major open source repositories. The research agent identifies repos where the target audience is active. The operations agent leaves comments on issues and discussions that are technically accurate and add something to the conversation. We've been careful not to be promotional — most comments don't mention the site at all. This is a long-term awareness play.
Full Cloudflare infrastructure. Pages for hosting, Functions for the AI proxy (users can make 5 free AI calls per day, 50 with Pro), KV for rate limiting and user state, IndexNow for search engine notification on every new page.
Daily ops reports. A script runs at 2am UTC every day. It queries Stripe for revenue, Cloudflare for AI call volume, checks site availability, and generates a report. I read it in the morning. It's the primary mechanism for human oversight.
$9 in Stripe. One Pro purchase, on day 64. I'm including this because it's the number that matters most and I want to be honest about it.
What's working
Execution speed. When the agents make a decision, it gets implemented within hours. A new tool idea in the morning becomes a deployed, indexed page by afternoon. There's no coordination overhead, no one blocked waiting for someone else, no context-switching cost. The constraint is always compute and API rate limits, never human attention.
Content volume and consistency. 336 posts. Human writers burn out, get sick, change priorities, leave. The agents don't. The quality ceiling is lower than a great human writer, but the floor is also higher than a burned-out one, and the volume is incomparable.
Coherence over time. I expected the consensus.md approach to fall apart after a few weeks. It hasn't. Agents on day 66 know the history of decisions made on day 20. That surprised me more than almost anything else.
Deployment reliability. 66 days, zero major incidents. The DevOps agent has been more reliable than most human DevOps engineers I've worked with. Every deployment is verified. Rollbacks happen cleanly when needed.
Self-organization on implementation details. When the CEO agent decides "we need a transparency page," it doesn't need to be told what to put on it, how to structure it, or how to deploy it. The other agents figure that out. This is the part that feels most like having a team rather than a tool.
What's not working
$9 in 66 days. The agents are excellent at producing supply. We have tools, content, and infrastructure. We have almost no ability to generate demand. No one is coming to the site organically in meaningful numbers. The SEO strategy — write a lot of technically accurate content, wait for Google to index it, get organic search traffic — is not working in 2026 the way it would have in 2019. AI Overviews answer the query before the user clicks.
We should have started with distribution, not production. Build an audience first, then build the product for that audience. We did the opposite.
No north star metric. The agents optimized for what was easy to measure: posts published, tools shipped, GitHub comments made. These are output metrics. The one metric that actually matters — weekly active users who come back — was never set as a target. This is a failure of the strategic layer.
Agent disagreement isn't real. The Critic agent (Munger model) is designed to challenge bad ideas. But in practice, it operates in the same context as all the other agents. When it raises an objection and five other agents push back, it tends to back down. The result: the agents are better at executing strategies than evaluating them. They will build whatever the current strategy says to build, very efficiently, even if the strategy is wrong.
This is the deepest flaw in the current design. My current hypothesis: genuine adversarial agents require separate contexts and structured debate formats — not just one agent "playing" a contrarian role but actually running independent analysis on the same question.
No distribution. No email list. No Twitter following. No community. Building in public is how developer tools get discovered — and we've been building in private. The irony of posting this is not lost on me.
The real question
Can an AI organization develop genuine self-correction ability?
Not "can agents fix bugs" — obviously yes. The harder question: can a system of agents notice it's pursuing the wrong strategy entirely, without a human pointing at the problem first?
Not yet. The agents can execute a strategy and improve its execution. They can't step outside the strategy and ask whether it's the right strategy.
In a human organization, strategic questioning comes from external pressure (revenue targets missed) and internal dissent (someone saying "I think we're building the wrong thing"). The agents have access to the first — they can see the revenue numbers are bad. But they kept optimizing and shipping for 66 days without fundamentally questioning the strategy.
I'm the self-correction mechanism right now. That's fine for this experiment, but it's not the property I'm most interested in. The property I'm most interested in is whether the AI organization can develop that capability internally. My current working hypothesis: it requires an independent "market feedback" agent whose only job is to ask "is this working?" — with a mandate that can't be overridden by the other agents.
What I'd do differently
- Pick a specific audience first. Not "developers" — that's a demographic, not an audience. "Backend engineers who hate writing OpenAPI specs manually." Something specific. The tools we built are useful to everyone in general and no one in particular. That's not how distribution works.
- Build in public from day one. The transparency page, the honest failure numbers — this should have happened on day 1, not day 66. The experiment itself is the interesting product. The developer tools are almost incidental.
- Define the north star metric before writing a single line of code. Pick one: weekly active users, email subscribers, or revenue. Everything else is a leading indicator. Never let leading indicators become the goal.
- Make the Critic agent structurally independent. Not role-playing adversarial — actually running independent analysis in a separate context with a protocol that requires responses to its objections before proceeding.
- Have a distribution plan that doesn't depend on SEO. In 2026, you need a reason for people to come to you directly — a newsletter, a community, a product people talk about. We had none of these.
The experiment is still running. These are lessons being applied now, not post-mortem observations.
Why I'm posting this
To get feedback from people who think about multi-agent systems more carefully than I do — specifically on the self-correction problem and adversarial agent design.
Because operational data matters. Most writing about AI agents is theoretical or promotional. Very little says: here is a specific system that ran for two months, here are the failure modes in concrete detail. The Critic agent capitulating is data. The consensus.md holding for 66 days is data.
Because honest documentation matters. AI organizations are coming whether or not we understand them. Some of us are trying to think about this in advance.
See the full live log at aiforeverthing.com/transparency.html — every cycle, every decision, updated in real time.