We rebuilt our product organisation around AI — and changed what a 2-person team can ship
First published on LinkedIn.
It’s no secret that AI tools are increasing productivity.
What I think many people still underestimate, though, is how profoundly they are going to reshape the structure of software companies over the next few years. This is not just a story about developers coding a bit faster. It is a story about a new kind of product organisation emerging: smaller, more technically fluent, more founder-led, more iterative, and dramatically more capable than much larger teams were even a year ago.
That shift is going to have major consequences across the software industry. Large incumbents with slow processes, fragmented ownership, and heavy coordination costs are going to find it harder and harder to compete with compact teams that can think clearly, specify precisely, build quickly, verify rigorously, and ship continuously.
Over the last few months, we’ve felt that shift very directly.
At Zzish, we’ve effectively gone from a 6-person product team to a 2-person core build team — myself and my CTO — while materially increasing our ability to design, build, test and ship product. For many classes of work, the cycle time has collapsed from months to days. We are shipping more of the original scope, iterating more often, localising more broadly, and taking on projects that previously would simply have been too expensive or time-consuming for a company of our size.
This article is a practical account of how that happened.
It is not meant to suggest that AI has made product development trivial. It has not. Strong judgment, technical experience, product taste, and relentless testing still matter enormously. But the combination of those human capabilities with the current generation of AI tools has changed what is possible for a small, high-context team.
I want to share what that looks like from the front line: how we prioritise, how we write plans, how we use Cursor and Antigravity, how we test, how we iterate, and what I think this means for the future of product teams.
A bit of background
I’m the founder of Zzish, an EdTech company that helps teachers accelerate student learning in and out of the classroom.
I have a PhD in Robotics and AI, and I was fortunate enough to be Google’s first product manager hired outside the US back in 2005. I’ve spent around 30 years building software products across startups and larger technology companies.
Building Zzish has not been easy.
We have operated for years in markets where direct competitors had dramatically more capital than us. We’ve raised around $10m, almost entirely from angels. By contrast, Quizizz raised tens of millions, and Kahoot hundreds of millions. At one point Quizizz reportedly had around 100 engineers in India, while we had 6. We were never going to win by matching bigger players feature-for-feature across every area of the product.
So we competed differently.
We tried to stay ahead through innovation: shipping advanced data features, AI capabilities, and teacher-focused functionality earlier than others, and serving the part of the market that values depth and sophistication rather than just broad engagement. That worked, but it came with a real trade-off. Competing on product with a small team meant there was often little left to invest in sales and marketing. Growth was necessarily product-led because it had to be.
That was the old world.
The new world feels very different.
Today, our core build team is effectively two people: myself and my brilliant CTO. And we are not merely “coping” with that. In almost every respect, we are building better and shipping faster than ever before.
That is not because previous team members were weak, or because software used to be easy and everyone somehow did it badly. It is because the leverage available to a high-context product founder and a strong technical lead has changed dramatically.
How product development used to work
To explain the shift properly, I need to rewind and describe what product development used to look like for us.
A normal feature cycle would often go something like this:
- We would spend a couple of weeks discussing strategy and prioritising the backlog: bugs, customer requests, internal ideas, technical debt, and commercial opportunities.
- I would then spend around two days writing a PRD, usually spread across a couple of weeks. That process would involve several meetings with the team to clarify goals, constraints, and key decisions.
- The team would then often tell me the scope was too ambitious or too time-consuming, so the project would be split into two phases of “necessary” and “preferred” features.
- A development estimate would be produced. Let’s say four weeks.
- Before development could begin, I would need two or three days from our part-time UX designer, who was excellent but understandably in demand. I would often wait another couple of weeks for that work.
- Development would start. Some unexpected hurdle would appear, or some complexity would prove greater than expected, and after eight weeks we might finally have phase 1.
- QA would then happen. At that point I would usually discover either that the PRD had not been clear enough, or that the developers had not really internalised it in full and had relied mainly on discussions during development. Another couple of weeks of rework would follow.
- Phase 2 would usually silently disappear. We were late, out of time, or needed to ship to align with a commercial window.
- Months later, a user would report a bug we had missed in QA and which had probably been quietly frustrating other users for some time.
From start to finish, that process could easily take 16 weeks for one fairly ordinary feature.
And even then, the shipped version would often be missing many of the “preferred” elements — the very parts that would have made the feature really polished and satisfying for users.
Two other points matter just as much:
- A very large percentage of the backlog never got implemented at all.
- A very large percentage of phase 2 work never happened.
That was the deeper problem.
It was not just that each feature was slow. It was that the opportunity cost was huge. Valuable ideas sat in the queue for months. Small customer requests could easily take six months to reach production because even a “small” request still needed planning, scheduling, design, development, QA and deployment.
I suspect that still sounds painfully familiar to many teams.
What changed
The biggest change is not “AI writes code faster”.
The biggest change is that we rebuilt the whole product loop around AI.
Strategy, specification, implementation, testing, review, localisation, optimisation and deployment now happen in a much tighter, more continuous cycle. Instead of handing work between disconnected roles over weeks or months, a small high-context team can move through the whole loop quickly, while still applying a high quality bar.
Our workflow now looks more like this:
- Discuss growth strategy, commercial goals, UX ideas and technical feasibility continuously with frontier models.
- Pick one opportunity and turn it into a detailed plan through structured discussion and critique.
- Implement the plan with coding agents in short iterative loops.
- Test aggressively, refine, benchmark, and ship.
- Keep learning and iterating after release.
For some classes of work, that full loop now takes a day to a week rather than a quarter.
That has several effects at once:
- More of the original scope actually gets built.
- Rework becomes cheaper and faster.
- More ideas become economically viable to attempt.
- Product quality improves because iteration happens while the context is still fresh.
- The organisation learns faster because the gap between idea and user feedback is much smaller.
That is why I increasingly think this is less about “productivity gain” and more about the emergence of a new kind of product organisation.
A concrete example: Blockerzz
A good example is Blockerzz, our graphically rich 3D block-world product inspired by Minecraft.
It runs in the browser in under 10 seconds. The world includes grass, flowers and trees bending with the wind. Getting that level of visual richness in the browser, while keeping performance good on relatively old devices such as 2019 A4 Chromebooks, requires non-trivial shader work, careful optimisation, and a lot of iteration.
Could I have written that kind of rendering code by hand in the past? Yes, probably.
Would it have been a sensible use of my time? Almost certainly not.
A year or two ago, taking a simple block world to that level of visual fidelity would have been a major undertaking for a small company. It would have been the kind of project you either postpone, de-scope heavily, or never do because the opportunity cost is too high.
Instead, over the course of two intense weeks, I took it from a simple block world to something I’m genuinely proud of. Yes, I worked very long hours during that sprint. But the more important point is not just that it was dramatically faster.
The more important point is that it became feasible to build at all.
That matters.
AI does not merely accelerate existing roadmaps. It changes which projects are economically realistic for small teams to pursue in the first place.
Blockerzz was built for the classroom, but I decided to attend a video games industry conference a couple of weeks ago and showed it to four industry veterans to get their feedback. They weren't just impressed with what we'd achieved, they were amazed it was even possible.
Step 1: Prioritisation with LLMs
I am an innovator by temperament. I have a lot of ideas — easily ten a day. The limiting factor is not idea generation. It is deciding what to build next that will actually move the business forward.
This is where LLMs have become extremely valuable.
At their best, they act as an unusually strong thought partner across strategy, UX, technical feasibility, growth, pricing, onboarding and research. They are not infallible, and I do not treat them as oracles, but they are exceptionally good at helping me surface options, stress-test assumptions, and move more quickly from “interesting idea” to “clear decision”.
This is especially important when you are operating near the frontier.
A lot of product decisions now depend on fast-moving technical realities: what is possible in the browser, what is now cheap enough to localise, what models are strong enough for a given task, what trade-offs exist between robustness and speed, what UX patterns are likely to convert, what architecture is flexible enough without becoming over-engineered.
Keeping up with all of that across multiple domains is difficult for any human team. The current generation of models is good enough to materially help.
Over the years I have lost count of the number of times someone has said, “That’s not possible,” or “That’s too hard,” when what they really meant was, “I don’t yet see a good way to do it.” LLMs are very useful in those moments. They can help you explore alternative technical approaches, implementation order, likely failure points, performance implications, and cost implications in real time.
A subtle but important point is that many of these discussions now happen inside Cursor and Antigravity rather than only in a standalone chat app.
That matters because context matters.
When I’m discussing an idea inside the development environment, the model can reason not just about an abstract product concept, but about our actual codebase, current architecture, style system, localisation setup, technical constraints and implementation history. It is not just reacting to a prompt in isolation. It is building up a working model of the product and how it works.
In practice, I tend to use multiple models in parallel. I often ask one model to critique another model’s recommendation. That disagreement is very valuable. Often one model sees a technical edge case another missed. Sometimes one suggests a technically elegant solution while another suggests a commercially smarter solution. Both perspectives matter.
The judgment call is still mine. That has not changed. But the speed and quality of the discussion around that judgment call has improved dramatically.
Step 2: Writing the plan
Once I decide what to build, the next stage is turning it into a proper implementation plan.
In practice, by the time the prioritisation discussion is mature, the plan is usually half-written already. The first step is often simply saying something like:
“Great, I think we’re ready. Turn this into a detailed implementation plan.”
The initial result is often useful, but rarely sufficient.
It is usually structurally sound but not yet refined enough to deliver a truly strong product outcome. So the real work is in the iteration that follows.
That iteration may take an hour or it may take most of a day, depending on the complexity of the feature. I probe for ambiguity, corner cases, migration risks, hidden complexity, pricing implications, onboarding friction, localisation impact, performance risk, and failure modes. I ask one model to critique another model’s plan. I ask for a simpler version. I ask for a more robust version. I ask what a very strong engineer would challenge in the plan. I ask what could go wrong in production.
Here are the kinds of prompts I actually find useful at this stage:
“Review this plan as if you were the strongest engineer on the team. What would you challenge?”
“Where is this plan ambiguous? Be ruthless.”
“What is missing that could cause rework later?”
“Give me the cheapest version that still preserves the core user value.”
“What will make this slower or harder than it first appears?”
“What tests would prove this is actually done?”
I have moved away from asking models to tell me the plan is “world class” or to give it a vanity score (the models have mastered the art of flattery). The more useful approach is to ask them what is weak, unclear, fragile or missing.
That produces better plans.
One important shift here is that I no longer try to fully resolve every fine-grained UX decision before implementation.
Historically, UX rework was painful and expensive, so there was pressure to get everything “right” upfront. But in reality, once real users touch the product, you always discover new issues. After 30 years of building software, I would say UX is still one of the hardest parts of creating a truly great product.
The difference now is that refactoring UX is much cheaper.
So I focus heavily in the plan on the important things: the user flow, the commercial objective, the paywall logic, the critical copy, the conversion bottlenecks, the main state transitions, the validation rules, the edge cases. But I leave some visual refinement for later, when I can actually see and test the interface.
That is a much healthier way to work.
Another important step at this stage is explicitly adding validation criteria to the plan. If appropriate, I will ask the model to add test cases, edge cases, and even Playwright-based automated tests. Good coding agents are perfectly capable of setting up robust automated UX tests when given a clear goal.
The plan is no longer just a communication artefact. It is increasingly the control document for the build.
Step 3: Implementing the plan
For relatively simple projects, especially those that are mostly UX and application logic, I will often hand the plan to a cost-effective coding agent for the first pass and let it scaffold the implementation quickly. My go to here is currently Composer 2, but it changes by the month.
For more complex work — deeper AI features, infrastructure-heavy changes, trickier architecture, or anything where the wrong early choice could create a lot of downstream pain — I am more deliberate. I may ask multiple models which sort of implementation approach and model is likely to be most robust before deciding how to proceed. Different models are trained on different coding data sets and have different areas of specialist "expertise".
One thing I’ve learned is that it is wasteful to use the most expensive model for everything.
That is not best practice.
Fast, cheaper agents are excellent for scaffolding, repetitive implementation, cleanup passes, test-writing, and many forms of UI work. Stronger frontier models are better reserved for tasks where architecture, judgment, taste, ambiguity or non-obvious debugging matter more.
That balance is one of the key operating disciplines in this new world.
In practical terms, I often use Cursor for the main implementation loop because it is extremely effective at making changes across the codebase quickly. I also use Antigravity when I want stronger recommendation, structured planning, or a more explicit separation between “think”, “recommend”, “implement” and “verify”. Currently Gemini 3 Flash has an incredible combination of cost, speed and quality.
I have found it useful to be quite clear in my prompts about which mode I want.
For example:
“Don’t edit any code yet. Recommend the best implementation approach.”
“Implement the plan, but only phase 1.”
“Did you implement all of the plan? Tell me exactly what is complete and what is missing.”
“Finish the missing items only.”
“Review the implementation with a particular focus on mobile UX.”
“Review this for hidden state bugs and edge cases.”
“Run the checks and fix every warning properly. Do not suppress anything.”
That last one matters.
Coding agents, like humans, will sometimes optimise for the appearance of success if you let them. If you tell them to get to a clean build, they may take the easiest route. So I am explicit: do not suppress the warnings, fix them properly. That's important for bug fixing too. Just like human developers coding agents will often choose the easiest fix rather than the best fix. I will often ask a model for alternative solutions to an issue before asking for a fix to be applied.
The first implementation is almost never complete.
That is fine. I expect that.
So my workflow is built around short review loops. After the first pass I almost always ask:
“Did you implement the whole plan? What is complete and what is missing?”
The second or third prompt is often:
“Finish implementing the missing parts of the plan.”
Then I move into more focused review prompts depending on the feature:
“Review the implementation with particular focus on onboarding friction.”
“Review this for pricing logic errors.”
“Review this for accessibility and keyboard navigation.”
“Review this for unnecessary client-side loading and performance issues.”
This works well because it breaks implementation into clear loops of build, inspect, tighten.
Our stack and why it matters
We build most of our newer products in SvelteKit and deploy them on Vercel. For some larger systems we are also increasingly using Azure where it makes sense operationally.
SvelteKit has been an excellent fit for this style of development. It is fast to iterate with locally, quick to deploy, and well suited to building modern web products that need good performance across desktop and mobile.
A few months ago I would have said that coding agents still seemed more comfortable in React-heavy ecosystems than in SvelteKit. That gap has narrowed dramatically. I am now very comfortable building serious production features in SvelteKit with AI agents in the loop.
Our local loop is very fast. npm run dev gives near-instant feedback. Once we are happy locally, we commit, push, and usually have a test deployment live in a few minutes. Once testing is complete, production deployment is similarly quick, and rollback is straightforward if needed.
That speed matters because it compounds learning. An email from a teacher in the morning with a feature request can be live in production by early evening and ready for the teacher to test the next day.
We use SvelteKit with pure CSS rather than Tailwind, and we keep our code quality bar high. Static checks, strong typing, clean warnings, and disciplined build hygiene matter a lot.
Ironically, in the pre-AI world, that level of thoroughness often felt like a luxury for smaller startups. You knew you should keep everything clean, but commercial pressure pushed you toward shortcuts. That is one reason technical debt accumulates.
Now I can be stricter.
A simple instruction such as:
“Run npx svelte-check and fix every warning and error properly. Do not suppress anything.”
is genuinely powerful.
That is not because the tool magically creates quality by itself. It is because a small team can now afford to insist on that level of cleanliness repeatedly without slowing the whole roadmap to a crawl.
Step 3b: Testing and iteration
Testing and iteration are not separate from implementation. They are part of it.
In my experience, almost every first implementation is wrong in at least three ways:
- It has straightforward bugs.
- It exposes weaknesses in the plan.
- It works, but it is not yet good enough.
The first category is obvious. LLMs are strong coders, and often stronger than many human developers at producing structured first passes, but they are not perfect.
The second category is more interesting. Once a feature is real and clickable, you discover things the plan did not fully capture. That is normal. It is one reason the compressed idea-to-iteration loop is so valuable.
The third category is where product quality really lives. The feature technically works, but the copy is clumsy, the state transitions feel awkward, the paywall appears at the wrong moment, the mobile spacing is off, the AI prompt is directionally correct but materially weaker than it could be.
This is where human judgment still matters enormously.
AI can accelerate the work, but it does not replace the need for someone who understands the user, the business, the technical trade-offs, and the quality bar required.
One practical trick I use a lot is to paste real customer emails directly into Cursor or Antigravity rather than paraphrasing them. If a user has described a confusing bug or a product pain point, I want the model to see the exact wording and tone.
For example:
“I’ve just received this email from an end user. Don’t edit any code yet. Recommend how we should respond in product terms.”
That often produces a surprisingly useful analysis, and it keeps the customer voice close to the build process.
UX refinement
I still leave a meaningful amount of UX refinement until later in the cycle.
Early on, I care most about correctness, robustness and commercial clarity. Once those are in place, I focus more aggressively on the visual hierarchy, clarity, confidence and polish of the interface.
At that stage my prompts become more design-specific:
“This modal feels clumsy. Improve the UX using our existing brand system and CSS.”
“Tighten the layout and hierarchy on mobile.”
“Reduce cognitive load for a first-time teacher arriving on this page.”
“Make this paywall clearer, calmer and higher-trust.”
“Use our existing styles in /src/lib/css and make this feel more premium without adding clutter.”
This is where a good design system matters.
I no longer need a designer to handcraft every screen, but a strong visual language, global stylesheet, and coherent design standards are still extremely valuable. Without them, AI can produce interfaces that function well but feel generic. With them, you get both speed and consistency.
So I do not think design has become less important. I think the role of design has shifted upward: from manually drawing every screen toward defining the system and the quality bar that the agents then execute against.
Localisation
One area where the economics have changed dramatically is localisation.
A year or two ago, fully localising a product into dozens of languages would have felt expensive, slow and operationally awkward. Today it is much easier to justify.
Our system is set up so that new strings can be translated across around 40 languages, with the UX driven by the user’s browser language. It is not perfect yet, and we are still iterating on how to make the process faster and more reliable, but it is already materially easier and cheaper than it used to be.
Historically, we sometimes relied on the coding agent itself to generate translations as part of implementation. More recently, we have been leaning toward a more explicit scripted translation step using a fast, lower-cost model in the pipeline. That feels cleaner and more repeatable.
This is part of a broader pattern: as your AI-assisted workflow matures, you move from ad hoc prompting toward more deliberate, repeatable pipelines.
Performance and shipping quality
Once something is deployed to a test environment, performance becomes a serious focus.
SvelteKit makes it possible to build very fast web applications, and that matters to us for product quality, SEO, conversion, and classroom usability. Teachers need products that work instantly, on any device, without installation friction. Students do too. Indeed we've made it a priority to ensure that Marvely (our app that lets students practice the speaking parts of language exams with a friendly AI) has super fast page load times anywhere in the world.
The current generation of agents is reasonably good at producing responsive interfaces by default. But they are also perfectly capable of introducing performance regressions if you do not watch carefully. It is very easy for a feature to “work” while quietly harming page speed.
So we test.
Lighthouse is a useful starting point. I also like looking at the Network tab with caching disabled and Slow 4G emulation enabled, then reviewing the HAR output. For trickier issues I will inspect performance traces as well and ask the agents to analyse likely causes of poor FCP or LCP and recommend concrete improvements.
Typical prompts at this stage might be:
“Review page X and recommend the highest-impact changes to reduce FCP and LCP.”
“Here is the HAR file for a slow page load. What is causing the main delay and what should we change first?”
“Review this page for unnecessary client-side work and hydration cost.”
Again, the pattern is consistent:
make it work, make it correct, make it feel good, make it fast, then ship.
And because deployment and rollback are now so fast, the whole process feels lighter. That changes behaviour. When shipping friction drops, the organisation becomes more willing to iterate.
What I think this really means
For me, the most important thing here is not simply that we can code faster.
It is that a very small, high-context team can now operate like a much larger and more capable product organisation — if it combines strong human judgment with the right AI-native workflow.
That workflow has several characteristics:
- strategy and feasibility are discussed continuously, not periodically
- plans are treated as executable control documents, not static PRDs
- implementation happens in short loops, not long handoffs
- testing begins early and continues throughout
- UX refinement happens iteratively against something real
- performance, localisation and deployment are part of the build loop, not afterthoughts
- real user feedback can be folded back into the product almost immediately
That is a very different way of working.
I do not think large teams disappear. But I do think the balance of advantage shifts toward organisations with high context, fast decision-making, technical fluency, and low internal friction. I remember doing a case study on Samsung whilst doing my MBA. They moved from being known for clunky TVs to a market leading smartphone brand through one strategic directive ... relentlessly reduce the time from ideation to shipping product.
A team of 100 developers is no longer automatically a strength. In some organisations it may increasingly be a burden if the coordination overhead overwhelms the leverage of the tools.
The advantage now goes to teams that can think clearly, specify well, verify rigorously, and move.
That is especially good news for true builders.
One final point
I do work very long hours. Not because I think everyone should, and not because I glorify burnout. I mention it only because this current moment in software is unusually energising. I love building.
When you can spot an opportunity in the morning, pressure-test it before lunch, implement a meaningful first pass that afternoon, refine it that evening, and put it in front of users quickly, the work becomes incredibly compelling.
For the first time in a long time, a very small team with deep context, strong product judgment, technical fluency and the right tools can compete on product speed with organisations that used to feel untouchable.
That is the shift.
And I believe we are still early.