ai founders spend three months on inference optimization and two weeks on pricing. then they wonder why margin barely moved. the math is upside down. caching saves you 20%. a tier change saves you 80%.
the canonical move in 2025-2026 looks the same at every ai-first company that crossed the gross margin threshold investors require. Cursor capped pro at 500 fast requests and moved heavy use to a metered "max mode." Replit shifted agent runs to a credit balance. Lovable moved its agent product to a credit ladder anchored to model cost. none of them shipped a smaller model or a smarter cache as the headline change. all of them shipped a price change.
if you sell ai and your gross margin is under 50%, you don't have an engineering problem. you have a pricing problem that engineering can't solve fast enough.
the math of an ai seat
a $200/month seat at 80% target gross margin can absorb $40 of inference cost before margin breaks. a power user running agent loops with 100k-token context windows and tool-use cascades burns $150 in inference. that customer is now at -75% margin on the inference line, which gets buried in the blended cogs number, which is what shows up on the board deck.
founders see this when they finally pull cost per customer. the top 5% of users by inference are 60-80% of total inference spend. the median user is profitable. the heavy tail is the company.
the ratio above is the wrong shape for a healthy ai business. it says you're spending $0.75 in inference for every $1 of revenue from that user, which leaves $0.25 to pay for r&d, payroll, the office, and your own salary. and that's the variable cost line alone — before customer success, support, the auth provider, the email sender, the data warehouse. the seat is a money loser at the seat price.
founders try to fix this with engineering. "if we cache the embeddings, swap to claude haiku for the easy steps, batch the tool calls." all of that is real. all of it takes a quarter. and at the end of the quarter, the same power user is now running 3x more agent loops because the product got better, so the saved cost re-spends itself.
why engineering takes a quarter and pricing takes two weeks
three engineering levers, in rough order of how much founders overweight them.
smaller models for the easy steps. ship a router that sends simple prompts to a cheaper model. real savings — 20-40% in published benchmarks. engineering is a sprint. eval is a quarter. then OpenAI releases something new and you redo it.
prompt caching and context window discipline. Anthropic's prompt caching shipped in 2024 and is now table stakes. it saves 50-90% on the cached portion and nothing on the parts that vary. meaningful for agent products with long threads. footnote for one-shot products.
fine-tuning and distillation. real economics for high-volume narrow tasks. months of work, fragile to base-model drift. by the time you ship, the model providers shipped a model that's 30% cheaper and you ask whether the fine-tune was worth it.
three pricing levers, in rough order of how fast they ship.
usage caps inside the existing tier. "pro is $20/mo and includes 500 requests." one engineer-week. a customer email. revenue protected. the heavy tail self-selects up.
a metered tier above the flat tier. "max mode" or "agent mode" billed at cost-plus. two weeks. the power users keep using the product, the company keeps the margin, the median user doesn't notice.
bring-your-own-key for the heaviest workloads. "enterprise customers can connect their own Anthropic account." a quarter, but only because of contracts. once shipped, every BYOK customer is +100% on the margin line.
what changed at Cursor, Replit, Lovable
each of these companies hit the same wall in 2025. headline arr up, gross margin trending down because the heavy users were getting heavier. the change wasn't a model swap. it was a pricing reorganization in public.
Cursor moved fast/slow requests to a quota system, then added a metered "max" mode for agent loops. the $20 entry price stayed. the median user was unaffected. the top decile stopped getting subsidized. gross margin reportedly moved from the mid-40s toward the 60s within two quarters.
Replit reorganized agent runs around credits with explicit per-action pricing. the average user paid roughly what they paid before. the heavy user — running long agent jobs over a private codebase — paid more, because the cost was real and the flat price had been a subsidy.
Lovable moved from a flat seat to a credit model with a transparent per-message cost. founders complained on twitter. customers stayed. gross margin moved up double-digit points within a quarter.
none of these companies got better at inference. they got better at pricing.
the founder objection that always comes up
"if we change pricing, customers will churn." sometimes. mostly they don't. the median customer doesn't notice because their usage was under the cap. the heavy customer notices and pays — they were the ones getting the most value from the product, which is why they were using it the most. a small group churns. revenue per customer goes up. gross margin goes up. the company survives.
the fastest way to fix ai gross margin isn't a smaller model — it's the email that says "starting next month, the heaviest 5% of users will be on a usage tier."
how zift handles this
zift pulls your stripe revenue, your Anthropic and OpenAI invoices, and your cloud bill into one weekly view of gross margin — blended and by customer decile. on the first of the month you get the number, the cohort that moved it, and the inference line that drove the change. before you debate the next caching project, you see whether the math says caching or pricing.
if you're a finance lead at an ai-first company with multiple products or model providers, zift handles that too.
inference margin is a pricing decision. the engineering work is real and you should do it anyway. but the line that moves the gross margin number this quarter is the one in your pricing page, not the one in your inference router.
