For AI startups burning cash on GPT/Claude

Cut your LLM bill ~80% by distilling your own logs

model-shrink ingests your prompt/response logs and distills a small fine-tuned open model that handles the repetitive 80% of your traffic — at a fraction of the cost. Start with a free eval report that proves the savings before you train anything.

Get my free eval report →See pricing

No signup for the free report · No data stored · Works with your existing logs

~80%

of traffic is repetitive

10-25×

cheaper per token

ML engineers needed

From log file to dollar savings

The free report is the whole pitch: proof, in your own numbers, that distillation pays off — before you commit to anything.

Upload your real logs

Drop in JSONL or a JSON array of your prompt/response pairs — OpenAI and Anthropic chat formats work out of the box. Nothing is stored.

See what's distillable

We cluster your traffic by task type and score how much of each a small fine-tuned 7-8B model can absorb at acceptable quality.

Get the dollar proof

A clear cost-savings dashboard: your current black-box spend vs. the distilled setup, in ₹ and $, with a per-workload breakdown.

How it works

Export your logs

Pull a sample of prompt/response pairs from your gateway, LangSmith, or DB. A few hundred rows is enough to project savings.

Run the free eval

Upload them. In seconds you get a distillation report — routable %, savings %, and which workloads to distill first.

Distill & deploy

On a paid plan we fine-tune a small open model on your data and host it behind one endpoint. Route the easy 80%, keep the frontier model for the long tail.

“We were spending more on GPT-4o for ticket triage than on our servers. Routing the easy classifications to a small model was the obvious win — we just needed the numbers to justify it.”

— The pitch we hear from every AI team. Run the report and see your own.

Run my free eval report →

Stop overpaying the black box.

Free report today. Hosted distilled endpoint from $39/mo.

Get started free