A beginner's visual glossary

Running open-source LLMs,
in metaphors.

You don't need the jargon yet. You need a mental model. Here are the three metaphors that unlock everything — kitchens, cars, and dinner service.

Metaphor 1

The Kitchen

Covers: training · inference · managed inference · self-hosted inference

Training = writing the cookbook

A team of chefs spends months and millions of dollars testing every recipe, writing down what works. The output is a 10,000-page cookbook. This is what Meta, DeepSeek, Alibaba, Anthropic do. You will never train a model.

Inference = using the cookbook to cook dinner

You open the cookbook, find the recipe, cook the meal. Fast. Cheap. Reproducible. Every API call you make is inference. The entire sweep is about finding the cheapest, fastest kitchen to do the cooking.

cookbook training (once, expensive) chef inference (per question) meal

Where do you cook?

Once you want to do inference (cook), you have two options:

Managed inference = eating at a restaurant

Walk in, order, eat, pay per dish. Chef is theirs, kitchen is theirs, ingredients are theirs. You don't install anything.

Examples: Cerebras · Groq · Together · Fireworks · DeepInfra · Mistral · OpenRouter

Pay per token ($0.15 per 1M in, $0.60 per 1M out).

Self-hosted = renting a commercial kitchen

Rent the kitchen by the hour. Bring your own cookbook (model weights). Install your own stove (vLLM). Cook as many meals as you want at no extra cost.

Examples: RunPod · Vast.ai · Lambda · Modal

Pay per hour ($0.49/hr A6000 → $2.99/hr H100).

Break-even question: at what meal-volume does renting the kitchen beat eating out? That's what Notebook 2's math was solving.

Metaphor 2

The Transport

Covers: serverless · dedicated · pod · instance · cluster · GPU classes · vLLM

Two ways to get around

You've decided to rent your own kitchen (self-hosted). How do you pay for it?

Serverless = Uber

Request a ride, wait 2 minutes for the car to arrive (cold start), get driven, pay per mile. No car when idle. First trip of the morning = always a wait.

Good for: spiky, unpredictable workloads. Bad for: "I need answers NOW, every 5 minutes."

Dedicated = leased car in your driveway

Car is always there, engine warm, no wait. Meter runs 24/7 whether you drive or sleep. You pay for availability, not mileage.

Good for: sustained heavy use. Bad for: a workload that runs 30 min/day.

Pod, Instance, Cluster

The vocabulary soup is sloppy across providers. Here's the floor:

pod / instance / VM

One rented hotel room. One or more GPUs inside. This is 99% of what you'll use.

cluster

Multiple rooms wired together as one logical unit. You won't need this.

node

One room inside a cluster. Synonym for pod when context is clusters.

Translation rule: when a doc says "deploy to a cluster," for our sweep you should mentally read "deploy to one pod."

GPU classes = kitchen sizes

Bigger kitchen = more ingredients fit + more meals served per hour. Rule of thumb: model weights at FP16 take ~2GB per billion parameters.

RTX 3090/4090
24GB
$0.20-0.40/hr
A6000
48GB
$0.49/hr
A100 40GB
40GB
$1.00-1.50/hr
A100 80GB
80GB
$1.39-1.49/hr
H100 80GB
80GB
$2.39-2.99/hr
H200/B200
141GB+
$4.00+/hr

Llama 70B in FP16 (full-precision): ~140GB needed. Requires H100 tight, or two A100-80GBs.

Llama 70B in FP8 (half-precision, nearly identical quality): ~70GB. Fits comfortably on one H100.

vLLM = the stove in the kitchen

The rented kitchen is empty. The cookbook (weights) is just paper. vLLM is the stove that turns weights into cooking. It's the program that loads the model into GPU memory and exposes an HTTP endpoint you can send questions to. Critically, vLLM speaks OpenAI's API shape — so any client that works with OpenAI works with vLLM. Zero integration.

Alternatives: TGI (HuggingFace's version), Ollama (laptop-scale), llama.cpp (CPU-only). For serious serving, vLLM wins.

Metaphor 3

The Dinner Service

Covers: throughput · latency · TTFT · rate limits · quantization · tool calling · Groq Compound

How fast is fast?

Two different questions. Providers brag about one, you usually feel the other.

Latency (TTFT) = how long until the first bite arrives

From the moment you order until the first spoon of soup touches your lips. This is what feels snappy. Low TTFT = "instant."

order first bite TTFT
Throughput (tok/s) = how fast food keeps coming

Once the first bite lands, how fast does the next one arrive? At 3000 tok/s (Cerebras), a full paragraph appears in under a second.

tokens arriving at X tok/s

Common trap: high throughput + high latency = "it thinks for a beat, then floods you with text." That's OpenRouter-via-DeepInfra for a large model. Cerebras is the opposite: low latency AND high throughput. The rare combo.

Rate limits = buffet rules

The buffet staff watches how much you eat. They apply caps in three ways — and Claude Code's appetite makes most free tiers choke.

RPM — Requests Per Minute

"Don't call our name more than X times a minute."

Rarely the bottleneck.

TPM — Tokens Per Minute

"Don't eat more than X grams per minute."

The real killer. Groq free = 12K TPM. Claude Code sends 20K in one turn.

RPD — Requests Per Day

"Don't come back more than X times a day."

Hard daily ceiling.

Why Groq's free tier is broken for Claude Code: the harness replays the entire conversation history every turn. So "just one more message" actually sends 20,000+ tokens. The 12K TPM wall rejects it. The Compound endpoint has 70K TPM — that one survives.

Quantization = JPEG for model weights

You can compress the cookbook. Same recipes, smaller file, faster to cook, slight loss of detail.

FP16 / BF16

original print

140GB for Llama 70B. Quality: 100%.

FP8

high-res JPEG

70GB. Quality: ~100% in practice.

Q8

good JPEG

35GB. Quality: 99%. Safe default.

Q4

heavy compression

17GB. Visible quality loss on hard reasoning.

Don't let a sketchy free provider silently serve you Q4 — if the model feels dumb, ask what quantization they use.

Tool calling = chef goes shopping

Some chefs can only cook with ingredients already on the counter. Others mid-meal will run to the market, check prices, grab fresh herbs, and come back.

Without tool calling

Chef only uses what's in front of them. Answers come purely from what the model already knows. Can't look up the current weather, can't run a calculation, can't read a file.

With tool calling

Chef pauses, writes you a note — "I need to grab rosemary, back in 2 minutes." You (or the harness) run the errand, bring the rosemary back, chef keeps cooking. This is how Claude Code's Edit/Write/Bash tools work.

The #1 compatibility check when picking an OSS provider: does this model emit clean, reliable tool-call JSON? Older Llamas hallucinate tool names. GLM 4.7, DeepSeek V3, and Qwen3-Coder are rock-solid. Some providers don't expose tool calling even when the model supports it.

Groq Compound = full-service chef-and-errand-runner

The chef AND the grocery runner, in one person.

Regular tool calling: chef stops cooking, you run to the market, come back. Round trip takes 30 seconds.

Groq Compound: chef keeps cooking, hands the grocery list to an apprentice who's already in the kitchen, apprentice sprints out and back. You never leave your seat. No round trip.

Technically: a single API call triggers an internal agentic loop on Groq's servers — web search, code execution, Wolfram Alpha, browser automation all happen server-side. Your client just waits for the finished answer.

Why it matters to us: the free tier gives you 70K TPM + 250 requests/day on this endpoint — roughly ~5M tokens/day free. For peak-hour experimentation, that's wild value.

The 30-second cheat sheet

Inference = cooking with a pre-written cookbook.

Managed = restaurant. Self-hosted = rented kitchen.

Serverless = Uber. Dedicated = leased car.

vLLM = the stove. H100 = industrial kitchen.

Latency / TTFT = time to first bite. Throughput = rate of later bites.

TPM is the real killer. Groq plain = 12K. Claude Code eats 20K a turn.

Quantization = JPEG for weights. FP8 is the safe default.

Groq Compound = chef-and-errand-runner. 5M free tokens/day.