A beginner's visual glossary
You don't need the jargon yet. You need a mental model. Here are the three metaphors that unlock everything — kitchens, cars, and dinner service.
Covers: training · inference · managed inference · self-hosted inference
A team of chefs spends months and millions of dollars testing every recipe, writing down what works. The output is a 10,000-page cookbook. This is what Meta, DeepSeek, Alibaba, Anthropic do. You will never train a model.
You open the cookbook, find the recipe, cook the meal. Fast. Cheap. Reproducible. Every API call you make is inference. The entire sweep is about finding the cheapest, fastest kitchen to do the cooking.
Once you want to do inference (cook), you have two options:
Walk in, order, eat, pay per dish. Chef is theirs, kitchen is theirs, ingredients are theirs. You don't install anything.
Examples: Cerebras · Groq · Together · Fireworks · DeepInfra · Mistral · OpenRouter
Pay per token ($0.15 per 1M in, $0.60 per 1M out).
Rent the kitchen by the hour. Bring your own cookbook (model weights). Install your own stove (vLLM). Cook as many meals as you want at no extra cost.
Examples: RunPod · Vast.ai · Lambda · Modal
Pay per hour ($0.49/hr A6000 → $2.99/hr H100).
Break-even question: at what meal-volume does renting the kitchen beat eating out? That's what Notebook 2's math was solving.
Covers: serverless · dedicated · pod · instance · cluster · GPU classes · vLLM
You've decided to rent your own kitchen (self-hosted). How do you pay for it?
Request a ride, wait 2 minutes for the car to arrive (cold start), get driven, pay per mile. No car when idle. First trip of the morning = always a wait.
Good for: spiky, unpredictable workloads. Bad for: "I need answers NOW, every 5 minutes."
Car is always there, engine warm, no wait. Meter runs 24/7 whether you drive or sleep. You pay for availability, not mileage.
Good for: sustained heavy use. Bad for: a workload that runs 30 min/day.
The vocabulary soup is sloppy across providers. Here's the floor:
pod / instance / VM
One rented hotel room. One or more GPUs inside. This is 99% of what you'll use.
cluster
Multiple rooms wired together as one logical unit. You won't need this.
node
One room inside a cluster. Synonym for pod when context is clusters.
Translation rule: when a doc says "deploy to a cluster," for our sweep you should mentally read "deploy to one pod."
Bigger kitchen = more ingredients fit + more meals served per hour. Rule of thumb: model weights at FP16 take ~2GB per billion parameters.
Llama 70B in FP16 (full-precision): ~140GB needed. Requires H100 tight, or two A100-80GBs.
Llama 70B in FP8 (half-precision, nearly identical quality): ~70GB. Fits comfortably on one H100.
The rented kitchen is empty. The cookbook (weights) is just paper. vLLM is the stove that turns weights into cooking. It's the program that loads the model into GPU memory and exposes an HTTP endpoint you can send questions to. Critically, vLLM speaks OpenAI's API shape — so any client that works with OpenAI works with vLLM. Zero integration.
Alternatives: TGI (HuggingFace's version), Ollama (laptop-scale), llama.cpp (CPU-only). For serious serving, vLLM wins.
Covers: throughput · latency · TTFT · rate limits · quantization · tool calling · Groq Compound
Two different questions. Providers brag about one, you usually feel the other.
From the moment you order until the first spoon of soup touches your lips. This is what feels snappy. Low TTFT = "instant."
Once the first bite lands, how fast does the next one arrive? At 3000 tok/s (Cerebras), a full paragraph appears in under a second.
Common trap: high throughput + high latency = "it thinks for a beat, then floods you with text." That's OpenRouter-via-DeepInfra for a large model. Cerebras is the opposite: low latency AND high throughput. The rare combo.
The buffet staff watches how much you eat. They apply caps in three ways — and Claude Code's appetite makes most free tiers choke.
RPM — Requests Per Minute
"Don't call our name more than X times a minute."
Rarely the bottleneck.
TPM — Tokens Per Minute
"Don't eat more than X grams per minute."
The real killer. Groq free = 12K TPM. Claude Code sends 20K in one turn.
RPD — Requests Per Day
"Don't come back more than X times a day."
Hard daily ceiling.
Why Groq's free tier is broken for Claude Code: the harness replays the entire conversation history every turn. So "just one more message" actually sends 20,000+ tokens. The 12K TPM wall rejects it. The Compound endpoint has 70K TPM — that one survives.
You can compress the cookbook. Same recipes, smaller file, faster to cook, slight loss of detail.
FP16 / BF16
original print
140GB for Llama 70B. Quality: 100%.
FP8
high-res JPEG
70GB. Quality: ~100% in practice.
Q8
good JPEG
35GB. Quality: 99%. Safe default.
Q4
heavy compression
17GB. Visible quality loss on hard reasoning.
Don't let a sketchy free provider silently serve you Q4 — if the model feels dumb, ask what quantization they use.
Some chefs can only cook with ingredients already on the counter. Others mid-meal will run to the market, check prices, grab fresh herbs, and come back.
Chef only uses what's in front of them. Answers come purely from what the model already knows. Can't look up the current weather, can't run a calculation, can't read a file.
Chef pauses, writes you a note — "I need to grab rosemary, back in 2 minutes." You (or the harness) run the errand, bring the rosemary back, chef keeps cooking. This is how Claude Code's Edit/Write/Bash tools work.
The #1 compatibility check when picking an OSS provider: does this model emit clean, reliable tool-call JSON? Older Llamas hallucinate tool names. GLM 4.7, DeepSeek V3, and Qwen3-Coder are rock-solid. Some providers don't expose tool calling even when the model supports it.
Regular tool calling: chef stops cooking, you run to the market, come back. Round trip takes 30 seconds.
Groq Compound: chef keeps cooking, hands the grocery list to an apprentice who's already in the kitchen, apprentice sprints out and back. You never leave your seat. No round trip.
Technically: a single API call triggers an internal agentic loop on Groq's servers — web search, code execution, Wolfram Alpha, browser automation all happen server-side. Your client just waits for the finished answer.
Why it matters to us: the free tier gives you 70K TPM + 250 requests/day on this endpoint — roughly ~5M tokens/day free. For peak-hour experimentation, that's wild value.
The 30-second cheat sheet
Inference = cooking with a pre-written cookbook.
Managed = restaurant. Self-hosted = rented kitchen.
Serverless = Uber. Dedicated = leased car.
vLLM = the stove. H100 = industrial kitchen.
Latency / TTFT = time to first bite. Throughput = rate of later bites.
TPM is the real killer. Groq plain = 12K. Claude Code eats 20K a turn.
Quantization = JPEG for weights. FP8 is the safe default.
Groq Compound = chef-and-errand-runner. 5M free tokens/day.