1. The one-sentence definition
An LLM — Large Language Model — is a function that takes a sequence of text and returns the most likely next chunk of text, one chunk at a time, until something tells it to stop.
That's it. Everything else in this post is mechanical detail about how that function is shaped, sampled, conditioned, and billed. The model has no goals, no memory between calls, no plan. Every clever-looking thing it does — answer questions, write code, follow instructions, "think step by step" — emerges from running that next-chunk loop fast enough to feel like conversation.
That framing is the single most useful idea in this post. Hold it. The rest follows from it.
2. Tokens — the atoms of an LLM
An LLM does not read characters and it does not read words. It reads tokens — sub-word fragments produced by a tokenizer trained alongside the model. A token is usually 3–5 characters but the exact split depends on the model family.
Open the Anthropic or OpenAI tokenizer pages and paste in a sentence — you'll see the splits. A worked example:
Text: "The quick brown fox jumps over the lazy dog."
Tokens: ["The", " quick", " brown", " fox", " jumps",
" over", " the", " lazy", " dog", "."]
Count: 10 tokens — about one per word, which is typical for plain English.
Text: "antidisestablishmentarianism"
Tokens: ["ant", "id", "ises", "tab", "lish", "ment", "arian", "ism"]
Count: 8 tokens — long rare words fragment into many tokens.
Text: "🎉 こんにちは"
Tokens: ["🎉", " こ", "んに", "ち", "は"]
Count: 5 tokens — emoji and non-Latin scripts cost more per character.
Three things every developer should know about tokens:
- You are billed per token, not per character. Roughly: ~750 English words ≈ 1,000 tokens. A 5-page document is around 2,500 tokens. A novel is 100k+.
- Input and output are priced differently. Output tokens cost 4–5× input tokens on most APIs. The model generates one at a time and that's where the latency lives, so vendors price accordingly.
- Token boundaries leak into behaviour. Models are slightly better at things that match their tokenizer's natural splits. This is why "spell this word backwards" or "count the letters in strawberry" trips up LLMs — they don't see letters, they see chunks.
3. The prediction loop — how text actually gets generated
Here is what a single LLM call looks like from the inside, simplified:
1. You send N tokens (your prompt) to the model.
2. The model computes a probability distribution over its vocabulary —
roughly 100,000–200,000 possible next tokens, each with a number.
3. A sampler picks one token from that distribution.
4. That token is appended to the sequence.
5. The model is re-run on the (now N+1)-token sequence.
6. Repeat from step 2 until a stop condition triggers.
Stop conditions (any one):
- a special "end-of-message" token is sampled
- the max_tokens limit is reached
- a stop sequence the caller specified is produced
- the model emits a tool-use request
Two consequences of this loop matter:
- Output is intrinsically incremental. The model cannot "plan" the last sentence before writing the first. Anything that looks like planning (reasoning steps, drafts) is the model literally generating those tokens first and then conditioning the rest on what it just wrote.
- Latency scales with output length. Input tokens are processed in parallel; output tokens are produced sequentially. That's why "respond in one word" is fast and "write me a 5,000-word essay" is slow — each output token requires another forward pass through the entire model.
4. Temperature and sampling — why outputs vary
Step 3 above — "a sampler picks one token from that distribution" — is where temperature and top-p live. The model emits a probability for every possible next token; the sampler decides which one to use.
- Temperature = 0 — always pick the most likely token. Deterministic-ish. Best for classification, extraction, structured output.
- Temperature = 0.7 — flatten the distribution a bit, then sample. The default for most APIs. Good for writing.
- Temperature = 1.0+ — flatter still. More creative, more variance, more risk of nonsense. Useful for brainstorming, dangerous for facts.
- Top-p (nucleus sampling) — instead of "flatten then sample everything," only sample from the smallest set of tokens whose probabilities add up to p.
top_p=0.9means "consider the most likely tokens that together account for 90% of probability mass." Often used alongside temperature.
A pragmatic rule: if you want consistent, machine-parseable output, set temperature=0. If you want prose that sounds varied across calls, leave it at the default. Temperature above 1.0 is almost never the right answer outside of brainstorming.
5. The context window — the LLM is stateless
An LLM is a stateless function. It has no memory of any previous conversation. Every API call is independent. When ChatGPT or Claude "remembers" your earlier messages, what's actually happening is that the client is re-sending the entire conversation history with every request.
# Turn 1
messages = [{"role": "user", "content": "My name is Ada."}]
response = client.messages.create(model="claude-sonnet-4-6", messages=messages)
# → "Nice to meet you, Ada."
# Turn 2 — the model has no memory; we resend everything
messages = [
{"role": "user", "content": "My name is Ada."},
{"role": "assistant", "content": "Nice to meet you, Ada."},
{"role": "user", "content": "What's my name?"},
]
response = client.messages.create(model="claude-sonnet-4-6", messages=messages)
# → "Your name is Ada."
The context window is the maximum number of tokens you can pack into one of those calls. In 2026, modern models offer 200k–1M token windows — enough to fit entire codebases or short books. But there are two catches:
- You pay for every token you send. A million-token context window doesn't mean you should use it. A million input tokens at $3/M is $3 per question.
- Attention degrades with length. Models pay less reliable attention to the middle of very long contexts ("lost in the middle"). Important instructions belong at the start or end.
6. The three roles — system, user, assistant
Every modern LLM API frames the conversation as a list of messages, each with a role:
system— instructions about the assistant. Persona, rules, output format. Sent once, applies to the whole conversation. "You are a senior Python developer. Respond in JSON."user— what the human is saying right now.assistant— what the model has said before (or, occasionally, a partial response you want the model to continue from).
response = client.messages.create(
model="claude-sonnet-4-6",
system="You are a terse, dry-witted code reviewer. Reply in < 50 words.",
messages=[
{"role": "user", "content": "What do you think of `i = i + 1`?"},
{"role": "assistant", "content": "Functional. Unloved by linters."},
{"role": "user", "content": "And `i += 1`?"},
],
)
Under the hood these roles are concatenated into a single token stream with special markers between them — but conceptually keeping them separate is what lets the model know who said what. The system prompt is structurally privileged: it sets the tone for the entire conversation and is harder for the user to override.
7. Why LLMs hallucinate (and what to do about it)
A hallucination is when a model produces something fluent and confident that is factually wrong. It's not a bug. It's a direct consequence of the prediction loop.
Remember: at every step, the model picks a likely next token. "I don't know" is almost never the most likely next token — because the training data is full of confident answers, not confessions of ignorance. When the model genuinely doesn't have the answer, the most likely next token is usually still something that sounds like an answer, because that's what answers look like.
Four practical defences, ranked by effectiveness:
- Ground the model with retrieved facts (RAG). Don't ask "what's the population of Andorra?" — fetch the figure first and inject it into the prompt with a sentence like "Use only the facts below." See our RAG with Django post for a working pipeline.
- Force structured output and verify it. Ask for JSON matching a schema. Parse the output with
json.loads()and validate with Pydantic. If verification fails, retry or surface the error rather than displaying the bad data. - Lower the temperature.
temperature=0doesn't eliminate hallucinations but it makes the most likely (often most-trained-on) answer surface more consistently. Better for factual queries. - Permit "I don't know." Add to your system prompt: "If the provided context does not answer the question, reply exactly 'INSUFFICIENT_CONTEXT'." Then check for that sentinel in your code. This is the most under-used technique.
8. Picking a model — small, medium, large
Most providers ship a model family in three sizes. Anthropic's Claude line (the API the rest of this post uses) is a typical example:
| Model | Best for | Speed | Cost (rough) |
|---|---|---|---|
| Haiku | Classification, extraction, simple tasks, high volume | Fastest | ~$1/M in, $5/M out |
| Sonnet | RAG, summarisation, code, general chat | Mid | ~$3/M in, $15/M out |
| Opus | Complex reasoning, multi-step planning, research | Slowest | ~$15/M in, $75/M out |
The trap most beginners fall into is reaching for the biggest model by default. Don't. Bigger models are not just slower and pricier — they are over-qualified for most production tasks. A well-prompted Haiku will hit 95% accuracy on classification and cost 15× less than a confused Opus call. Start small, measure quality, only escalate when the cheaper model genuinely cannot handle the task.
A practical pattern is model routing: pick the cheapest adequate model per task. We cover this in detail in AI-Native Architecture.
9. Streaming — why your UI should never block
Because tokens are produced sequentially, the SDK can stream them to your application as they arrive rather than buffering the full response. This matters because a 500-token response can take several seconds end-to-end; streaming makes the experience feel instant.
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Write a haiku about TCP."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
In a web app, you pipe the SDK stream out through Server-Sent Events or a WebSocket so the browser renders tokens as they arrive. The user sees the first word in 200ms instead of staring at a spinner for four seconds. The total cost and latency don't change — perceived performance does.
10. Prompt caching — why your AI bill is what it is
A common pattern: a long system prompt or RAG context that is the same across many calls, plus a small user-message tail that varies. Without help, you'd pay full input price for the entire prompt on every call.
Prompt caching lets you mark parts of the prompt as cacheable. The provider stores the model's internal representation of those tokens and reuses it on subsequent calls — usually for around five minutes — charging you a fraction of the normal input price (around 10% for cache reads).
response = client.messages.create(
model="claude-sonnet-4-6",
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # 2,000 tokens of instructions
"cache_control": {"type": "ephemeral"},
},
],
messages=[
{"role": "user", "content": user_question}, # only this varies
],
)
On a chatty app where the system prompt is constant, this can knock 80–95% off your API bill. The cache-write call costs a bit more than a normal call, then every read within the TTL window is dramatically cheaper. The break-even is usually around 2–3 requests.
11. Tool use — how an LLM does anything outside its head
An LLM by itself only generates text. To do something — query a database, fetch a URL, send an email — it needs tools. The mechanism is simple in concept and consistent across vendors:
- You declare tools in the API call: a name, a description, and a JSON schema of arguments.
- If the model decides a tool is needed, it returns a structured
tool_useblock instead of a normal text response — with the tool name and arguments it wants you to run. - Your code runs the tool, gets the result, and sends a follow-up call containing the tool's output.
- The model resumes and either calls another tool or produces a final answer.
tools = [{
"name": "get_weather",
"description": "Get the current temperature in Celsius for a city.",
"input_schema": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
}]
response = client.messages.create(
model="claude-sonnet-4-6",
tools=tools,
messages=[{"role": "user", "content": "What's the weather in London?"}],
)
# If response.stop_reason == "tool_use", run the requested tool,
# then send a follow-up with the tool result. The model resumes from there.
Loop tool calls and you have an agent: a model + tools + a controller that runs the cycle until the model declares it's done. Agents are not magic; they are this loop with a turn limit, error handling, and (ideally) safety guardrails.
12. Training vs inference — what you actually do
A quick demystification of three terms that get conflated:
- Pre-training — done once by the vendor, on a massive corpus of text, costs tens of millions of dollars. You never do this.
- Fine-tuning — taking a pre-trained model and continuing training on a smaller, task-specific dataset. Useful but rarely necessary in 2026 — prompt engineering plus RAG covers most of what fine-tuning used to do, at a fraction of the cost. You almost never need to do this either.
- Inference — sending tokens in, getting tokens out. This is what you do every time you call the API. Inference is what "using an LLM" means for 99% of applications.
If someone says "we should fine-tune a model," ask first: can we solve it with a better system prompt? With RAG? With structured output and a smaller model? Almost always yes.
13. Your first LLM app — 30 lines of Python
Enough theory. Here is a working chat loop with streaming, structured output, and basic error handling. pip install anthropic, set ANTHROPIC_API_KEY, run it.
import json, os
from anthropic import Anthropic
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
SYSTEM = (
"You are a concise assistant. When asked a factual question with a "
"numeric answer, respond with JSON of the form "
'{"answer": <number>, "unit": "<unit>", "confidence": "high|medium|low"}. '
"Otherwise reply in plain prose, < 80 words."
)
history = []
while True:
user_text = input("you ▸ ").strip()
if not user_text:
break
history.append({"role": "user", "content": user_text})
print("ai ▸ ", end="", flush=True)
chunks = []
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=512,
system=SYSTEM,
messages=history,
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
chunks.append(text)
print()
reply = "".join(chunks)
history.append({"role": "assistant", "content": reply})
# If the reply looks like JSON, parse it (defensive — model may not comply)
if reply.lstrip().startswith("{"):
try:
print(" ↳ parsed:", json.loads(reply))
except json.JSONDecodeError:
pass
What's happening, in the order you've now learned about:
- We instantiate a client and keep a
historylist — that's our re-sent conversation context. - The
SYSTEMstring is the persistent instruction set. - Each turn appends the user's message, then opens a streaming request.
- The model emits tokens one at a time; we print each as it arrives and accumulate the full reply.
- We append the model's reply to
historyso the next turn includes it. - We attempt to parse JSON because we asked for it — defensively, because the model might still respond in prose.
That's a complete, useful AI app. From here, every feature you can imagine (RAG, tool use, agents, multi-modal input, async batch processing, prompt caching) is an additive layer on top.
14. Myths busted
- "LLMs reason." They generate text that describes reasoning. Sometimes that is indistinguishable from reasoning; sometimes it falls apart. Treat reasoning chains as evidence to verify, not truth to trust.
- "You need to fine-tune for your domain." Almost never. Strong prompts plus retrieval cover ~95% of "we have proprietary data" use cases at a tenth of the engineering cost.
- "Bigger model = better answers." Bigger model = more capable on the hardest problems. For most production tasks the smallest capable model wins on speed, cost, and (counter-intuitively) reliability.
- "The model remembers our previous chat." No. The client re-sends history every turn. Long sessions get expensive; that is a cost you control by trimming or summarising history before it bloats.
- "Temperature 0 makes it deterministic." Not quite — batching, hardware non-determinism, and provider-side caching can still cause minor variation between calls. Lower variance, not zero.
- "You need GPUs to use AI." To train, yes. To use a hosted API, no — you need a laptop and an API key. Most production AI applications are nothing but HTTP calls.
- "LLMs hallucinate because they are broken." They hallucinate because the prediction loop optimises for plausibility, not truth. Once you internalise that, you stop being surprised and start building defences in code.
15. Where to go next
You now have the mental model. The journey from here is depth in specific directions, depending on what you want to build:
- Build production patterns — read Generative AI with Python: From API Calls to Production Patterns for retries, async, streaming over Server-Sent Events, and cost tracking.
- Add retrieval (RAG) — read RAG with Django: Chat Over Your Wagtail CMS Content for a complete pipeline with pgvector.
- Ship AI inside a Django app — read How to Add an AI Chatbot to Any Django Site in a Weekend for a minimal integration, then Django + Celery: Async AI Tasks Without Blocking Workers for the right way to offload long-running calls.
- Design AI-native systems — read AI-Native Architecture for prompt registries, token budgets, model routing, and human-in-the-loop review.
Summary
An LLM is a stateless next-token predictor. Everything else in the field — prompts, context windows, temperature, hallucinations, RAG, tool use, agents, prompt caching — is mechanical detail layered on that single fact. Once you hold the loop in your head, AI stops feeling like magic and starts feeling like what it is: a fast, fluent, beautifully-flawed function call you can build on.
Three things to take away if you're skimming the end of this post: start small (Haiku, not Opus), set temperature=0 when you want consistency, and always re-send the full conversation history — because the model never remembers. Build from there.