Generative AI with Python: From API Calls to Production Patterns
GenAI is not magic and it is not a framework you configure — it is a capability you integrate. This post covers everything a working Python developer needs to go from a first API call to patterns that hold up in production: the right mental model for how LLMs work, token arithmetic, prompt engineering that actually changes output quality, structured extraction, tool use, embeddings, RAG, streaming, retries, and cost tracking.
1. The Right Mental Model for LLMs
A large language model is a stateless function. You give it text; it gives you text back. It has no memory between calls, no running state, no awareness of previous requests. Every conversation turn you send the entire conversation history — the model sees it fresh each time and generates the next token based on probability distributions shaped by training.
Concretely, the model's job is: given a sequence of tokens, predict the most probable next token, append it, repeat until it decides to stop. Temperature controls how deterministic that sampling is — temperature 0 always picks the most probable next token (deterministic but boring), temperature 1 samples according to raw probabilities (more varied, occasionally surprising), values above 1 amplify unlikely tokens (creative but unstable).
This matters for your code. When you see confusing model behaviour, ask: what did the full input actually look like? The model can only respond to what you sent. Debugging starts with printing the exact messages array that reached the API.
2. The Python GenAI Ecosystem
The library landscape consolidates quickly. Here is where things stand today and when to reach for each:
# Direct vendor SDKs — start here
pip install anthropic # Claude (Anthropic) — best reasoning, tool use
pip install openai # GPT-4o, o3, embeddings (text-embedding-3-small)
pip install google-generativeai # Gemini 2.0
# Embeddings & vector search
pip install pgvector # PostgreSQL vector extension ORM support
pip install sentence-transformers # local embedding models (no API needed)
pip install chromadb # in-process vector store for prototyping
# Orchestration (reach for these only when you need them)
pip install langchain-core # chains, runnables — useful for complex pipelines
pip install llama-index-core # document ingestion + retrieval pipelines
# Token counting
pip install tiktoken # OpenAI tokenizer (cl100k_base ≈ Claude too)
Rule of thumb: start with the vendor SDK directly. Orchestration frameworks
like LangChain and LlamaIndex add value when you have multi-step pipelines with many moving
parts, but they also add abstraction layers that make debugging harder. For a single LLM
call in a Django view, anthropic.Anthropic().messages.create() is all you need.
If you are building provider-agnostic code, look at LiteLLM — it wraps every major provider behind one interface so you can swap models without rewriting call sites.
3. Tokens, Context Windows & Cost
A token is roughly 4 characters of English text — not a word, not a character. A thousand tokens is about 750 words. Models do not see characters; they see token IDs. This matters for three things: context window limits, latency, and cost.
Count tokens before you send, not after you get a bill:
import tiktoken
enc = tiktoken.get_encoding('cl100k_base') # closest approximation for Claude
def count_tokens(messages: list[dict]) -> int:
total = 0
for msg in messages:
total += 4 # per-message overhead
for value in msg.values():
total += len(enc.encode(str(value)))
return total + 2 # reply priming
messages = [
{'role': 'user', 'content': 'Explain async/await in Python in two sentences.'}
]
print(count_tokens(messages)) # → ~20 tokens
Claude's models have these context windows as of mid-2026:
# Context window limits (input + output combined)
claude-haiku-4-5 → 200 000 tokens # fast, cheap, good for classification
claude-sonnet-4-6 → 200 000 tokens # best balance of quality and cost
claude-opus-4-7 → 200 000 tokens # highest capability, highest cost
# Rough pricing guide (input / output per million tokens — verify on pricing page)
# Haiku: ~$0.80 / $4
# Sonnet: ~$3 / $15
# Opus: ~$15 / $75
The practical implication: a 200k token context window can hold a ~150,000 word document — roughly a full novel — plus the entire conversation history. The limit rarely bites in typical applications, but token cost does. Build token counting and usage logging from day one.
4. Your First LLM Call
The Anthropic Python SDK is the most straightforward way to call Claude. Install it, set
your key, call messages.create():
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
response = client.messages.create(
model = 'claude-sonnet-4-6',
max_tokens = 1024,
system = 'You are a concise Python tutor. Explain concepts in plain English.',
messages = [
{'role': 'user', 'content': 'What is the difference between a list and a tuple?'},
],
)
print(response.content[0].text)
# "Lists are mutable — you can change them after creation.
# Tuples are immutable — once created, they cannot be modified.
# Use tuples for fixed data like coordinates; lists for collections that change."
# Token usage
print(response.usage.input_tokens, response.usage.output_tokens)
The response object is not just a string. It carries usage data, stop reason, and the full
content array (which may contain multiple content blocks when tool use is involved).
Always check response.stop_reason — if it is "max_tokens" instead
of "end_turn", the response was cut short and you need to increase
max_tokens.
Multi-turn conversations
To hold a conversation, append each turn to the messages list yourself:
history = []
def chat(user_message: str) -> str:
history.append({'role': 'user', 'content': user_message})
response = client.messages.create(
model = 'claude-sonnet-4-6',
max_tokens = 1024,
system = 'You are a Python tutor.',
messages = history,
)
assistant_reply = response.content[0].text
history.append({'role': 'assistant', 'content': assistant_reply})
return assistant_reply
print(chat('What is a decorator?'))
print(chat('Show me an example with a timing decorator.')) # has full context
In a web application, history lives in a database or cache keyed on the
session — not in a global variable. See the
Django chatbot post for the full model +
view pattern.
5. Prompt Engineering Patterns
The system prompt is the highest-leverage variable in your application. A bad system prompt produces bad output regardless of model size. Here are the patterns that reliably improve output quality:
Zero-shot: just ask clearly
system = """
You are a code reviewer for a Python backend team.
Rules:
- Identify bugs, not style preferences
- Explain WHY something is a bug, not just what it is
- If the code is correct, say so — do not invent issues
- Use bullet points, one bug per bullet
- Be direct. No preamble.
"""
Few-shot: show the model the format you want
When zero-shot output format is inconsistent, add 2–3 examples in the messages array before the real request:
messages = [
# Example 1
{'role': 'user', 'content': 'Classify: "The server is down again"'},
{'role': 'assistant', 'content': 'category: incident\nurgency: high'},
# Example 2
{'role': 'user', 'content': 'Classify: "Could you add dark mode?"'},
{'role': 'assistant', 'content': 'category: feature_request\nurgency: low'},
# Real request
{'role': 'user', 'content': f'Classify: "{user_ticket}"'},
]
Chain-of-thought: ask it to think first
For reasoning tasks, instruct the model to work through the problem before answering. This produces measurably better output on multi-step problems:
system = """
When asked to debug code:
1. First, trace through the code step by step in a block.
2. Then provide the diagnosis and fix after .
Your final answer should come after the thinking block, not inside it.
"""
Prompting anti-patterns to avoid
- "Be as detailed as possible" — leads to padding and repetition. Specify length: "answer in 3 bullet points" or "answer in under 100 words".
- "Don't hallucinate" — the model cannot comply with this instruction; it doesn't know when it's hallucinating. Instead: "if you are not certain, say 'I don't know'".
- Huge system prompts with contradictory rules — models follow the most recent instruction when rules conflict. Keep the system prompt focused on one domain.
- Putting critical instructions only in the system prompt — for truly important constraints (format, length, language), repeat them at the end of the user message too.
6. Structured Output: Getting JSON from LLMs
LLMs are text generators. Getting reliably valid JSON requires explicit guidance. There are two reliable approaches: asking for JSON in the prompt with a strict schema example, or using tool use to force structured output (covered in the next section).
JSON mode via prompt
import json
import anthropic
client = anthropic.Anthropic()
def extract_invoice_data(raw_text: str) -> dict:
response = client.messages.create(
model = 'claude-sonnet-4-6',
max_tokens = 512,
system = """Extract invoice data and return ONLY valid JSON.
No markdown, no explanation, no code fences. Raw JSON only.
Schema:
{
"vendor": "string",
"amount": number,
"currency": "string (ISO 4217)",
"date": "string (YYYY-MM-DD)",
"line_items": [{"description": "string", "amount": number}]
}""",
messages = [{'role': 'user', 'content': raw_text}],
)
text = response.content[0].text.strip()
# Strip accidental markdown fences if the model disobeys
if text.startswith('```'):
text = text.split('```')[1]
if text.startswith('json'):
text = text[4:]
return json.loads(text)
Validation matters. Always wrap json.loads() in a try/except
and validate the result with Pydantic before using it downstream:
from pydantic import BaseModel, field_validator
from decimal import Decimal
class LineItem(BaseModel):
description: str
amount: Decimal
class Invoice(BaseModel):
vendor: str
amount: Decimal
currency: str
date: str
line_items: list[LineItem]
@field_validator('currency')
@classmethod
def must_be_iso(cls, v: str) -> str:
if len(v) != 3 or not v.isupper():
raise ValueError('currency must be 3-letter ISO 4217 code')
return v
try:
raw = extract_invoice_data(invoice_text)
invoice = Invoice(**raw)
except (json.JSONDecodeError, ValueError) as e:
# Retry once or raise to caller
raise ValueError(f'LLM returned invalid invoice structure: {e}')
7. Tool Use & Function Calling
Tool use (also called function calling) lets the model decide to call a Python function rather than generating free text. The model does not execute the function — it returns a structured call spec, your code executes it, and you send the result back for the final response.
This is the cleanest way to get structured output: define a tool with a JSON Schema, the model will always return a valid call spec conforming to that schema.
import anthropic
import json
client = anthropic.Anthropic()
# Define tools the model can call
tools = [
{
'name': 'get_weather',
'description': 'Get current weather for a city.',
'input_schema': {
'type': 'object',
'properties': {
'city': {'type': 'string', 'description': 'City name'},
'country': {'type': 'string', 'description': 'ISO country code'},
},
'required': ['city'],
},
},
{
'name': 'search_docs',
'description': 'Search the internal knowledge base.',
'input_schema': {
'type': 'object',
'properties': {
'query': {'type': 'string'},
'limit': {'type': 'integer', 'default': 5},
},
'required': ['query'],
},
},
]
def run_tool(name: str, inputs: dict) -> str:
"""Dispatch tool calls to actual Python functions."""
if name == 'get_weather':
return json.dumps({'temp': 18, 'condition': 'cloudy', 'city': inputs['city']})
if name == 'search_docs':
return json.dumps({'results': ['Doc A', 'Doc B']})
raise ValueError(f'Unknown tool: {name}')
def agent_loop(user_message: str) -> str:
messages = [{'role': 'user', 'content': user_message}]
while True:
response = client.messages.create(
model = 'claude-sonnet-4-6',
max_tokens = 1024,
tools = tools,
messages = messages,
)
# Model finished — return text
if response.stop_reason == 'end_turn':
return response.content[0].text
# Model wants to call tools
if response.stop_reason == 'tool_use':
# Add the assistant's tool-call turn to history
messages.append({'role': 'assistant', 'content': response.content})
# Execute each requested tool and collect results
tool_results = []
for block in response.content:
if block.type == 'tool_use':
result = run_tool(block.name, block.input)
tool_results.append({
'type': 'tool_result',
'tool_use_id': block.id,
'content': result,
})
# Feed results back to the model
messages.append({'role': 'user', 'content': tool_results})
# Loop — the model will now generate a final response
print(agent_loop('What is the weather in London?'))
The agent_loop function illustrates the agentic pattern: the model may call
multiple tools in sequence before producing a final answer. Each iteration of the while loop
is one round-trip to the API. Add a max_iterations guard to prevent runaway loops:
MAX_ITERATIONS = 10
def agent_loop(user_message: str) -> str:
messages = [{'role': 'user', 'content': user_message}]
for _ in range(MAX_ITERATIONS):
response = client.messages.create(...)
if response.stop_reason == 'end_turn':
return response.content[0].text
# ... handle tool_use ...
raise RuntimeError('Agent exceeded max iterations without reaching end_turn')
8. Embeddings & Semantic Search
An embedding is a dense vector — typically 256 to 3072 floating-point numbers — that encodes the semantic meaning of a piece of text. Two texts with similar meanings have vectors that are close together in that high-dimensional space, measured by cosine similarity or dot product.
Embeddings power semantic search (find documents by meaning, not keyword), clustering, deduplication, and the retrieval step in RAG.
Generating embeddings with the OpenAI SDK
from openai import OpenAI
import numpy as np
client = OpenAI() # reads OPENAI_API_KEY from env
def embed(texts: list[str]) -> list[list[float]]:
"""Embed a batch of texts. Max ~8000 tokens per text."""
response = client.embeddings.create(
model = 'text-embedding-3-small', # 1536-dim, cheapest
input = texts,
)
return [item.embedding for item in response.data]
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Example
query_vec = embed(['How do I handle async errors in Python?'])[0]
doc_vecs = embed([
'asyncio error handling with try/except',
'Django ORM queryset caching',
'Python async/await exception patterns',
])
similarities = [cosine_similarity(query_vec, dv) for dv in doc_vecs]
# → [0.91, 0.23, 0.88] — first and third are relevant, second is not
Storing embeddings in PostgreSQL with pgvector
# Install: pip install pgvector psycopg2-binary
# Database: CREATE EXTENSION IF NOT EXISTS vector;
from pgvector.django import VectorField, CosineDistance
from django.db import models
class Document(models.Model):
content = models.TextField()
embedding = VectorField(dimensions=1536)
created = models.DateTimeField(auto_now_add=True)
class Meta:
indexes = [
# HNSW index — fast approximate nearest neighbour search
models.Index(
fields=['embedding'],
name='doc_embedding_hnsw',
opclasses=['vector_cosine_ops'],
)
]
def semantic_search(query: str, limit: int = 5) -> list[Document]:
query_vec = embed([query])[0]
return (
Document.objects
.annotate(distance=CosineDistance('embedding', query_vec))
.order_by('distance')[:limit]
)
9. Retrieval-Augmented Generation (RAG)
RAG is the pattern of augmenting an LLM prompt with retrieved context before asking it to answer. It solves the two biggest LLM limitations: knowledge cutoffs and hallucination on proprietary data. The LLM is not expected to know the answer — it is expected to read the retrieved documents and synthesise an answer from them.
The full pipeline has five stages:
- Ingest: split documents into chunks (~500 tokens with 50-token overlap)
- Embed: embed each chunk and store the vector in a database
- Retrieve: embed the user query, find the top-k nearest chunks
- Augment: insert the retrieved chunks into the prompt as context
- Generate: call the LLM and return its answer
# Minimal RAG pipeline in Python
import anthropic
from openai import OpenAI
ac = anthropic.Anthropic()
oc = OpenAI()
CHUNK_SIZE = 500 # tokens
CHUNK_OVERLAP = 50
TOP_K = 4
# ── Stage 1 & 2: Ingest & embed ──────────────────────────────────────────
def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + size
chunks.append(' '.join(words[start:end]))
start = end - overlap
return chunks
def ingest_document(doc_id: int, text: str) -> None:
from myapp.models import DocumentChunk # your Django model
chunks = chunk_text(text)
embeddings = oc.embeddings.create(
model = 'text-embedding-3-small', input = chunks
).data
DocumentChunk.objects.bulk_create([
DocumentChunk(
document_id = doc_id,
content = chunk,
embedding = emb.embedding,
)
for chunk, emb in zip(chunks, embeddings)
])
# ── Stage 3 & 4 & 5: Query-time RAG ─────────────────────────────────────
def rag_query(user_question: str) -> str:
# Embed the question
q_vec = oc.embeddings.create(
model = 'text-embedding-3-small', input = [user_question]
).data[0].embedding
# Retrieve top-k relevant chunks
from myapp.models import DocumentChunk
from pgvector.django import CosineDistance
chunks = (
DocumentChunk.objects
.annotate(dist=CosineDistance('embedding', q_vec))
.order_by('dist')[:TOP_K]
)
# Build the augmented prompt
context = '\n\n---\n\n'.join(c.content for c in chunks)
prompt = f"""Answer the question using ONLY the context below.
If the context does not contain the answer, say "I don't have that information."
CONTEXT:
{context}
QUESTION: {user_question}"""
response = ac.messages.create(
model = 'claude-sonnet-4-6',
max_tokens = 1024,
messages = [{'role': 'user', 'content': prompt}],
)
return response.content[0].text
The phrase "using ONLY the context below" is load-bearing. Without it, the model blends retrieved context with its training knowledge, making hallucinations impossible to audit. Grounding the model in the retrieved text means every answer is traceable to a source chunk.
For a full production RAG implementation with Wagtail CMS as the document source, see RAG with Django: Chat Over Your Wagtail CMS Content.
10. Async & Streaming Responses
Streaming matters for user experience. A response that starts appearing in 300ms feels instant even if the full reply takes 8 seconds. Without streaming, users watch a spinner for the full duration and see text appear all at once.
Synchronous streaming
import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model = 'claude-sonnet-4-6',
max_tokens = 1024,
messages = [{'role': 'user', 'content': 'Explain the GIL in Python.'}],
) as stream:
for text in stream.text_stream:
print(text, end='', flush=True)
# After the stream — get the complete message with usage stats
message = stream.get_final_message()
print(f'\n\nInput tokens: {message.usage.input_tokens}')
print(f'Output tokens: {message.usage.output_tokens}')
Async streaming (for async frameworks)
import asyncio
import anthropic
async def stream_response(question: str) -> str:
client = anthropic.AsyncAnthropic()
accumulated = ''
async with client.messages.stream(
model = 'claude-sonnet-4-6',
max_tokens = 1024,
messages = [{'role': 'user', 'content': question}],
) as stream:
async for text in stream.text_stream:
accumulated += text
print(text, end='', flush=True)
return accumulated
asyncio.run(stream_response('What is asyncio.gather()?'))
Streaming in Django with Server-Sent Events
In a Django view, wrap the generator in a StreamingHttpResponse with the
text/event-stream content type. Set X-Accel-Buffering: no to
prevent nginx from buffering the stream before it reaches the browser:
from django.http import StreamingHttpResponse
import json
def stream_view(request):
def event_generator():
with client.messages.stream(
model = 'claude-sonnet-4-6',
max_tokens = 1024,
messages = [{'role': 'user', 'content': request.GET.get('q', '')}],
) as stream:
for text in stream.text_stream:
yield f'data: {json.dumps({"token": text})}\n\n'
yield 'data: [DONE]\n\n'
resp = StreamingHttpResponse(event_generator(), content_type='text/event-stream')
resp['Cache-Control'] = 'no-cache'
resp['X-Accel-Buffering'] = 'no'
return resp
11. Production Patterns
Retries with exponential backoff
LLM APIs rate-limit under load and occasionally return 529 (overloaded). The Anthropic SDK retries automatically by default (2 retries). For higher-traffic scenarios, configure the retry behaviour explicitly:
import anthropic
from anthropic import APIStatusError, RateLimitError
client = anthropic.Anthropic(
max_retries = 4, # SDK handles exponential backoff automatically
timeout = 60.0, # per-request timeout in seconds
)
def safe_create(messages: list[dict], **kwargs) -> str:
try:
response = client.messages.create(
model = 'claude-sonnet-4-6',
max_tokens = 1024,
messages = messages,
**kwargs,
)
return response.content[0].text
except RateLimitError:
raise # Let the caller decide — or queue for async retry via Celery
except APIStatusError as e:
if e.status_code == 529: # model overloaded
raise # retry at a higher level
raise # re-raise all other API errors
Prompt caching (Anthropic)
If your system prompt or context documents are the same across many requests, enable prompt
caching. Anthropic caches the first 1,024+ tokens marked with cache_control
and charges ~10% of the normal input token price on cache hits:
LONG_SYSTEM_CONTEXT = "..." * 500 # a large, stable context
response = client.messages.create(
model = 'claude-sonnet-4-6',
max_tokens = 1024,
system = [
{
'type': 'text',
'text': LONG_SYSTEM_CONTEXT,
'cache_control': {'type': 'ephemeral'}, # cache this block
}
],
messages = [{'role': 'user', 'content': user_query}],
)
# First request: normal cost. Subsequent requests within 5 min: ~10% cost.
# Check cache hit: response.usage.cache_read_input_tokens > 0
Cost tracking
Log every token count to a database table. This lets you attribute cost to users, endpoints, or features and catch runaway usage before the bill arrives:
from django.db import models
class LLMUsageLog(models.Model):
endpoint = models.CharField(max_length=100)
model = models.CharField(max_length=60)
input_tokens = models.PositiveIntegerField()
output_tokens = models.PositiveIntegerField()
cache_read = models.PositiveIntegerField(default=0)
user_id = models.IntegerField(null=True)
created = models.DateTimeField(auto_now_add=True)
@property
def estimated_cost_usd(self) -> float:
# Claude Sonnet-4-6 pricing (verify on Anthropic pricing page)
input_cost = (self.input_tokens - self.cache_read) * 3.0 / 1_000_000
cache_cost = self.cache_read * 0.3 / 1_000_000
output_cost = self.output_tokens * 15.0 / 1_000_000
return round(input_cost + cache_cost + output_cost, 6)
def log_usage(response, endpoint: str, user_id: int | None = None) -> None:
LLMUsageLog.objects.create(
endpoint = endpoint,
model = response.model,
input_tokens = response.usage.input_tokens,
output_tokens = response.usage.output_tokens,
cache_read = getattr(response.usage, 'cache_read_input_tokens', 0),
user_id = user_id,
)
Production checklist
- Never hardcode API keys. Use environment variables or a secrets manager. Rotate if exposed.
- Set
max_tokensexplicitly. Without it, some SDKs default to the model maximum, producing unexpectedly large responses and costs. - Validate all LLM output with Pydantic before using it as data. Text is never a safe input to downstream logic.
- Rate-limit per user at the application layer, not just at the API level. A single user can exhaust your API quota if you don't enforce per-session limits.
- Handle
stop_reason == "max_tokens". Either increase the limit or return a partial-response warning to the user. - Log every call — model, tokens, endpoint, user — from day one. Debugging cost spikes after the fact without logs is painful.
- Use async Celery tasks for non-interactive LLM calls (batch processing, background enrichment). Never block an HTTP worker thread on a multi-second API call. See the Celery async AI tasks post for the full pattern.
- Test with
model="claude-haiku-4-5"during development. It is fast and cheap. Switch to a more capable model only when you measure a quality gap.