Python AI GenAI LLM

Generative AI with Python: From API Calls to Production Patterns

GenAI is not magic and it is not a framework you configure — it is a capability you integrate. This post covers everything a working Python developer needs to go from a first API call to patterns that hold up in production: the right mental model for how LLMs work, token arithmetic, prompt engineering that actually changes output quality, structured extraction, tool use, embeddings, RAG, streaming, retries, and cost tracking.

1. The Right Mental Model for LLMs

A large language model is a stateless function. You give it text; it gives you text back. It has no memory between calls, no running state, no awareness of previous requests. Every conversation turn you send the entire conversation history — the model sees it fresh each time and generates the next token based on probability distributions shaped by training.

Concretely, the model's job is: given a sequence of tokens, predict the most probable next token, append it, repeat until it decides to stop. Temperature controls how deterministic that sampling is — temperature 0 always picks the most probable next token (deterministic but boring), temperature 1 samples according to raw probabilities (more varied, occasionally surprising), values above 1 amplify unlikely tokens (creative but unstable).

This matters for your code. When you see confusing model behaviour, ask: what did the full input actually look like? The model can only respond to what you sent. Debugging starts with printing the exact messages array that reached the API.

INPUT (messages list) system "You are a helpful..." user "Summarise this..." assistant "Sure! ..." LLM stateless · no memory P(next token | all tokens) temperature · max_tokens · top_p COMPLETION response.content[0].text "Here's the summary:" + usage.input_tokens + usage.output_tokens History is YOUR responsibility — resend it every turn. The model sees nothing between requests.
Every LLM call is stateless. You send the full conversation history; the model sees nothing between requests. History management is your code's job.

2. The Python GenAI Ecosystem

The library landscape consolidates quickly. Here is where things stand today and when to reach for each:

# Direct vendor SDKs — start here
pip install anthropic          # Claude (Anthropic) — best reasoning, tool use
pip install openai             # GPT-4o, o3, embeddings (text-embedding-3-small)
pip install google-generativeai  # Gemini 2.0

# Embeddings & vector search
pip install pgvector            # PostgreSQL vector extension ORM support
pip install sentence-transformers  # local embedding models (no API needed)
pip install chromadb            # in-process vector store for prototyping

# Orchestration (reach for these only when you need them)
pip install langchain-core      # chains, runnables — useful for complex pipelines
pip install llama-index-core    # document ingestion + retrieval pipelines

# Token counting
pip install tiktoken            # OpenAI tokenizer (cl100k_base ≈ Claude too)

Rule of thumb: start with the vendor SDK directly. Orchestration frameworks like LangChain and LlamaIndex add value when you have multi-step pipelines with many moving parts, but they also add abstraction layers that make debugging harder. For a single LLM call in a Django view, anthropic.Anthropic().messages.create() is all you need.

If you are building provider-agnostic code, look at LiteLLM — it wraps every major provider behind one interface so you can swap models without rewriting call sites.


3. Tokens, Context Windows & Cost

A token is roughly 4 characters of English text — not a word, not a character. A thousand tokens is about 750 words. Models do not see characters; they see token IDs. This matters for three things: context window limits, latency, and cost.

Count tokens before you send, not after you get a bill:

import tiktoken

enc = tiktoken.get_encoding('cl100k_base')  # closest approximation for Claude

def count_tokens(messages: list[dict]) -> int:
    total = 0
    for msg in messages:
        total += 4  # per-message overhead
        for value in msg.values():
            total += len(enc.encode(str(value)))
    return total + 2  # reply priming

messages = [
    {'role': 'user', 'content': 'Explain async/await in Python in two sentences.'}
]
print(count_tokens(messages))  # → ~20 tokens

Claude's models have these context windows as of mid-2026:

# Context window limits (input + output combined)
claude-haiku-4-5  →  200 000 tokens   # fast, cheap, good for classification
claude-sonnet-4-6 →  200 000 tokens   # best balance of quality and cost
claude-opus-4-7   →  200 000 tokens   # highest capability, highest cost

# Rough pricing guide (input / output per million tokens — verify on pricing page)
# Haiku:   ~$0.80  / $4
# Sonnet:  ~$3     / $15
# Opus:    ~$15    / $75

The practical implication: a 200k token context window can hold a ~150,000 word document — roughly a full novel — plus the entire conversation history. The limit rarely bites in typical applications, but token cost does. Build token counting and usage logging from day one.


4. Your First LLM Call

The Anthropic Python SDK is the most straightforward way to call Claude. Install it, set your key, call messages.create():

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

response = client.messages.create(
    model      = 'claude-sonnet-4-6',
    max_tokens = 1024,
    system     = 'You are a concise Python tutor. Explain concepts in plain English.',
    messages   = [
        {'role': 'user', 'content': 'What is the difference between a list and a tuple?'},
    ],
)

print(response.content[0].text)
# "Lists are mutable — you can change them after creation.
#  Tuples are immutable — once created, they cannot be modified.
#  Use tuples for fixed data like coordinates; lists for collections that change."

# Token usage
print(response.usage.input_tokens, response.usage.output_tokens)

The response object is not just a string. It carries usage data, stop reason, and the full content array (which may contain multiple content blocks when tool use is involved). Always check response.stop_reason — if it is "max_tokens" instead of "end_turn", the response was cut short and you need to increase max_tokens.

Multi-turn conversations

To hold a conversation, append each turn to the messages list yourself:

history = []

def chat(user_message: str) -> str:
    history.append({'role': 'user', 'content': user_message})

    response = client.messages.create(
        model    = 'claude-sonnet-4-6',
        max_tokens = 1024,
        system   = 'You are a Python tutor.',
        messages = history,
    )

    assistant_reply = response.content[0].text
    history.append({'role': 'assistant', 'content': assistant_reply})
    return assistant_reply

print(chat('What is a decorator?'))
print(chat('Show me an example with a timing decorator.'))  # has full context

In a web application, history lives in a database or cache keyed on the session — not in a global variable. See the Django chatbot post for the full model + view pattern.


5. Prompt Engineering Patterns

The system prompt is the highest-leverage variable in your application. A bad system prompt produces bad output regardless of model size. Here are the patterns that reliably improve output quality:

Zero-shot: just ask clearly

system = """
You are a code reviewer for a Python backend team.

Rules:
- Identify bugs, not style preferences
- Explain WHY something is a bug, not just what it is
- If the code is correct, say so — do not invent issues
- Use bullet points, one bug per bullet
- Be direct. No preamble.
"""

Few-shot: show the model the format you want

When zero-shot output format is inconsistent, add 2–3 examples in the messages array before the real request:

messages = [
    # Example 1
    {'role': 'user',      'content': 'Classify: "The server is down again"'},
    {'role': 'assistant', 'content': 'category: incident\nurgency: high'},

    # Example 2
    {'role': 'user',      'content': 'Classify: "Could you add dark mode?"'},
    {'role': 'assistant', 'content': 'category: feature_request\nurgency: low'},

    # Real request
    {'role': 'user',      'content': f'Classify: "{user_ticket}"'},
]

Chain-of-thought: ask it to think first

For reasoning tasks, instruct the model to work through the problem before answering. This produces measurably better output on multi-step problems:

system = """
When asked to debug code:
1. First, trace through the code step by step in a  block.
2. Then provide the diagnosis and fix after .

Your final answer should come after the thinking block, not inside it.
"""

Prompting anti-patterns to avoid

  • "Be as detailed as possible" — leads to padding and repetition. Specify length: "answer in 3 bullet points" or "answer in under 100 words".
  • "Don't hallucinate" — the model cannot comply with this instruction; it doesn't know when it's hallucinating. Instead: "if you are not certain, say 'I don't know'".
  • Huge system prompts with contradictory rules — models follow the most recent instruction when rules conflict. Keep the system prompt focused on one domain.
  • Putting critical instructions only in the system prompt — for truly important constraints (format, length, language), repeat them at the end of the user message too.

6. Structured Output: Getting JSON from LLMs

LLMs are text generators. Getting reliably valid JSON requires explicit guidance. There are two reliable approaches: asking for JSON in the prompt with a strict schema example, or using tool use to force structured output (covered in the next section).

JSON mode via prompt

import json
import anthropic

client = anthropic.Anthropic()

def extract_invoice_data(raw_text: str) -> dict:
    response = client.messages.create(
        model      = 'claude-sonnet-4-6',
        max_tokens = 512,
        system     = """Extract invoice data and return ONLY valid JSON.
No markdown, no explanation, no code fences. Raw JSON only.

Schema:
{
  "vendor": "string",
  "amount": number,
  "currency": "string (ISO 4217)",
  "date": "string (YYYY-MM-DD)",
  "line_items": [{"description": "string", "amount": number}]
}""",
        messages = [{'role': 'user', 'content': raw_text}],
    )

    text = response.content[0].text.strip()
    # Strip accidental markdown fences if the model disobeys
    if text.startswith('```'):
        text = text.split('```')[1]
        if text.startswith('json'):
            text = text[4:]
    return json.loads(text)

Validation matters. Always wrap json.loads() in a try/except and validate the result with Pydantic before using it downstream:

from pydantic import BaseModel, field_validator
from decimal import Decimal


class LineItem(BaseModel):
    description: str
    amount: Decimal


class Invoice(BaseModel):
    vendor: str
    amount: Decimal
    currency: str
    date: str
    line_items: list[LineItem]

    @field_validator('currency')
    @classmethod
    def must_be_iso(cls, v: str) -> str:
        if len(v) != 3 or not v.isupper():
            raise ValueError('currency must be 3-letter ISO 4217 code')
        return v


try:
    raw = extract_invoice_data(invoice_text)
    invoice = Invoice(**raw)
except (json.JSONDecodeError, ValueError) as e:
    # Retry once or raise to caller
    raise ValueError(f'LLM returned invalid invoice structure: {e}')

7. Tool Use & Function Calling

Tool use (also called function calling) lets the model decide to call a Python function rather than generating free text. The model does not execute the function — it returns a structured call spec, your code executes it, and you send the result back for the final response.

This is the cleanest way to get structured output: define a tool with a JSON Schema, the model will always return a valid call spec conforming to that schema.

import anthropic
import json

client = anthropic.Anthropic()

# Define tools the model can call
tools = [
    {
        'name': 'get_weather',
        'description': 'Get current weather for a city.',
        'input_schema': {
            'type': 'object',
            'properties': {
                'city':    {'type': 'string', 'description': 'City name'},
                'country': {'type': 'string', 'description': 'ISO country code'},
            },
            'required': ['city'],
        },
    },
    {
        'name': 'search_docs',
        'description': 'Search the internal knowledge base.',
        'input_schema': {
            'type': 'object',
            'properties': {
                'query': {'type': 'string'},
                'limit': {'type': 'integer', 'default': 5},
            },
            'required': ['query'],
        },
    },
]


def run_tool(name: str, inputs: dict) -> str:
    """Dispatch tool calls to actual Python functions."""
    if name == 'get_weather':
        return json.dumps({'temp': 18, 'condition': 'cloudy', 'city': inputs['city']})
    if name == 'search_docs':
        return json.dumps({'results': ['Doc A', 'Doc B']})
    raise ValueError(f'Unknown tool: {name}')


def agent_loop(user_message: str) -> str:
    messages = [{'role': 'user', 'content': user_message}]

    while True:
        response = client.messages.create(
            model    = 'claude-sonnet-4-6',
            max_tokens = 1024,
            tools    = tools,
            messages = messages,
        )

        # Model finished — return text
        if response.stop_reason == 'end_turn':
            return response.content[0].text

        # Model wants to call tools
        if response.stop_reason == 'tool_use':
            # Add the assistant's tool-call turn to history
            messages.append({'role': 'assistant', 'content': response.content})

            # Execute each requested tool and collect results
            tool_results = []
            for block in response.content:
                if block.type == 'tool_use':
                    result = run_tool(block.name, block.input)
                    tool_results.append({
                        'type':        'tool_result',
                        'tool_use_id': block.id,
                        'content':     result,
                    })

            # Feed results back to the model
            messages.append({'role': 'user', 'content': tool_results})
            # Loop — the model will now generate a final response

print(agent_loop('What is the weather in London?'))

The agent_loop function illustrates the agentic pattern: the model may call multiple tools in sequence before producing a final answer. Each iteration of the while loop is one round-trip to the API. Add a max_iterations guard to prevent runaway loops:

MAX_ITERATIONS = 10

def agent_loop(user_message: str) -> str:
    messages = [{'role': 'user', 'content': user_message}]

    for _ in range(MAX_ITERATIONS):
        response = client.messages.create(...)
        if response.stop_reason == 'end_turn':
            return response.content[0].text
        # ... handle tool_use ...

    raise RuntimeError('Agent exceeded max iterations without reaching end_turn')

8. Embeddings & Semantic Search

An embedding is a dense vector — typically 256 to 3072 floating-point numbers — that encodes the semantic meaning of a piece of text. Two texts with similar meanings have vectors that are close together in that high-dimensional space, measured by cosine similarity or dot product.

Embeddings power semantic search (find documents by meaning, not keyword), clustering, deduplication, and the retrieval step in RAG.

TEXT "async/await in Python" EMBEDDING MODEL text-embedding-3-small or sentence-transformers VECTOR (1536 floats) [0.021, -0.148, 0.305, 0.067, -0.211, ...] Similar text → close vectors · cosine similarity comparison Embeddings are deterministic for the same model — same input always produces the same vector.
Text → embedding model → dense vector. Similar texts produce geometrically close vectors. This enables search-by-meaning rather than search-by-keyword.

Generating embeddings with the OpenAI SDK

from openai import OpenAI
import numpy as np

client = OpenAI()  # reads OPENAI_API_KEY from env

def embed(texts: list[str]) -> list[list[float]]:
    """Embed a batch of texts. Max ~8000 tokens per text."""
    response = client.embeddings.create(
        model = 'text-embedding-3-small',  # 1536-dim, cheapest
        input = texts,
    )
    return [item.embedding for item in response.data]


def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))


# Example
query_vec = embed(['How do I handle async errors in Python?'])[0]
doc_vecs  = embed([
    'asyncio error handling with try/except',
    'Django ORM queryset caching',
    'Python async/await exception patterns',
])

similarities = [cosine_similarity(query_vec, dv) for dv in doc_vecs]
# → [0.91, 0.23, 0.88]  — first and third are relevant, second is not

Storing embeddings in PostgreSQL with pgvector

# Install: pip install pgvector psycopg2-binary
# Database: CREATE EXTENSION IF NOT EXISTS vector;

from pgvector.django import VectorField, CosineDistance
from django.db import models


class Document(models.Model):
    content   = models.TextField()
    embedding = VectorField(dimensions=1536)
    created   = models.DateTimeField(auto_now_add=True)

    class Meta:
        indexes = [
            # HNSW index — fast approximate nearest neighbour search
            models.Index(
                fields=['embedding'],
                name='doc_embedding_hnsw',
                opclasses=['vector_cosine_ops'],
            )
        ]


def semantic_search(query: str, limit: int = 5) -> list[Document]:
    query_vec = embed([query])[0]
    return (
        Document.objects
        .annotate(distance=CosineDistance('embedding', query_vec))
        .order_by('distance')[:limit]
    )

9. Retrieval-Augmented Generation (RAG)

RAG is the pattern of augmenting an LLM prompt with retrieved context before asking it to answer. It solves the two biggest LLM limitations: knowledge cutoffs and hallucination on proprietary data. The LLM is not expected to know the answer — it is expected to read the retrieved documents and synthesise an answer from them.

The full pipeline has five stages:

  • Ingest: split documents into chunks (~500 tokens with 50-token overlap)
  • Embed: embed each chunk and store the vector in a database
  • Retrieve: embed the user query, find the top-k nearest chunks
  • Augment: insert the retrieved chunks into the prompt as context
  • Generate: call the LLM and return its answer
# Minimal RAG pipeline in Python
import anthropic
from openai import OpenAI

ac = anthropic.Anthropic()
oc = OpenAI()

CHUNK_SIZE    = 500   # tokens
CHUNK_OVERLAP = 50
TOP_K         = 4


# ── Stage 1 & 2: Ingest & embed ──────────────────────────────────────────
def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
    words  = text.split()
    chunks = []
    start  = 0
    while start < len(words):
        end = start + size
        chunks.append(' '.join(words[start:end]))
        start = end - overlap
    return chunks


def ingest_document(doc_id: int, text: str) -> None:
    from myapp.models import DocumentChunk  # your Django model
    chunks    = chunk_text(text)
    embeddings = oc.embeddings.create(
        model = 'text-embedding-3-small', input = chunks
    ).data

    DocumentChunk.objects.bulk_create([
        DocumentChunk(
            document_id = doc_id,
            content     = chunk,
            embedding   = emb.embedding,
        )
        for chunk, emb in zip(chunks, embeddings)
    ])


# ── Stage 3 & 4 & 5: Query-time RAG ─────────────────────────────────────
def rag_query(user_question: str) -> str:
    # Embed the question
    q_vec = oc.embeddings.create(
        model = 'text-embedding-3-small', input = [user_question]
    ).data[0].embedding

    # Retrieve top-k relevant chunks
    from myapp.models import DocumentChunk
    from pgvector.django import CosineDistance

    chunks = (
        DocumentChunk.objects
        .annotate(dist=CosineDistance('embedding', q_vec))
        .order_by('dist')[:TOP_K]
    )

    # Build the augmented prompt
    context = '\n\n---\n\n'.join(c.content for c in chunks)
    prompt  = f"""Answer the question using ONLY the context below.
If the context does not contain the answer, say "I don't have that information."

CONTEXT:
{context}

QUESTION: {user_question}"""

    response = ac.messages.create(
        model      = 'claude-sonnet-4-6',
        max_tokens = 1024,
        messages   = [{'role': 'user', 'content': prompt}],
    )
    return response.content[0].text

The phrase "using ONLY the context below" is load-bearing. Without it, the model blends retrieved context with its training knowledge, making hallucinations impossible to audit. Grounding the model in the retrieved text means every answer is traceable to a source chunk.

For a full production RAG implementation with Wagtail CMS as the document source, see RAG with Django: Chat Over Your Wagtail CMS Content.


10. Async & Streaming Responses

Streaming matters for user experience. A response that starts appearing in 300ms feels instant even if the full reply takes 8 seconds. Without streaming, users watch a spinner for the full duration and see text appear all at once.

Synchronous streaming

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model      = 'claude-sonnet-4-6',
    max_tokens = 1024,
    messages   = [{'role': 'user', 'content': 'Explain the GIL in Python.'}],
) as stream:
    for text in stream.text_stream:
        print(text, end='', flush=True)

# After the stream — get the complete message with usage stats
message = stream.get_final_message()
print(f'\n\nInput tokens: {message.usage.input_tokens}')
print(f'Output tokens: {message.usage.output_tokens}')

Async streaming (for async frameworks)

import asyncio
import anthropic

async def stream_response(question: str) -> str:
    client = anthropic.AsyncAnthropic()
    accumulated = ''

    async with client.messages.stream(
        model      = 'claude-sonnet-4-6',
        max_tokens = 1024,
        messages   = [{'role': 'user', 'content': question}],
    ) as stream:
        async for text in stream.text_stream:
            accumulated += text
            print(text, end='', flush=True)

    return accumulated


asyncio.run(stream_response('What is asyncio.gather()?'))

Streaming in Django with Server-Sent Events

In a Django view, wrap the generator in a StreamingHttpResponse with the text/event-stream content type. Set X-Accel-Buffering: no to prevent nginx from buffering the stream before it reaches the browser:

from django.http import StreamingHttpResponse
import json


def stream_view(request):
    def event_generator():
        with client.messages.stream(
            model    = 'claude-sonnet-4-6',
            max_tokens = 1024,
            messages = [{'role': 'user', 'content': request.GET.get('q', '')}],
        ) as stream:
            for text in stream.text_stream:
                yield f'data: {json.dumps({"token": text})}\n\n'
        yield 'data: [DONE]\n\n'

    resp = StreamingHttpResponse(event_generator(), content_type='text/event-stream')
    resp['Cache-Control']     = 'no-cache'
    resp['X-Accel-Buffering'] = 'no'
    return resp

11. Production Patterns

Retries with exponential backoff

LLM APIs rate-limit under load and occasionally return 529 (overloaded). The Anthropic SDK retries automatically by default (2 retries). For higher-traffic scenarios, configure the retry behaviour explicitly:

import anthropic
from anthropic import APIStatusError, RateLimitError

client = anthropic.Anthropic(
    max_retries = 4,        # SDK handles exponential backoff automatically
    timeout     = 60.0,     # per-request timeout in seconds
)

def safe_create(messages: list[dict], **kwargs) -> str:
    try:
        response = client.messages.create(
            model      = 'claude-sonnet-4-6',
            max_tokens = 1024,
            messages   = messages,
            **kwargs,
        )
        return response.content[0].text

    except RateLimitError:
        raise  # Let the caller decide — or queue for async retry via Celery

    except APIStatusError as e:
        if e.status_code == 529:  # model overloaded
            raise  # retry at a higher level
        raise  # re-raise all other API errors

Prompt caching (Anthropic)

If your system prompt or context documents are the same across many requests, enable prompt caching. Anthropic caches the first 1,024+ tokens marked with cache_control and charges ~10% of the normal input token price on cache hits:

LONG_SYSTEM_CONTEXT = "..." * 500  # a large, stable context

response = client.messages.create(
    model      = 'claude-sonnet-4-6',
    max_tokens = 1024,
    system = [
        {
            'type': 'text',
            'text': LONG_SYSTEM_CONTEXT,
            'cache_control': {'type': 'ephemeral'},  # cache this block
        }
    ],
    messages = [{'role': 'user', 'content': user_query}],
)
# First request: normal cost. Subsequent requests within 5 min: ~10% cost.
# Check cache hit: response.usage.cache_read_input_tokens > 0

Cost tracking

Log every token count to a database table. This lets you attribute cost to users, endpoints, or features and catch runaway usage before the bill arrives:

from django.db import models


class LLMUsageLog(models.Model):
    endpoint      = models.CharField(max_length=100)
    model         = models.CharField(max_length=60)
    input_tokens  = models.PositiveIntegerField()
    output_tokens = models.PositiveIntegerField()
    cache_read    = models.PositiveIntegerField(default=0)
    user_id       = models.IntegerField(null=True)
    created       = models.DateTimeField(auto_now_add=True)

    @property
    def estimated_cost_usd(self) -> float:
        # Claude Sonnet-4-6 pricing (verify on Anthropic pricing page)
        input_cost  = (self.input_tokens  - self.cache_read) * 3.0  / 1_000_000
        cache_cost  =  self.cache_read                       * 0.3  / 1_000_000
        output_cost =  self.output_tokens                    * 15.0 / 1_000_000
        return round(input_cost + cache_cost + output_cost, 6)


def log_usage(response, endpoint: str, user_id: int | None = None) -> None:
    LLMUsageLog.objects.create(
        endpoint      = endpoint,
        model         = response.model,
        input_tokens  = response.usage.input_tokens,
        output_tokens = response.usage.output_tokens,
        cache_read    = getattr(response.usage, 'cache_read_input_tokens', 0),
        user_id       = user_id,
    )

Production checklist

  • Never hardcode API keys. Use environment variables or a secrets manager. Rotate if exposed.
  • Set max_tokens explicitly. Without it, some SDKs default to the model maximum, producing unexpectedly large responses and costs.
  • Validate all LLM output with Pydantic before using it as data. Text is never a safe input to downstream logic.
  • Rate-limit per user at the application layer, not just at the API level. A single user can exhaust your API quota if you don't enforce per-session limits.
  • Handle stop_reason == "max_tokens". Either increase the limit or return a partial-response warning to the user.
  • Log every call — model, tokens, endpoint, user — from day one. Debugging cost spikes after the fact without logs is painful.
  • Use async Celery tasks for non-interactive LLM calls (batch processing, background enrichment). Never block an HTTP worker thread on a multi-second API call. See the Celery async AI tasks post for the full pattern.
  • Test with model="claude-haiku-4-5" during development. It is fast and cheap. Switch to a more capable model only when you measure a quality gap.