AI-Native Architecture: Designing Django Applications Built Around Intelligence

1. AI-Augmented vs AI-Native

Imagine you are building a contract review platform: every document upload triggers a multi-step LLM analysis, every page view shows AI-extracted clauses, every search runs semantic retrieval. The AI is not a feature on a page — it is the page. That changes how you build the entire application.

Most Django applications that use AI today are AI-augmented: a standard request-response web application with one or two endpoints that call an LLM. The existing architecture does not change. The AI call is a side-effect — if the API is down, the app still works, just with a degraded feature.

An AI-native application is different in kind. The LLM is not a feature; it is the value proposition. A document analysis platform, a code review assistant, an intelligent customer support system — in these systems, if the AI fails, the application fails. That changes every architectural decision you make.

AI-augmented: AI is a bolt-on feature; the app works without it. AI-native: AI is the core value; the app layer delivers and safeguards it.

The practical consequence is that AI-native applications need infrastructure the augmented approach never requires: a prompt registry, token budgets, fallback chains, quality measurement, cost attribution, and async task queues as a first-class concern rather than an afterthought. Each of these is covered below.

2. Async-First by Design

LLM API calls are slow. Response time scales with output length: a short classification finishes in under a second, a 2,000-token RAG answer takes 8–15 seconds, an Opus multi-step analysis can take 30+. The worst mistake in AI-native architecture is synchronous LLM calls in HTTP request handlers. A Django worker thread blocked on an API call for 8 seconds is 8 seconds it cannot handle any other request. Under load, this queues up and cascades into timeouts.

The correct model is 202 Accepted + Celery task + polling or WebSocket: the HTTP handler accepts the request, enqueues the work, returns immediately, and the client polls for the result or receives it over a WebSocket.

# models.py
import uuid
from django.conf import settings
from django.db import models


class AIRequest(models.Model):
    class Status(models.TextChoices):
        PENDING    = 'pending'
        PROCESSING = 'processing'
        DONE       = 'done'
        FAILED     = 'failed'

    id         = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
    user       = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
    prompt_key = models.CharField(max_length=60)   # which prompt template to use
    payload    = models.JSONField()                 # user-supplied variables
    status     = models.CharField(max_length=12, choices=Status, default=Status.PENDING)
    result     = models.TextField(blank=True)
    error      = models.TextField(blank=True)
    created    = models.DateTimeField(auto_now_add=True)
    completed  = models.DateTimeField(null=True, blank=True)

# views.py
from django.http import JsonResponse
from django.views.decorators.http import require_POST
from django.contrib.auth.decorators import login_required
import json

from .models import AIRequest
from .tasks import run_ai_request


@login_required
@require_POST
def submit_request(request):
    body = json.loads(request.body)
    req  = AIRequest.objects.create(
        user       = request.user,
        prompt_key = body['prompt_key'],
        payload    = body.get('payload', {}),
    )
    run_ai_request.delay(str(req.id))   # non-blocking — Celery handles it
    return JsonResponse({'id': str(req.id), 'status': 'pending'}, status=202)


@login_required
def poll_request(request, request_id):
    req = AIRequest.objects.get(id=request_id, user=request.user)
    return JsonResponse({
        'id':     str(req.id),
        'status': req.status,
        'result': req.result if req.status == 'done' else None,
        'error':  req.error  if req.status == 'failed' else None,
    })

# tasks.py
from celery import shared_task
from django.utils import timezone
import anthropic

from .client import client                                       # singleton — see §8
from .models import AIRequest
from .prompts.registry import render_prompt, get_prompt, model_for  # defined in §3 and §5
from .tracking import log_usage


@shared_task(bind=True, max_retries=2, default_retry_delay=5)
def run_ai_request(self, request_id: str):
    req = AIRequest.objects.get(id=request_id)
    req.status = AIRequest.Status.PROCESSING
    req.save(update_fields=['status'])

    try:
        system, user_msg = render_prompt(req.prompt_key, req.payload)
        tmpl = get_prompt(req.prompt_key)

        response = client.messages.create(
            model       = model_for(req.prompt_key),
            max_tokens  = tmpl.max_tokens,
            temperature = tmpl.temperature,
            system      = system,
            messages    = [{'role': 'user', 'content': user_msg}],
        )

        req.result    = response.content[0].text
        req.status    = AIRequest.Status.DONE
        req.completed = timezone.now()
        req.save(update_fields=['result', 'status', 'completed'])
        log_usage(response, request=req, version=tmpl.version)

    except anthropic.RateLimitError as exc:
        raise self.retry(exc=exc, countdown=30)

    except Exception as exc:
        req.status = AIRequest.Status.FAILED
        req.error  = str(exc)
        req.save(update_fields=['status', 'error'])
        raise

For real-time feedback, replace polling with Django Channels WebSockets: the Celery task sends the result to a channel group, and the browser receives it without polling. See the async AI tasks post for the full WebSocket pattern.

Stream tokens for perceived latency

Waiting 10 seconds with a spinner feels broken. Waiting 10 seconds while tokens stream into the page feels alive — even though the total time is identical. For any user-facing LLM output longer than a sentence, streaming is non-optional. Pair a Celery task that streams from the SDK with a Server-Sent Events endpoint or a WebSocket that pushes each chunk to the browser:

# tasks.py — streaming variant
from .client import client

@shared_task
def stream_ai_request(request_id: str, channel_name: str):
    req  = AIRequest.objects.get(id=request_id)
    tmpl = get_prompt(req.prompt_key)
    system, user_msg = render_prompt(req.prompt_key, req.payload)

    pieces = []
    with client.messages.stream(
        model       = model_for(req.prompt_key),
        max_tokens  = tmpl.max_tokens,
        temperature = tmpl.temperature,
        system      = system,
        messages    = [{'role': 'user', 'content': user_msg}],
    ) as stream:
        for text in stream.text_stream:
            pieces.append(text)
            # Push each chunk to the browser over Channels.
            async_to_sync(get_channel_layer().group_send)(
                channel_name, {'type': 'ai.chunk', 'text': text},
            )
        response = stream.get_final_message()

    req.result = ''.join(pieces)
    req.status = AIRequest.Status.DONE
    req.save(update_fields=['result', 'status'])
    log_usage(response, request=req, version=tmpl.version)

Streaming changes the cost model too: you cannot abort a non-streaming request once it's in flight, but you can cancel a stream as soon as the user navigates away — saving the rest of the output tokens. For long generations behind a user's "stop" button, this matters.

3. Prompt Management as Code

In a prototype, prompts live in strings inside view functions. In production, this is a maintenance disaster. Prompts change more frequently than code. Different features need different variants. You need to A/B test them. You need to roll back a bad prompt without redeploying. You need to audit which prompt version produced which output.

Treat prompts the way you treat database migrations: versioned, reviewable, and stored separately from application logic. (For the engineering fundamentals of writing the prompts themselves — few-shot, structured output, tool use — see the GenAI with Python guide; this section is about the surrounding infrastructure.)

# prompts/registry.py
from dataclasses import dataclass
from pathlib import Path
from string import Template
import yaml


@dataclass(frozen=True)
class PromptTemplate:
    key:         str
    version:     str
    system:      str
    user:        str
    max_tokens:  int  = 1024
    temperature: float = 1.0


_registry: dict[str, PromptTemplate] = {}


def load_prompts(directory: str = 'prompts/templates') -> None:
    """Load all YAML prompt templates from disk into the registry."""
    for path in Path(directory).glob('*.yaml'):
        data = yaml.safe_load(path.read_text())
        tmpl = PromptTemplate(
            key         = data['key'],
            version     = data['version'],
            system      = data['system'],
            user        = data['user'],
            max_tokens  = data.get('max_tokens',  1024),
            temperature = data.get('temperature', 1.0),
        )
        _registry[tmpl.key] = tmpl


def get_prompt(key: str) -> PromptTemplate:
    if key not in _registry:
        raise KeyError(f'Unknown prompt key: {key!r}')
    return _registry[key]


def render_prompt(key: str, variables: dict) -> tuple[str, str]:
    """Return (system, user) strings with variables substituted."""
    tmpl   = get_prompt(key)
    system = Template(tmpl.system).safe_substitute(variables)
    user   = Template(tmpl.user).safe_substitute(variables)
    return system, user

Call load_prompts() once at startup in your app's AppConfig.ready() so the registry is populated before any request arrives:

# ai/apps.py
from django.apps import AppConfig


class AiConfig(AppConfig):
    name = 'ai'

    def ready(self):
        from .prompts.registry import load_prompts
        load_prompts()  # populates _registry from YAML files on disk

A YAML prompt template looks like this:

# prompts/templates/document_summary.yaml
key:         document_summary
version:     "1.3"
max_tokens:  1024
temperature: 0.3   # low — extracting facts, not generating creative prose

system: |
  You are a precise document analyst working for a legal team.
  Your job is to extract the key facts from a document and
  present them as a concise structured summary.

  Rules:
  - Summarise in no more than 5 bullet points
  - Each bullet must be a single, complete sentence
  - Use plain English — no legal jargon unless quoting the document
  - If the document contains no meaningful information, say so

user: |
  Document title: $title
  Document type:  $doc_type

  Content:
  $content

  Provide the summary.

Pick temperature deliberately per prompt. Classification, extraction, and routing prompts almost always want temperature: 0 — you want the same input to produce the same output, every time. Summarisation and Q&A can tolerate 0.2–0.5. Creative writing and brainstorming want 0.7+. Setting one global temperature in your client config is a mistake; bake the value into each prompt template.

Store the prompt key and version alongside every AI output in the database. When output quality degrades after a prompt change, you can immediately identify which version is responsible and roll back by reverting the YAML file.

For larger teams, store prompts in the database with a Django admin interface for non-technical stakeholders to edit them — but still version-control the canonical defaults as YAML so changes go through code review before they reach production.

4. RAG as Infrastructure, Not a Feature

Retrieval-Augmented Generation is often implemented as a one-off feature: embed some documents, run a similarity search, append results to a prompt. In an AI-native application, RAG is infrastructure — as fundamental as your database. It needs its own models, its own indexing pipeline, its own freshness guarantees, and its own health monitoring.

RAG as infrastructure: a background ingest pipeline keeps the vector store fresh; a query-time pipeline retrieves and augments before calling the LLM. Both share one pgvector store.

# models.py — RAG infrastructure models
from django.db import models
from pgvector.django import VectorField, HnswIndex


class KnowledgeDocument(models.Model):
    """Source documents fed into the RAG pipeline."""
    title      = models.CharField(max_length=300)
    source_url = models.URLField(blank=True)
    content    = models.TextField()
    checksum   = models.CharField(max_length=64, unique=True)  # SHA-256 of content
    indexed_at = models.DateTimeField(null=True)
    created    = models.DateTimeField(auto_now_add=True)


class KnowledgeChunk(models.Model):
    """Embedded chunk of a KnowledgeDocument."""
    document    = models.ForeignKey(KnowledgeDocument, on_delete=models.CASCADE,
                                    related_name='chunks')
    content     = models.TextField()
    chunk_index = models.PositiveIntegerField()
    embedding   = VectorField(dimensions=1536)

    class Meta:
        indexes = [
            HnswIndex(
                name='chunk_embedding_hnsw',
                fields=['embedding'],
                m=16,
                ef_construction=64,
                opclasses=['vector_cosine_ops'],
            )
        ]
        ordering = ['document', 'chunk_index']

# rag/pipeline.py
from openai import OpenAI
from pgvector.django import CosineDistance
from .models import KnowledgeChunk

oc = OpenAI()


def embed_text(text: str) -> list[float]:
    return oc.embeddings.create(
        model='text-embedding-3-small', input=[text]
    ).data[0].embedding


def retrieve(query: str, top_k: int = 5, min_score: float = 0.7) -> list[KnowledgeChunk]:
    """Return the top-k most relevant chunks for a query.

    min_score is cosine similarity (0..1). 0.7 keeps results closely related to
    the query; tune per dataset — too high returns empty lists, too low surfaces
    unrelated chunks. Measure on a labelled eval set before changing.
    """
    query_vec = embed_text(query)
    return list(
        KnowledgeChunk.objects
        .annotate(distance=CosineDistance('embedding', query_vec))
        .filter(distance__lt=(1 - min_score))   # cosine distance, not similarity
        .order_by('distance')
        .select_related('document')[:top_k]
    )


def build_context(chunks: list[KnowledgeChunk]) -> tuple[str, list[int]]:
    """Returns (context string, list of chunk IDs) for audit logging."""
    parts = []
    ids   = []
    for chunk in chunks:
        parts.append(f'[Source: {chunk.document.title}]\n{chunk.content}')
        ids.append(chunk.id)
    return '\n\n---\n\n'.join(parts), ids

Store the chunk IDs alongside every AI output row. This enables three things: citation (show users where the answer came from), auditing (reproduce the exact context a given answer was generated from), and freshness invalidation (when a source document changes, flag all AI outputs that depended on its chunks for review).

5. Cost Architecture

LLM costs are invisible until they arrive as a bill. At prototype scale, a few thousand tokens per request is negligible. At production scale — thousands of requests per day — poor cost hygiene can produce a bill that kills your margins. Cost architecture means designing token spend as deliberately as you design database queries.

Token budget enforcement

# tracking.py
from django.db import models
from django.conf import settings


class TokenBudget(models.Model):
    """Per-user or per-tenant daily token budget."""
    user        = models.OneToOneField('auth.User', on_delete=models.CASCADE)
    daily_limit = models.PositiveIntegerField(default=100_000)  # tokens/day
    used_today  = models.PositiveIntegerField(default=0)
    reset_date  = models.DateField()

    def has_capacity(self, estimated_tokens: int) -> bool:
        return (self.used_today + estimated_tokens) <= self.daily_limit

    def consume(self, tokens: int) -> None:
        TokenBudget.objects.filter(pk=self.pk).update(
            used_today=models.F('used_today') + tokens
        )


class LLMUsageLog(models.Model):
    """Immutable log of every LLM call for cost attribution and debugging."""
    request        = models.ForeignKey('AIRequest', on_delete=models.SET_NULL,
                                       null=True, related_name='usage_logs')
    prompt_key     = models.CharField(max_length=60)
    prompt_version = models.CharField(max_length=20, blank=True, default='')
    model          = models.CharField(max_length=60)
    input_tokens   = models.PositiveIntegerField()                # uncached input
    output_tokens  = models.PositiveIntegerField()
    cache_read     = models.PositiveIntegerField(default=0)       # cache hits  — 10% of input
    cache_write    = models.PositiveIntegerField(default=0)       # cache writes — 125% of input
    latency_ms     = models.PositiveIntegerField(default=0)
    created        = models.DateTimeField(auto_now_add=True)

    @property
    def cost_usd(self) -> float:
        # Sonnet 4.6 pricing — verify on Anthropic pricing page.
        # input_tokens / cache_read / cache_write are mutually exclusive counters
        # from the Anthropic API — don't double-count by subtracting one from another.
        input_cost       = self.input_tokens  * 3.00  / 1_000_000
        cache_read_cost  = self.cache_read    * 0.30  / 1_000_000
        cache_write_cost = self.cache_write   * 3.75  / 1_000_000
        output_cost      = self.output_tokens * 15.00 / 1_000_000
        return round(input_cost + cache_read_cost + cache_write_cost + output_cost, 6)


def log_usage(response, *, request, version: str = '', latency_ms: int = 0) -> LLMUsageLog:
    """Persist an LLMUsageLog row from an Anthropic Messages response."""
    usage = response.usage
    return LLMUsageLog.objects.create(
        request        = request,
        prompt_key     = request.prompt_key,
        prompt_version = version,
        model          = response.model,
        input_tokens   = usage.input_tokens,
        output_tokens  = usage.output_tokens,
        # These attributes only appear on the usage object when prompt caching is in use.
        cache_read     = getattr(usage, 'cache_read_input_tokens',     0) or 0,
        cache_write    = getattr(usage, 'cache_creation_input_tokens', 0) or 0,
        latency_ms     = latency_ms,
    )

The reset_date field needs a nightly reset task. Wire it with Celery Beat:

# tasks.py — run nightly via Celery Beat
from celery import shared_task
from django.utils import timezone
from .tracking import TokenBudget


@shared_task
def reset_daily_token_budgets():
    today = timezone.now().date()
    TokenBudget.objects.filter(reset_date__lt=today).update(
        used_today=0, reset_date=today
    )

Model routing: use the cheapest model that is good enough

Not every request needs your most capable model. Classification, extraction, and short summarisation tasks that Haiku handles equally well cost 15× less than Opus. Route by task type:

# prompts/registry.py (extend PromptTemplate)
MODEL_ROUTING = {
    # Classification, routing, simple extraction → Haiku
    'ticket_classify':    'claude-haiku-4-5-20251001',
    'intent_detection':   'claude-haiku-4-5-20251001',
    'sentiment_score':    'claude-haiku-4-5-20251001',

    # Summarisation, Q&A, RAG responses → Sonnet
    'document_summary':   'claude-sonnet-4-6',
    'rag_answer':         'claude-sonnet-4-6',
    'code_review':        'claude-sonnet-4-6',

    # Complex reasoning, multi-step analysis → Opus
    'contract_analysis':  'claude-opus-4-7',
    'architecture_audit': 'claude-opus-4-7',
}


def model_for(prompt_key: str) -> str:
    return MODEL_ROUTING.get(prompt_key, 'claude-sonnet-4-6')  # sensible default

Prompt caching for stable system prompts

import anthropic

client = anthropic.Anthropic()

# System prompts and RAG context that do not change between calls
# are prime candidates for prompt caching.
# Anthropic caches blocks marked cache_control for 5 minutes.
# Cache hits cost ~10% of normal input token price.

def call_with_cache(system: str, user_msg: str, context: str, model: str) -> str:
    response = client.messages.create(
        model      = model,
        max_tokens = 1024,
        system = [
            {
                'type': 'text',
                'text': system,
                'cache_control': {'type': 'ephemeral'},  # cache the system prompt
            }
        ],
        messages = [
            {
                'role': 'user',
                'content': [
                    {
                        'type': 'text',
                        'text': context,
                        'cache_control': {'type': 'ephemeral'},  # cache retrieved context
                    },
                    {
                        'type': 'text',
                        'text': user_msg,
                        # No cache_control — this changes every request
                    },
                ],
            }
        ],
    )
    # Check cache hit: response.usage.cache_read_input_tokens > 0
    return response.content[0].text

6. Observability for AI Systems

Standard application observability (error rates, latency percentiles, uptime) is necessary but not sufficient for AI-native applications. You also need to measure quality — and quality is not binary, it does not show up in logs, and it degrades silently when you change a prompt or swap a model version.

Three layers of AI observability

Infrastructure metrics: latency (p50/p95/p99), token consumption per endpoint, API error rate, Celery queue depth. These go into Grafana or Datadog like any other service metric.
Quality metrics: user thumbs-up/down signals, output length distribution, refusal rate (outputs containing "I cannot"), hallucination detection on known-answer test cases. Track these as custom events.
Cost metrics: cost per request, cost per user, cost per prompt key, cache hit rate. Alert when any endpoint exceeds its budget threshold.

# models.py — quality feedback
class AIOutputFeedback(models.Model):
    class Rating(models.IntegerChoices):
        THUMBS_DOWN = -1
        THUMBS_UP   =  1

    request    = models.OneToOneField(AIRequest, on_delete=models.CASCADE,
                                      related_name='feedback')
    user       = models.ForeignKey('auth.User', on_delete=models.CASCADE)
    rating     = models.SmallIntegerField(choices=Rating)
    comment    = models.TextField(blank=True)
    created    = models.DateTimeField(auto_now_add=True)


# Management command to compute daily quality report
# management/commands/ai_quality_report.py
from django.core.management.base import BaseCommand
from django.db.models import Avg, Count
from django.utils import timezone
from datetime import timedelta


class Command(BaseCommand):
    help = 'Print daily AI quality summary'

    def handle(self, *args, **options):
        yesterday = timezone.now().date() - timedelta(days=1)

        from myapp.models import AIRequest, AIOutputFeedback, LLMUsageLog

        requests   = AIRequest.objects.filter(created__date=yesterday)
        done       = requests.filter(status='done').count()
        failed     = requests.filter(status='failed').count()
        total_cost = sum(
            log.cost_usd for log in
            LLMUsageLog.objects.filter(created__date=yesterday)
        )
        thumbs_up = AIOutputFeedback.objects.filter(
            created__date=yesterday, rating=1
        ).count()
        thumbs_dn = AIOutputFeedback.objects.filter(
            created__date=yesterday, rating=-1
        ).count()

        self.stdout.write(f'--- AI Quality Report: {yesterday} ---')
        self.stdout.write(f'Requests completed: {done}  failed: {failed}')
        self.stdout.write(f'Total cost:         ${total_cost:.4f}')
        self.stdout.write(f'User feedback:      +{thumbs_up} / -{thumbs_dn}')
        if thumbs_up + thumbs_dn > 0:
            rate = thumbs_up / (thumbs_up + thumbs_dn) * 100
            self.stdout.write(f'Approval rate:      {rate:.1f}%')

Regression testing for prompts

Before deploying a prompt change, run it against a fixed set of test cases with known expected outputs. This is your quality gate — equivalent to a test suite for code:

# tests/test_prompts.py
import pytest
import anthropic
from prompts.registry import render_prompt


GOLDEN_CASES = [
    {
        'prompt_key': 'ticket_classify',
        'payload': {'ticket': 'The login button does nothing on mobile Safari'},
        'expected_contains': 'bug',
    },
    {
        'prompt_key': 'ticket_classify',
        'payload': {'ticket': 'Please add dark mode to the dashboard'},
        'expected_contains': 'feature',
    },
    {
        'prompt_key': 'sentiment_score',
        'payload': {'text': 'This product completely ruined my morning.'},
        'expected_contains': 'negative',
    },
]


@pytest.mark.parametrize('case', GOLDEN_CASES)
def test_prompt_regression(case):
    client  = anthropic.Anthropic()
    system, user_msg = render_prompt(case['prompt_key'], case['payload'])

    response = client.messages.create(
        model    = 'claude-haiku-4-5-20251001',
        max_tokens = 256,
        system   = system,
        messages = [{'role': 'user', 'content': user_msg}],
    )

    output = response.content[0].text.lower()
    assert case['expected_contains'] in output, (
        f"Prompt {case['prompt_key']!r} regression: "
        f"expected {case['expected_contains']!r} in output.\n"
        f"Got: {output}"
    )

These tests make real API calls and incur token cost — mark them with a custom pytest marker (e.g. @pytest.mark.llm) and run them as a pre-deploy gate in CI, separate from the main unit test suite.

7. Resilience: Fallbacks, Timeouts, and Circuit Breakers

LLM APIs have real downtime and real rate limits. An AI-native application that crashes when the model API returns a 529 is not production-ready. You need fallback strategies at every layer.

Timeout discipline

import anthropic

# Set aggressive timeouts — the default is no timeout, which is dangerous.
# connect_timeout: time to establish TCP connection
# read_timeout: time to wait for the first byte of a response
client = anthropic.Anthropic(
    timeout=anthropic.Timeout(
        connect = 5.0,    # seconds
        read    = 90.0,   # generous for long responses; reduce per-endpoint if possible
        write   = 5.0,
        pool    = 5.0,
    ),
    max_retries = 3,       # SDK retries with exponential backoff automatically
)

Fallback chain

import anthropic
import logging

log    = logging.getLogger(__name__)
client = anthropic.Anthropic(
    timeout     = anthropic.Timeout(connect=5.0, read=90.0, write=5.0, pool=5.0),
    max_retries = 3,
)

# Try primary model → cheaper fallback → cached response → graceful degradation
def resilient_call(system: str, user_msg: str, cache_key: str) -> str:
    PRIMARY  = 'claude-sonnet-4-6'
    FALLBACK = 'claude-haiku-4-5-20251001'

    for model in [PRIMARY, FALLBACK]:
        try:
            resp = client.messages.create(
                model      = model,
                max_tokens = 1024,
                system     = system,
                messages   = [{'role': 'user', 'content': user_msg}],
            )
            return resp.content[0].text

        except anthropic.RateLimitError:
            log.warning('Rate limited on %s — trying fallback', model)
            continue

        except anthropic.APIStatusError as e:
            if e.status_code == 529:
                log.warning('Model overloaded (%s) — trying fallback', model)
                continue
            raise

    # Both models failed — try the Redis cache for a recent similar response
    from django.core.cache import cache
    cached = cache.get(cache_key)
    if cached:
        log.warning('All models failed — serving cached response for %s', cache_key)
        return cached

    # Nothing works — raise so the Celery task marks the request as failed
    raise RuntimeError('All LLM fallbacks exhausted')

Human-in-the-loop for high-stakes outputs

For outputs that carry risk — legal documents, medical summaries, financial reports — route low-confidence results to a human review queue instead of returning them directly:

class AIReviewQueue(models.Model):
    """Outputs requiring human sign-off before delivery."""
    class Priority(models.TextChoices):
        HIGH   = 'high'
        MEDIUM = 'medium'
        LOW    = 'low'

    request       = models.OneToOneField(AIRequest, on_delete=models.CASCADE)
    reason        = models.CharField(max_length=200)  # why it was flagged
    priority      = models.CharField(max_length=8, choices=Priority, default=Priority.MEDIUM)
    reviewed_by   = models.ForeignKey('auth.User', null=True, blank=True,
                                      on_delete=models.SET_NULL, related_name='reviews')
    reviewed_at   = models.DateTimeField(null=True)
    approved      = models.BooleanField(null=True)  # None = pending
    created       = models.DateTimeField(auto_now_add=True)


def needs_human_review(output: str, prompt_key: str) -> tuple[bool, str]:
    """Return (needs_review, reason) based on output content."""
    FLAGGED_PATTERNS = [
        ('I cannot', 'model refusal — potential safety filter trigger'),
        ('I am not able', 'model refusal — out-of-scope request'),
        ('consult a', 'professional referral — may need expert validation'),
    ]
    output_lower = output.lower()
    for pattern, reason in FLAGGED_PATTERNS:
        if pattern.lower() in output_lower:
            return True, reason
    # High-stakes prompt keys always get reviewed
    HIGH_STAKES = {'contract_analysis', 'medical_summary', 'financial_report'}
    if prompt_key in HIGH_STAKES:
        return True, 'high-stakes prompt key — mandatory human review'
    return False, ''

8. Django Project Layout for AI-Native Apps

The flat myapp/ structure that works fine for a standard Django project becomes a liability in an AI-native application. Separate concerns that evolve at different rates:

myproject/
├── config/                  # Django settings, URLs, WSGI/ASGI
│   ├── settings/
│   │   ├── base.py
│   │   ├── production.py
│   │   └── local.py
│   └── urls.py
│
├── ai/                      # All AI concerns — no Django views in here
│   ├── client.py            # Anthropic/OpenAI client singletons
│   ├── prompts/
│   │   ├── registry.py      # PromptTemplate loader and renderer
│   │   └── templates/       # YAML prompt files
│   │       ├── document_summary.yaml
│   │       └── ticket_classify.yaml
│   ├── rag/
│   │   ├── pipeline.py      # chunk, embed, retrieve
│   │   ├── ingest.py        # Celery tasks for background indexing
│   │   └── models.py        # KnowledgeDocument, KnowledgeChunk
│   ├── tasks.py             # Celery tasks that call the LLM
│   ├── tracking.py          # log_usage(), cost calculations
│   └── resilience.py        # fallback chain, circuit breaker
│
├── requests/                # HTTP-facing request/response handling
│   ├── models.py            # AIRequest, TokenBudget, AIOutputFeedback
│   ├── views.py             # submit, poll, feedback endpoints
│   ├── serializers.py       # DRF serializers
│   └── urls.py
│
├── documents/               # Domain app — knows nothing about AI
│   ├── models.py            # Document, DocumentVersion
│   ├── views.py
│   └── signals.py           # triggers RAG re-index on document save
│
└── tests/
    ├── test_prompts.py      # golden-case regression tests
    ├── test_rag.py          # retrieval accuracy tests
    └── test_cost.py         # budget enforcement tests

The key principle: the ai/ package is pure Python — no Django views, no URL configs, no templates. The requests/ app handles HTTP. Domain apps (documents/) know nothing about AI. This keeps the AI layer testable in isolation and swappable without touching the HTTP layer.

Wire the layers together through Celery tasks and Django signals, not direct calls:

# documents/signals.py
from django.db.models.signals import post_save
from django.dispatch import receiver
from .models import Document
from ai.rag.ingest import reindex_document


@receiver(post_save, sender=Document)
def trigger_rag_reindex(sender, instance, **kwargs):
    """Re-index the document in the vector store whenever it is saved."""
    reindex_document.delay(instance.id)  # ingest task handles checksum comparison and skips if unchanged

9. Production Checklist

Every LLM call goes through a Celery task. No synchronous LLM calls in Django request handlers. Period.
Every prompt is versioned in a YAML file. Prompt key and version are logged alongside every AI output row.
Token budgets are enforced per user before enqueuing tasks. Reject requests that would exceed the daily limit, not after spending the tokens.
Prompt caching is enabled for all system prompts over 1,024 tokens. Check cache_read_input_tokens in usage logs — a cache hit rate below 60% on high-volume prompts is leaving money on the table.
Model routing is explicit. Classification → Haiku. Summarisation → Sonnet. Complex reasoning → Opus. Unrouted defaults to Sonnet.
Timeouts are set on the SDK client, not just on Celery tasks. Both are needed — the SDK timeout prevents a single stalled connection; the Celery soft_time_limit prevents a stalled task from blocking a worker.
Fallback chain tested in staging. Inject anthropic.RateLimitError in tests and verify the application degrades gracefully rather than crashing.
Golden-case regression tests run in CI before every prompt deploy. A prompt change that fails 1 of 20 test cases does not merge.
Quality feedback is captured and reviewed weekly. Approval rate below 80% on any prompt key triggers a prompt review.
RAG chunk IDs are stored with every output. When a source document is updated, the dependent AI outputs are flagged for review or regeneration.
Cost dashboards are visible to the team, not just engineering. Product managers who request AI features should see what they cost.
Human-in-the-loop queues exist for high-stakes outputs before you launch, not after the first incident.