AI-Native Architecture: Designing Django Applications Built Around Intelligence
There is a meaningful difference between an application that uses AI and one that is built around it. Bolting an LLM call onto an existing Django view gets you a feature. Designing the application from the start with AI as a load-bearing component changes everything: how you handle latency, how you manage cost, how you store and version your prompts, how you measure quality, and how you fail gracefully when the model misbehaves. This post covers the architectural decisions that actually matter in production AI-native Django systems.
1. AI-Augmented vs AI-Native
Imagine you are building a contract review platform: every document upload triggers a multi-step LLM analysis, every page view shows AI-extracted clauses, every search runs semantic retrieval. The AI is not a feature on a page — it is the page. That changes how you build the entire application.
Most Django applications that use AI today are AI-augmented: a standard request-response web application with one or two endpoints that call an LLM. The existing architecture does not change. The AI call is a side-effect — if the API is down, the app still works, just with a degraded feature.
An AI-native application is different in kind. The LLM is not a feature; it is the value proposition. A document analysis platform, a code review assistant, an intelligent customer support system — in these systems, if the AI fails, the application fails. That changes every architectural decision you make.
The practical consequence is that AI-native applications need infrastructure the augmented approach never requires: a prompt registry, token budgets, fallback chains, quality measurement, cost attribution, and async task queues as a first-class concern rather than an afterthought. Each of these is covered below.
2. Async-First by Design
LLM API calls are slow. Response time scales with output length: a short classification finishes in under a second, a 2,000-token RAG answer takes 8–15 seconds, an Opus multi-step analysis can take 30+. The worst mistake in AI-native architecture is synchronous LLM calls in HTTP request handlers. A Django worker thread blocked on an API call for 8 seconds is 8 seconds it cannot handle any other request. Under load, this queues up and cascades into timeouts.
The correct model is 202 Accepted + Celery task + polling or WebSocket: the HTTP handler accepts the request, enqueues the work, returns immediately, and the client polls for the result or receives it over a WebSocket.
# models.py
import uuid
from django.conf import settings
from django.db import models
class AIRequest(models.Model):
class Status(models.TextChoices):
PENDING = 'pending'
PROCESSING = 'processing'
DONE = 'done'
FAILED = 'failed'
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
user = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
prompt_key = models.CharField(max_length=60) # which prompt template to use
payload = models.JSONField() # user-supplied variables
status = models.CharField(max_length=12, choices=Status, default=Status.PENDING)
result = models.TextField(blank=True)
error = models.TextField(blank=True)
created = models.DateTimeField(auto_now_add=True)
completed = models.DateTimeField(null=True, blank=True)
# views.py
from django.http import JsonResponse
from django.views.decorators.http import require_POST
from django.contrib.auth.decorators import login_required
import json
from .models import AIRequest
from .tasks import run_ai_request
@login_required
@require_POST
def submit_request(request):
body = json.loads(request.body)
req = AIRequest.objects.create(
user = request.user,
prompt_key = body['prompt_key'],
payload = body.get('payload', {}),
)
run_ai_request.delay(str(req.id)) # non-blocking — Celery handles it
return JsonResponse({'id': str(req.id), 'status': 'pending'}, status=202)
@login_required
def poll_request(request, request_id):
req = AIRequest.objects.get(id=request_id, user=request.user)
return JsonResponse({
'id': str(req.id),
'status': req.status,
'result': req.result if req.status == 'done' else None,
'error': req.error if req.status == 'failed' else None,
})
# tasks.py
from celery import shared_task
from django.utils import timezone
import anthropic
from .client import client # singleton — see §8
from .models import AIRequest
from .prompts.registry import render_prompt, get_prompt, model_for # defined in §3 and §5
from .tracking import log_usage
@shared_task(bind=True, max_retries=2, default_retry_delay=5)
def run_ai_request(self, request_id: str):
req = AIRequest.objects.get(id=request_id)
req.status = AIRequest.Status.PROCESSING
req.save(update_fields=['status'])
try:
system, user_msg = render_prompt(req.prompt_key, req.payload)
tmpl = get_prompt(req.prompt_key)
response = client.messages.create(
model = model_for(req.prompt_key),
max_tokens = tmpl.max_tokens,
temperature = tmpl.temperature,
system = system,
messages = [{'role': 'user', 'content': user_msg}],
)
req.result = response.content[0].text
req.status = AIRequest.Status.DONE
req.completed = timezone.now()
req.save(update_fields=['result', 'status', 'completed'])
log_usage(response, request=req, version=tmpl.version)
except anthropic.RateLimitError as exc:
raise self.retry(exc=exc, countdown=30)
except Exception as exc:
req.status = AIRequest.Status.FAILED
req.error = str(exc)
req.save(update_fields=['status', 'error'])
raise
For real-time feedback, replace polling with Django Channels WebSockets: the Celery task sends the result to a channel group, and the browser receives it without polling. See the async AI tasks post for the full WebSocket pattern.
Stream tokens for perceived latency
Waiting 10 seconds with a spinner feels broken. Waiting 10 seconds while tokens stream into the page feels alive — even though the total time is identical. For any user-facing LLM output longer than a sentence, streaming is non-optional. Pair a Celery task that streams from the SDK with a Server-Sent Events endpoint or a WebSocket that pushes each chunk to the browser:
# tasks.py — streaming variant
from .client import client
@shared_task
def stream_ai_request(request_id: str, channel_name: str):
req = AIRequest.objects.get(id=request_id)
tmpl = get_prompt(req.prompt_key)
system, user_msg = render_prompt(req.prompt_key, req.payload)
pieces = []
with client.messages.stream(
model = model_for(req.prompt_key),
max_tokens = tmpl.max_tokens,
temperature = tmpl.temperature,
system = system,
messages = [{'role': 'user', 'content': user_msg}],
) as stream:
for text in stream.text_stream:
pieces.append(text)
# Push each chunk to the browser over Channels.
async_to_sync(get_channel_layer().group_send)(
channel_name, {'type': 'ai.chunk', 'text': text},
)
response = stream.get_final_message()
req.result = ''.join(pieces)
req.status = AIRequest.Status.DONE
req.save(update_fields=['result', 'status'])
log_usage(response, request=req, version=tmpl.version)
Streaming changes the cost model too: you cannot abort a non-streaming request once it's in flight, but you can cancel a stream as soon as the user navigates away — saving the rest of the output tokens. For long generations behind a user's "stop" button, this matters.
3. Prompt Management as Code
In a prototype, prompts live in strings inside view functions. In production, this is a maintenance disaster. Prompts change more frequently than code. Different features need different variants. You need to A/B test them. You need to roll back a bad prompt without redeploying. You need to audit which prompt version produced which output.
Treat prompts the way you treat database migrations: versioned, reviewable, and stored separately from application logic. (For the engineering fundamentals of writing the prompts themselves — few-shot, structured output, tool use — see the GenAI with Python guide; this section is about the surrounding infrastructure.)
# prompts/registry.py
from dataclasses import dataclass
from pathlib import Path
from string import Template
import yaml
@dataclass(frozen=True)
class PromptTemplate:
key: str
version: str
system: str
user: str
max_tokens: int = 1024
temperature: float = 1.0
_registry: dict[str, PromptTemplate] = {}
def load_prompts(directory: str = 'prompts/templates') -> None:
"""Load all YAML prompt templates from disk into the registry."""
for path in Path(directory).glob('*.yaml'):
data = yaml.safe_load(path.read_text())
tmpl = PromptTemplate(
key = data['key'],
version = data['version'],
system = data['system'],
user = data['user'],
max_tokens = data.get('max_tokens', 1024),
temperature = data.get('temperature', 1.0),
)
_registry[tmpl.key] = tmpl
def get_prompt(key: str) -> PromptTemplate:
if key not in _registry:
raise KeyError(f'Unknown prompt key: {key!r}')
return _registry[key]
def render_prompt(key: str, variables: dict) -> tuple[str, str]:
"""Return (system, user) strings with variables substituted."""
tmpl = get_prompt(key)
system = Template(tmpl.system).safe_substitute(variables)
user = Template(tmpl.user).safe_substitute(variables)
return system, user
Call load_prompts() once at startup in your app's AppConfig.ready()
so the registry is populated before any request arrives:
# ai/apps.py
from django.apps import AppConfig
class AiConfig(AppConfig):
name = 'ai'
def ready(self):
from .prompts.registry import load_prompts
load_prompts() # populates _registry from YAML files on disk
A YAML prompt template looks like this:
# prompts/templates/document_summary.yaml
key: document_summary
version: "1.3"
max_tokens: 1024
temperature: 0.3 # low — extracting facts, not generating creative prose
system: |
You are a precise document analyst working for a legal team.
Your job is to extract the key facts from a document and
present them as a concise structured summary.
Rules:
- Summarise in no more than 5 bullet points
- Each bullet must be a single, complete sentence
- Use plain English — no legal jargon unless quoting the document
- If the document contains no meaningful information, say so
user: |
Document title: $title
Document type: $doc_type
Content:
$content
Provide the summary.
Pick temperature deliberately per prompt. Classification, extraction, and routing prompts
almost always want temperature: 0 — you want the same input to produce the
same output, every time. Summarisation and Q&A can tolerate 0.2–0.5. Creative writing
and brainstorming want 0.7+. Setting one global temperature in your client config is a
mistake; bake the value into each prompt template.
Store the prompt key and version alongside every AI output in
the database. When output quality degrades after a prompt change, you can immediately
identify which version is responsible and roll back by reverting the YAML file.
For larger teams, store prompts in the database with a Django admin interface for non-technical stakeholders to edit them — but still version-control the canonical defaults as YAML so changes go through code review before they reach production.
4. RAG as Infrastructure, Not a Feature
Retrieval-Augmented Generation is often implemented as a one-off feature: embed some documents, run a similarity search, append results to a prompt. In an AI-native application, RAG is infrastructure — as fundamental as your database. It needs its own models, its own indexing pipeline, its own freshness guarantees, and its own health monitoring.
# models.py — RAG infrastructure models
from django.db import models
from pgvector.django import VectorField, HnswIndex
class KnowledgeDocument(models.Model):
"""Source documents fed into the RAG pipeline."""
title = models.CharField(max_length=300)
source_url = models.URLField(blank=True)
content = models.TextField()
checksum = models.CharField(max_length=64, unique=True) # SHA-256 of content
indexed_at = models.DateTimeField(null=True)
created = models.DateTimeField(auto_now_add=True)
class KnowledgeChunk(models.Model):
"""Embedded chunk of a KnowledgeDocument."""
document = models.ForeignKey(KnowledgeDocument, on_delete=models.CASCADE,
related_name='chunks')
content = models.TextField()
chunk_index = models.PositiveIntegerField()
embedding = VectorField(dimensions=1536)
class Meta:
indexes = [
HnswIndex(
name='chunk_embedding_hnsw',
fields=['embedding'],
m=16,
ef_construction=64,
opclasses=['vector_cosine_ops'],
)
]
ordering = ['document', 'chunk_index']
# rag/pipeline.py
from openai import OpenAI
from pgvector.django import CosineDistance
from .models import KnowledgeChunk
oc = OpenAI()
def embed_text(text: str) -> list[float]:
return oc.embeddings.create(
model='text-embedding-3-small', input=[text]
).data[0].embedding
def retrieve(query: str, top_k: int = 5, min_score: float = 0.7) -> list[KnowledgeChunk]:
"""Return the top-k most relevant chunks for a query.
min_score is cosine similarity (0..1). 0.7 keeps results closely related to
the query; tune per dataset — too high returns empty lists, too low surfaces
unrelated chunks. Measure on a labelled eval set before changing.
"""
query_vec = embed_text(query)
return list(
KnowledgeChunk.objects
.annotate(distance=CosineDistance('embedding', query_vec))
.filter(distance__lt=(1 - min_score)) # cosine distance, not similarity
.order_by('distance')
.select_related('document')[:top_k]
)
def build_context(chunks: list[KnowledgeChunk]) -> tuple[str, list[int]]:
"""Returns (context string, list of chunk IDs) for audit logging."""
parts = []
ids = []
for chunk in chunks:
parts.append(f'[Source: {chunk.document.title}]\n{chunk.content}')
ids.append(chunk.id)
return '\n\n---\n\n'.join(parts), ids
Store the chunk IDs alongside every AI output row. This enables three things: citation (show users where the answer came from), auditing (reproduce the exact context a given answer was generated from), and freshness invalidation (when a source document changes, flag all AI outputs that depended on its chunks for review).
5. Cost Architecture
LLM costs are invisible until they arrive as a bill. At prototype scale, a few thousand tokens per request is negligible. At production scale — thousands of requests per day — poor cost hygiene can produce a bill that kills your margins. Cost architecture means designing token spend as deliberately as you design database queries.
Token budget enforcement
# tracking.py
from django.db import models
from django.conf import settings
class TokenBudget(models.Model):
"""Per-user or per-tenant daily token budget."""
user = models.OneToOneField('auth.User', on_delete=models.CASCADE)
daily_limit = models.PositiveIntegerField(default=100_000) # tokens/day
used_today = models.PositiveIntegerField(default=0)
reset_date = models.DateField()
def has_capacity(self, estimated_tokens: int) -> bool:
return (self.used_today + estimated_tokens) <= self.daily_limit
def consume(self, tokens: int) -> None:
TokenBudget.objects.filter(pk=self.pk).update(
used_today=models.F('used_today') + tokens
)
class LLMUsageLog(models.Model):
"""Immutable log of every LLM call for cost attribution and debugging."""
request = models.ForeignKey('AIRequest', on_delete=models.SET_NULL,
null=True, related_name='usage_logs')
prompt_key = models.CharField(max_length=60)
prompt_version = models.CharField(max_length=20, blank=True, default='')
model = models.CharField(max_length=60)
input_tokens = models.PositiveIntegerField() # uncached input
output_tokens = models.PositiveIntegerField()
cache_read = models.PositiveIntegerField(default=0) # cache hits — 10% of input
cache_write = models.PositiveIntegerField(default=0) # cache writes — 125% of input
latency_ms = models.PositiveIntegerField(default=0)
created = models.DateTimeField(auto_now_add=True)
@property
def cost_usd(self) -> float:
# Sonnet 4.6 pricing — verify on Anthropic pricing page.
# input_tokens / cache_read / cache_write are mutually exclusive counters
# from the Anthropic API — don't double-count by subtracting one from another.
input_cost = self.input_tokens * 3.00 / 1_000_000
cache_read_cost = self.cache_read * 0.30 / 1_000_000
cache_write_cost = self.cache_write * 3.75 / 1_000_000
output_cost = self.output_tokens * 15.00 / 1_000_000
return round(input_cost + cache_read_cost + cache_write_cost + output_cost, 6)
def log_usage(response, *, request, version: str = '', latency_ms: int = 0) -> LLMUsageLog:
"""Persist an LLMUsageLog row from an Anthropic Messages response."""
usage = response.usage
return LLMUsageLog.objects.create(
request = request,
prompt_key = request.prompt_key,
prompt_version = version,
model = response.model,
input_tokens = usage.input_tokens,
output_tokens = usage.output_tokens,
# These attributes only appear on the usage object when prompt caching is in use.
cache_read = getattr(usage, 'cache_read_input_tokens', 0) or 0,
cache_write = getattr(usage, 'cache_creation_input_tokens', 0) or 0,
latency_ms = latency_ms,
)
The reset_date field needs a nightly reset task. Wire it with Celery Beat:
# tasks.py — run nightly via Celery Beat
from celery import shared_task
from django.utils import timezone
from .tracking import TokenBudget
@shared_task
def reset_daily_token_budgets():
today = timezone.now().date()
TokenBudget.objects.filter(reset_date__lt=today).update(
used_today=0, reset_date=today
)
Model routing: use the cheapest model that is good enough
Not every request needs your most capable model. Classification, extraction, and short summarisation tasks that Haiku handles equally well cost 15× less than Opus. Route by task type:
# prompts/registry.py (extend PromptTemplate)
MODEL_ROUTING = {
# Classification, routing, simple extraction → Haiku
'ticket_classify': 'claude-haiku-4-5-20251001',
'intent_detection': 'claude-haiku-4-5-20251001',
'sentiment_score': 'claude-haiku-4-5-20251001',
# Summarisation, Q&A, RAG responses → Sonnet
'document_summary': 'claude-sonnet-4-6',
'rag_answer': 'claude-sonnet-4-6',
'code_review': 'claude-sonnet-4-6',
# Complex reasoning, multi-step analysis → Opus
'contract_analysis': 'claude-opus-4-7',
'architecture_audit': 'claude-opus-4-7',
}
def model_for(prompt_key: str) -> str:
return MODEL_ROUTING.get(prompt_key, 'claude-sonnet-4-6') # sensible default
Prompt caching for stable system prompts
import anthropic
client = anthropic.Anthropic()
# System prompts and RAG context that do not change between calls
# are prime candidates for prompt caching.
# Anthropic caches blocks marked cache_control for 5 minutes.
# Cache hits cost ~10% of normal input token price.
def call_with_cache(system: str, user_msg: str, context: str, model: str) -> str:
response = client.messages.create(
model = model,
max_tokens = 1024,
system = [
{
'type': 'text',
'text': system,
'cache_control': {'type': 'ephemeral'}, # cache the system prompt
}
],
messages = [
{
'role': 'user',
'content': [
{
'type': 'text',
'text': context,
'cache_control': {'type': 'ephemeral'}, # cache retrieved context
},
{
'type': 'text',
'text': user_msg,
# No cache_control — this changes every request
},
],
}
],
)
# Check cache hit: response.usage.cache_read_input_tokens > 0
return response.content[0].text
6. Observability for AI Systems
Standard application observability (error rates, latency percentiles, uptime) is necessary but not sufficient for AI-native applications. You also need to measure quality — and quality is not binary, it does not show up in logs, and it degrades silently when you change a prompt or swap a model version.
Three layers of AI observability
- Infrastructure metrics: latency (p50/p95/p99), token consumption per endpoint, API error rate, Celery queue depth. These go into Grafana or Datadog like any other service metric.
- Quality metrics: user thumbs-up/down signals, output length distribution, refusal rate (outputs containing "I cannot"), hallucination detection on known-answer test cases. Track these as custom events.
- Cost metrics: cost per request, cost per user, cost per prompt key, cache hit rate. Alert when any endpoint exceeds its budget threshold.
# models.py — quality feedback
class AIOutputFeedback(models.Model):
class Rating(models.IntegerChoices):
THUMBS_DOWN = -1
THUMBS_UP = 1
request = models.OneToOneField(AIRequest, on_delete=models.CASCADE,
related_name='feedback')
user = models.ForeignKey('auth.User', on_delete=models.CASCADE)
rating = models.SmallIntegerField(choices=Rating)
comment = models.TextField(blank=True)
created = models.DateTimeField(auto_now_add=True)
# Management command to compute daily quality report
# management/commands/ai_quality_report.py
from django.core.management.base import BaseCommand
from django.db.models import Avg, Count
from django.utils import timezone
from datetime import timedelta
class Command(BaseCommand):
help = 'Print daily AI quality summary'
def handle(self, *args, **options):
yesterday = timezone.now().date() - timedelta(days=1)
from myapp.models import AIRequest, AIOutputFeedback, LLMUsageLog
requests = AIRequest.objects.filter(created__date=yesterday)
done = requests.filter(status='done').count()
failed = requests.filter(status='failed').count()
total_cost = sum(
log.cost_usd for log in
LLMUsageLog.objects.filter(created__date=yesterday)
)
thumbs_up = AIOutputFeedback.objects.filter(
created__date=yesterday, rating=1
).count()
thumbs_dn = AIOutputFeedback.objects.filter(
created__date=yesterday, rating=-1
).count()
self.stdout.write(f'--- AI Quality Report: {yesterday} ---')
self.stdout.write(f'Requests completed: {done} failed: {failed}')
self.stdout.write(f'Total cost: ${total_cost:.4f}')
self.stdout.write(f'User feedback: +{thumbs_up} / -{thumbs_dn}')
if thumbs_up + thumbs_dn > 0:
rate = thumbs_up / (thumbs_up + thumbs_dn) * 100
self.stdout.write(f'Approval rate: {rate:.1f}%')
Regression testing for prompts
Before deploying a prompt change, run it against a fixed set of test cases with known expected outputs. This is your quality gate — equivalent to a test suite for code:
# tests/test_prompts.py
import pytest
import anthropic
from prompts.registry import render_prompt
GOLDEN_CASES = [
{
'prompt_key': 'ticket_classify',
'payload': {'ticket': 'The login button does nothing on mobile Safari'},
'expected_contains': 'bug',
},
{
'prompt_key': 'ticket_classify',
'payload': {'ticket': 'Please add dark mode to the dashboard'},
'expected_contains': 'feature',
},
{
'prompt_key': 'sentiment_score',
'payload': {'text': 'This product completely ruined my morning.'},
'expected_contains': 'negative',
},
]
@pytest.mark.parametrize('case', GOLDEN_CASES)
def test_prompt_regression(case):
client = anthropic.Anthropic()
system, user_msg = render_prompt(case['prompt_key'], case['payload'])
response = client.messages.create(
model = 'claude-haiku-4-5-20251001',
max_tokens = 256,
system = system,
messages = [{'role': 'user', 'content': user_msg}],
)
output = response.content[0].text.lower()
assert case['expected_contains'] in output, (
f"Prompt {case['prompt_key']!r} regression: "
f"expected {case['expected_contains']!r} in output.\n"
f"Got: {output}"
)
These tests make real API calls and incur token cost — mark them with a custom pytest
marker (e.g. @pytest.mark.llm) and run them as a pre-deploy gate in CI,
separate from the main unit test suite.
7. Resilience: Fallbacks, Timeouts, and Circuit Breakers
LLM APIs have real downtime and real rate limits. An AI-native application that crashes when the model API returns a 529 is not production-ready. You need fallback strategies at every layer.
Timeout discipline
import anthropic
# Set aggressive timeouts — the default is no timeout, which is dangerous.
# connect_timeout: time to establish TCP connection
# read_timeout: time to wait for the first byte of a response
client = anthropic.Anthropic(
timeout=anthropic.Timeout(
connect = 5.0, # seconds
read = 90.0, # generous for long responses; reduce per-endpoint if possible
write = 5.0,
pool = 5.0,
),
max_retries = 3, # SDK retries with exponential backoff automatically
)
Fallback chain
import anthropic
import logging
log = logging.getLogger(__name__)
client = anthropic.Anthropic(
timeout = anthropic.Timeout(connect=5.0, read=90.0, write=5.0, pool=5.0),
max_retries = 3,
)
# Try primary model → cheaper fallback → cached response → graceful degradation
def resilient_call(system: str, user_msg: str, cache_key: str) -> str:
PRIMARY = 'claude-sonnet-4-6'
FALLBACK = 'claude-haiku-4-5-20251001'
for model in [PRIMARY, FALLBACK]:
try:
resp = client.messages.create(
model = model,
max_tokens = 1024,
system = system,
messages = [{'role': 'user', 'content': user_msg}],
)
return resp.content[0].text
except anthropic.RateLimitError:
log.warning('Rate limited on %s — trying fallback', model)
continue
except anthropic.APIStatusError as e:
if e.status_code == 529:
log.warning('Model overloaded (%s) — trying fallback', model)
continue
raise
# Both models failed — try the Redis cache for a recent similar response
from django.core.cache import cache
cached = cache.get(cache_key)
if cached:
log.warning('All models failed — serving cached response for %s', cache_key)
return cached
# Nothing works — raise so the Celery task marks the request as failed
raise RuntimeError('All LLM fallbacks exhausted')
Human-in-the-loop for high-stakes outputs
For outputs that carry risk — legal documents, medical summaries, financial reports — route low-confidence results to a human review queue instead of returning them directly:
class AIReviewQueue(models.Model):
"""Outputs requiring human sign-off before delivery."""
class Priority(models.TextChoices):
HIGH = 'high'
MEDIUM = 'medium'
LOW = 'low'
request = models.OneToOneField(AIRequest, on_delete=models.CASCADE)
reason = models.CharField(max_length=200) # why it was flagged
priority = models.CharField(max_length=8, choices=Priority, default=Priority.MEDIUM)
reviewed_by = models.ForeignKey('auth.User', null=True, blank=True,
on_delete=models.SET_NULL, related_name='reviews')
reviewed_at = models.DateTimeField(null=True)
approved = models.BooleanField(null=True) # None = pending
created = models.DateTimeField(auto_now_add=True)
def needs_human_review(output: str, prompt_key: str) -> tuple[bool, str]:
"""Return (needs_review, reason) based on output content."""
FLAGGED_PATTERNS = [
('I cannot', 'model refusal — potential safety filter trigger'),
('I am not able', 'model refusal — out-of-scope request'),
('consult a', 'professional referral — may need expert validation'),
]
output_lower = output.lower()
for pattern, reason in FLAGGED_PATTERNS:
if pattern.lower() in output_lower:
return True, reason
# High-stakes prompt keys always get reviewed
HIGH_STAKES = {'contract_analysis', 'medical_summary', 'financial_report'}
if prompt_key in HIGH_STAKES:
return True, 'high-stakes prompt key — mandatory human review'
return False, ''
8. Django Project Layout for AI-Native Apps
The flat myapp/ structure that works fine for a standard Django project
becomes a liability in an AI-native application. Separate concerns that evolve at
different rates:
myproject/
├── config/ # Django settings, URLs, WSGI/ASGI
│ ├── settings/
│ │ ├── base.py
│ │ ├── production.py
│ │ └── local.py
│ └── urls.py
│
├── ai/ # All AI concerns — no Django views in here
│ ├── client.py # Anthropic/OpenAI client singletons
│ ├── prompts/
│ │ ├── registry.py # PromptTemplate loader and renderer
│ │ └── templates/ # YAML prompt files
│ │ ├── document_summary.yaml
│ │ └── ticket_classify.yaml
│ ├── rag/
│ │ ├── pipeline.py # chunk, embed, retrieve
│ │ ├── ingest.py # Celery tasks for background indexing
│ │ └── models.py # KnowledgeDocument, KnowledgeChunk
│ ├── tasks.py # Celery tasks that call the LLM
│ ├── tracking.py # log_usage(), cost calculations
│ └── resilience.py # fallback chain, circuit breaker
│
├── requests/ # HTTP-facing request/response handling
│ ├── models.py # AIRequest, TokenBudget, AIOutputFeedback
│ ├── views.py # submit, poll, feedback endpoints
│ ├── serializers.py # DRF serializers
│ └── urls.py
│
├── documents/ # Domain app — knows nothing about AI
│ ├── models.py # Document, DocumentVersion
│ ├── views.py
│ └── signals.py # triggers RAG re-index on document save
│
└── tests/
├── test_prompts.py # golden-case regression tests
├── test_rag.py # retrieval accuracy tests
└── test_cost.py # budget enforcement tests
The key principle: the ai/ package is pure Python — no Django views, no
URL configs, no templates. The requests/ app handles HTTP. Domain apps
(documents/) know nothing about AI. This keeps the AI layer testable in
isolation and swappable without touching the HTTP layer.
Wire the layers together through Celery tasks and Django signals, not direct calls:
# documents/signals.py
from django.db.models.signals import post_save
from django.dispatch import receiver
from .models import Document
from ai.rag.ingest import reindex_document
@receiver(post_save, sender=Document)
def trigger_rag_reindex(sender, instance, **kwargs):
"""Re-index the document in the vector store whenever it is saved."""
reindex_document.delay(instance.id) # ingest task handles checksum comparison and skips if unchanged
9. Production Checklist
- Every LLM call goes through a Celery task. No synchronous LLM calls in Django request handlers. Period.
- Every prompt is versioned in a YAML file. Prompt key and version are logged alongside every AI output row.
- Token budgets are enforced per user before enqueuing tasks. Reject requests that would exceed the daily limit, not after spending the tokens.
- Prompt caching is enabled for all system prompts over 1,024 tokens. Check
cache_read_input_tokensin usage logs — a cache hit rate below 60% on high-volume prompts is leaving money on the table. - Model routing is explicit. Classification → Haiku. Summarisation → Sonnet. Complex reasoning → Opus. Unrouted defaults to Sonnet.
- Timeouts are set on the SDK client, not just on Celery tasks. Both are needed — the SDK timeout prevents a single stalled connection; the Celery
soft_time_limitprevents a stalled task from blocking a worker. - Fallback chain tested in staging. Inject
anthropic.RateLimitErrorin tests and verify the application degrades gracefully rather than crashing. - Golden-case regression tests run in CI before every prompt deploy. A prompt change that fails 1 of 20 test cases does not merge.
- Quality feedback is captured and reviewed weekly. Approval rate below 80% on any prompt key triggers a prompt review.
- RAG chunk IDs are stored with every output. When a source document is updated, the dependent AI outputs are flagged for review or regeneration.
- Cost dashboards are visible to the team, not just engineering. Product managers who request AI features should see what they cost.
- Human-in-the-loop queues exist for high-stakes outputs before you launch, not after the first incident.