Profiling Python for Energy Efficiency: Beyond Performance
Most Python profilers measure time. But fast code isn't always green code — a highly-parallelised function can finish in half the time while drawing twice the power. This post shows you how to measure the energy your Python applications actually consume, where the hidden hotspots are, and how to write code that's lighter on both the CPU and the planet, with practical Django examples throughout.
1. Why Energy Efficiency Matters for Python Developers
The data centre industry accounts for roughly 1–2% of global electricity consumption — a figure that grows every year as AI workloads, streaming, and cloud-native architectures scale. The software running inside those data centres determines how much of that electricity is actually necessary.
For Python developers, this is increasingly a practical concern, not just an ethical one:
- Cloud cost: energy translates directly into compute bills. A 20% reduction in CPU cycles is a 20% reduction in vCPU-hours.
- ESG reporting: enterprises are now required to report Scope 3 emissions, which includes software-related compute. Your Django API is in scope.
- Edge and IoT: battery-powered devices die faster when your code is wasteful. This is a correctness issue, not a preference.
- Regulatory pressure: the EU's Corporate Sustainability Reporting Directive (CSRD) and similar frameworks are pushing software teams to quantify their footprint.
The good news: the techniques that reduce energy consumption overlap significantly with those that improve performance. Measuring energy is the first step — and Python now has excellent tools for it.
2. The Performance-vs-Energy Gap
Time and energy are related but not equivalent. Energy = Power × Time. If you make a function run twice as fast by spinning up four CPU cores instead of one, you've reduced time but potentially increased total energy draw.
import time
import multiprocessing
def cpu_bound_task(n):
return sum(i * i for i in range(n))
# Single process: runs in ~2s, uses ~1 core at ~15W = ~30 J
start = time.perf_counter()
result = cpu_bound_task(10_000_000)
print(f"Single process: {time.perf_counter() - start:.2f}s")
# Four processes: runs in ~0.6s, but uses ~4 cores at ~15W each = ~36 J
# Faster wall-clock time, but MORE total energy consumed
# Note: use multiprocessing (not threading) — the GIL prevents Python threads
# from running CPU-bound work in parallel
start = time.perf_counter()
with multiprocessing.Pool(4) as pool:
pool.map(cpu_bound_task, [2_500_000] * 4)
print(f"Four processes: {time.perf_counter() - start:.2f}s")
The four-threaded version is faster by wall-clock time but draws more total energy. For a batch job that runs once a night this may not matter. For a web server handling 10,000 requests per minute it absolutely does — and the only way to know the real picture is to measure.
3. Measuring CPU Energy with pyRAPL
pyRAPL wraps
Intel's RAPL (Running Average Power Limit) interface — the hardware counters that CPUs expose
for measuring energy consumed by the CPU package and DRAM. It works on Linux with Intel
Sandy Bridge (2011) and later CPUs, and requires read access to
/sys/class/powercap/intel-rapl/.
pip install pyRAPL
# Grant access to RAPL on Linux (required once per boot)
sudo chmod -R a+r /sys/class/powercap/intel-rapl/
Decorator usage
import pyRAPL
pyRAPL.setup()
@pyRAPL.measureit(number=5) # average over 5 runs
def process_queryset(records):
return [r.value * 1.2 for r in records]
process_queryset(my_records)
# prints: label=process_queryset | duration=0.034s | pkg=12400µJ | dram=3200µJ
Context manager usage
import pyRAPL
pyRAPL.setup()
meter = pyRAPL.Measurement('data-pipeline')
meter.begin()
# Your code here
results = [expensive_transform(row) for row in dataset]
meter.end()
pkg_joules = meter.result.pkg[0] / 1_000_000 # µJ → J
dram_joules = meter.result.dram[0] / 1_000_000 # µJ → J
print(f"CPU package energy: {pkg_joules:.4f} J")
print(f"DRAM energy: {dram_joules:.4f} J")
print(f"Total: {pkg_joules + dram_joules:.4f} J")
Comparing two implementations
import pyRAPL
pyRAPL.setup()
data = list(range(1_000_000))
# Implementation A — naive string building
def build_string_naive(items):
result = ""
for item in items:
result += str(item) + ","
return result
# Implementation B — join
def build_string_join(items):
return ",".join(str(item) for item in items)
for label, fn in [("naive", build_string_naive), ("join", build_string_join)]:
meter = pyRAPL.Measurement(label)
meter.begin()
fn(data)
meter.end()
joules = meter.result.pkg[0] / 1_000_000
print(f"{label:10s} → {joules:.4f} J")
# naive → 0.8412 J
# join → 0.1937 J (4.3× less energy)
AMD and Apple Silicon: RAPL is Intel-specific. AMD CPUs expose similar
counters via /sys/class/powercap/amd-energy/ on Linux 5.8+. Apple Silicon
exposes power data through powermetrics on macOS. Neither is supported by
pyRAPL directly — use CodeCarbon (below) for cross-platform coverage.
4. Tracking CO₂ Emissions with CodeCarbon
CodeCarbon goes a step further than raw joules: it estimates the CO₂ equivalent emissions of your code, using the energy intensity of the local electricity grid (pulled from real-time data where available). It works cross-platform and cross-architecture.
pip install codecarbon
Context manager
from codecarbon import EmissionsTracker
with EmissionsTracker(project_name="my-django-api") as tracker:
train_model(X_train, y_train)
# emissions.csv is written automatically with:
# timestamp, duration, energy (kWh), emissions (kgCO2eq), country, region, cloud_provider
Decorator
from codecarbon import track_emissions
@track_emissions(project_name="recommendation-engine", output_dir="/var/log/carbon")
def generate_recommendations(user_id: int) -> list:
# ... heavy ML inference ...
return results
Programmatic access to results
from codecarbon import EmissionsTracker
tracker = EmissionsTracker(save_to_file=False) # in-memory only
tracker.start()
run_batch_job()
emissions_data = tracker.stop() # returns kg CO2eq as a float
energy_kwh = tracker._total_energy.kWh
print(f"Energy consumed: {energy_kwh:.6f} kWh")
print(f"CO2 equivalent: {emissions_data * 1000:.4f} gCO2eq")
# Equivalent to driving roughly:
km_equivalent = emissions_data * 1000 / 0.21 # EU avg 0.21 kgCO2/km
print(f"Equivalent to driving {km_equivalent:.2f} km")
Offline mode (air-gapped servers)
from codecarbon import EmissionsTracker
# Provide carbon intensity manually (gCO2/kWh) — check your cloud provider's dashboard
# AWS eu-west-2 (London): ~230 gCO2/kWh
tracker = EmissionsTracker(
project_name="batch-pipeline",
cloud_provider="aws",
cloud_region="eu-west-2",
save_to_file=True,
output_dir="/var/log/carbon",
)
with tracker:
run_etl_pipeline()
5. Linux perf Energy Counters
The Linux perf tool exposes RAPL events directly without any Python dependency.
It's the lowest-overhead option and works for profiling whole processes, including
multi-process Django deployments under Gunicorn.
# Check which energy events are available on your CPU
perf list | grep energy
# Profile a script — CPU package + DRAM + uncore (GPU on die)
perf stat \
-e power/energy-pkg/ \
-e power/energy-ram/ \
-e power/energy-gpu/ \
python manage.py process_batch --date 2026-05-09
# Output (example):
# Performance counter stats for 'python manage.py process_batch':
# 14.32 Joules power/energy-pkg/
# 3.81 Joules power/energy-ram/
# 0.00 Joules power/energy-gpu/
#
# 4.218 seconds time elapsed
Attaching to a running Gunicorn worker
# Find a Gunicorn worker PID
pgrep -f "gunicorn.*myapp" | head -1
# Attach perf to that worker for 30 seconds
perf stat -e power/energy-pkg/,power/energy-ram/ \
--pid 12345 \
sleep 30
# Divide joules by 30 to get average watts
# 12.4 J / 30s = 0.41 W average draw for this worker
Parsing perf output in Python
import subprocess
import re
def measure_energy(command: list[str]) -> dict[str, float]:
result = subprocess.run(
["perf", "stat", "-e", "power/energy-pkg/,power/energy-ram/",
"--field-separator", ";", *command],
capture_output=True,
text=True,
)
energy = {}
for line in result.stderr.splitlines():
if "energy-pkg" in line:
energy["pkg_joules"] = float(line.split(";")[0].strip())
elif "energy-ram" in line:
energy["ram_joules"] = float(line.split(";")[0].strip())
return energy
stats = measure_energy(["python", "myscript.py"])
print(f"CPU: {stats['pkg_joules']:.2f} J | RAM: {stats['ram_joules']:.2f} J")
6. GPU Energy with pynvml
If your Django application runs ML inference or image processing on an NVIDIA GPU,
pynvml (the Python binding for NVML) gives you real-time power draw per device.
pip install nvidia-ml-py
import time
import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
def measure_gpu_energy(fn, *args, sample_interval=0.1, **kwargs):
"""Run fn(*args, **kwargs) and return (result, joules)."""
samples = []
import threading
stop = threading.Event()
def sampler():
while not stop.is_set():
mw = pynvml.nvmlDeviceGetPowerUsage(handle) # milliwatts
samples.append(mw / 1000) # → watts
time.sleep(sample_interval)
t = threading.Thread(target=sampler, daemon=True)
t.start()
result = fn(*args, **kwargs)
stop.set()
t.join()
# Rectangular integration: energy ≈ sum(power_samples) × interval
joules = sum(samples) * sample_interval
return result, joules
result, gpu_joules = measure_gpu_energy(run_inference, model, input_tensor)
print(f"GPU energy: {gpu_joules:.3f} J")
pynvml.nvmlShutdown()
Tip: Enable GPU power persistence mode to avoid cold-start spikes skewing
short measurements: sudo nvidia-smi -pm 1.
7. Common Python Energy Hotspots
Armed with measurement tools, here are the patterns that consistently show up as energy offenders across production Python codebases.
String concatenation in loops
Each += allocates a new string object. The GC overhead alone can account for
30–40% of the energy cost of a naive string-building loop.
# High energy — new allocation on every iteration
def build_csv_naive(rows: list[dict]) -> str:
output = ""
for row in rows:
output += f"{row['id']},{row['name']},{row['value']}\n"
return output
# Low energy — single allocation at the end
def build_csv_join(rows: list[dict]) -> str:
return "\n".join(f"{row['id']},{row['name']},{row['value']}" for row in rows)
# Lowest energy for large outputs — streaming to a buffer
import io
def build_csv_buffer(rows: list[dict]) -> str:
buf = io.StringIO()
for row in rows:
buf.write(f"{row['id']},{row['name']},{row['value']}\n")
return buf.getvalue()
NumPy vectorisation vs Python loops
Python loops run one bytecode instruction per element. NumPy operations run compiled C on whole arrays. The energy difference scales with dataset size.
import numpy as np
prices = list(range(1_000_000))
# Pure Python — each element processed one at a time in the interpreter
def apply_tax_python(prices: list[float], rate: float) -> list[float]:
return [p * (1 + rate) for p in prices]
# NumPy — C-level loop over contiguous memory
def apply_tax_numpy(prices: np.ndarray, rate: float) -> np.ndarray:
return prices * (1 + rate)
prices_np = np.array(prices, dtype=np.float64)
# Typical result on 1M elements:
# Python loop: ~0.19 J ~180ms
# NumPy: ~0.009 J ~8ms (21× less energy, 22× faster)
Idle CPU in threads vs asyncio
Threads that wait on I/O still hold OS scheduler attention and prevent the CPU from reaching
low-power C-states. asyncio yields to the event loop, allowing the CPU to idle.
import asyncio
import httpx
# Threading — each thread holds a kernel thread context, prevents deep sleep
def fetch_all_sync(urls: list[str]) -> list[str]:
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as pool:
return list(pool.map(lambda u: httpx.get(u).text, urls))
# asyncio — single thread, CPU genuinely idles between I/O completions
async def fetch_all_async(urls: list[str]) -> list[str]:
async with httpx.AsyncClient() as client:
tasks = [client.get(url) for url in urls]
responses = await asyncio.gather(*tasks)
return [r.text for r in responses]
# asyncio version typically uses 40-60% less CPU energy for I/O-bound workloads
Memory allocation pressure
Frequent small allocations trigger the garbage collector, which is itself CPU-intensive. Reusing objects rather than recreating them reduces both memory and energy.
from dataclasses import dataclass
# High GC pressure — new dict created for every record
def process_records_dict(records):
results = []
for r in records:
results.append({
"id": r.id,
"value": r.value * 1.2,
"label": r.label.upper(),
})
return results
# Lower pressure — __slots__ dataclass avoids per-instance __dict__
@dataclass(slots=True)
class ProcessedRecord:
id: int
value: float
label: str
def process_records_slots(records):
return [ProcessedRecord(r.id, r.value * 1.2, r.label.upper()) for r in records]
Generator vs list materialisation
Materialising a large list to iterate over it once wastes both memory and the energy needed to allocate and then garbage-collect it.
# Unnecessary materialisation — builds full list in RAM before iterating
total = sum([x * x for x in range(10_000_000)])
# Generator — produces one value at a time, negligible memory overhead
total = sum(x * x for x in range(10_000_000))
# Same result, but the list version allocates ~80 MB it immediately discards
8. Django-Specific Patterns
N+1 queries: the silent energy killer
Each database round-trip involves network I/O, OS scheduler work, and query parsing — all of which burn CPU cycles. An N+1 query that fires 500 SQL statements does 500× the work of a single joined query.
# N+1 — one query per article to fetch the author (500 queries for 500 articles)
articles = Article.objects.all()[:500]
for article in articles:
print(article.author.name) # hits the DB on every iteration
# One query — JOIN fetches authors in a single round-trip
articles = Article.objects.select_related("author").all()[:500]
for article in articles:
print(article.author.name) # served from in-memory cache
defer() and only() — fetch what you need
Transferring unused columns from the database wastes network bandwidth, deserialisation
CPU, and ORM memory. Use only() to restrict the fetch to fields you actually use.
# Fetches all columns including large body text, thumbnails, etc.
articles = Article.objects.all()
# Only the two fields needed for a listing page
articles = Article.objects.only("title", "published_at")
# Exclude one heavy field while keeping everything else
articles = Article.objects.defer("body")
iterator() for large querysets
Django's default queryset evaluation loads the entire result set into memory. On large tables
this allocates hundreds of megabytes that the GC then has to collect.
iterator(chunk_size=…) streams results in batches.
# Loads all 500k rows into RAM at once
for record in DataPoint.objects.filter(processed=False):
process(record)
# Streams in chunks of 2000 — constant memory, lower GC pressure
for record in DataPoint.objects.filter(processed=False).iterator(chunk_size=2000):
process(record)
Database-side aggregation
Pulling rows into Python to sum or average them is always more expensive than asking the database to do it. The database does the work in optimised C, sends a single row back, and your application never allocates the intermediate data.
from django.db.models import Avg, Sum
# High energy — transfers all rows to Python, then computes in the interpreter
orders = list(Order.objects.filter(status="completed").values_list("total", flat=True))
average = sum(orders) / len(orders)
# Low energy — single SQL query, one row returned
result = Order.objects.filter(status="completed").aggregate(
total_revenue=Sum("total"),
avg_order=Avg("total"),
)
average = result["avg_order"]
Caching as an energy multiplier
The cheapest computation is the one you don't do. A well-placed cache entry eliminates not just CPU time but the entire energy cost of the underlying computation — database query, template rendering, or API call included.
from django.core.cache import cache
from django.views import View
from django.http import JsonResponse
class ProductListView(View):
CACHE_KEY = "product_list_v1"
CACHE_TTL = 300 # 5 minutes
def get(self, request):
data = cache.get(self.CACHE_KEY)
if data is None:
# Expensive: DB query + serialisation + template render
data = list(
Product.objects.select_related("category")
.only("id", "name", "price", "category__name")
.values("id", "name", "price", "category__name")
)
cache.set(self.CACHE_KEY, data, self.CACHE_TTL)
return JsonResponse({"products": data})
9. Integrating Energy Budgets into CI/CD
Measuring energy in development is useful. Tracking it over time in CI is what prevents gradual regressions — the death-by-a-thousand-cuts where each PR looks fine in isolation but the codebase gets 40% heavier over six months.
Writing an energy regression test
import pytest
from codecarbon import EmissionsTracker
ENERGY_BUDGET_KWH = 0.001 # fail if we exceed 1 Wh (3600 J) for this operation
def test_batch_export_energy_budget():
"""Batch export must not exceed the energy budget."""
tracker = EmissionsTracker(save_to_file=False, log_level="error")
tracker.start()
# Run the operation under test
export_all_records(date="2026-05-09")
energy_kwh = tracker._total_energy.kWh
tracker.stop()
assert energy_kwh < ENERGY_BUDGET_KWH, (
f"export_all_records used {energy_kwh:.6f} kWh — "
f"exceeds budget of {ENERGY_BUDGET_KWH} kWh"
)
Logging energy metrics to stdout for CI collection
import json
from codecarbon import EmissionsTracker
def run_with_energy_report(fn, label: str):
tracker = EmissionsTracker(save_to_file=False, log_level="error")
tracker.start()
result = fn()
emissions_kg = tracker.stop()
energy_kwh = tracker._total_energy.kWh
# Emit structured log line for CI to parse
print(json.dumps({
"metric": "energy",
"label": label,
"energy_kwh": round(energy_kwh, 8),
"co2_grams": round(emissions_kg * 1000, 4),
}))
return result
GitHub Actions step
# .github/workflows/energy.yml
name: Energy budget check
on: [push, pull_request]
jobs:
energy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.15"
- run: pip install codecarbon pytest
- name: Run energy regression tests
run: pytest tests/test_energy.py -v --tb=short
- name: Upload emissions log
uses: actions/upload-artifact@v4
with:
name: emissions
path: emissions.csv
10. Green Code Principles: A Checklist
Energy efficiency is not a single technique — it's a habit of measurement and discipline. Here's the practical checklist I apply before marking any performance-sensitive path as done:
- Measure before optimising. Use pyRAPL, CodeCarbon, or
perf statto establish a baseline. Intuition about what's slow is usually wrong. - Minimise database round-trips. N separate queries cost 10–100× more energy than one query that does the same aggregation. Push grouping, filtering, and summing into SQL — not into Python loops.
- Avoid materialising data you don't need. Use
only(),defer(),iterator(), and generators to keep memory footprint small and GC quiet. - Use vectorised libraries for numeric work. NumPy, Pandas, and Polars execute at C speed with minimal allocations. Pure Python loops are a last resort.
- Prefer asyncio over threads for I/O-bound code. Fewer OS threads means the CPU spends more time in low-power idle states.
- Cache aggressively, invalidate precisely. Every cache hit eliminates 100% of the downstream computation energy.
- Set energy budgets in CI. Treat energy regressions the same way you treat test failures — they're real bugs.
- Choose the right tool for the job. A Celery task that runs nightly doesn't need the same energy scrutiny as a per-request hot path. Measure what matters.
The overlap between energy-efficient code and well-written code is substantial. Code that avoids unnecessary allocations, defers computation, batches I/O, and reuses results tends to be faster, cheaper, and easier to reason about — as well as greener. Measuring energy is just another lens on code quality.