Profiling Python for Energy Efficiency: Beyond Performance

1. Why Energy Efficiency Matters for Python Developers

The data centre industry accounts for roughly 1–2% of global electricity consumption — a figure that grows every year as AI workloads, streaming, and cloud-native architectures scale. The software running inside those data centres determines how much of that electricity is actually necessary.

For Python developers, this is increasingly a practical concern, not just an ethical one:

Cloud cost: energy translates directly into compute bills. A 20% reduction in CPU cycles is a 20% reduction in vCPU-hours.
ESG reporting: enterprises are now required to report Scope 3 emissions, which includes software-related compute. Your Django API is in scope.
Edge and IoT: battery-powered devices die faster when your code is wasteful. This is a correctness issue, not a preference.
Regulatory pressure: the EU's Corporate Sustainability Reporting Directive (CSRD) and similar frameworks are pushing software teams to quantify their footprint.

The good news: the techniques that reduce energy consumption overlap significantly with those that improve performance. Measuring energy is the first step — and Python now has excellent tools for it.

2. The Performance-vs-Energy Gap

Time and energy are related but not equivalent. Energy = Power × Time. If you make a function run twice as fast by spinning up four CPU cores instead of one, you've reduced time but potentially increased total energy draw.

import time
import multiprocessing

def cpu_bound_task(n):
    return sum(i * i for i in range(n))

# Single process: runs in ~2s, uses ~1 core at ~15W = ~30 J
start = time.perf_counter()
result = cpu_bound_task(10_000_000)
print(f"Single process: {time.perf_counter() - start:.2f}s")

# Four processes: runs in ~0.6s, but uses ~4 cores at ~15W each = ~36 J
# Faster wall-clock time, but MORE total energy consumed
# Note: use multiprocessing (not threading) — the GIL prevents Python threads
# from running CPU-bound work in parallel
start = time.perf_counter()
with multiprocessing.Pool(4) as pool:
    pool.map(cpu_bound_task, [2_500_000] * 4)
print(f"Four processes: {time.perf_counter() - start:.2f}s")

The four-threaded version is faster by wall-clock time but draws more total energy. For a batch job that runs once a night this may not matter. For a web server handling 10,000 requests per minute it absolutely does — and the only way to know the real picture is to measure.

3. Measuring CPU Energy with pyRAPL

pyRAPL wraps Intel's RAPL (Running Average Power Limit) interface — the hardware counters that CPUs expose for measuring energy consumed by the CPU package and DRAM. It works on Linux with Intel Sandy Bridge (2011) and later CPUs, and requires read access to /sys/class/powercap/intel-rapl/.

pip install pyRAPL

# Grant access to RAPL on Linux (required once per boot)
sudo chmod -R a+r /sys/class/powercap/intel-rapl/

Decorator usage

import pyRAPL

pyRAPL.setup()

@pyRAPL.measureit(number=5)   # average over 5 runs
def process_queryset(records):
    return [r.value * 1.2 for r in records]

process_queryset(my_records)
# prints: label=process_queryset | duration=0.034s | pkg=12400µJ | dram=3200µJ

Context manager usage

import pyRAPL

pyRAPL.setup()

meter = pyRAPL.Measurement('data-pipeline')
meter.begin()

# Your code here
results = [expensive_transform(row) for row in dataset]

meter.end()

pkg_joules  = meter.result.pkg[0] / 1_000_000   # µJ → J
dram_joules = meter.result.dram[0] / 1_000_000  # µJ → J

print(f"CPU package energy: {pkg_joules:.4f} J")
print(f"DRAM energy:        {dram_joules:.4f} J")
print(f"Total:              {pkg_joules + dram_joules:.4f} J")

Comparing two implementations

import pyRAPL

pyRAPL.setup()

data = list(range(1_000_000))

# Implementation A — naive string building
def build_string_naive(items):
    result = ""
    for item in items:
        result += str(item) + ","
    return result

# Implementation B — join
def build_string_join(items):
    return ",".join(str(item) for item in items)

for label, fn in [("naive", build_string_naive), ("join", build_string_join)]:
    meter = pyRAPL.Measurement(label)
    meter.begin()
    fn(data)
    meter.end()
    joules = meter.result.pkg[0] / 1_000_000
    print(f"{label:10s} → {joules:.4f} J")

# naive      → 0.8412 J
# join       → 0.1937 J   (4.3× less energy)

AMD and Apple Silicon: RAPL is Intel-specific. AMD CPUs expose similar counters via /sys/class/powercap/amd-energy/ on Linux 5.8+. Apple Silicon exposes power data through powermetrics on macOS. Neither is supported by pyRAPL directly — use CodeCarbon (below) for cross-platform coverage.

4. Tracking CO₂ Emissions with CodeCarbon

CodeCarbon goes a step further than raw joules: it estimates the CO₂ equivalent emissions of your code, using the energy intensity of the local electricity grid (pulled from real-time data where available). It works cross-platform and cross-architecture.

pip install codecarbon

Context manager

from codecarbon import EmissionsTracker

with EmissionsTracker(project_name="my-django-api") as tracker:
    train_model(X_train, y_train)

# emissions.csv is written automatically with:
# timestamp, duration, energy (kWh), emissions (kgCO2eq), country, region, cloud_provider

Decorator

from codecarbon import track_emissions

@track_emissions(project_name="recommendation-engine", output_dir="/var/log/carbon")
def generate_recommendations(user_id: int) -> list:
    # ... heavy ML inference ...
    return results

Programmatic access to results

from codecarbon import EmissionsTracker

tracker = EmissionsTracker(save_to_file=False)   # in-memory only
tracker.start()

run_batch_job()

emissions_data = tracker.stop()   # returns kg CO2eq as a float

energy_kwh = tracker._total_energy.kWh
print(f"Energy consumed: {energy_kwh:.6f} kWh")
print(f"CO2 equivalent:  {emissions_data * 1000:.4f} gCO2eq")

# Equivalent to driving roughly:
km_equivalent = emissions_data * 1000 / 0.21   # EU avg 0.21 kgCO2/km
print(f"Equivalent to driving {km_equivalent:.2f} km")

Offline mode (air-gapped servers)

from codecarbon import EmissionsTracker

# Provide carbon intensity manually (gCO2/kWh) — check your cloud provider's dashboard
# AWS eu-west-2 (London): ~230 gCO2/kWh
tracker = EmissionsTracker(
    project_name="batch-pipeline",
    cloud_provider="aws",
    cloud_region="eu-west-2",
    save_to_file=True,
    output_dir="/var/log/carbon",
)

with tracker:
    run_etl_pipeline()

5. Linux perf Energy Counters

The Linux perf tool exposes RAPL events directly without any Python dependency. It's the lowest-overhead option and works for profiling whole processes, including multi-process Django deployments under Gunicorn.

# Check which energy events are available on your CPU
perf list | grep energy

# Profile a script — CPU package + DRAM + uncore (GPU on die)
perf stat \
  -e power/energy-pkg/ \
  -e power/energy-ram/ \
  -e power/energy-gpu/ \
  python manage.py process_batch --date 2026-05-09

# Output (example):
#  Performance counter stats for 'python manage.py process_batch':
#       14.32 Joules  power/energy-pkg/
#        3.81 Joules  power/energy-ram/
#        0.00 Joules  power/energy-gpu/
#
#       4.218 seconds time elapsed

Attaching to a running Gunicorn worker

# Find a Gunicorn worker PID
pgrep -f "gunicorn.*myapp" | head -1

# Attach perf to that worker for 30 seconds
perf stat -e power/energy-pkg/,power/energy-ram/ \
  --pid 12345 \
  sleep 30

# Divide joules by 30 to get average watts
# 12.4 J / 30s = 0.41 W average draw for this worker

Parsing perf output in Python

import subprocess
import re

def measure_energy(command: list[str]) -> dict[str, float]:
    result = subprocess.run(
        ["perf", "stat", "-e", "power/energy-pkg/,power/energy-ram/",
         "--field-separator", ";", *command],
        capture_output=True,
        text=True,
    )
    energy = {}
    for line in result.stderr.splitlines():
        if "energy-pkg" in line:
            energy["pkg_joules"] = float(line.split(";")[0].strip())
        elif "energy-ram" in line:
            energy["ram_joules"] = float(line.split(";")[0].strip())
    return energy

stats = measure_energy(["python", "myscript.py"])
print(f"CPU: {stats['pkg_joules']:.2f} J | RAM: {stats['ram_joules']:.2f} J")

6. GPU Energy with pynvml

If your Django application runs ML inference or image processing on an NVIDIA GPU, pynvml (the Python binding for NVML) gives you real-time power draw per device.

pip install nvidia-ml-py

import time
import pynvml

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)

def measure_gpu_energy(fn, *args, sample_interval=0.1, **kwargs):
    """Run fn(*args, **kwargs) and return (result, joules)."""
    samples = []
    import threading

    stop = threading.Event()

    def sampler():
        while not stop.is_set():
            mw = pynvml.nvmlDeviceGetPowerUsage(handle)   # milliwatts
            samples.append(mw / 1000)                      # → watts
            time.sleep(sample_interval)

    t = threading.Thread(target=sampler, daemon=True)
    t.start()

    result = fn(*args, **kwargs)

    stop.set()
    t.join()

    # Rectangular integration: energy ≈ sum(power_samples) × interval
    joules = sum(samples) * sample_interval
    return result, joules

result, gpu_joules = measure_gpu_energy(run_inference, model, input_tensor)
print(f"GPU energy: {gpu_joules:.3f} J")

pynvml.nvmlShutdown()

Tip: Enable GPU power persistence mode to avoid cold-start spikes skewing short measurements: sudo nvidia-smi -pm 1.

7. Common Python Energy Hotspots

Armed with measurement tools, here are the patterns that consistently show up as energy offenders across production Python codebases.

String concatenation in loops

Each += allocates a new string object. The GC overhead alone can account for 30–40% of the energy cost of a naive string-building loop.

# High energy — new allocation on every iteration
def build_csv_naive(rows: list[dict]) -> str:
    output = ""
    for row in rows:
        output += f"{row['id']},{row['name']},{row['value']}\n"
    return output

# Low energy — single allocation at the end
def build_csv_join(rows: list[dict]) -> str:
    return "\n".join(f"{row['id']},{row['name']},{row['value']}" for row in rows)

# Lowest energy for large outputs — streaming to a buffer
import io

def build_csv_buffer(rows: list[dict]) -> str:
    buf = io.StringIO()
    for row in rows:
        buf.write(f"{row['id']},{row['name']},{row['value']}\n")
    return buf.getvalue()

NumPy vectorisation vs Python loops

Python loops run one bytecode instruction per element. NumPy operations run compiled C on whole arrays. The energy difference scales with dataset size.

import numpy as np

prices = list(range(1_000_000))

# Pure Python — each element processed one at a time in the interpreter
def apply_tax_python(prices: list[float], rate: float) -> list[float]:
    return [p * (1 + rate) for p in prices]

# NumPy — C-level loop over contiguous memory
def apply_tax_numpy(prices: np.ndarray, rate: float) -> np.ndarray:
    return prices * (1 + rate)

prices_np = np.array(prices, dtype=np.float64)

# Typical result on 1M elements:
# Python loop:  ~0.19 J   ~180ms
# NumPy:        ~0.009 J  ~8ms   (21× less energy, 22× faster)

Idle CPU in threads vs asyncio

Threads that wait on I/O still hold OS scheduler attention and prevent the CPU from reaching low-power C-states. asyncio yields to the event loop, allowing the CPU to idle.

import asyncio
import httpx

# Threading — each thread holds a kernel thread context, prevents deep sleep
def fetch_all_sync(urls: list[str]) -> list[str]:
    import concurrent.futures
    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as pool:
        return list(pool.map(lambda u: httpx.get(u).text, urls))

# asyncio — single thread, CPU genuinely idles between I/O completions
async def fetch_all_async(urls: list[str]) -> list[str]:
    async with httpx.AsyncClient() as client:
        tasks = [client.get(url) for url in urls]
        responses = await asyncio.gather(*tasks)
        return [r.text for r in responses]

# asyncio version typically uses 40-60% less CPU energy for I/O-bound workloads

Memory allocation pressure

Frequent small allocations trigger the garbage collector, which is itself CPU-intensive. Reusing objects rather than recreating them reduces both memory and energy.

from dataclasses import dataclass

# High GC pressure — new dict created for every record
def process_records_dict(records):
    results = []
    for r in records:
        results.append({
            "id": r.id,
            "value": r.value * 1.2,
            "label": r.label.upper(),
        })
    return results

# Lower pressure — __slots__ dataclass avoids per-instance __dict__
@dataclass(slots=True)
class ProcessedRecord:
    id: int
    value: float
    label: str

def process_records_slots(records):
    return [ProcessedRecord(r.id, r.value * 1.2, r.label.upper()) for r in records]

Generator vs list materialisation

Materialising a large list to iterate over it once wastes both memory and the energy needed to allocate and then garbage-collect it.

# Unnecessary materialisation — builds full list in RAM before iterating
total = sum([x * x for x in range(10_000_000)])

# Generator — produces one value at a time, negligible memory overhead
total = sum(x * x for x in range(10_000_000))

# Same result, but the list version allocates ~80 MB it immediately discards

8. Django-Specific Patterns

N+1 queries: the silent energy killer

Each database round-trip involves network I/O, OS scheduler work, and query parsing — all of which burn CPU cycles. An N+1 query that fires 500 SQL statements does 500× the work of a single joined query.

# N+1 — one query per article to fetch the author (500 queries for 500 articles)
articles = Article.objects.all()[:500]
for article in articles:
    print(article.author.name)   # hits the DB on every iteration

# One query — JOIN fetches authors in a single round-trip
articles = Article.objects.select_related("author").all()[:500]
for article in articles:
    print(article.author.name)   # served from in-memory cache

defer() and only() — fetch what you need

Transferring unused columns from the database wastes network bandwidth, deserialisation CPU, and ORM memory. Use only() to restrict the fetch to fields you actually use.

# Fetches all columns including large body text, thumbnails, etc.
articles = Article.objects.all()

# Only the two fields needed for a listing page
articles = Article.objects.only("title", "published_at")

# Exclude one heavy field while keeping everything else
articles = Article.objects.defer("body")

iterator() for large querysets

Django's default queryset evaluation loads the entire result set into memory. On large tables this allocates hundreds of megabytes that the GC then has to collect. iterator(chunk_size=…) streams results in batches.

# Loads all 500k rows into RAM at once
for record in DataPoint.objects.filter(processed=False):
    process(record)

# Streams in chunks of 2000 — constant memory, lower GC pressure
for record in DataPoint.objects.filter(processed=False).iterator(chunk_size=2000):
    process(record)

Database-side aggregation

Pulling rows into Python to sum or average them is always more expensive than asking the database to do it. The database does the work in optimised C, sends a single row back, and your application never allocates the intermediate data.

from django.db.models import Avg, Sum

# High energy — transfers all rows to Python, then computes in the interpreter
orders = list(Order.objects.filter(status="completed").values_list("total", flat=True))
average = sum(orders) / len(orders)

# Low energy — single SQL query, one row returned
result = Order.objects.filter(status="completed").aggregate(
    total_revenue=Sum("total"),
    avg_order=Avg("total"),
)
average = result["avg_order"]

Caching as an energy multiplier

The cheapest computation is the one you don't do. A well-placed cache entry eliminates not just CPU time but the entire energy cost of the underlying computation — database query, template rendering, or API call included.

from django.core.cache import cache
from django.views import View
from django.http import JsonResponse

class ProductListView(View):
    CACHE_KEY = "product_list_v1"
    CACHE_TTL = 300   # 5 minutes

    def get(self, request):
        data = cache.get(self.CACHE_KEY)
        if data is None:
            # Expensive: DB query + serialisation + template render
            data = list(
                Product.objects.select_related("category")
                .only("id", "name", "price", "category__name")
                .values("id", "name", "price", "category__name")
            )
            cache.set(self.CACHE_KEY, data, self.CACHE_TTL)
        return JsonResponse({"products": data})

9. Integrating Energy Budgets into CI/CD

Measuring energy in development is useful. Tracking it over time in CI is what prevents gradual regressions — the death-by-a-thousand-cuts where each PR looks fine in isolation but the codebase gets 40% heavier over six months.

Writing an energy regression test

import pytest
from codecarbon import EmissionsTracker

ENERGY_BUDGET_KWH = 0.001   # fail if we exceed 1 Wh (3600 J) for this operation

def test_batch_export_energy_budget():
    """Batch export must not exceed the energy budget."""
    tracker = EmissionsTracker(save_to_file=False, log_level="error")
    tracker.start()

    # Run the operation under test
    export_all_records(date="2026-05-09")

    energy_kwh = tracker._total_energy.kWh
    tracker.stop()

    assert energy_kwh < ENERGY_BUDGET_KWH, (
        f"export_all_records used {energy_kwh:.6f} kWh — "
        f"exceeds budget of {ENERGY_BUDGET_KWH} kWh"
    )

Logging energy metrics to stdout for CI collection

import json
from codecarbon import EmissionsTracker

def run_with_energy_report(fn, label: str):
    tracker = EmissionsTracker(save_to_file=False, log_level="error")
    tracker.start()
    result = fn()
    emissions_kg = tracker.stop()
    energy_kwh = tracker._total_energy.kWh

    # Emit structured log line for CI to parse
    print(json.dumps({
        "metric": "energy",
        "label": label,
        "energy_kwh": round(energy_kwh, 8),
        "co2_grams": round(emissions_kg * 1000, 4),
    }))
    return result

GitHub Actions step

# .github/workflows/energy.yml
name: Energy budget check

on: [push, pull_request]

jobs:
  energy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.15"

      - run: pip install codecarbon pytest

      - name: Run energy regression tests
        run: pytest tests/test_energy.py -v --tb=short

      - name: Upload emissions log
        uses: actions/upload-artifact@v4
        with:
          name: emissions
          path: emissions.csv

10. Green Code Principles: A Checklist

Energy efficiency is not a single technique — it's a habit of measurement and discipline. Here's the practical checklist I apply before marking any performance-sensitive path as done:

Measure before optimising. Use pyRAPL, CodeCarbon, or perf stat to establish a baseline. Intuition about what's slow is usually wrong.
Minimise database round-trips. N separate queries cost 10–100× more energy than one query that does the same aggregation. Push grouping, filtering, and summing into SQL — not into Python loops.
Avoid materialising data you don't need. Use only(), defer(), iterator(), and generators to keep memory footprint small and GC quiet.
Use vectorised libraries for numeric work. NumPy, Pandas, and Polars execute at C speed with minimal allocations. Pure Python loops are a last resort.
Prefer asyncio over threads for I/O-bound code. Fewer OS threads means the CPU spends more time in low-power idle states.
Cache aggressively, invalidate precisely. Every cache hit eliminates 100% of the downstream computation energy.
Set energy budgets in CI. Treat energy regressions the same way you treat test failures — they're real bugs.
Choose the right tool for the job. A Celery task that runs nightly doesn't need the same energy scrutiny as a per-request hot path. Measure what matters.

The overlap between energy-efficient code and well-written code is substantial. Code that avoids unnecessary allocations, defers computation, batches I/O, and reuses results tends to be faster, cheaper, and easier to reason about — as well as greener. Measuring energy is just another lens on code quality.