Profiling & Benchmarking

0/3 in this phase0/54 across the roadmap

📖 Concept

Profiling and benchmarking are the foundation of performance optimization. Profiling identifies where your code spends time and memory, while benchmarking measures how fast specific operations run under controlled conditions. The golden rule: never optimize without measuring first.

Key profiling tools:

Tool Type Granularity Overhead Best For
timeit Benchmarking Statement/expression Minimal Micro-benchmarks
cProfile Deterministic profiler Function-level Low-medium Finding slow functions
line_profiler Line profiler Line-level High Pinpointing bottleneck lines
memory_profiler Memory profiler Line-level High Finding memory-hungry code
scalene Hybrid profiler Line-level (CPU + memory + GPU) Low All-in-one production profiling
py-spy Sampling profiler Function-level Near-zero Profiling running processes

timeit is the standard library module for micro-benchmarks. It runs a statement many times, disables garbage collection during timing, and reports the best result. Always use timeit instead of manual time.time() calls because it handles warm-up, repetition, and GC interference automatically.

cProfile instruments every function call and records call count, total time, and cumulative time. Run it from the command line with python -m cProfile -s cumulative script.py or programmatically. The output reveals your program's hot path — the chain of functions consuming the most time.

scalene is the modern choice for comprehensive profiling. It profiles CPU time (separating Python vs. native code), memory allocation, memory copy overhead, and even GPU usage — all with low overhead and per-line granularity. Unlike cProfile, it uses sampling rather than instrumentation, so it does not slow your program significantly.

Benchmarking strategies:

  • Isolate the code under test — remove I/O and database calls
  • Use representative data — small test inputs may hide algorithmic complexity
  • Run multiple iterations — single runs are unreliable due to OS scheduling, caching, and GC
  • Compare relative performance — absolute numbers vary across machines; ratios are portable
  • Profile in production-like conditions — development machines often differ from deployment environments

💻 Code Example

codeTap to expand ⛶
1# ============================================================
2# 1. timeit — Micro-benchmarking
3# ============================================================
4import timeit
5
6
7# --- Compare string concatenation strategies ---
8def benchmark_string_methods():
9 """Benchmark different approaches to building strings."""
10
11 # Method 1: String concatenation with +
12 concat_time = timeit.timeit(
13 stmt="""
14result = ""
15for i in range(1000):
16 result += str(i)
17""",
18 number=1000,
19 )
20
21 # Method 2: str.join()
22 join_time = timeit.timeit(
23 stmt='result = "".join(str(i) for i in range(1000))',
24 number=1000,
25 )
26
27 # Method 3: io.StringIO
28 stringio_time = timeit.timeit(
29 stmt="""
30import io
31buf = io.StringIO()
32for i in range(1000):
33 buf.write(str(i))
34result = buf.getvalue()
35""",
36 number=1000,
37 )
38
39 print("String building benchmarks (1000 iterations):")
40 print(f" += concatenation : {concat_time:.4f}s")
41 print(f" str.join() : {join_time:.4f}s")
42 print(f" io.StringIO : {stringio_time:.4f}s")
43
44
45# --- timeit from the command line ---
46# python -m timeit -s "data = list(range(10000))" "sorted(data)"
47# python -m timeit -s "data = list(range(10000))" "data.sort()"
48
49
50# ============================================================
51# 2. cProfile — Function-level profiling
52# ============================================================
53import cProfile
54import pstats
55import io
56
57
58def fibonacci_naive(n):
59 """Deliberately unoptimized recursive Fibonacci."""
60 if n <= 1:
61 return n
62 return fibonacci_naive(n - 1) + fibonacci_naive(n - 2)
63
64
65def fibonacci_memo(n, cache={}):
66 """Memoized Fibonacci for comparison."""
67 if n in cache:
68 return cache[n]
69 if n <= 1:
70 return n
71 cache[n] = fibonacci_memo(n - 1, cache) + fibonacci_memo(n - 2, cache)
72 return cache[n]
73
74
75def profile_and_compare():
76 """Profile both Fibonacci implementations and compare."""
77 for label, func, arg in [
78 ("Naive (n=30)", fibonacci_naive, 30),
79 ("Memoized (n=100)", fibonacci_memo, 100),
80 ]:
81 profiler = cProfile.Profile()
82 profiler.enable()
83 result = func(arg)
84 profiler.disable()
85
86 stream = io.StringIO()
87 stats = pstats.Stats(profiler, stream=stream)
88 stats.sort_stats("cumulative")
89 stats.print_stats(5)
90
91 print(f"\n--- {label} (result={result}) ---")
92 print(stream.getvalue())
93
94
95# ============================================================
96# 3. Context manager profiler for targeted use
97# ============================================================
98from contextlib import contextmanager
99import time
100
101
102@contextmanager
103def profile_block(label="block", top_n=5):
104 """Profile a code block and print top functions by time."""
105 profiler = cProfile.Profile()
106 wall_start = time.perf_counter()
107 profiler.enable()
108 try:
109 yield profiler
110 finally:
111 profiler.disable()
112 wall_elapsed = time.perf_counter() - wall_start
113 stream = io.StringIO()
114 stats = pstats.Stats(profiler, stream=stream)
115 stats.sort_stats("cumulative")
116 stats.print_stats(top_n)
117 print(f"\n[{label}] Wall time: {wall_elapsed:.4f}s")
118 print(stream.getvalue())
119
120
121# ============================================================
122# 4. Benchmarking with statistics (robust measurement)
123# ============================================================
124import statistics
125
126
127def robust_benchmark(func, *args, iterations=100, warmup=10, **kwargs):
128 """Run a function many times and report statistical summary."""
129 # Warm-up phase: let JIT / OS caches settle
130 for _ in range(warmup):
131 func(*args, **kwargs)
132
133 timings = []
134 for _ in range(iterations):
135 start = time.perf_counter()
136 func(*args, **kwargs)
137 elapsed = time.perf_counter() - start
138 timings.append(elapsed)
139
140 return {
141 "mean": statistics.mean(timings),
142 "median": statistics.median(timings),
143 "stdev": statistics.stdev(timings) if len(timings) > 1 else 0,
144 "min": min(timings),
145 "max": max(timings),
146 "p95": sorted(timings)[int(len(timings) * 0.95)],
147 "iterations": iterations,
148 }
149
150
151def print_benchmark(label, results):
152 """Pretty-print benchmark results."""
153 print(f"\n{label}:")
154 print(f" Mean : {results['mean']*1000:.3f} ms")
155 print(f" Median : {results['median']*1000:.3f} ms")
156 print(f" Stdev : {results['stdev']*1000:.3f} ms")
157 print(f" Min : {results['min']*1000:.3f} ms")
158 print(f" Max : {results['max']*1000:.3f} ms")
159 print(f" P95 : {results['p95']*1000:.3f} ms")
160
161
162# ============================================================
163# 5. Usage — putting it all together
164# ============================================================
165if __name__ == "__main__":
166 benchmark_string_methods()
167 profile_and_compare()
168
169 # Targeted profiling of a code block
170 with profile_block("sorting benchmark"):
171 import random
172 data = [random.randint(0, 1_000_000) for _ in range(100_000)]
173 sorted(data)
174
175 # Robust benchmark comparison
176 data = list(range(50_000))
177 results_sorted = robust_benchmark(sorted, data)
178 results_reversed = robust_benchmark(sorted, data, reverse=True)
179
180 print_benchmark("sorted(ascending)", results_sorted)
181 print_benchmark("sorted(descending)", results_reversed)

🏋️ Practice Exercise

Exercises:

  1. Use timeit to compare three ways of checking membership: in on a list, in on a set, and in on a dict. Test with 10, 1,000, and 100,000 elements. Plot or tabulate the results and explain the Big-O behavior you observe.

  2. Profile a real function (e.g., reading a CSV and computing aggregates) with cProfile. Generate a sorted stats report, identify the top 3 hotspot functions, and propose optimizations for each.

  3. Build the robust_benchmark function shown above. Add percentile calculations (p50, p90, p99) and a coefficient of variation metric. Use it to compare list comprehension vs. map() vs. a generator expression for transforming 1 million integers.

  4. Install scalene (pip install scalene) and profile a script that performs both CPU-bound computation and memory allocation. Compare scalene's output to cProfile — what additional insights does scalene provide (Python vs. C time, memory allocation, copy overhead)?

  5. Write a decorator @profile_calls that logs every call to a function with its arguments, return value, and execution time. Include a class-level summary that tracks total calls and cumulative time. Use it to instrument a recursive algorithm and analyze the call pattern.

  6. Create a benchmarking suite that compares dict vs. OrderedDict vs. defaultdict for 10,000 insertions and 10,000 lookups. Present results in a formatted table with mean, median, and standard deviation columns.

⚠️ Common Mistakes

  • Using time.time() instead of time.perf_counter() for benchmarking. time.time() measures wall-clock time with lower resolution and is affected by system clock adjustments (NTP sync). perf_counter() uses the highest-resolution monotonic clock available and is the correct choice for measuring elapsed time in benchmarks.

  • Running a single iteration and drawing conclusions. Execution time varies due to OS scheduling, CPU frequency scaling, garbage collection, and cache effects. Always run multiple iterations and report statistical summaries (mean, median, standard deviation, percentiles).

  • Profiling with cProfile and assuming the overhead does not affect results. cProfile instruments every function call, which can add 10-30% overhead and distort the relative cost of small functions. For micro-benchmarks, use timeit; for production profiling with minimal overhead, use py-spy or scalene.

  • Benchmarking with unrealistic data sizes. An algorithm that is fast on 100 elements may be catastrophically slow on 1 million elements due to algorithmic complexity (O(n^2) vs O(n log n)). Always benchmark with production-representative data volumes.

  • Forgetting to disable garbage collection during micro-benchmarks. timeit does this automatically, but manual timing loops do not. A GC pause during a timing run can produce misleading outliers. Use gc.disable() around critical measurement sections, or use timeit which handles this.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Profiling & Benchmarking