Optimization Techniques
📖 Concept
Once profiling has identified bottlenecks, targeted optimization techniques can deliver dramatic speedups. The key principle is to optimize at the right level: algorithm first, data structure second, implementation third, and low-level tricks last.
Algorithm & data structure optimization provides the largest gains. Replacing an O(n^2) nested loop with an O(n) hash-based lookup can turn a 10-minute job into a 1-second job. Common patterns:
- Use sets for membership testing —
x in my_setis O(1) vs. O(n) for lists - Use dicts for grouping and counting —
collections.Counter,defaultdict - Use heapq for top-k problems — O(n log k) instead of O(n log n) full sort
- Use bisect for sorted sequence operations — O(log n) binary search
Python-specific optimizations:
- List comprehensions and generator expressions are faster than equivalent for-loops because the iteration happens in C
- Built-in functions (
sum,map,filter,min,max,sorted) run in C and avoid Python loop overhead - Local variable access is faster than global or attribute lookups — assign frequently accessed attributes to local variables in tight loops
- String interning and
__slots__reduce memory overhead for objects created in large quantities
NumPy vectorization replaces Python loops with array operations executed in compiled C/Fortran. Operations that process millions of elements in a Python loop can run 10-100x faster as vectorized NumPy operations. The key is to express computation as whole-array operations rather than element-by-element iteration.
Cython compiles Python-like code to C, enabling C-level performance for computation-heavy functions while retaining Python interoperability. Adding type annotations to Cython code allows the compiler to generate optimized C code that bypasses the Python object model entirely.
C extensions via ctypes or cffi let you call existing C libraries directly from Python, and pybind11 makes it straightforward to write C++ extensions with a Pythonic API. These approaches are appropriate when critical inner loops need maximum performance and the overhead of Python's interpreter is the bottleneck.
Concurrency is another optimization dimension: multiprocessing for CPU-bound parallelism (bypasses the GIL), threading / asyncio for I/O-bound concurrency, and concurrent.futures for a high-level interface to both.
💻 Code Example
1# ============================================================2# 1. Data structure choices — dramatic impact on performance3# ============================================================4import time5from collections import defaultdict, Counter6import heapq7import bisect8910def demo_set_vs_list_lookup():11 """Demonstrate O(1) set lookup vs O(n) list lookup."""12 data_list = list(range(1_000_000))13 data_set = set(data_list)14 targets = [500_000, 999_999, 1_000_001] # exists, exists, missing1516 # List lookup — O(n) per lookup17 start = time.perf_counter()18 for target in targets:19 _ = target in data_list20 list_time = time.perf_counter() - start2122 # Set lookup — O(1) per lookup23 start = time.perf_counter()24 for target in targets:25 _ = target in data_set26 set_time = time.perf_counter() - start2728 print(f"List lookup: {list_time*1000:.3f} ms")29 print(f"Set lookup : {set_time*1000:.3f} ms")30 print(f"Speedup : {list_time/set_time:.0f}x")313233def demo_heapq_top_k():34 """Find top-k elements without full sort."""35 import random36 data = [random.random() for _ in range(1_000_000)]3738 # Full sort approach — O(n log n)39 start = time.perf_counter()40 top_10_sort = sorted(data, reverse=True)[:10]41 sort_time = time.perf_counter() - start4243 # heapq approach — O(n log k) where k=1044 start = time.perf_counter()45 top_10_heap = heapq.nlargest(10, data)46 heap_time = time.perf_counter() - start4748 print(f"\nTop-10 via sorted() : {sort_time*1000:.3f} ms")49 print(f"Top-10 via heapq.nlargest: {heap_time*1000:.3f} ms")505152# ============================================================53# 2. Loop optimization techniques54# ============================================================55def loop_optimization_demo():56 """Compare loop styles for performance."""57 data = list(range(1_000_000))5859 # Slow: Python for-loop with append60 start = time.perf_counter()61 result1 = []62 for x in data:63 result1.append(x * x)64 loop_time = time.perf_counter() - start6566 # Faster: List comprehension (C-level iteration)67 start = time.perf_counter()68 result2 = [x * x for x in data]69 comp_time = time.perf_counter() - start7071 # Faster: map() with lambda (C-level iteration)72 start = time.perf_counter()73 result3 = list(map(lambda x: x * x, data))74 map_time = time.perf_counter() - start7576 print(f"\nFor-loop + append : {loop_time*1000:.3f} ms")77 print(f"List comprehension : {comp_time*1000:.3f} ms")78 print(f"map() + lambda : {map_time*1000:.3f} ms")798081# ============================================================82# 3. Local variable optimization in tight loops83# ============================================================84import math858687def compute_distances_slow(points):88 """Slow: global/attribute lookups on every iteration."""89 results = []90 for x, y in points:91 dist = math.sqrt(x * x + y * y)92 results.append(dist)93 return results949596def compute_distances_fast(points):97 """Fast: cache lookups as local variables."""98 sqrt = math.sqrt # Local reference to avoid repeated attribute lookup99 return [sqrt(x * x + y * y) for x, y in points]100101102# ============================================================103# 4. NumPy vectorization — replacing Python loops with C104# ============================================================105def numpy_vectorization_demo():106 """Demonstrate the speedup of NumPy over pure Python."""107 try:108 import numpy as np109 except ImportError:110 print("NumPy not installed — skipping vectorization demo")111 return112113 size = 1_000_000114115 # Pure Python: element-by-element116 py_list_a = list(range(size))117 py_list_b = list(range(size))118119 start = time.perf_counter()120 py_result = [a + b for a, b in zip(py_list_a, py_list_b)]121 py_time = time.perf_counter() - start122123 # NumPy: vectorized operation (runs in C)124 np_a = np.arange(size)125 np_b = np.arange(size)126127 start = time.perf_counter()128 np_result = np_a + np_b129 np_time = time.perf_counter() - start130131 print(f"\nPython list addition : {py_time*1000:.3f} ms")132 print(f"NumPy vectorized add : {np_time*1000:.3f} ms")133 print(f"Speedup : {py_time/np_time:.0f}x")134135 # More complex: element-wise operations136 data = np.random.randn(size)137138 start = time.perf_counter()139 # Vectorized: applies sqrt to all elements in C140 normalized = np.where(data > 0, np.sqrt(data), 0.0)141 vec_time = time.perf_counter() - start142143 start = time.perf_counter()144 # Python loop: slow element-by-element145 normalized_py = [math.sqrt(x) if x > 0 else 0.0 for x in data]146 py_loop_time = time.perf_counter() - start147148 print(f"\nConditional sqrt (NumPy) : {vec_time*1000:.3f} ms")149 print(f"Conditional sqrt (Python): {py_loop_time*1000:.3f} ms")150 print(f"Speedup : {py_loop_time/vec_time:.0f}x")151152153# ============================================================154# 5. functools optimizations — caching and partial155# ============================================================156from functools import lru_cache157158159@lru_cache(maxsize=256)160def expensive_computation(n):161 """Simulates an expensive calculation with automatic caching."""162 time.sleep(0.001) # Simulate work163 return sum(i * i for i in range(n))164165166def lru_cache_demo():167 """Show cache hit/miss behavior."""168 # First calls — cache misses169 start = time.perf_counter()170 for i in range(100):171 expensive_computation(i)172 cold_time = time.perf_counter() - start173174 # Repeated calls — cache hits175 start = time.perf_counter()176 for i in range(100):177 expensive_computation(i)178 warm_time = time.perf_counter() - start179180 info = expensive_computation.cache_info()181 print(f"\nLRU Cache demo:")182 print(f" Cold (misses): {cold_time*1000:.1f} ms")183 print(f" Warm (hits) : {warm_time*1000:.1f} ms")184 print(f" Cache info : {info}")185186187# ============================================================188# 6. Multiprocessing for CPU-bound parallelism189# ============================================================190from multiprocessing import Pool191from concurrent.futures import ProcessPoolExecutor192193194def cpu_bound_task(n):195 """Simulate a CPU-intensive calculation."""196 return sum(i * i for i in range(n))197198199def parallel_vs_sequential():200 """Compare sequential vs parallel execution."""201 tasks = [500_000] * 8 # 8 identical tasks202203 # Sequential204 start = time.perf_counter()205 sequential_results = [cpu_bound_task(n) for n in tasks]206 seq_time = time.perf_counter() - start207208 # Parallel with ProcessPoolExecutor209 start = time.perf_counter()210 with ProcessPoolExecutor(max_workers=4) as executor:211 parallel_results = list(executor.map(cpu_bound_task, tasks))212 par_time = time.perf_counter() - start213214 print(f"\nSequential : {seq_time*1000:.1f} ms")215 print(f"Parallel : {par_time*1000:.1f} ms")216 print(f"Speedup : {seq_time/par_time:.1f}x")217218219# ============================================================220# Usage221# ============================================================222if __name__ == "__main__":223 demo_set_vs_list_lookup()224 demo_heapq_top_k()225 loop_optimization_demo()226 numpy_vectorization_demo()227 lru_cache_demo()228 parallel_vs_sequential()
🏋️ Practice Exercise
Exercises:
Write a function that finds duplicate values in a list of 1 million integers. Implement three approaches: nested loop O(n^2), sorted comparison O(n log n), and set-based O(n). Benchmark all three and verify the algorithmic complexity difference in practice.
Take a function that processes a list of dictionaries using a for-loop (e.g., filtering by a condition and transforming a field). Rewrite it using list comprehension,
map/filter, and a generator expression. Benchmark all four approaches with 1 million records.Install NumPy and rewrite a pure Python function that computes pairwise Euclidean distances between two lists of 2D points. Compare the pure Python nested-loop version with a vectorized NumPy version. Measure the speedup for 10,000 points.
Use
functools.lru_cacheto optimize a recursive function (e.g., Levenshtein edit distance or the partition function). Measure the speedup and print cache hit/miss statistics usingcache_info(). Experiment with differentmaxsizevalues.Implement a CPU-bound task (e.g., computing prime numbers up to N) and run it on 8 inputs sequentially vs. in parallel using
ProcessPoolExecutor. Measure the speedup and explain why it does not achieve a perfect 8x improvement.Profile a function that uses string concatenation in a loop (
+=), then optimize it usingstr.join(),io.StringIO, and"".join(list). Benchmark all four and explain why+=is O(n^2) for strings whilejoinis O(n).
⚠️ Common Mistakes
Optimizing before profiling. Developers often guess which code is slow and optimize the wrong thing. Always profile first with
cProfileorscaleneto find the actual bottleneck. The 80/20 rule applies: 80% of execution time is typically in 20% of the code.Using NumPy but still iterating element-by-element with Python loops. The entire point of NumPy is vectorized operations that execute in C. Writing
for i in range(len(arr)): arr[i] = arr[i] * 2defeats the purpose — usearr *= 2instead.Choosing
multiprocessingfor I/O-bound tasks orthreadingfor CPU-bound tasks. Due to the GIL, threads cannot achieve true parallelism for CPU-bound work. UsemultiprocessingorProcessPoolExecutorfor CPU-bound tasks, andasyncioorthreadingfor I/O-bound tasks.Overusing
lru_cachewithout considering memory. Each cached result stays in memory until the cache is full. For functions with many unique arguments, the cache grows unboundedly (withmaxsize=None) and can cause memory issues. Always set an appropriatemaxsizeand monitor withcache_info().Premature micro-optimization of Python code when the real solution is choosing a better algorithm. Shaving 10% off an O(n^2) algorithm is meaningless if an O(n log n) alternative exists. Focus on algorithmic complexity before implementation-level tricks.
💼 Interview Questions
🎤 Mock Interview
Practice a live interview for Optimization Techniques