Optimization Techniques

0/3 in this phase0/54 across the roadmap

📖 Concept

Once profiling has identified bottlenecks, targeted optimization techniques can deliver dramatic speedups. The key principle is to optimize at the right level: algorithm first, data structure second, implementation third, and low-level tricks last.

Algorithm & data structure optimization provides the largest gains. Replacing an O(n^2) nested loop with an O(n) hash-based lookup can turn a 10-minute job into a 1-second job. Common patterns:

  • Use sets for membership testingx in my_set is O(1) vs. O(n) for lists
  • Use dicts for grouping and countingcollections.Counter, defaultdict
  • Use heapq for top-k problems — O(n log k) instead of O(n log n) full sort
  • Use bisect for sorted sequence operations — O(log n) binary search

Python-specific optimizations:

  • List comprehensions and generator expressions are faster than equivalent for-loops because the iteration happens in C
  • Built-in functions (sum, map, filter, min, max, sorted) run in C and avoid Python loop overhead
  • Local variable access is faster than global or attribute lookups — assign frequently accessed attributes to local variables in tight loops
  • String interning and __slots__ reduce memory overhead for objects created in large quantities

NumPy vectorization replaces Python loops with array operations executed in compiled C/Fortran. Operations that process millions of elements in a Python loop can run 10-100x faster as vectorized NumPy operations. The key is to express computation as whole-array operations rather than element-by-element iteration.

Cython compiles Python-like code to C, enabling C-level performance for computation-heavy functions while retaining Python interoperability. Adding type annotations to Cython code allows the compiler to generate optimized C code that bypasses the Python object model entirely.

C extensions via ctypes or cffi let you call existing C libraries directly from Python, and pybind11 makes it straightforward to write C++ extensions with a Pythonic API. These approaches are appropriate when critical inner loops need maximum performance and the overhead of Python's interpreter is the bottleneck.

Concurrency is another optimization dimension: multiprocessing for CPU-bound parallelism (bypasses the GIL), threading / asyncio for I/O-bound concurrency, and concurrent.futures for a high-level interface to both.

💻 Code Example

codeTap to expand ⛶
1# ============================================================
2# 1. Data structure choices — dramatic impact on performance
3# ============================================================
4import time
5from collections import defaultdict, Counter
6import heapq
7import bisect
8
9
10def demo_set_vs_list_lookup():
11 """Demonstrate O(1) set lookup vs O(n) list lookup."""
12 data_list = list(range(1_000_000))
13 data_set = set(data_list)
14 targets = [500_000, 999_999, 1_000_001] # exists, exists, missing
15
16 # List lookup — O(n) per lookup
17 start = time.perf_counter()
18 for target in targets:
19 _ = target in data_list
20 list_time = time.perf_counter() - start
21
22 # Set lookup — O(1) per lookup
23 start = time.perf_counter()
24 for target in targets:
25 _ = target in data_set
26 set_time = time.perf_counter() - start
27
28 print(f"List lookup: {list_time*1000:.3f} ms")
29 print(f"Set lookup : {set_time*1000:.3f} ms")
30 print(f"Speedup : {list_time/set_time:.0f}x")
31
32
33def demo_heapq_top_k():
34 """Find top-k elements without full sort."""
35 import random
36 data = [random.random() for _ in range(1_000_000)]
37
38 # Full sort approach — O(n log n)
39 start = time.perf_counter()
40 top_10_sort = sorted(data, reverse=True)[:10]
41 sort_time = time.perf_counter() - start
42
43 # heapq approach — O(n log k) where k=10
44 start = time.perf_counter()
45 top_10_heap = heapq.nlargest(10, data)
46 heap_time = time.perf_counter() - start
47
48 print(f"\nTop-10 via sorted() : {sort_time*1000:.3f} ms")
49 print(f"Top-10 via heapq.nlargest: {heap_time*1000:.3f} ms")
50
51
52# ============================================================
53# 2. Loop optimization techniques
54# ============================================================
55def loop_optimization_demo():
56 """Compare loop styles for performance."""
57 data = list(range(1_000_000))
58
59 # Slow: Python for-loop with append
60 start = time.perf_counter()
61 result1 = []
62 for x in data:
63 result1.append(x * x)
64 loop_time = time.perf_counter() - start
65
66 # Faster: List comprehension (C-level iteration)
67 start = time.perf_counter()
68 result2 = [x * x for x in data]
69 comp_time = time.perf_counter() - start
70
71 # Faster: map() with lambda (C-level iteration)
72 start = time.perf_counter()
73 result3 = list(map(lambda x: x * x, data))
74 map_time = time.perf_counter() - start
75
76 print(f"\nFor-loop + append : {loop_time*1000:.3f} ms")
77 print(f"List comprehension : {comp_time*1000:.3f} ms")
78 print(f"map() + lambda : {map_time*1000:.3f} ms")
79
80
81# ============================================================
82# 3. Local variable optimization in tight loops
83# ============================================================
84import math
85
86
87def compute_distances_slow(points):
88 """Slow: global/attribute lookups on every iteration."""
89 results = []
90 for x, y in points:
91 dist = math.sqrt(x * x + y * y)
92 results.append(dist)
93 return results
94
95
96def compute_distances_fast(points):
97 """Fast: cache lookups as local variables."""
98 sqrt = math.sqrt # Local reference to avoid repeated attribute lookup
99 return [sqrt(x * x + y * y) for x, y in points]
100
101
102# ============================================================
103# 4. NumPy vectorization — replacing Python loops with C
104# ============================================================
105def numpy_vectorization_demo():
106 """Demonstrate the speedup of NumPy over pure Python."""
107 try:
108 import numpy as np
109 except ImportError:
110 print("NumPy not installed — skipping vectorization demo")
111 return
112
113 size = 1_000_000
114
115 # Pure Python: element-by-element
116 py_list_a = list(range(size))
117 py_list_b = list(range(size))
118
119 start = time.perf_counter()
120 py_result = [a + b for a, b in zip(py_list_a, py_list_b)]
121 py_time = time.perf_counter() - start
122
123 # NumPy: vectorized operation (runs in C)
124 np_a = np.arange(size)
125 np_b = np.arange(size)
126
127 start = time.perf_counter()
128 np_result = np_a + np_b
129 np_time = time.perf_counter() - start
130
131 print(f"\nPython list addition : {py_time*1000:.3f} ms")
132 print(f"NumPy vectorized add : {np_time*1000:.3f} ms")
133 print(f"Speedup : {py_time/np_time:.0f}x")
134
135 # More complex: element-wise operations
136 data = np.random.randn(size)
137
138 start = time.perf_counter()
139 # Vectorized: applies sqrt to all elements in C
140 normalized = np.where(data > 0, np.sqrt(data), 0.0)
141 vec_time = time.perf_counter() - start
142
143 start = time.perf_counter()
144 # Python loop: slow element-by-element
145 normalized_py = [math.sqrt(x) if x > 0 else 0.0 for x in data]
146 py_loop_time = time.perf_counter() - start
147
148 print(f"\nConditional sqrt (NumPy) : {vec_time*1000:.3f} ms")
149 print(f"Conditional sqrt (Python): {py_loop_time*1000:.3f} ms")
150 print(f"Speedup : {py_loop_time/vec_time:.0f}x")
151
152
153# ============================================================
154# 5. functools optimizations — caching and partial
155# ============================================================
156from functools import lru_cache
157
158
159@lru_cache(maxsize=256)
160def expensive_computation(n):
161 """Simulates an expensive calculation with automatic caching."""
162 time.sleep(0.001) # Simulate work
163 return sum(i * i for i in range(n))
164
165
166def lru_cache_demo():
167 """Show cache hit/miss behavior."""
168 # First calls — cache misses
169 start = time.perf_counter()
170 for i in range(100):
171 expensive_computation(i)
172 cold_time = time.perf_counter() - start
173
174 # Repeated calls — cache hits
175 start = time.perf_counter()
176 for i in range(100):
177 expensive_computation(i)
178 warm_time = time.perf_counter() - start
179
180 info = expensive_computation.cache_info()
181 print(f"\nLRU Cache demo:")
182 print(f" Cold (misses): {cold_time*1000:.1f} ms")
183 print(f" Warm (hits) : {warm_time*1000:.1f} ms")
184 print(f" Cache info : {info}")
185
186
187# ============================================================
188# 6. Multiprocessing for CPU-bound parallelism
189# ============================================================
190from multiprocessing import Pool
191from concurrent.futures import ProcessPoolExecutor
192
193
194def cpu_bound_task(n):
195 """Simulate a CPU-intensive calculation."""
196 return sum(i * i for i in range(n))
197
198
199def parallel_vs_sequential():
200 """Compare sequential vs parallel execution."""
201 tasks = [500_000] * 8 # 8 identical tasks
202
203 # Sequential
204 start = time.perf_counter()
205 sequential_results = [cpu_bound_task(n) for n in tasks]
206 seq_time = time.perf_counter() - start
207
208 # Parallel with ProcessPoolExecutor
209 start = time.perf_counter()
210 with ProcessPoolExecutor(max_workers=4) as executor:
211 parallel_results = list(executor.map(cpu_bound_task, tasks))
212 par_time = time.perf_counter() - start
213
214 print(f"\nSequential : {seq_time*1000:.1f} ms")
215 print(f"Parallel : {par_time*1000:.1f} ms")
216 print(f"Speedup : {seq_time/par_time:.1f}x")
217
218
219# ============================================================
220# Usage
221# ============================================================
222if __name__ == "__main__":
223 demo_set_vs_list_lookup()
224 demo_heapq_top_k()
225 loop_optimization_demo()
226 numpy_vectorization_demo()
227 lru_cache_demo()
228 parallel_vs_sequential()

🏋️ Practice Exercise

Exercises:

  1. Write a function that finds duplicate values in a list of 1 million integers. Implement three approaches: nested loop O(n^2), sorted comparison O(n log n), and set-based O(n). Benchmark all three and verify the algorithmic complexity difference in practice.

  2. Take a function that processes a list of dictionaries using a for-loop (e.g., filtering by a condition and transforming a field). Rewrite it using list comprehension, map/filter, and a generator expression. Benchmark all four approaches with 1 million records.

  3. Install NumPy and rewrite a pure Python function that computes pairwise Euclidean distances between two lists of 2D points. Compare the pure Python nested-loop version with a vectorized NumPy version. Measure the speedup for 10,000 points.

  4. Use functools.lru_cache to optimize a recursive function (e.g., Levenshtein edit distance or the partition function). Measure the speedup and print cache hit/miss statistics using cache_info(). Experiment with different maxsize values.

  5. Implement a CPU-bound task (e.g., computing prime numbers up to N) and run it on 8 inputs sequentially vs. in parallel using ProcessPoolExecutor. Measure the speedup and explain why it does not achieve a perfect 8x improvement.

  6. Profile a function that uses string concatenation in a loop (+=), then optimize it using str.join(), io.StringIO, and "".join(list). Benchmark all four and explain why += is O(n^2) for strings while join is O(n).

⚠️ Common Mistakes

  • Optimizing before profiling. Developers often guess which code is slow and optimize the wrong thing. Always profile first with cProfile or scalene to find the actual bottleneck. The 80/20 rule applies: 80% of execution time is typically in 20% of the code.

  • Using NumPy but still iterating element-by-element with Python loops. The entire point of NumPy is vectorized operations that execute in C. Writing for i in range(len(arr)): arr[i] = arr[i] * 2 defeats the purpose — use arr *= 2 instead.

  • Choosing multiprocessing for I/O-bound tasks or threading for CPU-bound tasks. Due to the GIL, threads cannot achieve true parallelism for CPU-bound work. Use multiprocessing or ProcessPoolExecutor for CPU-bound tasks, and asyncio or threading for I/O-bound tasks.

  • Overusing lru_cache without considering memory. Each cached result stays in memory until the cache is full. For functions with many unique arguments, the cache grows unboundedly (with maxsize=None) and can cause memory issues. Always set an appropriate maxsize and monitor with cache_info().

  • Premature micro-optimization of Python code when the real solution is choosing a better algorithm. Shaving 10% off an O(n^2) algorithm is meaningless if an O(n log n) alternative exists. Focus on algorithmic complexity before implementation-level tricks.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Optimization Techniques