Optimization Techniques

0/3 in this phase0/54 across the roadmap

📖 Concept

Once profiling has identified bottlenecks, targeted optimization techniques can deliver dramatic speedups. The key principle is to optimize at the right level: algorithm first, data structure second, implementation third, and low-level tricks last.

Algorithm & data structure optimization provides the largest gains. Replacing an O(n^2) nested loop with an O(n) hash-based lookup can turn a 10-minute job into a 1-second job. Common patterns:

Use sets for membership testing — x in my_set is O(1) vs. O(n) for lists
Use dicts for grouping and counting — collections.Counter, defaultdict
Use heapq for top-k problems — O(n log k) instead of O(n log n) full sort
Use bisect for sorted sequence operations — O(log n) binary search

Python-specific optimizations:

List comprehensions and generator expressions are faster than equivalent for-loops because the iteration happens in C
Built-in functions (sum, map, filter, min, max, sorted) run in C and avoid Python loop overhead
Local variable access is faster than global or attribute lookups — assign frequently accessed attributes to local variables in tight loops
String interning and __slots__ reduce memory overhead for objects created in large quantities

NumPy vectorization replaces Python loops with array operations executed in compiled C/Fortran. Operations that process millions of elements in a Python loop can run 10-100x faster as vectorized NumPy operations. The key is to express computation as whole-array operations rather than element-by-element iteration.

Cython compiles Python-like code to C, enabling C-level performance for computation-heavy functions while retaining Python interoperability. Adding type annotations to Cython code allows the compiler to generate optimized C code that bypasses the Python object model entirely.

C extensions via ctypes or cffi let you call existing C libraries directly from Python, and pybind11 makes it straightforward to write C++ extensions with a Pythonic API. These approaches are appropriate when critical inner loops need maximum performance and the overhead of Python's interpreter is the bottleneck.

Concurrency is another optimization dimension: multiprocessing for CPU-bound parallelism (bypasses the GIL), threading / asyncio for I/O-bound concurrency, and concurrent.futures for a high-level interface to both.

💻 Code Example

codeTap to expand ⛶

1# ============================================================
2# 1. Data structure choices — dramatic impact on performance
3# ============================================================
4import time
5from collections import defaultdict, Counter
6import heapq
7import bisect
8
9
10def demo_set_vs_list_lookup():
11    """Demonstrate O(1) set lookup vs O(n) list lookup."""
12    data_list = list(range(1_000_000))
13    data_set = set(data_list)
14    targets = [500_000, 999_999, 1_000_001]  # exists, exists, missing
15
16    # List lookup — O(n) per lookup
17    start = time.perf_counter()
18    for target in targets:
19        _ = target in data_list
20    list_time = time.perf_counter() - start
21
22    # Set lookup — O(1) per lookup
23    start = time.perf_counter()
24    for target in targets:
25        _ = target in data_set
26    set_time = time.perf_counter() - start
27
28    print(f"List lookup: {list_time*1000:.3f} ms")
29    print(f"Set lookup : {set_time*1000:.3f} ms")
30    print(f"Speedup    : {list_time/set_time:.0f}x")
31
32
33def demo_heapq_top_k():
34    """Find top-k elements without full sort."""
35    import random
36    data = [random.random() for _ in range(1_000_000)]
37
38    # Full sort approach — O(n log n)
39    start = time.perf_counter()
40    top_10_sort = sorted(data, reverse=True)[:10]
41    sort_time = time.perf_counter() - start
42
43    # heapq approach — O(n log k) where k=10
44    start = time.perf_counter()
45    top_10_heap = heapq.nlargest(10, data)
46    heap_time = time.perf_counter() - start
47
48    print(f"\nTop-10 via sorted()      : {sort_time*1000:.3f} ms")
49    print(f"Top-10 via heapq.nlargest: {heap_time*1000:.3f} ms")
50
51
52# ============================================================
53# 2. Loop optimization techniques
54# ============================================================
55def loop_optimization_demo():
56    """Compare loop styles for performance."""
57    data = list(range(1_000_000))
58
59    # Slow: Python for-loop with append
60    start = time.perf_counter()
61    result1 = []
62    for x in data:
63        result1.append(x * x)
64    loop_time = time.perf_counter() - start
65
66    # Faster: List comprehension (C-level iteration)
67    start = time.perf_counter()
68    result2 = [x * x for x in data]
69    comp_time = time.perf_counter() - start
70
71    # Faster: map() with lambda (C-level iteration)
72    start = time.perf_counter()
73    result3 = list(map(lambda x: x * x, data))
74    map_time = time.perf_counter() - start
75
76    print(f"\nFor-loop + append  : {loop_time*1000:.3f} ms")
77    print(f"List comprehension : {comp_time*1000:.3f} ms")
78    print(f"map() + lambda     : {map_time*1000:.3f} ms")
79
80
81# ============================================================
82# 3. Local variable optimization in tight loops
83# ============================================================
84import math
85
86
87def compute_distances_slow(points):
88    """Slow: global/attribute lookups on every iteration."""
89    results = []
90    for x, y in points:
91        dist = math.sqrt(x * x + y * y)
92        results.append(dist)
93    return results
94
95
96def compute_distances_fast(points):
97    """Fast: cache lookups as local variables."""
98    sqrt = math.sqrt  # Local reference to avoid repeated attribute lookup
99    return [sqrt(x * x + y * y) for x, y in points]
100
101
102# ============================================================
103# 4. NumPy vectorization — replacing Python loops with C
104# ============================================================
105def numpy_vectorization_demo():
106    """Demonstrate the speedup of NumPy over pure Python."""
107    try:
108        import numpy as np
109    except ImportError:
110        print("NumPy not installed — skipping vectorization demo")
111        return
112
113    size = 1_000_000
114
115    # Pure Python: element-by-element
116    py_list_a = list(range(size))
117    py_list_b = list(range(size))
118
119    start = time.perf_counter()
120    py_result = [a + b for a, b in zip(py_list_a, py_list_b)]
121    py_time = time.perf_counter() - start
122
123    # NumPy: vectorized operation (runs in C)
124    np_a = np.arange(size)
125    np_b = np.arange(size)
126
127    start = time.perf_counter()
128    np_result = np_a + np_b
129    np_time = time.perf_counter() - start
130
131    print(f"\nPython list addition   : {py_time*1000:.3f} ms")
132    print(f"NumPy vectorized add   : {np_time*1000:.3f} ms")
133    print(f"Speedup                : {py_time/np_time:.0f}x")
134
135    # More complex: element-wise operations
136    data = np.random.randn(size)
137
138    start = time.perf_counter()
139    # Vectorized: applies sqrt to all elements in C
140    normalized = np.where(data > 0, np.sqrt(data), 0.0)
141    vec_time = time.perf_counter() - start
142
143    start = time.perf_counter()
144    # Python loop: slow element-by-element
145    normalized_py = [math.sqrt(x) if x > 0 else 0.0 for x in data]
146    py_loop_time = time.perf_counter() - start
147
148    print(f"\nConditional sqrt (NumPy) : {vec_time*1000:.3f} ms")
149    print(f"Conditional sqrt (Python): {py_loop_time*1000:.3f} ms")
150    print(f"Speedup                  : {py_loop_time/vec_time:.0f}x")
151
152
153# ============================================================
154# 5. functools optimizations — caching and partial
155# ============================================================
156from functools import lru_cache
157
158
159@lru_cache(maxsize=256)
160def expensive_computation(n):
161    """Simulates an expensive calculation with automatic caching."""
162    time.sleep(0.001)  # Simulate work
163    return sum(i * i for i in range(n))
164
165
166def lru_cache_demo():
167    """Show cache hit/miss behavior."""
168    # First calls — cache misses
169    start = time.perf_counter()
170    for i in range(100):
171        expensive_computation(i)
172    cold_time = time.perf_counter() - start
173
174    # Repeated calls — cache hits
175    start = time.perf_counter()
176    for i in range(100):
177        expensive_computation(i)
178    warm_time = time.perf_counter() - start
179
180    info = expensive_computation.cache_info()
181    print(f"\nLRU Cache demo:")
182    print(f"  Cold (misses): {cold_time*1000:.1f} ms")
183    print(f"  Warm (hits)  : {warm_time*1000:.1f} ms")
184    print(f"  Cache info   : {info}")
185
186
187# ============================================================
188# 6. Multiprocessing for CPU-bound parallelism
189# ============================================================
190from multiprocessing import Pool
191from concurrent.futures import ProcessPoolExecutor
192
193
194def cpu_bound_task(n):
195    """Simulate a CPU-intensive calculation."""
196    return sum(i * i for i in range(n))
197
198
199def parallel_vs_sequential():
200    """Compare sequential vs parallel execution."""
201    tasks = [500_000] * 8  # 8 identical tasks
202
203    # Sequential
204    start = time.perf_counter()
205    sequential_results = [cpu_bound_task(n) for n in tasks]
206    seq_time = time.perf_counter() - start
207
208    # Parallel with ProcessPoolExecutor
209    start = time.perf_counter()
210    with ProcessPoolExecutor(max_workers=4) as executor:
211        parallel_results = list(executor.map(cpu_bound_task, tasks))
212    par_time = time.perf_counter() - start
213
214    print(f"\nSequential : {seq_time*1000:.1f} ms")
215    print(f"Parallel   : {par_time*1000:.1f} ms")
216    print(f"Speedup    : {seq_time/par_time:.1f}x")
217
218
219# ============================================================
220# Usage
221# ============================================================
222if __name__ == "__main__":
223    demo_set_vs_list_lookup()
224    demo_heapq_top_k()
225    loop_optimization_demo()
226    numpy_vectorization_demo()
227    lru_cache_demo()
228    parallel_vs_sequential()

🏋️ Practice Exercise

Exercises:

Write a function that finds duplicate values in a list of 1 million integers. Implement three approaches: nested loop O(n^2), sorted comparison O(n log n), and set-based O(n). Benchmark all three and verify the algorithmic complexity difference in practice.
Take a function that processes a list of dictionaries using a for-loop (e.g., filtering by a condition and transforming a field). Rewrite it using list comprehension, map/filter, and a generator expression. Benchmark all four approaches with 1 million records.
Install NumPy and rewrite a pure Python function that computes pairwise Euclidean distances between two lists of 2D points. Compare the pure Python nested-loop version with a vectorized NumPy version. Measure the speedup for 10,000 points.
Use functools.lru_cache to optimize a recursive function (e.g., Levenshtein edit distance or the partition function). Measure the speedup and print cache hit/miss statistics using cache_info(). Experiment with different maxsize values.
Implement a CPU-bound task (e.g., computing prime numbers up to N) and run it on 8 inputs sequentially vs. in parallel using ProcessPoolExecutor. Measure the speedup and explain why it does not achieve a perfect 8x improvement.
Profile a function that uses string concatenation in a loop (+=), then optimize it using str.join(), io.StringIO, and "".join(list). Benchmark all four and explain why += is O(n^2) for strings while join is O(n).

⚠️ Common Mistakes

Optimizing before profiling. Developers often guess which code is slow and optimize the wrong thing. Always profile first with cProfile or scalene to find the actual bottleneck. The 80/20 rule applies: 80% of execution time is typically in 20% of the code.
Using NumPy but still iterating element-by-element with Python loops. The entire point of NumPy is vectorized operations that execute in C. Writing for i in range(len(arr)): arr[i] = arr[i] * 2 defeats the purpose — use arr *= 2 instead.
Choosing multiprocessing for I/O-bound tasks or threading for CPU-bound tasks. Due to the GIL, threads cannot achieve true parallelism for CPU-bound work. Use multiprocessing or ProcessPoolExecutor for CPU-bound tasks, and asyncio or threading for I/O-bound tasks.
Overusing lru_cache without considering memory. Each cached result stays in memory until the cache is full. For functions with many unique arguments, the cache grows unboundedly (with maxsize=None) and can cause memory issues. Always set an appropriate maxsize and monitor with cache_info().
Premature micro-optimization of Python code when the real solution is choosing a better algorithm. Shaving 10% off an O(n^2) algorithm is meaningless if an O(n log n) alternative exists. Focus on algorithmic complexity before implementation-level tricks.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Optimization Techniques

Was this topic helpful?

← PreviousProfiling & Benchmarking Next →Memory Management