collections Module

0/3 in this phase0/54 across the roadmap

📖 Concept

The collections module provides specialized container datatypes that extend Python's built-in dict, list, set, and tuple with additional functionality. These are not exotic data structures — they are production workhorses that solve common problems more efficiently and expressively than rolling your own.

namedtuple creates lightweight, immutable, tuple subclasses with named fields. They use the same memory as regular tuples (no per-instance __dict__), support indexing and unpacking, and are perfect for representing records, coordinates, database rows, or any fixed-structure data. For new code, typing.NamedTuple provides the same functionality with type annotations.

deque (double-ended queue) is a thread-safe, O(1) append/pop from both ends. Lists are O(n) for insert(0, x) and pop(0) because they shift all elements. Deque with maxlen automatically evicts the oldest element when full — perfect for sliding windows, bounded buffers, and recent-history tracking.

defaultdict eliminates the "check if key exists, initialize if not" pattern. You provide a factory function (int, list, set, or a custom callable) that creates default values for missing keys. Nested defaultdicts enable auto-vivifying tree structures.

OrderedDict was essential before Python 3.7 when dicts were unordered. Since 3.7, regular dicts maintain insertion order, but OrderedDict still offers unique features: move_to_end(key), equality that considers order, and popitem(last=True/False) for FIFO/LIFO behavior. It's the backbone of LRU cache implementations.

ChainMap groups multiple dicts into a single view. Lookups search each dict in order. It's ideal for layered configuration (defaults < config file < environment < CLI args), template variable scoping, and any scenario where you need to overlay multiple namespaces without copying.

Container	When to Use	Key Advantage
`namedtuple`	Fixed-structure records	Memory-efficient, immutable
`deque`	FIFO queues, sliding windows	O(1) both-end operations
`defaultdict`	Grouping, counting, accumulating	Auto-initialization
`OrderedDict`	Order-sensitive equality, LRU caches	`move_to_end`, ordered equality
`ChainMap`	Layered lookups, scoped configs	No-copy dict overlay
`Counter`	Frequency counting, multisets	Arithmetic on counts

💻 Code Example

codeTap to expand ⛶

1# ============================================================
2# namedtuple: lightweight immutable records
3# ============================================================
4from collections import namedtuple, deque, defaultdict, OrderedDict, ChainMap, Counter
5from typing import NamedTuple, Optional
6import time
7import json
8
9
10# Classic namedtuple
11HttpResponse = namedtuple("HttpResponse", ["status", "headers", "body"])
12
13resp = HttpResponse(status=200, headers={"Content-Type": "application/json"}, body='{}')
14print(resp.status)       # 200
15print(resp[0])           # 200  (tuple indexing still works)
16status, headers, body = resp  # Unpacking works
17
18# Convert to dict for serialization
19print(resp._asdict())    # {'status': 200, 'headers': {...}, 'body': '{}'}
20
21# Create modified copy (namedtuples are immutable)
22error_resp = resp._replace(status=404, body='{"error": "not found"}')
23print(error_resp.status) # 404
24print(resp.status)       # 200  (original unchanged)
25
26
27# typing.NamedTuple — modern approach with type hints and defaults
28class DatabaseConfig(NamedTuple):
29    host: str
30    port: int = 5432
31    database: str = "myapp"
32    pool_min: int = 2
33    pool_max: int = 10
34
35    @property
36    def connection_string(self) -> str:
37        return f"postgresql://{self.host}:{self.port}/{self.database}"
38
39
40config = DatabaseConfig(host="prod-db-1", pool_max=20)
41print(config.connection_string)  # postgresql://prod-db-1:5432/myapp
42print(config.pool_max)           # 20
43
44
45# ============================================================
46# deque: high-performance double-ended queue
47# ============================================================
48
49# Bounded deque as a sliding window / recent history
50class RecentActivityTracker:
51    """Track the N most recent user actions."""
52
53    def __init__(self, max_size: int = 100):
54        self._actions: deque = deque(maxlen=max_size)
55
56    def record(self, action: str) -> None:
57        self._actions.append({"action": action, "timestamp": time.time()})
58
59    def recent(self, n: int = 10) -> list[dict]:
60        """Get n most recent actions (newest first)."""
61        return list(reversed(self._actions))[:n]
62
63    @property
64    def count(self) -> int:
65        return len(self._actions)
66
67
68tracker = RecentActivityTracker(max_size=5)
69for action in ["login", "view_page", "edit_profile", "upload", "logout", "login"]:
70    tracker.record(action)
71print(tracker.count)  # 5 (oldest "login" was evicted)
72
73
74# Deque as efficient FIFO queue
75task_queue: deque = deque()
76task_queue.append("task_1")     # Add to right (enqueue)
77task_queue.append("task_2")
78task_queue.append("task_3")
79next_task = task_queue.popleft() # Remove from left (dequeue) — O(1)!
80print(next_task)                 # "task_1"
81
82# Rotate elements
83d = deque([1, 2, 3, 4, 5])
84d.rotate(2)     # Rotate right by 2
85print(list(d))  # [4, 5, 1, 2, 3]
86d.rotate(-2)    # Rotate left by 2
87print(list(d))  # [1, 2, 3, 4, 5]
88
89
90# ============================================================
91# defaultdict: automatic initialization patterns
92# ============================================================
93
94# Group items by category
95products = [
96    ("electronics", "laptop"),
97    ("clothing", "shirt"),
98    ("electronics", "phone"),
99    ("clothing", "jacket"),
100    ("electronics", "tablet"),
101    ("food", "apple"),
102]
103
104by_category = defaultdict(list)
105for category, product in products:
106    by_category[category].append(product)
107
108print(dict(by_category))
109# {'electronics': ['laptop', 'phone', 'tablet'],
110#  'clothing': ['shirt', 'jacket'],
111#  'food': ['apple']}
112
113
114# Count occurrences
115word_counts = defaultdict(int)
116words = "the cat sat on the mat the cat".split()
117for word in words:
118    word_counts[word] += 1
119print(dict(word_counts))  # {'the': 3, 'cat': 2, 'sat': 1, 'on': 1, 'mat': 1}
120
121
122# Nested defaultdict: auto-vivifying tree
123def tree():
124    """Create an auto-vivifying tree (infinite nesting)."""
125    return defaultdict(tree)
126
127
128config_tree = tree()
129config_tree["database"]["primary"]["host"] = "db-1.prod"
130config_tree["database"]["primary"]["port"] = 5432
131config_tree["database"]["replica"]["host"] = "db-2.prod"
132config_tree["cache"]["redis"]["host"] = "redis.prod"
133
134# Convert to regular dict for serialization
135def tree_to_dict(t):
136    if isinstance(t, defaultdict):
137        return {k: tree_to_dict(v) for k, v in t.items()}
138    return t
139
140print(json.dumps(tree_to_dict(config_tree), indent=2))
141
142
143# defaultdict with set for unique value tracking
144user_logins = defaultdict(set)
145login_events = [
146    ("alice", "192.168.1.1"),
147    ("bob", "10.0.0.1"),
148    ("alice", "192.168.1.2"),
149    ("alice", "192.168.1.1"),  # Duplicate IP for alice
150    ("bob", "10.0.0.1"),      # Duplicate IP for bob
151]
152
153for user, ip in login_events:
154    user_logins[user].add(ip)
155
156print(dict(user_logins))
157# {'alice': {'192.168.1.1', '192.168.1.2'}, 'bob': {'10.0.0.1'}}
158
159
160# ============================================================
161# Counter: frequency analysis and multiset operations
162# ============================================================
163
164# Word frequency analysis
165text = "to be or not to be that is the question"
166word_freq = Counter(text.split())
167print(word_freq.most_common(3))  # [('to', 2), ('be', 2), ('or', 1)]
168
169# Arithmetic on Counters (multiset operations)
170inventory_a = Counter(apples=5, oranges=3, bananas=2)
171inventory_b = Counter(apples=2, oranges=5, grapes=4)
172
173combined = inventory_a + inventory_b
174print(combined)  # Counter({'oranges': 8, 'apples': 7, 'grapes': 4, 'bananas': 2})
175
176difference = inventory_a - inventory_b  # Only positive counts kept
177print(difference)  # Counter({'bananas': 2, 'apples': 3})
178
179# Intersection (minimum) and union (maximum)
180common = inventory_a & inventory_b
181print(common)  # Counter({'oranges': 3, 'apples': 2})
182
183total = inventory_a | inventory_b
184print(total)   # Counter({'oranges': 5, 'apples': 5, 'grapes': 4, 'bananas': 2})
185
186
187# ============================================================
188# OrderedDict: order-aware dictionary operations
189# ============================================================
190
191class LRUCache:
192    """Simple LRU cache using OrderedDict."""
193
194    def __init__(self, capacity: int):
195        self._cache: OrderedDict = OrderedDict()
196        self._capacity = capacity
197
198    def get(self, key: str) -> Optional[str]:
199        if key not in self._cache:
200            return None
201        self._cache.move_to_end(key)  # Mark as recently used
202        return self._cache[key]
203
204    def put(self, key: str, value: str) -> None:
205        if key in self._cache:
206            self._cache.move_to_end(key)
207        self._cache[key] = value
208        if len(self._cache) > self._capacity:
209            self._cache.popitem(last=False)  # Remove oldest (FIFO end)
210
211    def __repr__(self) -> str:
212        items = ", ".join(f"{k}={v}" for k, v in self._cache.items())
213        return f"LRUCache([{items}])"
214
215
216cache = LRUCache(capacity=3)
217cache.put("a", "1")
218cache.put("b", "2")
219cache.put("c", "3")
220print(cache)          # LRUCache([a=1, b=2, c=3])
221
222cache.get("a")        # Access "a" — moves to end
223cache.put("d", "4")   # Evicts "b" (least recently used)
224print(cache)          # LRUCache([c=3, a=1, d=4])
225
226
227# OrderedDict equality considers order
228od1 = OrderedDict([("a", 1), ("b", 2)])
229od2 = OrderedDict([("b", 2), ("a", 1)])
230print(od1 == od2)  # False (different order)
231
232d1 = {"a": 1, "b": 2}
233d2 = {"b": 2, "a": 1}
234print(d1 == d2)    # True (regular dicts ignore order in equality)
235
236
237# ============================================================
238# ChainMap: layered configuration
239# ============================================================
240
241# Layered configuration: CLI > env > config file > defaults
242defaults = {"debug": False, "log_level": "INFO", "port": 8080, "host": "0.0.0.0"}
243config_file = {"log_level": "WARNING", "port": 9090}
244env_vars = {"debug": True}
245cli_args = {"port": 3000}
246
247# Highest priority first
248settings = ChainMap(cli_args, env_vars, config_file, defaults)
249
250print(settings["port"])       # 3000  (from cli_args)
251print(settings["debug"])      # True  (from env_vars)
252print(settings["log_level"])  # WARNING  (from config_file)
253print(settings["host"])       # 0.0.0.0  (from defaults)
254
255# Mutations only affect the first mapping
256settings["new_key"] = "value"
257print(cli_args)  # {'port': 3000, 'new_key': 'value'}
258
259# Create a child scope (useful for template engines, variable scoping)
260local_scope = settings.new_child({"port": 5000, "local_var": True})
261print(local_scope["port"])       # 5000  (local override)
262print(local_scope["debug"])      # True  (inherited from parent chain)
263print(local_scope["local_var"])  # True
264
265
266# ============================================================
267# Practical: combining collections for a metrics aggregator
268# ============================================================
269
270class MetricsAggregator:
271    """Aggregate application metrics using collections."""
272
273    def __init__(self, window_size: int = 1000):
274        self._recent_latencies: deque = deque(maxlen=window_size)
275        self._error_counts: Counter = Counter()
276        self._endpoint_hits: defaultdict = defaultdict(int)
277        self._status_by_endpoint: defaultdict = defaultdict(Counter)
278
279    def record(self, endpoint: str, status: int, latency_ms: float) -> None:
280        self._recent_latencies.append(latency_ms)
281        self._endpoint_hits[endpoint] += 1
282        self._status_by_endpoint[endpoint][status] += 1
283        if status >= 400:
284            self._error_counts[f"{endpoint}:{status}"] += 1
285
286    def avg_latency(self) -> float:
287        if not self._recent_latencies:
288            return 0.0
289        return sum(self._recent_latencies) / len(self._recent_latencies)
290
291    def top_errors(self, n: int = 5) -> list:
292        return self._error_counts.most_common(n)
293
294    def summary(self) -> dict:
295        return {
296            "avg_latency_ms": round(self.avg_latency(), 2),
297            "total_requests": sum(self._endpoint_hits.values()),
298            "top_errors": self.top_errors(),
299        }
300
301
302metrics = MetricsAggregator(window_size=100)
303metrics.record("/api/users", 200, 45.2)
304metrics.record("/api/users", 200, 52.1)
305metrics.record("/api/orders", 500, 1200.5)
306metrics.record("/api/orders", 500, 980.3)
307metrics.record("/api/auth", 401, 12.0)
308print(metrics.summary())

🏋️ Practice Exercise

Exercises:

Build a namedtuple-based LogEntry with fields timestamp, level, service, message. Write a parser that reads log lines into LogEntry instances, then use Counter to find the top 5 most common error messages and defaultdict to group entries by service.
Implement a bounded task queue using deque(maxlen=N) that supports enqueue, dequeue, peek, and is_full. When the queue is full and a new item is enqueued, the oldest item should be silently dropped (not raise an error). Add a drain() method that yields all items.
Create a ChainMap-based template variable resolver that supports nested scopes. Implement enter_scope(vars) (pushes a new scope), exit_scope() (pops the current scope), and resolve(name) (looks up a variable through all scopes). Test with three levels of nesting.
Use defaultdict(lambda: defaultdict(list)) to build an inverted index: given a list of documents (strings), create a mapping from each word to the list of (doc_id, position) pairs where it appears. Then implement a search(word) function that returns matching documents with highlighted positions.
Implement a complete LRU cache using OrderedDict that supports get, put, delete, resize (change capacity), and stats (hits, misses, evictions). Write unit tests that verify eviction order and hit/miss counts.
Build a Counter-based shopping cart that supports adding items, removing items, merging two carts (+), finding common items between carts (&), and computing total price given a price lookup dict. Demonstrate all multiset operations.

⚠️ Common Mistakes

Using defaultdict when a regular dict with .setdefault() or .get(default) would suffice. defaultdict creates the default value on any missing key access, including accidental reads, which can silently populate the dict with unintended keys. Use dict.get(key, default) when you only want to read without side effects.
Forgetting that deque.maxlen silently discards elements from the opposite end when the deque is full. If you append to the right on a full deque, the leftmost element is dropped without warning. This is by design for sliding windows but can cause data loss if you expect an error.
Assuming OrderedDict is needed in Python 3.7+. Regular dicts maintain insertion order since 3.7. Only use OrderedDict when you need order-sensitive equality (OrderedDict([('a',1),('b',2)]) != OrderedDict([('b',2),('a',1)])), move_to_end(), or popitem(last=False) for FIFO behavior.
Mutating a namedtuple field directly. Namedtuples are immutable — point.x = 10 raises AttributeError. Use point._replace(x=10) to create a modified copy. If you need mutability, use a dataclass instead.
Not converting defaultdict back to a regular dict before serialization. json.dumps(defaultdict(...)) works, but the output type info is lost. More importantly, passing a defaultdict to code that doesn't expect auto-vivification can cause subtle bugs. Always dict(my_defaultdict) at boundaries.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for collections Module

Was this topic helpful?

← Previousitertools & functools Next →Threading & the GIL