Serialization: JSON, CSV, Pickle, YAML & TOML

0/4 in this phase0/54 across the roadmap

📖 Concept

Serialization is converting Python objects to a storable/transmittable format, and deserialization is the reverse. Python ships with robust support for JSON, CSV, and pickle in the standard library, with TOML reading added in Python 3.11. YAML requires the third-party PyYAML package.

Format comparison:

Format Human Readable Types Security Use Case
JSON Yes Strings, numbers, bools, null, arrays, objects Safe APIs, configs, data exchange
CSV Yes Strings only (everything is text) Safe Tabular data, spreadsheets
pickle No (binary) Any Python object DANGEROUS Internal caching, ML models
TOML Yes Strings, ints, floats, bools, datetimes, arrays, tables Safe Config files (pyproject.toml)
YAML Yes Rich types including dates, nulls Risky (safe_load only) Config files, Kubernetes, CI/CD

Critical security warnings:

  • Never unpickle untrusted data. pickle.loads() can execute arbitrary code — it's a remote code execution vulnerability. Only use pickle for data you created yourself.
  • Always use yaml.safe_load(), never yaml.load() without Loader=SafeLoader. The default loader can construct arbitrary Python objects from YAML.

JSON tips: Python's json module only handles basic types. For datetime, Decimal, UUID, etc., provide a custom encoder via default= parameter or subclass JSONEncoder. The orjson and ujson third-party libraries are 5-10x faster for large payloads.

CSV tips: Always use the csv module rather than splitting on commas — it handles quoting, escaping, and multi-line fields correctly. DictReader/DictWriter provide column-name access.

💻 Code Example

codeTap to expand ⛶
1# ============================================================
2# JSON — the universal data interchange format
3# ============================================================
4import json
5from datetime import datetime, date
6from decimal import Decimal
7from pathlib import Path
8from uuid import UUID, uuid4
9
10# Basic usage
11data = {"name": "Alice", "scores": [95, 87, 92], "active": True}
12json_str = json.dumps(data, indent=2) # PythonJSON string
13parsed = json.loads(json_str) # JSON string → Python
14
15# Write/read JSON files
16with open("data.json", "w", encoding="utf-8") as f:
17 json.dump(data, f, indent=2, ensure_ascii=False) # Pretty-print, allow unicode
18
19with open("data.json", "r", encoding="utf-8") as f:
20 loaded = json.load(f)
21
22# Custom encoder for types JSON doesn't support natively
23class AppJSONEncoder(json.JSONEncoder):
24 """Handle datetime, Decimal, UUID, set, bytes, Path."""
25
26 def default(self, obj):
27 if isinstance(obj, (datetime, date)):
28 return obj.isoformat()
29 if isinstance(obj, Decimal):
30 return str(obj) # Preserve precision (float would lose it)
31 if isinstance(obj, UUID):
32 return str(obj)
33 if isinstance(obj, set):
34 return sorted(list(obj)) # Sets aren't ordered
35 if isinstance(obj, bytes):
36 import base64
37 return base64.b64encode(obj).decode("ascii")
38 if isinstance(obj, Path):
39 return str(obj)
40 return super().default(obj) # Raise TypeError for unknown types
41
42# Usage
43record = {
44 "id": uuid4(),
45 "created": datetime.now(),
46 "price": Decimal("19.99"),
47 "tags": {"python", "coding"},
48}
49print(json.dumps(record, cls=AppJSONEncoder, indent=2))
50
51# Alternative: use the default= parameter (simpler for one-offs)
52json.dumps(record, default=str) # Converts anything unknown to string
53
54
55# ============================================================
56# CSV — tabular data
57# ============================================================
58import csv
59
60# Writing CSV
61employees = [
62 {"name": "Alice", "dept": "Engineering", "salary": 120000},
63 {"name": "Bob", "dept": "Marketing", "salary": 95000},
64 {"name": 'Charlie "Chuck"', "dept": "Sales", "salary": 88000},
65]
66
67with open("employees.csv", "w", newline="", encoding="utf-8") as f:
68 writer = csv.DictWriter(f, fieldnames=["name", "dept", "salary"])
69 writer.writeheader()
70 writer.writerows(employees)
71
72# Reading CSV
73with open("employees.csv", "r", newline="", encoding="utf-8") as f:
74 reader = csv.DictReader(f)
75 for row in reader:
76 # All values are strings — cast as needed
77 name = row["name"]
78 salary = int(row["salary"])
79 print(f"{name}: salary={salary:,}")
80
81# BAD — never split on comma manually
82# line.split(",") # Breaks on: 'Charlie "Chuck"', fields with commas, etc.
83
84# GOOD — csv module handles quoting, escaping, multi-line fields
85# Also handles different dialects (excel, unix, custom)
86
87
88# ============================================================
89# pickle — Python-specific binary serialization
90# ============================================================
91import pickle
92
93# WARNING: Never unpickle data from untrusted sources!
94# pickle.loads() can execute arbitrary code.
95
96class TrainedModel:
97 """Simulate an ML model with fitted parameters."""
98 def __init__(self, weights, accuracy):
99 self.weights = weights
100 self.accuracy = accuracy
101
102 def predict(self, x):
103 return sum(w * xi for w, xi in zip(self.weights, x))
104
105model = TrainedModel(weights=[0.5, 0.3, 0.2], accuracy=0.95)
106
107# Serialize to bytes
108pickled = pickle.dumps(model, protocol=pickle.HIGHEST_PROTOCOL)
109
110# Serialize to file
111with open("model.pkl", "wb") as f:
112 pickle.dump(model, f, protocol=pickle.HIGHEST_PROTOCOL)
113
114# Deserialize (ONLY from trusted sources!)
115with open("model.pkl", "rb") as f:
116 loaded_model = pickle.load(f)
117 print(loaded_model.predict([1, 2, 3])) # 0.5*1 + 0.3*2 + 0.2*3 = 1.7
118
119# Customize pickling with __getstate__ / __setstate__
120class DatabaseConnection:
121 """Connections can't be pickled — customize the behavior."""
122
123 def __init__(self, host, port):
124 self.host = host
125 self.port = port
126 self._connection = self._connect() # Not serializable
127
128 def _connect(self):
129 return f"<Connection to {self.host}:{self.port}>"
130
131 def __getstate__(self):
132 """Called during pickling — return what to serialize."""
133 state = self.__dict__.copy()
134 del state["_connection"] # Remove non-serializable attribute
135 return state
136
137 def __setstate__(self, state):
138 """Called during unpickling — restore the object."""
139 self.__dict__.update(state)
140 self._connection = self._connect() # Reconnect
141
142
143# ============================================================
144# TOMLPython 3.11+ built-in reader (pyproject.toml, configs)
145# ============================================================
146import tomllib # Python 3.11+ (read-only)
147
148toml_str = """
149[project]
150name = "my-package"
151version = "1.0.0"
152requires-python = ">=3.11"
153
154[project.dependencies]
155requests = ">=2.28"
156pydantic = ">=2.0"
157
158[tool.pytest.ini_options]
159testpaths = ["tests"]
160addopts = "-v --tb=short"
161
162[[project.authors]]
163name = "Alice"
164email = "alice@example.com"
165"""
166
167config = tomllib.loads(toml_str)
168print(config["project"]["name"]) # "my-package"
169print(config["tool"]["pytest"]["ini_options"]["addopts"])
170
171# Read from file (must open in binary mode for tomllib)
172# with open("pyproject.toml", "rb") as f:
173# config = tomllib.load(f)
174
175# For WRITING TOML, use the third-party 'tomli-w' package:
176# import tomli_w
177# with open("config.toml", "wb") as f:
178# tomli_w.dump(config, f)
179
180
181# ============================================================
182# YAML — human-friendly config format (requires PyYAML)
183# ============================================================
184# pip install pyyaml
185import yaml
186
187yaml_str = """
188database:
189 host: localhost
190 port: 5432
191 credentials:
192 username: admin
193 password: secret123
194
195services:
196 - name: api
197 replicas: 3
198 ports:
199 - 8080
200 - 8443
201 - name: worker
202 replicas: 5
203"""
204
205# ALWAYS use safe_load — never yaml.load() without SafeLoader
206# yaml.load() can execute arbitrary Python code from YAML!
207config = yaml.safe_load(yaml_str)
208print(config["database"]["host"]) # "localhost"
209print(config["services"][0]["name"]) # "api"
210
211# Write YAML
212output = yaml.dump(
213 config,
214 default_flow_style=False, # Block style (readable)
215 sort_keys=False, # Preserve insertion order
216 allow_unicode=True,
217)
218print(output)
219
220# Safe dump — only outputs standard YAML types
221yaml.safe_dump(config, stream=open("config.yaml", "w", encoding="utf-8"))

🏋️ Practice Exercise

Exercises:

  1. Write a JSONSerializer class with a custom encoder that handles datetime, Decimal, UUID, set, Path, dataclass, and Enum types. Add a corresponding object_hook decoder that round-trips datetimes and UUIDs back to their original types.

  2. Build a CSV processing pipeline: read a CSV file of sales records, filter rows by date range, compute aggregates (total revenue, average order value per category), and write the results to a new CSV. Handle malformed rows gracefully with error logging.

  3. Create a config file loader that auto-detects the format from the file extension (.json, .yaml, .toml) and returns a unified dictionary. Support environment variable interpolation (e.g., $\{DATABASE_URL} in values gets replaced with the actual env var).

  4. Demonstrate the security risk of pickle: write a malicious pickle payload that executes os.system("echo HACKED") when loaded. Then implement a RestrictedUnpickler (subclass pickle.Unpickler) that only allows specific safe classes.

  5. Build a simple document store: a class that saves/loads Python dicts as JSON files in a directory, with get(id), put(id, data), delete(id), and list_all() methods. Use atomic writes to prevent corruption.

⚠️ Common Mistakes

  • Using pickle for data exchange between systems or persisting user-supplied data. Pickle is insecure (arbitrary code execution), Python-version-specific, and not human-readable. Use JSON for interchange and structured formats for configs.

  • Calling yaml.load(data) without specifying Loader=yaml.SafeLoader. The default loader can instantiate arbitrary Python objects from YAML tags like !!python/object/apply:os.system [rm -rf /]. Always use yaml.safe_load().

  • Not passing newline='' when opening CSV files. The csv module handles line endings internally. Without newline='', you get extra blank rows on Windows because Python's universal newline translation doubles the \r .

  • Assuming CSV values are typed. Everything from csv.reader and csv.DictReader is a string. You must explicitly cast: int(row['age']), float(row['price']), datetime.fromisoformat(row['date']). Missing casts cause subtle bugs in comparisons and arithmetic.

  • Using json.dumps(obj, default=str) as a universal fix. While convenient, it silently converts unknown objects to their str() representation, which often can't be deserialized back. Write a proper custom encoder that handles each type explicitly.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Serialization: JSON, CSV, Pickle, YAML & TOML