Serialization: JSON, CSV, Pickle, YAML & TOML

0/4 in this phase0/54 across the roadmap

📖 Concept

Serialization is converting Python objects to a storable/transmittable format, and deserialization is the reverse. Python ships with robust support for JSON, CSV, and pickle in the standard library, with TOML reading added in Python 3.11. YAML requires the third-party PyYAML package.

Format comparison:

Format	Human Readable	Types	Security	Use Case
JSON	Yes	Strings, numbers, bools, null, arrays, objects	Safe	APIs, configs, data exchange
CSV	Yes	Strings only (everything is text)	Safe	Tabular data, spreadsheets
pickle	No (binary)	Any Python object	DANGEROUS	Internal caching, ML models
TOML	Yes	Strings, ints, floats, bools, datetimes, arrays, tables	Safe	Config files (pyproject.toml)
YAML	Yes	Rich types including dates, nulls	Risky (`safe_load` only)	Config files, Kubernetes, CI/CD

Critical security warnings:

Never unpickle untrusted data. pickle.loads() can execute arbitrary code — it's a remote code execution vulnerability. Only use pickle for data you created yourself.
Always use yaml.safe_load(), never yaml.load() without Loader=SafeLoader. The default loader can construct arbitrary Python objects from YAML.

JSON tips: Python's json module only handles basic types. For datetime, Decimal, UUID, etc., provide a custom encoder via default= parameter or subclass JSONEncoder. The orjson and ujson third-party libraries are 5-10x faster for large payloads.

CSV tips: Always use the csv module rather than splitting on commas — it handles quoting, escaping, and multi-line fields correctly. DictReader/DictWriter provide column-name access.

💻 Code Example

codeTap to expand ⛶

1# ============================================================
2# JSON — the universal data interchange format
3# ============================================================
4import json
5from datetime import datetime, date
6from decimal import Decimal
7from pathlib import Path
8from uuid import UUID, uuid4
9
10# Basic usage
11data = {"name": "Alice", "scores": [95, 87, 92], "active": True}
12json_str = json.dumps(data, indent=2)       # Python → JSON string
13parsed = json.loads(json_str)               # JSON string → Python
14
15# Write/read JSON files
16with open("data.json", "w", encoding="utf-8") as f:
17    json.dump(data, f, indent=2, ensure_ascii=False)  # Pretty-print, allow unicode
18
19with open("data.json", "r", encoding="utf-8") as f:
20    loaded = json.load(f)
21
22# Custom encoder for types JSON doesn't support natively
23class AppJSONEncoder(json.JSONEncoder):
24    """Handle datetime, Decimal, UUID, set, bytes, Path."""
25
26    def default(self, obj):
27        if isinstance(obj, (datetime, date)):
28            return obj.isoformat()
29        if isinstance(obj, Decimal):
30            return str(obj)  # Preserve precision (float would lose it)
31        if isinstance(obj, UUID):
32            return str(obj)
33        if isinstance(obj, set):
34            return sorted(list(obj))  # Sets aren't ordered
35        if isinstance(obj, bytes):
36            import base64
37            return base64.b64encode(obj).decode("ascii")
38        if isinstance(obj, Path):
39            return str(obj)
40        return super().default(obj)  # Raise TypeError for unknown types
41
42# Usage
43record = {
44    "id": uuid4(),
45    "created": datetime.now(),
46    "price": Decimal("19.99"),
47    "tags": {"python", "coding"},
48}
49print(json.dumps(record, cls=AppJSONEncoder, indent=2))
50
51# Alternative: use the default= parameter (simpler for one-offs)
52json.dumps(record, default=str)  # Converts anything unknown to string
53
54
55# ============================================================
56# CSV — tabular data
57# ============================================================
58import csv
59
60# Writing CSV
61employees = [
62    {"name": "Alice", "dept": "Engineering", "salary": 120000},
63    {"name": "Bob", "dept": "Marketing", "salary": 95000},
64    {"name": 'Charlie "Chuck"', "dept": "Sales", "salary": 88000},
65]
66
67with open("employees.csv", "w", newline="", encoding="utf-8") as f:
68    writer = csv.DictWriter(f, fieldnames=["name", "dept", "salary"])
69    writer.writeheader()
70    writer.writerows(employees)
71
72# Reading CSV
73with open("employees.csv", "r", newline="", encoding="utf-8") as f:
74    reader = csv.DictReader(f)
75    for row in reader:
76        # All values are strings — cast as needed
77        name = row["name"]
78        salary = int(row["salary"])
79        print(f"{name}: salary={salary:,}")
80
81# BAD — never split on comma manually
82# line.split(",")  # Breaks on: 'Charlie "Chuck"', fields with commas, etc.
83
84# GOOD — csv module handles quoting, escaping, multi-line fields
85# Also handles different dialects (excel, unix, custom)
86
87
88# ============================================================
89# pickle — Python-specific binary serialization
90# ============================================================
91import pickle
92
93# WARNING: Never unpickle data from untrusted sources!
94# pickle.loads() can execute arbitrary code.
95
96class TrainedModel:
97    """Simulate an ML model with fitted parameters."""
98    def __init__(self, weights, accuracy):
99        self.weights = weights
100        self.accuracy = accuracy
101
102    def predict(self, x):
103        return sum(w * xi for w, xi in zip(self.weights, x))
104
105model = TrainedModel(weights=[0.5, 0.3, 0.2], accuracy=0.95)
106
107# Serialize to bytes
108pickled = pickle.dumps(model, protocol=pickle.HIGHEST_PROTOCOL)
109
110# Serialize to file
111with open("model.pkl", "wb") as f:
112    pickle.dump(model, f, protocol=pickle.HIGHEST_PROTOCOL)
113
114# Deserialize (ONLY from trusted sources!)
115with open("model.pkl", "rb") as f:
116    loaded_model = pickle.load(f)
117    print(loaded_model.predict([1, 2, 3]))  # 0.5*1 + 0.3*2 + 0.2*3 = 1.7
118
119# Customize pickling with __getstate__ / __setstate__
120class DatabaseConnection:
121    """Connections can't be pickled — customize the behavior."""
122
123    def __init__(self, host, port):
124        self.host = host
125        self.port = port
126        self._connection = self._connect()  # Not serializable
127
128    def _connect(self):
129        return f"<Connection to {self.host}:{self.port}>"
130
131    def __getstate__(self):
132        """Called during pickling — return what to serialize."""
133        state = self.__dict__.copy()
134        del state["_connection"]  # Remove non-serializable attribute
135        return state
136
137    def __setstate__(self, state):
138        """Called during unpickling — restore the object."""
139        self.__dict__.update(state)
140        self._connection = self._connect()  # Reconnect
141
142
143# ============================================================
144# TOML — Python 3.11+ built-in reader (pyproject.toml, configs)
145# ============================================================
146import tomllib  # Python 3.11+ (read-only)
147
148toml_str = """
149[project]
150name = "my-package"
151version = "1.0.0"
152requires-python = ">=3.11"
153
154[project.dependencies]
155requests = ">=2.28"
156pydantic = ">=2.0"
157
158[tool.pytest.ini_options]
159testpaths = ["tests"]
160addopts = "-v --tb=short"
161
162[[project.authors]]
163name = "Alice"
164email = "alice@example.com"
165"""
166
167config = tomllib.loads(toml_str)
168print(config["project"]["name"])       # "my-package"
169print(config["tool"]["pytest"]["ini_options"]["addopts"])
170
171# Read from file (must open in binary mode for tomllib)
172# with open("pyproject.toml", "rb") as f:
173#     config = tomllib.load(f)
174
175# For WRITING TOML, use the third-party 'tomli-w' package:
176# import tomli_w
177# with open("config.toml", "wb") as f:
178#     tomli_w.dump(config, f)
179
180
181# ============================================================
182# YAML — human-friendly config format (requires PyYAML)
183# ============================================================
184# pip install pyyaml
185import yaml
186
187yaml_str = """
188database:
189  host: localhost
190  port: 5432
191  credentials:
192    username: admin
193    password: secret123
194
195services:
196  - name: api
197    replicas: 3
198    ports:
199      - 8080
200      - 8443
201  - name: worker
202    replicas: 5
203"""
204
205# ALWAYS use safe_load — never yaml.load() without SafeLoader
206# yaml.load() can execute arbitrary Python code from YAML!
207config = yaml.safe_load(yaml_str)
208print(config["database"]["host"])             # "localhost"
209print(config["services"][0]["name"])           # "api"
210
211# Write YAML
212output = yaml.dump(
213    config,
214    default_flow_style=False,  # Block style (readable)
215    sort_keys=False,           # Preserve insertion order
216    allow_unicode=True,
217)
218print(output)
219
220# Safe dump — only outputs standard YAML types
221yaml.safe_dump(config, stream=open("config.yaml", "w", encoding="utf-8"))

🏋️ Practice Exercise

Exercises:

Write a JSONSerializer class with a custom encoder that handles datetime, Decimal, UUID, set, Path, dataclass, and Enum types. Add a corresponding object_hook decoder that round-trips datetimes and UUIDs back to their original types.
Build a CSV processing pipeline: read a CSV file of sales records, filter rows by date range, compute aggregates (total revenue, average order value per category), and write the results to a new CSV. Handle malformed rows gracefully with error logging.
Create a config file loader that auto-detects the format from the file extension (.json, .yaml, .toml) and returns a unified dictionary. Support environment variable interpolation (e.g., $\{DATABASE_URL} in values gets replaced with the actual env var).
Demonstrate the security risk of pickle: write a malicious pickle payload that executes os.system("echo HACKED") when loaded. Then implement a RestrictedUnpickler (subclass pickle.Unpickler) that only allows specific safe classes.
Build a simple document store: a class that saves/loads Python dicts as JSON files in a directory, with get(id), put(id, data), delete(id), and list_all() methods. Use atomic writes to prevent corruption.

⚠️ Common Mistakes

Using pickle for data exchange between systems or persisting user-supplied data. Pickle is insecure (arbitrary code execution), Python-version-specific, and not human-readable. Use JSON for interchange and structured formats for configs.
Calling yaml.load(data) without specifying Loader=yaml.SafeLoader. The default loader can instantiate arbitrary Python objects from YAML tags like !!python/object/apply:os.system [rm -rf /]. Always use yaml.safe_load().
Not passing newline='' when opening CSV files. The csv module handles line endings internally. Without newline='', you get extra blank rows on Windows because Python's universal newline translation doubles the \r .
Assuming CSV values are typed. Everything from csv.reader and csv.DictReader is a string. You must explicitly cast: int(row['age']), float(row['price']), datetime.fromisoformat(row['date']). Missing casts cause subtle bugs in comparisons and arithmetic.
Using json.dumps(obj, default=str) as a universal fix. While convenient, it silently converts unknown objects to their str() representation, which often can't be deserialized back. Write a proper custom encoder that handles each type explicitly.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Serialization: JSON, CSV, Pickle, YAML & TOML

Was this topic helpful?

← PreviousFile Operations & Context Managers Next →Logging: Levels, Handlers, Formatters & Structured Logging