Serialization: JSON, CSV, Pickle, YAML & TOML
📖 Concept
Serialization is converting Python objects to a storable/transmittable format, and deserialization is the reverse. Python ships with robust support for JSON, CSV, and pickle in the standard library, with TOML reading added in Python 3.11. YAML requires the third-party PyYAML package.
Format comparison:
| Format | Human Readable | Types | Security | Use Case |
|---|---|---|---|---|
| JSON | Yes | Strings, numbers, bools, null, arrays, objects | Safe | APIs, configs, data exchange |
| CSV | Yes | Strings only (everything is text) | Safe | Tabular data, spreadsheets |
| pickle | No (binary) | Any Python object | DANGEROUS | Internal caching, ML models |
| TOML | Yes | Strings, ints, floats, bools, datetimes, arrays, tables | Safe | Config files (pyproject.toml) |
| YAML | Yes | Rich types including dates, nulls | Risky (safe_load only) |
Config files, Kubernetes, CI/CD |
Critical security warnings:
- Never unpickle untrusted data.
pickle.loads()can execute arbitrary code — it's a remote code execution vulnerability. Only use pickle for data you created yourself. - Always use
yaml.safe_load(), neveryaml.load()withoutLoader=SafeLoader. The default loader can construct arbitrary Python objects from YAML.
JSON tips: Python's json module only handles basic types. For datetime, Decimal, UUID, etc., provide a custom encoder via default= parameter or subclass JSONEncoder. The orjson and ujson third-party libraries are 5-10x faster for large payloads.
CSV tips: Always use the csv module rather than splitting on commas — it handles quoting, escaping, and multi-line fields correctly. DictReader/DictWriter provide column-name access.
💻 Code Example
1# ============================================================2# JSON — the universal data interchange format3# ============================================================4import json5from datetime import datetime, date6from decimal import Decimal7from pathlib import Path8from uuid import UUID, uuid4910# Basic usage11data = {"name": "Alice", "scores": [95, 87, 92], "active": True}12json_str = json.dumps(data, indent=2) # Python → JSON string13parsed = json.loads(json_str) # JSON string → Python1415# Write/read JSON files16with open("data.json", "w", encoding="utf-8") as f:17 json.dump(data, f, indent=2, ensure_ascii=False) # Pretty-print, allow unicode1819with open("data.json", "r", encoding="utf-8") as f:20 loaded = json.load(f)2122# Custom encoder for types JSON doesn't support natively23class AppJSONEncoder(json.JSONEncoder):24 """Handle datetime, Decimal, UUID, set, bytes, Path."""2526 def default(self, obj):27 if isinstance(obj, (datetime, date)):28 return obj.isoformat()29 if isinstance(obj, Decimal):30 return str(obj) # Preserve precision (float would lose it)31 if isinstance(obj, UUID):32 return str(obj)33 if isinstance(obj, set):34 return sorted(list(obj)) # Sets aren't ordered35 if isinstance(obj, bytes):36 import base6437 return base64.b64encode(obj).decode("ascii")38 if isinstance(obj, Path):39 return str(obj)40 return super().default(obj) # Raise TypeError for unknown types4142# Usage43record = {44 "id": uuid4(),45 "created": datetime.now(),46 "price": Decimal("19.99"),47 "tags": {"python", "coding"},48}49print(json.dumps(record, cls=AppJSONEncoder, indent=2))5051# Alternative: use the default= parameter (simpler for one-offs)52json.dumps(record, default=str) # Converts anything unknown to string535455# ============================================================56# CSV — tabular data57# ============================================================58import csv5960# Writing CSV61employees = [62 {"name": "Alice", "dept": "Engineering", "salary": 120000},63 {"name": "Bob", "dept": "Marketing", "salary": 95000},64 {"name": 'Charlie "Chuck"', "dept": "Sales", "salary": 88000},65]6667with open("employees.csv", "w", newline="", encoding="utf-8") as f:68 writer = csv.DictWriter(f, fieldnames=["name", "dept", "salary"])69 writer.writeheader()70 writer.writerows(employees)7172# Reading CSV73with open("employees.csv", "r", newline="", encoding="utf-8") as f:74 reader = csv.DictReader(f)75 for row in reader:76 # All values are strings — cast as needed77 name = row["name"]78 salary = int(row["salary"])79 print(f"{name}: salary={salary:,}")8081# BAD — never split on comma manually82# line.split(",") # Breaks on: 'Charlie "Chuck"', fields with commas, etc.8384# GOOD — csv module handles quoting, escaping, multi-line fields85# Also handles different dialects (excel, unix, custom)868788# ============================================================89# pickle — Python-specific binary serialization90# ============================================================91import pickle9293# WARNING: Never unpickle data from untrusted sources!94# pickle.loads() can execute arbitrary code.9596class TrainedModel:97 """Simulate an ML model with fitted parameters."""98 def __init__(self, weights, accuracy):99 self.weights = weights100 self.accuracy = accuracy101102 def predict(self, x):103 return sum(w * xi for w, xi in zip(self.weights, x))104105model = TrainedModel(weights=[0.5, 0.3, 0.2], accuracy=0.95)106107# Serialize to bytes108pickled = pickle.dumps(model, protocol=pickle.HIGHEST_PROTOCOL)109110# Serialize to file111with open("model.pkl", "wb") as f:112 pickle.dump(model, f, protocol=pickle.HIGHEST_PROTOCOL)113114# Deserialize (ONLY from trusted sources!)115with open("model.pkl", "rb") as f:116 loaded_model = pickle.load(f)117 print(loaded_model.predict([1, 2, 3])) # 0.5*1 + 0.3*2 + 0.2*3 = 1.7118119# Customize pickling with __getstate__ / __setstate__120class DatabaseConnection:121 """Connections can't be pickled — customize the behavior."""122123 def __init__(self, host, port):124 self.host = host125 self.port = port126 self._connection = self._connect() # Not serializable127128 def _connect(self):129 return f"<Connection to {self.host}:{self.port}>"130131 def __getstate__(self):132 """Called during pickling — return what to serialize."""133 state = self.__dict__.copy()134 del state["_connection"] # Remove non-serializable attribute135 return state136137 def __setstate__(self, state):138 """Called during unpickling — restore the object."""139 self.__dict__.update(state)140 self._connection = self._connect() # Reconnect141142143# ============================================================144# TOML — Python 3.11+ built-in reader (pyproject.toml, configs)145# ============================================================146import tomllib # Python 3.11+ (read-only)147148toml_str = """149[project]150name = "my-package"151version = "1.0.0"152requires-python = ">=3.11"153154[project.dependencies]155requests = ">=2.28"156pydantic = ">=2.0"157158[tool.pytest.ini_options]159testpaths = ["tests"]160addopts = "-v --tb=short"161162[[project.authors]]163name = "Alice"164email = "alice@example.com"165"""166167config = tomllib.loads(toml_str)168print(config["project"]["name"]) # "my-package"169print(config["tool"]["pytest"]["ini_options"]["addopts"])170171# Read from file (must open in binary mode for tomllib)172# with open("pyproject.toml", "rb") as f:173# config = tomllib.load(f)174175# For WRITING TOML, use the third-party 'tomli-w' package:176# import tomli_w177# with open("config.toml", "wb") as f:178# tomli_w.dump(config, f)179180181# ============================================================182# YAML — human-friendly config format (requires PyYAML)183# ============================================================184# pip install pyyaml185import yaml186187yaml_str = """188database:189 host: localhost190 port: 5432191 credentials:192 username: admin193 password: secret123194195services:196 - name: api197 replicas: 3198 ports:199 - 8080200 - 8443201 - name: worker202 replicas: 5203"""204205# ALWAYS use safe_load — never yaml.load() without SafeLoader206# yaml.load() can execute arbitrary Python code from YAML!207config = yaml.safe_load(yaml_str)208print(config["database"]["host"]) # "localhost"209print(config["services"][0]["name"]) # "api"210211# Write YAML212output = yaml.dump(213 config,214 default_flow_style=False, # Block style (readable)215 sort_keys=False, # Preserve insertion order216 allow_unicode=True,217)218print(output)219220# Safe dump — only outputs standard YAML types221yaml.safe_dump(config, stream=open("config.yaml", "w", encoding="utf-8"))
🏋️ Practice Exercise
Exercises:
Write a
JSONSerializerclass with a custom encoder that handlesdatetime,Decimal,UUID,set,Path,dataclass, andEnumtypes. Add a correspondingobject_hookdecoder that round-trips datetimes and UUIDs back to their original types.Build a CSV processing pipeline: read a CSV file of sales records, filter rows by date range, compute aggregates (total revenue, average order value per category), and write the results to a new CSV. Handle malformed rows gracefully with error logging.
Create a config file loader that auto-detects the format from the file extension (
.json,.yaml,.toml) and returns a unified dictionary. Support environment variable interpolation (e.g.,$\{DATABASE_URL}in values gets replaced with the actual env var).Demonstrate the security risk of pickle: write a malicious pickle payload that executes
os.system("echo HACKED")when loaded. Then implement aRestrictedUnpickler(subclasspickle.Unpickler) that only allows specific safe classes.Build a simple document store: a class that saves/loads Python dicts as JSON files in a directory, with
get(id),put(id, data),delete(id), andlist_all()methods. Use atomic writes to prevent corruption.
⚠️ Common Mistakes
Using
picklefor data exchange between systems or persisting user-supplied data. Pickle is insecure (arbitrary code execution), Python-version-specific, and not human-readable. Use JSON for interchange and structured formats for configs.Calling
yaml.load(data)without specifyingLoader=yaml.SafeLoader. The default loader can instantiate arbitrary Python objects from YAML tags like!!python/object/apply:os.system [rm -rf /]. Always useyaml.safe_load().Not passing
newline=''when opening CSV files. Thecsvmodule handles line endings internally. Withoutnewline='', you get extra blank rows on Windows because Python's universal newline translation doubles the\r.Assuming CSV values are typed. Everything from
csv.readerandcsv.DictReaderis a string. You must explicitly cast:int(row['age']),float(row['price']),datetime.fromisoformat(row['date']). Missing casts cause subtle bugs in comparisons and arithmetic.Using
json.dumps(obj, default=str)as a universal fix. While convenient, it silently converts unknown objects to theirstr()representation, which often can't be deserialized back. Write a proper custom encoder that handles each type explicitly.
💼 Interview Questions
🎤 Mock Interview
Practice a live interview for Serialization: JSON, CSV, Pickle, YAML & TOML