Data Visualization

0/3 in this phase0/54 across the roadmap

📖 Concept

Data visualization transforms raw numbers into insight. Python's visualization ecosystem is built on three major libraries: Matplotlib (the foundational layer), Seaborn (statistical visualization built on Matplotlib), and Plotly (interactive, web-based charts). Understanding when and how to use each is essential for effective data communication.

Matplotlib is Python's original plotting library and remains the most flexible. It uses a hierarchical object model: Figure (the canvas) contains one or more Axes (individual plots), which contain plot elements (lines, bars, text). There are two APIs:

  • pyplot API (plt.plot(), plt.bar()) — stateful, MATLAB-like, convenient for quick plots
  • Object-oriented API (fig, ax = plt.subplots()) — explicit, preferred for production code and multi-panel figures

Seaborn provides high-level functions for statistical visualization. Built on Matplotlib, it offers beautiful defaults, automatic statistical aggregation, and tight integration with pandas DataFrames. Key function categories:

  • Relationalscatterplot(), lineplot() for continuous relationships
  • Categoricalboxplot(), violinplot(), barplot(), stripplot() for category comparisons
  • Distributionhistplot(), kdeplot(), ecdfplot() for understanding data spread
  • Matrixheatmap(), clustermap() for correlation and similarity matrices

Plotly enables interactive charts that users can hover, zoom, and filter. Plotly Express provides a concise API similar to Seaborn, while the graph_objects module offers fine-grained control. Ideal for dashboards, web applications, and exploratory analysis in Jupyter notebooks.

Choosing the right chart type:

Goal Chart Types
Distribution Histogram, KDE, Box plot, Violin
Comparison Bar chart, Grouped bar, Dot plot
Relationship Scatter plot, Bubble chart, Heatmap
Composition Pie chart (sparingly), Stacked bar, Treemap
Trend over time Line chart, Area chart

Best practices: Always label axes and provide a title. Use colorblind-friendly palettes. Keep chart-to-data ratio high (minimize chartjunk). Choose chart types based on the story you want to tell, not visual appeal.

💻 Code Example

codeTap to expand ⛶
1# ============================================================
2# Data Visualization: Matplotlib, Seaborn, and Plotly
3# ============================================================
4import matplotlib.pyplot as plt
5import matplotlib.ticker as ticker
6import seaborn as sns
7import numpy as np
8import pandas as pd
9
10# --- Sample Data ---
11np.random.seed(42)
12n = 200
13df = pd.DataFrame({
14 "age": np.random.normal(35, 10, n).astype(int).clip(18, 65),
15 "salary": np.random.normal(75000, 20000, n).clip(30000, 150000),
16 "experience": np.random.normal(8, 4, n).clip(0, 30),
17 "department": np.random.choice(
18 ["Engineering", "Marketing", "Sales", "HR"], n,
19 p=[0.4, 0.25, 0.2, 0.15]
20 ),
21 "satisfaction": np.random.uniform(1, 10, n).round(1),
22})
23df["salary"] = df["salary"] + df["experience"] * 2000 # add correlation
24
25
26# ============================================================
27# 1. MatplotlibObject-Oriented API (production style)
28# ============================================================
29fig, axes = plt.subplots(2, 2, figsize=(12, 10))
30fig.suptitle("Employee Dashboard", fontsize=16, fontweight="bold")
31
32# Panel 1: Histogram with KDE overlay
33ax = axes[0, 0]
34ax.hist(df["salary"], bins=30, edgecolor="white", alpha=0.7,
35 color="#2196F3", density=True, label="Histogram")
36# Overlay KDE using numpy
37from scipy.stats import gaussian_kde
38kde = gaussian_kde(df["salary"])
39x_range = np.linspace(df["salary"].min(), df["salary"].max(), 200)
40ax.plot(x_range, kde(x_range), color="#FF5722", linewidth=2, label="KDE")
41ax.set_xlabel("Salary ($)")
42ax.set_ylabel("Density")
43ax.set_title("Salary Distribution")
44ax.xaxis.set_major_formatter(ticker.FuncFormatter(
45 lambda x, _: f"${x/1000:.0f}K"
46))
47ax.legend()
48
49# Panel 2: Scatter plot with regression line
50ax = axes[0, 1]
51colors = {"Engineering": "#2196F3", "Marketing": "#4CAF50",
52 "Sales": "#FF9800", "HR": "#9C27B0"}
53for dept, group in df.groupby("department"):
54 ax.scatter(group["experience"], group["salary"],
55 alpha=0.6, label=dept, color=colors[dept], s=30)
56# Add trendline
57z = np.polyfit(df["experience"], df["salary"], 1)
58p = np.poly1d(z)
59ax.plot(sorted(df["experience"]), p(sorted(df["experience"])),
60 "r--", linewidth=2, label=f"Trend (slope={z[0]:,.0f})")
61ax.set_xlabel("Years of Experience")
62ax.set_ylabel("Salary ($)")
63ax.set_title("Experience vs Salary")
64ax.legend(fontsize=8)
65
66# Panel 3: Box plot by department
67ax = axes[1, 0]
68dept_order = df.groupby("department")["salary"].median().sort_values().index
69bp = ax.boxplot(
70 [df[df["department"] == d]["salary"] for d in dept_order],
71 labels=dept_order, patch_artist=True, notch=True,
72)
73for patch, dept in zip(bp["boxes"], dept_order):
74 patch.set_facecolor(colors[dept])
75 patch.set_alpha(0.7)
76ax.set_ylabel("Salary ($)")
77ax.set_title("Salary by Department")
78ax.tick_params(axis="x", rotation=15)
79
80# Panel 4: Bar chart of average satisfaction
81ax = axes[1, 1]
82dept_satisfaction = (
83 df.groupby("department")["satisfaction"]
84 .agg(["mean", "std"])
85 .sort_values("mean", ascending=True)
86)
87bars = ax.barh(dept_satisfaction.index, dept_satisfaction["mean"],
88 xerr=dept_satisfaction["std"], capsize=5,
89 color=[colors[d] for d in dept_satisfaction.index],
90 alpha=0.8, edgecolor="white")
91ax.set_xlabel("Satisfaction Score (1-10)")
92ax.set_title("Average Satisfaction by Department")
93ax.set_xlim(0, 10)
94for bar, val in zip(bars, dept_satisfaction["mean"]):
95 ax.text(val + 0.3, bar.get_y() + bar.get_height() / 2,
96 f"{val:.1f}", va="center", fontweight="bold")
97
98plt.tight_layout()
99plt.savefig("employee_dashboard.png", dpi=150, bbox_inches="tight")
100plt.show()
101
102
103# ============================================================
104# 2. SeabornStatistical Visualization
105# ============================================================
106sns.set_theme(style="whitegrid", palette="husl", font_scale=1.1)
107
108# FacetGrid: distribution per department
109g = sns.FacetGrid(df, col="department", col_wrap=2,
110 height=4, aspect=1.2)
111g.map_dataframe(sns.histplot, x="salary", kde=True, bins=20)
112g.set_titles("{col_name}")
113g.set_axis_labels("Salary ($)", "Count")
114plt.tight_layout()
115plt.show()
116
117# Pair plot: multi-variable relationships
118sns.pairplot(df[["age", "salary", "experience", "satisfaction",
119 "department"]],
120 hue="department", diag_kind="kde",
121 plot_kws={"alpha": 0.5, "s": 20})
122plt.suptitle("Pairwise Relationships", y=1.02)
123plt.show()
124
125# Heatmap: correlation matrix
126fig, ax = plt.subplots(figsize=(8, 6))
127numeric_cols = df.select_dtypes(include=[np.number])
128corr = numeric_cols.corr()
129mask = np.triu(np.ones_like(corr, dtype=bool)) # upper triangle mask
130sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap="coolwarm",
131 center=0, square=True, linewidths=0.5, ax=ax,
132 cbar_kws={"shrink": 0.8})
133ax.set_title("Correlation Matrix")
134plt.tight_layout()
135plt.show()
136
137# Violin + strip plot combo
138fig, ax = plt.subplots(figsize=(10, 6))
139sns.violinplot(data=df, x="department", y="salary", inner=None,
140 alpha=0.3, ax=ax)
141sns.stripplot(data=df, x="department", y="salary", size=3,
142 alpha=0.6, jitter=True, ax=ax)
143ax.set_title("Salary Distribution by Department (Violin + Strip)")
144plt.tight_layout()
145plt.show()
146
147
148# ============================================================
149# 3. Plotly ExpressInteractive Charts
150# ============================================================
151# NOTE: Plotly outputs render in Jupyter notebooks or as HTML files.
152# Uncomment the lines below to generate interactive charts.
153
154# import plotly.express as px
155#
156# # Interactive scatter with hover data
157# fig = px.scatter(
158# df, x="experience", y="salary", color="department",
159# size="satisfaction", hover_data=["age"],
160# title="Interactive: Experience vs Salary",
161# labels={"experience": "Years of Experience",
162# "salary": "Annual Salary ($)"},
163# template="plotly_white",
164# )
165# fig.update_traces(marker=dict(opacity=0.7, line=dict(width=0.5)))
166# fig.show() # opens in browser or renders in notebook
167#
168# # Animated scatter over age groups
169# df["age_bin"] = pd.cut(df["age"], bins=5).astype(str)
170# fig = px.scatter(
171# df, x="experience", y="salary", color="department",
172# animation_frame="age_bin", size="satisfaction",
173# range_y=[20000, 180000], range_x=[0, 35],
174# title="Salary by Experience (Animated by Age Group)",
175# )
176# fig.show()
177
178
179# ============================================================
180# 4. Styling & Best Practices
181# ============================================================
182# Custom style context manager
183with plt.style.context("seaborn-v0_8-paper"):
184 fig, ax = plt.subplots(figsize=(8, 5))
185 dept_counts = df["department"].value_counts()
186 colors_list = [colors.get(d, "#999") for d in dept_counts.index]
187 bars = ax.bar(dept_counts.index, dept_counts.values,
188 color=colors_list, edgecolor="white", linewidth=1.5)
189
190 # Annotate bars with values
191 for bar in bars:
192 height = bar.get_height()
193 ax.text(bar.get_x() + bar.get_width() / 2., height + 1,
194 f"{int(height)}", ha="center", va="bottom",
195 fontweight="bold", fontsize=12)
196
197 ax.set_ylabel("Number of Employees")
198 ax.set_title("Headcount by Department",
199 fontsize=14, fontweight="bold")
200 ax.spines[["top", "right"]].set_visible(False) # remove chartjunk
201 plt.tight_layout()
202 plt.savefig("headcount.png", dpi=150, bbox_inches="tight",
203 facecolor="white")
204 plt.show()

🏋️ Practice Exercise

  1. Create a 2x2 subplot dashboard using Matplotlib's object-oriented API showing: (a) a histogram of a numeric column, (b) a scatter plot with a trendline, (c) a horizontal bar chart with error bars, and (d) a pie chart with percentage labels. Apply consistent styling across all panels.

  2. Use Seaborn to create a pair plot of at least 4 numeric variables colored by a categorical variable. Then create a heatmap of the correlation matrix with annotations. Interpret which variables are most strongly correlated and why.

  3. Build a Seaborn FacetGrid that shows the distribution of salaries across departments, with each panel representing a different experience level bin (0-5, 5-10, 10-15, 15+ years). Add KDE overlays and consistent axis limits.

  4. Create a publication-quality figure with Matplotlib that includes: a custom color palette, removed top/right spines, formatted tick labels (e.g., "$50K" instead of 50000), a legend outside the plot area, and export it as both PNG (150 dpi) and SVG.

  5. (Bonus) Use Plotly Express to create an interactive scatter plot with hover tooltips showing all data fields, color by category, size by a numeric variable, and add dropdown filters. Export it as a self-contained HTML file.

⚠️ Common Mistakes

  • Using the pyplot stateful API (plt.plot()) for complex multi-panel figures — always use the object-oriented API (fig, ax = plt.subplots()) for anything beyond quick exploratory plots.

  • Forgetting plt.tight_layout() or bbox_inches='tight' when saving, resulting in cut-off labels and overlapping titles.

  • Choosing chart types for visual appeal rather than data appropriateness — pie charts for more than 5 categories, 3D bar charts when 2D suffices, or line charts for non-sequential categorical data.

  • Not considering colorblind accessibility — avoid red/green-only palettes. Use Seaborn's colorblind-friendly palettes like 'colorblind', 'deep', or viridis/plasma from Matplotlib.

  • Creating overly complex visualizations that obscure the message. Effective charts have a high data-to-ink ratio — remove gridlines, borders, and decorations that do not aid interpretation.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for Data Visualization