Data Visualization
📖 Concept
Data visualization transforms raw numbers into insight. Python's visualization ecosystem is built on three major libraries: Matplotlib (the foundational layer), Seaborn (statistical visualization built on Matplotlib), and Plotly (interactive, web-based charts). Understanding when and how to use each is essential for effective data communication.
Matplotlib is Python's original plotting library and remains the most flexible. It uses a hierarchical object model: Figure (the canvas) contains one or more Axes (individual plots), which contain plot elements (lines, bars, text). There are two APIs:
- pyplot API (
plt.plot(),plt.bar()) — stateful, MATLAB-like, convenient for quick plots - Object-oriented API (
fig, ax = plt.subplots()) — explicit, preferred for production code and multi-panel figures
Seaborn provides high-level functions for statistical visualization. Built on Matplotlib, it offers beautiful defaults, automatic statistical aggregation, and tight integration with pandas DataFrames. Key function categories:
- Relational —
scatterplot(),lineplot()for continuous relationships - Categorical —
boxplot(),violinplot(),barplot(),stripplot()for category comparisons - Distribution —
histplot(),kdeplot(),ecdfplot()for understanding data spread - Matrix —
heatmap(),clustermap()for correlation and similarity matrices
Plotly enables interactive charts that users can hover, zoom, and filter. Plotly Express provides a concise API similar to Seaborn, while the graph_objects module offers fine-grained control. Ideal for dashboards, web applications, and exploratory analysis in Jupyter notebooks.
Choosing the right chart type:
| Goal | Chart Types |
|---|---|
| Distribution | Histogram, KDE, Box plot, Violin |
| Comparison | Bar chart, Grouped bar, Dot plot |
| Relationship | Scatter plot, Bubble chart, Heatmap |
| Composition | Pie chart (sparingly), Stacked bar, Treemap |
| Trend over time | Line chart, Area chart |
Best practices: Always label axes and provide a title. Use colorblind-friendly palettes. Keep chart-to-data ratio high (minimize chartjunk). Choose chart types based on the story you want to tell, not visual appeal.
💻 Code Example
1# ============================================================2# Data Visualization: Matplotlib, Seaborn, and Plotly3# ============================================================4import matplotlib.pyplot as plt5import matplotlib.ticker as ticker6import seaborn as sns7import numpy as np8import pandas as pd910# --- Sample Data ---11np.random.seed(42)12n = 20013df = pd.DataFrame({14 "age": np.random.normal(35, 10, n).astype(int).clip(18, 65),15 "salary": np.random.normal(75000, 20000, n).clip(30000, 150000),16 "experience": np.random.normal(8, 4, n).clip(0, 30),17 "department": np.random.choice(18 ["Engineering", "Marketing", "Sales", "HR"], n,19 p=[0.4, 0.25, 0.2, 0.15]20 ),21 "satisfaction": np.random.uniform(1, 10, n).round(1),22})23df["salary"] = df["salary"] + df["experience"] * 2000 # add correlation242526# ============================================================27# 1. Matplotlib — Object-Oriented API (production style)28# ============================================================29fig, axes = plt.subplots(2, 2, figsize=(12, 10))30fig.suptitle("Employee Dashboard", fontsize=16, fontweight="bold")3132# Panel 1: Histogram with KDE overlay33ax = axes[0, 0]34ax.hist(df["salary"], bins=30, edgecolor="white", alpha=0.7,35 color="#2196F3", density=True, label="Histogram")36# Overlay KDE using numpy37from scipy.stats import gaussian_kde38kde = gaussian_kde(df["salary"])39x_range = np.linspace(df["salary"].min(), df["salary"].max(), 200)40ax.plot(x_range, kde(x_range), color="#FF5722", linewidth=2, label="KDE")41ax.set_xlabel("Salary ($)")42ax.set_ylabel("Density")43ax.set_title("Salary Distribution")44ax.xaxis.set_major_formatter(ticker.FuncFormatter(45 lambda x, _: f"${x/1000:.0f}K"46))47ax.legend()4849# Panel 2: Scatter plot with regression line50ax = axes[0, 1]51colors = {"Engineering": "#2196F3", "Marketing": "#4CAF50",52 "Sales": "#FF9800", "HR": "#9C27B0"}53for dept, group in df.groupby("department"):54 ax.scatter(group["experience"], group["salary"],55 alpha=0.6, label=dept, color=colors[dept], s=30)56# Add trendline57z = np.polyfit(df["experience"], df["salary"], 1)58p = np.poly1d(z)59ax.plot(sorted(df["experience"]), p(sorted(df["experience"])),60 "r--", linewidth=2, label=f"Trend (slope={z[0]:,.0f})")61ax.set_xlabel("Years of Experience")62ax.set_ylabel("Salary ($)")63ax.set_title("Experience vs Salary")64ax.legend(fontsize=8)6566# Panel 3: Box plot by department67ax = axes[1, 0]68dept_order = df.groupby("department")["salary"].median().sort_values().index69bp = ax.boxplot(70 [df[df["department"] == d]["salary"] for d in dept_order],71 labels=dept_order, patch_artist=True, notch=True,72)73for patch, dept in zip(bp["boxes"], dept_order):74 patch.set_facecolor(colors[dept])75 patch.set_alpha(0.7)76ax.set_ylabel("Salary ($)")77ax.set_title("Salary by Department")78ax.tick_params(axis="x", rotation=15)7980# Panel 4: Bar chart of average satisfaction81ax = axes[1, 1]82dept_satisfaction = (83 df.groupby("department")["satisfaction"]84 .agg(["mean", "std"])85 .sort_values("mean", ascending=True)86)87bars = ax.barh(dept_satisfaction.index, dept_satisfaction["mean"],88 xerr=dept_satisfaction["std"], capsize=5,89 color=[colors[d] for d in dept_satisfaction.index],90 alpha=0.8, edgecolor="white")91ax.set_xlabel("Satisfaction Score (1-10)")92ax.set_title("Average Satisfaction by Department")93ax.set_xlim(0, 10)94for bar, val in zip(bars, dept_satisfaction["mean"]):95 ax.text(val + 0.3, bar.get_y() + bar.get_height() / 2,96 f"{val:.1f}", va="center", fontweight="bold")9798plt.tight_layout()99plt.savefig("employee_dashboard.png", dpi=150, bbox_inches="tight")100plt.show()101102103# ============================================================104# 2. Seaborn — Statistical Visualization105# ============================================================106sns.set_theme(style="whitegrid", palette="husl", font_scale=1.1)107108# FacetGrid: distribution per department109g = sns.FacetGrid(df, col="department", col_wrap=2,110 height=4, aspect=1.2)111g.map_dataframe(sns.histplot, x="salary", kde=True, bins=20)112g.set_titles("{col_name}")113g.set_axis_labels("Salary ($)", "Count")114plt.tight_layout()115plt.show()116117# Pair plot: multi-variable relationships118sns.pairplot(df[["age", "salary", "experience", "satisfaction",119 "department"]],120 hue="department", diag_kind="kde",121 plot_kws={"alpha": 0.5, "s": 20})122plt.suptitle("Pairwise Relationships", y=1.02)123plt.show()124125# Heatmap: correlation matrix126fig, ax = plt.subplots(figsize=(8, 6))127numeric_cols = df.select_dtypes(include=[np.number])128corr = numeric_cols.corr()129mask = np.triu(np.ones_like(corr, dtype=bool)) # upper triangle mask130sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap="coolwarm",131 center=0, square=True, linewidths=0.5, ax=ax,132 cbar_kws={"shrink": 0.8})133ax.set_title("Correlation Matrix")134plt.tight_layout()135plt.show()136137# Violin + strip plot combo138fig, ax = plt.subplots(figsize=(10, 6))139sns.violinplot(data=df, x="department", y="salary", inner=None,140 alpha=0.3, ax=ax)141sns.stripplot(data=df, x="department", y="salary", size=3,142 alpha=0.6, jitter=True, ax=ax)143ax.set_title("Salary Distribution by Department (Violin + Strip)")144plt.tight_layout()145plt.show()146147148# ============================================================149# 3. Plotly Express — Interactive Charts150# ============================================================151# NOTE: Plotly outputs render in Jupyter notebooks or as HTML files.152# Uncomment the lines below to generate interactive charts.153154# import plotly.express as px155#156# # Interactive scatter with hover data157# fig = px.scatter(158# df, x="experience", y="salary", color="department",159# size="satisfaction", hover_data=["age"],160# title="Interactive: Experience vs Salary",161# labels={"experience": "Years of Experience",162# "salary": "Annual Salary ($)"},163# template="plotly_white",164# )165# fig.update_traces(marker=dict(opacity=0.7, line=dict(width=0.5)))166# fig.show() # opens in browser or renders in notebook167#168# # Animated scatter over age groups169# df["age_bin"] = pd.cut(df["age"], bins=5).astype(str)170# fig = px.scatter(171# df, x="experience", y="salary", color="department",172# animation_frame="age_bin", size="satisfaction",173# range_y=[20000, 180000], range_x=[0, 35],174# title="Salary by Experience (Animated by Age Group)",175# )176# fig.show()177178179# ============================================================180# 4. Styling & Best Practices181# ============================================================182# Custom style context manager183with plt.style.context("seaborn-v0_8-paper"):184 fig, ax = plt.subplots(figsize=(8, 5))185 dept_counts = df["department"].value_counts()186 colors_list = [colors.get(d, "#999") for d in dept_counts.index]187 bars = ax.bar(dept_counts.index, dept_counts.values,188 color=colors_list, edgecolor="white", linewidth=1.5)189190 # Annotate bars with values191 for bar in bars:192 height = bar.get_height()193 ax.text(bar.get_x() + bar.get_width() / 2., height + 1,194 f"{int(height)}", ha="center", va="bottom",195 fontweight="bold", fontsize=12)196197 ax.set_ylabel("Number of Employees")198 ax.set_title("Headcount by Department",199 fontsize=14, fontweight="bold")200 ax.spines[["top", "right"]].set_visible(False) # remove chartjunk201 plt.tight_layout()202 plt.savefig("headcount.png", dpi=150, bbox_inches="tight",203 facecolor="white")204 plt.show()
🏋️ Practice Exercise
Create a 2x2 subplot dashboard using Matplotlib's object-oriented API showing: (a) a histogram of a numeric column, (b) a scatter plot with a trendline, (c) a horizontal bar chart with error bars, and (d) a pie chart with percentage labels. Apply consistent styling across all panels.
Use Seaborn to create a pair plot of at least 4 numeric variables colored by a categorical variable. Then create a heatmap of the correlation matrix with annotations. Interpret which variables are most strongly correlated and why.
Build a Seaborn FacetGrid that shows the distribution of salaries across departments, with each panel representing a different experience level bin (0-5, 5-10, 10-15, 15+ years). Add KDE overlays and consistent axis limits.
Create a publication-quality figure with Matplotlib that includes: a custom color palette, removed top/right spines, formatted tick labels (e.g., "$50K" instead of 50000), a legend outside the plot area, and export it as both PNG (150 dpi) and SVG.
(Bonus) Use Plotly Express to create an interactive scatter plot with hover tooltips showing all data fields, color by category, size by a numeric variable, and add dropdown filters. Export it as a self-contained HTML file.
⚠️ Common Mistakes
Using the pyplot stateful API (
plt.plot()) for complex multi-panel figures — always use the object-oriented API (fig, ax = plt.subplots()) for anything beyond quick exploratory plots.Forgetting
plt.tight_layout()orbbox_inches='tight'when saving, resulting in cut-off labels and overlapping titles.Choosing chart types for visual appeal rather than data appropriateness — pie charts for more than 5 categories, 3D bar charts when 2D suffices, or line charts for non-sequential categorical data.
Not considering colorblind accessibility — avoid red/green-only palettes. Use Seaborn's colorblind-friendly palettes like 'colorblind', 'deep', or viridis/plasma from Matplotlib.
Creating overly complex visualizations that obscure the message. Effective charts have a high data-to-ink ratio — remove gridlines, borders, and decorations that do not aid interpretation.
💼 Interview Questions
🎤 Mock Interview
Practice a live interview for Data Visualization