pandas: Data Analysis

0/3 in this phase0/54 across the roadmap

📖 Concept

pandas is Python's premier data manipulation library, providing the DataFrame and Series data structures that make working with tabular data intuitive, expressive, and fast. Built on top of NumPy, pandas adds labeled axes, flexible indexing, rich I/O capabilities, and a comprehensive API for data cleaning, transformation, and analysis.

Core data structures:

  • Series — a one-dimensional labeled array. Think of it as a column in a spreadsheet: it has an index (labels) and values (a NumPy array underneath). Supports vectorized operations, alignment by label, and mixed-type handling via object dtype
  • DataFrame — a two-dimensional labeled structure with columns of potentially different types. It is the workhorse of pandas — conceptually a dict of Series sharing the same index

Essential operations for data analysis:

  • I/Opd.read_csv(), pd.read_excel(), pd.read_sql(), pd.read_parquet(), and their .to_*() counterparts
  • Selection.loc[] (label-based), .iloc[] (position-based), boolean indexing, .query() for SQL-like filtering
  • Transformation.apply(), .map(), .assign(), .pipe() for method chaining
  • Aggregation.groupby() with .agg(), .transform(), .filter(); pivot tables via .pivot_table()
  • Mergingpd.merge() (SQL-style joins), pd.concat() (stacking), .join() (index-based)
  • Missing data.isna(), .fillna(), .dropna(), .interpolate()

Method chaining is the idiomatic way to write pandas code. Instead of creating intermediate variables, chain operations: df.query("age > 25").groupby("city").agg(mean_salary=("salary", "mean")).sort_values("mean_salary", ascending=False). This produces readable, pipeline-style transformations.

Performance tips:

  • Use category dtype for low-cardinality string columns — reduces memory by 90%+
  • Prefer .query() over boolean indexing for readability and sometimes speed
  • Avoid row-wise .apply() — it is essentially a Python loop. Use vectorized operations or .transform() instead
  • Load only needed columns with usecols parameter in read_csv()
  • For datasets > 1GB, consider read_parquet() (columnar, compressed, faster I/O)

💻 Code Example

codeTap to expand ⛶
1# ============================================================
2# pandas: Data Analysis Fundamentals
3# ============================================================
4import pandas as pd
5import numpy as np
6
7# --- Creating DataFrames ---
8# From dictionary
9df = pd.DataFrame({
10 "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
11 "age": [28, 34, 22, 45, 31],
12 "city": ["NYC", "LA", "NYC", "Chicago", "LA"],
13 "salary": [85000, 92000, 65000, 120000, 78000],
14 "department": ["Engineering", "Marketing", "Engineering",
15 "Management", "Marketing"],
16})
17
18# From CSV (common real-world usage)
19# df = pd.read_csv("employees.csv", parse_dates=["hire_date"],
20# usecols=["name", "age", "salary", "department"])
21
22# Basic inspection
23print(df.shape) # (5, 5)
24print(df.dtypes) # column data types
25print(df.describe()) # statistical summary
26print(df.info()) # memory usage + null counts
27
28# --- Selection & Filtering ---
29# Label-based selection with .loc[]
30engineers = df.loc[df["department"] == "Engineering", ["name", "salary"]]
31
32# Position-based selection with .iloc[]
33first_three = df.iloc[:3, :3] # first 3 rows, first 3 columns
34
35# Boolean indexing
36high_earners = df[df["salary"] > 80000]
37
38# .query() — cleaner syntax for complex filters
39result = df.query("age > 25 and city == 'NYC'")
40
41# Multiple conditions
42filtered = df[(df["salary"] > 70000) & (df["department"] == "Engineering")]
43
44# --- Transformation & New Columns ---
45# Vectorized operations (preferred)
46df["salary_k"] = df["salary"] / 1000
47df["age_group"] = pd.cut(df["age"], bins=[0, 25, 35, 50, 100],
48 labels=["Junior", "Mid", "Senior", "Executive"])
49
50# .assign() for method chaining (returns new DataFrame)
51df_enhanced = df.assign(
52 tax=lambda x: x["salary"] * 0.25,
53 net_salary=lambda x: x["salary"] * 0.75,
54 name_upper=lambda x: x["name"].str.upper(),
55)
56
57# --- GroupBy Aggregation ---
58# Single aggregation
59dept_means = df.groupby("department")["salary"].mean()
60
61# Multiple aggregations with named output
62dept_stats = (
63 df.groupby("department")
64 .agg(
65 avg_salary=("salary", "mean"),
66 max_salary=("salary", "max"),
67 headcount=("name", "count"),
68 avg_age=("age", "mean"),
69 )
70 .sort_values("avg_salary", ascending=False)
71)
72print(dept_stats)
73
74# .transform() — returns same-shape result (useful for normalization)
75df["salary_zscore"] = (
76 df.groupby("department")["salary"]
77 .transform(lambda x: (x - x.mean()) / x.std())
78)
79
80# --- Merging & Joining ---
81departments = pd.DataFrame({
82 "department": ["Engineering", "Marketing", "Management", "Sales"],
83 "budget": [500000, 300000, 200000, 400000],
84 "floor": [3, 2, 5, 1],
85})
86
87# SQL-style merge (inner join by default)
88merged = pd.merge(df, departments, on="department", how="left")
89
90# Concatenate DataFrames vertically
91new_employees = pd.DataFrame({
92 "name": ["Frank", "Grace"],
93 "age": [29, 38],
94 "city": ["NYC", "Chicago"],
95 "salary": [71000, 95000],
96 "department": ["Sales", "Engineering"],
97})
98all_employees = pd.concat([df, new_employees], ignore_index=True)
99
100# --- Pivot Tables ---
101pivot = df.pivot_table(
102 values="salary",
103 index="department",
104 columns="city",
105 aggfunc="mean",
106 fill_value=0,
107)
108print(pivot)
109
110# --- Handling Missing Data ---
111df_with_nulls = df.copy()
112df_with_nulls.loc[1, "salary"] = np.nan
113df_with_nulls.loc[3, "city"] = np.nan
114
115print(df_with_nulls.isna().sum()) # count nulls per column
116df_filled = df_with_nulls.fillna({
117 "salary": df_with_nulls["salary"].median(),
118 "city": "Unknown",
119})
120df_dropped = df_with_nulls.dropna(subset=["salary"]) # drop rows missing salary
121
122# --- Method Chaining (idiomatic pandas) ---
123report = (
124 df
125 .query("salary > 60000")
126 .assign(bonus=lambda x: x["salary"] * 0.1)
127 .groupby("department")
128 .agg(
129 avg_total_comp=("salary", lambda x: (x + x * 0.1).mean()),
130 team_size=("name", "count"),
131 )
132 .sort_values("avg_total_comp", ascending=False)
133 .reset_index()
134)
135print(report)
136
137# --- Performance: Category dtype ---
138# Before: string column uses ~8x more memory
139df["department"] = df["department"].astype("category")
140print(df["department"].cat.categories) # unique categories
141print(df.memory_usage(deep=True)) # compare memory usage

🏋️ Practice Exercise

  1. Load a CSV file (or create a DataFrame with at least 100 rows of synthetic data) containing columns: name, age, department, salary, hire_date, and city. Perform exploratory analysis: find the top 5 highest-paid employees, the average salary per department, and the department with the highest employee count.

  2. Write a method-chained pipeline that filters employees hired after 2020, groups by department, computes the mean and median salary, adds a column indicating whether the department average is above the company-wide average, and sorts by mean salary descending.

  3. Given two DataFrames — one with employee info and another with department budgets — perform a left merge, then use .groupby() and .transform() to add a column showing each employee's salary as a percentage of their department's total budget.

  4. Create a DataFrame with deliberately missing values in multiple columns. Demonstrate three different imputation strategies: fill with column mean, forward-fill (for time-series data), and fill categoricals with mode. Compare the resulting distributions.

  5. Use pd.pivot_table() to create a summary showing average salary by department and city, with row and column margins. Then use .style to highlight the highest value in each column.

  6. Profile the memory usage of a large DataFrame (1M+ rows). Optimize it by converting object columns to category dtype, downcasting numeric types with pd.to_numeric(downcast="integer"), and using read_csv() with dtype parameter. Measure the memory reduction.

⚠️ Common Mistakes

  • Using .apply() with a Python function row-by-row when a vectorized pandas operation exists — this is orders of magnitude slower and negates pandas' performance advantages.

  • Chaining indexing like df[condition]['column'] = value which triggers SettingWithCopyWarning. Always use .loc[] for assignment: df.loc[condition, 'column'] = value.

  • Ignoring the inplace trap — most pandas methods return new DataFrames. Relying on inplace=True is discouraged (it is being deprecated) and breaks method chaining.

  • Not specifying dtype or parse_dates when reading CSV files, leading to incorrect type inference (e.g., IDs read as integers instead of strings, dates as plain strings).

  • Loading an entire huge CSV into memory when you only need a few columns. Use usecols parameter or switch to Parquet format for large datasets.

💼 Interview Questions

🎤 Mock Interview

Practice a live interview for pandas: Data Analysis