pandas: Data Analysis
📖 Concept
pandas is Python's premier data manipulation library, providing the DataFrame and Series data structures that make working with tabular data intuitive, expressive, and fast. Built on top of NumPy, pandas adds labeled axes, flexible indexing, rich I/O capabilities, and a comprehensive API for data cleaning, transformation, and analysis.
Core data structures:
- Series — a one-dimensional labeled array. Think of it as a column in a spreadsheet: it has an index (labels) and values (a NumPy array underneath). Supports vectorized operations, alignment by label, and mixed-type handling via
objectdtype - DataFrame — a two-dimensional labeled structure with columns of potentially different types. It is the workhorse of pandas — conceptually a dict of Series sharing the same index
Essential operations for data analysis:
- I/O —
pd.read_csv(),pd.read_excel(),pd.read_sql(),pd.read_parquet(), and their.to_*()counterparts - Selection —
.loc[](label-based),.iloc[](position-based), boolean indexing,.query()for SQL-like filtering - Transformation —
.apply(),.map(),.assign(),.pipe()for method chaining - Aggregation —
.groupby()with.agg(),.transform(),.filter(); pivot tables via.pivot_table() - Merging —
pd.merge()(SQL-style joins),pd.concat()(stacking),.join()(index-based) - Missing data —
.isna(),.fillna(),.dropna(),.interpolate()
Method chaining is the idiomatic way to write pandas code. Instead of creating intermediate variables, chain operations: df.query("age > 25").groupby("city").agg(mean_salary=("salary", "mean")).sort_values("mean_salary", ascending=False). This produces readable, pipeline-style transformations.
Performance tips:
- Use
categorydtype for low-cardinality string columns — reduces memory by 90%+ - Prefer
.query()over boolean indexing for readability and sometimes speed - Avoid row-wise
.apply()— it is essentially a Python loop. Use vectorized operations or.transform()instead - Load only needed columns with
usecolsparameter inread_csv() - For datasets > 1GB, consider
read_parquet()(columnar, compressed, faster I/O)
💻 Code Example
1# ============================================================2# pandas: Data Analysis Fundamentals3# ============================================================4import pandas as pd5import numpy as np67# --- Creating DataFrames ---8# From dictionary9df = pd.DataFrame({10 "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],11 "age": [28, 34, 22, 45, 31],12 "city": ["NYC", "LA", "NYC", "Chicago", "LA"],13 "salary": [85000, 92000, 65000, 120000, 78000],14 "department": ["Engineering", "Marketing", "Engineering",15 "Management", "Marketing"],16})1718# From CSV (common real-world usage)19# df = pd.read_csv("employees.csv", parse_dates=["hire_date"],20# usecols=["name", "age", "salary", "department"])2122# Basic inspection23print(df.shape) # (5, 5)24print(df.dtypes) # column data types25print(df.describe()) # statistical summary26print(df.info()) # memory usage + null counts2728# --- Selection & Filtering ---29# Label-based selection with .loc[]30engineers = df.loc[df["department"] == "Engineering", ["name", "salary"]]3132# Position-based selection with .iloc[]33first_three = df.iloc[:3, :3] # first 3 rows, first 3 columns3435# Boolean indexing36high_earners = df[df["salary"] > 80000]3738# .query() — cleaner syntax for complex filters39result = df.query("age > 25 and city == 'NYC'")4041# Multiple conditions42filtered = df[(df["salary"] > 70000) & (df["department"] == "Engineering")]4344# --- Transformation & New Columns ---45# Vectorized operations (preferred)46df["salary_k"] = df["salary"] / 100047df["age_group"] = pd.cut(df["age"], bins=[0, 25, 35, 50, 100],48 labels=["Junior", "Mid", "Senior", "Executive"])4950# .assign() for method chaining (returns new DataFrame)51df_enhanced = df.assign(52 tax=lambda x: x["salary"] * 0.25,53 net_salary=lambda x: x["salary"] * 0.75,54 name_upper=lambda x: x["name"].str.upper(),55)5657# --- GroupBy Aggregation ---58# Single aggregation59dept_means = df.groupby("department")["salary"].mean()6061# Multiple aggregations with named output62dept_stats = (63 df.groupby("department")64 .agg(65 avg_salary=("salary", "mean"),66 max_salary=("salary", "max"),67 headcount=("name", "count"),68 avg_age=("age", "mean"),69 )70 .sort_values("avg_salary", ascending=False)71)72print(dept_stats)7374# .transform() — returns same-shape result (useful for normalization)75df["salary_zscore"] = (76 df.groupby("department")["salary"]77 .transform(lambda x: (x - x.mean()) / x.std())78)7980# --- Merging & Joining ---81departments = pd.DataFrame({82 "department": ["Engineering", "Marketing", "Management", "Sales"],83 "budget": [500000, 300000, 200000, 400000],84 "floor": [3, 2, 5, 1],85})8687# SQL-style merge (inner join by default)88merged = pd.merge(df, departments, on="department", how="left")8990# Concatenate DataFrames vertically91new_employees = pd.DataFrame({92 "name": ["Frank", "Grace"],93 "age": [29, 38],94 "city": ["NYC", "Chicago"],95 "salary": [71000, 95000],96 "department": ["Sales", "Engineering"],97})98all_employees = pd.concat([df, new_employees], ignore_index=True)99100# --- Pivot Tables ---101pivot = df.pivot_table(102 values="salary",103 index="department",104 columns="city",105 aggfunc="mean",106 fill_value=0,107)108print(pivot)109110# --- Handling Missing Data ---111df_with_nulls = df.copy()112df_with_nulls.loc[1, "salary"] = np.nan113df_with_nulls.loc[3, "city"] = np.nan114115print(df_with_nulls.isna().sum()) # count nulls per column116df_filled = df_with_nulls.fillna({117 "salary": df_with_nulls["salary"].median(),118 "city": "Unknown",119})120df_dropped = df_with_nulls.dropna(subset=["salary"]) # drop rows missing salary121122# --- Method Chaining (idiomatic pandas) ---123report = (124 df125 .query("salary > 60000")126 .assign(bonus=lambda x: x["salary"] * 0.1)127 .groupby("department")128 .agg(129 avg_total_comp=("salary", lambda x: (x + x * 0.1).mean()),130 team_size=("name", "count"),131 )132 .sort_values("avg_total_comp", ascending=False)133 .reset_index()134)135print(report)136137# --- Performance: Category dtype ---138# Before: string column uses ~8x more memory139df["department"] = df["department"].astype("category")140print(df["department"].cat.categories) # unique categories141print(df.memory_usage(deep=True)) # compare memory usage
🏋️ Practice Exercise
Load a CSV file (or create a DataFrame with at least 100 rows of synthetic data) containing columns: name, age, department, salary, hire_date, and city. Perform exploratory analysis: find the top 5 highest-paid employees, the average salary per department, and the department with the highest employee count.
Write a method-chained pipeline that filters employees hired after 2020, groups by department, computes the mean and median salary, adds a column indicating whether the department average is above the company-wide average, and sorts by mean salary descending.
Given two DataFrames — one with employee info and another with department budgets — perform a left merge, then use
.groupby()and.transform()to add a column showing each employee's salary as a percentage of their department's total budget.Create a DataFrame with deliberately missing values in multiple columns. Demonstrate three different imputation strategies: fill with column mean, forward-fill (for time-series data), and fill categoricals with mode. Compare the resulting distributions.
Use
pd.pivot_table()to create a summary showing average salary by department and city, with row and column margins. Then use.styleto highlight the highest value in each column.Profile the memory usage of a large DataFrame (1M+ rows). Optimize it by converting object columns to category dtype, downcasting numeric types with
pd.to_numeric(downcast="integer"), and usingread_csv()withdtypeparameter. Measure the memory reduction.
⚠️ Common Mistakes
Using
.apply()with a Python function row-by-row when a vectorized pandas operation exists — this is orders of magnitude slower and negates pandas' performance advantages.Chaining indexing like
df[condition]['column'] = valuewhich triggers SettingWithCopyWarning. Always use.loc[]for assignment:df.loc[condition, 'column'] = value.Ignoring the
inplacetrap — most pandas methods return new DataFrames. Relying oninplace=Trueis discouraged (it is being deprecated) and breaks method chaining.Not specifying
dtypeorparse_dateswhen reading CSV files, leading to incorrect type inference (e.g., IDs read as integers instead of strings, dates as plain strings).Loading an entire huge CSV into memory when you only need a few columns. Use
usecolsparameter or switch to Parquet format for large datasets.
💼 Interview Questions
🎤 Mock Interview
Practice a live interview for pandas: Data Analysis