Python Pandas Cheat Sheet: The Complete Quick Reference
Pandas is the backbone of data analysis in Python. Whether you are cleaning messy spreadsheets, merging datasets, or building summary reports, you will reach for pandas more than any other library. This cheat sheet covers every common operation with copy-paste code examples you can use immediately.
Import pandas at the top of every script:
import pandas as pd
import numpy as np
Creating DataFrames
Build a DataFrame from a dictionary of lists:
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [30, 25, 35],
'salary': [70000, 60000, 90000]
})
Create from a list of dictionaries (useful when each row comes from an API response):
records = [
{'name': 'Alice', 'age': 30},
{'name': 'Bob', 'age': 25},
{'name': 'Charlie', 'age': 35}
]
df = pd.DataFrame(records)
Create from a NumPy array with custom column names:
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
Reading and Writing Data
| Format | Read | Write |
|---|---|---|
| CSV | pd.read_csv('file.csv') |
df.to_csv('file.csv', index=False) |
| Excel | pd.read_excel('file.xlsx') |
df.to_excel('file.xlsx', index=False) |
| JSON | pd.read_json('file.json') |
df.to_json('file.json', orient='records') |
| SQL | pd.read_sql(query, conn) |
df.to_sql('table', conn, if_exists='replace') |
| Parquet | pd.read_parquet('file.parquet') |
df.to_parquet('file.parquet') |
Common read_csv parameters you will use constantly:
df = pd.read_csv(
'data.csv',
sep=',', # delimiter
header=0, # row number for column names
names=['a', 'b', 'c'], # custom column names
usecols=['a', 'b'], # only load specific columns
dtype={'a': str, 'b': int},# force column types
parse_dates=['date_col'], # parse as datetime
na_values=['N/A', 'null'], # treat these as NaN
nrows=1000, # only read first 1000 rows
encoding='utf-8'
)
Read a specific Excel sheet:
df = pd.read_excel('report.xlsx', sheet_name='Q1 Sales')
Read from a SQL database:
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM customers WHERE state = "CA"', conn)
Viewing and Inspecting Data
| Command | What It Returns |
|---|---|
df.head(10) |
First 10 rows |
df.tail(5) |
Last 5 rows |
df.shape |
Tuple of (rows, columns) |
df.info() |
Column names, types, non-null counts |
df.describe() |
Summary statistics for numeric columns |
df.dtypes |
Data type of each column |
df.columns |
List of column names |
df.index |
The row index |
df.nunique() |
Count of unique values per column |
df.value_counts('col') |
Frequency counts for a column |
df.sample(5) |
5 random rows |
df.memory_usage(deep=True) |
Memory consumption per column |
# Quick data audit in three lines
print(df.shape)
print(df.dtypes)
print(df.isnull().sum())
Selecting Data
Column access — three ways to grab a column:
df['name'] # returns a Series
df.name # dot notation (only works for simple names without spaces)
df[['name', 'age']] # multiple columns — returns a DataFrame
loc — select by label (row/column names):
df.loc[0, 'name'] # single value
df.loc[0:4, 'name':'salary'] # slice of rows and columns (inclusive)
df.loc[df['age'] > 30, ['name', 'salary']] # filter rows, select columns
iloc — select by integer position:
df.iloc[0, 1] # first row, second column
df.iloc[0:5, 0:3] # first 5 rows, first 3 columns
df.iloc[-1] # last row
df.iloc[:, [0, 2]] # all rows, first and third columns
Filtering Data
Boolean indexing — the most common pattern:
df[df['age'] > 30]
df[df['name'] == 'Alice']
df[(df['age'] > 25) & (df['salary'] > 65000)] # AND — use &
df[(df['age'] < 25) | (df['age'] > 35)] # OR — use |
df[~(df['name'] == 'Bob')] # NOT — use ~
isin for matching against a list of values:
df[df['name'].isin(['Alice', 'Charlie'])]
df[~df['state'].isin(['CA', 'NY'])] # exclude specific values
query for readable string-based filtering:
df.query('age > 30 and salary > 65000')
df.query('name in ["Alice", "Charlie"]')
min_age = 25
df.query('age > @min_age') # reference Python variables with @
between for range filtering:
df[df['age'].between(25, 35)] # inclusive on both ends
Sorting
df.sort_values('salary') # ascending
df.sort_values('salary', ascending=False) # descending
df.sort_values(['department', 'salary'], ascending=[True, False]) # multi-column
df.sort_index() # sort by row index
df.nlargest(10, 'salary') # top 10 by salary
df.nsmallest(5, 'age') # bottom 5 by age
Adding and Removing Columns
# Add a new column
df['bonus'] = df['salary'] * 0.10
df['full_name'] = df['first'] + ' ' + df['last']
df['category'] = np.where(df['salary'] > 80000, 'senior', 'junior')
# Add with multiple conditions
df['tier'] = np.select(
[df['salary'] > 100000, df['salary'] > 70000],
['gold', 'silver'],
default='bronze'
)
# Rename columns
df.rename(columns={'old_name': 'new_name', 'age': 'years'}, inplace=True)
# Drop columns
df.drop(columns=['bonus', 'tier'], inplace=True)
# Reorder columns
df = df[['name', 'age', 'salary']]
Groupby and Aggregation
Basic groupby — split, apply, combine:
df.groupby('department')['salary'].mean()
df.groupby('department')['salary'].agg(['mean', 'median', 'count'])
df.groupby(['department', 'title'])['salary'].sum()
Multiple aggregations on different columns:
df.groupby('department').agg(
avg_salary=('salary', 'mean'),
total_sales=('sales', 'sum'),
headcount=('employee_id', 'count'),
max_tenure=('years', 'max')
).reset_index()
Transform — apply a function and return a same-sized result:
# Add a column with the department average salary
df['dept_avg'] = df.groupby('department')['salary'].transform('mean')
# Percent of department total
df['pct_of_dept'] = df['salary'] / df.groupby('department')['salary'].transform('sum')
Merging and Joining
| Merge Type | What It Keeps |
|---|---|
inner |
Only rows with keys in both DataFrames |
left |
All rows from left, matching rows from right |
right |
All rows from right, matching rows from left |
outer |
All rows from both DataFrames |
# Merge on a common column
merged = pd.merge(orders, customers, on='customer_id', how='left')
# Merge on differently named columns
merged = pd.merge(orders, products, left_on='prod_id', right_on='product_id')
# Merge on multiple columns
merged = pd.merge(df1, df2, on=['year', 'month'], how='inner')
# Concatenate DataFrames vertically (stacking rows)
combined = pd.concat([df1, df2, df3], ignore_index=True)
# Concatenate horizontally (adding columns side by side)
combined = pd.concat([df1, df2], axis=1)
Pivoting and Reshaping
Pivot table — like Excel pivot tables:
pivot = df.pivot_table(
values='revenue',
index='region',
columns='quarter',
aggfunc='sum',
fill_value=0,
margins=True # add row/column totals
)
Melt — unpivot from wide to long format:
# Wide format: columns are Q1, Q2, Q3, Q4
# Long format: one column for quarter, one for value
long_df = df.melt(
id_vars=['product'],
value_vars=['Q1', 'Q2', 'Q3', 'Q4'],
var_name='quarter',
value_name='revenue'
)
Crosstab for frequency tables:
pd.crosstab(df['department'], df['gender'], margins=True)
Handling Missing Data
df.isnull().sum() # count NaN per column
df.isnull().sum().sum() # total NaN in entire DataFrame
df.dropna() # drop rows with any NaN
df.dropna(subset=['email']) # drop rows where email is NaN
df.dropna(thresh=3) # keep rows with at least 3 non-NaN values
df.fillna(0) # replace NaN with 0
df['salary'].fillna(df['salary'].median(), inplace=True) # fill with median
df.fillna(method='ffill') # forward fill (carry previous value)
df.fillna(method='bfill') # backward fill
df.interpolate() # linear interpolation for numeric data
Replace specific values:
df.replace({'N/A': np.nan, 'missing': np.nan})
df['status'].replace({'active': 1, 'inactive': 0}, inplace=True)
String Operations
All string methods are available via the .str accessor:
df['name'].str.lower()
df['name'].str.upper()
df['name'].str.title()
df['name'].str.strip() # remove leading/trailing whitespace
df['name'].str.replace('old', 'new')
df['name'].str.contains('alice', case=False) # boolean mask
df['name'].str.startswith('A')
df['name'].str.len()
df['email'].str.split('@').str[1] # extract domain from email
# Extract with regex
df['year'] = df['date_str'].str.extract(r'(\d{4})')
# Multiple string conditions
mask = df['description'].str.contains('urgent|critical', case=False, na=False)
Date and Time Operations
# Convert string to datetime
df['date'] = pd.to_datetime(df['date_str'])
df['date'] = pd.to_datetime(df['date_str'], format='%Y-%m-%d')
# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
df['quarter'] = df['date'].dt.quarter
df['is_weekend'] = df['date'].dt.dayofweek >= 5
# Date arithmetic
df['days_since'] = (pd.Timestamp.now() - df['date']).dt.days
df['next_week'] = df['date'] + pd.Timedelta(days=7)
# Resample time series (requires datetime index)
df.set_index('date').resample('M')['revenue'].sum() # monthly totals
df.set_index('date').resample('Q')['revenue'].mean() # quarterly averages
# Filter by date range
df[df['date'].between('2025-01-01', '2025-12-31')]
Apply, Map, and Lambda
apply — run any function on rows or columns:
# Apply to a column (Series)
df['salary_k'] = df['salary'].apply(lambda x: round(x / 1000, 1))
# Apply to each row (axis=1)
df['summary'] = df.apply(lambda row: f"{row['name']} earns {row['salary']}", axis=1)
# Apply a named function
def categorize(salary):
if salary > 100000:
return 'high'
elif salary > 60000:
return 'medium'
return 'low'
df['band'] = df['salary'].apply(categorize)
map — apply a mapping dictionary or function to a Series:
df['department_code'] = df['department'].map({
'Engineering': 'ENG',
'Marketing': 'MKT',
'Sales': 'SLS'
})
applymap (renamed to map on DataFrame in pandas 2.1+) — element-wise on entire DataFrame:
# Format all numeric values to 2 decimal places
df[['col1', 'col2']].map(lambda x: f'{x:.2f}')
Useful One-Liners
# Remove duplicate rows
df.drop_duplicates()
df.drop_duplicates(subset=['email'], keep='last')
# Change column types
df['price'] = df['price'].astype(float)
df['id'] = df['id'].astype(str)
# Clip outliers
df['score'] = df['score'].clip(lower=0, upper=100)
# Rank values
df['salary_rank'] = df['salary'].rank(ascending=False)
# Cumulative sum
df['running_total'] = df['revenue'].cumsum()
# Percentage change
df['pct_change'] = df['revenue'].pct_change()
# Bin continuous data into categories
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 50, 100],
labels=['young', 'mid', 'senior', 'veteran'])
Performance Tips
For large datasets, these habits make a measurable difference:
# Read only the columns you need
df = pd.read_csv('huge_file.csv', usecols=['id', 'name', 'amount'])
# Use category dtype for low-cardinality string columns
df['state'] = df['state'].astype('category')
# Use read_csv with chunksize for files that do not fit in memory
for chunk in pd.read_csv('huge.csv', chunksize=100000):
process(chunk)
# Avoid iterrows — use vectorized operations instead
# Slow:
for idx, row in df.iterrows():
df.loc[idx, 'new'] = row['a'] + row['b']
# Fast:
df['new'] = df['a'] + df['b']
This cheat sheet covers the operations you will use in the vast majority of pandas workflows. Bookmark it, keep it open while you work, and refer back to it until the syntax becomes second nature.
Learn Python from scratch in our free Python for Business Beginners textbook.