Python for Excel: Automate Spreadsheets with Python
Learn how to read, write, analyse, and automate Excel spreadsheets with Python. Covers openpyxl for Excel files and pandas for data analysis, with practical copy-paste examples.
Overview
If you spend hours every week doing the same thing in Excel — copying data between sheets, applying formulas, reformatting reports, merging files — Python can do it for you in seconds.
This guide covers two libraries:
| Library | Best For | Think of It As |
|---|---|---|
| openpyxl | Reading and writing .xlsx files directly (formatting, cell-by-cell control) | A robot that clicks through Excel for you |
| pandas (3.x) | Analysing, filtering, transforming, and summarising data | Excel on steroids — pivot tables, VLOOKUP, and more in one line |
You do not need to be a Python expert. If you have read our Python for Beginners guide, you have everything you need.
Setup
Install the Libraries
pip install openpyxl pandas xlsxwriter
openpyxl— read and write.xlsxfiles with full formatting controlpandas— data analysis (uses openpyxl under the hood for Excel I/O). pandas 3.0 brought significant performance improvements with Copy-on-Write semantics enabled by default. For very large datasets,polarsis still a faster alternative, but the gap has narrowedxlsxwriter— alternative Excel writer with advanced formatting (optional)
pandas 3.0 string dtype change: pandas 3.0 changed how string columns are inferred. Strings now default to the str dtype instead of object. This means methods like df["column"].str.contains(...) work more reliably out of the box. If you are following older tutorials or upgrading existing scripts, you may notice different dtype outputs when calling df.dtypes — this is expected and all the string operations shown in this guide remain the same.
Create a Sample File
Save this as sample_data.xlsx to follow along (or use any Excel file you already have):
| Name | Department | Salary | Start Date | Status |
|---|---|---|---|---|
| Alice Johnson | Engineering | 95000 | 2022-03-15 | Active |
| Bob Smith | Marketing | 72000 | 2021-07-01 | Active |
| Carol White | Engineering | 105000 | 2020-01-10 | Active |
| Dave Brown | Sales | 68000 | 2023-06-20 | Inactive |
| Eve Davis | Marketing | 78000 | 2022-11-05 | Active |
| Frank Miller | Sales | 71000 | 2021-09-14 | Active |
| Grace Lee | Engineering | 112000 | 2019-04-22 | Active |
| Henry Wilson | Marketing | 65000 | 2024-01-08 | Active |
Part 1: openpyxl — Direct Excel Control
Reading an Excel File
from openpyxl import load_workbook
# Open the file
wb = load_workbook("sample_data.xlsx")
ws = wb.active # Get the active sheet
# Print every row
for row in ws.iter_rows(min_row=2, values_only=True): # min_row=2 skips the header
name, dept, salary, start_date, status = row
print(f"{name} | {dept} | ${salary:,}")
Reading Specific Cells
from openpyxl import load_workbook
wb = load_workbook("sample_data.xlsx")
ws = wb.active
# Read a specific cell
print(ws["A2"].value) # "Alice Johnson"
print(ws["C2"].value) # 95000
# Read a range
for row in ws["A2":"C5"]:
for cell in row:
print(cell.value, end=" | ")
print()
Writing a New Excel File
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws.title = "Employees"
# Write headers
headers = ["Name", "Department", "Salary"]
ws.append(headers)
# Write data rows
data = [
["Alice Johnson", "Engineering", 95000],
["Bob Smith", "Marketing", 72000],
["Carol White", "Engineering", 105000],
]
for row in data:
ws.append(row)
# Save
wb.save("new_report.xlsx")
print("File saved.")
Formatting Cells
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side, numbers
wb = Workbook()
ws = wb.active
ws.title = "Formatted Report"
# Header row
headers = ["Name", "Department", "Salary", "Status"]
ws.append(headers)
# Style the header
header_font = Font(bold=True, color="FFFFFF", size=12)
header_fill = PatternFill(start_color="BE1E2D", end_color="BE1E2D", fill_type="solid")
for cell in ws[1]:
cell.font = header_font
cell.fill = header_fill
cell.alignment = Alignment(horizontal="center")
# Add data
data = [
["Alice Johnson", "Engineering", 95000, "Active"],
["Bob Smith", "Marketing", 72000, "Active"],
["Dave Brown", "Sales", 68000, "Inactive"],
]
for row in data:
ws.append(row)
# Format salary column as currency
for row in ws.iter_rows(min_row=2, min_col=3, max_col=3):
for cell in row:
cell.number_format = '$#,##0'
# Auto-adjust column widths
for col in ws.columns:
max_length = max(len(str(cell.value or "")) for cell in col)
ws.column_dimensions[col[0].column_letter].width = max_length + 4
# Conditional formatting — highlight "Inactive" in red
from openpyxl.styles import PatternFill
red_fill = PatternFill(start_color="FFE0E0", end_color="FFE0E0", fill_type="solid")
for row in ws.iter_rows(min_row=2):
if row[3].value == "Inactive":
for cell in row:
cell.fill = red_fill
wb.save("formatted_report.xlsx")
print("Formatted report saved.")
Adding Formulas
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
# Headers and data
ws.append(["Product", "Price", "Quantity", "Total"])
ws.append(["Widget A", 10.50, 100, "=B2*C2"])
ws.append(["Widget B", 25.00, 50, "=B3*C3"])
ws.append(["Widget C", 5.75, 200, "=B4*C4"])
# Summary row
ws.append([])
ws.append(["", "", "Grand Total:", "=SUM(D2:D4)"])
wb.save("formulas.xlsx")
Formulas are saved as text in openpyxl. They are evaluated when you open the file in Excel, not by Python. If you need Python to calculate values, do the maths in Python and write the result.
Working with Multiple Sheets
from openpyxl import Workbook
wb = Workbook()
# Rename default sheet
ws1 = wb.active
ws1.title = "Summary"
# Create additional sheets
ws2 = wb.create_sheet("Q1 Data")
ws3 = wb.create_sheet("Q2 Data")
# Write to each sheet
ws1.append(["Quarter", "Revenue"])
ws1.append(["Q1", "=SUM('Q1 Data'!B2:B100)"])
ws1.append(["Q2", "=SUM('Q2 Data'!B2:B100)"])
ws2.append(["Month", "Revenue"])
ws2.append(["January", 50000])
ws2.append(["February", 55000])
ws2.append(["March", 60000])
ws3.append(["Month", "Revenue"])
ws3.append(["April", 62000])
ws3.append(["May", 58000])
ws3.append(["June", 67000])
wb.save("multi_sheet.xlsx")
Part 2: pandas — Data Analysis Power
pandas is where Python truly replaces Excel. It handles the tasks you would normally do with VLOOKUP, pivot tables, and complex formulas — but faster and reproducibly.
Reading an Excel File into pandas
import pandas as pd
# Read the file — pandas creates a "DataFrame" (a table)
df = pd.read_excel("sample_data.xlsx")
# Preview the data
print(df.head()) # First 5 rows
print(df.shape) # (rows, columns) — e.g. (8, 5)
print(df.columns) # Column names
print(df.dtypes) # Data types of each column
print(df.describe()) # Summary statistics for numeric columns
Filtering Data (Like Excel's Filter)
import pandas as pd
df = pd.read_excel("sample_data.xlsx")
# Filter: only active employees
active = df[df["Status"] == "Active"]
print(active)
# Filter: Engineering department with salary > 100k
senior_eng = df[(df["Department"] == "Engineering") & (df["Salary"] > 100000)]
print(senior_eng)
# Filter: anyone in Marketing OR Sales
mkt_sales = df[df["Department"].isin(["Marketing", "Sales"])]
print(mkt_sales)
# Filter: names containing "son"
has_son = df[df["Name"].str.contains("son", case=False)]
print(has_son)
Sorting
# Sort by salary (highest first)
sorted_df = df.sort_values("Salary", ascending=False)
print(sorted_df)
# Sort by department, then by salary within each department
sorted_df = df.sort_values(["Department", "Salary"], ascending=[True, False])
print(sorted_df)
Adding and Modifying Columns
import pandas as pd
df = pd.read_excel("sample_data.xlsx")
# Add a new column — annual bonus (10% of salary)
df["Bonus"] = df["Salary"] * 0.10
# Add a column based on a condition
df["Pay Grade"] = df["Salary"].apply(lambda s: "Senior" if s >= 90000 else "Standard")
# Rename a column
df = df.rename(columns={"Start Date": "Hire Date"})
# Format a column
df["Salary Formatted"] = df["Salary"].apply(lambda s: f"${s:,.0f}")
print(df[["Name", "Salary Formatted", "Bonus", "Pay Grade"]])
Group By — Like Pivot Tables
This is the pandas equivalent of Excel's pivot tables and SUMIFS.
import pandas as pd
df = pd.read_excel("sample_data.xlsx")
# Average salary by department
dept_avg = df.groupby("Department")["Salary"].mean()
print(dept_avg)
# Department
# Engineering 104000.0
# Marketing 71666.7
# Sales 69500.0
# Multiple aggregations
summary = df.groupby("Department").agg(
Headcount=("Name", "count"),
Avg_Salary=("Salary", "mean"),
Max_Salary=("Salary", "max"),
Min_Salary=("Salary", "min"),
Total_Payroll=("Salary", "sum"),
).round(0)
print(summary)
VLOOKUP Equivalent — Merging DataFrames
import pandas as pd
# Main employee data
employees = pd.DataFrame({
"Name": ["Alice", "Bob", "Carol"],
"Dept_ID": [101, 102, 101],
})
# Department lookup table
departments = pd.DataFrame({
"Dept_ID": [101, 102, 103],
"Dept_Name": ["Engineering", "Marketing", "Sales"],
"Manager": ["Jane", "Tom", "Sarah"],
})
# VLOOKUP: merge on Dept_ID
result = employees.merge(departments, on="Dept_ID", how="left")
print(result)
# Name Dept_ID Dept_Name Manager
# 0 Alice 101 Engineering Jane
# 1 Bob 102 Marketing Tom
# 2 Carol 101 Engineering Jane
Combining Multiple Excel Files
import pandas as pd
import glob
# Read all .xlsx files in a folder and combine them
files = glob.glob("reports/*.xlsx")
all_data = []
for file in files:
df = pd.read_excel(file)
df["Source File"] = file # Track which file each row came from
all_data.append(df)
combined = pd.concat(all_data, ignore_index=True)
print(f"Combined {len(files)} files — {len(combined)} total rows")
# Save the combined data
combined.to_excel("combined_report.xlsx", index=False)
This is one of the most common time-savers. If you receive 12 monthly reports as separate Excel files and need to combine them, this script replaces an hour of manual copy-paste.
Removing Duplicates
# Remove exact duplicate rows
df_clean = df.drop_duplicates()
# Remove duplicates based on specific columns
df_clean = df.drop_duplicates(subset=["Name", "Department"])
# Keep the last occurrence instead of the first
df_clean = df.drop_duplicates(subset=["Name"], keep="last")
print(f"Removed {len(df) - len(df_clean)} duplicates")
Handling Missing Data
# Find missing values
print(df.isnull().sum()) # Count of missing values per column
# Fill missing values
df["Department"] = df["Department"].fillna("Unknown")
df["Salary"] = df["Salary"].fillna(df["Salary"].median())
# Drop rows with any missing values
df_clean = df.dropna()
# Drop rows where a specific column is missing
df_clean = df.dropna(subset=["Email"])
Writing Back to Excel
import pandas as pd
df = pd.read_excel("sample_data.xlsx")
# Basic write
df.to_excel("output.xlsx", index=False)
# Write multiple sheets to one file
with pd.ExcelWriter("report.xlsx", engine="xlsxwriter") as writer:
# Active employees on one sheet
active = df[df["Status"] == "Active"]
active.to_excel(writer, sheet_name="Active", index=False)
# Inactive on another
inactive = df[df["Status"] == "Inactive"]
inactive.to_excel(writer, sheet_name="Inactive", index=False)
# Summary on a third
summary = df.groupby("Department").agg(
Count=("Name", "count"),
Avg_Salary=("Salary", "mean"),
).round(0)
summary.to_excel(writer, sheet_name="Summary")
print("Report saved with 3 sheets.")
Part 3: Real-World Recipes
Recipe 1: Monthly Report Formatter
Takes raw data and produces a formatted Excel report with a summary sheet.
import pandas as pd
def format_monthly_report(input_file, output_file):
"""Read raw data and create a formatted monthly report."""
df = pd.read_excel(input_file)
with pd.ExcelWriter(output_file, engine="xlsxwriter") as writer:
# Raw data sheet
df.to_excel(writer, sheet_name="Data", index=False)
# Summary sheet
summary = df.groupby("Department").agg(
Employees=("Name", "count"),
Total_Payroll=("Salary", "sum"),
Average_Salary=("Salary", "mean"),
).round(0)
summary.to_excel(writer, sheet_name="Summary")
# Format the summary sheet
workbook = writer.book
worksheet = writer.sheets["Summary"]
money_fmt = workbook.add_format({"num_format": "$#,##0"})
worksheet.set_column("B:D", 18, money_fmt)
header_fmt = workbook.add_format({
"bold": True,
"bg_color": "#BE1E2D",
"font_color": "#FFFFFF",
"border": 1,
})
for col_num, value in enumerate(["Department"] + list(summary.columns)):
worksheet.write(0, col_num, value, header_fmt)
print(f"Report saved to {output_file}")
format_monthly_report("sample_data.xlsx", "monthly_report.xlsx")
Recipe 2: Find Differences Between Two Spreadsheets
import pandas as pd
def compare_spreadsheets(file_old, file_new, key_column="Name"):
"""Compare two spreadsheets and show what changed."""
old = pd.read_excel(file_old)
new = pd.read_excel(file_new)
# New rows (in new but not in old)
added = new[~new[key_column].isin(old[key_column])]
print(f"\n--- Added ({len(added)} rows) ---")
print(added)
# Removed rows (in old but not in new)
removed = old[~old[key_column].isin(new[key_column])]
print(f"\n--- Removed ({len(removed)} rows) ---")
print(removed)
# Changed rows (same key, different values)
common_keys = set(old[key_column]) & set(new[key_column])
old_common = old[old[key_column].isin(common_keys)].set_index(key_column).sort_index()
new_common = new[new[key_column].isin(common_keys)].set_index(key_column).sort_index()
changes = old_common.compare(new_common)
if not changes.empty:
print(f"\n--- Changed ({len(changes)} rows) ---")
print(changes)
else:
print("\nNo changes in existing rows.")
compare_spreadsheets("last_month.xlsx", "this_month.xlsx")
Recipe 3: Split One File into Multiple Files by Column
import pandas as pd
import os
def split_by_column(input_file, column, output_folder="split_output"):
"""Split an Excel file into separate files based on a column value."""
df = pd.read_excel(input_file)
os.makedirs(output_folder, exist_ok=True)
for value in df[column].unique():
subset = df[df[column] == value]
safe_name = str(value).replace("/", "-").replace(" ", "_")
output_path = os.path.join(output_folder, f"{safe_name}.xlsx")
subset.to_excel(output_path, index=False)
print(f" {output_path} — {len(subset)} rows")
print(f"\nSplit into {df[column].nunique()} files.")
split_by_column("sample_data.xlsx", "Department")
# Creates: split_output/Engineering.xlsx, split_output/Marketing.xlsx, split_output/Sales.xlsx
Recipe 4: Email-Ready Summary from Excel Data
import pandas as pd
def generate_email_summary(input_file):
"""Generate a plain-text summary suitable for pasting into an email."""
df = pd.read_excel(input_file)
total_employees = len(df)
active = len(df[df["Status"] == "Active"])
total_payroll = df["Salary"].sum()
avg_salary = df["Salary"].mean()
dept_summary = df.groupby("Department").agg(
Count=("Name", "count"),
Avg_Salary=("Salary", "mean"),
).round(0)
summary = f"""
EMPLOYEE REPORT SUMMARY
=======================
Total Employees: {total_employees}
Active: {active}
Total Payroll: ${total_payroll:,.0f}
Average Salary: ${avg_salary:,.0f}
BY DEPARTMENT:
{dept_summary.to_string()}
"""
print(summary)
return summary
generate_email_summary("sample_data.xlsx")
Common Tasks Cheat Sheet
| Excel Task | pandas Equivalent |
|---|---|
| Open a file | pd.read_excel("file.xlsx") |
| Save a file | df.to_excel("file.xlsx", index=False) |
| Filter rows | df[df["Column"] == "value"] |
| Sort | df.sort_values("Column") |
| Remove duplicates | df.drop_duplicates() |
| VLOOKUP | df.merge(other_df, on="key") |
| Pivot table | df.groupby("Column").agg(...) |
| SUMIF | df[df["Col"] == "X"]["Amount"].sum() |
| COUNTIF | len(df[df["Col"] == "X"]) |
| CONCATENATE | df["Full"] = df["First"] + " " + df["Last"] |
| IF formula | df["New"] = df["Col"].apply(lambda x: "A" if x > 50 else "B") |
| Fill blanks | df["Col"].fillna("default") |
| Add new column | df["New"] = df["Col"] * 1.1 |
| Read specific sheet | pd.read_excel("file.xlsx", sheet_name="Sheet2") |
| Read all sheets | pd.read_excel("file.xlsx", sheet_name=None) (returns dict) |
Save this cheat sheet — it covers 90% of what most people need when moving from Excel to Python. When in doubt, search "pandas equivalent of [Excel function]" and you will find it.
Next Steps
- Start with one repetitive Excel task you do weekly and automate it
- Use
pandasfor data analysis andopenpyxlwhen you need precise cell formatting - Explore our Python for Beginners guide if any code in this article was unclear
- Look into scheduling your Python scripts with Task Scheduler (Windows) or cron (Mac/Linux) to run them automatically
Related Articles
Was this article helpful?
