> "Dashboards are for people who want to explore. Reports are for people who want the answer delivered."
Learning Objectives
- Generate matplotlib/seaborn charts and save them as image files or in-memory buffers
- Create PDF reports with FPDF2 or ReportLab including charts, text, tables, and page numbers
- Create PowerPoint slides with embedded charts using python-pptx
- Create HTML email reports with inline chart images using smtplib and email.mime
- Build a reporting pipeline: parameterized function that generates a complete report
- Use Jinja2 templates for HTML reports with chart insertion slots
- Schedule automated report generation with cron (Unix) or Task Scheduler (Windows)
In This Chapter
- 31.1 Why Reports Still Matter
- 31.2 Charts to Bytes: The Foundation
- 31.3 PDF Generation with FPDF2
- 31.4 PDF Generation with ReportLab
- 31.5 PowerPoint Generation with python-pptx
- 31.6 HTML Email Reports
- 31.7 Jinja2 Templates for HTML Reports
- 31.8 Building a Report Pipeline
- 31.9 Scheduling with Cron and Task Scheduler
- 31.10 Progressive Project: Automated Climate Report
- 31.11 Reporting Pitfalls
- 31.12 HTML to PDF: A Different Approach
- 31.13 Excel Reports with openpyxl
- 31.14 CSV and Parquet for Data Delivery
- 31.15 Error Handling and Monitoring for Scheduled Reports
- 31.16 Incremental Reports and Data Freshness
- 31.17 Versioning and Report History
- 31.18 Parameterization and Bulk Generation
- 31.19 Security Considerations for Reports
- 31.20 Reports in the Era of LLMs
- 31.21 Check Your Understanding
- 31.22 Chapter Summary
- 31.23 Spaced Review
Chapter 31: Automated Reporting — Generating Charts for PDFs, Slides, and Emails
"Dashboards are for people who want to explore. Reports are for people who want the answer delivered." — unattributed
31.1 Why Reports Still Matter
In a world of dashboards, it is tempting to think that the report is obsolete. Why produce a static PDF when you could have an interactive dashboard? Why send a weekly email when users could check a live URL? The argument for dashboards is strong for exploratory use cases.
But reports persist, and for good reasons.
Push vs. pull. A dashboard is pull-based: users must remember to check it. A report is push-based: it arrives in the user's inbox (or on their desk) without any action from them. For stakeholders who receive information from dozens of sources, push is much more effective — the report shows up in their morning email, they read it over coffee, and they act on it. No visit to a URL required.
Audit trail. A report is a snapshot. "Here are the numbers as of last Thursday." This is a permanent record that can be cited, archived, and referred to later. A dashboard changes as the underlying data updates, which is great for real-time decision-making but bad for "what did we know on Thursday?" questions. Reports create an audit trail that dashboards do not.
Narrative structure. A report has a beginning, middle, and end. The author decides the story arc, the emphasis, and the conclusions. A dashboard is a tool for the user to explore — the narrative is whatever the user builds from their clicks. For communicating a specific conclusion to a specific audience, the controlled narrative of a report is often more effective than the freedom of a dashboard.
Compliance and regulation. Many industries require formal reports — financial reports to regulators, clinical trial reports to the FDA, environmental reports to governments. These are paper (or PDF) deliverables by law, not dashboards. The tooling for producing them must generate persistent documents with specific formats.
Distribution. A PDF can be emailed, printed, archived, forwarded, and viewed on any device. A dashboard URL requires the recipient to click, wait for load, and navigate. For wide distribution to people who may not have technical setups, the PDF wins.
This chapter covers the tools for generating reports automatically from Python: charts in memory, PDF generation with FPDF2 and ReportLab, PowerPoint with python-pptx, HTML emails with smtplib, Jinja2 templates for HTML reports, and scheduling. The chapter is applied — there is no new conceptual material — but the tools are specific and worth learning.
31.2 Charts to Bytes: The Foundation
Every report-generation pipeline starts with the same primitive: produce a chart and get its bytes. You build a matplotlib or seaborn figure, save it to an in-memory buffer, and read the bytes. The bytes can then be written to a file, embedded in a PDF, inserted into a PowerPoint slide, or attached to an email.
The standard pattern uses io.BytesIO:
import io
import matplotlib.pyplot as plt
def chart_to_bytes(fig, format="png", dpi=150):
buf = io.BytesIO()
fig.savefig(buf, format=format, dpi=dpi, bbox_inches="tight")
buf.seek(0)
return buf.getvalue()
The function takes a matplotlib Figure and returns its bytes in the specified format (png, pdf, svg, jpg). The BytesIO object acts as an in-memory file, so you avoid the overhead of writing to disk and reading back. The seek(0) rewinds the buffer to the start so subsequent reads get the whole content.
Usage:
fig, ax = plt.subplots()
ax.plot([1, 2, 3], [4, 5, 6])
chart_bytes = chart_to_bytes(fig, format="png", dpi=300)
plt.close(fig) # free memory
The plt.close(fig) is important in loops or long-running scripts: matplotlib figures accumulate in memory unless explicitly closed, and a report script that generates 100 charts can exhaust memory otherwise.
The same pattern works for any format. For PDF-embedded charts, use format="pdf". For web emails, format="png". For vector output in reports, format="svg". The downstream library (FPDF2, ReportLab, python-pptx) accepts the bytes and embeds them appropriately.
31.3 PDF Generation with FPDF2
FPDF2 is a Python port of the PHP FPDF library. It is simple, has few dependencies, and is sufficient for most report use cases. Install with pip install fpdf2.
A minimal PDF:
from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.set_font("helvetica", size=12)
pdf.cell(0, 10, "Hello, World!", ln=True)
pdf.output("hello.pdf")
The cell function writes a text cell with a specified height and content. ln=True moves to the next line after the cell. output saves the PDF to a file.
For a report with charts:
from fpdf import FPDF
import io
import matplotlib.pyplot as plt
# Generate chart in memory
fig, ax = plt.subplots(figsize=(6, 4))
ax.plot([1, 2, 3, 4, 5], [1, 4, 9, 16, 25])
ax.set_title("Sample Chart")
buf = io.BytesIO()
fig.savefig(buf, format="png", dpi=150, bbox_inches="tight")
buf.seek(0)
plt.close(fig)
# Build PDF
pdf = FPDF()
pdf.add_page()
pdf.set_font("helvetica", size=20)
pdf.cell(0, 15, "Monthly Report", ln=True, align="C")
pdf.set_font("helvetica", size=12)
pdf.multi_cell(0, 6, "This is the summary paragraph for the monthly report. It describes the key findings and provides context for the charts below.")
pdf.image(buf, x=20, y=None, w=170)
pdf.output("report.pdf")
pdf.image accepts an in-memory BytesIO object (or a file path) and embeds it at the current position. The w parameter sets the width in points (72 per inch); x=20 sets the left margin; y=None means "current y position." After the image, subsequent content flows below it.
FPDF2's main functions:
add_page()— start a new page.set_font(family, style, size)— change the active font.cell(w, h, text, ln, align, border)— write a single-line cell.multi_cell(w, h, text)— write a multi-line text block with word wrap.image(path_or_bytes, x, y, w, h)— embed an image.set_y(y)andset_x(x)— position the cursor.output(filename)— save the PDF.
For page numbers, subclass FPDF and override the footer method:
class ReportPDF(FPDF):
def footer(self):
self.set_y(-15)
self.set_font("helvetica", size=8)
self.cell(0, 10, f"Page {self.page_no()}", align="C")
pdf = ReportPDF()
pdf.add_page()
# ... content ...
pdf.output("report.pdf")
The footer method is called automatically on each page. Similarly, override header for page headers.
FPDF2 is simple enough to produce readable PDFs with modest effort. For more complex layouts (tables with custom formatting, multi-column documents, programmatic PDF manipulation), ReportLab is a more powerful alternative.
31.4 PDF Generation with ReportLab
ReportLab is the most capable Python PDF library. It supports complex layouts, tables, vector graphics, and nearly any PDF feature. It is more complex than FPDF2 but much more flexible.
A minimal ReportLab document:
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Image
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = [
Paragraph("Monthly Report", styles["Title"]),
Spacer(1, 12),
Paragraph("This is the summary paragraph.", styles["BodyText"]),
Spacer(1, 12),
Image("chart.png", width=400, height=300),
]
doc.build(story)
ReportLab uses a different model from FPDF2. You build a list of flowable objects (Paragraph, Spacer, Image, Table, PageBreak) and pass it to doc.build(). ReportLab handles the layout — flowing content across pages, wrapping text, avoiding widow/orphan lines.
For tables:
from reportlab.platypus import Table, TableStyle
from reportlab.lib import colors
data = [
["Category", "Q1", "Q2", "Q3", "Q4"],
["Revenue", "1.2M", "1.5M", "1.8M", "2.1M"],
["Costs", "0.8M", "0.9M", "1.0M", "1.1M"],
["Profit", "0.4M", "0.6M", "0.8M", "1.0M"],
]
table = Table(data)
table.setStyle(TableStyle([
("BACKGROUND", (0, 0), (-1, 0), colors.grey),
("TEXTCOLOR", (0, 0), (-1, 0), colors.white),
("ALIGN", (0, 0), (-1, -1), "CENTER"),
("FONTNAME", (0, 0), (-1, 0), "Helvetica-Bold"),
("BOTTOMPADDING", (0, 0), (-1, 0), 12),
("BACKGROUND", (0, 1), (-1, -1), colors.beige),
("GRID", (0, 0), (-1, -1), 1, colors.black),
]))
story.append(table)
The TableStyle uses cell ranges specified as (col, row) tuples. (0, 0) is the top-left, (-1, -1) is the bottom-right. Each style directive (BACKGROUND, TEXTCOLOR, ALIGN, FONTNAME, GRID) applies to the specified range.
For in-memory chart embedding in ReportLab:
from reportlab.platypus import Image
buf = chart_to_bytes(fig, format="png") # from Section 31.2
image = Image(io.BytesIO(buf), width=400, height=300)
story.append(image)
ReportLab accepts either a filename or an in-memory buffer. The in-memory approach avoids temporary files, which simplifies deployments.
ReportLab also supports custom page templates, headers/footers, table of contents, bookmarks, multi-column layouts, and direct drawing with its Canvas API. For reports that go beyond simple "text + image + table" layouts, ReportLab is the Python standard.
31.5 PowerPoint Generation with python-pptx
For clients who want slides instead of PDFs, python-pptx generates PowerPoint (.pptx) files from Python. Install with pip install python-pptx.
A minimal presentation:
from pptx import Presentation
from pptx.util import Inches, Pt
prs = Presentation()
# Title slide
slide_layout = prs.slide_layouts[0] # Title slide layout
slide = prs.slides.add_slide(slide_layout)
slide.shapes.title.text = "Monthly Report"
slide.shapes[1].text = "Q4 2024 Summary" # title subtitle shape
# Chart slide
slide_layout = prs.slide_layouts[5] # Title only layout
slide = prs.slides.add_slide(slide_layout)
slide.shapes.title.text = "Revenue Trend"
img_path = "chart.png"
slide.shapes.add_picture(img_path, Inches(1), Inches(2), width=Inches(8))
prs.save("report.pptx")
The slide_layouts list contains PowerPoint's built-in layouts: 0=Title Slide, 1=Title and Content, 5=Title Only, 6=Blank, and more. Different templates have different layouts; check prs.slide_layouts to see what is available in your template.
Position and size use Inches() or Pt() for explicit units. Inches(1) is one inch; Pt(14) is 14 points. All placement is absolute — no auto-layout like in HTML or matplotlib.
For in-memory charts:
buf = io.BytesIO()
fig.savefig(buf, format="png", dpi=150, bbox_inches="tight")
buf.seek(0)
slide.shapes.add_picture(buf, Inches(1), Inches(2), width=Inches(8))
add_picture accepts a BytesIO buffer or a file path. In-memory is usually preferred.
For text boxes with custom formatting:
from pptx.util import Pt
from pptx.dml.color import RGBColor
text_box = slide.shapes.add_textbox(Inches(1), Inches(5), Inches(8), Inches(1))
text_frame = text_box.text_frame
p = text_frame.paragraphs[0]
run = p.add_run()
run.text = "Key insight: revenue grew 15% YoY"
run.font.size = Pt(18)
run.font.bold = True
run.font.color.rgb = RGBColor(0x1F, 0x77, 0xB4)
The nested shapes → text_frame → paragraphs → runs hierarchy mirrors the PowerPoint XML structure. Each "run" is a span of text with consistent formatting.
python-pptx can also modify existing templates. If your company has a branded PowerPoint template with custom colors and fonts, you can load it and insert content:
prs = Presentation("company_template.pptx")
slide_layout = prs.slide_layouts[1] # whatever layout fits
slide = prs.slides.add_slide(slide_layout)
# ... add content ...
This is how most production PowerPoint automation works. The designer creates a template in PowerPoint with branding, and Python fills in the content. The result looks hand-crafted but is generated automatically.
31.6 HTML Email Reports
For the most lightweight delivery, send a report as an HTML email with inline charts. Python's standard library (smtplib and email.mime) handles this without any external dependencies.
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.image import MIMEImage
msg = MIMEMultipart("related")
msg["Subject"] = "Weekly Report - 2024-11-15"
msg["From"] = "reports@mycompany.com"
msg["To"] = "team@mycompany.com"
html = """
<html>
<body>
<h1>Weekly Report</h1>
<p>Here are the latest metrics:</p>
<img src="cid:chart1" />
<p>Revenue grew 15% week-over-week.</p>
</body>
</html>
"""
msg.attach(MIMEText(html, "html"))
with open("chart.png", "rb") as f:
img = MIMEImage(f.read())
img.add_header("Content-ID", "<chart1>")
msg.attach(img)
with smtplib.SMTP("smtp.example.com", 587) as server:
server.starttls()
server.login("user", "password")
server.send_message(msg)
The key pattern is CID (Content-ID) references. The HTML includes <img src="cid:chart1" />, and the image is attached with a matching Content-ID: <chart1>. Email clients render this as an inline image without needing an external URL. This approach works in Gmail, Outlook, Apple Mail, and most other clients.
For multiple charts, attach each with a unique CID:
for i, chart_bytes in enumerate(charts):
img = MIMEImage(chart_bytes)
img.add_header("Content-ID", f"<chart{i}>")
msg.attach(img)
The HTML then references cid:chart0, cid:chart1, etc.
Caveats:
- Email client rendering is inconsistent. Some clients block images by default, so the email should still be readable without images (use text descriptions).
- HTML emails support a limited subset of CSS. Inline styles are safer than style sheets. Avoid flexbox and grid; use tables for layout.
- Large images are blocked or shrunk by some clients. Keep individual images under ~500 KB.
- Attaching many images bloats the email size. Consider linking to a dashboard for detailed views.
For production email delivery, use a transactional email service (SendGrid, Mailgun, AWS SES, Postmark) instead of raw SMTP. These services handle deliverability, bounces, tracking, and scaling in ways that raw smtplib does not.
31.7 Jinja2 Templates for HTML Reports
For HTML reports (either for emails or for standalone HTML files), Jinja2 is the standard Python templating library. Jinja2 lets you write an HTML template with substitution slots and fill them in with data.
from jinja2 import Template
template_str = """
<html>
<head><title>{{ title }}</title></head>
<body>
<h1>{{ title }}</h1>
<p>Generated: {{ date }}</p>
<h2>Summary</h2>
<table>
<tr><th>Metric</th><th>Value</th><th>Change</th></tr>
{% for metric in metrics %}
<tr>
<td>{{ metric.name }}</td>
<td>{{ metric.value }}</td>
<td>{{ metric.change }}</td>
</tr>
{% endfor %}
</table>
<h2>Charts</h2>
{% for chart in charts %}
<h3>{{ chart.title }}</h3>
<img src="data:image/png;base64,{{ chart.image_b64 }}" />
{% endfor %}
</body>
</html>
"""
template = Template(template_str)
html = template.render(
title="Monthly Report",
date="2024-11-15",
metrics=[
{"name": "Revenue", "value": "$1.2M", "change": "+15%"},
{"name": "Users", "value": "45,678", "change": "+7%"},
],
charts=[
{"title": "Revenue Trend", "image_b64": base64.b64encode(chart_bytes).decode()},
],
)
with open("report.html", "w") as f:
f.write(html)
The template uses {{ variable }} for substitutions and {% for %} / {% endfor %} for loops. Jinja2 supports conditionals ({% if %}), filters ({{ variable | filter }}), inheritance, macros, and many other features.
For inline images, encode the chart bytes as base64 and use a data URI:
import base64
image_b64 = base64.b64encode(chart_bytes).decode("utf-8")
# Then in template: <img src="data:image/png;base64,{{ image_b64 }}" />
This embeds the image directly in the HTML, making the file self-contained — no external image files to track. Good for single-file reports that can be emailed or attached.
For production use, load templates from files:
from jinja2 import Environment, FileSystemLoader
env = Environment(loader=FileSystemLoader("templates"))
template = env.get_template("report.html")
html = template.render(**context)
This lets you keep templates in a templates/ folder separate from Python code, and it supports Jinja2's template inheritance ({% extends "base.html" %}) for consistent layouts across multiple report types.
31.8 Building a Report Pipeline
A complete report pipeline is a parameterized function that takes inputs (date range, filters) and produces outputs (PDF file, email, saved charts). The pipeline structure:
def generate_monthly_report(start_date, end_date, output_dir="reports"):
# 1. Load data
df = load_data(start_date, end_date)
# 2. Compute metrics
metrics = compute_metrics(df)
# 3. Generate charts
charts = {}
charts["trend"] = build_trend_chart(df)
charts["breakdown"] = build_breakdown_chart(df)
charts["comparison"] = build_comparison_chart(df)
# 4. Build PDF
pdf_path = f"{output_dir}/report_{end_date}.pdf"
build_pdf(pdf_path, metrics, charts)
# 5. Build HTML
html_path = f"{output_dir}/report_{end_date}.html"
build_html(html_path, metrics, charts)
# 6. Send email (optional)
send_email(recipients=["team@company.com"], pdf_path=pdf_path, html=html_content)
return {"pdf": pdf_path, "html": html_path}
Each step is a pure function (except for send_email), which makes the pipeline easy to test and modify. To generate a report for a different period, call the function with different dates. To add a new chart, modify build_trend_chart and the PDF/HTML builders. To change the email recipients, update the call to send_email.
For configuration, use a YAML or JSON file for parameters that change between runs:
# report_config.yaml
recipients:
- team@company.com
- management@company.com
output_dir: /var/reports
charts:
trend: true
breakdown: true
comparison: false
style:
brand_color: "#1F77B4"
font: "Helvetica"
Load the config in the pipeline:
import yaml
with open("report_config.yaml") as f:
config = yaml.safe_load(f)
generate_monthly_report(
start_date=datetime.date(2024, 10, 1),
end_date=datetime.date(2024, 10, 31),
output_dir=config["output_dir"],
)
Configuration files separate code from operational parameters, making the pipeline easier to deploy and modify without code changes.
31.9 Scheduling with Cron and Task Scheduler
Reports that run on a schedule (weekly, monthly, quarterly) need to be triggered automatically. The standard tools:
Cron (Linux/macOS): add an entry to the system crontab with crontab -e:
# Run the report every Monday at 8 AM
0 8 * * 1 /usr/bin/python3 /path/to/generate_report.py
# Run monthly on the 1st at 9 AM
0 9 1 * * /usr/bin/python3 /path/to/generate_report.py
The cron syntax is minute hour day-of-month month day-of-week command. Use * for "any" and specific numbers for fixed values. Cron handles scheduling and runs the command in the background; any output goes to the user's email by default (or redirect with > in the command).
Windows Task Scheduler: create a scheduled task via the Task Scheduler GUI or schtasks.exe command. Specify the Python interpreter, the script path, and the schedule. Similar functionality to cron but with a GUI.
Python scheduling libraries: schedule, APScheduler, and rocketry provide in-process scheduling for long-running Python applications. Useful when you want the scheduling logic to live with the code rather than in the OS.
Airflow / Prefect / Dagster: workflow orchestration tools for complex pipelines. Overkill for a single report but appropriate for a data team with many scheduled jobs.
Cloud schedulers: AWS EventBridge, GCP Cloud Scheduler, Azure Logic Apps. Run serverless functions on schedules. Useful for cloud-native deployments.
For most report pipelines, cron (or Task Scheduler on Windows) is sufficient. The schedule is simple, the script is a one-liner, and the deployment is just a file on a server. Avoid the complexity of Airflow unless you have a real need.
31.10 Progressive Project: Automated Climate Report
The chapter's climate project produces a monthly PDF report with four charts and a narrative summary. The pipeline:
import matplotlib.pyplot as plt
import pandas as pd
from fpdf import FPDF
import io
from datetime import datetime
def generate_climate_report(year, month, output_path):
# Load data
df = pd.read_csv("climate.csv", parse_dates=["date"])
df = df[(df["date"].dt.year == year) & (df["date"].dt.month == month)]
# Compute metrics
temp_avg = df["temperature_anomaly"].mean()
co2_avg = df["co2_ppm"].mean()
trend = "rising" if df["temperature_anomaly"].iloc[-1] > df["temperature_anomaly"].iloc[0] else "falling"
# Build charts in memory
def make_chart(fig_fn):
buf = io.BytesIO()
fig = fig_fn()
fig.savefig(buf, format="png", dpi=150, bbox_inches="tight")
plt.close(fig)
buf.seek(0)
return buf
chart_temp = make_chart(lambda: plot_temperature(df))
chart_co2 = make_chart(lambda: plot_co2(df))
chart_scatter = make_chart(lambda: plot_scatter(df))
chart_monthly = make_chart(lambda: plot_monthly(df))
# Build PDF
pdf = FPDF()
pdf.add_page()
# Header
pdf.set_font("helvetica", "B", 20)
pdf.cell(0, 15, f"Climate Report: {year}-{month:02d}", ln=True, align="C")
pdf.set_font("helvetica", "", 11)
pdf.cell(0, 8, f"Generated: {datetime.now():%Y-%m-%d}", ln=True, align="C")
pdf.ln(10)
# Summary
pdf.set_font("helvetica", "B", 14)
pdf.cell(0, 10, "Summary", ln=True)
pdf.set_font("helvetica", "", 11)
pdf.multi_cell(0, 6, f"Average temperature anomaly: {temp_avg:.2f} °C. Average CO₂ concentration: {co2_avg:.1f} ppm. The monthly temperature trend is {trend}.")
pdf.ln(5)
# Charts (2x2 grid)
pdf.set_font("helvetica", "B", 12)
pdf.cell(0, 8, "Temperature Anomaly", ln=True)
pdf.image(chart_temp, w=180)
pdf.add_page()
pdf.cell(0, 8, "CO₂ Concentration", ln=True)
pdf.image(chart_co2, w=180)
# ... etc for other charts
pdf.output(output_path)
# Run it
generate_climate_report(2024, 11, "climate_report_2024_11.pdf")
Schedule with cron to run on the first day of each month:
0 9 1 * * /usr/bin/python3 /path/to/generate_climate_report.py
The pipeline produces a dated PDF that is saved to disk (and optionally emailed). Each month the script runs automatically and the report arrives without any human intervention.
31.11 Reporting Pitfalls
Font issues in PDF. Some fonts are not embedded in all PDF viewers. Stick with standard fonts (Helvetica, Times, Courier) unless you explicitly embed custom fonts.
Image resolution. Reports printed on paper need 300+ DPI images. Screen-only reports can use 150 DPI. Lower resolutions save file size but look pixelated when printed.
Email deliverability. Emails from raw smtplib often end up in spam. Use a reputable SMTP provider or a transactional email service for production.
PDF file size. A report with many high-DPI images can balloon past email attachment limits (25 MB for Gmail). Compress images, use PNG instead of TIFF, or split into multiple PDFs.
Time zones in scheduled jobs. Cron runs in the server's timezone. A job scheduled for "9 AM" runs at 9 AM server time, which may not be 9 AM for the recipients. Use UTC for servers and document the schedule clearly.
Broken templates. Jinja2 templates can have subtle bugs (missing variables, wrong filters). Test templates with edge cases (empty data, missing fields) before deploying.
Silent failures. A scheduled job that fails silently produces no report. Set up monitoring — email on failure, or log to a monitoring system. A cron job that runs every Monday but silently errors every other week will produce half the expected reports, and no one will notice until a stakeholder asks.
Stale data. A report is only as good as its data source. If the data pipeline is broken, the report is wrong. Validate the data before generating the report — check expected row counts, non-null values in key columns, and date ranges.
31.12 HTML to PDF: A Different Approach
An alternative to FPDF2 and ReportLab is to build the report as HTML and convert to PDF. This approach leverages the rich styling capabilities of HTML/CSS and produces PDFs that look identical to the HTML preview.
The main Python tools for HTML-to-PDF conversion:
WeasyPrint (pip install weasyprint): a pure-Python HTML/CSS to PDF renderer. Supports modern CSS including flexbox, grid, and media queries. Excellent for report documents but does not support JavaScript (so interactive charts do not work — you need static images).
from weasyprint import HTML
html_content = template.render(**context) # from Section 31.7
HTML(string=html_content).write_pdf("report.pdf")
pdfkit (pip install pdfkit + install wkhtmltopdf separately): a Python wrapper for the wkhtmltopdf command-line tool. Good CSS support and simple API, but requires installing a separate binary.
Playwright (for more complex cases): a browser automation library that can render HTML to PDF using a real headless browser (Chromium). Supports JavaScript and interactive content. Overkill for simple reports but essential for HTML that depends on JS.
The HTML-to-PDF approach has several advantages:
- Rich styling: full CSS including modern layout features.
- Reusable templates: the same Jinja2 template can produce HTML and PDF.
- Design workflow: designers can mock up reports in HTML/CSS without touching Python PDF libraries.
- Preview in browser: open the HTML in a browser to see exactly what the PDF will look like.
Disadvantages:
- Font embedding: HTML-to-PDF converters may not embed all fonts correctly. For print, test carefully.
- Page breaks: HTML does not have a native page-break concept, so controlling where pages split in a long document requires CSS
page-break-before/afterrules. - Performance: HTML-to-PDF conversion is slower than direct PDF generation with FPDF2 or ReportLab, especially for Playwright which spins up a browser.
For simple reports, direct PDF generation is usually faster. For complex, richly-styled reports — especially ones that need to look like a polished HTML page — the HTML-to-PDF approach is often easier.
31.13 Excel Reports with openpyxl
Sometimes stakeholders want data in Excel format, not PDF. openpyxl (pip install openpyxl) is the Python library for reading and writing Excel files. It supports cell formatting, charts, formulas, and multiple worksheets.
A minimal Excel report:
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
from openpyxl.chart import LineChart, Reference
wb = Workbook()
ws = wb.active
ws.title = "Monthly Report"
# Header
ws["A1"] = "Monthly Revenue Report"
ws["A1"].font = Font(size=16, bold=True)
# Data
headers = ["Month", "Revenue", "Costs", "Profit"]
for col, header in enumerate(headers, start=1):
cell = ws.cell(row=3, column=col, value=header)
cell.font = Font(bold=True)
cell.fill = PatternFill("solid", fgColor="CCCCCC")
data = [
("Jan", 1.2, 0.8, 0.4),
("Feb", 1.5, 0.9, 0.6),
("Mar", 1.8, 1.0, 0.8),
]
for row_idx, row_data in enumerate(data, start=4):
for col_idx, value in enumerate(row_data, start=1):
ws.cell(row=row_idx, column=col_idx, value=value)
# Chart
chart = LineChart()
chart.title = "Revenue over time"
data_ref = Reference(ws, min_col=2, min_row=3, max_col=2, max_row=6)
cats_ref = Reference(ws, min_col=1, min_row=4, max_row=6)
chart.add_data(data_ref, titles_from_data=True)
chart.set_categories(cats_ref)
ws.add_chart(chart, "F3")
wb.save("report.xlsx")
The result is an .xlsx file with formatted headers, data, and a native Excel chart. Stakeholders can open it in Excel and modify it if needed.
Excel reports are less polished than PDFs but more useful for audiences who will do further analysis. Finance teams especially prefer Excel because they can pivot tables, change formulas, and extend the analysis. For read-only reports, PDF is better; for editable reports, Excel.
An alternative pattern is to embed matplotlib charts as images in Excel using openpyxl.drawing.image.Image:
from openpyxl.drawing.image import Image
img = Image("chart.png")
ws.add_image(img, "F3")
This is simpler than openpyxl's native chart system but produces static images rather than editable Excel charts. Use depending on whether the stakeholder wants to edit the charts.
31.14 CSV and Parquet for Data Delivery
Not every report needs fancy formatting. Sometimes the most useful deliverable is raw data — a CSV file or Parquet file that the stakeholder can load into their own tool.
# Simple CSV export
df.to_csv("data.csv", index=False)
# With rich metadata (rare but useful for reports)
df.to_csv("data.csv", index=False, encoding="utf-8-sig", date_format="%Y-%m-%d")
# Parquet for larger datasets or cross-tool compatibility
df.to_parquet("data.parquet", compression="snappy")
When to deliver data files instead of formatted reports:
- The stakeholder will analyze the data further in their own tool (Excel, Tableau, R).
- The data is the primary output and visualization is secondary.
- The data is too large to fit in a PDF or PowerPoint.
- The stakeholder needs column-level access (sorting, filtering, pivoting).
When to deliver formatted reports:
- The audience is non-technical.
- The conclusion matters more than the raw data.
- The output will be printed or presented.
- Regulatory requirements specify a document format.
Often a good compromise is to deliver both: a formatted report with the headline findings, plus the underlying data as a CSV attachment or download link. Stakeholders who want the conclusions read the report; those who want to dig deeper have the raw data.
31.15 Error Handling and Monitoring for Scheduled Reports
A scheduled report that fails silently is worse than no report at all — users come to expect the weekly email, don't notice when it doesn't arrive, and base decisions on stale data. Production report pipelines need explicit error handling and monitoring.
Catch exceptions at the top level:
import logging
import traceback
logging.basicConfig(level=logging.INFO, filename="report.log")
def main():
try:
result = generate_monthly_report(year=2024, month=11)
logging.info(f"Report generated: {result}")
notify_success(result)
except Exception as e:
logging.error(f"Report generation failed: {e}")
logging.error(traceback.format_exc())
notify_failure(e)
raise
if __name__ == "__main__":
main()
The try/except catches any error, logs it, and notifies a monitoring system. The raise at the end ensures the process exits with a non-zero status, which cron and other schedulers can detect and alert on.
Notify on both success and failure. A report that succeeds silently is fine — users get the email, they know it worked. But a failure should alert someone immediately: send an email to the data team, post to a Slack channel, create a ticket in PagerDuty. The goal is that failures are noticed within hours, not days.
Use a separate logging channel. Don't rely on cron's default email of stdout/stderr. Route logs to a centralized system (CloudWatch, Datadog, Loggly, or even a simple file that is monitored). This way you can audit when reports ran, how long they took, and what errors occurred.
Validate outputs before sending. Before emailing a report to users, check that the output is sensible:
if df.empty:
raise ValueError("Input data is empty — aborting report generation")
if len(df) < MINIMUM_EXPECTED_ROWS:
logging.warning(f"Only {len(df)} rows, expected at least {MINIMUM_EXPECTED_ROWS}")
if not all(col in df.columns for col in REQUIRED_COLUMNS):
raise ValueError(f"Missing required columns: {set(REQUIRED_COLUMNS) - set(df.columns)}")
These sanity checks catch upstream data issues before they propagate into the report. A report with zero rows or missing columns is worse than a report that failed cleanly — it looks correct but is misleading.
Monitor the cron job itself. Cron jobs can fail before they reach your Python code (permissions, path issues, Python not found). Use a cron monitoring service like Cronitor, Healthchecks.io, or Dead Man's Snitch. These services expect regular "heartbeats" from your cron job and alert you if the heartbeat is missed. Simple, cheap, and catches failures that your Python-level error handling cannot.
Keep a manual override. If automated generation fails, you still need to deliver the report. Maintain a way to run the pipeline manually with the current parameters — a documented command, a one-click trigger, or at worst a copy-paste script. When the scheduled job fails at 4 AM, you want to be able to manually kick off a replacement run in a minute, not dig through old code for an hour.
31.16 Incremental Reports and Data Freshness
Most reports are built from a snapshot of data at a specific moment. For reports that cover a period (a month, a quarter), the data source must be stable for that period — if new data arrives after you generate the report, the numbers become misaligned.
Strategies for handling data freshness:
Snapshot-at-generation: query the database at the moment of generation. The report reflects the state at that instant. Simple and obvious, but the numbers can differ between runs (e.g., if you regenerate a report, it may show different values because new data arrived in the interim).
Snapshot-at-cutoff: use a cutoff date (e.g., "data as of end of day on the last day of the month"). The query filters to rows with timestamps before the cutoff. The report is reproducible because the cutoff is fixed.
Archive snapshots: on a schedule (daily, weekly), export a copy of the relevant data to a timestamped file. Reports read from the archived file rather than the live database. This ensures perfect reproducibility but adds storage overhead.
Document the cutoff prominently: regardless of strategy, mention the data cutoff in the report. "Data as of 2024-11-15 23:59 UTC" in the header. This tells readers how to interpret any discrepancies they notice.
For most internal reports, snapshot-at-cutoff is the right default. It is reproducible, easy to explain, and does not require extra storage. Archive snapshots are for reports that might be reviewed years later (regulatory filings, annual reports).
31.17 Versioning and Report History
A quiet but useful practice is versioning your reports: keeping every generated report indefinitely (or for a defined retention period). This provides an audit trail and lets you refer back to "what did we report in October?" questions.
output_path = f"reports/{year}/{month:02d}/report_{year}_{month:02d}.pdf"
os.makedirs(os.path.dirname(output_path), exist_ok=True)
generate_report(output_path=output_path)
The directory structure reports/2024/11/report_2024_11.pdf keeps everything organized and makes it easy to find specific historical reports. For larger scales, consider object storage (S3, GCS) instead of a local filesystem.
Version control for the code that generates reports is also important. Keep the report script in git, tag releases with dates, and document what changed between versions. A report generated with version 1.0 of the code may have different metrics than one generated with version 2.0, and knowing the version helps reconcile discrepancies.
Some teams store a metadata file alongside each report:
{
"report_date": "2024-11-30",
"generated_at": "2024-12-01T09:00:00Z",
"code_version": "1.3.2",
"data_cutoff": "2024-11-30T23:59:00Z",
"input_rows": 1457832,
"recipients": ["team@company.com"],
"duration_seconds": 42.3
}
This metadata is invaluable for debugging historical discrepancies and for auditing compliance. It is trivial to generate and worth the effort for any production report pipeline.
31.18 Parameterization and Bulk Generation
A well-designed report pipeline is parameterized: the same script can generate reports for different periods, different customers, different departments, or different configurations. This parameterization enables bulk generation — producing many reports with the same pipeline.
Parameterize by period:
def generate_report(start_date, end_date, output_path):
df = load_data(start_date, end_date)
# ... generate report ...
To produce monthly reports for an entire year:
for month in range(1, 13):
start = datetime.date(2024, month, 1)
end = start + relativedelta(months=1) - datetime.timedelta(days=1)
generate_report(start, end, f"reports/2024_{month:02d}.pdf")
Parameterize by entity:
def generate_customer_report(customer_id, output_path):
df = load_customer_data(customer_id)
# ... generate report ...
for customer_id in customer_ids:
generate_customer_report(customer_id, f"reports/customer_{customer_id}.pdf")
A company with 1000 customers and 1 report per customer = 1000 reports, all generated from the same script with different parameters. Each report is customized to that customer (their data, their name, their metrics), but the generation logic is shared.
Parameterize by configuration:
def generate_report(config):
df = load_data(**config["query"])
charts = build_charts(df, config["charts"])
build_pdf(output_path=config["output"], charts=charts, **config["report"])
configs = [
{"query": {...}, "charts": {...}, "report": {...}, "output": "report_A.pdf"},
{"query": {...}, "charts": {...}, "report": {...}, "output": "report_B.pdf"},
]
for config in configs:
generate_report(config)
Each config object specifies a different variant of the report. This works for A/B testing report designs, for generating reports in multiple languages, or for producing reports with different levels of detail (executive summary vs. detailed breakdown).
Bulk generation performance considerations:
- Parallelize: use
multiprocessing.Poolorconcurrent.futures.ProcessPoolExecutorto generate reports in parallel. A 1000-customer job that takes 1 hour sequentially can finish in 10 minutes on a 16-core machine. - Batch data loading: load all customers' data once rather than once per report. Then filter the in-memory DataFrame for each report.
- Cache shared figures: if every report includes the same "company-wide overview" chart, generate it once and reuse across reports.
- Limit image DPI: 150 DPI is usually enough for emailable PDFs. 300 DPI doubles the file size and generation time for a benefit most readers won't notice.
A well-parameterized pipeline can produce thousands of customized reports in a single run. This is how email marketing platforms generate personalized reports, how financial firms produce customer statements, and how compliance systems generate per-entity audit trails. The complexity is upfront (designing the parameterization); the payoff is ongoing (bulk generation with no manual work).
31.19 Security Considerations for Reports
Reports often contain sensitive data: financial numbers, customer details, employee information, medical records. Handling this data responsibly is part of the report pipeline's job.
Encrypt sensitive PDFs: both FPDF2 and ReportLab support password-protected PDFs. Users must enter a password to open the document. This is weak security (PDF passwords can be cracked) but better than nothing, and it prevents casual mis-forwarding.
Use secure email delivery: for reports emailed to external recipients, consider using encrypted email (PGP, S/MIME) or a secure delivery portal rather than plain SMTP. Transactional email services like SendGrid support TLS by default but not end-to-end encryption.
Limit retention: don't store reports longer than necessary. For compliance, you may need 7 years of retention; for convenience, a week or a month is often enough. Automate deletion of old reports to reduce the attack surface.
Control access to source data: the weakest link is often the data source, not the report. Use database credentials with read-only access, limit to the specific tables the report needs, and rotate credentials regularly.
Audit generated reports: log who received each report, when it was generated, and what parameters were used. This audit trail is valuable for security incidents and compliance reviews.
Redact sensitive fields: for reports that go to broad audiences, consider redacting or masking sensitive fields. A report showing "Customer ID: 12345" is less risky than one showing "Name: Jane Smith, SSN: 123-45-6789."
Review the templates: automated pipelines can leak secrets via template mistakes. If a template variable includes a database connection string or an API key, it might appear in the output. Review templates for accidental exposure.
These precautions are standard data engineering practice, but they get overlooked in the rush to automate. The report pipeline is not just a visualization tool; it is a data distribution mechanism with security implications. Treat it accordingly.
31.20 Reports in the Era of LLMs
A recent development worth mentioning: large language models (LLMs) are starting to augment or generate report content. Instead of hand-writing the narrative summary, a pipeline can call an LLM with the data and ask it to produce a written explanation.
A simple pattern:
import openai
def generate_narrative(metrics, prev_metrics):
prompt = f"""
Write a one-paragraph summary of these monthly metrics compared to the previous month:
This month: {metrics}
Last month: {prev_metrics}
Focus on the most significant changes. Keep it under 100 words.
"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
The LLM reads the metrics and writes a human-readable narrative. The report then embeds this text alongside the charts and tables. This approach is appealing because it automates the most time-consuming part of report writing (the narrative) and produces output that reads naturally.
Benefits:
- Narrative content that would otherwise require manual writing.
- Personalized summaries for different audiences (executive vs. technical).
- Multi-language reports with translated narratives.
Caveats:
- Hallucination risk: LLMs sometimes invent facts. A report that says "revenue increased 15%" when it actually increased 5% is worse than no narrative. Validate LLM outputs against the underlying data before publishing.
- Cost: each LLM call costs money, and bulk generation can add up.
- Latency: LLM calls add seconds per report, which matters for large batch runs.
- Style consistency: LLMs produce different text each time, which may violate brand voice or style guides. Use low temperature and strict prompts to reduce variation.
- Privacy: sending customer data to an external LLM API may violate privacy policies. Check before integrating.
For many report pipelines, LLMs are not yet worth the trade-offs. For specific use cases (high-volume personalized reports, multilingual output, customer-facing summaries), they can be transformative. As LLMs become cheaper and more reliable, expect LLM-augmented reporting to become more common. For now, treat LLM narratives as a supplement to human-written content, not a replacement.
The broader point: automated reporting is a moving target. The libraries in this chapter (FPDF2, ReportLab, python-pptx, Jinja2) are stable and will remain useful, but the overall practice evolves as new tools become available. Stay curious, try new approaches, and keep the goal in mind: deliver the right information to the right people at the right time, with minimum manual effort.
31.21 Check Your Understanding
- Why do reports still matter in a world of dashboards?
- What is the
BytesIOpattern for chart generation? - What are the two main Python libraries for PDF generation?
- How do you embed an inline image in an HTML email?
- What is Jinja2, and what syntax does it use?
- How do you schedule a Python script to run every Monday at 8 AM?
- Name three reporting pitfalls and their fixes.
31.22 Chapter Summary
This chapter covered the tools for generating automated reports in Python:
- Charts to bytes:
io.BytesIObuffers let you embed charts without temporary files. - FPDF2: simple PDF generation for most use cases.
- ReportLab: more powerful PDF library for complex layouts.
- python-pptx: PowerPoint generation with slide layouts and chart embedding.
- smtplib and email.mime: HTML emails with inline CID-referenced images.
- Jinja2: HTML templates with
{{ variable }}and{% for %}syntax. - Pipeline structure: parameterized functions that load data, compute metrics, generate charts, and produce output.
- Scheduling: cron on Unix, Task Scheduler on Windows, Airflow for complex workflows.
- Pitfalls: font embedding, image resolution, email deliverability, time zones, silent failures.
No new threshold concept — the chapter is applied. The skill is assembling the libraries into a working pipeline that produces reliable reports on a schedule.
Chapter 32 covers theming and branding, building a visual identity that applies across dashboards, reports, and standalone charts.
31.23 Spaced Review
- From Chapter 27 (Statistical/Scientific): Publication-quality figures for journals overlap with report generation. How do the requirements differ between scientific journals and business reports?
- From Chapter 29-30 (Dashboards): Dashboards are pull-based; reports are push-based. When should a workflow use both?
- From Chapter 12 (Customization Mastery): Report charts benefit from consistent styling via rcParams and style sheets. How does this fit into Chapter 32's brand system?
- From Chapter 13 (Subplots): Reports often use multi-panel figures for single-page layouts. What differences does the report context impose compared to interactive dashboards?
- From Chapter 9 (Storytelling): A report has a narrative arc. How does Chapter 9's story structure map onto a PDF report?
- From Chapter 32 (upcoming — Theming and Branding): Report pipelines and brand systems are tightly linked. Why?
- From Chapter 7 (Typography): Report PDFs typically include typography choices. How do the Chapter 7 principles about font hierarchy and action titles apply in the report context?
- From Chapter 29-30 (Dashboards): Reports and dashboards serve different audiences and use different tools. Which stakeholders in your work prefer each, and why?
Automated reporting is the bread-and-butter of many production data teams in finance, healthcare, operations, compliance, and management. A well-designed report pipeline saves hundreds or even thousands of hours per year by eliminating manual chart-rebuilding and report assembly that would otherwise soak up analyst time. The tools in this chapter — FPDF2, ReportLab, python-pptx, smtplib, Jinja2 — are the Python standards for this work. Learn them, assemble them into your own pipeline, test them with real data, and ship reports that arrive in your stakeholders' inboxes every week without any human intervention on your part. A closing observation worth underlining: reports are often underestimated by analysts who prefer dashboards, but for many stakeholders the report is more effective precisely because it is constrained — the narrative is authorial, the data cutoff is fixed, the distribution is push rather than pull, and the format is long familiar from decades of business communication. These apparent constraints are often genuine virtues to the reader, and the best practitioners learn to respect the format rather than fighting against it. Dashboards and reports are complementary tools in the production data toolkit; the practitioner who knows when to reach for each — and who has built infrastructure for both — is more valuable than one who has mastered only one of the two approaches. Chapter 32 moves to brand and theme systems that apply consistency across all the output formats covered in this part of the book, from interactive dashboards to printed PDFs to PowerPoint decks to inline-embedded email charts.