Chapter 16 Key Takeaways: Excel and CSV Integration
The Core Ideas
1. pandas reads; openpyxl reads and writes with formatting.
Use pd.read_excel() and pd.read_csv() to load data into DataFrames. Use openpyxl to create formatted output workbooks. The two tools are complementary — most production workflows use both.
2. sheet_name=None reads all sheets at once.
pd.read_excel("file.xlsx", sheet_name=None) returns a dictionary {sheet_name: DataFrame}. This is the correct approach when you need to process all sheets from a multi-tab workbook without knowing the sheet names in advance.
3. load_workbook(data_only=True) returns calculated values, not formula strings.
Without data_only=True, a cell containing =SUM(A1:A10) returns the string "=SUM(A1:A10)". openpyxl has no formula evaluation engine — it returns the value Excel cached when the file was last saved. For fresh calculations, you need Excel open or xlwings.
4. pd.ExcelWriter as a context manager is the correct way to write multiple sheets.
Calling df.to_excel() multiple times on the same filename overwrites the previous file each time. Use with pd.ExcelWriter("output.xlsx", engine="openpyxl") as writer: and call df.to_excel(writer, sheet_name=...) inside the block. The workbook is finalized and saved when the block exits.
5. Define all openpyxl style objects once at the module level.
Font, PatternFill, Alignment, and Border objects are reusable. Define them as constants at the top of your file and apply them repeatedly. This is both more efficient and far easier to maintain — when you want to change the header color, you change one line.
6. Number format strings use Excel's custom format syntax.
The format '"$"#,##0'` produces `$580,000. The format '0.0%' multiplies the stored value by 100 and appends % — so store 0.385, get 38.5%. For dates: 'MM/DD/YYYY'. Always test your format strings by writing to a cell and checking the Excel output.
7. freeze_panes value is the first UNfrozen cell.
ws.freeze_panes = "A3" freezes rows 1 and 2 (keeps column headers visible). ws.freeze_panes = "B1" freezes column A only. ws.freeze_panes = "B3" freezes both rows 1-2 and column A.
8. openpyxl has no built-in column auto-fit.
Auto-sizing requires manual calculation: iterate over each column's cells, find the maximum string length, set ws.column_dimensions[col_letter].width to that length plus padding. Use a cap (e.g., min(max_length + 4, 50)) to prevent excessively wide columns.
9. The best production workflow: analyze in pandas, format with openpyxl. Do all computation, aggregation, and data transformation in pandas. Then write the results to Excel with openpyxl for formatting. This keeps business logic in pandas (easy to test, debug, and modify) and presentation in openpyxl (easy to style consistently).
10. CSV files have more edge cases than they appear.
The common failure modes: wrong encoding (use utf-8-sig for Windows/Excel-generated files), non-comma delimiters (test with sep=";", sep="\t", sep="|"), European number formats (add decimal=",", thousands="."), and files too large to fit in memory (use chunksize). Build the habit of explicitly specifying encoding on every read_csv and to_csv call.
The Python-to-Excel Decision Framework
| Use Excel directly | Use Python → Excel |
|---|---|
| One-time analysis | Repeated report generation |
| Small dataset | Any size dataset |
| Non-technical user maintains it after delivery | Output is final/read-only |
| Complex unique formatting | Consistent formatting applied each run |
| Analysis involves conditional manual judgment | Analysis is systematic and reproducible |
Common Mistakes to Avoid
Mistake: Writing calculation logic into openpyxl cell values. Don't compute averages or totals in openpyxl cell iteration. Do it in pandas first, then write the result to the cell. Mixing computation into formatting code makes both harder to debug.
Mistake: Not specifying index=False in df.to_excel().
By default, pandas writes the DataFrame index as the first column. If your index is 0, 1, 2, 3... this produces an unlabeled column of row numbers that confuses recipients. Always use index=False unless the index contains meaningful data (e.g., dates or category labels).
Mistake: Calling df.to_excel() multiple times on the same filename.
Each call creates a new workbook, overwriting the previous. Use pd.ExcelWriter to write multiple sheets to the same file.
Mistake: Applying number formatting to cells that contain strings.
A cell with the string "Fixed Fee" formatted as '"$"#,##0'` displays as `$0 or 0.00. Check that cells receiving number formats actually contain numbers.
Mistake: Using include_plotlyjs="cdn" for dashboards sent to recipients without internet.
Not specific to this chapter but relevant: when the deliverable is an Excel file, consider whether any embedded charts or images will require internet connectivity. openpyxl charts are embedded natively and always work offline.
Mistake: Forgetting data_only=True when reading formula-heavy workbooks.
If you load a workbook to read revenue totals that are calculated by Excel formulas, data_only=False (the default) returns formula strings, not numbers. The resulting DataFrame is full of "=SUM(...)" strings that will fail on any numeric operation.
Vocabulary Introduced This Chapter
openpyxl — A Python library for reading and writing Excel (.xlsx) files with full support for cell formatting, multiple sheets, merged cells, charts, and conditional formatting. Imported as from openpyxl import Workbook, load_workbook.
xlwings — A Python library that connects to a running Microsoft Excel process, enabling live read/write operations, formula recalculation, VBA macro invocation, and user-defined functions callable from Excel formulas.
ExcelWriter — A pandas context manager (pd.ExcelWriter) that enables writing multiple DataFrames to multiple sheets in a single Excel workbook.
data_only=True — A load_workbook argument that makes openpyxl return the last cached formula result (a number) rather than the formula string itself.
PatternFill — An openpyxl style object that controls background color of a cell. Created with PatternFill(fill_type="solid", fgColor="HEX_COLOR").
number_format — An openpyxl cell property that controls how a numeric value is displayed. Uses Excel's custom format string syntax (e.g., '"$"#,##0', '0.0%').
freeze_panes — A worksheet property that makes a specific row and/or column remain visible while the rest of the sheet scrolls. The value is the first cell that is NOT frozen.
chunksize — A pd.read_csv parameter that makes pandas read the file in fixed-size chunks instead of loading it all into memory. Returns an iterator over DataFrames.
BOM (Byte Order Mark) — A sequence of bytes at the start of a file that indicates the encoding. The utf-8-sig encoding reads and writes a UTF-8 BOM, which Excel requires to auto-detect UTF-8 encoding when opening CSV files.
on_bad_lines — A pd.read_csv parameter controlling behavior for lines with incorrect field counts. "error" raises an exception; "skip" drops the malformed row silently; "warn" drops it with a warning.
freeze_panes — (see above)
separation of concerns — A software design principle: keep business logic (calculations) separate from presentation logic (formatting). In the Excel workflow context: compute in pandas, format with openpyxl.