// projects / case-study · 05

Gmail → PDF
receipt pipeline.

Python automation that pulls receipts from Gmail, parses sender + amount + date metadata, generates clean PDFs, and files them into a structured archive — an ETL workflow built on personal email.

// type
personal · live
// 01 / problem

Receipts live in email forever.

Every online purchase generates a receipt that ends up buried in Gmail — searchable, but not portable. For expense tracking, taxes, and warranty claims, you need them as actual files, organized by vendor and date. Doing it by hand is tedious and inconsistent.

// 02 / approach

Build the ETL the way you'd build it at work.

Treated personal email as a data source. Wrote a Python pipeline that extracts matching emails via the Gmail API, transforms the parsed metadata (sender, amount, date, vendor) into a normalized shape, and loads generated PDFs into a date-partitioned directory structure.

  • Gmail API with OAuth — pulls messages matching a configurable filter
  • Metadata parsing handles HTML, plaintext, and attached-PDF variants
  • PDF generation for emails that only have HTML receipts
  • Idempotent — tracks processed message IDs to avoid duplicates on re-run
// 03 / outcome

An archive that actually exists.

Years of receipts now live as structured PDFs on disk, organized by date and vendor. Re-running the script is a no-op for already-processed mail. The pipeline doubles as a working reference for "what a real ETL looks like, end to end" — useful as a portfolio artifact for someone otherwise constrained by NDAs.

// 04 / stack

Stack.

Python Gmail API OAuth 2.0 PDFKit pdfplumber BeautifulSoup