Receipts live in email forever.
Every online purchase generates a receipt that ends up buried in Gmail — searchable, but not portable. For expense tracking, taxes, and warranty claims, you need them as actual files, organized by vendor and date. Doing it by hand is tedious and inconsistent.
Build the ETL the way you'd build it at work.
Treated personal email as a data source. Wrote a Python pipeline that extracts matching emails via the Gmail API, transforms the parsed metadata (sender, amount, date, vendor) into a normalized shape, and loads generated PDFs into a date-partitioned directory structure.
- Gmail API with OAuth — pulls messages matching a configurable filter
- Metadata parsing handles HTML, plaintext, and attached-PDF variants
- PDF generation for emails that only have HTML receipts
- Idempotent — tracks processed message IDs to avoid duplicates on re-run
An archive that actually exists.
Years of receipts now live as structured PDFs on disk, organized by date and vendor. Re-running the script is a no-op for already-processed mail. The pipeline doubles as a working reference for "what a real ETL looks like, end to end" — useful as a portfolio artifact for someone otherwise constrained by NDAs.