Skip to content

Hershey's — Pipeline Automation & ETL Modernization

Case Study Summary

Client: The Hershey Company — Mexico Reporting Center of Excellence
Role: Data Engineer & Reporting CoE Analyst
Period: Jun 2021 – Apr 2024
Stack: Azure Data Factory · Databricks · PySpark · Power BI · Azure Synapse · Blob Storage

Impact:

  • $750K USD in recurring vendor costs eliminated
  • 2,000+ hours of manual work automated annually
  • 100+ legacy reports migrated from third-party vendor to Power BI
  • 35% improvement in pipeline reliability and runtime efficiency
  • Nominated at HSY Awards Mexico for contributions to the Mexico reporting initiative

Context

Hershey's Mexico Reporting CoE operated a reporting infrastructure heavily dependent on a third-party vendor — expensive, slow to change, and difficult to maintain internally. The data team relied on manual ETL processes that consumed thousands of hours per year, and reporting pipelines were fragile with no observability.

My mandate: modernize the full data stack, reduce vendor dependency, and build infrastructure the team could own and maintain.

What I built

ETL Pipeline Automation

Designed and implemented automated ETL pipelines across the full data lifecycle using Azure Data Factory and Databricks with PySpark. This replaced manual data processing workflows that previously required 2,000+ hours of analyst time annually.

The pipeline architecture covered:

  • Financial datasets ingestion and normalization
  • Sales and macroeconomic data integration
  • Automated transformation and validation layers
  • Delta Lake storage with partitioning for query performance

Vendor Migration — 100+ Reports

Led the migration of 100+ legacy reports from the third-party vendor to Power BI — eliminating the recurring vendor contract entirely. Each migration included:

  • Reverse-engineering the existing report logic
  • Rebuilding the supporting ETL in Azure Data Factory
  • Validating output parity against the original reports
  • Handing off to the analytics team with documentation

This project alone eliminated $750K USD in annual vendor costs.

Data Products for Analytics & Data Science

Delivered end-to-end data products integrating financial, sales, and macroeconomic datasets used by both analytics and data science teams. These products went from requirement gathering through deployment — including Power BI dashboards, semantic models, and the underlying pipeline infrastructure.

Pipeline Optimization

Diagnosed and resolved reliability issues across regional reporting pipelines, achieving a 35% improvement in pipeline runtime efficiency and significantly reducing failure rates.

Tech Stack

  • Orchestration: Azure Data Factory
  • Transformation: Databricks (PySpark), Azure Synapse Analytics
  • Storage: Azure Blob Storage / ADLS Gen2, Delta Lake
  • Visualization: Power BI (DAX, semantic modeling)
  • Languages: Python, PySpark, SQL, DAX
  • CI/CD: Azure DevOps

Key outcomes

Metric Result
Vendor cost reduction $750,000 USD/year
Manual hours automated 2,000+ hours/year
Legacy reports migrated 100+
Pipeline runtime improvement 35%
Award recognition HSY Awards Mexico nomination
  • Similar challenges in your organization?


    Whether it's vendor dependency, manual ETL, or brittle reporting infrastructure — I've solved these problems at scale. Let's talk.

    Book Intro Call