Skip to content

EY / Microsoft — Financial Pipeline Optimization

Case Study Summary

Client: Microsoft (via EY GDS) — BizApps Financial Data Platform
Role: Senior Data Engineer
Period: Sep 2025 – Present
Stack: Azure Fabric · PySpark · Azure DevOps · Python

Impact:

  • 67% runtime reduction across two critical financial pipelines
  • TenantTPIDMapping: 4.5h → 1.5h (−67%)
  • Complete Revenue: 7.5h → 2.5h (−67%)
  • Reusable observability framework deployed across data workflows
  • Production pipelines serving distributed Microsoft business teams worldwide

Context

The Microsoft BizApps Financial Data Platform processes large-scale enterprise financial data used by business teams globally. Two of its most critical pipelines — TenantTPIDMapping and Complete Revenue — were running far above acceptable execution times, with no clear visibility into why.

The problem wasn't obvious: standard profiling showed the pipelines were "working." The bottlenecks were hidden inside the execution graph.

Diagnosis

I conducted a deep diagnostic pass across both pipelines, looking beyond surface-level metrics. What I found:

  • Hidden processing failures that were being silently retried rather than surfaced — consuming execution time without producing output
  • Dependency design issues that caused downstream stages to wait unnecessarily for upstream stages that had already completed their relevant work
  • Suboptimal execution logic where transformations were running sequentially when they could run in parallel, and where full-table scans were happening where partition pruning was possible

What I changed

Pipeline redesign — TenantTPIDMapping & Complete Revenue

Redesigned the dependency graph and execution logic for both pipelines:

  • Eliminated silent retry loops by surfacing failures with proper alerting
  • Restructured stage dependencies to remove artificial sequencing bottlenecks
  • Introduced parallel execution where the data flow allowed it
  • Added partition-aware processing to reduce the volume of data read per stage

Result: combined runtime dropped from 12 hours to 4 hours.

SQL → PySpark migration

For performance-sensitive transformations, I migrated the processing layer from SQL-based execution to PySpark — improving scalability for large financial datasets and making the transformation logic easier to maintain and test.

Observability framework

Built a reusable email alerting and validation framework that generates:

  • Dynamic HTML execution reports per pipeline run
  • Pipeline status notifications (success / failure / SLA breach)
  • Custom data quality validation messages with row-level detail

This framework was deployed across multiple data workflows — not just the two pipelines I optimized — giving the team full observability across the platform for the first time.

DevOps & deployment

Managed production-grade deployment through Azure DevOps: pull requests, release coordination, and environment promotion (dev → staging → prod) for financial data solutions used by distributed Microsoft teams.

Tech Stack

  • Platform: Azure Fabric (formerly Azure Synapse + Power BI unified)
  • Processing: PySpark, SQL
  • Orchestration: Azure Data Factory, Azure DevOps
  • Observability: Python (custom HTML report generation), Logic Apps
  • Languages: Python, PySpark, SQL, YAML

Key outcomes

Pipeline Before After Reduction
TenantTPIDMapping 4.5 hours 1.5 hours −67%
Complete Revenue 7.5 hours 2.5 hours −67%
Combined 12 hours 4 hours −67%
  • Slow or unreliable data pipelines?


    Pipeline performance problems are almost never where they first appear. I diagnose the real bottleneck and fix it properly.

    Book Intro Call