Most sustainability professionals don’t think of themselves as data engineers. But the work of collecting, structuring, validating, and calculating emissions data from dozens of sources across multiple sites and business units is, at its core, a data pipeline problem.

As mandatory reporting raises the bar — external assurance, factor versioning, audit trails, year-on-year comparatives — the spreadsheet approach that worked for voluntary disclosure is breaking down. The organisations that get ahead of this are treating sustainability reporting as an engineering challenge, not an administrative one.

The data pipeline underneath every emissions report

A completed sustainability report is the output of a surprisingly complex data pipeline. Here’s what that pipeline actually does:

1. Ingestion

Data arrives from diverse sources in diverse formats:

Electricity invoices as PDFs from 12 different retailers
Gas invoices by email, some as PDFs, some as CSV exports
Fuel card statements in Excel or CSV from the fleet management system
Water meter readings photographed on site and submitted through a form
Refrigerant logs maintained locally by facility managers
Utility smart meter data (NMI-level) via API or export from network operators

None of these sources use the same schema. Each needs to be parsed, extracted, and normalised into a consistent structure. This is ingestion — and it’s where most manual processes fail at scale.

2. Normalisation

Once data is ingested, it needs to be normalised to common units and dimensions:

Convert kWh to MWh (or vice versa, depending on the factor table)
Attribute each data point to a site, project, business unit, and reporting period
Handle partial periods (an invoice that spans two months, or two reporting periods)
Resolve duplicates (the same invoice processed twice under different file names)

This is where the organisational knowledge of “which cost centre belongs to which site” becomes a data mapping problem.

3. Validation

Not all data is correct. Common data quality issues in sustainability reporting:

Invoices that include estimated readings rather than actual meter reads
Fuel records entered in gallons by a US subsidiary but expected in litres
Site codes that changed mid-year due to a restructure
Missing periods (no invoice for March, either because it wasn’t collected or genuinely not consumed)
Physically implausible values (electricity consumption for a site ten times higher than any prior period)

Validation rules need to be codified and applied systematically. Flagging anomalies for human review — rather than passing them through to the calculation — is a quality control function.

4. Factor application

Applying the correct NGA factor to each data point sounds simple but requires:

Knowing the fuel type and combustion category for each Scope 1 source
Knowing the grid region for each Scope 2 electricity source
Using the correct vintage of the NGA Factors (annual publication)
Distinguishing market-based from location-based Scope 2 where PPAs or GreenPower apply
Applying the correct GWP values for non-CO₂ gases

This is deterministic computation, but it depends on metadata being correct at the upstream stages. Wrong site region assignment → wrong grid factor → wrong emissions figure.

5. Aggregation

Individual activity data points need to be aggregated to the right reporting dimensions: by site, by project, by Scope, by gas, by reporting period. Different stakeholders want different views:

The sustainability team wants totals by scope and site
The project team wants emissions attributable to their project
The auditor wants the ability to drill down to individual source documents
The CFO wants the total figure for the statutory report

Aggregation is non-trivial when the underlying data has different granularities and the organisational structure changes during the year.

6. Lineage

At every step, the system needs to record what happened: which document produced which data point, which factor was applied, which calculation produced which aggregate, who reviewed what and when.

This lineage is the audit trail. Without it, the pipeline output — the reported number — is a black box.

Why spreadsheets fail at scale

A well-designed spreadsheet can handle all of these steps for a single site and a single data source. For 50 sites and 10 data types, the complexity grows combinatorially: more tabs, more manual steps, more places for errors to enter and never be found.

The failure modes are predictable:

Inconsistent factor versions across different tabs or files
Manual copy-paste errors that are undetectable once made
No diff history — when a number changes, you can’t see what it was before or why it changed
No separation between data entry and calculation, so errors propagate silently
No systematic validation — if an anomaly is present, it is only found by chance

As assurance requirements tighten — from limited to reasonable assurance — these failure modes become material. A spreadsheet that “looked right” for a voluntary report may produce qualified findings under proper assurance.

What a proper solution looks like

The organisations ahead of this curve have either built internal data infrastructure or adopted purpose-built platforms that treat sustainability reporting as an engineering problem:

Structured ingestion with format-specific parsers for common invoice types
A normalisation layer that maps organisational structure to reporting dimensions
Automated validation with configurable rules and human-in-the-loop review for flagged anomalies
Factor management that tracks NGA vintage, scope category, and calculation methodology
Immutable audit records at every step

This is not a complex enterprise system. It’s a focused data pipeline for a well-defined problem. The complexity is in getting it right, not in the architecture.

Ayika is built as exactly this kind of pipeline — from invoice ingestion to assurance-ready emissions figures, with lineage maintained at every step. See how it handles the engineering problem.

Why Sustainability Reporting Is Becoming a Data Engineering Problem