What does good data validation look like?
Data engineers should add data validation processes in various stages throughout ETL pipelines to ensure that data remains accurate and consistent throughout its lifecycle. This article outlines strategies and best practices for doing this effectively.
When data pipelines ingest, transform, and transport data, some level of data quality degradation is inevitable. If poor quality data finds its way into the hands of teams who activate it, the list of negative outcomes runs the gamut from customers receiving irrelevant messaging to executives making misguided strategic decisions.
This is why it is critical for data engineers to prioritize data validation, i.e. processes that check data for accuracy and consistency at multiple steps throughout its journey from collection to activation. Doing this will increase efficiency by preventing data-driven teams from basing actions and decisions on faulty information, and eliminating the need for engineers to perform time-consuming debugging.
In the rest of this article, we’ll highlight some best practices when it comes to implementing data validation throughout an ETL pipeline.
Measure twice, cut once
There’s an old adage that speaks to the importance of careful preparation when performing any work where mistakes could have costly consequences: “Measure twice, cut once.” If you’re a carpenter, it’s easy to see the wisdom in these words––cutting wood in the wrong place could mean wasted materials and time. The idea also applies to data engineers, as failing to check for data errors and correct them early in an ETL pipeline can result in compounding data quality issues further downstream.
For data engineering teams, building the pipelines that transform data from its collection point to the destination systems can often end up taking precedence over ensuring the accuracy of that data. When this happens, data validation processes can be left until the transformation phase, at which point errors in accuracy have already been given a chance to manifest.
As a general rule, the farther down a pipeline a data quality error is discovered, the more effort it will take to correct. This is why the first layer of data validation should be performed as soon as data is ingested from its source system. Engineers should note that even if data was valid in its system of origin, it may not remain valid once ingested for several reasons:
- Disparate validation logic
The source system may have internal workarounds for inaccurate, duplicate, or ambiguous data, and these records could easily lead to errors and quality issues in the transformation or activation stages.
- Source system errors
Just because data in a source system has made it to the egress stage does not mean that it is entirely bug free––after all, that data was likely moved and transformed by another data engineering team, and no one is perfect.
- Errors in the ingestion process
In addition to errors arising from the source system, the ingestion process itself can produce data errors like incompletely loaded or extracted data sets, record mutations, or errors related to parsing flat files.
For these reasons, it’s never too early in the ETL process to implement data validation. In fact, early checks on data is advisable, and could prevent costly mistakes or time-consuming debugging down the line.
Perform context-based validations
When it comes to performing data quality checks at a syntactic level––that is, checking that data is named and structured in such a way as to have meaning across systems––data engineers are perfectly suited for the task. They can write scripts and queries to ensure that incoming data sets meet expectations. This is a critical line of defense against a wide variety of data error types.
However, any data that will be leveraged within an organization should meet semantic as well as syntactic checks on quality––that is, the data actually needs to carries the proper to meet the needs of the use cases for which it is intended. When it comes to checking data at this contextual level, it is often data analysts or non-technical owners of activation tools who have the most applicable knowledge. It is therefore critical that these stakeholders have input in validation processes that occur farther downstream and closer to the systems where it will be leveraged.
Building comprehensive data validation frameworks
In O’Reilly’s Data Pipelines Pocket Reference, author James Densmore outlines a simple framework that can be used for validation tests that check for both syntactic and semantic data outcomes. He also discusses an approach to ensuring that both data engineers and analysts are able to contribute their respective knowledge and skill sets to data validation, which is critical for achieving holistic data accuracy. Though somewhat limited in its applications in its vanilla iteration, Densmore’s framework is an extensible starting point that can jump start any data engineering team’s validation practices.
The framework consists of a Python script that runs two SQL scripts, compares their output to an expected value, and returns a test result based on this. The framework can be used to reduce any two data sets to a single numerical output and compare the result––for example, comparing the number of rows in two tables. The result of this test can then be used to either halt or continue a data pipeline, or throw a warning that a manual check on the data should be performed. For the latter cases, Densmore provides an example of how to extend the framework to throw an alert in a Slack channel.
Read more about how to perform effective data validation in O’Reilly’s Data Pipelines Pocket Reference.