Engineering—September 14, 2021

What is data integrity and why does it matter for customer data?

Integrity is a good quality. Just like you want the people around you to have integrity, you also want the data on which you base strategic decisions to be of high integrity as well. That sounds good, but what does it mean for data to have integrity, and why is this so important? In this post, we’ll explore this broad and nuanced concept, define what it means in the context of customer data, and learn a strategy to ensure your customer data maintains high integrity throughout its lifecycle.

Think of someone you know who has integrity. What qualities about this person come to mind? Maybe they always follow-up on promises. When they say they’re going to do something, you can rest easy knowing they’ll come through. Maybe they’re always there for you in times of crisis, delivering the rational, clear-headed advice you need, providing a sounding board for your every last thought. Or maybe you’re not thinking of a person at all, but an object––your old Toyota Camry, perhaps, which despite having over 200,000 miles, still starts up when you turn the key and gets you where you need to go.

In the context of data, integrity basically has the same meaning. Data with integrity is always there for you when you (API) call on it. It gives you exactly the information you need and expect, and you can be sure that what it’s telling you is accurate and thorough.

What is data integrity?

Data integrity is the degree to which data and data sets maintain accuracy, completeness, and consistency throughout their entire lifecycle. At the point of data collection, data integrity entails capturing data in a way that precisely matches the details of its plan. During storage, the term means avoiding changes to the data (either unintended or malicious) that may compromise its accuracy and reliability, especially as this data is moved between systems and environments. In retrieval, data integrity is again a measure of how accurately and completely the data adheres to the state it was in at the time of collection.

Within the umbrella of data integrity, there are two subcategories that are used to discuss integrity as it relates to specific characteristics of data:

Logical integrity

This means that data retains its accuracy and completeness even as it is used in different contexts. Ensuring logical integrity often relies on imposing constraints on the structure and representation of data that carry across systems and applications. Suppose, for example, a website of a peer-to-peer marketplace collects the price for which a user would like to list an item. The database that collects this information expects to receive this datapoint as an integer, but there is no constraint on the input enforcing this data type. On the off chance a user types in a price as a string (which would certainly be unusual but is not impossible), this data would have no meaning in the system that receives it, and would therefore be useless, and lack logical integrity.

Physical integrity

When we think about the ephemeral digital world in which data lives, it’s easy to forget about the physics that allow it to exist. Data is just 1s and 0s that need to live in a physical space in memory, and getting all those 1s and 0s in and out of their home requires wondrous forces of nature like electromagnetism. For data to have physical integrity means that its home is intact, and that it hasn’t been damaged as a result of degradation to its habitat. Such assaults can come in the form of power outages, surges, weather and climate disasters, material failure, and much more. Redundancy––that is, storing data in multiple locations––is one of the most common strategies to mitigate the risk of physical integrity failures.

Why does data integrity matter for customer data?

Data integrity applies to all types of data collected across every use case. When narrowing the scope of the “data” you’re concerned with, however, the term can carry some specific connotations. Here, we’ll focus specifically on the reasons why data integrity is critically important in the context of customer data.

Customer data is information that a company collects about its customers that includes (among other categories) personal and demographic data, behavioral data, and interactive or engagement data. The most valuable form of customer data is first-party data, which consists of information that a company collects directly from its customers on its owned applications, websites and other digital touchpoints. This is the type of customer data that we’ll be focusing on specifically here, though if you’re interested in learning more about the differences between types of customer data and the specific value of first-party data, check out this article.

Customer data is critically important to companies of all types across every vertical. Armed with robust customer data, companies can achieve a wide variety of critical goals, including but certainly not limited to:

Building a clear picture of who their customers are and what they value
Unlocking ways of driving value to different customer segments
Personalizing experiences across digital channels based on customers’ interest

Of course, customer data can only help organizations realize these outcomes if it is accurate, complete, and if its structure and meaning persists when it is used across different systems and for different purposes.

Every aspect of an organization’s strategy, from messaging to product development, can suffer from poor quality data. When a company’s data pipeline is plagued with inaccurate, incomplete or misleading data, the financial costs can compound quickly. This study on data quality by Harvard Business Review study on data quality points to the “rule of ten,” first articulated in Data Quality by Thomas C. Redman, to calculate the cost of bad data. Here’s an example the HBR uses to illustrate the compounding costs of poor data quality:

“Suppose you have 100 things to do and each costs a $1 when the data are perfect. If all the data are perfect, the total cost is 100 x $1 = $100. If 89 are perfect and 11 are flawed, the total cost is 89 x $1 + 11 x $10 = $199. For most, of course, the operational costs are far, far greater. And the rule of ten does not account for nonmonetary costs, such as lost customers, bad decisions, or reputational damage to your company.”

This is why customer data with low integrity is worse than no data at all. Inaccurate information about how users engage with a product, for example, can result in customers receiving irrelevant promotions and messages, and easily lead to a loss of trust and loyalty among consumers. Faulty feature A/B testing information can lead product teams down ineffective or even counterproductive development cycles, and place a company at a distinct competitive disadvantage.

Since a high level of data integrity is a requirement for companies that use customer data to guide strategic decisions, it is critical that teams that develop and implement a company’s data strategy have robust systems and processes in place to prevent poor quality data from entering their pipelines.

How to ensure data integrity

Working with a well thought out and strictly governed data tracking plan is one of the most effective ways to ensure a high level of data integrity. A data tracking plan is a centralized document accessible to all data stakeholders throughout an organization that acts as a single source of truth for details including:

What customer data the company will collect
The structure it should have when it is collected
Which customer behaviors will trigger data collection
How data events and attributes will be named
Which data types (booleans, strings, integers, etc.) will be used for incoming data

When creating a data plan, it is often useful to include constraints on customer data coming in from apps and websites whenever practicable. This means imposing rules on what the incoming data should contain, and flagging data that does not conform to these rules. For example, stating that a user’s first or last name should not include numbers, an email address should include an “@” sign, or that a “product_name” attribute should match one of three enumerated values are all constraints that data owners might add to a plan. Doing so will help ensure that data will retain a meaningful structure, even as it is transported and used across multiple parts of your data pipeline.

An expertly strategized data plan alone does not win the battle for customer data integrity, however. Data still has to get from your applications and websites in a manner that conforms to your data plan, and making sure of this is the responsibility of engineers. Developers are the ones who need to implement your data plan––that is, turn the events, attributes, and constraints into working code that harvests this information and sends it where it needs to go. The more complex a data plan is, the harder it will be for engineers to implement it flawlessly. That’s why tools that assist or even automate the data planning process are a godsend for both engineers who implement data plans and internal teams who rely on high integrity data.

mParticle offers a suite of interconnected developer tools and UI features that allow growth teams to easily create data plans, and engineers to seamlessly translate these plans into code. Smartype, for example, is a powerful code generation tool that turns a data plan represented as a JSON schema into usable data collection libraries for web, iOS and Android. Similarly, the Data Planning Snippet SDK is a GUI that allows you to paste in a data plan schema and generate multi platform data collection code directly in the browser.

If you’re interested in seeing how all of mParticle’s data planning features work in tandem to help teams achieve data integrity and save engineering teams from time-intensive implementation and debugging, check out this blog post.

AuthorSean RyanTechnical Writer