Engineering—November 10, 2020

CDPs vs. Data Lakes: What’s the difference, and can you use both?

CDPs and Data Lakes differ in the insights they surface, the users they serve, and the overall value they deliver. Though when used together, they are a powerful duo that can help your organization leverage historical and real-time customer data to the fullest extent.

With the number of tools for gathering customer insights increasing at a rapid pace, it can be difficult to cut through the jargon to understand how each one can benefit your internal teams and company as a whole. Data Lakes and Customer Data Platforms (CDPs) are two categories of data solutions that, even for seasoned buyers, can be tricky to differentiate.

Here, we’ll look at the types of data that Data Lakes and CDPs store, the internal bandwidth––technical and otherwise––needed to leverage them, the value each tool brings to your organization, and the ways in which both can help create unified customer experiences. We’ll also see how CDPs and Data Lakes are not mutually exclusive, but rather complementary tools that can work together to make your customer profiles more robust and actionable.

Data Lakes: broad insights from raw, unstructured data

As the name suggests, Data Lakes contain vast quantities of data (petabytes, in some cases) from a variety of sources. Just like water entering a lake from rainfall, data enters a Data Lake in its “natural” state and initially retains whatever structure it had in its container of origin. Though Data Lakes do not impose standardized structures on the data they store, they often add meta tags and identifiers to incoming data to aid in future analysis. Types of data stored in a Data Lake can run the gamut from totally unstructured collections of image, audio, video, and text files, to strictly structured rows and columns from a relational database.

Owing to their massive size, Data Lakes are most commonly housed in distributed, cloud-based environments. While Data Lakes often surface a variety of APIs and interfaces for users to input data, unlike CDPs, their ingestion process is not automated. Rather, the Data Lake’s main owners (typically machine learning engineers or data scientists) must replicate data from other sources to store it in the Data Lake. While the majority of data in a CDP is sourced in real time from customer interactions, data that already lives in other internal databases is typically the main source of data in Data Lakes.

The volume and variety of information in Data Lakes make them powerful tools in the hands of data scientists and other IT teams who can leverage sophisticated analytics techniques to uncover predictive insights. This structure also renders them too unwieldy for go-to-market teams like sales and marketing, however, since these users depend on up-to-date, unified, and immediately actionable data to drive personalized interactions. This is a CDP’s main deliverable and the reason why non-technical stakeholders are the primary users of a CDP.

While the information in a Data Lake is “static” in that it is not continuously and automatically updated, the insights this data can deliver are very broad in scope. For this reason, we can think of a Data Lake as an archive in a laboratory. It houses many types of specimens––some old, some new––that data scientists can retrieve at any time, combine in different ways, and examine with different tools and techniques to ultimately uncover new information. This is an invaluable role in an organization’s data ecosystem and one that ties directly into the function of a CDP, as we will see below.

CDPs: Real-time insights fueling one-to-one engagements

If a Data Lake is an archive, a CDP is an assembly line. In a CDP, data is constantly being taken in, checked for quality and consistency, combined with other data to form a complete product (a customer profile), and packaged for both internal analysis and delivery to third-party systems.

Data ingestion––a key point of differentiation between CDPs and Data Lakes––is the first station on this assembly line. While data must be replicated into a Data Lake, developers leveraging a CDP can automate data ingestion by implementing platform-specific SDKs and HTTP-based event APIs on each collection platform. For organizations leveraging a CDP, these CDP-specific SDKs and APIs are often the only MarTech-related integrations the engineering team needs to maintain. CDPs abstract away connections to downstream services, so developers are liberated from having to handle repetitive integrations and maintenance work. This frees up valuable engineering time and improves app performance by eliminating each downstream service’s SDK and associated dependencies from the codebase.

During ingestion, CDPs also enforce data quality and consistency by validating incoming data against an established data plan and blocking any violating data points from being forwarded to external systems. Just as preventing faulty spark plugs and brake pads from entering an assembly line keeps drivers safe, eliminating inconsistent data before it reaches downstream services prevents errors and inaccuracies in these systems. Because of this automated and quality-controlled ingestion process, CDPs are able to surface actionable insights to their end-users much more quickly than Data Lakes, and without the lift of having to run advanced data processing and analytics. This does come with a trade-off, however––typically, CDPs do not contain data sets with the same level of variety that Data Lakes are known for.

Thanks to this rigorous quality control process, CDP’s can handle data processing and analytics behind the scenes and deliver detailed and current pictures of individual customers to their end-users. This is why CDPs and Data Lakes differ in the types of users who primarily benefit from their adoption. While CDPs offer many benefits to engineering teams (streamlined workflows and lighter applications are just the tip of the iceberg), their primary value is to business users such as marketing and product management teams. Using a CDP, non-technical users are empowered to define and create customer segments and run various analytics in an intuitive user interface, without having to run SQL queries or any other programmatic analytics that Data Lakes demand.

Complementary roles in the data stack

With an IT team leveraging a Data Lake to generate broad, market-level insights to guide long term strategy, and product and marketing teams driving personalized customer interactions with a CDP, your organization is ready to take full advantage of the modern data ecosystem. A Data Lake’s power goes beyond generating forecasts and predictive insights, however. When coupled with a CDP, it can serve as a unique component in the CDP’s data ingestion pipeline, and conversely, a CDP can help activate the massive stores of historical customer data residing in Data Lakes.

In order for event-driven customer data to be consumed and activated in third-party systems, it needs to remain anchored to a unique customer profile. While automated data collection from a wide variety of touchpoints gives these customer profiles up-to-the-minute accuracy, it’s first-party data––such as name, address, gender, and date of birth––that helps tie each constantly evolving profile to a unique identifier. These static attributes are precisely the types of data points that are readily found in Data Lakes.

Waking up sleeping data

Historical data sets from a Data Lake can help CDP users make even more informed decisions about how and when to engage customers with personalized messaging and offers. Say, for example, a retailer has been collecting purchase history on a lifelong customer for many years and storing this information in a Data Lake. The retailer has years’ worth of information on the products this customer typically buys, the promotions they responded to, the seasons during which they have made purchases, and much more. Connected to a CDP––especially one with rich data processing capabilities like mParticle’s Calculated Attributes and Standard Audiences feature––this data can help marketers make highly-informed decisions about when, where, and how to reach this customer with personalized offers.

The CDP/Data Lake connection is especially fruitful for companies that were early to adopt data collection practices and have long backlogs of static data points on their users. Sitting in an isolated environment in an unstructured fashion, this data is asleep. While it can help guide broad insights and decisions, it is unlikely to translate to more meaningful customer interactions and personalized experiences. Connected to a CDP, this data wakes up. It’s tied to a real user, integrated with more recently collected information, and ready to be leveraged by product and marketing teams.

Wrapping up

Data Lakes and CDPs are not mutually exclusive tools. On the contrary, a robust Data Lake will help you maximize your CDP’s potential, and a CDP can breathe new life into a Data Lake’s backlog of customer information.

To learn more about mParticle, you can explore our documentation here!