How to use a CDP with your data warehouse
Data warehouses and CDPs are two pillars of the modern data stack. Recently, a perception has emerged that companies need to choose one system or the other as a “source of truth” for their data. This article poses a counter perspective, and demonstrates how when used together, a CDP and a data warehouse can form a dynamic duo at the core of your data infrastructure.
Here’s an understatement: The landscape of tools comprising the modern data stack is very complex. In the last few years, the number of vendors and systems entering the data marketplace has increased dramatically. The data lifecycle itself has become more intricate, as the broad stages of ingestion, transformation, storage, and activation are becoming subdivided into more granular processes.
CDPs and data warehouses are two staples in this ecosystem that, on the surface, have plenty of similarities. Both systems sit at the infrastructure layer of the data stack, and they can both fulfill the role of storing data from multiple sources to serve a variety of use cases. These two solutions differ significantly in the way they surface data for activation, however. Data warehouses often have direct integrations with tools like DataRobot, Tableau and Looker, and excel in servicing data science, BI and visualization use cases. Their ability to deliver on activation use cases for marketing and product teams, however, can be cumbersome, as this requires the additional overhead of a reverse ETL tool as well as data engineering team to write custom SQL queries to retrieve and forward this data. CDPs, on the other hand, offer pre-built integrations with hundreds of tools for analytics, engagement, advertising and other functions, and enable non-technical teams to forward audiences and other datasets to these tools through a user interface, without needing to write SQL.
There is no doubt that both of these systems are quite powerful in their own right. And given that CDPs and data warehouses handle many of the same functions within the data lifecycle, some analysts have made the argument that having both tools in your stack is redundant. After all, a data warehouse can store more than just customer data––transactional records, product data and employee information can be stored in these systems as well. If a data warehouse already serves as the source of truth for all of your data, why would you need or want to add a CDP to the mix?
Of course, there is some validity in the “simpler is better” approach to building a data stack. But given the immense functional differences between a data warehouse and a CDP, saying that these systems render each other redundant is like asking “Why would I wear a shirt if I’m already wearing pants?”
Today’s best-in-class infrastructure CDPs deliver out-of-the-box benefits in terms of data quality and consistency, identity resolution, and governance that surpass what engineering teams at early-stage and even enterprise organizations can feasibly develop and maintain. This is why using a CDP to take advantage of these challenges, while forwarding clean, quality, and consistent customer profiles to a data warehouse for long term storage and aggregation, is not at all redundant––it’s a powerful way to leverage the core competencies of these two foundational data tools. Since CDPs like mParticle make it possible to export data to leading DWHs like Google BigQuery, Amazon Redshift and Snowflake, this flow would be easy to accomplish within your data stack.
Next, let’s dive into some of the specific benefits that using a CDP alongside your data warehouse can deliver. If you would like to review the similarities and differences between CDPs and data warehouses in greater detail before moving forward, check out this piece first.
CDPs offer sophisticated identity resolution strategies
Today’s consumers have come to expect one-to-one, personalized messaging and offers from brands. In order to deliver on this expectation, marketing and product teams need to know who users are to understand engagement and deliver personalized experiences. For engineering teams, having to manually unify data from across multiple sources is time-consuming, cumbersome, error-prone, and doesn't scale. It entails developing and executing a strategy for recognizing when multiple events across various touchpoints were performed by the same customer, associating these data points with a single user profile, and continuously updating this record as this customer continues to interact with the brand.
In reality, data collected from the same customer across multiple sources is often fragmented and incomplete. Identity resolution is rarely as simple as performing a SQL join to match records based on one shared identifier. For example, consider this scenario. A customer makes a purchase on an eCommerce website using their laptop, providing their name and email address at checkout. Two weeks later, this same customer downloads the store’s mobile app and opts in to receive email updates when certain products go on sale. A month passes, and this customer makes a purchase at a brick and mortar location using the same credit card they used for their initial web transaction.
Since there is no single identifying data point common to all three of these events, determining that they all belong to the same customer would require data engineers to write SQL queries that look for shared identifiers in these events and associate them with a common user. Doing this in a way that delivers the flexibility required to handle cases like the one above can be challenging to execute, let alone regularly adapt to the changing needs of growth teams and privacy requirements.
CDPs deliver advanced identity resolution algorithms that offer the power and flexibility required to maintain robust user profiles without demanding ongoing engineering overhead. mParticle’s IDSync, for example, gives customers the ability to define a priority for the identifiers used to tie individual data points to a unique user. When mParticle receives an inbound data payload from a user event, the associated identities are matched against identifiers on existing user profiles based on a defined hierarchy. If none are found, a new user profile is created, which can eventually be merged with another profile once an identity match is discovered.
Even though the three customer interactions in the scenario described above took place on separate devices and captured different identifiers from the user, IDSync would be able to resolve all three records to the same user profile. On the first interaction, a new ID would be created for the user, storing their name, email address, and credit card number. When this user downloads the company’s mobile app, a new ID would need to be created at first, since their device ID would not match that on an existing user profile. However, once the user provides their email address to sign up for product updates, IDSync would discover that it matches an email on an existing record, and merge the two user profiles into one. Finally, when this user makes an in-person purchase using the same credit card from the online transaction, IDSync could use this identifier alone to look up the user’s profile, and merge any data attributes collected from the POS interaction into their record.
This strategy for matching cross-device data is known as deterministic modeling. Using this strategy, first-party data from separate devices is only unified into a single profile when common personal identifiable information (PII) becomes available. This means that data end users can have complete confidence that the user profiles at their disposal are accurate, since no guesswork was involved in creating them. Probabilistic modeling, on the other hand, is an alternate approach to profile matching that uses predictive algorithms to unify cross-device engagements when the likelihood of a match reaches a certain confidence threshold. While this strategy makes it easier to amass a large database of user profiles in a short period of time, it does come with a margin for error. Over time, these inaccurate records in the database can lead to a host of negative outcomes, like developers performing manual data cleansing, wasted spend on marketing and advertising efforts, and poor customer experiences. This article dives into probabilistic and deterministic matching strategies in greater detail.
For most companies, the identity resolution capabilities such as IDSync that an infrastructure CDP delivers out of the box will ultimately result in much richer customer profiles than an in-house solution. Additionally, leaving this process to a CDP relieves data science teams from having to develop and maintain these complex algorithms, enabling them to focus on more strategic analysis. The flexibility of an ID matching algorithm like mParticle’s IDSync can solve complex use cases like maintaining user continuity across anonymous and logged-in states, cross-device and cross-app tracking, and updating user identifiers without losing a user’s history in the case of mutable identities.
Given the critical importance of an effective identity resolution, leveraging CDPs to handle this function is a winning data strategy––especially for companies that are already handling this process with in-house solutions and a data warehouse. Additionally, offloading identity resolution to a CDP only enhances the long-term value of the data warehouse, as known and enriched user profiles can easily be forwarded to the data warehouse to unlock more use cases. The more complete and accurate these profiles are, the more valuable they will be in serving ongoing business intelligence and enhancing customer experiences.
Non-technical stakeholders can leverage CDPs without engineering involvement
Once a CDP’s data collection APIs have been implemented across input sources, engineers are no longer required to add or remove integrations with activation systems, nor do they need to write custom scripts or queries to forward data to downstream systems. Since CDPs have direct integrations with hundreds of leading vendors for marketing, analytics, and advertising use cases (as well as the ability to forward data directly to data warehouses), growth teams can set up system integrations directly within a user interface.
In mParticle, for instance, connecting a new output is as simple as finding the integration in a directory, entering an API key for the outbound system, and confirming the connection, all within the mParticle user interface:
The process of delivering data from a data warehouse to a growth tool is significantly more complex. First, it requires adopting another system into the data stack to perform reverse ETL. For a quick refresher, ETL stands for extract, transform, load, and it refers to the main processes involved in the inbound portion of the data pipeline: extraction from an input source, transformation into a commonly understood structure, and loading into a main repository. Reverse ETL tools, like their name suggests, perform these steps in the opposite direction to service data moving in the outbound direction.
Unlike CDPs, data warehouses often do not have direct integrations with the tools where the data sets they house can be activated, which is why a reverse ETL tool is necessary to handle data egress to systems that serve use cases for marketing and product teams. Even though reverse ETL tools bridge the gap between data warehouses and activation tools, they do not abstract away the technical overhead involved in data forwarding the way CDPs do––data engineers are still required to write custom queries that tell the reverse ETL system which audiences to retrieve from the data warehouse. Additionally, since reverse ETL tools do not have their own profile store, they cannot solve the problems of identity resolution or consent management in the way a CDP can.
Infrastructure CDPs deliver sophisticated data governance and privacy controls
For any brand that leverages customer data, robust data privacy and governance practices are essential. Since the emergence of the GDPR and CCPA, the legal landscape around data privacy has steadily moved in the direction of consumers gaining increased access and control over their data, and this trend will undoubtedly continue. As it does, it will become even more important for companies to ensure that their data practices comply with the latest standards.
While adhering to legal requirements is obviously critical, an equally compelling reason for prioritizing data governance and privacy is maintaining customer trust. Today’s consumers have become increasingly interested in how and when brands use their data, and they need assurance that when they provide companies with their information, it will be used responsibly and they will receive value in return. Maintaining this trust is critical for cultivating lasting and valuable relationships with customers.
Building privacy and governance features into the data lifecycle is a significant engineering challenge, however, and in a data stack architected around a central DWH and ETL/Reverse ETL pipelines, this task falls on in-house data engineers. Because this data ingestion strategy decouples identity resolution from event collection, data governance and compliance functions also need to happen after the point of collection. This means data engineers are tasked with ensuring customers have actually consented to the events and attributes being collected.
Infrastructure CDPs, on the other hand, deliver privacy and governance features out-of-the-box, and continuously update these features to ensure ongoing compliance with regulation. Not only does this take this considerable workload off the plates of in-house data engineers, but it gives privacy teams and data stakeholders much more control over compliance at each stage of the data lifecycle.
mParticle, for instance, allows customers to easily record and update customers’ consent preferences as part of their user profile, and use these consent preferences to:
- Apply rules to event forwarding
- Allow/prevent the inclusion of customers in audiences
- Block/allow data forwarding to downstream tools
- Easily fulfill data subject requests
This video provides more context on how these end-to-end data privacy features give teams control over governance and compliance.
CDPs enable seamless bidirectional data flows
Both CDPs and data warehouses are capable of forwarding custom datasets to downstream activation tools. In the case of a data warehouse, however, a reverse ETL tool is required to accomplish this task, and data engineers also need to write custom queries to retrieve the desired audience and deliver them to the appropriate downstream system. Using a CDP, however, non-technical stakeholders can easily turn third-party systems integrations on and off and forward audiences directly within a user interface.
Bidirectional data sharing is one common scenario in which a CDP’s integrations deliver a distinct advantage over the type of connections that data warehouses typically offer. In these use cases, custom audiences are developed within a CDP and forwarded to a third-party system. As a result of engagements and interactions within that vendor’s environment––like email opens, app engagements, or paid advertising impressions, for example––this data set can be augmented with new events. At this point, the enhanced data is forwarded back to the CDP, where it can be used to power real-time customer interactions (assuming that CDP exposes an API for querying customer profiles) or sent to another tool to drive other use cases.
For a detailed example of how this works in practice, see this integration use case that details how to execute a multi-channel marketing campaign leveraging mParticle’s feed integration with Iterable.
CDPs and data warehouses: Better together, not one or the other
Above, we focused on use cases in which a best-in-class CDP can deliver value and functionality beyond what data warehouses are typically capable of. However, this does not mean that a CDP should take the place of a data warehouse in the data stack––especially in situations where a DWH populated with rich historical data from multiple sources is serving as a source of truth across the organization. In these scenarios, the DWH still plays a critical role in storing data records long-term, and enabling analytics like churn risk prediction, serving product recommendations, and powering BI and strategic analysis.
In the modern data stack, CDPs and data warehouses can and should coexist fruitfully. The notion that they are mutually exclusive denies the unique capabilities of both tools.