Reverse ETL: What it is, what it isn’t, and where it fits into the data landscape in 2023
Reverse ETL is a process in which data is delivered from a data warehouse to the business applications where non-technical teams can put it to use. By piping data from a data warehouse to downstream business systems, reverse ETL tools fill the gap between data storage and activation.
The modern data landscape is densely populated with various tools and systems that collect, store, analyze, monitor, and transfer data. Businesses have a plethora of options for building out their data stack, and several factors––like business goals, internal resources, and existing infrastructure––can influence how data teams choose between these options. But one basic truth from the earliest days of enterprise data still holds: leveraging data in any capacity requires handling separate but interrelated lifecycle phases: collection, transformation, storage, and activation. Or more simply, it requires the “pulling” of data (from collection to storage), and “pushing” of it (from storage activation).
First things first (before the reverse)
With this basic understanding of the data lifecycle in mind, we can begin to break down one of the buzziest concepts in the world of data today: Reverse ETL. For context, let’s begin by breaking down what happens to data at each step in the ETL (or Extract, Transform, Load) process.
You can’t eat a fish before you catch it. Before you can leverage data in any capacity, you need to collect or “extract” it from places where it is produced or already stored. Most data strategies incorporate data from a wide variety of sources, including but not limited to owned applications and websites, CRM platforms, and legacy systems. Extracting data from these locations is the first step towards conducting business intelligence or analysis, and providing stakeholders like marketing and product teams with the data they rely on.
When data is extracted from a source, it is typically in a raw state and not yet suitable for analysis or activation. At the collection phase, data may include duplicate, inaccurate, or outdated records, and it won’t conform to a centralized data model that matches incoming data from other locations. This is where data transformation comes in. At this stage, data is cleansed, deduplicated, verified, and organized so that it adheres to a consistent structure and meets standards of data integrity. This step aims to ensure that when the data is loaded into a central repository and used for analysis, it will produce accurate and reliable results.
Once data has been collected and transformed, it’s time to deliver it into a system of record where it can be combined with other data, analyzed, and delivered to downstream tools (we’ll get to that part later). This is the purpose of the “load” step. Often, the system into which the cleaned and organized data is loaded is a data warehouse, like BigQuery, Amazon Redshift, or Snowflake.
The three processes that comprise ETL happen in the “pulling” half of the data lifecycle. ETL accomplishes the task of gathering data from disparate sources, organizing it into a cohesive structure, and delivering it into a central repository for storage and analysis. Once the ETL process is complete and data is housed in the data warehouse, the data is useful in a number of ways. For example, data scientists and analysts can analyze historical data to predict future trends, or produce dashboards for internal teams using BI tools like Looker and Tableau.
Data stored in a data warehouse has limited utility to semi- and non-technical stakeholders like marketing and product teams, however. For these teams to take action, data must be pushed to the “the last mile” of the data stack––that is, delivered to systems like engagement platforms and analytics tools where business stakeholders can put it to use. This is the gap in the data lifecycle that reverse ETL tools fill.
What is Reverse ETL?
Reverse ETL describes a data processing workflow in which data is extracted from a centralized repository, transformed into a structure that meets the ingestion requirements of specific systems, then loaded into these systems in order to be activated or analyzed. This should sound familiar, since these are the same steps touched on above. The only difference in reverse ETL is that these processes occur in the “pushing” rather than the “pulling” half of the data lifecycle.
In reverse ETL, the extraction process entails querying and copying desired data from the data warehouse. Once copied, this data is transformed into a state that is compatible with the API specifications of a target downstream system. Finally, it is loaded into these tools, where end users can activate it. By handling these processes, reverse ETL tools provide companies with a way to sync data from their data warehouse to downstream systems, and provide marketing, product, sales, and other business teams with a way to utilize data from the data warehouse.
What benefits and use cases do reverse-ETL tools deliver?
By acting as a bridge that takes data from the data warehouse to the tools where business teams can put that data to use, a reverse-ETL tool can unlock a variety of use cases and create efficiencies for both business and data teams. Let’s explore some of those here.
Activation of data in the data warehouse
A data warehouse is a great tool for storing vast quantities of data over a long period of time and enabling data engineers and data scientists to query and analyze this information. But without a way to pipe this data to the tools that matter in day-to-day customer engagement, the data warehouse is of little value to non-technical internal teams like marketing, sales, and customer support. Reverse ETL tools have the ability to make copies of the data in the data warehouse, transform it to meet the specifications of external APIs, and deliver it into tools like Salesforce, Hubspot, and Google Ads. Bridging this gap helps make data in the data warehouse accessible to non-technical teams, thereby increasing its value.
Increased data engineering efficiency
Even without a reverse ETL tool, data in the data warehouse can be made available to non-technical teams. However, in the absence of a pipe that connects the data warehouse to downstream systems, data engineers would have to take on a lot of manual work to enable marketing, sales, and other teams to leverage the data they need in the form of building and maintaining each of these API connections individually. This could quickly amount to an unsustainable amount of engineering overhead, and significantly add to the time it takes for business teams to access the data they need when they need it.
Improved data access for non-technical teams
Reverse ETL tools increase the speed at which marketing, sales, and other business teams can access and operationalize the data living in the data warehouse. Without the pre-built API connections and downstream tools, data engineers would have to manually query, cleanse, and transform datasets for end users whenever new use cases arise. Reverse ETL solutions enable business teams to forward pre-build data models to downstream tools without engineering support.
Where does reverse ETL fall short?
As illustrated above, reverse ETL tools enhance the value of the data warehouse by operationalizing the data it houses for non-technical users. They also benefit data engineers by reducing the work required to move data from the data warehouse to downstream tools.
Reverse ETL tools are primarily concerned with the outgoing phase of the data lifecycle, however. While they do successfully alleviate certain challenges related to operationalizing data, they do not address issues pertaining to the quality, enrichment, and real-time activation of data throughout its lifecycle.
Let’s explore a few of the challenges that remain unaddressed in a data stack centered on a data warehouse coupled with reverse ETL.
Ensuring data quality at the point of ingestion
Nothing summarizes the challenge at the heart of the modern data ecosystem like the old adage “garbage in, garbage out.” If the data you collect is inaccurate, inconsistent, or out of date, everything and everyone who relies on it will feel the consequences. Customer messaging will be irrelevant, product insights will be misguided––the list goes on. Data quality is essential to the successful activation of customer data. But considering the fragmented nature of the modern digital customer journeys (a web session yesterday, a mobile app today), ensuring data quality across platforms, devices, and sessions is one of the biggest challenges that data engineers face today. As reverse ETL tools focus squarely on data connection and not collection, they do not provide engineers with ways to address data quality before events are forwarded to the data warehouse. This means that engineers who implement data collection need to devise some other way to validate incoming data, and ensure that event names and properties align with data consumers’ needs.
The power and utility of a data warehouse comes from the volume and variety of data that it is capable of storing. But without a way to unify events performed by the same user on separate devices, in separate sessions, in multiple login states, customer data has very limited utility. Unifying cross-channel data to 360-degree customer profiles is called identity resolution, and it is the backbone of a brand’s ability to deliver personalized customer experiences across touchpoints. Like data governance, a data warehouse paired with a reverse ETL tool also lacks a built-in solution for conducting identity resolution. This is another complex data orchestration task that the modern data stack leaves up to data engineers to handle manually.
Eliminating the need for engineers to support non-technical users
As discussed above, reverse ETL tools relieve data engineers of coding API connectors required to operationalize data. However, engineers still need to play an active role in serving the data needs of non-technical teams. This is because building new audiences with a reverse ETL tool still entails having a deep understanding of your data schemas and querying the data warehouse using SQL. While some reverse ETL tools allow users to select and forward data models with a visual interface, these are limited to existing models that require in-depth knowledge of the data schema and proficiency with SQL to build. A marketer or PM who doesn’t have this proficiency or familiarity with the data warehouse architecture cannot easily create a new customer audience, forward it to an activation tool, and use it in a campaign. For this reason, reverse ETL tools do not alleviate the bottlenecks that can arise between growth and engineering teams.
An answer to complying with privacy regulations and consent states
Reverse ETL tools do not deliver any data governance features out of the box, which means that teams relying on them either need to build a process for handling user privacy and regulatory compliance internally, or acquire an additional data orchestration solution to do so. For many teams (especially at smaller companies), the outlay of time and resources required to handle these processes without the aid of a productized solution could be significant.
Deciding on the right tools for your data stack
Whether a reverse ETL tool is the right solution for your organization is a complex decision. The answer depends on your current data infrastructure, internal engineering resources, and business needs. Depending on these factors, a CDP may represent a more flexible and robust solution to a wider range of data use cases––especially for teams who have limited data engineering resources and/or depend on real-time data activation to deliver personalized customer engagement.