Engineering—December 15, 2020

What is data orchestration

Data orchestration is an automated process in which a software solution combines, cleanses, and organizes data from multiple sources, then directs it to downstream services where various internal teams can put it to use. The purpose of data orchestration is to help a company make its data as useful and versatile as possible.

If you listened to the bassoon part from Beethoven’s fifth symphony by itself, you would probably only hear random notes. If you listened to it played along with every other part of the orchestra, you would hear one of the most recognizable pieces of music ever written. Just like an orchestra needs a conductor to cue different sections on when to come in, when to rest, when to crescendo and decrescendo, the data you collect from your customers needs a conductor to combine it with other data and turn into meaningful information.

In the not-so-distant past, data scientists and engineers were the sole conductors who organized raw data and directed its flow to various services and platforms. Using cron jobs and Python scripts as their wand, these developers were responsible for data pipeline tasks like programmatically defining various scheduling and monitoring jobs and managing their dependencies, failure handling and alerting, and manually combing through logs to evaluate performance, to name a few.

Increasing data scale and complexity

Over the last few years, the complexity and scale of the data landscape has increased dramatically. For any company with a modern data pipeline and infrastructure, organizing and directing the flow of data is becoming far too onerous a task for engineers to manage directly.

One factor placing an increasing strain on data scientists saddled with these responsibilities is the sheer mind-boggling amount of data that people produce and companies collect each day. According to a 2019 study by DOMO, Americans alone use 4,416,720 GB of internet data every minute of every day. The sources and platforms gobbling up the bytes are many and varied––23,211 social media posts containing the hashtag #love, 4.5 million YouTube videos streamed, 8,683 Grubhub orders and 1.4 million Tinder swipes are just a few examples of what piles up a new 4.4 million GBs every 60 seconds.

All the while, companies collect this information via their websites, apps, server logs and other online and offline touchpoints. Nearly every modern organization accumulates some amount of data from their current and potential customers, and many companies are sitting on petabytes upon petabytes. Before the emergence of the cloud, most of this data lived on in-house systems. The rise of distributed storage infrastructure, however, has moved data offsite to a variety of remote systems, each with its own APIs dictating the access and organization of its contents.

A conductor is a great orchestrator of a sextet, and even a 64-piece orchestra, but no conductor will be able to get 50,000 musicians to play a single piece of music at once. Similarly, the speed and volume of data collection has outpaced any data team’s ability to organize and direct it down the appropriate pipelines without the help. This is why over half of the data that companies amass falls into a category that Gartner has termed “dark data”––that is, information that takes up space on a server but is never used for analytics, customer engagement, or any other activity that can help a business grow and succeed.

The role of data orchestration

The purpose of any data orchestration platform is to shine a light on dark data. Just as a musical score allows all of the players in an orchestra to play the same piece of music on different instruments in different places, data orchestration tools allow each customer data point to perform in harmony with other information at a scale beyond the limitations of a conductor.

Data orchestration tools have evolved since their first introduction into the SaaS ecosystem, and individual providers often emphasize different tasks or steps in the process. While there is no one single definition of data orchestration, these are the processes and workflows that are most commonly associated with it:

Data Collection

In order to organize and direct data, you need to have the data to begin withThe first step in data orchestration is collecting data from your organization’s customer touchpoints. The most advanced modern data orchestration tools provide SDKs and APIs that can be implemented directly into your company’s websites and apps, however there are solutions that leave initial ingestion to external mechanisms. As we’ll see in the later steps, there are significant advantages to using an end-to-end service that handles data ingestion.

Data Preparation and Transformation

If the properties and values of incoming data are standardized and checked at collection, most of the preparation and transformation work is complete by the time customer data enters internal systems. However, solutions that do not ingest data directly must develop a system for unifying data housed in disparate sources across the organization, where it may exist in various formats and languages. Once all of the internal data structures are thoroughly mapped out and fed into the data orchestration tool, it can then transform data from various systems into an internally compatible format.

Since data transformation often entails mutating identifying characteristics like property names and values, it is a potentially error-prone process when performed manually and therefore must be handled with care. If the accuracy of the data is compromised here, many costly problems could arise from downstream services ingesting incorrect information. End-to-end data orchestration solutions that handle collection do not rely as heavily on preparation and transformation, which gives them a distinct advantage in terms of data quality and accuracy.

Data Unification

This is perhaps the most important component of what any data orchestration tool can deliver––the ability to turn the whole of your data into something more valuable than the sum of its parts. When data from different sources is transformed into an internally interchangeable structure––whether this happens during or after collection––the next step is to create a unified view of the customer. This means stitching together data collected from all of your apps, websites, point-of-sale devices and other touchpoints to understand how each of your current and potential customers are interacting with your brand.

Delivery and activation

Once your data is transformed and unified into complete customer profiles, it is ready to be sent to the tools that your teams use every day to drive growth, such as analytics platforms, audience engagement tools, and business intelligence and management solutions.

Why adopt a data orchestration solution?

Companies that do not leverage a tool to unify and activate their data are likely missing opportunities to capitalize on the stories their customers are telling them. Given the sheer scale and complexity of the data landscape, it is all but impossible to avoid the problem of “dark data” by going it alone. In addition, without data orchestration, your organization’s data scientists and developers are probably spending time organizing and directing the flow of incoming data that they could be using to perform valuable analysis and customer-facing features.

In addition to ensuring that you derive the maximum potential from your data, a data orchestration tool can help you maintain compliance with the GDPR, CCPA and other data privacy laws. A major component of these laws is requiring companies to prove how, when, and where their customer data is collected, which is difficult to do with data in an unstructured and disorganized state. Furthermore, data privacy laws also like the GDPR also provides consumers with the ability to opt out of data collection, and require companies that have collected their data to delete it. Opt-out and deletion requests are much easier to perform when all of a customer’s data is in a unified state than if it resides in disparate silos.

mParticle is an end-to-end customer data management solution that empowers companies to make business personal. Learn more about how mParticle’s best-in-class data unification solutions like User Aliasing API and IDSync can deliver a 360-degree view of your customers and endless possibilities for delivering personalized customer experiences.

AuthorSean RyanTechnical Writer