Growth—January 08, 2021

Why real-time data processing matters

Business-critical systems shouldn't depend on slow data pipelines. Learn more about real-time data processing and how implementing it strategically can increase efficiency and accelerate growth.

Customer data has become one of the most valuable resources in business. In 2020, the global big data market was estimated to have been worth $189 billion, and by 2025 that valuation is expected to grow to $225 billion. Such forecasts have led The Economist and others to dub data the new oil.

While the comparison makes for a good headline, it’s worth noting that there’s a key difference between data as an asset and oil as an asset–the way in which each stores value over time.

For product managers and growth teams using first-party data to understand customer engagement and power contextual customer experiences, the value of a data point is correlated with the recency of its creation. The longer that a data point is in a “processing” stage before it can be accessed by business users, the less valuable it is when activated. In order to maximize the value of the data you’re collecting and deliver the best experiences to your customers, it’s critical to ensure that you’re able access data in as close to real-time as possible and avoid batch data processing where you can.

What’s the difference between batch and real-time data processing?

As a customer engages with your app and/or website, data is continually being created in the form of engagement events and attributes. Once this data is created, there are two ways in which you can collect it to your business systems: Batch processing and real-time stream processing.

Batch processing is the loading of data at rest from a data storage system to an analytics or customer engagement data system. During batch processing, it’s common that data points will be transformed to match the destination schema. Batch processing can take anywhere from hours to days, depending on the size of the data set being processed.

The typical batch processing architecture consists of the following components:

Data sources: The digital properties where events are being created
Data storage: A distributed file store that serves as a repository for high volumes of data in various formats
Batch processing system: Solution that processes data files using long-running batch jobs to filter, aggregate, and prepare data for analysis
Analytical data store: System that can store and serve processed data so that it can be queried by analytics tools
Analysis, reporting, and customer engagement: Downstream systems that enable data consumers to access processed data within a friendly interface

One of the challenges with batch processing is that it often groups data in time slices. For example, a batch processing system may upload data once a day, processing events from 12am-midnight on day one and events from 12am-midnight on day two in a second batch. An event that is created at 12:01am, therefore, may not be made available to downstream consumers until midnight, nearly 24 hours later. Furthermore, if a customer happens to enter the customer journey at the end of one batch, say at 11:30pm, and complete the journey at the beginning of the next batch, say at 1:00am, it may be difficult to link the events of those engagements in real time. Batch processing is acceptable for ad hoc initiatives, such as historical analysis or training of machine learning models, but when used to power real-time systems, such as product testing, transactional messaging, and ad retargeting, it can lead to inaccurate results, bad customer experiences, and wasted budget.

Real-time processing, on the other hand, allows you to capture events as they are created and process them with minimal latency to generate instantaneous reporting or automated responses. With data being made available in real time, data consumers are enabled to run granular testing and deliver more timely customer experiences. In contrast with batch processing, data being streamed is unbounded, meaning that it is made available on an ongoing basis and sequential data points are never separated into segregated uploads.

The typical real-time processing architecture consists of the following components:

Real-time message ingestion: A system built to capture and store real-time messages to be consumed by a stream processing consumer
Stream processing: A system for capturing and processing real time messages, which filters, aggregates, and otherwise prepares data for analysis
Data store: System that stores processed data in a structured format so that it can be queried by business tools.
Downstream activation systems: Tools managed by data consumers that activate data and provide insights through analytics and reporting.

The challenge in processing data in real time data is primarily technical. On the processing side, it’s critical to have a system in place that is able to support high-volume writes. On the data consumption side, your team needs to have tools in place that allow you to activate, analyze, and connect data in real time as it is made available.

Why does real-time data processing matter?

Today’s customers move fast. To keep pace, you need a data infrastructure in place that enables you to deliver contextual experiences in real time. Whether it’s the home screen of your mobile app, the language of a transactional email, or an ad shown on Twitter, successful personalization depends on having access to the most up-to-date customer data, and on being able to activate that data quickly.

While batch uploads remain suitable for ad hoc initiatives, there are several problems that can arise if you power real-time use cases with batch processing.

Wasted marketing budget: Gartner estimates that enterprise brands dedicated 11.5% of their marketing budget to paid advertising on average in 2019. If your ad targeting is based on customer data that is being processed in batch, you may be delivering experiences based on out-of-date data. This can lead to inefficiencies, such as serving ads to a customer that has already made a purchase.
Poor customer experiences: As customers engage with your brand, they are communicating preferences and interests on an ongoing basis. If you’re unable to access and activate this data in real time and are powering engagement systems with out-of-date data points, you risk delivering experiences that are not in-line with your customers’ present interests.
Regulatory compliance violations: One of the biggest adaptations consumer brands have had to make in recent years is reforming their systems to comply with GDPR and CCPA. If your data collection architecture is still processing events in batch, your downstream systems may not be receiving consent status updates until hours, or even days after opt-out. In order to support compliance, it’s critical for your systems to be able to process consent status changes in real time.
Inefficient reporting: Leading product managers are constantly testing features and analyzing how customers are engaging. If your product engagement data is being loaded into analytics tools through batch processing, your testing process will be much slower, and you may encounter gaps in your reporting.

How can you get started?

Whether you’re currently processing all customer data with batch uploads, or you’ve only partially adopted real-time data collection, it’s never too late to invest in accelerating your data pipeline.

First, take stock of your existing data pipelines. Which systems are providing your growth teams with customer data for product analytics, messaging, advertising, historical reporting, etc? Second, note the processing speed of each of those systems. Are there any real-time initiatives that are being powered with batch processing? Third, group the downstream systems that should be consuming data in real time and identify a simple, stable way in which to stream customer data into these systems.

An effective way to set up stream processing is by implementing a Customer Data Platform with real-time data collection capabilities. mParticle’s Events HTTP API, for example, enables you to ingest customer data from backend data sources in real time and make it available to data consumers instantly. Additionally, our selection of client-side SDKs collect user events, as well as app and device information, session events, and install events in real time. Not only are business teams able to view streaming data in an accessible interface like this, but they can also set up real-time data forwarding to any of the tools and systems they’re using to drive growth via 280+ packaged API integrations. This enables teams to get a better picture of customer engagement, build dynamic audience segments, deliver contextual customer experiences with greater efficiency, and more.

For a real world example of real-time processing in practice, see this breakdown of how the Walmart Global Tech team uses real-time data processing for monitoring and reporting.

To learn more about the mParticle Events API, you can explore the documentation here.

To see first-hand how you can activate real-time customer data as it’s ingested into mParticle, you can explore the platform demo here.