Engineering—April 08, 2021

What is data engineering?

The quantity and complexity of the data that companies deal with is constantly increasing. While Data Scientists analyze and generate actionable insights from data, they cannot do this effectively with data that suffers from poor quality. Data Engineering roles exist in companies to build data pipelines, transform data into useful formats and structures, and ensure quality and completeness in data sets.

On July 20, 1969, at 10:56 PM EDT, Neil Armstrong made history by becoming the first human being to set foot on the surface of the moon. Apollo 11, the mission that sent Neil Armstrong, Buzz Aldrin and Michael Collins to a celestial body other than our own, was the crowning achievement of a decade of space exploration. During this era, public attention was transfixed on the astronauts and cosmonauts who traveled into space, and names like Alan Shepherd, John Glenn and Yuri Gagarin were known in households throughout the world.

Equally important to the success of these missions, however, were the world-class engineers who built countless new technologies enabling humans to escape the bounds of earth and return home safely. Without these minds solving seemingly insurmountable technical challenges, astronauts would not have a Saturn V rocket to escape the earth’s atmosphere, nor a lunar excursion module in which to land on the moon. And without brilliant mathematicians like Katherine Johnson determining launch windows, escape velocity trajectories, and return paths, these men and machines could have never left the earth’s atmosphere in the first place.

Just as engineers and scientists enabled the miracle of space flight by repeated testing and experimentation, Data Engineers allow Data Scientists to extract meaningful information from data by building pipelines that transform data into meaningful structures. The astronauts in the space capsules were only as capable as the machines they piloted, and Data Scientists can only extract effective insights from information that is organized and structured. With the modern data ecosystem being as complex as it is, this is no simple task––especially at organizations that collect copious amounts of information in different formats and store this data in various distributed systems.

In this post, we’ll explore Data Engineering in the context of an organization’s data analysis cycle, look at the role Data Engineers play alongside their Data Scientist counterparts, and examine how a Data Engineer’s job is continuing to evolve alongside increasingly sophisticated tools for transforming information and delivering it to the appropriate stakeholders.

Born out of “big data,” a new field gets a trial by fire

In the late 2000s and early 2010s, significant advancements in information processing and networking technologies dramatically increased the amount and variety of data that companies were able to collect, and the speed at which they could collect it. Finding themselves suddenly inundated with information, organizations of all kinds––but especially tech-forward companies like the Facebooks and Ubers of the world––sought ways to accurately and quickly make sense of this new wellspring of data.

At this time, it became clear that traditional extract, transform, load (ETL) tools would no longer be adequate to morph the volumes of data that companies were collecting into formats ready for analysis and meaningful to stakeholders. Initially, the tasks of building the internal data infrastructure and pipelines necessary to transform data often fell on Data Scientists, but soon, companies realized that the skills and expertise necessary to handle these tasks were best delegated to a separate team altogether. With this, a new type of software engineering role was born––one that would focus entirely on data and the various knowledge domains that fall within it, such as warehousing, infrastructure, mining, modeling and pipeline building.

That role is Data Engineering, and over the last decade, this title has seen a level of growth and demand paralleling that of Data Scientists. Companies are continuing to collect data at an increasing velocity, which means that Data Engineers are more integral than ever to an organization’s analysis efforts. The precise function of Data Engineers and their relationship to their colleagues in Data Science continues to evolve, however. While the exact delineation between these functions depends on the needs of individual organizations, there is a rough dividing line between the skills and knowledge areas typically falling within these two aspects of data analysis.

A Data Engineer’s role depends on a company’s data culture

The role and day-to-day responsibilities of Data Engineers can vary significantly at different organizations, and this depends heavily on the broad strategy the organization employs for collecting, storing, and disseminating their data to stakeholders. One approach to internal data management is to have a data team that owns the responsibility of collecting and storing all of a company’s data, and dispensing different segments, audiences, and insights to internal teams on an ad hoc basis. In these cases, a central group of Data Engineers presides over all of the company’s data stored in a data lake or warehousing solution. When specific use cases arise out of the needs or marketing, product, and data science teams, Data Engineers will either build or modify a data pipeline or workflow to get it to the appropriate internal endpoint in the correct form.

While one advantage of this centralized approach to handling data is a clean delineation of team responsibilities, it does come with certain pitfalls––most significantly, a lack of flexibility and sustainability in data pipelines. Companies’ internal data needs transform rapidly, especially at organizations that develop and ship a wide range of customer-facing products with complex functionality. Marketing and product teams have an ongoing need to understand different aspects of customer behavior, which means that these teams’ data requirements will be varied and nuanced. In these cases, Data Engineering teams may find themselves in never-ending data shipping cycles in which they continuously create use case-specific data workflows. Much like engineering teams can encounter this problem as they respond to modifications to data plans, Data Engineers can find themselves similarly saddled with repetitive delivery tasks in the absence of a permanent data infrastructure.

The growing number of software tools and platforms addressing a wide range of team-specific data and analytics needs has given rise to another way of handling internal data. In this more decentralized model, each team uses the data analysis solutions that enable the use cases they deal with on a regular basis. For example, sales and marketing teams might use Salesforce as a CRM, product teams could leverage Amplitude for analytics, and additional miscellaneous data from across the organization might be housed in a data warehouse.

This scenario enables individual teams to have more flexibility to analyze and act on data relevant to their functions, which could enable them to make more agile decisions. The shortcomings of this model are twofold, however. For one, using a variety of tools collecting data means that a lack of data consistency is inevitable. This means that use cases that require combining data from multiple sources––which is essential to creating a unified view of the customer and enabling cross-channel personalization––will require considerable cleansing and quality assurance on part of Data Engineers. This leads into the second significant drawback of this type of internal data strategy, which is that Data Engineers spend the vast majority of their time cleansing, unifying, and combining data, and have less time to innovate within their roles and add value to the organization on more strategic levels.

Data Engineering at a crossroads

The concept of permanent Customer Data Infrastructure and the solutions that bring this idea to fruition has shifted the paradigm of collecting and leveraging customer data. Now that customers interact with brands across a vast and ever-expanding array of digital touchpoints, companies seeking to place the customer at the heart of everything they do cannot be hamstrung by fragmented, siloed data.

Building a modern data engineering stack:

Here are some components of the modern data engineering stack, as well as essential tools and skills that Data Engineers need to connect them:

Programming languages: It’s a given, but a strong handle on a common, general-purpose programming language like Python or Java (ideally both), the ability to use a query language like SQL, and a solid understanding of object-oriented programming is the bedrock atop which all other Data Engineering knowledge and skills rest.
Data stores: One or more locations in which to store data will be necessary to physically house all of the information your organization collects, both in structured and unstructured formats. The storage component of the data stack can include databases for housing real-time information about a particular part of your business, as well as data warehouses or data lakes for storing historical information that is not updated in real time.
Data processing frameworks: Due to the volume and complexity of data that modern companies collect, manually writing code to process this vast amount of information would be a daunting task. Data processing frameworks abstract away some of the common programming tasks involved in processing data, making them invaluable tools for Data Engineers. Some of the most popular and powerful frameworks include Hadoop, Spark, Flink and Storm.
Data Infrastructure: A data infrastructure solution sits at the heart of a company’s customer data ecosystem, and acts as a central point of access for all of the organization's first-party data. As such, it allows data owners and stakeholders to ensure that incoming data is consistent and accurate, and it enables data from various sources to be combined into unified customer profiles. An Infrastructure Customer Data Platform (CDP), for instance, can fulfill this role in your data stack. These tools establish a new foundational data infrastructure layer to help teams move data freely and securely between systems and applications in real time, while managing data quality and protecting consumer privacy.

For Data Engineers at organizations with robust Customer Data Infrastructure, many of the problems we saw in the two scenarios above are eliminated. With all of the company’s first-party data interesting a single system that enforces strict data quality and consistency, Data Engineers are relieved of many of the day-to-day responsibilities of cleaning, transforming and shipping data. They are also relieved of having to build one-off data pipelines and workflows that address single use cases for product and marketing teams. Empowered by the agility that permanent Customer Data Infrastructure affords, Data Engineers can focus on the organization's data needs in a broader, more strategic way, and deliver value to organizations along dimensions that were not possible before. For example, they could research different data vendors and technologies that could allow the company to take advantage of emerging data trends. If the company is leveraging a CDP with a built-in tool for creating and modifying data collection plans in real time, Dat Engineers could also focus on maintaining a cohesive data collection strategy that gets the most out of every downstream analytics tool.

The role of Data Engineering has been in a constant state of evolution since it emerged at the beginning of the age of “big data,” and the increasing importance of Data Infrastructure at customer-centric companies will likely have a profound impact on the field’s continued growth. Though it is difficult to predict exactly what Data Engineering tasks will look like as the Data Infrastructure landscape continues to evolve, it is clear that the importance of all professionals who leverage data to drive outcomes will only increase.