Engineering—November 04, 2021

When to use a data lake vs data warehouse

Enabling teams with access to high-quality data is important for business success. The way in which this data is stored impacts on cost, scalability, data availability, and more. This article breaks down the difference between data lakes and data warehouses, and provides tips on how to decide which to use for data storage.

A recent MIT Sloan report explains that organizations trying to seek returns from data need to create a culture in which all employees see data as their business. Wixom, a principal research scientist at MIT Center for Information Systems Research and author of the report, describes that “It's like data's a team sport and that the entire organization is the team,”

As more functions across the organization focus on leveraging data to make strategic decisions, the way in which data is stored is becoming increasingly important.

Data engineers need data to be connected through the pipeline in a way that is stable and efficient
Data/ BI analysts need data sets that are extensive so that they can run queries and create reports
Marketers need to be able to access real-time data and use it to power personalized customer experiences
Data Privacy Officers and Privacy Engineers need data to be stored in a way that protects consumer privacy

When building your data pipelines, it’s important to understand the needs of data consumers and ensure that the data storage systems match those needs. This blog will walk through two common storage solutions, data lakes and data warehouse, and discuss which data use cases each is best suited for.

What is a data lake?

A data lake is a system in which data is stored without any consistent structure. Data lakes will often contain high volumes of data as well as a variety of data types, and the purpose of that data is often yet to be defined. Because data stored in a data lake is inconsistent in both structure and type, it is not optimized for query optimization. That said, the volume and variety of information in data lakes make them powerful tools in the hands of data scientists who can leverage sophisticated analytics techniques to uncover predictive insights.

A typical data lake may contain product SKU information stored as text files, mobile user activity stored as JSON objects, and flat file extracts from a relational database. These individual data sets may each be structured in their own way, but their storage in a data lake is not optimized for querying in the interest of business reporting and analysis.

While data lakes often surface a variety of APIs and interfaces for users to input data, their ingestion process is not automated. Rather, the data lake’s owners must replicate data from other sources to store it in the Data Lake.

Some common data lake providers are:

In recent years, tools, such as Amazon Athena, have been introduced that make querying unstructured or semi structured data in a data lake far more accessible to a user comfortable with SQL.

For information on how data lakes compare to Customer Data Platforms (CDPs), check out this post.

What is a data warehouse?

A data warehouse is a database where data from different systems is stored and modeled to support analysis and other activities. The data stored in a data warehouse is cleansed and organized into a single, consistent schema before being loaded, enabling optimized reporting. The data loaded into a data warehouse is often processed with a specific purpose in mind, such as powering a product funnel report or tracking customer lifetime value.

Some common data warehouse providers are:

The data in a data warehouse is available to Data Analysts and BI Analysts for querying. Often, tools such as Looker and Tableau are used as an interface to run queries and build reports on the data stored in a data warehouse.

Data warehouse architecture is often broken into three tiers. The top, most accessible tier is the front-end client that presents results from BI tools and SQL clients to users across the business. The second, middle tier is the Online Analytical Processing Server (OLAP) that is used to access and analyze data. The third, bottom tier is the database server where data is loaded and stored. Data stored within the bottom tier of the data warehouse is stored in either hot storage (such as SSD Drives) or cold storage (such as Amazon S3) depending on how frequently it needs to be accessed.

For information on how data warehouses compare to CDPs, as well as how they can be used in tandem, check out this post.

When to use a data lake vs data warehouse

The decision of when to use a data lake vs a data warehouse should always be rooted in the needs of your data consumers.

For use cases in which business users comfortable with SQL need to access specific data sets for querying and reporting, data warehouses are a suitable option. That said, storing data in a data warehouse is more expensive than storing it in a data lake, and making changes to the types or properties of data stored in a data warehouse is difficult.

Data lakes, on the other hand, are better at storing a vast data set containing unstructured or semi structured data that does not need to be accessed and queried imminently. Data lake storage is cheaper than data warehouse storage, making it a great solution for storing historical records or “cold” files (similar to a laboratory archive). If data structure is frequently changing, it makes sense to store it in a data lake, as making changes to the data stored in a data lake is far easier than modifying a warehouse shema.

In reality, data lakes and data warehouses often sit side-by-side in a company’s data infrastructure, each being used for the needs that best match its capabilities. Some use cases may even begin by exploring unstructured data in a lake, and then moving it into a data warehouse for better querying.