EngineeringSeptember 27, 2022

How does Snowflake work? A simple explanation of the popular data warehouse

Learn more about what Snowflake is and how it fits into your data stack.

Jumping between different data projects due to resource limitations can be frustrating and inefficient.

Yet, until recently, most businesses ran data operations this way. Engineers would often need to stop resource-intensive queries so they could mine a database for urgent customer insights. These same data teams would also frequently need to run queries across many nights, when compute resources were not in demand.

But now, thanks to a highly scalable, available, and cost-effective cloud data warehouse like Snowflake, businesses can harness their data to improve objectives — without worrying about resource contention.

Snowflake is an elastically scalable cloud data warehouse 

Snowflake is a cloud data warehouse that can store and analyze all your data records in one place. It can automatically scale up/down its compute resources to load, integrate, and analyze data.

As a result, you can run virtually any number of workloads across many users at the same time without worrying about resource contention. Workloads can include use cases such as batch processing data to streaming real-time data to enabling interactive analytics to processing complex data pipelines.

Consider a typical work scenario where teams want to run different queries on customer data to answer various questions. Your product team may want to understand engagement and retention, while your marketing team may want to understand acquisition costs and customer lifetime value. Running all these queries on one compute resource cluster will create competition for resources, slowing down query performance for both teams. But with Snowflake, you can create separate virtual warehouses for each team, allowing all stakeholders to quickly get the answers they need.

Snowflake also automatically creates another compute cluster instance whenever one cluster is unable to handle all incoming queries — and starts balancing loads between the two clusters. So, you never need to worry about downtime or slow performance.

Because Snowflake can scale on-demand capacity and performance as needed, data teams no longer need to run upfront capacity planning exercises. Nor do they need to maintain costly oversized data warehouses that remain mostly underutilized.

Snowflake’s architecture automatically allocates the right resources

Snowflake’s decoupled storage, compute, and services architecture enables the platform to automatically deliver the optimal set of IO, memory, and CPU resources for each workload and usage scenario.

Snowflake uses a new multi-cluster, shared data architecture that decouples storage, compute resources, and system services. Snowflake’s architecture has the following three components:

  1. Storage: Snowflake uses a scalable cloud storage service to ensure a high degree of data replication, scalability, and availability without much manual user intervention. It allows users to organize information in databases, as per their needs.
  2. Compute: Snowflake uses massively parallel processing (MPP) clusters to allocate compute resources for tasks like loading, transforming, and querying data. It allows users to isolate workloads within particular virtual warehouses. Users can also specify which databases in the storage layer a particular virtual warehouse has access to.
  3. Cloud services: Snowflake uses a set of services such as metadata, security, access control, security, and infrastructure management. It allows users to communicate with client applications such as Snowflake web user interface, JDBC, or ODBC.

Because Snowflake does not tightly couple storage, compute, and database services — it can dynamically modify configurations and scale up or down resources independently. As a result, Snowflake’s unique architecture also makes it possible to handle all your data in one system. You don’t need to use specialized databases for different data formats. 

Snowflake is also capable of automatically adapting resources to a particular usage scenario. So, users no longer need to manually manage resources.

Snowflake offers native support for semi-structured data

Snowflake also offers native support for all semi-structured data formats without compromising completeness, performance, or flexibility.

Relational databases assume that all data records consistently adhere to a set of columns that are defined by the database schema. This static data model offers advantages such as indices and pruning but breaks down when incoming data records don’t follow a defined database schema. 

Today, machine learning models automatically generate a large chunk of business data in semi-structured data formats like JSON and XML. Traditional databases often can’t handle these data records because they do not follow a specified database schema.

To deal with these limitations, data teams force-fitted semi-structured data into a schema. But this approach resulted in the loss of information and flexibility. Also, adding new fields to the schema caused the existing data pipelines to misbehave. As an improvement to this, some databases began to treat semi-structured data as a special complex object. But users could not easily search, index, or load these special objects. So, even this approach led to performance tradeoffs.

Snowflake’s VARIANT data type allows users to store semi-structured data records in a native form inside a relational table. Users can easily load semi-structured data into a table with Snowflake’s VARIANT data type. Users can use this schema-less storage option for all JSON, Avro, XML, and Parquet data records. This VARIANT data type allows users to load semi-structured data directly into Snowflake without defining a schema, losing information, or creating performance lags.

Snowflake also automatically discovers the attributes of semi-structured data. It identifies similar attributes across records and organizes those attributes in a way that provides better compression and data access.

Load customer data into Snowflake automatically with mParticle

Snowflake is a massively parallel processing (MPP), pay-per-usage cloud data warehouse that takes full advantage of the cloud. However, loading customer data from your website, app, and other customer touchpoints into Snowflake using manual data pipelines will still cost you significant engineering resources. And if the structure of the data being loaded into your Snowflake instance isn’t consistent with the data in other systems, such as marketing and analytics tools, it will become more difficult to use data effectively.

mParticle makes it easy for you to collect customer data from client-side sources, such as mobile apps and websites. It also helps you to automatically and instantly load that data into Snowflake as JSON files. You can also use mParticle’s event filters to determine which events and attributes get sent to Snowflake — taking maximum advantage of Snowflake’s pay-per-use pricing to save costs.

Try mParticle for free and automatically load data into your Snowflake warehouse.

Get started today

Try out mParticle and see how to integrate and orchestrate customer data the right way for your business.

Sign upContact us

Startups can now receive up to one year of complimentary access to mParticle. Receive access