Snowflake vs. BigQuery: What are the key differences?
Learn more about the differences between two popular data warehouse solutions, Snowflake and Google BigQuery, and understand how to identify which is right for your team.
Some inputs and outputs just don’t scale linearly.
Today, his observations ring true across many enterprise-level data operations. Beyond a certain scale and complexity, data teams get bogged down with every additional resource provisioning, server monitoring, and data downtime error debugging task that needs to be done. To avoid this predicament, increasingly more data teams are ditching their on-premise infrastructure in favor of fully managed, near-zero-maintenance cloud data warehouses such as Snowflake and Google Cloud BigQuery.
Both Snowflake and BigQuery offer cost-effective, elastically scalable, low-latency cloud data warehouse solutions to meet a wide range of business needs. Yet, differences in architecture, pricing, performance, and security between the two solutions can impact engineering resources. So, make an informed buying decision after thoroughly evaluating what both these warehouses offer.
Snowflake and BigQuery have different, but equally scalable, architectures
Snowflake and BigQuery both offer decoupled storage and compute resources that can elastically scale to meet all your mission-critical workload requirements. But they also have slightly different architectures.
Snowflake offers the best of both shared-disk and shared-nothing architectures
Snowflake has a three-layer architecture. At its core is the data storage layer. On top of that comes the query processing compute layer that uses virtual warehouses to process queries. And lastly, the cloud services layer helps users manage infrastructure, authentication, and query processing services.
- Centralized storage layer: Uses cloud storage infrastructure from platforms such as AWS, Microsoft Azure, and GCP to store all your data records persistently, without any losses.
- Multi-cluster compute layer: Contains multiple massively parallel processing (MPP) compute clusters that are capable of quickly processing all your complex queries.
- Cloud services layer: This layer is a collection of different services, such as query management, optimization, transactions, security, metadata, and sharing.
Snowflake’s unique three-layered architecture offers the data management simplicity of a shared-disk architecture along with the scalability of a shared-nothing architecture.
BigQuery’s serverless architecture uses Dremel, Colossus, Jupiter, & Borg
BigQuery is built on top of a query engine called Dremel. This scalable, interactive, ad-hoc query system breaks up your complex queries into a nested tree of smaller query units. Dremel then processes these smaller queries individually and reassembles the corresponding results back into one coherent result.
In addition to the query processing Dremel layer, BigQuery also makes use of:
- Colossus: Google’s distributed file system that automatically handles data replication and recovery. Colossus uses a columnar storage format and compression algorithm to store all your data records as efficiently as possible, with the least number of disk I/O read and writes.
- Jupiter: Google’s networking infrastructure that powers all its internal distributed computing and storage systems. The datacenter network is powerful enough to “read the entire scanned contents of the Library of Congress in less than 1/10th of a second.”
- Borg: Google’s large-scale cluster management system that is capable of running thousands of queries from several different applications simultaneously in a distributed manner across many virtual clusters.
BigQuery also supports semi-structured and unstructured data formats such as CSV, JSON, Datastore backups, AVRO, and Parquet.
Both cloud data warehouses have different pricing models. Snowflake calculates pricing based on your execution time, while BigQuery calculates pricing based on the virtual CPU slots that were consumed.
Snowflake calculates compute resource credits on a per-hour basis
Snowflake offers both on-demand and pre-paid compute resources. On-demand compute resources are priced between $2.50 and $5 per credit, based on your Snowflake plan. Pre-paid compute resource prices are dynamically determined as and when an order is placed.
Snowflake automatically allocates compute resources and power based on the virtual warehouse size that a user selects. The virtual warehouses come in a range of different sizes:
- XS-sized virtual warehouses that use 1 credit per hour
- S-sized virtual warehouses that use 2 credits per hour
- M-sized virtual warehouses that use 4 credits per hour
- L-sized virtual warehouses that use 8 credits per hour
- XL-sized virtual warehouses that use 16 credits per hour
- XXL-sized virtual warehouses that use 32 credits per hour
Because Snowflake’s on-demand pricing model is calculated on an hourly basis, it allows users to scale compute power and size as needed without over-provisioning resources. This pricing model suits teams that need to process complex and resource-intensive workloads over short periods of time.
In addition to compute resources, Snowflake charges you for storage separately. On-demand storage is priced between $35 and $46 per TB based on your location, while pre-paid storage comes cheaper at $23 to $25 per TB of storage.
For more accurate and up-to-date pricing information, visit Snowflake’s detailed pricing guide.
BigQuery’s pricing is based on the data scanned and virtual CPU units used
BigQuery also offers both on-demand and flat-rate compute pricing options. On-demand compute resources are priced at $5 per TB of data that gets scanned from user queries. This on-demand pricing includes the power of up to 2,000 BigQuery virtual CPUs.
However, users with more large and stable workloads can also opt to reserve a fixed number of BigQuery slots (virtual CPU units), on a monthly basis, for a flat rate. Flat-rate pricing begins at $2,000 a month for 100 slots. But irrespective of usage, users will get charged this flat fee as long as the plan is active.
BigQuery Reservations are more useful for data teams that need to consistently process large workloads that scan many hundreds of terabytes of data. Only at this scale does the flat-rate pricing outweigh the flexibility and compute power of the on-demand plan.
For example, a user that chooses to avail 100 BigQuery slots at a flat rate will have significantly lower compute power to process complex queries. That’s because the flat rate plan will allow only a maximum of 100 virtual CPUs, while the on-demand plan comes loaded with the power of 2,000 CPUs.
BigQuery also offers two types of storage pricing. Data records that have been modified over the last 90 days are categorized as active storage and are priced at $0.023 per GB. Data records that haven't been modified over at least 90 days are categorized as long-term storage and are priced at $0.016 per GB.
For more detailed pricing information, visit BigQuery’s pricing page.
When testing a TPC-DS workload, research firm GIGAOM found that Snowflake consistently performed better than BigQuery.
For most business needs, these query performance differences won’t matter too much.
However, if you run mission-critical workloads and require real-time analytics, then you may want to test your live workloads on both Snowflake and BigQuery. Such a test can help you determine which data warehouse is better suited for the needs of your business.
Both Snowflake and BigQuery offer a wide range of security features to authenticate users, improve data security and reinforce network security.
For authentication, both Snowflake and BigQuery support key-pair authentication, multi-factor authentication (MFA), single sign-on (SSO), and OAuth 2. These features allow you to authenticate users without requiring them to share, enter, or store any credentials.
For improved data security, both Snowflake and BigQuery use AES-256 strong encryption standards to encrypt all your data records.
When it comes to reinforcing network security, Snowflake limits virtual private networking only to users that have subscribed to a Virtual Private Snowflake edition. But BigQuery allows all GCP users to configure a virtual private network using GCP Virtual Private Cloud Service Controls.
Snowflake offers native support for all leading cloud platforms, such as Amazon AWS, Microsoft Azure, and Google Cloud Platform. BigQuery only offers native support for the Google Cloud Platform. However, BigQuery users have the option of using BigQuery Omni to run multi-cloud workloads from AWS and Azure on Google Datastore.
Load data into Snowflake and BigQuery without building ETL pipelines
Replacing your on-premise infrastructure with a modern cloud data warehouse opens up new opportunities to analyze your customer data for actionable insights. But, to harness your data, you’ll need to first ingest data from various customer touchpoints such as websites, mobile apps, points of sale, third-party tools, ad platforms, and social media sites into your warehouse.
One option is to allocate more engineering resources and manually build ETL data pipelines, one by one, using a tool like Snowplow or Fivetran. But, as explained at the start, adding more engineers will also lead to more communication overheads, coordination efforts, and resource dependencies than what it is worth. In addition to that, any changes to your business needs, workflows, UX, or third-party tools may cause these manual pipelines to misbehave. So, you’ll need even more engineering resources to solve such secondary problems.
The other option is to use an event tracking solution like mParticle. Instead of manually ingesting, transforming, and loading data into each pipeline, you can use mParticle’s native SDKs and APIs to collect data across all your customer touchpoints. mParticle allows you to collect event data from customer touchpoints through a single SDK and export it to Snowflake and/or BigQuery without any additional development work. For example, here's a walkthrough of how to connect mParticle to Snowflake.
To see what it's like to collect and connect data with mParticle, you can explore the mParticle platform demo here.
Latest from mParticle
How Marks & Spencer drove revenue growth with mParticle's real-time segmentation and personalization
mParticle’s real-time data insights and segmentation fuel Joe & The Juice loyalty program growth