Why Azure customers should consider Snowflake as a Data Warehouse?
Snowflake deployment on Azure
Snowflake, the data warehouse built for the cloud, has announced that Azure support has been released as a public preview. So far, Snowflake support for Azure was limited to data ingestion from Azure Blob Store. Now it is possible to start a Snowflake data warehouse by choosing between Amazon AWS and Azure.
Data Warehouse modernization goals
Before we describe the benefits of a cloud data warehouse over existing options on Azure, let’s take a step back and imagine an end-to-end analytics technology stack. A perfect cloud analytics environment would meet the following expectations:
- All data analyzable - All internal and relevant external data is present in our cloud Data Lake.
- All user groups satisfied - Every user group should have freedom to use the data they need and choose the tools they like. This means sharing data between Business Units, letting power users dive into data or empowering a Data Science team.
- Short time to insights - The biggest obstacle in analytics is the data complexity. All those extra processing steps (ETL) needed to prepare data. What about simplicity and data reuse between systems?
- Cost under control - A transition to the cloud always raises concerns about uncontrolled billing or just paying for resources that are not utilized properly.
Traditional Data Warehouse limitations
A traditional on-premise data warehouse has a monolithic architecture. It is just a big server or a cluster of machines that is connected to a storage array. The environment requires a lot of maintenance activities like backup, vacuuming or tweaking to keep working. The obvious limitations are:
- Limited storage - the data warehouse servers are connected directly to the storage
- Complex maintenance - delays any requests from users (time to insights)
- Inflexibility to scale - not possible to scale-up in response to peak load
Data warehousing options on Azure
It is always possible to run a traditional database in the cloud but we would only avoid infrastructure maintenance challenges. Data warehouses that have their roots in on-premise solutions are still affected by many legacy limitations. Let’s take a look at some of the options that we have on Azure:
- SQL Azure - storage is strictly attached to a database instance (two instances of different size cannot share data), also “pay-as-you-go” just means “pay a flat fee for 24h availability.”
- Azure SQL DW - despite a different technology that is based on a legacy PDW appliance, it still shares similar limitations of SQL Azure. For example, the table partitioning must be defined in advance which just increases complexity.
- HDInsight - at least it is possible to separate compute nodes from storage (Azure Blob), but it is not possible to start and stop a cluster in just a few seconds. Additionally the administrator must face the maintenance nightmare of Hadoop.
Cloud Data Warehouse principles
Finally, we can define the principles of a true, cloud-first data warehouse that is not affected by on-premise legacy limitations:
- Separate compute and storage - Storage is separated from compute instances. Both compute and storage layer may scale up and down independently. It is also possible to share the same data between different compute engines. Like two HDInsight instances sharing data through Azure Blobs.
- Pay only for actual use - Pay only for compute instances when they are working, suspend them a few minutes after the last query but reactivate for the next query so quickly that a user would not notice a difference. That just means: apply serverless principles to data warehousing.
- No maintenance - No need to perform manual maintenance operations (at an infrastructure level) but also no need to tune the performance or decide about the partitioning structure. The data warehouse should adapt to the load and reorganize the data as needed.
Snowflake cloud Data Warehouse architecture on Azure
Snowflake was defined from the ground up as a cloud data warehouse that can benefit from all features of the cloud and meets all the principles above. Let’s look at the Snowflake architecture on Azure and whether it satisfies all the above principles.
A Snowflake instance is deployed as a SaaS data warehouse. It is not available as a product from Azure Marketplace deployed in the customer’s Azure Subscription. Simply visit on this link and begin a trial. The setup process is described in another article.
Once a Snowflake instance is deployed, you will receive your private address specific to your account. The top three layers (Load Balancer, Cloud Services and Metadata Store) are provided for free and only the Virtual Warehouses (compute nodes) and the Blob Store are paid resources. There is only a minimum monthly charge of $25.
Snowflake stores data in an optimized columnar format in Azure Blob Store, but the files are not directly accessible. Instead a Snowflake instance may be connected to multiple Azure Blob Containers from which data may be easily ingested. For example, the whole Azure Blob Container that is a data store for a Hadoop cluster (a Data Lake) may be registered as an “external stage” in Snowflake.
Snowflake architecture: storage, compute, services
The following diagram shows the dependency between compute and storage layers:
Snowflake layers serve the following purposes:
- Centralized storage - The storage is shared across the whole Snowflake instance.
- Virtual Databases - Multiple independent compute engines of various sizes are defined and run only as long as new SQL statements are received from users.
- Virtual Database per workload - It is possible to create different compute engines (called Virtual Databases) for different workload or a different user group. A Virtual Database dedicated for slow nightly data loads may be just a small (2 credits / hour) instance. At the same time Data Scientists (who execute a few big queries a day) may use a separate Virtual Database that is a 2X-Large instance, but just runs a few predictive queries a day.
- Services - Management, transaction, security or metadata services are taking care of the whole environment.
Snowflake data warehouse benefits on Azure
Finally, when does it make sense to consider Snowflake as an important part of the Azure analytics toolbox? Let’s look back at the expectations for a smart analytics environment and how they are solved with Snowflake:
- All data available in a Data Lake - Snowflake can consume all popular data formats including CSV, ORC, Parquet, Avro, XML and JSON. When the data is already aggregated in a Data Lake, Snowflake may be a very cost effective (pay when a query runs) alternative to both Azure SQL DW or HDInsight
- Different types of workloads and user needs - A separate Snowflake Virtual Database may be created and handed over to a different user group that needs to access data owned by a different Business Unit or just perform some ad-hoc queries without affecting production usage.
- Short time to insights - Snowflake has a simple web interface, supports a full ANSI-SQL language and may easily consume external data. Additionally no need to define table partitioning makes it possible to prepare datasets for analytics or reporting faster than before.
- Pay as you go - This feature is taken to a second level in Snowflake. When your data warehouse (even Azure SQL DW) faces irregular utilization at daily (data marts not queried at night), weekly (not used during weekends) or monthly intervals, the “pay only when compute layer is active” feature would be more cost effective.
What can you lose if you deploy Snowflake? Just a few minutes to set up the environment, 2 hours of learning and maybe a few dollars for the environment if you don’t use it actively. For sure a monthly cost of a suspended Snowflake instance is negligible when compared to the smallest Azure SQL DW instance DW1000c that costs over $8000 a month.