How to integrate Snowflake with a Data Lake on Azure Blob Store
Snowflake support for Azure Blob Store
Snowflake, the data warehouse built for the cloud platform is now available on Microsoft Azure. So far, it was possible to register a connection from Snowflake to Azure Blob Store, however any data transfer would be routed over public internet between Azure and AWS clouds.
Many organizations have centralized their data assets in a Data Lake. This article shows how to connect a Snowflake cloud data warehouse to a Data Lake that stores flat files on Azure Blob Store. The Blob Store may be reused between other Azure platforms like HDInsight or Azure SQL Data Warehouse. Snowflake supports all popular file formats for semi-structured and unstructured files like CSV, Avro, Parquet, ORC, JSON and XML. The files may be also compressed using and the compression algorithm, which is automatically detected by Snowflake during data load.
Snowflake integration with a Data Lake on Azure
The following diagram shows the architecture that will be explored in this article. The “External Stage” is a connection from Snowflake to Azure Blob Store that defines the location and credentials (a Shared Access Signature).
Snowflake access rights to Azure Blob Store
In the first step, a Shared Access Signature token must be created for the Azure Blob Store that will be accessed by Snowflake.
> Log in to the Azure portal and from the home dashboard, choose Storage Accounts » storage_account » Settings » Shared access signature.
> Define the access rights: Only Read and List permissions are necessary when Snowflake will be using the Azure Blob store in read-only mode. However, we may also be able to grant write operations to enable data uploading from Snowflake back to Azure Blob Store.
> The allowed resource types that are highlighted must all be enabled. This is a major difference between the current Snowflake documentation and the version created before Snowflake was offered on Azure.
> Next, specify start and expiry dates & times for the SAS token. As part of a general security plan, you could generate a different SAS token periodically.
> Leave the Allowed IP address field blank and specify either ‘HTTPS only’ or ‘HTTPS and HTTP’ under Allowed protocols.
> Next, click the ‘Generate SAS’ button. Record the full value in the SAS token field, starting with and including the ?. This is your SAS token. You will specify this token when you create an external stage using CREATE STAGE.
Database creation on Snowflake
Before data from a Data Lake will be loaded to Snowflake, a database must be created. Go to the Databases section and click “Create”.
CSV file format definition on Snowflake
Before we load data from the Azure Blob Store, we should define a format and title for files that will be loaded. For the example in this article, we are using sample files from Dell DVD Store Version 2 sample database, which is distributed as flat CSV files.
The command to create a file format is shown before, but it must be executed in the context of our newly created ‘demo’ database.
CREATE FILE FORMAT azure_csv TYPE=CSV
Snowflake integration with Azure Blob Store using an external stage
Create an external stage that specifies an Azure storage account, the blob container and a Shared Access Signature. Use the Shared Access Token that was generated on the Azure portal.
CREATE OR REPLACE STAGE azure_blob_stage
file_format = AZURE_CSV;
External data queries on Snowflake
At this moment, an external stage should be defined, and it is possible to query files stored in the Data Lake. Columns from source files are identified by their position in the CSV file. For example, $3 references the third column. We can query individual files by using a full path to a file after the @external_stage_name token, but it is also possible to specify a regular expression for folders and files.
SELECT t.$1 as prod_id, t.$2 as category, t.$3 as title,
t.$4 as actor, t.$5 as price,
t.$6 as special, t.$7 as common_prod_id
FROM '@azure_blob_stage/ingress/products/prod.csv' AS t;
ANSI SQL support in Snowflake
Snowflake is highly compatible with ANSI SQL standard so creating tables is quite easy. Because Snowflake takes care of storage and partitioning, we don’t need to specify too many parameters in the CREATE TABLE statement.
Create a table that will be used as the target of a COPY INTO statement.
CREATE TABLE PRODUCTS
PROD_ID NUMBER(9,0) NOT NULL,
CATEGORY NUMBER(3,0) NOT NULL,
TITLE VARCHAR(50) NOT NULL,
ACTOR VARCHAR(50) NOT NULL,
PRICE NUMBER(18,2) NOT NULL,
COMMON_PROD_ID NUMBER(9,0) NOT NULL
Data loading from flat files to Snowflake
Now we can copy (load) data from a file located on Azure Blob Store using a COPY INTO command:
COPY INTO PRODUCTS FROM '@azure_blob_stage/ingress/products/prod.csv'
Data transformation may be applied by wrapping a nested SELECT statement inside the FROM clause. Our sample files were using ‘/’ as a date field separator instead of ‘-’ which is a default in Snowflake. The values in the second column must be changed from ‘2018/07/17 14:20:30’ to ‘2018-07-17 14:20:30’, which we do using a standard SQL function REPLACE.
COPY INTO ORDERS FROM
(select t.$1, replace(t.$2, '/', '-'), t.$3, t.$4, t.$5, t.$6
from '@azure_blob_stage/ingress/orders/jan_orders.csv' as t)
Finally we can test our tables loaded to Snowflake by creating a simple dashboard in Power BI. The ODBC drivers for Snowflake are available in the Help section on the Snowflake account portal. It is a “? Help” icon at the top right corner.
Power BI support dashboards on Snowflake in DirectQuery mode
Start the Power BI Desktop and select “Snowflake” from the list of available data sources. Provide the account name (which is the full hostname of the Snowflake instance) and the Virtual Data Warehouse (the compute cluster).
Expand the demo database and select tables that will be used in the data model for the dashboard.
Power BI supports joins and DirectQuery mode for Snowflake, so it is possible to create a complex data model that joins multiple tables.
The final dashboard that we created is shown below:
Snowflake benefits for Business Intelligence on Azure
Snowflake offers a cost effective (pay-as-you-go) alternative to other SQL data warehouses available on Azure. Snowflake is a data warehouse that is based on a serverless architecture, so the customer pays only for the time that the warehouse is processing queries. The data warehouse automatically suspends execution a few minutes after the last query.
Given the great integration with Azure Blob Store and support for all popular data types, Snowflake is a natural data warehousing solution that should be considered on Azure.
How BigData Dimension can help you with Snowflake
BigData Dimension is a software company that specializes in Snowflake deployments, data warehouse modernization and Business Intelligence. Schedule a meeting to talk with our experts today, we are eager to share our experience and best practices in data warehouse modernization using Snowflake.