Traditional data warehouses faced challenges with storing large volumes of data and providing faster and more efficient analytical queries. Things are changing with modern cloud warehouse solutions where the storage and computations are decoupled and independently managed. The data volume is increasing every day and the type of data varies depending on the source generating the data whether structured, semi-structured, or unstructured. To support such a massive amount of data and at the same time provide acceptable analytical computations in a timely fashion, cloud warehouse solutions like Snowflake are getting popular.
Snowflake provides a modern cloud data warehousing solution delivered as SaaS. The Snowflake data warehouse can be hosted on multiple cloud platforms:
- Amazon Web Services
- Microsoft Azure
- Google Cloud Platform
As I discussed in my previous blog about Azure Synapse Analytics, Snowflake is also another data cloud platform that provides data ingestion, storage, computation, and analytics. Snowflake has the functionalities of any enterprise-grade warehouse and analytics engine and a lot more unique capabilities like decoupling data storage and computation.
Snowflake Architecture
Snowflake runs on cloud and hence it manages all aspects of software installation and updates. It uses virtual compute instances for any computation needs and a storage service to persist data. Let’s review the components of Snowflake architecture.
- Ingestion
- Storage
- Transformation
- Cloud Services
- Delivery
Ingestion
This is one area where Snowflake seems to be lacking native data ingestion technologies. There are a few data pipeline options including the popular COPY command, Snowpipe, or the Kafka Connector. Also, there are third-party vendors supporting data ingestion to Snowflake. You can see the best features of Snowflake once you get the data over there.
Storage
All kinds of unstructured, semi-structured, and structured data after ingestion into Snowflake are centrally stored together in columnar format. Several architectural designs can be implemented to support the need including data lakes, a warehouse, and a potential new feature to read data directly from on-prem solution (in preview as of this blog date). The cool thing about Snowflake being a completely managed service is it will handle the compression, partitioning, time travel, and ACID compliance by itself without us interfering. Of course, who does not like Time Travel?
Transformation
All the data transformation and processing happen in the processing layer and uses a virtual warehouse. Each warehouse is a compute node that can have multiple nodes in the compute cluster and support Massively Parallel Processing (MPP). Each warehouse node in the compute cluster has its own resources and therefore there is no performance impact on one node due to the other nodes.
Virtual warehouses in Snowflake are available in two flavors:
- Standard warehouse
- Snowpark-optimized warehouse
Snowpark-optimized warehouses are preferred for workloads that require large memory. Warehouses are also defined by their size where the size actually specifies the compute resources per cluster. The size of the warehouse can range from X-Small to 6X-Large.
Cloud Services
This is the layer in Snowflake that coordinates all the activities including user authentication, query parsing, optimization, etc. Different services that are controlled in this layer are:
- Performance Optimization
- Infrastructure & Metadata Management
- High Availability
- Security & Governance
- Sharing & Collaboration
Delivery
Once the data is curated inside the warehouse, the data can be made available for consumption. Several BI tools including Power BI/Tableau can connect and build powerful visualizations. The same data can be fed to AI/ML models to generate more insights into organizational data. Snowflakes allows a very secure way to share data with partners across clouds and regions.
How to connect to Snowflake
There are several ways to connect to Snowflake services.
- Snowsight – A web interface to access and manage Snowflake services with SQL support.
- SnowSQL – A command line client to access and manage Snowflake including executing SQL queries, performing DDL/DML operations, and loading data.
- Third-party tools – With certified partners and integrated 3rd-party tools, Snowflake’s ecosystem provides several options to access its services.
- ODBC and JDBC drivers can be used to connect to Snowflake.
- Native connectors like Python and Spark can be used to connect to Snowflake
- Visual Studio Code extension for Snowflake
Snowflake Editions
Depending on your organization’s need for Warehousing and Analytics, Snowflake offers several editions to fit the usage. Each higher-end edition includes all the features from the lower edition and also adds more features. As the usage of the organization grows, change in Snowflake Edition is trivial. Also, remember the edition you choose determines the costs for your organization.
- Standard Edition – Introductory level
- Enterprise Edition – Everything Standard + features to support larger organizations
- Business Critical Edition – Everything Enterprise + higher level of data protection
- Virtual Private Snowflake – Everything Business Critical + isolated from other accounts and provides the highest level of data security