Azure SQL Data Warehouse and Its Best Practices
Over the last few years, data warehouse architecture has seen a huge shift towards cloud-based data warehouses and away from traditional on-site warehouses. Earlier, huge investments in IT resources were required to set up a data warehouse to build and manage a designed on-premise data center. Today, data warehousing functions are offered as a service (DWaaS) that are accessible via an internet connection. This has negated the costly capital expenditure and management that was earlier required for an on-premise data warehouse.
In this blog, we will talk about Azure SQL Data Warehouse, which is one of the modern data warehouses that is grabbing the attention of companies worldwide.
What is Azure SQL Data Warehouse?
Azure SQL Data Warehouse is a Platform as a Service (PaaS) offering from Microsoft which helps users to create an Enterprise Data Warehouse (EDW) by leveraging powerful Massively Parallel Processing (MPP) architecture.
Azure SQL Data Warehouse can quickly run complex queries across petabytes of data across a distributed system called nodes. Along with its capability of processing massive volumes of relational and non-relational data, its compute resources are massively scalable and can be paused in seconds.
Massively Parallel Processing (MPP) Architecture & Its Components
Components of the MPP Architecture
Another interesting thing about Azure SQL Data Warehouse is that you can migrate your data to Azure SQL Data Warehouse in 3 simple steps executed in preparation, metadata migration, and data migration stages.
But, why do you need to migrate to Azure SQL Data Warehouse? What are the benefits associated with it?
Why Migrate to Azure SQL Data Warehouse?
One of the key features of Azure SQL Data Warehouse is that it enables you to use your traditional business intelligence skills to build a data warehouse in Azure. This comes with the unmatched compute power in the cloud. Some of the other features built into Azure that can be leveraged by creating an Azure SQL Data Warehouse include:
- A low-cost solution with scalability that enables users to pause or resume the databases within minutes.
- A high-performance boost and the ability of globalization.
- The ability to run up to 128 concurrent queries at one time using the Massive Parallel Processing with increased performance than you could ever get with a traditional on-prem SQL Server.
- Smoothly create your jobs and build analytics along with any native connectivity with data integration and visualization services.
- Guarantee of 99.9% availability with fully managed service, so you don’t need your DBAs to take care of any of this.
- Built-in advanced security like encryption, connection security, authentication, and authorization.
- Built-in intelligent caching that helps you accelerate access to your data and query performance.
The availability of cloud data warehouses makes data warehousing much simpler, convenient, and accessible to a broader range of businesses. However, before you rush into the set up of your cloud data warehouse, you must understand that managing a data warehouse in the cloud comes with a whole new set of challenges, regardless of whether you’ve managed an on-premise setup before.
This is where our experienced Azure data engineers have been of help for our clients who wanted to switch from on-prem to a cloud-based data warehouse. On-premise data warehouses are increasingly costly. Our team helped them migrate to Azure SQL Data Warehouse and mitigate challenges involved during and post-migration.
In certain scenarios where high throughput was required to pull petabytes of data from our client’s database for reporting. We used the MPP architecture in Azure SQL Data Warehouse that provides massive performance in query execution for reporting. We could help the client to get quick insights because Azure SQL Data Warehouse’s performance is comparatively better than a normal database’s performance.
Best Practices of Azure SQL Data Warehouse
To enable you to better prepare for managing a data warehouse in the cloud, below are the best practices of Azure Data Warehouse that you must consider.
1. Performance with Proper Distribution
Distribution of tables is one of the powerful features which contributes to the performance of the queries on the compute nodes. There are three types of distribution in Azure SQL Datawarehouse –
- Hash Distribution tables: A hash distributed table can deliver the highest query performance for joins and aggregations on large table columns. The hash function uses the values in the distribution column to assign each row to a distribution.
- For example – A unique column is always preferred for hash distribution, resulting in optimum performance. However, for large tables, it is recommended to perform hash distribution. Also when you are joining two tables, the distribution of both the tables should be on the same column for best performance.
- Round-Robin tables: A round-robin distributed table distributes data evenly across the table, but without any further optimization. A round robin table is fast for storing data, but query performance can often be improved with hash distributed tables. Joins on round-robin tables require reshuffling data which can degrade the performance of the query. In scenarios when you need to store data in tables but none of the columns are qualified for the distribution, round-robin tables method can be used.
- Replicated tables: It is a table that is replicated caches a full copy of the table on each computes node. However, this reduces the DMS operation. Replicated tables are best utilized and only recommended for small tables. For example, a lookup table can be replicated for faster availability in joins and aggregations.
2. Selection of Data Warehouse Unit (DWU’s)
Azure SQL Data Warehouse CPU, memory, and IO are bundled into units of compute scale called Data Warehouse Units (DWUs). A change to your service level alters the number of DWUs that are available to the system, which in turn adjusts the performance and the cost of your system.
SQL Data Warehouse has two generations – Gen1 and Gen2. We recommend Gen2 to leverage the best performance from SQL Data Warehouse as Gen2 provides 2.5x more memory per query than the Gen1, enabling you to scale up to 30,000 Data Warehouse Units and also have unlimited columnar storage.
Both generations measure DWUs as below:
- Gen1 data warehouses are measured in Data Warehouse Units (DWUs).
- Gen2 data warehouses are measured in compute Data Warehouse Units (cDWUs).
Parameters which needs to be checked before selecting a DWUs are below :
- Number of compute nodes required for the processing
- Number of distribution per compute node
- Memory required per data warehouse
- Maximum concurrent queries to be executed (Consider execution sharing among resource classes like small, medium, large, etc. and small resource class.)
3. Fast Data Uploading Strategy
Here are the 4 ways to upload data faster –
- Push prepared data in Azure Blob Storage
- Pull data into External table (polybase)
- Move data by some transformation into staging tables
- Merge the data of staging with the production tables
You may also consider Clustered Columnstore Indexes for production tables to enable compression with fast query execution.
4. Reduce Cost by Writing an Optimized Query
Write an optimized query to pull data. Also, the query should be memory efficient with minimum execution time. A wrong query method can spike the billing of the data warehouse. Also, you must select only the required columns to save memory.
For example, if two tables need to be joined, they must be joined on the same distribution key with proper filters, otherwise, the query execution may go under suspended/running status for a long time which will lead to additional costs.
5. Use Smaller Resource Class to Increase Concurrency & Larger for Query Performance
- More parallelism can be achieved by using smaller resource class user in any applications. A large, time-consuming query should not be used by this resource group.
- Certain queries like large joins, indexing, heavy loads on clustered columnstore tables, etc. will benefit from larger memory allocations which are available with the large resource group. A large resource group user should not be used in every scenario as it would lead to a queue of the other queries that will be blocked until the large resource group user query gets executed.
6. Use CTAS (Create Table As Select) to Enhance Query Performance and Joins
CTAS is one of the most important T-SQL features available in Azure SQL Data Warehouse. CTAS is a parallel operation that creates a new table based on the output of a SELECT statement. It is the simplest and fastest way to create and insert data into a table with a single command.
For example – To join multiple tables which have separate distributions, CTAS can be used to create an intermediate table with the distribution of the third table and then this intermediate table can be joined with the third table to optimize the performance.