Product Profile and Evaluation: Amazon Redshift, Microsoft Azure SQL Data Warehouse, Google BigQuery, and Snowflake Data Warehouse
1 Cloud Analytics Platform Offerings
Big data analytics platforms load, store, and analyze volumes of data rapidly, providing timely insights to businesses. Data-driven organizations leverage this data, for example, for advanced analysis to market new promotions, as operational analytics to drive efficiency, or for predictive analytics to evaluate credit risk and detect fraud. Customers are leveraging a mix of relational analytical databases and data warehouses to gain insights.
This report focuses on relational analytical databases in the cloud, because deployments are at an all-time high and poised to expand dramatically. The cloud enables enterprises to differentiate and innovate with these database systems at a much quicker pace than was ever possible before. The cloud is a disruptive technology, offering elastic scalability vis-à-vis on-premises deployments, enabling faster server deployment and application development, and allowing for less costly storage. For these reasons and others, many companies have leveraged the cloud to maintain, or gain, momentum as a company.
This report outlines the results from a GigaOm Analytic Field Test derived from the industry standard TPC Benchmark™ DS (TPC-DS)1 comparing Amazon Redshift, Azure SQL Data Warehouse, Google BigQuery, and Snowflake Data Warehouse — four relational analytical databases based on scale-out cloud data warehouses and columnar-based database architectures. Despite these similarities, there are some distinct differences between the four platforms.
Amazon Web Services Redshift was the first managed data warehouse cloud service and continues to maintain a high level of mindshare in this category. It does indeed tick all the table stakes boxes for a cloud analytic database.
Amazon Redshift Spectrum can create an external table to store raw data on Amazon S3. Redshift Spectrum has some future-proofing that a modern data engineering approach might utilize. In support of diverse data, Amazon Redshift has a few nice features like the JSON_EXTRACT_PATH_TEXT function for noSQL key value databases.
Amazon Redshift is a fit for organizations needing a data warehouse with little to no administrative overhead and a clear, consistent pricing model. Amazon Web Services supports most of the databases in this report, and then some.
Azure SQL Data Warehouse
Azure SQL Data Warehouse storage is separate from the compute Data Warehouse Unit (DWU). This enables Azure SQL Data Warehouse to scale columnar storage capacity and compute resources independently. This capability adjusts to varying workload demands, offering potential cost savings when demand is low. Azure SQL Data Warehouse can also pause and resume compute billing, meaning only the storage is billed for during the paused time. Azure SQL Data Warehouse achieves a good balance in both configurability and simplicity, in a way that is both easy to administer and flexible in handling almost any usage pattern.
Azure SQL Data Warehouse is fully ANSI-SQL compliant and users familiar with SQL Server will be very comfortable using this environment.
Azure SQL Data Warehouse can export data to a local file the same way an on-premises SQL Server can, e.g., via the SQL Server Import and Export Wizard.
Google BigQuery is a managed service with some interesting distinctions. Google abstracts the details of the underlying hardware, database, and all configurations.
BigQuery is a hands-off database without indexes or column constraints. Defragmentation and system tuning are not required. It is truly serverless. Google Cloud manages the servers in a fully hands-off manner to the customer, dynamically allocating storage and compute resources. The customer does not define nodes and capacity of the BigQuery instance. The provisioning of compute is particularly fast and seamless.
You pay for the amount of data you query and store. Customers can pre-purchase flat-rate computation “slots” or units in increments of $10,000 per month per 500 compute units.
We simply consumed the results for this field test, but should we have been looking to do more with the data, such as exporting it in different formats, BigQuery has the capabilities to do so.
Also, although we did not time data ingest, this is an area of strength for BigQuery. BigQuery ingest does not impact the performance of queries, because separate compute slots are used for loading apart from the ones performing the SELECT statements for users. Ingest stats are auto-generated and data with an unknown schema can be loaded using schema auto-suggestion. There is no charge for ingest.
Snowflake Data Warehouse
As a cloud-only, fully managed solution, Snowflake has a clear separation between compute and storage. For Snowflake on AWS, which is what we used for the queries, data is stored in AWS S3 and is cached when queries are executed to bring the data in closer proximity to compute resources. Snowflake essentially offers two configuration “levers” — the size of the warehouse cluster and how many clusters are permitted to spin up to handle concurrency. Snowflake scales by cluster server count in powers of 2 (i.e., 1, 2, 4, 8, 16, and so on). If enabled, Snowflake will spin up additional clusters to handle multi-user concurrent query workloads. Snowflake would then automatically spin the additional clusters down once demand has passed. If not enabled, it will place paused queries in a queue until resources free up.
In our estimation, Snowflake performance largely hinges on cache. If Snowflake must access Amazon S3 for additional data, query execution slows dramatically. Historically, Snowflake has not disclosed the server class or size of its AWS EC2 instances. Thus, one cannot calculate beforehand what configuration would be most optimal for workloads.
Snowflake supports an ANSI-compliant form of SQL. However, we experienced syntax differences when converting existing queries to Snowflake. Snowflake supports data export to S3 (on AWS), though it is possible to use the Snowflake command line interface results to direct data to a file in a Linux shell.