This article is designed to help choose the appropriate technical tools for building out a Data Lake and a Data Warehouse.
A Data Lake is a centralized repository that allows you to store structured, semi-structured, and unstructured data at scale. The idea is that copies of all your transactional data are gathered, stored, and made available for additional processing, whether it is transformed and moved into a data warehouse or made available for direct usage for analytics such as machine learning or BI reporting.
A Data Warehouse, on the other hand, is a digital storage system that connects and harmonizes large amounts of data from many different sources. Its purpose is to feed business intelligence (BI), reporting, and analytics, and support regulatory requirements – so companies can turn their data into insight and make smart, data-driven decisions. Data warehouses store current and historical data in one place and act as the single source of truth for an organization.
While a data warehouse is a database structured and optimized to provide insight, a data lake is different because it stores relational data from line of business applications and non-relational data from other systems. The structure of the data or schema is not defined when data is captured. This means you can store all your data without careful design or the need-to-know what questions you might need answers for in the future. Different types of analytics on your data like SQL queries, big data analytics, full text search, real-time analytics, and machine learning can be used to uncover insights.
There are a lot of different ways to implement a data lake. The first is in a simple object storage system such as Amazon S3. The other is through a combination of ways, such as tying together a database system in which to store relational data and an object storage system to store unstructured or semi-structured data, or adding in stream-processing, or a key/value database, or any number of different products that handle the job similarly. With all these different approaches potentially filling the definition of “data lake”, the first thing we must do is to define deeper the data lake requirements that we will use for the comparison.
- Relational data –One requirement is that the data stack include multiple SAAS systems that will providing data to the data lake. These systems default to a relational\object-based structures where objects are consistently defined and are inter-related. This means that this information will be best served by storing within a relational database system. Examples of systems providing this type of data include Salesforce, Marketo, OpenAir, and NetSuite.
- Semi-structured data – some of the systems that will act as sources for this data lake contain semi structured data, which is data that has some structure but does not necessarily conform to a data model. Semi-structured data may mostly “look like each other” but each piece of data may have different attributes or properties. Data from these sources could be transformed into a relational approach, but this transformation gets away from storing the data in its original state. An example of sources of this kind of data include logs, such as Splunk or application logs.
- Unstructured data – Unstructured data is data which does not conform to a data model and has no easily identifiable structure, such as images, videos, documents, etc. At this time, there are no identified providers of unstructured data that is expected to be managed within the data lake.
Thus, fulfilling this set of data lake requirements means support for both relational\structured and semi-structured data are “must-haves”, and the ability to manage unstructured data is a “nice-to-have”.
The next set of requirements is for both data lakes and data warehouses are around being able to add, access, and extract data, with there being many more requirements around being able to manage data as it is being read and extracted. These requirements are:
- Input – The data lake must be able to support write and edit access from external systems. The system must support authentication and authorization as well as the ability to automate data movement without having to rely on internal processes. Thus, a tool requiring a human to click through a UI is not acceptable.
- Access – the data lake must be able to support different ways of accessing data from external sources, so the tool:
- must be able to support the processing of SQL (or similar approaches) for queries to limit/filter data being returned
- must be supported as a data source for Tableau, Excel, and other reporting systems
- should be able to give control over the number of resources used when executing queries to have control over the performance/price ratio
There are a lot of products available for consideration that support either, or both, data lake and data warehouse needs. However, one of the key pivots was to only look at offerings that were available as a managed service in the cloud rather than a product that would be installed and maintained on-premise and that the product can support both data lake and data warehouse needs. With that in mind, we narrowed the list down to 5 contenders: 1) Google BigQuery, 2) Amazon Redshift, 3) Azure Synapse Analytics, 4) Teradata Vantage, and 5) Snowflake Data Cloud based upon the Forrester Wave™ on Cloud Data Warehouses study as shown below in Figure 1.
The Forrester scoring was based upon 25 different criteria, of which we pulled out the seven that were the most important to our defined set of requirements and major use cases (in order of importance):
- Data Ingestion and Loading – The ability to easily insert and transform data into the product
- Administration Automation – An ability to automate all the administration tasks, including user provisioning, data management, and operations support
- Data Types – The ability to support multiple types of data, including structured, semi-structured, unstructured, files, etc.
- Scalability Features – The ability to easily and automatically scale for both storage and execution capabilities
- Security Features – The ability to control access to data stored in the product. This includes management at database, table, row, and column levels.
- Support – The strength of support offered by the vendor, including training, certification, knowledge base, and a support organization.
- Partners – The strength of the partner network, other companies that are experienced in implementing the software as well as creating additional integration points and/or features.
|Google BigQuery||Azure Synapse Analytics||Teradata Vantage||Snowflake Data Cloud||Amazon Redshift|
|Data Ingestion and Loading||5.00||3.00||3.00||5.00||5.00|
|Average of Forrester Scores||4.43||3.57||3.57||5.00||4.43|
This shows a difference between the overall scores, based on the full list of 25 items, and the list of scores based upon those areas where it was determined our requirements pointed to the most need. Snowflake was an easy selection into the in-depth, final evaluation, but the choices for the other products to dig deeper into was more problematic. In this first evaluation we compared Snowflake to Azure Synapse Analytics. Other comparisons will follow.
There are two additional areas in which we dive deeper into the differences between Synapse and Snowflake, cost, and execution processing. Execution processing is important because the more granular the ability to manage execution resources, the more control the customer has over performance in their environment.
Calculating cost for data platforms generally comes down to four different areas, data storage, processing memory, processing CPUs, and input/output. There are two different approaches to Azure’s pricing for Synapse, Dedicated and Serverless. Dedicated is broken down into 2 values, storage and an abstract representative of compute resources and performance called a Data Warehouse Unit (DWU). The Serverless consumption model, on the other hand, charges by TB of data processed. Thus, a query that scans the entire database would be very expensive while small, very concise queries will be relatively inexpensive. Snowflake has taken a similar approach to the Dedicated mode, with their abstract representation called a Virtual Warehouse (VW). We will build a pricing example that includes both Data Lake and Data Warehouse usage using the following customer definition:
- 20 TBs of data, adding 50 GB a week
- 2.5 hours a day on a 2 CPU, 32GB RAM machine to support incoming ETL work.
- 10 users working 8 hours, 5 days a week, currently supported by a 4 CPU, 48GB RAM machine
|Synapse Ded.||Storage Requirements||$ 4,600||20 TB * $23 (per TB month) * 12 months|
|Synapse Ded.||ETL Processing||$ 2,190||400 Credits * 2.5 hours * 365 days * $0.006 a credit|
|Synapse Ded.||User Processing||$ 11,520||1000 Credits * 8 hours * 240 days * $.006 a credit|
|Synapse Ded.||Total||$ 18,310|
|Synapse Serv.||Storage (Serverless)||$ 4,560||20 TB * $19 (per TB month) * 12 months|
|Synapse Serv.||Compute||$ 6,091||((5 TB * 240 days) + (.05 TB * 365 days)) * $5|
|Synapse Serv.||Total||$ 10,651|
|Snowflake||Storage Requirements||$ 1,104||4 TB (20 TB uncompressed) * $23 (per TB month)|
|Snowflake||ETL Processing||$ 3,650||2 Credits * 2.5 hours * 365 days * $2 a credit|
|Snowflake||User Processing||$ 15,360||4 Credits * 8 hours * 240 days * $2 a credit|
As an estimate, before considering any discounts, Synapse is approximately 10% less expensive than Snowflake when using Dedicated consumption and almost 50% less costly when using Serverless consumption, though that is very much an estimate because Serverless charges by the TB processed in a query, so large or poorly constructed queries can be exponentially more expensive. However, there is some cost remediation of the Snowflake plan because the movement of data from the data lake to the data warehouse is simplified. Snowflake’s structure and custom SQL language allow the movement of data between databases using SQL, so there is only the one-time processing fee for the SQL execution rather than paying for execution of query to get data out of the system and the execution to get data into the system. Synapse offers some of the same capability, but much of that is outside of the data management system and the advantage gained in the process is less than that offered by Snowflake.
The ability to flexibly scale the number of resources that you use when executing a command is invaluable. Both products do this to some extent. Snowflake, for example, uses the concept of a Warehouse, which is simply a cluster of compute resources such as CPU, memory, and temporary storage. A warehouse can either be running or suspended, and charges accrue only when running. The flexibility that Snowflake operates is around these warehouses, where the user can provision different warehouses, or sizes of work, and then select the desired data warehouse, or processing level, when connecting to the server to perform work as shown in Figure 2. This allows for matching the processing to the query being done. Selecting a single item from the database by it’s primary key may not need to be run using the most powerful (expensive) warehouse, but a report doing complete table scans across TBs of data could use the boost. Another example would be a generated scheduled report where the processing taking slightly longer is not an issue.
Figure 3 shows the connection management screen in the integration tool Boomi, with the arrow pointing towards where you can set the desired warehouse when you connect to Snowflake.
Synapse takes a different approach. You either use a Serverless model, where you have no control over the processing resources or a Dedicated model where you determine the level of resource consumption when you create the SQL pool which will store your data. You do not have the ability to scale up or down the execution resources being used.
Due to this flexibility, Snowflake has the advantage in execution processing.
In summary, Azure Synapse Analytics has an advantage on cost while Snowflake has an advantage on the flexibility with which you can manage system processing resources. When bringing in the previous scoring as provided by Forrester, Snowflake has the advantage based on its strength around data ingestion and administration automation, both areas that help ease the speed of delivery as well as help ease the burden around operational support. Based off this analysis my recommendation would be to use Snowflake for the entire data stack, including both Data Lake and Data Warehouse rather than Azure Synapse Analytics.