AWS Schema Conversion Tool

The AWS Schema Conversion Tool is designed to make cross-engine database migrations more predictable. It does this by automatically converting not only the source data, but also most of the database objects such as views, stored procedures, and functions. If you think back to the previous section, you may recall that there was no mention of those database objects; the objective was simply to move all of the database tables. And, since that is all our sample database had, that was quite sufficient. However, many “enterprisey” databases will have these database objects. The Schema Conversion Tool will help you with those.

Firstly, the schema conversion tool is a downloadable tool, available for use on Microsoft Windows, Fedora Linux, and Ubuntu Linux. You can access the download links at https://aws.amazon.com/dms/schema-conversion-tool. We will use the Windows version of the tool for our walkthrough. Second, the tool will only migrate relational data into Amazon RDS or Amazon Redshift. Table 1 displays the source and target database combinations supported by the tool.

  SourceAurora MySQLAurora PGSQL  MariaDB  MySQL  PGSQLSQL Server  Redshift
OracleXXXXX  
Oracle Data Warehouse      X
Azure SQL DatabaseXX XX  
Microsoft SQL ServerXXXXXXX
Teradata      X
IBM Netezza      X
Greenplum      X
HPE Vertica      X
MySQL X  XX 
PostgreSQL (PGSQL)XX XX  
IBM DB2 LUWXXXXX  
IBM Db2 for z/OSXX XX  
SAP ASEXXXXX  
Amazon Redshift      X
Azure Synapse Analytics      X
Snowflake      X
Table 1 Databases available as sources and targets for Schema Conversion Tool

Clicking the download tool link will start the downloading of a zip file. Once the file is downloaded, extract the content to a working directory. There will be a .msi installation file and two folders. Run the installation file and start the application when the installation is completed.

Configuring the Source

Upon your first run of the tool, you will be presented with the terms of use. Accepting these terms will open the application and present the Create a new database migration project screen as shown in Figure 1.

Figure 1. Create a new database migration project screen in SCT
Figure 1. Create a new database migration project screen in SCT

 We selected Microsoft SQL Server as our source engine and this enabled the three radio buttons that give some direction as to how the conversion process should proceed. The three choices are:

·         I want to switch engines and optimize for the cloud (default)

·         I want to keep the same engine but optimize for the cloud

·         I want to see a combined report for database engine switch and optimization to cloud.

Each of these selections will alter the logic of the migration project, for example, selecting to keep the same engine will provide you with a different set of destinations than selecting to switch engines.

Completing the fields in Step 1 and clicking the Next button will take you to the Step 2 – Connect to the source database screen as shown in Figure 2.

Figure 2. Specifying connection information for the source database
Figure 2. Specifying connection information for the source database

As shown in Figure 2 there are four fields that have the upper left corner of the field displaying a red tick. Those fields are required for the connection, and three of them are fields that you should be well acquainted with by now, the Server Name, User name, and Password (when accessing the source database using SQL Server Authentication). However, the last field, Microsoft SQL Server driver path, is a new one and points to the directory in which the Microsoft SQL Server JDBC driver is located, which we didn’t have installed. Fortunately, AWS helpfully provides a page with links to the various database drivers at https://docs.aws.amazon.com/SchemaConversionTool/latest/userguide/CHAP_Installing.html. You will need to install drivers for both your source and target databases. We went through and downloaded the drivers for SQL Server (our source database) and the drivers for Amazon Aurora MySQL (our destination database). Once the appropriate JDBC drivers are installed, you can point to the SQL Server driver path as shown in Figure 3.

Figure 3. Specifying SQL Server JDBC path
Figure 3. Specifying SQL Server JDBC path

Once you have your server, authentication, and driver path filled out you can click the Test connection button to ensure everything works as expected. If that is successful, you can select Next to continue.

Note: Our Microsoft SQL Server JDBC download contained three different .jar files, jre8, jre11, and jre17. The tool would allow the selection of jre8 and jre11 but would not allow the selection of the jre17 file. This will likely change as the tool continues to evolve.

The tool will next display a screen that indicates that it is loading the metadata for the server. This metadata includes databases, schemas, and triggers. Once that loading is completed, you will be in Step 3. Choose a schema where you will get a list of all databases and the schemas available within each one. This list includes all of the system databases, such as master, model, msdb, and tempdb. You will probably not want to include those! Once you have selected the schema(s), click the Next button. You will see the “Loading metadata” screen again as the tool gets all the database objects based upon your selected schema(s). This process will take a few minutes.

Database Migration Assessment Screen

Once completed, you will be taken to the Step 4. Run the Database migration assessment screen. The first thing that you will see is the assessment report. This report was created by the tool taking all the metadata that it found and analyzing it to see how well it would convert into the various source databases. At the top of the report is the Executive summary. This lists all of the potential target platforms and summarizes the types of actions that need to be taken. An example of this report is shown in Figure 4.

Figure 4. Executive summary of the migration assessment report
Figure 4. Executive summary of the migration assessment report

Immediately under the executive summary is a textual analysis of the data in the chart. Each of the line items is described with an estimation of the percentage of database storage objects and database code objects that can be automatically converted.  In our case, both Amazon RDS for MySQL and Amazon Aurora (MySQL compatible) can be converted at 100%. None of the other target platforms scored that high.

Additional detail is displayed further down the page as shown in Figure 5.

Figure 5. Details on migrating to Amazon Aurora MySQL
Figure 5. Details on migrating to Amazon Aurora MySQL

This section demonstrates that 1 schema, 10 tables, 17 constraints, 7 procedures and 1 scalar function can be successfully converted to Amazon Aurora (MySQL compatible).

Configuring the Destination

Once you have completed your review of the potential destination, click the Next button. This will bring you to Step 5. Choose a target page where you select the target engine and configure the connection to the target database. When we got to the page Amazon RDS for MySQL was selected as the target engine, so we went with that and created a new Amazon RDS for MySQL instance in the RDS console, making sure that we enabled external access. Filling out the connection information and clicking the Test connection button demonstrated that we had filled the information out appropriately, so we clicked the Finish button.

Completing the Migration

This brings you to the project page as shown in Figure 6.

Figure 6. Schema conversion tool conversion project
Figure 6. Schema conversion tool conversion project

Just like with DMS, the conversion tool gives you the ability to add mapping and transformation rules. You do this by clicking on the Main view icon in the toolbar and selecting the Mapping view. This changes the center section of the screen. In this section you can add Transformation rules. These transformation rules, just as with DMS, allow you to alter the name of items that are going to be migrated. You can create a rule where you create the appropriate filters to determine which objects will be affected, and you have the following options on how the names will be changed:

·          Add prefix

·         Add suffix

·         Convert lowercase

·         Convert uppercase

·         Move to

·         Remove prefix

·         Remove suffix

·         Rename

·         Replace prefix

·         Replace suffix

These different transformations are useful when working with database schemas that user older approaches such as using a prefix of “t” before the name to show that the object is a table, or “v” to indicate that it’s a view. We will not be using any transformations as part of our conversion.

Since we are converting our ProDotNetOnAWS database and its dbo schema, you need to go to the left window where the SQL Server content is displayed, right-click on the dbo schema, and select Convert schema from the popup menu. You will get an additional popup that shows the copying of the schema to the source destination. Once completed, the right window will look like Figure 7 where it shows that the schema has been copied over along with tables, procedures, views, and functions (if you have all of those).

Figure 7. Schema converted to source database
Figure 7. Schema converted to source database

Note that this has not yet been applied to the destination server and is instead a local representation of what that it would look like once applied. Your next step is to apply the changes to the destination. You do this by right-clicking on the destination schema and selecting Apply to database. This will bring up a pop-up confirmation window after which you will see the schema being processed. The window will close once completed.

At this point, your schema has been transferred to the source database. Figure 8 shows the destination database in MySQL Workbench, and you can see that the schema defined in the tool has been successfully migrated.

Figure 8. Viewing converted schema in MySQL Workbench
Figure 8. Viewing converted schema in MySQL Workbench

Once your data has been migrated, the last step is to convert your code so that it can access your new database.

Selecting a BI Data Stack – Snowflake vs Azure Synapse Analytics

This article is designed to help choose the appropriate technical tools for building out a Data Lake and a Data Warehouse.

Components

A Data Lake is a centralized repository that allows you to store structured, semi-structured, and unstructured data at scale. The idea is that copies of all your transactional data are gathered, stored, and made available for additional processing, whether it is transformed and moved into a data warehouse or made available for direct usage for analytics such as machine learning or BI reporting.

A Data Warehouse, on the other hand, is a digital storage system that connects and harmonizes large amounts of data from many different sources. Its purpose is to feed business intelligence (BI), reporting, and analytics, and support regulatory requirements – so companies can turn their data into insight and make smart, data-driven decisions. Data warehouses store current and historical data in one place and act as the single source of truth for an organization.

While a data warehouse is a database structured and optimized to provide insight, a data lake is different because it stores relational data from line of business applications and non-relational data from other systems. The structure of the data or schema is not defined when data is captured. This means you can store all your data without careful design or the need-to-know what questions you might need answers for in the future. Different types of analytics on your data like SQL queries, big data analytics, full text search, real-time analytics, and machine learning can be used to uncover insights.

General Requirements 

There are a lot of different ways to implement a data lake. The first is in a simple object storage system such as Amazon S3. The other is through a combination of ways, such as tying together a database system in which to store relational data and an object storage system to store unstructured or semi-structured data, or adding in stream-processing, or a key/value database, or any number of different products that handle the job similarly. With all these different approaches potentially filling the definition of “data lake”, the first thing we must do is to define deeper the data lake requirements that we will use for the comparison.

  • Relational data –One requirement is that the data stack include multiple SAAS systems that will providing data to the data lake. These systems default to a relational\object-based structures where objects are consistently defined and are inter-related. This means that this information will be best served by storing within a relational database system. Examples of systems providing this type of data include Salesforce, Marketo, OpenAir, and NetSuite.
  • Semi-structured data – some of the systems that will act as sources for this data lake contain semi structured data, which is data that has some structure but does not necessarily conform to a data model. Semi-structured data may mostly “look like each other” but each piece of data may have different attributes or properties. Data from these sources could be transformed into a relational approach, but this transformation gets away from storing the data in its original state. An example of sources of this kind of data include logs, such as Splunk or application logs.
  • Unstructured data – Unstructured data is data which does not conform to a data model and has no easily identifiable structure, such as images, videos, documents, etc. At this time, there are no identified providers of unstructured data that is expected to be managed within the data lake.

Thus, fulfilling this set of data lake requirements means support for both relational\structured and semi-structured data are “must-haves”, and the ability to manage unstructured data is a “nice-to-have”.

The next set of requirements is for both data lakes and data warehouses are around being able to add, access, and extract data, with there being many more requirements around being able to manage data as it is being read and extracted. These requirements are:

  • Input – The data lake must be able to support write and edit access from external systems. The system must support authentication and authorization as well as the ability to automate data movement without having to rely on internal processes. Thus, a tool requiring a human to click through a UI is not acceptable.
  • Access – the data lake must be able to support different ways of accessing data from external sources, so the tool:
    • must be able to support the processing of SQL (or similar approaches) for queries to limit/filter data being returned
    • must be supported as a data source for Tableau, Excel, and other reporting systems
    • should be able to give control over the number of resources used when executing queries to have control over the performance/price ratio

Top Contenders

There are a lot of products available for consideration that support either, or both, data lake and data warehouse needs. However, one of the key pivots was to only look at offerings that were available as a managed service in the cloud rather than a product that would be installed and maintained on-premise and that the product can support both data lake and data warehouse needs. With that in mind, we narrowed the list down to 5 contenders: 1) Google BigQuery, 2) Amazon Redshift, 3) Azure Synapse Analytics, 4) Teradata Vantage, and 5) Snowflake Data Cloud based upon the Forrester Wave™ on Cloud Data Warehouses study as shown below in Figure 1.

Figure 1 – Forrester Wave on Cloud Data Warehouses

The Forrester scoring was based upon 25 different criteria, of which we pulled out the seven that were the most important to our defined set of requirements and major use cases (in order of importance):

  1. Data Ingestion and Loading – The ability to easily insert and transform data into the product
  2. Administration Automation – An ability to automate all the administration tasks, including user provisioning, data management, and operations support
  3. Data Types – The ability to support multiple types of data, including structured, semi-structured, unstructured, files, etc.
  4. Scalability Features – The ability to easily and automatically scale for both storage and execution capabilities
  5. Security Features – The ability to control access to data stored in the product. This includes management at database, table, row, and column levels.
  6. Support – The strength of support offered by the vendor, including training, certification, knowledge base, and a support organization.
  7. Partners – The strength of the partner network, other companies that are experienced in implementing the software as well as creating additional integration points and/or features.
Google BigQueryAzure Synapse AnalyticsTeradata VantageSnowflake Data CloudAmazon Redshift
Forrester Scores   
Data Ingestion and Loading5.003.003.005.005.00
Administration Automation3.003.003.005.003.00
Data Types5.005.005.005.005.00
Scalability Features5.005.003.005.003.00
Security Features3.003.003.005.005.00
Support5.005.003.005.005.00
Partners5.001.005.005.005.00
Average of Forrester Scores4.433.573.575.004.43
Table 1 – Base Forrester scores for selected products

This shows a difference between the overall scores, based on the full list of 25 items, and the list of scores based upon those areas where it was determined our requirements pointed to the most need. Snowflake was an easy selection into the in-depth, final evaluation, but the choices for the other products to dig deeper into was more problematic. In this first evaluation we compared Snowflake to Azure Synapse Analytics. Other comparisons will follow.

Final Evaluation

There are two additional areas in which we dive deeper into the differences between Synapse and Snowflake, cost, and execution processing. Execution processing is important because the more granular the ability to manage execution resources, the more control the customer has over performance in their environment.

Cost

Calculating cost for data platforms generally comes down to four different areas, data storage, processing memory, processing CPUs, and input/output. There are two different approaches to Azure’s pricing for Synapse, Dedicated and Serverless. Dedicated is broken down into 2 values, storage and an abstract representative of compute resources and performance called a Data Warehouse Unit (DWU). The Serverless consumption model, on the other hand, charges by TB of data processed. Thus, a query that scans the entire database would be very expensive while small, very concise queries will be relatively inexpensive. Snowflake has taken a similar approach to the Dedicated mode, with their abstract representation called a Virtual Warehouse (VW).  We will build a pricing example that includes both Data Lake and Data Warehouse usage using the following customer definition:

  • 20 TBs of data, adding 50 GB a week
  • 2.5 hours a day on a 2 CPU, 32GB RAM machine to support incoming ETL work.
  • 10 users working 8 hours, 5 days a week, currently supported by a 4 CPU, 48GB RAM machine
ProductDescriptionPriceNotes
Synapse Ded.Storage Requirements$ 4,60020 TB * $23 (per TB month) * 12 months
Synapse Ded.ETL Processing$ 2,190400 Credits * 2.5 hours * 365 days * $0.006 a credit
Synapse Ded.User Processing$ 11,5201000 Credits * 8 hours * 240 days * $.006 a credit
Synapse Ded.Total$ 18,310 
Synapse Serv.Storage (Serverless)$ 4,56020 TB * $19 (per TB month) * 12 months
Synapse Serv.Compute$ 6,091((5 TB * 240 days) + (.05 TB * 365 days)) * $5
Synapse Serv.Total$ 10,651 
SnowflakeStorage Requirements$ 1,1044 TB (20 TB uncompressed) * $23 (per TB month)
SnowflakeETL Processing$ 3,6502 Credits * 2.5 hours * 365 days * $2 a credit
SnowflakeUser Processing$ 15,3604 Credits * 8 hours * 240 days * $2 a credit
SnowflakeTotal$ 20,114 

As an estimate, before considering any discounts, Synapse is approximately 10% less expensive than Snowflake when using Dedicated consumption and almost 50% less costly when using Serverless consumption, though that is very much an estimate because Serverless charges by the TB processed in a query, so large or poorly constructed queries can be exponentially more expensive. However, there is some cost remediation of the Snowflake plan because the movement of data from the data lake to the data warehouse is simplified. Snowflake’s structure and custom SQL language allow the movement of data between databases using SQL, so there is only the one-time processing fee for the SQL execution rather than paying for execution of query to get data out of the system and the execution to get data into the system.  Synapse offers some of the same capability, but much of that is outside of the data management system and the advantage gained in the process is less than that offered by Snowflake.

Execution Processing

The ability to flexibly scale the number of resources that you use when executing a command is invaluable. Both products do this to some extent. Snowflake, for example, uses the concept of a Warehouse, which is simply a cluster of compute resources such as CPU, memory, and temporary storage. A warehouse can either be running or suspended, and charges accrue only when running. The flexibility that Snowflake operates is around these warehouses, where the user can provision different warehouses, or sizes of work, and then select the desired data warehouse, or processing level, when connecting to the server to perform work as shown in Figure 2. This allows for matching the processing to the query being done. Selecting a single item from the database by it’s primary key may not need to be run using the most powerful (expensive) warehouse, but a report doing complete table scans across TBs of data could use the boost. Another example would be a generated scheduled report where the processing taking slightly longer is not an issue.

Figure 2 – Switching Snowflake warehouses to change dedicated compute resources for a query

Figure 3 shows the connection management screen in the integration tool Boomi, with the arrow pointing towards where you can set the desired warehouse when you connect to Snowflake.

Figure 3 – Boomi connection asking for Warehouse name

Synapse takes a different approach. You either use a Serverless model, where you have no control over the processing resources or a Dedicated model where you determine the level of resource consumption when you create the SQL pool which will store your data. You do not have the ability to scale up or down the execution resources being used.

Due to this flexibility, Snowflake has the advantage in execution processing.

Recommendation

In summary, Azure Synapse Analytics has an advantage on cost while Snowflake has an advantage on the flexibility with which you can manage system processing resources. When bringing in the previous scoring as provided by Forrester, Snowflake has the advantage based on its strength around data ingestion and administration automation, both areas that help ease the speed of delivery as well as help ease the burden around operational support.  Based off this analysis my recommendation would be to use Snowflake for the entire data stack, including both Data Lake and Data Warehouse rather than Azure Synapse Analytics.