Amazon DocumentDB and .NET

Amazon DocumentDB is a fully managed, scalable, and highly available document database service that uses a distributed, fault-tolerant, self-healing storage system that auto-scales up to 64 TB per cluster. Amazon DocumentDB reduces database I/O by only persisting write-ahead logs and does not need to write full buffer page syncs, avoiding slow, inefficient, and expensive data replication across network links.

This design allows for the separation of compute and storage. This means that you can scale each of those areas independently. DocumentDB is designed for 99.99% availability and replicates six copies of your data across three availability zones. It also continually monitors cluster instance health and automatically fails over to a read replica in the event of a failure–typically in less than 30 seconds. You can start with a single instance that delivers high durability then, as you grow, can add a second instance for high availability, and easily increase the number of instances you use for read-only. You can scale read capacity to millions of requests per second by scaling out to 15 low latency read replicas that can be scattered across three availability zones

One of the interesting characteristics is that DocumentDB was built with MongoDB compatibility. Why is that important?

MongoDB

MongoDB is a source-available cross-platform document database that was first released in 2009 and is, by far, the most popular document database available. Since it was both the first, and available open-sourced, it tended to be adopted early by companies running on-premises and experimenting with the concept of document stores so there is a huge number of on-premises software systems that rely completely on, or in part on, MongoDB. This means that any movement of systems to the cloud would have to come up with a way to support the software written around MongoDB otherwise this would be a massive impediment in migrating to the cloud.

AWS realized this and released DocumentDB to include compatibility with the MongoDB APIs that were available at the time. At the time of this writing (and also at release), DocumentDB implements the MongoDB 3.6 and 4.0 API responses. However, MongoDB released version 5.0 in July 2021, so there is no feature parity. Also, since there were annual releases for 4.2, 4.4, 4.4.5, and 4.4.6 that AWS did not support it appears that there will not be additional support moving forward.

While current compatibility with MongoDB is not supported, the design of DocumentDB, together with optimizations like advanced query processing, connection pooling, and optimized recovery and rebuild, provide a very performant system. AWS claims that DocumentDB achieves twice the throughput of currently available MongoDB managed services.

Setting up a DocumentDB Database

Now that we have a bit of understanding about DocumentDB, let’s go set one up. Log in to the console, and either search for Document DB or find it using Services > Database > Amazon DocumentDB. Select the Create Cluster button on the dashboard to bring up the Create cluster page as shown in Figure 1.

Figure 1. Create Amazon DocumentDB Cluster screen

First, fill out the Configuration section by adding in a Cluster identifier. This identifier must be unique to DocumentDB clusters within this region. Next, you’ll see a dropdown for the Engine version. This controls the MongoDB API that you will want to use for the connection. This becomes important when selecting and configuring the database drivers that you will be using in your .NET application. We recommend that you use 4.0 as this gives you the ability to support transactions. The next two fields are Instance class and Number of instances. This is where you define the compute and memory capacity of the instance as well as the number of instances that you will have. For the Instance class, the very last selection in the dropdown is db.t3.medium, which is eligible for the free tier. We selected that class. When considering the number of instances, you can have a minimum of 1 instance that acts as both read-write, or you can have more than one instance which would be primary instance plus replicas. We chose two (2) instances so that we can see both primary and replica instances.

You’ll see a profound difference if you compare this screen to the others that we saw when working with RDS as this screen is much simpler and seems to give you much less control over how you set up your cluster. The capability to have more fine control over configuration is available, however; you just need to click the slider button at the lower left labeled Show advanced settings. Doing that will bring up the Network settings, Encryption, Backup, Log exports, Maintenance, Tags, and Deletion protection configuration sections that you will be familiar with from the RDS chapters.

Once you complete the setup and click the Create cluster button you will be returned to the Clusters list screen that will look like Figure 2.

Figure 2. Clusters screen immediately after creating a cluster

As Figure 2 shows, there are initial roles for a Regional cluster and a Replica instance. As the creation process continues, the regional cluster will be created first, and then the first instance in line will change role to become a Primary instance. When creation is completed, there will be one regional cluster, one primary instance, and any number of replica instances. Since we chose two instances when creating the cluster, we show one replica instance being created. When the cluster is fully available you will also have access to the regions in which the clusters are available. At this point, you will have a cluster on which to run your database, but you will not yet have a database to which you can connect. But, as you will see below when we start using DocumentDB in our code, that’s ok.

DocumentDB and .NET

When looking at using this service within your .NET application, you need to consider that like Amazon Aurora, DocumentDB emulates the APIs of other products. This means that you will not be accessing the data through AWS drivers, but instead will be using MongoDB database drivers within your application. It is important to remember, however, that your database is not built using the most recent version of the MongoDB API, so you must ensure that the drivers that you do use are compatible with the version that you selected during the creation of the cluster, in our case MongoDB 4.0. Luckily, however, MongoDB provides a compatibility matrix at https://docs.mongodb.com/drivers/csharp/.

The necessary NuGet package is called MongoDB.Driver. Installing this package will bring several other packages with it including MongoDB.Driver.Core, MongoDB.Driver.BSON, and MongoDB.Libmongocrypt. One of the first things that you may notice is there is not an “Entity Framework” kind of package that we got used to when working with relational databases. And that makes sense because Entity Framework is basically an Object-Relational Mapper (ORM) that helps manage the relationships between different tables – which is exactly what we do not need when working with NoSQL systems. Instead, your table is simply a collection of things. However, using the MongoDB drivers allow you to still use well-defined classes when mapping results to objects. It is, however, more of a deserialization process than an ORM process.

When working with DocumentDB, you need to build your connection string using the following format: mongodb://[user]:[password]@[hostname]/[database]?[options]

This means you have five components to the connection string:

  • user – user name
  • password – password
  • hostname – url to connect to
  • database – Optional parameter – database with which to connect
  • options – a set of configuration options

It is easy to get the connection string for DocumentDB, however. In the DocumentDB console, clicking on the identifier for the regional cluster will bring you to a summary page. Lower on that page is a Connectivity & security tab that contains examples of approaches for connecting to the cluster. The last one, Connect to this cluster with an application, contains the cluster connection string as shown in Figure 3.

Figure 3. Getting DocumentDB connection string

You should note that the table name is not present in the connection string. You can either add it to the connection string or you can use a different variable as we do in the example below.

Once you have connected to the database and identified the collection that you want to access, you are able to access it using lambdas (the anonymous functions, not AWS’ serverless offering!). Let’s look at how that works. First, we define the model that we want to use for the deserialization process.

public class Person
{
    [BsonId]
    [BsonRepresentation(BsonType.ObjectId)]

    public string Id { get; set; }

    [BsonElement("First")]
    public string FirstName { get; set; }

    [BsonElement("Last")]
    public string LastName { get; set; }
}

What you’ll notice right away is that we are not able to use a plain-ol’ class object (POCO) but instead must provide some MongoDB BSON attributes. Getting access to these will require the following libraries to be added to the using statements:

using MongoDB.Bson;
using MongoDB.Bson.Serialization.Attributes; 

There are many different annotations that you can set, of which we are using three:

·         BsonId – this attribute defines the document’s primary key. This means it will become the easiest and fastest field on which to retrieve a document.

·         BsonRepresentation – this attribute is primarily to support ease-of use. It allows the developer to pass the parameter as type string instead of having to use an ObjectId structure as this attribute handles the conversion from string to ObjectId.

·         BsonElement – this attribute maps the property name “Last” from the collection to the object property “LastName” and the property name “First” to the object property “FirstName”.

That means that the following JavaScript shows a record that would be saved as a Person type.

{
 "Id": "{29A25F7D-C2C1-4D82-9996-03C647646428}",
 "First": "Bill",
 "Last": "Penberthy"
}

We mentioned earlier that accessing the items within DocumentDB becomes straightforward when using lambdas. Thus, a high-level CRUD service could look similar to the code in Listing 1.

using MongoDB.Driver;

public class PersonService
{
    private readonly IMongoCollection<Person> persons;

    public PersonService ()
    {
        var client = new MongoClient("connection string here");
        var database = client.GetDatabase("Production");
        persons = database.GetCollection<Person>("Persons");
    }

    public async Task<List<Person>> GetAsync() => 
        await persons.Find(_ => true).ToListAsync();

    public async Task<Person?> GetByIdAsync(string id) =>
        await persons.Find(x => x.Id == id).FirstOrDefaultAsync();

    public async Task< Person?> GetByLastNamAsync(string name) =>
        await persons.Find(x => x.LastName == name).FirstOrDefaultAsync();

    public async Task CreateAsync(Person person) =>
        await persons.InsertOneAsync(person);

    public async Task UpdateAsync(string id, Person person) =>
        await persons.ReplaceOneAsync(x => x.Id == id, person);

    public async Task RemoveAsync(string id) =>
        await persons.DeleteOneAsync(x => x.Id == id);
}

  Code Listing 1. CRUD service to interact with Amazon DocumentDB

One of the conveniences with working with DocumentDB (and MongoDB for that matter) is that creating the database and the collection is automatic when the first item (or document) is saved. Thus, creating the database and book collection is just as simple as saving your first book. Of course, that means you have to make sure that you define your database and collection correctly, but it also means that you don’t have to worry about your application not connecting if you mistype the database name – you’ll just have a new database!

AWS and In-Memory Databases

In-memory databases are typically used for applications that require “real-time access” to data. In-memory databases do this by storing data directly, wait for it, in memory. In-memory databases attempt to deliver microsecond latency to applications for whom millisecond latency is not enough, hence why it is called real-time access.

In-memory databases are faster than traditional databases because they store data in Random-access Memory (RAM) so can rely on a storage manager that requires a lot fewer CPU instructions as there is no need for disk I/O and they are able to use internal optimization algorithms in the query processor which are simpler and faster than those in traditional databases that need to worry about reading various blocks of data rather than using the direct pointers available to in-memory databases. Figure 1 shows how all these pieces interact.

Internal design of an in-memory database
Figure 1. Internals of an in-memory database

Of course, this reliance on RAM also means that in-memory databases are more volatile than traditional databases because data is lost when there is a loss of power or the memory crashes or gets corrupted. This lack of durability (the “D” in ACID) has been the primary stopper of more common usage with the expense of memory being a close second. However, memory has gotten cheaper, and system architectures have evolved around the use of in-memory databases in a way where the lack of durability is no longer a showstopper, mainly as caches or as a “front-end” to a more traditional database. This allows the best of both worlds, real-time persistence with an in-memory database and durability from the traditional database. Figure 2 shows an example of how this could work.

visualization of using in-memory database as a cache
Figure 2. Using an in-memory and traditional database together

In Figure 2, we see that the application primarily interacts with the in-memory database to take advantage of its responsiveness, minimizing any kind of lag that a user may notice. In this example, a component of the application, called “Cache Manager,” is responsible for ensuring that the in-memory database and the traditional database are in synch. This component will be responsible for populating the cache upon cache startup and then ensuring all changes to the in-memory database are replicated through to the traditional database.

Obviously, the more complex the systems the more difficult this cache management may become – say there is another application that may be changing the traditional database in the same way. This could be problematic because the cache manager in your first application may not know that there is a change and thus data in the in-memory database will become stale because of the changes from the other application. However, ensuring the same approach, to a shared in-memory database can help eliminate this issue as shown in Figure 3.

Sharing an in-memory database between two applications
Figure 3. Two applications managing their changes to a shared cache

More and more systems are relying on in-memory databases, which is why AWS has two different services, Amazon ElastiCache and Amazon MemoryDB for Redis. Before we dig too far into these services, let’s add some context by doing a quick dive into the history of in-memory databases as this can help you understand the approach that AWS took when creating their products.

Memcached, an open-source general-purpose distributed memory caching system, was released in 2003. It is still maintained, and since it was one of the first in-memory databases available, it is heavily used in various systems including YouTube, Twitter, and Wikipedia. Redis is another open-source in-memory database that was released in 2009. It quickly became popular as both a store and a cache and is likely the most used in-memory database in the world. The importance of these two databases will become obvious shortly.

Amazon ElastiCache

Amazon ElastiCache is marketed as a fully managed in-memory caching service. There are two ElastiCache engines that are supported, Redis and Memcached (see, I told you that these would come up again). Since it is fully managed, ElastiCache eliminates a lot of the work necessary in hardware provisioning, software patching, setup, configuration, monitoring, failure recovery, and backups than you would have running your own instance of either Redis or Memcached. Using ElastiCache also adds support for event notifications and monitoring and metrics without you having to do anything other than enabling them in the console.

However, there are a few limitations to Amazon ElastiCache that you need to consider before adoption. The first is that any application using an ElastiCache cluster must be running within the AWS environment; ideally within the same VPC. Another limitation is that you do not have the ability to turn the service off without deleting the cluster itself; it is either on or gone.

Let’s do a quick dive into the two different ElastiCache engines.

Amazon ElastiCache for Redis

There are two options to choose from when using the Redis engine, non-clustered and clustered. The primary difference between the two approaches is the number of write instances supported. In non-clustered mode, you can have a single shard (write instance) with up to five read replica nodes. This means you have one instance to which you write and the ability to read from multiple instances. In a clustered mode, however, you can have up to 500 shards with 1 to 5 read instances for each.

When using non-clustered Redis, you scale by adding additional read nodes so that access to stored data is not impacted. However, this does leave the possibility of write access being bottlenecked. Clustered Redis takes care of that by creating multiple sets of these read/write combinations so that you can scale out both read, by adding additional read nodes per shard (like non-clustering) or by adding additional shards with their own read nodes. As you can probably guess, a clustered Redis is going to be more expensive than a non-clustered Redis.

As a developer, the key thing to remember is that since ElastiCache for Redis is a managed service offering on top of Redis, using it in your .NET application means that you will use the open-source .NET drivers for Redis, StackExchange.Redis. This provides you the ability to interact with objects stored within the database. Code Listing 1 shows a very simple cache manager for inserting and getting items in and out of Redis.

using StackExchange.Redis;
using Newtonsoft.Json;

namespace Redis_Demo
{
    public interface ICacheable
    {
        public string Id { get; set; }
    }

    public class CacheManager
    {
        private IDatabase database;

        public CacheManager()
        {
            var connection = ConnectionMultiplexer.Connect("connection string");
            database = connection.GetDatabase();
        }

        public void Insert<T>(T item) where T : ICacheable
        {
            database.StringSet(item.Id, Serialize(item));
        }

        public T Get<T>(string id) where T : ICacheable
        {
            var value = database.StringGet(id);
            return Deserialize<T>(value);
        }

        private string Serialize(object obj)
        {
            return JsonConvert.SerializeObject(obj);
        }

        private T Deserialize<T>(string obj)
        {
            return JsonConvert.DeserializeObject<T>(obj);
        }
    }
}

Code Listing 1. Saving and retrieving information from Redis

The above code assumes that all items being cached will implement the ICacheable interface which means that there will be an Id property with a type of string.  Since we are using this as our key into the database, we can assume this Id value will likely be a GUID or other guaranteed unique value – otherwise we may get some strange behavior.

Note: There is another very common Redis library that you may see referenced in online samples, ServiceStack.Redis. The main difference is that ServiceStack is a commercially-supported product that has quotas on the number of requests per hour. Going over that quota will require paying for a commercial license.

Now that we have looked at the Redis engine, let’s take a quick look at the Memcached engine.

Amazon ElastiCache for Memcached

Just as with Amazon ElastiCache for Redis, applications using Memcached can use ElastiCache for Memcached with minimal modifications as you simply need information about the hostnames and port numbers of the various ElastiCache nodes that have been deployed.

The smallest building block of a Memcached deployment is a node. A node is a fixed-size chunk of secure, network-attached RAM that runs an instance of Memcached. A node can exist in isolation or in a relationship with other nodes – a cluster. Each node within a cluster must be the same instance type and run the same cache engine. One of the key features of Memcached is that it supports Auto Discovery.

 Auto Discovery is the ability for client applications to identify all of the nodes within a cluster and initiate and maintain connections to those nodes. This means applications don’t have to worry about connecting to individual nodes, instead you simply connect to a configuration endpoint.

Using ElastiCache for Memcached in a .NET application requires you to use the Memcached drivers, EnyimMemcached for older .NET versions or EnyimMemcachedCore when working with .NET Core based applications. Using Memcached in a .NET application is different from many of the other client libraries that you see because it is designed to be used through dependency injection (DI) rather than “newing” up a client like we did for Redis. Ideally, you would be using DI there as well, but we did not take that approach to keep the sample code simpler. We don’t have that option in this case.

The first thing you need to do is register the Memcached client with the container management system. If working with ASP.NET Core, you would do this through the ConfigureServices method within the Startup.cs class and would look like the following snippet.

using System;
using System.Collections.Generic;
using Enyim.Caching.Configuration;

public void ConfigureServices(IServiceCollection services)
{
   …
   services.AddEnyimMemcached(o => 
        o.Servers = new List<Server> { 
            new Server {Address = "end point", Port = 11211} 
        });
   …
}

Using another container management system would require the same approach, with the key being the AddEnyimMemcached method to ensure that a Memcached client is registered. This means that the Memcached version of the CacheManager class that we used with Redis would instead look like Code Listing 2.

using Enyim.Caching;

namespace Memcached_Demo
{
    public class CacheManager
    {
        private IMemcachedClient client;
        private int cacheLength = 900;

        public CacheManager(IMemcachedClient memclient)
        {
            client = memclient;
        }

        public void Insert<T>(T item) where T : ICacheable
        {
            client.Add(item.Id, item, cacheLength);
        }

        public T Get<T>(string id) where T : ICacheable
        {
            T value;
            if (client.TryGet<T>(id, out value))
            {
                return value;
            }
            return default(T);
        }
    }
}

Code Listing 2. Saving and retrieving information from Memcached

The main difference that you will see is that every item being persisted in the cache has a cache length or the number of seconds that something will stay in the cache. Memcached uses a lazy caching mechanism which means that values will only be deleted when requested or when a new entry is saved, and the cache is full. You can turn this off by using 0 as the input value. However, Memcached retains the ability to delete items before their expiration time when memory is not available for a save.

Choosing Between Redis and Memcached

The different ElastiCache engines can solve different needs. Redis, with its support for both clustered and non-clustered implementations, and Memcached provide different levels of support. Table 1 shows some of the different considerations when evaluating the highest version of the product.

Table 1 – Considerations when choosing between Memcached and Redis

 MemcachedRedis (non-clustered)Redis (clustered)
Data typesSimpleComplexComplex
Data partitioningYesNoYes
Modifiable clustersYesYesLimited
Online reshardingNoNoYes
EncryptionNoYesYes
Data tieringNoYesYes
Multi-threadedYesNoNo
Pub/sub capabilitiesNoYesYes
Sorted setsNoYesYes
Backup and restoreNoYesYes
Geospatial indexingNoYesYes

An easy way to look at it is if you use simple models and have a concern for running large nodes with multiple cores and threads, then Memcached is probably your best choice. If you have complex models and want some support in the database for failover, pub\sub, and backup, then you should choose Redis.

Amazon MemoryDB for Redis

Where Amazon ElastiCache is a managed services wrapper around an open-source in-memory database, Amazon MemoryDB for Redis is a Redis-compatible database service, which means that it is not Redis, but is instead a service that accepts many of the same Redis commands as would a Redis database itself. This is very similar to Amazon Aurora which supports several different interaction approaches; MySQL and PostgreSQL. As of the time of this writing, MemoryDB supported the most recent version of Redis, 6.2.

Because MemoryDB is a fully managed service, creating the instance is straightforward. You create your cluster by determining how many shards you would like to support as well as the number of read-replicas per shard. If you remember the ElastiCache for Redis discussion earlier, this implies that the setup defaults to Redis with clustering enabled.

When looking at pricing, at the time of this writing, MemoryDB costs approximately 50% more than the comparable ElastiCache for Redis pricing. What this extra cost buys you is data durability. MemoryDB stores your entire data set in memory and uses a distributed multi-AZ transactional log to provide that data durability, a feature that is unavailable in ElastiCache for Redis. MemoryDB also handles failover to a read replica much better, with failover typically happening in under 10 seconds. However, other than the pricing difference, there is a performance impact to using MemoryDB over ElastiCache because the use of distributed logs has an impact on latency, so reads and writes will take several milliseconds longer for MemoryDB than would the same save in ElastiCache for Redis.

As you can guess, since the Redis APIs are replicated, a developer will work with MemoryDB in the same way as they would ElastiCache for Redis, using the StackExchange.Redis NuGet package. In-Memory databases, whether ElastiCache or MemoryDB, offer extremely fast reads and writes because they cut out a lot of the complexity added by performing I/O to disks and optimize their processing to take advantage of the speed inherent in RAM. However, working with is very similar to working with “regular” document and key/value databases.

WTF are Ledger Databases?

A ledger database is a NoSQL database that provides an immutable, transparent, and cryptographically verifiable transaction log that is owned by a central authority. The key here is that you never actually change the data in the table. What we generally consider an update, which replaces the old content with the new content, is not applicable when working with a ledger database. Instead, an update adds a new version of the record. All previous versions still exist, so your update never overwrites existing data. The cryptographically verified part comes into place because this ensures that the record is immutable.

Ledger databases are generally used to record economic and financial activity within an organization, such as by tracking credits and debits within an account. Other common use cases are generally around workflows, such as tracking the various steps taken on an insurance claim or tracing the movement of an item through a supply-chain network. For those of you that ever needed to implement audit tables in a relational database – you can start to see how managing that all automatically will be of benefit.

Note: A ledger database may seem very similar to Blockchain. However, there is one significant difference and that is that the ledger database is generally a centralized ledger whereas Blockchain is a distributed ledger. There will be times when the organization, such as a bank of financial organization, is not comfortable with a distributed ledger and instead prefer a simpler architecture with the same guarantee of immutable and verifiable data.

Now that we have discussed, at a high level, the features of a ledger database, let’s take a look at AWS’ version, Amazon QLDB.

Amazon QLDB

In traditional, relational database architecture, the approach is to write data into tables as part of a transaction. This transaction is generally stored in a transaction log and includes all the database modifications that were made during the transaction. This allows you to replay the log and those transactions in the event of a system failure or for data replication. However, generally those logs are not immutable and generally aren’t designed to allow easy access to users.

This approach is different with QLDB, as the journal is the central feature of the database. In a practical sense, the journal is structurally similar to the transaction log, however, it takes a write-only approach to storing application data with all writes, inserts, updates, and deletes, being committed to the journal first. QLDB then uses this journal to interpret the current set of your data. It does this by materializing that data in queryable, user-defined tables. These tables also provide access into the history of the data within that table, including revisions and metadata.

While a lot of the functionality seems similar, some of the terminology is different. Table 1 shows the mapping between these terms.

Relational TermQLDB Term
DatabaseLedger
TableTable
IndexIndex
Table rowAmazon Ion Document
ColumnDocument Attribute
SQLPartiQL
Audit LogsJournal
Table 1. Terminology map between RDBMS and QLDB

The key difference is around the difference between the table row and column from a relational database that was replaced with an Amazon Ion Document. Amazon Ion is a richly-typed, self-describing, hierarchical data serialization format that offers interchangeable binary and text representations. The text format is a superset of JSON and is easy to read and work with while the binary representation is efficient for storage and transmission. Ion’s rich type system enables unambiguous semantics for data (e.g., a timestamp value can be encoded using the timestamp type). This support for rich types allows an Ion document to be able to conceptually replace a database row.

Note – The Ion Document format provides a lot of the support lacking in the Amazon Timestream SDK where columns and values are defined separately. This seems to be an unfortunate example of how various service teams seem to never talk to each other nor follow a single standard!!

Creating an Amazon QLDB ledger is very simple. In the AWS console, navigate to Amazon QLDB. Here you will be given the ability to Create ledger. Creating a ledger only requires:

1.      Ledger name – The name of your ledger, needs to be unique by region

2.      Ledger permissions mode – There are 2 choices here, Standard and Allow all. The “Standard” permissions mode is the default and it enables control over the ledger and its tables using IAM. The “Allow all” mode allows any user with access to the ledger to manage all tables, indexes, and data.

3.      Encrypt Data at rest – Where you choose the key to use when encrypting data. There is no ability to opt-out – all data will be encrypted at rest.

4.      Tags – additional label describing your resource

Creating a QLDB Ledger

You can also create a table at this time, however, there is no way through the UI for you to be able to add an Index. So, instead, let’s look at how you can do that in .NET as well as add data to your newly created table.

.NET and Amazon QLDB

The first thing you will need to do to work with Amazon QLDB is to add the appropriate NuGet packages. The first one you should add is Newtonsoft.Json, which we will use in our examples to manage serialization and deserialization. The next ones to add are specific to QLDB, AWSSDK.QLDBSession, Amazon.IonObjectMapper, Amazon.QLDB.Driver, and Amazon.QLDB.Driver.Serialization. If you install the Driver.Serialization package, NuGet will install all the other required QLDB packages as dependencies.

The next step is to build an Amazon.QLDB.Driver.IQldbDriver. You do that with a Builder as shown below.

private IQldbDriver qldbDriver;

AmazonQLDBSessionConfig amazonQldbSessionConfig = new AmazonQLDBSessionConfig();

qldbDriver = QldbDriver.Builder()
    .WithQLDBSessionConfig(amazonQldbSessionConfig)
    .WithLedger(ledgerName)
    .Build();

Note how you can not just “new” up a driver, and instead must use dot notation based upon the Builder() method of the QldbDriver. Next, you need to ensure that the table and applicable indexes are set up. See the code below that contains a constructor that sets up the client and calls the validation code.

private IQldbDriver qldbDriver;
private IValueFactory valueFactory;

private string ledgerName = "ProdDotNetOnAWSLedger";
private string tableName = "Account";

public DataManager()
{
    valueFactory = new ValueFactory();

    AmazonQLDBSessionConfig amazonQldbSessionConfig = new AmazonQLDBSessionConfig();

    qldbDriver = QldbDriver.Builder()
        .WithQLDBSessionConfig(amazonQldbSessionConfig)
        .WithLedger(ledgerName)
        .Build();

    ValidateTableSetup(tableName);
    ValidateIndexSetup(tableName, "AccountNumber");
}

As you can see below, checking for the existence of the table is simple as the driver has a ListTableNames method that you can use to determine the presence of your table. If it doesn’t exist, then process the query to create it.

Let’s pull that method out and examine it, because this is how you will do all your interactions with QLDB, by executing queries.

private void ValidateTableSetup(string tableName)
{
    if (!qldbDriver.ListTableNames().Any(x => x == tableName))
    {
        qldbDriver.Execute(y => { y.Execute($"CREATE TABLE {tableName}"); });
    }
}

The Execute method accepts a function, in this case, we used a lambda function that executes a “CREATE TABLE” command. The code takes a similar approach to creating the index, however, you have to go through more steps to be able to determine whether the index has already been created as you have to query a schema table first and then parse through the list of indexes on that table. That code is shown below.

private void ValidateIndexSetup(string tableName, string indexField)
{
    var result = qldbDriver.Execute(x =>
    {
        IIonValue ionTableName = this.valueFactory.NewString(tableName);
        return x.Execute($"SELECT * FROM information_schema.user_tables WHERE name = ?", ionTableName);
    });

    var resultList = result.ToList();
    bool isLIsted = false;

    if (resultList.Any())
    {
        IIonList indexes = resultList.First().GetField("indexes");               
        foreach (IIonValue index in indexes)
        {
            string expr = index.GetField("expr").StringValue;
            if (expr.Contains(indexField))
            {
                isLIsted = true;
                break;
            }
        }                
    }
    if (!isLIsted)
    {
        qldbDriver.Execute(y => y.Execute($"CREATE INDEX ON {tableName}({indexField})"));
    }
}

As mentioned earlier, the save is basically a query that is executed. A simple save method is shown below:

public void Save(object item)
{
    qldbDriver.Execute(x =>
        {
            x.Execute($"INSERT INTO {tableName} ?", ToIonValue(item));
        }
    );
}

private IIonValue ToIonValue(object obj)
{
    return IonLoader.Default.Load(JsonConvert.SerializeObject(obj));
}

This example, as well as the one before this one where we did a check to see whether an index exists, both have some dependence on Ion. Remember, Ion is based upon richly-typed JSON, so it expects values to be similarly represented – which is why there are two different approaches for converting an item to an IonValue, even something as simple as the tableName, for comparison. The first of these is through the Amazon.IonDotnet.Tree.ValueFactory object as shown in the index validation snippet while the other is through the Amazon.IonDotnet.Builders.IonLoader as shown in the ToIonValue method.

Once you have the ledger, table, and any indexes you may want set up, the next step is to save some data into the table. This is done by executing a SQL-like command as shown below.

public void Save(object item)
{
    qldbDriver.Execute(x =>
        {
            x.Execute($"INSERT INTO {tableName} ?",
                       ToIonValue(item));
        }
    );
}

Getting it out of the database is a little bit trickier as you want it to be cast into the appropriate data model. The approach for retrieving an item from the database and converting it into a plain old class object (POCO) is shown below.

public List<T> Get<T>(string accountNumber)
{
    IIonValue ionKey = valueFactory.NewString(accountNumber);

    return qldbDriver.Execute(x =>
    {
        IQuery<T> query = x.Query<T>(
            $"SELECT * FROM {tableName} as d 
              WHERE d.AccountNumber = ?", ionKey);
        var results = x.Execute(query);
        return results.ToList();
    });
}

You can see that this is a two-step process. First, you create a generic query that is defined with the SELECT text and the appropriate model for the return set. You then Execute that query. This brings you back an Amazon.QLDB.Driver.Generic.IResult<T> object that can be converted to return a list of the requested items matching the AccountNumber that was passed into the function.

That’s a quick review of the general functionality of QLDB. Let’s look at one more specific case, the history of an item – one of the prime drivers of using a ledger database. You do this by using the history function, a PartiQL extension that returns revisions from the system-defined view of your table. The syntax for using the history function is

SELECT * FROM history(table,`start-time`,`end-time` ] ] ) AS h
[ WHERE h.metadata.id = 'id' ]

 The start time and end time values are optional. Using a start time will select any versions that are active when the start time occurs and any subsequent versions up until the end time. If no start time is provided, then all versions of the document are retrieved, and the current time is used in the query if the end time value is not explicitly defined. However, as you can see, the optional WHERE clause is using a metadata.id value – which we have never talked about. This metadata.id value is used by QLDB to uniquely identify each document and can be accessed through the metadata of the document. One way to get this information is shown below:

public string GetMetadataId(Account item)
{
    IIonValue ionKey = 
                  valueFactory.NewString(item.AccountNumber);
    IEnumerable<IIonValue> result = new List<IIonValue>();

    qldbDriver.Execute(x =>
    {
        result = x.Execute($"SELECT d.metadata.id as id 
                FROM _ql_committed_{tableName} as d 
                WHERE d.data.AccountNumber = ?", ionKey);
    });

    IIonValue working = result.First();
    return working.GetField("id").StringValue;
}

This code snippet accesses the “committed” view, one of the different views provided by QLDB that provides access to information about your data. Taking the result from that query, and plugging it into the previous snippet where we queried using the history function will get you the history of that document for the time period defined in the query. Thus, you can access the current version as well as have access to all the history; all by executing SQL-like commands.

There is a lot more we can go into about using Amazon QLDB and .NET together, probably a whole book! To recap, however, Amazon QLDB is a ledger database, which means it provides an immutable, transparent, and cryptographically verifiable transaction log that you can access and evaluate independently from the active version.

Time Series Databases

A time series database is a database optimized for time-stamped or time-series data. Typical examples of this kind of data are application performance monitoring logs, network data, clicks in an online application, stock market trading, IoT sensor data, and other use cases where the time of the data is important. So far it seems like something pretty much any database can manage, right? Well, the difference comes around the way the data will be accessed.

All the NoSQL and relational databases depend upon the existence of a key, generally some kind of string or integer. In a time-series database, however, the timestamp is the most important piece of information. This means that applications that look at data within a time range, and it is the time range that is crucial, would be well served by a time-series database because that data will be stored together (much like how relational databases store data in order by primary key) so getting a range of data based on a time range will be faster than getting data stored with a different type of key. Figure 1 shows how this works.

Figure 1. Time-series data stored by timestamp

A typical characteristic of time-series data is that when data arrives it is generally recorded as a new entry. Updating that data is more of an exception than a rule. Another characteristic is that the data typically arrives in timestamp order, so key scanning is minimized, and the data-write can be very performant. A thirds characteristic is that the non-timestamp data is stored as a value and is rarely included as part of the filtering in a query. Scaling is also especially important in time-series databases because there tends to be a lot of frequently created data that ends up being stored in this type of database. The last characteristic, as already mentioned, is that time is the primary axis. Time-series databases tend to include specialized functions around dealing with time series, including aggregation, analysis, and time calculations.

Amazon Timestream

Now that we have briefly defined the concept behind a time-series database, let us look at the AWS entry into the field, Amazon Timestream. Timestream is a serverless, automatically-scaling, database service that can store trillions of requests per day and is exponentially faster than a relational database at a fraction of the cost. Timestream uses a purpose-built query engine that lets you access and analyze recent and historical data together within a single query as well as supporting built-in time-series analytics functions.

Creating an Amazon Timestream database is one of the simplest processes that you will find in AWS. All you need to do is to provide a Database name and decide on the Key Management Service (KMS) key to use when encrypting your data (all Timestream data is encrypted by default). And that is it. Creating a table within the database is almost as simple; provide a Table name and then configure the Data retention policies.

These data retention policies are designed to help you manage the lifecycle of your time-series data. The assumption is that there are two levels of data, data that is frequently used in small segments for analysis, such as “the last 5 minutes”, and data that will be accessed much less often but in larger ranges, such as “last month”. Timestream helps you manage that lifecycle by automatically storing data in two places, in-memory and in a magnetic store. The data retention policies allow you to configure when that transition should happen. Figure 2 shows an example where data is persisted in memory for 2 days and then transferred to magnetic storage and kept there for 1 year.

Figure 2. Configuring the Data Retention policies for a Timestream table

Now that we have looked at creating a Timestream database and table, the next step is to look at how the data is managed within the database as this will help you understand some of the behavior you will see when using the database programmatically. The following are key concepts in Timestream:

·         Dimension – describes metadata as key/value pairs and can be 0 to many within a record. An example could be the location and type of a sensor.

·         Measure – the actual value being measured, as a key/value pair

Now that we have briefly discussed Timestream databases, our next step is to look at using it in a .NET application.

.NET and Amazon Timestream

We have spent some time accessing database services on AWS that are designed to emulate some other product, such as Redis. Timestream does not take that approach. Instead, you need to use the AWS SDK for .NET to write data to and read data from Timestream. The interesting thing about this is that AWS made the decision to break this data access process down by providing two discrete .NET clients for Timestream, the Write SDK client that persists data to the table and the Query SDK client that returns data from the table.

Let’s start our journey into Timestream by looking at a method in the code snippet below that saves an item, in this case, a Measurement.

1  using System;
2  using System.Collections.Generic;
3  using System.Threading.Tasks;
4  using Amazon.TimestreamWrite;
5  using Amazon.TimestreamWrite.Model;
6
7  public class Measurement
8  {
9     public Dictionary<string, double> KeyValues { get; set; }
10    public string Source { get; set; }
11    public string Location { get; set; }
12    public DateTime Time { get; set; }
13  }
14
15 public async Task Insert(Measurement item)
16 {
17    var queryClient = new AmazonTimestreamQueryClient();
18
19    List<Dimension> dimensions = new List<Dimension>
20    {
21        new Dimension {Name = "location", Value = item.Location}
22    };
23
24    Record commonAttributes = new Record
25    {
26        Dimensions = dimensions,
27        MeasureValueType = MeasureValueType.DOUBLE,
28        Time = ConvertToTimeString(item.Time)
29    };
30
31   List<Record> records = new List<Record>();
32
33    foreach (string key in item.KeyValues.Keys)
34    {
35        var record = new Record
36        {
37            MeasureName = key,
38            MeasureValue = item.KeyValues[key].ToString()
39        };
40        records.Add(record);
41    }
42
43    var request = new WriteRecordsRequest
44    {
45        DatabaseName = databaseName,
46        TableName = tableName,
47        CommonAttributes = commonAttributes,
48        Records = records
49    };
50
51    var response = await writeClient.WriteRecordsAsync(request);
52    // do something with result, such as evaluate HTTP status
53 }

When persisting data into Timestream you will use a WriteRecordsRequest, as shown on Line 43 of the above code. This object contains information about the table and database to use as well as CommonAttributes and Records. The CommonAttributes property expects a Record, or a set of fields that are common to all the items contained in the Records property; think of it as a way to minimize the data sent over the wire as well as act as initial value grouping. In this case, our common attributes contain the Dimensions (as discussed earlier) as well as the MeasureValueType which defines the type of data being submitted (although it will be converted to a string for the submission), and Time. This leaves only the list of measurement values that will be put into the Records property. The response from the WriteRecordsAsync method is an HTTP Status code that indicates success or failure.

Pulling data out is quite different; mainly because the process of fetching data is done by passing a SQL-like query string to the AmazonTimestreamQueryClient. This means that a simple query to retrieve some information could look like this:

SELECT location, measure_name, time, measure_value::double
FROM {databaseName}.{tableName}
WHERE measure_name=’type of measurement you are looking for’
ORDER BY time DESC
LIMIT 10

All of which will be recognizable if you have experience with SQL.

This is all well and good, but the power of this type of database really shines when you start to do time-specific queries. Consider the following query that finds the average measurement value, aggregated together over 30-second intervals (binned), for a specific location (one of the dimensions saved with each submission) over the past 2 hours.

SELECT BIN(time, 30s) AS binned_timestamp,
    ROUND(AVG(measure_value::double), 2) AS avg_measurementValue,
    location 
FROM {databaseName}.{tableName} 
WHERE measure_name = ’type of measurement you are looking for’
    AND location = '{LOCATION}' 
    AND time > ago(2h) 
GROUP BY location, BIN(time, 30s) 
ORDER BY binned_timestamp ASC"

As you can probably imagine, there are many more complicated functions that are available for use in Timestream.

Note: These time series functions include functions that will help fill in missing data, interpolations, functions that look at the rate of change for a metric, derivatives, volumes of requests received, integrals, and correlation functions for comparing two different time series. You can find all the time series functions at https://docs.aws.amazon.com/timestream/latest/developerguide/timeseries-specific-constructs.functions.html.

Once you have built the query, running the query is pretty simple, as shown below:

public async Task<QueryResponse> RunQueryAsync(string queryString)
{
    try
    {
        QueryRequest queryRequest = new QueryRequest();
        queryRequest.QueryString = queryString;
        QueryResponse queryResponse = 
                 await queryClient.QueryAsync(queryRequest);
        return queryResponse;
    }
    catch (Exception e)
    {
        return null;
    }
}

However, the complication comes from trying to interpret the results – the QueryResponse object that is returned by the service. This is because you are simply passing in a query string that could be doing any kind of work, so the response needs to be able to manage that. Figure 3 shows the properties on the QueryResponse object.

Figure 3. Object definition of the QueryResponse object

There are five properties in the QueryResponse. The QueryStatus returns Information about the status of the query, including progress and bytes scanned. The QueryId and NextToken properties are used together to support pagination when the result set is larger than the default length, ColumnInfo provides details on the column data types of the returned result set, and Rows contains the results set.

You can see a simple request/result as captured in Telerik Fiddler Classic in Figure 4.

Figure 4. Amazon Timestream query response in JSON

There are three areas called out. Area 1 is the query that was sent, as you can see it is a simple “SELECT ALL” query with a limit put on of 2 rows. Area 2 shows the ColumnInfo property in JSON, with each item in the array corresponding to each of the ScalarValues found in the array of Data that makes up the Rows property.

Looking at this, you can probably see ways in which you can transform the data into JSON that will deserialize nicely into a C# class. Unfortunately, however, AWS did not provide this functionality for you as part of their SDK.

What kind of “NoSQL” are you talking about?

Too many times I hear NoSQL databases as if they are a single way of “doing business.” However, there are multiple types of NoSQL databases, each with its specializations and ways in which they approach persisting and accessing information. So just saying “use NoSQL” isn’t really all that useful.

With that in mind, lets’ take a look at the various choices that you have to consider. The main flavors of NoSQL databases are document databases, key-value stores, column-oriented databases, and graph databases. There are also databases that are combination of these and even some other highly specialized databases that do not depend of relationships and SQL, so they could probably be NoSQL as well. That being said, let’s look at those four primary types of NoSQL databases.

Document Databases

A document database stores data in JSON as shown in Figure 10-3, BSON, or XML. BSON stands for “Binary JSON.” BSON’s binary structure allows for the encoding of type and length information. This encoding tends to allow the data to be parsed much more quickly and it helps support indexing and querying into the properties of the document being stored.

Figure 1. Document database interpretation of object

Documents are typically stored in a format that is much closer to the data model being worked with in the application. The examples used above show this clearly. Additional functionality supported by document databases tend to be around ways of indexing and\or querying data within the document. The Id will generally be the key used for primary access but imagine cases like where you want to search the data for all instances of a Person with a FirstName of “Steve”. In SQL that is easy because that data is segregated into its own column. It is more complex in a NoSQL database because that query must now go rummaging around the data within the document which, as you can probably imagine, will be slow without a lot of work around query optimization.  Common examples of document databases include MongoDB, Amazon DocumentDB, and Microsoft Azure CosmosDB.

Key-Value Stores

A key-value store is the simplest of the NoSQL databases. In these types of databases, every element in the database is stored as a simple key-value pair made up of an attribute, or key, and the value, which can be anything. Literally, anything. It can be a simple string, a JSON document like is stored in a document database, or even something very unstructured like an image or some other blob of binary data.

The difference between Document databases and Key-Value databases is generally their focus. Document databases focus on ways of parsing the document so that users can do things like making decisions about data within that document. Key-value databases are very different. They care much less what the value is, so they tend to focus more on storing and accessing the value as quickly as possible and making sure that they can easily scale to store A LOT of different values. Examples of Key-Value databases include Amazon DynamoDB, BerkeleyDB, Memcached, and Redis.

Those of you familiar with the concept of caching, where you create a high-speed data storage layer that stores a subset of data so that accessing that data is more responsive than calling directly into the database, will see two products that are commonly used for caching, Memcached and Redis. These are excellent examples of the speed vs. document analysis trade-off between these two types of databases.

Column-Oriented Databases

Yes, the name of these types of NoSQL databases may initially be confusing because you may think about how the data in relational databases are broken down into columns. However, the data is accessed and managed differently. In a relational database, an item is best thought of as a row, with each column containing an aspect or value referring to a specific part of that item. A column-oriented database instead stores data tables by column rather than row.

Let’s take a closer look at that since that sounds pretty crazy. Figure 2 shows a snippet of a data table like you are used to seeing in a relational database system, where the RowId is an internal value used to refer to the data (only within the system) while the Id is the external value used to refer to the data.

Figure 2. Data within a relational database table, stored as a row.

Since row-oriented systems are designed to efficiently return data for a row, this data could be saved to disk as:

1:100,Bob,Smith;
2:101,Bill,Penberthy;

In a column-oriented database, however, this data would instead be saved as

100:1,101:2;
Bob:1,Bill:2;
Smith:1,Penberthy:2

Why would they take that kind of approach? Mainly, speed of access when pulling a subset of values for an “item”. Let’s look at an index for a row-based system. An index is a structure that is associated with a database table to speed the retrieval of rows from that table. Typically, the index sorts the list of values so that it is easier to find a specific value. If you consider the data in Figure 10-4 and know that it is very common to query based on a specific LastName, then creating a database index on the LastName field means that a query looking for “Roberts” would not have to scan every row in the table to find the data, as that price was paid when building the index. Thus, an index on the LastName field would look something like the snipped below, where all LastNames have been alphabetized and stored independently:

Penberthy:2;
Smith:1

If you look at this definition of an index, and look back at the column-oriented data, you will see a very similar approach. What that means is that when accessing a limited subset of data or the data is only sparsely populated, a column-oriented database will most likely be much faster than accessing a row-based database. If, on the other hand, you tend to use a broader set of data then storing that information in a row would likely be much more performant. Examples of column-oriented systems include Apache Kudu, Amazon Redshift, and Snowflake.

Graph Databases

The last major type of NoSQL databases is graph databases. A graph database focuses on storing nodes and relationships. A node is an entity, is generally stored with a label, such as Person, and contains a set of key-value pairs, or properties. That means you can think of a node as being very similar to the document as stored in a database. However, a graph database takes this much further as it also stores information about relationships.

A relationship is a directed and named connection between two different nodes, such as Person Knows Person. A relationship always has a direction, a starting node, an ending node, and a type of relationship. A relationship also can have a set of properties, just like a node. Nodes can have any number or type of relationships and there is no effect on performance when querying those relationships or nodes. Lastly, while relationships are directed, they can be navigated efficiently in either direction.

Let’s take a closer look at this. From the example above, there are two people, Bill and Bob that are friends. Both Bill and Bob watched the movie “Black Panther”. This means there are 3 nodes, with 4 known relationships:

  • Bill is friends with Bob
  • Bob is friends with Bill
  • Bill watched “Black Panther”
  • Bob watched “Black Panther”     

Figure 3 shows how this is visualized in a graph model.

Figure 3. Graph model representation with relationships

Since both nodes and relationships can contain additional information, we can include more information in each node, such as LastName for the Persons or a Publish date for the Book, and we can add additional information to a relationship, such as a StartDate for writing the book or a Date for when they become friends.

Graph databases tend to have their own way of querying data because you can be pulling data based upon both nodes and relationships. There are 3 common query approaches:

·         Gremlin – part of an Apache project is used for creating and querying property graphs.  A query to get everyone Bill knows would look like this:

g.V().has("FirstName","Bill").
  out("Is_Friends_With").
values("FirstName")

·         SPARQL – a World Wide Web Consortium supported project for a declarative language for graph pattern matching. A query to get everyone Steve knows would look like this:

PREFIX vcard: http://www.w3.org/2001/vcard-rdf/3.0#
SELECT ?FirstName
WHERE {?person vcard:Is_Friends_With ?person }

·         openCypher – a declarative query language for property graphs that was originally developed by Neo4j and then open-sourced. A query to get everyone Bill knows would look like this:

MATCH (me)-[:Is_Friends_With*1]-(remote_friend)
WHERE me.FirstName = Bill
RETURN remote_friend.FirstName

The most common graph databases are Neo4j, ArangoDB, and Amazon Neptune.

Next time – don’t just say “lets put that in a NoSQL database.” Instead, say what kind of NoSQL database makes sense for that combination of data and business need. That way, it is obvious you know what you are saying and have put some thought into the idea!

Selecting a BI Data Stack – Snowflake vs Azure Synapse Analytics

This article is designed to help choose the appropriate technical tools for building out a Data Lake and a Data Warehouse.

Components

A Data Lake is a centralized repository that allows you to store structured, semi-structured, and unstructured data at scale. The idea is that copies of all your transactional data are gathered, stored, and made available for additional processing, whether it is transformed and moved into a data warehouse or made available for direct usage for analytics such as machine learning or BI reporting.

A Data Warehouse, on the other hand, is a digital storage system that connects and harmonizes large amounts of data from many different sources. Its purpose is to feed business intelligence (BI), reporting, and analytics, and support regulatory requirements – so companies can turn their data into insight and make smart, data-driven decisions. Data warehouses store current and historical data in one place and act as the single source of truth for an organization.

While a data warehouse is a database structured and optimized to provide insight, a data lake is different because it stores relational data from line of business applications and non-relational data from other systems. The structure of the data or schema is not defined when data is captured. This means you can store all your data without careful design or the need-to-know what questions you might need answers for in the future. Different types of analytics on your data like SQL queries, big data analytics, full text search, real-time analytics, and machine learning can be used to uncover insights.

General Requirements 

There are a lot of different ways to implement a data lake. The first is in a simple object storage system such as Amazon S3. The other is through a combination of ways, such as tying together a database system in which to store relational data and an object storage system to store unstructured or semi-structured data, or adding in stream-processing, or a key/value database, or any number of different products that handle the job similarly. With all these different approaches potentially filling the definition of “data lake”, the first thing we must do is to define deeper the data lake requirements that we will use for the comparison.

  • Relational data –One requirement is that the data stack include multiple SAAS systems that will providing data to the data lake. These systems default to a relational\object-based structures where objects are consistently defined and are inter-related. This means that this information will be best served by storing within a relational database system. Examples of systems providing this type of data include Salesforce, Marketo, OpenAir, and NetSuite.
  • Semi-structured data – some of the systems that will act as sources for this data lake contain semi structured data, which is data that has some structure but does not necessarily conform to a data model. Semi-structured data may mostly “look like each other” but each piece of data may have different attributes or properties. Data from these sources could be transformed into a relational approach, but this transformation gets away from storing the data in its original state. An example of sources of this kind of data include logs, such as Splunk or application logs.
  • Unstructured data – Unstructured data is data which does not conform to a data model and has no easily identifiable structure, such as images, videos, documents, etc. At this time, there are no identified providers of unstructured data that is expected to be managed within the data lake.

Thus, fulfilling this set of data lake requirements means support for both relational\structured and semi-structured data are “must-haves”, and the ability to manage unstructured data is a “nice-to-have”.

The next set of requirements is for both data lakes and data warehouses are around being able to add, access, and extract data, with there being many more requirements around being able to manage data as it is being read and extracted. These requirements are:

  • Input – The data lake must be able to support write and edit access from external systems. The system must support authentication and authorization as well as the ability to automate data movement without having to rely on internal processes. Thus, a tool requiring a human to click through a UI is not acceptable.
  • Access – the data lake must be able to support different ways of accessing data from external sources, so the tool:
    • must be able to support the processing of SQL (or similar approaches) for queries to limit/filter data being returned
    • must be supported as a data source for Tableau, Excel, and other reporting systems
    • should be able to give control over the number of resources used when executing queries to have control over the performance/price ratio

Top Contenders

There are a lot of products available for consideration that support either, or both, data lake and data warehouse needs. However, one of the key pivots was to only look at offerings that were available as a managed service in the cloud rather than a product that would be installed and maintained on-premise and that the product can support both data lake and data warehouse needs. With that in mind, we narrowed the list down to 5 contenders: 1) Google BigQuery, 2) Amazon Redshift, 3) Azure Synapse Analytics, 4) Teradata Vantage, and 5) Snowflake Data Cloud based upon the Forrester Wave™ on Cloud Data Warehouses study as shown below in Figure 1.

Figure 1 – Forrester Wave on Cloud Data Warehouses

The Forrester scoring was based upon 25 different criteria, of which we pulled out the seven that were the most important to our defined set of requirements and major use cases (in order of importance):

  1. Data Ingestion and Loading – The ability to easily insert and transform data into the product
  2. Administration Automation – An ability to automate all the administration tasks, including user provisioning, data management, and operations support
  3. Data Types – The ability to support multiple types of data, including structured, semi-structured, unstructured, files, etc.
  4. Scalability Features – The ability to easily and automatically scale for both storage and execution capabilities
  5. Security Features – The ability to control access to data stored in the product. This includes management at database, table, row, and column levels.
  6. Support – The strength of support offered by the vendor, including training, certification, knowledge base, and a support organization.
  7. Partners – The strength of the partner network, other companies that are experienced in implementing the software as well as creating additional integration points and/or features.
Google BigQueryAzure Synapse AnalyticsTeradata VantageSnowflake Data CloudAmazon Redshift
Forrester Scores   
Data Ingestion and Loading5.003.003.005.005.00
Administration Automation3.003.003.005.003.00
Data Types5.005.005.005.005.00
Scalability Features5.005.003.005.003.00
Security Features3.003.003.005.005.00
Support5.005.003.005.005.00
Partners5.001.005.005.005.00
Average of Forrester Scores4.433.573.575.004.43
Table 1 – Base Forrester scores for selected products

This shows a difference between the overall scores, based on the full list of 25 items, and the list of scores based upon those areas where it was determined our requirements pointed to the most need. Snowflake was an easy selection into the in-depth, final evaluation, but the choices for the other products to dig deeper into was more problematic. In this first evaluation we compared Snowflake to Azure Synapse Analytics. Other comparisons will follow.

Final Evaluation

There are two additional areas in which we dive deeper into the differences between Synapse and Snowflake, cost, and execution processing. Execution processing is important because the more granular the ability to manage execution resources, the more control the customer has over performance in their environment.

Cost

Calculating cost for data platforms generally comes down to four different areas, data storage, processing memory, processing CPUs, and input/output. There are two different approaches to Azure’s pricing for Synapse, Dedicated and Serverless. Dedicated is broken down into 2 values, storage and an abstract representative of compute resources and performance called a Data Warehouse Unit (DWU). The Serverless consumption model, on the other hand, charges by TB of data processed. Thus, a query that scans the entire database would be very expensive while small, very concise queries will be relatively inexpensive. Snowflake has taken a similar approach to the Dedicated mode, with their abstract representation called a Virtual Warehouse (VW).  We will build a pricing example that includes both Data Lake and Data Warehouse usage using the following customer definition:

  • 20 TBs of data, adding 50 GB a week
  • 2.5 hours a day on a 2 CPU, 32GB RAM machine to support incoming ETL work.
  • 10 users working 8 hours, 5 days a week, currently supported by a 4 CPU, 48GB RAM machine
ProductDescriptionPriceNotes
Synapse Ded.Storage Requirements$ 4,60020 TB * $23 (per TB month) * 12 months
Synapse Ded.ETL Processing$ 2,190400 Credits * 2.5 hours * 365 days * $0.006 a credit
Synapse Ded.User Processing$ 11,5201000 Credits * 8 hours * 240 days * $.006 a credit
Synapse Ded.Total$ 18,310 
Synapse Serv.Storage (Serverless)$ 4,56020 TB * $19 (per TB month) * 12 months
Synapse Serv.Compute$ 6,091((5 TB * 240 days) + (.05 TB * 365 days)) * $5
Synapse Serv.Total$ 10,651 
SnowflakeStorage Requirements$ 1,1044 TB (20 TB uncompressed) * $23 (per TB month)
SnowflakeETL Processing$ 3,6502 Credits * 2.5 hours * 365 days * $2 a credit
SnowflakeUser Processing$ 15,3604 Credits * 8 hours * 240 days * $2 a credit
SnowflakeTotal$ 20,114 

As an estimate, before considering any discounts, Synapse is approximately 10% less expensive than Snowflake when using Dedicated consumption and almost 50% less costly when using Serverless consumption, though that is very much an estimate because Serverless charges by the TB processed in a query, so large or poorly constructed queries can be exponentially more expensive. However, there is some cost remediation of the Snowflake plan because the movement of data from the data lake to the data warehouse is simplified. Snowflake’s structure and custom SQL language allow the movement of data between databases using SQL, so there is only the one-time processing fee for the SQL execution rather than paying for execution of query to get data out of the system and the execution to get data into the system.  Synapse offers some of the same capability, but much of that is outside of the data management system and the advantage gained in the process is less than that offered by Snowflake.

Execution Processing

The ability to flexibly scale the number of resources that you use when executing a command is invaluable. Both products do this to some extent. Snowflake, for example, uses the concept of a Warehouse, which is simply a cluster of compute resources such as CPU, memory, and temporary storage. A warehouse can either be running or suspended, and charges accrue only when running. The flexibility that Snowflake operates is around these warehouses, where the user can provision different warehouses, or sizes of work, and then select the desired data warehouse, or processing level, when connecting to the server to perform work as shown in Figure 2. This allows for matching the processing to the query being done. Selecting a single item from the database by it’s primary key may not need to be run using the most powerful (expensive) warehouse, but a report doing complete table scans across TBs of data could use the boost. Another example would be a generated scheduled report where the processing taking slightly longer is not an issue.

Figure 2 – Switching Snowflake warehouses to change dedicated compute resources for a query

Figure 3 shows the connection management screen in the integration tool Boomi, with the arrow pointing towards where you can set the desired warehouse when you connect to Snowflake.

Figure 3 – Boomi connection asking for Warehouse name

Synapse takes a different approach. You either use a Serverless model, where you have no control over the processing resources or a Dedicated model where you determine the level of resource consumption when you create the SQL pool which will store your data. You do not have the ability to scale up or down the execution resources being used.

Due to this flexibility, Snowflake has the advantage in execution processing.

Recommendation

In summary, Azure Synapse Analytics has an advantage on cost while Snowflake has an advantage on the flexibility with which you can manage system processing resources. When bringing in the previous scoring as provided by Forrester, Snowflake has the advantage based on its strength around data ingestion and administration automation, both areas that help ease the speed of delivery as well as help ease the burden around operational support.  Based off this analysis my recommendation would be to use Snowflake for the entire data stack, including both Data Lake and Data Warehouse rather than Azure Synapse Analytics.

Integrated Development Environment (IDE) Toolkits for AWS

As a developer myself, I have spent significant time working in so-called integrated development environments (IDEs) or source code editors. Many a time I’ve heard the expression “don’t make me leave my IDE” in relation to the need to perform some management task! For those of you with similar feelings, AWS offers integrations, known as “toolkits”, for the most popular IDEs and source code editors in use today in the .NET community – Microsoft Visual Studio, JetBrains Rider, and Visual Studio Code.

The toolkits vary in levels of functionality and the areas of development they target. All three however share a common ability of making it easy to package up and deploy your application code to a variety of AWS services.

AWS Toolkit for Visual Studio

Ask any longtime developer working with .NET and it’s almost certain they will have used Visual Studio. For many .NET developers it could well be the only IDE they have ever worked with in a professional environment. It’s estimated that almost 90% of .NET developers are using Visual Studio for their .NET work which is why AWS has supported an integration with Visual Studio since Visual Studio 2008. Currently, the AWS toolkit is available for the Community, Professional, and Enterprise editions of Visual Studio 2017 and Visual Studio 2019. The toolkit is available on the Visual Studio marketplace.

The AWS Toolkit for Visual Studio, as the most established of the AWS integrations, is also the most functionally rich and supports working with features of multiple AWS services from within the IDE environment. First, a tool window – the AWS Explorer – surfaces credential and region selection, along with a tree of services commonly used by developers as they learn, experiment, and work with AWS. You can open the explorer window using an entry on the IDE’s View menu. Figure 1 shows a typical view of the explorer, with its tree of services and controls for credential profile and region selection.

Figure 1. The AWS Explorer

The combination of selected credential profile and region scopes the tree of services and service resources within the explorer, and resource views opened from the explorer carry that combination of credential and region scoping with them. In other words, if the explorer is bound to (say) US East (N. Virginia) and you open a view onto EC2 instances, that view shows the instances in the US East (N. Virginia) region only, owned by the user represented by the selected credential profile. If the selected region, or credential profile, in the explorer changes the instances shown in the document view window do not – the instances view remains bound to the original credential and region selection.

Expanding a tree node (service) in the explorer will display a list of resources or resource types, depending on the service. In both cases, double clicking a resource or resource type node, or using the Open command in the node’s context menu, will open a view onto that resource or resource type. Consider the Amazon DynamoDB, Amazon EC2, and Amazon S3 nodes, shown in Figure 2.

Figure 2. Explorer tree node types

AWS Toolkit for JetBrains Rider

Rider is a relatively new, and increasingly popular, cross-platform IDE from JetBrains. JetBrains are the creators of the popular ReSharper plugin, and the IntelliJ IDE for Java development, among other tools. Whereas Visual Studio runs solely on Windows (excluding the “special” Visual Studio for Mac), Rider runs on Windows, Linux, and macOS.

Unlike the toolkit for Visual Studio, which is more established and has a much broader range of supported services and features, the toolkit for Rider focuses on features to support the development of serverless and container-based modern applications. You can install the toolkit from the JetBrains marketplace by selecting Plugins from the Configure link in Rider’s startup dialog.

To complement the toolkit’s features, you do need to install a couple of additional dependencies. First, the AWS Serverless Application Model (SAM) CLI because the Rider toolkit uses the SAM CLI to build, debug, package, and deploy serverless applications. In turn, SAM CLI needs Docker to be able to provide the Lambda-like debug environment. Of course, if you are already working on container-based applications you’ll likely already have this installed.

With the toolkit and dependencies installed, let’s first examine the AWS Explorer window, to compare it to the Visual Studio toolkit. Figure 3 shows the explorer, with some service nodes expanded.

Figure 3. The AWS Explorer in JetBrains Rider

We can see immediately that the explorer gives access to fewer services than the explorer in Visual Studio; this reflects the Rider toolkit’s focus on serverless and container development. However, it follows a familiar pattern of noting your currently active credential profile, and region, in the explorer toolbar.

Controls in the IDE’s status bar link to the credential and region fields in the explorer. This enables you to see at a glance which profile and region are active without needing to keep the explorer visible (this isn’t possible in Visual Studio, where you need to open the explorer to see the credential and region context). Figure 4 shows the status bar control in action to change region. Notice that the toolkit also keeps track of your most recently used profile and region, to make changing back-and-forth super quick.

Figure 4. Changing region or credentials using the IDE’s status bar

AWS Toolkit for Visual Studio Code

Visual Studio Code (VS Code) is an editor with plugin extensions, rather than a full-fledged IDE in the style of Visual Studio or Rider. However, the sheer range of available extensions make it a more than capable development environment for multiple languages, including C# / .NET development on Windows, Linux, and macOS systems.

Like the toolkit for Rider, the VS Code toolkit focuses on the development of modern serverless and container-based applications. The toolkit offers an explorer pane with the capability to list resources across multiple regions, similar to the single-region explorers available in the Visual Studio and Rider toolkits. The VS Code toolkit also offers local debugging of Lambda functions in a Lambda-like environment. As with Rider, the toolkit uses the AWS SAM CLI to support debugging and deployment of serverless applications, so you do need to install this dependency, and Docker as well, to take advantage of debugging support.

Credentials are, once again, handled using profiles, and the toolkit offers a command palette item that walks you through setting up a new profile if no profiles already exist on your machine. If you have existing profiles, the command simply loads the credential file into the editor, where you can paste the keys to create a new profile.

Figure 5 shows some of the available commands for the toolkit in the command palette.

Figure 5. Toolkit commands in the command palette

Figure 6 highlights the active credential profile and an explorer bound to multiple regions. Clicking the status bar indicator enables you to change the bound credentials. You show or hide additional regions in the explorer from a command, or using the toolbar in the explorer (click the button).

Figure 6. Active profile indicator and multi-region explorer

Obviously, I have not really touched on each of the toolkits in much detail. I will be doing that in future articles where I go much deeper into the capabilities, strengths, and weaknesses of the various toolkits and how they may affect your ability to interact with the AWS services directly from within your IDE. Know now, however, that if you are a .NET developer that uses one of these common IDEs (yes, there are still some devs that do development in Notepad) that there is an AWS toolkit that will help you as you develop.

.NET and Containers on AWS (Part 2: AWS App Runner)

The newest entry into AWS container management does a lot to remove the amount of configuration and management that you must use when working with containers. App Runner is a fully managed service that automatically builds and deploys the application as well creating the load balancer. App Runner also manages the scaling up and down based upon traffic. What do you, as a developer, have to do to get your container running in App Runner? Let’s take a look.

First, log into AWS and go to the App Runner console home. If you click on the All Service link you will notice a surprising AWS decision in that App Runner is found under the Compute section rather than the Containers section, even though its purpose is to easily run containers. Click on the Create an App Runner Service button to get the Step 1 page as shown below:

Creating an App Runner service

The first section, Source, requires you to identify where the container image that you want to deploy is stored. You currently, at the time of this writing, can choose either a container registry, Amazon ECR, or a source code repository. Since we have already loaded an image in ECR, let us move forward with this option by ensuring Container registry and Amazon ECR are selected, and then clicking the Browse button to bring up the image selection screen as shown below.

Selecting a container image from ECR

In this screen we selected the “prodotnetonaws” image repository that we created in the last post and the container image with the tag of “latest”.

Once you have completed the Source section, the next step is to determine the Deployment settings for your container. Here, your choices are to use the Deployment trigger of Manual, which means that you must fire off each deployment yourself using the App Runner console or the AWS CLI, or Automatic, where App Runner watches your repository, deploying the new version of the container every time the image changes. In this case, we will choose Manual so that we have full control of the deployment.

Warning: When you have your deployment settings set to Automatic, every time the image is updated will trigger a deployment. This may be appropriate in a development or even test environment, but it is unlikely that you will want to use this setting in a production setting.

The last data that you need to enter on this page is to give App Runner an ECR access role that App Runner will use to access ECR. In this case, we will select Create new service role and App Runner will pre-select a Service name role. Click the Next button when completed.

The next step is entitled Configure service and is designed to, surprisingly enough, help you configure the service. There are 5 sections on this page, Service settings, Auto scaling, Health check, Security, and Tags. The only section that is expanded is the first, all of the other sections need to be expanded before you see the options.

The first section, Service settings, with default settings can be seen below.

Service settings in App Runner

Here you set the Service name, select the Virtual CPU & memory, configure any optional Environmental variables that you may need, and determine the TCP Port that your service will use. If you are using the sample application that we loaded into ECR earlier in the chapter you will need to change the port value from the default 8080 to port 80 so that it will serve the application we configured in the container. You also have the ability, under Additional configuration, to add a Start command which will be run on launch. This is generally left blank if you have configured the entry point within the container image. We gave the service the name “ProDotNetOnAWS-AR” and let all the rest of the settings in this section remain as default.

The next section is Auto scaling, and there are two major options, Default configuration and Custom configuration, each of which provide the ability to set the auto scaling values as shown below.

Setting the Auto scaling settings in App Runner

The first of these auto scaling values is Concurrency. This value represents the maximum number of concurrent requests that an instance can process before App Runner scales up the service. The default configuration has this set at 100 requests that you can customize if using the Custom configuration setting.

The next value is Minimum size, or the number of instances that App Runner provisions for your service, regardless of concurrent usage. This means that there may be times where some of these provisioned instances are not being use. You will be charged for the memory usage of all provisioned instances, but only for the CPU of those instances that are actually handling traffic. The default configuration for minimum size is set to 1 instance.

The last value is Maximum size. This value represents the maximum number of instances to which your service will scale; once your service reaches the maximum size there will be no additional scaling no matter the number of concurrent requests. The default configuration for maximum size is 25 instances.

If any of the default values do not match your need, you will need to create a custom configuration, which will give you control over each of these configuration values. To do this, select Custom configuration. This will display a drop-down that contains all of the App Runner configurations you have available (currently it will only have “DefaultConfiguration”) and an Add new button. Clicking this button will bring up the entry screen as shown below.

Customizing auto scaling in App Runner

The next section after you configure auto scaling is Health check. The first value you set in this section is the Timeout, which describes the amount of time, in seconds, that the load balancer will wait for a health check response. The default timeout is 5 seconds. You also can set the Interval, which is the number of seconds between health checks of each instance and is defaulted to 10 seconds. You can also set the Unhealthy and Health thresholds, where the unhealthy threshold is the number of consecutive health check failures that means that an instance is unhealthy and needs to be recycled and the health threshold is the number of consecutive successful health checks necessary for an instance to be determined to be healthy. The default for these values is 5 requests for unhealthy and 1 request for healthy.

You next can assign an IAM role to the instance in the Security section. This IAM role will be used by the running application if it needs to communicate to other AWS services, such as S3 or a database server. The last section is Tags, where you can enter one or more tags to the App Runner service.

Once you have finished configuring the service, clicking the Next button will bring you to the review screen. Clicking the Create and deploy button on this screen will give the approval for App Runner to create the service, deploy the container image, and run it so that the application is available. You will be presented with the service details page and a banner that informs you that “Create service is in progress.” This process will take several minutes, and when completed will take you to the properties page as shown below.

After App Runner is completed

Once the service is created and the status is displayed as Running, you will see a value for a Default domain which represents the external-facing URL. Clicking on it will bring up the home page for your containerized sample application.

There are five tabs displayed under the domain, Logs, Activity, Metrics, Configuration, and Custom Domain. The Logs section displays the Event, Deployment, and Application logs for this App Runner service. This is where you will be able to look for any problems during deployment or running of the container itself. You should be able, under the Event log section, to see the listing of events from the service creation. The Activity tab is very similar, in that it displays a list of activities taken by your service, such as creation and deployment.

The next tab, Metrics, tracks metrics related to the entire App Runner service. This where you will be able to see information on HTTP connections and requests, as well as being able to track the changes in the number of used and unused instances. By going into the sample application (at the default domain) and clicking around the site, you should see these values change and a graph become available that provides insight into these various activities.

The Configuration tab allows you to view and edit many of the settings that we set in the original service creation. There are two different sections where you can make edits. The first is at the Source and deployment level, where you can change the container source and whether the deployment will be manual or automatically happen when the repository is updated. The second section where you can make changes is at the Configure service level where you are able to change your current settings for autoscaling, health check, and security.

The last tab on the details page is Custom domain. The default domain will always be available for your application; however, it is likely that you will want to have other domain names pointed to it – we certainly wouldn’t want to use https://SomeRandomValue.us-east-2.awsapprunner.com for our company’s web address. Linking domains is straightforward from the App Runner side. Simply click the Link domain button and input the custom domain that you would like to link, we of course used “prodotnetonaws.com”. Note that this does not include “www”, because this usage is currently not supported through the App Runner console. Once you enter the customer domain name, you will be presented the Configure DNS page as shown below.

Configuring a custom domain in App Runner

This page contains a set of certificate validation records that you need to add to your Domain Name System (DNS) so that App Runner can validate that you own or control the domain. You will also need to add CNAME records to your DNS to target the App Runner domain; you will need to add one record for the custom domain and another for the www subdomain if so desired. Once the certificate validation records are added to your DNS, the customer domain status will become Active and traffic will be directed to your App Runner instance. This evaluation can take anywhere from minutes to up to 48 hours.

Once your App Runner instance is up and running this first time, there are several actions that you can take on it as shown in the upper right corner as shown a couple of images ago. The first is the orange Deploy button. This will deploy the container from either the container or source code repository depending upon your configuration. You also can Delete the service, which is straightforward, as well as Pause the service. There are some things to consider when you pause your App Runner service. The first is that your application will lose all state – much as if you were deploying a new service. The second consideration is that if you are pausing your service because of a code defect, you will not be able to redeploy a new (presumably fixed) container without first resuming the service.

.NET and Containers on AWS (Part 1: ECR)

Amazon Elastic Container Registry (ECR) is a system designed to support storing and managing Docker and Open Container Initiative (OCI) images and OCI-compatible artifacts. ECR can act as a private container image repository, where you can store your own images, a public container image repository for managing publicly accessible images, and to manage access to other public image repositories. ECR also provides lifecycle policies, that allow you to manage the lifecycles of images within your repository, image scanning, so that you can identify any potential software vulnerabilities, and cross-region and cross-account image replication so that your images can be available wherever you need them.

As with the rest of AWS services, ECR is built on top of other services. For example, Amazon ECR stores container images in Amazon S3 buckets so that server-side encryption comes by default. Or, if needed, you can use server-side encryption with KMS keys stored in AWS Key Management Service (AWS KMS), all of which you can configure as you create the registry. As you can likely guess, IAM manages access rights to the images, supporting everything from strict rules to anonymous access to support the concept of a public repository.

There are several different ways to create a repository. The first is through the ECR console by selecting the Create repository button. This will take you to the Create repository page as shown below:

Through this page you can set the Visibility settings, Image scan settings, and Encryption settings. There are two visibility settings, Private and Public. Private repositories are managed through permissions managed in IAM and are part of the AWS Free Tier, with 500 MB-month of storage for one year for your private repositories. Public repositories are openly visible and available for open pulls. Amazon ECR offers you 50 GB-month of always-free storage for your public repositories, and you can transfer 500 GB of data to the internet for free from a public repository each month anonymously (without using an AWS account.) If authenticating to a public repository on ECR, you can transfer 5 TB of data to the internet for free each month and you get unlimited bandwidth for free when transferring data from a public repository in ECR to any AWS compute resources in any region  

Enabling Scan on push on means that every image that is uploaded to the repository will be scanned. This scanning is designed to help identify any software vulnerabilities in the uploaded image, and will automatically run every 24 hours, but turning this on will ensure that the image is checked before it can ever be used. The scanning is being done using the Common Vulnerabilities and Exposures (CVEs) database from the Clair project, outputting a list of scan findings.

Note: Clair is an open-source project that was created for the static analysis of vulnerabilities in application containers (currently including OCI and docker). The goal of the project is to enable a more transparent view of the security of container-based infrastructure – the project was named Clair after the French term which translates to clear, bright, or transparent.

The last section is Encryption settings. When this is enabled, as shown below, ECR will use AWS Key Management Service (KMS) to manage the encryption of the images stored in the repository, rather than the default encryption approach.

You can use either the default settings, where ECR creates a default key (with an alias of aws/ecr) or you can Customize encryption settings and either select a pre-existing key or create a new key that will be used for the encryption.

Other approaches for creating an ECR repo

Just as with all the other services that we have talked about so far, there is the UI-driven way to build an ECR like we just went through and then several other approaches to creating a repo.

AWS CLI

You can create an ECR repository in the AWS CLI using the create-repository command as part of the ECR service.

C:\>aws ecr create-repository

    –repository-name prodotnetonaws

    –image-scanning-configuration scanOnPush=true

    –region us-east-1

You can control all of the basic repository settings through the CLI just as you can when creating the repository through the ECR console, including assigning encryption keys and defining the repository URI.

AWS Tools for PowerShell

And, as you probably aren’t too surprised to find out, you can also create an ECR repository using AWS Tools for PowerShell:

C:\> New-ECRRepository

-RepositoryName prodotnetonaws

-ImageScanningConfiguration_ScanOnPush true

Just as with the CLI, you have the ability to fully configure the repository as you create it.

AWS Toolkit for Visual Studio

As well as using the AWS Toolkit for Visual Studio, although you must depend upon the extension’s built-in default values because the only thing that you can control through the AWS Explorer is the repository name as shown below. As you may notice, the AWS Explorer does not have its own node for ECR and instead puts the Repositories sub-node under the Amazon Elastic Container Service (ECS) node. This is a legacy from the past, before ECR really became its own service, but it is still an effective way to access and work with repositories in Visual Studio.

Once you create a repository in VS, going to the ECR console and reviewing the repository that was created will show you that it used the default settings, so it was a private repository with both “Scan on push” and “KMS encryption” disabled.

At this point, the easiest way to show how this will all work is to create an image and upload it into the repository. We will then be able to use this container image as we go through the various AWS container management services.

Note: You will not be able to complete many of these exercises with Docker already running on your machine. You will find download and installation instructions for Docker Desktop at https://www.docker.com/products/docker-desktop. Once you have Desktop installed you will be able to locally build and run container images.

We will start by creating a simple .NET ASP.NET Core sample web application in Visual Studio through File -> New Project and selecting the ASP.NET Core Web App (C#) project template. You then name your project and select where to store the source code. Once that is completed you will get a new screen that asks for additional information as shown below. The checkbox for Enable Docker defaults as unchecked, so make sure you check it and then select the Docker OS to use, which in this case is Linux.

This will create a simple solution that includes a Docker file as shown below:

If you look at the contents of that generated Docker file you will see that it is very similar to the Docker file that we went through earlier, containing the instructions to restore and build the application, publish the application, and then copy the published application bits to the final image, setting the ENTRYPOINT.

FROM mcr.microsoft.com/dotnet/aspnet:5.0 AS base
WORKDIR /app
EXPOSE 80
EXPOSE 443

FROM mcr.microsoft.com/dotnet/sdk:5.0 AS build
WORKDIR /src
COPY ["SampleContainer/SampleContainer.csproj", "SampleContainer/"]
RUN dotnet restore "SampleContainer/SampleContainer.csproj"
COPY . .
WORKDIR "/src/SampleContainer"
RUN dotnet build "SampleContainer.csproj" -c Release -o /app/build

FROM build AS publish
RUN dotnet publish "SampleContainer.csproj" -c Release -o /app/publish

FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
ENTRYPOINT ["dotnet", "SampleContainer.dll"]

If you look at your build options in Visual Studio as shown below, you will see additional ones available for containers. The Docker choice, for example, will work through the Docker file and start the final container image within Docker and then connect the debugger to that container so that you can debug as usual.

The next step is to create the container image and persist it into the repository. To do so, right click the project name in the Visual Studio Solution Explorer and select Publish Container to AWS to bring up the Publish Container to AWS wizard as shown below.

The image above shows that the repository that we just created is selected as the repository for saving, and the Publish only the Docker image to Amazon Elastic Container Registry option in the Deployment Target was selected (these are not the default values for each of these options). Once you have this configured, click the Publish button. You’ll see the window in the wizard grind through a lot of processing, then a console window may pop up to show you the actual upload of the image, and then the wizard window will automatically close if successful.

Logging into Amazon ECR and going into the prodotnetonaws repository (the one we uploaded the image into), as shown below, will demonstrate that there is now an image available within the repository with a latest tag, just as configured in the wizard. You can click the icon with the Copy URI text to get the URL that you will use when working with this image. We recommend that you go ahead and do this at this point and paste it somewhere easy to find as that is the value you will use to access the image.

Now that you have a container image stored in the repository, we will look at how you can use it in the next article

Refactor vs Replacement: a Journey

Journeys around the refactoring of application and system architectures are dangerous.  Every experienced developer probably has a horror story around an effort to evolve an application that has gone wrong.  Because of this, any consideration around the refactoring effort of an application or system needs to start with a frank evaluation as to whether or not that application has hit the end of its lifecycle and should be retired.  You should only consider a refactor if you are convinced that the return on investment for that refactor is higher than the return on a rewrite or replace. Even then, it should not be an automatic decision.

First, let us define a refactor.  The multiple ways in which this word is used can lead to confusion.  The first approach is best described by Martin Fowler, who defined it as “A change made to the internal structure of software to make it easier to understand and cheaper to modify without changing its observable behavior.”  That is not the appropriate definition in this case.  Instead, refactor in this case refers to a process that evolves the software through multiple iterations, with each iteration getting closer to the end state.

One of the common misconceptions of an application refactor is that the refactor will successfully reduce problematic areas.  However, studies have shown that refactoring has a significantly high chance of introducing new problem areas into the code (Cedrim et al., 2017).  This risk increases if there is other work going on within the application, thus an application that requires continuing feature or defect resolution work will be more likely to have a higher number of defects caused by the refactoring process (Ferriera et al., 2018). Thus, the analysis regarding refactoring an application has to contain both an evaluation of the refactor work and the integration work for the new functionality that may span both refactored and refactored work.  Unanticipated consequences when changing or refactoring applications is so common that many estimation processes try to consider them.  Unfortunately, however, these processes only try to allow for the time necessary to understand and remediate these consequences rather than trying to determine where these efforts will occur.

Analyzing an application for replacement

Understanding the replacement cost of an application is generally easier than analyzing the refactor cost because there tends to be a deeper understanding of the development process as it pertains to new software as opposed to altering software; it is much easier to predict side effects in a system as you build it from scratch.  This approach also allows for “correction of error” where previous design decisions may no longer be appropriate.  These “errors” could be the result of changes in technology, changes in business, or simply changes in the way that you apply software design patterns to the business need; it is not a judgement that the previous approach was wrong, just that it is not currently the most appropriate approach.

Understanding the cost of replacing an application then becomes a combination of new software development cost plus some level of opportunity cost.  Using opportunity cost is appropriate because the developers that are working on the replacement application are no longer able to work on the current, to be replaced, version of the application. Obviously, when considering an application for replacement, the size and breadth of scope of the application is an important driver.  This is because the more use cases that a system has to satisfy; the more expensive it will be to replace that system.  Lastly, the implementation approach will matter as well.  An organization that can support two applications running in parallel and phasing processing from one to the other will have less of an expense than will an organization that turns the old system off while turning on the new system.  That you can phase the processing like that typically means that you can support reverting back to the previous version, before the last phasing-in part.  The hard switch over makes it less likely that you can simply “turn the old system back on.” 

Analyzing the Analysis

A rough understanding of the two approaches, and the costs and benefits of each approach, will generally lead to a clear “winner” where one approach is demonstrably superior to another.  In those other cases, you should give the advantage to that approach that demonstrates a closer adherence to modern architectural design – which would most likely be the replacement application.  This becomes the decision maker because of its impact to scalability, reliability, security, and cost, all of which define a modern cloud-based architecture.  This circles back to my earlier point on how the return on investment for a refactor must be higher than the return on investment for a rewrite\replace, as there is always a larger error when understanding the effort of a refactor as compared to green-field development.  You need to consider this fact when you perform your evaluation of the ROI of refactoring your application.  Otherwise, you will mistakenly refactor when you should have replaced, and that will lead to another horror story.  Our job is already scary enough without adding to the horror…

 References

Cedrim, D., Gheyi, R., Fonseca, B., Garcia, A., Sousa, L., Ribeiro, M., Mongiovi, M., de Mello, R., Chanvez, A. Understanding the Impact of Refactoring on Smells: A Longitudinal Study of 23 Software Projects, Proceedings of 2017 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Paderborn, Germany, September 4–8, 2017 (ESEC/FSE’17), 11 pages. https://doi.org/10.1145/3106237.3106259

Ferreira, I., Fernandes, E., Cedrim, D., Uchoa, A., Bibiano, A., Garcia, A., Correia, J., Santos, F., Nunes, G., Barbosa, C., Fonseca, B., de Mello, R., Poster: The Buggy side of code refactoring: Understanding the relationship between refactoring and bugs.  40th International Conference of Software Engineering (ICSE), Posters Track. Gotherburg, Sweden. https://doi.org/10.1145/3183440.3195030