Amazon DocumentDB and .NET

Amazon DocumentDB is a fully managed, scalable, and highly available document database service that uses a distributed, fault-tolerant, self-healing storage system that auto-scales up to 64 TB per cluster. Amazon DocumentDB reduces database I/O by only persisting write-ahead logs and does not need to write full buffer page syncs, avoiding slow, inefficient, and expensive data replication across network links.

This design allows for the separation of compute and storage. This means that you can scale each of those areas independently. DocumentDB is designed for 99.99% availability and replicates six copies of your data across three availability zones. It also continually monitors cluster instance health and automatically fails over to a read replica in the event of a failure–typically in less than 30 seconds. You can start with a single instance that delivers high durability then, as you grow, can add a second instance for high availability, and easily increase the number of instances you use for read-only. You can scale read capacity to millions of requests per second by scaling out to 15 low latency read replicas that can be scattered across three availability zones

One of the interesting characteristics is that DocumentDB was built with MongoDB compatibility. Why is that important?

MongoDB

MongoDB is a source-available cross-platform document database that was first released in 2009 and is, by far, the most popular document database available. Since it was both the first, and available open-sourced, it tended to be adopted early by companies running on-premises and experimenting with the concept of document stores so there is a huge number of on-premises software systems that rely completely on, or in part on, MongoDB. This means that any movement of systems to the cloud would have to come up with a way to support the software written around MongoDB otherwise this would be a massive impediment in migrating to the cloud.

AWS realized this and released DocumentDB to include compatibility with the MongoDB APIs that were available at the time. At the time of this writing (and also at release), DocumentDB implements the MongoDB 3.6 and 4.0 API responses. However, MongoDB released version 5.0 in July 2021, so there is no feature parity. Also, since there were annual releases for 4.2, 4.4, 4.4.5, and 4.4.6 that AWS did not support it appears that there will not be additional support moving forward.

While current compatibility with MongoDB is not supported, the design of DocumentDB, together with optimizations like advanced query processing, connection pooling, and optimized recovery and rebuild, provide a very performant system. AWS claims that DocumentDB achieves twice the throughput of currently available MongoDB managed services.

Setting up a DocumentDB Database

Now that we have a bit of understanding about DocumentDB, let’s go set one up. Log in to the console, and either search for Document DB or find it using Services > Database > Amazon DocumentDB. Select the Create Cluster button on the dashboard to bring up the Create cluster page as shown in Figure 1.

Figure 1. Create Amazon DocumentDB Cluster screen

First, fill out the Configuration section by adding in a Cluster identifier. This identifier must be unique to DocumentDB clusters within this region. Next, you’ll see a dropdown for the Engine version. This controls the MongoDB API that you will want to use for the connection. This becomes important when selecting and configuring the database drivers that you will be using in your .NET application. We recommend that you use 4.0 as this gives you the ability to support transactions. The next two fields are Instance class and Number of instances. This is where you define the compute and memory capacity of the instance as well as the number of instances that you will have. For the Instance class, the very last selection in the dropdown is db.t3.medium, which is eligible for the free tier. We selected that class. When considering the number of instances, you can have a minimum of 1 instance that acts as both read-write, or you can have more than one instance which would be primary instance plus replicas. We chose two (2) instances so that we can see both primary and replica instances.

You’ll see a profound difference if you compare this screen to the others that we saw when working with RDS as this screen is much simpler and seems to give you much less control over how you set up your cluster. The capability to have more fine control over configuration is available, however; you just need to click the slider button at the lower left labeled Show advanced settings. Doing that will bring up the Network settings, Encryption, Backup, Log exports, Maintenance, Tags, and Deletion protection configuration sections that you will be familiar with from the RDS chapters.

Once you complete the setup and click the Create cluster button you will be returned to the Clusters list screen that will look like Figure 2.

Figure 2. Clusters screen immediately after creating a cluster

As Figure 2 shows, there are initial roles for a Regional cluster and a Replica instance. As the creation process continues, the regional cluster will be created first, and then the first instance in line will change role to become a Primary instance. When creation is completed, there will be one regional cluster, one primary instance, and any number of replica instances. Since we chose two instances when creating the cluster, we show one replica instance being created. When the cluster is fully available you will also have access to the regions in which the clusters are available. At this point, you will have a cluster on which to run your database, but you will not yet have a database to which you can connect. But, as you will see below when we start using DocumentDB in our code, that’s ok.

DocumentDB and .NET

When looking at using this service within your .NET application, you need to consider that like Amazon Aurora, DocumentDB emulates the APIs of other products. This means that you will not be accessing the data through AWS drivers, but instead will be using MongoDB database drivers within your application. It is important to remember, however, that your database is not built using the most recent version of the MongoDB API, so you must ensure that the drivers that you do use are compatible with the version that you selected during the creation of the cluster, in our case MongoDB 4.0. Luckily, however, MongoDB provides a compatibility matrix at https://docs.mongodb.com/drivers/csharp/.

The necessary NuGet package is called MongoDB.Driver. Installing this package will bring several other packages with it including MongoDB.Driver.Core, MongoDB.Driver.BSON, and MongoDB.Libmongocrypt. One of the first things that you may notice is there is not an “Entity Framework” kind of package that we got used to when working with relational databases. And that makes sense because Entity Framework is basically an Object-Relational Mapper (ORM) that helps manage the relationships between different tables – which is exactly what we do not need when working with NoSQL systems. Instead, your table is simply a collection of things. However, using the MongoDB drivers allow you to still use well-defined classes when mapping results to objects. It is, however, more of a deserialization process than an ORM process.

When working with DocumentDB, you need to build your connection string using the following format: mongodb://[user]:[password]@[hostname]/[database]?[options]

This means you have five components to the connection string:

  • user – user name
  • password – password
  • hostname – url to connect to
  • database – Optional parameter – database with which to connect
  • options – a set of configuration options

It is easy to get the connection string for DocumentDB, however. In the DocumentDB console, clicking on the identifier for the regional cluster will bring you to a summary page. Lower on that page is a Connectivity & security tab that contains examples of approaches for connecting to the cluster. The last one, Connect to this cluster with an application, contains the cluster connection string as shown in Figure 3.

Figure 3. Getting DocumentDB connection string

You should note that the table name is not present in the connection string. You can either add it to the connection string or you can use a different variable as we do in the example below.

Once you have connected to the database and identified the collection that you want to access, you are able to access it using lambdas (the anonymous functions, not AWS’ serverless offering!). Let’s look at how that works. First, we define the model that we want to use for the deserialization process.

public class Person
{
    [BsonId]
    [BsonRepresentation(BsonType.ObjectId)]

    public string Id { get; set; }

    [BsonElement("First")]
    public string FirstName { get; set; }

    [BsonElement("Last")]
    public string LastName { get; set; }
}

What you’ll notice right away is that we are not able to use a plain-ol’ class object (POCO) but instead must provide some MongoDB BSON attributes. Getting access to these will require the following libraries to be added to the using statements:

using MongoDB.Bson;
using MongoDB.Bson.Serialization.Attributes; 

There are many different annotations that you can set, of which we are using three:

·         BsonId – this attribute defines the document’s primary key. This means it will become the easiest and fastest field on which to retrieve a document.

·         BsonRepresentation – this attribute is primarily to support ease-of use. It allows the developer to pass the parameter as type string instead of having to use an ObjectId structure as this attribute handles the conversion from string to ObjectId.

·         BsonElement – this attribute maps the property name “Last” from the collection to the object property “LastName” and the property name “First” to the object property “FirstName”.

That means that the following JavaScript shows a record that would be saved as a Person type.

{
 "Id": "{29A25F7D-C2C1-4D82-9996-03C647646428}",
 "First": "Bill",
 "Last": "Penberthy"
}

We mentioned earlier that accessing the items within DocumentDB becomes straightforward when using lambdas. Thus, a high-level CRUD service could look similar to the code in Listing 1.

using MongoDB.Driver;

public class PersonService
{
    private readonly IMongoCollection<Person> persons;

    public PersonService ()
    {
        var client = new MongoClient("connection string here");
        var database = client.GetDatabase("Production");
        persons = database.GetCollection<Person>("Persons");
    }

    public async Task<List<Person>> GetAsync() => 
        await persons.Find(_ => true).ToListAsync();

    public async Task<Person?> GetByIdAsync(string id) =>
        await persons.Find(x => x.Id == id).FirstOrDefaultAsync();

    public async Task< Person?> GetByLastNamAsync(string name) =>
        await persons.Find(x => x.LastName == name).FirstOrDefaultAsync();

    public async Task CreateAsync(Person person) =>
        await persons.InsertOneAsync(person);

    public async Task UpdateAsync(string id, Person person) =>
        await persons.ReplaceOneAsync(x => x.Id == id, person);

    public async Task RemoveAsync(string id) =>
        await persons.DeleteOneAsync(x => x.Id == id);
}

  Code Listing 1. CRUD service to interact with Amazon DocumentDB

One of the conveniences with working with DocumentDB (and MongoDB for that matter) is that creating the database and the collection is automatic when the first item (or document) is saved. Thus, creating the database and book collection is just as simple as saving your first book. Of course, that means you have to make sure that you define your database and collection correctly, but it also means that you don’t have to worry about your application not connecting if you mistype the database name – you’ll just have a new database!

AWS and In-Memory Databases

In-memory databases are typically used for applications that require “real-time access” to data. In-memory databases do this by storing data directly, wait for it, in memory. In-memory databases attempt to deliver microsecond latency to applications for whom millisecond latency is not enough, hence why it is called real-time access.

In-memory databases are faster than traditional databases because they store data in Random-access Memory (RAM) so can rely on a storage manager that requires a lot fewer CPU instructions as there is no need for disk I/O and they are able to use internal optimization algorithms in the query processor which are simpler and faster than those in traditional databases that need to worry about reading various blocks of data rather than using the direct pointers available to in-memory databases. Figure 1 shows how all these pieces interact.

Internal design of an in-memory database
Figure 1. Internals of an in-memory database

Of course, this reliance on RAM also means that in-memory databases are more volatile than traditional databases because data is lost when there is a loss of power or the memory crashes or gets corrupted. This lack of durability (the “D” in ACID) has been the primary stopper of more common usage with the expense of memory being a close second. However, memory has gotten cheaper, and system architectures have evolved around the use of in-memory databases in a way where the lack of durability is no longer a showstopper, mainly as caches or as a “front-end” to a more traditional database. This allows the best of both worlds, real-time persistence with an in-memory database and durability from the traditional database. Figure 2 shows an example of how this could work.

visualization of using in-memory database as a cache
Figure 2. Using an in-memory and traditional database together

In Figure 2, we see that the application primarily interacts with the in-memory database to take advantage of its responsiveness, minimizing any kind of lag that a user may notice. In this example, a component of the application, called “Cache Manager,” is responsible for ensuring that the in-memory database and the traditional database are in synch. This component will be responsible for populating the cache upon cache startup and then ensuring all changes to the in-memory database are replicated through to the traditional database.

Obviously, the more complex the systems the more difficult this cache management may become – say there is another application that may be changing the traditional database in the same way. This could be problematic because the cache manager in your first application may not know that there is a change and thus data in the in-memory database will become stale because of the changes from the other application. However, ensuring the same approach, to a shared in-memory database can help eliminate this issue as shown in Figure 3.

Sharing an in-memory database between two applications
Figure 3. Two applications managing their changes to a shared cache

More and more systems are relying on in-memory databases, which is why AWS has two different services, Amazon ElastiCache and Amazon MemoryDB for Redis. Before we dig too far into these services, let’s add some context by doing a quick dive into the history of in-memory databases as this can help you understand the approach that AWS took when creating their products.

Memcached, an open-source general-purpose distributed memory caching system, was released in 2003. It is still maintained, and since it was one of the first in-memory databases available, it is heavily used in various systems including YouTube, Twitter, and Wikipedia. Redis is another open-source in-memory database that was released in 2009. It quickly became popular as both a store and a cache and is likely the most used in-memory database in the world. The importance of these two databases will become obvious shortly.

Amazon ElastiCache

Amazon ElastiCache is marketed as a fully managed in-memory caching service. There are two ElastiCache engines that are supported, Redis and Memcached (see, I told you that these would come up again). Since it is fully managed, ElastiCache eliminates a lot of the work necessary in hardware provisioning, software patching, setup, configuration, monitoring, failure recovery, and backups than you would have running your own instance of either Redis or Memcached. Using ElastiCache also adds support for event notifications and monitoring and metrics without you having to do anything other than enabling them in the console.

However, there are a few limitations to Amazon ElastiCache that you need to consider before adoption. The first is that any application using an ElastiCache cluster must be running within the AWS environment; ideally within the same VPC. Another limitation is that you do not have the ability to turn the service off without deleting the cluster itself; it is either on or gone.

Let’s do a quick dive into the two different ElastiCache engines.

Amazon ElastiCache for Redis

There are two options to choose from when using the Redis engine, non-clustered and clustered. The primary difference between the two approaches is the number of write instances supported. In non-clustered mode, you can have a single shard (write instance) with up to five read replica nodes. This means you have one instance to which you write and the ability to read from multiple instances. In a clustered mode, however, you can have up to 500 shards with 1 to 5 read instances for each.

When using non-clustered Redis, you scale by adding additional read nodes so that access to stored data is not impacted. However, this does leave the possibility of write access being bottlenecked. Clustered Redis takes care of that by creating multiple sets of these read/write combinations so that you can scale out both read, by adding additional read nodes per shard (like non-clustering) or by adding additional shards with their own read nodes. As you can probably guess, a clustered Redis is going to be more expensive than a non-clustered Redis.

As a developer, the key thing to remember is that since ElastiCache for Redis is a managed service offering on top of Redis, using it in your .NET application means that you will use the open-source .NET drivers for Redis, StackExchange.Redis. This provides you the ability to interact with objects stored within the database. Code Listing 1 shows a very simple cache manager for inserting and getting items in and out of Redis.

using StackExchange.Redis;
using Newtonsoft.Json;

namespace Redis_Demo
{
    public interface ICacheable
    {
        public string Id { get; set; }
    }

    public class CacheManager
    {
        private IDatabase database;

        public CacheManager()
        {
            var connection = ConnectionMultiplexer.Connect("connection string");
            database = connection.GetDatabase();
        }

        public void Insert<T>(T item) where T : ICacheable
        {
            database.StringSet(item.Id, Serialize(item));
        }

        public T Get<T>(string id) where T : ICacheable
        {
            var value = database.StringGet(id);
            return Deserialize<T>(value);
        }

        private string Serialize(object obj)
        {
            return JsonConvert.SerializeObject(obj);
        }

        private T Deserialize<T>(string obj)
        {
            return JsonConvert.DeserializeObject<T>(obj);
        }
    }
}

Code Listing 1. Saving and retrieving information from Redis

The above code assumes that all items being cached will implement the ICacheable interface which means that there will be an Id property with a type of string.  Since we are using this as our key into the database, we can assume this Id value will likely be a GUID or other guaranteed unique value – otherwise we may get some strange behavior.

Note: There is another very common Redis library that you may see referenced in online samples, ServiceStack.Redis. The main difference is that ServiceStack is a commercially-supported product that has quotas on the number of requests per hour. Going over that quota will require paying for a commercial license.

Now that we have looked at the Redis engine, let’s take a quick look at the Memcached engine.

Amazon ElastiCache for Memcached

Just as with Amazon ElastiCache for Redis, applications using Memcached can use ElastiCache for Memcached with minimal modifications as you simply need information about the hostnames and port numbers of the various ElastiCache nodes that have been deployed.

The smallest building block of a Memcached deployment is a node. A node is a fixed-size chunk of secure, network-attached RAM that runs an instance of Memcached. A node can exist in isolation or in a relationship with other nodes – a cluster. Each node within a cluster must be the same instance type and run the same cache engine. One of the key features of Memcached is that it supports Auto Discovery.

 Auto Discovery is the ability for client applications to identify all of the nodes within a cluster and initiate and maintain connections to those nodes. This means applications don’t have to worry about connecting to individual nodes, instead you simply connect to a configuration endpoint.

Using ElastiCache for Memcached in a .NET application requires you to use the Memcached drivers, EnyimMemcached for older .NET versions or EnyimMemcachedCore when working with .NET Core based applications. Using Memcached in a .NET application is different from many of the other client libraries that you see because it is designed to be used through dependency injection (DI) rather than “newing” up a client like we did for Redis. Ideally, you would be using DI there as well, but we did not take that approach to keep the sample code simpler. We don’t have that option in this case.

The first thing you need to do is register the Memcached client with the container management system. If working with ASP.NET Core, you would do this through the ConfigureServices method within the Startup.cs class and would look like the following snippet.

using System;
using System.Collections.Generic;
using Enyim.Caching.Configuration;

public void ConfigureServices(IServiceCollection services)
{
   …
   services.AddEnyimMemcached(o => 
        o.Servers = new List<Server> { 
            new Server {Address = "end point", Port = 11211} 
        });
   …
}

Using another container management system would require the same approach, with the key being the AddEnyimMemcached method to ensure that a Memcached client is registered. This means that the Memcached version of the CacheManager class that we used with Redis would instead look like Code Listing 2.

using Enyim.Caching;

namespace Memcached_Demo
{
    public class CacheManager
    {
        private IMemcachedClient client;
        private int cacheLength = 900;

        public CacheManager(IMemcachedClient memclient)
        {
            client = memclient;
        }

        public void Insert<T>(T item) where T : ICacheable
        {
            client.Add(item.Id, item, cacheLength);
        }

        public T Get<T>(string id) where T : ICacheable
        {
            T value;
            if (client.TryGet<T>(id, out value))
            {
                return value;
            }
            return default(T);
        }
    }
}

Code Listing 2. Saving and retrieving information from Memcached

The main difference that you will see is that every item being persisted in the cache has a cache length or the number of seconds that something will stay in the cache. Memcached uses a lazy caching mechanism which means that values will only be deleted when requested or when a new entry is saved, and the cache is full. You can turn this off by using 0 as the input value. However, Memcached retains the ability to delete items before their expiration time when memory is not available for a save.

Choosing Between Redis and Memcached

The different ElastiCache engines can solve different needs. Redis, with its support for both clustered and non-clustered implementations, and Memcached provide different levels of support. Table 1 shows some of the different considerations when evaluating the highest version of the product.

Table 1 – Considerations when choosing between Memcached and Redis

 MemcachedRedis (non-clustered)Redis (clustered)
Data typesSimpleComplexComplex
Data partitioningYesNoYes
Modifiable clustersYesYesLimited
Online reshardingNoNoYes
EncryptionNoYesYes
Data tieringNoYesYes
Multi-threadedYesNoNo
Pub/sub capabilitiesNoYesYes
Sorted setsNoYesYes
Backup and restoreNoYesYes
Geospatial indexingNoYesYes

An easy way to look at it is if you use simple models and have a concern for running large nodes with multiple cores and threads, then Memcached is probably your best choice. If you have complex models and want some support in the database for failover, pub\sub, and backup, then you should choose Redis.

Amazon MemoryDB for Redis

Where Amazon ElastiCache is a managed services wrapper around an open-source in-memory database, Amazon MemoryDB for Redis is a Redis-compatible database service, which means that it is not Redis, but is instead a service that accepts many of the same Redis commands as would a Redis database itself. This is very similar to Amazon Aurora which supports several different interaction approaches; MySQL and PostgreSQL. As of the time of this writing, MemoryDB supported the most recent version of Redis, 6.2.

Because MemoryDB is a fully managed service, creating the instance is straightforward. You create your cluster by determining how many shards you would like to support as well as the number of read-replicas per shard. If you remember the ElastiCache for Redis discussion earlier, this implies that the setup defaults to Redis with clustering enabled.

When looking at pricing, at the time of this writing, MemoryDB costs approximately 50% more than the comparable ElastiCache for Redis pricing. What this extra cost buys you is data durability. MemoryDB stores your entire data set in memory and uses a distributed multi-AZ transactional log to provide that data durability, a feature that is unavailable in ElastiCache for Redis. MemoryDB also handles failover to a read replica much better, with failover typically happening in under 10 seconds. However, other than the pricing difference, there is a performance impact to using MemoryDB over ElastiCache because the use of distributed logs has an impact on latency, so reads and writes will take several milliseconds longer for MemoryDB than would the same save in ElastiCache for Redis.

As you can guess, since the Redis APIs are replicated, a developer will work with MemoryDB in the same way as they would ElastiCache for Redis, using the StackExchange.Redis NuGet package. In-Memory databases, whether ElastiCache or MemoryDB, offer extremely fast reads and writes because they cut out a lot of the complexity added by performing I/O to disks and optimize their processing to take advantage of the speed inherent in RAM. However, working with is very similar to working with “regular” document and key/value databases.

WTF are Ledger Databases?

A ledger database is a NoSQL database that provides an immutable, transparent, and cryptographically verifiable transaction log that is owned by a central authority. The key here is that you never actually change the data in the table. What we generally consider an update, which replaces the old content with the new content, is not applicable when working with a ledger database. Instead, an update adds a new version of the record. All previous versions still exist, so your update never overwrites existing data. The cryptographically verified part comes into place because this ensures that the record is immutable.

Ledger databases are generally used to record economic and financial activity within an organization, such as by tracking credits and debits within an account. Other common use cases are generally around workflows, such as tracking the various steps taken on an insurance claim or tracing the movement of an item through a supply-chain network. For those of you that ever needed to implement audit tables in a relational database – you can start to see how managing that all automatically will be of benefit.

Note: A ledger database may seem very similar to Blockchain. However, there is one significant difference and that is that the ledger database is generally a centralized ledger whereas Blockchain is a distributed ledger. There will be times when the organization, such as a bank of financial organization, is not comfortable with a distributed ledger and instead prefer a simpler architecture with the same guarantee of immutable and verifiable data.

Now that we have discussed, at a high level, the features of a ledger database, let’s take a look at AWS’ version, Amazon QLDB.

Amazon QLDB

In traditional, relational database architecture, the approach is to write data into tables as part of a transaction. This transaction is generally stored in a transaction log and includes all the database modifications that were made during the transaction. This allows you to replay the log and those transactions in the event of a system failure or for data replication. However, generally those logs are not immutable and generally aren’t designed to allow easy access to users.

This approach is different with QLDB, as the journal is the central feature of the database. In a practical sense, the journal is structurally similar to the transaction log, however, it takes a write-only approach to storing application data with all writes, inserts, updates, and deletes, being committed to the journal first. QLDB then uses this journal to interpret the current set of your data. It does this by materializing that data in queryable, user-defined tables. These tables also provide access into the history of the data within that table, including revisions and metadata.

While a lot of the functionality seems similar, some of the terminology is different. Table 1 shows the mapping between these terms.

Relational TermQLDB Term
DatabaseLedger
TableTable
IndexIndex
Table rowAmazon Ion Document
ColumnDocument Attribute
SQLPartiQL
Audit LogsJournal
Table 1. Terminology map between RDBMS and QLDB

The key difference is around the difference between the table row and column from a relational database that was replaced with an Amazon Ion Document. Amazon Ion is a richly-typed, self-describing, hierarchical data serialization format that offers interchangeable binary and text representations. The text format is a superset of JSON and is easy to read and work with while the binary representation is efficient for storage and transmission. Ion’s rich type system enables unambiguous semantics for data (e.g., a timestamp value can be encoded using the timestamp type). This support for rich types allows an Ion document to be able to conceptually replace a database row.

Note – The Ion Document format provides a lot of the support lacking in the Amazon Timestream SDK where columns and values are defined separately. This seems to be an unfortunate example of how various service teams seem to never talk to each other nor follow a single standard!!

Creating an Amazon QLDB ledger is very simple. In the AWS console, navigate to Amazon QLDB. Here you will be given the ability to Create ledger. Creating a ledger only requires:

1.      Ledger name – The name of your ledger, needs to be unique by region

2.      Ledger permissions mode – There are 2 choices here, Standard and Allow all. The “Standard” permissions mode is the default and it enables control over the ledger and its tables using IAM. The “Allow all” mode allows any user with access to the ledger to manage all tables, indexes, and data.

3.      Encrypt Data at rest – Where you choose the key to use when encrypting data. There is no ability to opt-out – all data will be encrypted at rest.

4.      Tags – additional label describing your resource

Creating a QLDB Ledger

You can also create a table at this time, however, there is no way through the UI for you to be able to add an Index. So, instead, let’s look at how you can do that in .NET as well as add data to your newly created table.

.NET and Amazon QLDB

The first thing you will need to do to work with Amazon QLDB is to add the appropriate NuGet packages. The first one you should add is Newtonsoft.Json, which we will use in our examples to manage serialization and deserialization. The next ones to add are specific to QLDB, AWSSDK.QLDBSession, Amazon.IonObjectMapper, Amazon.QLDB.Driver, and Amazon.QLDB.Driver.Serialization. If you install the Driver.Serialization package, NuGet will install all the other required QLDB packages as dependencies.

The next step is to build an Amazon.QLDB.Driver.IQldbDriver. You do that with a Builder as shown below.

private IQldbDriver qldbDriver;

AmazonQLDBSessionConfig amazonQldbSessionConfig = new AmazonQLDBSessionConfig();

qldbDriver = QldbDriver.Builder()
    .WithQLDBSessionConfig(amazonQldbSessionConfig)
    .WithLedger(ledgerName)
    .Build();

Note how you can not just “new” up a driver, and instead must use dot notation based upon the Builder() method of the QldbDriver. Next, you need to ensure that the table and applicable indexes are set up. See the code below that contains a constructor that sets up the client and calls the validation code.

private IQldbDriver qldbDriver;
private IValueFactory valueFactory;

private string ledgerName = "ProdDotNetOnAWSLedger";
private string tableName = "Account";

public DataManager()
{
    valueFactory = new ValueFactory();

    AmazonQLDBSessionConfig amazonQldbSessionConfig = new AmazonQLDBSessionConfig();

    qldbDriver = QldbDriver.Builder()
        .WithQLDBSessionConfig(amazonQldbSessionConfig)
        .WithLedger(ledgerName)
        .Build();

    ValidateTableSetup(tableName);
    ValidateIndexSetup(tableName, "AccountNumber");
}

As you can see below, checking for the existence of the table is simple as the driver has a ListTableNames method that you can use to determine the presence of your table. If it doesn’t exist, then process the query to create it.

Let’s pull that method out and examine it, because this is how you will do all your interactions with QLDB, by executing queries.

private void ValidateTableSetup(string tableName)
{
    if (!qldbDriver.ListTableNames().Any(x => x == tableName))
    {
        qldbDriver.Execute(y => { y.Execute($"CREATE TABLE {tableName}"); });
    }
}

The Execute method accepts a function, in this case, we used a lambda function that executes a “CREATE TABLE” command. The code takes a similar approach to creating the index, however, you have to go through more steps to be able to determine whether the index has already been created as you have to query a schema table first and then parse through the list of indexes on that table. That code is shown below.

private void ValidateIndexSetup(string tableName, string indexField)
{
    var result = qldbDriver.Execute(x =>
    {
        IIonValue ionTableName = this.valueFactory.NewString(tableName);
        return x.Execute($"SELECT * FROM information_schema.user_tables WHERE name = ?", ionTableName);
    });

    var resultList = result.ToList();
    bool isLIsted = false;

    if (resultList.Any())
    {
        IIonList indexes = resultList.First().GetField("indexes");               
        foreach (IIonValue index in indexes)
        {
            string expr = index.GetField("expr").StringValue;
            if (expr.Contains(indexField))
            {
                isLIsted = true;
                break;
            }
        }                
    }
    if (!isLIsted)
    {
        qldbDriver.Execute(y => y.Execute($"CREATE INDEX ON {tableName}({indexField})"));
    }
}

As mentioned earlier, the save is basically a query that is executed. A simple save method is shown below:

public void Save(object item)
{
    qldbDriver.Execute(x =>
        {
            x.Execute($"INSERT INTO {tableName} ?", ToIonValue(item));
        }
    );
}

private IIonValue ToIonValue(object obj)
{
    return IonLoader.Default.Load(JsonConvert.SerializeObject(obj));
}

This example, as well as the one before this one where we did a check to see whether an index exists, both have some dependence on Ion. Remember, Ion is based upon richly-typed JSON, so it expects values to be similarly represented – which is why there are two different approaches for converting an item to an IonValue, even something as simple as the tableName, for comparison. The first of these is through the Amazon.IonDotnet.Tree.ValueFactory object as shown in the index validation snippet while the other is through the Amazon.IonDotnet.Builders.IonLoader as shown in the ToIonValue method.

Once you have the ledger, table, and any indexes you may want set up, the next step is to save some data into the table. This is done by executing a SQL-like command as shown below.

public void Save(object item)
{
    qldbDriver.Execute(x =>
        {
            x.Execute($"INSERT INTO {tableName} ?",
                       ToIonValue(item));
        }
    );
}

Getting it out of the database is a little bit trickier as you want it to be cast into the appropriate data model. The approach for retrieving an item from the database and converting it into a plain old class object (POCO) is shown below.

public List<T> Get<T>(string accountNumber)
{
    IIonValue ionKey = valueFactory.NewString(accountNumber);

    return qldbDriver.Execute(x =>
    {
        IQuery<T> query = x.Query<T>(
            $"SELECT * FROM {tableName} as d 
              WHERE d.AccountNumber = ?", ionKey);
        var results = x.Execute(query);
        return results.ToList();
    });
}

You can see that this is a two-step process. First, you create a generic query that is defined with the SELECT text and the appropriate model for the return set. You then Execute that query. This brings you back an Amazon.QLDB.Driver.Generic.IResult<T> object that can be converted to return a list of the requested items matching the AccountNumber that was passed into the function.

That’s a quick review of the general functionality of QLDB. Let’s look at one more specific case, the history of an item – one of the prime drivers of using a ledger database. You do this by using the history function, a PartiQL extension that returns revisions from the system-defined view of your table. The syntax for using the history function is

SELECT * FROM history(table,`start-time`,`end-time` ] ] ) AS h
[ WHERE h.metadata.id = 'id' ]

 The start time and end time values are optional. Using a start time will select any versions that are active when the start time occurs and any subsequent versions up until the end time. If no start time is provided, then all versions of the document are retrieved, and the current time is used in the query if the end time value is not explicitly defined. However, as you can see, the optional WHERE clause is using a metadata.id value – which we have never talked about. This metadata.id value is used by QLDB to uniquely identify each document and can be accessed through the metadata of the document. One way to get this information is shown below:

public string GetMetadataId(Account item)
{
    IIonValue ionKey = 
                  valueFactory.NewString(item.AccountNumber);
    IEnumerable<IIonValue> result = new List<IIonValue>();

    qldbDriver.Execute(x =>
    {
        result = x.Execute($"SELECT d.metadata.id as id 
                FROM _ql_committed_{tableName} as d 
                WHERE d.data.AccountNumber = ?", ionKey);
    });

    IIonValue working = result.First();
    return working.GetField("id").StringValue;
}

This code snippet accesses the “committed” view, one of the different views provided by QLDB that provides access to information about your data. Taking the result from that query, and plugging it into the previous snippet where we queried using the history function will get you the history of that document for the time period defined in the query. Thus, you can access the current version as well as have access to all the history; all by executing SQL-like commands.

There is a lot more we can go into about using Amazon QLDB and .NET together, probably a whole book! To recap, however, Amazon QLDB is a ledger database, which means it provides an immutable, transparent, and cryptographically verifiable transaction log that you can access and evaluate independently from the active version.

Time Series Databases

A time series database is a database optimized for time-stamped or time-series data. Typical examples of this kind of data are application performance monitoring logs, network data, clicks in an online application, stock market trading, IoT sensor data, and other use cases where the time of the data is important. So far it seems like something pretty much any database can manage, right? Well, the difference comes around the way the data will be accessed.

All the NoSQL and relational databases depend upon the existence of a key, generally some kind of string or integer. In a time-series database, however, the timestamp is the most important piece of information. This means that applications that look at data within a time range, and it is the time range that is crucial, would be well served by a time-series database because that data will be stored together (much like how relational databases store data in order by primary key) so getting a range of data based on a time range will be faster than getting data stored with a different type of key. Figure 1 shows how this works.

Figure 1. Time-series data stored by timestamp

A typical characteristic of time-series data is that when data arrives it is generally recorded as a new entry. Updating that data is more of an exception than a rule. Another characteristic is that the data typically arrives in timestamp order, so key scanning is minimized, and the data-write can be very performant. A thirds characteristic is that the non-timestamp data is stored as a value and is rarely included as part of the filtering in a query. Scaling is also especially important in time-series databases because there tends to be a lot of frequently created data that ends up being stored in this type of database. The last characteristic, as already mentioned, is that time is the primary axis. Time-series databases tend to include specialized functions around dealing with time series, including aggregation, analysis, and time calculations.

Amazon Timestream

Now that we have briefly defined the concept behind a time-series database, let us look at the AWS entry into the field, Amazon Timestream. Timestream is a serverless, automatically-scaling, database service that can store trillions of requests per day and is exponentially faster than a relational database at a fraction of the cost. Timestream uses a purpose-built query engine that lets you access and analyze recent and historical data together within a single query as well as supporting built-in time-series analytics functions.

Creating an Amazon Timestream database is one of the simplest processes that you will find in AWS. All you need to do is to provide a Database name and decide on the Key Management Service (KMS) key to use when encrypting your data (all Timestream data is encrypted by default). And that is it. Creating a table within the database is almost as simple; provide a Table name and then configure the Data retention policies.

These data retention policies are designed to help you manage the lifecycle of your time-series data. The assumption is that there are two levels of data, data that is frequently used in small segments for analysis, such as “the last 5 minutes”, and data that will be accessed much less often but in larger ranges, such as “last month”. Timestream helps you manage that lifecycle by automatically storing data in two places, in-memory and in a magnetic store. The data retention policies allow you to configure when that transition should happen. Figure 2 shows an example where data is persisted in memory for 2 days and then transferred to magnetic storage and kept there for 1 year.

Figure 2. Configuring the Data Retention policies for a Timestream table

Now that we have looked at creating a Timestream database and table, the next step is to look at how the data is managed within the database as this will help you understand some of the behavior you will see when using the database programmatically. The following are key concepts in Timestream:

·         Dimension – describes metadata as key/value pairs and can be 0 to many within a record. An example could be the location and type of a sensor.

·         Measure – the actual value being measured, as a key/value pair

Now that we have briefly discussed Timestream databases, our next step is to look at using it in a .NET application.

.NET and Amazon Timestream

We have spent some time accessing database services on AWS that are designed to emulate some other product, such as Redis. Timestream does not take that approach. Instead, you need to use the AWS SDK for .NET to write data to and read data from Timestream. The interesting thing about this is that AWS made the decision to break this data access process down by providing two discrete .NET clients for Timestream, the Write SDK client that persists data to the table and the Query SDK client that returns data from the table.

Let’s start our journey into Timestream by looking at a method in the code snippet below that saves an item, in this case, a Measurement.

1  using System;
2  using System.Collections.Generic;
3  using System.Threading.Tasks;
4  using Amazon.TimestreamWrite;
5  using Amazon.TimestreamWrite.Model;
6
7  public class Measurement
8  {
9     public Dictionary<string, double> KeyValues { get; set; }
10    public string Source { get; set; }
11    public string Location { get; set; }
12    public DateTime Time { get; set; }
13  }
14
15 public async Task Insert(Measurement item)
16 {
17    var queryClient = new AmazonTimestreamQueryClient();
18
19    List<Dimension> dimensions = new List<Dimension>
20    {
21        new Dimension {Name = "location", Value = item.Location}
22    };
23
24    Record commonAttributes = new Record
25    {
26        Dimensions = dimensions,
27        MeasureValueType = MeasureValueType.DOUBLE,
28        Time = ConvertToTimeString(item.Time)
29    };
30
31   List<Record> records = new List<Record>();
32
33    foreach (string key in item.KeyValues.Keys)
34    {
35        var record = new Record
36        {
37            MeasureName = key,
38            MeasureValue = item.KeyValues[key].ToString()
39        };
40        records.Add(record);
41    }
42
43    var request = new WriteRecordsRequest
44    {
45        DatabaseName = databaseName,
46        TableName = tableName,
47        CommonAttributes = commonAttributes,
48        Records = records
49    };
50
51    var response = await writeClient.WriteRecordsAsync(request);
52    // do something with result, such as evaluate HTTP status
53 }

When persisting data into Timestream you will use a WriteRecordsRequest, as shown on Line 43 of the above code. This object contains information about the table and database to use as well as CommonAttributes and Records. The CommonAttributes property expects a Record, or a set of fields that are common to all the items contained in the Records property; think of it as a way to minimize the data sent over the wire as well as act as initial value grouping. In this case, our common attributes contain the Dimensions (as discussed earlier) as well as the MeasureValueType which defines the type of data being submitted (although it will be converted to a string for the submission), and Time. This leaves only the list of measurement values that will be put into the Records property. The response from the WriteRecordsAsync method is an HTTP Status code that indicates success or failure.

Pulling data out is quite different; mainly because the process of fetching data is done by passing a SQL-like query string to the AmazonTimestreamQueryClient. This means that a simple query to retrieve some information could look like this:

SELECT location, measure_name, time, measure_value::double
FROM {databaseName}.{tableName}
WHERE measure_name=’type of measurement you are looking for’
ORDER BY time DESC
LIMIT 10

All of which will be recognizable if you have experience with SQL.

This is all well and good, but the power of this type of database really shines when you start to do time-specific queries. Consider the following query that finds the average measurement value, aggregated together over 30-second intervals (binned), for a specific location (one of the dimensions saved with each submission) over the past 2 hours.

SELECT BIN(time, 30s) AS binned_timestamp,
    ROUND(AVG(measure_value::double), 2) AS avg_measurementValue,
    location 
FROM {databaseName}.{tableName} 
WHERE measure_name = ’type of measurement you are looking for’
    AND location = '{LOCATION}' 
    AND time > ago(2h) 
GROUP BY location, BIN(time, 30s) 
ORDER BY binned_timestamp ASC"

As you can probably imagine, there are many more complicated functions that are available for use in Timestream.

Note: These time series functions include functions that will help fill in missing data, interpolations, functions that look at the rate of change for a metric, derivatives, volumes of requests received, integrals, and correlation functions for comparing two different time series. You can find all the time series functions at https://docs.aws.amazon.com/timestream/latest/developerguide/timeseries-specific-constructs.functions.html.

Once you have built the query, running the query is pretty simple, as shown below:

public async Task<QueryResponse> RunQueryAsync(string queryString)
{
    try
    {
        QueryRequest queryRequest = new QueryRequest();
        queryRequest.QueryString = queryString;
        QueryResponse queryResponse = 
                 await queryClient.QueryAsync(queryRequest);
        return queryResponse;
    }
    catch (Exception e)
    {
        return null;
    }
}

However, the complication comes from trying to interpret the results – the QueryResponse object that is returned by the service. This is because you are simply passing in a query string that could be doing any kind of work, so the response needs to be able to manage that. Figure 3 shows the properties on the QueryResponse object.

Figure 3. Object definition of the QueryResponse object

There are five properties in the QueryResponse. The QueryStatus returns Information about the status of the query, including progress and bytes scanned. The QueryId and NextToken properties are used together to support pagination when the result set is larger than the default length, ColumnInfo provides details on the column data types of the returned result set, and Rows contains the results set.

You can see a simple request/result as captured in Telerik Fiddler Classic in Figure 4.

Figure 4. Amazon Timestream query response in JSON

There are three areas called out. Area 1 is the query that was sent, as you can see it is a simple “SELECT ALL” query with a limit put on of 2 rows. Area 2 shows the ColumnInfo property in JSON, with each item in the array corresponding to each of the ScalarValues found in the array of Data that makes up the Rows property.

Looking at this, you can probably see ways in which you can transform the data into JSON that will deserialize nicely into a C# class. Unfortunately, however, AWS did not provide this functionality for you as part of their SDK.