What kind of “NoSQL” are you talking about?

Too many times I hear NoSQL databases as if they are a single way of “doing business.” However, there are multiple types of NoSQL databases, each with its specializations and ways in which they approach persisting and accessing information. So just saying “use NoSQL” isn’t really all that useful.

With that in mind, lets’ take a look at the various choices that you have to consider. The main flavors of NoSQL databases are document databases, key-value stores, column-oriented databases, and graph databases. There are also databases that are combination of these and even some other highly specialized databases that do not depend of relationships and SQL, so they could probably be NoSQL as well. That being said, let’s look at those four primary types of NoSQL databases.

Document Databases

A document database stores data in JSON as shown in Figure 10-3, BSON, or XML. BSON stands for “Binary JSON.” BSON’s binary structure allows for the encoding of type and length information. This encoding tends to allow the data to be parsed much more quickly and it helps support indexing and querying into the properties of the document being stored.

Figure 1. Document database interpretation of object

Documents are typically stored in a format that is much closer to the data model being worked with in the application. The examples used above show this clearly. Additional functionality supported by document databases tend to be around ways of indexing and\or querying data within the document. The Id will generally be the key used for primary access but imagine cases like where you want to search the data for all instances of a Person with a FirstName of “Steve”. In SQL that is easy because that data is segregated into its own column. It is more complex in a NoSQL database because that query must now go rummaging around the data within the document which, as you can probably imagine, will be slow without a lot of work around query optimization.  Common examples of document databases include MongoDB, Amazon DocumentDB, and Microsoft Azure CosmosDB.

Key-Value Stores

A key-value store is the simplest of the NoSQL databases. In these types of databases, every element in the database is stored as a simple key-value pair made up of an attribute, or key, and the value, which can be anything. Literally, anything. It can be a simple string, a JSON document like is stored in a document database, or even something very unstructured like an image or some other blob of binary data.

The difference between Document databases and Key-Value databases is generally their focus. Document databases focus on ways of parsing the document so that users can do things like making decisions about data within that document. Key-value databases are very different. They care much less what the value is, so they tend to focus more on storing and accessing the value as quickly as possible and making sure that they can easily scale to store A LOT of different values. Examples of Key-Value databases include Amazon DynamoDB, BerkeleyDB, Memcached, and Redis.

Those of you familiar with the concept of caching, where you create a high-speed data storage layer that stores a subset of data so that accessing that data is more responsive than calling directly into the database, will see two products that are commonly used for caching, Memcached and Redis. These are excellent examples of the speed vs. document analysis trade-off between these two types of databases.

Column-Oriented Databases

Yes, the name of these types of NoSQL databases may initially be confusing because you may think about how the data in relational databases are broken down into columns. However, the data is accessed and managed differently. In a relational database, an item is best thought of as a row, with each column containing an aspect or value referring to a specific part of that item. A column-oriented database instead stores data tables by column rather than row.

Let’s take a closer look at that since that sounds pretty crazy. Figure 2 shows a snippet of a data table like you are used to seeing in a relational database system, where the RowId is an internal value used to refer to the data (only within the system) while the Id is the external value used to refer to the data.

Figure 2. Data within a relational database table, stored as a row.

Since row-oriented systems are designed to efficiently return data for a row, this data could be saved to disk as:

1:100,Bob,Smith;
2:101,Bill,Penberthy;

In a column-oriented database, however, this data would instead be saved as

100:1,101:2;
Bob:1,Bill:2;
Smith:1,Penberthy:2

Why would they take that kind of approach? Mainly, speed of access when pulling a subset of values for an “item”. Let’s look at an index for a row-based system. An index is a structure that is associated with a database table to speed the retrieval of rows from that table. Typically, the index sorts the list of values so that it is easier to find a specific value. If you consider the data in Figure 10-4 and know that it is very common to query based on a specific LastName, then creating a database index on the LastName field means that a query looking for “Roberts” would not have to scan every row in the table to find the data, as that price was paid when building the index. Thus, an index on the LastName field would look something like the snipped below, where all LastNames have been alphabetized and stored independently:

Penberthy:2;
Smith:1

If you look at this definition of an index, and look back at the column-oriented data, you will see a very similar approach. What that means is that when accessing a limited subset of data or the data is only sparsely populated, a column-oriented database will most likely be much faster than accessing a row-based database. If, on the other hand, you tend to use a broader set of data then storing that information in a row would likely be much more performant. Examples of column-oriented systems include Apache Kudu, Amazon Redshift, and Snowflake.

Graph Databases

The last major type of NoSQL databases is graph databases. A graph database focuses on storing nodes and relationships. A node is an entity, is generally stored with a label, such as Person, and contains a set of key-value pairs, or properties. That means you can think of a node as being very similar to the document as stored in a database. However, a graph database takes this much further as it also stores information about relationships.

A relationship is a directed and named connection between two different nodes, such as Person Knows Person. A relationship always has a direction, a starting node, an ending node, and a type of relationship. A relationship also can have a set of properties, just like a node. Nodes can have any number or type of relationships and there is no effect on performance when querying those relationships or nodes. Lastly, while relationships are directed, they can be navigated efficiently in either direction.

Let’s take a closer look at this. From the example above, there are two people, Bill and Bob that are friends. Both Bill and Bob watched the movie “Black Panther”. This means there are 3 nodes, with 4 known relationships:

  • Bill is friends with Bob
  • Bob is friends with Bill
  • Bill watched “Black Panther”
  • Bob watched “Black Panther”     

Figure 3 shows how this is visualized in a graph model.

Figure 3. Graph model representation with relationships

Since both nodes and relationships can contain additional information, we can include more information in each node, such as LastName for the Persons or a Publish date for the Book, and we can add additional information to a relationship, such as a StartDate for writing the book or a Date for when they become friends.

Graph databases tend to have their own way of querying data because you can be pulling data based upon both nodes and relationships. There are 3 common query approaches:

·         Gremlin – part of an Apache project is used for creating and querying property graphs.  A query to get everyone Bill knows would look like this:

g.V().has("FirstName","Bill").
  out("Is_Friends_With").
values("FirstName")

·         SPARQL – a World Wide Web Consortium supported project for a declarative language for graph pattern matching. A query to get everyone Steve knows would look like this:

PREFIX vcard: http://www.w3.org/2001/vcard-rdf/3.0#
SELECT ?FirstName
WHERE {?person vcard:Is_Friends_With ?person }

·         openCypher – a declarative query language for property graphs that was originally developed by Neo4j and then open-sourced. A query to get everyone Bill knows would look like this:

MATCH (me)-[:Is_Friends_With*1]-(remote_friend)
WHERE me.FirstName = Bill
RETURN remote_friend.FirstName

The most common graph databases are Neo4j, ArangoDB, and Amazon Neptune.

Next time – don’t just say “lets put that in a NoSQL database.” Instead, say what kind of NoSQL database makes sense for that combination of data and business need. That way, it is obvious you know what you are saying and have put some thought into the idea!

Leave a Reply