DynamoDB Intro And Database Design Guide

This article will cover:

What is DynamoDB?

In 2007, Amazon published the Dynamo Paper revealing the in-house tech it was using for Amazon.com, which would eventually become DynamoDB. DynamoDB is a serverless, distributed, key-value NoSQL database which trades ad-hoc query speed in exchange for ultra-fast, ultra-scalable response times for specific access patterns.

The below video, DynamoDB docs, and the API reference are excellent resource to learn about DynamoDB.

Fundamental Terminology And Facts

  • In DynamoDB, "items" are the equivalent to SQL rows, and "attributes" are the equivalent of SQL columns.
  • However, in DynamoDB, items can have different attributes. E.G. The item for the movie "Jaws" may have attributes "Actors" and "Director" but the item for the book "Dynamo DB Applied Access Patterns" may have the attributes "Author" and "ISBN."
  • Items can be up to 400 KB, which is large enough for most SQL row equivalents.

What are the advantages, disadvantages and key factors of dynamoDB?

  • In DynamoDB, the amount of data in the database, and amount of running queries has little to no impact on the speed of queries. Retrieving a single item runs at O(1) time complexity if the Key is known, even with millions of records in a table and many concurrent queries running. Retrieving a sorted result set is done in O(log(n)) if the proper access patterns are used (read on to learn). In other words, In dynamoDB, increasing the number of running queries and amount of the database storage used does not impact performance of those queries (except in extreme cases with queries accessing the same partition).
  • The trade off for this performance is that this requires known hierarchical access patterns. The disadvantage is that ad-hoc access patterns scale poorly with O(n) time complexity, whereas in a classical SQL, non-partitioned DB they would scale better with O(log(n)). SQL databases usually give you more flexibility for ad hoc access patterns/queries.
  • Fully managed and Serverless - no patching, upgrading DB version, etc.
  • Designed for speed (CPU-optimized) over reduced storage (unlike SQL which is designed to save disk space)
  • Best for OLTP (SQL is still best for OLAP)
  • It's partitioned and distributed, which means it scales horizontally (while SQL scales vertically)
  • Can scale to 55 million TPS or higher
  • With Dynamo DB Global Tables, multiregion replication happens very quickly (great for Disaster Recovery)
  • Automatically replicates data, ensuring high durability and availability
  • You can choose to be billed on-demand, and only pay for the reads/writes that you use or use provisioned capacity
  • Some other disadvantages include lack of triggers and the fact that it is only usable on AWS

So What Actually Makes DynamoDB So Fast: Partitioning

A big draw to move to DynamoDB is its blazing fast for most OLTP applications. Why is it so fast?
  • First, if it ain't broke don't fix it. What's the problem is with most SQL databases and performance? You may have a SQL query that runs in 10 milliseconds when you have only 5 GB of data in your database and 0.1 TPS, but when you have 300 GB, and 2 TPS of traffic coming in, the same query could take well over a second. You can mitigate this by increasing CPU and RAM for your database, but it's an ongoing effort, and difficult to get perfect prior to new bursts in traffic.
  • But what causes this size-dependent speed? The answer is that SQL databases typically runs on a single machine, and there is only so much a single machine can do to search through thousands or millions of records for multiple requests at one time. When the number of read or write requests against the DB at once exceeds the number of cores available on the database instance's CPU, the response time and memory usage grows. Additionally, one reason that size of the database storage used in bytes can effect response time is due to using more hard disk space instead of memory only. That could be mitigated by moving to a larger RDS instance if using AWS (the largest one as of this writing has 1 TB of memory).

DynamoDB databases aren't limited by the physical limitations of one database instance, because they are distributed and partitioned.
  • A dynamoDB table consists of a network of many physical instances, rather than a single one. So when a request comes in, it needs a way to identify which instance will handle it.
  • Enter the partition key. This key determines which node will process the query. By using a partitioned design, AWS can route DB requests to thousands of distributed nodes at one time, based on the partition keys.
  • Each partition key can be assigned to a different physical system for storage, and the 'value' of the key can be up to 10 GB in size.

Are there any partitioned SQL databases?

Yes, CockroachDB and Vitess are two examples. This article will not go into detail on them, but they are presumably fast as well.

Why can't DynamoDB do ad-hoc queries quickly?

Ad-hoc queries, such as queries on columns not part of an index in DynamoDB, are scan operations that run slowly compared to how they would run in a SQL-based database. The reason is that these are accessing multiple physical nodes. Accessing these multiple nodes and combining the results into a single result set incurs additional network costs and delays.

Why can't DynamoDB do joins?

In order to join the data together would require a centralized node. This, along with additional backend complexity burdens, is also why joins are not a provided capability in DynamoDB (even though they are possible in MongoDB, and Vitess which are also partitioned databases)

You can query on different attributes efficiently if you define a secondary index for them, though. See the section on Secondary Indexes for more info.

Before we get more into secondary indexes, let's go over a bit more about the benefits of DynamoDB.

Durability and Replication

DynamoDB uses multiple nodes to replicate all its data for high availability and durability. Despite this replication, which you may expect might slow things down, DynamoDB has the concept of quorum (technically it uses sloppy quorum) to decrease response time, where only some number (not all) of the internal nodes need to be updated before the write is considered successful. If w is the number of nodes that need to be updaed for a write to be successful, and n is the total number of nodes, strongly consistent reads would need to read r nodes to ensure a the data is up-to-date where r+w>n.

Global Tables

Global Tables enable fast multi-directional replication across two or more regions in which each region's table is fully active (able to accept reads and writes). To make a DynamoDB table into a Global Table, you create a "replica" on the "Global Tables" tab as explained in this video. These replicas are not read replicas. Of course, this will double your WCU usage.

DynamoDB Streams

These enable asynchronous time-ordered streams of item-level modifications on your table, so that operations can be done on them after the data is written. This can be used to overcome the limitations of DynamoDB with adhoc queries by streaming to a OLAP database. Each table, if enabled, gets a log file that keeps change information for up to 24 hours.

DynamoDB Schema Design

Partition Key & Sort Key Key Facts

  • When creating a table, you can choose to use a Partition Key as the Primary Key, or a combination of Partition Key and Sort Key. These are known as the "Primary Key attributes."
  • Partition Key is also known as Hash Key. Sort Key is also known as Range Key
  • If you choose Partition Key, each item must have a Partition Key. If you choose Partition Key and Sort Key, each item must have both a Partition Key and Sort Key.
  • Partition Keys, which work like distributed hashmaps, give results in O(1) time whereas SQL Primary Keys usually give results in O(log(n)) time due to their B-tree internals.
  • Sort Keys further sort or filter query results within a partition in O(log(n)) time.
  • Using the Query API endpoint, you can:
    • Get all items within a Partition Key sorted by Sort Key Ascending
    • Get all items within a Partition Key sorted by Sort Key Descending
    • Provide Partition and Sort Key Comparison Operator to get only some of the items within a Partition Key, sorted by Sort Key
    • Use begins_with on the sort key to search for all sort keys that begin with a certain string. Begins_with cannot be used with the partition key
  • Partition Keys should be specific with high cardinality (in other words, more partition keys is better).
  • User#Account or User#Order are examples of good partition keys, because these will likely result in hundreds, thousands or millions of partition key values, giving us the benefits of a partitioned DB.
  • "AccountStatus" is a poor key because many users will be sharing the same partition, making that partition "hot" and resulting in a degradation in performance.
  • There is one catch to the above examples of good partition keys - if all parts of the key can for sure be provided in the access pattern, it will be fine, but if some of these values are not known, the query will be impossible because partition keys can only be done based on equality (no conditional, begins_with, or contains type operations are possible on partition key). If only the user id will be available, it may be best to include order# or account# in the sort key, which can use begins_with.
  • Contents within a partition can be up to 10GB
  • Examples of Table Schemas:
    • History of actions of a user in a system: Partition Key is User 1. Sort Key is Date/Time#Action. Attributes may include more details about the action.
    • History of orders of a customer Partition Key is Customer 1. Sort Key is Date/Time#ItemNumber, and/or Date/Time#OverallOrderDetails, assuming no two orders would happen at the same time.
    • Message history in a messaging app. Primary Key is LoggedInUser#OtherPersonOrGroup. Sort Key is Timestamp. Attributes include Username of other user, full text of the message, and any attachments to the message.
  • You can also get results for up to 100 partitions at once using BatchGetItem, however, you would need to know and provide each secondary key as well.

What are Secondary Indexes in DynamoDB?

Global Secondary Indexes

  • Global Secondary Indexes(GSIs) create an internal copy of your table in a separate partition space, in which the Partition and Sort Key are different from the Primary key (which as explained above is usually a combination of partition key and sort key). Consequently of this design, the primary and secondary copies of the data are updated whenever an update is made, regardless of if it is updated with the Primary or Secondary index.
  • GSIs allow access to your data with different attributes, with high speed as if accessing with the primary key.
  • GSIs can be limited in the attributes they contain. A projection is the set of attributes that exist in the secondary index's partitions. These can be selected from the attributes that exist in the main table. You can choose a limited subset of items (projection type INCLUDE), or just the primary keys themselves (projection type KEYS_ONLY), to save on storage cost.
  • GSIs do not contribute to the 10GB partition limit because they use their own partition space.
  • Additionally, you can write items to the table using the main indexes without writing GSIs for them. This means, some of the items' GSI keys are allowed to be null, which is not allowed for the primary key. This is a sparse index, which can be a useful design pattern that can help save on write costs.

Local Secondary Indexes

Local Secondary Indexes allow you to use the same Partition Key with a different sort key. LSIs, like GSIs, make a copy of the data to be accessed and may also project all, or just some of the main tables attributes. LSIs projected attributes do add to the 10GB limit, which is one possible disadvantage of using them.

What are the general cons of Adding Secondary Indexes

  • Cost: The more indexes you have on a table, the more cost you will incur with writes, due to copying the data to the projection when something is updated. If you are mostly doing reads, this is less of a concern.
  • Performance: Because secondary indexes are updated asynchronously, you won't see a performance difference with writes by using them. With more indexes, you may see a response time increase if you are using on-demand capacity or if you exceed your provisioned capacity.