Thinking About State

December 22, 2020

I'm in "housekeeping" mode this week, looking back through the things I've deployed so far to see if there are things that can be cleaned up or consolidated. I've been keeping an eye out for state in particular. Within software, state is the word for any data stored by a program that isn't code or configuration. The text files in which I write these posts are state, as are the autogenerated HTML files read by a browser. In contrast, a DNS record would usually not be considered state[1]. Roughly, all data that doesn't exist when you install an application, but gets created while you use it, is state. A unit of state--like a line in a CSV file, or a blog post formatted as a markdown file, or a database row--is what I will call a record. The information within a record might be further divided into fields--a record for a blog post might include a date field. I will use the word collection to refer to a grouping of similar records, such as a directory of markdown files, or a database table, or a CSV file. These distinctions are based on the context of the conversation-- a CSV file is usually a collection, but multiple CSV files could also be a colloection. A CSV file containing only one row might be considered a record. A system that holds state, like a hard drive, a bucket, or a database, is called a store. Secrets, like passwords and API keys, are a special case--they have most of the same considerations as other forms of state, but also have some special characteristics.

State brings with it a number of challenges. THe ones I'm thinking about today are structure, access, consistency, and persistence.

Structure

The structure of state is the "shape" of individual records. I'm writing this blog post as a text file. At the top of the file there's some metadata--the title, author, date, etc. There are about 37 posts on this site so far, each one its own file. That means that if I wanted to change the structure of a post--if I wanted to add a link to a heading image, for instance--I would need to edit each of 37 files, either manually or with a program. Changing the structure of data, or moving it from one place to another, is called migration. When the number of records in a system grows over time, as it does with blog posts, image-storage-systems, or other types of social media, it gets more difficult to change the structure of the data[2].

Access

I'm using access to mean "people who are allowed to get the data can get it." This idea combines both access control and availability[3]. When we think about how we want a system to store state, we need to consider both. For instance, if you expect thousands of requests per second, you would not want to use a single spinning-disk hard drive, because it would be too slow, and availability would suffer. If you wanted to make a password manager for different users in a shared system, you would not want to store everyone's passwords in the same file, because then everyone would have access to everyone else's passwords. Different stores have different features for access control and availability--different ways of restricting access or enabling access at a bigger scale.

Consistency

Consistency refers to whether different elements of state agree with each other. For instance, imagine if something went wrong when adding a blog post to a site, with the result that the blog post was created--that is, a direct link to the blog post would work--but it never gopt added to the list of posts or the front page. That would be an example of inconsistency--from the perspective of the post page, the post exists, but from the perspective of the list, it does not. Big sources of consistency errors are coding mistakes, system malfunctions, and data structure changes[4].

Persistence

I'm using persistence within this discussion to specifically mean "resistance to permanent destruction and loss." The two most visible considerations for persistence are backup and restore[5]. Planning for persistence benefits from a security mindset--you want to try to identify likely risks and plan for ways to handle them. Some of these risks are obvious--drives fail, cloud services have outages, USB drives get lost--but others require more creativity to plan around. Even if data is stored safely, it may become unavailable if there's no way to read it (have you seen a Betamax player recently?). Encrypted data can only be accessed with its secret key, so encrypted backups rely on safe key storage to function.

As the amount of state in a system gets larger, design errors in state storage get harder to fix. Backing up or restoring a large database from scratch may take days and interfere with using the database in the meantime. Most cloud services charge by the amount of data you're moving around, so copying a bucket to move it or back it up can be an expensive exercise. Different storage technologies have different operating costs--some may be comparatively cheap for large volumes of data that are accessed infrequently, but become expensive or unusable for more frequent access.

More generally, different storage systems are billed in different ways. In the previous post, I laid out my rationale for preferring highly scalable services with pay-as-you-go pricing. That rationale applies to state storage too. That rules out some common solutions. There are few options for pay-as-you-go SQL databases. Rented hard drives are also inconvenient for pay-as-you-go use cases, because using drives requires either virtual machines or other connection setup, which tend to require pay-for-capacity billing.

My go-to storage systems are NoSQL databases and object stores. Object stores are good for files, especially large files or public files. NoSQL databases are probably the cheapest type of cloud databases to use for small-scale personal systems. Both object stores and NoSQL databases are available similarly (and cheaply) from multiple cloud vendors, so a system that restricts itself to those solutions can be backed up to a second cloud vendor, or it can be migrated to a new cloud vendor relatively easily. Both scale well to handle spikes in load.

One of my housekeepiing tasks this week has been to take an inventory of the state I'm keeping in various subsystems. Since all state comes with the same four challenges, one of my goals is to handle those challenges as efficiently and consistently as possible. I hope to write a more specific blog post in the future detailing the specific approach that I'm taking to organize state within my system.


  1. It's an interesting case though. If I was a DNS provider--if DNS records are what I was storing--then a DNS record would count as state. The term application state can be useful for indicating context--DNS records are not part of the application state of a Wordpress blog, since they're not within that application. But for a DNS server, the DNS records are part of its application state. ↩︎

  2. In some cases this is harder than others. If I have a database of receipts, and each receipt has a field called "date," but I decide that I want that field to be called "time" instead, it's pretty easy to make that change even if there are lots of records, because all I'm doing is renaming something that already exists within the record. But if I have a database of all of the photos I've taken over the past year, and I decide that I want to start storing GPS coordinates, I either have to go back to every image and figure out where it was taken, or else I have to accept that I can't always count on GPS being present. ↩︎

  3. In some contexts, access control and availability are very separate concepts. Here I think it works to lump them together. ↩︎

  4. Coding mistakes could lead to situations where some of a set of related records are updated but others are not, like the post / posts list example above. System malfunctions, like a drive running out of space, can cause some records to fall out of sync with others. Data structure changes can erase data needed for consistency or can update data in inconsistent ways--strictly, this probably falls under "coding error" but it's a significant-enough category that it deserves its own mention. ↩︎

  5. Not testing backups is a longstanding and devastating tech tradition. ↩︎