Early May Check-in: Alpha release thoughts

In a previous post, I announced plans to release an alpha version of a complete system. I have a couple goals for this release:

It's been almost exactly three months since the last time I released an exercise. Each release forces me to test deploying the system as a whole. I don't usually do this when I'm developing--instead I slowly add things on to my already-deployed infrastructure. This keeps the work moving forward, but it comes with a risk: that some of my iterative changes won't be deployable "from scratch" or that I've cut corners that will require rework. Doing a release is an opportunity to identify and correct any issues like that.
Once the code for the release comes together, I'd like to try to get five people besides me to deploy this system and attempt to use it. I'll re-evaluate this goal once the release code is complete; I'll only solicit volunteers if I really think the system can provide some usefulness to them as well as to me. But this goal remains important whether or not I'm able to achieve it in this version; if I can't demonstrate usefulness-to-others, at some point I'll need to move on to a different project.

Along with those goals, I have some leniencies--specific elements where I know more work is needed but release anyway. This post is going to be a list of those places that I know about so far. In each case, I'm going to describe what is left to do, the effect it will have on the system as a whole, and what the path to completing it might be.

Lambda@Edge Log Groups: Tha auth system creates some log groups dynamically. A log group is a collection of logs that are managed together in Cloudwatch, AWS's built-in log service. Because these log groups are created dynamically, they are not given an expiration date. That means that over time they'll build up in the account. The solution to this will be to write a lambda function that automatically sets expiration times on any autogenerated log groups. The auth system doesn't log much; over the past three months I've accumulated around 3MB of these logs--the size of a single uncompressed image from a nice phone camera.
Image Management and Deletion: As a part of my goal to enable blog posts, I've included code for uploading images, including them in posts, and publishing them along with the posts. However, I have not written a system for managing the images themselves. I'm also not sure if I'm going to fully implement publlished-image deletion before the alpha release; it may be the case that once published, images will be accessible via their direct URLs even if the relevant post is deleted. This will also mean that we spend more storage space on images than necessary. I don't think this inefficiency is enough to outweigh the value of the proof-of-concept. I also have some uncertainty about what "image management" should entail--for instance, neither Medium nor Instagram lets you manage pictures independently of posts. Not being a photo-album guy myself, I'm not sure how far to go in that direction, or if I could accurately anticipate how such a system should work. Open to suggestions. A lot of the invisible complexity of having a photo management system and a post-management system is keeping track of interactions between the two--things like "if you want to delete an image through the image management system, but that image is being used in a blog post, how should that be communicated and what should happen?" These types of interactions can add a lot of unpleasant complexity if we're not careful.

Update a couple of days later: After thinking about it a little, I've decided that in this version, images will be tied to the specific post where they are used. This means that when a post is published, its images are published with it, and when it is unpublished (when it is made nonpublic) its images are unpublished as well. This means that images are never "shared" between posts--like on a service like instagram, if you want to put the same image in multiple posts you need to upload it multiple times. This greatly helps to reduce the circumstances under which an image could remain published after its post was unpublished. It's still a little inefficient.
Theming System: The theming system for the blog is underdeveloped. A "theming system" on a blog refers to the way that you customize the blog's appearance. I feel ok calling what I have a "theming system," because each post is generated from a template. But at this point the templates are deployed by terraform and not easily configurable. So for the alpha release, all the blogs will look like this blog.
Trail link bugs and blog inefficiencies: I've noticed some bugs in the way the trail links at the bottom of each post are generated. I haven't made it a priority to fix them yet because I'm not sure how valuable those links are--in the logs, I don't see a lot of evidence that people are using them. But maybe people aren't using them because they're broken. In any case, that's more of a superficial thing, and I'm still focusing on foundations.

There's also still an efficiency issue where the blog stores the full text of entries in a dynamo database. I did this because I wanted to enable RSS and Atom feeds quickly, but it really feels wrong to duplicate all that content in such a user-opaque way--ideally, I want a non-practitioner to be able to use just the S3 UI to know everything they need to know about the system, and letting dynamo have such a significant role interferes with that goal. I think the solution to this will be to reduce the amount of content shown in the RSS and Atom feeds so that full text isn't needed in the database. I may even need some convincing on the long-term usefulness of Atom and RSS under the assumtion that operator-controlled systems ought to be the norm^[1].
Autosave, Archiving, and Plugin Boundaries: One of the highest priorities of this system is to preserve artifacts of human attention. That means that when the system owner does something that reflects a moment of their attention, such as starting to write a post or uploading an image, the system should attempt to protect that thing from accidental loss in every likely scenario.

One of the most likely scenarios, which I think we've all seen and been frustrated by in the past, is to be interacting with a site when you suddenly get signed out because your session timed out. Session timeouts are a defense in depth measure to limit the damage that can be done by an adversary getting hold of a session token--the special password-like thing that your browser uses to keep you signed in to a web app. I don't think that it would be safe to remove session timeouts, so our data-preservation strategy needs to assume that they will happen, and the user will sometimes be logged out while doing stuff.

So the next question to ask is: how do we ensure that even if the user is logged out when doing stuff, they don't lose more than a few seconds of progress? One answer is autosave. This refers to the practice of saving the current state once every few seconds while the user is editing it. That way, if the user gets logged out, they get a version from a few seconds before the logout happened, not from the last time they saved manually. This was the first thing I tried.

Unfortunately, I ran into a couple of resource issues with that approach. The first was the interaction with the archive system. The archive system is designed to back up everything that lands in the system's storage. When combined with autosave, this means that each version of a document that's being edited ends up being saved forever. When autosave runs every five seconds or so during editing, this could mean saving hundreds of not-very-useful intermediate copies of a document. I think it's likely that the archive system needs to change somewhat, but I don't want to change it now^[2]. So I can either accept this issue or I can try to work around it within the blog system itself. I decided on the latter.

It turns out that there is a pretty straightforward solution to session timeouts happening on the browser. Browsers have a feature called local storage that lets website code save things between refreshes. That means that you can go to a website, do some stuff, refresh the page, and your data will still be there. I experimented with this a bit and found a way to use it to preserve data even through session timeouts. This means that autosave could mean "save to local storage," and run every few seconds without misusing archive storage space. But then another problem arises. My security model for plugins requires a boundary between the data from one plugin and the data from another. Local storage doesn't have a built-in way to enforce that kind of boundary between different pages on a single website^[3].

So again, we can accept the problem or we can try to solve it. This time, I'm going to accept the problem for now. In the future, I propose that each plugin will be able to create a lambda function for encryption and decryption. This lambda function can be the sole keeper of a plugin-specific symmetric key for encrypting and decrypting data. When a plugin wants to store data in local storage, it can call its encryption endpoint with the plaintext, get the ciphertext, and store the ciphertext in local storage. When it needs to decrypt local storage data, it can call the decrypt endpoint with the ciphertext to get the plaintext. The encryption / decryption endpoints are restricted to a single plugin using the plugin isolation strategy described in the previous post.

Another item that I'm not going to resolve in time for this first release is standardizing plugin data-retention policies. A data-retention policy specifies how long data will be saved. All data that gets uploaded gets stored in the archive system, presumably forever. But once data has been processed by a plugin and saved to the archive, how long should we keep the original copy? The reason this question matters is that the archive is like a storage unit--it's not convenient for things that will be used often. So the retention policy for the originals should be as long as we're likely to want to actively modify the data. Does that mean a week? a month? a year? Different for different types of data? I'm not sure.

I started writing this post on May 5 or May 6; I left it in progress for a few days so that I'd have time to capture different issues as I notice them. I think it now captures most of the big todos, but others will certainly appear.

RSS and Atom are feed formats--they're techniques for publishing a list of posts, articles, etc. that a website has made available. Their design includes the assumption that a reader program wants to get the full list of items, and that the publisher has a plan for managing the list of items so that it doesn't get too long. That seems like underspecification to me--there are enough unanswered questions about how things should work that different programs that follow all the rules might be making different and contradictory assumptions about the questions that the rules don't answer. For now I'm reserving judgment on whether these technologies are useful in this personal-social-media context or if different solutions might be more appropriate. ↩︎
The archive system is global--it's intended to support everything that this system does. The autosave feature I'm discussing is local--it's a feature of a single plugin (the blog) to the overall system. When I'm considering changes to a global system, I want to see a global reason for the changes--I want to see multiple plugins all "agree" on what the change should be. Here the situation is that one plugin is interacting badly with the archive system. If I change the archive system now, I run the risk of specializing it in a way that makes sense for the blog plugin, but causes problems for other plugins I want to implement in the future. The work that I'm doing on the blog system does suggest that there are changes needed in the archive system, but I'd like to see those suggestions corroborated by my experience with another plugin or two before I commit. ↩︎
That is, one origin can't see the local storage from a different origin, but all the pages from a given origin share a local storage namespace. ↩︎

Raphael Luckom

Early May Check-in: Alpha release thoughts