social system notes

This post contains my running notes on building the social-media features in a new plugin. It’s an experiment; I’m going to publish it publicly because there’s no reason not to, but its primary purpose is as a scratch-pad for my notes. So it will likely change a lot, with content constantly added, amended, and removed, and it may not make a lot of narrative sense. Feel free to ask questions if something catches your eye.

07/19 -

Started work on the server design. Added options for omitting the cookie-to-auth-header function from the access-control functions, since its only purpose relates to getting browsers access to stuff that won’t be included in this server. Also added an option for non-access-controlled paths in otherwise access-controlled sites, which is necessary for publishing public keys.

Been kinda dreading choosing a public-key format for this—heard a lot of shade thrown at the JOSE suite, which seems to be the most widely deployed. I was very happy to find this article which matches my intuition in the areas I know pretty well, and provides some well-supported thoughts stretching into the areas that I don’t. I will be using JOSE. Specifically, the JWK format for publishing keys seems well-supported and not especially controversial, so I’m feeling happy with it.

Basic plan:

An authoring UI based on the blog-post-authoring one. The main architectural difference is that social-media posts usually don’t have titles, and the user is not expected to supply an ID. This is going to affect both of the existing page designs; without a human-readable post-id, there’s no obvious feature for the list page as it exists. The edit page could be made the same-but-without-a-title-field. But it may make sense to combine the edit and post-list into a single feed page. Questions about that idea:

Does it show all posts (yours and your friends’) or just yours (friends’ posts on a different page), or just your friends’? This has security implications; in the past I’ve tried to avoid displaying external content (such as friends’ posts) on pages where there are js-accessible credentials to minimize the risk of XSS. However this is probably overkill—I already have fine-grained CSP control and I can decide to use html-disabled md as a publishing format.
Where do I generate a jwk? One choice is to use a null resource provider in terraform to generate the jwk using a shell script and upload it to an s3 bucket. The nice thing there is that no other resources need write access, and it needn’t be in plaintext in the tf state. The downside is that the isolation makes it less flexible than e.g. a jwk generator function. I could also generate a jwk in a lambda, but a) would have to look at the quality of randomness available in lambdas (it must be pretty good by now, right?) b) I can’t think of a bootstrapping procedure that I love. For now I’m more interested in the null-resource route.

The function to process the uploaded posts is only a little different than the blog-post one. Instead of dropping posts in a blog bucket, it’s going to create a gzip archive of the main post md and all its imgs. This will be placed in a “posts” directory in the site bucket. The function will also add the post metadata to a dynamo table. The dynamo table will be the index of posts ordered by date—this is what will record what’s new on our site for when friends ask.

The site bucket will be exposed on a cloudfront distribution. The access-control functions on the distribution will receive requests from friends’ sites (signed with their published keys). It will verify the signatures against the published key, and make sure the signer is one of our friends, then passthrough the request to the endpoint.

There will also be a post-list endpoint on the cloudfront. This will be access-controlled by the same function described above, but instead of passing through to an s3 origin it will pass through to a lambda that queries the post-index table. It will return a list of posts with metadata—exactly which metadata is tbd.

Reader thoughts:

What kind of durability do we want for friends’ posts? Do we pass them straight to the browser still gzipped? Do we store them in our server side? For how long? How are these things paginated? Maybe from the perspective of a friend, posts should become inaccessible after a given time has passed.
What if one wants to allow all access publicly? In theory it should be as simple as omitting the access-control functions. But keep an eye out in case there are things that could make it difficult.

TODOS:

Rework post entry function from blog into a function for packaging social media posts as described above. Repurpose the existing post table to be the index table for metadata.

Write an access-control function (and I’m pretty sure that it is only one, as opposed to the 5 that were required for oauth) to validate incoming requests after fetching the jwks. Should we cache / pin JWKs from friends’ sites? What if any jwk rotation strategy makes the most sense. Like if a jwk unexpectedly changes and that’s something we want to care about, that has to include some kind of user interaction to see if the new jwk is trusted. But if 100% of people are going to click through that without reading it, then what would the point be. It might be good to just cache the known jwks and if they change make a new request (and silently use the currently-published one, whichever it happens to be).

Write a script to generate a jwk and store it. For now assume option 1: generate in a null resource in terraform.

Write a polling function to check for friends’ posts. This can maybe be a bit aggressive—every 5-10 minutes or so—since I have so far failed to make a meaningful dent in the lambda free tier allocation (of course, the scaling concern here would be more on the incoming side, where a multitude of requests from connected sites could incur lots of requests. We can cache the response from the index table though). Note that the polling function will be a POC for signing requests with the corresponding private key of our published key. Should the polling function also fetch posts, or should that be delegated? Implicates capacity questions (time limit, mem limit) that may want more specific answers than one-size-fits-all. So on balance the signing stuff should be modular in case we want to split up later.

Rework the authoring UI for social-post-authoring, answering the questions above. Use this to guide the friends’-posts-durability / storage question.

07/20

Copied the blog plugin to start off the social plugin. It seems like the only thing I might need to change on the server side is to have the post_entry (renamed to feed_entry) function replace its publish & img-publish steps with publishing a zipped archive. Steps:

locally download a post & its imgs
write a utility to parse the post, replace the img links with relative paths, archive all the files, return s3-ready buffer. Test w/ local post, imgs
update or insert a step to get the available imgs (possibly filtered down to the used ones)
insert the packaging utility as a generic async function step.
update the published, pinned post metadata as described below.

Next, I’ll need to add a lambda endpoint to query the posts table & get the most recent published posts. There may not be a reliable “published” flag in that table. Idea for that:

publishedPost is a new partitionkey value. Add a new index on a new createDate field.
pinnedPost is a new partitionKey value.
When an authed friend queries for posts, the lambda gives them the most recent 50 publishedPost (metadata only) and all of the pinnedPost metadata. The metadata includes urls to find the post packages. This response can be cached by cloudfront since every friend can see the same post list (if you want circles, deploy more than one social site).

UI changes

The index page turns into a feed, but retains a “new post” button that simply takes you to the authoring page. The feed shows you the latest chronological posts, yours & friends. Open question: what’s the strategy for gathering posts? If the requests happen just when you log on, that a) leaks information about your habits b) big latency hit on the ui. otoh, if you query friends every 10 mins, you only need to get the (likely cached) dynamo query results to know if you need to get any post packages. The post packages can be stored in s3 with an expiration time of a week or so (if you have a lot of friends / follows, this could become a significant amount of data if you kept it for long), potentially even rendered to html (by rendering md & being able to inspect imgs, as opposed to letting the distributor do the rendering, we have a bit of security). The holding location for the posts could be explicit-allowed as an iframe location, further isolating the rendered posts from the admin stuff.

The authoring page doesn’t change much, except that there’s no title.

Access Control

Have a null resource generate a jwk file (I think there might be instructions in the access control functions for doing it manually) and store the private key in the admin site s3 (hosted-assets? No, shouldn’t be accessible from UI. Uploads? No, shouldn’t be overwritable from UI. Some new path. Secrets? Config?). Store the pubkey in the public bucket. Permissions to read privkey is an array of role arns (at least the friend-query function, possibly a post-resolver if separate).

Look at indieauth again to see if it’s applicable here—it seems to have put thought into how to treat origin urls. otoh, since we know who our friends are I’m not sure this is critical for this use case.

Friend data should be kept in a dynamo table; the same table should hold the cached pubkeys. When a request comes in, its keyId is compared to the cached. If not match (or if it’s been a while since we checked) check to see if the site has rotated it’s key; use the latest published to verify the request. If verify, passthrough, else 401.

Poll

Write friend-poll fn. Try: get privkey, sign requests. every 10 minutes, poll all friends. If new, fetch, parse, store in post-storage area. Set 1week expiration (but what if pinned?)

07/21 08:12

In last night’s description I forgot the friend request / response flow.

UI: another page in the social plugin—friend management. Can likely be based on the current index page, which does table-based create update delete (but for posts). A “send request“ feature in the UI—validates, for a given url, that it is an instance of something that can respond to the request (tbh this is a pretty indieweb situation, hcard, the decisions made around indieauth seem relevant here and I remember them being sensible).

Backend: When the UI says “send a friend request to x” the backend does the same sanity check on the url (hcard, w/e) and through that sanity check discovers the location of the friend-request endpoint. It then constructs a friend request, details tbd, and signs it with the priv key associated with a published pubkey (use multiple keys, maybe, so we can distinguish which functions have access to which privkey).

In the friends table, friendships have a marker showing what state they’re in: friend request sent, friend request received, request mutually accepted. Friendships in the mutually accepted state are friendships for the purpose of access control. I’m not sure there needs to be a “request denied” state—at that point the record should be deleted (there might be some advantage to having denied as “reminder not to resend” but that energy would likely be better spent on filtering incoming requests from people one has previously denied, rather than flagging outgoing requests—which, maybe there should be a special “incoming-request-denied” state, so we can look skeptically at repeated requests from previously-denied sources.

The management UI has a filter & top bar showing counts by friendship state. Each action relevant to the current state is shown in the list.

That means that you can delete existing friendships. As a courtesy, a deletion should result in a notification to the other party (signed with the privkey) however we should not rely on that so we should make another friendship state of “disconnected” in case we get defriended by someone who didn’t send a notification. disconnected means that we didn’t get a notification but we are no longer able to read from the source—possibly differend disconnection states for network (there doesn’t seem to be anything there anymore) and auth (suddenly our published key is no good there, suggesting deliberate). A friendship in the disconnected state is not allowed for access control, but it can repair itself automatically to its previous state (beware of escalations here) if the disconnection passes. Repairing a disconnected friendship should be a monitored and measured thing, since it’s an opportunity for shenanigans.

So if a delete notification is received, the record is deleted or soft-deleted. It seems like in most cases soft-delete is the way to go, since it’s nice to have context when responding to incoming stuff.

07/22

Didn’t have any new realizations that I’d forgotten anything last night. Yesterday I did that bit of planning and added a social table but that was about it for this project. Today I hope to get the post-publish step to deposit a zip of the post contents in the public bucket.

07/24

Last night I got the post-packaging working. Today I want to go over the total post-entry function and make sure it all still makes sense. I think I have two good options about what to do next. I could start working on the access-control stuff, or I could work on the reader display of the packaged posts. I can’t think of a reason to prefer one over the other; I don’t see any obvious dependencies. I feel like I’d rather start with the access control. And that could easily turn into finishing the whole friend request flow before moving to the reader piece. That’s pretty on brand for me—first work on the plumbing and leave presentation for last.

If I want to start on the access control, what does that entail? First would be the null resource key generator. Answer the following questions:

What are the details of the key—algorithm, alg params—use something common for a signing keypair
Where exactly is the private part of the key stored? Is there already a path in the plugin structure where the plugin can store things intended to be secret from both the UI and other plugins, or do I need to make a new place?
Where exactly is the public part of the key stored on the website? It should probably be under the `.well-known` path, but where exactly.

07/25

The key will be EdDSA (JOSE algorithm EdDSA).
The private key will be stored in a new path in the admin bucket. Each plugin may give roles read access but not write access to the path. the terraform account will be used to write the privkey.
The public part of the key will be stored at the path /.well-known/microburin-social/keys/social-signing-public-key.jwk . According to rfc8615, applications that want to use a well-known path should choose one specific to them. I have decided to name this social plugin “microburin” or “microburin-social.” Also according to 8615, applications may define additional path components, and you can never have enough namespacing, so even though this is the only file I’m giving it a path segment too. The mime type is "application/jwk+json" according to rfc7517. Another thing in there is the option to publish a jwk-set in one file. I wonder if I’ll want to do that—right now it seems prudent to preserve the ability to access-control each key separately by making them separate files.

07/26

Got the pub / privkey set up. I’m feeling like I’d rather work on the access control function than the UI next. Steps would be:

determine request format. request format must include the caller’s domain name and a signed token. The token should be the concatenation of at least the domain name and a timestamp within 5/10seconds. A maybe-stronger thing would be to agree on a shared secret at friend-accept time and include that too. But that’s some complexity. EDIT: also include the recipient’s id in the token to prevent replays. Maybe a specific resource?
validate the token—write a simple validation script. Test that it fails if the signature is invalid or the timestamp isn’t current or the token isn’t signed for us.
Add the validator as the check-auth on the social site cloudfront dist.

07/27

Started on the validator function. Ideally, it would work like this:

On viewer-request event^[1], parse the token sent by the requester. Sanity-check the params (timestamp, recipient are valid, signature and origin are present)
If the params pass sanity test, query dynamo for the most recent list of our connections. Check that the origin is a connection in a good state.
If the origin is indeed one of our connections, get their public key. Reconstruct the payload from the values we already validated, and make sure the signature is valid.
If the signature is valid, pass through the request to get a cached or fresh response

Unfortunately, viewer-request functions have a max package size of 1MB compressed, and the dynamodb lib alone seems to have a compressed package size of 3MB. So as far as I can tell, there’s no way to use an AWS-published js library to query dynamo within a viewer-request edge function. Which is annoying.

So if I want to use caching, I can’t query dynamo from the edge function. And since all the aws libs will likely have the same size, I basically can’t use AWS libs at all to get the list. Which leaves two options: I can put it on an otherwise-controlled HTTPS endpoint or I can bake it into the function package. Baking it into the function package would require repackaging and redeploying the function each time there’s a new connection, which would be hugely complicated and awful. So that leaves the HTTPS endpoint. Which needs to be secured with something other than AWS credentials because to use them I’d need an aws lib.

So what I’m thinking is, make a separate function that queries the dynamo table. Put it behind an api gateway. Use tf to build the function package with a log random password without which it won’t return results. Build that password into the edge lambda package. Have the edge lambda use the password to get the list over https. The list can also be a list of the hashes of the valid connections, salted with a single salt known by the access-control function (re-using salt, I know…). This minimizes the disclosure if the password is leaked somehow—the adversary gets a list of hashes (and does know how many connections we have) but without the salt they can’t easily compare it to a pre-generated list. If they get the list of hashes and the salt, there’s probably no secrecy left, but that seems like a lot of effort for something that still doesn’t get you a valid credential.

07/31

The access control function is looking pretty good; it’s time to think about what happens in the happy path—assuming that we are dealing with a system that is allowed to get posts from our system, what should the actual logistics of that look like?

One idea that I like enough to want to try out is the idea of ephemera—things that don’t exist publicly forever (they stop being public after some amount of time). I think that this is a humane design in general—I don’t think it does anyone much good for the entire kaleidoscope of their past selves to be on display for all time—but it’s also technically easy to implement in a fairly cheap way.

The design would look like this:

When you finish composing a post, the text and media are put in a zip archive and saved to an S3 location (this part is done already). The metadata—the post ID, the location of the zip archive, maybe some other stuff—is saved to a dynamo table. One of the things that we would save to the dynamo table is a presigned URL to the zip file.

Our friends can query the dynamo table through a secured endpoint. Since the table includes presigned URLs for the content, that means that they can get the content for as long as the presigned url is valid. Once they have it, they could save it but the amount of data you’d have to pay to store would get big unless you let it expire after a reasonable period—a week or a month.

I like this design because it’s pretty cheap. Specifically, we need to watch out for the way that private access to data scales. If you have 130 friends who query for your posts every 5 minutes, that’s 1.2 million lambda@edge requests per month. That would cost around $2-$4, which might not seem like a lot but it’s more than the whole system costs now. If we also used our own access control for the individual S3 objects, we’d be looking at a multiple of that number^[2]. Using pre-signed s3 urls gives us a way to offload validation of the individual content requests, and creates a convenient time-based management strategy.

So what would it take to make this happen?

Add presigned URL to the metadata in the posts dynamo table. Includes coming up with a datasource for getSignedUrl. Also implies picking an expiration time. This should be somewhere that it can be shared by e.g. S3 expiries and stuff.
Create the polling function. Should it opportunistically download whatever it can, or should it store lists of presigned URLs? My impulse is to do everything in a scheduled way, rather than on-demand. It is a bit more expensive but one thing the trajectory of web system design teaches is that pretty much no one can be trusted with leaks of behavioral data like when you’re online.
Reader UI

08/01

The presigned url is now in the metadata. I think the next step is probably the polling function. This means I need to make some decisions:

what do I use for now for the polling setup? Default: Every 10 minutes.
Do I get the posts or list them? Where do I put the posts I get? Do I preprocess them at all? If I preprocess them, is that in the same function as I get them? Defaults: Just get the posts, no processing. Drop them in a TBD backend readwrite root. Not in the hosting place or the uploads.
How do I keep track of which posts I’ve already seen? What if a post gets updated? I would say, the metadata includes an event time. The event time is “the last time something changed about this.” We know our polling frequency; we know the last time we polled. We can say “give us everything with an event time later than n. Note that this means that we have to deal with a case where we get a new version of a post created potentially forever ago. It would also be weird as an author to have whatever I happen to edit (maybe because I’m embarrassed about it) suddenly pop to the top of everyone’s feed. So we should be judicious about publishing and showing edit events. Note that this also means that the feed endpoint will want to cap the length into the past that a connection is allowed to request. Say 1 week?

So the plan for the polling function:

Make sure that the post-entry function reliably updates the modified date, bring the modified date into the top level of the dynamo schema (detecting a trend here, since I just did this with the url). Make sure this lists compressed size.
Make the feed-list endpoint. Restrict the lookback. Include post size (compressed).
Add a backend-readwrite location in the s3 bucket.
Add an index on the modified date (can use the same partition key?)
Event trigger can be rate(5 minutes).
Select connections in the valid state from the connections db, also get the signing private key (parallel).
Sign requests for the feed-list endpoint of each connection.
send the requests (parallel). It would be helpful if the responses include the item sizes to help balance what happens next.
It seems like a bad idea to expect one function to hold all the post objects in memory before writing any of them. Ideal would be if there’s a way to stream requests to uploads. Next best thing, probably distribute groups of posts to collector functions that each collect a set number of mb. We should reject above a certain size. Actually we should offload as much as possible in any case; since the polling function will run often and often do nothing, we should keep it small.
Once we’ve decided who’s requesting the actual post objects, whoever it is should put them in the backend-readwrite location.

Another note:

Where I’ve been saying “reader UI,” I’m conflating both the UI that shows you posts and the UI that manages connections. I need to start thinking about them separately. I should tackle the connections UI before the posts UI because more things depend on it.

I also need to redeploy my second test site to validate these things.

08/02

I finished through #4 above. The next move is to set up the polling function itself. There’s a bit of a decision here. I haven’t made an enum of the connection states—I really feel like I ought to work out at least a candidate state machine so I can start poking holes in it. I think there’s a start in the 07/21 post; might be all I need.

I was also reconsidering the index keys on the feed-item table.The docs say that there’s no limit to the number of unique sort keys per partition key, but it seems like a bad idea to count on being able to filter through all the posts to get the recent ones. If I can use a > comparison on the partition key, then I can put an index on the modifiedTime column which would probably give me efficient access. If this doesn’t pan out I’ll start googling for solutions again; this has to be a common use case.

I looked it up, and the issue is that I should be specifying a sort key as part of the key condition expression instead of in a filter expression.

08/04

Yesterday I rewrote the access-control function for social after a friend pointed out that the aws sdk is included in the edge lambda runtime. I got to delete a bunch of code that was making me nervous, so that was nice. Then I went out to the clay studio and for a wander the rest of the day. It was pretty nice.

Today I’d like to get through step #8 above—the polling function gets the connection domains and queries their feed-list endpoints. I may need to clean up the access control function a bit since so far nothing has used it. I also want to get some chores done so I may end up having less time than I’d like. If the day goes sideways I’ll probably set up the test site instead (then again, I may not—I’m not sure I absolutely need it to test this out at this stage)

08/05, early

Just finished a first draft of the connection-polling function.

08/07

This morning I got the initial phase of the poll fn working—it gets the presigned urls of all the connection feed items. Next will be to fetch the posts. For now this will be on a schedule of every 5 minutes—I want to see what it does to the cost.

for the post fetching fn I can create a test event based on some of the responses from the poll fn. I should add a step to the poll fn that filters based on max reported size.

The post fetching fn will need to verify sizes for itself. It will need to stream its input to S3, which means a new access schema. Should the post fetching fn record what it gets in a db?

08/09

I’m waffling about whether this publishing interaction should really be polling-based or push-based. Most of the ways I look at it, push-based would be a more efficient model.

instead of getting {number of connections} poll requests every five minutes, you would need to spend {number of connections} effort once each time you publish a post.
If the process for pushing an item to a connection was to 1) send a request with the item’s size to get a presigned url, then 2) stream the zipped item to an s3 POST, it could be pretty simple.

Things that make me worried about switching:

I’m still thinking through the threat model. It seems like content-length on uploads can be restricted; since the person sending the item can calculate its length, it should be fine to expect to be able to presign a request with the exact value.
Wouldn’t you still want to make provision for catching up on previously-published stuff? Would that be functionally the same as the polling design, but maybe less regularly scheduled?
Would using presigned put urls add additional difficulty (over presigned get urls) for someone trying to implement this outside aws?

It feels like the basic components required for either of these designs are similar, so if I press on with the polling-based solution, I’ll be able to re-use a lot of it if I decide to switch later. I’d rather feel very confident in this choice before releasing again.

08/10

This morning I got the bare-bones post collector working. That was the last thing in the numbered list above; I can work on improvements (needed) or I can move on to another functional area, such as the connection request flow or displaying posts. I think I’d prefer to do the connection flow.

The connections table already exists. It seems likely that it already includes all the information it will need. What is still remaining:

An endpoint for connection requests (not secured or progressively secured; it needs to accept requests from unconnected people; that’s what connection requests are)
A state diagram of a connection request
A UI for managing connections.
A connection-state-changing function that validates that the transition requested is legit and then makes it.

Am I forgetting anything here? This doesn’t seem like much.

08/11

I’m thinking through the connection request flow. I think that, like posts, the efficient way to do this is to use an expiry to prevent buildup. This simplifies the state diagram for connection requests; when a request is sent, it can either be accepted or ignored. If it is ignored, future connection requests are silently dropped until the expiry time of the ignored request. Once a request expires, it is deleted (yes, there needs to be a “permanent ignore” option to mitigate harrassment—that’s not especially hard to implement but won’t be in the first version).

One thing that means is that it’s time for me to actually figure out how dynamo expiries work.

It seems like TTL attribute is set per-table, while the TTL value is set per-item. I have a mix of items that should be expired (any connection with a state other than CONNECTION_ESTABLISHED) and the established connections that should not expire. Hopefully I can either not set the TTL attribute for stuff I don’t want to expire or else set a ridiculous TTL (there’s something kind of appropriate about being required to set a TTL for everything; “forever” is kinda another way of saying “I want someone else to worry about cleaning this up after I’m gone”).

I did a diagram. This is what I’m aiming for:

I’m planning to set the expiry time to one week. Clock starts when the request is received, when it is sent, or when a disconnect is detected.

So the steps I need to do:

Add a TTL attribute to the connections table, figure out the strategy for non-TTL state.
Add a connection-request endpoint. It should receive a request, check if the request is silenced, and if not it should add a ttl-ed pending response record to the connections table. This needs to do its own validation.
Add a connection-response endpoint. It should receive a response, check that there is a pending request, update the pending request to established (deleting the TTL so it doesn’t go away).
Add a send-request function that tries to send a request to the connection-request endpoint.
Add a respond-to-request function that tries to send a response to a received request.
Add a step to the connection poll fn where if it gets a 401 the connection is downgraded to broken.
Add a low-frequency fn that tries to reconnect broken connections.

I don’t think much needs to be done specifically for the silenced state. In the usual case, a silenced request will just time out on its own.

08/16

I’ve been finding momentum a challenge. I started to set up the connection-request and connection-request-response (the inbound side) as donut-days functions, but then I changed my mind and rewrote them as regular lambdas. I went back and forth on it a lot, because the DD functions are easier to write and maintain, but the regular functions are easier to test by themselves. Here is the rule that I came up with to explain my decision:

When a function has a narrowly-defined function, AND it needs to be carefully tested AND it is sensitive enough that we don’t want to risk it drifting when the donut-days layer is updated, use a regular permissioned_lambda. In other cases, use DD. In general, this means functions that take input from outside the access-controlled perimeter.

I’m also going to switch the direction of the post-collection from polling to pushing. Again, probably half a week of do-over, but at least I can use some of what I’ve already done.

I have completed through #3 in the list above; I still need to do 4-7. I want to try to get #4 at least done today.

08/18

I think it’s time to start going back and revising this social media design. This is how it goes for me sometimes—most of the time even, on things I haven’t done before. I start with an idea about what I want to make, and from that I make a plan. I start executing on the plan. Sometimes it blows up immediately, but usually I get between 50-75% through it before I’ve accumulated so many “I wish I did X differently” that it no longer seems worth it to push through, and instead I want to go back and start applying what I learned^[3]. If I’m very unlucky, which happens occasionally, I find out much later that the real problem is somewhere in that last 25-50%, and I shouldn’t have spent so much time on any of it.

There are two main design errors that I need to correct. The first error is setting up post-collection as polling instead of pushing. I started out with a particular narrow idea of one way to preserve privacy (by having automated polling rather than information-leaking on-demand fetching). In my head, there was some way that polling was more “in the operator’s control” than getting notifications from outside. I’ve been thinking about it for a few weeks now and that just doesn’t hold up. Polling adds extra busywork, but it doesn’t affect the balance of the relationship compared to accepting notifications. If you poll, you’re trusting the other person to answer you about the new stuff (and bugging them every five minutes). If you ask for notifications, you’re trusting the other person to proactively tell you about the new stuff (and not bugging them every five minutes). The trust is equivalent; the only difference is the toil.

The second error I need to correct is making the connection status field in the connections database one of the keys. I was focused on the need to select connections by connection status. This remains important, but by making the status part of the object key, “updating” the object to a different status requires deleting the original and creating a new object (because to PUT an object with a different status is simply to create a new connection). So the connection status needs to be a non-key field. It needs an index, and all the functions that depend on it need to query the index.

These things turned into blocking issues last night when I was testing out the delivery functions. I had neglected to include the step where the delivery functions update the connection status (i.e. from “no connection” to “request sent” and from “our response requested” to “accepted”). When I thought about adding that in, I realized the issue with the key. When I realized that I was going to have to do a lap of the codebase to fix the key issue, it seemed like the right time to fix the directionality issue as well.

New plan:

In terraform, fix the keys & indexes on the connection table.
Adjust the access-control function to rely on the new state-index because status isn’t a key anymore. Per #3, the access-control functions should look for a request body; if they find a body, they should look for a corresponding body-signature and verify it as a condition for success.
Reverse the polling function. Most of what’s there is still good; it still needs to get the list of connections and sign requests for each of them. But it needs to send an update rather than asking for updates. Should it add a signature for the payload? Yes, yes it should. The payload should include the post metadata, which is a bit vague but at least needs to include the presigned post URL.
The “receive poll” function should probably stay as it is—having an endpoint for listing our posts seems like a good idea even if I don’t have a use for it right now. But there needs to be a post-notification endpoint in addition to it. The post notification endpoint should be inside the security perimeter (i.e. it relies on the access-control fns to screen its input for access-permission) and it should delegate to the post-collection function I wrote for the polling system.
The connection-delivery functions should be updated to do the connection-state updates correctly.

Today I have some chores to do, so I might not spend much more time on this exactly.

08/21

In the last few days I completed 1 and 2 above, and updated the access-control function to validate the request body signature. For a minute I thought that I was going to want to redo the connection functions again, but when I thought about it some more I realized that wasn’t the case^[4].

I also found another error that I need to correct—I need to stop allowing spaces in post URLs. It actually works fine, but dealing with those URLs in many contexts is awful (for instance, when you paste one into a word processing thing, it will break the URL on the first space).

Today’s task is to design a function that receives pushes from our connections, and once done, to reverse the polling function to send pushes to the push endpoint.

Same day, later

I’ve realized that I have a decision to make. When a function protected by social access control gets a request, how specifically does it get the data (including verified data, like origin) from the request? It can’t trust the contents of the request body, because we haven’t semantically verified that the contents of the request are acceptable. So, for instance, it needs to get the origin from the same place where it was verified.

The obvious choice would be to pass through the auth header to the request. Then it can get the verified things from there. But it seems like a bad idea to pass an “authorization” header everywhere.

The conclusion I’ve come to is that the authorization header isn’t. It’s not an authorization. It’s a signature that covers specific pieces of metadata about a request. So on the one hand, it’s ok to pass that information through to places that wouldn’t be good custodians of an auth token. But I shouldn’t call it “Authorization.” I should rename it “Microburin-Signature.”

08/23

The header name change is complete, and I think I’ve touched all the places that the reversal to pushing is going to affect. I’m finding it a bit hard to keep everything in my head; it may be time to try to make a diagram of some parts of this so that it’s easier to visualize. To do that, I need to imagine some unit that determines what’s in the diagram vs what’s out. I could diagram the post-delivery flow as a process.

08/24

The push-flow post updates are working—on post publish, the post is zipped and connections notified. The notification triggers a download (via presigned URL) and an entry for the post is created in the connection items DB.

I’m now on item 5 of the list immediately above; correcting the connection-request functions to update the connection state in the connections DB. Once done with that, I’ll roll back up to the parent unfinished list, from 08/11. Of that list, I’ll still need to do items 6 and 7. And then, hopefully, on to the UI.

08/26

Today I deployed an additional social site (which is honestly pretty cool to be able to just do) so that I’d have two to test between. I hit a snag—the policy for my admin bucket has finally hit the maximum character count. I tried to compress the policy a little bit, but my sense was that there isn’t a lot of slack before the policy semantics are wrong. The other option I have, if I wanted to use it, is for the bucket module to detect when the policy gets too large and use an iam policy assigned to roles instead. But that could get crosswise with an effort to keep the policies-per-role down. I think it’s diminishing returns territory and accept that there can only be a limited number of plugins for now. Or we could write an access-control function for the bucket. Potential for that to be useful in the visibility system as well.

In the near term, I deleted my test blog (which I haven’t used in a while) and added another social plugin. I’m glad I’ve been maintaining isolation at the plugin level; it has been pretty satisfying to see this work so well.

08/28

I think I’m going to move to the UI. There are still a few things to do in the back end, but I want to take some time away from that and then come back to reassess with more distance. Things I want to remember to do at that time:

Provide for broken connections — detecting, correcting
Reread the tests for the connection request functions—there are things that ought to be added.
Don’t just save the notifications to the db—make sure they’re sanitized

I may add to that list over the day or days.

09/01

I’ve made a start on the UI. The decision that I need to make now is: for the connection page, what should the interface between the js and the dynamo puts that need to happen?

In two cases there needs to be backend help (send request, respond to request). But in ignore / delete connection, we’re dealing with either a single dynamo put or delete. So, give the UI user delete permissions? Or a lambda? Isn’t it the same thing?

Since I don’t have a strong opinion, I’m going to say for now that the UI should get delete permissions on the connections table. It seems like the simplest way to do it and it will be good enough to know whether it needs to be better.

As opposed to origin-request event. viewer-request is on the viewer side of the cache—if you validate then, the passthrough request can get a cached response, which is cheaper than hitting s3 every time. If you don’t do access-control on viewer-request, then you can’t really cache at all, because any requests that hit the cache will not be access-controlled. ↩︎
One of the things that’s a bit unique to this system is that its price is based largely on the amount of attention you get. This is likely to surprise and maybe disappoint people who are used to equating the quantity of attention one holds with success.

On one hand, I intend that there should be other ways to defray those costs—notice that even a very low subscription price, of less than $1 / connection—would be more than sufficient. But on the other hand, I am positively invested in making the design of this system resistant to the strategy of “get as much attention as you can and monetize it,” which I see as a pretty powerful force for not-good in the world. ↩︎
The first time I remember getting in trouble for a writing assignment was third grade. My teacher wanted to teach us The Right Way to do a piece of writing. The Right Way in this case was to do a First Draft on Yellow Lined Paper and then copy it onto a Final Draft on White Lined Paper. I didn’t see the point of writing the whole first draft if I already knew that I was just going to have to do it over, so I just started on the white paper and turned it in. I definitely got in trouble, and I specifically remember being told to go back and write out a first draft. I’m not sure what lesson I was meant to take away from this.

My thinking was, and still is, that it’s usually a good idea to try to get things right the first time. In the usual case, where failure doesn’t result in someone getting hurt, just pick a route and follow it until it either gets you where you want to go or gives you overwhelming evidence that you need to find a different one. And when it’s time to start over with a new route, start on the new route with the intention that it will work out. ↩︎
The crux is that the generic access control fn needs to make sure that the accessor is a connection already, while the connection-request functions need to make sure that the requester is not a connection but is in a particular connection state. ↩︎

Raphael Luckom

social system notes