Most of the time in the design of this system, I’m able to re-use existing security strategies that other people have come up with and tested. This is generally thought to be the best way to go—making up security systems is notoriously difficult to get right, and even very small flaws can often render the entire system useless. But sometimes I don’t see a feasible way to incorporate any existing design, so I come up with something myself. In these cases, I like to write a blog post calling out exactly what the situation is and what I’ve decided. If I’ve made a mistake, I don’t want to hide it; I want to make it easy to have a well-grounded conversation about the design and its tradeoffs.
EDIT 08/03: A friend on linkedin pointed me to this stackoverflow answer, which suggests that the aws-sdk is included in the lambda@edge runtime by default[1]. I tested that out and it works, which means that the problem described below doesn’t really apply. I’m going to leave this post up because process and notes and mumble mumble.
Problem Statement: When someone’s system asks to see our posts, we need an access-control strategy to decide whether to allow it. Since we expect that these requests will be very common—in proportion to the number of connections we have—it would be best if the access control function is as lightweight and inexpensive as possible. Likewise, since each connection will see the same posts, it would be most efficient if the posts can be cached by cloudfront.
The diagram below illustrates the flow of a request through cloudfront.
A few things to notice:
-
Each of the four functions in this diagram is optional; if a function isn’t provided, the request or response simply moves past its position unmodified.
-
Any of the three steps on the inbound side—viewer-request, cloudfront cache, origin request—is able to short-circuit the rest of the cycle by generating a response. If the viewer-request function generates a response, the response passes directly to the viewer-response function. If a response is found in the cache when the request is inbound, the response is passed to the viewer-response function before being sent to the user. If the origin-request function generates a response directly, it may be cached as it travels back toward the requester.
-
Requests that can be served from the cache are slightly cheaper than those that have to be served from the origin. S3 (the origin for most things we care about) charges around $0.004 per 1000 GET requests. Cloudfront charges around $0.0012 per 1000 HTTPS requests. Cloudfront also charges around $0.10 / GB of outbound data.
-
Requests that can be served from the cache will be faster than those that have to go through the origin.
Based on this situation, when a request comes in asking to see our social posts from someone claiming to be a connection, our best option is to use the viewer-request function to decide whether to allow the access. If the request is valid, it gets passed through and may receive a request from the cache. If it is invalid, the viewer-request function sends a denial response to the requester.
We want the viewer request to implement the following procedure:
At each box, the black arrow shows the next step if the request passes the test, while the red arrow shows the next step if it fails.
The difficulty arises in the box where we get a list of all of our connections. The connections are stored in DynamoDB, so the natural thing to do would be to have the viewer-request function query dynamo directly. However, the viewer-request function code can’t be larger than 1MB, and the AWS-published library to query dynamo is about 4MB all by itself. So we need to think of an alternative way to get the list into that function. We can’t put the list in S3, since the library to query S3 is even larger. We also can’t bake the list into the function, because that would require a redeployment any time our connections change.
I decided to handle this difficulty by creating an HTTPS endpoint that the viewer-request function can query to get a list of connections. At deploy-time, terraform will create a password for the connection-list function. The password will be included in the function bundle for the connection-list function and the viewer request function. The API path for the list function will include a hard-to-guess string to discourage brute-force attacks—this API path will likewise be distributed to the viewer-request function.
Rather than returning a plaintext list of our connections, the list function will return a list of hashes, salted with another random value created by terraform and shared with the viewer-request function. The viewer-request function will test whether the origin of the request is one of our connections by concatenating the origin and the salt, hashing that value, and then checking to see whether the hash appears in the list of our connections.
It would obviously be preferable to simply query dynamo from the viewer-request function. But since that isn’t feasible, I believe that this strategy raises the cost of a successful attack above the value of its reward. If the attacker learns where the endpoint is, they can’t get data without the password. If they learn the password, they can’t test whether a site is one of our connections without the salt. If they have the salt, they can test individual sites but they can’t directly get the plaintext of all of our connections. If they determine all of our connections, they still can’t generate a valid request for our posts unless they control the public key that one of our connections publishes. And since the actual post data is likely to be regular social-media ephemera, it’s unlikely to be worth that kind of effort in the common case.
I’d be very grateful to anyone who wants to help me think this through and poke holes in it; for now it’s what I’m doing.
I’ve still not been able to find the AWS documentation confirming that the aws sdk is built into the runtime, but I tried it and it works. One interesting thing about this is that the
require(‘aws-sdk‘)
statement gets you the v2 sdk. AWS seems to be pushing the modular v3 sdk these days. I’ve always used v2, and it made me sad to think that they might replace it (v3 is technically an improvement because it replaces one 62MB(!!) package with smaller modules, but since even the individual packages run to around 5MB, it’s…got room for improvement).So if AWS has built the v2 SDK into the lambda@edge runtime, where people can depend on it for integration with any other aws service, what is their plan for replacing it with the v3 sdk whose main selling point is that it doesn’t include all that stuff by default? I don’t know, but I don’t envy any of the product managers involved. ↩︎