Notes on a test Mastodon deployment

November 21, 2022

I made a terraform stack for standing up Mastodon instances in AWS. This post is about how I set it up and some general notes. The deployment is configured for my own testing, i.e. it is as small as possible while still indulging my personal preference for well-organized, decoupled, scalable system design. I have not run a real load through it. The prices below are not representative of a real deployment supporting real users.

The Stack

The stack is one basic functioning Mastodon instance with its required services and data stores. It does not include Elastic for full-text search. It has working email and working media uploads to s3.

The network architecture is a VPC with two public subnets, using security groups to lock down traffic. Redis is provided by a small elasticache instance; Postgres is provided by a small RDS instance; all the Mastodon containers are running on Fargate. Redis and Postgres only have private IPs and their security groups only allow traffic from Mastodon containers. The Mastodon containers have public IPs but only egress traffic is allowed over them[1]. There is an application load balancer with IP addresses in the subnets; the ALB itself is behind Cloudfront. TLS is set up between the ALB and Cloudfront, and between Cloudfront and the browser. Traffic within the VPC is not encrypted. An S3 bucket handles media storage. SES is used for email. DNS is handled by Route53. Certificates are from ACM. The secrets are stored in Parameter Store and provided to the containers as environment variables[2]. Everything is provisioned by Terraform except the manual approval of email addresses for SES and the rake task to add a user via the Mastodon CLI.

In terms of cost, the big-ticket resources are:

My main impression of this price list is that it leaves an operator highly exposed to the metered-bandwidth services, particularly Cloudfront and S3. Note the distinction between bandwidth pricing and storage pricing—if your storage shoots up to 50GB in a week but then plateaus because things are expiring and getting deleted, you still have about 7GB / day of bandwidth ongoing (and that only counts adding things to storage; there’s probably much more traffic in sending things to users).

Operational Stuff In No Particular Order

For one-off jobs, like adding an admin user, it’s possible to create task definitions and use terraform to create pre-filled CLI JSON skeletons. If I was going to run a setup like this long-term, that would be my #1 way of interacting with the services / data stores, etc.

You need to set the WEB_DOMAIN environment variable if CDN_DOMAIN isn’t set, because the static asset path doesn’t fall back on LOCAL_DOMAIN.

You need to make sure that the Host header is getting forwarded all the way to your Mastodon deployment, and that it is a hostname that Mastodon recognized (LOCAL_DOMAIN, WEB_DOMAIN, etc). Otherwise Mastodon will not respond to requests. Note that the Host header that Mastodon receives must be the one that was sent by the requester (i.e. not replaced in transit by the LB, etc). The Host header is part of the signed string used as authentication in http signatures, so Mastodon will not be able to validate and respond to incoming messages if it gets replaced with a different value, even if the different value is one that Mastodon knows about.

If one wanted to scale this stack, one would add more sidekiq services with differing queue priorities as has been documented here.

You need to be somewhat careful when deploying to Fargate especially if you’ve changed the infrastructure. Fargate re-pulls the container image on each deployment, so if your containers start crashlooping you can exceed the rate-limit on dockerhub. If you wanted you could push images to your own repo to avoid that.

If you are terminating TLS at the load balancer, you need to make sure it is adding the X-FORWARDED-PROTO header with a value of ‘https.’ The ALB seems to do this automatically.

There are autoscaling drives that can be attached to fargate tasks if one wanted to do media storage on a filesystem. This would probably be cheaper but slower than S3 and would introduce complexity in backups, which you’d have to automate.

You could skip cloudfront and point your DNS at the ALB, which would increase your compute costs but decrease your bandwidth costs.

There is some scope to use spot containers in fargate, particularly for queue processing. You would probably need to write a controller though because you wouldn’t want to get starved when spot instances are unavailable.

Currently I’m not running Redis with a user and password. Security groups lock down Redis to only be reachable from things that are supposed to have access. I think that it could be configured with a user / password using another manually-triggered Fargate task on first deployment.

Under no load at all, it seems like this is costing me about $2.37 / day[3]. That’s a lot for a single-user system, especially considering that under load the storage and bandwidth costs would be significant additional expense. However it still seems possible that one would find economies of scale with the right number of users chipping in. From an ops perspective it could be made to run very smoothly with almost no manual intervention.

  1. Otherwise it would be impossible to download the container images without a NAT gateway, which is very expensive. ↩︎

  2. This is not the most secure option, but depending on who has access to your AWS account I think it could be OK. It seems far less likely that these things would be stolen from your AWS account console than it is that they would be exfiltrated from the running, internet-connected application which also has them. But don’t give any TV interviews with your AWS account open behind you. ↩︎

  3. It actually seems like less when I look at the change in my bill. I think maybe the fargate pods are using fewer resources than their limit and I’m not getting charged for what they aren’t using? Not sure. ↩︎