Postmortem: Logging cost explosion

December 18, 2020

Yesterday saw two milestones for me, one good and one bad. The good one was that I cut over to this new site. The bad one was that my AWS bill blew up, rocketing from $1.20 to $8.26 over 48 hours due to the quantity of log data I was storing.

If I want to maintain intellectual honesty about my project, I have to acknowledge that cost explosion as a pretty embarassing unforced error. With this post, I want to describe what happened and what I'm taking away from it.

AWS provides a service called cloudwatch for storing logs and tracking metrics. In an operational context, this is called visibility--understanding how the system is behaving over time. I have been using cloudwatch to store the logs from my cloud functions. Cloudwatch charges based on the volume of logs you send to it. The first 5 GB of logs are free, after which logs are charged at $0.50 per GB added. Sometime on 12/16, I used up the last of the initial 5GB. Between 12/16 and 12/18 I stored an additional 14GB, for a cost of $6.69.

There are a couple of eyebrow-raising stats in there. The most astonishing is the volume of logs--14GB! For reference, the amount of text in all of the posts in this site totals around 800KB. That means that I generated 17,500 times as much log data as there is text on this site. The second important number is the $0.50 / GB price tag for storing that data. For comparison, S3, a different data-storage service offered by AWS, charges $0.023 per GB per month[1].

Let's look at the volume of log storage first, because it demonstrates an important fact about the cloud that critics are quick to point out--there's no built-in circuit-breaker that limits the amount per month that you can end up paying[2]. Cloud services are industrial-scale; it's not unusual for a company's AWS bill to be in the hundreds of thousands of dollars per month. And the better a system is designed to scale--the more easily it handles increasing traffic without breaking--the more vulnerable it is to cost spikes.

This is a good opportunity to discuss some common assumptions. For instance, what's supposed to happen if usage of a system suddenly spikes? In my case, I didn't think carefully enough about how much data I was logging. That's a pretty common error to make and it doesn't raise any important questions--I can fix it, resolve to do better next time, and move on with my life. But what if the spike was due to sudden popularity instead of an error on my part? What's supposed to happen if something I write goes so viral that it blows up the amount of traffic I get?

Everyone who puts content online answers this question, even if they don't mean to. Sites like this one can easily handle almost any amount of traffic, and the bills can be correspondingly large. Some sites run on rented or self-hosted servers. Those servers cost a known amount per month, and can handle a given amount of traffic; when that amount is exceeded, the site becomes unresponsive for some users. Other sites use free-tier hosting from providers like Github or netlify, which impose their own size and traffic limitations. When those limits are exceeded, performance may suffer or the service operators may restrict your use of their platforms. Finally, there are content silos--sites like facebook, instagram, and twitter. On these silos, the most popular content creators are subsidized by the ad revenue generated by everyone who uses the service. If your facebook page gets hundreds of thousands of hits per month, you're getting a pretty good deal on web hosting. If your page only gets a friends-and-family number of hits per month, your data and eyes-on-ads are paying for hosting for more popular people.

There is obviously no single correct way to host a website--every available method has tradeoffs. I will describe any method where your service (and your bill) scales infinitely with use an autoscaling method, and anything else as a fixed-scale method[3]. For the purpose of this discussion, any service that you're not paying for yourself, like Github pages, is a fixed-scale method.

The other important distinction is how each service is billed, which can be per request or by capacity. In a pay-per-request system, you pay based on the number of visitors your site gets. This is like a cell phone plan where you're billed at the end of the month according to how many minutes you actually used. in a pay-by-capacity system, you pay for a given capacity, like a phone plan that has "unlimited"[4] minutes but costs the same regardless of how many you use.

In both fixed-scale and autoscaling cases, we are talking about potentially very small amounts of money. The cheapest fixed-scale options are free. Over time, different free-hosting providers rise and fall, but there's always some company willing to host moderate amounts of content for zero dollars. These services can be highly reliable--Github's storage and traffic limits for free pages are quite generous--at least for sites that are mostly text. On the other hand, the same properties that make those free options work well--you're not realistically going to get that much traffic-- make per-request autoscaling sites cheap too. Before Loggageddon, my monthly bill was about $1.35, of which $1.00 was for two domain records. All of my site data, logs, and traffic cost $0.35 per month. This number grows pretty slowly--for example, if all 330 million twitter users read this post, the cost to me would be around $190.00[5]. It's interesting to imagine how the media landscape might change if "being popular on social media" was an expensive proposition rather than a lucrative one[6]. Pay-per-capacity systems, whether fixed-scale or autoscaling, are usually more expensive for low-traffic sites, but can be cheaper if you consistently get larger amounts of traffic.

Since I want the maximum amount of control and flexibility, and I don't expect a huge volume of traffic, I'm firmly in the autoscaling, pay-per-request category. Using a big cloud-provider's static site hosting gets me fairly-cheap per-request pricing, access to my own server logs, and the ability to add interesting features to the back-end--things like autogenerating HTML when I upload Markdown--without jumping through integration hoops. My costs tied to traffic volume are the same as those of an equivalent static site, while my costs for management features like "adding a post" are only tied to the number of times I do that[7].

Now let's turn to the second eyebrow-raising stat from my cost explosion--it costs $0.50 / GB to store data in Cloudwatch Logs, AWS's cloud logging system. To put that in perspective, if I use the same amount of data from the earlier example, where all 330 million twitter users read this post, instead of $190.00 to deliver that data to users, it would cost $1,188.00 to store it in cloudwatch. That's a big difference! I can't exactly say why cloudwatch costs so much more than S3, but it doesn't surprise me. It has been my experience that more "directly usable" cloud products, like the ones that let you look directly at log data, often cost orders of magnitude more than the ones that are simply generic computer functions--storing data, moving data, executing arbitrary code-- and the differences don't seem to be tied to underlying costs of operating the service. My suspicion is that these services are priced at much higher "value added" rates because managers would rather pay extra per month than justify using one of their own engineers to solve the problem.

So what have I learned? Well, first, I learned--I should say, learned again--that a useful amount of logging when debugging quickly turns into an expensive amount of logging in production. That item on my mental release checklist has gotten another layer of heavy underlining. Second, I learned that for my own good, I should try to only use a small number of different services whose costs I understand well, rather than always trying to use whatever purpose-built solution will get me a capability fastest. These considerations apply to my situation because I am taking a long-term view. If I was burning through investor capital at a rate of $500-600K per month, the incentives would be much more on the side of "glue together the first 5 things that get the job done."

  1. An exact comparison is a bit more complicated. Strictly speaking, cloudwatch only charges $0.03 per GB per month for keeping data. The $0.50 / GB is for the process of adding the data in the first place, which is called ingestion. S3 does not charge for data transfer in from the internet, but charges $0.09 / GB for transfer out. Making realistic assumptions (20GB stored, 6GB transfer out), if I had been billed at S3 rates instead of cloudwatch rates I probably would have paid $1.00 instead of $6.29. ↩︎

  2. It's possible to make your own circuit breakers like that, which monitor your metrics or bill over the course of a month and adjust your infrastructure accordingly. But AWS doesn't provide any built-in feature like that, and I can't think of a way that they could offer that feature in a sane way--like, if you set a limit and you exceed it because your data-storage costs get too high, would the circuit-breaker just delete all your data? That doesn't sound great. ↩︎

  3. Here, autoscaling refers to any system that does not have a fixed limit to the amount it can cost per month. I would say that any system with a fixed number of servers, or a fixed data plan, is not autoscaling, since there is a theoretical maximum amount of data that such a system can handle. Any system where the number of servers is not fixed, or where non-server components--such as cloud functions, content-delivery networks, or cloud storage services--can effectively scale infinitely, is an autoscaling system. ↩︎

  4. Strictly, the number of allowed minutes is limited by the number of minutes in the billing period. ↩︎

  5. 7.2kb * 330,000,000 is 2,376 GB. The AWS CDN charges $0.08 / GB for data within the US, giving ~$190. Yeah, CSS would probably double that number, and the fonts I'm hosting would bump it by an order of magnitude. None of that meaningfully detracts from the main point, which is that the practical cost growth of per-request, autoscaling static-site systems is much less scary than "maybe infinity." ↩︎

  6. It might not be better, but it would sure be different. ↩︎

  7. I would argue that this isn't true of an active CMS like Wordpress, where every request invokes a program on the server, though obviously this can be mitigated with caching. ↩︎