I’m just wrapping up what I think is the final feature for the alpha system: the pageview counter. Since deciding to release the alpha version, I’ve been committed to including both an authoring UI and a visibility system; I wanted the visibility system to include at least a monthly-cost display and some way of measuring traffic to the site. I finished the cost display a couple of weeks ago; as of this afternoon my pageview counter seems to be behaving adequately.
I didn’t initially want to write a pageview counter. I wanted to make a UI where you could simply look at the server logs. The reason I didn’t want to write a pageview counter is because that is a fundamentally hopeless task. The reason that I did write a pageview counter is that I started to understand that it was the feature that everyone was going to ask for anyway, so I might as well do a “Proposal 1,” as my old boss used to say, so that people will have something concrete to disagree with. It’s futile, I know it’s futile, but I’ve found that people never accept that something is futile until you let them try to prove you wrong. I think that’s why my grandpa used to give us grandkids those twisted-nail blacksmith puzzles at Thanksgiving—he knew our hubris would be our undoing.
But it exists now, this thing, so I’m going to say what it is and how it works.
Which left me with server logs. Every time anything—a browser, a bot, whatever—requests a web page, the server writes down a record of that request in a log. If you read those logs, you can see records of every time the page has been requested. So we’re done, right? Count the requests for each page and go home.
Of course not. This site, like most small sites, gets far more requests from bots and other nonhuman visitors than from humans. And a significant proportion of those nonhuman visitors intentionally disguise their requests, so that they don’t get filtered out by the sites that try to deny nonhuman traffic. If you filter out all of the self-identified nonhuman traffic, you’re still left with a few probably-human pageviews and a much greater number of probably-nonhuman pageviews.
So how do you tell the difference? You can’t, really, when it comes to a single request. But when you have a set of requests, you can tell that some of them don’t look like human traffic. When you get requests for each page on the site, in order, at a rate of 3 requests per second, those requests are not coming from a person looking at their iphone, even if that’s what they claim. When you get requests claiming to come from particularly old browsers—especially ones that don’t support modern versions of security protocols—those are also potentially nonhuman. And everything rests on that word—potentially. Whatever you do, you’re going to have to encode your idea of what human traffic patterns should look like. You’re going to have to look at the requests and say “these requests look human, and these other requests don’t look human even though they say they are.” And then you write a program that tries to reproduce your rubric for detecting human pageviews and filtering out nonhuman pageviews.
So here’s the pageview rubric I landed on:
Any request that didn’t get a successful response doesn’t count as a pageview.
Any request that identifies itself as from a nonhuman process doesn’t count as a pageview.
Any time a single IP address makes too many requests very quickly, no request from that IP is counted as a pageview.
Any time a single IP address has an average time between requests of less than about 7 seconds, no request from that IP address is counted as a pageview.
This rubric should work best for people who get a small-to-moderate amount of traffic from a variety of sources. Could it be better? Yes. Is there a one-size-fits-all solution? I doubt it. And for me, over the past day or so while it’s been running, it has matched my intuition fairly closely. For whatever that’s worth.
Notice that this is a much harder task than counting views within an access-controlled social media system. If these posts were behind a “friends-only” access-control system, then every successful request would be from a specific, known friend, and it would thus be easy to tell how many view there were. If we saw nonhuman traffic in a situation like that, it would (hopefully) mean that our friend was experimenting with writing a bot, but it might also mean that their account was compromised.
I’m going to do this eventually, and it is going to be awesome. Much better than a pageview counter. ↩︎
Each request includes a bit of text called the user-agent string, which is supposed to identify the type of program making the request. For instance, the UA of Safari running on an iphone is “
Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/91.0.4472.80 Mobile/15E148 Safari/604.1”. The UA strings of popular browsers are a particularly tangled hodgepodge, with origins in the enthusiastically ad-hoc early history of browsers. Bots are supposed to send UA strings that identify them as bots. Many of them do. But some bots intentionally use UA strings that say that they’re browsers. This might be because they’re trying to request things they’re not supposed to, or because they don’t want to identify themselves, or because the site sends different data to different clients and they want what a human user would see. This is a kind of low-grade bad behavior that everybody just lives with, because there’s no way of fixing it that would be better than living with it. ↩︎
Specifically, only requests that get a ‘200 OK’ response are counted. This excludes requests previously cached by the user. ↩︎
This will cause the system not to count human pageviews when too many human pageviews come quickly from the same IP address, such as when many people on a school or company network all access the same page at once.
A further enhancement of this rubric would be to compare both IPs and UA strings. It would be a bit better but that’s not a priority for me right now. ↩︎