Today is a public holiday in Germany and I sat down to finally release some new features for Open Access Helper, but sadly I got sidetracked quickly. My end to end tests, using Puppeteer failed, and so I embarked on a journey of discovery.
I quickly identified the area in my code where the function failed. That code pointed to a service on oahelper.org, which did not return the result hoped for. I started to investigate and learned that I was hitting a super generous rate limit of 100 000 requests a day at OpenAlex.org. But how could my hobby project reach such a generous limit?
Moreover, other services using the same API worked just fine, they only failed from my server and thus it was time to look at my access.log and that quickly showed a dark picture. In the roughly 9 hours between midnight and my initial review of the logs, I found that our dear friends at ChatGTP (OpenAI), Claude (Anthropic) and Amazon had made about 76000 request to my server.
ClaudeBot/1.0 | 49855 |
GPTBot/1.2 | 17979 |
Amazonbot/0.1 | 9045 |
So what to do? I updated my robots.txt to disallow these (and other) bots, but ChatGTP, Claude & Co tell me that it would take at least 24 hours before that change would be respected.
My service was suffering, but more importantly openalex.org got all this wasteful traffic that was not helpful to you, my actual users. I did not want to wait 24 hours, I am impatient like that.
The service that was being abused, actually represents a WordPress Plugin. The WordPress plugin architecture has a nifty feature, that allows you to send a 403 (forbidden) error, when a condition is matched. To implement it, it basically is as simple as:
add_filter( 'template_redirect', 'custom_403_error' );
function custom_403_error() {
if (<Your_Condition>) {
header('HTTP/1.0 403 Forbidden');
die('Bots are not allowed.');
}
}
My condition focused on the User-Agent Header and performing a strpos on words such as ‘ClaudeBot’, ‘GPTBot’ and so on. This worked – our friends at Anthropic and Amazon started to get HTTP 403 (forbidden) errors, but GPTBot kept on getting through. Why?
Remember how I pointed out I was looking for the User-Agent? I made the mistake to assume that everyone would adhere to that specific capitalization, after all that’s how it is listed in RFC 9110. OpenAI decided to send user-agent – lowercase u, lowercase a. I adjusted my code and now my server is sending them a lovely “403 Forbidden” as well. I wished it didn’t have to be that hard…
If this was a fairy tale, I could now write “and they happily lived ever after”, but it isn’t a fairly tale.
It isn’t, because I just told you about the “nice bots”, the ones that admit to being a bot, the ones that likely will respect a robots.txt. There are many more scrupulous bots out there, that pretend to be a normal user, with a normal User-Agent string and that will likely occupy me for the rest of the weekend.
Never a dull moment…
Update 2025-05-31:
You would have thought that I cracked it, but I didn’t 🙁 After I got the reasonably good guys to give up, I continued to be hammered – albeit at a slower pace – by the not s good guys.
The not so good guys harvest from multiple server centers and jump between IP Addresses within the same cloud provider, e.g. Huawei Cloud. Those “guys” also use more “normal” User-Agent strings, i.e. they don’t tell you they are a bot. While in my case they were slower, they were just as annoying.
From the application I needed to protect there were a few extra measures I could take. From example I am now checking that requests are initiated from some known “good starting” points and that seems to be successful.
Sadly I broke a new feature of Open Access Helper along the way – so I now get to think about what to do 🙂