Do You Check (Or Even Have) Your Server Logs?

2025-03-18 07:00 by Karl Denninger
in Technology , 109 references

[Comments enabled]

An interesting pattern was noted here recently -- crawlers that didn't look to be search-engine related.

"Crawlers" request data from your web server "as a person - sort of." They're robots, and the usual use of them historically is to populate search engines, which if done in a respectful way is of benefit to the site owner. Respectful robots do this with time and rate limits because they're not paying you anything, they are imposing load, and thus your incentive to let them is that they cause people to visit your site when they query the search engine. That is, there's value both ways.

There is also a "robots.txt" file you can populate that tells a robot (or all robots) what places it should not index. There are plenty of reasons to do that; you might have, for example, static content that is in a file but isn't useful to a user as a search term, but is something you display (e.g. your copyright data.) Those lookups are entirely wasted on both ends, so you tell the robots not to do that. You can also list specific robot "identifiers" as "don't index at all" but that's not an enforcement, its a request.

Of course the problem here is that since its a request a robot might ignore it. That's not very nice, but again, its an "ask."

Of late there has been quite-significant growth in "crawlers" that aren't indexing for public benefit and they are ignoring directives in the robots.txt file. That is, they're crawling not to populate a search engine but for other purposes, including using the data for AI training or even to market against your site, and because they're acting maliciously they also ignore your directives in robots.txt. This is a gross violation of the premise of the web, which is that you scratch my back and I scratch yours: I let you crawl my site because you're going to direct readers to my pages with the results of doing so (either in a search engine or in advertising bidding.)

Leaving aside the legal issue (copyright) of taking material for other than its published use and for commercial gain of the taker without agreement on compensation (or an explicit exemption) to the publisher, which is unlawful by the way, there is the fact that traffic and processing is not free and if you're going to take without any sort of colorable claim that the person you take it from will benefit why should they let you do it?

Well, I decided not to, and thus identified a bunch of these aggressive robots that as near as I could tell by investigation were either (1) training AIs or (2) collecting data they were then selling to other competing properties as "search optimization" bids, that is, they're using my data to market directly against me.

Not anymore they're not -- I altered my blog software so that if you're a crawler and get identified as one of those that is either presumed to be acting maliciously or for training AI type things, including specifically ignoring the robots.txt file, you get a nice pink screen that says "Robot identified and rejected; go away" instead of whatever you were trying to access.

What was the impact of this?

Zero in terms of unique actual users.

But a 30% decrease, roughly, in terms of bytes passed!

If you run a web property you need to look into this. From the data roughly one third of the data you are processing and transmitting, and thus one third of your operating cost for said processing and transmission, may be consumed by robotic actors who are using that data to deliberately harm your operation, either by training AIs against your material or marketing directly against your operation to others, including if you sell physical or digital products, your direct competitors!

Nobody in their right mind would permit this sort of abuse if they knew about it but I'll bet not one web operator in a hundred knows the scope of this -- until I collected the data, analyzed it in full and then implemented the block, which does take a bit of work, it was certainly not clear to me that one third of the traffic volume was in fact these aggressive and harmful (to me) robotic "readers."

View with responses