Do You Check (Or Even Have) Your Server Logs?
The Market Ticker - Commentary on The Capital Markets
Login or register to improve your experience
Main Navigation
Sarah's Resources You Should See
Full-Text Search & Archives
Leverage, the book
Legal Disclaimer

The content on this site is provided without any warranty, express or implied. All opinions expressed on this site are those of the author and may contain errors or omissions. For investment, legal or other professional advice specific to your situation contact a licensed professional in your jurisdiction.

NO MATERIAL HERE CONSTITUTES "INVESTMENT ADVICE" NOR IS IT A RECOMMENDATION TO BUY OR SELL ANY FINANCIAL INSTRUMENT, INCLUDING BUT NOT LIMITED TO STOCKS, OPTIONS, BONDS OR FUTURES.

Actions you undertake as a consequence of any analysis, opinion or advertisement on this site are your sole responsibility; author(s) may have positions in securities or firms mentioned and have no duty to disclose same.

The Market Ticker content may be sent unmodified to lawmakers via print or electronic means or excerpted online for non-commercial purposes provided full attribution is given and the original article source is linked to. Please contact Karl Denninger for reprint permission in other media, to republish full articles, or for any commercial use (which includes any site where advertising is displayed.)

Submissions or tips on matters of economic or political interest may be sent "over the transom" to The Editor at any time. To be considered for publication your submission must be complete (NOT a "pitch"; those get you blocked as a spammer), include full and correct contact information and be related to an economic or political matter of the day. All submissions become the property of The Market Ticker.

Considering sending spam? Read this first.

2025-03-18 07:00 by Karl Denninger
in Technology , 109 references Ignore this thread
Do You Check (Or Even Have) Your Server Logs?
[Comments enabled]

An interesting pattern was noted here recently -- crawlers that didn't look to be search-engine related.

"Crawlers" request data from your web server "as a person - sort of."  They're robots, and the usual use of them historically is to populate search engines, which if done in a respectful way is of benefit to the site owner.  Respectful robots do this with time and rate limits because they're not paying you anything, they are imposing load, and thus your incentive to let them is that they cause people to visit your site when they query the search engine.  That is, there's value both ways.

There is also a "robots.txt" file you can populate that tells a robot (or all robots) what places it should not index.  There are plenty of reasons to do that; you might have, for example, static content that is in a file but isn't useful to a user as a search term, but is something you display (e.g. your copyright data.)  Those lookups are entirely wasted on both ends, so you tell the robots not to do that.  You can also list specific robot "identifiers" as "don't index at all" but that's not an enforcement, its a request.

Of course the problem here is that since its a request a robot might ignore it.  That's not very nice, but again, its an "ask."

Of late there has been quite-significant growth in "crawlers" that aren't indexing for public benefit and they are ignoring directives in the robots.txt file.  That is, they're crawling not to populate a search engine but for other purposes, including using the data for AI training or even to market against your site, and because they're acting maliciously they also ignore your directives in robots.txt.  This is a gross violation of the premise of the web, which is that you scratch my back and I scratch yours: I let you crawl my site because you're going to direct readers to my pages with the results of doing so (either in a search engine or in advertising bidding.)

Leaving aside the legal issue (copyright) of taking material for other than its published use and for commercial gain of the taker without agreement on compensation (or an explicit exemption) to the publisher, which is unlawful by the way, there is the fact that traffic and processing is not free and if you're going to take without any sort of colorable claim that the person you take it from will benefit why should they let you do it?

Well, I decided not to, and thus identified a bunch of these aggressive robots that as near as I could tell by investigation were either (1) training AIs or (2) collecting data they were then selling to other competing properties as "search optimization" bids, that is, they're using my data to market directly against me.

Not anymore they're not -- I altered my blog software so that if you're a crawler and get identified as one of those that is either presumed to be acting maliciously or for training AI type things, including specifically ignoring the robots.txt file, you get a nice pink screen that says "Robot identified and rejected; go away" instead of whatever you were trying to access.

What was the impact of this?

Zero in terms of unique actual users.

But a 30% decrease, roughly, in terms of bytes passed!

If you run a web property you need to look into this.  From the data roughly one third of the data you are processing and transmitting, and thus one third of your operating cost for said processing and transmission, may be consumed by robotic actors who are using that data to deliberately harm your operation, either by training AIs against your material or marketing directly against your operation to others, including if you sell physical or digital products, your direct competitors!

Nobody in their right mind would permit this sort of abuse if they knew about it but I'll bet not one web operator in a hundred knows the scope of this -- until I collected the data, analyzed it in full and then implemented the block, which does take a bit of work, it was certainly not clear to me that one third of the traffic volume was in fact these aggressive and harmful (to me) robotic "readers."