2022-06-09: More Google misbehaviour

More Google misbehaviour

I've written elsewhere about misbehaviour from Google that led me to block them very hard.

Now, I need to give a little background. My FTP area is also exported over HTTP. For the sake of FTP clients, I used to have a symlink ftp->. at the root of my FTP area. I found that both HTTP and FTP clients were trying to fetch things like /ftp/ftp/ftp/ftp/....

So, in an attempt to break such things badly enough for them to get noticed and fixed, I created a directory /crawl which contained three things: a README explaining why it was there and two symlinks, a->. and b->., in the hope that such a crawler would get lost in the power-of-two explosion of paths that would result.

Recently, I discovered that the HTTP server I was using would transparently follow symlinks, hiding their link nature from clients. While this doesn't excuse FTP client misbehaviour, it does explain HTTP client behaviour; the client has no idea that the symlinks are symlinks and thus can't handle them properly.

So I changed the HTTP server. It now presents symlinks as HTML links, resolving the link to the appropriate path and generating a link to that path. I also removed the /ftp link and the /crawl directory entirely, and created a robots.txt that denies access to /ftp and /crawl, so that well-behaved bots would stay away from them.

Some of the robots.txt documentation I found mentioned that some of the more malicious crawlers use robots.txt as a guide for places to look. So I set up software to watch my logs, noticing ill-behaved clients and dropping their IPs into my border-router blacklist. (I waited a day or two first, so that well-behaved clients would have a chance to notice the new robots.txt.) After a little while, I expanded this to include POST requests (I export no writable webspace) and attempts to initiate SSL (TLS, whatever it's called this week), which I see no excuse for anyone to try to initiate on port 80.

This has now been in effect for about a week. It's tripped a little over a hundred times.

The aspect that's relevant to this post? Two of those trips have come from Google's webcrawler. On 2022-06-07 10:26:10, 35.185.244.30, 30.244.185.35.bc.googleusercontent.com, tried to initiate SSL; on 2022-06-08 22:20:28, 34.83.64.112, 112.64.83.34.bc.googleusercontent.com, tried to fetch /wp-login.php, something I do not have and never have had.

I'm not sure what to make of this. They might be explained by someone else linking to a nonexistent URL on my host, but I don't see why anyone would do that. It could be an attempt to poison the crawler in question against me, but that would assume that someone (a) noticed that I'm doing something unusual, (b) figured out what I'm doing, and (c) wanted to poison crawlers against me. I'm not sure whether (c) would have been an attempt to harm me, or the crawler, or what, but I don't see any percentage for the offender in it either way.

Main