discouraging http bot probes on sever

AdamM · October 2016

Hi,

I get steady probes to my http server by what appears to be bots. Nothing that is causing any disruption of service. My question, what is the best way to discourage this? They seem to be GETing the same pages over and over, which do not have any meaningful content on my server. Should I just have those routes respond with a 404, or is there another status that is better to convince the bots to move on?

Thanks,

-Adam

Ole_Juul · October 2016

Did you install fail2ban?

GCat · October 2016

Return code 418 on requests, along with a image of a teapot. Works everytime, add the words "CONFIDENTIAL GOVERNMENT PUNISHMENT PENALTY AUTHORIZED USERS ONLY" lots of bots will flag sites like that as government and not to scan again

jesin · October 2016

If you can clearly identify them as bots based on their user agent redirect them to localhost

AdamM · October 2016

If you can clearly identify them as bots based on their user agent redirect them to localhost

I like this

impossiblystupid · October 2016

@AdamM said:
They seem to be GETing the same pages over and over, which do not have any meaningful content on my server. Should I just have those routes respond with a 404, or is there another status that is better to convince the bots to move on?

It depends on what bots are trying to get at what resources. As I've noted on another thread, I think that common/expected files should get a 2xx code when you simply have no content to return. For spiders that come in scanning for things like PHP exploits, though, I let fail2ban stop the immediate abuse, and then I drop the whole network of the ones that bother me too much directly into the firewall. A 404 isn't going to stop a bot that's intentionally behaving badly.

ricardo · October 2016

Some people go for the whitelist approach. Normally you'd just want Googlebot, Bing, Yandex, possibly the Archive bot to come grab your stuff. Ban everything else, especially so if they didn't look for or obey robots.txt If you're feeling extra stingy, you'd want to do a reverse and forward DNS lookup for new IPs to ensure the User Agent is what it says it is.

joepie91 · October 2016

If they're existing resources, just let them crawl them. You can always ban individual user agents / crawling patterns if they become problematic. If they're non-existent resources, they should be returning a 404 anyway, regardless of bot behaviour.

@ricardo said:
Some people go for the whitelist approach. Normally you'd just want Googlebot, Bing, Yandex, possibly the Archive bot to come grab your stuff. Ban everything else, especially so if they didn't look for or obey robots.txt If you're feeling extra stingy, you'd want to do a reverse and forward DNS lookup for new IPs to ensure the User Agent is what it says it is.

That's really not a good idea, and is just going to lead to people pretending to be browsers. No need to help grow monopolies.

ricardo · October 2016

Oh, I was just paraphrasing someone I know who's dealt with this kind of thing for 15 years. You'd also want to set up a spider trap.

black · October 2016

Redirect those bots via HTTP 302 to the biggest file you can find on the internet.

sin · October 2016

For my nginx installs I setup Fail2ban. I used this setup: https://petermolnar.net/secure-wordpress-with-nginx-and-fail2ban/ as a base and then I tuned and added in my own custom filters and it works really well.

Nomad · October 2016

I know it's not that effective but give robots.txt a go as well.
It might stop at least a few of them if they are obedient.

Howdy, Stranger!

Categories

In this Discussion

discouraging http bot probes on sever

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

discouraging http bot probes on sever

Comments