Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


High Mysql load from crawlers
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

High Mysql load from crawlers

bdtechbdtech Member
edited May 2013 in General

Is there anyway to prevent high mysql load from a crawler traversing Wordpress pages aggressively? The bing bot got me up to 300 percent CPU. Can I use cpulimit on mysql if the load hits a certain point? What's the best route (mysql/nginx/fastcgi).

Comments

  • Are you using any WP caching plugins?

  • Awmusic12635Awmusic12635 Member, Host Rep

    It is possible cloudflare could help you with this.

  • MunMun Member

    install quick cache ~ http://wordpress.org/extend/plugins/quick-cache/

    or limit crawlers with robots.txt

  • tortautortau Member

    install varnish. move apache to listen to different port. then configure varnish to listen to port 80 and redirect traffic to the apache port.

  • bdtechbdtech Member

    I use fastcgi cache plus w3tc. It looks like they were crawling individual comments.

  • 24khost24khost Member

    Are you using nginx?

  • bdtechbdtech Member

    Yes nginx and fastcgi cache and w3tc

  • MunMun Member

    do you know which bot, I have a feeling it is bingbot.

  • 24khost24khost Member

    Php-fpm?

  • bdtechbdtech Member

    Yes fpm. The pages they requested weren't cached. Bing bot was looking at unique comment id's or something

  • MunMun Member

    Bingbot is known for rushes, where they do a days worth of queries in a few hours.

  • 24khost24khost Member

    Do you have forced time out set?

  • RalliasRallias Member

    @Mun said: Bingbot is known for rushes, where they do a days worth of queries in a few hours.

    Hmm... I've never seen official bingbot doing this. I've seen a fake bingbot do this to me, but that's the extent of the damages.

  • bdtechbdtech Member
    edited May 2013

    I'm going to try robots.txt as bingbot will apparently reference the delay rate.

    User-agent: *
    Crawl-delay: 15
    Disallow: /wp-admin/
    Disallow: /wp-includes/

  • GienGien Member

    dont add disallows, as malicious bots/skriptkiddies search for those disallows

    try to change the folder/dir names to a non standard one.

    you van set the crawl delay to bing only.. or for all bots, just google it

  • AdducAdduc Member

    Log into Bing's webmaster interface. It'll let you slow the crawl rate from there, as well as identify the best time to crawl.

  • sleddogsleddog Member

    @Gien said: dont add disallows, as malicious bots/skriptkiddies search for those disallows

    Disallow: /honeypot/
    

    And have some fun with it.

  • @sleddog said: And have some fun with it.

    Oh yeah.
    I once had a site on which I put hidden links (both text and images) and robots.txt entries for the same honeypot. Around 150 bots went on that page.

  • bdtechbdtech Member

    @Gien I think the 20 plus wp-content references every page has pretty much gives WP away

  • GienGien Member

    @bdtech yeah you can always tell if its an wp site.. but there ways to hide it a bit

    also you dont want extra attention to your admin panel..

    also add another layer with htaccess, in your wp-admin wp-includes etc...

    if you can edit your iptables / hosts.deny file you can add the ips of the malicious bot(s)

  • sleddogsleddog Member

    @Gien said: if you can edit your iptables / hosts.deny file you can add the ips of the malicious bot(s)

    If it's really an issue, use fail2ban. Then it's managed automatically.

  • bdtechbdtech Member

    Which fail2ban rule do you use?

  • trazxtrazx Member

    I had similar issue once, register for Bing webmaster tools (www.bing.com/toolbox/webmaster‎) and adjust the robot crawling speed from their controls, you can reduce it so that instead of crawling your site in an hour it can do it gradually and thus keep your load down.

  • sleddogsleddog Member

    @bdtech said: Which fail2ban rule do you use?

    I don't. I was just following up on the honeypot concept for dealing with malicious bot that abuse "Disallow" entries in robots.txt. Legitimate bots like Google and Bing can be rate-limited as discussed. For malicious bots:

    • Add a "Disallow: /honeypot/"
    • At /honeypot/ put a script that does nothing but log the IP of the visitor (to honeypot.log). For informational purposes you may want to also record date/time and user agent.
    • Configure fail2ban to monitor the honeypot.log. Every entry is banned so the rule shouldn't be too complex :) Lift the ban after ~ 24 hours so iptables entries don't grow endlessly.
    • Rotate the honeypot.log with logrotate.
  • GienGien Member

    Yeah @sleddog should keep the bad ones out most of the time

  • bdtechbdtech Member

    Found the culprit. w3tc does not cache if there's a query string and these requests from bingbot and others peg MySQL

    /preview-science-20/?replytocom=117365

    Any idea how i can translate this to nginx?

    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} "msnbot|Googlebot|bingbot|Slurp|ScoutJet|MJ12bot|Baiduspider|Ezooms|YandexBot|Exabot"
    RewriteCond %{QUERY_STRING} ^replytocom=\d+$
    RewriteRule ^(.*)$ http://%{HTTP_HOST}%{REQUEST_URI}? [redirect=301,last]

    http://support.tigertech.net/wordpress-performance

Sign In or Register to comment.