Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Facebook aggressive crawling
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Facebook aggressive crawling

I recently found out that Facebook crawlers are getting many abuse reports by website owners for their scanning activities, when checking my own cloud security warnings and doing a quick search...because, Facebook does not offer cloud services like Amazon or Microsoft, do they?

https://www.abuseipdb.com/check/173.252.83.19

Can someone explain to me what this funny thing is about? I'm really curious now.

Comments

  • Crawling every single link billions of people paste in their private chats + oversensitive web application firewalls = this is the result

  • @inland said:
    Crawling every single link billions of people paste in their private chats + oversensitive web application firewalls = this is the result

    In this case, only Steam APIs are allowed for our community use, there's no other social media/networks embedded, should have said it just to clarify.

  • The reason this happens is because Facebook's crawlers are not always well behaved. Sometimes they just go crazy.

    I've faced situations in which Facebook hits my sites several hundred times per second, for multiple consecutive seconds. This quickly exhausts available resources, so I've put in some throttling to protect against it.

    I use mod_security to throttle requests with User Agent "facebookexternalhit". So, after a certain limit is reached, I return a 429 "Too Many Requests" error.

  • I should also mention that, in my research, Facebook doesn't use centralized crawlers like search engines tend to do.

    Instead, they will crawl exactly the same URL from all of their locations (30+, if I recall correctly). So, if the crawler starts to misbehave, you can expect multiple requests for the same batch of URLs, which provides a nice DDoS effect.

    If you use caching at the Web server layer, then you might be okay. But in my experience, the sheer number of connections required by this crawling behavior can easily surpass limits you may have set for the number of worker processes on your application layer.

    Thanked by 2foxtare let_rocks
  • They might be building their AI? They already have huge amount data on people. Connect that to other data points. What do you get? >:)

Sign In or Register to comment.