Facebook aggressive crawling

foxtare · March 2023

I recently found out that Facebook crawlers are getting many abuse reports by website owners for their scanning activities, when checking my own cloud security warnings and doing a quick search...because, Facebook does not offer cloud services like Amazon or Microsoft, do they?

https://www.abuseipdb.com/check/173.252.83.19

Can someone explain to me what this funny thing is about? I'm really curious now.

inland · March 2023

Crawling every single link billions of people paste in their private chats + oversensitive web application firewalls = this is the result

foxtare · March 2023

@inland said:
Crawling every single link billions of people paste in their private chats + oversensitive web application firewalls = this is the result

In this case, only Steam APIs are allowed for our community use, there's no other social media/networks embedded, should have said it just to clarify.

aj_potc · March 2023

The reason this happens is because Facebook's crawlers are not always well behaved. Sometimes they just go crazy.

I've faced situations in which Facebook hits my sites several hundred times per second, for multiple consecutive seconds. This quickly exhausts available resources, so I've put in some throttling to protect against it.

I use mod_security to throttle requests with User Agent "facebookexternalhit". So, after a certain limit is reached, I return a 429 "Too Many Requests" error.

aj_potc · March 2023

I should also mention that, in my research, Facebook doesn't use centralized crawlers like search engines tend to do.

Instead, they will crawl exactly the same URL from all of their locations (30+, if I recall correctly). So, if the crawler starts to misbehave, you can expect multiple requests for the same batch of URLs, which provides a nice DDoS effect.

If you use caching at the Web server layer, then you might be okay. But in my experience, the sheer number of connections required by this crawling behavior can easily surpass limits you may have set for the number of worker processes on your application layer.

Boogeyman · March 2023

They might be building their AI? They already have huge amount data on people. Connect that to other data points. What do you get?

Howdy, Stranger!

Categories

In this Discussion

Facebook aggressive crawling

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Facebook aggressive crawling

Comments