Bots are draining your bandwidth without you notice

JustPfff · September 2025

This was the first thing I noticed when I moved my (popular) website to another host.
At best, my website had around ~ 500 unique visitors per day, but Cloudflare statistics showed me nearly ~20k unique IP visits.
Before the move, my VPS bandwidth was almost reaching its monthly limit of 1.5 TB, averaging about ~50GB per day — which is insane for the size of my website.
Now, on day 13 of the month, the bandwidth usage is only ~180GB, simply because I got rid of the bots.

davide · September 2025

Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow. This kind of passive analysis would throw no JavaShit to the users' browser like HCaptcha or Cloudflare.

Saahib · September 2025

Bots are now menace for almost all even slightly popular website, specially after this AI thing.
@JustPfff, what method you used to filter out bots ?

MikeA · September 2025

@davide said:
Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow.

There's no realistic way to do it. AI crawlers like xAI use residential IPs and actively rotate fake useragents. They're not the only one but they're the most prevalent due to their size.

davide · September 2025

@MikeA said:

@davide said:
Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow.

There's no realistic way to do it. AI crawlers like xAI use residential IPs and actively rotate fake useragents. They're not the only one but they're the most prevalent due to their size.

E-commerce definitely has different patterns between humans and crawlers and they are easily discriminated by the webserver logs alone regardless of user agent or address. If the website is small, the more sophisticated crawlers may be indistinguishable, but with a product space of 80,000 items it's quickly apparent which clients attempt a breadth-first search or don't progress semantically with their search.

ehhthing · September 2025

@davide said:
Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow. This kind of passive analysis would throw no JavaShit to the users' browser like HCaptcha or Cloudflare.

This is possible but then you can’t exactly block the connections outright because you risk accidentally blocking real traffic from a “weird” person. Eventually you do need to serve a captcha or some other verification method unless you want to lose business.

WebProject · September 2025

@ehhthing said:

@davide said:
Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow. This kind of passive analysis would throw no JavaShit to the users' browser like HCaptcha or Cloudflare.

This is possible but then you can’t exactly block the connections outright because you risk accidentally blocking real traffic from a “weird” person. Eventually you do need to serve a captcha or some other verification method unless you want to lose business.

It is possible to use Cloudflare to block all AI bots and other non-search engine bots using security rules. On one of our client websites, the Facebook bot alone was consuming around 50 GB of bandwidth per month.

yoursunny · September 2025

Mentally strong people block no one.
Both bots and humans are welcome on our website.
In the past 24 hours, we had 5.86K total requests, fewer than 10% of which came from bots.

AI Crawl Control

traffic overview

johnnyquestion · September 2025

Serve them a 429 after so many requests

cmeerw · September 2025

@yoursunny said:
Mentally strong people block no one.
Both bots and humans are welcome on our website.
In the past 24 hours, we had 5.86K total requests, fewer than 10% of which came from bots.

I would guess that there are bots that are not known AI crawlers.

DrNutella · September 2025

Better yet, can I monetize my bandwidth and sell it?

emgh · September 2025

@MikeA said:

@davide said:
Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow.

There's no realistic way to do it. AI crawlers like xAI use residential IPs and actively rotate fake useragents. They're not the only one but they're the most prevalent due to their size.

if you don't think there's ways to detect bots, try scraping bet365 efficiently

i'm giving up:(

cu_olly · September 2025

@davide said:
Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow. This kind of passive analysis would throw no JavaShit to the users' browser like HCaptcha or Cloudflare.

Yes.

Bash script every ten seconds
Print last 1000 lines of access log
For $IP, check request count >2
Then, for $IP, check status code = 200
If match, check for static (.jpg,png,css,js)
If static, add $IP to whitelist
If no static, csf -td $IP 3600

ChatGPT or Grok will be able to put this together for you.

davide · September 2025

@cu_olly said:
If no static, csf -td $IP 3600

If no static, check cache expiration. FTFY.

cu_olly · September 2025

@davide said:

@cu_olly said:
If no static, csf -td $IP 3600

If no static, check cache expiration. FTFY.

Nope, not relevant. Unless you're using a CDN, you'll always have static requests on the first initial connection, thus $IP will be added to the whitelist.

cu_olly · September 2025

@MikeA said: There's no realistic way to do it. AI crawlers like xAI use residential IPs and actively rotate fake useragents. They're not the only one but they're the most prevalent due to their size.

Grok uses, at least, Comcast DIA and has thousands of residential-looking IPs on its roster. However, for crawling, it uses Google Cloud spot instances. I have this confirmed, in writing, as a result of >1000 abuse reports and to-and-fro with GCP's abuse team. They will not, under any circumstance, remove Grok as a customer, no matter what levels of abuse are reported.

davide · September 2025

@cu_olly said:

@davide said:

@cu_olly said:
If no static, csf -td $IP 3600

If no static, check cache expiration. FTFY.

Nope, not relevant. Unless you're using a CDN, you'll always have static requests on the first initial connection, thus $IP will be added to the whitelist.

The lookback should be a session duration not a number of entries in the log. It wouldn't work on a high traffic site.
But then bots load statics too huhuhu

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠉⠉⠉⠉⠉⠉⠙⠒⠢⠤⣄⡀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡼⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⣳⣤⡀⠀⠀
⠀⠀⠀⠀⠀⠀⣰⠚⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠙⢏⠁⠀
⠀⠀⠀⠀⠀⠘⣻⠟⠀⠀⠀⠀⢀⣰⣄⡀⢶⣤⣀⠀⢠⣀⠀⠀⠀⠀⠀⠀⠀⠈⢳⡀
⠀⠀⠀⠀⠀⢰⠃⠀⠀⠀⠀⢠⡞⠁⠀⠉⠉⠛⠛⠛⠲⣽⡟⠲⣄⠀⠀⠀⠀⠀⠀⣧
⠀⠀⠀⠀⢀⡇⠀⠀⠀⢠⣷⣾⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠛⠀⠈⠳⡄⠀⣄⠀⠀⢸
⠀⠀⠀⠀⢸⠀⠀⣀⣀⡏⠈⠿⠀⠀⠀⣠⠖⠒⢦⠀⠀⠀⠀⠀⢀⣀⠹⣤⠟⣆⢀⡟
⠀⠀⠀⠀⢸⠀⡼⢙⣉⡻⠄⠀⠀⠀⡼⠁⢀⣤⣼⡇⠀⠀⠀⣴⠋⠉⢻⠋⣰⣻⡞⠀
⠀⠀⠀⠀⢸⡀⡇⢸⣹⡇⠀⠀⠀⠰⠧⠤⢼⣿⡿⠀⠀⠀⢰⡇⠀⣶⣾⠀⡇⠏⠀⠀
⠀⠀⠀⠀⠈⢇⠹⣄⠻⠇⠀⠀⠀⠀⠀⣦⠀⠀⠀⠀⠀⠀⠀⠉⢹⡛⠋⢠⡇⠀⠀⠀
⠀⠀⠀⠀⠀⠈⠣⣈⣻⠂⠀⠀⠀⠀⢸⣹⡇⠀⠀⠀⠀⠀⠀⠀⡞⡇⠀⣸⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠙⣆⠀⣸⡀⠀⠈⠉⠀⠀⣀⣀⡀⠀⠀⠈⠷⠃⢠⠇⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⢦⡏⡇⠀⠀⠀⠖⠛⠛⠛⠿⠷⣄⠀⠀⢠⣏⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠿⢷⢤⣄⣀⠀⠀⠀⠀⠀⠀⣀⣤⠞⠁⡿⡆⠀⠀⠀⠀
⠀⠀⠀⠀⣀⣤⡖⠛⠉⠹⡄⠀⠘⣆⠀⠈⠉⠉⠉⠉⠉⠉⡟⢳⣄⠰⠿⠃⠀⠀⠀⠀
⠀⢀⡴⠚⠉⠉⠉⠙⠲⣄⠹⣄⠀⠘⣆⠀⠀⠀⠀⠀⠀⠀⡇⠀⢿⢶⠦⣄⡀⠀⠀⠀
⠀⡞⠀⠀⠀⠀⠀⠀⠀⠀⠻⡝⢦⡀⠈⠳⢤⡀⠀⠀⢀⡼⠀⢀⡟⠈⡆⠀⠹⡄⠀⠀
⢸⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⢱⠀⠙⠢⣄⡀⠈⠉⠉⠁⢀⣠⠞⠁⠀⢸⠀⠀⣇⠀⠀
⢸⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣞⣀⣀⣀⣀⣈⣉⣙⣛⣉⣉⣀⣀⣀⣀⣸⣀⣀⡇⠀⠀

MikeA · September 2025

@cu_olly said:

@MikeA said: There's no realistic way to do it. AI crawlers like xAI use residential IPs and actively rotate fake useragents. They're not the only one but they're the most prevalent due to their size.

Grok uses, at least, Comcast DIA and has thousands of residential-looking IPs on its roster. However, for crawling, it uses Google Cloud spot instances. I have this confirmed, in writing, as a result of >1000 abuse reports and to-and-fro with GCP's abuse team. They will not, under any circumstance, remove Grok as a customer, no matter what levels of abuse are reported.

Grok uses all US ISPs. AT&T, Cox, Comcast, Spectrum, etc. I've gotten IPs from all of them, they are ran by multiple different third party "residential ISP proxy" companies. xAI is paying companies for real residential proxies. I'm not sure what you mean by roster, but they use it to access websites for various tasks when people talk to Grok. But if you mean crawling, like actually training the LLM I'm sure that's done with GCP or something in house with their own datacenter.

cu_olly · September 2025

@MikeA said: I'm not sure what you mean by roster, but they use it to scrape websites.

I mean, Grok's AI gas turbine house is connected by at least Comcast DIA. Maybe more. But that's not the prevelant issue, because those requests are only submitted as a direct result of Grok queries. The crawling from GCP is intensive, invasive, and is causing problems for pretty much every host.

ehhthing · September 2025

@WebProject said:

@ehhthing said:

@davide said:
Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow. This kind of passive analysis would throw no JavaShit to the users' browser like HCaptcha or Cloudflare.

This is possible but then you can’t exactly block the connections outright because you risk accidentally blocking real traffic from a “weird” person. Eventually you do need to serve a captcha or some other verification method unless you want to lose business.

It is possible to use Cloudflare to block all AI bots and other non-search engine bots using security rules. On one of our client websites, the Facebook bot alone was consuming around 50 GB of bandwidth per month.

Cloudflare uses JS as part of their bot detection system.

praburam · September 2025

Those are peanuts look at mine.

fatchan · September 2025

Yep, was seeing a high request volume from Alibaba cloud going through every page for every commit of every file in my self hosted git repository until I added some protection for that.
Would have been about 20T of bandwidth in a month, which isn't crazy big scale, but definitely an anomaly for the site in question.

praburam · September 2025

I reduced the load on php and html. I directly stored as html file permanently to avoid overloading servers

blip1945 · September 2025

Cloudflare is the no brainer foolproof drop in solution for anyone too lazy to deal with them bots locally. I think they're making ai bot blocking the default for any domain added to their service now. Not ideal that cf being the goto for everyone and their grandma making the internet full of cf captcha but atleast their captcha aren't fucking annoying like google recaptcha or hcaptcha. Cf sort of became a necessary evil now.

johnnyquestion · September 2025

No cookies get cached content

CloudHopper · September 2025

Safeline WAF has a really good anti-bot mechanism, for those who don't want to use Cloudflare, (and its exploit prevention is also far superior to CF too): https://github.com/chaitin/SafeLine

yoursunny · September 2025

@fatchan said:
Yep, was seeing a high request volume from Alibaba cloud going through every page for every commit of every file in my self hosted git repository until I added some protection for that.

My collaborator is experiencing a similar problem on https://gerrit.named-data.net , hosted together with one other application on a dedicated server with 1x Xeon Gold 5120.
The bots were accessing git blame webpages very frequently, causing application server to use up all 14 cores.

He deployed Anubis: Web AI Firewall Utility, which stopped application server overloads.
Instead, all the bots are now sitting in the entryway using up all Apache threads.

We suggested him to buy that dedicated server just one year ago when the bots weren't such popular.
If we knew, we should make him buy a 128-core EPYC enough to feed the bots.

Saahib · September 2025

@blip1945 said:
Cloudflare is the no brainer foolproof drop in solution for anyone too lazy to deal with them bots locally. I think they're making ai bot blocking the default for any domain added to their service now. Not ideal that cf being the goto for everyone and their grandma making the internet full of cf captcha but atleast their captcha aren't fucking annoying like google recaptcha or hcaptcha. Cf sort of became a necessary evil now.

Basically for CF, these bots are causing hell lot of Bandwidth, consuption which is not useful for site owners and eating CF resources.Its no brainer that they want to block it as default.

rdes · September 2025

What annoys me the most are bots with Microsoft IPs. There are hundreds of them daily, probably free trials from Azure. But of course, some “very important” legitimate services like to use those servers and I can't cut out the entire AS. But I think that would save me a lot of CPU power and TB of traffic..

Cloudflare in its default settings also lets through a lot of junk hosts that scanning WordPress plugins directories, webshells etc.

Tange · September 2025

you need WAF

aj_potc · September 2025

@fatchan said:
Yep, was seeing a high request volume from Alibaba cloud going through every page for every commit of every file in my self hosted git repository until I added some protection for that.
Would have been about 20T of bandwidth in a month, which isn't crazy big scale, but definitely an anomaly for the site in question.

In my case, I saw a huge reduction in abusive requests simply by blocking Alibaba and Huawei Cloud IP ranges.

Howdy, Stranger!

Categories

In this Discussion

Bots are draining your bandwidth without you notice

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Bots are draining your bandwidth without you notice

Comments