Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Shells Virtual Desktop
BMail.ag - Secure Email Service
Server.net
CPLicense.net
VPS Server
Buy VPN
Vultr
VMs for AI
HostDare
HostDare
ReliableSite White-Label Dedicated Hosting for Resellers
InterServer VPS
BMail.ag - Secure Email Service
Best VPN
High-Performance Bare Metal Server Solutions
Karvl.com
Server Mania Cloud Hosting
DataWagon Hosting
AlphaVPS Hosting
Evoxt.com
Clouvider
VPS Hosting with NVMe
Residential IPs in the US & 4G Mobile Proxies in EU & US with Unlimited Bandwidth
ReliableSite White-Label Dedicated Hosting for Resellers
Rabisu - Hosting Solutions
Shells Virtual Desktop
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Bots are draining your bandwidth without you notice

JustPfffJustPfff Member
edited September 2025 in Outages

This was the first thing I noticed when I moved my (popular) website to another host.
At best, my website had around ~ 500 unique visitors per day, but Cloudflare statistics showed me nearly ~20k unique IP visits.
Before the move, my VPS bandwidth was almost reaching its monthly limit of 1.5 TB, averaging about ~50GB per day — which is insane for the size of my website.
Now, on day 13 of the month, the bandwidth usage is only ~180GB, simply because I got rid of the bots.

«1

Comments

  • davidedavide Member
    edited September 2025

    Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow. This kind of passive analysis would throw no JavaShit to the users' browser like HCaptcha or Cloudflare.

  • SaahibSaahib Host Rep, Veteran
    edited September 2025

    Bots are now menace for almost all even slightly popular website, specially after this AI thing.
    @JustPfff, what method you used to filter out bots ?

  • MikeAMikeA Member, Patron Provider

    @davide said:
    Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow.

    There's no realistic way to do it. AI crawlers like xAI use residential IPs and actively rotate fake useragents. They're not the only one but they're the most prevalent due to their size.

    Thanked by 2oloke mrTom
  • davidedavide Member
    edited September 2025

    @MikeA said:

    @davide said:
    Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow.

    There's no realistic way to do it. AI crawlers like xAI use residential IPs and actively rotate fake useragents. They're not the only one but they're the most prevalent due to their size.

    E-commerce definitely has different patterns between humans and crawlers and they are easily discriminated by the webserver logs alone regardless of user agent or address. If the website is small, the more sophisticated crawlers may be indistinguishable, but with a product space of 80,000 items it's quickly apparent which clients attempt a breadth-first search or don't progress semantically with their search.

  • @davide said:
    Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow. This kind of passive analysis would throw no JavaShit to the users' browser like HCaptcha or Cloudflare.

    This is possible but then you can’t exactly block the connections outright because you risk accidentally blocking real traffic from a “weird” person. Eventually you do need to serve a captcha or some other verification method unless you want to lose business.

    Thanked by 1mrTom
  • WebProjectWebProject Veteran, 🚩 Host Rep Tag Suspended

    @ehhthing said:

    @davide said:
    Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow. This kind of passive analysis would throw no JavaShit to the users' browser like HCaptcha or Cloudflare.

    This is possible but then you can’t exactly block the connections outright because you risk accidentally blocking real traffic from a “weird” person. Eventually you do need to serve a captcha or some other verification method unless you want to lose business.

    It is possible to use Cloudflare to block all AI bots and other non-search engine bots using security rules. On one of our client websites, the Facebook bot alone was consuming around 50 GB of bandwidth per month.

    Thanked by 1schwabene
  • yoursunnyyoursunny Member, IPv6 Advocate

    Mentally strong people block no one.
    Both bots and humans are welcome on our website.
    In the past 24 hours, we had 5.86K total requests, fewer than 10% of which came from bots.

    AI Crawl Control

    traffic overview

    Thanked by 1WebProject
  • Serve them a 429 after so many requests

  • @yoursunny said:
    Mentally strong people block no one.
    Both bots and humans are welcome on our website.
    In the past 24 hours, we had 5.86K total requests, fewer than 10% of which came from bots.

    AI Crawl Control

    traffic overview

    I would guess that there are bots that are not known AI crawlers.

  • Better yet, can I monetize my bandwidth and sell it?

  • emghemgh Member, Megathread Squad

    @MikeA said:

    @davide said:
    Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow.

    There's no realistic way to do it. AI crawlers like xAI use residential IPs and actively rotate fake useragents. They're not the only one but they're the most prevalent due to their size.

    if you don't think there's ways to detect bots, try scraping bet365 efficiently

    i'm giving up:(

  • cu_ollycu_olly Member
    edited September 2025

    @davide said:
    Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow. This kind of passive analysis would throw no JavaShit to the users' browser like HCaptcha or Cloudflare.

    Yes.

    Bash script every ten seconds
    Print last 1000 lines of access log
    For $IP, check request count >2
    Then, for $IP, check status code = 200
    If match, check for static (.jpg,png,css,js)
    If static, add $IP to whitelist
    If no static, csf -td $IP 3600

    ChatGPT or Grok will be able to put this together for you.

  • @cu_olly said:
    If no static, csf -td $IP 3600

    If no static, check cache expiration. FTFY.

    Thanked by 1tentor
  • @davide said:

    @cu_olly said:
    If no static, csf -td $IP 3600

    If no static, check cache expiration. FTFY.

    Nope, not relevant. Unless you're using a CDN, you'll always have static requests on the first initial connection, thus $IP will be added to the whitelist.

  • @MikeA said: There's no realistic way to do it. AI crawlers like xAI use residential IPs and actively rotate fake useragents. They're not the only one but they're the most prevalent due to their size.

    Grok uses, at least, Comcast DIA and has thousands of residential-looking IPs on its roster. However, for crawling, it uses Google Cloud spot instances. I have this confirmed, in writing, as a result of >1000 abuse reports and to-and-fro with GCP's abuse team. They will not, under any circumstance, remove Grok as a customer, no matter what levels of abuse are reported.

  • davidedavide Member
    edited September 2025

    @cu_olly said:

    @davide said:

    @cu_olly said:
    If no static, csf -td $IP 3600

    If no static, check cache expiration. FTFY.

    Nope, not relevant. Unless you're using a CDN, you'll always have static requests on the first initial connection, thus $IP will be added to the whitelist.

    The lookback should be a session duration not a number of entries in the log. It wouldn't work on a high traffic site.
    But then bots load statics too huhuhu

    ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠖⠊⠉⠉⠉⠉⠉⠉⠙⠒⠢⠤⣄⡀⠀⠀⠀⠀⠀
    ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡼⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⣳⣤⡀⠀⠀
    ⠀⠀⠀⠀⠀⠀⣰⠚⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠙⢏⠁⠀
    ⠀⠀⠀⠀⠀⠘⣻⠟⠀⠀⠀⠀⢀⣰⣄⡀⢶⣤⣀⠀⢠⣀⠀⠀⠀⠀⠀⠀⠀⠈⢳⡀
    ⠀⠀⠀⠀⠀⢰⠃⠀⠀⠀⠀⢠⡞⠁⠀⠉⠉⠛⠛⠛⠲⣽⡟⠲⣄⠀⠀⠀⠀⠀⠀⣧
    ⠀⠀⠀⠀⢀⡇⠀⠀⠀⢠⣷⣾⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠛⠀⠈⠳⡄⠀⣄⠀⠀⢸
    ⠀⠀⠀⠀⢸⠀⠀⣀⣀⡏⠈⠿⠀⠀⠀⣠⠖⠒⢦⠀⠀⠀⠀⠀⢀⣀⠹⣤⠟⣆⢀⡟
    ⠀⠀⠀⠀⢸⠀⡼⢙⣉⡻⠄⠀⠀⠀⡼⠁⢀⣤⣼⡇⠀⠀⠀⣴⠋⠉⢻⠋⣰⣻⡞⠀
    ⠀⠀⠀⠀⢸⡀⡇⢸⣹⡇⠀⠀⠀⠰⠧⠤⢼⣿⡿⠀⠀⠀⢰⡇⠀⣶⣾⠀⡇⠏⠀⠀
    ⠀⠀⠀⠀⠈⢇⠹⣄⠻⠇⠀⠀⠀⠀⠀⣦⠀⠀⠀⠀⠀⠀⠀⠉⢹⡛⠋⢠⡇⠀⠀⠀
    ⠀⠀⠀⠀⠀⠈⠣⣈⣻⠂⠀⠀⠀⠀⢸⣹⡇⠀⠀⠀⠀⠀⠀⠀⡞⡇⠀⣸⠀⠀⠀⠀
    ⠀⠀⠀⠀⠀⠀⠀⠀⠙⣆⠀⣸⡀⠀⠈⠉⠀⠀⣀⣀⡀⠀⠀⠈⠷⠃⢠⠇⠀⠀⠀⠀
    ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⢦⡏⡇⠀⠀⠀⠖⠛⠛⠛⠿⠷⣄⠀⠀⢠⣏⠀⠀⠀⠀⠀
    ⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡤⠿⢷⢤⣄⣀⠀⠀⠀⠀⠀⠀⣀⣤⠞⠁⡿⡆⠀⠀⠀⠀
    ⠀⠀⠀⠀⣀⣤⡖⠛⠉⠹⡄⠀⠘⣆⠀⠈⠉⠉⠉⠉⠉⠉⡟⢳⣄⠰⠿⠃⠀⠀⠀⠀
    ⠀⢀⡴⠚⠉⠉⠉⠙⠲⣄⠹⣄⠀⠘⣆⠀⠀⠀⠀⠀⠀⠀⡇⠀⢿⢶⠦⣄⡀⠀⠀⠀
    ⠀⡞⠀⠀⠀⠀⠀⠀⠀⠀⠻⡝⢦⡀⠈⠳⢤⡀⠀⠀⢀⡼⠀⢀⡟⠈⡆⠀⠹⡄⠀⠀
    ⢸⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⢱⠀⠙⠢⣄⡀⠈⠉⠉⠁⢀⣠⠞⠁⠀⢸⠀⠀⣇⠀⠀
    ⢸⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣞⣀⣀⣀⣀⣈⣉⣙⣛⣉⣉⣀⣀⣀⣀⣸⣀⣀⡇⠀⠀

    Thanked by 2Frameworks mrTom
  • MikeAMikeA Member, Patron Provider
    edited September 2025

    @cu_olly said:

    @MikeA said: There's no realistic way to do it. AI crawlers like xAI use residential IPs and actively rotate fake useragents. They're not the only one but they're the most prevalent due to their size.

    Grok uses, at least, Comcast DIA and has thousands of residential-looking IPs on its roster. However, for crawling, it uses Google Cloud spot instances. I have this confirmed, in writing, as a result of >1000 abuse reports and to-and-fro with GCP's abuse team. They will not, under any circumstance, remove Grok as a customer, no matter what levels of abuse are reported.

    Grok uses all US ISPs. AT&T, Cox, Comcast, Spectrum, etc. I've gotten IPs from all of them, they are ran by multiple different third party "residential ISP proxy" companies. xAI is paying companies for real residential proxies. I'm not sure what you mean by roster, but they use it to access websites for various tasks when people talk to Grok. But if you mean crawling, like actually training the LLM I'm sure that's done with GCP or something in house with their own datacenter.

  • @MikeA said: I'm not sure what you mean by roster, but they use it to scrape websites.

    I mean, Grok's AI gas turbine house is connected by at least Comcast DIA. Maybe more. But that's not the prevelant issue, because those requests are only submitted as a direct result of Grok queries. The crawling from GCP is intensive, invasive, and is causing problems for pretty much every host.

  • @WebProject said:

    @ehhthing said:

    @davide said:
    Has anyone written a web server logs parser to detect non-human sessions? There are human patterns that crawlers don't follow. This kind of passive analysis would throw no JavaShit to the users' browser like HCaptcha or Cloudflare.

    This is possible but then you can’t exactly block the connections outright because you risk accidentally blocking real traffic from a “weird” person. Eventually you do need to serve a captcha or some other verification method unless you want to lose business.

    It is possible to use Cloudflare to block all AI bots and other non-search engine bots using security rules. On one of our client websites, the Facebook bot alone was consuming around 50 GB of bandwidth per month.

    Cloudflare uses JS as part of their bot detection system.

  • praburampraburam Member
    edited September 2025

    Those are peanuts look at mine.

    Thanked by 1WebProject
  • fatchanfatchan Member, Host Rep

    Yep, was seeing a high request volume from Alibaba cloud going through every page for every commit of every file in my self hosted git repository until I added some protection for that.
    Would have been about 20T of bandwidth in a month, which isn't crazy big scale, but definitely an anomaly for the site in question.

  • I reduced the load on php and html. I directly stored as html file permanently to avoid overloading servers

    Thanked by 1tentor
  • Cloudflare is the no brainer foolproof drop in solution for anyone too lazy to deal with them bots locally. I think they're making ai bot blocking the default for any domain added to their service now. Not ideal that cf being the goto for everyone and their grandma making the internet full of cf captcha but atleast their captcha aren't fucking annoying like google recaptcha or hcaptcha. Cf sort of became a necessary evil now.

  • No cookies get cached content

  • Safeline WAF has a really good anti-bot mechanism, for those who don't want to use Cloudflare, (and its exploit prevention is also far superior to CF too): https://github.com/chaitin/SafeLine

    Thanked by 2akhfa 0xC7
  • yoursunnyyoursunny Member, IPv6 Advocate
    edited September 2025

    @fatchan said:
    Yep, was seeing a high request volume from Alibaba cloud going through every page for every commit of every file in my self hosted git repository until I added some protection for that.

    My collaborator is experiencing a similar problem on https://gerrit.named-data.net , hosted together with one other application on a dedicated server with 1x Xeon Gold 5120.
    The bots were accessing git blame webpages very frequently, causing application server to use up all 14 cores.

    He deployed Anubis: Web AI Firewall Utility, which stopped application server overloads.
    Instead, all the bots are now sitting in the entryway using up all Apache threads.

    We suggested him to buy that dedicated server just one year ago when the bots weren't such popular.
    If we knew, we should make him buy a 128-core EPYC enough to feed the bots.

  • SaahibSaahib Host Rep, Veteran
    edited September 2025

    @blip1945 said:
    Cloudflare is the no brainer foolproof drop in solution for anyone too lazy to deal with them bots locally. I think they're making ai bot blocking the default for any domain added to their service now. Not ideal that cf being the goto for everyone and their grandma making the internet full of cf captcha but atleast their captcha aren't fucking annoying like google recaptcha or hcaptcha. Cf sort of became a necessary evil now.

    Basically for CF, these bots are causing hell lot of Bandwidth, consuption which is not useful for site owners and eating CF resources.Its no brainer that they want to block it as default.

  • What annoys me the most are bots with Microsoft IPs. There are hundreds of them daily, probably free trials from Azure. But of course, some “very important” legitimate services like to use those servers and I can't cut out the entire AS. But I think that would save me a lot of CPU power and TB of traffic..

    Cloudflare in its default settings also lets through a lot of junk hosts that scanning WordPress plugins directories, webshells etc.

  • you need WAF

  • @fatchan said:
    Yep, was seeing a high request volume from Alibaba cloud going through every page for every commit of every file in my self hosted git repository until I added some protection for that.
    Would have been about 20T of bandwidth in a month, which isn't crazy big scale, but definitely an anomaly for the site in question.

    In my case, I saw a huge reduction in abusive requests simply by blocking Alibaba and Huawei Cloud IP ranges.

    Thanked by 1tentor
Sign In or Register to comment.