New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
The death of MXRBL
Hey friends,
A long time ago I pushed my RBL here as a good source for you to use on your mail servers: https://mxrbl.com. Today I'm here to announce that if you are using it, you should stop. It will continue to accept queries for a while, I will reach out to people that I know are using it. In my Black Friday offer post I'm going to explain a bit more about what we're doing at MXroute to stop spam and why I believe the latest approaches are vastly superior to any usage of a traditional RBL.


Comments
Any rough idea when queries will no longer be accepted?
but recurring means forever
Most likely I'll empty the zone, point the three NS to 1 cloud server, and return empty queries for a decade or more.
I've recently implemented my own RBL, and while it seems to work well enough for me I'm looking forward to learn more about your new approach.
That will be interesting to read. AI based content analysis, hopefully?
That would be the dream but I can’t self host a LLM that would perform well enough or reach our scale at a reasonable overhead. It’s a series of efforts that combine into what I’m expecting to see received as one of the best filtering systems around, with content filters being the last line of defense.
I run rspamd on my mail server, and I'm continually frustrated about endless need to manually adjust it to block the newest wave of easily recognizable refund scam, phishing, or extortion campaigns. I'll be eager to see what you are doing, as I am on the verge of writing a whole new spam filter built around AI (not necessarily LLM) techniques.
RBL's can still be quite effective method to cut down the amount of spam but comes with the risks of losing legitimate mail if you use it to reject mail even when using the major/trusted ones, over the years I've seen ip's of major mostly legit senders be inboxes/transactional etc appearing in rbl even the more reputable ones.
LLM based one would be good to filter the phishing,SEO etc crap you receive from major free inbox where RBL would be useless since listing the IP would cause legitimate mail to get rejected.
Hopefully in the next few years LLM based ones make sense to deploy and cost effective enough to justify them for smaller providers.
It is really unfair how much Gmail makes independent email servers go through when they can't (or don't want to) block the firehose of spam that they originate. They don't do anything to prevent gmail accounts from sending the most blatant nigerian prince, refund scams, sextortion scams, etc.
Gmail is extremely frustrating to the rest of us and that perfectly describes it.
Never worked on spam classification but using LLMs for that is a bit of an overkill, isnt it?
I think it would be perfect because it would be better at learning and adapting quickly to the continual changes spammers are making to try to circumvent filters. But I would only feel comfortable with it if it were a locally running model connected via extremely low latency private network to the server. Huge bottleneck potential, and via API huge privacy violation.
Yeah I see your point, I meant using some custom word embeddings and some light classifiers would be probably much more efficient to train/infer and probably perform better.
Yea, you don't need a full blown multi-billion parameter LLM to effectively use ML techniques.
The only real problem in training any kind of a model on real emails is that you need an effective way to remove PII from the dataset.
@jar has a point though, that an LLM has been trained on enough context to understand any language trick used by spammers in future.
I have a feeling that for such a simple task, lighter classifiers could still have enough power even without transformers. But worst case, also a cut-down pretrained model on 200-300m parameters could be more than enough.
It has enough language understanding to generalize, and can be very easily fined-tuned to identify spam.
Running in parallel on a fast engine like vllm can have very low latency.
As far as I see, there are such minded models on hf like: https://huggingface.co/phishbot/ScamLLM https://huggingface.co/ggrizzly/roBERTa-spam-detection etc
Train another model to do that /jk
IIRC there are many open datasets out there.
Yes, but many of them are 20 years old or so. Spam isn't quite the same as it was back then.
Bayesian learning is still the most popular open source algorithm and I still find it to be completely unable to keep up with current trends, much as I reached the same conclusion in 2013. Even rspamd's fuzzy misses pretty much everything no matter how much you feed it. Spammers are changing their messages up too rapidly for simple approaches. LLMs are the only thing new that has a chance of changing the game on pure content filtering. If it can't even seem like it can think like a human, it doesn't stand a chance in 2024. You can bet the spammers are using LLMs to help them out.
Or they just embed it in an image, and fly right past every spam filter out there.
There could/should at least be some OCR done on images, which I'm sure the big guys do in some form, but isn't widely available for self-hosting and smaller providers.
I've played around with postfix content filters, and it would be incredible easy to add OCR and PDF text extraction, but without good training data to properly train a classifier, it isn't worth it.
A LLM to sort your mailbox could be handy. A better version of what Google does sorting promotions and social posts from transactional email from personal messages.
You won't be needing full LLM but a SML (small language models) will be enough. Most modern models are already good enough. Some can be further trained/specialized or adapted to use multi shots. Might even be able to use multiple and different SML.
I was actually skeptical about LLMs for spam prediction, but according to JPMorgan they work quite well. I just don't think it'd be cheap enough to justify
I tested the precursor to LLMs (bert) and they are too stupid to be useful.
So you tested a single SLM?
Luckily there are some academic publications on the matter of SLM for tasks like spam detection.
Did you fine-tuned for this task?
I know the pain. Hardware costs should come down in a few years which will eventually make local computation viable. Adaptability is a challenge especially in the context of spam because it changes very fast and today, the common approach is offline learning. All the resources you put into making a model becomes stale over time due to the adaptability of spammers. If you're doing R&D, online learning techniques are something to explore. Also, speaking from experience, it's an unsolvable problem because it's always going to be a moving target. The only thing you can do is aim for perfection knowing it can never be perfect.
Yes I specifically trained it on my own data.
Emails without text, just an image attached, automatically get a high spam score..