The death of MXRBL

jar · November 2024

Hey friends,

A long time ago I pushed my RBL here as a good source for you to use on your mail servers: https://mxrbl.com. Today I'm here to announce that if you are using it, you should stop. It will continue to accept queries for a while, I will reach out to people that I know are using it. In my Black Friday offer post I'm going to explain a bit more about what we're doing at MXroute to stop spam and why I believe the latest approaches are vastly superior to any usage of a traditional RBL.

hsr · November 2024

Any rough idea when queries will no longer be accepted?

zed · November 2024

but recurring means forever

jar · November 2024

@hsr said:
Any rough idea when queries will no longer be accepted?

Most likely I'll empty the zone, point the three NS to 1 cloud server, and return empty queries for a decade or more.

quicksilver03 · November 2024

I've recently implemented my own RBL, and while it seems to work well enough for me I'm looking forward to learn more about your new approach.

eezcloud · November 2024

@jar said: In my Black Friday offer post I'm going to explain a bit more about what we're doing at MXroute to stop spam

That will be interesting to read. AI based content analysis, hopefully?

jar · November 2024

@eezcloud said:

@jar said: In my Black Friday offer post I'm going to explain a bit more about what we're doing at MXroute to stop spam

That will be interesting to read. AI based content analysis, hopefully?

That would be the dream but I can’t self host a LLM that would perform well enough or reach our scale at a reasonable overhead. It’s a series of efforts that combine into what I’m expecting to see received as one of the best filtering systems around, with content filters being the last line of defense.

eezcloud · November 2024

@jar said: That would be the dream but I can’t self host a LLM that would perform well enough or reach our scale at a reasonable overhead.

I run rspamd on my mail server, and I'm continually frustrated about endless need to manually adjust it to block the newest wave of easily recognizable refund scam, phishing, or extortion campaigns. I'll be eager to see what you are doing, as I am on the verge of writing a whole new spam filter built around AI (not necessarily LLM) techniques.

Razza · November 2024

RBL's can still be quite effective method to cut down the amount of spam but comes with the risks of losing legitimate mail if you use it to reject mail even when using the major/trusted ones, over the years I've seen ip's of major mostly legit senders be inboxes/transactional etc appearing in rbl even the more reputable ones.

LLM based one would be good to filter the phishing,SEO etc crap you receive from major free inbox where RBL would be useless since listing the IP would cause legitimate mail to get rejected.

Hopefully in the next few years LLM based ones make sense to deploy and cost effective enough to justify them for smaller providers.

eezcloud · November 2024

@Razza said: you receive from major free inbox

It is really unfair how much Gmail makes independent email servers go through when they can't (or don't want to) block the firehose of spam that they originate. They don't do anything to prevent gmail accounts from sending the most blatant nigerian prince, refund scams, sextortion scams, etc.

jar · November 2024

@eezcloud said:

@Razza said: you receive from major free inbox

It is really unfair how much Gmail makes independent email servers go through when they can't (or don't want to) block the firehose of spam that they originate. They don't do anything to prevent gmail accounts from sending the most blatant nigerian prince, refund scams, sextortion scams, etc.

Gmail is extremely frustrating to the rest of us and that perfectly describes it.

itsdeadjim · November 2024

@jar said: That would be the dream but I can’t self host a LLM that would perform well enough or reach our scale at a reasonable overhead. It’s a series of efforts that combine into what I’m expecting to see received as one of the best filtering systems around, with content filters being the last line of defense.

Never worked on spam classification but using LLMs for that is a bit of an overkill, isnt it?

jar · November 2024

@itsdeadjim said:

@jar said: That would be the dream but I can’t self host a LLM that would perform well enough or reach our scale at a reasonable overhead. It’s a series of efforts that combine into what I’m expecting to see received as one of the best filtering systems around, with content filters being the last line of defense.

Never worked on spam classification but using LLMs for that is a bit of an overkill, isnt it?

I think it would be perfect because it would be better at learning and adapting quickly to the continual changes spammers are making to try to circumvent filters. But I would only feel comfortable with it if it were a locally running model connected via extremely low latency private network to the server. Huge bottleneck potential, and via API huge privacy violation.

itsdeadjim · November 2024

@jar said: I think it would be perfect because it would be better at learning and adapting quickly to the continual changes spammers are making to try to circumvent filters. But I would only feel comfortable with it if it were a locally running model connected via extremely low latency private network to the server.

Yeah I see your point, I meant using some custom word embeddings and some light classifiers would be probably much more efficient to train/infer and probably perform better.

eezcloud · November 2024

@itsdeadjim said: much more efficient to train/infer and probably perform better.

Yea, you don't need a full blown multi-billion parameter LLM to effectively use ML techniques.

eezcloud · November 2024

The only real problem in training any kind of a model on real emails is that you need an effective way to remove PII from the dataset.

itsdeadjim · November 2024

@eezcloud said: Yea, you don't need a full blown multi-billion parameter LLM to effectively use ML techniques.

@jar has a point though, that an LLM has been trained on enough context to understand any language trick used by spammers in future.

I have a feeling that for such a simple task, lighter classifiers could still have enough power even without transformers. But worst case, also a cut-down pretrained model on 200-300m parameters could be more than enough.

It has enough language understanding to generalize, and can be very easily fined-tuned to identify spam.

Running in parallel on a fast engine like vllm can have very low latency.

As far as I see, there are such minded models on hf like: https://huggingface.co/phishbot/ScamLLM https://huggingface.co/ggrizzly/roBERTa-spam-detection etc

@eezcloud said: The only real problem in training any kind of a model on real emails is that you need an effective way to remove PII from the dataset.

Train another model to do that /jk
IIRC there are many open datasets out there.

eezcloud · November 2024

@itsdeadjim said: IIRC there are many open datasets out there.

Yes, but many of them are 20 years old or so. Spam isn't quite the same as it was back then.

jar · November 2024

Bayesian learning is still the most popular open source algorithm and I still find it to be completely unable to keep up with current trends, much as I reached the same conclusion in 2013. Even rspamd's fuzzy misses pretty much everything no matter how much you feed it. Spammers are changing their messages up too rapidly for simple approaches. LLMs are the only thing new that has a chance of changing the game on pure content filtering. If it can't even seem like it can think like a human, it doesn't stand a chance in 2024. You can bet the spammers are using LLMs to help them out.

eezcloud · November 2024

@jar said: Spammers are changing their messages up too rapidly for simple approaches

Or they just embed it in an image, and fly right past every spam filter out there.

adly · November 2024

@eezcloud said:

@jar said: Spammers are changing their messages up too rapidly for simple approaches

Or they just embed it in an image, and fly right past every spam filter out there.

There could/should at least be some OCR done on images, which I'm sure the big guys do in some form, but isn't widely available for self-hosting and smaller providers.

eezcloud · November 2024

@adly said: There could/should at least be some OCR done on images, which I'm sure the big guys do in some form, but isn't widely available for self-hosting and smaller providers.

I've played around with postfix content filters, and it would be incredible easy to add OCR and PDF text extraction, but without good training data to properly train a classifier, it isn't worth it.

misterm · November 2024

A LLM to sort your mailbox could be handy. A better version of what Google does sorting promotions and social posts from transactional email from personal messages.

blackjack4494 · November 2024

You won't be needing full LLM but a SML (small language models) will be enough. Most modern models are already good enough. Some can be further trained/specialized or adapted to use multi shots. Might even be able to use multiple and different SML.

filtered · November 2024

I was actually skeptical about LLMs for spam prediction, but according to JPMorgan they work quite well. I just don't think it'd be cheap enough to justify

bobert · November 2024

@blackjack4494 said: You won't be needing full LLM but a SML (small language models) will be enough.

I tested the precursor to LLMs (bert) and they are too stupid to be useful.

blackjack4494 · November 2024

@bobert said:

@blackjack4494 said: You won't be needing full LLM but a SML (small language models) will be enough.

I tested the precursor to LLMs (bert) and they are too stupid to be useful.

So you tested a single SLM?
Luckily there are some academic publications on the matter of SLM for tasks like spam detection.

itsdeadjim · November 2024

@bobert said: I tested the precursor to LLMs (bert) and they are too stupid to be useful.

Did you fine-tuned for this task?

black · November 2024

I know the pain. Hardware costs should come down in a few years which will eventually make local computation viable. Adaptability is a challenge especially in the context of spam because it changes very fast and today, the common approach is offline learning. All the resources you put into making a model becomes stale over time due to the adaptability of spammers. If you're doing R&D, online learning techniques are something to explore. Also, speaking from experience, it's an unsolvable problem because it's always going to be a moving target. The only thing you can do is aim for perfection knowing it can never be perfect.

bobert · November 2024

@itsdeadjim said: Did you fine-tuned for this task?

Yes I specifically trained it on my own data.

kevinds · November 2024

@eezcloud said:
Or they just embed it in an image, and fly right past every spam filter out there.

Emails without text, just an image attached, automatically get a high spam score..

Howdy, Stranger!

Categories

In this Discussion

The death of MXRBL

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

The death of MXRBL

Comments