How do you santize the data you put into commercial LLMs?

CloudHopper · July 5

As the title suggests, when you copy logs, errors, broken code and whatever else into a commercial clanker, how do you anonymize your personally identifiable information first?

rpqu · July 5

Uh sed?

Alyx · July 5

I would take bets that nobody really does that 🫣

zed · July 5

wait what

nghialele · July 5

So people did that? I just throw raw into it

JohnFilch123 · July 5

I usually use Notepad or Notepad++ and use search & replace to bulk remove/replace data. For domain I usually use example.com, for IPs I usually use ipv4 or ipv6.

totally_not_banned · July 5

@Alyx said:
I would take bets that nobody really does that 🫣

Yeah fat chance people would trade in any of that sweet, sweet efficiency gain for something boring like privacy or security. Not like the whole concept is likely to cross their minds at all but even if it did: No way.

TrikeLike · July 5

I'm sure someone (maybe you) could vibe-slop something together for this quickly. If that's too much work, I imagine you could just feed it through some lightweight local model that strips this info out first, then copy-paste that into the prompt of your commercial chatbot of choice

oloke · July 5

@JohnFilch123 said:
I usually use Notepad or Notepad++ and use search & replace to bulk remove/replace data. For domain I usually use example.com, for IPs I usually use ipv4 or ipv6.

Actually I do the same for now. Would love to know if there are better solutions. That said, I don't rely on LLMs for really sensitive things anyway.

I think there are now companies specializing in not letting users put sensitive company data into clanker. I saw one ad on YouTube, can't remember the name of the product.

JohnFilch123 · July 5

@oloke said: Would love to know if there are better solutions

I guess for really bulk logs etc you can run it through you local LLM which will sanitize it. Or can probably vibe code a web app in JS or something.

PolyAnthi · July 5

I've seen more and more things like https://x.com/MaziyarPanahi/status/2073383825669849118 show lately on my twitter feed (yes I know it is renamed X, cope).

I imagine that would be the perfect utilisation, using local models to pre-clean your output to commercial LLMs.

Void · July 5

So you guys don’t say YOLO and send all the logs and code as is?

CloudHopper · July 5

@JohnFilch123 said:

@oloke said: Would love to know if there are better solutions

I guess for really bulk logs etc you can run it through you local LLM which will sanitize it. Or can probably vibe code a web app in JS or something.

I'm also using Notepad++ and actively substituting things to keep the structure, but it sucks. I'm considering setting up an 8b local LLM for it, but I was hoping for a cheaper solution.

@PolyAnthi said:
I've seen more and more things like https://x.com/MaziyarPanahi/status/2073383825669849118 show lately on my twitter feed

That looks like the kind of thing I need, but I'd also need a workflow for it too. I can use Open-WebUI for a local LLM, but my commercial clanker subscription only works via their own UI so I guess I'd have to workout a solution that injects/extracts prompts/responses from a browser and that's an increasing headache. But still better that than voluntarily giving critical PII to the world's biggest data vampires.

PolyAnthi · July 5

@CloudHopper said:

@JohnFilch123 said:

@oloke said: Would love to know if there are better solutions

I guess for really bulk logs etc you can run it through you local LLM which will sanitize it. Or can probably vibe code a web app in JS or something.

I'm also using Notepad++ and actively substituting things to keep the structure, but it sucks. I'm considering setting up an 8b local LLM for it, but I was hoping for a cheaper solution.

@PolyAnthi said:
I've seen more and more things like https://x.com/MaziyarPanahi/status/2073383825669849118 show lately on my twitter feed

That looks like the kind of thing I need, but I'd also need a workflow for it too. I can use Open-WebUI for a local LLM, but my commercial clanker subscription only works via their own UI so I guess I'd have to workout a solution that injects/extracts prompts/responses from a browser and that's an increasing headache. But still better that than voluntarily giving critical PII to the world's biggest data vampires.

May be worth looking at making a clanker make an extension to redact data from said clanker.

rpqu · July 5

@JohnFilch123 said:
I usually use Notepad or Notepad++ and use search & replace to bulk remove/replace data. For domain I usually use example.com, for IPs I usually use ipv4 or ipv6.

Just replace them with words

JohnFilch123 · July 5

@oloke said: Would love to know if there are better solutions

I guess for really bulk logs etc you can run it through you local LLM which will sanitize it. Or can probably vibe code a web app in JS or something.

@rpqu said: Just replace them with words

If only I understand what you mean....

JohnFilch123 · July 5

@CloudHopper said: cheaper solution

I can vibe code one some kind of pastebin service that will strip personal/private data but I doubt anybody will use it much since people have got a strong negative sentiment towards vibe coders.

rpqu · July 5

@JohnFilch123 said:

@rpqu said: Just replace them with words

If only I understand what you mean....

Let's say nginx log -> sed your IP with etchosts -> run a script that temporarily assign common-name as replacement to random IPs from a dictionary -> saves token

rpqu · July 5

@JohnFilch123 said:

@CloudHopper said: cheaper solution

I can vibe code one some kind of pastebin service that will strip personal/private data but I doubt anybody will use it much since people have got a strong negative sentiment towards vibe coders.

Vibebin.com regged
Clankbin.com
Crankbin.com

JohnFilch123 · July 5

@rpqu said:

@JohnFilch123 said:

@rpqu said: Just replace them with words

If only I understand what you mean....

Let's say nginx log -> sed your IP with etchosts -> run a script that temporarily assign common-name as replacement to random IPs from a dictionary -> saves token

Ah got it now. I would prefer to have a web app to do it automatically, using scripts it kinda tiring for me. However, ya, this is an option.

JohnFilch123 · July 5

@rpqu said: Vibebin.com

Nah, too expensive. I have got a few short domains, maybe reuse them instead.

network · July 5

You guys send your logs to AI manually? Just give claude code permission to run ssh * and give it the hostname.

network · July 5

@network said:
You guys send your logs to AI manually? Just give claude code permission to run ssh * and give it the hostname.

Just make sure you have ssh agent running locally and passwordless sudo on the server.

rpqu · July 5

@network said:
You guys send your logs to AI manually? Just give claude code permission to run ssh * and give it the hostname.

ssh, git, rsync, docker, scp, source, nohup, sudo, rm

CloudHopper · July 5

@network said:
You guys send your logs to AI manually? Just give claude code permission to run ssh * and give it the hostname.

You give Claude unfettered access to your digital life without a condom? 😲

Levi · July 5

0 fck given. Paste everything with passwords, tokens. All .cfg. .env goes directly into prompt.

rm_ · July 5

Ask another commercial LLM to sanitize it for you

Seriously though, I would only not send passwords. The rest, nobody cares.

RIYAD · July 5

Sanitize: do not make mistake 😭

wdmg · July 5

"Hey ChatGPT, can you sanitize this data for me so I can ask Claude who will consult with Grok and I'll bring it back to you to unsanitize it so I can review it?"

Hotmarer · July 5

I am using LiteLLM with guardrails and some PII that for example replace IP addresses or domains. I am also trying to route to local LLM's when possible.

Pandy · July 5

i remember openai released some sort of PII sanitization open model thats tiny, but i have no idea how well it works etc
https://github.com/openai/privacy-filter

Howdy, Stranger!

Categories

In this Discussion

How do you santize the data you put into commercial LLMs?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

How do you santize the data you put into commercial LLMs?

Comments