How would you provide clients access to easily download extremely large backups?

Francisco · May 2025

@LordSpock said: Oh that's a neat idea.

To be fair this is exactly how Google Takeout works. You request a backup and after some time (hours or days) you get a multi part zip file (2GB chunks?). It works and most people will understand how to work with it.

The same staffer that is adamant about this did a POC of my streaming gzip aswell. It works well, but with 2 large caveats. You don't know how big the archive is, so you never know when you're 'done' other than when the stream disconnects. Did it disconnect because it's all done or because of a network disruption? No idea.

There's no way to know the final size with the stream gzip option, nor anyway to resume (since we have no guarantee that the files will be in the exact same order.).

The archive option is more involved since we have to spend the time creating the archive and then we have to allocate those additional resources to store the archive for whatever the hold time is (1 week?). It's not a big deal, we have a PB or two just for backups at the moment, with us using only ~100T of it.

Francisco

JabJab · May 2025

As there is no standardized way of importing those backups anywhere (and depends from platform used) splitting things into different archives is a must. Most people gonna take out mails, fuck the memos/notes, maybe someone will care about files?

I don't get the streaming idea - in theory it's nice idea, in practice working with big files without a resume option (or chunks for multithread) it's a PITA - you will end up with a client downloading things to some remote shitty host with 8kB/s and the archive will be 300GB - you will hold it in memory for weeks? Then of course it's gonna die like 157175 times and you will end with 157175 attempts to download that huge thing, rip memory/cpu/io.

Pack it, on demand [could be API, could be ticket] and ship to some remote S3. Remote, clients S3, not yours. He controls that S3, only he can be blamed for speed / lack of space / cost and how long he want to store it there. He is in control of his data.

Plus if you ever end with EU mail (still waaaaaaaaaiting) you won't be harased for uploading EU data to non-EU server?

and yeah, on demand - most people won't care, then people gonna forget about it and be angry that they stopped using your service (without termination of course) and you weekly put a data into theirs S3 and they bill racked up!!1111

Plus you can control it so you won't get (D)DOS-ed by everyone clicking "Backup" at the same time.

macmouse · May 2025

What fastmail does is take regular snapshots of the their data (on a per-user basis) and then when you need to restore something, they "mount" a read-only copy of the selected version to the webmail client under a "restore" folder that is available for ~24 hours.

The items (email,files,calendars,contact) show under the corresponding "shared" section, as if another user had shared the folder but I assume driven by some service account.

Anything they want to restore, just select the desired item and copy it over to desired folder like normal.

Very user friendly and don't really need to learn anything new.

Would be a bit annoying for an admin if they needed to restore a ton of accounts at once but tbh most of the time it's just one person who made the mistake...

It does take a few minutes to load (it starts with an empty folder and then starts importing) but that's fine... They already survived ~30 days with the email "missing" with it sitting in their trash before it got expunged - they can wait a little longer.

Francisco · May 2025

@JabJab said: As there is no standardized way of importing those backups anywhere (and depends from platform used) splitting things into different archives is a must.

You can always install a copy of smartermail and drop the whole domain in place and run with.

It isn't a PST file, that'd require we have login/passwords to (attempt to) export.

Francisco

Francisco · May 2025

@macmouse said: What fastmail does is take regular snapshots of the their data (on a per-user basis) and then when you need to restore something, they "mount" a read-only copy of the selected version to the webmail client under a "restore" folder that is available for ~24 hours.

This isn't for 'restoring' within Namecrane. We'll have to think up something clever there.

This is more for people that want their data so they can take it somewhere else all together (or just to have it for legal archival/hold reasons).

Francisco

JabJab · May 2025

@Francisco said: You can always install a copy of smartermail and drop the whole domain in place and run with.

I mean most people (I assume) that will migrate / backup from hosted service gonna go to other hosted service, using different app stack.

Francisco · May 2025

@JabJab said: I mean most people (I assume) that will migrate / backup from hosted service gonna go to other hosted service, using different app stack.

Fair, and you're right, there isn't some standard in place to make that easy to do. We could try to export PST's that someone could then import and 'restore', but Outlook already does that locally. You can just save the PST, attach it to a new inbox, and push the data back.

If you have a big inbox that's going to suck ass, but that's your problem.

Francisco

macmouse · May 2025

Oops, my bad..

IMO these are three different use cases that need three different solutions.

For offsite backup purposes, you want to efficiently de-duplicate all the data and store it somewhere reliable.

For migration purposes, it always depends on what tools they have and what the new destination service supports and it's always a little different each time..

Most of the time you end up with having to move to the lowest common denominator (PST if it's microsoft/outlook world or in *nix land a folder of maildir/etc files).

I would keep it simple and have on archive file per user in some standardized format...
domain.com-username.zip or what have you.

While some of the nicer services provide a user interface to migrate stuff (although generally that is using imap-to-imap but then you don't need to have an export in the first place), I would say half the time you end up having to run a script on your computer anyway because everyone has slightly different file formats.

For archival/legal purposes, they generally want to be feed to a specialized database that is setup to lock the records (emails), that prevents them from being modified/deleted before a certain period of time has passed (varies by industry and the specific regulations).

The data also needs to be searchable , which generally means it needs to end up in some kind of a database...

Since you're now using a database, might as well use a SQL connector or load in a nightly "file" with just the changes. Providing the whole data set every time will not be practical.

At one construction firm I worked at it, was ~7 years because that is what the local regulatory commission wanted but at a fintech firm it was like 20 years due to federal finance regulations.

Doing a "once a day snapshot of everything" was good enough for the construction firm and we used an off-the-shelf software MailStore for that running on the archive NAS. This was read using imap on "production" (using a special service account that saw everything) and then feed into the Microsoft(?) SQL server that MailStore used on the backend.

However, the fintech one required us to store copies of every revision of emails, including draft versions they saved (even if it was later deleted and not actually sent anywhere).

The latter in particular was a big pain and we ended up having to switch to an email server using an SQL backend that was "write only", so it kept every possible revision.

Besides the regular archival job, we would have todo regular re-imports (alternating between the primary and standby server), so the expunged records were actually removed (make a empty table on other server, sync non-expunged messages and then switch which is primary) because outlook gets grumpy when you have too much to slog through...

TimboJones · May 2025

@Francisco said:

@cmeerw said: How is the customer supposed to use that backup? Set up their own cranemail compatible service to access the data? Is that documented somewhere how to do that?

You can take the backup, throw it into a smartermail install, and be on your way. Smartermail has a free tier that allows 1 domain and 10 users. File Storage items are stored as the whole file w/ the correct name/path, making it easy enough to cherry pick.

To be fair though, Google Takeout probably doesnt have an "Eat in" option where you can import the backup, do they? I'm assuming it's a one way trip. I tihnk protonmail has an 'import' option.

imapsync would allow 2 way push/pull on this as you said, but we do have people that want access to download backups. imapsync also requires you have every users login/password, which is rarely the case.

You can import gmail and Microsoft from within the web gui settings.

Microsoft and Google allow mailboxes to delegate access to admins. I know for 365, it was PowerShell only to enable.

Francisco · May 2025

@TimboJones said: You can import gmail and Microsoft from within the web gui settings.

Microsoft and Google allow mailboxes to delegate access to admins. I know for 365, it was PowerShell only to enable.

>

I think you're going in the wrong direction.

We're talking about exporting all emails/etc on a domain into some format a user can then restore to another smartermail install, or what have you.

Francisco

schwabene · May 2025

@Francisco said:

Ideas?

Francisco

Here is one.
You're a hosting business after all.

For each user, set up a folder on a separate machine with rsync write access for you, and read-only access for them (e.g., via SFTP). This way, they cannot delete the folder and trigger a full resync.
In your backup process, rsync to BOTH your ZFS storage (as you normally do) and the user's folder. This adds redundancy without affecting your current backups.
If the user wants versioning (by simply creating daily archives or doing borg backups), sell them a storage slab on the same machine and let them handle it themselves - or maybe you’ll decide to handle it for them..

Result: No need to transfer 10TB backup files around.

Francisco · May 2025

@schwabene said: Result: No need to transfer 10TB backup files around.

But then we need to keep at least 2 copies of the data...when we already do fairly paranoid R60's already (6 disk vdev's).

Since I opened this thread we've had time to test different setups and we'll be replacing our rsync/zfs layer with restic. Reason being is that with zfs/rsync, we have to keep track of how much storage each remote node is using and shuffle things around. We can use JBOD's and such but that gets really iffy.

Restic has an S3 backend and we have an S3 platform that easily scales out (we can just throw hardware at it and it's all under the same endpoint). Giving users a "mostly read-only" login would be fine (basically only has write access to the locks folder, or we just tell them to use the ignore locks option in restic).

With that we're keeping 1 complete/full copy and then if we want to offer the tar/zips, we can without too much effort, just a little bit of storage bloat.

Francisco

ypmLA77zcs · May 2025

So we'll get access to one Restic repository per domain (just in read-only mode)? If so - would "restic mount" work too?

Or I got it all wrong?

TIA

Francisco · May 2025

@ypmLA77zcs said:
So we'll get access to one Restic repository per domain (just in read-only mode)? If so - would "restic mount" work too?

Or I got it all wrong?

TIA

Mount would probably work too.

Francisco

lui · May 2025

@Francisco said:

@LordSpock said: Oh that's a neat idea.

To be fair this is exactly how Google Takeout works. You request a backup and after some time (hours or days) you get a multi part zip file (2GB chunks?). It works and most people will understand how to work with it.

The same staffer that is adamant about this did a POC of my streaming gzip aswell. It works well, but with 2 large caveats. You don't know how big the archive is, so you never know when you're 'done' other than when the stream disconnects. Did it disconnect because it's all done or because of a network disruption? No idea.

There's no way to know the final size with the stream gzip option, nor anyway to resume (since we have no guarantee that the files will be in the exact same order.).

The archive option is more involved since we have to spend the time creating the archive and then we have to allocate those additional resources to store the archive for whatever the hold time is (1 week?). It's not a big deal, we have a PB or two just for backups at the moment, with us using only ~100T of it.

Francisco

You can code the streaming gzip with node.js streams and know whether it failed or succeeded. I've done that multiple times. If you DM I can send an example

Francisco · May 2025

@luissousa said: You can code the streaming gzip with node.js streams and know whether it failed or succeeded. I've done that multiple times. If you DM I can send an example

We already have a working POC of it It's just that there's no way to know the final size to Content-Length.

Francisco

cmeerw · May 2025

@Francisco said: You don't know how big the archive is, so you never know when you're 'done' other than when the stream disconnects. Did it disconnect because it's all done or because of a network disruption? No idea.

Your http client should be able to tell the difference (even on a raw TCP socket you can tell the difference between a clean shutdown from the server or a network disruption - with a TLS layer on top, your TLS library will tell you as well; and then there is also HTTP chunked transfer encoding that could help)

But yes, streaming doesn't go together that well with resuming.

Motion3549 · May 2025

So which options you go with?

Francisco · May 2025

@Motion3549 said: So which options you go with?

Not decided yet. Akash really wants us to build gzip/zip's upon request, and then the user pulls that down. I think streaming would work nicely, but he's worried about incomplete archives, or archives that we can't resume.

I agree with that and think we could likely just do the gzip/zip/zst archives without too much issue.

Francisco

ypmLA77zcs · May 2025

@Francisco said:

@Motion3549 said: So which options you go with?

Not decided yet. Akash really wants us to build gzip/zip's upon request, and then the user pulls that down. I think streaming would work nicely, but he's worried about incomplete archives, or archives that we can't resume.

I agree with that and think we could likely just do the gzip/zip/zst archives without too much issue.

Francisco

So restic is no longer in the cards?

Francisco · May 2025

@ypmLA77zcs said:

@Francisco said:

@Motion3549 said: So which options you go with?

Not decided yet. Akash really wants us to build gzip/zip's upon request, and then the user pulls that down. I think streaming would work nicely, but he's worried about incomplete archives, or archives that we can't resume.

I agree with that and think we could likely just do the gzip/zip/zst archives without too much issue.

Francisco

So restic is no longer in the cards?

I think restic is fine and doable too. Most users wouldn’t be doing restic though, only the technical/advnaced users.

Francisco

ypmLA77zcs · September 2025

Hi @Francisco what solution you ended up using? Were you able to allow customers' use of restic to access mail server backups?

TIA

classy · September 2025

I would pay a small fee for S3 push, that’d be set up and forget with Backblaze B2

Francisco · September 2025

@ypmLA77zcs said:
Hi @Francisco what solution you ended up using? Were you able to allow customers' use of restic to access mail server backups?

TIA

We haven’t decided one way or another. I got caught up with some other big projects and this fell to the way side.

Francisco

ypmLA77zcs · September 2025

I REALLY hope restic is still a serious contender

Francisco · September 2025

@ypmLA77zcs said:
I REALLY hope restic is still a serious contender

I think it’s as close to “ideal” as I can come up with, and wouldn’t be overly complex to implement.

Francisco

vitobotta · September 2025

I'm on macOS so I just use https://thehorcrux.com to back up my email

ypmLA77zcs · September 2025

@vitobotta said:
I'm on macOS so I just use https://thehorcrux.com to back up my email

That sounds interesting, but I'd rather minimize the number of programs installed and stay away from a proprietary backup scheme. I'm using restic already for my backups, so it only makes sense having access to a restic repository for my email

czed · November 2025

For huge backups (10TB+ especially) I'd suggest the option to physically mail a hard drive with the contents. Place a hold on the users credit card for the value of the drive until it is returned.

lantudai · November 2025

As customer, either s3 or restic works for me.I feel sending to s3 bucket is clean and versatile. Later, you can develop features based on s3, such as using restic to backup to s3, which combines the advantages of these two tools.

Howdy, Stranger!

Categories

In this Discussion

How would you provide clients access to easily download extremely large backups?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

How would you provide clients access to easily download extremely large backups?

Comments