All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
cheapest way to host public datasets — vps or object storage?
hey, looking for some advice on the cheapest way to host and share datasets for an open data project i'm working on.
the larger sets are around 5-10gb each, so if those get downloaded frequently the traffic will add up fast. my rough estimate is that we add up somewhere between 25-50gb new data per month.
on top of that we have smaller and even tiny datasets that are already being used in some of our automation scripts. if other people start pulling those into their own scripts too, it's not just the bandwidth that becomes a concern — it's also the sheer number of requests. and with object storage, all of that can get expensive fast (storage + egress + per-request costs).
so i'm trying to figure out what's actually cheapest at scale:
- a vps with a generous traffic allowance (netcup for example has a 2tb/day limit)
- object storage like s3, r2, b2, etc.
ideas? what would you go with?


Comments
backblaze with cloudflare in front? I believe traffic through cloudflare is free
Github
Huggingface
second
Third
Actually, another point but If the data is legal and you want an backup option for people to download as well aside from Huggingface, then I recommend you create a simple telegram server and split it within chunks of 2GB and use Huggingface as primary server.
So even if something happens to huggingface whose chances are very unlikely btw unless you break their TOS or similar
(They offer 50 GB per repository and I think that you can create many repositories)
I think my point is always have backup for these data as well and given that what you are working on is open data though, I think that huggingface would work best so don't worry much about it
What are you working on if I may ask?
thanks for the input!
github: that's a no-go for us. i already have a project on there with some large files in lfs and constantly run into the monthly traffic limit — cloning/downloading just stops working after a point.
backblaze + cloudflare (or any other cloud): definitely an option. my only concern is that cloudflare doesn't really like heavy download traffic. smaller files will probably get cached and never touch the storage, but the large ones always need to be pulled fresh.
huggingface: going on the shortlist. only worry is that it gets a bit messy and hard to navigate with lots of files and monthly exports. a simple apache-style open directory listing is honestly way more user-friendly
well first its legal data
. for backups we already have that covered separately, so telegram as a fallback isn't really needed there. for huggingface, i'm still a bit on the fence. if we're doing monthly exports, that's a new repo every month and it gets messy pretty quickly. not exactly what i'd call user-friendly.
@jazzii is there a particular location looking for? ex: USA, EU, ASIA, etc..