Now Reading
Ask HN: Recommendations to host 10TB knowledge with a month-to-month +100TB bandwidth

Ask HN: Recommendations to host 10TB knowledge with a month-to-month +100TB bandwidth

2023-05-27 09:52:33

Nothing in that video is about scale. Or the difficulty of serving 5TB. It’s about the difficulty of implementing n+1 redundancy with graceful failover inside cloud providers.

User: “I want to serve 5TB.”

Guru: “Throw it in a GKE PV and put nginx in front of it.”

Congratulations, you are already serving 5TB at production scale.

The interesting thing is there also paradoxes of large scale: things that get more difficult with increasing size.

Medium- and smaller-scale can often be more flexible because they don’t have to incur the pain of nonuniformity as scale increases. While they may not be able to afford optimizations or discounts with larger, standardized purchases, they can provide more personalized services large scale cannot hope to provide.

Depends on what exactly you want to do with it. Hetzner has very cheap Storage boxes (10TB for $20/month with unlimited traffic) but those are closer to FTP boxes with a 10 connection limit. They are also down semi-regularly for maintenance.

For rock-solid public hosting Cloudflare is probably a much better bet, but you’re also paying 7 times the price. More than a dedicated server to host the files, but you get more on other metrics.

> Hetzner has very cheap Storage boxes (10TB for $20/month with unlimited traffic)

* based on fair use

at 250 TB/mo:

> In order to continue hosting your servers with us, the traffic use will
need to be drastically reduced. Please check your servers and confirm
what is using so much traffic, making sure it is nothing abusive, and
then find ways of reducing it.

I’d suggest looking into “seedboxes” which are intended for torrenting.

I suspect the storage will be a bigger concern.

Seedhost.eu has dedicated boxes with 8TB storage and 100TB bandwidth for €30/month. Perhaps you could have that and a lower spec one to make up the space.

Prices are negotiable so you can always see if they can meet your needs for cheaper than two separate boxes.

> I’d suggest looking into “seedboxes” which are intended for torrenting.

Though be aware that many (most?) seedbox arrangements have no redundancy, in fact some are running off RAID0 arrays or similar. Host has a problem like a dead drive: bang goes your data. Some are very open about this, afterall for the main use case cheap space is worth the risk, some far less so…

Of course if the data is well backed up elsewhere or otherwise easy to reproduce or reobtain this may not be a massive issue and you’ve just got restore time to worry about (unless one of your backups can be quickly made primary so restore time is as little as a bit of DNS & other configuration work).

It’s impossible to answer this question without more information. What is the use profile of your system? How many clients, how often, what’s the burst rate, what kind of reliability do you need? These all change the answer.

If we are talking about serving files publicly I’d go with the €40 server for flexibility (the storage boxes are kind of limited), but still get a €20 Storage Box to have a backup of the data. Then add more servers as bandwidth and redundancy requires.

But if splitting your traffic across multiple servers is possible you can also get the €20 storage box and put a couple Hetzner Cloud servers with a caching reverse proxy in front (that’s like 10 lines of Nginx config). The cheapest Hetzner Cloud option is the CAX11 with 4GB RAM, 40GB SSD and 20TB traffic for €3,79. Six of those plus the Storage Box gives you the traffic you need, lots of bandwidth for usage peaks, SSD cache for frequently requested files, and easily upgradable storage in the Storage Box, all for €42. Also scales well at $3,79 for every additional 20TB traffic, or $1/TB if you forget and pay fees for the excess traffic instead.

You will be babysitting this more than the $150/month cloudflare solution, but even if you factor in the cost of your time you should come out ahead.

> even if you factor in the cost of your time you should come out ahead

There is always the hidden cost of not spending time on activities that are core to your business (if this is indeed for a business) that would make multiples of the money CF costs you.

1gbit is 300T a month, 10g is 3000T a month.

There’s always a limit, that might be measured in TB, PB or EB, and may be what you determine practical or not, but it’s there

To host it for what? A backup? Downloading to a single client? Millions of globally distributed clients uploading and downloading traffic? Bittorrent?

At some point you still need a seed for that 10TB of data with some level of reliability. WebTorrent only solves the monthly bandwidth iff you’ve got some high capacity seeds (your servers or long-term peers).

I helped run a wireless research data archive for a while. We made smaller data sets available via internet download but for the larger data sets we asked people to send us a hard drive to get a copy. Sneakernet can be faster and cheaper than using the internet. Even if you wanted to distribute 10TB of _new_ data every month, mailing hard drives would probably be faster and cheaper, unless all your customers are on Internet2 or unlimited fiber.

If it’s for internal use, I have had good results with Resilio Sync (formerly BitTorrent Sync).

It’s like Dropbox except peer to peer. So it’s free, limited only by your client side storage.

The catch is it’s only peer to peer (unless they added a managed option), so at least one other peer must be online for sync to take place.

They don’t really maintain the regular Sync client anymore, only the expensive enterprise Connect option. My wife and I used Resilio Sync for years, but had to migrate away, since it had bugs and issues with newer OS versions, but they didn’t care to fix them. Let alone develop new features.

BuyVM has been around a long time and have a good reputation. I’ve used them on and off for quite a while.

They have very reasonably priced KVM instances with unmetered 1G (10G for long-standing customers) bandwidth that you can attach “storage slabs” up to 10TB ($5 per TB/mo). Doubt you will find better value than this for block storage.

https://buyvm.net/block-storage-slabs/

If price is a consideration, you might consider two 10 TB hard drives on machines on two home gbps Internet connections. It’s highly unlikely that both would go down at the same time, unless they were in the same area, on the same ISP.

How do you set up load balancing for those two connections?

That is
yourdomain.com -> IP_ISP1, IP_ISP2

Going the other way from yourserver -> outside would indicate some sort of bonding setup.

It is not trivial for a home lab.

I use 3 ISPs at home and just keep each network separate (different hardware on each) even though in theory the redundancy would be nice.

Just use two A records for the one DNS name, and let the clients choose.

The other way is to have two names, like dl1 and dl2, and have your download web page offer alternating links, depending on how the downloads are handled.

You very rarely can do multi-ISP bonding, often not even with multiple lines from the same ISP, unfortunately.

The 100TB was just an example. They don’t want you using more bandwidth than your storage. If you’re storing 500GB, then your bandwidth usage should be less than 500GB.

Wasabi isn’t meant for scenarios where you’re going to be transferring more than you’re storing.

> Wasabi isn’t meant for scenarios where you’re going to be transferring more than you’re storing.

Which is basically a roundabout way of saying, they’re offering storage for backups, not for content distribution.

And they just added TCP client sockets in Workers. We are just one step step away from being able to serve literally anything on their amazing platform (listener sockets).

Only client sockets are available. So what you can do is build a worker that receives HTTP requests and then uses TCP sockets to fetch data from wherever, returning it over HTTP somehow.

It may depend on the makeup of data or something. They “requested” one of my prior projects go on the enterprise plan after about 50TB, granted the overwhelming majority of transfer was for distributing binary executables so I was in pretty blatant violation of their policy. This was 2015ish, so the limit could also have gone up over time as bandwidth gets cheaper too.

Several huge infrastructure providers offer decent VPS servers and bare metal with free bandwidth for pretty reasonable prices nowadays.

You might want to check out OVH or – like mentioned before – Hetzner.

I would also like to ask everyone about suggestions for deep storage of personal data, media etc. 10TB with no need for access unless in case of emergency data loss. I’m currently using S3 intelligent tiering.

I like to use rsync.net for backups. You can use something like borg, rsync, or just sftp/sshfs mount. Its not as cheap as something like S3 deep (in terms of storage) but it is pretty convient. The owner is a absolute machine and frequently visits HN too.

S3 is tough to beat on storage price. Another plus is that the business model is transparent, i.e., you don’t need to worry about the pricing being a teaser rate or something.

Of course the downside is that, if you need to download that 10TB, you’ll be out $900! If you’re worried about recovering specific files only this isn’t as big an issue.

Glacier Deep Archive is exactly what you want for this, that would be something like $11/month ongoing, then about $90/TB in the event of retrieval download. Works well except for tiny (<150KB) files.

Note that there is Glacier and Glacier Deep Archive. The latter is cheaper but longer minimum storage periods. You can use it as a life cycle rule.

I think they’ll charge me only when my current monthly statement is enough to charge. Pretty sure I’ve never been charged so far with my monthly statement being like 0.02€.

Some tens of gigabytes at this point? It’s definitely not a lot. Mostly just some stuff that doesn’t make sense to keep locally but I still want to have a copy in case a disaster strikes.

Surprised no one has said Cloudflare Pages. Might not work though depending on your requirements since there’s a max of 20,000 files of no more than 25 mb per project. But if you can fit under that, it’s basically free. If your requirements let you break it up by domain, you can split your data across multiple projects too. Latency is amazing too since all the data is on their CDN.

Personally, at home, I have ~600 TiB and 2 Gbps without a data cap.

I can’t justify colo unless I can get 10U for $300/month with 2kW of PDU, 1500 kWh, and 1 GbE uncapped.

I used to have a dedicated server there and what happened to me is that my uploads were fast, but my downloads were slow. Looking at an MTR route, it was clear that the route back to me was different (perhaps cheaper?). With google drive for example I could always max out my gbit connection. Same with rsync.net

Also I know that some cheaper Home ISPs also cheap out on peering.

Now, this was some time ago, so things might have changed, just as you suggested.

whats your budget?

who are you serving it to?

how often does the data change?

is it read only?

What are you optimising for, speed, cost or availability? (pick two)

Unless its 100TB/mo of pure HTML/CSS/JS (lol) cloudflare will demand you be on enterprise plan long before 100TB/mo. The fine print makes it near useless for any significant volume.

Any idea on how many files/objects, and how often they change?

Also, any idea on the number of users (both average, and peak) you’d expect to be downloading at once?

Does latency of their downloads matter? eg do downloads need to start quickly like a CDN, or as long as they work is good enough?

you can definitely do this at home on the cheap. As long as you have a decent internet connection, that is 😉
10TB+ harddisks are not expensive, you can put them in an old enclosure together with a small industrial or NUC PC in your basement

I current have 45 WUH721414ALE6L4 drives in a Supermicro JBOD SC847E26 (SAS2 is way cheaper than SAS3) connected to an LSI 9206-16e controller (HCL reasons) via hybrid Mini SAS2 to Mini SAS3 cables. The SAS expanders in the JBOD are also LSI and qualified for the card. The hard drives are also qualified for the SAS expanders.

I tried this using Pine ROCKPro64 to possibly install Ceph across 2-5 RAID1 NAS enclosures. The problem is I can’t get any of their dusty Linux forks to recognize the storage controller, so they’re $200 paperweights.

I wrote a SATA HDD “top” utility that brings in data from SMART, mdadm, lvm, xfs, and the Linux SCSI layer. I set monitoring to look for elevated temperature, seek errors, scan errors, reallocation counts, offline reallocation, and
probational count.

See Also

I once had a Hetzner dedicated server that held about 1 TB of content and did some terabytes of traffic per month (record being 1 TB/24 hours). Hetzner charged me 25€/month for that server and S3 would’ve been like $90/day at peak traffic.

> If your monthly egress data transfer is less than or equal to your active storage volume, then your storage use case is a good fit for Wasabi’s free egress policy

> If your monthly egress data transfer is greater than your active storage volume, then your storage use case is not a good fit for Wasabi’s free egress policy.

https://wasabi.com/paygo-pricing-faq/

The answer to this question depends entirely on the details of the use case. For example, if we’re talking about an HTTP server where a small number of files are more popular and are accessed significantly more frequently than most others, you can get a bunch of cheap VPS with low storage/specs but a lot of cheap bandwidth to use as cache servers to significantly reduce the bandwidth usage on your backend.

The OP request would benefit from details, but the solution depends on what format the data is and how to be shared.

Assuming the simplest need is making files available :

1) Sync.com provides unlimited hosting and file sharing from it.

Sync is a decent Dropbox replacement with a few more bells and whistles.

2) BackBlaze business let’s you deliver files for free via their CDN. $5/TB per month storage plus free egress via their CDN.

https://www.backblaze.com/b2/solutions/developers.html

Backblaze appears to be 70-80% cheaper than S3 because it claims.

Conventional greatest observe cloud paths are optimized to be a greatest observe to generate revenue for the cloud supplier.

Fortunately it’s good to hardly ever be alone or the primary to have a necessity.

S3 is probably the highest quality. It’s enterprise grade : fast, secure with a lot of tiers and controls.

If you recover only small data, it’s also not expensive. The only problem is if you recover large data. That would be a major problem.

10TB storage + 100TB bandwidth and S3 will easily be +1000 USD per month, while there are solutions out there that are fast and secure with unrestricted bandwidth for less than 100 USD per month. Magnitude cheaper with same grade in “enterprisey”.

Well, I said, if you store small data. For large data, sure, prohibitively expensive!

I don’t think many other solutions are equally fast and secure.

AWS operation is pretty transparent, documented, audited and used by governments. You can lock it down heavily with IAM and a CMK KMS key, and audit the repository. The physical security is also pretty tight, and there is location redundancy.

Even hetzner doesn’t have proper redundancy in place. Other major providers in France burned down (apparently with with data loss), or had security problems with hard drives stolen in transport.

I don’t work for AWS, don’t have much data in there, just saying. GCP and Azure are probably also good.

>Well, I said, if you store small data.

Well, the OP said he would be using >100 TB a month.

>GCP and Azure are probably also good.

They similarly charge 100x for bandwidth. No they are not a good option either.

Sounds like you could find someone with a 1Gbps symmetric fiber net connection, and pay them for it and colo. I have 1Gbps and push that bandwidth every month. You know, for yar har har.

And that’s only 309Mbits/s (or 39MB/s).

And a used refurbished server you can easily get loads or ram, cores out the wazoo, and dozens of TB’s for under $1000. You’ll need a rack, router, switch, and batt backup. Shouldn’t cost much more than $2000 for this.

You could do this for about $1k/mo with Linode and Wasabi.

For FastComments we store assets in Wasabi and have services in Linode that act as an in-memory+on disk LRU cache.

We have terabytes of data but only pay $6/mo for Wasabi, because the cache hit ratio is high and Wasabi doesn’t charge for egress until your egress is more than your storage or something like that.

The rest of the cost is egress on Linode.

The nice thing about this is we gets lots of storage and downloads are fairly fast – most assets are served from memory in userspace.

Following thread to look for even cheaper options without using cloudflare lol

> You could do this for about $1k/mo with Linode and Wasabi.

This is still crazy expensive. Cloud providers have really warped people’s expectations.

Well, for us it’s actually really cheap because we really just want the compute. The bandwidth is just a bonus.

Actually, since the Akami acquisition it would be even cheaper.

$800/mo to serve 100TB with fairly high bandwidth and low latency from cold storage is a good deal IMO. I know companies paying millions a year to serve less than a third of that through AWS when you include compute, DB, and storage.

Fine, but now you’re changing the comparison. Spending millions on compute with low bandwidth requirements doesn’t make it stupid. It probably still is, but that’s a different conversation.

You could do it through interserver for $495/mo (5 20gb sata disks, 150tb free bandwidth). 10gbps link. 128gb ram for page cache.

Backups probably wouldn’t be much more.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top