OVH VPS Cloud Disk I/O Performance

WebDude · October 2016

It's something horrible. There are pauses up to tens of seconds, where it's flat 0 IOPS. And if you're unlucky you'll hit those bumps one afther another. So then you'll get 15 seconds of 100 IOPS and then it drops to 0 for 35 seconds again.

It seems that their Ceph / NVMe storage system for VPS Cloud is completely falling apart at least in Strasbourg - SBG.

Has anyone else been having similar experiences? No this is not just one box, it's several dozens of VPSes ordered at different times and all of those are having totally nightmarish storage I/O data rates. 8 MB/s writes and reads between 0-20MB/s, but worst part is that it can be 0 MB/s for a minute or so.

I guess the only solution is to move out, helpdesk can't do anything about it. They've even claimed it's normal. But if that's normal, then OVH is absolute no no as a service provider. Issue report has been open on status for a long time. There's no information if this is going to change, or if this is actually the new normal.

mehargags · October 2016

Can you paste a FIO report ?

Falzo · October 2016

is this on the initial disk-space with the VPS or only addon disks?

I've seen this once, back around six months or so, but only a single addon disk was involved. problem resolved like by itself somewhat two weeks later...

support wasn't helpful but I have to admit I didn't even call. only ticketed in, which was left unanswered.

EtienneM · October 2016

Hi,

I'm working at OVH on Ceph clusters, an update was published on travaux but not translated on status, I translated it here http://travaux.ovh.com/?do=details&id=20490 , sorry for this.

We searched a lot (and still searching!), it's probably linked to the way Ceph stores the data, some of the disks have way more rights/read than the other. When we were removing a disk that seemed slower the other, another disk that was working well was hammered and seems to be slower too.

We plan two things :

We'll change the configuration to have a better data allocation.
We'll upgrade Ceph to a more recent release

For the first step, it will start at the end of the week (probably tomorrow).
For the second step we are currently doing some test to ensure the new release works with the rest of our infrastructure, I have no ETA yet for this. But our goal is to deploy this release on every clusters.

I'll update the task (with the translation ) once we have changed the configuration.

Can you pm please me your vpsid? I'll check if you are on the cluster linked to the task, you should have the right amount of iops, but sometimes an increase of the latency of the block storage.

root@vpsxxxxxx:~# fio --name=rand-write --ioengine=libaio --iodepth=32 --rw=randwrite --invalidate=1 --bsrange=4k:4k,4k:4k --size=512m --runtime=120 --time_based --do_verify=1 --direct=1 --group_reporting --numjobs=1
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process
Jobs: 1 (f=0): [w(1)] [100.0% done] [0KB/7892KB/0KB /s] [0/1973/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=27873: Thu Oct 13 16:15:49 2016
  write: io=931932KB, bw=7765.7KB/s, iops=1941, runt=120016msec

Done on a VPS3 (2K iops limit)

Don't hesitate to answer if you have some questions.

Etienne

[edit] Thanks, I was not expecting such a warm welcome

Harambe · October 2016

@EtienneM said:

What is this, invasion of OVH employees day? Glad to have you guys here

Amitz · October 2016

Hey, @EtienneM - Nice to have you here. Welcome on Lowendtalk.com!

FredQc · October 2016

Hey, just done it on my VPS 3, nice to see I'm within the specs

write: io=958668KB, bw=7987.9KB/s, iops=1996, runt=120016msec

WebDude · October 2016

Average is below 300 IOPS on VPS Cloud 2, but there are often complete halts to 0 IOPS for 15 - 25 seconds, repeatedly. As being said, that's the real problem and that's what drives to usability to near zero, if you're expecting any performance. Of course it's ok, if you're in a business which can tolerate those. But interactive services run users bat shit crazy when something which you expect taking less than a second starts taking several minutes.

Confirmed that things still very seriously suck: Fri Oct 14 04:21:43 2016 UTC

But as they said it's normal. Shared environments like VPS Cloud just provide extremely bad performance, and if you choose to use on, that's the thing you'll accept. Only remedy is to move out. I've been mapping suitable alternatives for two weeks because the current situation is totally unbearable. Luckily the systems have been barely running, even if users are totally mad. So it hasn't forced us to migrate to much more expensive platforms, which naturally would also provide hugely improved performance.

I didn't bother to check all the VPSs because it seems that situation hasn't changed at all.

Just now 30 second 0 IOPS pause. I'll PM the vps numbers, but as being said. They don't care about that, because this is normal, as the helpdesk said. 1 IOPS can take tens of seconds. Worst recorded time so far for 1 write IOPS is .... 625000 ms... Think about that. And I'm not kidding. It's nice to notice that saving one config file can take more than 10 minutes. Or writing a single log line. - You won't believe how mad admins get about this when they're trouble shooting performance issues and find out that's the new normal.

Let's try one of the older VPSes which should be in different cluster. If it's performing as badly or even worse. One of the older VPSs seem to be wokring better, around 550 IOPS but no pauses. That's the key.

Wow. The older VPS delivers even more interesting results. On very quick check ~5 minutes, several 10 seconds 0 IOPS pauses detected. Sigh.

Luckily this is low end talk, so I assume it's ok. IO just takes a while randomly, and that's what you get.

WebDude · October 2016

To continue with this OVH quality fun. Now one server crashed, and won't reboot. Reboot has taken now over 60 minutes. Actually it might be booting well, it's just "bit slow" and reboot takes a while on OVH. Console / Reboot again options won't work, because reboot is still in progress. -> Unfortunately this is also normal on OVH. On other providers which I'm using similar task will take less than a minute.

Booting to rescue mode took one day more than 4 hours. I was supposed to be a quick boot into rescue mode to run fio, but caused extensive downtime. Which is unfortunately normal again.

OVH is fine for low cost hobby sites, where a few days of downtime doesn't matter. But anything more serious than that, you're really badly shooting yourself and asking for serious trouble.

Edit: Two hours later, still booting. - Yes, this is the very unfortunate reality.

We're using servers from multiple service providers, and with other than OVH it's highly likely that there's less than one incident / month, but with OVH downtime and issues are daily experience. - Yes, I've factored in the number of servers. So it's not like 1 vs 100 or something similar.

Edit: Three hours later, wow, it's finally up.

EtienneM · October 2016

Hi I'm checking.

EtienneM · October 2016

At 04:21 there was a spike of latency on the cluster, that's the issue we want to resolve.

About the values you give like 550 IOPS, it's the value from a bench or the IOPS generated by the usage?

WebDude · October 2016

Single threaded read only bench (QD1), without other load on the server. But the key is that we're only reading. The fio tests people are using are skewed, because it writes and reads the same data during short interval. It doesn't reveal the true random cold data read time because it's very highly likely to hit caches. Also if there's more than one thread one thread stalling causes much less havoc on average than with single thread. - But this shouldn't be news to anyone working with process / task latency matters.

Very non scientific test which will reveal extreme I/O latency very simply is just running ls -R or dir /s on root. If the listing stops for tens of seconds every now and then, it's clear that there's something very wrong with the storage backend, as there is. This is a task which is very light technically on the storage system, yet it can still reveal extremely bad performance, which happens all the time on OVH.

Only disks I've seen being slower than OVH lately, are WD Scorpio Blue 2.5" SATA drives which are broken and relocating sectors. Then you can expect system to freeze or even crash just like it does on OVH platform.

EtienneM · October 2016

Yep, benchs only show one case and does not reflects the quality for every uses case.

There is an issue on SBG that impacts some API call, I'm awaiting it's fixed to ask compute team about your vps and why some are stuck.

Francisco · October 2016

Welcome to the hells of Ceph I've spent a lot of time researching into it, especially in pure SSD deployments, and it wasn't pretty, at least from what results said earlier in the year.

It sounds like the OVH Ceph cluster isn't pure SSD then? tiered maybe?

Francisco

WebDude · October 2016

@EtienneM said:
Yep, benchs only show one case and does not reflects the quality for every uses case.

Yes, I'm fully aware about that. But that's the way which very clearly shows what the problem is for us. If we run similar benchmark on other providers, we don't have any problems with the results. As example UpCloud constantly got 0.1 ms I/O latency which allows single threaded QD1 read benchmark values up to 10000 IOPS. To be honest, we don't need high IOPS count, we need reliable data delivery when data is required with reasonable data rate. I know random I/O is in many cases more demanding that sequential I/O but in this case the tests we've been doing is sequential I/O because it has been extremely bad too. - Technically speaking our demands aren't even that high, but the I/O latencies being experienced have been totally insane as being shown over and over again.

Also it's easy to forget that many real world work loads contain synchronized / blocks / locking sections. Which basically mean that other threads can't continue if one thread is stuck inside critical I/O section for extended periods. Like writing to database or file system journal. Running several totally independent threads naturally gives better performance, but unfortunately that's not possible for all workloads.

Reading directly from /dev/vda and logging amount of data being read every second is one way to see I/O performance.

WebDude · October 2016

@Francisco said:
It sounds like the OVH Ceph cluster isn't pure SSD then? tiered maybe?

Clearly tiered. Data read often comes at about 100 MB/s (VPS Cloud 2, IOPS limit?) and data read not so often comes at 10 - 20 MB/s on good day and a lot slower on bad cases. Just as I've described. OVH says NVMe / Ceph, so I assume NVMe is cache system for hot data on servers.

UpCloud uses also tiered storage, but it's 400 MB/s for hot data and around 100 MB/s for cold data (QD1 / Single Thread)

Francisco · October 2016

It looks like I was mixing product lines up. I always thought the Ceph backed storage was pure SSD when it actually sounds like the SSD based plans are probably local storage?

I guess this makes a lot more sense now. I wonder if they used the erazur or whatever deduplication options? From what I read the completely murders performance at the perk of saving space

Rebalancing might help but only to an extent. If it really is tiered and you got a couple clients trying to rip 3k IOPS out of spinning rust....

Francisco

WebDude · October 2016

Rebalancing might help but only to an extent. If it really is tiered and you got a couple clients trying to rip 3k IOPS out of spinning rust....

Sure, with tiered storage the cache hit ratio is the key. I do prefer tiered storage usually, because in our use case most of data is cold. There's only small hot set, so tiered storage does perfect sense. The only problem is that the cold data storage is extremely slow with insanely high (at times) latency. - That's the problem. - I don't actually mind if the cold storage data rate is 15 MB/s, it's enough, as long as it won't go to 0 for half a minute.

For most use cases tiered storage makes perfect sense, because the cold data is seldomly accessed.

WebDude · October 2016

This is a wonderful way to start Saturday morning at 6 am. Waking up on a call, that server has crashed again. - Then finding out that it can't be connected using console and refuses to reboot using reboot option. - Which means that there will be several hours of down time before someone will attend that issue.

It's bad if this happens once a year. But unfortunately with OVH this is more like weekly issue. On other providers this happens at times, but is extremely rare. Usually when they've been swapping hardware and freezing VM for some period for relocating to another physical host without booting. I'm not sure if OVH does that, or if the same effect is just achieved by extremely bad performance.

At this point based on experience, I know it's going to take around 4 hours before it's running again.

As a pro tip, I woudln't recommend OVH for any commercial operations. It's ok for free, hobby stuff, where it really doesn't matter if your availability is more like 98%, we can now forget the rest of nines, single nine is what's OVH is actually achieving.

Edit, added that also servers in GRA experiencing extreme lag with disk systems. I've got so many servers it's easy to forget where those are. Only servers so far NOT being complained about are in RBX.

mik997 · October 2016

I feel your pain .. I hate to be on the receiving end of complaints about service where you are dependent on some disinterested support desk who don't really seem to have the ability to diagnose and resolve the actual issue ..

Suggest you perhaps look at AWS for your production servers - it's expensive but I've been running small instances on their EU-WEST-1 DC for over 18 months now with no downtime and zero 'scheduled maintenance' events.

trvz · October 2016

WebDude said: As a pro tip, I woudln't recommend OVH for any commercial operations.

Why the hell did you think OVH's fine for production in the first place? It's not.

WebDude · October 2016

@trvz said:
Why the hell did you think OVH's fine for production in the first place? It's not.

That's a darn good question. Because they advertise cloud servers with reasonable price. I did expect more trouble when I started with OVH. But actually everything worked pretty well with the first 10 or so Cloud VPSs in RBX. But since then they've been clearly ramping up the resource overselling which has lead to serious performance degradation from the times we started with OVH.

We were really skeptical about OVH, but (unfortunately?) our first impression wasn't that bad.

Also some of the servers are working fine, and others are crashing and experiencing extreme I/O stalls, as told in this thread.

Their site also especially quotes that the Cloud series is for professional production use, with high SLA and reliable services. They also claim that the system provides low latency (I think I'm soiling my-self while laughing), and of course high availability and reliable disk storage. - Which all seem to be utter lies based on my experience.

They claim there hasn't been Ceph data loss. But I'm not sure if the things got screwed up due to failed disk writes or what. But we've been experiencing database and file system corruption on OVH Cloud servers. It's of course hard to say what's the true base reason for this. But my strong guess is that, it probably has something to do with those extreme disk latencies. As well as the crashing is also probably related to it, unless there are some other serious host issues. The server which has been crashing almost weekly is at GRA, VPS Cloud 2 instance.

M66B · October 2016

VPS 2016 SSD 2 BHS

fio --name=rand-write --ioengine=libaio --iodepth=32 --rw=randwrite --invalidate=1 --bsrange=4k:4k,4k:4k --size=512m --runtime=120 --time_based --do_verify=1 --direct=1 --group_reporting --numjobs=1
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
rand-write: Laying out IO file(s) (1 file(s) / 512MB)
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/5998KB/0KB /s] [0/1499/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=19286: Sat Oct 15 17:30:38 2016
  write: io=717256KB, bw=5975.8KB/s, iops=1493, runt=120029msec
    slat (usec): min=1, max=53548, avg=21.96, stdev=282.16
    clat (usec): min=132, max=84330, avg=21392.06, stdev=2541.87
     lat (usec): min=139, max=84341, avg=21414.55, stdev=2544.97
    clat percentiles (usec):
     |  1.00th=[15296],  5.00th=[20352], 10.00th=[20608], 20.00th=[20864],
     | 30.00th=[21120], 40.00th=[21120], 50.00th=[21376], 60.00th=[21376],
     | 70.00th=[21632], 80.00th=[21888], 90.00th=[22144], 95.00th=[22400],
     | 99.00th=[29056], 99.50th=[34560], 99.90th=[51968], 99.95th=[63232],
     | 99.99th=[76288]
    bw (KB  /s): min= 4424, max= 7144, per=100.00%, avg=5980.85, stdev=178.55
    lat (usec) : 250=0.02%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.03%, 4=0.10%, 10=0.26%, 20=3.68%, 50=95.79%
    lat (msec) : 100=0.11%
  cpu          : usr=1.75%, sys=3.46%, ctx=148311, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=179314/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: io=717256KB, aggrb=5975KB/s, minb=5975KB/s, maxb=5975KB/s, mint=120029msec, maxt=120029msec

Disk stats (read/write):
  vda: ios=248/180984, merge=0/149, ticks=604/3887240, in_queue=3883268, util=100.00%

Falzo · October 2016

@WebDude said:

so why didn't you move out already?

I probably had tried to spin up new instances in others of their locations to replace the crashed ones in between. or totally change providers at all.

complaining a week long at LET about the same stuff most probably doesn't change anything. sorry to say but imho as a tech/admin/reseller/whatever you are no significant better than OVH itself, if you are not able to work around those problems yourself other then crying out loud.

WebDude · October 2016

@Falzo said:

@WebDude said:

so why didn't you move out already?

We're doing that right now. We've got a several service providers being tested. Some are as cheap as OVH and others are 2-4x more expensive. This will at least mean that as soon as we're ready with new service providers we won't be adding more OVH servers. As well as 'if anything needs to be done' for a server, it'll be moved out. We currently got servers at five different service providers and at 10 or more physical locations. Some dedicated but mostly VPS.

Running full comparison between service providers, testing those out and checking all pricing aspects etc, always requires quite lot work. As well as getting setup processes, automation, invoicing, administrative tasks and all that stuff straight. That's why it hasn't been instant. Problems with OVH are bad, but not so bad we would do instant transfer for all hosts with extra work & costs.

Another reason is that we're running our systems on Windows (sigh), which seriously limits number of service providers available, especially if looking for reasonable pricing.

We've been very happy with UpCloud.com, but Windows Data Center Edition licenses are pretty expensive on monthly basis. Which actually costs usually more than the server resources / month. That's the pain point with them.

If panic strikes, we can easily and quickly evacuate directly to UC Frankfurt, no problem.

All high value customers are hosted on UC.

WebDude · October 2016

It's also bad idea that OVH recommends running for 120 seconds. That's way too short time. Here's a test which was run for 16 hours. And it clearly reveals what's OVH isn't telling. Test run collected 358284 data points. 4k file creation time. At best this time was 11 ms. Which is of course ok.

But here's the part where shit hits the fa. In 3919 tests the latency was 1000 ms or more, but that doesn't even yet suck extremely bad. Here's the bomb and what's the whole point of this thread. 245 operations took more than 10000 ms aka 10 fucking seconds.

And 15 of the samples took more than 20 fucking seconds. This is quite high time for IOPS to complete and as said, totally ruins any performance and user experience.

Worst time is 33 seconds in that set.

This is a server where all other processes have been removed, because users got totally fed up with extremely bad response times.

Full data set is here, if anyone cares: http://www.pastefs.com/pid/6857

Francisco · October 2016

Let me know if some slices might interest you. They aren't Ceph or any other buzz words but they perform excellent, dedicated resources on most of the plans, well priced, and include Windows.

Francisco

EtienneM · October 2016

Hi we have started our operations to improve storage, as I told you the issue is that some requests takes a large amount of time.

We are still working on it and have some new things to dig.

WebDude · October 2016

@EtienneM said:
Hi we have started our operations to improve storage, as I told you the issue is that some requests takes a large amount of time.

We are still working on it and have some new things to dig.

Yes, we've been generally happy with OVH. Except this extreme latency issue. From history there has been also two other kind of other issues.

Loss of network connectivity, which is probably somehow related to Virtual Redhat Network driver in Windows. Restarting windows won't fix the problem, but restarting VM from control panel usually does. (Not always, but usually).

And the second major problem, which happens more often, is just the box freeze ups. So technically systems and processes are running. But won't actually work. I suspect this has been caused by storage issues. But I don't have any facts for that. Because this is a problem which would require running local / remote ram based analysis tools and I haven't bothered with it. - I just hope this problem would get automatically resolved when storage issues are fixed. - If anyone has experience if this is what happens with Windows when disk latency goes insane, then it would be nice to know. I haven't luckily been dealing too much servers with extremely slow / completely failing storage systems.

As bit extra detail, as example, remote desktop socket connects, but nothing happens since that. Similarly database servers socket is available, but no data comes through. I suspect the ram based daemon works, but as soon as anything needs to be logged, the process fails.

Thanks

WebDude · October 2016

@Francisco said:
Let me know if some slices might interest you. They aren't Ceph or any other buzz words but they perform excellent, dedicated resources on most of the plans, well priced, and include Windows.

Drop me a line, I'll pm my email.

I've got a few other offers already from a few providers. Some local, some reading this thread. Also we've got a private cloud (With GlusterFS) offer with dedicated hardware. But I guess that's not something we really want. In all cases it's SSD / HDD tiered / hybrid storage. As said, we don't need full performance for other than a small set of data. But the cold storage HDD can't be darn slow.

iwaswrongonce · October 2016

@trvz said:

WebDude said: As a pro tip, I woudln't recommend OVH for any commercial operations.

Why the hell did you think OVH's fine for production in the first place? It's not.

If OVH isn't production grade, then who is?

Howdy, Stranger!

Categories

In this Discussion

OVH VPS Cloud Disk I/O Performance

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

OVH VPS Cloud Disk I/O Performance

Comments