OVH VPS Cloud Disk I/O Performance

AlbaHost · October 2016

@iwaswrongonce said:

@trvz said:

WebDude said: As a pro tip, I woudln't recommend OVH for any commercial operations.

Why the hell did you think OVH's fine for production in the first place? It's not.

If OVH isn't production grade, then who is?

They have great prices and hardwares too, but their support exculde abuse department is horrible. I remember back in 2014-2015 someone complainted an abuse to ovh that we are sending spam "throught IRC" which cant be possible and ovh abuse department did requested from us to stop all our irc chat servers from their servers.
Before a 2-3 days ago i did opened an ovh support ticked and asked them for authorization to run irc chat servers and they said its ok, more about question:
http://imgur.com/a/1ttox
asked a friend of mine to complain a fake abuse to them you can see here:
http://imgur.com/a/AUcr7 and http://imgur.com/a/vGawU and voila, here you go:
http://pastebin.com/ctFHP800

Both irc chat servers are hosted with us, so as you can see they didn't even check if this is a real abuse or not they just request to stop it, they don't care if the "called illegal content or irc abuse/spam does exist or not"
So i explained back to the same abuse ticket that was from a friend of mine and just a fake abuse added mentioned pics too, now they are silent 7 days no answer from their side.
So that's why it's hard to say that ovh is good for productive use.

NetworkPanda · October 2016

We have a cloud instance in SBG for tests and some applications we develop (not hosting related) and the performance over the past 6-7 days is ridiculous. At random times their storage system stops completely working, tons of processes are added to the processor queue and never executed due to the locked disk I/O and you see a completely idle system with 0% CPU usage reaching loads of 600.0 +.

AlbaHost said: their support exculde abuse department is horrible

The best way to ask for support is to send a message to their mailing lists, usually they respond there and check issues within minutes.

AlbaHost · October 2016

AlbaHost said: their support exculde abuse department is horrible

The best way to ask for support is to send a message to their mailing lists, usually they respond there and check issues within minutes.

If @EtienneM is from ovh, then maybe he might help

WebDude · October 2016

Hi EtienneM,

General user experience is now much better. Yet there are still number of disk operations which seem to take much longer than expected. I would say that the worst operations should be below 1 second to be ok. Now 1639 samples of 458864 took 1000 ms (1 second) or more. That's 0,00357%. Worst operations are still above 9 seconds, which should be considered a very long time in this context. Luckily the long stalls are now much less frequent than earlier.

I acknowledge that this test is extremely limited and supposed only to give indication of level of service. It's not any kind of absolute measure.

Raw data if anyone cares.

For comparison I've also posted entry level SSD and HDD timings. Now we can conclude that in general the IOPS / storage system is about as fast as consumer grade spinning rust on 4k file creation.

I think we can live with current situation. Still hoping to see improvement on the worst times. Cold data read rates seem to be on some of the servers below 20 MB/s. On others around 30MB/s. Those numbers are still quite low. So there's still much room for improvement. Writes seem to be around 120 MB/s quite constantly, which is good.

I'm curious about stuff like Ceph & NVMe & GlusterFS etc. What's the root cause? Better chunk load balancing? Reducing hot spots? Making "fail timeout shorter", that's the good point with HA, you can fetch data from another drive, if the primary source fails, etc.

Best regards,
WebDude

Falzo · October 2016

@WebDude said:

thanks for posting this follow up, much appreciated!

WebDude · October 2016

Yeah, I just thought for it a while. There's blatant fail with %. Ratio is 0.00375 which of course makes 0.375% which is naturally 100% more than I said in the follow up post. Yet I did include the raw data and total values, which should clearly indicate the % fail to anyone really taking a look at the values.

WebDude · October 2016

One more update. Now when the most extreme disk lag and latency is gone. We haven't seen any Windows crashes either. So that makes me think that the educated guess that operating system crashing was caused by storage backend seems plausible or lightly confirmed. - It also explains why the request to change the host machine didn't affect the crashing, because the reason for OS crashes was the extremely slow storage system.

EtienneM · October 2016

Hi,

I've updated the travaux task with details.
Fell free to answer here if you want more details or if some things were not clear.

http://travaux.ovh.com/?do=details&id=20490

gustavargas · October 2016

root@web1:~# fio --name=rand-write --ioengine=libaio --iodepth=32 --rw=randwrite --invalidate=1 --bsrange=4k:4k,4k:4k --size=512m --runtime=120 --time_based --do_verify=1 --direct=1 --group_reporting --numjobs=1
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.1.11
Starting 1 process
rand-write: Laying out IO file(s) (1 file(s) / 512MB)
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/4004KB/0KB /s] [0/1001/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10393: Wed Oct 26 18:11:20 2016
  write: io=478376KB, bw=3985.4KB/s, iops=996, runt=120033msec
    slat (usec): min=3, max=48736, avg=31.88, stdev=633.98
    clat (usec): min=113, max=84449, avg=32080.83, stdev=3304.72
     lat (usec): min=121, max=84475, avg=32113.14, stdev=3283.72
    clat percentiles (usec):
     |  1.00th=[30848],  5.00th=[31104], 10.00th=[31360], 20.00th=[31616],
     | 30.00th=[31616], 40.00th=[31872], 50.00th=[31872], 60.00th=[32128],
     | 70.00th=[32384], 80.00th=[32384], 90.00th=[32640], 95.00th=[32640],
     | 99.00th=[34048], 99.50th=[57088], 99.90th=[74240], 99.95th=[77312],
     | 99.99th=[84480]
    bw (KB  /s): min= 3584, max= 4776, per=100.00%, avg=3988.65, stdev=74.17
    lat (usec) : 250=0.01%, 500=0.01%, 750=0.02%, 1000=0.01%
    lat (msec) : 2=0.03%, 4=0.10%, 10=0.13%, 20=0.21%, 50=98.85%
    lat (msec) : 100=0.65%
  cpu          : usr=0.71%, sys=2.44%, ctx=119373, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=119594/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: io=478376KB, aggrb=3985KB/s, minb=3985KB/s, maxb=3985KB/s, mint=120033msec, maxt=120033msec

Disk stats (read/write):
  vda: ios=1/126279, merge=0/19, ticks=0/3924260, in_queue=3924676, util=99.99%

In my Public Cloud VPS-SSD-1

WebDude · October 2016

http://travaux.ovh.com/?do=details&id=20490

Thank you!

Yet it's not great. Here's same test run on idle OVH & UpCloud servers.

OVH

UpCloud

There's some slight difference, especially in latency.

WebDude · October 2016

In my Public Cloud VPS-SSD-1

Ahem, VPS-SSD has nothing to do with VPS Cloud. Yet if local SSDs are used, then that 100 ms latency there looks pretty bad too.

WebDude · October 2016

Systems worked better for a while. But now it seems that during the weekend and now, the performance has started to suck totally again. Several servers crashed during the weekend. Spent hours restarting and looking for crashed systems + performance was totally horrible when systems got rebooted. -> Situation doesn't seem to be actually getting any better.

Now IO seems to be averaging around 6 MB 7s, which is pretty bad. Yes, there aren't those long 0 IOPS stalls anymore. But generic performance is still very weak. Which seems to be enough bad to cause systems to crash. So I assume it's much worse periodically.

Rasberry PI with any random class 10 SD card would do much better.

WebDude · November 2016

Today systems has been crashing at GRA and SBG disk I/O is under 1 MB/s on random 4K ops. Ouch! This situation is getting much worse and unbearable again.

At times max I/O is barely above 0.1 MB/s for dozens of seconds even if storage system is maxed out. As said, when critical locked sections get caught in this kind of trap, it brings everything down.

vimalware · November 2016

Tragedy of the commons : storage edition.

weblinks · November 2016

I was on same issue VPS Cloud Disk I/O Performance on my vps cloud 3 in Gravelines (FR) in month of sept end 2016 to oct start. I contacted there support ticket for many days. provide all details and proof and there last reply on 5th oct 2016 was

First of all, let me apologize if your issue is taking too long to have an answer. Our specialist are still analyzing the information sent by you, we will notify you again through this ticket as soon as we have any update concerning it.

I discontinue that vps and then I ordered new cloud ram 3 vps on Beauharnois (CA) and shifted there. seems ok but overall vps cloud ram 3 performance is not good, what i am expecting. 2 days ago load goes high around 200+ on my both vps cloud 3 at same time and after investigating same I/O issue. so it means this disk I/O issue make your vps down anytime.

Overall I am not satisfied with there vps cloud series (ceph storage issue)

WebDude · November 2016

I can confirm what WebLinks says... Yep, Storage I/O can get so darn poor, it basically kills everything. I'm seeing again situation where performance is very seriously degrading at SBG.

Just opening power shell and getting directory listing, took around 3 minutes. There's no other load on server and DISK I/O is maxed out all the time. - Absolutely devastating performance. Basically system is totally 'dead', if something needs to be read from disk. Only RAM based stuff works.

It's good to keep in mind, that getting directory listing or reading small configuration file can take several minutes. Just like logging anything, can also take minutes. - Unfortunately most of 'normal' applications won't really work well with this kind of extremely low performance environment.

OVH Control Panel is broken, so new servers can't be accessed, etc. - Oh joy.

Now I can truly understand people whom got something bad to say about OVH.

WebDude · November 2016

Finally the servers at least in SBG are performing ok again. Of course the performance isn't stellar one. But something I would assume us to get for the price we pay. Most importantly I/O latency is under control and systems feel responsive again. Getting ls of home directory won't take tens of seconds anymore. And launching applications or booting system won't take several minutes.

Thanks

Just hoping that the extreme performance degradation won't repeat.

wokenwoll · November 2016

Hello guys,

some of you wondered if OVH will do some Black Friday Deals for VPS and dedicated servers.

So yes we will all VPS, all regions, all countries. For dedicated servers, some refs only.

Not sure if i have the right to promote my company here in this forum, so let me find out first if i can detail the offers.

AlbaHost · November 2016

@wokenwoll said:
Hello guys,

some of you wondered if OVH will do some Black Friday Deals for VPS and dedicated servers.

So yes we will all VPS, all regions, all countries. For dedicated servers, some refs only.

Not sure if i have the right to promote my company here in this forum, so let me find out first if i can detail the offers.

Another OVH guy? If yes then that's great because i get faster answers from @EtienneM here instead of ovh support tickets.

EDIT: Welcome to LET.

wokenwoll · November 2016

i asked him to join in fact so yes, if you have questions related to our product don't hesitate

WebDude · December 2016

Yet again problems are back. Some of the systems are extremely slow and basically storage I/O is dropped to zero crashing systems and making reboots to take hours.

This worse than like running systems from cheapest possible Class 2 SD card with RPi. Because that's not slow at all. Maybe more like using good old 1x CD-ROM drive.

WebDude · July 2017

Just letting you know, that once again three servers crashed due to storage subsystem problems. It seems that this is more like a feature than a bug. It works just as well as it has been designed to work. (Which is of course not very well.)

GamerTech24 · July 2017

I have a VPS Cloud 3 in BHS I've literally had since October 2015 (it's been through the fiber cut and power failure back in 2015, apart from that 2016 and all of 2017 so far have been flawless, 100% uptime) and been renewing it each month, I'll check, as that is something I have important data and backups on, if that thing goes down I don't know what I'll do

GamerTech24 · July 2017

Current list of OVH Employees currently registered on this forum

@wokenwoll

@EtienneM

@MaikoB - OVH APAC Team (Australia, etc)

correct me if I'm missing anyone

Who knows, with the Virginia DC and the other one (can't remember the state) being built and then probably hiring I might become one too if I get the qualifications needed

WebDude · July 2017

@ethancedrik said:
I have a VPS Cloud 3 in BHS I've literally had since October 2015 (it's been through the fiber cut and power failure back in 2015, apart from that 2016 and all of 2017 so far have been flawless, 100% uptime) and been renewing it each month, I'll check, as that is something I have important data and backups on, if that thing goes down I don't know what I'll do

Don't worry. Those problems are usually quite isolated. I don't really know what kind of 'storage segments' OVH is using. But usually the problems seem to be affecting only certain group of servers. And in this case the servers affected are in SBG DC.

This is just the classic Amazon / Google / Microsoft, whatever case. They can always claim that everything is working. Because the systems are segmented and only small number of systems is being affected at a time. That's why they can A) Show all the time that everything is ok Have all the time bunch of tickets open that things aren't working.

This particular storage system problem was earlier much worse. Actually since my posts, I've got one more crash due to storage system.

Another thing which makes this particularly annoying is the way Windows works in these situations. Ram based stuff keeps working. Which means that A) Socket connections are accpeted and server is 'available' for most of low level monitoring. Only stuff which requires access to storage isn't working. C) This is pretty annoying combination.

But I'm sure this is nothing new for server experts around here.

Btw. Other service providers have had very unfortunately similar failure models. I've seen these exactly same fail on several other providers too. It seems that Windows somehow sucks with these situations, and systems running Linux won't end up similarly partially dead. - Basically this means that when the storage subsystem sucks badly enough, it requires manual hard reboot with all Windows instances.

WebDude · July 2017

It's going on. Yesterday several servers crashed after the post. Just 20 minutes ago one crashed again. I've opened tickets, but nobody bothers to comment. - Typical.

How about just getting your shit together, so we don't need to whine here. - Thanks

As pro-tip. Don't trust single server or service provider ever. Always have independent offsite backup with LONG history. Highly recommended thing.

Btw. Not OVH related, but on other VPS provider, NTFS got totally fsck:ed yesterday. Only full restore from remote backup saved my ass.

Continued, flooding sucks so..

Third server crashed right now... So it's highly likely things are going to suck seriously during the weekend.

Zerpy · July 2017

@WebDude said:
How about just getting your shit together, so we don't need to whine here. - Thanks

You seem to be the only one whining?

EtienneM · August 2017

Hi,

Sorry for delay, I was on holidays.

You can find progress on the issue here http://travaux.ovh.net/?do=details&id=26382

VPS and public cloud don't share the same storage. Neither the different regions inside the same DC (like GRA1 and GRA3).

Reminder, if you use the new regions you can run your instances on local raid SSD. Performances and data integrity at the same time

jeanluc · October 2017

Hi @WebDude, do you have the latest on I/O perf?

Our company has been thinking about moving from OVH dedicated servers to public cloud, but all the reviews I've seen online are that I/O is abysmal. Is that still the case? Their website claims "near SSD" performance on their disks.

jeanluc · October 2017

@EtienneM We've been looking to move to OVH public cloud and saw that some of the instances have 200GB/400GB local SSD. Do you plan on introducing instances with more local space?

Also, from OVH's perspective, are all the I/O issues mentioned above now resolved?

Howdy, Stranger!

Categories

In this Discussion

OVH VPS Cloud Disk I/O Performance

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

OVH VPS Cloud Disk I/O Performance

Comments