NVMe Speed Issue :: CentOS vs Windows

MetroVPS_NMP · October 2021

Hello,

I'm facing NVMe speed difference between CentOS & Windows. In windows, it's pulling ~3.5 GB/s Read & ~2 GB/s Write.

But, when it's CentOS 7/CentOS 8, it's pulling only 1.2~1.3GB/s write.

[root@localhost ~]# dd if=/dev/zero of=test bs=64k count=16k conv=fdatasync
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB) copied, 0.997785 s, 1.1 GB/s

The configuration is: AMD Ryzen 3700X + Asrock Rack X470D4U + 32GB Ram + NVMe Adapter (PCIe x16).

Can anyone shed some light if I'm making any mistake ?

Regards.

chocolateshirt · October 2021

Hwo about debian 11?

chocolateshirt · October 2021

Or maybe use yabs.sh/fio as the software as standard disk benchmark.

You must know that different software give different result as the method of measurement is different.

tetech · October 2021

Filesystem?

skorupion · October 2021

Different size of bytes

tetech · October 2021

@skorupion said:
Different size of bytes

Uh... sort of like different weights of 1 kg?

yokowasis · October 2021

@chocolateshirt said:
Or maybe use yabs.sh/fio as the software as standard disk benchmark.

You must know that different software give different result as the method of measurement is different.

here is an idea. do the dd or yabs on windows too, using wsl.

MetroVPS_NMP · October 2021

@tetech said:
Filesystem?

NTFS in Windows, XFS in CentOS

MetroVPS_NMP · October 2021

@skorupion said:
Different size of bytes

Probably not. It's the same.

coolice · October 2021

Centos 7-8 has too old kernels to see full potential of your ryzen maybe for the nvme too (all drivers are in the kernel) try Ubuntu 20.4.3 with hwe kernel it is 5.11 or proxmox

https://askubuntu.com/questions/248914/what-is-hardware-enablement-hwe

Brand new hardware devices are released to the public always more frequently. And we want such hardware to be always working on Ubuntu, even if it has been released after an Ubuntu release. Six months (the time it takes for a new Ubuntu release to be made) is a very long period in the IT field. Hardware Enablement (HWE) is about that: catching up with the newest hardware technologies.

MetroVPS_NMP · October 2021

@yokowasis said:

@chocolateshirt said:
Or maybe use yabs.sh/fio as the software as standard disk benchmark.

You must know that different software give different result as the method of measurement is different.

here is an idea. do the dd or yabs on windows too, using wsl.

Yabs provides 500+700 MB, so around 1.2GB/s. I'll post the benchmark as soon as I'm free.

chocolateshirt · October 2021

@Mahfuz_SS_EHL said:

@yokowasis said:

@chocolateshirt said:
Or maybe use yabs.sh/fio as the software as standard disk benchmark.

You must know that different software give different result as the method of measurement is different.

here is an idea. do the dd or yabs on windows too, using wsl.

Yabs provides 500+700 MB, so around 1.2GB/s. I'll post the benchmark as soon as I'm free.

Which one? 4k? 64k? 512k? Or 1m?

varwww · October 2021

Try some other OS like Debian 11 or Ubuntu 20.04/21.04 - Centos has old garbage kernel 4.x https://repology.org/project/linux/versions

AlwaysSkint · October 2021

Try ext4 rather than XFS.

Falzo · October 2021

@tetech said:

@skorupion said:
Different size of bytes

Uh... sort of like different weights of 1 kg?

while I get the joke, he is right ;-)

testing 1M blocksize in windows vs 64k in linux obviously leads to different results, if you keep in mind that bandwidth is the result of iops*blocksize ...

MetroVPS_NMP · October 2021

@Falzo said:

@tetech said:

@skorupion said:
Different size of bytes

Uh... sort of like different weights of 1 kg?

while I get the joke, he is right ;-)

testing 1M blocksize in windows vs 64k in linux obviously leads to different results, if you keep in mind that bandwidth is the result of iops*blocksize ...

So, what will be the bs & count size if I want to match it to Windows ?? I thought he's talking about Allocation unit of the partition.

Falzo · October 2021

@Mahfuz_SS_EHL said:

@Falzo said:

@tetech said:

@skorupion said:
Different size of bytes

Uh... sort of like different weights of 1 kg?

while I get the joke, he is right ;-)

testing 1M blocksize in windows vs 64k in linux obviously leads to different results, if you keep in mind that bandwidth is the result of iops*blocksize ...

So, what will be the bs & count size if I want to match it to Windows ?? I thought he's talking about Allocation unit of the partition.

I haven't used crystaldiskmark in a while, but from your screenshot seems you are using a testfile of 1 GB size and the blocksize for the first sequential test is 1MB

dd if=/dev/zero of=test bs=1M count=1k conv=fdatasync

should be closest to that (1M blocksize and a 1GB testfile)

Tr33n · October 2021

What Falzo says is correct, you are testing with different blocksizes. Apart from that, I also observed that dd returns worse results for "small" files (1 GB is small in that case) with a fast storage backend, also under CentOS 7.

I invested some time to find the cause of this, but could not figure it out and I can't remember the exact details either. However, I believe that there is a short delay when dd is started where some pre stuff is done, which is counted as runtime. With such fast storages as with NVMe and with such small test files this has a big effect on the result.

If you let dd write a larger file (with 1M blocksize) you should get a similar result as under Windows. By larger file I mean e.g. 10 GB.

Falzo · October 2021

@Tr33n said:
What Falzo says is correct, you are testing with different blocksizes. Apart from that, I also observed that dd returns worse results for "small" files (1 GB is small in that case) with a fast storage backend, also under CentOS 7.

I invested some time to find the cause of this, but could not figure it out and I can't remember the exact details either. However, I believe that there is a short delay when dd is started where some pre stuff is done, which is counted as runtime. With such fast storages as with NVMe and with such small test files this has a big effect on the result.

If you let dd write a larger file (with 1M blocksize) you should get a similar result as under Windows. By larger file I mean e.g. 10 GB.

yes definitely possible, as with small files you are in a sub-second field already, so this is a good suggestion. for comparison it seems reasonable to also use a larger testfile in windows as well. might also help with caching related things, though I haven't checked, how crystaldiskmark handles that.

always difficult to compare different benchmarks anyway. maybe use fio und centos instead of dd as I have found it more reliable or at least more 'tunable' to replicate the same settings somehow.

MetroVPS_NMP · October 2021

@Falzo said:

@Mahfuz_SS_EHL said:

@Falzo said:

@tetech said:

@skorupion said:
Different size of bytes

Uh... sort of like different weights of 1 kg?

while I get the joke, he is right ;-)

testing 1M blocksize in windows vs 64k in linux obviously leads to different results, if you keep in mind that bandwidth is the result of iops*blocksize ...

So, what will be the bs & count size if I want to match it to Windows ?? I thought he's talking about Allocation unit of the partition.

I haven't used crystaldiskmark in a while, but from your screenshot seems you are using a testfile of 1 GB size and the blocksize for the first sequential test is 1MB

dd if=/dev/zero of=test bs=1M count=1k conv=fdatasync

should be closest to that (1M blocksize and a 1GB testfile)

bs=1M count=1K pulls the same as 1.2GB/s. But, if I increase the filesize then sequentially speed increases too. I have got fio. But the command I found is:

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1

I think I need to change the stats here to match with CrystalMark (SEQ1M Q8T1 is interpreted as Sequential Test for 1 Mebibyte block size data with total of 8 tasks in sequence on thread 1.)

MetroVPS_NMP · October 2021

[root@localhost ~]# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=posixaio, iodepth=16
...
fio-3.19
Starting 16 processes
random-write: Laying out IO file (1 file / 256MiB)
random-write: Laying out IO file (1 file / 256MiB)
random-write: Laying out IO file (1 file / 256MiB)
random-write: Laying out IO file (1 file / 256MiB)
random-write: Laying out IO file (1 file / 256MiB)
random-write: Laying out IO file (1 file / 256MiB)
random-write: Laying out IO file (1 file / 256MiB)
random-write: Laying out IO file (1 file / 256MiB)
random-write: Laying out IO file (1 file / 256MiB)
random-write: Laying out IO file (1 file / 256MiB)
random-write: Laying out IO file (1 file / 256MiB)
random-write: Laying out IO file (1 file / 256MiB)
random-write: Laying out IO file (1 file / 256MiB)
random-write: Laying out IO file (1 file / 256MiB)
random-write: Laying out IO file (1 file / 256MiB)
Jobs: 16 (f=16): [w(9),F(1),w(5),F(1)][100.0%][w=1513MiB/s][w=24.2k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=2221: Mon Oct  4 23:05:49 2021
  write: IOPS=1750, BW=109MiB/s (115MB/s)(6656MiB/60840msec); 0 zone resets
    slat (nsec): min=510, max=447420, avg=4725.32, stdev=4749.57
    clat (usec): min=35, max=2210.7k, avg=7554.41, stdev=120393.52
     lat (usec): min=38, max=2210.7k, avg=7559.13, stdev=120393.49
.
.
.
Run status group 0 (all jobs):
  WRITE: bw=1835MiB/s (1925MB/s), 109MiB/s-119MiB/s (115MB/s-125MB/s), io=109GiB (117GB), run=60020-60959msec

Disk stats (read/write):
  nvme0n1: ios=0/905877, merge=0/3, ticks=0/5909119, in_queue=5909119, util=98.30%

Falzo · October 2021

yeah you're getting closer.

with fio obviously --bs=1M would represent the correct blocksize and --size=1GB the overall testfile size. you don't need 16 jobs, with fio they run in parallel but I think Q8T1 in CDM means it breaks down it's test in 8 parts that still run one after another (1 thread), so you'd rather want only one job. yet if CPU power plays a role windows could still handle that differently - I don't know.

however deep you dive into it, IMHO your takeaway should be: do not overengineer it!

the nvme speeds won't change just because of the os. there might be slight differences depending on filesystem and such, but I'd say these are rather negligible.
everything you see in the benchmarks that comes across as big difference reflects an artificial combination of different settings, and as said before even caching could play a role...

also maybe try vpsbench if you want to see linux beat windows anyway.
...sorry @jsg , but I simply couldn't resist ;-)

MetroVPS_NMP · October 2021

@Falzo said:
yeah you're getting closer.

with fio obviously --bs=1M would represent the correct blocksize and --size=1GB the overall testfile size. you don't need 16 jobs, with fio they run in parallel but I think Q8T1 in CDM means it breaks down it's test in 8 parts that still run one after another (1 thread), so you'd rather want only one job. yet if CPU power plays a role windows could still handle that differently - I don't know.

however deep you dive into it, IMHO your takeaway should be: do not overengineer it!

the nvme speeds won't change just because of the os. there might be slight differences depending on filesystem and such, but I'd say these are rather negligible.
everything you see in the benchmarks that comes across as big difference reflects an artificial combination of different settings, and as said before even caching could play a role...

also maybe try vpsbench if you want to see linux beat windows anyway.
...sorry @jsg , but I simply couldn't resist ;-)

Actually the motherboard was having NVMe Slots of PCIe Gen3 x2 & PCIe Gen2 x4. I wanted to use PCIe Gen3 x4. That's why I managed adapters. But, then got confused at this. Now, this is clear to me. Thanks for helping out

jsg · October 2021

@coolice quoted:
Brand new hardware devices are released to the public always more frequently. And we want such hardware to be always working on Ubuntu, even if it has been released after an Ubuntu release. Six months (the time it takes for a new Ubuntu release to be made) is a very long period in the IT field. Hardware Enablement (HWE) is about that: catching up with the newest hardware technologies.

(a) "HWE" is mainly about stuff that actually needs newer or specific driver like graphics cards.
(b) For NVMe there is a standard driver because NVMe, unlike e.g. graphics cards, is a standard. So, unless one is on a really old kernel like 2.6 there is no need at all to worry about v.5.x vs v.4x

@Mahfuz_SS_EHL said:
Yabs provides 500+700 MB, so around 1.2GB/s. I'll post the benchmark as soon as I'm free.

Makes no sense on diverse levels. For one the testing methods are totally different (e.g. read+write in 1 test vs. read test + write test), plus Windows and Unix/linux are totally different beasts on many levels.

@Falzo said:

dd ... conv=fdatasync

Are you sure that CrystalMark reads/writes sync?
Anyway, I think you nailed it well by hinting at the diverse differences between benchmarks (not to even talk between Windows and Unix/linux).

@Mahfuz_SS_EHL said:
Actually the motherboard was having NVMe Slots of PCIe Gen3 x2 & PCIe Gen2 x4. I wanted to use PCIe Gen3 x4. That's why I managed adapters. But, then got confused at this. Now, this is clear to me. Thanks for helping out

G3x2 is roughly equal to G2x4. Btw in such a case you should go with G2x4 because in case the NVMe itself is G2 (unlikely but it could be) it will at least see - and use - 4 lanes.

Finally a piece of advice: use whatever OS you'll run in production for benchmarking. Unless you plan to use Windows don't waste time on Windows benchmarks. Also don't be overly concerned about fine details, because you don't know how VM users (and the software they use) will use the system anyway. A database for example has very different needs and ways to operate than say some kind of a file server.
As a provider you should just care about no test (e.g. random writing) being particularly bad plus learn valuable information for your node caching strategy.

Howdy, Stranger!

Categories

In this Discussion

NVMe Speed Issue :: CentOS vs Windows

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

NVMe Speed Issue :: CentOS vs Windows

Comments