Upgrade your OpenSSL on Scaleway ARM: 5x performance gain

rm_ · November 2017

In case anyone is using those machines, here's what I just found.

Debian Jessie comes with OpenSSL 1.0.1t:

# openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 9437607 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 2704353 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 713491 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 1024 size blocks: 181168 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 22743 aes-256-cbc's in 3.00s
OpenSSL 1.0.1t  3 May 2016
built on: Fri Jan 27 00:08:40 2017
options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr) 
compiler: gcc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -DTERMIO -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-z,relro -Wa,--noexecstack -Wall
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc      50333.90k    57692.86k    61088.19k    61838.68k    62103.55k

After upgrading that to Debian Stretch's version 1.1.0f:

# openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 18554406 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 9779929 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 3375690 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 938046 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 121052 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 60624 aes-256-cbc's in 3.00s
OpenSSL 1.1.0f  25 May 2017
built on: reproducible build, date unspecified
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DOPENSSLDIR="\"/usr/lib/ssl\"" -DENGINESDIR="\"/usr/lib/aarch64-linux-gnu/engines-1.1\"" 
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc      98956.83k   208638.49k   288058.88k   320186.37k   330552.66k   331087.87k

It seems the Cavium ThunderX contain hardware acceleration for AES (similar to AES-NI on x86):

processor  : 0
BogoMIPS    : 200.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part    : 0x0a1
CPU revision    : 1

processor   : 1
BogoMIPS    : 200.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part    : 0x0a1
CPU revision    : 1

processor   : 2
BogoMIPS    : 200.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part    : 0x0a1
CPU revision    : 1

processor   : 3
BogoMIPS    : 200.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part    : 0x0a1
CPU revision    : 1

But it's not supported yet in OpenSSL 1.0. With that enabled however, these should make for decent VPN machines, or better handle HTTPS load.

J1021 · November 2017

You’re still using Scaleway?

rm_ · November 2017

Well I never used it in any serious capacity (i.e. other than for Tor), and with all the Kimsufis that I have, don't need one at the moment either.

Still it's a solid offer, even more so with this OpenSSL improvement. Now that OVH's VPS SSD had a price hike, this one might remain to be the best bang for the buck KVM out there (whether the x86 or ARM variant). If I decide to cancel my KS dedis, I will most likely use a couple of these either for hosting stuff directly, or as reverse proxies.

Amitz · November 2017

@kcaj said:
You’re still using Scaleway?

You still shilling for Vultr?

Neoon · November 2017

@kcaj said:
You’re still using Scaleway?

He gives up a KS4C for a Scaleway.

WSS · November 2017

@Neoon said:
He gives up a KS4C for a Scaleway.

lol where was this?

SplitIce · November 2017

Which instruction set? Does this also apply to SBC ARMs like the Raspberry PI and Friendly ARM NanoPI?

WSS · November 2017

@SplitIce said:
Which instruction set? Does this also apply to SBC ARMs like the Raspberry PI and Friendly ARM NanoPI?

@rm_ said:
It seems the Cavium ThunderX contain hardware acceleration for AES (similar to AES-NI on x86):

https://en.wikipedia.org/wiki/AES_instruction_set

SplitIce · November 2017

Oh it's the actual aes instruction set not an arm one. I need glasses.

Looks like the NanoPI's H3 also supports it, but not the Raspberry PI.

WSS · November 2017

Why am I now envisioning VPN-in-a-cigarette box?

Fusl · November 2017

Amitz said: You still shilling for Vultr?

Yes. BGP FTW.

johnklos · November 2017

What's the clock speed of those Cavium cores? Obviously, 200 BogoMIPS isn't correct. I'd have thought they'd be faster than a 2.2 GHz Cortex-A15, which is quite old, so either the clock speed is a bit slower or the Caviums aren't that impressive.

rm_ · November 2017

johnklos said: What's the clock speed of those Cavium cores? Obviously, 200 BogoMIPS isn't correct. I'd have thought they'd be faster than a 2.2 GHz Cortex-A15, which is quite old, so either the clock speed is a bit slower or the Caviums aren't that impressive.

You can see some benchmarks here: https://wiki.neoon.pw/doku.php?id=dedicated_benchmarks

Also IIRC they got 212 MB/sec in my MD5 CPU benchmark, which is more than e.g. some 1.7 GHz Xeon VPS I had at another provider.

xyz · November 2017

johnklos said: I'd have thought they'd be faster than a 2.2 GHz Cortex-A15, which is quite old

The ThunderX cores are relatively weak as the focus is on having a lot of them. The Cortex-A15, on the other hand, is designed as a performance core, so I wouldn't be surprised if the ThunderX is slightly less powerful than an A15 core.

rm_ · November 2017

Performance-wise a weirdness that I noticed is this that it gets mbw results like this:

Long uses 8 bytes. Allocating 2*52428800 elements = 838860800 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 10 runs per test.
0   Method: MEMCPY  Elapsed: 1.05829    MiB: 400.00000  Copy: 377.968 MiB/s
1   Method: MEMCPY  Elapsed: 1.05849    MiB: 400.00000  Copy: 377.898 MiB/s
2   Method: MEMCPY  Elapsed: 1.06049    MiB: 400.00000  Copy: 377.186 MiB/s
3   Method: MEMCPY  Elapsed: 1.05870    MiB: 400.00000  Copy: 377.822 MiB/s
4   Method: MEMCPY  Elapsed: 1.05840    MiB: 400.00000  Copy: 377.928 MiB/s
5   Method: MEMCPY  Elapsed: 1.05922    MiB: 400.00000  Copy: 377.635 MiB/s
6   Method: MEMCPY  Elapsed: 1.05855    MiB: 400.00000  Copy: 377.875 MiB/s
7   Method: MEMCPY  Elapsed: 1.05962    MiB: 400.00000  Copy: 377.494 MiB/s
8   Method: MEMCPY  Elapsed: 1.06070    MiB: 400.00000  Copy: 377.108 MiB/s
9   Method: MEMCPY  Elapsed: 1.06061    MiB: 400.00000  Copy: 377.140 MiB/s
AVG Method: MEMCPY  Elapsed: 1.05931    MiB: 400.00000  Copy: 377.605 MiB/s
0   Method: DUMB    Elapsed: 0.95776    MiB: 400.00000  Copy: 417.643 MiB/s
1   Method: DUMB    Elapsed: 0.95037    MiB: 400.00000  Copy: 420.889 MiB/s
2   Method: DUMB    Elapsed: 0.95883    MiB: 400.00000  Copy: 417.177 MiB/s
3   Method: DUMB    Elapsed: 0.95618    MiB: 400.00000  Copy: 418.333 MiB/s
4   Method: DUMB    Elapsed: 0.95567    MiB: 400.00000  Copy: 418.555 MiB/s
5   Method: DUMB    Elapsed: 0.95715    MiB: 400.00000  Copy: 417.905 MiB/s
6   Method: DUMB    Elapsed: 0.95440    MiB: 400.00000  Copy: 419.109 MiB/s
7   Method: DUMB    Elapsed: 0.94754    MiB: 400.00000  Copy: 422.146 MiB/s
8   Method: DUMB    Elapsed: 0.95699    MiB: 400.00000  Copy: 417.979 MiB/s
9   Method: DUMB    Elapsed: 0.95736    MiB: 400.00000  Copy: 417.817 MiB/s
AVG Method: DUMB    Elapsed: 0.95522    MiB: 400.00000  Copy: 418.750 MiB/s
0   Method: MCBLOCK Elapsed: 0.10251    MiB: 400.00000  Copy: 3902.134 MiB/s
1   Method: MCBLOCK Elapsed: 0.10208    MiB: 400.00000  Copy: 3918.303 MiB/s
2   Method: MCBLOCK Elapsed: 0.10200    MiB: 400.00000  Copy: 3921.569 MiB/s
3   Method: MCBLOCK Elapsed: 0.10188    MiB: 400.00000  Copy: 3926.149 MiB/s
4   Method: MCBLOCK Elapsed: 0.10191    MiB: 400.00000  Copy: 3924.955 MiB/s
5   Method: MCBLOCK Elapsed: 0.10203    MiB: 400.00000  Copy: 3920.339 MiB/s
6   Method: MCBLOCK Elapsed: 0.10197    MiB: 400.00000  Copy: 3922.684 MiB/s
7   Method: MCBLOCK Elapsed: 0.10191    MiB: 400.00000  Copy: 3925.109 MiB/s
8   Method: MCBLOCK Elapsed: 0.10204    MiB: 400.00000  Copy: 3919.878 MiB/s
9   Method: MCBLOCK Elapsed: 0.10204    MiB: 400.00000  Copy: 3919.878 MiB/s
AVG Method: MCBLOCK Elapsed: 0.10204    MiB: 400.00000  Copy: 3920.089 MiB/s

Basically memory access is "not very fast", except if you use the specialized "copy block" function. A modern x86 system does not have such distinction, it will show 3-5 GB/sec in this test across the board, no matter which access method.

xyz · November 2017

What version of mbw are you using? I found this in a search.

If this code is accurate, I can't see why MCBLOCK would be any different to a memcpy. If it was covering the same block, then you're really just comparing cache speed vs memory speed (and cache is much faster); if the pointers move across the data, then you're essentially doing exactly the same thing as a single memcpy and can't see it being any faster (it doesn't use any "specialized function").
And in fact, I wouldn't be surprised if the "DUMB" method just gets rewritten to a memcpy by an optimizing compiler.

rm_ · November 2017

xyz said: What version of mbw are you using?

I used version 1.2.2 from Debian Jessie.

xyz · November 2017

rm_ said: I used version 1.2.2 from Debian Jessie.

From the code:

    if(type==1) { /* memcpy test */
        /* timer starts */
        gettimeofday(&starttime, NULL);
        memcpy(b, a, array_bytes);
        /* timer stops */
        gettimeofday(&endtime, NULL);
    } else if(type==2) { /* memcpy block test */
        gettimeofday(&starttime, NULL);
        for(t=0; t<array_bytes; t+=block_size) {
            b=mempcpy(b, a, block_size);
        }
        if(t>array_bytes) {
            b=mempcpy(b, a, t-array_bytes);
        }
        gettimeofday(&endtime, NULL);

Yeah the code's broken. Essentially the MCBLOCK method is just testing memory write bandwidth, since the a is likely served from cache, whilst MEMCPY tests read+write (copy) bandwidth.

Howdy, Stranger!

Categories

In this Discussion

Upgrade your OpenSSL on Scaleway ARM: 5x performance gain

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Upgrade your OpenSSL on Scaleway ARM: 5x performance gain

Comments