Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Upgrade your OpenSSL on Scaleway ARM: 5x performance gain
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Upgrade your OpenSSL on Scaleway ARM: 5x performance gain

rm_rm_ IPv6 Advocate, Veteran
edited November 2017 in General

In case anyone is using those machines, here's what I just found.

Debian Jessie comes with OpenSSL 1.0.1t:

# openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 9437607 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 2704353 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 713491 aes-256-cbc's in 2.99s
Doing aes-256-cbc for 3s on 1024 size blocks: 181168 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 22743 aes-256-cbc's in 3.00s
OpenSSL 1.0.1t  3 May 2016
built on: Fri Jan 27 00:08:40 2017
options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr) 
compiler: gcc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -DTERMIO -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-z,relro -Wa,--noexecstack -Wall
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-cbc      50333.90k    57692.86k    61088.19k    61838.68k    62103.55k

After upgrading that to Debian Stretch's version 1.1.0f:

# openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 18554406 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 9779929 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 3375690 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 938046 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 121052 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 60624 aes-256-cbc's in 3.00s
OpenSSL 1.1.0f  25 May 2017
built on: reproducible build, date unspecified
options:bn(64,64) rc4(char) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DOPENSSLDIR="\"/usr/lib/ssl\"" -DENGINESDIR="\"/usr/lib/aarch64-linux-gnu/engines-1.1\"" 
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc      98956.83k   208638.49k   288058.88k   320186.37k   330552.66k   331087.87k

It seems the Cavium ThunderX contain hardware acceleration for AES (similar to AES-NI on x86):

processor  : 0
BogoMIPS    : 200.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part    : 0x0a1
CPU revision    : 1

processor   : 1
BogoMIPS    : 200.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part    : 0x0a1
CPU revision    : 1

processor   : 2
BogoMIPS    : 200.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part    : 0x0a1
CPU revision    : 1

processor   : 3
BogoMIPS    : 200.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid
CPU implementer : 0x43
CPU architecture: 8
CPU variant : 0x1
CPU part    : 0x0a1
CPU revision    : 1

But it's not supported yet in OpenSSL 1.0. With that enabled however, these should make for decent VPN machines, or better handle HTTPS load.

Comments

  • You’re still using Scaleway?

  • rm_rm_ IPv6 Advocate, Veteran

    Well I never used it in any serious capacity (i.e. other than for Tor), and with all the Kimsufis that I have, don't need one at the moment either.

    Still it's a solid offer, even more so with this OpenSSL improvement. Now that OVH's VPS SSD had a price hike, this one might remain to be the best bang for the buck KVM out there (whether the x86 or ARM variant). If I decide to cancel my KS dedis, I will most likely use a couple of these either for hosting stuff directly, or as reverse proxies.

    Thanked by 1ValdikSS
  • @kcaj said:
    You’re still using Scaleway?

    You still shilling for Vultr?

  • NeoonNeoon Community Contributor, Veteran

    @kcaj said:
    You’re still using Scaleway?

    He gives up a KS4C for a Scaleway.

    Thanked by 1vimalware
  • @Neoon said:
    He gives up a KS4C for a Scaleway.

    lol where was this?

  • SplitIceSplitIce Member, Host Rep

    Which instruction set? Does this also apply to SBC ARMs like the Raspberry PI and Friendly ARM NanoPI?

  • @SplitIce said:
    Which instruction set? Does this also apply to SBC ARMs like the Raspberry PI and Friendly ARM NanoPI?

    @rm_ said:
    It seems the Cavium ThunderX contain hardware acceleration for AES (similar to AES-NI on x86):

    https://en.wikipedia.org/wiki/AES_instruction_set

  • SplitIceSplitIce Member, Host Rep

    Oh it's the actual aes instruction set not an arm one. I need glasses.

    Looks like the NanoPI's H3 also supports it, but not the Raspberry PI.

    Thanked by 1vimalware
  • Why am I now envisioning VPN-in-a-cigarette box?

  • Amitz said: You still shilling for Vultr?

    Yes. BGP FTW.

  • What's the clock speed of those Cavium cores? Obviously, 200 BogoMIPS isn't correct. I'd have thought they'd be faster than a 2.2 GHz Cortex-A15, which is quite old, so either the clock speed is a bit slower or the Caviums aren't that impressive.

  • rm_rm_ IPv6 Advocate, Veteran
    edited November 2017

    johnklos said: What's the clock speed of those Cavium cores? Obviously, 200 BogoMIPS isn't correct. I'd have thought they'd be faster than a 2.2 GHz Cortex-A15, which is quite old, so either the clock speed is a bit slower or the Caviums aren't that impressive.

    You can see some benchmarks here: https://wiki.neoon.pw/doku.php?id=dedicated_benchmarks

    Also IIRC they got 212 MB/sec in my MD5 CPU benchmark, which is more than e.g. some 1.7 GHz Xeon VPS I had at another provider.

    Thanked by 1Aidan
  • xyzxyz Member
    edited November 2017

    johnklos said: I'd have thought they'd be faster than a 2.2 GHz Cortex-A15, which is quite old

    The ThunderX cores are relatively weak as the focus is on having a lot of them. The Cortex-A15, on the other hand, is designed as a performance core, so I wouldn't be surprised if the ThunderX is slightly less powerful than an A15 core.

  • rm_rm_ IPv6 Advocate, Veteran
    edited November 2017

    Performance-wise a weirdness that I noticed is this that it gets mbw results like this:

    Long uses 8 bytes. Allocating 2*52428800 elements = 838860800 bytes of memory.
    Using 262144 bytes as blocks for memcpy block copy test.
    Getting down to business... Doing 10 runs per test.
    0   Method: MEMCPY  Elapsed: 1.05829    MiB: 400.00000  Copy: 377.968 MiB/s
    1   Method: MEMCPY  Elapsed: 1.05849    MiB: 400.00000  Copy: 377.898 MiB/s
    2   Method: MEMCPY  Elapsed: 1.06049    MiB: 400.00000  Copy: 377.186 MiB/s
    3   Method: MEMCPY  Elapsed: 1.05870    MiB: 400.00000  Copy: 377.822 MiB/s
    4   Method: MEMCPY  Elapsed: 1.05840    MiB: 400.00000  Copy: 377.928 MiB/s
    5   Method: MEMCPY  Elapsed: 1.05922    MiB: 400.00000  Copy: 377.635 MiB/s
    6   Method: MEMCPY  Elapsed: 1.05855    MiB: 400.00000  Copy: 377.875 MiB/s
    7   Method: MEMCPY  Elapsed: 1.05962    MiB: 400.00000  Copy: 377.494 MiB/s
    8   Method: MEMCPY  Elapsed: 1.06070    MiB: 400.00000  Copy: 377.108 MiB/s
    9   Method: MEMCPY  Elapsed: 1.06061    MiB: 400.00000  Copy: 377.140 MiB/s
    AVG Method: MEMCPY  Elapsed: 1.05931    MiB: 400.00000  Copy: 377.605 MiB/s
    0   Method: DUMB    Elapsed: 0.95776    MiB: 400.00000  Copy: 417.643 MiB/s
    1   Method: DUMB    Elapsed: 0.95037    MiB: 400.00000  Copy: 420.889 MiB/s
    2   Method: DUMB    Elapsed: 0.95883    MiB: 400.00000  Copy: 417.177 MiB/s
    3   Method: DUMB    Elapsed: 0.95618    MiB: 400.00000  Copy: 418.333 MiB/s
    4   Method: DUMB    Elapsed: 0.95567    MiB: 400.00000  Copy: 418.555 MiB/s
    5   Method: DUMB    Elapsed: 0.95715    MiB: 400.00000  Copy: 417.905 MiB/s
    6   Method: DUMB    Elapsed: 0.95440    MiB: 400.00000  Copy: 419.109 MiB/s
    7   Method: DUMB    Elapsed: 0.94754    MiB: 400.00000  Copy: 422.146 MiB/s
    8   Method: DUMB    Elapsed: 0.95699    MiB: 400.00000  Copy: 417.979 MiB/s
    9   Method: DUMB    Elapsed: 0.95736    MiB: 400.00000  Copy: 417.817 MiB/s
    AVG Method: DUMB    Elapsed: 0.95522    MiB: 400.00000  Copy: 418.750 MiB/s
    0   Method: MCBLOCK Elapsed: 0.10251    MiB: 400.00000  Copy: 3902.134 MiB/s
    1   Method: MCBLOCK Elapsed: 0.10208    MiB: 400.00000  Copy: 3918.303 MiB/s
    2   Method: MCBLOCK Elapsed: 0.10200    MiB: 400.00000  Copy: 3921.569 MiB/s
    3   Method: MCBLOCK Elapsed: 0.10188    MiB: 400.00000  Copy: 3926.149 MiB/s
    4   Method: MCBLOCK Elapsed: 0.10191    MiB: 400.00000  Copy: 3924.955 MiB/s
    5   Method: MCBLOCK Elapsed: 0.10203    MiB: 400.00000  Copy: 3920.339 MiB/s
    6   Method: MCBLOCK Elapsed: 0.10197    MiB: 400.00000  Copy: 3922.684 MiB/s
    7   Method: MCBLOCK Elapsed: 0.10191    MiB: 400.00000  Copy: 3925.109 MiB/s
    8   Method: MCBLOCK Elapsed: 0.10204    MiB: 400.00000  Copy: 3919.878 MiB/s
    9   Method: MCBLOCK Elapsed: 0.10204    MiB: 400.00000  Copy: 3919.878 MiB/s
    AVG Method: MCBLOCK Elapsed: 0.10204    MiB: 400.00000  Copy: 3920.089 MiB/s

    Basically memory access is "not very fast", except if you use the specialized "copy block" function. A modern x86 system does not have such distinction, it will show 3-5 GB/sec in this test across the board, no matter which access method.

  • What version of mbw are you using? I found this in a search.

    If this code is accurate, I can't see why MCBLOCK would be any different to a memcpy. If it was covering the same block, then you're really just comparing cache speed vs memory speed (and cache is much faster); if the pointers move across the data, then you're essentially doing exactly the same thing as a single memcpy and can't see it being any faster (it doesn't use any "specialized function").
    And in fact, I wouldn't be surprised if the "DUMB" method just gets rewritten to a memcpy by an optimizing compiler.

  • rm_rm_ IPv6 Advocate, Veteran

    xyz said: What version of mbw are you using?

    I used version 1.2.2 from Debian Jessie.

  • rm_ said: I used version 1.2.2 from Debian Jessie.

    From the code:

        if(type==1) { /* memcpy test */
            /* timer starts */
            gettimeofday(&starttime, NULL);
            memcpy(b, a, array_bytes);
            /* timer stops */
            gettimeofday(&endtime, NULL);
        } else if(type==2) { /* memcpy block test */
            gettimeofday(&starttime, NULL);
            for(t=0; t<array_bytes; t+=block_size) {
                b=mempcpy(b, a, block_size);
            }
            if(t>array_bytes) {
                b=mempcpy(b, a, t-array_bytes);
            }
            gettimeofday(&endtime, NULL);
    

    Yeah the code's broken. Essentially the MCBLOCK method is just testing memory write bandwidth, since the a is likely served from cache, whilst MEMCPY tests read+write (copy) bandwidth.

Sign In or Register to comment.