Server for mining with AVX-2/AVX-512

hewnkinetic · July 2018

I am looking for bulk (30+) dedicated servers for crypto mining with argon2d algorithm.

Most important point is hash/price ratio, support for AVX-512 is recommended in this case.

For example Hetzner CX51 (8c! Xeon Silver 4108) give twice more hash than 2xE5-2670v1 20c.

If someone has experience with this algorithm, or may help, please share your advices of providers.

Other minimum specifications: 120GB drive, 8 GB RAM.

Update:

AVX-512 CPU's list:
https://ark.intel.com/Search/FeatureFilter?productType=processors&InstructionSetExtensions=Intel%C2%AE%20AVX-512

stefeman · July 2018

Budget per server?

stefeman · July 2018

Netcup root is the new scalable.

hewnkinetic · July 2018

@stefeman
The ratio is important, budget then does not matter in this case.

I can get one server which cost me $4000 with Phi or 40 which cost me $100 with E5.

Sadly i believe that i would need to provide some charts with results to let me help in better way.

https://slack-files.com/T131FCJ8J-F8PENJRPZ-6b683a506f

This one include mostly home CPUs.

NetCup is cool but they does not allow mining.

levnode · July 2018

Ryzen 7 1700 is 12 times faster than 1700X? No way.

Dedispec · July 2018

If you are looking for 2x E5-2670v2/2680v2 give me a shout for a bulk price. One of our mining customers just picked up our remaining bulk units we had on hand this week, but I should have a good amount available within the week.

hewnkinetic · July 2018

I have updated initial post of list AVX-512 CPU's which im mostly interested in.

So far i have found: 2xXeon 4110 with 145 EUR, 2xXeon 5118 with 210 EUR.

FHR · July 2018

MyLoc does 4110 and 5118

Clouvider · July 2018

You’re unlikely to get a lot of offers due to disproportionate memory to CPU ratio.

For example. Here we don’t even stock 8 GB memory sticks any more.

jsg · July 2018

I don't know Argon2d very well (didn't work with it but had a good look at it) but I know that inside it's based on Blake2 ('b' I think).

It's utterly meaningless to offer advice to you without knowing the parallelism tolerated and the m and t factors required by the mining system.
8 GB is almost certainly way too little memory for any sensible degree of parallelism.
You do understand what a KDF (like Argon) is? These algorithms are designed with the expressed goal of denying major advantages for multi or many cores, FPGAs, and ASICs. So, yes you can gain some advantage by expensive hardware but not really a lot.
AVX* is a reasonable approach because Blake can be optimized to be significantly faster. But there are some BIG ifs like non matching (implementation dependant) hashes and the fact that I've seen quite some implementations of questionable quality. So, unless you really, really know what you are doing and e.g. which loops to unroll and to what degree and which not, I strongly suggest to stay away.

Also note that the very nature of KDFs can also be to your advantage. It might, for example be more promising to use cheaper and less power hungry Arm cores but lots of them than to go the route of optimized but power hungry and expensive CPUs.

Firstishe · July 2018

Similarly, I am interested in cpu mining on free capacities.
I can perform experiments on the Core i7 6700 / 6700k / 7700k / 8700 / 8700k.

Do you mine Credits coin?

Most cpu only algо are die, because price\hardware\electricity ROI is impossible

willie · July 2018

hewnkinetic said: For example Hetzner CX51 (8c! Xeon Silver 4108) give twice more hash than 2xE5-2670v1 20c.

Hetzner doesn't allow mining on their VPS (cloud) products, but they do permit mining on dedis. @Hetzner_OL mentioned that here a while ago iirc. They have temporarily cancelled the setup fees for some of them, including some recent-generation ones.

@jsg the mining community is very good at optimizing software for the cpus/gpus that are out there, so they have all the implementation tricks figured out.

I can't wait til someone builds asics for all this stuff, so the rest of us can get our cpus back.

Aidan · July 2018

willie said: I can't wait til someone builds asics for all this stuff, so the rest of us can get our cpus back.

Then a new shitcoin will come out, meant for CPUs only.

jsg · July 2018

@willie said:
@jsg the mining community is very good at optimizing software for the cpus/gpus that are out there, so they have all the implementation tricks figured out.

I have doubts mainly for two reasons: (a) there are very few people out there who have the necessary level of knowlege and experience in both software dev. and crypto and (b) and probably more importantly those people will usually NOT share their work because that's their very advantage which is worth a lot.

Also every crypto currency I ever looked at had weaknesses, often even serious ones. That doesn't suggest to me that they have excellent people because real experts don't make certain stupid protocol errors (the problems are usually in the protocol not in the crypto per se; these are usually due to improper "optimization" attempts).

That said I don't really care because I'm not interested in crypto currencies. It just happens to be quite closely linked to my field and so I got to know a bit about it.

willie · July 2018

jsg said: real experts don't make certain stupid protocol errors

That's a completely different skill set than micro-optimizing software. And yes they are good at the latter.

jsg · July 2018

@willie said:

jsg said: real experts don't make certain stupid protocol errors

That's a completely different skill set than micro-optimizing software. And yes they are good at the latter.

Kindly provide some links supporting your statements that they are good at the latter and that they share their optimizations. Well noted you might be right; it's just that I didn't see yet what you assert.

willie · July 2018

Web search "optimized monero miner avx512" found this in first few hits:

https://github.com/JayDDee/cpuminer-opt/releases

Looks ok to me.

jsg · July 2018

@willie said:
Web search "optimized monero miner avx512" found this in first few hits:

https://github.com/JayDDee/cpuminer-opt/releases

Looks ok to me.

Not to me.

For a start what you linked to is a collection of a whole lot of algo implementations with quite some of the having a multiple implementations. I just picked blake for a first look and was confronted with a plethora of implementations (and versions).

Pardon me but I dare to assume that the vast majority of miners - if they build their software at all - will look bewildered and confused and will not know which one to choose.

// End of part 1

jsg · July 2018

@willie // part 2

To put a cherry on top of the fun I discovered right away a questionable and actually SLOW "optimization" in one of the files I looked at (sph_blake2b.c):

#define ROTR64(x, y) (((x) >> (y)) ^ ((x) << (64 - (y))))

Note the 'xor'. In the original version by the blake people that's an 'or' which both is cleaner and less processor dependant and more importantly is recognized by modern C compilers and replaced by an intrinsic operation (e.g. rorl) where available and MUCH faster.

Plus funnily some attractive optimizations opportunities were simply not seen. An example is the load32/64 macros which are SLOW and only needed on big endian systems. On little endian systems (like x86) they can be trimmed to a simple uintXX_t pointer cast and an align pragma (if needed).

Sorry but that's worthless and does in fact prove my point and not yours.

Btw. sorry for posting in two parts due to Cloudflare.

willie · July 2018

jsg said: #define ROTR64(x, y) (((x) >> (y)) ^ ((x) << (64 - (y))))

That doesn't convince me of anything unless you've written it both ways, benchmarked the results, and found the above to be slower. More cogently, what coin uses blake2? There is assembly code for some of the other algorithms.

jsg · July 2018

@willie said:

jsg said: #define ROTR64(x, y) (((x) >> (y)) ^ ((x) << (64 - (y))))

That doesn't convince me of anything unless you've written it both ways, benchmarked the results, and found the above to be slower. More cogently, what coin uses blake2? There is assembly code for some of the other algorithms.

What I said is a well known - and well tested - fact in the relevant circles. But feel free to check and verify with clang and gcc. And yes, I DO use and work with stuff like that every day.

But I'll end this now because you try to blindly defend your position and are, uhm, not really deep into the matter plus you simply ignore points that don't fit your view.

willie · July 2018

Ok, another minute of web search finds there is an asic for blake2 https://asian-miner.com/shop/s11-innosilicon/

There are similarly gpu miners for it. That makes cpu mining almost irrelevant.

Anyway I looked at the asm output for that file. ROTR64 is called multiple times from B2B_G which expands to a lot of code, but the rot/adds resulting look like:

    addq    %r12, %rsi
    xorq    %rsi, %rcx
    rorq    $16, %rcx
    addq    %rcx, %r9

and similarly for the other sizes of rotations. So there is a 64-bit rorq as if an intrinsic had been used, afaict. That is gcc-6.3.0 with -O2 and a bunch of other flags as emitted by the makefile.

jsg · July 2018

@willie

Nope. Thats the full equivalent of the macro rather than a simple intrinsic. Note the xorq %rsi, %rcx which is the 64-bit xor I talked about.

Regarding the ASIC, no surprise there. There are ASICs for most of the more common algos. In fact that (or Verilog or VHDL anyway) is a requirement in many crypto competitions.

And keep in mind that blake usually is NOT "the crypto algo" with crypto currencies but just a sub-element (e.g. of Argon). To gain an advantage there's more to optimize than blake.

Anyway, as I said I don't care about crypto currencies. If they are happy with their "optimizations", great.

willie · July 2018

jsg said: Note the xorq %rsi, %rcx which is the 64-bit xor

That xor doesn't look to be from the ROTR64. You can see that it is BEFORE the rorq instruction. It's xor of stuff going INTO the rotation, from the B2B_G macro:

#define B2B_G(a, b, c, d, x, y) {   \
        v[a] = v[a] + v[b] + x;         \
        v[d] = ROTR64(v[d] ^ v[a], 32); \
        v[c] = v[c] + v[d];             \
        v[b] = ROTR64(v[b] ^ v[c], 24); \
        v[a] = v[a] + v[b] + y;         \
        v[d] = ROTR64(v[d] ^ v[a], 16); \
        v[c] = v[c] + v[d];             \
        v[b] = ROTR64(v[b] ^ v[c], 63); }

The snippet I pasted looks to be the 16 bit rotation in the above, where the input is the xor of the two previous rotation/add outputs.

So anyway I need to see benchmarks before believing they made some dumb mistake that slowed down this code. Both of your theories so far (that the rotr macro doesn't generate a rotate intrinsic, and that it generates an unnecessary xor) have been wrong, so it's no longer worth examining further ones unless there's a benchmark.

jsg · July 2018

@willie said:
That xor doesn't look to be from the ROTR64. You can see that it is BEFORE the rorq instruction. It's xor of stuff going INTO the rotation, from the B2B_G macro:

Yes, that looks reasonable but doesn't change fact that both the common rule and the original blake2 implementation use 'or' (and not 'xor') in the ROTXYY macro.

So anyway I need to see benchmarks before believing they made some dumb mistake that slowed down this code.

Pardon me but there's something to be set straight as you again try to play the "convince me" game. It's YOUR burden to demonstrate that YOUR statement, on which this whole side-discussion is based, is true; you asserted that the miner communities had quite many crypto code optimizers. What you have demonstrated so far however is merely that they don't do worse than standard.

My basic intention in this thread was to offer some advice to OP and to ask him some detail questions the answers to which are vital for giving him what he asked for.

xyz · July 2018

jsg said: Note the 'xor'. In the original version by the blake people that's an 'or' which both is cleaner and less processor dependant and more importantly is recognized by modern C compilers and replaced by an intrinsic operation (e.g. rorl) where available and MUCH faster.

xor is weird, but hardly an issue: https://godbolt.org/g/2QNXTD

Tried a few older versions of GCC, and all optimize to the same thing. Clang does seem to miss the optimization here though. Perhaps they never checked Clang, seeing as GCC is the default compiler on Linux.

jsg said: On little endian systems (like x86) they can be trimmed to a simple uintXX_t pointer cast

Which is exactly what it does.

I don't know enough about how these work to make much of a statement, but I suspect all the performant stuff is done in SIMD, so this code is probably just fallback for when SIMD is unavailable. In other words, performance probably wasn't even a major concern for this part of the code.

jsg · July 2018

@xyz said:
xor is weird, but hardly an issue: https://godbolt.org/g/2QNXTD

Tried a few older versions of GCC, and all optimize to the same thing. Clang does seem to miss the optimization here though. Perhaps they never checked Clang, seeing as GCC is the default compiler on Linux.

You've set it to C++ but the compiler actually used is C (-std=c99). Funnily, when setting the linked online toy to C both gcc and clang produce poor code. I do know however for granted (I did work on the algo and compiled it plenty often with gcc, clang, and Compcert) that xor vs. or do produce different code in at least some compilers.

jsg said: On little endian systems (like x86) they can be trimmed to a simple uintXX_t pointer cast

Which is exactly what it does.

Oops, you are right. I was looking at the ref. implementation. So, one point for that detail in their code. Funny sidenote: They don't have load/store64 but kept load/store48 which isn't used at all; dunno why the blake people put it there. Plus they forgot the align pragma and also use an unreliable zeromem with the simple volatile trick.

I don't know enough about how these work to make much of a statement, but I suspect all the performant stuff is done in SIMD, so this code is probably just fallback for when SIMD is unavailable. In other words, performance probably wasn't even a major concern for this part of the code.

Nope, that code is in the blake2 algo core. SSE and other optimizations are significant but not game changing.

Most importantly optimizations (and often even 64 bit code advantages) are practically useless or even unwanted depending on the use case. For mining one of course squeezes out every bit but one also has a quite different and tighter minimal base like amd64 w/SSE4.1. But for most applications 32 bits must be supported and 64-bit comes down to -mtune=core2. Just imagine for example OpenSSH telling you that it runs only on the newest Xeons...

That's why in the real world careful and skilled optimization - as opposed to "let's just use AVX" - needs to be done. And then unrolling some loops by hand (and typically adapting the algorithm) while leaving others alone, optimizing stores and loads and bitstream walking etc. become important.

Keep in mind btw that the frame here was a question related to a KDF (Argon2) and at the same time about mining (in other words: that crypto currency has a nonsensical basis).

xyz · August 2018

jsg said: You've set it to C++ but the compiler actually used is C (-std=c99). Funnily, when setting the linked online toy to C both gcc and clang produce poor code.

Makes no difference to me. Perhaps you forgot to include -O2 in the compiler flags?

SSE and other optimizations are significant but not game changing

Wait, you just said they were significant... which means I'd expect them to be used?

and bitstream walking etc. become important.

Even more important is to not blindly ignore, or make unverified assumptions about the optimizing compiler.

jsg · August 2018

@xyz said:
Makes no difference to me. Perhaps you forgot to include -O2 in the compiler flags?

Of course not. IF a simple -O2 is OK - which is not always the case. I've worked on algos/implementations where -O2 would make the tests fail and where I had to hand select the flags.

SSE and other optimizations are significant but not game changing

Wait, you just said they were significant... which means I'd expect them to be used?

Yes but that wasn't my point. In some contexts a 30% improvement in speed of some sub algo is just not changing the game. KDFs are one example.

and bitstream walking etc. become important.

Even more important is to not blindly ignore, or make unverified assumptions about the optimizing compiler.

Are we getting personal now? I've worked on 2 Caesar finalists (and found and repaired a vulnerability), on 2 or 3 [cs]prngs, on one and a half PHC finalists and some more things where "worked" typically means to completely analyze, then statically verify and finally guard it (Hoare triples and separation logic). Sometimes optimization is also requested. So I'm quite confident wrt my knowledge of compilers, thank you.

Howdy, Stranger!

Categories

In this Discussion

Server for mining with AVX-2/AVX-512

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Server for mining with AVX-2/AVX-512

Comments