Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Server for mining with AVX-2/AVX-512
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Server for mining with AVX-2/AVX-512

hewnkinetichewnkinetic Member
edited July 2018 in Requests

I am looking for bulk (30+) dedicated servers for crypto mining with argon2d algorithm.

Most important point is hash/price ratio, support for AVX-512 is recommended in this case.

For example Hetzner CX51 (8c! Xeon Silver 4108) give twice more hash than 2xE5-2670v1 20c.

If someone has experience with this algorithm, or may help, please share your advices of providers.

Other minimum specifications: 120GB drive, 8 GB RAM.

Update:

AVX-512 CPU's list:
https://ark.intel.com/Search/FeatureFilter?productType=processors&InstructionSetExtensions=Intel%C2%AE%20AVX-512

Comments

  • Budget per server?

  • Netcup root is the new scalable.

  • hewnkinetichewnkinetic Member
    edited July 2018

    @stefeman
    The ratio is important, budget then does not matter in this case.

    I can get one server which cost me $4000 with Phi or 40 which cost me $100 with E5.

    Sadly i believe that i would need to provide some charts with results to let me help in better way.

    https://slack-files.com/T131FCJ8J-F8PENJRPZ-6b683a506f

    This one include mostly home CPUs.

    NetCup is cool but they does not allow mining.

  • Ryzen 7 1700 is 12 times faster than 1700X? No way.

  • DedispecDedispec Member, Patron Provider

    If you are looking for 2x E5-2670v2/2680v2 give me a shout for a bulk price. One of our mining customers just picked up our remaining bulk units we had on hand this week, but I should have a good amount available within the week.

  • hewnkinetichewnkinetic Member
    edited July 2018

    I have updated initial post of list AVX-512 CPU's which im mostly interested in.

    So far i have found: 2xXeon 4110 with 145 EUR, 2xXeon 5118 with 210 EUR.

  • FHRFHR Member, Host Rep

    MyLoc does 4110 and 5118

  • ClouviderClouvider Member, Patron Provider

    You’re unlikely to get a lot of offers due to disproportionate memory to CPU ratio.

    For example. Here we don’t even stock 8 GB memory sticks any more.

  • jsgjsg Member, Resident Benchmarker

    I don't know Argon2d very well (didn't work with it but had a good look at it) but I know that inside it's based on Blake2 ('b' I think).

    • It's utterly meaningless to offer advice to you without knowing the parallelism tolerated and the m and t factors required by the mining system.

    • 8 GB is almost certainly way too little memory for any sensible degree of parallelism.

    • You do understand what a KDF (like Argon) is? These algorithms are designed with the expressed goal of denying major advantages for multi or many cores, FPGAs, and ASICs. So, yes you can gain some advantage by expensive hardware but not really a lot.

    • AVX* is a reasonable approach because Blake can be optimized to be significantly faster. But there are some BIG ifs like non matching (implementation dependant) hashes and the fact that I've seen quite some implementations of questionable quality. So, unless you really, really know what you are doing and e.g. which loops to unroll and to what degree and which not, I strongly suggest to stay away.

    Also note that the very nature of KDFs can also be to your advantage. It might, for example be more promising to use cheaper and less power hungry Arm cores but lots of them than to go the route of optimized but power hungry and expensive CPUs.

  • Similarly, I am interested in cpu mining on free capacities.
    I can perform experiments on the Core i7 6700 / 6700k / 7700k / 8700 / 8700k.

    Do you mine Credits coin?

    Most cpu only algо are die, because price\hardware\electricity ROI is impossible

  • williewillie Member
    edited July 2018

    hewnkinetic said: For example Hetzner CX51 (8c! Xeon Silver 4108) give twice more hash than 2xE5-2670v1 20c.

    Hetzner doesn't allow mining on their VPS (cloud) products, but they do permit mining on dedis. @Hetzner_OL mentioned that here a while ago iirc. They have temporarily cancelled the setup fees for some of them, including some recent-generation ones.

    @jsg the mining community is very good at optimizing software for the cpus/gpus that are out there, so they have all the implementation tricks figured out.

    I can't wait til someone builds asics for all this stuff, so the rest of us can get our cpus back.

    Thanked by 2Aidan inthecloudblog
  • AidanAidan Member

    willie said: I can't wait til someone builds asics for all this stuff, so the rest of us can get our cpus back.

    Then a new shitcoin will come out, meant for CPUs only.

    Thanked by 1vimalware
  • jsgjsg Member, Resident Benchmarker

    @willie said:
    @jsg the mining community is very good at optimizing software for the cpus/gpus that are out there, so they have all the implementation tricks figured out.

    I have doubts mainly for two reasons: (a) there are very few people out there who have the necessary level of knowlege and experience in both software dev. and crypto and (b) and probably more importantly those people will usually NOT share their work because that's their very advantage which is worth a lot.

    Also every crypto currency I ever looked at had weaknesses, often even serious ones. That doesn't suggest to me that they have excellent people because real experts don't make certain stupid protocol errors (the problems are usually in the protocol not in the crypto per se; these are usually due to improper "optimization" attempts).

    That said I don't really care because I'm not interested in crypto currencies. It just happens to be quite closely linked to my field and so I got to know a bit about it.

  • williewillie Member

    jsg said: real experts don't make certain stupid protocol errors

    That's a completely different skill set than micro-optimizing software. And yes they are good at the latter.

  • jsgjsg Member, Resident Benchmarker

    @willie said:

    jsg said: real experts don't make certain stupid protocol errors

    That's a completely different skill set than micro-optimizing software. And yes they are good at the latter.

    Kindly provide some links supporting your statements that they are good at the latter and that they share their optimizations. Well noted you might be right; it's just that I didn't see yet what you assert.

  • williewillie Member

    Web search "optimized monero miner avx512" found this in first few hits:

    https://github.com/JayDDee/cpuminer-opt/releases

    Looks ok to me.

  • jsgjsg Member, Resident Benchmarker
    edited July 2018

    @willie said:
    Web search "optimized monero miner avx512" found this in first few hits:

    https://github.com/JayDDee/cpuminer-opt/releases

    Looks ok to me.

    Not to me.

    For a start what you linked to is a collection of a whole lot of algo implementations with quite some of the having a multiple implementations. I just picked blake for a first look and was confronted with a plethora of implementations (and versions).

    Pardon me but I dare to assume that the vast majority of miners - if they build their software at all - will look bewildered and confused and will not know which one to choose.

    // End of part 1

  • jsgjsg Member, Resident Benchmarker

    @willie // part 2

    To put a cherry on top of the fun I discovered right away a questionable and actually SLOW "optimization" in one of the files I looked at (sph_blake2b.c):

    #define ROTR64(x, y) (((x) >> (y)) ^ ((x) << (64 - (y))))

    Note the 'xor'. In the original version by the blake people that's an 'or' which both is cleaner and less processor dependant and more importantly is recognized by modern C compilers and replaced by an intrinsic operation (e.g. rorl) where available and MUCH faster.

    Plus funnily some attractive optimizations opportunities were simply not seen. An example is the load32/64 macros which are SLOW and only needed on big endian systems. On little endian systems (like x86) they can be trimmed to a simple uintXX_t pointer cast and an align pragma (if needed).

    Sorry but that's worthless and does in fact prove my point and not yours.

    Btw. sorry for posting in two parts due to Cloudflare.

  • williewillie Member
    edited July 2018

    jsg said: #define ROTR64(x, y) (((x) >> (y)) ^ ((x) << (64 - (y))))

    That doesn't convince me of anything unless you've written it both ways, benchmarked the results, and found the above to be slower. More cogently, what coin uses blake2? There is assembly code for some of the other algorithms.

  • jsgjsg Member, Resident Benchmarker

    @willie said:

    jsg said: #define ROTR64(x, y) (((x) >> (y)) ^ ((x) << (64 - (y))))

    That doesn't convince me of anything unless you've written it both ways, benchmarked the results, and found the above to be slower. More cogently, what coin uses blake2? There is assembly code for some of the other algorithms.

    What I said is a well known - and well tested - fact in the relevant circles. But feel free to check and verify with clang and gcc. And yes, I DO use and work with stuff like that every day.

    But I'll end this now because you try to blindly defend your position and are, uhm, not really deep into the matter plus you simply ignore points that don't fit your view.

  • williewillie Member
    edited July 2018

    Ok, another minute of web search finds there is an asic for blake2 https://asian-miner.com/shop/s11-innosilicon/

    There are similarly gpu miners for it. That makes cpu mining almost irrelevant.

    Anyway I looked at the asm output for that file. ROTR64 is called multiple times from B2B_G which expands to a lot of code, but the rot/adds resulting look like:

        addq    %r12, %rsi
        xorq    %rsi, %rcx
        rorq    $16, %rcx
        addq    %rcx, %r9
    

    and similarly for the other sizes of rotations. So there is a 64-bit rorq as if an intrinsic had been used, afaict. That is gcc-6.3.0 with -O2 and a bunch of other flags as emitted by the makefile.

  • jsgjsg Member, Resident Benchmarker

    @willie

    Nope. Thats the full equivalent of the macro rather than a simple intrinsic. Note the xorq %rsi, %rcx which is the 64-bit xor I talked about.

    Regarding the ASIC, no surprise there. There are ASICs for most of the more common algos. In fact that (or Verilog or VHDL anyway) is a requirement in many crypto competitions.

    And keep in mind that blake usually is NOT "the crypto algo" with crypto currencies but just a sub-element (e.g. of Argon). To gain an advantage there's more to optimize than blake.

    Anyway, as I said I don't care about crypto currencies. If they are happy with their "optimizations", great.

  • williewillie Member
    edited July 2018

    jsg said: Note the xorq %rsi, %rcx which is the 64-bit xor

    That xor doesn't look to be from the ROTR64. You can see that it is BEFORE the rorq instruction. It's xor of stuff going INTO the rotation, from the B2B_G macro:

    #define B2B_G(a, b, c, d, x, y) {   \
            v[a] = v[a] + v[b] + x;         \
            v[d] = ROTR64(v[d] ^ v[a], 32); \
            v[c] = v[c] + v[d];             \
            v[b] = ROTR64(v[b] ^ v[c], 24); \
            v[a] = v[a] + v[b] + y;         \
            v[d] = ROTR64(v[d] ^ v[a], 16); \
            v[c] = v[c] + v[d];             \
            v[b] = ROTR64(v[b] ^ v[c], 63); }
    

    The snippet I pasted looks to be the 16 bit rotation in the above, where the input is the xor of the two previous rotation/add outputs.

    So anyway I need to see benchmarks before believing they made some dumb mistake that slowed down this code. Both of your theories so far (that the rotr macro doesn't generate a rotate intrinsic, and that it generates an unnecessary xor) have been wrong, so it's no longer worth examining further ones unless there's a benchmark.

  • jsgjsg Member, Resident Benchmarker

    @willie said:
    That xor doesn't look to be from the ROTR64. You can see that it is BEFORE the rorq instruction. It's xor of stuff going INTO the rotation, from the B2B_G macro:

    Yes, that looks reasonable but doesn't change fact that both the common rule and the original blake2 implementation use 'or' (and not 'xor') in the ROTXYY macro.

    So anyway I need to see benchmarks before believing they made some dumb mistake that slowed down this code.

    Pardon me but there's something to be set straight as you again try to play the "convince me" game. It's YOUR burden to demonstrate that YOUR statement, on which this whole side-discussion is based, is true; you asserted that the miner communities had quite many crypto code optimizers. What you have demonstrated so far however is merely that they don't do worse than standard.

    My basic intention in this thread was to offer some advice to OP and to ask him some detail questions the answers to which are vital for giving him what he asked for.

  • xyzxyz Member

    jsg said: Note the 'xor'. In the original version by the blake people that's an 'or' which both is cleaner and less processor dependant and more importantly is recognized by modern C compilers and replaced by an intrinsic operation (e.g. rorl) where available and MUCH faster.

    xor is weird, but hardly an issue: https://godbolt.org/g/2QNXTD

    Tried a few older versions of GCC, and all optimize to the same thing. Clang does seem to miss the optimization here though. Perhaps they never checked Clang, seeing as GCC is the default compiler on Linux.

    jsg said: On little endian systems (like x86) they can be trimmed to a simple uintXX_t pointer cast

    Which is exactly what it does.

    I don't know enough about how these work to make much of a statement, but I suspect all the performant stuff is done in SIMD, so this code is probably just fallback for when SIMD is unavailable. In other words, performance probably wasn't even a major concern for this part of the code.

  • jsgjsg Member, Resident Benchmarker
    edited July 2018

    @xyz said:
    xor is weird, but hardly an issue: https://godbolt.org/g/2QNXTD

    Tried a few older versions of GCC, and all optimize to the same thing. Clang does seem to miss the optimization here though. Perhaps they never checked Clang, seeing as GCC is the default compiler on Linux.

    You've set it to C++ but the compiler actually used is C (-std=c99). Funnily, when setting the linked online toy to C both gcc and clang produce poor code. I do know however for granted (I did work on the algo and compiled it plenty often with gcc, clang, and Compcert) that xor vs. or do produce different code in at least some compilers.

    jsg said: On little endian systems (like x86) they can be trimmed to a simple uintXX_t pointer cast

    Which is exactly what it does.

    Oops, you are right. I was looking at the ref. implementation. So, one point for that detail in their code. Funny sidenote: They don't have load/store64 but kept load/store48 which isn't used at all; dunno why the blake people put it there. Plus they forgot the align pragma and also use an unreliable zeromem with the simple volatile trick.

    I don't know enough about how these work to make much of a statement, but I suspect all the performant stuff is done in SIMD, so this code is probably just fallback for when SIMD is unavailable. In other words, performance probably wasn't even a major concern for this part of the code.

    Nope, that code is in the blake2 algo core. SSE and other optimizations are significant but not game changing.

    Most importantly optimizations (and often even 64 bit code advantages) are practically useless or even unwanted depending on the use case. For mining one of course squeezes out every bit but one also has a quite different and tighter minimal base like amd64 w/SSE4.1. But for most applications 32 bits must be supported and 64-bit comes down to -mtune=core2. Just imagine for example OpenSSH telling you that it runs only on the newest Xeons...

    That's why in the real world careful and skilled optimization - as opposed to "let's just use AVX" - needs to be done. And then unrolling some loops by hand (and typically adapting the algorithm) while leaving others alone, optimizing stores and loads and bitstream walking etc. become important.

    Keep in mind btw that the frame here was a question related to a KDF (Argon2) and at the same time about mining (in other words: that crypto currency has a nonsensical basis).

  • xyzxyz Member

    jsg said: You've set it to C++ but the compiler actually used is C (-std=c99). Funnily, when setting the linked online toy to C both gcc and clang produce poor code.

    Makes no difference to me. Perhaps you forgot to include -O2 in the compiler flags?

    SSE and other optimizations are significant but not game changing

    Wait, you just said they were significant... which means I'd expect them to be used?

    and bitstream walking etc. become important.

    Even more important is to not blindly ignore, or make unverified assumptions about the optimizing compiler.

  • jsgjsg Member, Resident Benchmarker

    @xyz said:
    Makes no difference to me. Perhaps you forgot to include -O2 in the compiler flags?

    Of course not. IF a simple -O2 is OK - which is not always the case. I've worked on algos/implementations where -O2 would make the tests fail and where I had to hand select the flags.

    SSE and other optimizations are significant but not game changing

    Wait, you just said they were significant... which means I'd expect them to be used?

    Yes but that wasn't my point. In some contexts a 30% improvement in speed of some sub algo is just not changing the game. KDFs are one example.

    and bitstream walking etc. become important.

    Even more important is to not blindly ignore, or make unverified assumptions about the optimizing compiler.

    Are we getting personal now? I've worked on 2 Caesar finalists (and found and repaired a vulnerability), on 2 or 3 [cs]prngs, on one and a half PHC finalists and some more things where "worked" typically means to completely analyze, then statically verify and finally guard it (Hoare triples and separation logic). Sometimes optimization is also requested. So I'm quite confident wrt my knowledge of compilers, thank you.

Sign In or Register to comment.