Shitty SuperMicro

randvegeta · February 2018

Clouvider said: Sample size of 10 does not justify calling them ‘shitty’ in public.

It's 80 servers affected, so it's not unreasonable. And it's not just the Microclouds. It's their standalone boards that have problems too.

Even so, if you order 10 things from one manufacturer, costing around $50k, ordered at different times and so from different batches, and some different models, and you get a high failure rate, there is something going on!

Based on the replies here, I'm not the only one. @leapswitch suggested having problems too. Doubtful many providers have a sample size of 10,000! 10,000 MicroClouds would mean at least 80,000 servers. Do you have 80,000 servers @Clouvider? Even if you do, you have that many MicroClouds?

No 1 company will have a sample size large enough to be statistically valid.

Clouvider · February 2018

More than you and clearly as shown in many other responses against your thesis, this are not widespread issues.

I think you’re overreacting a lot. Asking the community whether they also have an issue and making a claim and calling them names is frankly a lot different.

deank · February 2018

I am not sure whether Google still does now, but they used to file reports of HDD failures in their data centers. Their sample size was over 10,000+.

Now, that's a report I can trust with.

randvegeta · February 2018

Clouvider said: More than you

No doubt. But statistically relevant? I'd be very impressed if you guys had 10,000 Microclouds in deployment.

Clouvider said: I think you’re overreacting a lot.

Perhaps. How many failures do you accept from your suppliers/vendors before you start looking for alternatives? The question remains, any reasonable alternatives?

Someone suggested ASRock. Anything else?

leapswitch · February 2018

India has 2 major Supermicro distributors, both have the same experience with 8 node microclouds deployed across India. There are rarely any microclouds which haven't had a single backplane replacement. So it's not limited to units sold to /used by Leapswitch.

Ipmi white screen, connection errors used to happen on older versions, especially if Ipmi is on a public IP. Not seeing such issues in the past 2-3 years on any Supermicro IPMI behind a VPN/firewall/private IP.

randvegeta · February 2018

deank said: I am not sure whether Google still does now, but they used to file reports of HDD failures in their data centers. Their sample size was over 10,000+.

Backblaze are good at posting such information.

But are you seriously suggesting that your own experiences are irrelevant?

Besides, 10 Chassis = 80 Servers. That's no small number. If you buy 10 switches, and 7 of them show some kind of problem, would you not consider looking for alternatives? Would you not think that maybe there is something wrong with those switches and avoid them in the future?

randvegeta · February 2018

leapswitch said: India has 2 major Supermicro distributors, both have the same experience with 8 node microclouds deployed across India. There are rarely any microclouds which haven't had a single backplane replacement. So it's not limited to units sold to /used by Leapswitch.

Statistically irrelevant. @Clouvider has more Micrclouds and SM gear in deployment than all of India combined and he has 0 failures.

deank · February 2018

Is Tyan still active? They used to compete with SM. I haven't seen them for some years now.

deank · February 2018

@randvegeta said:
But are you seriously suggesting that your own experiences are irrelevant?

In some cases, yes, my own subjective experience is moot. If I am doing a job professionally, I rely on solid data instead of my own.

That is my code.

randvegeta · February 2018

deank said: Is Tyan still active? They used to compete with SM. I haven't seen them for some years now.

Yes we used to use Tyan for years! 2003-2010 we either bought consumer hardware, Tyan or Intel. They've completely disappeared.

Clouvider · February 2018

@randvegeta said:

leapswitch said: India has 2 major Supermicro distributors, both have the same experience with 8 node microclouds deployed across India. There are rarely any microclouds which haven't had a single backplane replacement. So it's not limited to units sold to /used by Leapswitch.

Statistically irrelevant. @Clouvider has more Micrclouds and SM gear in deployment than all of India combined and he has 0 failures.

And now you’re taking this personal. Not cool.

Difference between your OP and that of @LeapSwitch is tha @LeapSwitch shares their experience in a professional manner, you’re not. You start with calling names and making claims they are shitty and worse than consumer oriented equipment based on a sample of 10

Levi · February 2018

Fujitsu Primergy is another option to consider if you are going for brands such as Supermicro. Otherwise, go with HP, Dell or IBM.

Clouvider · February 2018

@deank said:
Is Tyan still active? They used to compete with SM. I haven't seen them for some years now.

Tyan went into big-Clients dealings almost exclusively. Distribution says they are coming back now trying to win their market share back.

Oseri · February 2018

@randvegeta said:
I'm really starting to get sick of the crap that SuperMicro come out with.

Out of curiosity, did you check often how stable/clean is the electricity supply going to your racks? Fluctuations can cause random failures and slowly kill your boards circuits.

deank · February 2018

Wouldn't fluctuations filtered by UPS?

LosPollosHermanos · February 2018

@Corey said:
We've seen the same issues. To top it off their BIOS takes an atrociously long time to boot. I've had an issue where a DIMM slot was bad on a board but didn't find that out until a couple years later down the road when we upgraded the RAM. I'm not sure what kind of QoS checking goes into super micro boards because too frequently there are problems.

A few other manufacturers are starting to make boards meant for the datacenter. We've tried asrockrack in a single machine and were happy but haven't purchased anything on scale yet.
http://www.asrockrack.com/
https://www.asus.com/us/Commercial-Servers-Workstations/Commercial-Server-Motherboards-Products/

@randvegeta said:

LosPollosHermanos said: I tried using an Intel server motherboard once and the thing kept locking up every few weeks. Now that was crap hardware.

I've found Intel's to be rock solid. But I have far fewer Intel boards, and all of them are much higher end and more expensive. The most problematic SM boards I have are for E3 CPUs. The E5s are alright, and I only run E5s on the Intel boards.

The biggest issue with Intel is that they are more expensive, and the KVM is an optional extra. You need to pay for an actual physical chip and install it onto the board to use it. Otherwise you need an old fashion KVM device that directly connects to the VGA and PS2 ports.

I changed all the hardware (motherboard/mem/cpu/power supply/chassis) and it still locked up intermittently. Used recommended memory. So the Intel server was flawed design and most likely the motherboard. I'm not the only one if you do a search. Intel doesn't have a very good reputation for their branded servers.

Oseri · February 2018

@deank said:
Wouldn't fluctuations filtered by UPS?

Yes, but as any other piece of hardware they could go sour and stop to properly protect your precious equipment.

randvegeta · February 2018

deank said: Wouldn't fluctuations filtered by UPS?

Yes and all the equipment is on UPS. Perfect sin wave. Though the microclouds have dual PSU and the second PSU is connected directly to mains rather than to the same PSU. But HK has pretty stable power supply.

Oseri said: Out of curiosity, did you check often how stable/clean is the electricity supply going to your racks? Fluctuations can cause random failures and slowly kill your boards circuits.

Almost all servers are single PSU and all connect to a UPS. Every rack has dual power feed with primary power from UPS and secondary from mains. An STS controls source, and UPS it's definitely not a possible problem.

In any case, the failures are not happening over long period of time. Old gear doesnt fail. Problem normally occur within the warranty period. If they still work after warranty, they don't die. They just get too old to use. I've got plenty of E5 v1 and v2 still in service using SM boards, and they work great (other than the java IPMI).

In terms of boards, it's mostly the microATX boards that support E3 v3 CPUs. In fact, I've not had any major problems with any of the E5 boards of any generation other than a couple with some dodgy DIMM slots. Those we still use but with less RAM.

Even our Microclouds are mostly E3 (all v5/v6) and so far only 2 has been completely fault free (so far). The E5 microclouds were purchased second hand from the US and they are solid (so far). But the dodgy ones were all purchased form new and they are from different batches.

I can accept some small issues if the board is still usable. IPMI may be dodgy but for the most part the boards do continue to work. As I said above, probably less than 1% of the boards actually fail (ever) but when you include small problems the number is considerably higher. But backplane issues are unacceptable, and I don't care if a sample of 10 is small. That's 80 servers and over US$50k of SM equipment and around $100k when you include all the RAM/CPUs/Drives.

Maybe in the US and UK you have decent RMA turnaround times, but in HK it's not, and that's very frustrating.

LTniger said: HP, Dell or IBM.

As far as I know, you cannot buy standard form factor boards from these manufacturers, and need to buy whole barebones systems no?

Fujitsu Primergy

Will have a look. Thanks.

Clouvider said: And now you’re taking this personal. Not cool.

Jeeze. It's just a joke. Have a sense of humour!

Let me go back through your comments.

Surprisingly you’re the first person I know to have such issues and contrary to our experience with 100s of Servers.

Now you know of at least 4 on this thread who have had similar problems.

Clouvider said: Well, you may wish to check other potential causes of failure in that case.

Clouvider said: It feels like if there's such a high fail ratio, there are perhaps some issues during assembly/ build stage rather than the factory. Unless somehow we're that lucky.

Doesn't explain why its only the new stuff that's affected. And if our build methods were so bad, it would suggest we would have more problems with the standalone boards.

You're not offering anything helpful, you're passively trying to direct the fault in our direction. That's 'not cool'!

Clouvider said: More than you and clearly as shown in many other responses against your thesis, this are not widespread issues.

And yet if you count the number of people who have experienced problems and how many have not, you'll find more WITH problems than without. So much so that @leapswitch has already stated they won't be buying any more Microclouds. Of course you like to point out that my numbers are statistically irrelevant but not theirs. And yet you feel it important to share your experience (and how it is good) even though your volume too is probably statistically irrelevant.

You're very selective with what you read and respond to. And you know what, that's fine. This is LET. LET is a casual place, but wtf is all this about being professional or not. This is fucking LET, who gives a fucking shit. You think it's professional to call out another provider for being unprofessional?

Clouvider said: I think you’re overreacting a lot. Asking the community whether they also have an issue and making a claim and calling them names is frankly a lot different.

Maybe this is an overaction. And maybe our company is just unlucky with SM. If you really don't have any of these problems, maybe you're freakishly lucky. But like I said, if you buy 10 switches, and 7 have some problems or another, are you really going to give that switch another go? Lucky number 11? 12? How many times will you try it before you give up? Until it's statistically significant? That's a lot of money down the drain if you have to wait to have so many failures that your experiences make a statistically significant impact!

Yes my OP is bias, and yes this is a bit of a rant. Who the hell cares? That's what people fucking do on LET! They fucking complain! Whether or not it's of any statistical impact is apparently irrelevant. So I can only conclude you're just trolling. Maintaining an ere of professionalism is impressive while trolling though. Kudos!

I take it you don't buy much else from other brands and so you don't have anything to recommend?

PhotonVPS · February 2018

We have thousands of Supermicro servers and rarely seen these issues, but we purchase through a system intergrator who gives us 3 years warranty as well. Everything is tested before being shipped as they send replacements for defective units.

randvegeta · February 2018

@PhotonVPS said:
We have thousands of Supermicro servers and rarely seen these issues, but we purchase through a system intergrator who gives us 3 years warranty as well. Everything is tested before being shipped as they send replacements for defective units.

Any microclouds?

Did you really have no problems or were they all just fixed / replaced by the vendor?

MasonR · February 2018

@randvegeta said:

@PhotonVPS said:
We have thousands of Supermicro servers and rarely seen these issues, but we purchase through a system intergrator who gives us 3 years warranty as well. Everything is tested before being shipped as they send replacements for defective units.

Any microclouds?

Did you really have no problems or were they all just fixed / replaced by the vendor?

I think the moral of this is QC/QA matters. If your supplier is doing minimal QC for you, then that ultimately falls on you to do a thorough test of your hardware before putting it in production.

apollo15 · February 2018

Supermicro is desktop-alike crap hardware

only those who cant afford premium stuff(e.g HP) use it. You get what you pay for

PhotonVPS · February 2018

@randvegeta said:

@PhotonVPS said:
We have thousands of Supermicro servers and rarely seen these issues, but we purchase through a system intergrator who gives us 3 years warranty as well. Everything is tested before being shipped as they send replacements for defective units.

Any microclouds?

Did you really have no problems or were they all just fixed / replaced by the vendor?

No problems with Microclouds but our vendor has said the pins are very sensitive and to gently re-seat them if you take them out. We use the hotswappable bays to avoid removing/re-seating the blades.

LosPollosHermanos · February 2018

@apollo15 said:
Supermicro is desktop-alike crap hardware

only those who cant afford premium stuff(e.g HP) use it. You get what you pay for

IThinkUFailed · February 2018

I guess some suppliers of SuperMicro are better than others but you definitely have no leg to stand on to call them shitty if your sample size is of 10. You may get 80 servers or 120 servers depending on the MicroClouds but it's still one chassis and would be counted as a singular unit in this instance.

A sample size of 10 isn't anything to judge. You could just have the crappiest luck and be getting a dodgy one every time or it could just be how the chassis is stored/handled by the supplier.

To call them shitty over your sample size isn't right and your personal experience does count for something for sure but it's like me saying "I won't buy x brand servers as I've experienced a 100% failure rate with them!" but only buying 1 server.

Hardware goes wrong and needs replacing or physical troubleshooting at times but if your HK dealer isn't giving you any good support then don't use him. Ask SuperMicro for reps in that region that they trust and explain your experiences thus far. Same goes for your EU dealer but he seems to at least try to rectify issues.

Shazan · February 2018

What about the temperature and/or the air flow of the rack? Perhaps it is too hot or too cold? Yes, too cold is bad too.

randvegeta · February 2018

IThinkUFailed said: but you definitely have no leg to stand on to call them shitty if your sample size is of 10

It's not a sample of 10. It's a sample of hundreds. Just because I only have 10 micro clouds, does not mean I only have 10 things with SuperMicro.

Also even if it was all, there are 80 individual nodes (8 within each chassis). If 10 out of 80 nodes have a problem does that not count as 10 separate issues? If not, why? I've had to RMA several of the individual nodes. Why does it only count as 1 when there are 8 nodes?

There are only 10 back planes, and I've had problems with 4 of them in less than 2 years. Perhaps it would be fair to say I have a sample of 10 back planes, but the nodes should count as much as any standalone motherboard.

Likewise if I had power supply failure, there are 2 PSUs per chassis. So the sample size just from the 10 chassis is 20. I haven't had any problems with any SM power supply so far but I've had problems with back planes and individual nodes. Different problems, on different individual hardware pieces. It would be insane to count the sample size as just 10.

But even so, I have mentioned problems with the basic boards, specifically having IPMI issues on the E3 microATX boards. The sample there is a couple hundred at least. Then there are the E5 boards, which is at least another 100. So we're talking about a total sample size of more like 500, and that's actually wrong because I'm ignoring all the hardware that's already been decommissioned or sold.

IThinkUFailed said: To call them shitty over your sample size isn't right and your personal experience does count for something for sure but it's like me saying "I won't buy x brand servers as I've experienced a 100% failure rate with them!" but only buying 1 server.

I don't think that's the same at all. You buy 1 lemon and maybe you're just unlucky. If the fault rate is less than 1% across the board, and you buy 1 server, there is a 1% chance you'll get a lemon. But when you get a 10% fault rate the chances are tiny.

Think about it, if you start with the premise that there is a 1% chance that what you buy turns out to be a lemon, then it's 1 in a 100. If you buy a 2nd one and its also a lemon that's 1 in 10,000. If you buy a 3rd one and it's also a lemon, that's 1 in 1,000,000.

Now obviously that's not the case here. I have plenty of decent and working SM gear. But if the fault rate is significantly above the claimed average, I would start questioning the accuracy of the claimed average.

IThinkUFailed said: Hardware goes wrong and needs replacing or physical troubleshooting at times but if your HK dealer isn't giving you any good support then don't use him. Ask SuperMicro for reps in that region that they trust and explain your experiences thus far. Same goes for your EU dealer but he seems to at least try to rectify issues.

We've purchased most of our SM hardware from just 5 sellers. 3 of them are authorised dealers, 2 located in Hong Kong and 1 in Europe. The other 2 sell second hand SM hardware from USA and UK.

I've only had complete failures from the NEW Microclouds, both nodes and back planes. We've ordered MicroCloud from all 3 of those authorised dealers and all 3 have shipped a MC with some sort of problem. The only MC we haven't seen problems with is the 2nd hand ones from the US. All the new ones are the same model (but different batch). All the problem ones are for E3 v5/v6 CPUs. The E5 microclouds have been solid so far.

Since the MCs come from different vendors and are from different batch, what are the odds that they would all ship problematic units?

If it's our environment then why are the 2nd hand units holiding up better than the new ones? And why do the E3 boards seem to have more problems than E5? And why do we not see the same fault rate in our non Supermicro hardware?

We're not about to stop buying Supermicro boards. But we'll certainly stop buying the Microclouds! And I'm thinking it may be best to stay away from E3 or mATX boards and stick with the E5s. But if you break things down like that then of course the samples get even smaller.

Shazan said: What about the temperature and/or the air flow of the rack? Perhaps it is too hot or too cold? Yes, too cold is bad too.

Doubtful.. Possibly humidity? Temperatures vary depending where in the DC by a few degrees but it's about 23-27C. Humidity around 40%. There is a vent next to the 2 racks where the micro clouds are hosted, so it get's a lot of airflow. The MCs run cooler than our average servers. With air flow and relatively dry conditions, there could be some static building up, but everything is grounded. The chassis are double grounded as is the rack and every other piece of equipment, so that shouldn't be it.

I have no idea.

Clouvider · February 2018

Uh, great, yet another rant.

deank · February 2018

Yeah, this is getting a little old. Just how many PMS do you get in a month?

bsdguy · February 2018

@LTniger said:
Fujitsu Primergy is another option to consider if you are going for brands...

Yes. Only had good experience with those.

Howdy, Stranger!

Categories

In this Discussion

Shitty SuperMicro

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Shitty SuperMicro

Comments