New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Comments
It's 80 servers affected, so it's not unreasonable. And it's not just the Microclouds. It's their standalone boards that have problems too.
Even so, if you order 10 things from one manufacturer, costing around $50k, ordered at different times and so from different batches, and some different models, and you get a high failure rate, there is something going on!
Based on the replies here, I'm not the only one. @leapswitch suggested having problems too. Doubtful many providers have a sample size of 10,000! 10,000 MicroClouds would mean at least 80,000 servers. Do you have 80,000 servers @Clouvider? Even if you do, you have that many MicroClouds?
No 1 company will have a sample size large enough to be statistically valid.
More than you and clearly as shown in many other responses against your thesis, this are not widespread issues.
I think you’re overreacting a lot. Asking the community whether they also have an issue and making a claim and calling them names is frankly a lot different.
I am not sure whether Google still does now, but they used to file reports of HDD failures in their data centers. Their sample size was over 10,000+.
Now, that's a report I can trust with.
No doubt. But statistically relevant? I'd be very impressed if you guys had 10,000 Microclouds in deployment.
Perhaps. How many failures do you accept from your suppliers/vendors before you start looking for alternatives? The question remains, any reasonable alternatives?
Someone suggested ASRock. Anything else?
India has 2 major Supermicro distributors, both have the same experience with 8 node microclouds deployed across India. There are rarely any microclouds which haven't had a single backplane replacement. So it's not limited to units sold to /used by Leapswitch.
Ipmi white screen, connection errors used to happen on older versions, especially if Ipmi is on a public IP. Not seeing such issues in the past 2-3 years on any Supermicro IPMI behind a VPN/firewall/private IP.
Backblaze are good at posting such information.
But are you seriously suggesting that your own experiences are irrelevant?
Besides, 10 Chassis = 80 Servers. That's no small number. If you buy 10 switches, and 7 of them show some kind of problem, would you not consider looking for alternatives? Would you not think that maybe there is something wrong with those switches and avoid them in the future?
Statistically irrelevant. @Clouvider has more Micrclouds and SM gear in deployment than all of India combined and he has 0 failures.
Is Tyan still active? They used to compete with SM. I haven't seen them for some years now.
In some cases, yes, my own subjective experience is moot. If I am doing a job professionally, I rely on solid data instead of my own.
That is my code.
Yes we used to use Tyan for years! 2003-2010 we either bought consumer hardware, Tyan or Intel. They've completely disappeared.
And now you’re taking this personal. Not cool.
Difference between your OP and that of @LeapSwitch is tha @LeapSwitch shares their experience in a professional manner, you’re not. You start with calling names and making claims they are shitty and worse than consumer oriented equipment based on a sample of 10
Fujitsu Primergy is another option to consider if you are going for brands such as Supermicro. Otherwise, go with HP, Dell or IBM.
Tyan went into big-Clients dealings almost exclusively. Distribution says they are coming back now trying to win their market share back.
Out of curiosity, did you check often how stable/clean is the electricity supply going to your racks? Fluctuations can cause random failures and slowly kill your boards circuits.
Wouldn't fluctuations filtered by UPS?
I changed all the hardware (motherboard/mem/cpu/power supply/chassis) and it still locked up intermittently. Used recommended memory. So the Intel server was flawed design and most likely the motherboard. I'm not the only one if you do a search. Intel doesn't have a very good reputation for their branded servers.
Yes, but as any other piece of hardware they could go sour and stop to properly protect your precious equipment.
Yes and all the equipment is on UPS. Perfect sin wave. Though the microclouds have dual PSU and the second PSU is connected directly to mains rather than to the same PSU. But HK has pretty stable power supply.
Almost all servers are single PSU and all connect to a UPS. Every rack has dual power feed with primary power from UPS and secondary from mains. An STS controls source, and UPS it's definitely not a possible problem.
In any case, the failures are not happening over long period of time. Old gear doesnt fail. Problem normally occur within the warranty period. If they still work after warranty, they don't die. They just get too old to use. I've got plenty of E5 v1 and v2 still in service using SM boards, and they work great (other than the java IPMI).
In terms of boards, it's mostly the microATX boards that support E3 v3 CPUs. In fact, I've not had any major problems with any of the E5 boards of any generation other than a couple with some dodgy DIMM slots. Those we still use but with less RAM.
Even our Microclouds are mostly E3 (all v5/v6) and so far only 2 has been completely fault free (so far). The E5 microclouds were purchased second hand from the US and they are solid (so far). But the dodgy ones were all purchased form new and they are from different batches.
I can accept some small issues if the board is still usable. IPMI may be dodgy but for the most part the boards do continue to work. As I said above, probably less than 1% of the boards actually fail (ever) but when you include small problems the number is considerably higher. But backplane issues are unacceptable, and I don't care if a sample of 10 is small. That's 80 servers and over US$50k of SM equipment and around $100k when you include all the RAM/CPUs/Drives.
Maybe in the US and UK you have decent RMA turnaround times, but in HK it's not, and that's very frustrating.
As far as I know, you cannot buy standard form factor boards from these manufacturers, and need to buy whole barebones systems no?
Will have a look. Thanks.
Jeeze. It's just a joke. Have a sense of humour!
Let me go back through your comments.
Now you know of at least 4 on this thread who have had similar problems.
Doesn't explain why its only the new stuff that's affected. And if our build methods were so bad, it would suggest we would have more problems with the standalone boards.
You're not offering anything helpful, you're passively trying to direct the fault in our direction. That's 'not cool'!
And yet if you count the number of people who have experienced problems and how many have not, you'll find more WITH problems than without. So much so that @leapswitch has already stated they won't be buying any more Microclouds. Of course you like to point out that my numbers are statistically irrelevant but not theirs. And yet you feel it important to share your experience (and how it is good) even though your volume too is probably statistically irrelevant.
You're very selective with what you read and respond to. And you know what, that's fine. This is LET. LET is a casual place, but wtf is all this about being professional or not. This is fucking LET, who gives a fucking shit. You think it's professional to call out another provider for being unprofessional?
Maybe this is an overaction. And maybe our company is just unlucky with SM. If you really don't have any of these problems, maybe you're freakishly lucky. But like I said, if you buy 10 switches, and 7 have some problems or another, are you really going to give that switch another go? Lucky number 11? 12? How many times will you try it before you give up? Until it's statistically significant? That's a lot of money down the drain if you have to wait to have so many failures that your experiences make a statistically significant impact!
Yes my OP is bias, and yes this is a bit of a rant. Who the hell cares? That's what people fucking do on LET! They fucking complain! Whether or not it's of any statistical impact is apparently irrelevant. So I can only conclude you're just trolling. Maintaining an ere of professionalism is impressive while trolling though. Kudos!
I take it you don't buy much else from other brands and so you don't have anything to recommend?
We have thousands of Supermicro servers and rarely seen these issues, but we purchase through a system intergrator who gives us 3 years warranty as well. Everything is tested before being shipped as they send replacements for defective units.
Any microclouds?
Did you really have no problems or were they all just fixed / replaced by the vendor?
I think the moral of this is QC/QA matters. If your supplier is doing minimal QC for you, then that ultimately falls on you to do a thorough test of your hardware before putting it in production.
Supermicro is desktop-alike crap hardware
only those who cant afford premium stuff(e.g HP) use it. You get what you pay for
No problems with Microclouds but our vendor has said the pins are very sensitive and to gently re-seat them if you take them out. We use the hotswappable bays to avoid removing/re-seating the blades.
I guess some suppliers of SuperMicro are better than others but you definitely have no leg to stand on to call them shitty if your sample size is of 10. You may get 80 servers or 120 servers depending on the MicroClouds but it's still one chassis and would be counted as a singular unit in this instance.
A sample size of 10 isn't anything to judge. You could just have the crappiest luck and be getting a dodgy one every time or it could just be how the chassis is stored/handled by the supplier.
To call them shitty over your sample size isn't right and your personal experience does count for something for sure but it's like me saying "I won't buy x brand servers as I've experienced a 100% failure rate with them!" but only buying 1 server.
Hardware goes wrong and needs replacing or physical troubleshooting at times but if your HK dealer isn't giving you any good support then don't use him. Ask SuperMicro for reps in that region that they trust and explain your experiences thus far. Same goes for your EU dealer but he seems to at least try to rectify issues.
What about the temperature and/or the air flow of the rack? Perhaps it is too hot or too cold? Yes, too cold is bad too.
It's not a sample of 10. It's a sample of hundreds. Just because I only have 10 micro clouds, does not mean I only have 10 things with SuperMicro.
Also even if it was all, there are 80 individual nodes (8 within each chassis). If 10 out of 80 nodes have a problem does that not count as 10 separate issues? If not, why? I've had to RMA several of the individual nodes. Why does it only count as 1 when there are 8 nodes?
There are only 10 back planes, and I've had problems with 4 of them in less than 2 years. Perhaps it would be fair to say I have a sample of 10 back planes, but the nodes should count as much as any standalone motherboard.
Likewise if I had power supply failure, there are 2 PSUs per chassis. So the sample size just from the 10 chassis is 20. I haven't had any problems with any SM power supply so far but I've had problems with back planes and individual nodes. Different problems, on different individual hardware pieces. It would be insane to count the sample size as just 10.
But even so, I have mentioned problems with the basic boards, specifically having IPMI issues on the E3 microATX boards. The sample there is a couple hundred at least. Then there are the E5 boards, which is at least another 100. So we're talking about a total sample size of more like 500, and that's actually wrong because I'm ignoring all the hardware that's already been decommissioned or sold.
I don't think that's the same at all. You buy 1 lemon and maybe you're just unlucky. If the fault rate is less than 1% across the board, and you buy 1 server, there is a 1% chance you'll get a lemon. But when you get a 10% fault rate the chances are tiny.
Think about it, if you start with the premise that there is a 1% chance that what you buy turns out to be a lemon, then it's 1 in a 100. If you buy a 2nd one and its also a lemon that's 1 in 10,000. If you buy a 3rd one and it's also a lemon, that's 1 in 1,000,000.
Now obviously that's not the case here. I have plenty of decent and working SM gear. But if the fault rate is significantly above the claimed average, I would start questioning the accuracy of the claimed average.
We've purchased most of our SM hardware from just 5 sellers. 3 of them are authorised dealers, 2 located in Hong Kong and 1 in Europe. The other 2 sell second hand SM hardware from USA and UK.
I've only had complete failures from the NEW Microclouds, both nodes and back planes. We've ordered MicroCloud from all 3 of those authorised dealers and all 3 have shipped a MC with some sort of problem. The only MC we haven't seen problems with is the 2nd hand ones from the US. All the new ones are the same model (but different batch). All the problem ones are for E3 v5/v6 CPUs. The E5 microclouds have been solid so far.
Since the MCs come from different vendors and are from different batch, what are the odds that they would all ship problematic units?
If it's our environment then why are the 2nd hand units holiding up better than the new ones? And why do the E3 boards seem to have more problems than E5? And why do we not see the same fault rate in our non Supermicro hardware?
We're not about to stop buying Supermicro boards. But we'll certainly stop buying the Microclouds! And I'm thinking it may be best to stay away from E3 or mATX boards and stick with the E5s. But if you break things down like that then of course the samples get even smaller.
Doubtful.. Possibly humidity? Temperatures vary depending where in the DC by a few degrees but it's about 23-27C. Humidity around 40%. There is a vent next to the 2 racks where the micro clouds are hosted, so it get's a lot of airflow. The MCs run cooler than our average servers. With air flow and relatively dry conditions, there could be some static building up, but everything is grounded. The chassis are double grounded as is the rack and every other piece of equipment, so that shouldn't be it.
I have no idea.
Uh, great, yet another rant.
Yeah, this is getting a little old. Just how many PMS do you get in a month?
Yes. Only had good experience with those.