⭐ Hosted on our own physical server ⭐ HAZI.ro | VPS SSD | Game | Storage

Levi · October 2021

LET-50 does not work for plans listed in first message. Even with discount, price is huge. BF is coming, a lot of reputable providers will offer more powerful packages than yours. Bad timing mate, very bad. Your best bet is to launch 1 GB VPS for 9.99EUR/year.

FlorinMarian · October 2021

@LTniger said:
LET-50 does not work for plans listed in first message. Even with discount, price is huge. BF is coming, a lot of reputable providers will offer more powerful packages than yours. Bad timing mate, very bad. Your best bet is to launch 1 GB VPS for 9.99EUR/year.

Thank you for feedback.
Unfortunatelly that offer isn't suitable because we pay 1.20 eur pm/IP.
With NAT behind we can easily do that kind of offer.

Best regards, Florin.

Andrews · October 2021

@FlorinMarian said:
With NAT behind we can easily do that kind of offer.

FlorinMarian · November 2021

Boom..node is offline for at least few hours.
Short-long story.
We bought this server with 128GB DDR4 ECC memory.
After colocating this server, we had 4 freezes in 3 days using proxmox 7.
We reinstalled proxmox with 6th version, but after 12 hours exactly the same effect.
Now we're dealing to debug which is the DIMM which causes this problem, and the huge problem is that we have to pay remote hands for each hardware procedure (removal/install of DIMM).
If anyone can help us with this, we'll be very happy.

Those who are affected by those changes will receive at least one week free service.

Best regards, Florin.

**We know that it comes from RAM because we found on IPMI log "Uncorrectalble ECC" error and it really makes sense because Proxmox is installed on ZFS.

cybertech · November 2021

interesting , what brand of ram was this?

FlorinMarian · November 2021

@cybertech said:
interesting , what brand of ram was this?

I don't rememeber properly, I had this server under 1 hour on my house.
It is for sure Samsung or HP, not anything else.

Best regards, Florin.

jugganuts · November 2021

So you didn't do a burn in test before sending it off..? Seems like a rookie mistake.

FlorinMarian · November 2021

@jugganuts said:
So you didn't do a burn in test before sending it off..? Seems like a rookie mistake.

It was 4PM when it arrived and at 6PM it was in datacenter, 1 hour driving.
They did minimal tests and nobody expected tbh to recognize every RAM memory but with some issues after half of a day.

snz · November 2021

They did minimal tests

Well it's your server. You should do the testing

yoursunny · November 2021

@FlorinMarian said:

@jugganuts said:
So you didn't do a burn in test before sending it off..? Seems like a rookie mistake.

It was 4PM when it arrived and at 6PM it was in datacenter, 1 hour driving.
They did minimal tests and nobody expected tbh to recognize every RAM memory but with some issues after half of a day.

1 hour driving and the server comes back?
No more remote hands fees.

FlorinMarian · November 2021

@yoursunny said:

@FlorinMarian said:

@jugganuts said:
So you didn't do a burn in test before sending it off..? Seems like a rookie mistake.

It was 4PM when it arrived and at 6PM it was in datacenter, 1 hour driving.
They did minimal tests and nobody expected tbh to recognize every RAM memory but with some issues after half of a day.

1 hour driving and the server comes back?
No more remote hands fees.

Normally, yeah.
But I have courses all day long.

yoursunny · November 2021

@FlorinMarian said:

@yoursunny said:

@FlorinMarian said:

@jugganuts said:
So you didn't do a burn in test before sending it off..? Seems like a rookie mistake.

It was 4PM when it arrived and at 6PM it was in datacenter, 1 hour driving.
They did minimal tests and nobody expected tbh to recognize every RAM memory but with some issues after half of a day.

1 hour driving and the server comes back?
No more remote hands fees.

Normally, yeah.
But I have courses all day long.

1 hour after classes and the server comes back?
Put it back to data center a few days later.
Do push-ups to compensate your clients for the downtime.

Well, that's what I did in 2007 when I'm in charge of a university website.
I used a bicycle to move the server to my dorm (100 meters but it's 4th floor without elevator), and paid my helper with a chicken leg.
The website was down for 3 days when I fixed the server.
Every website user who submitted a claim received a small discount on their dorm's electricity bill.

Levi · November 2021

Almost like science fiction episode

servarica_hani · November 2021

ok this is not that common I guess you have bad luck

through out all those years we probably had 3 mem sticks that were corrupt so it is really rare

what you need to do is to run the burn in test you are running but the problem mostly those errors will not be emitted immediately the test has to run few days to see it

other things to check
1- make sure you install the ram in correct slots based on the motherboard instruction
2- re-seat the cpus and while removing the cpu see if any of the pins bent (look closely) (same for ram)
3- change the ram order by replacing stick1 with stick2 and so on (sometimes just by keep changing the order you will find a configuration that will fail on bios POST or the OS boot )
4- check the heat of mem ,cpu and chipset from ipmi while running the tests , it could be the cause not even related to ram and you have faulty fan/s (heating issue)

also it would help if you have screenshot of the full error maybe we can get something from it

jsg · November 2021

@yoursunny said:

@FlorinMarian said:
Our Storage packages are RAID6
I took my lessons and installed RAID-6 for HDD packages.

Software RAID, what can go wrong?

The LSI SAS 9305-16i seems to support HW Raid. And ZFS raidz is widely praised.

@yoursunny said:
My classmate created a software RAID (same data on 2 disks).

Way too little info to come to a judgement.

The disk keeps changing name between md0 and md127, so that he has to edit fstab after half of the reboots.

Well known linux problem that can be dealt with (I also experienced it myself)

The machine stops working in 2019 after 5 years of heavy use, and he found both disks failed at the same time.

That may mean a lot or basically nothing. Even the statement itself is doubtful (because it tells what your classmate perceived/thought but not necessarily what really happened).

He lost all his research data because he thinks he has RAID and doesn't need other backups.

Yes. It's mentioned frequently here but I'll repeat it anyway: RAID is no backup replacement!

I used a machine hardware RAID (3 disks providing capacity of 2 disks) during an internship in 2009.
Once I saw a disk is blinking yellow.
My manager pulled out the disk and put in another one, while the server is running.

Yes, that is indeed one of the advantages of HW Raid. But there's also a disadvantage: Oftentimes an exact replacement is needed if a HW Raid controller breaks. Not a big problem for e.g. providers with many identical machines or a spare controller, but it can bite normal users with but one single card.

@vyas11 said:
Can you post BM for those VPS which use older script?

Already running ...

@PieHasBeenEaten said:
Why you using raid6? Raid 10 would of net you same amount of space and faster reads and writes.

For 4 disks, yes, but not for more.

@FlorinMarian said:
Have a look here, please. [ https://lowendtalk.com/discussion/174629/raid5-on-backup-space ]

Careful there, Raid 6 helps to make a rebuild situation considerably less likely, but if an array rebuild were necessary you wouldn't be off better than with Raid 5. They are both snails in terms of rebuilding; in fact Raid 6 is likely to be even (a bit) worse.

@FlorinMarian said:
Boom..node is offline for at least few hours.
Short-long story.
We bought this server with 128GB DDR4 ECC memory.
After colocating this server, we had 4 freezes in 3 days using proxmox 7.
We reinstalled proxmox with 6th version, but after 12 hours exactly the same effect.
Now we're dealing to debug which is the DIMM which causes this problem, and the huge problem is that we have to pay remote hands for each hardware procedure (removal/install of DIMM).
If anyone can help us with this, we'll be very happy.

A full BIOS post test should detect a RAM error and a good BIOS should even indicate the socket.

FlorinMarian · November 2021

@jsg unfortunatelly our BIOS doesn't do so many tasks as on ProLiant Servers.
It boots up in less than 2 minutes, with minimal tests.
The error related to RAM via IPMI didn't said anything about memory socket.

Today our datacenter staff pulled of all RAM DIMMs and connected them again. This was also the 1st step indicated by seller which said that they'll replace any spare part if required.
There could be a connection problem because the server had 2500 km travel from UK to Romania and for sure lot of shocks.

We'll see next hours how it will do.

Best regards, Florin.

jsg · November 2021

@FlorinMarian said:
@jsg unfortunatelly our BIOS doesn't do so many tasks as on ProLiant Servers.
It boots up in less than 2 minutes, with minimal tests.
The error related to RAM via IPMI didn't said anything about memory socket.

Today our datacenter staff pulled of all RAM DIMMs and connected them again. This was also the 1st step indicated by seller which said that they'll replace any spare part if required.
There could be a connection problem because the server had 2500 km travel from UK to Romania and for sure lot of shocks.

We'll see next hours how it will do.

Tough luck. Frankly if I were in your shoes I'd stop trying to find a more "elegant" way and just announce some downtime, drive to the DC, rip out 1 mem. bank ("every other stick"), boot and run a memory test (plenty available for Proxmox/debian), then rinse and repeat (put back 1 bank of the bad half and boot and test again) until the bad stick is identified.

yoursunny · November 2021

@jsg said:

@yoursunny said:
My classmate created a software RAID (same data on 2 disks).

Way too little info to come to a judgement.

It's RAID-1 over two identical 4TB SAS hard drives.

The disk keeps changing name between md0 and md127, so that he has to edit fstab after half of the reboots.

Well known linux problem that can be dealt with (I also experienced it myself)

A problem that does not occur with hardware RAID.

The machine stops working in 2019 after 5 years of heavy use, and he found both disks failed at the same time.

That may mean a lot or basically nothing. Even the statement itself is doubtful (because it tells what your classmate perceived/thought but not necessarily what really happened).

The hardware failure was confirmed by university computer lab staff, who is experienced in hardware diagnosis.

He lost all his research data because he thinks he has RAID and doesn't need other backups.

Yes. It's mentioned frequently here but I'll repeat it anyway: RAID is no backup replacement

I encountered a hard drive failure (single disk no RAID) before this classmate.
I only lost two days, because I have weekly backup.
This classmate knew my story yet he had no backups.

I used a machine hardware RAID (3 disks providing capacity of 2 disks) during an internship in 2009.
Once I saw a disk is blinking yellow.
My manager pulled out the disk and put in another one, while the server is running.

Yes, that is indeed one of the advantages of HW Raid. But there's also a disadvantage: Oftentimes an exact replacement is needed if a HW Raid controller breaks. Not a big problem for e.g. providers with many identical machines or a spare controller, but it can bite normal users with but one single card.

The controller did not fail.
One disk failed, and the manager found a spare that is same model.
We have many servers of the same model.

jsg · November 2021

@yoursunny said:

The disk keeps changing name between md0 and md127, so that he has to edit fstab after half of the reboots.

Well known linux problem that can be dealt with (I also experienced it myself)

A problem that does not occur with hardware RAID.

Sure, but my point was that it's certainly no show stopper.

The machine stops working in 2019 after 5 years of heavy use, and he found both disks failed at the same time.

That may mean a lot or basically nothing. Even the statement itself is doubtful (because it tells what your classmate perceived/thought but not necessarily what really happened).

The hardware failure was confirmed by university computer lab staff, who is experienced in hardware diagnosis.

With all due respect, I don't care what the lab staff said unless I see tangible relevant info. Why? Because if the two drives really failed at the same point in time then with very high likelihood a quite different problem were to be looked at and discussed. As I do not think your classmate was simply lying I have to assume that he - and the lab staff - presumed that both disks failed at the same time but have no hard data to confirm that impression.

He lost all his research data because he thinks he has RAID and doesn't need other backups.

Yes. It's mentioned frequently here but I'll repeat it anyway: RAID is no backup replacement

I encountered a hard drive failure (single disk no RAID) before this classmate.
I only lost two days, because I have weekly backup.
This classmate knew my story yet he had no backups.

Well, then frankly he deserved it.

I used a machine hardware RAID (3 disks providing capacity of 2 disks) during an internship in 2009.
Once I saw a disk is blinking yellow.
My manager pulled out the disk and put in another one, while the server is running.

Yes, that is indeed one of the advantages of HW Raid. But there's also a disadvantage: Oftentimes an exact replacement is needed if a HW Raid controller breaks. Not a big problem for e.g. providers with many identical machines or a spare controller, but it can bite normal users with but one single card.

The controller did not fail.
One disk failed, and the manager found a spare that is same model.
We have many servers of the same model.

Yes that's what was to be expected. But still, HW Raid controllers do occasionally fail.

yoursunny · November 2021

@jsg said:

The machine stops working in 2019 after 5 years of heavy use, and he found both disks failed at the same time.

That may mean a lot or basically nothing. Even the statement itself is doubtful (because it tells what your classmate perceived/thought but not necessarily what really happened).

The hardware failure was confirmed by university computer lab staff, who is experienced in hardware diagnosis.

With all due respect, I don't care what the lab staff said unless I see tangible relevant info. Why? Because if the two drives really failed at the same point in time then with very high likelihood a quite different problem were to be looked at a discussed. As I do not think your classmate was simply lying I have to assume that he - and the lab staff - presumed that both disks failed at the same time but have no hard data to confirm that impression.

My guess is that one disk failed but he didn't notice, then another failed and machine went offline.
When he requested lab staff assistance, they saw both disks failed.

I can guess as such because he doesn't take care of his server.
I often see 100+ Ubuntu packages pending upgrades on his server.
Moreover, he never installed smartctl command on the server.

The server is in a locked room, not the student lab.
If he doesn't go there look for yellow blinking light, and doesn't have smartctl, he would not know when the first disk failed.

jsg · November 2021

@yoursunny said:
My guess is that one disk failed but he didn't notice, then another failed and machine went offline.
When he requested lab staff assistance, they saw both disks failed.

I can guess as such because he doesn't take care of his server.
I often see 100+ Ubuntu packages pending upgrades on his server.
Moreover, he never installed smartctl command on the server.

The server is in a locked room, not the student lab.
If he doesn't go there look for yellow blinking light, and doesn't have smartctl, he would not know when the first disk failed.

Yes, that matches quite well what my take is.

FlorinMarian · November 2021

26 hours and server it's still alive.
We're 99% sure about that pulling out & installing again RAM DIMMs solved the problem.
Thank you for those who replied here with tips.
Best regards, Florin.

cybertech · November 2021

@FlorinMarian said:
26 hours and server it's still alive.
We're 99% sure about that pulling out & installing again RAM DIMMs solved the problem.
Thank you for those who replied here with tips.
Best regards, Florin.

did you blow on the ram pins before inserting them in before?

yoursunny · November 2021

@cybertech said:

@FlorinMarian said:
26 hours and server it's still alive.
We're 99% sure about that pulling out & installing again RAM DIMMs solved the problem.
Thank you for those who replied here with tips.
Best regards, Florin.

did you blow on the ram pins before inserting them in before?

I thought you are supposed to dip them in honey and lick?

FlorinMarian · November 2021

WARNING!!!

One hour later after I've declared everything is fine, server frozen again.
I've pulled it out from Datacenter and I hope that in about 24h to finish memory testing (let's hope i'll find the problem there).
Otherwise I'll lost my mind because I dont know what's wrong with it.

I'm really sorry for those who have active services hosted on it and also for newer customers which paid but didn't received the service yet.

I promise that i'll keep you up to date.

Best regards, Florin.

servarica_hani · November 2021

the server layout looks interesting what is the model number for this server ?

FlorinMarian · November 2021

@servarica_hani said:
the server layout looks interesting what is the model number for this server ?

Good morning!
It is "HPE Cloudline CL3100 G3". I love it because of 12x LFF and 4x SFF (at the same time).
Also it has only PCI-E SFP network cards at 10Gbps.

About server's problems: 2/8 Memory DIMMs passed successfully Memtest86 test.

Someone asked few days ago what brand memory is, it is "Micron".

Best regards, Florin.

FlorinMarian · November 2021

UPDATE..sad update.
We spent 4x 1h 40' for each test (100% completion on Memtest86 PRO) and it didn't found any issue with DIMMs.
We're very afraid because we have no idea what to do.
Any kind of feedback is welcome.

Mr_Tom · November 2021

So, it tests fine at home with the cover off? Have you tried just leaving it running Proxmox to see if it crashes?

If not, but it's crashing when racked surrounded by other servers perhaps as @servarica_hani mentioned it's worth checking the temps?

FlorinMarian · November 2021

@Mr_Tom said:
So, it tests fine at home with the cover off? Have you tried just leaving it running Proxmox to see if it crashes?

If not, but it's crashing when racked surrounded by other servers perhaps as @servarica_hani mentioned it's worth checking the temps?

Thank you for reply!
At home I also covered it because the fans goes crazy during memtest without cover.
Also on IPMI I had a look and there weren't evey anything over 40 degrees.
And about crashing..it taken 26 hours working fine before crash.
Also if I leave it as it is, how this will help me to solve the problem as long I dont know the cause?
Thank you!

Howdy, Stranger!

Categories

In this Discussion

⭐ Hosted on our own physical server ⭐ HAZI.ro | VPS SSD | Game | Storage

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

⭐ Hosted on our own physical server ⭐ HAZI.ro | VPS SSD | Game | Storage

Comments