HOST-C, Chat, Updates, Stuff

fly056 · July 2025

@maverick said:

@host_c said:

@maverick said: send Stuart with a hammer... to do a hard reset

He shall not do such pornographic things......

He does like to fix things with the hammer tho......

who doesn't?

dunno what's the problem with that sea gate, but fwiw i've always trusted WD more especially that series based on HGST tech they bought many years ago... those are animals, i just hope i haven't jinxed it now (some of mine are just about out of warranty as i write this)...

good luck with repair!

I have some HC520 drives on my home NAS. They are great so far. I got them as server pulls with a 5 year warranty for $75 each.

host_c · July 2025

Ok, official info was sent to remaining affected customers, please check your e-mails.

Here is a time-lined description of the event of the fuckup:

July 27, 2025 – Afternoon (GMT+3):

Multiple Seagate ST18000NM019J drives (firmware KM02) across two nodes suddenly powered down due to a firmware-related failure. Drives began reporting critical SMART alerts (Data channel impending failure), causing the RAID-6/60 array to become unavailable.

Result:
Addon storage volumes became inaccessible, and VPS services depending on those volumes were disrupted. Some NVMe-based systems also experienced write issues due to OS-level I/O buffering.

July 28, 2025 – Morning:
Our team accessed the datacenter, identified the fault, and began recovery efforts. All NVMe-only VPS services were successfully migrated to healthy nodes.

July 28–29, 2025:
RAID array access was restored in degraded mode, enabling partial access to addon volumes at limited transfer speeds.

🧪 Root Cause

Firmware fault affecting multiple ST18000NM019J (KM02) drives simultaneously

RAID controller entered fault mode due to concurrent SMART failures

No physical disk damage, no reallocated sectors or ECC errors — this was purely firmware-triggered

🛡️ Mitigation Going Forward

We are conducting a full infrastructure audit to identify any remaining ST18000NM019J drives with KM02 firmware

Affected drives will be proactively replaced or updated, where supported

RAID monitoring thresholds and firmware validation processes are being tightened to catch these failures earlier

This was an unprecedented firmware-level failure that bypassed typical RAID fault tolerance. We appreciate your understanding as we finalize recovery efforts for impacted systems.

Here is an output of one of the drives, maybe it can help others to check theirs if they have the same model used, all 6 reported exactly the same error, have the same powered on hours ( ~266 days ) and were brand new.

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST18000NM019J
Revision:             KM02
Compliance:           SPC-5
User Capacity:        18,000,207,937,536 bytes [18.0 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500d8a51a07
Serial number:        ZR57B8800000G20806CV
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Mon Jul 28 17:36:48 2025 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===

SMART Health Status: Data channel impending failure general hard drive failure [asc=5d, ascq=30]

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     31 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 6367:42
Manufactured in week 01 of year 2022
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  34
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  291
Elements in grown defect list: 1

Vendor (Seagate Cache) information
  Blocks sent to initiator = 3828
  Blocks received from initiator = 1650689
  Blocks read from cache and sent to initiator = 9094
  Number of read and write commands whose size <= segment size = 29
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 6367.70
  number of minutes until next internal SMART test = 53

Seagate FARM log supported [try: -l farm]

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0          0.016           0
write:         0        0         0         0          0          6.889           0

Non-medium error count:        0

Pending defect count:0 Pending Defects

The error in bold triggered the detach of the drives from the raid array.

Here is a screen shot from the log of one of the dells servers ( R740 ) showing 2 drives leaving the " chat" at the precise same time, DST was not set on the server so that is why the time shows only 12:00

allthemtings · July 2025

@host_c you dropped something
↓
↓
↓
↓
↓
↓

👑

JabJab · July 2025

Seagate more like Samsung am I right?!

host_c · July 2025

@allthemtings said:
@host_c you dropped something
↓
↓
↓
↓
↓
↓

👑

I always said that it is a question of when something like this will happen, rather if it will happen.

These are totally out of ones control. To be fair, in my career, this is a forst, especially with SAS drives, not to mention New drives.

For fuck sake, we have older 8-10-14 TB SAS drives that have 5 years powered on time and have no issues even today. ( WD/HGST/Seagate )

Ah yes, Seagate was as helpful as the popcorn replays we will get here

It is what it is.

We will most probably switch to HGST in the next Drives Orders we will do.

FAT32 · July 2025

After causing data loss by Seagate many years ago, I wouldnt touch them anymore even with a 10 foot pole

host_c · July 2025

@FAT32 said:
After causing data loss by Seagate many years ago, I wouldnt touch them anymore even with a 10 foot pole

I will agree with you on this. At least when we have drive fails with HGST, it is usually 1 drive / server, not 4-5-6 at the same time.

Don't get me wrong, we are used to drive fail, as well, we do storage, so I see nothing abnormal to change drives on a monthly basis in the data-center. But this was something new.

Luck is that we do not have many 18 TB seagates left. ( Especially this model )

Also, this issue is specific only to EXOS X18 line, mostly 16-18 TB models, at lest this was the info we googled and found for the past 48h.

Customers have their options in the mail, and we have a new thing to put up to the check list.

It is an unfortunate event, but fuck, this is life, some things break sometimes.

Too bad this messed up our week of upgrades in the DC we wished to do, so that got derailed for another week or so.......

MMMMMM · July 2025

TL;DR

layer7 · July 2025

Hi,

similar happened to a customer of us... with 8 TB Intel NVMe drives... failed faster than could be replaced.

Sometimes it is what it is...

Wish you all luck to get all data out there before things explode!

And its just another case that should show clearly to everyone: Keep always backups somewhere... there is no 100% security, no matter how good the hoster or the hardware might be.

host_c · July 2025

@layer7 said:
Hi,

similar happened to a customer of us... with 8 TB Intel NVMe drives... failed faster than could be replaced.

Sometimes it is what it is...

Wish you all luck to get all data out there before things explode!

And its just another case that should show clearly to everyone: Keep always backups somewhere... there is no 100% security, no matter how good the hoster or the hardware might be.

THX

default · July 2025

@host_c - I received your email. Thank you very much for being open and informing your customers about the causes of downtime. Such openness is highly appreciated with regards to respect for your business. If I may, I have some questions:

Which drives will you use for customers choosing a fresh install?
Which drives will you use for customers opting for recovery?
Will IPv4 and IPv6 addresses change in the case of a fresh re-install?
Will CPU change upon a fresh re-install?

Thanks again for your understanding and clear communication.

truemagic · July 2025

Does that mean those without email updates....are not impacted?

Edit: Oh well just checked I do received an email update from @host_c However none of my VPS seems to be affected (as I can still access them normally?). Is that so?

host_c · July 2025

@default said:
@host_c - I received your email. Thank you very much for being open and informing your customers about the causes of downtime. Such openness is highly appreciated with regards to respect for your business. If I may, I have some questions:

Which drives will you use for customers choosing a fresh install?

Which drives will you use for customers opting for recovery?

Will IPv4 and IPv6 addresses change in the case of a fresh re-install?

Will CPU change upon a fresh re-install?

Thanks again for your understanding and clear communication.

1. Which drives will you use for customers choosing a fresh install?

HGST 14 and 16 TB, some have a mix of Toshiba and HGST, sincerely I cannot recall from top of my head exactly.

2. Which drives will you use for customers opting for recovery?

same

yet, customers are provisioned over a raid array, not individual drives!

3. Will IPv4 and IPv6 addresses change in the case of a fresh re-install?

No, these will be manually issued, so we will preserve IPs settings. However due to this manual provisioning, it will be slow, as firstly we have to delete the old config manually from the cluster, recreate it and so on

4. Will CPU change upon a fresh re-install?

No. CPU Type Generation will not change, Model might as we have from 2.4 to 2.7 GHz Scale Gen 2 CPU's

@truemagic said: Does that mean those without email updates....are not impacted?

Precisely, Mail was sent to VPS on those specific nodes. Who did not get any mail and has it's service up, it means nothing happened, carry on

maverick · July 2025

@host_c said: Ah yes, Seagate was as helpful as the popcorn replays we will get here

It is what it is.

We will most probably switch to HGST in the next Drives Orders we will do.

thanks for sharing the juicy details with us

Seagate, you suck!

yeah, if they don't help you with this, don't buy any Seagate drive ever again!

host_c · July 2025

@maverick said:

@host_c said: Ah yes, Seagate was as helpful as the popcorn replays we will get here

It is what it is.

We will most probably switch to HGST in the next Drives Orders we will do.

thanks for sharing the juicy details with us

Seagate, you suck!

yeah, if they don't help you with this, don't buy any Seagate drive ever again!

Well, they blown me off as the drives were not bought thru a certified Seagate reseller. as if any of us can manufacture a drive at home. Fuck me.

There are 3 Drive Manufacturers in the WORLD:

Seagate
WD/HGST
Toshiba

So any drive you have bought that is enterprise and has a 5 year warranty should be replaceable regardless that you bought it on e-bay or a shop. ( in the limits that the drive does not have hammer marks or it did not operate in 50 degree Celsius )

But, here is the reply from them, and I will underline the fact that we asked for a FW fix not a replace, as a fw fix might have helped more. I could not care less that we have 6 or 10 failed drives, that is my problem, I asked for FW fix as that is the issue that might had helped us and our customers issue; again, there is no mechanical issue with them, they just decided to go to holiday. and left the array in the middle of the day.

Now, this is not a Seagate only policy, WD and Toshiba do the same.

EDIT:

One of the reasons I moved away from HP years ago was their restrictive firmware and BIOS update policy. Starting with Gen8 servers, critical updates — including fixes for issues that only emerge under specific conditions — were locked behind a support subscription.

This approach is frustrating because firmware and BIOS bugs are not user-created issues; they are vendor-side flaws that should be resolved as a matter of responsibility. Requiring customers to pay for access to those fixes feels like a penalty for simply using the hardware.

Now that HP is involved with Juniper, I can only hope they don’t bring this same restrictive, short-sighted policy mindset into that ecosystem. - tho I am positive they will.

NJa64F · July 2025

@host_c said: July 28–29, 2025:
RAID array access was restored in degraded mode, enabling partial access to addon volumes at limited transfer speeds.

Stuff happens. Then you fix it. Thats life. You're doing a good job communicating what happened.

My question is: did any customers lose data ? the customers that requested recovery, how does that happen ? Is this a forensic recovery where the drives are sent out ?

host_c · July 2025

@jperkins said:

@host_c said: July 28–29, 2025:
RAID array access was restored in degraded mode, enabling partial access to addon volumes at limited transfer speeds.

Stuff happens. Then you fix it. Thats life. You're doing a good job communicating what happened.

My question is: did any customers lose data ? the customers that requested recovery, how does that happen ? Is this a forensic recovery where the drives are sent out ?

This is not a forensic procedure , it is an I house solutin, we do not send out anything that has customer data to no one regardless the situation.

We did manage to inport the array on an older controller that does not take into account the smart error of the drives ( for the moment ), but copy off them is extremely slow, a few mb/sec

This is why we sent out the mail that those that do not have crucial data, can opt for provision of a new vps, as it is faster. Those that need the data will have to wait till we move the add-on drive to a new vps, slow, very slow.

Unfortunately we cannot guarantee the integrity of the data we recover, that will be up to the user to check. This is the best we can do under the current circumstances.

NJa64F · July 2025

@host_c said: We did manage to inport the array on an older controller

thanks for the explanation. Good luck to all involved and even though most of my stuff is backed up ya always wonder, 'what am I not backing up'

default · July 2025

@jperkins said:

@host_c said: We did manage to inport the array on an older controller

thanks for the explanation. Good luck to all involved and even though most of my stuff is backed up ya always wonder, 'what am I not backing up'

You. There is no backup of you. Once you die, that's it. There is no backup of your firmware, especially considering it is patented and personalised for you. This is why the end is always nigh.

barbaros · July 2025

@default said:

@jperkins said:

@host_c said: We did manage to inport the array on an older controller

thanks for the explanation. Good luck to all involved and even though most of my stuff is backed up ya always wonder, 'what am I not backing up'

You. There is no backup of you. Once you die, that's it. There is no backup of your firmware, especially considering it is patented and personalised for you. This is why the end is always nigh.

thats deep, mate.

BigBlue · July 2025

Any flash sale plans? Could use a stronger (CPU, RAM) box for Immich to aid my existing storage VPS.

Maelstrom36 · July 2025

@BigBlue said:
Any flash sale plans? Could use a stronger (CPU, RAM) box for Immich to aid my existing storage VPS.

What is your current storage VPS specs? Just out of curiosity.

plumberg · July 2025

@BigBlue said:
Any flash sale plans? Could use a stronger (CPU, RAM) box for Immich to aid my existing storage VPS.

What are your current specs? I have a home instance crunching on rpi4

BigBlue · July 2025

1 vCPU, 2GB RAM, hosting 2 static sites, an OpenCloud instance, Open WebUI, Shlink and Actual Budget already

plumberg · July 2025

@BigBlue said:
1 vCPU, 2GB RAM, hosting 2 static sites, an OpenCloud instance, Open WebUI, Shlink and Actual Budget already

Depending on your photo library, You should really get a new box and not just add resources to this.

BigBlue · July 2025

That's exactly the plan. Could also benefit from redundancy in the EU region, still being burnt by previous host's 'emergency migration' with now close to a week of unexpected downtime.

onewater · July 2025

@host_c said:

@maverick said:

@host_c said: Ah yes, Seagate was as helpful as the popcorn replays we will get here

It is what it is.

We will most probably switch to HGST in the next Drives Orders we will do.

thanks for sharing the juicy details with us

Seagate, you suck!

yeah, if they don't help you with this, don't buy any Seagate drive ever again!

Well, they blown me off as the drives were not bought thru a certified Seagate reseller. as if any of us can manufacture a drive at home. Fuck me.

There are 3 Drive Manufacturers in the WORLD:

Seagate
WD/HGST
Toshiba

So any drive you have bought that is enterprise and has a 5 year warranty should be replaceable regardless that you bought it on e-bay or a shop. ( in the limits that the drive does not have hammer marks or it did not operate in 50 degree Celsius )

But, here is the reply from them, and I will underline the fact that we asked for a FW fix not a replace, as a fw fix might have helped more. I could not care less that we have 6 or 10 failed drives, that is my problem, I asked for FW fix as that is the issue that might had helped us and our customers issue; again, there is no mechanical issue with them, they just decided to go to holiday. and left the array in the middle of the day.

Now, this is not a Seagate only policy, WD and Toshiba do the same.

EDIT:

One of the reasons I moved away from HP years ago was their restrictive firmware and BIOS update policy. Starting with Gen8 servers, critical updates — including fixes for issues that only emerge under specific conditions — were locked behind a support subscription.

This approach is frustrating because firmware and BIOS bugs are not user-created issues; they are vendor-side flaws that should be resolved as a matter of responsibility. Requiring customers to pay for access to those fixes feels like a penalty for simply using the hardware.

Now that HP is involved with Juniper, I can only hope they don’t bring this same restrictive, short-sighted policy mindset into that ecosystem. - tho I am positive they will.

Correct a small mistake, currently Toshiba hard drives belong to WD.
So there are only 2 Drive Manufacturers in the WORLD:
Seagate and WD

TandM · July 2025

@onewater said:

Correct a small mistake, currently Toshiba hard drives belong to WD.
So there are only 2 Drive Manufacturers in the WORLD:
Seagate and WD

Fairly certain Toshiba still is an independent hard drive manufacturer. They bought out certain 3.5" drive manufacturing facilities and IP from WD in 2012, with WD buying certain 2.5" facilities and IP in turn, and nothing has changed in the meantime AFAIK.

Riccardo · July 2025

@host_c said:

default said:
@host_c - I received your email. Thank you very much for being open and informing your customers about the causes of downtime. Such openness is highly appreciated with regards to respect for your business. If I may, I have some questions:

Which drives will you use for customers choosing a fresh install?

Which drives will you use for customers opting for recovery?

Will IPv4 and IPv6 addresses change in the case of a fresh re-install?

Will CPU change upon a fresh re-install?

Thanks again for your understanding and clear communication.

1. Which drives will you use for customers choosing a fresh install?

HGST 14 and 16 TB, some have a mix of Toshiba and HGST, sincerely I cannot recall from top of my head exactly.

2. Which drives will you use for customers opting for recovery?

same

yet, customers are provisioned over a raid array, not individual drives!

3. Will IPv4 and IPv6 addresses change in the case of a fresh re-install?

No, these will be manually issued, so we will preserve IPs settings. However due to this manual provisioning, it will be slow, as firstly we have to delete the old config manually from the cluster, recreate it and so on

4. Will CPU change upon a fresh re-install?

No. CPU Type Generation will not change, Model might as we have from 2.4 to 2.7 GHz Scale Gen 2 CPU's

truemagic said: Does that mean those without email updates....are not impacted?

Precisely, Mail was sent to VPS on those specific nodes. Who did not get any mail and has it's service up, it means nothing happened, carry on

Although I was not affected I highly appreciate the transparency. Thanks for taking your time to let others know about the issue so they can be aware. I feel I am in very good hands

host_c · July 2025

@onewater said: Seagate and WD

Nice, THX for the update on this.

I feel good having only 2 options, makes things far more simpler.

Howdy, Stranger!

Categories

In this Discussion

HOST-C, Chat, Updates, Stuff

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

HOST-C, Chat, Updates, Stuff

Comments