New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
HDD goes failing?
Hello,
today I noticed after updating my small dedicated server, that it has a extreme high load. And has a big iowait.
And anything what I am reading or writing to the harddisk(s) is really slow. It has two 250GB HDDs in RAID1 (software) and is CentOS6 - how can I check if one hdd is going to fail or is already failed?
dd if=/dev/zero of=test bs=64k count=4k conv=fdatasync
4096+0 Datensätze ein
4096+0 Datensätze aus
268435456 Bytes (268 MB) kopiert, 41,3819 s, 6,5 MB/s
Looks not good, and on the server are mostly only me...
EDIT:
Have done
smartctl -t long /dev/sda
smartctl -t long /dev/sdb
Now waiting an hour to look what smart will say...
Comments
mdadm should be able to tell you if the drive is failing. mdadm --detail /dev/mdXX, where mdXX is your software raid device.
https://raid.wiki.kernel.org/articles/d/e/t/Detecting,_querying_and_testing.html will help, too.
/dev/md0:
Version : 1.0
Creation Time : Fri Feb 17 21:52:56 2012
Raid Level : raid1
Array Size : 240639864 (229.49 GiB 246.42 GB)
Used Dev Size : 240639864 (229.49 GiB 246.42 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Tue Feb 28 17:29:06 2012
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : atom:0 (local to host atom)
UUID : f474eb1c:1c95cdeb:0d4a1e6e:39a2dd4c
Events : 7800
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
But if I can trust that, everything seems be fine? But why I only get ~6mb/s? Some days/weeks ago, I got more than 40mb/s!
Perhaps there's some runaway process that's eating available i/o? You can install iotop to see: http://guichaz.free.fr/iotop/
@Damian: Thanks for the tip, but I have already checked everything with iotop. When I made the test it was only dd which wrote to the disc and that was not over ~6mb/s
nevermind, deleting
@Amfy Take your server offline for an FSCK - maybe there are some corruption issues.
@sturdyvps Some weeks ago I made the mistake on a testing server, that I ran FSCK while the hdd was online and in use, and after that I had to reinstall the server...
To force a FSCK-ceck I should do touch /forcefsck? And then reboot the server, wait a hour or some? And then it should be back. But I can't loose my data?
@Amfy Its not recommended to run FSCK while the Server is online.
You can just issue the following command: shutdown -rF now (this command will reboot the server and do a fsck)
I would recommend you also have access to a KVM or get your hosting company to do this for you.
@sturdyvps: Hehe, yes, after this experience I really know that :P
Ah, thanks!
I'm not sure... I had already access to a free-kvm a week ago, because of a faile at the installation of centos6 (The installer hasn't installed grub, even if I had forced the installer to do that). I don't want to go on the nervse to ask them, and maybe I will forced to pay ~25€
If your Server is Managed then they should be able to do the fsck for you, even if the Server us Unmanaged they should be able to do the fsck for you again.
They're to friendly I will try it on my own.
Thanks for all your help, @sturdyvps and @Damian
@sturdyvps you're offering vps in NL? I will take a look at them :P
Mh, okay, I decided to reboot my server this night and let it make a FSCK-Check. How long could it take about for a 250GB HDD?
Well, if your drives are slow for some reason, then it could take quite some time. If rebooting returns your drives to their normal speed, then it will not be as long.
Mh, it takes not more than a normal reboot...
I'm not sure, it seems not faster, but what could it be now? Or is ext4 oversized for a Intel D525?
dd if=/dev/zero of=test bs=16k count=16k conv=fdatasync
16384+0 Datensätze ein
16384+0 Datensätze aus
268435456 Bytes (268 MB) kopiert, 94,2323 s, 2,8 MB/s
looks awful
It is awful. You sure want software raid 1 on that ? I had countless problems with software raid including misterious fails like yours.
Remove one disk and rebuild, this may solve the problem, but in the long run, you might consider other ways of securing your data, such as keeping your second drive for automated incremental backups with a minimal system installed in case the other drive fails.
M
What load average does the server have? Also the iowait how much is it?
And also what are you using the server for?
@Maounique thank you very very much for your answer.
And you really had similar problems with similar bad-speed? Some weeks ago I had on the same machine an debian with ext3 and it was relative good, then it seemed that one harddisk failed, I better reinstalled it. Because I wanted to test centos6 I have taken it and thought it would be more future if I take ext4.
Actually I'm doing every night a vzdump backup to a hostigation-vps.
Hmm, maybe I will contact the provider and ask him, if other customers are having similar problems.
I don't have very important data, but I don't know a server without raid?!
You mean something like mdadm --manage /dev/md0 --remove /dev/sdb
mdadm --manage /dev/md0 --add /dev/sdb?
What mdadm --grow --bitmap=internal /dev/md0 help a bit?
In some cases software RAID will perform just as well as hardware RAID if not better. Modern CPU's along with modern software RAID don't really have an impact when using RAID 1, 0 or 10. I'd probably only opt for hardware RAID with a BBU. (Which we do run).
However, I've worked with hardware RAID and software RAID and there's not much of a difference for what was stated above.
@sturdyvps:
Splittet in OpenVZ-Containers: Mailserver, Shellserver, Webserver, some developer and testing stuff
Acutally ~0,5 - 1,0. (I deactivated anything except the webserver). But at launch-time when I made some updates the load was about 10 - 20! And if I let dd write a minute or so, I have a load ~5
If a proccess is waking up, the iowait % of top is growing near 100%
something like:
mdadm /dev/md0 -f /dev/sda1 /marks as faulty
mdadm /dev/md0 -r /dev/sda1 /removes
mdadm /dev/md0 -a /dev/sda1 /adds back as hot spare
Or sdb if you prefer.
If you do not depend on that data, dont use raid, will lower power used and will reduce headache. In theory should work great, but in practice (mine) if data is not that important, backup is better than raid (even if you have hardware), if it is critical, use best performance with minimal redundancy+heavy backup.
M
P.S. It may look like one drive fails, but it is not always the case, software raids may be mistaken in some situations. Better check the drive before discarding.