Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Dedicated Server keeps crashing/rebooting
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Dedicated Server keeps crashing/rebooting

NeoXiDNeoXiD Member
edited March 2015 in Help

Introduction

Hello everyone! I'm struggling with a really annoying issue currently - my SYS dedicated server keeps crashing randomly. I tried to sum up all necessary informations for you and hopefully someone has an idea what's going on. Thanks in advance for your time.

Hardware configuration

I'm using a E3-SSD-3 from SoYouStart, so there isn't any fancy hardware involved. Basically sums up to the following components, which are also listed on the page:

  • Intel Xeon E3 1245v2 (4C/8T)
  • 32GB RAM (no ECC)
  • 3x 120GB SSD (SW-RAID 5)

What happens?

The server just keeps randomly rebooting, I've already tried 2 distributions and with both of them the problem occured. Here are some combinations where a crash occurred:

  • CentOS 7: Idle SSH connection, Installation of big software suite (OpenStack), Update of all installed packages with yum
  • Fedora 20 Idle SSH connection, Installation of htop with yum
  • Fedora 21 Idle SSH connection, Installation of big software suite (OpenStack oncemore)

As various kernel versions were tested, it doesn't seem to be an issue with the kernel - atleast that would be my guess/assumption.

Have you checked X/Y/Z?

Ofcourse I already tried to narrow down the problem, but so far I didn't have any success at all. Those are the things I've checked already:

  • memcheck within OVHs rescue system: Executed that check two times, so 2x15 loops - no errors were found, so I guess those are fine.
  • cpuburn within OVHs rescue system: Was running for 4hours at 8.0 load, no crashes during that time.
  • stress: Some people might know stress, it's a really simple tool to create some heavy CPU, memory and I/O load. I've let that run on the dedicated server (Fedora 21) for a hour and it also didn't crash. Seems like a server's trolling me.
  • S.M.A.R.T. data of SSDs: They've got 5kh, 23kh and 30kh of runtime, so two of them are quite old already. The statistics are fine though and both a short and long selftest didn't return any errors.
  • Stumbling through the logs: I've already looked through /var/log/dmesg and /var/log/messages, there's not a single message in there. The server just reboots and that's it.
  • Trying to find a pattern: Didn't have success with that, as you've probably read already above.

So what?

Well, I'm absolutely clueless why that problem occurs and hopefully someone else has an idea what else I could look for. As the server didn't crash during the cpuburn & memcheck test within the rescue system, I can't really get OVH to look at it - as some people might know, their SYS support isn't top-notch. Completely unrelated sidenote: Wouldn't have choosen them if they weren't the only company which sells IPs for a reasonable price

Thanked by 1evnix

Comments

  • ClouviderClouvider Member, Patron Provider

    Is there anything left in logs after the reboot?

  • Do you have access to the systems IPMI? If so does the IPMI contain any hardware failure information or NMIs?

  • NeoXiDNeoXiD Member
    edited March 2015

    @Clouvider said:
    Is there anything left in logs after the reboot?

    Unfortunately wasn't able to find anything at all - just boots normally and there's not a single error listed there. Once it even "crashed" in the middle of a log line, so it just seems to be totally random.

    @MarkTurner said:
    Do you have access to the systems IPMI? If so does the IPMI contain any hardware failure information or NMIs?

    It's a SYS box, so no KVM or IPMI at all as they're using cheap consumer boards. Only options are booting from disk or rescue system. I was able to install those distributions by using that "trick" from trick77. (Boot to rescue system, run a portable QEMU/KVM which is passing through the disks, install the system, adjust network configuration and reboot from disk). Additional unrelated sidenote: Too bad that you don't have any of those dedicated server deals in Europe :p

  • ClouviderClouvider Member, Patron Provider

    I would say contact the support, but...

    Either way, can you see if the IPMI is present locally? Just in case?
    (test with ipmitool chassis status)

  • You have testing system in rescue web interface http://help.ovh.com/RescueMode

  • NeoXiDNeoXiD Member
    edited March 2015

    @Clouvider said:
    I would say contact the support, but...

    Either way, can you see if the IPMI is present locally? Just in case?
    (test with ipmitool chassis status)

    As said, there's no IPMI available, that command also returns No such file or directory. It's just some cheap consumer board. I didn't contact the support yet as I've read in the forum that if their rescue system doesn't crash/report something, they won't really do anything for you. Remember, I'm a SYS customer...
    EDIT: And they won't just exchange the whole server with another one - so that's why I'm trying to narrow it down.

    @coolice said:
    You have testing system in rescue web interface http://help.ovh.com/RescueMode

    As I've written in the initial post, I did those tests already without any errors/crashes.

  • ClouviderClouvider Member, Patron Provider

    @NeoXiD it's always worth to check, you might have been lucky to have one :).

    There is not enough information to see what's causing the problem. It may be well outside your server as well. I would recommend running the tests for a bit longer, 24hrs perhaps to see if anything pops up.

  • NeoXiDNeoXiD Member
    edited March 2015

    @Clouvider said:
    NeoXiD it's always worth to check, you might have been lucky to have one :).

    There is not enough information to see what's causing the problem. It may be well outside your server as well. I would recommend running the tests for a bit longer, 24hrs perhaps to see if anything pops up.

    I'll give the cpuburn test in OVHs rescue another shot for about 24 hours, if it shouldn't stop before - the memcheck test stops after 15 loops unfortunately, so no way to do long-term testing. I didn't run the cpuburn test longer than 4 hours as the crashes occured sometimes after just 10 minutes of uptime. Today it crashed 4 times during just one hour and afterwards it was running fine again...

  • qpsqps Member, Host Rep
    edited March 2015

    Maybe ask them to swap the power supply? Also, maybe have them check the fans (CPU fan, chassis fan, etc).

  • I'm getting this with my Kimsufi I ordered just today, is there a reason for this? I installed windows, it works completely fine (not activated it yet though) and lala it works for lets say, an hour two? then suddenly kicks me off RDP and stops pinging within about 5seconds of kicking me off/freezing RDP? Obviously have no IMPI, although I could try Rescue mode and I am unsure what to do with this.

  • @HyperSpeed said:
    I'm getting this with my Kimsufi I ordered just today, is there a reason for this? I installed windows, it works completely fine (not activated it yet though) and lala it works for lets say, an hour two? then suddenly kicks me off RDP and stops pinging within about 5seconds of kicking me off/freezing RDP? Obviously have no IMPI, although I could try Rescue mode and I am unsure what to do with this.

    Does just the connection drop or does your windows server reboot? As mine keeps rebooting. I'm running a heavy stresstest now within the rescue system for 24 hours, if that damn server doesn't crash I'll try my luck with a ticket.

  • @NeoXiD said:
    Does just the connection drop or does your windows server reboot? As mine keeps rebooting. I'm running a heavy stresstest now within the rescue system for 24 hours, if that damn server doesn't crash I'll try my luck with a ticket.

    Shuts down as far as im aware because no matter how long you leave it, it'll never come back up once it suddenly drops and tried rescue mode and it keeps saying server may not be responding as its runningn the tests?

  • It's probably hardware issue, no other choice you need to submit a support ticket but I could be wrong, of course

    HyperSpeed said: keeps saying server may not be responding as its runningn the tests?



    BIOS error, perhaps?

  • @HyperSpeed said:
    I'm getting this with my Kimsufi I ordered just today, is there a reason for this? I installed windows, it works completely fine (not activated it yet though) and lala it works for lets say, an hour two? then suddenly kicks me off RDP and stops pinging within about 5seconds of kicking me off/freezing RDP? Obviously have no IMPI, although I could try Rescue mode and I am unsure what to do with this.

    Completely putting the blue out here.

    But sometimes this happens when Windows Licenses are not activated.. Just shuts down completely..

  • MSPNick said: Completely putting the blue out here.

    But sometimes this happens when Windows Licenses are not activated.. Just shuts down completely..

    I'll take a risk and activate it then, hopefully that stops it being a pain in the rear. Fingers crossed.

  • Hopefully it goes away now... phaha never knew that feature existed, although I've always licensed it straight away so never noticed!

    image

  • @neokid Are you running the system with sw raid 1 or 0? Have you checked the disks via smartctl?

  • @HyperSpeed said:
    image

    Let's hope. Report back would love to know if it's ok or not.

  • @aggressivenetworks said:
    neokid Are you running the system with sw raid 1 or 0? Have you checked the disks via smartctl?

    Forgot to mention that, sorry. I'm running a SW RAID5 on it, RAID rebuilded successfully multiple times. No scary SMART data as said in the first post, two of them are really old but they still work.

  • agoldenbergagoldenberg Member, Host Rep

    @neoXiD did you change the main IP of the box to a different IP? This happened to me when I did that.

  • @agoldenberg said:
    neoXiD did you change the main IP of the box to a different IP? This happened to me when I did that.

    No, it's the standard main IP. The rescue system with stress on it still runs stable without any crashes. Maybe it's some component which doesn't get used on the rescue system which causes those random crashes. So... Maybe it is kernel related or one of the disks screws up. I've pasted the SMART data here, but it looks fine for me:

  • edanedan Member

    Try to using the default distributions kernel and not OVH kernel.

  • BayuBayu Member
    edited March 2015

    I'm also getting this with my dedicated from online.net. Always rebooting/kernel panic after several hours. I change sysctl.conf to prevent auto reboot on kernel panic in order to narrow down the problem, so I can see the error message that is displayed via IPMI.

    But IPMI not much help because I can't scroll error message that displayed on vnc. The same thing happens when I try to change the other linux operating system (Ubuntu 12.04, Ubuntu 14.04, Centos 5, Centos 6, Debian 7)

    But when I uninstalling the application smartmontools (smartd / smartctl process), my server no longer crash again.

    http://lowendtalk.com/discussion/23650/ask-kernel-panics

    Thanked by 1niknar1900
Sign In or Register to comment.