Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Debian 10 AWS EC2 instance hangs randomly
New on LowEndTalk? Please Register and read our Community Rules.

Debian 10 AWS EC2 instance hangs randomly

I have 50+ Debian 8/9 EC2 instances running good for years... however, with Debian 10, I've started having trouble lately. I spun a new instance with Debian 10 AMI from marketplace today, configured everything and application side deployment... after an hour or so, the machine just froze, no response on SSH / http / xRDP / VNC ports. Even the AWS internal instance check say 1 of 2 checks failed ... so it is really down.

Out of curiosity, I spun another Deb 10 EC2 ... same t3 small spec instance... just did an apt update / upgrade and left it alone... after a 20 mins or so, this one too is in hung state exactly like the first one.

I have opened a ticket with AWS for this... the first off guy could not help the issue so he said he will call me back with some research with his tech colleague, so awaiting their reply.

I recall having a very similar issue on LightSail in January... Lightsail did not have Deb 10 available so I had to install Deb 9, upgrade to Deb 10 and then I observed it would hang in between after a few hours of idling. Suspecting I missed something during the upgrade I re-made it again and it runs fine till date. Today, when I ran into the same exact behavior, albeit on mainstream AWS, I got suspicious it might have something to do with their AMIs.

The forums are flooded with such reports where EC2 just froze and would not come back until you reboot from AWS console... I checked whatever I could read, like changing security group, checking network, which did not help. I think it is a baseline problem with stability, not just network or anything else.

Anyone else experienced the same issues with Debian 10 on Ec2 lately ?
please share your observations... Thanks

Comments

  • I use debian 10 too, the official AMI. I'm on phone so cant check the AMI ID.

    Had 100+ server there, I had few hang ec2 if I changes value limits.conf and sysctl.conf too small and logind.services nolimitfile smaller that limits on limits.conf/sysctl.conf

    did you check the log of ec2 from console, maybe there's clue what happen

  • Some more observations:
    I rebooted both instances and discovered that the instance itself kept working... the logs, namely auth and syslog entries show events at regular hours and minutes while the instance was not reachable to the world. On the second test instance I had left a cron to run every minute that would echo to a log file every minute... and I see it had entries to every minute non-stop without a miss.

    This concludes the issue is not instance state... but may be the network.

    Thanks @sibaper for the pointers... No changes to limits.conf and the only change in sysctl.conf is to disable IPv6
    echo 'net.ipv6.conf.all.disable_ipv6 = 1' >> /etc/sysctl.conf
    echo 'net.ipv6.conf.default.disable_ipv6 = 1' >> /etc/sysctl.conf

    This is my routine since last 8 years or so but still may be it is putting it in jeopardy, I'd like to investigate this further.

    I spun two new instances... one ubuntu 18 and another Deb 10, did absolutely nothing... just a single apt update and apt full-upgrade, rebooted and let them sit... surprisingly they are up after 5-6 hours or so.

    I'm leaving out those two configs on them and doing mysql, apache, nginx, etc in steps and will see.

    I also removed the disable_ipv6 lines from the previous 2 servers and rebooted them, will observe if this resolves the problem.

    sibaper said: did you check the log of ec2 from console

    Good idea... did not click my mind as I never really had the need... but will try to catch something there.

  • SplitIceSplitIce Member, Host Rep

    Have you performed any troubleshooting?

    Checked dmesg? Checked the console?

    Making changes without first identifying the cause is a recipe for getting yourself in a loop (of bs).

  • OK I nailed it... apparently Debian 10
    it really is this entry in /etc/sysctl.conf

    echo 'net.ipv6.conf.all.disable_ipv6 = 1' >> /etc/sysctl.conf
    echo 'net.ipv6.conf.default.disable_ipv6 = 1' >> /etc/sysctl.conf
    

    Apparently it only has a problem with Debian 10 AMIs... tested with Debian 9 / Ubuntu 18 instances, they had no problems with this entry.

    I tested multiple times...Removing those lines from /etc/sysctl.conf from Deb 10 Instances would keep it online flawless... the moment I put them back, the instance goes offline ina few minutes. And mind it, its not the instance going down itself, its "on" in the background, just cuts off from the network, even internal AWS VPC.

    My best guess... something pesky about the networking here... The server remains up in the background but loses all network connectivity... May be some kind of network probe that is failing.

    Have reported AWS on ticket... will update if they reply with something conclusive :smile:

  • @mehargags

    which AMI do you used?

    I just create 1 server using ami-0f44b7e6f65040e6d and add

    echo 'net.ipv6.conf.all.disable_ipv6 = 1' >> /etc/sysctl.conf
    echo 'net.ipv6.conf.default.disable_ipv6 = 1' >> /etc/sysctl.conf
    

    to sysctl.conf

  • @sibaper
    it is the default free Debian 10 AMI with Debian logo on it... don't know where to get that ID from. check screenshot

    https://prnt.sc/s2pruc

  • We ran into the same issue. Upon further investigation, we found a debian bug which described our exact issue:
    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=964596

    We ended up applying the same patches manually (https://salsa.debian.org/cloud-team/debian-cloud-images/-/merge_requests/206/diffs and https://salsa.debian.org/cloud-team/debian-cloud-images/-/merge_requests/207/diffs) instead of using the new AMIs. This worked for us.

  • LTnigerLTniger Member

    Any chance to bump Debian 10 to 11?

  • angstromangstrom Member, Moderator

    @nkmishra1997 said:
    We ran into the same issue. Upon further investigation, we found a debian bug which described our exact issue:
    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=964596

    We ended up applying the same patches manually (https://salsa.debian.org/cloud-team/debian-cloud-images/-/merge_requests/206/diffs and https://salsa.debian.org/cloud-team/debian-cloud-images/-/merge_requests/207/diffs) instead of using the new AMIs. This worked for us.

    Holy necro

    (Please don't necropost, especially not for your first post, and please read the rules)

    Seriously, is this still an issue more than two years later? The final message in that bug report says that the issue was resolved in v10.5 :

    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=964596#22

Sign In or Register to comment.