Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


BuyVM down again? =( - Page 2
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

BuyVM down again? =(

2»

Comments

  • RobertClarkeRobertClarke Member, Host Rep

    Mine was down earlier...

  • xur17xur17 Member

    Mine's been down for a little over an hour now. I think I'm making the full jump to Digital Ocean now. I've been switching to Digital Ocean whenever it goes down, but I'm tired of doing that.

  • raindog308raindog308 Administrator, Veteran

    @Jack said: he has posted on WHT but not here

    I feel betrayed!

    j/k

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    Remember i'm in Vegas right now so I don't have all my email notifications/etc nor have I been really checking many forums.

    As I mentioned, the first port (and I'm assuming the interface marked as the primary member) reset and was stuck in 'connecting' mode. The only fix was for me to simply pull the ethernet and re clip it to make it fully cleanup its act.

    There was a longer outage on my.frantech since I was cleaning up some cables and got a cable backwards. While this might sound odd since eth0 is almost 99% the one closest to the standard IO ports, supermicro botched the first revision of the X9SCL's and this is reversed.

    I caught it a little bit after the fact but by then Karen called FH to let me know that something was busted.

    As mentioned, once Rob is ready for us and the brocade, we'll be removing our needs for LACP/LAGs and just let BGP do its thang.

    Been making good progress on things. At this point I've completed everything I need to be onsite to do, including setting up 4 more storage nodes, 2 more KVM nodes, 4 mode OVZ 128MB nodes, replaced all outstanding bad RAM as well as upgraded a bunch more nodes to L5638's.

    There is still a few nodes not on L5638's but it seems I miscounted when I placed the original order and didn't have enough to finish them :( Oh well, next time.

    Francisco

  • netomxnetomx Moderator, Veteran

    @Francisco said: Remember i'm in Vegas right now so I don't have all my email notifications/etc nor have I been really checking many forums.

    not an excuse, we hate you now, by for ever </3

    hahahah have a great night over there

  • @Francisco said: 4 more storage nodes, 2 more KVM nodes, 4 mode OVZ 128MB nodes

    I smell new stock. Hehehe.

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    @xur17 said: Mine's been down for a little over an hour now

    Nope.

    I wasn't even on the DC floor at that point. If you're judging based off manage.buyvm.net or buyvm.net loading that's skewed as mentioned. Past the first few minutes, the guys were going mental ringing myself and Fiberhub off the hook. Not a single person at the company would have just sat on their thumbs for an hour of a network issue w/o some sort of contact with me about it, especially since they knew I was < 10' from our cage.

    The issue was entirely out of our hands short of us somehow taking fault for using a software based router. As also mentioned, that'll be gone from LAS within the month.

    NY doesn't use an LACP or a bond of any sort since we only run the single port. We could have just moved LV to a 10gig port (this was the original plan), but we have 0 experience routing/forwarding 10gbit ports on both sides of a software router. It'd likely be fine but who knows, it could quite possibly have caused more issues than helped.

    Best wishes with your ventures,

    Francisco

  • @Francisco Strange that the LACP didn't failover to using just the other link when one of the two failed, it's supposed to failover. That's why i wrote above that it seems to be some sort of strange balancing setup :)

    And i don't know why you are so afraid of 10G ports, i'd say go for it, should save you a lot of headaches. Heck, probably even your brocade can do L3 and has 10G ports, you don't even need the software router to switch packets then.

  • @Francisco are you still using the software router to handle 10G?

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    @rds100 said: @Francisco Strange that the LACP didn't failover to using just the other link when one of the two failed, it's supposed to failover. That's why i wrote above that it seems to be some sort of strange balancing setup :)

    That's what's confusing. All I can think is that eth0 is marked as a primary port, even though we have no way of saying otherwise. Maybe it's being forced that way on FH's brocade?

    And i don't know why you are so afraid of 10G ports, i'd say go for it, should save you a lot of headaches. Heck, probably even your brocade can do L3 and has 10G ports, you don't even need the software router to switch packets then.

    Right now our router is based around BSD and if you sniff around any mail listings regarding LAGG ports you'll see a few people reported issues with ports randomly changing to 'status 8' (connecting).

    The brocade is a CES 2024C with a dual port 10gig addon card and has no problem doing linerate in hardware :) We'll have an ACL based port mirror that will mirror things to our autonull box.

    We don't need 10gbit of burst on our setup so jumping into that doesn't do us any good and bumps up our pricing more. With that being said, I still did get an LR based 10gbit XFP module sitting in the rack just incase we do need to move to 10gbit.

    :)

    Francisco

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    @concerto49 said: @Francisco are you still using the software router to handle 10G?

    Only one half of it is 10gbit, namely he side facing the internal network. Since users have private LAN access, there isn't a whole lot of inter network traffic that passes on the router itself. Monitoring our LACP facing FH and our 10gig port facing internally at the same time (just ifstat, very basic) shows that traffic levels are symmetrical (inbound on one side is ~= outbound on the other, and vice versa).

    While we don't need the brocade router, I got such a good deal on it that I may as well just future proof the network. I wish we had done the cut on Monday like I originally had in mind but Rob was too busy and we wouldn't have had a ton of time to thrash the brocade enough.

    Granted, if we had done the cut on Monday the LACP blip today wouldn't have happened :P But, this thread would have come up anyways as someone wouldn't have bothered reading the announcement.

    Francisco

  • @Francisco you mean fiberhub would charge more for 2G cap (no burst) on 10G port, compared to 2x1G ports bonded? That's strange.

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    @rds100 said: @Francisco you mean fiberhub would charge more for 2G cap (no burst) on 10G port, compared to 2x1G ports bonded? That's strange.

    They bill it 95% so they have to keep room incase we do need to burst. Any sort of caps on the 10gig I'd have to do on my own and doing that in software would likely make the router melt under any sort of decent flood.

    Again, the $500/m extra that Rob was going to want due to the fiber link up isn't where the concern was, it's more that we don't want to just try throwing money at a problem to see if it helps and end up with an even bigger mess that we can't get out of.

    Rob doesn't do 'test' 10gbit uplinks either since they'd still have to go out to buy the blade.

    Don't get the wrong idea here. None of this was the fault of FH or even really us short of the fact we decided many years ago to use unix based routers. When I looked up entries on BSD regarding LACP's being dropped, the only listings I could find were a few years old without any new information, so I assumed it got patched back in 8.x or even 9.x

    TL;DR - It's annoying but wasn't caused by anything anyone was doing at the time. We've worked hard to remove all LACP's from our network so this very issue will be a thing of the past as early as next week, Rob willing.

    Francisco

  • You use L5638's? Wouldn't LEB standards considers them old procs :)

  • @bdtech said: Wouldn't LEB standards considers them old procs :)

    Bah, they are still good horses.

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    @bdtech said: You use L5638's? Wouldn't LEB standards considers them old procs :)

    We need the core counts due to how much users slam things. While I've rolled out the newest revision of monbot just before I left and it has kept nodes under good control, I still like the buffer.

    The L5638's are older, but at 24 cores to a node they do quite well. It's like 20% slower than a same clocked E5 but instead of costing me ~$2000/node to upgrade like the E5's, I only spend like $600/node for the newer procs.

    Francisco

  • i want this awesome live monitor for me.... :(

    http://buyvmstatus.com/live

  • @Francisco said: at 24 cores

    o_O?
    Quad socket mobos? Or you mean HT cores

  • xur17xur17 Member
    edited March 2013

    @Francisco said: Nope. I wasn't even on the DC floor at that point. If you're judging based off manage.buyvm.net or buyvm.net loading that's skewed as mentioned. Past the first few minutes, the guys were going mental ringing myself and Fiberhub off the hook. Not a single person at the company would have just sat on their thumbs for an hour of a network issue w/o some sort of contact with me about it, especially since they knew I was < 10' from our cage.

    I'm not sure what you mean by that - My node was definitely down for over an hour. I just checked pingdom, and it confirms 1h 45m of downtime.

    The downtime really isn't a problem for most of my sites, but one of them is popular enough that it makes sense to pay a little more and get something more stable.

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    @xur17 said: I'm not sure what you mean by that - My node was definitely down for over an hour. I just checked pingdom, and it confirms 1h 45m of downtime.

    What I mean by that is our own nodeping reports show ~25 minutes. Maybe you were on a node that was in the middle of a FSCK since that does happen but the whole network blip was < 25 minutes. The guys waited 3 - 4 minutes before calling FH. FH ran out, let me know what was up. I spent 5 - 10 minutes debugging since it was spotty, not a full up/down case. Then spent the last of the time getting the crash cart on the router and things fixed up.

    We have all of our nodes in nodeping and given how many ip's were involved, it means both ports would be used in the LACP hashing.

    Much like most LE's out there, pingdom isn't trusted given how many false positives it gives.

    @yomero said: o_O?

    Quad socket mobos? Or you mean HT cores

    Right, HT comes into play at that point :) The CPU's haul ass either way and were worth the cash.

    Francisco

  • shovenoseshovenose Member, Host Rep

    For the record we tried Pingdom but did not continue after the trial because it sent us hundreds of downtime notifications when the downtime really didn't exist. So use something more reliable.

  • FranciscoFrancisco Top Host, Host Rep, Veteran
    edited March 2013

    @Jack said: That's called packet loss buddy, Check the root analysis in the pingdom panel it shows why it said it was down/why it failed.

    I dunno about that. You can ask kujoe or Tim, i'm fairly sure they both ignore pingdom SLA claims because of the same crap.

    Nodeping has been decent for us, granted we do only HTTP checks since ICMP can be annoying at times. Only their TX location ever really alerts us.

    Francisco

Sign In or Register to comment.