BuyVM down again? =(

RobertClarke · March 2013

Mine was down earlier...

xur17 · March 2013

Mine's been down for a little over an hour now. I think I'm making the full jump to Digital Ocean now. I've been switching to Digital Ocean whenever it goes down, but I'm tired of doing that.

unused · March 2013

https://my.frantech.ca/announcements.php?id=144

raindog308 · March 2013

@Jack said: he has posted on WHT but not here

I feel betrayed!

j/k

Francisco · March 2013

Remember i'm in Vegas right now so I don't have all my email notifications/etc nor have I been really checking many forums.

As I mentioned, the first port (and I'm assuming the interface marked as the primary member) reset and was stuck in 'connecting' mode. The only fix was for me to simply pull the ethernet and re clip it to make it fully cleanup its act.

There was a longer outage on my.frantech since I was cleaning up some cables and got a cable backwards. While this might sound odd since eth0 is almost 99% the one closest to the standard IO ports, supermicro botched the first revision of the X9SCL's and this is reversed.

I caught it a little bit after the fact but by then Karen called FH to let me know that something was busted.

As mentioned, once Rob is ready for us and the brocade, we'll be removing our needs for LACP/LAGs and just let BGP do its thang.

Been making good progress on things. At this point I've completed everything I need to be onsite to do, including setting up 4 more storage nodes, 2 more KVM nodes, 4 mode OVZ 128MB nodes, replaced all outstanding bad RAM as well as upgraded a bunch more nodes to L5638's.

There is still a few nodes not on L5638's but it seems I miscounted when I placed the original order and didn't have enough to finish them Oh well, next time.

Francisco

netomx · March 2013

@Francisco said: Remember i'm in Vegas right now so I don't have all my email notifications/etc nor have I been really checking many forums.

not an excuse, we hate you now, by for ever </3

hahahah have a great night over there

budingyun · March 2013

@Francisco said: 4 more storage nodes, 2 more KVM nodes, 4 mode OVZ 128MB nodes

I smell new stock. Hehehe.

Francisco · March 2013

@xur17 said: Mine's been down for a little over an hour now

Nope.

I wasn't even on the DC floor at that point. If you're judging based off manage.buyvm.net or buyvm.net loading that's skewed as mentioned. Past the first few minutes, the guys were going mental ringing myself and Fiberhub off the hook. Not a single person at the company would have just sat on their thumbs for an hour of a network issue w/o some sort of contact with me about it, especially since they knew I was < 10' from our cage.

The issue was entirely out of our hands short of us somehow taking fault for using a software based router. As also mentioned, that'll be gone from LAS within the month.

NY doesn't use an LACP or a bond of any sort since we only run the single port. We could have just moved LV to a 10gig port (this was the original plan), but we have 0 experience routing/forwarding 10gbit ports on both sides of a software router. It'd likely be fine but who knows, it could quite possibly have caused more issues than helped.

Best wishes with your ventures,

Francisco

rds100 · March 2013

@Francisco Strange that the LACP didn't failover to using just the other link when one of the two failed, it's supposed to failover. That's why i wrote above that it seems to be some sort of strange balancing setup

And i don't know why you are so afraid of 10G ports, i'd say go for it, should save you a lot of headaches. Heck, probably even your brocade can do L3 and has 10G ports, you don't even need the software router to switch packets then.

concerto49 · March 2013

@Francisco are you still using the software router to handle 10G?

Francisco · March 2013

@rds100 said: @Francisco Strange that the LACP didn't failover to using just the other link when one of the two failed, it's supposed to failover. That's why i wrote above that it seems to be some sort of strange balancing setup

That's what's confusing. All I can think is that eth0 is marked as a primary port, even though we have no way of saying otherwise. Maybe it's being forced that way on FH's brocade?

And i don't know why you are so afraid of 10G ports, i'd say go for it, should save you a lot of headaches. Heck, probably even your brocade can do L3 and has 10G ports, you don't even need the software router to switch packets then.

Right now our router is based around BSD and if you sniff around any mail listings regarding LAGG ports you'll see a few people reported issues with ports randomly changing to 'status 8' (connecting).

The brocade is a CES 2024C with a dual port 10gig addon card and has no problem doing linerate in hardware We'll have an ACL based port mirror that will mirror things to our autonull box.

We don't need 10gbit of burst on our setup so jumping into that doesn't do us any good and bumps up our pricing more. With that being said, I still did get an LR based 10gbit XFP module sitting in the rack just incase we do need to move to 10gbit.

Francisco

Francisco · March 2013

@concerto49 said: @Francisco are you still using the software router to handle 10G?

Only one half of it is 10gbit, namely he side facing the internal network. Since users have private LAN access, there isn't a whole lot of inter network traffic that passes on the router itself. Monitoring our LACP facing FH and our 10gig port facing internally at the same time (just ifstat, very basic) shows that traffic levels are symmetrical (inbound on one side is ~= outbound on the other, and vice versa).

While we don't need the brocade router, I got such a good deal on it that I may as well just future proof the network. I wish we had done the cut on Monday like I originally had in mind but Rob was too busy and we wouldn't have had a ton of time to thrash the brocade enough.

Granted, if we had done the cut on Monday the LACP blip today wouldn't have happened :P But, this thread would have come up anyways as someone wouldn't have bothered reading the announcement.

Francisco

rds100 · March 2013

@Francisco you mean fiberhub would charge more for 2G cap (no burst) on 10G port, compared to 2x1G ports bonded? That's strange.

Francisco · March 2013

@rds100 said: @Francisco you mean fiberhub would charge more for 2G cap (no burst) on 10G port, compared to 2x1G ports bonded? That's strange.

They bill it 95% so they have to keep room incase we do need to burst. Any sort of caps on the 10gig I'd have to do on my own and doing that in software would likely make the router melt under any sort of decent flood.

Again, the $500/m extra that Rob was going to want due to the fiber link up isn't where the concern was, it's more that we don't want to just try throwing money at a problem to see if it helps and end up with an even bigger mess that we can't get out of.

Rob doesn't do 'test' 10gbit uplinks either since they'd still have to go out to buy the blade.

Don't get the wrong idea here. None of this was the fault of FH or even really us short of the fact we decided many years ago to use unix based routers. When I looked up entries on BSD regarding LACP's being dropped, the only listings I could find were a few years old without any new information, so I assumed it got patched back in 8.x or even 9.x

TL;DR - It's annoying but wasn't caused by anything anyone was doing at the time. We've worked hard to remove all LACP's from our network so this very issue will be a thing of the past as early as next week, Rob willing.

Francisco

bdtech · March 2013

You use L5638's? Wouldn't LEB standards considers them old procs

yomero · March 2013

@bdtech said: Wouldn't LEB standards considers them old procs

Bah, they are still good horses.

Francisco · March 2013

@bdtech said: You use L5638's? Wouldn't LEB standards considers them old procs

We need the core counts due to how much users slam things. While I've rolled out the newest revision of monbot just before I left and it has kept nodes under good control, I still like the buffer.

The L5638's are older, but at 24 cores to a node they do quite well. It's like 20% slower than a same clocked E5 but instead of costing me ~$2000/node to upgrade like the E5's, I only spend like $600/node for the newer procs.

Francisco

cloromorpho · March 2013

i want this awesome live monitor for me....

http://buyvmstatus.com/live

yomero · March 2013

@Francisco said: at 24 cores

o_O?
Quad socket mobos? Or you mean HT cores

xur17 · March 2013

@Francisco said: Nope. I wasn't even on the DC floor at that point. If you're judging based off manage.buyvm.net or buyvm.net loading that's skewed as mentioned. Past the first few minutes, the guys were going mental ringing myself and Fiberhub off the hook. Not a single person at the company would have just sat on their thumbs for an hour of a network issue w/o some sort of contact with me about it, especially since they knew I was < 10' from our cage.

I'm not sure what you mean by that - My node was definitely down for over an hour. I just checked pingdom, and it confirms 1h 45m of downtime.

The downtime really isn't a problem for most of my sites, but one of them is popular enough that it makes sense to pay a little more and get something more stable.

Francisco · March 2013

@xur17 said: I'm not sure what you mean by that - My node was definitely down for over an hour. I just checked pingdom, and it confirms 1h 45m of downtime.

What I mean by that is our own nodeping reports show ~25 minutes. Maybe you were on a node that was in the middle of a FSCK since that does happen but the whole network blip was < 25 minutes. The guys waited 3 - 4 minutes before calling FH. FH ran out, let me know what was up. I spent 5 - 10 minutes debugging since it was spotty, not a full up/down case. Then spent the last of the time getting the crash cart on the router and things fixed up.

We have all of our nodes in nodeping and given how many ip's were involved, it means both ports would be used in the LACP hashing.

Much like most LE's out there, pingdom isn't trusted given how many false positives it gives.

@yomero said: o_O?

Quad socket mobos? Or you mean HT cores

Right, HT comes into play at that point The CPU's haul ass either way and were worth the cash.

Francisco

shovenose · March 2013

For the record we tried Pingdom but did not continue after the trial because it sent us hundreds of downtime notifications when the downtime really didn't exist. So use something more reliable.

Francisco · March 2013

@Jack said: That's called packet loss buddy, Check the root analysis in the pingdom panel it shows why it said it was down/why it failed.

I dunno about that. You can ask kujoe or Tim, i'm fairly sure they both ignore pingdom SLA claims because of the same crap.

Nodeping has been decent for us, granted we do only HTTP checks since ICMP can be annoying at times. Only their TX location ever really alerts us.

Francisco

Howdy, Stranger!

Categories

In this Discussion

BuyVM down again? =(

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

BuyVM down again? =(

Comments