Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Post Mortem on Cloudflare Control Plane and Analytics Outage
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Post Mortem on Cloudflare Control Plane and Analytics Outage

FAT32FAT32 Administrator, Deal Compiler Extraordinaire

https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage/

It isn't too informative but it does shows the commitment Cloudflare is putting to ensure high availability of all their services.

Comments

  • Honestly, people sometimes complain that Cloudflare does occasionally go down but I'm genuinely amazed that it's up at all.

    Given their levels of traffic, and the flexibility their setup needs to accommodate it's really amazing any of it works at all.

  • ClouviderClouvider Member, Patron Provider

    Hope they got a few days of good night sleep after this incident.

    Thanked by 3FAT32 emgh COLBYLICIOUS
  • 1 tech that started a week earlier. He/She must have been sweating buckets.

    A good informative read. Thanks for sharing that I probably wouldn't have gone looking for it.

  • LeviLevi Member
    edited November 2023

    Poat mortem nicely covered how some vital parts of cf infra was not HA. Particularly new features. "Just roll out, we will sort this latter" approach. Majority of stuff was single homed. What a shame.

    P.s. cf share price has actually grown up during outage.

    Thanked by 2darkimmortal loay
  • jsgjsg Member, Resident Benchmarker
    edited November 2023

    I've learned basically two things from that:

    • Matthew Prince seems to be a decent down to earth person who seems to not play hide and seek when SHTF but to bet on transparency (as far as that's possible).
    • His company however seems to be even crappier than I thought. In particular and among other negatives they seem to quite carelessly select, even critical, DCs.

    Flexentia (or similar, their critical infrastructur DC provider) seems to be a real sh#t show. Some examples:

    • they seem to not even have a CTO, but a "chief of innovation". A clear and hard no go.
    • they - at a major DC - had only one single techie, and a rather new one at that, on site.
    • they pretty much outsourced their whole energy operation to an obviously incapable or not interested in their customers survival utility corporation.
    • they (or PGE?) are utterly stupid and didn't put their "the house is on fire" access control system on a UPS, preferably an independent one and dual.

    TL;DR

    Matthew Prince should make sure to actually and really control his ship and not let evidently not top-class people, especially in innovations, sink it.

    And Flexblabla [or whatever] seems to be a bad joke at best. Tier 3, my ass. They may be adequate for John, Mary, and their dog customers like small businesses in Oakland (well, what's left) but for a major and critical for the internet customer their funny "tier 3" fun box evidently, blinking evidently isn't adequate.

    Lessons learned: (a) keep innovators (and crap boxes with a chief innovator) on a tight leash, and (b) Use adequate tools and critical infrastructure providers!

  • raindog308raindog308 Administrator, Veteran

    @jsg said: Tier 3, my ass.

    And not very transparent. I looked on their site for an status page or outage update page or anything similar and couldn't find it.

    Thanked by 1jsg
  • jarjar Patron Provider, Top Host, Veteran

    @ehhthing said:
    Honestly, people sometimes complain that Cloudflare does occasionally go down but I'm genuinely amazed that it's up at all.

    Given their levels of traffic, and the flexibility their setup needs to accommodate it's really amazing any of it works at all.

    All of that uptime is definitely relevant. No one can pull off 100% over a long enough time frame.

    Thanked by 1emgh
  • @raindog308 said:

    @jsg said: Tier 3, my ass.

    And not very transparent. I looked on their site for an status page or outage update page or anything similar and couldn't find it.

    Their status page is here: https://www.cloudflarestatus.com/

    Thanked by 1COLBYLICIOUS
  • raindog308raindog308 Administrator, Veteran

    I meant a Flexential status page.

    Thanked by 2emgh DanSummer
  • "viawest - proprietary and confidential" at bottom of the diagram. Ballsy cloudflare showed it publicly lmao

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    @jsg said: And Flexblabla [or whatever] seems to be a bad joke at best.

    They've been taking non stop L's at the moment.

    They had that big power failure on the east coast a few months ago that made @EthernetServers move. There was the fire that happened in LAX via Krypt. Then there was the dedipath drama.

    Francisco

    Thanked by 2jsg emgh
  • @cupcake said:
    "viawest - proprietary and confidential" at bottom of the diagram. Ballsy cloudflare showed it publicly lmao

    Heh, yeah. I think they know Flexential is not exactly in a great position to go after them. The blog post pretty strongly hints that, in Cloudflare's opinion, Flexential has breached their colocation agreement in a number of ways.

  • Even if the post is light on certain details (and heavy on blame for Flexential), I really respect companies that are so open about their internal operations. Backblaze is another good example of this.

    I also love when a CEO of a technology company actually seems to have some technical knowledge and doesn't use only business buzzwords. I'm not sure how Prince is as a manager, but this post certainly makes it sound like he's in the trenches.

  • jsgjsg Member, Resident Benchmarker

    @Francisco said:

    @jsg said: And Flexblabla [or whatever] seems to be a bad joke at best.

    They've been taking non stop L's at the moment.

    Sorry me sometimes stupid with English. "taking L's" means what?

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    @jsg said: Sorry me sometimes stupid with English. "taking L's" means what?

    "Taking L's" = losses, failures, etc.
    "Taking W's" = wins, etc.

    Francisco

    Thanked by 1jsg
  • jsgjsg Member, Resident Benchmarker

    @raindog308 said:

    @jsg said: Tier 3, my ass.

    And not very transparent. I looked on their site for an status page or outage update page or anything similar and couldn't find it.

    Probably yet another master piece of their chief of innovation ... *smirk

    "A tech corp doesn't need a CTO, let alone a status page. Our status is always 'we are sooo innovative'"

  • IMHO this part alone speaks volumes about the company

    Our team was all-hands-on-deck and had worked all day on the emergency, so I made the call that most of us should get some rest and start the move back to PDX-04 in the morning. That decision delayed our full recovery, but I believe made it less likely that we’d compound this situation with additional mistakes.

  • Hell...I am changing my nameserver to cloudflare before I even login to my CF dashboard.. then the nightmare began... I've been stuck and stop for 3 days..luckily I am on a developement site!

  • EthernetServersEthernetServers Member, Patron Provider

    Cloudflare may not be perfect - no business is :)

    But one thing I do have respect for is the fact they bother to communicate properly - both during and after incidents.

    Plenty of businesses don't do that and I think that's what riles customers and/or users more than anything.

    Thanked by 10xC7
  • @Fabian47 said:
    Hell...I am changing my nameserver to cloudflare before I even login to my CF dashboard.. then the nightmare began... I've been stuck and stop for 3 days..luckily I am on a developement site!

    I had no troubles with Cloudflare DNS though.

  • @EthernetServers said: no business is

    How dare you discredit @Francisco like that.

  • @jsg said:
    I've learned basically two things from that:

    • Matthew Prince seems to be a decent down to earth person who seems to not play hide and seek when SHTF but to bet on transparency (as far as that's possible).
    • His company however seems to be even crappier than I thought. In particular and among other negatives they seem to quite carelessly select, even critical, DCs.

    As an ex-CF employee I would say its the opposite. I would definitely not characterize Matthew as down to earth or a decent person. He has an excellent public image, but internally it's quite the opposite. Anyone who has been on a call with him can confirm, especially if competition was discussed . The CEO/CTO duo were the worst part of working at CF.

    The company on the other hand is surprisingly good with the engineering people being the smartest people I've ever met. Unfortunately that didn't apply to product or project managers at all, most of who had zero technical knowledge or experience in IT. Most came from completely unrelated fields like logicists, oil, clothing... So if a CF product you love seems to have strange priorities, this is why.

    Still, despite the issues and the complains I've yet to see a company being able to compete with CF on innovation, speed of development or pricing. Every "birthday week" they basically kill a dozen of startups.

    Thanked by 1sillycat
  • good read, thank you

  • Should i invest 100k in cf and hope its the next google in the future.

  • jsgjsg Member, Resident Benchmarker
    edited November 2023

    @jimaek said:

    @jsg said:
    I've learned basically two things from that:

    • Matthew Prince seems to be a decent down to earth person who seems to not play hide and seek when SHTF but to bet on transparency (as far as that's possible).
    • His company however seems to be even crappier than I thought. In particular and among other negatives they seem to quite carelessly select, even critical, DCs.

    As an ex-CF employee I would say its the opposite. I would definitely not characterize Matthew as down to earth or a decent person. He has an excellent public image, but internally it's quite the opposite. Anyone who has been on a call with him can confirm, especially if competition was discussed . The CEO/CTO duo were the worst part of working at CF.

    The company on the other hand is surprisingly good with the engineering people being the smartest people I've ever met. Unfortunately that didn't apply to product or project managers at all, most of who had zero technical knowledge or experience in IT. Most came from completely unrelated fields like logicists, oil, clothing... So if a CF product you love seems to have strange priorities, this is why.

    Still, despite the issues and the complains I've yet to see a company being able to compete with CF on innovation, speed of development or pricing. Every "birthday week" they basically kill a dozen of startups.

    Based on his public statement I see a quite different man.

    Congrats btw wrt all the Mr. Prince not all liking "innovation" (ex) employees! The innovations detailed in Mr. Prince's statement certainly succeeded to gain lots of attention.
    And an extra special award goes to Flex[whatever]. Very innovative indeed!

    From what I saw, Mr. Prince's statement also clearly shows a man who acts in a honest, transparent, and straight way (as far as possible). It's not just image blabla, he actually and really apologized, stated that the company f#cked up, and tried to quite frankly lay out what went terribly wrong and why.

    (Btw: can you show me any not small company where each and every top manager is well liked by all employees? I don't hold my breath).

Sign In or Register to comment.