Cloudjiffy is Down

yokowasis · June 2019

I can't access client area, control panel, or the main website.

this is the 2nd time this week. Where can I see / watch the update / news of the downtime ? @leapswitch

uptime · June 2019

Have you checked to make sure it's plugged in?

leapswitch · June 2019

We are aware of this issue and working on the same. There is a DNS level issue causing this and we are checking why redundancies have not kicked in.

yokowasis · June 2019

@leapswitch said:
We are aware of this issue and working on the same. There is a DNS level issue causing this and we are checking why redundancies have not kicked in.

Okay, it's just the last time I tried contacting your support (live chat, because client area not working), he insist that there is nothing wrong, and the server never down, despite my screenshot of isup.me clearly stating it down.

I hope it got resolved ASAP.

donli · June 2019

@yokowasis said:

I hope it got resolved ASAP.

You mean you hope it gets resolved in a jiffy.

leapswitch · June 2019

@yokowasis said:

@leapswitch said:
We are aware of this issue and working on the same. There is a DNS level issue causing this and we are checking why redundancies have not kicked in.

Okay, it's just the last time I tried contacting your support (live chat, because client area not working), he insist that there is nothing wrong, and the server never down, despite my screenshot of isup.me clearly stating it down.

I hope it got resolved ASAP.

It is resolved now. I sincerely apologize for the issue.

Could you please PM me the last time you faced this issue as there has not been any outage before this apart from a scheduled network maintenance (25th May) for which an email was sent out.

Details of current issue -
We host the dashboard on 2 servers but with manual failover. In case of an issue with the primary, only the dashboard goes down and has to be manually switched over.
This does not affect running environments in any region.

However, in today's outage, the server running dashboard physically failed and brought down the DNS cluster (spread over 4 servers). We are checking with dev team for the cause of this issue as the DNS cluster should never fail.

Our website / ticket system is hosted outside our regions, however, DNS for it is on the same cluster.

Due to this failure, it is indeed time to review our architecture as well as communication in order to ensure such an issue never happens again.

We will send out a detailed RFO to affected clients once we have worked out what went wrong and how we plan to ensure this does not happen again.

Thank you.

yokowasis · June 2019

@leapswitch said:

@yokowasis said:

@leapswitch said:
We are aware of this issue and working on the same. There is a DNS level issue causing this and we are checking why redundancies have not kicked in.

Okay, it's just the last time I tried contacting your support (live chat, because client area not working), he insist that there is nothing wrong, and the server never down, despite my screenshot of isup.me clearly stating it down.

I hope it got resolved ASAP.

It is resolved now. I sincerely apologize for the issue.

Could you please PM me the last time you faced this issue as there has not been any outage before this apart from a scheduled network maintenance (25th May) for which an email was sent out.

Details of current issue -
We host the dashboard on 2 servers but with manual failover. In case of an issue with the primary, only the dashboard goes down and has to be manually switched over.
This does not affect running environments in any region.

However, in today's outage, the server running dashboard physically failed and brought down the DNS cluster (spread over 4 servers). We are checking with dev team for the cause of this issue as the DNS cluster should never fail.

Our website / ticket system is hosted outside our regions, however, DNS for it is on the same cluster.

Due to this failure, it is indeed time to review our architecture as well as communication in order to ensure such an issue never happens again.

We will send out a detailed RFO to affected clients once we have worked out what went wrong and how we plan to ensure this does not happen again.

Thank you.

I forget which time it was, it was similar like this, it was less than one hour of downtime, and when I was chatting with the support, the server was up again. So, I can understand why he insist that the server is not down. Also, I as long as the server is up again, I don't really care about what he thinks.

leapswitch · June 2019

@yokowasis said:

@leapswitch said:

@yokowasis said:

@leapswitch said:
We are aware of this issue and working on the same. There is a DNS level issue causing this and we are checking why redundancies have not kicked in.

Okay, it's just the last time I tried contacting your support (live chat, because client area not working), he insist that there is nothing wrong, and the server never down, despite my screenshot of isup.me clearly stating it down.

I hope it got resolved ASAP.

It is resolved now. I sincerely apologize for the issue.

Could you please PM me the last time you faced this issue as there has not been any outage before this apart from a scheduled network maintenance (25th May) for which an email was sent out.

Details of current issue -
We host the dashboard on 2 servers but with manual failover. In case of an issue with the primary, only the dashboard goes down and has to be manually switched over.
This does not affect running environments in any region.

However, in today's outage, the server running dashboard physically failed and brought down the DNS cluster (spread over 4 servers). We are checking with dev team for the cause of this issue as the DNS cluster should never fail.

Our website / ticket system is hosted outside our regions, however, DNS for it is on the same cluster.

Due to this failure, it is indeed time to review our architecture as well as communication in order to ensure such an issue never happens again.

We will send out a detailed RFO to affected clients once we have worked out what went wrong and how we plan to ensure this does not happen again.

Thank you.

I forget which time it was, it was similar like this, it was less than one hour of downtime, and when I was chatting with the support, the server was up again. So, I can understand why he insist that the server is not down. Also, I as long as the server is up again, I don't really care about what he thinks.

We will reach out to you directly once we find and check your chat. Apart from the maintenance last Saturday/Sunday for about 4 minutes, and today's outage we have not added any downtime.

dahartigan · June 2019

Do the needful

leapswitch · June 2019

Short RFO -

Our current infrastructure in Pune West1 uses ATS or Automatic Transfer Switch for power redundancy at the rack level. 2 power sources A and B supply power to the switch and further they are connected to the servers . Each server has 2 cables from the ATS for redundancy. On Sunday, we had an issue where the ATS in PR3R2 tripped up causing all nodes in that rack to restart. All nodes except 2 came back online alongwith all containers within 10 minutes. Containers/nodes in other racks were not affected.

These 2 were our oldest nodes ( hn01 , hn02 ) . The containers themselves were online and serving requests, but without SSH or dashboard access to them .

To ensure this does not happen again, we are separating the power sources and removing the ATS so that this single point of failure does not cause an outage.

However, as our user, you can also build redundancy into your application -

Whenever you add 2 or more containers in a single layer (Load balancer, Application server, Database Server ) they get spread over multiple nodes or Availability Zones. In case of an outage in one node or zone, your application remains online.
- We offer inbuilt replication for MySQL, MariaDB, PostgreSQL , MongoDB and CouchDB so that you can use a highly-available cluster without making any application changes.
- We offer Shared Block storage ( currently single but distributed coming soon ) as well as replicated Object ( Minio ) and File storage ( Addons > File Synchronization )
- We offer a CDN (Addons > CDN ) which can bypass your load balancer and directly load balance requests between your application servers.

We offer free consulting for converting your application architecture into a highly-available one. You can schedule a consulting slot here - https://calendly.com/cloudjiffy-demo

We are also working on adding new regions for DR with Mumbai, India and Frankfurt, Germany coming shortly.

We will be calculating and adding SLA credits to affected customers by this weekend 9th June 2019.

Regards,

CloudJiffy Team

Howdy, Stranger!

Categories

In this Discussion

Cloudjiffy is Down

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Cloudjiffy is Down

Comments