November 13th Service Outage Post-mordem

Dear customer,

On November 13th 14:05 CST, the gams data platform encountered a major outage that made the nodes data unavailable through the API or the dynamic views (tablet or TV displays). The incident ended on November 14th 9:03 CST, all the services are now online. We are closely monitoring the platform to make sure all the services are running properly for you.

After checking all the measurements, I can assure you that no data loss was sustained thanks to our exclusive smart buffer feature. We are committed to an open and transparent communication with our customers so this post will detail the problems we found during that first platform outage and what corrective steps we are taking to prevent any other similar occurrence.

Please do feel free to reach out to me or the team through your wechat group if you want to get more information on this incident and the content of this post-mortem.
You can check the status of our different services at any time through our status page: https://status.measureofquality.com/

Stefan Berder
CEO

 

What happened?

After a routine system security upgrade, we restarted our main data server. This is the kind of operation we usually do without customers noticing thanks to our robust platform design. This time though, the server refused to restart. We started working with Aliyun support, suspecting a disk problem. At around 18:00 CST, the support deemed that everything was working properly on the server side so it had to be a problem with the system itself. We decided to snapshot/backup our main disk, start fresh on a new server and perform a recovery on the snapshot. After some digging, we realized that the boot issue was caused by the kernel that was needed for the security update we passed. We decided to use the fresh server as a new base and recover all the data. That process took most of the night but was facilitated by our recovery process. All services went back online this morning at 9:03 CST after a night of hard work from our tech team.

A check on all nodes affected shows no data loss thanks to the early 2017 node upgrade that brought the smart buffer feature.

 

What are we doing about this?

Different things made the process a bit harder than it should be, here are the steps we are taking to avoid that type of major outage and a quicker recovery in the event of such an incident.
We are now working on duplicating the data services on a different server farm so we have a backup to fall back to in case of outage of the main server. This was an item on our operations roadmap with no priority. We will have a replicated service by the end of the next week.
Automation of recovery from backups, we never got to test our backups in a real emergency situation. As very often, having backups is not the end of the operational duty and testing the recovery process is as important. The recovery was needlessly long and could have been swift and painless if properly automated and tested beforehand.

 

尊敬的客户,

在北京时间11月13日下午2时5分,gams 的数据平台遭遇了一次重大的服务器中断,中断期间所有监测节点数据的API或动态视图(平板电脑或电视显示器)均无法正常使用。该中断已经在昨天11月14日上午9时3分解决,目前gams全部的数据服务均已恢复正常。我们仍然在密切监测 gams 数据平台的运行,确保我们所有客户的服务都能继续正常进行。

在检查完所有的测量数据后,我可以向您保证,由于我们设备独有的智能缓冲存储功能,在中断期间所有的数据并不会丢失。gams 致力于与客户间进行公开而透明的沟通,故写这封邮件,希望向您详细说明我们在第一次发生平台中断期间发现的问题,以及我们正在采取哪些纠正措施来防止其他类似事件的再次发生。

如您想了解更多有关这次平台中断的信息,请随时通过您所在的 gams 微信群组与我和 gams 团队联系。
您同样可以通过我们的服务器状态页面随时查看我们不同数据服务的动态:https://status.measureofquality.com/

Stefan Berder 白峰
CEO

 

前天究竟发生了什么?

前日,在进行完例行的系统安全升级之后,我们重新启动了我们的数据主服务器。归功于平台的弹性设计,这些升级原本都是我们在不影响平台使用的情况下进行的常规操作。然而这一次在重新启动服务器时,我们的重启遭到了拒绝。在几次尝试重启失败后,我们怀疑可能磁盘出现问题,找到了阿里云的技术支持开始交涉解决。大约在北京时间6时左右,支持人员检查并回复服务器端一切正常,所以问题出在了系统本身上面。我们于是决定对自己的主磁盘进行快照/备份,在新服务器上重新启动来解决原快照无响应的情况。经过一番对问题的分析挖掘,我们意识到先前的重启问题是由我们在进行安全更新时传输的内核引起。我们即决定使用新的服务器作为新的容灾系统,并恢复了所有的数据。我们的技术部门昨日连夜进行这一数据恢复过程,随后所有的服务于昨天上午9时3分重新连接上线。

基于今天我们对所有节点进行的检查,这一次平台服务中断没有导致任何数据丢失。而这还要归功于gams于2017年早些时候升级上线的智能缓冲储存功能。

 

我们接下来会做哪些改善?

几件事情使得前天的平台恢复过程比想象地更困难一些,以下则是我们为避免今后此类重大的中断事件再次发生,正在采取的措施:

我们将在不同的服务器场上分别嫁接我们的数据服务,以便在主服务器停机的情况下有备份平台可以继续运营。这也原本是我们技术运营路线图上的一个项目,先前并不在优先级中,但我们下周末会将此项目提上开发日程。

数据备份恢复的自动化。我们从来没有在真正的紧急情况下测试我们的备份,很多时候,备份并不是运营任务的结束,因为测试数据恢复的过程同样重要。如果我们能事先对此进行相应的自动化和测试,那么今后的恢复流程所需时间将大大减少即可快速解决。

2018-01-23T11:11:51+00:00