November 13th Service Outage Post-mordem

Dear customer,

On November 13th 14:05 CST, the gams data platform encountered a major outage that made the nodes data unavailable through the API or the dynamic views (tablet or TV displays). The incident ended on November 14th 9:03 CST, all the services are now online. We are closely monitoring the platform to make sure all the services are running properly for you.

After checking all the measurements, I can assure you that no data loss was sustained thanks to our exclusive smart buffer feature. We are committed to an open and transparent communication with our customers so this post will detail the problems we found during that first platform outage and what corrective steps we are taking to prevent any other similar occurrence.

Please do feel free to reach out to me or the team through your wechat group if you want to get more information on this incident and the content of this post-mortem.
You can check the status of our different services at any time through our status page:

Stefan Berder


What happened?

After a routine system security upgrade, we restarted our main data server. This is the kind of operation we usually do without customers noticing thanks to our robust platform design. This time though, the server refused to restart. We started working with Aliyun support, suspecting a disk problem. At around 18:00 CST, the support deemed that everything was working properly on the server side so it had to be a problem with the system itself. We decided to snapshot/backup our main disk, start fresh on a new server and perform a recovery on the snapshot. After some digging, we realized that the boot issue was caused by the kernel that was needed for the security update we passed. We decided to use the fresh server as a new base and recover all the data. That process took most of the night but was facilitated by our recovery process. All services went back online this morning at 9:03 CST after a night of hard work from our tech team.

A check on all nodes affected shows no data loss thanks to the early 2017 node upgrade that brought the smart buffer feature.


What are we doing about this?

Different things made the process a bit harder than it should be, here are the steps we are taking to avoid that type of major outage and a quicker recovery in the event of such an incident.
We are now working on duplicating the data services on a different server farm so we have a backup to fall back to in case of outage of the main server. This was an item on our operations roadmap with no priority. We will have a replicated service by the end of the next week.
Automation of recovery from backups, we never got to test our backups in a real emergency situation. As very often, having backups is not the end of the operational duty and testing the recovery process is as important. The recovery was needlessly long and could have been swift and painless if properly automated and tested beforehand.



在北京时间11月13日下午2时5分,gams 的数据平台遭遇了一次重大的服务器中断,中断期间所有监测节点数据的API或动态视图(平板电脑或电视显示器)均无法正常使用。该中断已经在昨天11月14日上午9时3分解决,目前gams全部的数据服务均已恢复正常。我们仍然在密切监测 gams 数据平台的运行,确保我们所有客户的服务都能继续正常进行。

在检查完所有的测量数据后,我可以向您保证,由于我们设备独有的智能缓冲存储功能,在中断期间所有的数据并不会丢失。gams 致力于与客户间进行公开而透明的沟通,故写这封邮件,希望向您详细说明我们在第一次发生平台中断期间发现的问题,以及我们正在采取哪些纠正措施来防止其他类似事件的再次发生。

如您想了解更多有关这次平台中断的信息,请随时通过您所在的 gams 微信群组与我和 gams 团队联系。

Stefan Berder 白峰