Partial outage

Incident Report for vPlan

Postmortem

A few hours ago some of our customers experienced an unfortunate outage, we would like to take the time to explain what happened.
All times mentioned below are in CEST (GMT +2)

Between 10:49 and 10:59 some customers could not access vPlan, from 11:23 to 11:31 the affected customers may have experienced a slight performance degradation.

When the partial outage occurred we were immediately notified by our monitoring system of a possible problem, quick investigation of these reports showed a problem with a server in one of our database clusters.
While we were looking into a solution for this problem we received monitoring reports about a problem with the whole cluster. The other servers in the cluster were waiting on a response from the server that presented the initial problem, while they waited for the response other requests were queued could not be handled resulting in an overload of the cluster.
Obviously this should not happen, when a server in a cluster is unable to perform its task the rest of the cluster should not be affected.
We resolved the issue with the cluster by temporarily shutting down the problematic server, this immediately resulted in all outstanding requests being handled and having the cluster operational again, albeit in a degraded form.

Between 10:59 and 11:23 we investigated the issue with the, now deactivated, server. The problem was caused by a bug in the database software we use. The database engine was unable to handle requests, but the service itself was still reporting it was operational.
We have run into this bug before, until today, only outside of office hours with minimal to no impact to our customers. 
When we first encountered the bug we raised a bug report with the developers of our database software. 
The bug has since been resolved in a later version of the software. Unfortunately when we were testing this new version on our test environment, we ran into other issues with the newer version that had far more severe consequences and occurred more often. Using this version in production would result in far more problems than we would solve, so we decided against using this new version in production.
We raised new bug reports for the newly encountered problems.

For now we have to wait on the developers of our database software to resolve these new issues. They are working a solution and will hopefully release a new version of the software soon.
As soon as this version is available we will start testing it on our test environment and deploy it to production when we are confident about the solution.
This means that currently it would still be possible to run into the current bug again, however, since this has been the first occurrence that lead to a problem during office hours and was fixed relatively fast, we are confident that waiting for a version that fixes all issues we encountered is the best approach.

In the meantime, our development team will investigate options for reducing the chance of requests resulting in this bug. Even when the bug is fixed, these changes will improved the reliability and performance of vPlan as a whole.

We are very sorry for any inconvenience this may have caused.

Posted Jun 09, 2022 - 17:53 CEST

Resolved

This incident has been resolved.

Posted Jun 09, 2022 - 14:43 CEST

Monitoring

All services are fully operational again, over the next few hours we will keep monitoring the situation closely.

Posted Jun 09, 2022 - 11:35 CEST

Identified

We have identified the cause and implemented a temporary workaround. The outage should now be resolved.
While we are still working on the final solution performance might be impacted slightly.

Posted Jun 09, 2022 - 10:59 CEST

Investigating

We are currently investigating monitoring reports about a partial outage

Posted Jun 09, 2022 - 10:49 CEST

This incident affected: vPlan (API, Web Application, Mobile Apps) and vPlan Infrastructure.