2020-November-19 Service Incident


Tuesday, November 19 17:40 - 19:30 CET

What happened:

We experienced an increased error rate for jobs in the us-west-1, eu-central-1 and us-east-1 datacenters.

Why it happened:

Back pressure from a third party provider caused a delay and build up of requests within the internal service responsible for starting new jobs. This delay caused a crash loop within our service.

How we fixed it:

Calls to the third party provider experience back pressure were disabled and rerouted to different end points that weren’t experiencing delays.

What we are doing to prevent it from happening again:

We’ve modified our services to route traffic via a more resilient path to our third party provider.  We’ve also modified our behaviour to drop messages on a queue without waiting for a response to eliminate the chance that back pressure causes our internal services to become overwhelmed.  In addition, we are going to add additional monitoring and alerting for these subsystems

Posted Nov 24, 2020 - 18:43 CET

Error rates have subsided. All services are fully operational
Posted Nov 19, 2020 - 20:39 CET
We have taken remedial action and are seeing improvements. We are monitoring.
Posted Nov 19, 2020 - 19:45 CET
We are seeing a high error rate on automated browser and emusim tests, as well as non-legacy RDC tests. We are continuing to investigate.
Posted Nov 19, 2020 - 18:46 CET
We are experiencing a service incident. We are currently determining the scope of the issue and will update shortly once that has been established.
Posted Nov 19, 2020 - 18:34 CET