2019-August-28 Service Incident
Incident Report for Sauce Labs European Data Center
Postmortem

Dates:

August 28th, 2019 19:15 -19:37 CEST

What happened:

Some customers experienced high wait times for Windows 10 VMs in PC Cloud in our European datacenter.

Why it happened:

Sauce uses an advanced machine learning algorithm to predict and pre-boot high demand VMimages to reduce customer wait times. During this incident, one of the data sources that the algorithm uses was not stable. As a result, the algorithm did not get all the correct data and therefore made incorrect predictions. Windows 10 is the most frequently demanded VM image in our PC cloud, and in the absence of pre-booted images the wait time increased sharply.

How we fixed it:

As an immediate remedial step, we turned off the machine learning service and manually adjusted the tilt weight on Windows 10 VM images. Then we updated our predictive service to use a more reliable data source and restarted the service.

What we are doing to prevent it from happening again:

We have a two-pronged approach to prevent this or similar incidents from happening again.

First, we will redesign our predictive service to consume the image usage data from a data service rather than the current data sources. We will need to design and develop this new data service and make sure it's much more reliable and stable than the current data sources.

Secondly, we will design and implement a more thorough monitoring process to make sure data fed into the predictive service is reliable.

Posted 11 days ago. Sep 05, 2019 - 18:30 CEST

Resolved
All services are now fully operational.
Posted 19 days ago. Aug 28, 2019 - 19:37 CEST
Investigating
Wait times on our PC Cloud are high. We are taking remedial action.
Posted 19 days ago. Aug 28, 2019 - 19:15 CEST
This incident affected: Automated VM Testing (Automated PC Testing) and Manual Testing (Manual VM Testing).