August 28th, 2019 19:15 -19:37 CEST
Some customers experienced high wait times for Windows 10 VMs in PC Cloud in our European datacenter.
Sauce uses an advanced machine learning algorithm to predict and pre-boot high demand VMimages to reduce customer wait times. During this incident, one of the data sources that the algorithm uses was not stable. As a result, the algorithm did not get all the correct data and therefore made incorrect predictions. Windows 10 is the most frequently demanded VM image in our PC cloud, and in the absence of pre-booted images the wait time increased sharply.
As an immediate remedial step, we turned off the machine learning service and manually adjusted the tilt weight on Windows 10 VM images. Then we updated our predictive service to use a more reliable data source and restarted the service.
We have a two-pronged approach to prevent this or similar incidents from happening again.
First, we will redesign our predictive service to consume the image usage data from a data service rather than the current data sources. We will need to design and develop this new data service and make sure it's much more reliable and stable than the current data sources.
Secondly, we will design and implement a more thorough monitoring process to make sure data fed into the predictive service is reliable.