On Tuesday, February 27, the Roundtrip platform experienced degraded service executing background processes such as ride notifications and eligibility file processing.
Unlike the incident on February 7, there was never any issue with the Roundtrip web site. In this case, the problem was related to insufficient capacity on our background job workers which caused a backup in the job queue and a subsequent delay in processing jobs.
Why did this happen? On a typical day, Roundtrip runs two application containers responsible for executing background jobs that do not run in the context of a web request. This architecture helps with performance as the job workers do not need to compete with the web requests for computing resources. The two application containers provide more than enough resources to complete background jobs with almost zero waiting time in a queue, even during our busiest periods.
Roundtrip also typically deploys updates to the platform multiple times each day. As part of cost-savings and risk mitigation efforts, we only keep the last fifty (50) application container images available in our repository. That number – fifty (50) – is purely arbitrary number adopted as a “best guess” based on our historical deployment rate. However, during mid-February, the Roundtrip engineering team was preparing a large rollout of dashboard updates and in the midst of finalizing that work did not deploy to production in over a week while continuing to build a large number of application containers to deploy to testing environments. This is the longest we had gone without a production deploy in over eighteen months.
On February 27, one of the two job worker containers was terminated by our cloud provider as part of routine host maintenance. Ordinarily, our infrastructure would see this happen and launch a replacement container within a minute or two. However, in this case, the container image it needed to launch a new worker had been deleted by our automated process – we had produced fifty new images since our last deploy.
The solution was simple – redeploy the release currently running in production or make a new release. We opted for the latter and the issue was resolved shortly thereafter.
We have taken a few steps to prevent this issue from occurring again. First, we have increased that arbitrary number of container images to provide a buffer in case of longer periods without a deployment. Second, we have added a monitor and alert to notify us when the job latency exceeds a tolerable limit. This would have helped us catch the issue much sooner before customers noticed. Lastly, we are running additional worker capacity to allow for the loss of one container for a long period without causing a queue backup. We are sorry for the inconvenience this may have caused you.