Delayed delivery times with platform notifications

Incident Report for Roundtrip

Postmortem

On Tuesday, February 27, the Roundtrip platform experienced degraded service executing background processes such as ride notifications and eligibility file processing.

Unlike the incident on February 7, there was never any issue with the Roundtrip web site. In this case, the problem was related to insufficient capacity on our background job workers which caused a backup in the job queue and a subsequent delay in processing jobs.

Why did this happen? On a typical day, Roundtrip runs two application containers responsible for executing background jobs that do not run in the context of a web request. This architecture helps with performance as the job workers do not need to compete with the web requests for computing resources. The two application containers provide more than enough resources to complete background jobs with almost zero waiting time in a queue, even during our busiest periods.

Roundtrip also typically deploys updates to the platform multiple times each day. As part of cost-savings and risk mitigation efforts, we only keep the last fifty (50) application container images available in our repository. That number – fifty (50) – is purely arbitrary number adopted as a “best guess” based on our historical deployment rate. However, during mid-February, the Roundtrip engineering team was preparing a large rollout of dashboard updates and in the midst of finalizing that work did not deploy to production in over a week while continuing to build a large number of application containers to deploy to testing environments. This is the longest we had gone without a production deploy in over eighteen months.

On February 27, one of the two job worker containers was terminated by our cloud provider as part of routine host maintenance. Ordinarily, our infrastructure would see this happen and launch a replacement container within a minute or two. However, in this case, the container image it needed to launch a new worker had been deleted by our automated process – we had produced fifty new images since our last deploy.

The solution was simple – redeploy the release currently running in production or make a new release. We opted for the latter and the issue was resolved shortly thereafter.

We have taken a few steps to prevent this issue from occurring again. First, we have increased that arbitrary number of container images to provide a buffer in case of longer periods without a deployment. Second, we have added a monitor and alert to notify us when the job latency exceeds a tolerable limit. This would have helped us catch the issue much sooner before customers noticed. Lastly, we are running additional worker capacity to allow for the loss of one container for a long period without causing a queue backup. We are sorry for the inconvenience this may have caused you.

Posted Mar 07, 2024 - 08:15 EST

Resolved

The remediation actions that the team took to restore the components related to message actions have been successful. At this time all delayed messages have been processed and the system is performing as expected.

Note: Upon further analysis - this incident also caused delays in the processing of patient ingestion messages. Similar to the notifications, all patient ingestion processes are back to normal operation. No patient data was lost or exposed during this incident.

A postmortem will be published in the next day or so.

Posted Feb 27, 2024 - 15:01 EST

Monitoring

We were able to push a fix to the affected components that were contributing to delays in sending notifications in the Roundtrip platform. With the affected components resolved - the platform is "catching up" sending delayed messages. We expect the backlog to complete processing within the next hour and normal operations to continue from that point.

Posted Feb 27, 2024 - 14:27 EST

Identified

We have identified the issue with the delayed notification processing - and are currently deploying a fix for this. We expect a return to normal processing shortly.

Will provide details on root cause once the incident has been fully resolved.

Posted Feb 27, 2024 - 13:53 EST

Investigating

The team has identified an issue with delayed sending of notifications from the Roundtrip platform. This issue is causing delays in the receipt of emails regarding ride status. The team is aware of the issue and is currently identifying the reason for this.

Posted Feb 27, 2024 - 13:08 EST

This incident affected: Notifications.