Issue connecting to Roundtrip

Incident Report for Roundtrip

Postmortem

The Roundtrip web application experienced an outage October 10th for 56 minutes, from 12:50 PM Eastern until 1:46 PM Eastern. The immediate cause of the outage was the removal of a database index that was to be immediately replaced with a new version.

Database indices are used to help queries find rows quickly. Another use of indices is to enforce constraints on the data, such as uniqueness. We have encountered cases where row duplication occurred due to an unenforced uniqueness constraint. These rare cases do not affect ride operations, but they have manifested themselves in our internal reporting and reconciliations and required manual intervention to “fix” the data.

Today we intended to replace two indices that do not enforce a uniqueness constraint with two new indices that do enforce the constraint. Our database engine allows us to drop and replace indices concurrently with read and write operations; we have implemented dozens of these changes as part of an ongoing effort to improve our platform. While dropping the old indices, we saw a dramatic decrease in the efficiency of our primary dashboard query in the platform. Queries that normally take less than 100 milliseconds began to run for longer than 1 minute or more. This condition consumed all the computing capacity of our database server and prevented us from either creating the new indices or restoring the old ones.

Once we realized what was happening, we opted to act decisively to restore the health of the site. The continuous creation of new dashboard queries prevented us from gaining enough compute resources to replace the indices. We decided to actively take our application offline to quiesce the database, failover the database, replace the old indices, and bring the site back online. We accomplished this as quickly as possible. Once the new application services came back up, the site was healthy once more.

The effort to improve our platform is an ongoing process and we implement changes like the one we attempted today almost every day. This change followed the same standard process as all of our changes, including peer review and testing in our three non production environments.

‌

In the prior 90 days our application uptime exceeded 99.99% - something we are very proud of and have invested considerable time in crafting safe processes and practices. Something went awry today, and we are very sorry for the inconvenience it has caused. We will work to ensure we learn from this issue and continue to improve.

Posted Oct 11, 2024 - 14:12 EDT

Resolved

After continuing to monitor performance and activity of the Roundtrip platform, we are pleased to say that this incident is completely resolved. We will be posting a post mortem of the root cause and remediation efforts within the next 24 hours.

Posted Oct 10, 2024 - 16:03 EDT

Monitoring

A fix has been implemented and site traffic is returning to normal. Further root cause investigation and analysis is continuing, more information will be provided. At this time, we do not expect any further issues related to this issue.

Posted Oct 10, 2024 - 13:51 EDT

Identified

We have identified an issue with database resources that has impacted ability to access the Roundtrip application. The team has identified the cause and is working to resolve the issue.. We expect functionality to be restored shortly. More updates will follow.

Posted Oct 10, 2024 - 13:16 EDT

This incident affected: Booking Portal, Community Portal, Rider App, Reporting and Analytics, and Patient Data Ingestion.