Intermittent connectivity and performance issues

Incident Report for Roundtrip

Postmortem

At 10:07 EST on Wednesday, February 7, Roundtrip Engineering received an automated alert indicating instability with our web application containers and the load balancer in front of them. The load balancer conducts a health check on the containers every 30 seconds and, when it finds an unhealthy container, sends a signal to shut down the problematic container and launch a new one. Container failure in the cloud is not uncommon and this “self-healing” aspect of our infrastructure is an important part of our resilience strategy for the Roundtrip product.

On Wednesday, however, we found that the load balancer believed that all of our application containers were unhealthy. It would launch new containers, they would serve requests successfully for 5 to 20 minutes, fail a health check, and be replaced. This condition persisted until about 17:00 EST. During that period, users of our product experienced degraded performance and intermittent errors as containers were continually launched and drained.

Initial investigation of the issue suggested that an issue existed in the connection between the load balancer and the containers. We engaged in a high-severity investigation with our cloud provider as that segment of the application architecture lay within the provider’s network and beyond our control. While working with their support team, we noticed that we had two long-running queries active in our database that were blocking access to a core data table. When we killed those “stuck” queries our container/load balancer issue resolved itself.

What happened? Web application servers have a pool of connections on which they serve requests. When a user loads a web page in our application, they use one of those connections. Once their page loads, that connection is available for another request to use. If no connections are available to handle a request, the server makes the request wait until one is available. If the request waits too long – a duration known as the “timeout period”, usually between 30 to 60 seconds – the request will fail.

We run enough server capacity to provide a pool of connections more than sufficient to serve the number of active web requests at a given time under normal circumstances. A typical request takes a few hundred milliseconds, ensuring that there are almost always connections being made available to new requests. However, on Thursday, those blocking database queries caused active requests to take much longer because they could not access the core table blocked by the bad queries. As a result, connections were being used for long periods and not being made available to new requests. Unfortunately, the load balancer health checks also require a server connection and, after waiting for their “timeout period”, began to fail, causing the load balancer to think the container running the server was down. The result was an unstable set of application containers.

We have identified the source of the long-running queries and have mitigated their use. We are also improving our monitoring of blocking states within the application database to prevent long wait times from impacting the state of the application connection pool. We are sorry for the inconvenience this has caused.

Posted Feb 08, 2024 - 12:56 EST

Resolved

Further analysis indicates that the actions taken ~5PM Eastern time today have indeed solved the root issue. The root cause was associated to database performance issues that were causing widespread issues to the operations of the platform. Upon implementing the fix - performance was restored quickly and remains within expected ranges.

The team is satisfied with the return to normal system operations. We will continue to monitor performance overnight and will provide a full root cause analysis in the coming days.
Posted Feb 07, 2024 - 18:02 EST

Update

We are continuing to monitor for any further issues.
Posted Feb 07, 2024 - 17:25 EST

Monitoring

At this time, we believe that we've addresses a core problem to the connectivity and performance issues experienced throughout the day. This fix has been implemented and we are receiving monitoring that indicates the system is returning to normal operations. We are continuing to monitor the system through the next few hours to ensure that the root cause has been addressed.

Assuming we maintain normal operations and confirm this root cause - a root cause analysis will be provided.
Posted Feb 07, 2024 - 17:22 EST

Update

The team had identified what we believe to be the root cause for the issue that is causing intermittent connectivity and performance issues. We are currently working through options to resolve this issue with our cloud hosting provider.

We will continue to update as more information becomes available.
Posted Feb 07, 2024 - 10:36 EST

Identified

The fix we were previously monitoring did not have the desired impact. We are still investigating this issue.
Posted Feb 07, 2024 - 10:35 EST

Monitoring

We have made an adjustment to the configuration our infrastructure, as recommended by our cloud provider. Services appear to be gradually restoring - we will continue to monitor this situation to ensure a return to normal operation. A root cause analysis is still pending.
Posted Feb 07, 2024 - 10:30 EST

Investigating

We are currently investigating an issue with Roundtrip's production workloads at our cloud provider. There appears to be an issue with underlying connectivity within the cloud platform that is causing intermittent timeouts and slower response times throughout the platform. This is not currently resulting in a system-wide outage.

We have escalated a ticket to our cloud provider and our engineering team is taking remediation steps.

We will update this thread with more information as it becomes available.
Posted Feb 07, 2024 - 10:00 EST
This incident affected: Booking Portal, Community Portal, and Reporting and Analytics.