At 10:07 EST on Wednesday, February 7, Roundtrip Engineering received an automated alert indicating instability with our web application containers and the load balancer in front of them. The load balancer conducts a health check on the containers every 30 seconds and, when it finds an unhealthy container, sends a signal to shut down the problematic container and launch a new one. Container failure in the cloud is not uncommon and this “self-healing” aspect of our infrastructure is an important part of our resilience strategy for the Roundtrip product.
On Wednesday, however, we found that the load balancer believed that all of our application containers were unhealthy. It would launch new containers, they would serve requests successfully for 5 to 20 minutes, fail a health check, and be replaced. This condition persisted until about 17:00 EST. During that period, users of our product experienced degraded performance and intermittent errors as containers were continually launched and drained.
Initial investigation of the issue suggested that an issue existed in the connection between the load balancer and the containers. We engaged in a high-severity investigation with our cloud provider as that segment of the application architecture lay within the provider’s network and beyond our control. While working with their support team, we noticed that we had two long-running queries active in our database that were blocking access to a core data table. When we killed those “stuck” queries our container/load balancer issue resolved itself.
What happened? Web application servers have a pool of connections on which they serve requests. When a user loads a web page in our application, they use one of those connections. Once their page loads, that connection is available for another request to use. If no connections are available to handle a request, the server makes the request wait until one is available. If the request waits too long – a duration known as the “timeout period”, usually between 30 to 60 seconds – the request will fail.
We run enough server capacity to provide a pool of connections more than sufficient to serve the number of active web requests at a given time under normal circumstances. A typical request takes a few hundred milliseconds, ensuring that there are almost always connections being made available to new requests. However, on Thursday, those blocking database queries caused active requests to take much longer because they could not access the core table blocked by the bad queries. As a result, connections were being used for long periods and not being made available to new requests. Unfortunately, the load balancer health checks also require a server connection and, after waiting for their “timeout period”, began to fail, causing the load balancer to think the container running the server was down. The result was an unstable set of application containers.
We have identified the source of the long-running queries and have mitigated their use. We are also improving our monitoring of blocking states within the application database to prevent long wait times from impacting the state of the application connection pool. We are sorry for the inconvenience this has caused.