Cloud Sites Issues

Wednesday Oct 21, 2015

[Resolved] Resolved: Cloud Sites | Website Degradation | ORD1
11:32 AM EDT
The Cloud Sites environment has been stable for over 48 hours. Our engineers remain engaged with our third-party storage vendor, and a full technical assessment is in progress to identify actions necessary for the long-term health and stability of the environment.

At this time, this issue is considered resolved.
01:20 AM EDT
The Cloud Sites environment has continued to remain stable, however our engineering teams are continuing to monitor the environment for any changes to performance.

We will provide additional updates as information becomes available.
03:28 PM EDT
As of 14:20 CDT, the Cloud Sites environment has remained stable. Our teams will continue to monitor the environment overnight for any changes to performance.

Next Update: Wednesday, 28 October at 10:00 CDT
06:59 AM EDT
At this time, the Cloud Sites environment continues to remain stable, however Cloud Sites engineers are monitoring for any changes in performance, and our third-party vendor remains on standby for any required escalations.

We will provide additional updates as information becomes available.
12:44 AM EDT
As of 23:30 CDT, the Cloud Sites environment continues to remain stable. Our teams are continuing to monitor for any changes in performance, and our third-party vendor remains on standby for any required escalations.

We will provide additional updates as information becomes available.
05:10 PM EDT
As of 16:00 CDT, the Cloud Sites environment remains stable. Our teams will continue to monitor through the night for any changes in performance. Senior engineers from the third-party vendor for our storage system remain on standby for any required escalations. Additional capacity is scheduled to be brought online to increase capacity and maintain stability, and engineers plan to migrate a subset of customers to the new hardware when it is introduced to the environment.

We appreciate your patience during this time, and will provide additional updates as information becomes available.
03:54 PM EDT
At this time, the Cloud Sites environment has stabilized. Our teams are continuing to monitor the environment for any changes to performance. Engineers remain engaged with the vendor to determine any additional work needed to maintain stability.
02:04 PM EDT
The Rackspace Cloud Sites environment has experienced a recurrence of storage latency issues that occurred last week within the ORD region. The portion of the environment that was previously affected has remained stable. A separate device in the infrastructure is now exhibiting similar symptoms and behavior. Cloud Sites engineers and Rackspace Leadership continue to actively work with our third-party vendor to identify a path to resolve the issue. During this time, customers may experience website latency and slow page loads, or may be unable to access their websites.
11:42 AM EDT
Cloud Sites engineers are continuing to work to resolve an issue impacting one storage node in the ORD region. Engineers have re-engaged our third-party storage solution vendor to identify a path to resolution. At this time customer sites provisioned to the affected node may experience significant latency and slow page loads. Additionally, some customers may be unable to access their websites.
10:43 AM EDT
Cloud Sites engineers are working to resolve an issue impacting one storage node in the ORD region. At this time customer sites provisioned to the affected node may experience significant latency and slow page loads. Additionally, some customers may be unable to access their websites.

Next Update: 10:30 CDT or sooner if significant changes occur.
09:51 AM EDT
At this time, the Cloud Sites environment has remained stable. Our teams continue to monitor the environment for any changes to performance. Engineers remain engaged with the vendor to determine any additional work needed to maintain stability.

Next Update: Monday, 26 October at 10:00 CDT
05:00 PM EDT
On Tuesday, October 20th, Rackspace Cloud Sites engineers began a planned maintenance to migrate a subset of storage volumes into cluster mode in the ORD region. During this maintenance, engineers observed load levels that were higher than normal for the environment. The additional load was initially attributed to maintenance activities, and the load returned to normal levels when the maintenance completed the following day.

A few hours later, engineers became aware of increasing latency impacting all of the storage volumes that had been migrated. As a result of the latency accessing the storage volumes, the web server nodes began to experience high load when processing requests for all storage volumes in ORD, including volumes not included in the maintenance. At this time, customers began to experience significant latency and slow page loads for all sites hosted in the impacted region. Additionally, some customers may have been unable to access their websites.

Engineers escalated the issue to our third-party vendor for the storage systems and began gathering performance log data to facilitate vendor troubleshooting. Engineers also redistributed network traffic for two of the four affected storage nodes to reduce latency. The web server nodes continued to experience issues processing requests, and our operations team initiated a job to continuously terminate slow processes.

At this time, engineers and Rackspace leadership evaluated the feasibility of three potential paths to remediation to identify the course of action with the least impact to customers. The remediation options were as follows:

· Perform an emergency maintenance to add additional capacity.
· Partially roll back the maintenance on the impacted storage volumes.
· Wait for the vendor to complete analysis of the performance data.

Engineers initially determined that waiting for the vendor to complete their analysis was the best course of action. Adding capacity would require a minimum of eight hours to complete the build out, resulting in extended customer impact. Rolling back the maintenance would result in the potential loss of data created or modified after the maintenance started for customers provisioned to the impacted volumes.

After some time working to troubleshoot the issue, it became apparent the vendor would not be able to complete their analysis in a reasonable timeframe. The decision was then made to begin preparations to roll back the maintenance for four of the impacted volumes. Following completion of the rollback, engineers redistributed traffic to stabilize the majority of customer impact, and began work to redistribute the affected storage volumes. During this process, a subset of customers would have continued to experience website latency and service degradation.

Since the initial start of the event, our engineers have continued to engage with our vendor to try and identify the root cause of the problem and a path towards full resolution. On Wednesday afternoon, we began the process of moving storage volumes to a new aggregate in the cluster.

On Friday morning, the final volume migration completed which improved latency and website performance. Currently our teams are continuing to research the issue along with the vendor. Additionally, we are bringing on additional capacity in case a resurgence of the issue occurs again.
03:36 PM EDT
At this time, the Cloud Sites environment has remained stable. Our teams are continuing to monitor the environment for any changes to performance. Engineers remain engaged with the vendor to determine any additional work needed to maintain stability.

Next Update: 17:00 CDT
01:08 PM EDT
At this time, impact related to this issue has been mitigated and customer websites are accessible. Engineers are continuing to monitor and are actively working to maintain stability of the environment.

Next Update: 14:30 CDT
11:32 AM EDT
Engineers are working to remediate ongoing impact occurring as a result of this issue. At this time, customer sites provisioned to a subset of storage volumes in our environment are experiencing continued high latency that has the potential to affect web requests across the region. Efforts to redistribute the final affected storage volume is nearing completion. Our engineering teams are continuing to work with senior engineers from the third party vendor for our storage environment to identify additional remediation options.

Next update: 12:30 CDT
09:26 AM EDT
Engineers have diligently worked throughout the night on efforts to remediate impact to a subset of Cloud Sites customers in the ORD region. Customers may continue to experience significant latency and may be unable to access their websites until this issue is resolved.

Next Update: 10:30 CDT or sooner if significant changes occur.
07:31 AM EDT
Engineers continue work to redistribute the final affected storage volume in the ORD region. Approximately two-thirds of the storage volume has been redistributed. During this time, a subset of Cloud Sites customer may continue to experience significant latency and may be unable to access their websites.

Next update will be at approximately 08:30 CDT or sooner if significant changes occur.
05:29 AM EDT
As of 04:30 CDT, close to half of the redistribution of the final storage block for Cloud Sites has completed. During this time, a subset of Cloud Sites customer may continue to experience significant latency and may be unable to access their websites.

Further updates will be provided as they become available.
03:32 AM EDT
As of 02:30 CDT, engineers continue to work towards redistribution of the final storage block for Cloud Sites. Redistribution is anticipated to disseminate more quickly as the evening continues on.

The next update will be provided at 04:30 CDT or earlier if significant changes occur.
11:06 PM EDT
Engineers are continuing to work to resolve issues impacting a subset of Cloud Sites customers in the ORD region. Customers may continue to experience significant latency and may be unable to access their websites until this issue is resolved.
07:11 PM EDT
At this time, work to redistribute the final affected storage volume continues in the ORD region. A subset of Cloud Sites customer may continue to experience significant latency and may be unable to access their websites.

Next Update: 22:00 CDT or sooner if significant changes occur.
04:53 PM EDT
Some actions we took late this afternoon as part of our continuing efforts to fully remediate this issue had unanticipated customer impact. We have ceased those specific impacting actions, and continue our overall work to redistribute the final affected storage volume. Between 14:20 and 14:55 CDT, customers may have experienced significant latency or would have been unable to access their websites.

Next Update: 17:30 CDT, or sooner if significant changes occur.
03:13 PM EDT
Cloud Sites engineers continue to work to redistribute the last remaining affected storage volume. This process is expected to take a number of hours to complete. At this time, customer sites provisioned to the affected storage volume may continue to experience website latency and slow page loads.

Next Update: 18:30 CDT or sooner if significant changes occur.
01:32 PM EDT
Cloud Sites engineers are continuing work to redistribute the last remaining affected storage volume. Customer sites provisioned to the affected volume may continue to experience website latency and slow page loads. Next Update: 14:30 CDT or sooner if significant changes occur.
10:59 AM EDT
Engineers continue to work towards resolving the remaining customer impact.During this time, customers with Cloud Sites in the ORD region may experience website latency or slow page loads.
10:59 AM EDT
Engineers continue to work towards resolving the remaining customer impact. During this time, customers with Cloud Sites in the ORD region may experience website latency or slow page loads.
10:59 AM EDT
Engineers continue to work towards resolving the remaining customer impact. During this time, customers with Cloud Sites in the ORD region may experience website latency or slow page loads.
09:27 AM EDT
As of 08:12 CDT, engineers have completed work to redistribute customer traffic on the affected storage volumes. Engineers are working to determine next steps to resolve remaining customer impact. During this time, customers with Cloud Sites in the ORD region may experience website latency or slow page loads.

Next Update: 10:00 CDT
08:11 AM EDT
As of 07:05 CDT, engineers continue redistribution of the remaining storage volumes. Customers affected by this issue who have already been stabilized remain stable. During this time, a small subset of customers may continue to experience website latency or slow page loads until the redistribution is 100% complete.

The next update will be provided as information becomes available.
04:55 AM EDT
As of 03:46 CDT, approximately two-thirds of the redistribution of the remaining storage has been completed. An ETA for completion is not available due to the nature of the process, but the copy rate has continued to improve as traffic volume has decreased. During this time, a small subset of customers may continue to experience website latency or slow page loads.
02:58 AM EDT
Engineers work towards resolution on the residual latency issues for a small subset of customers. Impact remains stabilized for the majority of customers initially affected by this issue. During this time, a small subset of customers may continue to experience website latency or slow page loads.

Next Update: As information becomes available.
11:59 PM EDT
Engineers continue to work to resolve residual latency issues for a small subset of customers. Impact remains stabilized for the majority of customers initially affected by this issue. During this time, a small subset of customers may continue to experience website latency or slow page loads.

Next Update: As information becomes available.
05:21 PM EDT
Cloud Sites engineers are continuing work to resolve residual latency issues for a small subset of customers. Impact has stabilized for the majority of customers initially affected by this issue. During this time, a small subset of customers may continue to experience website latency or slow page loads.

Next Update: As information becomes available.
04:30 PM EDT
Our engineers have completed a rollback of a recent maintenance to stabilize the environment. A subset of customers may continue to experience slow page loads and website latency as the environment resumes normal operations.

Next Update: 16:30 CDT
04:02 PM EDT
Our Cloud Sites engineers have reported that the roll back of the recent maintenance on a subset of backend storage volumes in the ORD region is nearly complete.As a result of this rollback, a portion of customers may experience a loss of changes made to their sites after 21:30 CDT yesterday.

Next Update: 15:30 CDT
02:50 PM EDT
Our Cloud Sites engineers are preparing to perform a roll back of a recent maintenance on a subset of backend storage volumes in the ORD region. The rollback is expected to take up to one hour to complete.As a result of this rollback, a portion of customers may experience a loss of changes made to their sites after 21:30 CDT yesterday.

Next update: 15:00 CDT
02:29 PM EDT
Cloud Sites teams are continuing work to stabilize ongoing connectivity issues affecting a subset of backend storage nodes in the ORD region. We are actively engaged with the the third party vendor for the affected service, and have identified a possible path to resolution.During this time, customers may continue to experience high latency or be unable to access their websites.

Next Update: 14:30 CDT
01:36 PM EDT
Cloud Sites teams are continuing work to stabilize ongoing connectivity issues in the ORD region. An issue has been identified affecting a subset of backend storage targets, resulting in anomalously high load across the environment. We continue to work with the third party vendor for the affected service to further investigate the issue. Engineers have identified several possible paths to resolution, which are currently being evaluated. During this time, customers may continue to experience high latency or be unable to access their websites.

Next Update: 13:30 CDT
12:18 PM EDT
Cloud Sites teams are continuing work to stabilize ongoing connectivity issues in the ORD region. An issue has been identified affecting a subset of backend storage targets. We have engaged the third party vendor for the affected service to further investigate the issue. During this time, customers may continue to experience high latency or be unable to access their websites.

Next Update: 12:30 CDT
11:00 AM EDT
Our Cloud Sites engineers are continuing work to resolve connectivity issues affecting websites in the ORD region. During this time, customers may experience high latency or be unable to access their websites in the affected region.
09:43 AM EDT
As of approximately 08:20 CDT, a portion of Cloud Sites customers began experiencing periods of latency in the ORD1 data center. Engineers are engaged and are working to resolve the issue. During this time, a portion of customers may be experiencing high latency on their sites.

If you have further questions, please contact a member of your support team.
09:43 AM EDT
As of approximately 08:20 CDT, a portion of Cloud Sites customers began experiencing periods of latency in the ORD1 data center. Engineers are engaged and are working to resolve the issue. During this time, a portion of customers may be experiencing high latency on their sites.

If you have further questions, please contact a member of your support team.
09:43 AM EDT
As of approximately 08:20 CDT, a portion of Cloud Sites customers began experiencing periods of latency in the ORD1 data center. Engineers are engaged and are working to resolve the issue. During this time, a portion of customers may be experiencing high latency on their sites.

If you have further questions, please contact a member of your support team.
09:42 AM EDT
As of approximately 08:20 CDT, a portion of Cloud Sites customers began experiencing periods of latency in the ORD1 data center. Engineers are engaged and are working to resolve the issue. During this time, a portion of customers may be experiencing high latency on their sites.

If you have further questions, please contact a member of your support team.
09:07 AM EDT
Cloud Sites engineers are continuing to work on an issue affecting a portion of Cloud Sites storage targets in the ORD Datacenter. During this time, a portion of customers may be experiencing high latency on their sites.We will update you as more information becomes available.

If you have further questions, please contact a member of your support team.

[Resolved] Resolved: Cloud Sites | High Latency | ORD Region
05:13 AM EDT
Cloud Sites engineers have resolved the issue affecting a portion of Cloud Sites customers in the ORD region. During the time of impact, a portion of customers may have experienced high latency on their sites and/or intermittent timeouts.

If you have further questions, please contact a member of your support team.
04:53 AM EDT
Cloud Sites engineers are continuing to work resolve an issue affecting a portion of Cloud Sites customers in the ORD region. During this time, a portion of customers may be experiencing high latency on their sites.

We will update you as more information becomes available.

If you have further questions, please contact a member of your support team.

03:28 AM EDT
Engineers are working to resolve an issue causing a portion of Cloud Sites customers to experience periods of high latency in the ORD region. We will update you as more information becomes available.

If you have further questions, please contact a member of your support team.