RCM for NA14 Disruptions of Service - May 2016
|Knowledge Article Number||000233934|
Published May 16, 2016 at approximately 8 a.m. PDT.
At Salesforce, trusted customer success is our #1 value, and delivering the highest standard in system availability, performance, and security is our top priority. We know that trust begins with transparency, and with that in mind, we are providing the full root cause analysis for the disruptions of service to NA14 in May 2016 described below.
At approximately 5:47 p.m. PDT on May 9, 2016 (00:47 UTC on May 10, 2016), the Salesforce Technology team observed a service disruption for the NA14 instance. They determined the issue was related to a power failure servicing a portion of the Washington, D.C. (WAS) data center. During the service disruption, customers on the NA14 instance were unable to access the Salesforce service.
As the team assessed whether the issue could be corrected within the WAS data center, they determined that a circuit breaker responsible for controlling power into the data center had failed. The team engaged the circuit breaker vendor who began the process of replacing the failed breaker. Multiple redundant power systems had not engaged, which led to power failures at the compute system level.
In an effort to restore service to the NA14 instance as quickly as possible, the team determined that the appropriate action was to perform a site switch of the instance from its primary data center (WAS) to its secondary data center in Chicago (CHI). As this activity was completed, service to the NA14 instance was restored. The team called an all-clear at 7:39 p.m. PDT on May 9, 2016. Procedures to begin local backup in the WAS data center were started at this time.
At approximately 5:41 a.m. PDT on May 10, 2016, the Technology team observed a degradation in performance on the NA14 instance. At approximately 6:31 a.m. PDT, the degradation escalated to a service disruption as a result of a database cluster failure on the NA14 instance. During the service disruption, customers on NA14 were unable to access the Salesforce service.
The team engaged our database vendor who determined that the database cluster failure had resulted in file inconsistencies in the NA14 database. Additionally, they found that the file discrepancies had been replicated to the database in the WAS data center and that the local backup had not yet completed in the CHI data center. Because the file discrepancies had been copied to the standby copy of the database in the WAS data center, the team determined a site switch back to this copy of the NA14 instance in WAS was not a viable option to restore service to NA14 customers. Additionally, because the local backup of the instance in the CHI data center had not completed, this copy could not be used to correct the file discrepancies and restore service.
The teams pursued several approaches for restoring service to NA14 in the CHI data center, including repairing the file discrepancies directly. Each attempt to restore service resulted in errors or failures that prevented these approaches from continuing. After multiple attempts to restore service to NA14 in the CHI data center failed, the decision was made to restore from a local backup of the instance in the WAS data center that did not have the file discrepancies present. However, this backup was last updated prior to the power failure that impacted NA14 the evening of May 9, and therefore did not contain data entered into the instance after the site switch to the CHI data center.
In an effort to apply transactional activity on NA14 from the time of the power failure in WAS to the time of the database failure in CHI, the team transported database redo logs from CHI to WAS and began the process of writing these transactions to the WAS database.
The process of applying these redo logs could not be completed before the team needed to prepare the instance for the start of peak customer activity on May 11, 2016. A decision was made to not impact a second day of customer activity on NA14, so the redo application process was halted and the team brought the NA14 instance up in read/write mode and performed readiness tests before opening it up to customers for the start of daily activity. Because the redo process was not completed, there was a period of time, from the point at which the redo process was stopped to the point of the database failure, where customer transaction data could not be applied prior to bringing the instance back online. This resulted in a window between 2:53 a.m. and 6:29 a.m. PDT on May 10, 2016 where data written to NA14 was not applied to the instance.
The storage array for NA14 continued to run in a degraded state through May 11, 2016. To prevent the volume of customer activity on NA14 from negatively impacting performance due to the backlog of built up jobs and new customer activity, the team halted several internal jobs, including sandbox copies and weekly exports, and staggered initial customer activity coming into the instance.
All functionality was restored to the NA14 instance, including sandbox copy and weekly export functionality, on May 15, 2016, once the teams were able to verify all backup processes were up to date.
Once all functionality was restored, an all-clear was called at 7:30 p.m. PDT on May 15, 2016 (02:30 UTC on May 16, 2016).
The The root cause of the initial power failure to the WAS data center was a failure of a main critical board that tripped open a circuit breaker pair. The breakers are used to segment power from the data center universal power supply ring and direct the power into the different rooms. This failed board caused a portion of the power distribution system to enter a fault condition. The fault created an uncertain power condition, which led to a redundant breaker not closing to activate the backup feed because that electronic circuit breaker could not confirm the state of the problem board.
Despite ongoing investigation, the root cause for the initial failure and subsequent fault condition of the breaker remains unknown. The breaker in question passed load testing in March 2016 as part of our regular data center certification. The manufacturer has reported that forensics testing was ultimately inconclusive due to the unusual failure condition.
Through investigation alongside our storage array vendor, the root cause of the database failure on May 10, 2016, was determined to be a latent firmware bug that was exposed due to increased traffic volume resulting from a combination of factors:
This increase in volume on the instance exposed the firmware bug on the storage array, which significantly increased the time for the database to write to the array. Because the time to write to the storage array increased, the database began to experience timeout conditions when writing to the storage tier. Once these timeout conditions began, a single database write was unable to successfully complete, which caused the file inconsistency condition to become present in the database. Once this inconsistency occurred, the database cluster failed and could not be restarted.
Our internal backup processes are designed to be near real-time, however the local copy of the database, which was required after the site switch to Chicago, had not yet completed. Our remote replication process copied the blocks that contained the file inconsistencies to the standby copy of the database in the WAS data center before the database crashed, resulting in these copies of the NA14 database being unusable for purposes of restoring service once the database cluster failed.
Through investigations alongside our third-party database vendor, the root cause for the database cluster crash resulting in file inconsistencies was determined to be a result of database partition alignment. The partitioning on the database storage disks resulted in database blocks becoming misaligned with the cache segments. This led to a situation where individual database blocks were split between two cache segments at the time of the database cluster crash. In this situation, the full cache segments containing a subset of partial database blocks were written to disk. Partial cache segments containing the remaining database blocks, were not written to disk at the time of the database cluster crash.
Next Steps and Preventive Actions:
Our vendor responsible for the WAS data center power circuits has replaced the failed components.
Our internal Data Center Operations team, in partnership with our data center providers, have completed an audit of the power and failover systems in all Salesforce data centers. This audit identified two rooms with similar power distribution designs. The team has updated these room designs to an N+1 configuration. Additionally, all hardware racks in these rooms have been dual-corded across busways to further reduce the reliance on individual components. These actions reduce the potential single point of failure in the event of a single power circuit breaker failing in a similar manner in the future.
The NA14 storage array hardware in the WAS and CHI data centers have been replaced with components that have higher firmware versions, and therefore are not vulnerable to the bug that caused the increased storage write times.
No other instances have the version of firmware on which the bug exists.
The Salesforce infrastructure team is in the process of deploying a new technology for data replication to standby copies of instances. This updated approach will utilize the wide-area network (WAN) to perform logical, database level replication, eliminating block-level replication. Additionally, this technology will help to perform site switches in a more streamlined and timely manner.
The database partition alignment condition has been corrected. This will prevent conditions where database blocks are split between cache segments and will prevent file integrity inconsistencies resulting from misaligned database blocks at the time of potential database crashes in the future.
Our commitment to you:
We sincerely apologize for the impact this incident caused you and your organization. It is our goal to provide world-class service to our customers, and we are continuously assessing and improving our tools, processes, and architecture in order to provide customers with the best service possible.