Loading

Feature Disruption to Einstein Chatbots on February 6-7, 2024

Julkaisupäivä: Mar 4, 2024
Kuvaus

Root Cause Analysis 

This article and other items we publish on the subject, including through social media outlets, may contain forward-looking statements, the achievement or success of which involves risks, uncertainties, and assumptions. If any such risks or uncertainties materialize or if any of the assumptions proves incorrect, the results of Salesforce, Inc. could differ materially from the results expressed or implied by the forward-looking statements we make.

We sincerely apologize for any impact this incident may have caused you and your business. At Salesforce, trust is our #1 value, and security is our top priority. We value transparency and want to take this opportunity to outline the facts regarding a recent service disruption that may have disrupted your ability to use multiple Salesforce services, as we currently understand them. Our investigation is ongoing, and we will provide customers with updated information in the future.

The information contained in this document is provided by Salesforce for general information purposes only and is based on information as of the date of distribution and is subject to change. The incident timeline displayed within this Salesforce Technology Service Delivery Preliminary Root Cause Analysis reflects the incident investigation timeline (inclusive of remediation and monitoring). While it includes any period during which the Service was unavailable, the investigation, remediation, and monitoring period is generally much longer than the period of unavailability, if any, caused by the disruption.

Executive Summary - What Happened?

On February 6, 2024, at 21:00 UTC, the Salesforce Technology team (Technology team) identified a feature degradation on Einstein Bots and Einstein Language Understanding (ELU) features. Customers who implemented ELU were unable to communicate with their clients. 

Upon initial investigation, the Technology team discovered that failed chat requests were transferred to a live (human) agent.

As a precautionary step, the Technology team performed a rollback of a recent change that correlated to the beginning of this event. Subsequently, the Technology team provisioned additional capacity to mitigate performance issues. 

The Technology team performed multiple actions and validation health checks to isolate the trigger for this event, including implementing configuration changes at the application tier, that restored normal operations for customers. The remediation actions improved performance, and customers were validated being out of impact at around February 7, 2024, 04:40 UTC.

The preliminary investigation indicates that AI Platform Prediction timed out when generating AI Predictions due to thread exhaustion in the Inference Graph Execution Services (IGES) and other downstream prediction services caused by an overwhelming increase in requests to the services. 

The Technology team is still actively conducting a full investigation of the incident to confirm the technical trigger, the underlying cause, and preventive actions to avoid a recurrence.

How did this issue impact Salesforce services?

During the period of impact, customers who implemented ELU  were experiencing intermittent failures with Einstein Bots and were unable to communicate with their clients. 

Specifically, the failed chat requests were transferred to a live (human) agent, leading to potential capacity issues for agents. 

Technical Details

Detection and Initial Impact

February 6, 2024, 22:21 UTC: The Technology team started investigating a feature degradation on multiple cells reporting issues with ELU.

22:43 UTC: The Technology team observed Einstein Bots had intermittent failures with chats escalated to a human agent, and impact was on Hyperforce cells and first party (1P) instances in IA2, IA4, IA5, IA6, IA7, prod1, prod4,  prod5, and prod9 datacenters.

23:17 UTC: The Technology team engaged additional Subject Matter Experts (SMEs) and based on the SME recommendations, initiated action to provision additional capacity to mitigate performance issues.

February 7, 2024, 00:42 UTC: As an additional mitigation step, the Technology team initiated a rollback of a recent deployment change. The Technology team suspected the deployment change was related to the timeline of the incident. 

Remediation

01:14 UTC - 04:00 UTC: The Technology team explored workarounds and carried out mitigations, including provisioning additional capacity in the Einstein Prediction Service Layer, validation of recent deployment changes around Einstein Platform releases, and relevant health checks. The rollback of the recent deployment change was completed. However, this did not mitigate the disruption.

04:40 UTC: The Technology team restarted the Prediction Services, followed by configuration changes at the application tier. As a result of these multiple remediation actions, the Einstein Bots feature has recovered and was back to a healthy state.

05:18 UTC: After a period of monitoring, the Technology team declared the impact as resolved.

Root Cause Analysis 

The post incident investigation identified that AI Platform Prediction timed out when generating AI Predictions due to thread exhaustion in the Inference Graph Execution Services (IGES) and other downstream prediction services caused by an overwhelming increase in requests to the services. 

One of the downstream services had experienced high latencies with requests to some of its GPU clusters due to the increased volume of traffic from two applications unrelated to Bots. Some of the increased traffic was a result of organic growth from the applications, while some was due to errors and their retries within the applications. Since the last update, The Technology team reproduced the issue in the performance environment and confirmed the thread exhaustion occurred as result of an overwhelming increase of requests and high latencies to a subset of the GPU clusters due to insufficient GPU capacity.

The thread exhaustion resulted in increased latencies to all calls in the Prediction service, including ELU calls, with some of these calls timing out. As a result, customers who implemented ELU were experiencing intermittent failures with Einstein Bots and were unable to communicate with their clients for some time.

The Technology team has also  identified that the auto-scaling of the Prediction service was unable to mitigate the thread exhaustion as the increase in the number of requests was higher than the volume auto-scaling could support. When the auto-scaling failed, the Technology team manually scaled up Prediction and IGES services, implemented configuration changes at the application tier, and restarted the Prediction services to restore normal operations for customers.

Next Steps

To maintain the performance level that our customers expect from Salesforce and to prevent this defect from recurring, our focus is on continuous improvement. The Technology team has identified and are implementing the following actions:

  • Complete:

    • Rolled back recent deployment changes on the Einstein Platform.

    • Resolved connectivity errors with a clean restart of the Prediction services.

    • Increased the number of GPU clusters for models used by the two applications that had increased volume.

    • Reduced the number of retries and timeout value from one of the Prediction Service applications with high latencies.

    • Reproduced the issue in the performance environment and confirmed the trigger of the thread exhaustion.

    • Reduced the number of retries and timeouts to all impacted Prediction services from the application tier. 

    • Analyzed the max auto-scaling configurations to account for increased request rates and updated settings as needed.

  • In Progress: 

    • Exploring deployment of multiple separate clusters for Prediction Services, for handling different use cases in order to mitigate impact to Bots by unrelated applications. ETA: End of March

    • Improving resilience of Prediction services by investigating options for fine-grained rate limiting and enhanced thread-pool resource allocations. ETA: End of March 

    • Validating and provisioning additional Network level and application level health checks to prevent recurrence of the issue in the future. ETA: End of March

    • Exploring and provisioning additional alerting and monitoring of Thread Pool exhaustion. ETA: End of March

We sincerely apologize for the impact this incident may have caused you and your business; Salesforce is fully committed to minimizing downtime when incidents do occur. We also continually assess and improve our tools, processes, and architecture to provide you with the best service possible.

 

Knowledge-artikkelin numero

000979033

 
Ladataan
Salesforce Help | Article