Loading

Feature Degradation on Multiple Salesforce Instances on June 26-27, 2025

Fecha de publicación: Nov 26, 2025
Descripción

Root Cause Analysis

Updated November 2025 (closed next steps)

This article and other items we publish on the subject, including through social media outlets, may contain forward-looking statements, the achievement or success of which involves risks, uncertainties, and assumptions. If any such risks or uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make.

We sincerely apologize for any impact this incident may have caused you and your business. At Salesforce, trust is our #1 value and security is our top priority. We value transparency and want to take this opportunity to outline the facts regarding a recent performance degradation that may have disrupted your ability to use multiple Salesforce services, as we currently understand them. Our investigation is ongoing, and we will provide customers with updated information in the future.

The information contained in this document is provided by Salesforce for general information purposes only and is based on information as of the date of distribution and is subject to change. The incident timeline displayed within this Salesforce Technology Service Delivery Root Cause Analysis reflects the incident investigation timeline (inclusive of remediation and monitoring). While it includes any period during which the Service was unavailable, the investigation, remediation, and monitoring period is generally much longer than the period of unavailability, if any, caused by the disruption.

Executive Summary - What Happened?

On June 26, 2025, customers across multiple instances began experiencing performance degradation while using Email Services. The Salesforce Technology Team (Technology Team) determined that there was a triggering event that consumed a high volume of resources on the mail servers. This led to slower processing of connections, which also caused external providers to increase the volume of connection attempts, creating a “retry storm.” The impact caused mail server failures, email delays, and secondary impacts to Message Queue (MQ) performance.

To bring customers out of impact, the Technology team used a multi-pronged approach for remediation, including disabling the problematic traffic, optimizing hosting networking, splitting inbound and outbound traffic to different hosts, migrating spool files to solid state drives (SSDs), and implementing firewall throttling. The Technology declared the incident resolved on June 27, 2025, at 22:08 Coordinated Universal Time (UTC).

How did this issue impact Salesforce services?

Email Services: 

Customers experienced delays with inbound and outbound emails that were sent and received through Salesforce. In addition, customers may have observed long Message Queues (MQ) and slow Apex jobs. Customers may also have experienced email bounces and delivery failures.

After remediation, all emails that could be processed were processed. Emails that are marked as “permanent failures” will not be recovered, and customers have to reinitiate them. 

Additionally, due to the earlier low rate of processing, some emails were retried, which may have led to duplicate Email-to-Case creation. Salesforce is working to quantify and investigate reports of such duplicates. 

Inbound and outbound email functionality was impacted on the following instances: 

NA239, NA240, NA241, NA243, NA244, NA246, NA248, NA250, NA253

Inbound email (email-to-case, email to Apex, and so on) may have been affected if your organization had originally been created in the affected data center or if you had created email services while there, and were subsequently migrated to a different location. Instances include:

BRA30, BRA36, BRA48, CAN34, CAN36, CAN44, CAN46, CAN50, CAN54, CAN56, DEU36, ITA24, NA225, NA226, NA233, NA236, NA237, NA242, NA249, NA251, NA252, NA254, SWE62, USA1, USA34, USA226, USA230, USA232, USA234, USA238, USA272, USA274, USA276, USA278, USA280, USA282, USA288, USA292, USA294, USA296, USA300, USA302, USA304, USA312, USA314, USA316, USA318, USA320, USA326, USA330, USA332, USA334, USA336, USA338, USA340, USA342, USA346, USA348, USA352, USA354, USA356, USA360, USA372, USA374, USA376, USA384, USA408, USA410, USA414, USA418, USA420, USA422, USA424, USA426, USA428, USA430, USA436, USA438, USA440, USA442, USA454, USA456, USA462, USA464, USA466, USA468, USA470, USA472, USA476, USA478, USA496, USA538, USA544, USA546, USA548, USA550, USA552, USA554, USA558, USA560, USA564, USA566, USA570, USA572, USA574, USA576, USA578, USA580, USA584, USA588, USA592, USA600, USA604, USA606, USA610, USA616, USA618, USA624, USA632, USA634, USA644, USA646, USA648, USA650, USA652, USA674, USA676, USA678, USA680, USA682, USA684, USA686, USA688, USA692, USA694, USA696, USA698, USA702, USA714, USA716, USA718, USA720, USA724, USA726, USA728, USA730, USA732, USA736, USA740, USA742, USA744, USA746, USA778, USA788, USA810, USA852, USA854, USA856, USA864, USA868, USA920, USA922, USA926, USA1012, USA1108, and USA9008

Technical Details

Detection and Initial Impact

On June 26, 2025, at 12:01 UTC, the Technology team received an alert indicating performance degradation on email services across multiple instances. The Technology team observed an ongoing event during which a large volume of emails were consuming mail server resources.

The overall impact to customers was significant delays in delivering outbound email as well as in processing of inbound email. 

Remediation

The Technology team identified that there was a triggering event that consumed a high volume of resources on the mail servers. The Technology team adopted a multi-pronged approach to resolving the problem via swimlanes (groups of Subject Matter Experts) working concurrently.

Swimlane: Mitigation of the triggering event

On June 26, 2025, at 13:42, the Technology team disabled mail related to the triggering event, to prevent new mail sends and cleared any existing mail events queued on the mail hosts.

Swimlane: Mail Host restart

Even after processing hundreds of thousands of messages from the mail queues, the Technology team found that the mail process continued to load additional messages from the queue file system into memory due to the large backlog it was processing. At this point, the Technology team decided to put the current queue on hold and reduce the load on the mail servers to bring performance back to normal. The Technology team stopped the mail service and moved the existing queues to standby storage. The Technology team then restarted the service with an empty queue. After the restart, the mail service eventually resumed, albeit in a degraded state.

Swimlane: Additional resource allocation

To assist with the outbound mail load, the Technology team added a single host from another superpod (a separate data center availability zone) to the affected data center. The additional resources helped reduce the load, but the Technology team noted the core application was having issues connecting to this newly added host as well.

Swimlane: Coordination with third-party MTA

On June 26, 2025, at 17:54 UTC, the Technology team engaged the third-party Mail Transfer Agent (MTA) vendor to support the investigation and recovery efforts. 

After reviewing system behavior and the current configurations, the third-party vendor requested that the Technology team make network configuration updates to optimize network performance for the MTA. With assistance from the third-party vendor, the Technology team reviewed and implemented these changes. Network performance improved, but the mail hosts continued experiencing issues. 

Swimlane: Separating inbound and outbound mail

The Technology team separated outbound from inbound mail processing. The Technology team shifted outbound traffic to a different set of hosts in another superpod in the same data center, which would leave only inbound mail processing on the existing hosts. This effectively doubled the capacity of the mail service. After this action was taken, all outbound email was being processed by the new hosts. This action restored outbound mail to normal processing, and all failures from the core application in reaching the mail hosts stopped.

After additional review, the Technology team decided that the volume of traffic for outbound email that had been moved to new hosts could cause large Internet email receivers to struggle to accommodate the sudden change in volume. The Technology team conducted a traffic study and developed a plan for Internet protocol (IP) warmup (transmitting email over multiple different IP addresses), which was subsequently implemented. This dispersed the traffic across multiple US data centers. Only the contents of the emails routed through the mail servers at these locations, where they were briefly stored until sent, and then the emails were deleted. This process temporarily allows IP warmup in the original source location to prevent internet service provider (ISP) blocking due to large traffic shifts. The IP warmup period lasts 30 days and will be completed on July 28, 2025.  

During the implementation of the restoration plan, the Technology team evaluated the state of the hosts still handling inbound email. These hosts were still having I/O issues, and the ongoing retry storm of outside connections was still impacting the MTA’s efficiency. The Technology team began exploring the implementation of firewall limits and faster disks.

The Technology team discussed appropriate firewall limitations that could limit the amount of traffic while still allowing sufficient throughput. Limiting connections from the firewall would reject traffic back to the sender, that would reattempt the connection to deliver mail. This would reduce the load on the mail hosts, allowing them to focus on the volume of already queued mail. This action did help reduce load on the servers as it completed processing its backlog.

Swimlane: Mail Queue Migration

The Technology team received reports of additional slowdowns in Message Queue (MQ) processing due to longer times to inject new emails. The Technology team initiated job suspension to address long MQ times and delays in Apex job processing.

Concurrently, the Technology team evaluated options to utilize faster SSDs available to the mail hosts. The Technology team decided to migrate the mail queue from one of the affected hosts to the SSD and observe performance. The Technology team saved the existing queue of mail on one of the affected hosts, reconfigured the service to utilize the faster SSD, restarted the mail services, and initiated an import of the old queue. New mail began to process instantly, and the host was again very responsive. The Technology team then repeated this task across all affected hosts. This restored the processing of new inbound email to normal. 

The Technology team imported older mail queues that had been saved at the beginning of the incident. The Technology team continued to monitor until all imports were complete. These imports were completed on June 27. 2025, at 22:00 UTC. 

On June 27, 2025, at 22:08 UTC, the Technology team declared the incident resolved.

Root Cause Analysis

The Technology Team’s post-incident investigation and analysis determined that the root cause of this incident was that Salesforce’s email infrastructure experienced a slowdown in processing that was triggered by the interaction between a unique traffic spike and suboptimal configuration. During the incident, a suboptimal disk layout, in which both the email message queues and log directories were configured to use the same disk resources, became a bottleneck that impacted the servers’ ability to process new connections.

The issue began when the mail servers experienced a sustained and elevated volume of outbound email traffic with an unusual connection pattern. This traffic led to increased disk Input/Output (I/O) and eventually exhausted both memory and I/O capacity. As a result, outbound messages began to retry and duplicate, further increasing system load. The degraded performance also impacted inbound email processing—including services like email-to-case and email-to-apex—as the servers struggled to maintain stable connections.

As the servers slowed down, external mail providers escalated connection attempts and retries, creating a storm of retries that further strained the system. The surge in disk activity also triggered excessive logging, which compounded the disk I/O pressure and impacted both outbound and inbound mail services. Additionally, degraded mail server performance began to affect downstream components, including the Message Queue (MQ) services used by Apex and Mass Email, which encountered delays due to longer response times from the mail system.

The Technology Team collaborated with the third-party mail software vendor to identify and confirm the disk layout misconfiguration. This structural limitation, combined with the bursty traffic pattern, ultimately led to the failure to handle email connections at scale.

Next Steps

To maintain the performance level that our customers expect from Salesforce and to prevent this defect from recurring, our focus is on continuous improvement. The Technology Team has identified and is implementing the following actions:

Completed:

  • The Technology Team optimized infrastructure configuration to be able to handle increased volume on the impacted mail servers, including disk log and spool file layouts

  • The Technology Team implemented rate-limiting configurations at the firewall level for the impacted mail servers to preemptively protect against similar overload scenarios

  • The Technology Team is expanding the optimized infrastructure configuration to the entire fleet of servers

  • The Technology Team is enhancing monitoring to include additional metrics related to mail server health and latency

  • The Technology Team is evaluating additional limit protections in the core application

  • The Technology Team is evaluating solutions to reduce the impact on customers’ organizations that have been migrated to other locations

We sincerely apologize for the impact this incident may have caused you and your business; Salesforce is fully committed to minimizing downtime when incidents do occur. We also continually assess and improve our tools, processes, and architecture to provide you with the best service possible.

 

Número del artículo de conocimiento

005100898

 
Cargando
Salesforce Help | Article