Loading
Feature degradation | Gmail Email delivery failureRead More
Set Up and Maintain Your Salesforce Organization
Table of Contents
Select Filters

          No results
          No results
          Here are some search tips

          Check the spelling of your keywords.
          Use more general search terms.
          Select fewer filters to broaden your search.

          Search all of Salesforce Help
          Safety and Security - Toxicity Detection Control

          Safety and Security - Toxicity Detection Control

          Automatically scans AI prompts and generated responses to identify, flag, and score harmful language across multiple categories (for example, hate speech, violence, profanity).

          Control Name

          Einstein Trust Layer - Toxicity Detection in Prompt & Response

          Control Overview

          Automatically scans AI prompts and generated responses to identify, flag, and score harmful language across multiple categories (for example, hate speech, violence, profanity).

          Description

          Uses a hybrid system of rules and machine learning to assign a toxicity confidence score (0–1) to content. High scores indicate a high probability of toxic content, allowing for automated blocking or flagging.

          Recommended Configuration

          Enable "Toxicity Detection" in Einstein Setup. Make sure that the Models API is configured to pass toxicity flags and that scores are actively monitored via the Einstein Audit Trail in Data Cloud.

          Security Impact

          Ensures that the AI does not generate biased, offensive, or legally compromising material.

          Business Impact

          Safeguards brand reputation by preventing the AI from interacting inappropriately with customers or employees, while providing a defensible audit trail for HR and Legal compliance.

          Security Risk If Not Configured

          Without active detection, the LLM may produce toxic hallucinations or respond to malicious prompts with harmful content that could be interpreted as the company’s official stance.

          Threat Scenarios

          Prompt Injection: A user tricks the AI into generating a profane response. Toxic Output: The LLM inadvertently generates biased or violent instructions based on a complex user request.

          Estimated CVSS Score Range

          Critical (9.0–10.0).

          Risk Impact Considerations

          Higher risk for customer-facing AI where unvetted toxic responses have immediate public visibility.

          Higher Risk When

          Toxicity detection is bypassed in favor of lower latency, or when the system is used in unsupported languages where detection accuracy is significantly lower.

          Low Risk When

          Toxicity detection is active, and admins regularly review the Audit Trail for toxic patterns and proactively block responses.

          Business and Integration Considerations

          Toxicity detection adds a small amount of latency to the "Response Journey." Admins should set clear thresholds for what score (for example, >0.7) triggers an automatic block versus a simple warning.

          Security Health Review Guidance

          Security Health Review scans the Einstein Trust Layer Setup to confirm that toxicity detection is enabled.

          Who Is Impacted

          Compliance Officers, HR, Legal Teams, and any end-user interacting with generative AI features.

           
          Loading
          Salesforce Help | Article