You are here:
Safety and Security - Toxicity Detection Control
Automatically scans AI prompts and generated responses to identify, flag, and score harmful language across multiple categories (for example, hate speech, violence, profanity).
Control Name
Einstein Trust Layer - Toxicity Detection in Prompt & Response
Control Overview
Automatically scans AI prompts and generated responses to identify, flag, and score harmful language across multiple categories (for example, hate speech, violence, profanity).
Description
Uses a hybrid system of rules and machine learning to assign a toxicity confidence score (0–1) to content. High scores indicate a high probability of toxic content, allowing for automated blocking or flagging.
Recommended Configuration
Enable "Toxicity Detection" in Einstein Setup. Make sure that the Models API is configured to pass toxicity flags and that scores are actively monitored via the Einstein Audit Trail in Data Cloud.
Security Impact
Ensures that the AI does not generate biased, offensive, or legally compromising material.
Business Impact
Safeguards brand reputation by preventing the AI from interacting inappropriately with customers or employees, while providing a defensible audit trail for HR and Legal compliance.
Security Risk If Not Configured
Without active detection, the LLM may produce toxic hallucinations or respond to malicious prompts with harmful content that could be interpreted as the company’s official stance.
Threat Scenarios
Prompt Injection: A user tricks the AI into generating a profane response. Toxic Output: The LLM inadvertently generates biased or violent instructions based on a complex user request.
Estimated CVSS Score Range
Critical (9.0–10.0).
Risk Impact Considerations
Higher risk for customer-facing AI where unvetted toxic responses have immediate public visibility.
Higher Risk When
Toxicity detection is bypassed in favor of lower latency, or when the system is used in unsupported languages where detection accuracy is significantly lower.
Low Risk When
Toxicity detection is active, and admins regularly review the Audit Trail for toxic patterns and proactively block responses.
Business and Integration Considerations
Toxicity detection adds a small amount of latency to the "Response Journey." Admins should set clear thresholds for what score (for example, >0.7) triggers an automatic block versus a simple warning.
Security Health Review Guidance
Security Health Review scans the Einstein Trust Layer Setup to confirm that toxicity detection is enabled.
Who Is Impacted
Compliance Officers, HR, Legal Teams, and any end-user interacting with generative AI features.

