Safety and Security - Toxicity Detection Control

You are here:

Safety and Security - Toxicity Detection Control

Automatically scans AI prompts and generated responses to identify, flag, and score harmful language across multiple categories (for example, hate speech, violence, profanity).

Control Name

Einstein Trust Layer - Toxicity Detection in Prompt & Response

Control Overview

Automatically scans AI prompts and generated responses to identify, flag, and score harmful language across multiple categories (for example, hate speech, violence, profanity).

Description

Uses a hybrid system of rules and machine learning to assign a toxicity confidence score (0–1) to content. High scores indicate a high probability of toxic content, allowing for automated blocking or flagging.

Recommended Configuration

Enable "Toxicity Detection" in Einstein Setup. Make sure that the Models API is configured to pass toxicity flags and that scores are actively monitored via the Einstein Audit Trail in Data Cloud.

Security Impact

Ensures that the AI does not generate biased, offensive, or legally compromising material.

Business Impact

Safeguards brand reputation by preventing the AI from interacting inappropriately with customers or employees, while providing a defensible audit trail for HR and Legal compliance.

Security Risk If Not Configured

Without active detection, the LLM may produce toxic hallucinations or respond to malicious prompts with harmful content that could be interpreted as the company’s official stance.

Threat Scenarios

Prompt Injection: A user tricks the AI into generating a profane response. Toxic Output: The LLM inadvertently generates biased or violent instructions based on a complex user request.

Estimated CVSS Score Range

Critical (9.0–10.0).

Risk Impact Considerations

Higher risk for customer-facing AI where unvetted toxic responses have immediate public visibility.

Higher Risk When

Toxicity detection is bypassed in favor of lower latency, or when the system is used in unsupported languages where detection accuracy is significantly lower.

Low Risk When

Toxicity detection is active, and admins regularly review the Audit Trail for toxic patterns and proactively block responses.

Business and Integration Considerations

Toxicity detection adds a small amount of latency to the "Response Journey." Admins should set clear thresholds for what score (for example, >0.7) triggers an automatic block versus a simple warning.

Security Health Review Guidance

Security Health Review scans the Einstein Trust Layer Setup to confirm that toxicity detection is enabled.

Who Is Impacted

Compliance Officers, HR, Legal Teams, and any end-user interacting with generative AI features.

Did this article solve your issue?

Let us know so we can improve!

Safety and Security - Toxicity Detection Control

Control Name

Control Overview

Description

Recommended Configuration

Security Impact

Business Impact

Security Risk If Not Configured

Threat Scenarios

Estimated CVSS Score Range

Risk Impact Considerations

Higher Risk When

Low Risk When

Business and Integration Considerations

Security Health Review Guidance

Who Is Impacted

General Information

Required Cookies

Functional Cookies

Advertising Cookies

General Information

Required Cookies

Functional Cookies

Advertising Cookies

Cookie List

Product Area

Feature Impact

Edition

Experience

Safety and Security - Toxicity Detection Control

Control Name

Control Overview

Description

Recommended Configuration

Security Impact

Business Impact

Security Risk If Not Configured

Threat Scenarios

Estimated CVSS Score Range

Risk Impact Considerations

Higher Risk When

Low Risk When

Business and Integration Considerations

Security Health Review Guidance

Who Is Impacted