Content Safety and Security
Einstein Trust Layer includes a set of policies to help detect potentially harmful content and malicious attacks that attempt to compromise the safety and security of AI applications.
Required Editions
| Available in: Enterprise, Performance, and Unlimited Editions with an Einstein for Sales, Einstein for Platform, Einstein for Service, Einstein 1 Service, or Einstein GPT Service add-on. To purchase add-ons, contact your Salesforce account executive. |
Harmful Content
Harmful content refers to information that can have detrimental effects on individuals or communities. In the context of generative AI models, harmful content refers to data in prompts or responses that negatively impact mental health, behavior, or well-being. Harmful content can include toxic material such as hate and violence or content that has unfair or discriminatory patterns.
Large Language Models (LLMs) can inadvertently generate harmful content due to several reasons:
- Prompt influence: The language in the prompt directly influences the model’s output. For example, if the prompt contains offensive or harmful phrases, the model can incorporate similar language in its response.
- Training data: LLMs learn from vast amounts of data, which can include biased or toxic content. If the training data contains harmful language, the model can inadvertently reproduce it.
- Contextual patterns: LLMs generate responses based on statistical patterns in the data. If harmful language appears frequently in similar contexts, the model can replicate those patterns.
- Fine-tuning and transfer learning: Fine-tuning LLMs on specific tasks can introduce biases. Transfer learning from unrelated domains can also affect content generation.
Einstein Trust Layer uses machine learning models to identify harmful content in generative AI applications and features.
Prompt Injections
Prompt injections are attempts to make the LLM do something that it isn’t designed to do. Hackers can create prompts that attempt to override the system policies or manipulate the LLM into doing something unintended.
Salesforce provides prompt defense mechanisms to help mitigate risks posed by malicious attacks.
Prompt injection detection together with the system policies provide an in-depth approach to prompt defense. Prompt injection defense is consistently applied to all user prompts, bolstering security in Agentforce and embedded AI applications.
- We have built-in system policies to help limit hallucinations and decrease the likelihood of unintended or harmful outputs by the LLM.
- Prompt injection detection is used to help detect malicious attacks that attempt to manipulate the LLMs into doing something it wasn’t designed to do.

