You are here:
Use the Test Results
Once your tests finish running, you’ll see overall metrics for the test suite along with evaluation scores for each test. Use these results to identify failed or low-scoring utterances, then manually retest them in Agentforce Builder. Refine the instructions within your subagents and actions, and iterate until you’re satisfied with the responses. Pay close attention to recurring failures as these results can help uncover underlying issues such as unclear guidance, missing knowledge, or configuration issues. To interpret your results, see how each evaluation helps to measure agent performance.
To interpret your results, see how each evaluation helps to measure agent performance.
Default evaluations are given a score of 1, pass, or 0, fail.
| Evaluation | Testing Significance | Recommendation |
|---|---|---|
| Response Evaluation | Scores of 3 to 5 indicates that the agent generally succeeds in achieving its goal. A score of 5 reflects a precise, complete, and brand-aligned response with no irrelevant content. Scores of 3 to 4 show decreasing levels of clarity and completeness. There may be minor omissions or partial understanding which introduces some ambiguity while the core intent is still addressed. Scores of 1 to 2 suggest the agent struggled to meet the goal. The response may be unclear, missing key elements, or include irrelevant information. It might also ask the user for information that should have been retrieved from the CRM. A score of 0 represents a complete failure to resolve the user’s query. The response is generic and doesn’t address the user’s intent. |
Check your agent configuration for issues related to subagent selection, instructions, or actions. Consider whether there are knowledge gaps, like outdated articles, or other issues preventing the agent from responding appropriately. |
| Subagent Assertion | A score of 1 indicates that the agent correctly identified the appropriate subagent to address an utterance. A score of 0 indicates that the agent selected an unexpected subagent to address an utterance. |
Manually retest any failed utterance in Agentforce Builder and review the agent's reasoning in the plan canvas. Refine the instructions for both the expected action and the subagent itself to clearly guide the agent toward the correct choice and restrict the use of the incorrect action. |
| Action Assertion | A score of 1 indicates that the agent correctly identified all of the appropriate actions within a subagent to address an utterance. A score of 0 indicates that the agent either chose the wrong actions or failed to select all necessary actions within a subagent to address an utterance. |
Manually retest any failed utterance in Agentforce Builder and review the agent's reasoning in the plan canvas. Refine the instructions for both the expected action and the subagent itself to clearly guide the agent toward the correct choice and restrict the use of the incorrect action. |
Response quality evaluations are primarily scored between 0 and 5, with 3 or higher signifying a pass; however, Instruction Adherence is uniquely scored as High, Low, or Uncertain. The reasoning for each score is provided by the LLM judge via an infobubble.
What is an LLM-as-judge?
LLM as Judge is when one large language model (LLM) evaluates the outputs of another LLM, serving as a scalable, automated, and objective evaluation tool for tasks like scoring summaries or ranking responses. A "judge" LLM receives a prompt that includes the task and evaluation criteria like, factual accuracy, relevance, coherence, and faithfulness to source. With these resources and guidelines, the LLM judge determines the expected response and compares it to the agent response then generates scores, rankings, or textual feedback. We’ve carefully designed our LLM-as-judge prompts to give you the most accurate and useful test results.
| Evaluation | Testing Significance |
|---|---|
| Completeness | A score of 5 indicates a fully complete and accurate answer with no important omissions. Scores of 4 and 3 reflect decreasing levels of completeness, with minor to moderate gaps that may slightly affect understanding. Scores of 1 or 2 indicate that the generated answer is significantly incomplete, missing several or most important pieces of information, which causes confusion or leads to a partially misleading result. A score of 0 represents a complete failure of the evaluation, signifying the answer missed all important pieces of information and is therefore highly confusing or misleading. |
| Coherence | Scores from 3 to 5 indicate that the agent can correctly transform underlying information into conversational language with the appropriate sentence and grammatical structures, ensuring that the dialogue flows smoothly and is easily understood by the user. Scores from 0 to 2 indicate that information has been taken from Salesforce objects and delivered as raw data like JSON structures or direct field content. |
| Conciseness | Scores from 3 to 5 indicate that the agent's response is short but accurate, successfully capturing the essence of the desired content. Scores from 0 to 2 mean the generated answer lacks conciseness, perhaps because the response is lengthy, repetitive, includes unimportant points, or contains irrelevant content. |
| Latency | A high or unusual latency in a test utterance suggests a problem with either the utterance itself or some underlying infrastructure. If agent adjustments don't resolve the issue, contact Salesforce support to check for infrastructure issues. |
| Instruction Adherence | High: The agent interprets and fully follows the subagent instructions, addressing key points and providing any required information. Low: The agent doesn’t interpret or follow the subagent instructions accurately. It fails to follow at least one instruction, leading to an incorrect response. This flags a need to refine instructions and set clearer constraints. Uncertain: Instruction adherence can't be conclusively determined due to an ambiguous response or action, incomplete response, or conflicting interpretations of the subagent instructions. |
While default evaluations are scored by comparing expected and actual results, quality evaluations are assessed by the judge-LLM based on fixed criteria. If a response receives a low score, review it carefully to see whether it truly falls short of your expectations. For example, the judge-LLM may assign a low conciseness score, but you might feel that additional context better serves your customers or aligns with your brand voice. Otherwise, you can add extra instructions or guardrails to better tailor the agent to your goals. Always consider your specific goals and use case when interpreting scores. Quality evaluations provide guidance, but they are not absolute measures of your agent’s success or failure.

