About Knowledge/RAG Quality Data and Metrics

You are here:

About Knowledge/RAG Quality Data and Metrics

Learn about the metrics used to score retrievals. This feature collects data at run time and calculates quality scores for RAG-powered knowledge retrievals.

Required Editions

Available in: Lightning Experience

Available in: Enterprise, Performance, Unlimited, and Developer Editions. Required add-on licenses vary by agent type.

Note RAG quality metrics are supported for individual and ensemble retrievers.

RAG Quality Metrics

Use quality metrics to track run-time performance, view trends, and identify areas to improve in RAG-powered solutions. RAG quality metrics help you identify problem patterns, conduct root cause analysis, and fine-tune your RAG configuration.

Metric Name	Description
Context Precision	Measures the relevance of the retrieved context, calculated based on both the question and context. What proportion of the right knowledge was found?
Faithfulness	Measures the factual consistency of the generated answer against the given context. How well did the agent use the retrieved knowledge?
Answer Relevance	Measures how pertinent the generated answer is to the given prompt. How completely and how well is the question answered?

Common Patterns in RAG Quality Metrics

Pattern	Indication	Investigate
High Faithfulness, Low Context Relevance	The answer is grounded in the retrieved context, but that context isn’t relevant to the query. As a result, the answer relevance is also likely low. This symptom likely indicates a problem in the retrieval.	Does the content actually exist in the knowledge store? Is the number of returned results sufficiently high? Are the correct result ﬁelds selected? Is the search string well formed? For non-English content, is the multilingual embedding model selected?
Low Faithfulness, High-Context Relevance	The answer isn’t grounded in the context, even though that context is relevant to the query. Answer relevance is also likely low. This symptom likely indicates a problem in the LLM generation. It’s possibly due to a shortcoming in prompt engineering. Example: A failure of the LLM failing to give sufficiently strong instructions to follow the provided context.	Is the prompt template well written? Is the LLM sufficiently capable to perform the required reasoning task? If not, select a different LLM or LLM version.
High Faithfulness and High-Context Relevance, Low-Answer Relevance	The answer is grounded in the context. That context is relevant to the query. However, the answer relevance is still low. This symptom likely indicates that insufficient context was retrieved to fully answer the query. The problem is likely in the retrieval, particularly in the recall of the retrieval.	Does the content actually exist in the knowledge store? Is the number of returned results sufficiently high? Are the right result ﬁelds selected?

Pattern

Indication

Investigate

High Faithfulness, Low Context Relevance

The answer is grounded in the retrieved context, but that context isn’t relevant to the query. As a result, the answer relevance is also likely low.

This symptom likely indicates a problem in the retrieval.

Does the content actually exist in the knowledge store?
Is the number of returned results sufficiently high?
Are the correct result ﬁelds selected?
Is the search string well formed?
For non-English content, is the multilingual embedding model selected?

Low Faithfulness, High-Context Relevance

The answer isn’t grounded in the context, even though that context is relevant to the query. Answer relevance is also likely low.

This symptom likely indicates a problem in the LLM generation. It’s possibly due to a shortcoming in prompt engineering. Example: A failure of the LLM failing to give sufficiently strong instructions to follow the provided context.

Is the prompt template well written?
Is the LLM sufficiently capable to perform the required reasoning task? If not, select a different LLM or LLM version.

High Faithfulness and High-Context Relevance, Low-Answer Relevance

The answer is grounded in the context. That context is relevant to the query. However, the answer relevance is still low.

This symptom likely indicates that insufficient context was retrieved to fully answer the query. The problem is likely in the retrieval, particularly in the recall of the retrieval.

Does the content actually exist in the knowledge store?
Is the number of returned results sufficiently high?
Are the right result ﬁelds selected?

Billing Considerations

Collecting and scoring RAG quality metrics increases your org’s credit consumption rate, including LLM calls and Data 360 features. To learn more, see:

Agentforce: Flex Credits Billable Usage Types
Data 360: Billing Considerations for Audit and Feedback

About Knowledge/RAG Quality Data and Metrics

Required Editions

RAG Quality Metrics

Common Patterns in RAG Quality Metrics

Billing Considerations

See Also

General Information

Required Cookies

Functional Cookies

Advertising Cookies

General Information

Required Cookies

Functional Cookies

Advertising Cookies

Cookie List

Product Area

Feature Impact

Edition

Experience

About Knowledge/RAG Quality Data and Metrics

Required Editions

RAG Quality Metrics

Common Patterns in RAG Quality Metrics

Billing Considerations

See Also