Optimizing Search Indexes: Field Selection and Chunking

When you create a search index in the advanced builder, you can optimize your search index to deliver more accurate results by paying attention to the field selection and chunking strategies you use.

Indexing Text Fields

If you want to add text fields when creating a search index, select text ﬁelds with longer, free-text content. You can even index multiple text ﬁelds from a DMO. For example, if you select the Description, Summary, Content, and Resolution fields from a DMO, Data 360 stores all corresponding vectors in the same search index.

You can separate vectors on the basis of the DataSource__c field in the Index DMO. The DataSource__c field contains the original ﬁeld name. Because this ﬁeld is in the Index DMO, you can use it in a retriever’s preﬁlter. For example, you can configure a retriever to evaluate queries on semantic similarity to a speciﬁc ﬁeld only, such as Description and not Resolution.

Avoid selecting too many similar ﬁelds or redundant ﬁelds (for example, Summary, Title, and Description). Doing so can lead to decreased recall if your retriever doesn’t have preﬁlters on DataSource__c. Because these ﬁelds all likely contain the same, or very similar information, at least three chunks (one chunk for each ﬁeld) from the same document can appear highly ranked in the query results. These bring the same information to the LLM, and if you configure the retriever to retrieve, for example, nine results, only three documents will be represented in the results. This reduces variation in your search results and can lead to documents being missed.

We recommended that when two or more ﬁelds represent the same content, but in a diﬀerent form, select the ﬁeld with the longest text, such as Description. Consider prepending that ﬁeld with a shorter, more condensed version, such as Title.

Tip Don’t select categorical columns as index ﬁelds. Categorial data is single-word or two-word descriptors that map to a picklist in Salesforce. To create useful results, semantic search requires a longer textual scope and more context.

Using Prepend Fields

One way to optimize your chunking strategy is to use prepend fields to add context to chunks and make them easier to identify. For example, suppose you have a chunk that contains a sequence of troubleshooting steps. By prepending that chunk with the Title field that contains the text “How to Fix Device X When It Shows Behaviour Y,” you make it easier to identify that content as relevant to a user’s question. Prepending ﬁelds like Title or Product Name add those values to a chunk, which makes them visible in prompt augmentation or in the Data 360 Query Editor.

Adjusting Chunk Size

Another way to optimize chunking is to tune the chunk size.

When you create a default search index, Data 360 uses the semantic-based passage extraction markers to chunk your content into pieces as small as possible. Data Cloud then lumps the chunks back together until it reaches the chunk size you specify, or the default maximum chunk size (512 tokens).

You’ll find this requires some experimentation, as the optimal chunk size and strategy varies per RAG or agent implementation.

For more information, refer to How the Max Token Setting Affects Chunking.

Optimizing Chunks for Retrieval

When planning the size of your chunks for retrieval, consider the information density and organizational structure of the content you’re chunking. Remember that one chunk results in one vector. All the content in the chunk is represented in this single vector. Consider how many words are needed to adequately understand the meaning of a chunk. Will 400 to 500 words work? Or can fewer words suﬃciently capture a self-contained, piece of information (possibly enhanced with ﬁeld prepending or chunk enrichment)? Those are the kinds of questions that should come up in your planning.

Optimizing Chunks for Prompt Augmentation

You should also consider chunking from a prompt augmentation perspective. How many chunks does your LLM need to generate a suﬃciently usable response? Is a small, individual factoid useful enough, or does your LLM require more context?

For UDMO-based search indexes, augmentation of content typically relies on chunk size, in which case, chunks need to be larger to include extra context.

For DMO-based indexes, you have more options because you can use additional ﬁelds for augmentation. It’s even possible to augment the prompt using the original document (for example, a knowledge article) instead of a single chunk. This increases the resolution of your prompt resolution, so consider the context window of your LLM in relation to the selected number of results. But keep in mind that such prompts increase the cost of response generation: in other words, if you increase the prompt and response size, you consume more Einstein Requests.

Did this article solve your issue?

Let us know so we can improve!

Optimizing Search Indexes: Field Selection and Chunking

Indexing Text Fields

Using Prepend Fields

Adjusting Chunk Size

Optimizing Chunks for Retrieval

Optimizing Chunks for Prompt Augmentation

General Information

Required Cookies

Functional Cookies

Advertising Cookies

General Information

Required Cookies

Functional Cookies

Advertising Cookies

Cookie List

Product Area

Feature Impact

Edition

Experience

Optimizing Search Indexes: Field Selection and Chunking

Indexing Text Fields

Using Prepend Fields

Adjusting Chunk Size

Optimizing Chunks for Retrieval

Optimizing Chunks for Prompt Augmentation