You are here:
Optimizing Search Indexes: Field Selection and Chunking
When you create a search index in the advanced builder, you can optimize your search index to deliver more accurate results by paying attention to the field selection and chunking strategies you use.
Indexing Text Fields
If you want to add text fields when creating a
search index, select text fields with longer, free-text content. You can even index multiple
text fields from a DMO. For example, if you select the Description,
Summary, Content, and Resolution fields
from a DMO, Data 360 stores all corresponding vectors in the same search index.
You
can separate vectors on the basis of the DataSource__c field in the Index DMO. The DataSource__c field
contains the original field name. Because this field is in the Index DMO, you can use it in a
retriever’s prefilter. For example, you can configure a retriever to evaluate queries on
semantic similarity to a specific field only, such as Description and not
Resolution.
Avoid selecting too many similar fields or redundant
fields (for example, Summary, Title, and
Description). Doing so can lead to decreased recall if your retriever
doesn’t have prefilters on DataSource__c. Because these fields all likely
contain the same, or very similar information, at least three chunks (one chunk for each
field) from the same document can appear highly ranked in the query results. These bring the
same information to the LLM, and if you configure the retriever to retrieve, for example,
nine results, only three documents will be represented in the results. This reduces
variation in your search results and can lead to documents being missed.
We
recommended that when two or more fields represent the same content, but in a different form,
select the field with the longest text, such as Description. Consider
prepending that field with a shorter, more condensed version, such as
Title.
Using Prepend Fields
One way to optimize your chunking
strategy is to use prepend fields to add context to chunks and make them easier to
identify. For example, suppose you have a chunk that contains a sequence of troubleshooting
steps. By prepending that chunk with the Title field that contains the text
“How to Fix Device X When It Shows Behaviour Y,” you make it easier to identify that content
as relevant to a user’s question. Prepending fields like Title or
Product Name add those values to a chunk, which makes them visible in
prompt augmentation or in the Data 360 Query Editor.
Adjusting Chunk Size
Another way to optimize chunking is to tune the chunk size.
When you create a default search index, Data 360 uses the semantic-based passage extraction markers to chunk your content into pieces as small as possible. Data Cloud then lumps the chunks back together until it reaches the chunk size you specify, or the default maximum chunk size (512 tokens).
You’ll find this requires some experimentation, as the optimal chunk size and strategy varies per RAG or agent implementation.
For more information, refer to How the Max Token Setting Affects Chunking.
Optimizing Chunks for Retrieval
When planning the size of your chunks for retrieval, consider the information density and organizational structure of the content you’re chunking. Remember that one chunk results in one vector. All the content in the chunk is represented in this single vector. Consider how many words are needed to adequately understand the meaning of a chunk. Will 400 to 500 words work? Or can fewer words sufficiently capture a self-contained, piece of information (possibly enhanced with field prepending or chunk enrichment)? Those are the kinds of questions that should come up in your planning.
Optimizing Chunks for Prompt Augmentation
You should also consider chunking from a prompt augmentation perspective. How many chunks does your LLM need to generate a sufficiently usable response? Is a small, individual factoid useful enough, or does your LLM require more context?
For UDMO-based search indexes, augmentation of content typically relies on chunk size, in which case, chunks need to be larger to include extra context.
For DMO-based indexes, you have more options because you can use additional fields for augmentation. It’s even possible to augment the prompt using the original document (for example, a knowledge article) instead of a single chunk. This increases the resolution of your prompt resolution, so consider the context window of your LLM in relation to the selected number of results. But keep in mind that such prompts increase the cost of response generation: in other words, if you increase the prompt and response size, you consume more Einstein Requests.

