Data 360: Calculated Insights Optimization Recommendations

Julkaisupäivä: Dec 2, 2025

Kuvaus

A single Calculated Insight has a runtime limit of 2 hours. If the query does not complete within this window, the system terminates execution and returns the error. Complex CI queries may run slowly or consume excessive resources if they are not optimized. Common causes of inefficiency include:

1. Large Intermediate Datasets Due to Late Filtering

When WHERE filters are applied after the JOIN stage instead of being pushed down closer to the source tables, the query engine must first combine all the records from the joined tables before removing unnecessary rows.

This generates huge intermediate datasets, consuming excessive memory and CPU.
The larger the intermediate set, the slower the downstream stages (sorting, grouping, aggregations).
Best practice: Move filtering conditions into the JOIN ON clause or apply them to the source tables before the JOIN.

2. Inefficient JOIN Conditions, Especially Those with OR

Using OR in JOIN conditions forces the query engine to evaluate multiple complex match patterns for each row.

This prevents the optimizer from using certain indexes or join algorithms.
Execution plans can switch from hash joins to nested loop joins, which are significantly slower for large datasets.
Example:

ON t1.key = t2.key OR t1.alt_key = t2.key

can often be rewritten into a UNION ALL of two separate queries instead of one JOIN with OR.

3. Overuse of Wildcards or LIKE

LIKE '%...%' searches force a full table scan because the leading wildcard % prevents index usage.

The query engine must scan every row and evaluate the string pattern.
This becomes expensive on columns with high cardinality (many distinct values).
If you know the position or structure of the substring, use:
- SUBSTR() or POSITION() functions for fixed offsets.
- LIKE 'prefix%' if searching by starting characters (allows index usage).

4. Subqueries That Can Be Replaced with JOINs

Nested subqueries (especially with IN or NOT IN) can be inefficient because:

The engine may execute the subquery separately and then perform a comparison for every row in the outer query.
Optimizers sometimes rewrite them as joins, but not always effectively.
Using explicit JOIN or LEFT JOIN ... IS NULL patterns often results in fewer scans and better join planning.

5. Incorrect JOIN Order on Large Datasets

The sequence in which tables are joined matters for performance:

Joining two large tables first can generate massive intermediate datasets.
Joining a large table with a much smaller table first can drastically reduce the dataset early.
Good join order allows the optimizer to reduce the working set at each step, improving speed and lowering memory use.

6. Queries That Try to Do Too Much in a Single Execution

Combining too many transformations, joins, filters, and aggregations in one query:

Increases the size and complexity of the execution plan.
Makes debugging harder if a step fails or runs slowly.
Can cause execution timeouts or Out of Memory errors in Data 360 processing.
Best practice: Design your query in stages. Create smaller, focused chunks/CI that handle specific parts of the logic. Then, combine these CIs within a larger CI. This modular approach keeps your queries efficient, easier to maintain, and faster to run.

Ratkaisu

1. Optimize JOIN Conditions

Avoid OR in JOIN conditions. Rewrite logic to remove OR when possible.
Move filters from the WHERE clause to the JOIN ON clause to reduce intermediate data size.

Example - Less efficient:

SELECT c.customer_id, c.name, o.order_id
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id
WHERE o.shipped_date >= '2025-01-01';

More efficient:

SELECT c.customer_id, c.name, o.order_id
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id
  AND (o.shipped_date >= '2025-01-01');

2. Filter Early

Apply WHERE conditions as close to the source table as possible.
Filtering earlier reduces the volume of data passed between query stages.

3. Avoid Expensive String Operations

LIKE with wildcards (especially %...%) is costly.
Use SUBSTR or other exact match functions when possible.

Instead of:

WHERE name LIKE '%sample%'

Use:

WHERE SUBSTR(name, 3, 6) = 'sample'

4. Replace `NOT IN` with `LEFT JOIN` and `IS NULL`

Less efficient:

SELECT id FROM table1
WHERE id NOT IN (SELECT id FROM table2);

More efficient:

SELECT t1.id
FROM table1 t1
LEFT JOIN table2 t2 ON t1.id = t2.id
WHERE t2.id IS NULL;

5. Minimize or Avoid Subqueries

Whenever possible, replace subqueries with JOIN operations for better performance.

6. Consider JOIN Order

Join smaller datasets first to reduce the working set size.

Less efficient:

SELECT id, x, y
FROM large_table l
JOIN medium_table m ON l.id = m.id
JOIN small_table s ON m.id = s.id;

Better:

SELECT id, x, y
FROM large_table l
JOIN small_table s ON l.id = s.id
JOIN medium_table m ON l.id = m.id;

7. Break Down Large CI Queries

If optimization fails, start with a CI that contains only core logic.
Gradually add complexity to identify the performance bottleneck.
If still too large, split into parent and child CIs for modular execution.

Note: Always validate query results after optimization to ensure no unintended logic changes.

Knowledge-artikkelin numero

005132115

Ratkaisiko tämä artikkeli ongelmasi?

Anna palautetta, jotta voimme kehittyä!

Data 360: Calculated Insights Optimization Recommendations

1. Large Intermediate Datasets Due to Late Filtering

2. Inefficient JOIN Conditions, Especially Those with OR

3. Overuse of Wildcards or LIKE

4. Subqueries That Can Be Replaced with JOINs

5. Incorrect JOIN Order on Large Datasets

6. Queries That Try to Do Too Much in a Single Execution

1. Optimize JOIN Conditions

2. Filter Early

3. Avoid Expensive String Operations

4. Replace `NOT IN` with `LEFT JOIN` and `IS NULL`

5. Minimize or Avoid Subqueries

6. Consider JOIN Order

7. Break Down Large CI Queries

General Information

Required Cookies

Functional Cookies

Advertising Cookies

General Information

Required Cookies

Functional Cookies

Advertising Cookies

Cookie List

Data 360: Calculated Insights Optimization Recommendations

1. Large Intermediate Datasets Due to Late Filtering

2. Inefficient JOIN Conditions, Especially Those with OR

3. Overuse of Wildcards or LIKE

4. Subqueries That Can Be Replaced with JOINs

5. Incorrect JOIN Order on Large Datasets

6. Queries That Try to Do Too Much in a Single Execution

1. Optimize JOIN Conditions

2. Filter Early

3. Avoid Expensive String Operations

4. Replace NOT IN with LEFT JOIN and IS NULL

5. Minimize or Avoid Subqueries

6. Consider JOIN Order

7. Break Down Large CI Queries

4. Replace `NOT IN` with `LEFT JOIN` and `IS NULL`