Loading

RTF Runtime Fabric - Slow or Stalled Application Deployments Caused by RTFD CPU Throttling and Ingress Patching Overhead

Publish Date: May 29, 2026
Description

APPLIES TO

  • Runtime Fabric (RTF) - Self-Managed / BYOK
  • Any Kubernetes distribution (AKS, EKS, GKE, on-premises)
  • Clusters managing a large number of Mule applications (recommended review threshold: 50+)
  • RTF Agent 2.9.x and above


SYMPTOMS

  • Applications started via Runtime Manager (UI or API) take significantly longer than expected to begin deploying — ranging from several minutes to over an hour — with no error displayed in Runtime Manager during the wait.
  • Deploying multiple applications simultaneously makes delays progressively worse. The more apps are queued at once, the longer each one waits.
  • In RTFD logs, a large gap is visible between the STARTED command being received and the Helm upgrade beginning for the same application:
Handling [STARTED] my-application
...                                     ← minutes of silence
preparing upgrade for my-application
  • The issue is consistent and reproducible across deploys and redeploys, and worsens under bulk deployment scenarios (disaster recovery, go-live events, mass restarts).

CAUSE

Two compounding bottlenecks within the RTFD (Runtime Fabric Deployment daemon) container could be contributing to the delays in bulk app deployments:

  1. RTFD CPU THROTTLING

RTFD processes deployment commands serially — one at a time. If the RTFD container CPU limit is set too low for the number of applications the cluster manages, the Linux kernel CFS scheduler will throttle RTFD whenever it reaches that ceiling. A single deployment that should take 5-8 seconds may take several minutes when RTFD is CPU-starved. Every subsequently queued deployment waits behind it, causing backlogs that compound with each additional concurrent deployment request.

This is the primary cause of the queue wait delay observed in RTFD logs. The default or manually configured CPU limit of 200m is insufficient for clusters managing 50 or more applications under load.

  1. INGRESS PATCHING OVERHEAD

When HTTPRoutes are not enabled, RTFD performs 4 sequential Kubernetes API calls per application to patch Ingress objects on every deployment operation. Each call takes approximately 7 seconds, adding approximately 22-28 seconds of overhead per app per deploy. This overhead is applied to every managed application on every deployment and accumulates significantly as cluster size grows.

Under CPU throttling conditions, Ingress patching takes even longer due to resource contention, further compounding the overall delay.

These two issues interact: the CPU throttle slows down every phase of each deployment, and the Ingress patching adds a fixed overhead on top, resulting in total RTFD processing times that can exceed 10 minutes per application under load.

Resolution

RESOLUTION

Both steps below are recommended but step 1 is the simplest and faster way to improve the deployment times. 
Step 1 eliminates the CPU throttle and queue backlog, we suggest you to implement this change first and test if it helps with delays.
Step 2 eliminates the Ingress patching overhead. Applying both has been validated to reduce total RTFD processing time by approximately 99% under load.

STEP 1 - INCREASE RTFD CPU AND MEMORY LIMITS

Increase the resource limits on the RTFD container. The following values are validated for clusters managing 100 or more applications. Adjust proportionally for smaller clusters, but ensure the CPU limit is sufficient to prevent CFS throttling under concurrent deployment load.

containers:
- name: rtfd
  resources:
    limits:
      cpu: "1"       # minimum recommended for large clusters
      memory: "1Gi"
    requests:
      cpu: "300m"
      memory: "512Mi"

For the agent container on large clusters, also verify:

Agent CPU limit:    1500m (minimum recommended for 100+ app clusters)
Agent Memory limit: 1Gi

Apply the changes by editing or patching the deployment directly, depending on your RTF installation method. Restart the agent pod after applying.

STEP 2 - ENABLE HTTPROUTES

Enable HTTPRoutes mode on the RTF cluster. When enabled, RTFD skips all Ingress object patching during deployments, eliminating the 22-28 seconds of per-app overhead entirely. The change applies immediately to all managed applications in the namespace.

Confirm HTTPRoutes is active by checking RTFD logs after the next deployment:

Ingress resources creation from service watcher is skipped when HTTPRoutes is Enabled

For prerequisites and enablement steps, refer to the Runtime Fabric network configuration documentation: docs.mulesoft.com/runtime-fabric/latest/install-self-managed-network-configuration

STEP 3 - VALIDATE THE FIX

Run the following after the agent pod stabilizes:

rtfctl cluster status

Note: if agent-version-consistency shows as unhealthy immediately after an agent pod restart, this is a cosmetic artifact — the probe returns agentVersion: "0.0.0" until the agent fully stabilizes. Recheck after 5 minutes; it resolves on its own with no action required.

Monitor RTFD logs during a test deployment. Expected behavior after the fix:

Handling [STARTED] my-application
preparing upgrade for my-application    ← within 1-2 seconds
Ingress creation SKIPPED                ← HTTPRoutes active
Completed request [6s]

Important: to properly validate the fix, test with a bulk deployment of 30-50 applications simultaneously. A small test (5-10 apps on an idle cluster) will show similar results under both old and new settings because it does not generate enough queue depth to expose the CPU throttle bottleneck. The true benefit is visible under production load conditions.

EXPECTED RESULTS AFTER REMEDIATION

MetricBefore (Under-Provisioned)After (Optimized)
Queue wait before RTFD processingUp to 8+ minutes per app 1-2 seconds
RTFD processing time per app60-90+ seconds5-8 seconds
Ingress patching overhead per app~28 seconds (4 sequential calls)1-2 seconds
Pod Ready time (post-RTFD)Blocked by queue30-45 seconds (Mule JVM-bound)
Bulk deployment (40+ apps)Hours~5-15 minutes



Additional Resources

ADDITIONAL NOTES

  • Pod Ready times of 20-60 seconds after RTFD completes are expected and normal. This window is dominated by Mule runtime JVM startup and any monitoring sidecar initialization — not by RTF overhead. 
  • DEPLOYMENT_RATELIMITPERSECOND=1 (default) is not a bottleneck when RTFD processes each app in ~6 seconds. Only consider raising it to 2 in the app-config ConfigMap if queue depth consistently exceeds 50 apps simultaneously.
  • Third-party tools such as intrusive network proxy/firewalls, monitoring agents, secret injectors, or security enforcement agents present in the RTF or application namespaces are could cause of this same delay pattern and might need to be disabled to resolve it.

RELATED ARTICLES

Knowledge Article Number

005385645

 
Loading
Salesforce Help | Article