Log Analysis ML-based alerts delayed by several minutes in production

We’re experiencing significant delays with ML-based alerting in IBM Log Analysis for our production environment. Our critical alerts that should trigger within 30 seconds are now taking 4-6 minutes to fire, which is causing major issues with our incident response.

The delays seem to correlate with our log ingestion pipeline processing around 15GB/hour during peak times. We’ve noticed the ML batch interval appears to be processing in longer cycles than configured. Resource allocation shows we’re at 60% capacity on our service plan, so we shouldn’t be hitting limits.

Has anyone encountered similar latency with ML-powered alerts in Log Analysis? We need alerts firing within acceptable timeframes to maintain our SLAs. Any insights on tuning the ingestion pipeline or ML processing intervals would be greatly appreciated.

Thanks for the responses. Checking now - our ml_processing_queue_depth is averaging around 1800 during peak hours, so that’s definitely a bottleneck. We’re using a mix of structured and unstructured logs. About 40% of our volume is unstructured application logs that the ML needs to analyze for anomaly patterns. Would filtering out non-critical unstructured logs help, or should we look at scaling our service tier?

I’d add that your resource allocation percentage can be misleading. Log Analysis allocates ML processing resources separately from storage and ingestion. You might be at 60% overall but 95% on ML compute specifically. Check the ‘ml_cpu_utilization’ and ‘ml_memory_usage’ metrics in your service dashboard. If those are pegged near limits, you’ll need to upgrade your tier or reduce ML workload through better log hygiene.

The queue depth of 1800 is your smoking gun. For ML-based alerting to work efficiently in Log Analysis, you want that metric below 500. Two approaches: First, implement aggressive log filtering before ingestion - use exclusion rules in your logging agents to drop debug-level and verbose logs that don’t need ML analysis. Second, consider splitting your logs across multiple service instances - critical alerts on one instance with structured logs only, and general monitoring on another. This isolates your alert pipeline from the processing overhead of unstructured logs.

One more consideration - batch interval configuration. The default ML batch interval is 60 seconds, but if your queue is backing up, the system automatically increases intervals to catch up, which creates a vicious cycle. You can’t directly tune this, but you can influence it through the approaches mentioned above.

Are you using structured logs or unstructured? We had similar issues until we switched to structured JSON logging. The ML models process structured data much faster because they don’t need to parse free-text patterns. Also, your 15GB/hour might be hitting internal rate limits even at 60% capacity - those limits aren’t always linear with plan allocation.