Log Analysis ML-based alerts delayed by several minutes in production

cynthiaops · October 12, 2025, 10:15am

We’re experiencing significant delays with ML-based alerting in IBM Log Analysis for our production environment. Our critical alerts that should trigger within 30 seconds are now taking 4-6 minutes to fire, which is causing major issues with our incident response.

The delays seem to correlate with our log ingestion pipeline processing around 15GB/hour during peak times. We’ve noticed the ML batch interval appears to be processing in longer cycles than configured. Resource allocation shows we’re at 60% capacity on our service plan, so we shouldn’t be hitting limits.

Has anyone encountered similar latency with ML-powered alerts in Log Analysis? We need alerts firing within acceptable timeframes to maintain our SLAs. Any insights on tuning the ingestion pipeline or ML processing intervals would be greatly appreciated.

markcoder · October 28, 2025, 1:37pm

Thanks for the responses. Checking now - our ml_processing_queue_depth is averaging around 1800 during peak hours, so that’s definitely a bottleneck. We’re using a mix of structured and unstructured logs. About 40% of our volume is unstructured application logs that the ML needs to analyze for anomaly patterns. Would filtering out non-critical unstructured logs help, or should we look at scaling our service tier?

cynthiaops · November 10, 2025, 10:12pm

I’d add that your resource allocation percentage can be misleading. Log Analysis allocates ML processing resources separately from storage and ingestion. You might be at 60% overall but 95% on ML compute specifically. Check the ‘ml_cpu_utilization’ and ‘ml_memory_usage’ metrics in your service dashboard. If those are pegged near limits, you’ll need to upgrade your tier or reduce ML workload through better log hygiene.

larry_dev · October 31, 2025, 8:09am

The queue depth of 1800 is your smoking gun. For ML-based alerting to work efficiently in Log Analysis, you want that metric below 500. Two approaches: First, implement aggressive log filtering before ingestion - use exclusion rules in your logging agents to drop debug-level and verbose logs that don’t need ML analysis. Second, consider splitting your logs across multiple service instances - critical alerts on one instance with structured logs only, and general monitoring on another. This isolates your alert pipeline from the processing overhead of unstructured logs.

anna_architect · November 22, 2025, 6:06am

One more consideration - batch interval configuration. The default ML batch interval is 60 seconds, but if your queue is backing up, the system automatically increases intervals to catch up, which creates a vicious cycle. You can’t directly tune this, but you can influence it through the approaches mentioned above.

joshuadev · October 15, 2025, 7:35pm

Are you using structured logs or unstructured? We had similar issues until we switched to structured JSON logging. The ML models process structured data much faster because they don’t need to parse free-text patterns. Also, your 15GB/hour might be hitting internal rate limits even at 60% capacity - those limits aren’t always linear with plan allocation.

Topic		Replies	Views
Azure Log Analytics query latency spikes during high-volume data ingestion Microsoft Azure question , monitoring , networking , observability , az-2021 , performance-tuning , latency , azure-log-analytics , kusto-query	6	0	February 4, 2025
Network latency spikes causing delayed analytics queries in multi-zone VPC IBM Cloud question , networking , analytics , performance , ic-2020 , vpc-routing , ibm-cloud-monitoring , latency-spike , query-delay	4	1	July 18, 2025
Cloud Monitoring alerts missed ERP network latency spikes during peak hours IBM Cloud question , networking , performance , observability , ic-2019 , sla-monitoring , cloud-monitoring , alert-configuration , latency-metrics	4	0	March 18, 2025
VPC Flow Logs missing traffic for ML anomaly detection pipeline in ic-2020 networking module IBM Cloud question , ml-ai , monitoring , networking , logging , ic-2020 , vpc-flow-logs , anomaly-detection , subnet	6	0	June 25, 2025
Real-time analytics monitoring versus batch processing: trade-offs and best practices IBM Cloud discussion , analytics , batch-processing , cost-optimization , ic-2021 , latency , real-time-processing , monitoring-mana , ibm-cloud-analy	6	0	January 8, 2025
Automated IoT device health monitoring using IBM Log Analysis with real-time alerting for predictive maintenance IBM Cloud use-case , monitoring , iot-services , automation , alerts , observability , ic-2019 , log-analysis , device-health	3	0	September 4, 2025
Azure Monitor alerts not triggering for ML anomaly detection on ERP transaction data Microsoft Azure question , observability , log-analytics , az-2021 , machine-learning , azure-monitor , kusto , alerts-missing , anomaly-detection	7	0	September 8, 2025
CloudWatch Logs Insights API batch query limits and performance tuning for high-volume log analytics Amazon Web Services (AWS) discussion , timeout , observability , batch-processing , aws-2020 , rate-limit , apis , analytics-delay , cloudwatch-logs	5	0	May 13, 2025
Azure Log Analytics query latency spikes during high-volume ingestion Microsoft Azure question , monitoring , networking , query-optimization , observability , az-2021 , latency , kusto , azure-log-analytics	6	0	August 31, 2025

Log Analysis ML-based alerts delayed by several minutes in production

Related topics