Our ERP analytics dashboards are showing stale data during peak business hours (9 AM - 12 PM), and we’ve traced it back to IBM Event Streams consumer lag. The dashboards consume transaction events from Event Streams topics to display real-time sales metrics, but during peak hours the lag increases from normal 2-3 seconds to over 5 minutes.
We have a single consumer group with 3 consumers reading from a topic with 6 partitions. Event production rate during peak is around 5,000 events/minute. The lag is visible in the Event Streams dashboard, but we’re not sure if the issue is consumer-side (our processing is too slow) or Event Streams-side (need more partitions or throughput). Dashboard data freshness is critical for our sales team to make real-time decisions. Has anyone dealt with Event Streams consumer lag issues affecting analytics? Should we scale partitions or optimize our consumer logic?
Yes, when you add consumers to the consumer group, Event Streams will automatically trigger a rebalance and reassign partitions. With 6 consumers and 6 partitions, you’ll get 1:1 assignment which is optimal for parallel processing. Just be aware that during the rebalance (usually takes a few seconds), consumption will pause briefly. Also make sure your max.poll.interval.ms is set high enough - with faster processing you can probably lower it from the default 300000ms to something like 60000ms to detect stuck consumers faster.
One more thing to check - monitor your Event Streams broker metrics during peak hours. If the brokers are hitting CPU or network limits, that could also contribute to lag even if your consumers are optimized. Check the Event Streams dashboard for broker throughput and resource utilization. If brokers are saturated, you might need to upgrade your Event Streams plan or increase partition replication settings. But based on 5,000 events/minute, you shouldn’t be hitting broker limits unless your events are very large.
Great suggestions! We implemented batch Redis writes (writing every 100 events instead of per event) and that reduced processing time to about 2-3ms per event. We also increased max.poll.records to 1000. The lag improved but we’re still seeing 2-3 minute delays during absolute peak times. I think we need to add more consumers. If we go from 3 to 6 consumers (one per partition), will that automatically rebalance and distribute the load evenly?