We’re experiencing a critical issue with our automated procurement approval workflow in MASC 2022. Purchase orders are getting stuck at the vendor compliance validation step, causing significant delays and SLA breaches.
The workflow engine logs show repeated retry attempts to call the vendor compliance microservice, but it appears to be timing out consistently. What’s puzzling is that there’s no error displayed in the UI - the PO just sits in “Pending Compliance Review” status indefinitely. Only when we dig into the backend logs do we see the timeout errors.
We’ve checked the microservice health endpoints and they respond normally. Network connectivity seems fine. The issue started happening intermittently about two weeks ago but has now become consistent for all POs over $50K that require compliance checks.
Has anyone encountered similar timeout issues with workflow automation calling external microservices? Any suggestions on where to look for the root cause?
Another consideration: are you caching any of the vendor compliance data? Many sanctions screening results can be cached for a reasonable period since they don’t change frequently. If you’re hitting the external API for every PO, you’re creating unnecessary latency and dependency on external service availability.
Good points. I checked the workflow timeout configuration and it’s set to 30 seconds. The compliance microservice logs show that some requests are taking 35-45 seconds to complete, which explains the timeouts. The microservice is making calls to an external sanctions screening API that has become slower recently. Should we just increase the timeout value or is there a better approach?
Based on the discussion, here’s a comprehensive solution addressing all three focus areas:
Vendor Compliance Microservice Timeout:
The root cause is the external sanctions screening API taking 35-45 seconds, exceeding your 30-second workflow timeout. Immediate fix: increase the workflow timeout to 60 seconds in your workflow engine configuration. However, implement asynchronous processing as the proper long-term solution. Modify the compliance validation step to use a callback pattern where the workflow initiates the check, moves to a waiting state, and resumes when the microservice sends a completion event.
Workflow Engine Logs Show Repeated Retries:
The retry storm is likely due to aggressive retry configuration without proper backoff. Update your workflow engine retry policy:
- Set maximum retry attempts to 3 (not unlimited)
- Implement exponential backoff (2s, 4s, 8s between retries)
- Configure circuit breaker pattern to stop calling the microservice after consecutive failures
- Monitor your workflow engine thread pool to ensure retries aren’t exhausting available threads
Add a dead letter queue for POs that fail all retries so they can be processed manually without blocking the workflow.
No UI Error, Only Backend Logs:
The UI shows “Pending Compliance Review” because the workflow engine hasn’t surfaced the timeout error to the application layer. Implement proper error handling in your workflow:
- Configure the compliance validation step to catch timeout exceptions
- Update the PO status to “Compliance Check Failed” with a user-friendly message
- Send notifications to procurement team when timeouts occur
- Add a workflow dashboard that shows stuck approvals with their actual error status
Also implement the caching strategy mentioned earlier - cache vendor compliance results for 24 hours to reduce external API dependency. Most vendors’ compliance status doesn’t change daily, so this significantly reduces the load on the external screening service.
For monitoring, set up alerts when compliance check duration exceeds 20 seconds or when timeout rate goes above 5%. This gives you early warning before it becomes a widespread issue.
The fact that it’s only affecting POs over $50K is a key clue. That threshold likely triggers additional compliance validation rules that involve more complex queries or external system calls. Check if the vendor compliance microservice has different processing paths based on PO amount. The higher-value POs might be hitting database queries that aren’t optimized or are causing lock contention. Also verify that the microservice connection pool is sized appropriately for peak loads.