We’re experiencing very slow data transfer speeds when our Azure ML pipeline pulls training data from our on-premises ERP system to Azure Blob Storage via ExpressRoute. The connection is a 1 Gbps ExpressRoute circuit, but we’re only seeing 80-120 Mbps throughput during ML data transfers, which is causing significant delays in our pipeline execution. Normal application traffic over the same ExpressRoute circuit seems fine, hitting 600-700 Mbps during business hours. The slow transfer specifically affects our nightly ML data sync job that moves 200-300 GB of transactional data from ERP database to Azure storage for model training. This job is now taking 6-8 hours instead of the expected 2-3 hours, which pushes into business hours and impacts our production ERP performance. We’ve checked the ExpressRoute circuit metrics in Azure Monitor and the overall bandwidth utilization shows we’re not maxing out the circuit, so I’m wondering if there’s a QoS or routing configuration issue that’s throttling the ML data transfer specifically. What could be causing this performance degradation for large data transfers while normal traffic flows normally?
That makes sense about the TCP limitations. How do I configure the Azure SDK for parallel transfers? And beyond that, are there any QoS settings I should configure on the ExpressRoute connection to prioritize this data transfer traffic during the nightly sync window? We have some flexibility to mark traffic with DSCP values on our on-premises side if that would help.
Private Peering is correct for accessing Azure services via private IPs. The latency is good. Single-threaded transfers over ExpressRoute often hit exactly the performance wall you’re seeing - around 100-150 Mbps even on a 1 Gbps circuit. This is due to TCP congestion control algorithms and bandwidth-delay product limitations. The Azure SDK by default uses single-threaded uploads for blob storage unless you explicitly enable concurrent connections. You need to configure the SDK to use parallel transfers with multiple TCP streams. For 200-300 GB transfers, aim for 8-16 concurrent connections to better utilize the available bandwidth.
The symptom you’re describing - normal application traffic performing well while large data transfers are slow - suggests a few possible issues. First, check your ExpressRoute SKU and peering configuration. Are you using Microsoft Peering or Private Peering? For Azure Storage access, you should be using Microsoft Peering with route filters. Second, what’s the MTU size configured on your on-premises network equipment? ExpressRoute supports up to 1500 byte MTU, but if your equipment is set lower or there’s fragmentation happening, that could significantly impact large data transfers. Third, how many TCP connections is your ML data sync job using? Single-threaded transfers won’t fully utilize available bandwidth due to TCP window scaling limitations over high-latency connections.