OSS multipart upload fails for large files with timeout errors and incomplete uploads

We’re experiencing consistent failures when uploading large backup files (over 5GB) to OSS using the Java SDK multipart upload feature. The uploads fail after 30-40% completion with connection reset errors, and we’re losing critical backup data.

Our backup automation uses OSS SDK 3.10.2 with multipart upload configured for 100MB part size. The timeout errors occur randomly during large file transfers:

OSSClient client = new OSSClient(endpoint, accessKeyId, accessKeySecret);
UploadFileRequest request = new UploadFileRequest(bucketName, objectKey);
request.setUploadFile(localFile);
request.setPartSize(100 * 1024 * 1024);
client.uploadFile(request);

The connection resets happen unpredictably, and we can’t resume from where it failed. We’ve tried adjusting SDK configuration and network settings, but large file backup remains unreliable. Has anyone successfully implemented resumable upload for files this size? What SDK configuration works best for stable multipart uploads over slower connections?

The SDK handles resume automatically when you provide a checkpoint directory. Just make sure the path is writable and persistent. Another thing - check your client-side network settings. We had issues where our firewall was dropping idle connections after 60 seconds. You might need to configure TCP keepalive or adjust your infrastructure timeout settings. For production backup systems, we also added retry logic with exponential backoff at the application level, separate from SDK retries.

Here’s a comprehensive solution addressing all aspects of reliable multipart upload for large files:

1. Enable Resumable Upload with Checkpoint

The key to handling connection resets is enabling checkpoint-based resumable upload:

UploadFileRequest request = new UploadFileRequest(bucketName, objectKey);
request.setUploadFile(localFile);
request.setPartSize(50 * 1024 * 1024); // 50MB parts
request.setEnableCheckpoint(true);
request.setCheckpointFile("/var/backup/checkpoints/" + objectKey + ".checkpoint");

The SDK automatically saves progress to the checkpoint file. If upload fails, calling uploadFile() again with the same checkpoint path resumes from the last successful part - no manual logic needed.

2. Optimize SDK Configuration for Large Files

Increase timeout values to match your network conditions:

ClientConfiguration config = new ClientConfiguration();
config.setConnectionTimeout(300000); // 5 minutes
config.setSocketTimeout(300000); // 5 minutes
config.setMaxErrorRetry(5);
OSSClient client = new OSSClient(endpoint, accessKeyId, accessKeySecret, config);

3. Adjust Part Size Based on Network Stability

For unstable connections, smaller parts (20-50MB) reduce retry overhead. For stable high-bandwidth links, larger parts (100-500MB) improve throughput. Your 100MB parts are reasonable for most scenarios, but try 50MB if you’re still seeing frequent resets.

4. Implement Application-Level Retry Logic

int maxRetries = 3;
for (int attempt = 0; attempt < maxRetries; attempt++) {
    try {
        UploadFileResult result = client.uploadFile(request);
        break; // Success
    } catch (Exception e) {
        if (attempt == maxRetries - 1) throw e;
        Thread.sleep((long) Math.pow(2, attempt) * 1000); // Exponential backoff
    }
}

5. Monitor and Verify Uploads

After successful upload, verify the object:

ObjectMetadata metadata = client.getObjectMetadata(bucketName, objectKey);
long uploadedSize = metadata.getContentLength();
// Compare with local file size

Network Considerations:

  • Check for intermediate proxies or firewalls with connection timeout policies
  • Enable TCP keepalive if your network drops idle connections
  • Consider using OSS transfer acceleration for cross-region uploads
  • Test with different part sizes to find optimal balance for your bandwidth

Bucket Policy: Ensure your RAM user has PutObject permission and no bucket policies restrict large uploads. Some policies set max object size limits.

With these configurations, we successfully upload 50GB+ backup files with automatic resume on any network interruption. The checkpoint mechanism is robust - even if your application crashes, restarting with the same checkpoint file continues from the last completed part. For production backup systems, this approach has given us 99.9% success rates even over unreliable WAN connections.

One more consideration - monitor your upload bandwidth and set realistic timeout values based on actual transfer speeds. If you’re uploading 100MB parts over a 10Mbps connection, that’s 80+ seconds per part minimum. Default timeouts are often too short for this math. Calculate expected time per part and add 50% buffer for your timeout configuration.

I see you’re using uploadFile() which is good, but your configuration is incomplete for large file scenarios. Let me share what we use for 10GB+ backup files with high success rates.