BigQuery load job fails when importing large CSV from Cloud Storage with 'Too many errors' message

I’m trying to load a 12GB CSV file from Cloud Storage into BigQuery and the load job consistently fails with ‘Too many errors encountered’ message. The CSV has about 50 million rows with 35 columns.

The load job configuration:

bq load --source_format=CSV \
  --skip_leading_rows=1 \
  --max_bad_records=100 \
  dataset.table_name \
  gs://bucket/large_file.csv

Error details show field type mismatches and parsing errors around row 2.3 million, but the CSV appears well-formed when I sample it. I’ve tried increasing max_bad_records to 1000 but the job still fails. The schema is auto-detected by BigQuery.

Is there a better approach for importing large CSVs with explicit schema definition? Should I be logging these errors to Cloud Storage to identify the problematic rows?

Your load job failures are due to three interconnected issues that need systematic resolution:

1. CSV Formatting Validation Large CSVs from legacy systems often have hidden formatting inconsistencies. Validate and fix these before loading:

# Add formatting flags to handle common CSV issues
bq load --source_format=CSV \
  --skip_leading_rows=1 \
  --allow_quoted_newlines \
  --allow_jagged_rows \
  --field_delimiter=',' \
  --quote='"' \
  --max_bad_records=1000

The --allow_quoted_newlines handles text fields with embedded line breaks, while --allow_jagged_rows permits rows with missing trailing columns. These are common in exports from legacy systems.

2. Explicit Schema Definition Never rely on auto-detection for production loads, especially with large files. Create a schema JSON file:

[
  {"name": "user_id", "type": "STRING", "mode": "REQUIRED"},
  {"name": "amount", "type": "NUMERIC", "mode": "NULLABLE"},
  {"name": "timestamp", "type": "TIMESTAMP", "mode": "REQUIRED"}
]

Then load with explicit schema:

bq load --source_format=CSV \
  --schema=schema.json \
  --skip_leading_rows=1 \
  dataset.table_name \
  gs://bucket/large_file.csv

Use STRING type for ambiguous columns initially, then cast during queries or use a transformation step. This prevents type mismatch errors during load.

3. Error Logging to Cloud Storage Enable comprehensive error logging to identify problematic rows:

bq load --source_format=CSV \
  --schema=schema.json \
  --max_bad_records=1000 \
  --error_destination_table=dataset.error_log \
  dataset.table_name \
  gs://bucket/large_file.csv

Bad records will be written to the error table with details about why they failed. Query this table to identify patterns:

SELECT error_message, COUNT(*) as error_count
FROM dataset.error_log
GROUP BY error_message
ORDER BY error_count DESC

Additional Recommendations:

  1. Pre-validation: Use Cloud Dataflow to validate and clean the CSV before loading:

    • Standardize date formats
    • Remove non-printable characters
    • Validate data types
    • Split into manageable chunks (1-2GB each)
  2. Incremental Loading: For 12GB files, consider loading in batches with load job arrays to parallelize and isolate failures.

  3. Monitoring: Enable Cloud Logging for BigQuery jobs to track load performance and errors over time.

After implementing explicit schema and error logging, rerun the load job. Check the error_log table to identify specific data quality issues, then either clean the source CSV or adjust your schema to accommodate the actual data patterns.

Another approach: split the large CSV into smaller chunks (1-2GB each) using gsutil or a Dataflow job. This makes troubleshooting easier and you can load them in parallel. If one chunk fails, you know exactly which data range has issues.

Auto-detection can be unreliable with large files. Define an explicit schema JSON and use --schema flag. Also enable error logging with --destination_table for failed records. This will help you identify which rows are causing issues.

I had similar issues with large CSVs. The problem is often inconsistent data types in columns that look uniform in samples. For example, a mostly-numeric column might have a few text values deep in the file. BigQuery’s auto-detection samples the first rows and gets it wrong. Always use explicit schemas for production loads.

For very large files, consider using the BigQuery Storage Write API instead of load jobs. It provides better error handling and streaming capabilities for bulk imports.

Definitely check CSV formatting. Use --allow_quoted_newlines if your text fields contain line breaks. Also consider --allow_jagged_rows if some rows have fewer columns. For delimiter issues, explicitly set --field_delimiter. I recommend validating the CSV structure first with a tool like csvlint before loading to BigQuery to catch formatting problems early.