Your load job failures are due to three interconnected issues that need systematic resolution:
1. CSV Formatting Validation
Large CSVs from legacy systems often have hidden formatting inconsistencies. Validate and fix these before loading:
# Add formatting flags to handle common CSV issues
bq load --source_format=CSV \
--skip_leading_rows=1 \
--allow_quoted_newlines \
--allow_jagged_rows \
--field_delimiter=',' \
--quote='"' \
--max_bad_records=1000
The --allow_quoted_newlines handles text fields with embedded line breaks, while --allow_jagged_rows permits rows with missing trailing columns. These are common in exports from legacy systems.
2. Explicit Schema Definition
Never rely on auto-detection for production loads, especially with large files. Create a schema JSON file:
[
{"name": "user_id", "type": "STRING", "mode": "REQUIRED"},
{"name": "amount", "type": "NUMERIC", "mode": "NULLABLE"},
{"name": "timestamp", "type": "TIMESTAMP", "mode": "REQUIRED"}
]
Then load with explicit schema:
bq load --source_format=CSV \
--schema=schema.json \
--skip_leading_rows=1 \
dataset.table_name \
gs://bucket/large_file.csv
Use STRING type for ambiguous columns initially, then cast during queries or use a transformation step. This prevents type mismatch errors during load.
3. Error Logging to Cloud Storage
Enable comprehensive error logging to identify problematic rows:
bq load --source_format=CSV \
--schema=schema.json \
--max_bad_records=1000 \
--error_destination_table=dataset.error_log \
dataset.table_name \
gs://bucket/large_file.csv
Bad records will be written to the error table with details about why they failed. Query this table to identify patterns:
SELECT error_message, COUNT(*) as error_count
FROM dataset.error_log
GROUP BY error_message
ORDER BY error_count DESC
Additional Recommendations:
-
Pre-validation: Use Cloud Dataflow to validate and clean the CSV before loading:
- Standardize date formats
- Remove non-printable characters
- Validate data types
- Split into manageable chunks (1-2GB each)
-
Incremental Loading: For 12GB files, consider loading in batches with load job arrays to parallelize and isolate failures.
-
Monitoring: Enable Cloud Logging for BigQuery jobs to track load performance and errors over time.
After implementing explicit schema and error logging, rerun the load job. Check the error_log table to identify specific data quality issues, then either clean the source CSV or adjust your schema to accommodate the actual data patterns.