Complete Implementation Details:
Lambda Backup Metadata Capture: EventBridge rule pattern matches AWS Backup job state changes. Lambda function (Python 3.9) receives event, extracts backup metadata, calls ResourceGroupsTaggingAPI to get resource tags (environment, application, data_classification), enriches event data, and writes to S3. Critical code handles AWS Backup event structure variations across service types.
S3 Parquet Partitioning: Data lands in S3 path: s3://backup-analytics/data/year=YYYY/month=MM/day=DD/resource_type=TYPE/. Lambda uses PyArrow to write Parquet with Snappy compression. Glue crawler runs weekly (CloudWatch Events schedule) to update table schema and discover partitions. Table registered in Glue Data Catalog enables Athena queries. Partition projection could eliminate crawler but we prefer explicit partition discovery for data validation.
Athena Compliance Queries: Created saved queries for common compliance needs: (1) Backup success rate by resource type over 30/90 days, (2) Resources missing backups in last 24 hours, (3) Retention policy violations (backups older than configured retention), (4) Cost analysis by application tag, (5) Recovery point objective (RPO) compliance (time between backups vs. target). Queries use window functions and CTEs for complex analytics. Workgroup configured with result encryption and 30-day result retention.
QuickSight Dashboards: Direct connection to Athena table via SPICE dataset (refreshes hourly). Three dashboards: (1) Executive view - backup coverage percentage, success rates, cost trends, (2) Operations view - failed backups requiring attention, backup job durations, storage growth, (3) Compliance view - retention compliance, RPO/RTO metrics, audit-ready reports. Parameterized filters for date range, environment, application. Embedded in internal portal using QuickSight embedding SDK.
EventBridge Alerting: Two-tier approach: (1) Real-time alerts for critical failures (database backups, production resources) using EventBridge rule → SNS → PagerDuty, (2) Daily summary alerts using scheduled Lambda that queries Athena for failure trends and sends formatted report via SNS → email. Implemented alert suppression using DynamoDB table tracking recent alerts per resource (prevents duplicate notifications within 4 hours). Alert logic queries last 3 backup attempts from Athena before deciding to alert.
Results: Audit preparation time dropped from 3-4 days to 30 minutes. Compliance reports auto-generate weekly. Identified 12 resources with backup gaps in first month. Monthly operational cost: $45 (Lambda $8, Athena $10, S3 $15, QuickSight $12). System processes 200+ backup events daily across 180 resources. Recovery from backup data corruption now possible using Athena to identify last known good backup. Highly recommend this architecture for any organization with compliance reporting requirements.