Real-time analytics on CDN traffic using Kinesis Data Streams and Athena for ad campaign optimization

We implemented a real-time analytics pipeline for our CloudFront CDN traffic that reduced our marketing team’s decision latency from 24 hours to under 5 minutes. Previously, we relied on CloudFront access logs delivered to S3, which meant waiting for log file delivery (up to 2 hours) plus batch processing time before seeing traffic patterns.

Our solution streams CloudFront access logs through Kinesis Data Streams for real-time processing, while maintaining S3 archive for historical analysis with Athena. This gives us both immediate visibility into traffic spikes, geographic distribution changes, and cache performance, plus the ability to run ad-hoc queries on historical data.

The architecture handles approximately 50K requests per minute during peak traffic and provides dashboards updated every 30 seconds. Marketing teams can now react immediately to campaign performance and content trends rather than waiting for daily reports. This has directly improved our ability to optimize content delivery and identify issues before they impact significant user populations.

For ad-hoc analysis, raw CloudFront logs in S3 work but aren’t optimal. Converting to columnar format and partitioning properly makes a huge difference in Athena query performance and cost. You might want a Kinesis Firehose branch that writes Parquet to a separate S3 prefix for analytics queries.

Our complete implementation addresses all three architectural components for effective real-time CDN analytics.

CloudFront Log Delivery Architecture:

CloudFront delivers access logs to S3 with a prefix structure that enables efficient processing. We configure logging with hourly partitioning to manage Lambda concurrency:

S3 bucket structure: `cdn-logs/raw/YYYY/MM/DD/HH/ S3 Event Notification triggers Lambda when new log files arrive. The Lambda function (Python 3.9, 512MB memory, 60-second timeout) handles the streaming pipeline:

# Pseudocode - CloudFront log processing steps:
1. Receive S3 event notification with log file key
2. Download and decompress gzip log file from S3
3. Parse CloudFront log format (tab-delimited fields)
4. Transform records: extract timestamp, status code, bytes, geo, cache status
5. Batch records (500 per PutRecords call) and send to Kinesis
6. Handle partial batch failures with retry logic
# See AWS documentation: CloudFront Log File Format

Critical configuration: Set S3 event notification filter to *.gz suffix and enable Lambda reserved concurrency (50 concurrent executions) to handle traffic spikes without throttling.

Kinesis Data Streams Ingestion:

We use Kinesis Data Streams as the real-time backbone with careful shard management:

  • Initial configuration: 5 shards (5MB/sec write, 10MB/sec read capacity)
  • Partition key: CloudFront distribution ID + timestamp hash (ensures even distribution)
  • Record format: JSON with selected fields (timestamp, edge location, status, bytes, cache status, URI, geo)
  • Auto-scaling policy: Target utilization 70%, scale between 5-15 shards

Downstream consumers include:

  1. Lambda consumer for real-time alerting (cache hit rate drops, error rate spikes)
  2. Kinesis Data Analytics for sliding window aggregations (requests per minute by edge location)
  3. Kinesis Firehose for transformed data delivery to S3 for Athena

Firehose configuration transforms JSON records to Parquet format with dynamic partitioning by date and edge location, writing to: `cdn-analytics/processed/year=YYYY/month=MM/day=DD/edge=LOCATION/ Athena Ad-Hoc Querying:

We maintain two Athena table structures for different query patterns:

Table 1: Raw logs (for complete forensic analysis)

  • Points to original CloudFront logs in S3
  • Partitioned by date (year/month/day)
  • Uses Grok SerDe for CloudFront log format parsing
  • Query cost: $5 per TB scanned
  • Use case: Detailed investigation of specific requests

Table 2: Processed analytics (for business intelligence)

  • Points to Parquet data written by Kinesis Firehose
  • Partitioned by date and edge location
  • Columnar format reduces scan volume by 80%
  • Query cost: $1 per TB scanned (effective)
  • Use case: Aggregate metrics, trend analysis, dashboards

Sample Athena query for cache performance analysis:

SELECT edge_location,
       cache_status,
       COUNT(*) as requests,
       AVG(time_taken) as avg_latency
FROM processed_cdn_logs
WHERE year=2024 AND month=12 AND day=13
GROUP BY edge_location, cache_status;

Operational Insights:

Real-time benefits we’ve achieved:

  • Cache hit rate monitoring with 30-second granularity (previously daily)
  • Geographic traffic shift detection within 2 minutes (previously 24 hours)
  • Error rate alerting with 1-minute latency (previously discovered through customer reports)
  • Content popularity tracking for dynamic TTL optimization

Cost considerations:

  • Kinesis: ~$150/month for 5-15 shards with auto-scaling
  • Lambda: ~$80/month for log processing (50K invocations/day)
  • Firehose: ~$100/month for data transformation and delivery
  • Athena: ~$50/month for ad-hoc queries (varies by usage)
  • S3 storage: ~$200/month for raw and processed logs (30-day retention)

Total: ~$580/month for real-time analytics on 50M+ daily requests

Key Architectural Decisions:

  1. Lambda-based ingestion vs. direct Firehose: We chose Lambda for flexibility in parsing CloudFront’s specific log format and filtering unnecessary fields before streaming. Firehose alone can’t parse CloudFront logs without Lambda transformation.

  2. Dual storage strategy: Maintaining both raw logs (compliance, forensics) and processed Parquet data (analytics) provides flexibility for different query patterns while optimizing Athena costs.

  3. Kinesis Data Streams vs. Firehose only: Data Streams enables multiple consumers (real-time alerting, analytics, archival) from a single ingestion pipeline. Firehose alone would require separate processing for each use case.

This architecture successfully addresses CloudFront log delivery latency, Kinesis ingestion scalability, and Athena query efficiency - enabling marketing teams to make data-driven decisions in near real-time rather than waiting for batch processing cycles.

CloudFront doesn’t directly stream to Kinesis, so we use Lambda triggered by S3 events when log files arrive. The Lambda function parses the gzip-compressed log files and sends records to Kinesis Data Streams. This adds minimal latency (under 30 seconds from log delivery to Kinesis) while maintaining the S3 archive for Athena queries. We partition S3 data by date for efficient historical querying.

How are you handling the Athena side? Do you query directly against S3 logs, or are you also writing processed data from Kinesis back to S3 in a more query-friendly format like Parquet?

We batch records in Lambda - each CloudFront log file contains multiple requests, so we parse the file and send records in batches of 500 using PutRecords API. This reduces Kinesis API calls significantly. We started with 5 shards (5MB/sec write capacity) and enabled auto-scaling to handle traffic spikes. During peak hours, we scale to 8-10 shards.

This sounds like exactly what we need. How did you handle the CloudFront log delivery to Kinesis? Did you use Lambda to read from S3 and forward, or is there a direct integration?