AWS provides many solutions for managing business data. There’s Amazon Relational Database, or Amazon RDS, which is ideal for scaling your databases on the cloud. There’s Amazon Redshift for warehousing your data. For collecting big data, we’ve looked at a number of modern data integration platforms, but Amazon CloudFront is more of a content delivery platform.

So, why are we talking about CloudFront in terms of big data right now? Because Amazon CloudFront usefully integrates with a range of AWS services that help with data management. Not just that, but CloudFront logs provide their own set of unique insights that savvy businesses want to collect.

CloudFront allows users to create log files that keep track of every single user request received by the platform. You might see these referred to as standard logs or access logs, which are then saved in the S3 bucket. S3/Cloudfront logging is generally inexpensive and speedy to implement, making it attractive to users who need big data results, fast. We take a look at how to set up your S3/Cloud logging and use it to collect the data that matters to you and your business.

Table of Contents

  1. Setting Up CloudFront Access Logs: Step by Step
  2. Ways to Analyze your Access Logs
  3. Optimizing your CloudFront Logs
  4. Editing your CloudFront Logs
  5. Integrate.io and the Power of ETL

Setting Up CloudFront Access Logs: Step by Step

Enabling logging for CloudFront is surprisingly simple and doesn’t require too much troubleshooting. Standard logs are available for web distribution and RTMP. You can use Amazon’s other tools like Lambda, to create a completely serverless way to log and analyze your data. Alternatively, you can work with an ETL expert to get the most out of data integration.

1. Collecting Big Data

Your first step is to think about the types of data you want to collect. Do you need to know how many website visitors are reading your latest blog entries? How many are sharing your content on their social media feeds? Or, is it more important to find out how many people click away within a certain time? Depending upon the types of insights you need, you could use Amazon CloudWatch. Cloudwatch is integrated with CloudFront and uses six default metrics which it monitors. It’s also possible to set up your own custom CloudFront distribution, which costs a little extra.

2. Create Your AWS Account

AWS services are free, at a basic level, for the first 12 months. You have to choose a package that suits you best. For example, Amazon EC2 provides 750 hours of Linux or Windows-based cloud computing capacity each month. The AWS Lambda compute solution is also a free option. Lambda works with popular programming languages such as Ruby and Python, making it a great choice for developers. Whereas the Amazon S3 package focuses on increased storage, which is ideal for those large buckets of data. Click here to explore the available AWS packages and here to register for free.

 

3. Create an Amazon CloudFront S3 Bucket

You can use the S3 dashboard to create an Amazon S3 bucket which is the endpoint for your log files. The name of your bucket has to be unique, and if you already have an S3 bucket set up for another purpose, you must use a different one. Be sure to check that you have these permissions:

FULL_CONTROL

s3: GetBucketAc1

s3: PutBucketAc1

If you’re not the owner of the AWS account, you may need to speak to whoever is to get these permissions granted.

Also, if you’ve enabled SSE or server-side encryption, you will need to adjust the key policy for your Customer Master Key (CMK).

Choose your S3 bucket and ensure the name appears in “Bucket for Logs” e.g. bucket-name.S3.amazonaws.com where bucket-name is the destination you have chosen for your data.

Make sure you turn logging on and choose whether or not you want to log cookies, which is optional.

4. Download Your Logs

You can use the AWS command-line interface (CLI) to download your logs and start gaining insights from the data.

Ways to Analyze your Access Logs Including AWS Lambda and Athena

Once you update the production environment your S3 buckets should start to fill up nicely with useful insights. To make the most of that data, you need intelligent analysis. Amazon has a walkthrough on how to use AWS Lambda, AWS Athena, plus their Kinesis Analytics to help you understand what’s happening with your website and content. Common insights include understanding who the top viewers are, the bandwidth usage per CloudFront distribution, and the detection of bots.

It’s worth noting that you can’t associate any Lambda function with a CloudFront distribution you don’t own. Certain CloudFront triggers also rely on an IAM execution role that’s associated with a particular Lambda function being assumable by certain principles. There are helpful articles on all unusual situations across AWS documentation, so troubleshooting is usually pretty simple.

It’s possible to combine all logs for a specific time period into single files. That makes analysis simpler, as you can focus on a single hour or day, rather than trawling through weeks’ worth of data.

Amazon Athena is set up to work with CloudFront, and it's an interactive query service that you can use to create a table for your CloudFront logs based on specific queries that you set.

The following code is freely available from Amazon and allows you to use Athena to create a table that stores and displays your query data in a usable format. Simply adjust the location to match the S3 bucket where you store your log files.

CREATEEXTERNALTABLEIFNOTEXISTS default.cloudfront_logs (

 `date`DATE,

 timeSTRING,

 location STRING,

 bytesBIGINT,

 request_ip

STRING,

 method STRING,

 host STRING,

 uri

STRING,

 statusINT,

 referrer

STRING,

 user_agent

STRING,

 query_string

STRING,

 cookie STRING,

 result_type

STRING,

 request_id

STRING,

 host_header

STRING,

 request_protocol

STRING,

 request_bytes BIGINT,

 time_taken

FLOAT,

 xforwarded_for

STRING,

 ssl_protocol

STRING,

 ssl_cipher

STRING,

 response_result_type

STRING,

 http_version

STRING,

 fle_status

STRING,

 fle_encrypted_fields

INT,

 c_port

INT,

 time_to_first_byte

FLOAT,

 x_edge_detailed_result_type

STRING,

 sc_content_type

STRING,

 sc_content_len BIGINT,

 sc_range_start BIGINT,

 sc_range_end BIGINT

)

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\t'

LOCATION 's3://CloudFront_bucket_name/CloudFront/'

TBLPROPERTIES ( 'skip.header.line.count'='2' )

Ruby users may want to use request-log-analyzer to start working with their data, but be aware that it doesn’t necessarily carry support for the CloudFront standard log format. This workaround at GitHub may be useful for Ruby users.

 

Optimizing Your CloudFront Logs

Once you’ve passed the free period, Amazon tends to charge for S3_bucket storage by the month. It makes sense to compress your data as much as possible. Athena users will save money on queries, too, because the more compressed the data is when it is queried, the less it costs per query.

We touched on splitting your CloudFront access log into time groups, such as data for a particular hour, day, or week. You can also partition your data by domain name, IP address, or other filters, allowing for deeper insights and faster analysis. Whether you’re using Athena or another system for data display and analytics, scanning less data means you get it done faster and with reduced latency. Creative partitioning means you’re working on the data that matters to your queries. Lowering your costs and resource usage in this way also means you can start to scale up the amount of data you review, knowing that all that data is relevant and useful. That can lead to some fantastic big data insights.

Edit and Configure your CloudFront Logs

In business, things are always changing. That’s why it’s good news that you can always change the following features on your access logs with ease:

  • Enable or disable the logging feature
  • Change the S3 bucket where the data is stored
  • Change the prefix for log files

You may be able to use CloudFront API to configure some changes, although the API is most commonly used to update CloudFront distributions.

Of course, CloudFront might not be your only source of data for insight and analysis. It’s quite common to use a range of AWS services to collect and collate data for a variety of purposes. You can stream similar access logs from Lambda or Lambda@Edge logs. You may also have access to Application Load Balancer logs (ALB). It’s possible, with some tinkering on the AWS Management console, to glue this data together, creating a partitioned table that can display all your data in one place. This involves creating an IAM role with permissions to all sources feeding into the S3.Bucket.

Gluing your data together in this way is a form of ETL: Extract, Transform, and Load. Cloud-based, serverless data transformation is the most efficient way of handling and integrating your business data. It’s also highly scalable, meaning that you can start on some seriously big data collection without having to ramp up your business’s in-house resources.

Integrate.io and the Power of ETL

 

Everything we’ve covered in terms of setting up and how to check CloudFront logs is a form of data integration: extracting the data from your website, transforming it into a workable format, and loading it into business intelligence tools. This provides profit-boosting insights that not only give you the edge over your competitors, but faster analysis means you have more time to focus on the aspects of your business that you’re really passionate about.

Learn more about how ETL can transform the way you manage your business data. Schedule an intro call with our support team.