Although the Internet made the world flat, geography still matters. Knowing which countries your users live in could provide business opportunities to localize your services and increase profits. The only question is how in the world to do it.
Luckily, user locations can be discovered from IP addresses via geolocation services. Since visitor IPs are stored in web server logs, all that's left to do is run over the logs, geolocate the addresses, and aggregate and store the results. Sounds like a job for Integrate.io!
In this post we’ll show how Integrate.io’s data integration on the cloud can process web server logs, extract IP addresses, and discover user geolocations. We will use it to calculate web visitors per country, and then drill down to check the number of visitors per city.
For this demo, we will use 1.5 GB of public domain ‘Star Wars Kid’ web server logs. Example log lines:
220.127.116.11 - - [14/Sep/2003:14:30:14 -0700] "GET / HTTP/1.1" 200 39101 "http://www.wired.com/news/culture/0,1284,58881,00.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" 18.104.22.168 - - [14/Sep/2003:14:30:18 -0700] "GET /archive/cat/image/index.shtml HTTP/1.1" 200 18267 "http://www.waxy.org/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" 22.214.171.124 - - [14/Sep/2003:14:30:21 -0700] "GET /archive/cat/events/index.shtml HTTP/1.1" 200 27686 "-" "LinkWalker"
- Source IP
- User Identifier (blank)
- UserID (blank)
- Date - in the format of dd/MMM/yyyy:HH:mm:ss Z
- HTTP request - type, URL, HTTP version
- HTTP code
- Bytes transferred
- User agent
Note that IP to location mapping changes over time as IP address ranges are obtained and released, so geolocating this data is done for demo purposes only.
Visitors per Country
The following dataflow can be created using Integrate.io’s visual editor. It loads the data, determines unique visitors, converts IPs to countries, counts the number of visitors per country, sorts the data, and then stores the results.
Source - loads the data from the relevant S3 bucket/path. Once all the relevant options are set, the circular arrows button at the top right can auto-detect the schema and fill-in the field names.
Select - keeps IP addresses and gets rid of the rest of the data.
Distinct - removes duplicate IPs to get unique visitors (this component doesn’t have any options).
Select - converts IP addresses to countries using the
Aggregate - counts the number of unique visitors per country. This is achieved by grouping the data by country, and then counting the number of times each country name appears.
Sort - sorts results by the number of visitors in descending order.
Destination - saves the results back to Amazon S3.
Here’s an interactive map showing the number of visitors per country (see the next section for full results):
- Total visitors - 1,387,000
- Countries - 200
- Top countries - United States (62%, 863,000), Canada (9%, 126,000), United Kingdom (5%, 75,000).
Visitors per City in the UK
Let’s say that we want to dig deeper into the UK market and find out which cities our website visitors come from. We will use a similar dataflow with several additions - filter visitors from the UK and then geolocate which cities they come from.
Source - no changes.
Select - no changes.
Distinct - no changes.
Select - keeps IPs for later use and geolocates the countries.
Filter - filters visitors from the United Kingdom
Select - geolocates cities using the
Aggregate - same as before, except that now the ‘city’ field is used.
Sort - sorts the data by the number of visitors.
Destination - stores results in a different path.
Here are the top UK cities where visitors come from, as well as full results from the previous section:
- UK Visitors - 75,000
- UK Cities - almost 1,090
- Top UK cities - London (4.51%, 3,380), Bristol (0.94%, 700), Birmingham (0.85%, 640)
User locations can be easily extracted from IP addresses with a bit of help from Integrate.io. Get your free account and start geolocating your data.