Although the Internet made the world flat, geography still matters. Knowing which countries your users live in could provide business opportunities to localize your services and increase profits. The only question is how in the world to do it.

Luckily, user locations can be discovered from IP addresses via geolocation services. Since visitor IPs are stored in web server logs, all that's left to do is run over the logs, geolocate the addresses, and aggregate and store the results. Sounds like a job for Integrate.io!

In this post we’ll show how Integrate.io’s data integration on the cloud can process web server logs, extract IP addresses, and discover user geolocations. We will use it to calculate web visitors per country, and then drill down to check the number of visitors per city.

The Data

For this demo, we will use 1.5 GB of public domain ‘Star Wars Kid’ web server logs. Example log lines:

62.61.154.25 - - [14/Sep/2003:14:30:14 -0700] "GET / HTTP/1.1" 200 39101 "http://www.wired.com/news/culture/0,1284,58881,00.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
62.61.154.25 - - [14/Sep/2003:14:30:18 -0700] "GET /archive/cat/image/index.shtml HTTP/1.1" 200 18267 "http://www.waxy.org/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
209.167.50.22 - - [14/Sep/2003:14:30:21 -0700] "GET /archive/cat/events/index.shtml HTTP/1.1" 200 27686 "-" "LinkWalker"

Log format:

  1. Source IP
  2. User Identifier (blank)
  3. UserID (blank)
  4. Date - in the format of dd/MMM/yyyy:HH:mm:ss Z
  5. HTTP request - type, URL, HTTP version
  6. HTTP code
  7. Bytes transferred
  8. Referrer
  9. User agent

Note that IP to location mapping changes over time as IP address ranges are obtained and released, so geolocating this data is done for demo purposes only.

Visitors per Country

Dataflow

The following dataflow can be created using Integrate.io’s visual editor. It loads the data, determines unique visitors, converts IPs to countries, counts the number of visitors per country, sorts the data, and then stores the results.

thumbnail image

  1. Source - loads the data from the relevant S3 bucket/path. Once all the relevant options are set, the circular arrows button at the top right can auto-detect the schema and fill-in the field names.
    thumbnail image

  2. Select - keeps IP addresses and gets rid of the rest of the data.
    thumbnail image

  3. Distinct - removes duplicate IPs to get unique visitors (this component doesn’t have any options).

  4. Select - converts IP addresses to countries using the CountryNameFromIP(ip)function.
    thumbnail image

  5. Aggregate - counts the number of unique visitors per country. This is achieved by grouping the data by country, and then counting the number of times each country name appears.
    thumbnail image

  6. Sort - sorts results by the number of visitors in descending order.
    thumbnail image

  7. Destination - saves the results back to Amazon S3.
    thumbnail image

Results

Here’s an interactive map showing the number of visitors per country (see the next section for full results):

  • Total visitors - 1,387,000
  • Countries - 200
  • Top countries - United States (62%, 863,000), Canada (9%, 126,000), United Kingdom (5%, 75,000).

Visitors per City in the UK

Dataflow

Let’s say that we want to dig deeper into the UK market and find out which cities our website visitors come from. We will use a similar dataflow with several additions - filter visitors from the UK and then geolocate which cities they come from.

thumbnail image

  1. Source - no changes.

  2. Select - no changes.

  3. Distinct - no changes.

  4. Select - keeps IPs for later use and geolocates the countries.
    thumbnail image

  5. Filter - filters visitors from the United Kingdom
    thumbnail image

  6. Select - geolocates cities using the CityNameFromIP(ip)function.
    thumbnail image

  7. Aggregate - same as before, except that now the ‘city’ field is used.
    thumbnail image

  8. Sort - sorts the data by the number of visitors.

  9. Destination - stores results in a different path.

Results

Here are the top UK cities where visitors come from, as well as full results from the previous section:

  1. UK Visitors - 75,000
  2. UK Cities - almost 1,090
  3. Top UK cities - London (4.51%, 3,380), Bristol (0.94%, 700), Birmingham (0.85%, 640)

Conclusion

User locations can be easily extracted from IP addresses with a bit of help from Integrate.io. Get your free account and start geolocating your data.