Big Data brother is watching - whenever users surf your website, their browser sends an HTTP header called ‘User Agent’. It tells your web server which browser they’re using, in which version, and on which operating system. The user agent string is logged by the web server and can be later analyzed to find out, for example, how many users still surf your website in old IE (Microsoft Internet Explorer) versions and whether you should support them or not.

Let’s do just that. We’ll analyze the public domain ‘Star Wars Kid’ logs - 1.6GB of uncompressed Apache server logs collected between April and November 2003. The user agent string will be parsed automatically to find out which browsers & versions were used. Then we’ll take a look only at Internet Explorer users and check their statistics. All of this will be done via Integrate.io’s data integration on the cloud and without writing a single line of code.

Related reading: How to Parse Query String Parameters from URLs in Big Data

The Data

The data is standard space delimited web server logs. Here are a few sample lines:

217.153.121.177 - - [10/Apr/2003:03:17:13 -0700] "GET /archive/cat/games/index.shtml HTTP/1.1" 200 34477 "http://www.google.pl/search?hl=pl&ie=UTF-8&oe=UTF-8&q=loop+shockwave+game+download&lr=" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
158.143.122.16 - - [10/Apr/2003:03:17:19 -0700] "GET /archive/2003/04/03/typo_pop.shtml HTTP/1.1" 200 26557 "http://www.kottke.org" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
68.41.184.178 - - [10/Apr/2003:03:17:26 -0700] "GET /index.rdf HTTP/1.1" 200 24911 "-" "curl/7.7.2 (powerpc-apple-darwin6.0) libcurl 7.7.2 (OpenSSL 0.9.6e) (ipv6 enabled)"

The schema:

  1. Source IP

  2. User Identifier (blank)

  3. UserID (blank)

  4. Date - in the format of dd/MMM/yyyy:HH:mm:ss Z

  5. HTTP request - type, URL, HTTP version

  6. HTTP code

  7. Bytes transferred

  8. User agent

Parsing User Agent Strings in Web Logs

IE Usage

thumbnail image

  1. Source - loads the data. It’s publicly available via the Integrate.io demo data connection in the integrate.io.public bucket at weblogs/starwarskid/ Use a single space character as the field delimiter and double quotes as the string qualifier. Fill-in the fields automatically by clicking the circular arrows button on the top right.
    thumbnail image

  2. Select - parses the user agent to return the browser family and major version using the functions BrowserFamily(useragent)and BrowserMajor(useragent). Other user agent parsing functions are also available: BrowserFullName, BrowserMinor, BrowserPatch, BrowserVersion, OsFamily, OsFullName, OsMajor, OsMinor, OsPatch, OsPatchMinor,OsVersion. For further details, please see the Integrate.io functions documentation.
    thumbnail image

  3. Filter - filters only IE logs using the text equals operator.
    thumbnail image

  4. Aggregate - groups the data by browser version, and counts the various IE versions.
    thumbnail image

  5. Sort - sorts the data by version count in descending order.
    thumbnail image

  6. Destination - writes the output back to Amazon S3. If the result should be overwritten, make sure to check the overwrite checkbox.
    thumbnail image

This job can be executed in about 7 minutes using a free sandbox cluster.

thumbnail image

Looking at the results, there were about 6,000,000 visits with Internet Explorer. The top versions were IE6 with 77%, IE5 with 23%, and IE4 with 0.26%. Considering these logs were taken in 2003, this makes sense and fits public IE statistics in 2003.

All Browsers and Versions

How does IE usage compare to all browsers and versions? Good question. Let’s make a copy of the previous dataflow, and change it a little:

thumbnail image

  1. Source - same as before.

  2. Select - concatenates browser family and major version and stores them as the browser alias.
    thumbnail image

  3. Filter - removed.

  4. Aggregate, Sort - changed to use the browser alias as set in the select component.

  5. Destination - changed the path not to overwrite the previous results.

Conclusion

Analyzing the new results, there were about 6,500,000 visits in total. The top browsers were IE6 with 71%, IE5 with 21% respectively, and Netscape7 (remember that good ol’ browser?) with 2%. Looks like IE6 and IE5 are the dominant browsers after all, so there was no need at that time to support IE4.

Would you also like to analyze the user agent in your web logs? Get a free Integrate.io account and generate browser statistics now.