File Under: Backend

Gather Users Data From Server Logs

Hey, you! What are you doing? Where are you going? More importantly, what are you clicking on?

If only it were that easy. But no, most users like to travel the web incognito. They come to your site, poke around a few files, download a PDF or two, and then — poof — disappear, leaving nothing but questions in their wake:Where did they come from? Which browsers are they using? Are they experiencing any errors?

The most thorough method of tracking users is by planting cookies, which some folks consider rude or invasive, and, oh yeah, you need to know how to program them. Not to worry — there is an option that requires very little technical know-how, comes at no (or nominal) cost, and may already be a part of your site’s backend. I’m talking about logs!


Contents

  1. Logfile Lowdown
  2. The Prizes Inside
  3. A Sample Log File
    1. Different Ways of Looking at It
  4. List of Popular Logfile Analysis Tools
  5. Tips

Logfile Lowdown

Almost every web server worth its salt has some sort of system that stores information about which pages, images, and files are requested, who requests them, and how many bytes are transferred. All of this information is dumped into a log file that is stored in a specific location on your server.

These log files are yours to explore. You can simply open the log file in an ordinary text editor and read the raw data. Or, for a more user-friendly view of the info, suck the log file into a nifty stand-alone software package or browser-based viewer, which parse the data and spit it out as a charts or graphs or tables that clearly illustrate your users’ activities.

Not sure why this information is valuable? Well, if you’ve invested time and money in a website, one of your biggest points of interest is indubitably traffic — whether people are exposed to advertisements or your products, traffic is directly proportional to revenue. But there’s more to traffic than just eyes on pages. Sure, the numbers you get from your log files will tell you how many people visited your site in any given space of time, but traffic data can also be studied to give you a clear, precise idea of what kinds of viewing practices your users exhibit.

Let’s say a user comes to your site and views a few pages. In server-speak, user actions are counted in requests. Any time the user is served an image, an HTML file, or an ad, it counts as a request. If 17 HTTP requests are served in one session, how many of those 17 requests are images? How many are ads? How many turned up as (eek!) 404 “address failed” errors? These are the types of questions that can be answered by picking over your log files and generating in-depth reports.

The trick is to learn as much as possible about what is being served to your users. Vitals like location, browser version, and time spent on your site allow you to tailor your content and presentation design specifically to please the users that you’re doing business with.

So what, exactly, do you look for? Let’s take a closer look.


The Prizes Inside

There are a number of areas where the data housed in your logfiles can help you understand and cater to your users:

Traffic

What’s your traffic like to any given page? Are there certain pages that stand out as high-traffic areas? Pages that corral more viewers are hot in terms of real estate – ad space on those high-traffic pages should cost more, right? And what is the overall volume like on your site? Do you see traffic jump when you publish exciting, new content, or does it stay relatively flat throughout your publishing schedule? Do you get twice as much traffic on Fridays than Mondays? Thorough traffic reporting will present the answers to these questions if you take the time to seek them out.

Audience

Who’s visiting your site? Are most of your users from the Unites States or Japan? You can look at the IP addresses of your visitors and determine where they are geographically. You can also find out where your visitors are coming from demographically. Are you being visited by AOL users, university students, or workers at defense contracting firms? A site in Mexico that sees heavy traffic from American university students should be certain that its English translation service is doing its job — the site is also especially ripe for ads pushing college spring break travel packages.

Browsers/Platforms

Are your users primarily Macintosh users? Linux users? Since your site probably varies in presentation between OS X and Ubuntu, you can use the reports about platform specifics to round out your site testing and quality assurance practices. And as any savvy developer knows, the differences between how a page looks in IE and Firefox or Konqueror can be astounding. Are you using gobs of IE-specific CSS tricks on pages that are primarily being viewed by Mac users? For your sake and theirs, I hope not.

Browser plug-ins are fun only when they work, so if you have any content that’s “plug-in required,” you should be sure that the majority of your users are running a platform for which the needed plugins are available.

Errors

What kinds of errors are your log files reporting? Are any links on your site handing out those pesky 404s? Better check those links, then. Are your redirects working or are they pointing your users out into the ether? Are any of your scripts loading incorrectly? Even if everything runs ship-shape on your workstation, a report that shows faulty scripts might lead you to test them on different browsers or from behind a firewall. Are users ditching an image before it fully loads up? There’s a cause for concern – look into it. The image may have an error, or may simply need to be optimized.

Referers

A referer indicates where a user was refered from, whether it be an advertisement, a link somewhere else on your site, or a link on some one else’s site. You can use your referer data to see what kind of traffic you’re getting out of a plug on a message board, an ad, or even a mention on Slashdot or Digg.

Getting at the Info

So, how do you get your paws on all this valuable data? If you’re hip to Unix Guide, you can use grep and sort commands to extract data from raw log files. Or just FTP down a logfile and open it up in your favorite text editor.

A Sample Log File

Next time you fire up your [[Reference:FTP | FTP] client or log in to your web server, take a moment and dig around for your log files. On most web servers, you will find a directory — usually in your root directory, or the parent directory just above it — named “logs” or “stats”. Inside, you will most likely see a file with a .log, .web, or .clf extension. Since Web logs are essentially text files, some will even have a .txt file extension. Download the log file, save it to a local drive, and have a look.

Most servers generate CLF (Common Log Format) files, but they also come in other flavors, like ELF (Extended Log Format) and DLF (Combined Log Format). Some servers produce files with different extensions in different formats, but most of the log file types out there are formatted much like CLF files. For this reason, we’ll use the structure of a CLF file for our example.

In Common Log Format files, each line represents one request. So if a user comes to your site and is served a page with three images, it shows up as four lines of text in your CLF file – one request each for the three images and one request for the HTML file itself.

CLF files are standardized, so they almost always look the same. A normal CLF file logs the data in this format:

 user's computer  ident  userID  [date and time]  "requested file"  status



 filesize

The fields are separated by spaces. Some fields, such as the date and request information, are defined with punctuation. If any of the fields are non-existent during the session logged, the server puts a hyphen in the place of the non-active field. Let’s look at these fields one by one.

  • The remote host information shows the IP address and, in some cases, the domain name of the client computer requesting the file.
  • The ident information is logged if your server is running IdentityCheck, an antiquated directive that was once used for thorough server logging. It was phased out of general use because it required the identification process to run every time a file is served. Because this process can sometimes take 5 or 10 seconds, most sites turn IdentityCheck off so that their pages load more quickly.
  • If your site requires a password upon login, the userID that the user entered is logged in this field. If you don’t have any user login features on your site, this field is no big deal.
  • The date field is straightforward – the date and time of the request is logged here.
  • The request field logs the type of request made by the user, as well as the path and name of the requested file.
  • The status field contains a three-digit code that tells you if the file was transferred successfully or not. These codes are standard HTTP codes.
  • The filesize field is also straightforward – it lists the number of bytes transfered when the requested file was served.

For the following example, I’ve extracted one line from a log file that records the activity on my own personal website, snackfight.com. My hosting company serves my site using Apache, and they’ve tweaked a few options to provide me with more comprehensive data. (Apache’s mod_log_config module allows you to customize the string that’s fed into the logs.) I’ve divided this logged request into its separate parts for clarity – normally, all of this data would be dumped onto one single line in the log file.


 adsl-63-183-164.ilm.bellsouth.net - - [09/May/2001:13:42:07 -0700]



 "GET /about.htm HTTP/1.1" 200 3741



 "http://www.e-angelica.com"



 "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)"

The first part of the request shows the user’s local domain. I can see that this is a DSL subscriber on the BellSouth network. The two hyphens that follow are where the IndentityCheck and UserID information would normally show up, but since my site does not utilize either of these processes, I get nothing but hyphens. Next, in brackets, is the date, then the time (in 24-hour format), followed by the user’s time zone code.

The request field, displayed within quotes, shows that the user asked the server to GET a page. Other request types are POST, DELETE, and HEAD, though you don’t see those nearly as often. Following the request type is the path and name of the file. In this case, the user was requesting the “about.htm” file in the root directory of snackfight.com. Also, you can see that the protocol used here was the good old Hypertext Transfer Protocol, version 1.1.

The status field shows a status code of 200, meaning that everything went through just peachy. A status code of 404, as you may know, means that the file was not found on the server. Immediately following the status code is the file size of “about.htm”. It’s 3,741 bytes. Hey, not bad! I’ll bet it loaded nice and quick.

Referers

The next two fields are especially interesting. These are custom fields that my hosting company has added to its logging so that I can get a better idea of who’s visiting my site. The first field, in quotes, is the referer field. This is where my user clicked on a link in order to arrive at the page he was just served. I can see that this particular user is a fan of the e-angelica site, because that’s where he came from to arrive at my site. In some cases, referers are logged in their own log file. These referer logs usually use the same format and can also be viewed or run through an analyzer.

The last field, also in quotes, shows some information about the user’s browser and platform, in this case, Internet Explorer 5.0 on a Windows 98 machine. Oh, upgrade already!

And that’s about it. It’s a lot of information, I know, and your log file may store even more goodies (an in-depth explanation of the syntax used in log files can be found in the massive spec for HTTP 1.1, which is also useful as a reference when looking up header fields and server status codes.)

All this data is little overwhelming, no? Especially in its raw state. If you’re not exactly thrilled about the idea of picking through thousands of lines of text and status codes to determine whether or not your users are being served in the most efficient manner, there are several software packages on the market that you can use to generate reports without getting your hands dirty (and without opening your text editor). But which one is right for you? Well, that all depends on what you’re looking for.

Different Ways of Looking at It

Picking out a log file analyzer package is a bit like stocking up for a party. If you’re only inviting a few close friends, maybe you can get away with a case of soda and a few bags of chips. But what happens if everyone you know unexpectedly invites seven people? All of a sudden, that case of soda isn’t good enough anymore. There’s a lesson here – be a good scout and Be Prepared!

For solid, intelligent reporting, it’s crucial to pick out a software application that covers everything under the sun – even for a small site with moderate traffic. Sooner or later, you’ll want to generate a strange or unique report, and having the tools at hand to do so is essential. The extra functionality that comes with the larger, robust applications allows you to generate just about any kind of log file report possible.

If your server produces any of the usual file types (again, CLF, DLF, ELF), you’re in the green. However, there are a great deal of proprietary file formats out there, in which case you will need to check and see if your desired software package can understand the flavor of log file you’ll be analyzing.

Most log file analyzers run locally on your computer. Some of the more forward-thinking software companies, however, now offer hosted log file report generators. These options are lightweight yet powerful and they can be accessed from any computer with web access, whether or not it’s directly connected to your server. These hosted solutions are often less costly than a large software application.


All About the Washingtons

There are a plethora of log file analyzer applications available as freeware or shareware. The open source software movement has made many of these indispensable backend utilities available for free, though finding technical support for some of them can be challenging. Most have excellent online documentation, but lack a telephone- or e-mail-based support structure. If you don’t feel comfortable walking the tightrope without a safety net but you’re on a budget, there are several software companies that provide tiered pricing on their products – you only buy the level of functionality that you require.

If all you’re interested in is hits, you can grab a handy free counter like Site Meter. If you’re running a Microsoft Internet Information Server (IIS), Microsoft’s Site Server application has extensive logging and analyzing capabilities. Also, if your site is hosted, the good Web hosting providers will offer browser-based log file reports as part of (or as an add-on to) their basic service. Graphs, charts, numbers in a row – all a few clicks away.

Once you’ve managed to clearly define your needs and limitations, you’re ready to go to market. There are many log file analyzers to choose from, and you may have to do some research on your own to select the one that’s Cinderella-slipper-perfect for your needs.


List of Popular Logfile Analysis Tools

  • Google Analytics is a free web service offerd by Google. You put a little script in your web page code, then log in to Google Analytics from any computer to access the free tools. There’s reporting, graphs, ways to parse raw data and a set of tools for those using Google’s AdWords program.
  • WebTrends is one of the industry leaders for log file reporting applications on the enterprise and small-to-medium business level. WebTrends’ various packages are commercial products available at different price levels.
  • Sawmill is another commercial log file analyzer. Sawmill is not as feature-rich as WebTrends, and it may not look as pretty, but it certainly gets the job done. Sawmill’s interface is entirely browser-based. Pricing for Sawmill Lite, which is good enough for smaller websites, is $100.
  • Analog is the program that claims to be the “most popular log file analyzer in the world.” It’s free software. A complete user manual is available on the web, plus there’s a user e-mail list that you can turn to if you get stuck. This is helpful as the learning curve tips a little steep for the inexperienced user. For more stylish-looking reports, you can download a free add-on called Report Magic, which gives Analog the pretty user interface you’d expect from a program that costs a whole lot more.
  • Webalizer is a free application generates highly detailed, easy-to-read reports in HTML (check out the graphing capabilities in these sample reports). It also runs on a host of operating systems and speaks multiple languages.
  • HTTP-analyze is a highly configurable application that uses a frames-based browser interface, making it easy to navigate through your log reports. Also, HTTP-analyze is a Unix-based program, meaning that its operating system support is more limited than the others.
  • Tynt Tracer is a free web analytics tool that allows you to determine what is being copied from your site.

Tips

At this juncture, I encourage you to go explore your log files. You’ve got enough under your belt to start analyzing your site activity. But before you go dive into your logs, I have a few hard-won tips and hints to pass along.

  • One point that I can’t stress enough is the importance of long-term logging. If your web server is configured to erase your old log files every month, either change the server’s configuration or save a copies of your log files locally. It’s very insightful to see the differences in site traffic over a four-year period. For example, by looking at the user agent information over time, you can see how quickly and how often your users upgrade their browsers or operating systems to the latest versions.
  • When you’re looking at your log files, either in raw form or in an analyzer, you’ll probably notice a file called “robots.txt” in your root directory that’s getting a whole bunch of hits. Don’t worry, that’s not a mistake — it only means that a search engine robot was crawling your site. Search engines send out their robots, also called spiders or crawlers, every now and then to crawl the Web and see what’s out there. If you include a robots.txt file in your root directory, you can give specific instructions to a robot:Tell it to go away, or point it to the information that you would like to make searchable. For more information on how the robot.txt file works, visit the Web Robots pages.
  • And here’s a handy trick to take with you. Did you do your good Monkey deed and create a favorites icon for your site? If so, you can find out how many people are actually seeing your icon simply by running a report that counts hits on your “favicon.ico” file.