HOME » TRAFFICK
ARTICLES » ARTICLE
Desperately Seeking Web Log File Standards
By Cory Kleinschmidt, 11/22/2002
As any webmaster or search engine marketer knows, you can't measure the success
of your web site or of your online marketing campaigns without knowing your
site statistics. And the only way to know your stats is to dig deep into the
bowels of your server's log files. But once you get in, you might not make it back!
Every page viewed on your site, every visit to your site, every referring URL,
and hundreds of other bits of information, is stored in these labyrinthine text
files that can grow to be hundreds of megabytes in size. It's nearly impossible
to decipher these log files yourself, which is why software companies have created
versatile -- and pricey -- programs to extract the useful nuggets of information
contained therein.
Here's a basic
definition of log files (also called extended log files), courtesy of the
World Wide Web Consortium (or W3C), the Internet standards group:
"An extended log file contains a sequence of lines containing ASCII
characters terminated by either the sequence LF or CRLF. Log file generators
should follow the line termination convention for the platform on which they
are executed. Analyzers should accept either form. Each line may contain either
a directive or an entry.
Entries consist of a sequence of fields relating to a single HTTP transaction.
Fields are separated by whitespace, the use of tab characters for this purpose
is encouraged. If a field is unused in a particular entry dash "-"
marks the omitted field."
WebTrends is the industry
standard log file software, and is used by thousands of companies around the
world. Some hosting companies offer WebTrends reports to every site hosted on
their servers free of charge, or perhaps for a small fee. There are also companies
such as HitBox who offer
free ASP -- application service provider -- hosted services that track your
site stats in exchange for placing ads on your site.
Any hosting company worth its salt will give you access to your raw log files,
which you can download and then analyze on your own using software from vendors
like 123LogAnalyzer
(my current favorite), SurfStats,
and Sawmill.
These log file analyzers are usually good at giving you a general picture of
the overall health of your site, but if you analyze your log files with more
than one log analysis tool and you'll see that the world of site stats is a
murky one filled with competing standards, conflicting definitions of basic
terminology and few easy methods of understanding what the numbers mean.
One log file tool may report 100,000 page views for your site in a month's
time, and another may report just 80,000. I talked with several of these vendors,
and they all gave a litany of possible reasons for the discrepancy in page views:
some log file software counts failed pages as page views, some have different
definitions for what constitutes a page view or user session, or maybe the other
program's parser -- which is the part of the software that scans the log file
entries -- isn't up to snuff. But, none of the companies would admit that their
software could do a better job in reporting numbers.
So, why does it seem so impossible for different log file programs to report
numbers consistently? Here are some of the reasons:
1. There is no standard log file format.
Log file formats come in many different flavors. There are the different formats
based on Microsoft's Internet Information Server (IIS); there's one for the
free web server called Apache; and there are different formats for proxy servers
(which act as Internet access gateways for networks).
2. There is no standard method for interpreting and parsing log files.
Many log file analyzers, OpenWebScope,
for example, report the useless term "hits" as the more-useful-to-know
term "page views." There are also different definitions of what constitutes
a visitor session. Some programs say that if a visitor to your site is inactive
for 10 minutes or more and then they come back, they count as a new visitor.
Obviously users shouldn't be counted twice, but when you deal with dynamic IP
addresses, it's hard to know if the user with IP address 64.217.243.22 was the
same person as it was 10 minutes ago.
3. There is no standard way to track and measure success.
Unless you contract with a specialized software company to track banner ad campaigns,
PPC campaigns, and other sales promotions, you will have a difficult time calculating
ROI, tracking referrers, and so on. Log files do record the pass-through URL
parameters appended to links, but the data is so generic as to be almost useless.
Someone really needs to create a program for small to midsize businesses that
will not only analyze log files but will also provide data mining, ROI tracking,
etc. To be sure, there are programs like this, but they are usually geared toward
the enterprise or corporate market.
So, until web server vendors and log file software companies get together and
agree on a set of standards for recording interpreting and tracking web site
traffic, webmasters and search engine marketers will have to deal with inconsistent
tools and rely more on guesswork than hard numbers to determine the health and
success of a web site.
If you do analyze your server logs with two different log file analyzers and
get different numbers, which one do you trust? That's a question for you to
struggle with, but I usually go with the one with the higher numbers! If you
have the time and money and you wanted to get fancy, you could even analyze
your logs with a third program and then average all three totals to get a more
accurate picture.
For more information, visit the following links:
LogFile
Lowdown - Hotwired
W3C Proposal: Shop Log
File Format
Cory Kleinschmidt is the webmaster of Traffick.com.
|