Capturing Web Traffic Data — Two Methods That Suck

By on in with 12 Comments

We’re working with a client who is simultaneously deploying Eloqua for marketing automation while also switching from Urchin 5 to Google Analytics. All three of these tools provide some level of web traffic data. And, right out of the chute, the client was seeing 40% lower traffic being reported by Eloqua than was reported by Urchin 5. That raised questions…as the deployment of concurrent web analytics tools always does! Having put myself through this wringer several times, and having seen it crop up as a recurring theme on the webanalytics Yahoo! group, it seemed worth sharing some material I put together a couple of years ago on the subject.

First off, it is largely a waste of time to try to completely reconcile data from two different web analytics tools. This post really isn’t about that. Mark Twain, Lee Segall, or perhaps someone else coined the saying, “A man with one watch knows what time it is; a man with two watches is never quite sure.” The same is true for web analytics. Thanks to different data capture methods, different data processing algorithms, different data storage schemas, and different definitions, no two tools running concurrently will ever report the same results. The good news, though, is that most tools will show very similar trends. WebTrends preaches, “in web analytics, it’s the trends that matter — that’s why it’s part of our name!” But, even in the broader web analytics community, this is widely accepted. Avinash Kaushik had a great post titled Data Quality Sucks, Let’s Just Get Over It way back in 2006, but it still applies. Read more there!

This post, rather, is more the basics of “log files” versus “page tagging,” which are the two dominant methods of capturing web data. Page tagging has been much more in vogue of late, but its got its drawbacks. In the case of our client, their Urchin 5 implementation is log-based, while Google Analytics and Eloqua are tag based. And, not surprisingly, Google Analytics and Eloqua are providing traffic data that is fairly similar. But, even when two tools use the same basic data capture method, there is no guarantee that they will present identical results.

The following diagram tells the basic story of how the two methods differ (click on the image to see a larger version):

Web Data Capture Methods

“But, wait!” you exclaim! “How come both of these have ‘log file processed’ in them? I thought one method was log file-based and the other was not!” <sigh> As it turns out, both methods are, in the end, parsing log files. With page tag solutions, the log file being parsed/processed is the page tag server’s log file. In theory, your main web server(s) could be the page tag server…but then the tool would be stuck having to sift through a lot more clutter to get to the page tag-generated requests.

I’m getting ahead of myself, but go ahead and file that little bit of information as a handy cocktail party conversation…um…killer (unless the cocktail party is a Web Analytics Wednesday event — it’s all about your target audience, isn’t it?).

In a log file-based solution — the left diagram above — the “hit” is recorded as soon as the user’s browser manages to get a request for a page to your web server. It doesn’t matter if the page is successfully delivered and rendered on the user’s machine. This is good and bad, as we’ll cover in a bit.

In a page tag-based solution — the right diagram above — the “hit” is recorded much, much later in the process. The user’s browser requests the page, the page gets downloaded to the browser, the browser renders the page and, as part of that rendering, executes a bit of Javascript. The Javascript usually picks up some additional information beyond the basic stuff that is recorded in a standard web request (such as screen resolution, maybe some meta tag values from the page, and so on). It then tacks all of that supplemental information onto the end of an image request to the page tag server. The page tag server log file, then, only has those image requests, but it has some really rich information included in them.

Got all that? Well, there are obvious pros and cons to both approaches.

Log File-Based Tools Pros and Cons

The good things about a log file-based approach

  • They (more) accurately reflect the actual load on your web servers — your IT department probably cares about this a lot more than your Marketing department doe
  • They captures data very early in the process — as soon as you could possibly know someone is trying to view a page, they record it

But, it’s not all sweetness and light. There are some cons to log files that are nontrivial:

  • They miss hits to cached pages (by browser, by proxy) — this can make for some rather nonsensical clickstreams
  • They are limited to data captured in the Web server log file — this can be a fairly severe limitation if, for instance, you have rich meta data in the content of your pages and you want to use that meta data to group your content for analysis
  • They capture a lot of useless data — I just went to the Microsoft home page, and watched 65 discrete requests hit their web servers to render the page (images, stylesheets, Javascript include files, etc.); this is fairly typical, and means you wind up pre-processing the log file to strip out all of the crud that you don’t really care about
  • It is difficult for them to filter out spiders/bots — there is a “long tail” of spiders crawling the web, so this is not simply a matter of knocking out Google’s bot, Yahoo’s bot, and Baidu’s bot; there is an unmanageable, constantly changing list of known bots and spiders…and many bots mask themselves, which is extremely difficult to detect (this was actually the far-and-away biggest culprit with the client who spawned this post)

Page Tag-Based Tools Pros and Cons

Alas! Although page tags address the bigger negatives of log file-based solutions, they have their own downsides. But, let’s start with the positives:

  • Because they are Javascript-based, they are able to capture lots of juicy supplemental data about the visitor and the content
  • Most (not all, mind you) spiders/bots do not execute Javascript, so they are automatically omitted from the data
  • The Javascript “forces” the page tag to fire…even on cached pages

There are some downsides, though:

  • They requires the page tag Javascript to be deployed on every page you want tracked — even if you have a centrally managed footer that gets deployed to all pages…chances are there are still some important corner case pages where this is not the case; and, even if that is not the case now, that could happen in the future; we had a pretty robust system that was undermined when the design of a key landing page was completely overhauled…and the page tag was nuked in the process
  • They do not record a hit to the page until the page has been at least partly delivered to the client — if you have visitors that bounce off of your site very quickly, you may never see that they hit the site at all
  • If Javascript is disabled by the client, then you have to put in some sort of clunky workaround to capture the traffic…and what you capture will not be nearly as rich as what you capture for visitors who have Javascript enabled

So, What’s the Answer?

The obvious answer may seem to be to employ both approaches in a hybrid system. And, that is obvious if you or your management is so aggressively compulsive that you are willing to deploy major time and resources to try to pull this off (and very likely fail and create confusion in the process).

Let’s toss the obvious answer out then, shall we?

The answer is more simple, actually:

  • Understand the pros/cons of both approaches
  • Be clear on what your objectives are — what do you care about?
  • Determine which approach will more effectively help you meet your objectives and go with that

Now, if you are a Marketer, there’s a pretty good chance that you’ll wind up settling on a page tag-based solution. If that’s the case, then it might still make sense to figure out where your log files are and to do a little snooping around in them. I’ve found log files to be very handy when the page tags throw some sort of anomaly. If you can narrow down the anomaly, the log file can be a good way to get to the bottom of what is going on. Page tags…with log files to supplement. Does that sound like a tasty recipe or what?



  1. Pingback Capturing Web Traffic Data — Two Methods that Suck · We cares

  2. Hey Tim,
    Great post, I liked it! however, the diagram showing the flow of traffic in page-tag based approach is not visible fully.


  3. Yikes! The danger of writing/reviewing on a 1280×1024 display! I’ve resized the image to make it all fit, as well as made it a hyperlink to a larger version of the image. Thanks!

  4. Pingback Gilligan on Data by Tim Wilson » The Best Little Book on Data

  5. Great article, Tim. Have you run across issues with getting Google’s reporting (both PPC and Analytics) to work properly on a site that uses Eloqua? I recently discovered that conversions had dropped to zero because Eloqua was stripping out the gclid value when the ppc visitor submitted an inquiry form – so the google snippet wasn’t picking up on that conversion event. Managed to fix that by having Eloqua pass the gclid value in a hidden form field; conversions show up in Google PPC console, but now Google Analytics isn’t seeing ANY activity from paid search. It’s making me tear my hair out! Any ideas/thoughts/soothing words?

  6. @Matt Yikes. That’s a tough one. I can’t say I’ve run into that situation, but I can’t say I’m surprised that someone did!

    Am I understanding correctly that the problem occurs once a visitor lands on an Eloqua-managed landing page/hypersite? Everything works fine when you’re on other pages on the site that simply have both the Eloqua and the GA tag, right?

    If that’s the case, one kinda’ ugly option might be to shift the forms off of an Eloqua-managed page and then have the form POST to Eloqua (you still set up the form in Eloqua, it just doesn’t handle the visitor-facing page).

    That’s not exactly soothing…

  7. This is acceptable in terms of search engine optimization. Nothing appears to bother upon them compared to that!Funnily enough, this is just what was forewarned about several years prior at the blackhat con about search engines in 1995!

  8. Pingback All Web Analytics Tools Are the Same (at least when it comes to data capture) | Gilligan on Data by Tim Wilson

  9. Pingback » All Web Analytics Tools Are the Same (when it comes to data capture) | Tim Wilson's Blog at Web Analytics Demystified

  10. Pingback All Web Analytics Tools Are the Same (when it comes to data capture) Tim Wilson at Web Analytics Demystified

  11. 站少该当该怎样做好网站定位 中介买卖 SEO诊断淘宝客 站少团购 云主机 手艺年夜厅   第一步:网站定位很主要  尾先,我们能够到百度中文搜刮风云榜找一 seo 下,看甚么最热点,而且用百度指数搜一下此枢纽词的指数怎样,到时没有要做出的网站没有是人们体贴的,那末也便出几小我私家会 seo 见您的网站。好比搜MP3的人多,便有IMP3那样一个网站,尾页告白大要20个,一个告白6000,一月也便借有面支出,能有支出借是没有错的。固然此中借能找到响应的别的枢纽词。平常许多站少做的站皆是靠百度战GOOGLE那样的告白去弄面支出,那种念法我小我私家以为不合错误,该当找当前能间接正在您网站投放告白的网站去做,那样的开展空 seo

Leave your Comment

« »