Mining your server log files

While top website analytics packages offer pretty much anything you might need to find actionable data to improve your site, there are situations where we need to dig deeper to identify vital information.

One of such situations came to light in a post by randfish of Seomoz.org. He writes about the problem with most enterprise-size websites, they have many pages with no or very few incoming links and fewer pages that get a lot of incoming links. He later discusses some approaches to alleviate the problem, suggesting primary linking to link-poor pages from link-rich ones manually, or restructuring the website. I commented that this is a practical situation where one would want to use automation.

Log files are a goldmine of information about your website: links, clicks, search terms, errors, etc In this case, they can be of great use to identify the pages that are getting a lot of links and the ones that are getting very few. We can later use this information to link from the rich to the poor by manual or automated means.

Here is a brief explanation on how this can be done.

Here is an actual log entry to my site tripscan.com in the extended log format: 64.246.161.30 – – [29/May/2007:13:12:26 -0400] “GET /favicon.ico HTTP/1.1” 206 1406 “http://www.whois.sc/tripscan.com” “SurveyBot/2.3 (Whois Source)” “-“

First we need to parse the entries with a regex to extract the internal pages — between GET and HTTP — and the page that is linking after the server status code and the page size. In this case, after 206 and 1406.

We then create two maps: one for the internal pages — page and page id, and another for the external incoming links page and page id as well. After that we can create a matrix where we identify the linking relationships between the pages. For example: matrix[23][15] = 1, means there is a link from external page id 15 to internal page id 23. This matrix is commonly known in information retrieval as the adjacency matrix or hyper link matrix. We want an implementation that can be preferably operated from disk in order to be able to scale to millions of link relationships.

Later we can walk the matrix and create reports identifying the link-rich pages, the pages with many link relationships, and the link-poor pages with few link relationships. We can define the threshold at some point (i.e. pages with more or less than 10 incoming links.)

Advertisements

16 Responses to “Mining your server log files”


  1. 1 kichus May 30, 2007 at 2:18 am

    Batista…

    First time I am on your blog and the very first one itself cought my mind, thanks for teh lights…. πŸ™‚

    I would love to get the steps much more simpler – some of the terms are beyond my reach, like the matrix. How we map the page ids with the Page URL? I have to do it for myself to get the complete logic, I know. But I would appreciate if you can help me out in teh process, that would be a great help.

    once again, thanks for the great post.

    kichus

  2. 2 hamletb May 30, 2007 at 9:58 am

    kichus,

    Thanks for visiting. It is the first time for most. I started this blog last Friday.

    I am currently working on the code to do this and I will write a very detailed post about it. Stay tunned.

  3. 3 kichus May 30, 2007 at 11:08 am

    thanks hamlet… looking forward to read more from you…

    all the best

  4. 4 Meditech May 30, 2007 at 11:34 am

    An interesting concept! Although we’re not enterprise-sized, I think we may still benefit from your insight. I look forward to reading more!

  5. 5 hamletb May 30, 2007 at 11:59 am

    Meditech,

    Thanks for your comment! Now that I started blogging I won’t be able to stop. πŸ™‚

  6. 6 tOrn&cOnfused May 31, 2007 at 2:38 am

    Hi,

    You guys might laugh. I work for a company with almost a complete freedom of what I do with a retail website.

    My firm owns several websites that have PPC campaign selling the same products to the same ppl (bcoz same keywords). Currently our website are being hosted by an IT company that decided to serve the same market thus became a direct competitor. My firm decided to swap to another IT firm that has some stake interest in another wesbite which also co-own with my firm (another direct competitor) but not me.

    My question if my firm have them hosted under the same host, would search engine discover that the websites have the same owner thus impose rules on them for organic purposes? Secondly, obviously the new IT company who happen to have some interest with other websites own by my firm (which I have no direct control or responsibility), could it secretly use the web logs to study like an analytic tool?

  7. 7 hamletb May 31, 2007 at 5:39 pm

    >My question if my firm have them hosted under the same host, would search engine discover that the websites have the same owner thus impose rules on them for organic purposes?

    I am aware that Google has registrar status and can tell who owns which website. Owning multiple websites is not a problem, but cross linking them to seo purposes might not give you the results you expect.

    > Secondly, obviously the new IT company who happen to have some interest with other websites own by my firm (which I have no direct control or responsibility), could it secretly use the web logs to study like an analytic tool?

    If I had access to your log files, using the new script I just published I can easily steal your Adwords keywords and use them to with you, head on.

  8. 8 tOrn&cOnfused May 31, 2007 at 6:58 pm

    Hi what about CTR, CVR? and other valuble organic and paid traffic metrics?

  9. 9 hamletb May 31, 2007 at 10:09 pm

    Conversion rates can be identified from the log files, if you know the conversion pages (thank you pages).

    Click-through rates? It’s tricky but is possible too. How?

    Let say you have your log file for last month. You mine it to find how many clicks you got from Yahoo for a particular keyword. Later, you use the yahoo keyword research tool to find out how many people searched for that keyword the previous month. You know the math that follows.

  10. 10 Jim Newsome June 1, 2007 at 4:09 am

    Hi Hamlet,

    Love the idea of log-based link analysis, please drop me an email to discuss further. (btw – your Contact Us form on nemdia.com isn’t working).

    Jim

  11. 11 hamletb June 1, 2007 at 9:56 am

    Thanks for the comment. We uploaded a new design for Nemedia Yesterday and we didn’t expect people to be hitting the contact us form so soon πŸ™‚

  12. 12 clickforlessons June 4, 2007 at 3:58 pm

    Very interesting post! Our site has over 1 million pages and PR dist is always something of consideration. We’ll be checking this out.

    ~steven

  13. 13 hamletb June 4, 2007 at 4:17 pm

    Steven,

    That is impressive! I hope this technique is of help.


  1. 1 Using Log Files To Improve Page Rank Distribution »Technology News | Venture Capital, Startups, Silicon Valley, Web 2.0 Tech Trackback on May 29, 2007 at 5:26 pm
  2. 2 Some Thoughts on Enterprise Website Linking After Reading Rand : Kichus - SEO KiD Trackback on May 30, 2007 at 12:28 pm
  3. 3 This Week In SEO - 6/1/07 - TheVanBlog Trackback on June 1, 2007 at 10:05 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s





%d bloggers like this: