Archive for the 'log mining' Category

LongTailMiner v0.1 alpha – find long tail keywords nobody thought about

I’m really enjoying this blogging thing! Every comment I am getting from my readers is a new idea that I feel rushed to put into practice.

My reader, Andrea, mentioned she parses log files to mine for keywords as well. That is an excellent idea.

I decided to put that idea into code and here is a new tool to mine for long tail keywords.

To make really good use of it, I would setup a PPC campaign in Google with a “head keyword” in broad match, bidding at the minimum possible. Make sure your ads maintain good click-through rates (over 0.5%) to avoid getting disabled. Run it for a week or two (preferably more) and you will have a good number of search referrals and “long tail keywords” that people are actually looking for. You can later create good content pages that include those keywords. In most cases, long tail keywords are really easy to rank with on-page optimization only.

I will probably write a Youmoz entry with more detailed instructions on how to take advantage of this. In this way I can get more people to try it and get really valuable feedback.

Here is the Python code:

#!/usr/bin/python

# LongTailMiner.py v0.1 alpha by Hamlet Batista 2007

# 


import re

from urlparse import urlparse

from cgi import parse_qsl


p = r'[^"]+"GETs([^s]+)[^"]+"s2[^"]+"([^"]+(?:google|yahoo|msn|ask)+[^"]+)"'


log = open('tripscan.actual_log')

lines = log.readlines()


keywords = set()


for line in lines:

 m = re.search(p, line)


	if m:

 	(internal, link) = m.groups()


		elements = urlparse(link)


		if elements[4]: #check to see if there is query string

 		params = parse_qsl(elements[4])#break qs in keyword, value pairs


			for (k,v) in params:

 			if k == 'p' or k == 'q':#top search engines use p or q for the keywords

 				keywords.add( elements[1] + " - " +  v)


#print the report

for k in  keywords:

 print k

Here is the output:

search.sympatico.msn.ca – best places to vacation in april
www.ask.com – help find a cheap vacation package anywhere
www.ask.com – new york vacation package deals
search.yahoo.com – vegas vacation packages
search.yahoo.com – what is the best beaches to stay in jamaica
search.yahoo.com – outrageous hawaii vacation packages
www.google.se – “paris in 5 days” versailles
search.sympatico.msn.ca – Vacation Package Deals
search.yahoo.com – vacationpackage
search.yahoo.com – vacation packages
search.msn.com – 10 best places for vacation
in.search.yahoo.com – vacation package
search.yahoo.com – best places to vacation in june/july
search.sympatico.msn.ca – best travel deals for june
search.msn.com – last minute caribean deals
search.yahoo.com – package vacation
www.google.com – Tripscan
search.sympatico.msn.ca – best places to vacation in June
search.sympatico.msn.ca – best places to travel in october
search.yahoo.com – vacation package
search.msn.com – caribean vacation
search.msn.com – Best Caribean vacation
www.ask.com – Cheap Vacation Package
search.sympatico.msn.ca – CANYON RANCH IN LENNOX
search.sympatico.msn.ca – find vacation packages
ca.search.yahoo.com – Hawaii all inclusive Vacation Packages
search.yahoo.com – california vacation ideas
search.yahoo.com – vacaton package
ie.search.msn.com – caribean vacation
search.yahoo.com – all inclusive package deals from New York to Cancun
search.yahoo.com – best places to explore
search.msn.com – caribean vacation island packages
search.yahoo.com – vancation package
search.yahoo.com – puerto vallarta nude resorts
www.ask.com – all inclusive vacation places
search.yahoo.com – vacation packge
search.yahoo.com – vacation package
www.ask.com – caribean deals
search.msn.com – best hotels in caribean
search.msn.com – the best caribean vacation
www.google.com – related:www.exectourtravel.com/
ca.search.yahoo.com – vacation packages

This is just scratching the surface. One improvement we can make, is to identify the landing pages to which the keywords lead, so we can make sure visitors are finding what they want.

Usage

In order to use the script you need to download Python from http://www.python.org. The script should run in Unix/Linux, Mac, and Windows but I only tested it in Linux.

1. Copy your log file to the directory were you saved the script.

2. Change the name of the log file (inside the quotes) in the line log = open(’tripscan.actual_log’) to the name of your log file.

3. In the command line type: python LongTailMiner.py and you should see the report.

LinkingHood v0.1 alpha

As I promised to one of my readers, here is the first version of the code to mine log files for linking relationship information.

I named it LinkingHood as the intention is to take link juice from the rich to give to the poor linking sites.

I wrote it in Python for clarity ( I love Python :-) ) . I was working on an advanced approach involving matrices and linear algebra. After reading some of the feedback regarding the article, it gave birth to a new idea. To make it easier to explain, I decided to use a simpler approach . This code would definitely need to be rewritten to use matrices and linear algebraic operations. (More about that in a later post). For scalability to sites with 10,000 or more pages, this is primarily an illustration and does everything in memory. It’s also extremely inefficient in its current form.

I simply used a dictionary of sets. The keys are the internal pages and the sets are the list of links pointing to those pages. I tested it with my tripscan.com log file and included the results of a test-run.

Here is the script:

#!/usr/bin/python# LinkingHood v0.1 alpha by Hamlet Batista 2007
#

import re

relationships = {}

p = r'[^"]+"GETs([^s]+)[^"]+"s2[^"]+"([^"]+)"'

log = open('tripscan.actual_log')

lines = log.readlines()

for line in lines:

   m = re.search(p, line)

	if m:

 	        (internal_page, external_link) = m.groups()

		if re.search(r'.css|.js|.gif|.jpg|.swf|?', internal_page):

 		     continue

		if not relationships.has_key(internal_page):

 		     relationships[internal_page] = set()

		if re.search(r'yahoo|google|msn|live|ask', external_link):

 		     continue

		relationships[internal_page].add(external_link)

print "Tripscan internal pages:"

for page in  relationships.keys():

   print "t"+page+ ": " +str(len(relationships[page])) + " links"

home = relationships['/']

about =  relationships['/aboutus.html']

print 'Home has ' + str(len(home)) + ' links'


for link in home:

   print 't'+link
print 'About has ' + str(len(about)) + ' links'

for link in about:

   print 't'+link

Here are the results from the run:

Tripscan internal pages:
/orlando.php: 2 links
/directory/money_and_finance.html: 3 links
/contact.php: 2 links
/favicon.ico: 3 links
/lasvegas.php: 2 links
/directory/services.html: 2 links
/index.php: 2 links
/directory/travel.html: 1 links
/charleston.php: 2 links
/sunburst.php: 2 links
/cancun.php: 2 links
/blank.php: 5 links
/london.php: 2 links
/discount_travel.php: 2 links
/santodomingo.php: 2 links
/directory/internet.html: 2 links
/phoenix.php: 2 links
/: 41 links
/paris.php: 2 links
/sanfrancisco.php: 2 links
/directory/drugs_and_pharmacy.html: 2 links
/honolulu.php: 2 links
/chicago.php: 2 links
/directory/general.html: 1 links
/directory/fun.html: 2 links
/sitemap.php: 2 links
/hiltongrand.php: 2 links
//: 1 links
/directory/travel2.html: 2 links
/directory/home_business.html: 1 links
/losangeles.php: 2 links
/directory/misc.html: 1 links
/jamaica.php: 2 links
/aruba.php: 2 links
/best_spa.php: 2 links
/amsterdam.php: 2 links
/puertovallarta.php: 3 links
/barcelona.php: 2 links
/newyork.php: 2 links
/submit_link.php: 2 links
/11thhour.php: 2 links
/directory/services2.html: 2 links
/neworleans.php: 2 links
/toronto.php: 2 links
/rome.php: 2 links
/directory/: 2 links
/aboutus.html: 4 links
/directory/other_resources.html: 2 links
/top_ten.php: 2 links

Home has 41 links
http://www.directorypanel.com/detail/link-3571.html
http://www.campwalden.ca/web/travel14.htm
http://chiangmai.discount-thailand-hotel.net/chmresources/travel_resources-page17.php
http://hamletbatista.com/2007/05/29/mining-you-server-log-files/
http://www.the-happy-side.com/link_description.php?cat_id=1
http://www.popularaffiliate.com/travel.html
http://www.kingbloom.com
http://energytable.com/links/shopping.html
http://hamletbatista.com/page/2/
http://www.garyknight.com/links/vacations6.html
http://www.nicepakistan.com/directory/index.php?c=14
http://www.linkdirectory.com/Travel___Vacation/Destinations/
http://www.realestateingrandrapids.com/links/recreation.html
http://whois.domaintools.com/tripscan.com
-
http://www.abccoachhire.co.uk
http://www.1americamall.com/index.php?c=22&s=201
http://www.littlemarketstreet.com/links/travel3.html
http://www.uddsprinting.com/travellinks.html
http://uddsprinting.com/travellinks.html
http://www.whois.sc/tripscan.com
http://www.cheap-air-travel-fares.info/resources9.html
http://www.siteinclusion.com/directory?logic=or&maximum=&term=mexico+vacation+central&sr=20&pp=20
&cp=2
http://www.tripscan.com
http://www.goodsearch.com/Search.aspx?Keywords=vacation+packages&page=4
http://www.vts.net/links/travel3.html
http://hamletbatista.com/
http://www.search-the-world.com/search/search.php/search::cat/category::25/page::42/hpp::20/
http://linkcentre.com/search/?keyword=travel&page=4&flag=
http://www.patclarkconversions.com/links/travel2.html
http://www.goodsearch.com/Search.aspx?Keywords=www.tripscan.com&Source=mozillaplugin
http://hamletbatista.com/tag/link-building/
http://www.datingshare.com/sharelinks/travel.html
http://www.tripscan.com/directory/
http://www.webdigity.com/ws/
http://www.tripscan.com/
http://www.ottosuch.de/
http://www1.tripscan.com/hotel-deals/10015639-hotrate.html
http://www.link-exchange.ws/link-exchange/index.php?action=displaycat&catid=27&page=12&perpage=15
&page=13&perpage=15
http://www.bargaintraveleurope.com/Travel_Links.htm
http://www.weboart.com/links/recreation-sports-travel.html
About has 4 links
http://www.tripscan.com
http://res99.lmdeals.com/config.html?in_origination_key=371&in_pd_key=329&SRC=10015639&SRC_AID=no
ne&in_package_key=5034225&in_offering_key=1578846&in_slipclick=main_result&SRC=10015639&SRC_AID=none
-
http://www.tripscan.com/

One of the most common errors for people unfamiliar with Python is the issue of indentation. This code cannot just be copied, pasted to a text file, and passed onto Python to run. You need to make sure the indentation (spacing) is right. I will post the code somewhere else and provide a link if this causes too much trouble.

Some readers got lost when I talked about matrices in the previous post. Linking relationships and similarly connected structures are conceptually and graphically represented as graphs. A graph is an interconnected structure that has nodes and edges. In our case, the links are the edges and the nodes are the pages. One of the most common ways to express a graph is with a matrix. Similar to an Excel sheet, it has rows and columns, where the squares can be use to indicate that there is a relationship between the page in column A and the page in row C.

Matrices are great for this because one can use matrix operations to solve problems that would otherwise require a lot of memory and computing power to solve. In order to create the matrix, we would number each unique page and unique link. We would use the rows to represent the pages and the columns to represent the links. Each position where there is a 1 means there is a link between the two pages and a 0 means there is no relationship. Using numbers for the rows and columns, and ones and zeros, for the values saves a lot of memory. This makes the computation a lot more efficient. In the code I use the pages and links directly for more clarity.

I hope this is not too confusing.

Update: I made a small change to include the incoming link count for each page.

In order to use the script, download Python from http://www.python.org. The script should run in Unix/Linux, Mac and Windows but I only tested it in Linux.

1. Copy your log file to the directory where the script was saved.

2. Change the name of the log file (inside the quotes) in the line log = open(‘tripscan.actual_log’) to the name of your log file.

3. In the command line, type: python LinkingHood.py and you should see the report.

Mining your server log files

While top website analytics packages offer pretty much anything you might need to find actionable data to improve your site, there are situations where we need to dig deeper to identify vital information.

One of such situations came to light in a post by randfish of Seomoz.org. He writes about the problem with most enterprise-size websites, they have many pages with no or very few incoming links and fewer pages that get a lot of incoming links. He later discusses some approaches to alleviate the problem, suggesting primary linking to link-poor pages from link-rich ones manually, or restructuring the website. I commented that this is a practical situation where one would want to use automation.

Log files are a goldmine of information about your website: links, clicks, search terms, errors, etc In this case, they can be of great use to identify the pages that are getting a lot of links and the ones that are getting very few. We can later use this information to link from the rich to the poor by manual or automated means.

Here is a brief explanation on how this can be done.

Here is an actual log entry to my site tripscan.com in the extended log format: 64.246.161.30 – - [29/May/2007:13:12:26 -0400] “GET /favicon.ico HTTP/1.1″ 206 1406 “http://www.whois.sc/tripscan.com” “SurveyBot/2.3 (Whois Source)” “-”

First we need to parse the entries with a regex to extract the internal pages — between GET and HTTP — and the page that is linking after the server status code and the page size. In this case, after 206 and 1406.

We then create two maps: one for the internal pages — page and page id, and another for the external incoming links page and page id as well. After that we can create a matrix where we identify the linking relationships between the pages. For example: matrix[23][15] = 1, means there is a link from external page id 15 to internal page id 23. This matrix is commonly known in information retrieval as the adjacency matrix or hyper link matrix. We want an implementation that can be preferably operated from disk in order to be able to scale to millions of link relationships.

Later we can walk the matrix and create reports identifying the link-rich pages, the pages with many link relationships, and the link-poor pages with few link relationships. We can define the threshold at some point (i.e. pages with more or less than 10 incoming links.)