LinkingHood v0.1 alpha

As I promised to one of my readers, here is the first version of the code to mine log files for linking relationship information.

I named it LinkingHood as the intention is to take link juice from the rich to give to the poor linking sites.

I wrote it in Python for clarity ( I love Python 🙂 ) . I was working on an advanced approach involving matrices and linear algebra. After reading some of the feedback regarding the article, it gave birth to a new idea. To make it easier to explain, I decided to use a simpler approach . This code would definitely need to be rewritten to use matrices and linear algebraic operations. (More about that in a later post). For scalability to sites with 10,000 or more pages, this is primarily an illustration and does everything in memory. It’s also extremely inefficient in its current form.

I simply used a dictionary of sets. The keys are the internal pages and the sets are the list of links pointing to those pages. I tested it with my tripscan.com log file and included the results of a test-run.

Here is the script:

#!/usr/bin/python# LinkingHood v0.1 alpha by Hamlet Batista 2007
#

import re

relationships = {}

p = r'[^"]+"GETs([^s]+)[^"]+"s2[^"]+"([^"]+)"'

log = open('tripscan.actual_log')

lines = log.readlines()

for line in lines:

   m = re.search(p, line)

	if m:

 	        (internal_page, external_link) = m.groups()

		if re.search(r'.css|.js|.gif|.jpg|.swf|?', internal_page):

 		     continue

		if not relationships.has_key(internal_page):

 		     relationships[internal_page] = set()

		if re.search(r'yahoo|google|msn|live|ask', external_link):

 		     continue

		relationships[internal_page].add(external_link)

print "Tripscan internal pages:"

for page in  relationships.keys():

   print "t"+page+ ": " +str(len(relationships[page])) + " links"

home = relationships['/']

about =  relationships['/aboutus.html']

print 'Home has ' + str(len(home)) + ' links'


for link in home:

   print 't'+link
print 'About has ' + str(len(about)) + ' links'

for link in about:

   print 't'+link

Here are the results from the run:

Tripscan internal pages:
/orlando.php: 2 links
/directory/money_and_finance.html: 3 links
/contact.php: 2 links
/favicon.ico: 3 links
/lasvegas.php: 2 links
/directory/services.html: 2 links
/index.php: 2 links
/directory/travel.html: 1 links
/charleston.php: 2 links
/sunburst.php: 2 links
/cancun.php: 2 links
/blank.php: 5 links
/london.php: 2 links
/discount_travel.php: 2 links
/santodomingo.php: 2 links
/directory/internet.html: 2 links
/phoenix.php: 2 links
/: 41 links
/paris.php: 2 links
/sanfrancisco.php: 2 links
/directory/drugs_and_pharmacy.html: 2 links
/honolulu.php: 2 links
/chicago.php: 2 links
/directory/general.html: 1 links
/directory/fun.html: 2 links
/sitemap.php: 2 links
/hiltongrand.php: 2 links
//: 1 links
/directory/travel2.html: 2 links
/directory/home_business.html: 1 links
/losangeles.php: 2 links
/directory/misc.html: 1 links
/jamaica.php: 2 links
/aruba.php: 2 links
/best_spa.php: 2 links
/amsterdam.php: 2 links
/puertovallarta.php: 3 links
/barcelona.php: 2 links
/newyork.php: 2 links
/submit_link.php: 2 links
/11thhour.php: 2 links
/directory/services2.html: 2 links
/neworleans.php: 2 links
/toronto.php: 2 links
/rome.php: 2 links
/directory/: 2 links
/aboutus.html: 4 links
/directory/other_resources.html: 2 links
/top_ten.php: 2 links

Home has 41 links
http://www.directorypanel.com/detail/link-3571.html
http://www.campwalden.ca/web/travel14.htm
http://chiangmai.discount-thailand-hotel.net/chmresources/travel_resources-page17.php
http://hamletbatista.com/2007/05/29/mining-you-server-log-files/
http://www.the-happy-side.com/link_description.php?cat_id=1
http://www.popularaffiliate.com/travel.html
http://www.kingbloom.com
http://energytable.com/links/shopping.html
http://hamletbatista.com/page/2/
http://www.garyknight.com/links/vacations6.html
http://www.nicepakistan.com/directory/index.php?c=14
http://www.linkdirectory.com/Travel___Vacation/Destinations/
http://www.realestateingrandrapids.com/links/recreation.html
http://whois.domaintools.com/tripscan.com

http://www.abccoachhire.co.uk
http://www.1americamall.com/index.php?c=22&s=201
http://www.littlemarketstreet.com/links/travel3.html
http://www.uddsprinting.com/travellinks.html
http://uddsprinting.com/travellinks.html
http://www.whois.sc/tripscan.com
http://www.cheap-air-travel-fares.info/resources9.html
http://www.siteinclusion.com/directory?logic=or&maximum=&term=mexico+vacation+central&sr=20&pp=20
&cp=2
http://www.tripscan.com
http://www.goodsearch.com/Search.aspx?Keywords=vacation+packages&page=4
http://www.vts.net/links/travel3.html
http://hamletbatista.com/
http://www.search-the-world.com/search/search.php/search::cat/category::25/page::42/hpp::20/
http://linkcentre.com/search/?keyword=travel&page=4&flag=
http://www.patclarkconversions.com/links/travel2.html
http://www.goodsearch.com/Search.aspx?Keywords=www.tripscan.com&Source=mozillaplugin
http://hamletbatista.com/tag/link-building/
http://www.datingshare.com/sharelinks/travel.html
http://www.tripscan.com/directory/
http://www.webdigity.com/ws/
http://www.tripscan.com/
http://www.ottosuch.de/
http://www1.tripscan.com/hotel-deals/10015639-hotrate.html
http://www.link-exchange.ws/link-exchange/index.php?action=displaycat&catid=27&page=12&perpage=15
&page=13&perpage=15
http://www.bargaintraveleurope.com/Travel_Links.htm
http://www.weboart.com/links/recreation-sports-travel.html
About has 4 links
http://www.tripscan.com
http://res99.lmdeals.com/config.html?in_origination_key=371&in_pd_key=329&SRC=10015639&SRC_AID=no
ne&in_package_key=5034225&in_offering_key=1578846&in_slipclick=main_result&SRC=10015639&SRC_AID=none

http://www.tripscan.com/

One of the most common errors for people unfamiliar with Python is the issue of indentation. This code cannot just be copied, pasted to a text file, and passed onto Python to run. You need to make sure the indentation (spacing) is right. I will post the code somewhere else and provide a link if this causes too much trouble.

Some readers got lost when I talked about matrices in the previous post. Linking relationships and similarly connected structures are conceptually and graphically represented as graphs. A graph is an interconnected structure that has nodes and edges. In our case, the links are the edges and the nodes are the pages. One of the most common ways to express a graph is with a matrix. Similar to an Excel sheet, it has rows and columns, where the squares can be use to indicate that there is a relationship between the page in column A and the page in row C.

Matrices are great for this because one can use matrix operations to solve problems that would otherwise require a lot of memory and computing power to solve. In order to create the matrix, we would number each unique page and unique link. We would use the rows to represent the pages and the columns to represent the links. Each position where there is a 1 means there is a link between the two pages and a 0 means there is no relationship. Using numbers for the rows and columns, and ones and zeros, for the values saves a lot of memory. This makes the computation a lot more efficient. In the code I use the pages and links directly for more clarity.

I hope this is not too confusing.

Update: I made a small change to include the incoming link count for each page.

In order to use the script, download Python from http://www.python.org. The script should run in Unix/Linux, Mac and Windows but I only tested it in Linux.

1. Copy your log file to the directory where the script was saved.

2. Change the name of the log file (inside the quotes) in the line log = open(‘tripscan.actual_log’) to the name of your log file.

3. In the command line, type: python LinkingHood.py and you should see the report.

Advertisements

6 Responses to “LinkingHood v0.1 alpha”


  1. 1 kichus May 31, 2007 at 5:25 am

    thats a great one Batista, thanks for simplifying. And now I ‘ve got one more doubt.. The Logs only records the Hits (I mean it won’t have the entire list of links pointing to one page), so only the referral links for that particular period of time – depend on what date range the Log files has – is getting evaluated.

    I understand that they are the Performing Links which has the capability to distribute some Link Juice. But do they have the exact list on which we can rely on to make strategies?

    Also, if you could add one more paragraph saying the basic requirements for running that script, that would be great.

  2. 2 Andrea May 31, 2007 at 5:37 am

    Hi, this is really interesting, but I miss what to do next.
    I already store in a DB all the incoming links from non-SE for almost any page of my sites plus the SE keywords.
    I use it to see the top keywords trend and to check if affiliate sites stop linking to me…
    What is the smartest thing I should do with all this huge amount of data?

  3. 3 hamletb May 31, 2007 at 11:05 am

    kichus,

    You are absolutely right. Note that search engines crawl you website frequently and the script will take advantage of that to collect all your internal links. The more days recorded in your log file, the better.

    I am not sure I understand your second paragraph.

    I updated the entry based on your suggestions.

    Andrea,

    Parsing the keywords is very smart idea. I think I will modify the code to do that as well.

    In my opinion the best use for this data is to modify the link rich pages so that they pass traffic and link love to the other ones.

    The keywords are the best way for your visitors to express their intentions. Use your data to make sure they land in the right pages– the intention expressed in the keywords needs to match the content in the landing page.

    If this is not the case I would change the content on the pages to match the queries.

  4. 4 kichus June 1, 2007 at 5:47 am

    Thanks Batista….

    Sorry about the confusion, I was actually saying if you could mention some basic requirements for running this script (like, python server with version x.xx, where to keep your Log files and how to call it etc.). Sorry if I missed it in the post…

  5. 5 kichus June 1, 2007 at 5:52 am

    Oops… I was actually asking for exactly the same wht you’ve added under the “Usage”. 🙂

    I felt it was mising the first time when I read that…

    kichus

  6. 6 hamletb June 1, 2007 at 9:58 am

    Yes. I added it after you mentioned it. Thanks for the heads up!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s





%d bloggers like this: