As I promised to one of my readers, here is the first version of the code to mine log files for linking relationship information.
I named it LinkingHood as the intention is to take link juice from the rich to give to the poor linking sites.
I wrote it in Python for clarity ( I love Python
) . I was working on an advanced approach involving matrices and linear algebra. After reading some of the feedback regarding the article, it gave birth to a new idea. To make it easier to explain, I decided to use a simpler approach . This code would definitely need to be rewritten to use matrices and linear algebraic operations. (More about that in a later post). For scalability to sites with 10,000 or more pages, this is primarily an illustration and does everything in memory. It’s also extremely inefficient in its current form.
I simply used a dictionary of sets. The keys are the internal pages and the sets are the list of links pointing to those pages. I tested it with my tripscan.com log file and included the results of a test-run.
Here is the script:
#!/usr/bin/python# LinkingHood v0.1 alpha by Hamlet Batista 2007
#
import re
relationships = {}
p = r'[^"]+"GETs([^s]+)[^"]+"s2[^"]+"([^"]+)"'
log = open('tripscan.actual_log')
lines = log.readlines()
for line in lines:
m = re.search(p, line)
if m:
(internal_page, external_link) = m.groups()
if re.search(r'.css|.js|.gif|.jpg|.swf|?', internal_page):
continue
if not relationships.has_key(internal_page):
relationships[internal_page] = set()
if re.search(r'yahoo|google|msn|live|ask', external_link):
continue
relationships[internal_page].add(external_link) print "Tripscan internal pages:" for page in relationships.keys(): print "t"+page+ ": " +str(len(relationships[page])) + " links" home = relationships['/'] about = relationships['/aboutus.html'] print 'Home has ' + str(len(home)) + ' links' for link in home: print 't'+link
print 'About has ' + str(len(about)) + ' links'
for link in about: print 't'+link
Here are the results from the run:
Tripscan internal pages:
/orlando.php: 2 links
/directory/money_and_finance.html: 3 links
/contact.php: 2 links
/favicon.ico: 3 links
/lasvegas.php: 2 links
/directory/services.html: 2 links
/index.php: 2 links
/directory/travel.html: 1 links
/charleston.php: 2 links
/sunburst.php: 2 links
/cancun.php: 2 links
/blank.php: 5 links
/london.php: 2 links
/discount_travel.php: 2 links
/santodomingo.php: 2 links
/directory/internet.html: 2 links
/phoenix.php: 2 links
/: 41 links
/paris.php: 2 links
/sanfrancisco.php: 2 links
/directory/drugs_and_pharmacy.html: 2 links
/honolulu.php: 2 links
/chicago.php: 2 links
/directory/general.html: 1 links
/directory/fun.html: 2 links
/sitemap.php: 2 links
/hiltongrand.php: 2 links
//: 1 links
/directory/travel2.html: 2 links
/directory/home_business.html: 1 links
/losangeles.php: 2 links
/directory/misc.html: 1 links
/jamaica.php: 2 links
/aruba.php: 2 links
/best_spa.php: 2 links
/amsterdam.php: 2 links
/puertovallarta.php: 3 links
/barcelona.php: 2 links
/newyork.php: 2 links
/submit_link.php: 2 links
/11thhour.php: 2 links
/directory/services2.html: 2 links
/neworleans.php: 2 links
/toronto.php: 2 links
/rome.php: 2 links
/directory/: 2 links
/aboutus.html: 4 links
/directory/other_resources.html: 2 links
/top_ten.php: 2 links
Home has 41 links
http://www.directorypanel.com/detail/link-3571.html
http://www.campwalden.ca/web/travel14.htm
http://chiangmai.discount-thailand-hotel.net/chmresources/travel_resources-page17.php
http://hamletbatista.com/2007/05/29/mining-you-server-log-files/
http://www.the-happy-side.com/link_description.php?cat_id=1
http://www.popularaffiliate.com/travel.html
http://www.kingbloom.com
http://energytable.com/links/shopping.html
http://hamletbatista.com/page/2/
http://www.garyknight.com/links/vacations6.html
http://www.nicepakistan.com/directory/index.php?c=14
http://www.linkdirectory.com/Travel___Vacation/Destinations/
http://www.realestateingrandrapids.com/links/recreation.html
http://whois.domaintools.com/tripscan.com
-
http://www.abccoachhire.co.uk
http://www.1americamall.com/index.php?c=22&s=201
http://www.littlemarketstreet.com/links/travel3.html
http://www.uddsprinting.com/travellinks.html
http://uddsprinting.com/travellinks.html
http://www.whois.sc/tripscan.com
http://www.cheap-air-travel-fares.info/resources9.html
http://www.siteinclusion.com/directory?logic=or&maximum=&term=mexico+vacation+central&sr=20&pp=20
&cp=2
http://www.tripscan.com
http://www.goodsearch.com/Search.aspx?Keywords=vacation+packages&page=4
http://www.vts.net/links/travel3.html
http://hamletbatista.com/
http://www.search-the-world.com/search/search.php/search::cat/category::25/page::42/hpp::20/
http://linkcentre.com/search/?keyword=travel&page=4&flag=
http://www.patclarkconversions.com/links/travel2.html
http://www.goodsearch.com/Search.aspx?Keywords=www.tripscan.com&Source=mozillaplugin
http://hamletbatista.com/tag/link-building/
http://www.datingshare.com/sharelinks/travel.html
http://www.tripscan.com/directory/
http://www.webdigity.com/ws/
http://www.tripscan.com/
http://www.ottosuch.de/
http://www1.tripscan.com/hotel-deals/10015639-hotrate.html
http://www.link-exchange.ws/link-exchange/index.php?action=displaycat&catid=27&page=12&perpage=15
&page=13&perpage=15
http://www.bargaintraveleurope.com/Travel_Links.htm
http://www.weboart.com/links/recreation-sports-travel.html
About has 4 links
http://www.tripscan.com
http://res99.lmdeals.com/config.html?in_origination_key=371&in_pd_key=329&SRC=10015639&SRC_AID=no
ne&in_package_key=5034225&in_offering_key=1578846&in_slipclick=main_result&SRC=10015639&SRC_AID=none
-
http://www.tripscan.com/
One of the most common errors for people unfamiliar with Python is the issue of indentation. This code cannot just be copied, pasted to a text file, and passed onto Python to run. You need to make sure the indentation (spacing) is right. I will post the code somewhere else and provide a link if this causes too much trouble.
Some readers got lost when I talked about matrices in the previous post. Linking relationships and similarly connected structures are conceptually and graphically represented as graphs. A graph is an interconnected structure that has nodes and edges. In our case, the links are the edges and the nodes are the pages. One of the most common ways to express a graph is with a matrix. Similar to an Excel sheet, it has rows and columns, where the squares can be use to indicate that there is a relationship between the page in column A and the page in row C.
Matrices are great for this because one can use matrix operations to solve problems that would otherwise require a lot of memory and computing power to solve. In order to create the matrix, we would number each unique page and unique link. We would use the rows to represent the pages and the columns to represent the links. Each position where there is a 1 means there is a link between the two pages and a 0 means there is no relationship. Using numbers for the rows and columns, and ones and zeros, for the values saves a lot of memory. This makes the computation a lot more efficient. In the code I use the pages and links directly for more clarity.
I hope this is not too confusing.
Update: I made a small change to include the incoming link count for each page.
In order to use the script, download Python from http://www.python.org. The script should run in Unix/Linux, Mac and Windows but I only tested it in Linux.
1. Copy your log file to the directory where the script was saved.
2. Change the name of the log file (inside the quotes) in the line log = open(‘tripscan.actual_log’) to the name of your log file.
3. In the command line, type: python LinkingHood.py and you should see the report.
















Recent Comments