Archive for the 'link building' Category

LinkingHood v0.1 alpha

As I promised to one of my readers, here is the first version of the code to mine log files for linking relationship information.

I named it LinkingHood as the intention is to take link juice from the rich to give to the poor linking sites.

I wrote it in Python for clarity ( I love Python :-) ) . I was working on an advanced approach involving matrices and linear algebra. After reading some of the feedback regarding the article, it gave birth to a new idea. To make it easier to explain, I decided to use a simpler approach . This code would definitely need to be rewritten to use matrices and linear algebraic operations. (More about that in a later post). For scalability to sites with 10,000 or more pages, this is primarily an illustration and does everything in memory. It’s also extremely inefficient in its current form.

I simply used a dictionary of sets. The keys are the internal pages and the sets are the list of links pointing to those pages. I tested it with my tripscan.com log file and included the results of a test-run.

Here is the script:

#!/usr/bin/python# LinkingHood v0.1 alpha by Hamlet Batista 2007
#

import re

relationships = {}

p = r'[^"]+"GETs([^s]+)[^"]+"s2[^"]+"([^"]+)"'

log = open('tripscan.actual_log')

lines = log.readlines()

for line in lines:

   m = re.search(p, line)

	if m:

 	        (internal_page, external_link) = m.groups()

		if re.search(r'.css|.js|.gif|.jpg|.swf|?', internal_page):

 		     continue

		if not relationships.has_key(internal_page):

 		     relationships[internal_page] = set()

		if re.search(r'yahoo|google|msn|live|ask', external_link):

 		     continue

		relationships[internal_page].add(external_link)

print "Tripscan internal pages:"

for page in  relationships.keys():

   print "t"+page+ ": " +str(len(relationships[page])) + " links"

home = relationships['/']

about =  relationships['/aboutus.html']

print 'Home has ' + str(len(home)) + ' links'


for link in home:

   print 't'+link
print 'About has ' + str(len(about)) + ' links'

for link in about:

   print 't'+link

Here are the results from the run:

Tripscan internal pages:
/orlando.php: 2 links
/directory/money_and_finance.html: 3 links
/contact.php: 2 links
/favicon.ico: 3 links
/lasvegas.php: 2 links
/directory/services.html: 2 links
/index.php: 2 links
/directory/travel.html: 1 links
/charleston.php: 2 links
/sunburst.php: 2 links
/cancun.php: 2 links
/blank.php: 5 links
/london.php: 2 links
/discount_travel.php: 2 links
/santodomingo.php: 2 links
/directory/internet.html: 2 links
/phoenix.php: 2 links
/: 41 links
/paris.php: 2 links
/sanfrancisco.php: 2 links
/directory/drugs_and_pharmacy.html: 2 links
/honolulu.php: 2 links
/chicago.php: 2 links
/directory/general.html: 1 links
/directory/fun.html: 2 links
/sitemap.php: 2 links
/hiltongrand.php: 2 links
//: 1 links
/directory/travel2.html: 2 links
/directory/home_business.html: 1 links
/losangeles.php: 2 links
/directory/misc.html: 1 links
/jamaica.php: 2 links
/aruba.php: 2 links
/best_spa.php: 2 links
/amsterdam.php: 2 links
/puertovallarta.php: 3 links
/barcelona.php: 2 links
/newyork.php: 2 links
/submit_link.php: 2 links
/11thhour.php: 2 links
/directory/services2.html: 2 links
/neworleans.php: 2 links
/toronto.php: 2 links
/rome.php: 2 links
/directory/: 2 links
/aboutus.html: 4 links
/directory/other_resources.html: 2 links
/top_ten.php: 2 links

Home has 41 links
http://www.directorypanel.com/detail/link-3571.html
http://www.campwalden.ca/web/travel14.htm
http://chiangmai.discount-thailand-hotel.net/chmresources/travel_resources-page17.php
http://hamletbatista.com/2007/05/29/mining-you-server-log-files/
http://www.the-happy-side.com/link_description.php?cat_id=1
http://www.popularaffiliate.com/travel.html
http://www.kingbloom.com
http://energytable.com/links/shopping.html
http://hamletbatista.com/page/2/
http://www.garyknight.com/links/vacations6.html
http://www.nicepakistan.com/directory/index.php?c=14
http://www.linkdirectory.com/Travel___Vacation/Destinations/
http://www.realestateingrandrapids.com/links/recreation.html
http://whois.domaintools.com/tripscan.com
-
http://www.abccoachhire.co.uk
http://www.1americamall.com/index.php?c=22&s=201
http://www.littlemarketstreet.com/links/travel3.html
http://www.uddsprinting.com/travellinks.html
http://uddsprinting.com/travellinks.html
http://www.whois.sc/tripscan.com
http://www.cheap-air-travel-fares.info/resources9.html
http://www.siteinclusion.com/directory?logic=or&maximum=&term=mexico+vacation+central&sr=20&pp=20
&cp=2
http://www.tripscan.com
http://www.goodsearch.com/Search.aspx?Keywords=vacation+packages&page=4
http://www.vts.net/links/travel3.html
http://hamletbatista.com/
http://www.search-the-world.com/search/search.php/search::cat/category::25/page::42/hpp::20/
http://linkcentre.com/search/?keyword=travel&page=4&flag=
http://www.patclarkconversions.com/links/travel2.html
http://www.goodsearch.com/Search.aspx?Keywords=www.tripscan.com&Source=mozillaplugin
http://hamletbatista.com/tag/link-building/
http://www.datingshare.com/sharelinks/travel.html
http://www.tripscan.com/directory/
http://www.webdigity.com/ws/
http://www.tripscan.com/
http://www.ottosuch.de/
http://www1.tripscan.com/hotel-deals/10015639-hotrate.html
http://www.link-exchange.ws/link-exchange/index.php?action=displaycat&catid=27&page=12&perpage=15
&page=13&perpage=15
http://www.bargaintraveleurope.com/Travel_Links.htm
http://www.weboart.com/links/recreation-sports-travel.html
About has 4 links
http://www.tripscan.com
http://res99.lmdeals.com/config.html?in_origination_key=371&in_pd_key=329&SRC=10015639&SRC_AID=no
ne&in_package_key=5034225&in_offering_key=1578846&in_slipclick=main_result&SRC=10015639&SRC_AID=none
-
http://www.tripscan.com/

One of the most common errors for people unfamiliar with Python is the issue of indentation. This code cannot just be copied, pasted to a text file, and passed onto Python to run. You need to make sure the indentation (spacing) is right. I will post the code somewhere else and provide a link if this causes too much trouble.

Some readers got lost when I talked about matrices in the previous post. Linking relationships and similarly connected structures are conceptually and graphically represented as graphs. A graph is an interconnected structure that has nodes and edges. In our case, the links are the edges and the nodes are the pages. One of the most common ways to express a graph is with a matrix. Similar to an Excel sheet, it has rows and columns, where the squares can be use to indicate that there is a relationship between the page in column A and the page in row C.

Matrices are great for this because one can use matrix operations to solve problems that would otherwise require a lot of memory and computing power to solve. In order to create the matrix, we would number each unique page and unique link. We would use the rows to represent the pages and the columns to represent the links. Each position where there is a 1 means there is a link between the two pages and a 0 means there is no relationship. Using numbers for the rows and columns, and ones and zeros, for the values saves a lot of memory. This makes the computation a lot more efficient. In the code I use the pages and links directly for more clarity.

I hope this is not too confusing.

Update: I made a small change to include the incoming link count for each page.

In order to use the script, download Python from http://www.python.org. The script should run in Unix/Linux, Mac and Windows but I only tested it in Linux.

1. Copy your log file to the directory where the script was saved.

2. Change the name of the log file (inside the quotes) in the line log = open(‘tripscan.actual_log’) to the name of your log file.

3. In the command line, type: python LinkingHood.py and you should see the report.

The harder to get the link, the more valuable it is

Links that are too easy or relatively easy to get do not help much in getting traffic or authority for search engine rankings.

If your link is placed on a page where there are several hundred links competing for attention, it is less likely that potential visitors will click than if the page only has a few dozen links.

The value of your link source is in direct relation to how selective that source is when placing links on the page and how much traffic the source gets.  The value also declines with the number of links on the page.

Google is understood to use algorithms to measure the importance and quality of each page.  The PageRank was invented by Google founders and is used for measuring absolute importance of a page.  The TrustRank algorithm describes a technique for identifying trustworthy pages — quality pages.  We can not tell for sure to what extent Google is using this algorithm if at all, or at least their publicly known version.  What we can say, is that based on observation, we can definitely say that they do not treat all links equal and they do not pass authority to your page from all of your link sources.