Archive for the 'hyperlink analysis' Category

Should I cross link my sites for better rankings?

My loyal reader Jez asks a very interesting question. I am sure the same question is on the minds of others in the same situation.

Finally, I am in the process of creating multiple sites around a similar theme. I have unique content for all sites, and will host on different servers in Europe and the US, however the whois for each domain will show my name (The company I used does not allow me to hide this info).

Is the common whois likely to make much difference when I begin cross linking the sites?

Cross linking (or reciprocal linking) in a small scale (maybe 10 to 15 sites maximum) should not be a major concern. I’ve seen many sites do it and they are ranking in highly competitive phrases. Most of their link juice comes from non-cross-linked sites though.

When you try to do this on a massive scale, things start to get interesting. I know this from experience.

Back in 2003 and 2004, I managed to get a couple of my sites ranking on Google for “Viagra” and most variations. That is one of the most competitive industries, because you make really good money as an affiliate. I got those rankings through link exchanges exclusively. Being a developer, I created scripts to ‘borrow’ links from my competitors link directories and later traded links with my sites. When I hit the 5,000 links mark, my sites got banned and I dropped in all my rankings. Back then, Google was not as sophisticated as it is now.

Later, I carefully studied competitors that were doing a more advanced type of cross linking. They created large networks of sites that they owned, and they created complex inter linking structures to boost the rank of a few of their sites for highly competitive terms. Pair.com was a common web host as they provided IP address in different class C blocks.

That worked well for a while–until Google became a registrar. It is illegal to use fake domain registration information, and by having access to the domain ownership information Google could more easily identify complex cross linking. I think they became a registrar with that sole purpose. I don’t see them selling domains in the future. They haven’t yet. Have they?

Making your cross linked domains’ registration private won’t help much either. I think registrars have access to the real information anyways, but even if I am wrong, it would be suspicious for your site to have all inbound links coming from private registrations.

There are far more complex cross linking schemes where there are a few owners cooperating in the creation of massive collection of websites with well planned link boosting structures. The funny thing is that search engine researchers have already identified most of them. Check the paper “Link Spam Alliances“, it is a very interesting read.

So, If you want to cross link on a massive scale, you better have a very intricate and complex linking plan to avoid detection.

Mining your server log files

While top website analytics packages offer pretty much anything you might need to find actionable data to improve your site, there are situations where we need to dig deeper to identify vital information.

One of such situations came to light in a post by randfish of Seomoz.org. He writes about the problem with most enterprise-size websites, they have many pages with no or very few incoming links and fewer pages that get a lot of incoming links. He later discusses some approaches to alleviate the problem, suggesting primary linking to link-poor pages from link-rich ones manually, or restructuring the website. I commented that this is a practical situation where one would want to use automation.

Log files are a goldmine of information about your website: links, clicks, search terms, errors, etc In this case, they can be of great use to identify the pages that are getting a lot of links and the ones that are getting very few. We can later use this information to link from the rich to the poor by manual or automated means.

Here is a brief explanation on how this can be done.

Here is an actual log entry to my site tripscan.com in the extended log format: 64.246.161.30 – - [29/May/2007:13:12:26 -0400] “GET /favicon.ico HTTP/1.1″ 206 1406 “http://www.whois.sc/tripscan.com” “SurveyBot/2.3 (Whois Source)” “-”

First we need to parse the entries with a regex to extract the internal pages — between GET and HTTP — and the page that is linking after the server status code and the page size. In this case, after 206 and 1406.

We then create two maps: one for the internal pages — page and page id, and another for the external incoming links page and page id as well. After that we can create a matrix where we identify the linking relationships between the pages. For example: matrix[23][15] = 1, means there is a link from external page id 15 to internal page id 23. This matrix is commonly known in information retrieval as the adjacency matrix or hyper link matrix. We want an implementation that can be preferably operated from disk in order to be able to scale to millions of link relationships.

Later we can walk the matrix and create reports identifying the link-rich pages, the pages with many link relationships, and the link-poor pages with few link relationships. We can define the threshold at some point (i.e. pages with more or less than 10 incoming links.)