What is the practical benefit of learning Google’s internals?

I forgot to start my Google inner workings series with WIIFM. My plan is to write one post each week.

Not matter how well I try to explain it, it is a complex subject. I should have started the first post explaining why you would want to learn that. There are a lot of easier things to read.With some people questioning the usefulness of SEO, this is a good time to make my views clear. Please note that I believe in a solid marketing mix that includes SEO, PPC, SMO, affiliate marketing, viral marketing, etc. Do not put all your eggs in one basket.

If you have been blogging for a while, you have probably noticed that you are getting hits from the search engines for words that you did not try to optimize. For example, the next day I started this blog, I received a comment from a reader that found my blog through a blog search! How was this possible?

Heather Paquinas May 26th, 2007 at 1:24 am

I found your blog in google blogsearch. Needless to say I subscribed right away after reading this. I always suspected what you said, especially after Mike Levin from hittail blogged about using hittail for ppc, but you really hit the nail on the head with this post.

This is possible because that is the job of the search engines! If every page you search had to be optimized, there wouldn’t be billions of pages in Google index. It would take a lot of people to do the SEO work :-) .

Why we need SEO then?

The answer is very simple. Not all traffic means money. If you want to make money, target competitive terms, build your brand, etc. you need to select your keywords strategically. You cannot expect search engines to rank your site in high profit niches, automatically.

Well, maybe you are that lucky.The traffic that is highly valuable is probably very competitive. It is very difficult to rank for competitive terms with no effort in direct optimization. Why? Because, others are already optimizing for those terms. When there is no competition you can rank very easily.

Why learn Google internal workings and other advanced information such as patents, research papers, etc.?

Again, I only recommend this if you are targeting competitive terms and markets. If you are happy with a few hundred clicks a week, you probably don’t need this.

Advanced knowledge gives you an edge over your competitors:

1. You can read and participate in forums and blogs and know what information is useful and what isn’t.
2. You can easily find solutions to your search engine related problems.
3. You can tell if a proposed theory is possible or not.
4. If you are a black hat, you can more easily find new holes to exploit and schemes to pursue.

If major search engines keep their internals so secret, there is a $reason$.

Should I cross link my sites for better rankings?

My loyal reader Jez asks a very interesting question. I am sure the same question is on the minds of others in the same situation.

Finally, I am in the process of creating multiple sites around a similar theme. I have unique content for all sites, and will host on different servers in Europe and the US, however the whois for each domain will show my name (The company I used does not allow me to hide this info).

Is the common whois likely to make much difference when I begin cross linking the sites?

Cross linking (or reciprocal linking) in a small scale (maybe 10 to 15 sites maximum) should not be a major concern. I’ve seen many sites do it and they are ranking in highly competitive phrases. Most of their link juice comes from non-cross-linked sites though.

When you try to do this on a massive scale, things start to get interesting. I know this from experience.

Back in 2003 and 2004, I managed to get a couple of my sites ranking on Google for “Viagra” and most variations. That is one of the most competitive industries, because you make really good money as an affiliate. I got those rankings through link exchanges exclusively. Being a developer, I created scripts to ‘borrow’ links from my competitors link directories and later traded links with my sites. When I hit the 5,000 links mark, my sites got banned and I dropped in all my rankings. Back then, Google was not as sophisticated as it is now.

Later, I carefully studied competitors that were doing a more advanced type of cross linking. They created large networks of sites that they owned, and they created complex inter linking structures to boost the rank of a few of their sites for highly competitive terms. Pair.com was a common web host as they provided IP address in different class C blocks.

That worked well for a while–until Google became a registrar. It is illegal to use fake domain registration information, and by having access to the domain ownership information Google could more easily identify complex cross linking. I think they became a registrar with that sole purpose. I don’t see them selling domains in the future. They haven’t yet. Have they?

Making your cross linked domains’ registration private won’t help much either. I think registrars have access to the real information anyways, but even if I am wrong, it would be suspicious for your site to have all inbound links coming from private registrations.

There are far more complex cross linking schemes where there are a few owners cooperating in the creation of massive collection of websites with well planned link boosting structures. The funny thing is that search engine researchers have already identified most of them. Check the paper “Link Spam Alliances“, it is a very interesting read.

So, If you want to cross link on a massive scale, you better have a very intricate and complex linking plan to avoid detection.

Can you trust Alexa’s numbers?

It is very important to understand that there is no way for external metrics tools such as Alexa, Compete, Ranking, Netcraft, etc. to provide accurate data.

Their information is collected from their respective toolbar usage. Alexa has the broadest distribution than others, but there are still a lot of people that don’t use those toolbars or browser plugins.

Their data is particularly useful if you are in a technical field: search and affiliate marketing, web development, etc. A large portion of your potential visitors probably have one or more of these toolbars installed.

A while ago, there was an interesting project regarding the efficacy those metrics:

Conclusion – The Value of External Metrics

This survey represents only a tiny sampling of sites in a niche sector, albeit a relatively popular one in the blogosphere and webdev/tech space. Based on the evidence we’ve gathered here, it’s safe to say that no external metric, traffic prediction service or ranking system available on the web today provides any accuracy when compared with real numbers. Incidentally, I did log in to Hitwise to check their estimations and although I can’t publish them (as Hitwise is a paid service and doing so would the violate terms of service), I can say that the numbers issued from the competitive intelligence tool were no better than Alexa’s in predicting relative popularity or traffic estimation.

The sad conclusion is that right now, no publicly available competitive analysis tool (that we’re aware of) offers real value. Let’s hope withing the next few years, better data will be made available.

What is the problem?

In statistics, when you need a sample that represents the entire population that you are measuring, data is collected carefully and completely to avoid any bias. Unfortunately, there is no way to configure the toolbars of sites or people grouped in similar samplings. Users install them at will and the ones installing them are usually advanced users (Not your typical gardener).

Why use the data then?

In my case, the content on my blog is highly technical, so there is a high probability that most users have the Alexa toolbar or the browser plugin.

For comparative purposes. By comparing my blog’s Alexa to a blog directed at a similar audience (seobythesea.com) I was able to tell if I am in the right path.

Should you use it?

How technical is your audience is the right question to ask yourself. If you target casual readers, it might not be very useful.

Great Content + Bad Headline = Mediocre Results

You can spend a few hours researching, structuring, drafting and proofreading a great post, to completely miss it by choosing a really bad title.

I recently submitted a carefully crafted rebuttal to the Seomoz article: Proof Google is Using Behavioral Data in Rankings. The post generated some controversy and some heated discussion as to the validity of the tests and results. I read everything. And, given my technical nature, I decided to dig deeper in myself.

I ended up with slightly different conclusions about the experiments. If you want to find out please read the post at Youmoz.

Now, here’s the bad news.

As Kurt, wisely points out, I tragically missed the mark by poorly choosing an empty title: “Relevance feedback“.

Kurt (86)

Sat (6/16/07) at 05:38 PM

Good post… well thought out and presented… gave it a thumbs up.

Unfortunately, it will most likely get overlooked by most readers due to its title/headline.

Look at the article you’re a referencing, “Proof Google is Using Behavioral Data in Rankings“. You know that headline will bring in some clicks. It was moved to the blog of SEOmoz from the Youmoz section (even with its flawed testing and logic). The mozzers aren’t stupid… they know this type of headline and article will stir up some controversy and bring in some links.

I’m no expert copywriter… far from it. I just hate to see a good post sit on the sidelines because of a bad headline.

The title I chose did not offer the reader any incentive to click or learn more. I guess that I operate in two modes: engineer and marketer and that I forgot to flip the switch while writing this post.

First, let me state that his remarks about the mozzers are valid for most journalists, trade publications, social media sites, etc. It is human nature to judge books by their cover. If the cover is crap, the content must be crap. That is how we normally think.

Again, whether you are writing:

1. A blog post
2. A book
3. An email
4. A fax cover letter
5. An article
6. A Digg submition
7. etc.

Write title/subjects that entice users to read further.

What can you learn from my mistake?

1. Most people scan web pages. They don’t have the time to follow each link. The title must be a call to action: “this is interesting, click to learn more”.
2. Summary/excerpt is very important too. I chose a really bad first paragraph. If you write post as guest for other popular blogs, you want your title and first paragraph to be cliff hangers. You must get people to click further.
3. Content importance is second to title and excerpt! This is sad, but true. While crappy content won’t get the word out, crappy titles won’t even get the word in the first place.

Deceptive titles are not a good idea

Am I suggesting you start writing bait and switch posts? Definitely not.

While controversy draws attention, writing titles that say one thing and when you read the content you find another is the best way to brand yourself as a charlatan.

Ideally, you should spend enough time carefully writing your posts (especially, if they are to be published on other websites), and spend a few minutes carefully writing the titles as well. Be creative!

The power of networking

When I started to blog (now close to three weeks ago) I did not know what to expect. I have to say that I am more than impressed with the power of blogging and networking with popular related blogs.

My topics tend to be too technical and I am well aware that it severely limits my audience. Not everybody understands what I am talking about.

I plan to change this in the coming weeks by adding illustrations to the complex topics. I am also working to move my blog away from wordpress.com to be self hosted on one of our servers. That will give me a lot more flexibility than I have now. One thing I want is the ability to link to my source code, instead of including the code in the posts. I will probably just include a flow diagram in the posts. I also want to make the scripts available for use directly from the blog so that you don’t have to install them.

What have I learned so far?

I have done several things on purpose:

1. I decided to not monetize this blog in any way. My plan is to use it exclusively for branding. You won’t see any ads or affiliate links here.
2. I don’t have any short term plans to advertise the blog in any way.

My plan is to test how well a blog can do by just writing useful and original content and by participating in other blogs and forums with useful feedback. For that I try to keep posting at least one article a day here and write articles to be published in other blogs and popular websites.

I don’t think the results are mind blowing, but compared to what I’ve read in other blogs, my numbers are looking good. My Alexa Rank for this week is around 60 thousand. I checked seobythesea.com that is very heavy on technical content and his traffic rank is 40 thousand.

jun07_alexa.png

Things to avoid

While commenting in other popular blogs is one of the most effective ways to get your name or brand out and potentially attract more visitors to your site, doing it wrong can prove to be a waste of time or cause the opposite.

I often see a lot of comments that just say: ‘Nice Post. Keep it up’. This is the best way to waste your time. First of all, it doesn’t help with rankings as most comments are ‘no-followed‘. Furthermore, it will not bring traffic. How many times have you tried to find out who is writing that insightful comment ‘Nice Post’? Unless it gets really annoying, I don’t think you do. I don’t.

Carefully read what the post is about, reflect on it, and try to find out something to say that adds to the conversation. It could be confirming the post or taking an opposing view, but you need to add something. You can also ask clarifying questions, but visitors will most likely click on your site if you are adding something of value.

Blogging is informal, but that doesn’t mean you shouldn’t carefully research your posts. Citing other blogs and authority sources not only gives more credibility, but also the pingbacks to other popular blogs will be more likely accepted. Even if they are ‘not-followed’, the traffic is good. You also get the opportunity of getting picked by other blogs as well.

Advanced Adwords bidding strategies

In Yesterday’s Search Day article: Are Bid Management Tools Dead?, Eric Enge, writes some interesting facts and conclusions he brought from SMX.

A solid strategy for your PPC campaigns will have the following elements:

  1. Use a bid management tool to manage the long tail of your campaign.
  2. Stay focused on your ad copy and your landing pages, because they can dramatically influence the cost and conversion rates of your campaigns.
  3. Take significant brand building terms and manage them separately
  4. Take significant “first visit search” keywords and manage them separately as well.

While I think it is no longer necessary to manage large lists of long tail keywords for PPC campaigns (thanks to broad matching options), I do see great value in bid optimizing tools on improving the ROI of your PPC campaigns.

I want to focus on one particular aspect that was brought forth. Brand building and “first visit” keywords should be managed differently, and should be left out of the automated bid management tools. They provide no direct measurable ROI, but they are definitely very important to have.

This is the strategy I use to build PPC campaigns. I take a few more variables into account, but I will provide a detailed high level overview of this process.

Organizing the keywords

The first step is carefully organizing the keywords. I organize them using the following criteria: if they relate to the brand, if they are generic or action keywords. I further organize the keywords by how relevant they are to my business in a scale of one to five. Five being the most relevant and the most likely to turn into a conversion.

Initial maximum bids for ROI (money) keywords

I estimate our initial maximum bids by doing the following:

I want no less than 100% ROI and assume at least a 1% conversion rate (The ROI and conversion rates are usually higher than that). Use the net profit per conversion and with simple math you can determine what is the maximum you want to pay per click. That is for the most relevant keywords (#5 on your scale). Then, discount 20% for each level of the scale. For example, if the maximum bid for scale five is $1.5, for scale four it would be $1.2, for scale three it would be $0.9, etc.

This strategy guarantees profitability.

Next, create four campaigns, each one at a different maximum cost per click (based on your relevance scale) and use Google Adwords Budget Optimizer. The Budget Optimizer will try to get as many clicks as possible and will manage the individual keywords’ maximum bids automatically for you. Let it run for at least a month, carefully looking at your actual conversions and ROI.

After a week or two you should have more accurate conversion rate and ROI numbers. Use them to adjust your maximum bids per campaign.

Managing brand building and “first visit search” keywords

For brand building you want as many impressions as possible, for the lowest possible cost. The content network is an excellent option for this, by using CPM ads and site targeted ads.

Now, for brand building, using keyword search has a little bit different setup. You must obtain at least a 0.5% click-trough rate to prevent your ads from being disabled, but you want as many impressions as possible and very few clicks to maintain your costs low.

The Ads are the key. Create ads to get the right message across, don’t try to entice users to click. My strategy is to use position preference, target positions 4-6 and bid at the minimum or the minimum necessary to keep the ads running. This guarantees your ads will remain on display and you won’t need to actively manage those campaigns. Big advertisers will usually want to be in the top positions (1-2). Position preference is not compatible with the Budget Optimizer, but for this strategy it is not necessary.

For “first visit search” keywords I use a similar strategy, but I try to get the visitors to click with an enticing ad.

Improving your ROI campaign performance with preferred cost bidding

After running your ROI keywords campaigns for a while with Budget Optimizer, you can generate reports that give you all sorts of useful information. One such information is the true value of each visitor per keyword. Using this information you can have more predictable spending and ROI.

My technique is to create another campaign with preferred cost bidding, and move all best performing keywords from the Budget Optimizer campaigns. Set the preferred costs per clicks to the actual value of those clicks that you determined by running the reports. Google will automatically handle the rest for you.

Another popular and probably more efficient alternative, is to use advanced bid management tools that use portfolio based algorithms for bid management.

These techniques require some historical data to be useful, that is where the Budget Optimizer campaigns comes in handy.

Google recently introduced Cost Per Acquisition bidding for the content network. I haven’t used it and can’t comment much on it. I can say that paying only per results sounds ideal for the advertiser. This is very similar to running an affiliate program.

There are many elements that are necessary to run a successful PPC campaign. For me, the ads and the landing pages are the most important ones to insure adequate conversions and ROI. Astute bidding can give you a big competitive edge. Especially in highly competitive markets.

Google’s inner workings – part 1

Google keeps tweaking its search engine, and now it is more important than ever to better understand its inner workings.

Google lured Mr. Manber from Amazon last year. When he arrived and began to look inside the company’s black boxes, he says, that he was surprised that Google’s methods were so far ahead of those of academic researchers and corporate rivals.

While Google closely guards its secret sauce, for many obvious reasons, it is possible to build a pretty solid picture of Google’s engine. In order to do this we are going to start by carefully dissecting Google’s original engine: How Google was conceived back in 1998. Although a newborn baby, it had all the basic elements it needed to survive in the web world.

The plan is to study how it worked originally, and follow all the published research papers and patents in order to put together the missing pieces. It is going to be very interesting.

Google has added and improved many things over the years. The original paper only describes the workings of the web search engine. Missing features are the ability to search news, images, documents (PDF, word, etc.), video, products, addresses, books, patents, maps, blogs, etc.

Also missing are substantial improvements such as local search, mobile search, personalized search, universal search, supplemental index, freshness, spam detection and PageRank improvements. Some things that will be hard to know is how Google uses the data it collects through other services, like Google Toolbar, Google Analytics, Google Adsense, Doubleclick, Gmail, Gtalk, Feedburner, etc. There is a lot of information that can be used both for better ads and for better search results.

No matter the type of search you are conducting, conceptually, search engines have three key components: the crawler, the indexer and the searcher.

The crawler’s (also known as a search engine robot) job is to collect all the information that will be later searched. Whether it’s images, video, text or RSS feeds. These documents are stored for later processing by the indexer module. Webmasters and site owners can control how crawlers access their websites via a robots.txt file and the robots exclusion protocol. In this file you basically tell the crawler what pages or sections it is not allowed to crawl. I posted an entry about this several days ago.

The indexer module is the one doing the heavy lifting. It has the daunting task of carefully organizing the information collected by the crawler. The power of the search engine is on this specific task. Depending on how well classified the information is – the faster and the better the search. Search engines conceptually classify documents similar to the way you file documents on a cabinet. Without some sort of labeling you will probably waste a lot time finding your bank statements, notes, etc. Search engines label documents in a way that makes it easy for them to find them later by words or phrases (also known as keywords). In the case of text and similar documents the indexer breaks down the document in words and collects some additional information about those words, such as the frequency of the word in the document.

The searcher module is the one that takes the user search, cleans it to remove ambiguities, misspellings, etc., finds the documents in the index that more closely match the search, and rank them according to the current ranking formula. The ranking formula is the most closely guarded secret of all major commercial search engines.

These basic components remain the same nowadays, but they say that the devil is on the details. Today, Google’s inner workings are far more complex than what I am going to explain, but the basic principles are the same. I will quote the original paper as necessary.

There is quite a bit of recent optimism that the use of more hypertextual information can help improve search and other applications [Marchiori 97] [Spertus 97] [Weiss 96] [Kleinberg 98]. In particular, link structure [Page 98] and link text provide a lot of information for making relevance-related assessements and quality filtering. Google makes use of both link structure and anchor text (see Sections 2.1 and 2.2).

One notable improvement Google brought to the commercial search markeplace was the use of link structure and anchor/link text to improve the quality of results. This proved to be a significant factor that helped fuel their growth. Today, these elements remain significant, but Google makes use of very sophisticated filters to detect most attempts at manipulation. Proof that they remain signifficant are the successful Google bombs of late.

To support novel research uses, Google stores all of the actual documents it crawls in compressed form

Here is the reference to their caching feature we are acostumed to use.

Now, let’s see how they define PageRank — how important or high quality are pages for Google’s search engine.

…a page can have a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank. Intuitively, pages that are well cited from many places around the web are worth looking at. Also, pages that have perhaps only one citation from something like the Yahoo! homepage are also generally worth looking at. If a page was not high quality, or was a broken link, it is quite likely that Yahoo’s homepage would not link to it. PageRank handles both these cases and everything in between by recursively propagating weights through the link structure of the web.

Here is a clear description of why they use anchor text for searching.

The text of links is treated in a special way in our search engine. Most search engines associate the text of a link with the page that the link is on. In addition, we associate it with the page the link points to. This has several advantages. First, anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases. This makes it possible to return web pages which have not actually been crawled. Note that pages that have not been crawled can cause problems, since they are never checked for validity before being returned to the user. In this case, the search engine can even return a page that never actually existed, but had hyperlinks pointing to it. However, it is possible to sort the results, so that this particular problem rarely happens.

Now, let’s read about on-page elements that Google considered that were not in regular use back then. Proximity, capitalization and font weight, and page caching.

Aside from PageRank and the use of anchor text, Google has several other features. First, it has location information for all hits and so it makes extensive use of proximity in search. Second, Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font weigh heavier than other words. Third, full raw HTML of pages is available in a repository.

Now let’s see have a big picture view as to how everything fits together. This is very technical, but I will try to explain it the best I can.

In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of “barrels”, creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.

The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.

The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.

Google uses distributed crawlers/downloaders. If you have ever looked at your server log files you will notice that when Googlebot is visiting your site you will see hits coming from different IPs. That is because the crawling is distributed among several computers. They need a URL server to feed the URLs to download, to the crawlers, because the URL server is the one coordinating the crawling efforts.

All the urls are sent to the storeserver for compression and storage and are assigned an ID (doctID). For computers, it is easier and more efficient to use numbers to refer to things.

The indexers does some heavy work:

  • Reads, uncompresses and parses documents. Converts documents into hits (word ocurrences)
  • Creates partially sorted forwarded indices.
  • Create anchors file (link text, and to and from links). URLResolver fixes relative URLs and assigns docIDs.
  • Include anchor text in forward index but using the link it points to as the docID. Associates the text in the link to the document it points to.
  • Maintains a link databases used to compute PageRanks
  • Generates a lexicon–the list of all different words in the index

Basically, the forward index allows you to find the words of a document given the docID. In order to be useful for searching, this needs to be inverted. ie.: find documents by the words. The sorter does this addtional step, by assisting the indexer in creating an inverted index that uses wordIDs as keys to the docIDs. The inverted index includes the offsets and list of words. Dumplexicon is used to update the lexicon used by the searcher.

Finally, the searcher combines the lexicon, the inverted index and the PageRanks to respond the queries.

Next, I’ll describe each of the processes in more detail. Can’t wait? Read the document yourself and draw your own conclusions :-)

Protecting your privacy from Google with Squid and FoxyProxy

There is no doubt about it; this has definitely been Google’s Privacy Week.

Relevant news:

Privacy International’s open letter to Google
Danny Sullivan defending Google
Matt Cutts defending his employer
Google’s official response (PDF letter)
Google Video flaw exposes user credentials

It’s only human nature to defend ourselves (and those close to us) when we are under public scrutiny. I am not surprised to see Matt or Danny stand behind Google on this matter. I do think it is far more wise and beneficial to look into criticism and determine for ourselves what we can do to remedy it. I am glad to see that Google took this approach on their official response:

After considering the Working Party’s concerns, we are announcing a new policy: to anonymize our search server logs after 18 months, rather than the previously-established period of 18 to 24 months. We believe that we can still address our legitimate interests in security, innovation and anti-fraud efforts with this shorter period … We are considering the Working Party’s concerns regarding cookie expiration periods, and we are exploring ways to redesign cookies and to reduce their expiration without artificially forcing users to re-enter basic preferences such as language preference. We plan to make an announcement about privacy improvements for our cookies in the coming months.

You can take any side you want. But, I feel that none of the people covering this topic has addressed two critical issues: 1) How do you opt-out of data collection by Google or other search engines at will? 2) And, do you want to wait 18 months for your data to be anonymized?

My answer is no. You can anonymize your data with Squid, configured as an anonymous proxy, and using the FoxyProxy Firefox extension. This solution assumes that you have access to a Linux box, where you can install Squid (if it’s not already installed), and you can make the relevant changes to the proxy’s configuration. If you don’t have such access, there is a Windows port or you might search for a good anonymous proxy service. I don’t use any and can’t point you to anyone in particular.

What is an anonymous proxy?

A proxy is basically a server software that intercepts all your traffic before it reaches its destination. Proxies usually do something with your traffic. They can filter it, modify it, block it, cache it, etc. Depending on the Internet Protocol they serve (HTTP, SMTP, etc.) and the protocol used to access them (SOCKS5, HTTP, etc.), there are different types of proxies,. Your web browser comes with all you need to access proxy servers.

If you use a proxy server, the IP address of your proxy is the one that is going to show up on the referrer’s logs of sites you visit. That is unless their scripts read the HTTP server variable: HTTP_X_FORWARDED_FOR. This variable contains the list of proxies you used to get to the current server. Typical proxies pass this information along.

An anonymous proxy server is a proxy server that does not pass HTTP_X_FORWARDED_FOR, and removes all identifiable information from the user, — including requests. JavaScript makes it really difficult to be completely anonymous, so it is often a good idea to turn it off is you don’t need it. Squid is an open source proxy server. You can find the installation instructions here. To make it an anonymous proxy, add the following lines to your squid.conf file, and restart the server:

forwarded_for off
anonymize_headers allow Allow Authorization Cache-Control
anonymize_headers allow Content-Encoding Content-Length
anonymize_headers allow Content-Type Date Expires Host
anonymize_headers allow If-Modified-Since Last-Modified
anonymize_headers allow Location Pragma Accept Charset
anonymize_headers allow Accept-Encoding Accept-Language
anonymize_headers allow Content-Language Mime-Version
anonymize_headers allow Retry-After Title Connection
anonymize_headers allow Proxy-Connection

fake_user_agent Mosaic/0.1 (CP/M; 8-bit)

Now that you have done the hardest part and you have an anonymous proxy server, you need to setup your client to take advantage of this. Since, I use Firefox as my browser, I will use this to illustrate my point.

Why use a FoxyProxy extension, if I can simply setup my own anonymous proxy?

You might only want to enable the proxy while you are using Google and other search engines, and not when you are browsing elsewhere. FoxyProxy is very handy for this. Do the following to use it:

  1. Install FoxyProxy
  2. Add your anonymous proxy via the Add New Proxy button. (Squid port is usually 3128)
  3. On the patterns tab for your proxy, define one wildcard pattern for each search engine. ie.: *.google.com

Now you can surf the web with absolutely certainty that Google, and other search engines will not be able to link your searches to you. For maximum protection, do not login to any Google service while the proxy is on. You might also want to install a few more anonymizing proxies and add them to your FoxyProxy configuration.

The power of sharing

While most developers and technical people are used to sharing useful information, most entrepreneurs and consultants do not, or share very little. The logic is: “why share information if you can charge for it?”

Let me give you my thoughts on this, as I’ve been on both sides of the camp and therefore, I can offer an unique perspective.

Right after college, back in 1996, I landed a job as a Windows c++ software developer. I remember that I used to spend 20-30% of my time reading news groups, looking for other developers facing the same compiler errors that I was facing. This was far easier and less time consuming than trying to figure out the problem myself. Occasionally, I did have to solve some difficult problems on my own, however, the newsgroups proved to be a very valuable resource.

I met Linux while at college and I immediately felt in love with all things open source. I remember downloading “Slackware” over a 28kb/s line and copying it to 700 floppy disks! worried that they might remove it and I wouldn’t be able to download it later. I did not think this free OS would last long. I’m glad I was wrong.

I’ve came across colleagues that protected their knowledge with iron claws. They felt that having their knowledge out in the open would make them replaceable. They did this to protect their job.

Benefits of Sharing

On my next job, I worked as a Solaris/Linux system administrator and I was surprised to learn that the senior administrator was not sharing information with others. His reasoning was that if he shared, others would take his place. Paradoxically, I became the senior admin by doing exactly the opposite. As I was willing to share, others listened to my opinions, I gained more responsibility, and obtained better and higher paying positions.

Before I started my first company, I was very comfortable with sharing. Being a Linux/Open source fan, there was no point in keeping things to myself. Later, when I started to learn marketing I realized that while developers and technical people are more prone to share, marketers were not so altruistic. The affiliate marketer, that inspired me to move my young company in the right direction, would not share a single bit of information for competitive reasons. I’m glad that the few words he did say were enough for me to find the right path (you can find them in my about page).

Why would I tell a potential customer how to solve their problem? The customer can pretty much do it on his or her own after learning how to do it.

My basic principle is that there are other things to earn besides money. Branding is one of them.

I firmly believe that the easiest way to receive is to give. Try to share as much as you can, but first try to have a sustainable business model.

I think that now it is easier and cheaper than ever to create a start-up with no external funding. Even if you have the money to spend, it is wiser to go low budget. What you are trying to build with a lot of money has probably been built by somebody else. Probably using open source software and free content that provides the same value.

Now, coming up with a viable business model is increasingly difficult. Some experts offer advice for free and make your buy their e-books, others offer e-books for free and sell the tools, others offer tools and e-books for free and sell advice or ads. How can you come up with a winning formula in such a market? Over-delivering and offering unique value is one way to achieve this.

Every rule has it’s exception, so sharing your information is NOT always a good idea.

One of my developers faced this predicament while doing some after-hours freelance work for one of the companies I used to work for. They tried multiple times to squeeze that information out of him, in order to avoid having to pay him for his services. I recommended him to not tell them how to solve their problem. Sharing this information with them would not be “economically sustainable”.

Find out what you can share and what you can’t, but please, start sharing!

 

Why Viralinks are a waste of time?

I’m new to blogging, and I’m catching up with a lot of interesting things. One of them is the Viralink, coined by Andy Coates.

I was exposed to the concept while reading John’s blog. One of the readers mentioned he was trying out a Viralink on his blog and he was getting a little bit of traffic.

What is a Viralink? A Viralink is basically a new scheme to build up the PageRank of the participating sites. The instructions at Andy’s blog explain everything better.

———copy and paste the Viralink and instructions below this line———

Below is a matrix of 120 stars, I have already added a link to my blog onto one of the stars, all you need to do is copy and paste the grid into your blog and add your own link to one of the other spare stars, and tell others to do the same!

Viralink

********************
*
*******************
********************
**
******************
********************
******
**************

When I receive a ping back once you have added the Viralink to your site I will add your link to this grid, and each person who copies the grid from here will also link to your site!

Rules
No Porn Sites
Only 1 link per person (i.e don’t hog the viralink!)
Please don’t tamper with other peoples url’s
Enjoy!

———copy and paste the Viralink and instructions above this line———

I have to admit that it is a very clever idea. By participating in a Viralink, you can potentially get hundreds or thousands of links, and a very nice PageRank.

Now, let me give you the specific reasons why I think this is risky, and pretty much a waste of time.

  • There is no direct or indirect benefit for your readers. The links don’t even have text or descriptions. You can’t expect readers to mouse over the links, and try to guess from the URL whether they want to click to the linked page or not. This is simply designed to fool the search engines.
  • There is no anchor text benefit. I am not sure who wants to be #1 for the highly popular phrase ‘*’.

Quality guidelines

These quality guidelines cover the most common forms of deceptive or manipulative behavior, but Google may respond negatively to other misleading practices not listed here (e.g. tricking users by registering misspellings of well-known websites). It’s not safe to assume that just because a specific deceptive technique isn’t included on this page, Google approves of it. Webmasters who spend their energies upholding the spirit of the basic principles will provide a much better user experience and subsequently enjoy better ranking than those who spend their time looking for loopholes they can exploit.

If you believe that another site is abusing Google’s quality guidelines, please report that site at http://www. google.com/contact/spamreport.html. Google prefers developing scalable and automated solutions to problems, so we attempt to minimize hand-to-hand spam fighting. The spam reports we receive are used to create scalable algorithms that recognize and block future spam attempts.

Quality guidelines – basic principles

  • Make pages for users, not for search engines. Don’t deceive your users or present different content to search engines than you display to users, which is commonly referred to as “cloaking.”
  • Avoid tricks intended to improve search engine rankings. A good rule of thumb is whether you’d feel comfortable explaining what you’ve done to a website that competes with you. Another useful test is to ask, “Does this help my users? Would I do this if search engines didn’t exist?”
  • Don’t participate in link schemes designed to increase your site’s ranking or PageRank. In particular, avoid links to web spammers or “bad neighborhoods” on the web, as your own ranking may be affected adversely by those links.
  • Don’t use unauthorized computer programs to submit pages, check rankings, etc. Such programs consume computing resources and violate our Terms of Service. Google does not recommend the use of products such as WebPosition Gold™ that send automatic or programmatic queries to Google.
  • Your competitor will see this and report you.
  • This is very easy for search engines to detect automatically. They just need to look for blocks of links with ‘*’ in their anchor text.

Doing this is probably very time consuming. Why not spend the time creating useful content that attracts links naturally (Linkbait)?

Dynamic Keyword Insertion for Landing Pages

One critical aspect of highly successful search marketing campaigns is making sure searchers find what they are looking for. I posted this before.

To accomplish this, we first need to grab the visitors’ attention, get them to click through our pages, and ensure that the pages’ content matches the search.

Whether you are doing SEO or PPC, it is imperative that your ads (title and description if SEO) include the search terms.

Advanced PPC management platforms (such as Adwords) provide a very useful feature for this purpose: Dynamic Keyword Insertion (DKI). The purpose of this feature is to help the advertiser create dynamic ads that include the queried keywords in the ad copy, automatically.

DKI works by creating a place holder text (ie.: {Widgets}) where you want the keywords to be included. A typical ad that says: “Buy Widget” will say the same, no matter what the user is searching for. Now, using DKI, for the ad: “Buy {Widget}”, the text inside the brackets, and the brackets will be replaced with whatever the users types in the search box. If he or she types “blue widgets”, the ad will say “Buy Blue Widgets”, etc. This is very useful. DKI can be used to replace all the text in an ad (the title, text and landing page). Jennifer Slegg wrote an interesting article on using DKI for changing the URL of the landing page in the PPC ad.

The point is that the closer the ad is to the search query, the more likely the visitor is going to click on it. In addition to this, Google highlights the keywords if they match the query. This helps a lot too.

Now, what happens when the visitor gets to the landing page? Well, chances are that the page will not include the exact keywords the visitor used to conduct the search; especially, if you are doing PPC. In order to fix this, I use a very simple technique: I use client-side or server-side dynamic pages to get the keywords from the referring URL (search engine) and to automatically replace those keywords in the landing page copy.

This can be done with Javascript on the client-side, or with PHP, ASP, etc. on the server-side. I will use Django and Python to illustrate this concept. It’s been a while since the last time I wrote something in PHP or Perl, and if you’ve read some of my previous posts, you’d probably know that I am a big fan of Python. Let me tell you that I am a big fan of Django too. Django is a rapid web development framework, similar to Ruby on Rails or Cake PHP. It makes web development a lot more productive.

The logic in my code is to capture the search query live on the dynamic code, pass the captured keywords to the template, and let the template code dynamically replace the keywords where appropriate on the landing page text. We will borrow the relevant code to parse the keywords from my previous post.

Django uses four main files for a basic web application: model.py, urls.py, views.py and template.html. We are not going to use a database, so we don’t need the file model.py. Please visit http://www.djangoproject.com to get help in setting up Django and a simple web application.

The mysite’s project files

urls.py contents:

from django.conf.urls.defaults import *

urlpatterns = patterns('',
    (r'^/landing_page/$', 'mysite.views.landing_page'),

)

views.py contents:

import re
from urlparse import urlparse
from cgi import parse_qsl

def landing_page(request):

	p = r’[^”]+”GETs([^s]+)[^”]+”s2[^”]+”([^”]+(?:google|yahoo|msn|ask)+[^”]+)”‘

	referrer = request.META[HTTP_REFERER]
	m = re.search(p, referrer)

	if m:

	 	(internal, link) = m.groups()

			elements = urlparse(link)

			if elements[4]: #check to see if there is query string

				params = parse_qsl(elements[4])#break qs in keyword, value pairs

				for (param,keywords) in params:

					if param == ‘p’ or param == ‘q’:

						return render_to_response('/page.html', {'keywords': keywords})

page.html (template) contents:

<html>
<head>
    <title>{{keywords}}<!-- here we dynamically replace the page title with the searched for keywords --></title>
</head>
<body>
    <img src="sitelogo.gif" alt="Logo" />
	<h1>Articles for {{ keywords }}<!-- here we dynamically replace part of a heading with the searched for keywords --></h1>

</body>
</html>

When somebody lands on the dynamic page, coming from a search engine, the HTTP_REFERRER will be analyzed to extract the keywords. The keywords will then be passed to the HTML template, as well as anywhere we want the keywords to appear. (We only need to include {{ keywords }} as a text holder). This is very similar to the functionality of DKI of Google Adwords, but for landing pages.

Preventing duplicate content issues via robots.txt and .htaccess

Rand of SEOmoz.org posted an interesting article on duplicate content issues. He uses the typical blog to show different examples.

In a blog, every post can appear in the home page, pagination, archives, feeds, etc.

Rand suggests the use of the meta robots tag “no-index”, or the potentially risky use of cloaking, to redirect the robots to the original source.

Joost the Valk recommends WordPress users change some lines in the source code to address these problems.

There are a few items I would like to add to the problem and to the proposed solution.

As willcritchlow asks, there is also the problem of multiple URLs leading to the same content (ie.: www.site.com, site.com, site.com/index.html, etc.). This can be fixed by using HTTP redirects and by telling Google what is our preferred domain via webmaster central.

Reader roadies, recalls reading about a robots.txt and .htaccess solution somewhere. That gave me the inspiration to write this post.

After carefully reviewing Google’s official response to the duplicate content issue, it occurred to me that the problem might not be as bad as we think.

What does Google do about it?
During our crawling and when serving search results, we try hard to index and show pages with distinct information. This filtering means, for instance, that if your site has articles in “regular” and “printer” versions and neither set is blocked in robots.txt or via a noindex meta tag, we’ll choose one version to list. In the rare cases in which we perceive that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. However, we prefer to focus on filtering — rather than ranking adjustments … so in the vast majority of cases, the worst thing that’ll befall webmasters is to see the “less desired” version of a page shown in our index.

Basically, Google says that unless we are trying to do something purposely ill intended (like ‘borrowing’ content from other sites), they will only toss out duplicate pages. They explain that their algorithm automatically detects the ‘right’ page and uses that to return results.

The problem is that we might not want Google to choose the ‘right’ page for us. Maybe they are choosing the printer-friendly page and we want them to choose the page that includes our sponsors’ ads! That is one of the main reasons, in my opinion, to address the duplicate content issue. Another thing is that those tossed out pages will likely end up in the infamous supplemental index. Nobody wants them there :-) .

One important addition to Rand’s article is the use of robots.txt to address the issue. One advantage, this has over the use of the meta robots tag “no-index”, is in the case of RSS feeds. Web robots index them, they contain duplicate content but the meta tag is intended for HTML/XHTML content and the feeds are XML content.

If you read my post on John Chow’s robots.txt file, you probably noticed that some of the changes he did to his file, were precisely to address duplicate content issues.

Now, let me explain how you can address duplicate content via robots.txt.

One of the nice things about Google’s bot is that it supports pattern matching. This is not part of the robots exclusion standard. Other web bots probably don’t support it.

As I am a little bit lazy, I will use Googlebot for the example as it will require less typing.

User-Agent: Googlebot

#Prevents Google's robot from accessing paginated pages

Disallow: /page/*

Disallow: /*?* #Some blogs use dynamic URLs for pagination.
#For example: http://www.seomoz.org/blog?show=5


#Prevents Googlebot from accessing the archived posts
Disallow: /2007/05

Disallow: /2007/06 # It is not a good idea to use * here, like /2007/*,
# because that will prevent access to the post as well. ie.:/2007/06/06/advanced-link-cloaking-techniques/


#Prevents Googlebot from accessing the feeds

Disallow: /feed/

To address print-friendly pages duplication, I think the best solution is to use CSS styles.

Now, let’s see how you can address the problem of the same content accessible from multiple URLs, by using .htaccess and permanent redirects. This assumes you use Apache and mod_alias. More complex manipulation can be achieved via mod_rewrite.

You just need to create a .htaccess file in your website’s root folder with this content:

RedirectPermanent /index.php http://www.site.com/

Or alternatively:

Redirect 301 /index.php http://www.site.com/

Or, in the event that you plan to use regular expressions, try this:

RedirectMatch 301 /[Ii]ndex.php$ http://www.site.com/ 
# this matches Index.php and index.php

Google allows you to tell them what is your preferred canonical name (ie.:site.com vs www.site.com) via Webmaster Central, so this step is no longer necessary. At least, if your only concern is Google.

To force all access to your site include www in the URL (ie.: http://www.site.com instead of http://site.com). You can use redirection via .htaccess file.

RewriteEngine On

RewriteBase /

RewriteCond %{HTTP_HOST} !^www.site.com [NC] 
# Redirect http://site.com to http://www.site.com

RewriteRule ^(.*) http://www.site.com/$1 [L,R=301]


As I said. These additional lines are probably unnecessary, but it doesn’t hurt to do add them.

Update: Reader identity correctly pointed out that secure pages (https) can cause duplicate content problems.  I was able to confirm that at least Google is indexing secure pages.

To solve this, I removed the redirection lines from the .htaccess file and I recommend you use a separate robots.txt for https://www.site.com with these few lines:

User-Agent: Googlebot

#Prevents Google's robot from accessing all pages

Disallow: /


							

Advanced link cloaking techniques

The interesting discussion between Rand and Jeremy had me thinking about some of the things affiliates do to protect their links. I am talking about link cloaking — the art of hiding links.

We can hide links from our potential customer (in the case of affiliate links), and we can hide them from the search engines as well (as in the case of reciprocal links, paid links, etc.).

While I think cloaking affiliate links to prevent others from stealing your commissions is useful, I am not encouraging you to use the techniques I am about to explain. I certainly think it is very important to understand link cloaking in order to protect yourself when you are buying products, services or links.

When I am reading a product endorsement, I usually mouse over the link to see if it is an affiliate link. Why? I don’t mind the blogger making a commission’; but, If I see he or she is trying to hide it via redirects, Java-script, etc. I don’t perceive it is as an endorsement.  I feel it is a concealed ad. When I see <aff>, editor’s note, etc. I feel I can trust the endorsement.

Another interesting technique is the cloaking of links to the search engines. The reasoning behind this concept is so that your link partners think you endorse them, but you tell the search engines that you don’t. Again, I am not supporting this.

Cloaking links to the potential customers.

Several of the techniques, I’ve seen are:

Script redirects – the use of simple script that takes a key (i.e: merchant=eBay), pulls the affiliate link from a database or from an in-line dictionary (programming term), and sends the visitor’s browser an HTTP 302 or 301 (temporary or permanent redirect) to the merchant site.

Meta refreshes – the use of blank HTML pages with the meta refresh tag and the affiliate tracking code embedded. This is very popular.

In-line Java-script– the use of Java-script to capture the mouse over, and the right click event from the target link, in order to make the status bar display the link without the tracking code. I feel this one is very deceptive.

Encoding URLs – the use of HTML character entities or URL encoding to obfuscate the tracking links or tracking codes from your visitors. This works because browsers understand the encoding and humans are unable to understand them without some work.

Java-script + image links – This is really advanced. I haven’t seen this being used much. The idea is to use Java-script to capture the on_click event and have the code pull a transparent image before transferring control to the new page. The trick is that the URL of the transparent image, is in reality a tracking script, that receives the tracking code as a parameter or as part of the URL.

Cloaking links to the search engines.

These are some of the techniques I’ve seen:

Use of rel=”no-follow” anchor attribute. I would not say this is technically cloaking, but the results are the same. Search engines (Google, Yahoo and Live) will not ‘respect’ those links.

Use of no-follow and/or no-index meta tag. There is a slight difference between the use of no-follow in the anchor link tag vs the meta robots tag. When used on the robots meta tag it means: “do not follow the links on this page”. When used on the anchor tag it tells the search engine “do not consider my link for your scoring” ( this link is not a vote/endorsement).

Crawler user agent check. This consists in detecting the search engine crawler by user agent via the HTTP REFERRER header and hiding the link or presenting the search engine a link with rel-no-follow. Normal visitors will not see this.

Crawler IPs check. Black hat SEOs keep a list of search engine crawler IP addresses to make cloaking more effective. While search engine crawlers announce their presence via the user agent header, when using cloaking detection algorithms they don’t.  Keeping a record of crawler IPs help detect them.

Once the crawler is detected, the same technique I just mentioned is used to hide the target links.

Robot.txt disallow. Disallowing search engine crawlers access to specific sections of your website (ie: link partner pages)via robot.txt, is another way to effectively hide those links from the search engines.

The use of robots-nocontent class. This is a relatively new addition (only Yahoo supports this at the moment). With this CSS class, you can tell the Yahoo crawler that you don’t want it to index portions of a page. Hiding link sections is another way to cloak links.

Robot.txt disallow + crawler IPs check. I haven’t seen this being used, but it’s technically possible. The idea is to  present search engines a different version of your robot.txt file than you present to users. The version you present to the search engines prohibits sections of your site where the links you want o hide are. You detect the search engine robot either by the user agent or by a list of known robots’ IP addresses. Note that you can prevent the search crawler from caching the robot.txt file making detection virtually impossible.

Now, as I said before, I am not exposing these techniques to promote them. On the contrary, here’s why it’s important to detect them.

The best way to detect cloaked links is to look closely at the HTML source and robot.txt file, and specially at the cache versions of those files. If you are buying or trading links for organic purposes (assuming you don’t get reported as spam by your competitors), don’t buy or trade links that use any of these techniques or that prevent the search engines from caching the robots.txt file or page in question.

Estimating visitor value

We love traffic.  We want as much traffic as possible.  It is really nice to see our traffic graphs jump really high.  With our PPC campaigns we pretty much obsess over our click-through rates.  We like to go after the keywords phrases that drive the most traffic.  Everybody is in love with Digg and Social Media.

All traffic is not equal, even search traffic coming from similar phrases.  What we really need is traffic that converts.  Visitors that take whatever action we expect them to take.  Buy an e-book, subscribe to our newsletter or download our software, etc.  We need traffic motivated to take action.

There is a big difference between running a site that get 10,000 visitors a day that makes $10,000 a month and one that gets 1,000 visitors a day that makes $20,000 a month. For the first, the visitor is worth 3 cents, and for the second is worth 66 cents — 22 times more.

While some might say that non-converting traffic is good for branding, I prefer branding that actually gets results.

The most cost effective way to get branding traffic, in my opinion, is via Viral Marketing, Social Media, Organic Search or very cheap CPM based deals.  I do not like the idea of big PPC budgets for branding only.

If you are planning or working on your start-up, you need to make the best use of your time and money, possible.

Here is a simple tip I use to get traffic that takes the action I want by measuring the potential visitor value before hand.

My tool of choice is Google’s Adwords Traffic Estimator.  I know it is not very accurate and many times the estimates are way off, but it serves my purposes.

1. Use the Google Keyword Research Tool to find as many relevant keywords as possible for the target content.
2. Include those keywords in the traffic estimator .
3. Set the maximum click to $100. Don’t panic! This is just for getting the #1 slot estimates.
4. Sort the results by cost per day. That is — the number of clicks times the maximum bid.

The reasoning is the following: If advertisers are willing to pay x for some particular keywords, that means that they are making at least that to break even.

It is a good idea to run the estimates several times during the week to rule out inexperienced advertisers bidding blunders.

If we sort by number of clicks, we only care about the volume of traffic. If we sort by the max cost per click we only want to target the most expensive keywords. Now, targeting the keywords with the max cost per day means we are after the money makers.
Use PPC, SEO or buy relevant ads or reviews on sites rankings for those keywords.

Determining searcher intent automatically

Here is an example of how useful it is to learn SEO from research papers.

If you’ve read some of my previous posts, you will know that I am a big fan of finding out what exactly search visitors want. I posted about classifying both visitors and landing pages, so that search visitors looking for information find information articles, searchers looking to take action land on transaction pages, etc.

I really like the research tools MSN Labs has. One of my favorites is this http://adlab.msn.com/OCI/OCI.aspx

You can use it to detect commercial intent. Try it. It is really nice.

I’ve been wanting to do something like that, but I didn’t have enough clues as to how to do it. Until now.

Search engines patent expert, Bill Slawsky, uncovered a gem. A research paper that details how a team of researchers achieved exactly this.

I still need to dig deep into the document and the reference material, but it is definitely an excellent find.

I will try to make a new tool for this. I will also try to make this and other scripts I write, more accessible to non-technical readers. I guess most readers don’t care much about the programming details. They just want to be able to use my tools easily :-)

What to do with the money you make online

Some readers coming from John Chow dot Com, might be wondering if I make money on-line.

Showing big checks and bragging is not my style, but I do understand most people want proof. Instead of showing checks, bank statements, etc. I am just going to show you what I do with the money my companies make for me.

I just added pictures from my nice little golf villa in Casa de Campo. I bought it last year and I recently remodeled it.

If you want to have an idea how much it costs, here is the current list of villas for sale at Casa de Campo. You can alternatively do a search in Google for “buy villa in casa de campo”.

Remember I don’t live there, it’s just for renting and relaxing.

Robots.txt 101

First let me thank my beloved reader SEO blog.

Thanks to him I got a really nice bump in traffic and several new RSS subscribers.

It is really funny how people that don’t know you, start questioning your knowledge, calling you names, etc. I am glad that I don’t take things personal. For me it was a great opportunity to get my new blog some exposure.

I did not try intentionally, to be controversial. I did ran a back link check on John’s site and found those interesting results I reported. I am still more inclined to believe that my theory has more grounds than SEO Blog’s. Please keep reading to learn why.

His theory is that John fixed the problem, by making some substantial changes to his robots.txt file. I am really glad that he finally decided to dig for evidence. This is far more professional than calling people, you don’t know, names.

I thoughtfully checked both robots.txt files and here is what John removed in the new version:

# Disallow all monthly archive pages

Disallow: /2005/12

Disallow: /2006/01

Disallow: /2006/02

Disallow: /2006/03

Disallow: /2006/04

Disallow: /2006/05

Disallow: /2006/06

Disallow: /2006/07

Disallow: /2006/08

Disallow: /2006/09

Disallow: /2006/10

Disallow: /2006/11

Disallow: /2006/12

Disallow: /2007/01

Disallow: /2007/02

Disallow: /2007/03

Disallow: /2007/04

Disallow: /2007/05 


# The Googlebot is the main search bot for google

User-agent: Googlebot


# Disallow all files ending with these extensions

Disallow: /*.php$

Disallow: /*.js$

Disallow: /*.inc$

Disallow: /*.css$

Disallow: /*.gz$

Disallow: /*.wmv$

Disallow: /*.tar$

Disallow: /*.tgz$

Disallow: /*.cgi$

Disallow: /*.xhtml$


# Disallow Google from parsing indididual post feeds and trackbacks..

Disallow: */feed/

Disallow: */trackback/


# Disallow all files with ? in url

Disallow: /*?*

Disallow: /*?


# Disallow all archived monthlies

Disallow: /2006/0*

Disallow: /2007/0*

Disallow: /2005/1*

Disallow: /2006/1*

Disallow: /2007/1*

In English, this means, he is now letting Google crawl and index his archived articles, dynamic pages,
and files ending with “.php”, “.js”,”.inc”, “.css”, etc. Note that in none of the robots.txt files, John is preventing the crawler from accessing his home page or the regular posts. WordPress uses PHP, but regular posts and the home page can be accessed without “.php”.

If this was the change that fixed the problem, it might have been because removing those internal pages from the spider view might have weaken his internal link structure. His claim is not without merit.

Now, here is one tiny little detail that my friend is missing. To prove his point, he used Google’s cache to show the different version of the robots.txt file. If Google still has that version on their cache, what makes him think that Google is already using the new one? Google should be caching the new version not the old one. That is why I am still not convinced that this is the reason for the fix.

John says he is not telling, because a reader said Google might change their algorithm and drop him again. How does the changes John did to his robot.txt file , have anything to do with algorithm changes? I am just curious.

In reality, we can theorize all we can, but the only ones who can tell for sure is the guys at the Googleplex. John probably tried many different things and one or several of them worked. He is probably not even sure which one did.

How did I learn SEO?

SEO Blog suggests I visit his forum to learn SEO. Here is the problem with that. I am a technical guy, I can not take gut feelings or opinions as truth. I do visit some forums and blogs every now and then, but my experience is that the noise to signal ratio is too high. I prefer to learn and get my insights from the source: search engine research papers, search engine representatives blogs or my own experiments.

I learned SEO back in 2002 when I read this paper. Back then, nobody was even talking about Google bombs, anchor text, etc. Read the paper, it is all there.

John Chow fixes anchor text and pleases Google

As I reported before, John stopped showing up in Google for “make money on-line” for a few days. He is now back at #1 for the term.

What did he do? He is not telling.

I was going to use this post to explain exactly what I did to restore my number one ranking. However, after reading Kumiko’s comments in my Taipei 101 to number 1 post, I’ve decided against it. I think everyone will agree that this kind of information is extremely valuable – some “SEO Guru” tried to take me for $4,000 by saying he knew the answer (which I highly doubt since he made no guarantee).

While I won’t give the step by step I can offer this piece of advice if you lose a ranking for a desired keyword – Google webmaster tools is your friend! Get to know it really well.

This is what Kumiko said:

Comment by Kumiko

2007-06-02 18:24:47

48) { this.width = 48; this.height = 48; } ; if (this.width Reading how you got back to #1 will be a great read! Aren’t you worried about Google reading it though and simply changing the algorithm again?

 

He is hinting that he used Google webmaster tools to figure out what the problem was. I can tell you what specific section he looked at: Webmaster Tools -> Statistics -> Page analysis -> In external links to your site. That section shows the anchor text people are using when linking to your site.

Before he made the changes, I am sure most of the link text was “make money online”, because that was what he was requiring from the reviewers.

Unfortunately you can only use Webmaster Tools on your own sites, not on other people’s. SEO’s know of several handy tools that can give you the same information: Opti-link, SeoElite, Seobook Backlink Analyzer, to name a few.

I used one such tool and I figure out what changes John made. His website’s anchor text now includes: “Make Money Online with John Chow dot Com”, “Make Money on the Internet”, “Make Blog Money”, “Make Money Online Blog”, “Making Money On-line”, “Making Good Money from your Blog – John Chow dot Com”, “Making Money with your Blog”, and others. I don’t think it would be fair to post the entire list here. I guess he emailed several of the reviewers and asked them to change the anchor texts.

The changes look very similar to what I suggested he do. It is also interesting to note that in the latest batch of reviews he is not asking people to link back with “make money online” as anchor text. I guess it’s better to let people link to you first and send the back an email with the specific anchor text you want.

Update: As one reader suggested. Here is a partial list of the reviewers that are using variations of “make money online” in their links to John.


http://www.walletrehab.com/my-biggest-blogging-mistake/,John Chow - Making money on-line

http://www.walletrehab.com/blogging-mistakes-dont-let-this-happen-to-you/,John Chow - Making money on-line

http://www.random-good-stuff.com/index.php/category/helpful/,Make Blog Money

http://www.random-good-stuff.com/index.php/category/internet/,Make Blog Money

http://www.random-good-stuff.com/index.php/category/blog/,Make Blog Money

http://www.random-good-stuff.com/index.php/category/make/,Make Blog Money

http://www.neverlandteam.net/blog/category/este-blog/,make money on the internet

http://www.fewleftstanding.net/,make money on the internet

http://www.newsdoggy.com/category/headlines/feed/,make money on the internet

http://iexplor.blogspot.com/2007/04/how-to-win-30gb-microsoft-zune.html,make money on the internet

http://www.wiredzune.com/,make money on the internet

http://www.ontora.com/,make money on the internet

http://www.techlivez.com/search/label/Cool%20Gadgets,make money on the internet

http://www.zengrrl.com/,make money on the internet

http://dereksemmler.com/,Make Money Online With John Chow dot Com

http://expatsinitaly.com/annika/,Make Money Online with John Chow dot Com

http://www.shashwat.in/,Make Money Online with John Chow dot Com

http://www.vuthasurf.com/,Make Money Online with John Chow dot com E Book

http://www.tolnetwork.com/,make-money-online-batch74

http://blog.acreativedesktop.com/,JohnChow.com - Money

http://www.oubipaws.org/,Mr. Make Money Online

Competitive research or privacy attack?

I found an interesting tool via Seobook.com. It exploits a “feature” of current browsers that do not properly partition persistent client-side state information (visited links and caching information) on a per site basis.

The tool can identify URLs in your visitor’s browsing history. Aaron suggests this be used to check if your visitors come from competing sites and adjust your marketing strategy accordingly.

This might not work as Aaron might expect. You can only tell that the visitor visited those URLs in the last n days (n the number of days the user keeps in his or her browsing history). You won’t be able to tell when, how often or how recently those URLs where visited.

While this is very useful for marketing purposes, the window for taking advantage of this for other purposes is huge. Collecting information on users without their consent doesn’t sound very good either.

Reader Dave comments:

I’ve always been conscious of the technical possibility of this and taken some safeguards against it. Still, as a user, I’d be furious if I knew this technique were being used on me, and I will be keeping my eye out for any precedent-setting legal challenges to this.

As a publisher/affiliate, I refuse to stoop this low. It’s disappointing but not unexpected that a great deal of readers here would be so sanguine about something so blatantly unethical.

Your user’s history object is none of your [edited] business.

Imagine a phisher that uses this to identify the on-line bank you use. With this information, his scam will be far more effective. Most people ignore emails from institutions they are not affiliated with.

Another reader pointed to a Firefox plug-in that solves the visited-link based attack problem. Here is another plug-in that prevents cache-based attacks. I installed both of them immediately.

The tool Aaron mentions exploits the visited-link vulnerability. Here is how it works:

Your browser, by default, colors visited links in a different color than normal ones. That information is available via CSS and client-side Javascript. The script works by pulling a list of target URLs, using Ajax (this happens with no user action), inspecting their color and flagging the ones that have the visited-link color — these are the ones the visitor has previously visited.


    if (link.currentStyle) {

    		var color = link.currentStyle.color;

    		if (color == '#ff0000') /* Here is the color inspection */

    			return true;

    		return false;

    }

This is possible because our browsers don’t make sure the links flagged as visited are not in a page in the same domain of the link. It is very likely this will be fixed in future browser releases.

It might seem that disabling Javascript solves the problem, but this trick can be done as well with CSS only. Check https://www.indiana.edu/~phishing/browser-recon

Another form of attack, not used by the tool, is measuring the time the browser takes to open target URLs. URLs that have been visited are generally cached and load faster. Comparing timing information one can tell if a page was visited or not.

The plug-ins mentioned above protect from both types of attacks.

For more information visit: http://crypto.stanford.edu/sameorigin/

See also this papers for more background information:

Protecting browser state from web privacy attacks
Invasive browser sniffing and counter measures
Timing attacks on web privacy

Segment visitors by intention with Google Analytics

As I mentioned before, understanding what visitors want and giving it to them is the key to a successful website. That is the big picture.

Now let me tell you how to actually measure this. My tool of choice for this is Google Analytics.

With Google’s Adwords Conversion Tracking you can define goal pages and track conversions that happen once visitors land on those pages. For example: thank you pages for signing up, downloading a white paper or for purchasing a camera.

Many e-commerce websites have a multi-step check out process. Once you hit the “buy” button you are taken to a page where you can select the quantity of the selected product and other variables. After this you are taken to a page where you input your shipping information. Later to another page for you to input your billing information. Followed by a confirmation page and finally to the thank you page. This is commonly known as the “conversion funnel”.

You can use funnels to identify and reduce drop-out rates throughout the conversion process. Google analytics provides tools to create such funnels and reports to measure them.

The main problem is that most people optimize their conversion process, but don’t measure and optimize their persuasion or pre-selling process as well.

Once a visitor clicks on the “buy” button, he or she is already set on buying the product, but the path to conversion starts way before that, it begins with the persuasion process. I explained that process on this post.

In short, visitors come to your site with a specific mindset (expecting something in particular and the keywords they type are the best clue to what that is). It is important they land in the right pages and that those pages induce them to move to the next step in the persuasion process.

Now let’s see how we can use Google Analytics to segment your visitors based on what they want.

In addition to segmenting users along pre-defined segments such as geographic region and language preference, Google Analytics allows you to define custom segments and analyze the behavior of each segment. For example, you might ask visitors to select their job category (such as Engineering, Marketing, motorcycle stunt riding, etc) from a form. You could then analyze browsing and buying behavior based upon the selected job categories.

What we are going to do is create three custom segments: On for visitors seeking primarily information or products with no specific brand, another for visitors looking for a specific brand, and another for visitors that actually want to take action (buy, download, subscribe, etc.). Those visitors are expected to land on relevant web pages, if we set our PPC or SEO campaign right. We can then label those pages with their segment (generic, brand, action).

To do this, insert the following code to each of the pages.

For pages visited by people looking for information and not a specific brand, i.e.: people searching for “laptops”:

<script type=”text/javascript”>__utmSetVar(‘Generic’);</script>

For pages visited by people looking for a specific brand and model, i.e.: people searching for “dell latitude”:

<script type=”text/javascript”>__utmSetVar(‘Brand’);</script>

For pages visited by people who are looking to take action, i.e: people typing “cheap dell latitude d420”:

<script type=”text/javascript”>__utmSetVar(‘Action’);</script>

These codes need to be set below the Google Analytics tracking code.

You can view conversion behavior for each of your custom segments in the Marketing Optimization > Visitor Segment Performance > User-Defined report. Also, many other reports allow you to cross-analyze data according to visitor segments. To cross-analyze, click on the Analysis Options button at the far left of any entry in most reports. Select: User-Defined, to view your custom segments.

Cross-analyzing, using this segmentation, can provide actionable data that can be used to improve the persuasion process and lead more visitors to the conversion funnel.

Next Page »



Follow

Get every new post delivered to your Inbox.