What is the practical benefit of learning Google’s internals?

I forgot to start my Google inner workings series with WIIFM. My plan is to write one post each week.

Not matter how well I try to explain it, it is a complex subject. I should have started the first post explaining why you would want to learn that. There are a lot of easier things to read.With some people questioning the usefulness of SEO, this is a good time to make my views clear. Please note that I believe in a solid marketing mix that includes SEO, PPC, SMO, affiliate marketing, viral marketing, etc. Do not put all your eggs in one basket.

If you have been blogging for a while, you have probably noticed that you are getting hits from the search engines for words that you did not try to optimize. For example, the next day I started this blog, I received a comment from a reader that found my blog through a blog search! How was this possible?

Heather Paquinas May 26th, 2007 at 1:24 am

I found your blog in google blogsearch. Needless to say I subscribed right away after reading this. I always suspected what you said, especially after Mike Levin from hittail blogged about using hittail for ppc, but you really hit the nail on the head with this post.

This is possible because that is the job of the search engines! If every page you search had to be optimized, there wouldn’t be billions of pages in Google index. It would take a lot of people to do the SEO work :-).

Why we need SEO then?

The answer is very simple. Not all traffic means money. If you want to make money, target competitive terms, build your brand, etc. you need to select your keywords strategically. You cannot expect search engines to rank your site in high profit niches, automatically.

Well, maybe you are that lucky.The traffic that is highly valuable is probably very competitive. It is very difficult to rank for competitive terms with no effort in direct optimization. Why? Because, others are already optimizing for those terms. When there is no competition you can rank very easily.

Why learn Google internal workings and other advanced information such as patents, research papers, etc.?

Again, I only recommend this if you are targeting competitive terms and markets. If you are happy with a few hundred clicks a week, you probably don’t need this.

Advanced knowledge gives you an edge over your competitors:

1. You can read and participate in forums and blogs and know what information is useful and what isn’t.
2. You can easily find solutions to your search engine related problems.
3. You can tell if a proposed theory is possible or not.
4. If you are a black hat, you can more easily find new holes to exploit and schemes to pursue.

If major search engines keep their internals so secret, there is a $reason$.

Should I cross link my sites for better rankings?

My loyal reader Jez asks a very interesting question. I am sure the same question is on the minds of others in the same situation.

Finally, I am in the process of creating multiple sites around a similar theme. I have unique content for all sites, and will host on different servers in Europe and the US, however the whois for each domain will show my name (The company I used does not allow me to hide this info).

Is the common whois likely to make much difference when I begin cross linking the sites?

Cross linking (or reciprocal linking) in a small scale (maybe 10 to 15 sites maximum) should not be a major concern. I’ve seen many sites do it and they are ranking in highly competitive phrases. Most of their link juice comes from non-cross-linked sites though.

When you try to do this on a massive scale, things start to get interesting. I know this from experience.

Back in 2003 and 2004, I managed to get a couple of my sites ranking on Google for “Viagra” and most variations. That is one of the most competitive industries, because you make really good money as an affiliate. I got those rankings through link exchanges exclusively. Being a developer, I created scripts to ‘borrow’ links from my competitors link directories and later traded links with my sites. When I hit the 5,000 links mark, my sites got banned and I dropped in all my rankings. Back then, Google was not as sophisticated as it is now.

Later, I carefully studied competitors that were doing a more advanced type of cross linking. They created large networks of sites that they owned, and they created complex inter linking structures to boost the rank of a few of their sites for highly competitive terms. Pair.com was a common web host as they provided IP address in different class C blocks.

That worked well for a while–until Google became a registrar. It is illegal to use fake domain registration information, and by having access to the domain ownership information Google could more easily identify complex cross linking. I think they became a registrar with that sole purpose. I don’t see them selling domains in the future. They haven’t yet. Have they?

Making your cross linked domains’ registration private won’t help much either. I think registrars have access to the real information anyways, but even if I am wrong, it would be suspicious for your site to have all inbound links coming from private registrations.

There are far more complex cross linking schemes where there are a few owners cooperating in the creation of massive collection of websites with well planned link boosting structures. The funny thing is that search engine researchers have already identified most of them. Check the paper “Link Spam Alliances“, it is a very interesting read.

So, If you want to cross link on a massive scale, you better have a very intricate and complex linking plan to avoid detection.

Can you trust Alexa’s numbers?

It is very important to understand that there is no way for external metrics tools such as Alexa, Compete, Ranking, Netcraft, etc. to provide accurate data.

Their information is collected from their respective toolbar usage. Alexa has the broadest distribution than others, but there are still a lot of people that don’t use those toolbars or browser plugins.

Their data is particularly useful if you are in a technical field: search and affiliate marketing, web development, etc. A large portion of your potential visitors probably have one or more of these toolbars installed.

A while ago, there was an interesting project regarding the efficacy those metrics:

Conclusion – The Value of External Metrics

This survey represents only a tiny sampling of sites in a niche sector, albeit a relatively popular one in the blogosphere and webdev/tech space. Based on the evidence we’ve gathered here, it’s safe to say that no external metric, traffic prediction service or ranking system available on the web today provides any accuracy when compared with real numbers. Incidentally, I did log in to Hitwise to check their estimations and although I can’t publish them (as Hitwise is a paid service and doing so would the violate terms of service), I can say that the numbers issued from the competitive intelligence tool were no better than Alexa’s in predicting relative popularity or traffic estimation.

The sad conclusion is that right now, no publicly available competitive analysis tool (that we’re aware of) offers real value. Let’s hope withing the next few years, better data will be made available.

What is the problem?

In statistics, when you need a sample that represents the entire population that you are measuring, data is collected carefully and completely to avoid any bias. Unfortunately, there is no way to configure the toolbars of sites or people grouped in similar samplings. Users install them at will and the ones installing them are usually advanced users (Not your typical gardener).

Why use the data then?

In my case, the content on my blog is highly technical, so there is a high probability that most users have the Alexa toolbar or the browser plugin.

For comparative purposes. By comparing my blog’s Alexa to a blog directed at a similar audience (seobythesea.com) I was able to tell if I am in the right path.

Should you use it?

How technical is your audience is the right question to ask yourself. If you target casual readers, it might not be very useful.

Great Content + Bad Headline = Mediocre Results

You can spend a few hours researching, structuring, drafting and proofreading a great post, to completely miss it by choosing a really bad title.

I recently submitted a carefully crafted rebuttal to the Seomoz article: Proof Google is Using Behavioral Data in Rankings. The post generated some controversy and some heated discussion as to the validity of the tests and results. I read everything. And, given my technical nature, I decided to dig deeper in myself.

I ended up with slightly different conclusions about the experiments. If you want to find out please read the post at Youmoz.

Now, here’s the bad news.

As Kurt, wisely points out, I tragically missed the mark by poorly choosing an empty title: “Relevance feedback“.

Kurt (86)

Sat (6/16/07) at 05:38 PM

Good post… well thought out and presented… gave it a thumbs up.

Unfortunately, it will most likely get overlooked by most readers due to its title/headline.

Look at the article you’re a referencing, “Proof Google is Using Behavioral Data in Rankings“. You know that headline will bring in some clicks. It was moved to the blog of SEOmoz from the Youmoz section (even with its flawed testing and logic). The mozzers aren’t stupid… they know this type of headline and article will stir up some controversy and bring in some links.

I’m no expert copywriter… far from it. I just hate to see a good post sit on the sidelines because of a bad headline.

The title I chose did not offer the reader any incentive to click or learn more. I guess that I operate in two modes: engineer and marketer and that I forgot to flip the switch while writing this post.

First, let me state that his remarks about the mozzers are valid for most journalists, trade publications, social media sites, etc. It is human nature to judge books by their cover. If the cover is crap, the content must be crap. That is how we normally think.

Again, whether you are writing:

1. A blog post
2. A book
3. An email
4. A fax cover letter
5. An article
6. A Digg submition
7. etc.

Write title/subjects that entice users to read further.

What can you learn from my mistake?

1. Most people scan web pages. They don’t have the time to follow each link. The title must be a call to action: “this is interesting, click to learn more”.
2. Summary/excerpt is very important too. I chose a really bad first paragraph. If you write post as guest for other popular blogs, you want your title and first paragraph to be cliff hangers. You must get people to click further.
3. Content importance is second to title and excerpt! This is sad, but true. While crappy content won’t get the word out, crappy titles won’t even get the word in the first place.

Deceptive titles are not a good idea

Am I suggesting you start writing bait and switch posts? Definitely not.

While controversy draws attention, writing titles that say one thing and when you read the content you find another is the best way to brand yourself as a charlatan.

Ideally, you should spend enough time carefully writing your posts (especially, if they are to be published on other websites), and spend a few minutes carefully writing the titles as well. Be creative!

The power of networking

When I started to blog (now close to three weeks ago) I did not know what to expect. I have to say that I am more than impressed with the power of blogging and networking with popular related blogs.

My topics tend to be too technical and I am well aware that it severely limits my audience. Not everybody understands what I am talking about.

I plan to change this in the coming weeks by adding illustrations to the complex topics. I am also working to move my blog away from wordpress.com to be self hosted on one of our servers. That will give me a lot more flexibility than I have now. One thing I want is the ability to link to my source code, instead of including the code in the posts. I will probably just include a flow diagram in the posts. I also want to make the scripts available for use directly from the blog so that you don’t have to install them.

What have I learned so far?

I have done several things on purpose:

1. I decided to not monetize this blog in any way. My plan is to use it exclusively for branding. You won’t see any ads or affiliate links here.
2. I don’t have any short term plans to advertise the blog in any way.

My plan is to test how well a blog can do by just writing useful and original content and by participating in other blogs and forums with useful feedback. For that I try to keep posting at least one article a day here and write articles to be published in other blogs and popular websites.

I don’t think the results are mind blowing, but compared to what I’ve read in other blogs, my numbers are looking good. My Alexa Rank for this week is around 60 thousand. I checked seobythesea.com that is very heavy on technical content and his traffic rank is 40 thousand.

jun07_alexa.png

Things to avoid

While commenting in other popular blogs is one of the most effective ways to get your name or brand out and potentially attract more visitors to your site, doing it wrong can prove to be a waste of time or cause the opposite.

I often see a lot of comments that just say: ‘Nice Post. Keep it up’. This is the best way to waste your time. First of all, it doesn’t help with rankings as most comments are ‘no-followed‘. Furthermore, it will not bring traffic. How many times have you tried to find out who is writing that insightful comment ‘Nice Post’? Unless it gets really annoying, I don’t think you do. I don’t.

Carefully read what the post is about, reflect on it, and try to find out something to say that adds to the conversation. It could be confirming the post or taking an opposing view, but you need to add something. You can also ask clarifying questions, but visitors will most likely click on your site if you are adding something of value.

Blogging is informal, but that doesn’t mean you shouldn’t carefully research your posts. Citing other blogs and authority sources not only gives more credibility, but also the pingbacks to other popular blogs will be more likely accepted. Even if they are ‘not-followed’, the traffic is good. You also get the opportunity of getting picked by other blogs as well.

Advanced Adwords bidding strategies

In Yesterday’s Search Day article: Are Bid Management Tools Dead?, Eric Enge, writes some interesting facts and conclusions he brought from SMX.

A solid strategy for your PPC campaigns will have the following elements:

  1. Use a bid management tool to manage the long tail of your campaign.
  2. Stay focused on your ad copy and your landing pages, because they can dramatically influence the cost and conversion rates of your campaigns.
  3. Take significant brand building terms and manage them separately
  4. Take significant “first visit search” keywords and manage them separately as well.

While I think it is no longer necessary to manage large lists of long tail keywords for PPC campaigns (thanks to broad matching options), I do see great value in bid optimizing tools on improving the ROI of your PPC campaigns.

I want to focus on one particular aspect that was brought forth. Brand building and “first visit” keywords should be managed differently, and should be left out of the automated bid management tools. They provide no direct measurable ROI, but they are definitely very important to have.

This is the strategy I use to build PPC campaigns. I take a few more variables into account, but I will provide a detailed high level overview of this process.

Organizing the keywords

The first step is carefully organizing the keywords. I organize them using the following criteria: if they relate to the brand, if they are generic or action keywords. I further organize the keywords by how relevant they are to my business in a scale of one to five. Five being the most relevant and the most likely to turn into a conversion.

Initial maximum bids for ROI (money) keywords

I estimate our initial maximum bids by doing the following:

I want no less than 100% ROI and assume at least a 1% conversion rate (The ROI and conversion rates are usually higher than that). Use the net profit per conversion and with simple math you can determine what is the maximum you want to pay per click. That is for the most relevant keywords (#5 on your scale). Then, discount 20% for each level of the scale. For example, if the maximum bid for scale five is $1.5, for scale four it would be $1.2, for scale three it would be $0.9, etc.

This strategy guarantees profitability.

Next, create four campaigns, each one at a different maximum cost per click (based on your relevance scale) and use Google Adwords Budget Optimizer. The Budget Optimizer will try to get as many clicks as possible and will manage the individual keywords’ maximum bids automatically for you. Let it run for at least a month, carefully looking at your actual conversions and ROI.

After a week or two you should have more accurate conversion rate and ROI numbers. Use them to adjust your maximum bids per campaign.

Managing brand building and “first visit search” keywords

For brand building you want as many impressions as possible, for the lowest possible cost. The content network is an excellent option for this, by using CPM ads and site targeted ads.

Now, for brand building, using keyword search has a little bit different setup. You must obtain at least a 0.5% click-trough rate to prevent your ads from being disabled, but you want as many impressions as possible and very few clicks to maintain your costs low.

The Ads are the key. Create ads to get the right message across, don’t try to entice users to click. My strategy is to use position preference, target positions 4-6 and bid at the minimum or the minimum necessary to keep the ads running. This guarantees your ads will remain on display and you won’t need to actively manage those campaigns. Big advertisers will usually want to be in the top positions (1-2). Position preference is not compatible with the Budget Optimizer, but for this strategy it is not necessary.

For “first visit search” keywords I use a similar strategy, but I try to get the visitors to click with an enticing ad.

Improving your ROI campaign performance with preferred cost bidding

After running your ROI keywords campaigns for a while with Budget Optimizer, you can generate reports that give you all sorts of useful information. One such information is the true value of each visitor per keyword. Using this information you can have more predictable spending and ROI.

My technique is to create another campaign with preferred cost bidding, and move all best performing keywords from the Budget Optimizer campaigns. Set the preferred costs per clicks to the actual value of those clicks that you determined by running the reports. Google will automatically handle the rest for you.

Another popular and probably more efficient alternative, is to use advanced bid management tools that use portfolio based algorithms for bid management.

These techniques require some historical data to be useful, that is where the Budget Optimizer campaigns comes in handy.

Google recently introduced Cost Per Acquisition bidding for the content network. I haven’t used it and can’t comment much on it. I can say that paying only per results sounds ideal for the advertiser. This is very similar to running an affiliate program.

There are many elements that are necessary to run a successful PPC campaign. For me, the ads and the landing pages are the most important ones to insure adequate conversions and ROI. Astute bidding can give you a big competitive edge. Especially in highly competitive markets.

Google’s inner workings – part 1

Google keeps tweaking its search engine, and now it is more important than ever to better understand its inner workings.

Google lured Mr. Manber from Amazon last year. When he arrived and began to look inside the company’s black boxes, he says, that he was surprised that Google’s methods were so far ahead of those of academic researchers and corporate rivals.

While Google closely guards its secret sauce, for many obvious reasons, it is possible to build a pretty solid picture of Google’s engine. In order to do this we are going to start by carefully dissecting Google’s original engine: How Google was conceived back in 1998. Although a newborn baby, it had all the basic elements it needed to survive in the web world.

The plan is to study how it worked originally, and follow all the published research papers and patents in order to put together the missing pieces. It is going to be very interesting.

Google has added and improved many things over the years. The original paper only describes the workings of the web search engine. Missing features are the ability to search news, images, documents (PDF, word, etc.), video, products, addresses, books, patents, maps, blogs, etc.

Also missing are substantial improvements such as local search, mobile search, personalized search, universal search, supplemental index, freshness, spam detection and PageRank improvements. Some things that will be hard to know is how Google uses the data it collects through other services, like Google Toolbar, Google Analytics, Google Adsense, Doubleclick, Gmail, Gtalk, Feedburner, etc. There is a lot of information that can be used both for better ads and for better search results.

No matter the type of search you are conducting, conceptually, search engines have three key components: the crawler, the indexer and the searcher.

The crawler’s (also known as a search engine robot) job is to collect all the information that will be later searched. Whether it’s images, video, text or RSS feeds. These documents are stored for later processing by the indexer module. Webmasters and site owners can control how crawlers access their websites via a robots.txt file and the robots exclusion protocol. In this file you basically tell the crawler what pages or sections it is not allowed to crawl. I posted an entry about this several days ago.

The indexer module is the one doing the heavy lifting. It has the daunting task of carefully organizing the information collected by the crawler. The power of the search engine is on this specific task. Depending on how well classified the information is – the faster and the better the search. Search engines conceptually classify documents similar to the way you file documents on a cabinet. Without some sort of labeling you will probably waste a lot time finding your bank statements, notes, etc. Search engines label documents in a way that makes it easy for them to find them later by words or phrases (also known as keywords). In the case of text and similar documents the indexer breaks down the document in words and collects some additional information about those words, such as the frequency of the word in the document.

The searcher module is the one that takes the user search, cleans it to remove ambiguities, misspellings, etc., finds the documents in the index that more closely match the search, and rank them according to the current ranking formula. The ranking formula is the most closely guarded secret of all major commercial search engines.

These basic components remain the same nowadays, but they say that the devil is on the details. Today, Google’s inner workings are far more complex than what I am going to explain, but the basic principles are the same. I will quote the original paper as necessary.

There is quite a bit of recent optimism that the use of more hypertextual information can help improve search and other applications [Marchiori 97] [Spertus 97] [Weiss 96] [Kleinberg 98]. In particular, link structure [Page 98] and link text provide a lot of information for making relevance-related assessements and quality filtering. Google makes use of both link structure and anchor text (see Sections 2.1 and 2.2).

One notable improvement Google brought to the commercial search markeplace was the use of link structure and anchor/link text to improve the quality of results. This proved to be a significant factor that helped fuel their growth. Today, these elements remain significant, but Google makes use of very sophisticated filters to detect most attempts at manipulation. Proof that they remain signifficant are the successful Google bombs of late.

To support novel research uses, Google stores all of the actual documents it crawls in compressed form

Here is the reference to their caching feature we are acostumed to use.

Now, let’s see how they define PageRank — how important or high quality are pages for Google’s search engine.

…a page can have a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank. Intuitively, pages that are well cited from many places around the web are worth looking at. Also, pages that have perhaps only one citation from something like the Yahoo! homepage are also generally worth looking at. If a page was not high quality, or was a broken link, it is quite likely that Yahoo’s homepage would not link to it. PageRank handles both these cases and everything in between by recursively propagating weights through the link structure of the web.

Here is a clear description of why they use anchor text for searching.

The text of links is treated in a special way in our search engine. Most search engines associate the text of a link with the page that the link is on. In addition, we associate it with the page the link points to. This has several advantages. First, anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases. This makes it possible to return web pages which have not actually been crawled. Note that pages that have not been crawled can cause problems, since they are never checked for validity before being returned to the user. In this case, the search engine can even return a page that never actually existed, but had hyperlinks pointing to it. However, it is possible to sort the results, so that this particular problem rarely happens.

Now, let’s read about on-page elements that Google considered that were not in regular use back then. Proximity, capitalization and font weight, and page caching.

Aside from PageRank and the use of anchor text, Google has several other features. First, it has location information for all hits and so it makes extensive use of proximity in search. Second, Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font weigh heavier than other words. Third, full raw HTML of pages is available in a repository.

Now let’s see have a big picture view as to how everything fits together. This is very technical, but I will try to explain it the best I can.

In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of “barrels”, creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.

The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.

The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.

Google uses distributed crawlers/downloaders. If you have ever looked at your server log files you will notice that when Googlebot is visiting your site you will see hits coming from different IPs. That is because the crawling is distributed among several computers. They need a URL server to feed the URLs to download, to the crawlers, because the URL server is the one coordinating the crawling efforts.

All the urls are sent to the storeserver for compression and storage and are assigned an ID (doctID). For computers, it is easier and more efficient to use numbers to refer to things.

The indexers does some heavy work:

  • Reads, uncompresses and parses documents. Converts documents into hits (word ocurrences)
  • Creates partially sorted forwarded indices.
  • Create anchors file (link text, and to and from links). URLResolver fixes relative URLs and assigns docIDs.
  • Include anchor text in forward index but using the link it points to as the docID. Associates the text in the link to the document it points to.
  • Maintains a link databases used to compute PageRanks
  • Generates a lexicon–the list of all different words in the index

Basically, the forward index allows you to find the words of a document given the docID. In order to be useful for searching, this needs to be inverted. ie.: find documents by the words. The sorter does this addtional step, by assisting the indexer in creating an inverted index that uses wordIDs as keys to the docIDs. The inverted index includes the offsets and list of words. Dumplexicon is used to update the lexicon used by the searcher.

Finally, the searcher combines the lexicon, the inverted index and the PageRanks to respond the queries.

Next, I’ll describe each of the processes in more detail. Can’t wait? Read the document yourself and draw your own conclusions 🙂