How do Search Engines Detect Cloaking?

Cloaking is the practice of showing a different web page to search engine bots than to your visitors in an attempt to distort rankings. The hope on the part of the cloaker is that the search engine can be shown a rich variety of content which contains the keywords they would like to be ranked for. For example a flash site that sends textual content to the search engines. This is practice is considered to be black-hat and if detected would lead to search engine retribution. So how do the search engines detect cloaked websites?
When a search engine indexes a page, it will identify itself using a string which is recorded in the search engine logs as the user agent. The Firefox extension, User Agent switcher allows you to change the user agent string specified. You can pretend to be any browser or even Googlebot, which is a useful for detecting cloaked content.
So that’s it then, the obvious approach is that Google is also spoofing its user agent to disguise Googlebot as a user. In this scenario it would visit the site as Googlebot and then visit again with a new identity to compare the two websites. While this might seem like a good idea, there are many occasions where this would not work, for example: the site may be different on each visit due to frequently updated content such on news websites or even some blogs, time-stamp information creating unique pages on every visit, advertising rotation and serving different content on each visit. Such things are not cloaking.
One paper seeks to algorithmically identify cloaking by identifying differences between three separate copies of the page. Using the user agent identifier of either a browser or crawler, three copies of the data: one using the browser B1, one from the crawler C1 and a second copy C2 from the crawler taken one day later, cloaking can be detected. For each URL, the difference between the two crawler pages, C1 and C2 is calculated. The difference between the first crawled page and the first browser page is also calculated.
If the difference between the two crawled pages, C1 -C2 is greater than the difference between the crawler and the browser page C1-B1, it can be identified as a candidate for cloaking. The reason being that even if there are small changes in the two crawled versions of the page, due to the factors discussed above that are not cloaking, this difference should not be greater than difference between the browser and the crawled page.
The detail comes in how are the differences determined. Term difference measures how different the pages are in their text content. Link difference occurs when two pages differ in the links that a send the browser as redirection is also a common use of cloaking.
A blanket approach to perceived cloaked content would result in some legitimate websites being labelled as cloaked, therefore by separating URLs into bins of difference, the algorithm can be refined to choose the most acceptable levels of precision and recall.
The research also looked at the percent of sites found to be utilising cloaking using these techniques in different categories for the most popular search terms. The degree of cloaked pages found that the highest percentage of cloaking occurs for shopping and sports at slightly over 10% of sites, while the lowest was for holidays at just under 5%.
The full details of the research can be found at:
http://www.allthingssem.com/research-papers/cloaking-and-redirection-a-preliminary-study.pdf
compost bins here
fun heelys shoes
romadreaming.it
Shaggy Rugs
Tags: cloaking, detecting, Search Engines


![Validate my RSS feed [Valid RSS]](http://www.seothegame.com/wp-content/uploads/2008/11/valid-rss.png)