How Do You Get Pages Out of the Google Index?


Posted by Sam Battin, Senior Search Strategist, SEO Using META Robots, Link Canonicals and Other Methods to Keep Your Best Landing Pages on Top of the Results How do bad pages get into the Google index?  Theoretically, Google’s supposed to mirror what’s on your site and only show its users the best pages.  If you’ve ever looked in the pages that Google indexed for your site, however, you may have found results that aren’t very good landing pages for your customers.  For example, if you type the search query Google index Google will show you all of the pages it has found within Yoursite.com.  The pages Google has indexed on your site might include flash files, old PDFs, pages for a contest that ended in 1998, and other stuff that would give a bad experience to your visitors if that’s the first page they ever saw in your site. As always, Google is in charge of what’s in its index, but there are things you can do to help Google show only the best pages on your site.  You’ll need to know two things:

  1. How your pages get in Google’s index in the first place
  2. How to get bad pages where they belong

  Getting Into Google’s Index There are way too many pages on the Web for any single person to read and decide which ones are useful and which are spam.  Therefore, Google (and every other search engine) uses a computer program called a robot, or a bot, or a spider, or a crawler.  These names are interchangeable; the relatively simple program is designed to visit your site, read the text and note every single hyperlink it finds on the page.  Then it tries to follow each hyperlink it found and the cycle begins again.  In this way, search engines build up a picture of what’s on the world wide Web. The crawler isn’t smart enough to know what’s important on your site and what’s not.  If it finds a working hyperlink to the page describing the contest you held on your site in 1998, then that page will appear in Google’s index.  Even if there aren’t any links in your site to that page, all Google needs is to find a link somewhere else on the Web that points there, and the world will see that in 1998 you held a contest to give away a free “Mulan” DVD and a Chumbawamba CD.  A first-time visitor to your site who sees ©1998 at the bottom of your HTML 1.0 page might come away with the idea that you don’t update your site very often. This is how your old PDF files, flash files, server records, and other low-value pages are getting into Google and appearing in search results – all search engines need is a single link.  Spiders routinely find thousands of these low-value pages every day, and the Google algorithms try to make sense of what the spiders find.  If the algorithms determine the pages are low-value, they might appear in the “omitted results” link at the end of the search results page: Google index 2 Google is a packrat; it tries to keep records of virtually everything it finds because it might be valuable to someone, someday.  The problem is that low-value pages, or even duplicate pages, often get into the index through hyperlinks.  The ill effects from letting crawlers see every single page in your site can range from less-than-satisfactory user experiences to lower visibility for your important search terms. ROBOTS.TXT IS NOT INFALLIBLE You can use your site’s robots.txt file to disallow search engines from indexing particular directories.  For example, you can tell search engines not to index the folder where you keep your flash files and your JavaScript programs. Once bad pages are in the index, however, adjusting the robots.txt file won’t get them out right away.  If a bad page has a link from somewhere else on the Web, it may still appear in the index, even if it’s disallowed by the robots.txt.  To help you keep your site’s search results in the best shape possible, we recommend the following methods to get bad pages out of the index. META ROBOTS TAGS DISALLOW INDEXATION In the <HEAD> section of any HTML page, you can add the following code: <META NAME=”ROBOTS” CONTENT=”NOINDEX”> This code explicitly tells search engines not to index that page.  Even if the page has a link from somewhere else on the Web, search engines won’t include pages with this META tag in their index.  With this code, your page will eventually drop out of the major search engines’ indexes. 301 REDIRECTS TAKE VISITORS TO GOOD PAGES Whether a page is cached or not, you can always place a 301 “Permanent” redirect on a URL.  Because the 301 redirect tells search engines that the page contents have been permanently removed, search engines will eventually remove these results from their index.  This can take two weeks or more, but the pages will be gone from the index, and anyone attempting to access the pages will be redirected to a page you select. LINK:CANONICAL IDENTIFIES DUPLICATE CONTENT In some cases, you may not be able to set up a 301 redirect for all of your site’s duplicate pages.  For example, your content management system may inadvertently create duplicate pages. In a situation like this, the Link:Canonical tag can be used to negate the negative impact of duplicate pages.  Please see Google’s article Specify Your Canonical to learn more about this tag. DELETE THE URLs Search engines want to deliver a good experience to their users.  For this reason, they won’t display a link that goes to a 404 “Not Found” page.  In some cases, simply deleting a page will be the best move; this will prevent anyone from seeing it, not only search engines but human visitors as well.  When your server returns a 404 “Not Found” to the visiting crawler, eventually the bad results will leave search engine indexes. If you’re going to delete the page, we recommend creating a custom 404 page to give a more friendly “Not Found” experience to anyone who may have bookmarked the old page. GOOGLE WEBMASTER URL REMOVAL TOOL FOR EMERGENCIES This is an emergency method to remove URLs that need to get out of the Google index pronto; for example, if somehow Google has indexed private information from your site. The methods we mentioned above will take care of most of the indexed junk in a couple of weeks, but if there are one or two URLs you  need to get out of the Google index as soon as possible, Performics recommends using Google’s URL removal tool. Of course, the Google Webmaster URL removal tool will only get pages out of the Google index, but  since January 2009 Bing has also offered their own URL removal options.  With the Bing/Yahoo! merger, we can presume these changes will apply to results in the Yahoo index. At the current time, Google won’t (and maybe even can’t) roll-out this technology to allow mass removal of URLs, so remember that this tool is for emergencies only.  Google is not under any obligation to delete 10,000 URLs from its index just because you said so, but it will try to help you out by removing a couple of very important URLs quickly.


Comments are closed.

Performics Newsletter

[raw]



[/raw]