Posted by Sam Battin, Senior Natural Search Specialist Understanding how robots think is a good way to predict how your actions will affect your site. Remember, robots are stupid. They’re great at doing what they’re told, but if they’re not told to do something, then they’ll never do it, as we’ll soon see… How Do Robots See XML Sitemaps? Google started XML Sitemaps back in June 2005. Their idea was that they wanted webmasters to be able to provide them with a complete list of site URLs. This way, Google could compare what their spiders indexed vs. the complete list of URLs. As a result, they could improve their crawlers to find previously un-findable URLs. This idea was so good that in a little over a year Bing, Yahoo! and Google announced joint support for the sitemaps protocol. A sample of XML protocol looks like this, and doesn’t make a lot of sense if you’ve never seen it before: <url> <loc>http://example.com/</loc> <lastmod>2006-11-18</lastmod> <changefreq>daily</changefreq> <priority>0.8</priority> </url> To a robot, however, this code explains a lot. It’s also an efficient way to get information when a site has thousands of pages. Each line in the XML sitemap file informs robots important information about each of your URLs, as we see in the table below:
Based on the information in your XML sitemap file, robots will be able to more efficiently crawl your site. For example, if your XML sitemap contains a list of every URL in your site, then search engines will know for certain how many pages to crawl before they stop. Additionally, by looking at the XML file, a robot can determine if any pages have changed since the last visit. If no pages have changed since last time, then they don’t need to crawl those pages; this gives the robot more time to crawl the pages you’ve updated most recently. For these reasons, your XML file should be as accurate and as up-to-date as possible. Google Still Hasn’t Crawled These Pages! A complaint Performics runs across at times, is when a webmaster has submitted an XML sitemap to Google that contains every single URL in their site, but when they look for these pages in Google a week later they find their submitted URLs missing. When they search for the URL, all they get is this: There are a couple of different reasons for this. Most likely, Google has not yet crawled that page. Google will only crawl a certain number of pages each time they visit your site. If your site has hundreds or thousands of pages, they will not have time to index every page in a single visit, even if the URLs appeared in an XML sitemap. Another reason Google hasn’t crawled your URL yet is that it hasn’t found a path to it from your home page. Basically, it’s not enough for you to simply tell Google that there is a URL at a particular location. Before they’ll use your page as a search result, Google has to find a way to reach that URL through existing hyperlinks on the Web. These hyperlinks can be on your site, or they may be from another site, but unless Google can find some way to crawl the URLs apart from your sitemap file, the URL may not appear as a search result. Without a hyperlink path, Google can’t calculate the PageRank value of your page, and can’t accurately rank your page vs. its competitors. At the very least, Google needs to visit your page with its robot and make sure the server isn’t returning a 404 “File Not Found” result. Don’t Confuse The Robot One final reason your URL might not appear in Google is it your robots.txt has disallowed its indexation. For example, if your XML sitemap says there’s a URL at: /awesome/great.html; but your robots.txt says disallow: /awesome/ then that’s going to confuse the robot. In one place you tell the robot “index this page” and in the other place you tell it “don’t index this page.” When this happens, steam will erupt from the robot’s ears and it will repeat “DOES NOT COMPUTE” over and over again, and your pages will not be indexed. The same thing can happen when you add a META robots “NOINDEX,NOFOLLOW” on a page you’ve included in the XML sitemap; in either case your page isn’t going to appear as a search result.