When Written: April 2007
One particular client site has been occupying a significant part of my time recently. This site was built and launched just before Christmas and the client was obviously keen for Google to index the site as soon as possible. Now whilst Google may have its faults, it is one of the few search engines that has a complete front end for webmasters to monitor how Google is indexing their sites (http://www.google.co.uk/intl/en/webmasters ). This is done by the webmaster registering a web site with Google and then Google will ask the webmaster to put a cryptically named file on the web site to prove that they have the correct rights to that particular website and do in fact own it.
The webmaster can then see the results of Google’s web crawler. Rather than leave the indexing of your site to Google’s web crawler, known as GoggleBot, you can give it some help. You may find that the web crawler is missing certain pages or you want to give it some indication as to what pages have been updated. To do this Google is now implementing ‘Sitemaps’ . These are an XML file of an open standard format that explains to the search engine’s crawler where the web pages are, when they have been updated and how important it is to index them. The format is quite easy to understand. Although there are some tools out there to create them for your web site, you will probably find that you will want to customise the sitemap to get the desired results from Google. These site maps will not increase your rankings but will at least help to make sure that Google knows all it needs to know about your site. Let’s just examine a typical sitemap file. This one is from an imaginary web site that has just two pages; index.html and page2.html. Obviously the more pages you want to tell the search engine about the longer this file will be. The first two lines are typical XML definition lines and tell the consuming service what the layout of the file is all about:
<?xml version="1.0" encoding="UTF-8" ?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
Now we enter the definition of a page which is held between <url> and </url> tags
<url>
<loc>http://www.mywebsite.co.uk/index.html</loc>
This is obviously the URL of the page that you wish to be indexed, next we tell the search engine when it was changed last:
<lastmod>2007-02-15T12:01:44Z</lastmod>
Note the order of the date, it is Year-Month-Day this is in W3C Time format and the ‘Z’ at the end designates the time zone of the website.
We can also inform it as to how often the page changes, there is a temptation to enter a shorter period than is necessary but then the Googlebot may not have enough time to index all your pages so by helping it out, by informing it which pages change can often help the indexing process.
<changefreq>daily</changefreq>
The options here are always, hourly, daily, weekly, monthly, yearly and never.
And finally you can specify a priority that informs Google as to how important this page is to the site, this has a value from 0 to 1 with 0.5 being the default.
<priority>0.5</priority>
</url>
Only the <URLSET> <URL> and the <LOC> tags are compulsory, the rest are optional.
So for the second and further pages this is repeated:
<url>
<loc>http://www.saml-uk.co.uk/page2.html</loc>
<lastmod>2007-02-15T12:01:44Z</lastmod>
<changefreq>daily</changefreq>
<priority>0.5</priority>
</url>
And at the end of the file we close the whole thing up with an </urlset> tag
Once you have created your site map and uploaded it to your web site you then need to use the webmaster front end in Google to upload it to Google. Then you sit back and wait for Google to re-index the site. To see what pages Google has indexed of a particular site just type site:mywebsiteURL into the box where you would normally type your Google search. Google will then return links to all the pages of your site it knows about. By fine tuning your sitemap and then re-running this Google search in a few days you can analyse how the changes to the sitemap affect Google’s indexing.
I do wish that other search engines enabled the web master to see what pages they are indexing in the same way, it helps so much when explaining to a client that the indexing is working, but that these things can take time. Don’t think by the way, that just because the Googlebot has visited your web site that all the pages have been indexed, it seems to get bored sometimes and give up, only to come back a few days later to grab some more. If you want to learn more than is good for you about Google and how to optimise your web site for it, then the extremely active forums (http://groups.google.com/group/Google_Webmaster_Help ) in the Web master area of Google are a great place to look. See you there!
Article by: Mark Newton
Published in: Mark Newton