Blogger's robots.txt and sitemap

Robots.txt
If you are using blogger for your hosting you have a robots.txt file automatically, and can NOT change it.

To find the robots.txt, open up your web browser, and type in your Blogger blog’s URL, at the end of the URL add robots.txt.

For example, if the URL of your blog is http://myblog.blogspot.com, then enter http://myblog.blogspot.com/robots.txt

You may find the following entries:

User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /search

Sitemap: http://myblog.blogspot.com/feeds/posts/default?orderby=updated

Mediapartners-Google is an Adsense crawler, which crawls pages to determine AdSense content. Google only use this bot to crawl your site if AdSense ads are displayed on your site. So the first two lines means your blog allow Mediapartners-Google to crawl blog contents, nothing are disallowed, so it's empty after "Disallow:".

"User-agent:* " means all search engine, a star sign '*' means all. The robots.txt instruct all search engines not to craw the subdirectory /search, the purpose is to avoid duplicate content. Your Blogger posts can be reached by archive date (normal), and also by label, each different label on any blog will result in a different URL pointing to the same post.

http://myblog.blogspot.com/2010/01/blogpost1.html
http://myblog.blogspot.com/search/label/label1/blogpost1.html
http://myblog.blogspot.com/search/label/label2/blogpost1.html
http://myblog.blogspot.com/search/label/label3/blogpost1.html

We can see there is only on URL named by date - if index only by date - each and every post has one and only one post date.

But, how many lablels you assign to your post, there will be same number extra URL point to the same post. "If the search engines were allowed to index by label (under /search subdirectory), they would see 6 extra instances of that post, one per label search. Since the post was already indexed by archive date, those 6 label search instances would be considered duplicate content. The search engines would penalise all 7 URLs for having duplicated content." (Blogger help forum)

Sitmap
Now about sitemap. You can find a line in the Robot.txt that starts with 'Sitemap:'. The URL after that label is the location of your sitemap.

"Using the example above, the line would look like:

Sitemap: http://myblog.blogspot.com/feeds/posts/default?orderby=updated

Back in Google’s Webmaster Tools, the domain name part of the URL would already be included, so you would just need to specify the feeds/posts/default?orderby=updated portion of the sitemap URL. "(Technically Easy)

You can also ignore this sitemap, anyway Google Webmaster tools will look at the robots.txt file for a sitemap if one isn’t specified.

Google Sitemap will accept all xml pages, you can submit your blog feed as sitemap URL, for example atom.xml.

Comments

songwriter said…

Yes, this was helpful as I was trying to fix a problem was Google had indexed houston.maindomain.com which I point FreshHoustonJobs.com I added no follow to robots.txt for houston.maindomain.com and was looking for info about sitemap info in robots.txt file. Thanks for posting. I already have a sitemap registered with Google. But now I'll add a link to it in my robots.txt too. Hope to keep Google from accessing the same content via subdomain that the realdomain FreshHoustonJobs.com points to. Don't want duplicate content.

5 June 2011 at 08:20

Blogger said…

Get free website marketing tools at TraffiCheap.

21 September 2017 at 18:13

StandiForth

Search This Blog

Blogger's robots.txt and sitemap

Labels

Comments

Popular posts from this blog

How to Input Phonetic Symbols (IPA) in Google Docs

Vodafone Router Configuration for Incoming Connection and other Services

How to stop Freenet?