Robots.txt
If you are using blogger for your hosting you have a robots.txt file automatically, and can NOT change it.
To find the robots.txt, open up your web browser, and type in your Blogger blog’s URL, at the end of the URL add robots.txt.
For example, if the URL of your blog is http://myblog.blogspot.com, then enter http://myblog.blogspot.com/robots.txt
You may find the following entries:
Mediapartners-Google is an Adsense crawler, which crawls pages to determine AdSense content. Google only use this bot to crawl your site if AdSense ads are displayed on your site. So the first two lines means your blog allow Mediapartners-Google to crawl blog contents, nothing are disallowed, so it's empty after "Disallow:".
"User-agent:* " means all search engine, a star sign '*' means all. The robots.txt instruct all search engines not to craw the subdirectory /search, the purpose is to avoid duplicate content. Your Blogger posts can be reached by archive date (normal), and also by label, each different label on any blog will result in a different URL pointing to the same post.
http://myblog.blogspot.com/2010/01/blogpost1.html
http://myblog.blogspot.com/search/label/label1/blogpost1.html
http://myblog.blogspot.com/search/label/label2/blogpost1.html
http://myblog.blogspot.com/search/label/label3/blogpost1.html
We can see there is only on URL named by date - if index only by date - each and every post has one and only one post date.
But, how many lablels you assign to your post, there will be same number extra URL point to the same post. "If the search engines were allowed to index by label (under /search subdirectory), they would see 6 extra instances of that post, one per label search. Since the post was already indexed by archive date, those 6 label search instances would be considered duplicate content. The search engines would penalise all 7 URLs for having duplicated content." (Blogger help forum)
Sitmap
Now about sitemap. You can find a line in the Robot.txt that starts with 'Sitemap:'. The URL after that label is the location of your sitemap.
"Using the example above, the line would look like:
Sitemap: http://myblog.blogspot.com/feeds/posts/default?orderby=updated
Back in Google’s Webmaster Tools, the domain name part of the URL would already be included, so you would just need to specify the feeds/posts/default?orderby=updated portion of the sitemap URL. "(Technically Easy)
You can also ignore this sitemap, anyway Google Webmaster tools will look at the robots.txt file for a sitemap if one isn’t specified.
Google Sitemap will accept all xml pages, you can submit your blog feed as sitemap URL, for example atom.xml.
If you are using blogger for your hosting you have a robots.txt file automatically, and can NOT change it.
To find the robots.txt, open up your web browser, and type in your Blogger blog’s URL, at the end of the URL add robots.txt.
For example, if the URL of your blog is http://myblog.blogspot.com, then enter http://myblog.blogspot.com/robots.txt
You may find the following entries:
User-agent: Mediapartners-Google
Disallow:
User-agent: *
Disallow: /search
Sitemap: http://myblog.blogspot.com/feeds/posts/default?orderby=updated
Mediapartners-Google is an Adsense crawler, which crawls pages to determine AdSense content. Google only use this bot to crawl your site if AdSense ads are displayed on your site. So the first two lines means your blog allow Mediapartners-Google to crawl blog contents, nothing are disallowed, so it's empty after "Disallow:".
"User-agent:* " means all search engine, a star sign '*' means all. The robots.txt instruct all search engines not to craw the subdirectory /search, the purpose is to avoid duplicate content. Your Blogger posts can be reached by archive date (normal), and also by label, each different label on any blog will result in a different URL pointing to the same post.
http://myblog.blogspot.com/2010/01/blogpost1.html
http://myblog.blogspot.com/search/label/label1/blogpost1.html
http://myblog.blogspot.com/search/label/label2/blogpost1.html
http://myblog.blogspot.com/search/label/label3/blogpost1.html
We can see there is only on URL named by date - if index only by date - each and every post has one and only one post date.
But, how many lablels you assign to your post, there will be same number extra URL point to the same post. "If the search engines were allowed to index by label (under /search subdirectory), they would see 6 extra instances of that post, one per label search. Since the post was already indexed by archive date, those 6 label search instances would be considered duplicate content. The search engines would penalise all 7 URLs for having duplicated content." (Blogger help forum)
Sitmap
Now about sitemap. You can find a line in the Robot.txt that starts with 'Sitemap:'. The URL after that label is the location of your sitemap.
"Using the example above, the line would look like:
Sitemap: http://myblog.blogspot.com/feeds/posts/default?orderby=updated
Back in Google’s Webmaster Tools, the domain name part of the URL would already be included, so you would just need to specify the feeds/posts/default?orderby=updated portion of the sitemap URL. "(Technically Easy)
You can also ignore this sitemap, anyway Google Webmaster tools will look at the robots.txt file for a sitemap if one isn’t specified.
Google Sitemap will accept all xml pages, you can submit your blog feed as sitemap URL, for example atom.xml.
Comments