Skip to main content

Blogger's robots.txt and sitemap

Robots.txt
If you are using blogger for your hosting you have a robots.txt file automatically, and can NOT change it.

To find the robots.txt, open up your web browser, and type in your Blogger blog’s URL, at the end of the URL add robots.txt.

For example, if the URL of your blog is http://myblog.blogspot.com, then enter http://myblog.blogspot.com/robots.txt

You may find the following entries:
User-agent: Mediapartners-Google
Disallow:

User-agent: *
Disallow: /search

Sitemap: http://myblog.blogspot.com/feeds/posts/default?orderby=updated

Mediapartners-Google is an Adsense crawler, which crawls pages to determine AdSense content. Google only use this bot to crawl your site if AdSense ads are displayed on your site. So the first two lines means your blog allow Mediapartners-Google to crawl blog contents, nothing are disallowed, so it's empty after "Disallow:".

"User-agent:* " means all search engine, a star sign '*' means all. The robots.txt instruct all search engines not to craw the subdirectory /search, the purpose is to avoid duplicate content. Your Blogger posts can be reached by archive date (normal), and also by label, each different label on any blog will result in a different URL pointing to the same post.

http://myblog.blogspot.com/2010/01/blogpost1.html
http://myblog.blogspot.com/search/label/label1/blogpost1.html
http://myblog.blogspot.com/search/label/label2/blogpost1.html
http://myblog.blogspot.com/search/label/label3/blogpost1.html

We can see there is only on URL named by date - if index only by date - each and every post has one and only one post date.

But, how many lablels you assign to your post, there will be same number extra URL point to the same post. "If the search engines were allowed to index by label (under /search subdirectory), they would see 6 extra instances of that post, one per label search. Since the post was already indexed by archive date, those 6 label search instances would be considered duplicate content. The search engines would penalise all 7 URLs for having duplicated content." (Blogger help forum)

Sitmap
Now about sitemap. You can find a line in the Robot.txt that starts with 'Sitemap:'. The URL after that label is the location of your sitemap.

"Using the example above, the line would look like:

Sitemap: http://myblog.blogspot.com/feeds/posts/default?orderby=updated

Back in Google’s Webmaster Tools, the domain name part of the URL would already be included, so you would just need to specify the feeds/posts/default?orderby=updated portion of the sitemap URL. "(Technically Easy)

You can also ignore this sitemap, anyway Google Webmaster tools will look at the robots.txt file for a sitemap if one isn’t specified.

Google Sitemap will accept all xml pages, you can submit your blog feed as sitemap URL, for example atom.xml.

Comments

Cris said…
HI friends, this information is very interesting, I would like read more information about this topic, thanks for sharing. homes for sale in costa rica
songwriter said…
Yes, this was helpful as I was trying to fix a problem was Google had indexed houston.maindomain.com which I point FreshHoustonJobs.com I added no follow to robots.txt for houston.maindomain.com and was looking for info about sitemap info in robots.txt file. Thanks for posting. I already have a sitemap registered with Google. But now I'll add a link to it in my robots.txt too. Hope to keep Google from accessing the same content via subdomain that the realdomain FreshHoustonJobs.com points to. Don't want duplicate content.
Blogger said…
Get free website marketing tools at TraffiCheap.

Popular posts from this blog

How to Input Phonetic Symbols (IPA) in Google Docs

You can insert special characters by clicking "Insert" on the menu, then click the "Ω Special Characters", the choose "Latin" category from the drop-down menu, and then Phonetics (IPA) sub-category. Insert Special Characters in Google Docs There is a short-cut for inputting some IPA symbols which you use them frequently. Automatic Substitution in Google Docs similar to Auto Correct in MS Word. You can replace common acronyms, misspellings and other symbols. So you can set auto-replace for your IPA symbols, for example, "e<" for "ɛ", "o/" for "ø", "o>" for "ɔ" etc. Automatic Substitutions in Google Docs

Virgin Media Netgear Wireless Router Username and Password

As Virgin Media customer, if you find your wireless router is Netgear, then you may type the router's setup URL into a web browser address bar. http://192.168.0.1 is the default Netgear router IP address. http://192.168.1.1 will work for some Netgear models. Mine setup URL is http://192.168.1.1 . Then you are required to enter a username and password. If you haven't change the default setting, it is "virgin" and "password", you may find that on a label stuck on the router. This default username and password of Virgin Wireless router is different from that of the normal Netgear router. The default username of Netgear is admin and the password is either password or 1234. Then you open the configure interface, change the settings, such as change your DNS server to OpenDNS . For a normal Netgear router, if you forget the username and password, you can reset and restore the NETGEAR device to factory default settings. But I couldn't find any button on

URL cannot contain a Google host

Google just opened up Knol to the public. Knol is also serving up AdSense advertising on the site. Authors on Knol can enter their AdSense data into Knol, and will get the regular AdSense payout for every click on an ad. This seems like a smart way to reward users who write the best (or most popular) content, while still making money for Google, because the cut Google already takes from the advertising through AdSense anyway. Currently, I don't know how may we track ads performance on knol, I tried to add an URL channel, and got error message: "http://knol.google.com/k/-/k3-searching-tv-programs-using-teletext/2p3t2lhf3x6sj/6#" at line 1 invalid: URL cannot contain a Google host. Has anybody any idea of how to do it?