technical

Bing to launch updated, renamed web crawler “Bing Bot”

Microsoft is to launch its new spider later this year. Here’s what site owners need to know.

Microsoft’s search engine wasn’t always called “Bing” and its web crawler, “msnbot”, hasn’t kept up with the name change. When Microsoft renamed Live Search (formerly MSN Search) Bing, we have to admit to being mildly disappointed that it didn’t take the opportunity to rename its spider “Bing Bot”.

There are many good reasons not to change the name of a spider, especially one as widely used as Microsoft’s search spider. Many software packages look at the name of visiting browsers and spiders (known as the User-Agent) to perform a variety of functions, and it’s possible that problems might occur for a time on less well-configured websites if this were to be changed. For example, Yahoo! maintained the User-Agent “Slurp” for its spider, which it inherited from its acquisition of Inktomi, to “ensure consistency and minimal disruption”.

It appears that Microsoft has decided that the branding “Bing Bot” is too good to miss, however, and has announced that its next generation spider will indeed be renamed when it comes out of beta.

Here’s what site owners need to know:

When is this happening?

This will happen on 1st October 2010.

This is also when Microsoft’s new spider will officially come out of beta.

What will the User-Agent be?

Microsoft’s current User-Agent is:

msnbot/2.0b (+http://search.msn.com/msnbot.htm)

The new Bing Bot User-Agent will be:

Mozilla/5.0 (compatible; bingbot/2.0 +http://www.bing.com/bingbot.htm)

In addition to the “bingbot” branding, there are two other changes to note. Firstly, Microsoft is switching to the “Mozilla/5.0”-style User-Agent. Google made this change more than six years ago because it wanted web servers to treat its spider more like a real web browser. The second, more minor, change is that the “b” (meaning “beta”) in its version number has been dropped.

Any other changes to the spider’s requests?

In addition to the User-Agent change, Microsoft has also change the “From:” HTTP header field, so the old value of:

From: msnbot(at)microsoft.com

will become:

From: bingbot(at)microsoft.com

Will my old robots.txt entries still work?

Thankfully, Microsoft has decided to make its spider respect the User-Agent field which it currently recognises in robots.txt, “msnbot”. However, the way in which it will work from October is somewhat subtle, so deserves a brief explanation.

Whilst existing directives will still work, Microsoft is also going to recognise a “User-Agent:” robots.txt entry of “bingbot”, and it will give precedence to an entry of “bingbot” over an entry of “msnbot” (which, in turn, has precedence over the catch-all User-Agent entry of “*”). This means that, if you add robots.txt rules for “bingbot”, it will ignore all other rules, including those for “msnbot”.

Whilst adding conflicting “msnbot” and “bingbot” entries hopefully isn’t too likely to happen on most sites, in a larger, more complex organisation in which many different people or departments are able to make changes to robots.txt files, I wouldn’t be surprised to see someone accidentally trip up and add a new “bingbot” entry which doesn’t match up with the already existing “msnbot” entry (for example, where a separate “crawl-delay” value for Bing is specified).

Microsoft clearly wants site owners to update their robot.txt files with the new User-Agent, and we’d definitely recommend that you do this – but don’t forget that the new Bing Bot only launches on 1st October – until then, you should still use the old “msnbot” terminology in your robots.txt files.

What should I do now?

Firstly, if you currently have a separate robots.txt entry for msnbot on your site(s), make a note on your calendar on to change it to “bingbot” on October 1st.

Secondly, make sure that your website doesn’t do anything else special for Microsoft’s crawler or for visitors which don’t identify themselves as ‘Mozilla compatible’. This could include tools such as analytics packages or software which performs anti-spam functionality such as request rate-limiting.

Other than that, there shouldn’t be anything to worry about! However, in the (hopefully unlikely) event that you do experience any problems come October, Microsoft has set up an email address (bingbot@microsoft.com) to help to resolve any issues.

Tags: , , , , , , , ,

0 comments Add This

Dissecting the URL

Often, marketing clients have a deep understanding of their own businesses, route to market and campaign targets, but aren’t necessarily experts in the field of digital. Some of the more staple elements of the online world require an explanation to assist in interpreting reports and campaign documentation. This article explains the URL and the nomenclature of its various components.

Using as an example, Http (Hyper Text Transfer Protocol) is the standard protocol used by most web pages. The other common protocol used is Https, which is used for secure connections. The colon is used as a separator and the double slash // is the instruction for making a connection to a server.

The www.example.com is the domain name, but it may also be called the hostname when associated with an IP address. The www part is an optional subdomain, while .com is the TLD (Top Level Domain). It should be noted that example.com (without the www) is also a domain name and similarly, if an IP address is associated with it, it can also be a host name.

The next part after the colon is the port number (80, in the example above). Port 80 is the default port for http and is rarely seen as most browsers don’t display it.

In the above example, /media represents the path. In situations in which there is no path, the slash would indicate the root of the domain. In many URLs, the last part is followed by a further / and then a file name: index.html, index.htm, default.html and index.php being four common examples.

A URL can also include sub directories: in the above example, /media/ is the sub directory.

The URL in the example contains one last section, ?id=647386768. This is a URL parameter and may well mean that the URL is dynamic, that is to say, it is generated by code (often from a content management system). Dynamic URLs can be problematical from a SEO point of view. The parameter here also uses the id=. Using the id= (or sid=) parameter is not recommended, as some search engines can construe this as denoting a session id and may not fully spider the URL. When it comes to using dynamic pages Google has offered the following: advice

If you decide to use dynamic pages (i.e., the URL contains a "?" character), be aware that not every search engine spider crawls dynamic pages as well as static pages. It helps to keep the parameters short and the number of them few.

You will, on occasions, see URLs like the following:

http://www.example.com/reptiles.html#terrapins

Here, the # denotes a “named anchor” which is in effect a place holder on a page. These are especially useful in long html pages when you want to link to a specific part of a page. Search engines do not follow these. As AJAX pages sometimes use the # as part of the URL structure, this can render them uncrawlable. However, Google has mentioned that it is working on crawling AJAX pages and has proposed a technique for creating search friendly AJAX pages.

Tags: , ,

1 comments Share