search

Search engines still struggling with Internationalized Domain Names (IDNs)

Internationalized Domain Names (IDNs) are now a reality and in use by websites right now. Unfortunately, it seems that the search engines are still playing catch-up.

Skip to start of post

Introduction

Note: If you are unable to view the Arabic and Cyrillic letters in this page you may need to install the required fonts.

Now that the first Internationalized Domain Names (IDNs) have gone live and have had some time to get established, it seems like a good time to revisit the finding of my previous article on IDNs “Can search engines handle Internationalized Domain Names (IDNs)?

IDNs went live initially for three countries, all using the Arabic alphabet: Egypt (مصر); Saudi Arabia (السعودية); and the United Arab Emirates (امارات). Russia’s new IDN (рф) went live a little later, adding the Cyrillic alphabet to the mix, and additional IDNs have been created for other countries and alphabets. For this article I’ll take a look at how search engines handle these four IDNs.

To get an idea of how extensively the search engines have indexed sites on these new IDNs I’m going to use the “site:” operator. Although this operator is primarily used for finding pages on a particular website, e.g. [site:lbi.co.uk] it can also work all the way up to the TLD level, e.g. [site:uk].

Searching Google for [site:مصر], [site: السعودية], [site:امارات] and [site:рф] returns results from the IDNs  for Egypt, Saudi Arabia, the UAE and Russia as expected. Whilst the new IDN for Saudi Arabia only had 14 pages indexed when checked, the other IDNs all feature thousands of results.

Screenshot of a search for [site:рф] in Google:

Google search for site:рф

Trying the same searches in Bing, however, does not return any results:

Bing search for site:рф

It appears that the site: operator does not work with these new IDNs in Bing (searching for other domains, e.g. [site:com], works as expected).

IDNs in search results?

The next area tested is whether the search engines will return these domains in their search results. To test this I picked out some random web pages on the new Egyptian IDN and tried searching for their title tags in both Google and Bing.

Searching both Google and Bing for the title of one web page, [مراكز التميز في البحث والتطوير - وزارة الإتصالات], brought up a number of web pages. The results from Google and Bing both contained a result from an IDN:

Google snippet featuring an IDN:

Google snippet featuring an Arabic IDN

Bing snippet featuring an IDN:

Bing snippet featuring an Arabic IDN

More IDN bugs

Earlier I described how Bing’s site: operator does not yet work with IDNs. However, Google also has a number of IDN woes. Searching for [site:مصر] (the new IDN for Egypt) brings up the site سجل.مصر – however, clicking on the “Show more results from سجل.مصر” link in Google appears to be listing sites on domains other than سجل.مصر. Additionally, the “Show all results” link is percent encoded rather than listing the site name in the Arabic font.

Screenshot of Google IDN bug

In my previous look at how search engines handled IDNs I had found that Google’s links to “Translate this page” and “Cached” were broken for IDNs. Today it appears that Google has fixed the translation links – however, the cache links still do not appear to function.

Conclusion

The situation is much the same as it was back in February. The search engines can index websites which use IDNs – however, all of the major search engines still have bugs with their IDN support.

Given that the number of IDNs is set to grow and the number of websites using IDNs is likely to vastly increase in the near future, it’s vital that the search engines iron out the bugs in their IDN support. After all, if a search engine can’t handle websites from a particular properly, people might decide to switch to a search engine that can.

Tags: , , , , , , , ,

0 comments Add This

Bing now powering Yahoo! results in the US & Canada

Bing Yahoo! Logo
Yahoo! is dead, long live Yahoo!

The “Binghoo” search alliance is finally coming to fruition. After some initial testing Yahoo! and Bing have announced that Yahoo! has completed the Bing transition and its search results are now being powered entirely by Bing.

This initial rollout only covers the US and the English-language version of Yahoo! in Canada, with other countries set to follow. Given the relative maturity of Bing in the UK compared to many other countries we would be surprised if the next rollout didn’t include the UK, although when this will happen is anyone’s guess. Yahoo! has said that the full worldwide rollout may be as late as 2012.

One country that might not be transitioning to Bing-Powered Yahoo! is Japan – the one country in the world where Yahoo! is a market leader. Yahoo! Japan is only partially owned by Yahoo! and has said that it is planning to use Google to power its search results instead of Bing, a move which Microsoft has slammed as anti-competitive.

Tags: , , , , , ,

0 comments Share

Google manually editing ‘Organic’ search results?

Upon the recent launch of our new LBi.com site we were alarmed to notice that Google was sending visitors to the wrong site!

As you can see below, at the time of writing, a search for [lbi.com] in google.co.uk will display a result for the Leo Baeck Institute in New York, a site about the history and culture of German speaking Jewry hosted on the domain ‘lbi.org’. The ‘sitelinks’ underneath the top result also erroneously refer and link to pages on the lbi.org domain:

Google UK lbi.com search

This is badly wrong. As it happens, this is not a major disaster for LBi, but it could be much different for our natural search clients, who could lose significant revenues as a result of this kind of error.

So why did this happen?

There are no configurations or logical connections between the “lbi.com” site and the “lbi.org” site which could have mislead Google, leaving only two options; an error in Google code, or an error in a manually edited result – the latter of which we believe to be the most likely reason.

This is a very rare occurance that gives us an insight into the world of Google, in particular how some results are so well positioned, despite there being no ‘apparent’ reason for them to be performing so well.

We do see this from time to time, although it should be stressed that the overwhelming majority of sites will never see this kind of manual intervention, and usual best practices still apply.

One reason this result may have been singled out is due to Google’s recent focus on branded search. We suspect that brand results are one of the items currently being identified and prioritised by Google for search quality purposes.

Why would Google be manually editing search results in 2010?

Manually editing SERPS is more common than you might think. It happens for numerous reasons, from legal requests for removal of content, to handing out “black hat” SEO penalties, to delivering expected results for high volume navigational queries where, for example, a user is searching for a branded website.

Search engines have a conundrum, in that they need websites to be included in their index to attract searchers. If they remove websites for infringing terms and conditions no matter who they are, search engine users would soon get fed up and find another search engine. Likewise, if a search engine doesn’t surface expected results for a query because the site a user seeks is not optimised well enough to naturally be top of the search engine results, search engines reserve the right to manually edit results.

This introduces the potential for human error, which we believe is the case for the erroneous result demonstrated here.

Digging a little deeper:

The cached copy of this page, shown below as indexed on the 7th of August, clearly shows “lbi.com” in the cache URL, but “lbi.org” in the cache description. This is only the case for the homepage, for the phrase [lbi.com]:

Google Cache of lbi.org

The same error is evidenced with a search for [lbi.com] on the google.com site:

google.com lbi.com search

The same is also true for a “site:” operator search, which should only return pages from the “lbi.com” domain:

Site search for lbi.com

A search for [lbi] shows the expected results, including the correct ‘lbi.com’ homepage, so this is definitely included in the index:

Google.com search for lbi

The Leo Baeck Institute website (lbi.org) has no such error, showing that there is not a plain switch of site home pages:

Site search for lbi.org

We’ve dropped Google a line and will post further updates here when we hear any news back from them…

Update: Once we highlighted this, Google’s own John Mueller provided a response in the comments below, and within 24 hours the result for [lbi.com] has now been changed to display the expected results, with an LBi.com title, snippet and sitelinks appearing at the top of the page. We would like to extend our thanks to Google for ensuring a swift resolution.

Upon the recent launch of our new LBi.com site we were alarmed to notice that Google was sending visitors to the wrong site!

As you can see below, at the time of writing, a search for [lbi.com] in google.co.uk will display a result for the Leo Baeck Institute in New York, a site about the history and culture of German speaking Jewry hosted at the domain ‘lbi.org’:

Tags: , , ,

10 comments Share

Google and Facebook gear up to fight for social search

Recently it seems that Google can’t make enough enemies – once their primary target may have been Microsoft but if Google’s attitude to Apple is anything to go by Redmond’s lot seem positively irrelevant these days. And if the rumours surrounding Google Me are anything to go by it sounds like Facebook just made the top of the hate-list.

Me is allegedly Google’s attempt to move on the ‘full service’ social network space that is Facebook (yes, I did just coin a social media description) but despite rumours proclaiming this as a major deal it is difficult not to be just a little bit cynical.  We have already seen Google launch both Wave and Buzz to ridiculous hype rapidly followed by almost laughable silence weeks after their respective launches – why should Me be any different? And more importantly, why is Google not focusing on joining up all of their various social hooks into something that makes sense? At present they have a variety of different social offerings yet most of them act like the others don’t exist – from Google Voice and Chat through to Buzz, Wave and even Google Reader (with its built in sharing settings) the graph may usually move between them but little else does.

So maybe that is all Google Me really is – a platform to pull together all of Google’s other platforms. Yet it is already being labelled a competitor to Facebook – this despite the fact that having to rebuild a whole new social graph on a new social network is about as enjoyable as actually being forced to converse with most of those forgotten school friends you could passively ignore before the days of Facebook.

So why bother? Well it probably has something to do with the fact that Facebook overtook Google in the US earlier this year to become the biggest site in terms of visits and it shows no sign of slowing. Whilst Google’s core offering (adverts served against search results) doesn’t currently directly compete with Facebook’s (adverts served against personal content) Google have to be more than just a little bit conscious that it wouldn’t take much for Facebook to make a move into their space.

What makes Google great? They have vast amounts of data about sites, the relationships between sites and the ways in which people access those sites. And what do Facebook have? Vast amounts of data about people, the relationships between people and, since the introduction of the ‘like’ button, the ways in which people access and share sites.

Facebook have recently started including sites with ‘like’ functionality into the search results a user receives when they search for anything on the Facebook site. But to be brutally honest, it’s horrible – there is no relevance to the results and it doesn’t fit with the user behaviour for people on the site. Yet it isn’t inconceivable that Facebook could buy a search engine and if you began to lay social graph data combined with content consumption habits you could have the next evolution of search: results that are socially aware. Imagine a result page where the sites your friends visit frequently get a little boost in the results for your searches.

The social search engine has seemed an obvious next step for years and yet still hasn’t happened – probably partly because no single company has had the relevant data sets, they have typically sat in separate businesses.  Of course let’s not forget that privacy concerns are likely to be a huge factor too, since search is just so personal. Yet packaged in the right way, whereby both sharing and privacy controls are simple and straightforward, it becomes a tantalising prospect.

Facebook have said search isn’t their focus (but they would, wouldn’t they) yet Google’s continued focus on building relationship data certainly suggests that social search may be the future.

Ultimately Google Me will still crash and burn if it can’t offering something unique that Facebook doesn’t – trying to move a population of 400 million to a new home is no mean feat – but if Google spies a threat to their core search business then you can be sure they are about to throw everything they can at the social space.

This story was originally posted at The Wall.

Tags: , , , ,

0 comments Share

Bing to launch updated, renamed web crawler “Bing Bot”

Microsoft is to launch its new spider later this year. Here’s what site owners need to know.

Microsoft’s search engine wasn’t always called “Bing” and its web crawler, “msnbot”, hasn’t kept up with the name change. When Microsoft renamed Live Search (formerly MSN Search) Bing, we have to admit to being mildly disappointed that it didn’t take the opportunity to rename its spider “Bing Bot”.

There are many good reasons not to change the name of a spider, especially one as widely used as Microsoft’s search spider. Many software packages look at the name of visiting browsers and spiders (known as the User-Agent) to perform a variety of functions, and it’s possible that problems might occur for a time on less well-configured websites if this were to be changed. For example, Yahoo! maintained the User-Agent “Slurp” for its spider, which it inherited from its acquisition of Inktomi, to “ensure consistency and minimal disruption”.

It appears that Microsoft has decided that the branding “Bing Bot” is too good to miss, however, and has announced that its next generation spider will indeed be renamed when it comes out of beta.

Here’s what site owners need to know:

When is this happening?

This will happen on 1st October 2010.

This is also when Microsoft’s new spider will officially come out of beta.

What will the User-Agent be?

Microsoft’s current User-Agent is:

msnbot/2.0b (+http://search.msn.com/msnbot.htm)

The new Bing Bot User-Agent will be:

Mozilla/5.0 (compatible; bingbot/2.0 +http://www.bing.com/bingbot.htm)

In addition to the “bingbot” branding, there are two other changes to note. Firstly, Microsoft is switching to the “Mozilla/5.0”-style User-Agent. Google made this change more than six years ago because it wanted web servers to treat its spider more like a real web browser. The second, more minor, change is that the “b” (meaning “beta”) in its version number has been dropped.

Any other changes to the spider’s requests?

In addition to the User-Agent change, Microsoft has also change the “From:” HTTP header field, so the old value of:

From: msnbot(at)microsoft.com

will become:

From: bingbot(at)microsoft.com

Will my old robots.txt entries still work?

Thankfully, Microsoft has decided to make its spider respect the User-Agent field which it currently recognises in robots.txt, “msnbot”. However, the way in which it will work from October is somewhat subtle, so deserves a brief explanation.

Whilst existing directives will still work, Microsoft is also going to recognise a “User-Agent:” robots.txt entry of “bingbot”, and it will give precedence to an entry of “bingbot” over an entry of “msnbot” (which, in turn, has precedence over the catch-all User-Agent entry of “*”). This means that, if you add robots.txt rules for “bingbot”, it will ignore all other rules, including those for “msnbot”.

Whilst adding conflicting “msnbot” and “bingbot” entries hopefully isn’t too likely to happen on most sites, in a larger, more complex organisation in which many different people or departments are able to make changes to robots.txt files, I wouldn’t be surprised to see someone accidentally trip up and add a new “bingbot” entry which doesn’t match up with the already existing “msnbot” entry (for example, where a separate “crawl-delay” value for Bing is specified).

Microsoft clearly wants site owners to update their robot.txt files with the new User-Agent, and we’d definitely recommend that you do this – but don’t forget that the new Bing Bot only launches on 1st October – until then, you should still use the old “msnbot” terminology in your robots.txt files.

What should I do now?

Firstly, if you currently have a separate robots.txt entry for msnbot on your site(s), make a note on your calendar on to change it to “bingbot” on October 1st.

Secondly, make sure that your website doesn’t do anything else special for Microsoft’s crawler or for visitors which don’t identify themselves as ‘Mozilla compatible’. This could include tools such as analytics packages or software which performs anti-spam functionality such as request rate-limiting.

Other than that, there shouldn’t be anything to worry about! However, in the (hopefully unlikely) event that you do experience any problems come October, Microsoft has set up an email address (bingbot@microsoft.com) to help to resolve any issues.

Tags: , , , , , , , ,

0 comments Share