ask

Can search engines handle Internationalized Domain Names (IDNs)?

Internationalized Domain Names (IDNs) have been approved by ICANN and are set to become a reality. Are the search engines prepared for them?

Skip to start of post


Introduction

Note: If you are unable to view the Chinese and Arabic letters in this page you may need to install the required fonts.

In October 2009, ICANN voted to allow the use of non-ASCII characters in domain names. Non-ASCII characters have existed within domain names for a while – for example, many Hong Kong sites feature Chinese characters (example: http://香港儒釋道院.組織.hk). However, before now, these characters were not allowed within TLDs and, as such, URLs still required ASCII characters (in the example above, the ccTLD “.hk”).

ICANN launched the IDN ccTLD Fast Track Process in November, and last month announced that four top-level IDNs had successfully passed the initial stage of approval (three Arabic-language IDN ccTLDs for Egypt, Saudi Arabia and the United Arab Emirates, and one Cyrillic-language IDN ccTLD for the Russian Federation). At the time of writing, there are another 13 IDN ccTLDs on their way through this process, representing 10 different languages in total.

In order to provide the Internet community time to prepare for the rollout of new IDN domains, ICANN has set up a number of IDN domains for testing purposes. Each of these test domains is written as “example.test” in it’s respective language, and content has been made available to view on each site.

Seeing as most of the initial IDN ccTLDs are likely to be in Arabic, I have used ICANN’s test Arabic domain (مثال.إختبار) for my research.

Before I start, I need to quickly explain what Punycode is, as it it used to support the addition of IDN domains to the existing Internet infrastructure. The problem with the current system is that the Domain Name System (DNS) only allows certain ASCII characters, which means that it is not possible to simply add Unicode characters to it. Punycode was invented to get around this issue. Essentially, it is a method by which Unicode characters can be translated to (and from) the ASCII characters allowed within the DNS. When your browser requests a domain name containing Unicode characters, it converts it to the ASCII-formatted Punycode before sending the request.

For this experiment, I have looked at the way in which the search engines handle both the Unicode form of the Arabic domain (http://مثال.إختبار/) as well as the corresponding Punycode format (which, in this case, is http://xn--mgbh0fb.xn--kgbechtv/). Note that, because Arabic is an RTL (right-to-left) language, pages on this site will have the URL path to the left of the hostname, rather than to the right.

One last note before we look at the results – the test page does not feature a meta description tag, so any snippet text is likely to come from text within the page itself.

Here are the results.

Google

Searching Google for the Unicode variant of the URL returns the homepage of the domain as the first result, with an additional nested result for a second, internal page on the domain:

  Google Unicode

Initially, everything seems to be in place here. The title tags, snippets and URLs are correctly displayed in Arabic, and Google has highlighted the search text in bold as usual. Additionally, the “Similar” pages link works, and the “jump to” successfully takes you to an anchor within the page. Lastly, the URL path is written in the correct RTL form for the second result.

However, not all is well. The first URL that Google is listing, the homepage, is actually a 301 redirect to an internal page. Google should be indexing the destination page, not the redirecting homepage.

There are several other issues too. Firstly, the cached copy link did not work:

Google cached copy

I tried a number of pages on the site and Google’s cached copy did not work for any of them, so Google may have an issue with this feature at present.

Additionally, the “Translate this page” links for both results do not correctly function, and an error message is shown:

Google Translate error

Side note – the “See original page” link does correctly point to the Arabic domain name.

Next I tried searching Google for the Punycode form of the URL:

Google Punycode

Google has returned the same two URLs, which is a good sign of consistency. The title tags are the same, and the URL is still written in Arabic and not displayed in the Punycode form.

This time around, Google has picked out some text on the page which matches the Punycode search term. Although this particular snippet is rather less attractive than the ones from the previous query, matching the exact text on a page is probably the best approach. However, it would also make sense for Google to at least highlight the Unicode version (for example, in the URL), which it currently does not do.

Again, while the “Similar” pages link works, the “Cached” and “Translate this page” links are broken. This seems to be an issue that Google needs to fix.

Yahoo!

Searching Yahoo! for the Unicode or Punycode version of the URL does not return any results from the domain:

Yahoo! Arabic domain fail

Similarly, entering the URLs within Yahoo! Site Explorer simply redirects back to the main Yahoo!
search results. Performing “site:” searches (for either variant) also fails (looking at the HTTP headers, you can see that Yahoo! actually redirects the query to Site Explorer, which then redirects you back to the standard web search results).

I tried a few additional ICANN test IDN domains in other languages and none of them worked. Yahoo! seems to fail completely at handling IDNs.

Given that Yahoo! is likely to use Bing’s search in the future, let’s see how Bing performs next.

Bing

Searching Bing for the Unicode version of the URL does return a page from the site, although it’s at position 8, which is not ideal (when searching for a URL you would usually want the URL to appear at or near the top of the search results). The snippet appears as follows:

Bing Unicode

Only one URL is shown, which isn’t quite as useful as Google’s result, but is still adequate. The title tag, snippet and URL are all correctly shown in Arabic, which is good. The “Translate this page” and “Cached page” links both work, whilst they didn’t on Google.

Bing does have some issues, however. Although Bing has indexed the destination URL (the link goes directly to the destination URL), for some reason Bing only displays the URL of the homepage in the snippet. Additionally, although Bing has highlighted the domain in bold in the snippet, it has not highlighted it within the URL.

Bing does have a number of problems with its handling of this domain. However, they are fairly minor and definitely less important than the issues that Google has with this site.

Searching Bing for the Punycode version of the URL, Bing returns the URL at position 2 instead, which is a bit better:

Bing Punycode

Again, like Google, Bing has picked out the text from the page which matches the query for the snippet but has not highlighted the Arabic equivalent in the snippet. Otherwise, this result is much the same as the Unicode search variant.

Ask Jeeves

I have also looked at Ask Jeeves (known as just “Ask” in the US).

Searching Ask Jeeves for the Unicode version of the URL returns the site at position one. Like Google, it includes a second indented URL at position two. Interestingly, these are the same two URLs that Google returned for this search (it is worth remembering that Ask Jeeves might be using Google’s results at times).

Ask Jeeves Unicode

Ask Jeeves is correctly displaying both the title and the snippet in Arabic, but the URL is written in the Punycode form instead, which is clearly far from ideal.

There is another major issue with Ask Jeeves’ implementation – the second URL goes through a redirect, but the hostname given by the redirect has been encoded in a way which makes Firefox and Internet Explorer fail to load the page (Google Chrome and Opera did successfully load the page from the redirect). Note: This does not always happen – reloading the page sometimes returns the URL without the redirect, and in this case it works correctly.

Searching Ask Jeeves for the Punycode version of the URL results in much the same as we have seen earlier. Again, the snippet includes the text from the page which matches the query. Ask Jeeves includes a small screenshot of the page too:

Ask Jeeves Punycode

Ask Jeeves’ binoculars feature, which displays a small thumbnail screenshot of the site, does appear to work correctly. However, it is possible that there are issues here as well.

Ask Jeeves Binoculars

Although it’s difficult to make out due to the small size of the thumbnail, it appears that the English text renders correctly but the Arabic text (although correctly displayed in an RTL fashion) looks like it might be showing a nonsense placeholder character, in the same way that web browsers which do not render Unicode characters do. That said, it is difficult to determine for sure from the small thumbnail that Ask Jeeves provides.

Conclusion

In conclusion, Google, Bing and Ask Jeeves do support IDNs to varying degrees. If I had to proclaim a winner at the moment, I would say that Bing had a slight lead, but all of these search engines had some issues. Hopefully these issues will be ironed out by the time that IDNs eventually roll out en-masse.

Yahoo! appears to completely fail to support IDNs at present. Once it switches to Bing’s search engine, however, we assume that it will inherit all of Bing’s IDN support as well.

Tags: , , , , , , , , ,

2 comments Add This

Are we losing two of the top four search engines?

Bing and Yahoo! have agreed a deal which will essentially kill off the Yahoo! search engine and merge its technology with Bing’s. At the same time, we have discovered that Ask Jeeves has been serving Google-crawled pages. Are we going to lose half of the top four search engines?

The recently announced deal between Bing and Yahoo! will essentially kill off the Yahoo! search engine – the search results on Yahoo! will be served by Bing, and Microsoft gets Yahoo!’s search technology. This means that the number two and number three search engines will (pending regulatory approval of the deal) become a single search engine.

At the same time, we recently discovered that the number four search engine, Ask Jeeves, appears to be showing web pages which were provided in some way by Google.

So are the "big four" set to become just the "big two"? What would this do to the search marketplace?

Make no mistake about it – the consolidation of search engines is bad for site owners. Instead of having multiple search engines where you have a chance of ranking for your keywords, you will have only two. You’ll either get lots of traffic, or almost none, and swings in web traffic will become more severe.

As you can see, homogenous ecosystems are not healthy environments in which to live. Unfortunately, to an extent, this is what we already have, as Google has such great dominance in the search world. In an ideal world, there would be no dominant search engine, just lots of smaller ones with market shares of no more than 20-30%.

Tags: , , , , ,

2 comments Share

Is Ask Jeeves scraping Google?

Ask Jeeves has pages in its index which could only have been spidered by Googlebot. What is going on?

I was experimenting with User-Agents the other day and came across UserAgent.org – a site which simply displays your web-browser’s User-Agent string. I thought it might be interesting to look at which User-Agents the various search engines had used when they last spidered the site. Little did I expect to find this!

As expected, Google, Yahoo! and Bing simply displayed their standard User-Agents. For example, here’s the result when searching in Bing:

Bing results for UserAgent.org

Side note: Bing is gradually shifting away from msnbot 1.1 and is moving to msnbot 2.0.

However, something rather unexpected happened when searching for that site in the number four search engine, Ask Jeeves:

Ask Jeeves results for UserAgent.org

Eh? That’s Google’s User-Agent, Googlebot! At first, I wondered if Ask Jeeves was simply pretending to be Googlebot sometimes (perhaps to get around websites which block their spider or to detect cloaking). However, when looking at a page which shows the IP address that the request came from, the mystery deepened further:

Ask Jeeves results for UserAgent.org IP address

This page was fetched from the IP address 66.249.68.19. I immediately recognised this as one of Googlebot’s IP addresses (Google owns the entire IP range 66.249.64.0 to 66.249.95.255, and it’s a common Googlebot crawl source). Sure enough, this IP address resolved to the following domain:

crawl-66-249-68-19.googlebot.com

What does this mean? It means that this page must have been fetched by Google’s spiders, not those from Ask Jeeves. It’s not just this site either, there are many, many pages indexed by Ask Jeeves which were spidered from the same location.

Ask Jeeves results from multiple Googlebot IP addresses

It gets even more peculiar – if you look at the cached copy of UserAgent.org, Ask Jeeves instead displays it as having the Ask Jeeves/Teoma spider, with the following User-Agent:

Mozilla/5.0 (compatible; Ask Jeeves/Teoma; +http://about.ask.com/en/docs/about/webmasters.shtml)

Also, sometimes you do indeed get Ask Jeeves results – for example, here’s exactly the same web page we saw earlier, after refreshing the search results page a few times:

Ask Jeeves Teoma spider

The IP address 66.235.124.6 resolves to the following Ask Jeeves crawler hostname:

crawler5006.ask.com

In other words, sometimes Ask Jeeves is displaying a page fetched by Googlebot, and sometimes it is displaying the page fetched by its own spider. Typically, the first time a particular request is made, you get the Googlebot-fetched page, and after that Ask Jeeves usually shows the copy it fetched itself.

So why is Ask Jeeves including Google-sourced pages? Well, aside from the somewhat crazy idea that they might actually be scraping Google’s cached pages, which I think we can dismiss, this means that Ask Jeeves and Google have some kind of agreement whereby Google is assisting its diminutive competitor with spidering – and quite possibly more than that.

According to paidContent.org, the advertising deal between Ask Jeeves and Google includes a provision for Google to assist in providing algorithmic search results to Ask Jeeves, not just the better known advertising aspect of the deal.

If so, this discovery of Google-sourced search results could possibly be the first real proof that Ask Jeeves is throwing in the algorithmic towel and giving up on its own search engine.

Note: This is particularly interesting in light of the recently announced Microsoft-Yahoo! deal.

See our follow-up post: Are we losing two of the top four search engines?

Tags: ,

0 comments Share

Ask.com also supporting the new canonical tag

Number four search engine Ask has joined the other three major search engines in supporting the new “canonical tag” – this is great news for webmasters as they can now use this new search technology on all of the “Big Four” search engines.

Number four search engine Ask.com, formerly known as Ask Jeeves, have announced that they will be joining Google, Yahoo! and Microsoft in supporting the new canonical tag, a recent joint effort by the search engines to combat duplicate content. This is fantastic news for webmasters as it means that they can use the same technology on their website and it will work on all of the “Big Four” search engines.

Ask is the fourth biggest search engine, and the smallest of the “Big Four” (AOL is powered by Google so is not included as a search engine in its own right). Ask has a market share in the United States of between 3% and 4% according to web metrics companies Hitwise and comScore, and its UK market share is also around the 3% mark. Although this is fairly small it is still a reasonably significant number of users and they are not that far behind number three Microsoft in these markets.

Supporting shared standards is particularly important for a small player in the search engine marketplace as it essentially drives down the cost of doing business. Relatively few webmasters will implement solutions specific to small search engines, preferring to concentrate on the dominant market leader, Google. However, webmasters will be more than happy to implement solutions which also work across all search engines, as it allows them to support all of the smaller players in one fell swoop without having to create tailored solutions for each one.

Although Ask were a few days later than the other search engines with their announcement, this is much faster than many of their previous reactions to new Search standards. Ask.com added support for XML Sitemaps almost 2 years after Google and months after Yahoo! and Microsoft, and still haven’t announced support for the “nofollow” attribute for links (although Ask.com claim that their algorithm is less sensitive to link-based spam as it measures local popularity rather than using global popularity like PageRank).

Hopefully this marks a change in pace for Ask.com’s support of new standards in Search.

UPDATE – According to Microsoft Live Search, Ask.com were included in discussing the creation of this tag.

Tags: ,

0 comments Share

Targeting individual search engines; a technique from the past?

There was a time when SEOs were recommending different landing pages for different engines.

Thankfully this has gone the way of meta keyword stuffing, but is there an argument for still focusing campaigns on individual engines or should companies be adopting a more holistic search offering?

Happy Valentine’s day!

In the course of this post I am going to seem to switch between both sides of the same argument several times. Please bear with me.

In the UK Google dominance is much higher than across the pond. For this reason there is much more of an argument for companies, particularly those with wider verticals, to place all their emphasis on Google. Because of this many companies adopt a ‘we only care about Google’ stance. To be fair, it is hard to argue with that.

With Yahoo! and Live there is often less competition, they have less sophisticated algorithms and they are more predictable than Google, so the cost of a position in these engines can be considerably less than a Google position.

The problem here is ROI – whilst it is relatively easier to rank well in the non-Google engines, the corresponding traffic can be so minimal (depending on location and market) that the cost per conversion is many times higher than it is with Google.

That cost, when applied only to direct traffic from the appropriate engine is normally prohibitive, but there are different personas using each engine. I am not simply talking about whether a Yahoo! user is more likely to buy your product or service than a Google user either.

Invoking Manley’s search referrer stereotyping rule of thumb, Yahoo! users tend to be more socially orientated than Google users and are thus more likely to provide you with links than their Google counterparts, so ranking well in Yahoo! is likely to help your position in Google’s SERPs as well. Who knows, good position in Ask could even get you a .edu link!

At the end of the day though, the majority of best practice and link-building strategies which an ethical professional would use for Google are going to work for the other engines as well. Deliberately excluding Slurp aside, there is very little to optimising for Yahoo! which would differ from optimising for Google.

If your site is performing badly in any engine, it is a sign that there is a problem with the site development. It might be that this is a known issue or that it is a site architecture problem, and you may decide that the ROI is not enough to justify the work, but monitoring and reporting across the engines is important. Optimising for Google should still cover all of those bases.

So, in summary, for a generic business model:

  1. Monitor performance across Google, Live, Yahoo! and possibly Ask.
  2. Play close attention to any vertical engines in your sector.
  3. Optimise your pages for search, rather than for a specific engine.

Now I need to go and make my wife a card.


Manley’s search referrer stereotyping rule of thumb:

  • Google => Professional, technically savvy users
  • Yahoo! => Web 2.0, social users
  • MSN/Live => Lower tech loyal users
  • Ask => Academic

Back to article >>

Tags: , , ,

0 comments Share