Last month, Inktomi rolled out the latest version of the search engine it provides to partners such as MSN. But is Inktomi’s “Web Search 9” better than previous incarnations? Indeed, is it better than its major crawler-based competitors, Google, AllTheWeb, Teoma and AltaVista?
With the release of Web Search 9, Inktomi meets or exceeds all other search engines in the key metrics of search performance: relevance, freshness and index size,” says Inktomi, on a summary page about the changes at the company’s web site.
To back up its claims, Inktomi cited figures for two of the key metrics. For freshness, Inktomi claims to revisit all documents in its index at least once every two weeks. For index size, it claims to index 3 billion web documents, a new record (and one that Google also began claiming last month). But what about relevance? Inktomi had no figures to offer, here.
Inktomi’s not alone in this. AltaVista also had a major relaunch last month. A press release describing AltaVista’s new features and under-the-hood changes also notes that AltaVista provides “a new index that delivers unique, highly relevant results.” Any figures to back this up? Nope.
Where are the relevancy figures? While relevancy is the most important “feature” a search engine can offer, there sadly remains no widely-accepted measure of how relevant the different search engines are. As we shall see, turning relevancy into an easily digested figure is a huge challenge, but it’s a challenge the search engine industry needs to overcome, for its own good and that of consumers.
There are many different ways that search engines can be tested, some good, some bad and often some that make sense depending on a particular situation. Let’s examine a few.
“Anecdotal Evidence” is simply when someone reports a general impression they have with a particular search engine or the results of an isolated search. For example, many of my readers and others I encounter continue to give Google praise for the quality of its results. Anecdotally, Google is great. However, that doesn’t tell us conclusively that it is the best. Similarly, when some people complained in October that Google’s relevancy had “dropped,” they gave anecdotal evidence of this, while Google defenders used anecdotal evidence to dispute the claim.
“Mega Search” is a style of test I usually find particularly bad. This is when someone performs a search, then looks at how many matches each search engine has found for the query. The engine with the most matches “wins,” even though no quality has been determined.
“Ego Search” is another style that can be bad, and one that I still see journalists and others often perform. In an ego search, you look for your name. If you fail to come up tops for it, you conclude the search engine’s relevancy is poor.
In some cases, perhaps this is true. If I search for “bill gates,” it’s reasonable to expect to find the official web site for Microsoft Chairman Bill Gates. But what if you aren’t as well known as Bill Gates or have a popular name? What if you’ve built a web site in some free hosting service that is shared by spammers? These might be issues that push you down, and for good reason. Moreover, are you’re going to condemn an entire search engine as bad, based on one search? Well, I’ve seen it happen.
Overture performs one type of internally-run relevancy testing that I’d describe as “Binary Search.” In that, users are shown listings for a query and asked if they’d be happy with the result, if they’d consider it somehow relevant. The answer is either yes or no, a binary choice. Go through a list of 100 results, and if 95 of them are considered OK, then you could claim 95 percent relevancy.
That sounds great, but there’s no nuance involved, no sense of whether things “more relevant” to the search topic are missing. It’s equivalent to asking people to eat different types of cakes and answering whether each cake is simply edible. Edible is fine in some instances, but what you really want to know is who serves the best cakes consistently?
Overture is not alone in running internal relevancy tests. AltaVista runs one where its listings are compared to those from competitors, though branding and formatting is stripped away, so that users don’t know which search engines are involved. The users rate which set of results is better or whether the sets are equal to each other.
There’s still some sense of the “binary” in this type of testing. Is the user an expert in the subject they are searching on? If not, they might be unaware of important sites that are missing in both sets of results. That means both sets might be deemed relevant, even though someone expert in the subject might consider the relevancy to be poor.
Indeed, the subjectivity of searchers is one of the biggest challenges in relevancy testing. Two people could search for “dvd players,” one person looking to buy a player and the other wanting to learn more about them. The first person might be pleased to get listings dominated by commercial sites selling DVD players, while the second person might prefer more editorial-style listings of reviews and explanatory pages. The mindsets of what’s relevant are different, and it can be important to test relevancy for both intentions, commercial and non-commercial.
In what I call “goal oriented testing,” subjectivity is brought a little more under control. In a goal test, you perform a query where you know a particular page really ought to appear, according to most people you might ask. The “Company Name Tests” I used to perform are like this, where a search for “microsoft” really ought to bring up a link to the Microsoft web site, for example. Few would dispute that.
Unfortunately, the problem with using company names as goal is that they really only test the “navigational” aspects of a search engine. Most people searching for a company by name probably want to reach that company, to navigate to its web site. That’s an important role for a search engine to fulfill, but it’s only one type of relevancy.
The “Perfect Page Test” that Search Engine Watch performed recently was a different style of goal-oriented searching, where we came up with a list of web sites that we felt many people knowledgeable about different topics would agree should be present for certain queries. Of course, the problem with that test—as we pointed out with our article about it—is that it doesn’t measure whether other sites listed are also good. A search engine failing to bring up the target page for a particular query might get a bad score yet still have nine other highly relevant findings.
The limitations of this test is one reason why we didn’t trumpet the winners of the test in the headline of the article. Indeed, we buried the scores under 12 paragraphs of explanation. And the fact that Google, Yahoo and MSN Search got an A on that test doesn’t mean that all of their results are A quality, any more than AltaVista getting a D on the test means that all of its results are D quality. It only means that in this very limited, particular and narrow test, that’s what was found. In other tests, the leaders could be failing while formerly poor performers might be succeeding.
Ideally, what you want is a battery of tests, with tens, hundreds or thousands of queries run and examined, tested in different aspects. How did this query perform for someone searching in “product mode?” How does this search engine handle a navigational query? What’s an ego search bring up for some prominent people? And so on.
Search Engine Watch will take a stab at this by running small scale tests over time. However, the real solution is for the search engines themselves to come together, agree on some testing standards, contract to have these performed on a regular basis and crucially, agree to publish the findings no matter what.
Search Engines Need To Come Together
Relevancy testing isn’t a job I’m stumping for. I’d be happy to help offer suggestions about how relevancy testing might be performed, and I’m sure other search engine commentators would be pleased to be involved, as well. But the job is really something a testing organization should take on.
There is precedent for this. The former eTesting Labs (now VeriTest, and the link wasn’t working when I tried to visit today), for example, was contracted in the past by different search engines to run a battery of tests to measure relevancy. That company or others could be contracted by the search engine industry to conduct perhaps quarterly testing, according to criteria that the industry agrees on.
Public release of this data is also important. Some companies that contracted with eTesting Labs in the past refused to let the tests be made public, if they did poorly. Similarly, the NPD Group used to do consumer surveys, where the search engines’ own users rated their relevancy. Those search engines that did well often released their figures, while those that did poorly kept quiet.
While it may be tempting to sit on bad news, if the search engines want us to take seriously their claims of relevancy, then they have to agree to release both the good and the bad. If a search engine does poorly, then that poor performance should be an excuse to work harder.
Spare Us The Noisy Stats
Why is getting a relevancy figure so important to consumers? First and foremost, it would bring a greater awareness that there are real choices in search.
I love Google’s relevancy and cannot sing the praises of the company’s work to improve the standards of search enough. They’ve been a driving force over the past few years in raising the quality of search, and they deserve all the success that their hard work has brought them. However, Google also has some very good competitors now (such as these major search engines). Some of these competitors may even—gasp—have a search algorithm that works better for some users, or search assistance features that Google lacks or even an interface that some users might like better.
Some search consumers may never bother trying some of these other search engines because they’ve been told or convinced that Google is the best. Regularly published relevancy figures would help in this. If it turns out that relevancy testing finds that Google and its competitors are all in roughly the same degree of relevancy, then users might then be willing to experiment more with others. It also means that they may pay more attention to choosing search engines for certain features, such as search term refinement options, the ability to see more results at once or spell checking.
A relevancy figure would also free us from search engines playing the “size card” or the “freshness card” to quantify themselves as better than the competition. Yes, having a large index is generally good. Yes, having a fresh index is desirable. However, neither of these stats indicates how relevant a search engine is. Nevertheless, the search engines keep pushing them at us, and in particular at journalists, in an effort to trump their competitors.
It’s understandable. Journalists and others desperately want to quantify which search engines are the “best” in some way. Size and freshness figures are easy way for search engines to trot out numbers that can be turned into pretty bar charts. However, we’ve literally had years of this game being played, and it’s tiring. It’s also detracts from that most important factor, relevancy.
Do It, Or Have It Done To You
Ultimately, if the search engines fail to come up with an accepted means of measuring relevancy, they are going to continue to be measured by one-off “ego searchers” or rated anecdotally. Check out what a college student was recently quoted as saying, in a campus newspaper article about searching:
“Don’t waste your time with any of the new [search engines]. Ask Jeeves is useless and AOL is a piece of junk,” the student said. “Just stick to Yahoo and Google.”
Actually, Ask Jeeves is far from useless, given that it makes much more use of the high-quality search results from its Teoma search engine. As for AOL, since it is 100 percent powered by Google, calling it “a piece of junk” while simultaneously praising Google makes no sense. Meanwhile, Yahoo is now, in the vast majority of cases, providing the same results as Google, so “sticking” to just Yahoo and Google is essentially sticking to Google. So much for diversity in search.
I’ve simplified things a bit. Though Yahoo and AOL are powered by Google, they are not exactly the same as Google. Yahoo does have some additional features, while AOL lacks some of what Google offers—but for AOL users, it also offers some things Google doesn’t.
Nevertheless, this student is no doubt remembering some bad experiences at Ask Jeeves and AOL before their upgrades and unaware of recent changes at Yahoo. His anecdote is based on old data, but as there is no relevancy number to counter his claims, his anecdote gets the last word, rather than some more authoritative relevancy measures.