Wednesday, June 27, 2007

The Relevancy Quest: The Point

Here's a quick follow-up to the last post, which got long and perhaps failed to net-out the point as clearly as it might have. Here's the point:
  • Search engines seem to assume that the question is improving relevancy based on a few keyword grunts.
  • They use various degrees of magic to try and improve relevancy: dynamic clustering, taxonomy, recent query history, social tagging/editing, entity extraction, PageRank, SemanticRank, SomethingRank, etc.
  • The point of all this magic is to guess exactly what you want.
Here's the question: why guess when you can know? When you send a SQL query to a Oracle, it's not guessing what you want (e.g., show me average sales by product line for 2Q). It knows what you want and there is a single correct answer to your question.

So why do have to guess when it comes to content? You don't have to. With XQuery and XML content servers, like MarkLogic, you can express, powerful, complex database-style queries that get exactly what you want from a contentbase.

Tuesday, June 26, 2007

The Relevancy Quest

In the classic book, The Innovator's Dilemma, Clayton Christensen concludes that a key reason leading companies fail is because they spend too much energy working on sustaining innovations that continuously improve their products for their existing customers. Seemingly paradoxically, he points out that these sustaining innovations can involve very advanced and very expensive technology. That is, it's not the nature of the technology used (e.g., advanced or simple) that causes innovation to be sustaining or disruptive -- it's who the technology is designed to serve and in what uses.

I think search vendors need to dust off their copies of The Innovator's Dilemma. Why? Because, for the most part, they seemed wedged in the following paradigm, which I'd call the relevancy quest:
  • Search is about grunting a few keywords
  • The answer is a list of links
  • The quest is then magically inducing the most relevant links given a few grunts
And it's not a bad paradigm. Heck, it made Google worth $140B and bought Larry and Sergey a nice 767. But can we do better?

Some folks, like the much-hyped Powerset, think so. They're challenging the grunting part of the equation, arguing that "keyword-ese" is the problem and the solution is natural language. They seem unphased both by Ask Jeeves' failure to dominate search and by the more than 20 years of failed attempts to provide natural language interfaces to database data, used for business intelligence (BI). As I often say, if natural language were the key to BI user interfaces, then Business Objects would have been purchased by Microsoft years ago for a pittance and Natural Language Inc.'s DataTalker would rule BI. (Instead of the other way around.)

But I respect Powerset because at least they're challenging the paradigm and taking a different approach to the problem. And, while I sure don't understand the cost model, I also respect guys like ChaCha because they're challenging the paradigm, too. In ChaCha's case, they're delivering human-powered search where you can literally chat with a live guide who helps you refine your search.

I can also respect the social search guys, including the recently launched Mahalo, because they're challenging the paradigm as well -- using Wisdom of Crowds / Web 2.0 / Wikipedia style collaboration to created "hand-written results pages" for topics, such as the always searchable "Paris Hilton."

The folks I have trouble understanding are those on the algorithmic relevancy quest, companies like Hakia, a semantic search vendor (interviewed here by Read/Write Web) whose schtick is meaning-based search, and who comes complete with a PageRank (tm) rip-off-name algorithm called SemanticRank (tm). Or Ask who recently launched a $100M advertising campaign about "the algorithm". These people remind me of the disk drive manufacturers who invested millions in very advanced technologies for improved 8" disk drives (to serve their existing customers) all the while missing the market for 5.25" disk drives required by different customers (i.e., PC manufacturers).

Are the Hakias of the world answering the right question? Should we be grunting keywords into search boxes and relying on SomethingRank (tm) to do the best job of determining relevancy? Is the search battle of the future really about "my rank's better than you rank" or equivalently, "my PhD's smarter than your PhD"? Aren't these guys fighting the last war?

As usual, I think there are separate answers for Internet and enterprise search.

On the Internet side, sure I think search engines can certainly use more "magic" to improve search relevancy. For example, they can use recent queries and a user profile to impute intent. They can use dynamic clustering and iterative query refinement (e.g., faceted navigation) to help users incrementally improve the precision of their queries.

More practically, I think vertical search and community sites are a great way of improving search results. The context of the site you're on provides a great clue to what you're looking for. Typing "Paris Hilton" into Expedia means you're probably looking for a hotel, where typing it EOnLine means you're looking for information on the jailed debutante.

Of course, there are a host of Web 2.0 style techniques to improve search like diggs and wikis which can be put to work as well.

Increasingly, our publishing and media customers are going well beyond "improving search" and changing the paradigm to "content applications" -- systems that combine software and content to help specific users accomplish specific tasks. See Elsevier's PathConsult as a concrete example.

On the enterprise search side, I think the answer is different. As I've often mentioned, on the enterprise side you lack the rich link structure of the web, effectively lobotomizing PageRank and robbing Google of its once-special (and now increasingly gamed and hacked) sauce.

When I look for the answer of how to improve search in an enterprise context, I look back to BI, where we have decades of history to guide us about the quest to enable end-user access to corporate data.
  • Typing SQL (once seriously considered as the answer) failed. Too complex. While SQL itself was the great enabler of the BI industry, end users could never code it.
  • Creating reports in 4GL languages failed. Too complex.
  • Having other people create reports and deliver them to end users was a begrudging success. While this created a report treadmill/backlog for IT and buried end-users in too much information, it was probably the most widely used paradigm.
  • Natural language interfaces failed. Too hard to express what you really want. Too much precision required. Too much iteration required.
  • End users using graphical tools linked directly to the database schema failed. While these tools hid the complexities of SQL, they failed to hide the complexity of the database schema.
It was only when Business Objects invented a graphical, SQL-generating tool that hid all underlying database complexity and enabled users to compose an arbitrary query that the BI market took off. Simply put, there were two keys:

1. The ability to phrase an arbitrary query of arbitrary complexity (not a highly constrained search).

2. The ability to hide the complexity of the database from the underlying user

While no one has yet built a such a tool for an arbitrary XML contentbase (and while I think building one will be hard given the lack of requirement for a defined schema), MarkLogic customers use our product every day to build content applications that generate complex queries against large contentbases, and completely hide XQuery from the end-user.

Simply put, it's not about improving search. It's about delivering query. That's the game-changer.

Friday, June 22, 2007

Mark Logic Granted Key Patent

Mark Logic announced this week (see press release) that the company has been granted a fundamental patent related to XML indexing technology. The patent, entitled "Parent-Child Query Indexing for XML Databases," is US patent number 7,171,404 and was granted on 1/30/07.

I think this patent is similar to the patent that my last employer, Business Objects, had on the semantic layer, an abstraction layer that insulated end users from the complexities of relational databases and SQL, and enabled end users to compose their own queries.

So why do I think that these two, technically quite dissimilar, patents have something in common? Because both patented a fundamental invention that enabled a market. Simply put, I think both patents are big deals, dealing with inventions fundamental to software categories.

I should note that this is Mark Logic's second approved patent. (The first was on our unique XML classification system, US patent 7,127,469.)

Buy O'Reilly Books By The Chapter

Just as iTunes disaggregated the CD, allowing consumers to buy songs instead of CDs (dare I say "albums"), so has O'Reilly Media now disaggregated the book -- as of this week, consumers can now buy chapters instead of books.

Tim O'Reilly blogged on this earlier this week, here. The price for an O'Reilly chapter? $3.99. Here's a quick excerpt from Tim's post:

Note that each chapter has its own TOC (table of contents) and index, which is possible because of the infrastructure we built with SafariU. Each chapter is also bookmarked and searchable. What they are still lacking is a book cover, and the ability to click [the embedded] TOC or index entry and be taken to that part of the chapter, but those are coming.

Here's a screenshot:

Tuesday, June 19, 2007

Steve Mills Computerworld Interview

Sorry for the dearth of posts of late. Things have been quite busy (in a good way) at Mark Logic and as I say in the FAQ, I'm a CEO blogger (not a blogger CEO) so when things heat up, the posts may slow down.

Today's quick post highlights an interesting interview with Steve Mills who runs IBM's software business and who has been a key executive in IBM software for almost as long as I can remember. First note the sidebar that says:
  • Software contributes $20B to IBM's revenues
  • Software contributes 40% of IBM's profits
  • IBM's software group has acquired 44 companies since 2000
In the interview, Mills speaks about open source (calling it "inevitable" and "good for the industry"), XML (speaking about the "native" XML handling in DB2 version 9), and SaaS (calling salesforce.com "a rounding error").

It's a quick read and, given his power in the software industry, definitely worth reading.

Funny inconsistency: the sidebar says he joined IBM as a sales trainee. The story's last line says "I wrote assembler programs when I joined IBM." Could it be that assembler programming was part of IBM sales training in 1974? :-)

Monday, June 11, 2007

Macmillan CEO Swipes Google Laptops

Here's a great, pointed PR stunt. First, the CEO of book publisher Macmillan swipes two laptops from the Google Book Search (Beta) booth at BookExpo America last week.

Then he, himself, blogs about it, here. Talk about combining new- and old-world PR tactics.

Here are my favorite quotes from his blog:
"There was no sign saying 'please do not steal the computers.'"

"We were merely doing to Google what Google was doing to us."

Touché!

The Register covers the story here. Charkin's blog is here. Valleywag's take is here.