Saturday, March 31, 2007

What's a Column-Oriented DBMS?

One of my memes is the rise of special-purpose database management systems (DBMSs). While I’m obviously a big believer in MarkLogic and XML content servers, I believe that XML content servers are simply one example of a whole new class of DBMSs, each designed and optimized for a specific purpose.

As a technologist, I believe this is because I think it’s neither necessary nor desirable (nor, I might add, possible) to infinitely extend the now quarter-century-old RDBMS.
  • Not necessary because federation and web services make it increasingly easy to dedicate special tasks to special servers.
  • Not desirable because the relational database has certain design assumptions that, while quite useful some applications, are wholly inappropriate for others.
Rather than riff again about XML content servers, I thought today I’d pick a different example of a special-purpose DBMS: the column-oriented database.

I’d recently heard that Michael Stonebraker had founded Vertica, a column-oriented DBMS company (complete with ten-cute-points slogan, "the tables have turned"). So I decided to try and figure out what column-oriented DBMS is and why you might want one. Here’s my answer.



First, look at the above picture, which depicts a row-oriented and a column-oriented DBMS. A row-oriented system stores rows together. A column-oriented system stores columns together. So what?

For people who care about the information in just one column, storing that information in a table with other columns reduces information density. That is, the information you care about is striped across rows loaded with (potentially lots) of non-useful information. (Imagine this when the rows are quite long.) That means it’s inherently inefficient to go pull a large number of values for one column.

But who only cares about information in one column? Wait a minute. Think about the fact table in a data warehouse. Now it all makes sense. Data warehouse queries often care a lot about one-column queries (e.g., the fact table in a star schema) and about joins between the fact table and the normalized dimension tables in a snowflake schema.

So a column-oriented database has a number of advantages here:
  • Higher information density means more efficiency in pulling answers off disk and/or in caching them in memory.
  • Storing the column-tables in row-id, value order (as opposed to value, row-id order) greatly accelerates the performance of sort-merge joins. Why? Because the column-tables you want to join are already sorted by row-id eliminating the costly need to sort before the merge. They claim this can accelerate join performance 100x.
So are you going to rip out Oracle from underneath your production SAP implementation and replace it with a column-oriented database? No. That’s not the point. Column-oriented databases, like other special-purpose DBMSs, are not general-purpose RDBMS replacements.

Special-purpose DBMSs are built to do one thing “really well” meaning either (1) doing something practically impossible in, or (2) do something 10 to 100 times faster than, an equivalent RDBMS implementation.

For more information on column-oriented DBMSs, check out the Wikipedia entry here.

Thanks to Mark Logician Ron Avnur for educating me on this topic.

Friday, March 30, 2007

My Third All-Time Favorite High-Tech PR Gaffe

Until today, I had two all-time favorite high-tech public relations (PR) oops-es:

  • The first was when some company sent The New York Times a softcopy of a press release in Microsoft Word without first accepting all the revision marks. The journalist who received the press release changed the view in Word to show the complete revision history and got to see a live demonstration of the corporate weasel-wording process.

  • The second was when Google's new head of investor relations sent a PowerPoint slide deck to analysts and forgot to sanitize the notes section, inadvertantly leaking both confidential financial projections and a new product announcement. I blogged about that one, here.

Ever wonder why most PR people now convert both Word and PowerPoint documents to PDF before sending them out?

Well, today, I have a third all-time favorite. See this amazing story where Microsoft (and/or their PR firm Waggener Edstrom) accidentally mailed a briefing document prepared for a spokesperson to the journalist that was interviewing him, Fred Vogelstein of Wired.

If you're at all interested in PR then you have to read the briefing document, here.

If you're not interested in PR you should probably read it anyway, because you will get a first-hand education into just how much work large companies spend on pitching stories and preparing spokespeople for press interviews.

In my opinion, Waggener Edstrom president Frank Shaw should have eaten some humble pie and/or simply shut up, but instead -- as I suppose would be every PR maven's instinct -- he tried to make lemonade out of lemons here. It didn't work for me at all nor, judging by the comments on his post, did it work for most other people either.

Wired editor in chief Chris Anderson has an interesting and fairly charitable post on his blog, here.

Thursday, March 22, 2007

Text Analytics Summit Panel 6/12 - 6/13/07

Just a quick post to announce that I'll be speaking at the Text Analytics Summit that's being held just outside Boston on June 12th and 13th, 2007.

I'm looking forward to this as we're getting increasingly involved with text analytics at Mark Logic. Most text mining or text analytics tools ouput what they find as either XML metadata or (better yet) in-line, enriched XML.

As it turns out, MarkLogic Server is a great place to store that XML, as it goes through an enrichment pipeline of utilities that each perform some particular extraction magic. And once fully extracted and enriched, MarkLogic is again a great place to store the final resulting XML, because we are able to run extremely powerful XQueries, fast, that fully leverage XML markup combined with full-text search.

Here is moderator's Curt Monash's blog post on the panel. (Thanks for the kind words.) The full panel consists of:
  • Curt Monash, President, Monash Information Services
  • Dave Kellogg, CEO, Mark Logic Corporation
  • Michelle De Haaff, VP Marketing, Attensity Corporation
  • Michel Lemay, VP Marketing, nstein Technologies
  • Mary Crissey, SAS Analytics Marketing Manager, SAS Institute
You can get more information on the summit here, or -- if you feel so inclined -- register here.

Oracle Sues SAP: "Theft on a Grand Scale"

In the early days of enterprise software, analysts and executives focused on license revenue and market share as the two primary metrics of a software company's health.
  • License revenue, because software license revenue typically carrys very high 95%+ gross margins
  • Market share, because it determined the percent of the total opportunity a vendor was locking up, and locking-in due to switching costs.
Only as enterprise software matured did maintenance revenue start to get real respect. Why? Because:
  • As license growth slowed, the maintenance stream became a relatively larger part of the business.
  • Maintenance revenues typically also carry high gross margins, in the 80% range.
  • Maintenance is an annuity (typically 20% to 25% of license) that doesn't need to be sold every year.
  • There was a growth opportunity in mining the maintenance contract junk heap that most software companies had accumulated over the years. Simply by cleaning up maintenance renewal procedures, companies could keep driving growth in an otherwise slow growth climate.
More recently, vendors started to compete for each other's maintenance revenue annuities. For example, SAP started to sell maintenance/support on Oracle's applications.

Personally, I never understood this. While company A could quite possibly provide "assistance support" (i.e., telephone calls for help) for company B's products, without access to company A's source code I don't see how they could provide software support. For system-level software this is most definitely true. For applications, where customers can obtain and customize source code, I suppose it's only partially true. But either way, unless I was planning on permanently moving from company B to A (and ergo wanted only ramp-down assistance support), personally I wouldn't buy my support from anyone but the original vendor (open source projects excepted).

Besides source code access, there's another important issue in company A supporting company B's products: know-how. (Recall my first software job was as a support engineer at Ingres.) Supporting software requires considerable knowledge about how it works, access to the full software documentation, access to internal documentation, and access to internal knowledge bases, technical notes, various homegrown utilities, etc.

After reading the 44-page Oracle complaint, know-how appears to be the key issue in this case. I'd recommend taking a few minutes to read the complaint because it's written like a novella using lots of dramatic language, because it can show marketers how statements and claims can be used against them, and because I think everyone in software should read a complaint every now and then to get a concrete feel for what they look like.

Before excerpting from the colorful complaint, I'll share my favorite media quote from this story in InformationWeek
"This isn't really about protecting intellectual property," said Forrester Research analyst Ray Wang. "This is all about the art of war."
I'm not a lawyer and claim no legal expertise in this matter, but I will share some of my favorite excerpts from the complaint to give those who lack the time to read it a taste:

This case is about corporate theft on a grand scale, committed by the largest German software company – a conglomerate known as SAP. Oracle is a leading developer of database and applications software, and SAP is Oracle’s largest enterprise applications software competitor.

[...]

Oracle brings this lawsuit after discovering that SAP is engaged in systematic, illegal access to – and taking from – Oracle’s computerized customer support systems. Through this scheme, SAP has stolen thousands of proprietary, copyrighted software products and other confidential materials that Oracle developed to service its own support customers. SAP gained repeated and unauthorized access, in many cases by use of pretextual customer log-in credentials, to Oracle’s proprietary, password-protected customer support website. From that website, SAP has copied and swept thousands of Oracle software products and other proprietary and confidential materials onto its own servers. As a result, SAP has compiled an illegal library of Oracle’s copyrighted software code and other materials. This storehouse of stolen Oracle intellectual property enables SAP to offer cut rate support services to customers who use Oracle software, and to attempt to lure them to SAP’s applications software platform and away from Oracle’s.

[...]

For example, using one customer’s credentials, SAP suddenly downloaded an average of over 1,800 items per day for four days straight (compared to that customer’s normal downloads averaging 20 per month). Other purported customers hit the Oracle site and harvested Software and Support Materials after they had cancelled all support with Oracle in favor of SAP TN. Moreover, these mass downloads captured Software and Support Materials that were clearly of no use to the “customers” in whose names they were taken. Indeed, the materials copied not only related to unlicensed products, but to entire Oracle product families that the customers had not licensed.

[...]

All of these customers whose IDs SAP appropriated had one critical fact in common: they were, or were just about to become, new customers of SAP TN – SAP AG’s and SAP America’s software support subsidiary whose sole purpose is to compete with Oracle.

[...]

It was not clear how SAP TN could offer, as it did on its website and its other materials, “customized ongoing tax and regulatory updates,” “fixes for serious issues,” “full upgrade script support,” and, most remarkably, “30-minute response time, 24x7x365” on software programs for which it had no intellectual property rights. To compound the puzzle, SAP continued to offer this comprehensive support to hundreds of customers at the “cut rate” of 50 cents on the dollar, and purported to add full support for an entirely different product line – Siebel – with a wave of its hand. The economics, and the logic, simply did not add up.

Oracle has now solved this puzzle. To stave off the mounting competitive threat from Oracle, SAP unlawfully accessed and copied Oracle’s Software and Support Materials.

The SAP Solution: Stolen Passage

[...]

SAP TN conducted these high-tech raids as SAP AG’s agent and instrumentality and as the cornerstone strategy of SAP AG’s highly-publicized Safe Passage program. Further, to the extent SAP TN had any legitimate basis to access Oracle’s site as a contract consultant for a customer with current licensed support rights, SAP TN committed to abide by the same license obligations and usage terms and conditions described above applicable to licensed customers.

Wednesday, March 21, 2007

Viacom's Anti-Google Argument

Here is an interesting article from Steve Bryant's Google Watch column on eWeek that articulates Viacom's argument against Google / YouTube in their recent lawsuit.

The thing I always found funny about the YouTube acquisition was the increase in sue-ability that directly resulted from being part of Google. When YouTube was a money-losing, startup outfit, there was really no reason to sue them because there were no real assets to try to take as compensation for damages.

Being owned by Google changed that radically. I know I'm not the first guy to figure this out, but it still amazes me.

Yes, I believe there is a $200M carve-out in the $1.6B acquisition price to cover anticipated legal and associated costs. But will that be enough? And won't the real problem be the opportunity cost associated with the distraction of Google's management?

See this story, Google Searches for YouTube Payoff, in the San Jose Mercury News for more on this theme.

Excerpts include:
"I think Google may have underestimated the amount of challenge all this copyright problem was going to cause," said Josh Bernoff, a vice president of Forrester Research.

[...]

All the complications aside, some analysts say Google is rich enough to afford to have YouTube fail. What the Mountain View company can't afford, they say, is a prolonged period of YouTube-induced distraction.

"The biggest risk related to YouTube, in our view, may actually be whether it is keeping Google management from focusing on the core search business," Lehman Brother's analyst Doug Anmuth wrote in a recent note.


Indeed.

Monday, March 12, 2007

Wipro Announces Publishing Solution

I'm excited to report that last Thursday, Wipro announced their Integrated Publishing Platform (IPP) solution, of which MarkLogic is a key part. IPP is a solution designed to help publishers build an integrated content supply chain and thereby reduce the "time to publish" for their products.

The solution is centered around Wipro's Flow-briX, a business process management (BPM) framework. Internally, the solution uses MarkLogic Server as its XML repository and for content delivery.

We think that IPP addresses a very real publishing problem and we're excited that Wirpo chose to leverage MarkLogic within it.

Here is MarkLogic's supporting press release about the IPP launch.

Thursday, March 08, 2007

Open Secrets

I recently found and greatly enjoyed this New Yorker article, entitled "Open Secrets: Enron, Intelligence, and the Perils of Too Much Information," by Malcolm Gladwell (of The Tipping Point and Blink fame). See here for my review of Blink.

Open Secrets is a long (7,000 word) article that goes into considerable depth on the topic of open source intelligence -- basically, finding things hidden in plain sight.

Early in the article, Gladwell introduces the distinction between puzzles and mysteries:

The national-security expert Gregory Treverton has famously made a distinction between puzzles and mysteries. Osama bin Laden's whereabouts are a puzzle. We can't find him because we don't have enough information. The key to the puzzle will probably come from someone close to bin Laden, and until we can find that source bin Laden will remain at large.

The problem of what would happen in Iraq after the toppling of Saddam Hussein was, by contrast, a mystery. It wasn't a question that had a simple, factual answer. Mysteries require judgments and the assessment of uncertainty, and the hard part is not that we have too little information but that we have too much.
He then proceeds to deftly argue that the Enron debacle could easily be mistaken for a puzzle, when in reality it was a mystery. Most or all of the information needed to recognize that Enron was at risk (e.g., the complex special-purpose entities) was disclosed in public documents. The problem with Enron, Gladwell convincingly argues, wasn't too little information but too much.

He goes on to describe the Screwball Division, a US World War II intelligence outfit that relied entirely on public information:

The analysts listened to the same speeches that anyone with a shortwave radio could listen to. They simply sat at their desks with headphones on, working their way through hours and hours of Nazi broadcasts. Then they tried to figure out how what the Nazis said publicly—about, for instance, the possibility of a renewed offensive against Russia—revealed what they felt about, say, invading Russia.

One journalist at the time described the propaganda analysts as "the greatest collection of individualists, international rolling stones, and slightly batty geniuses ever gathered together in one organization." And they had very definite thoughts about the Nazis' secret weapon.

That secret turned out to be the V-1 rocket and most of the inferences the analysts made turned out to be correct. Another excerpt:
The political scientist Alexander George described the sequence of V-1 rocket inferences in his 1959 book "Propaganda Analysis," and the striking thing about his account is how contemporary it seems. The spies were fighting a nineteenth-century war. The analysts belonged to our age, and the lesson of their triumph is that the complex, uncertain issues that the modern world throws at us require the mystery paradigm.
The article ends with an example that cleverly and clearly supports Gladwell's hypothesis about Enron. In 1998 six Cornell business school sudents decided to do a term project on Enron for an advanced financial analysis class:

It was about a six-week project, half a semester. Lots of group meetings. It was a ratio analysis, which is pretty standard business-school fare. You know, take fifty different financial ratios, then lay that on top of every piece of information you could find out about the company ...
The students' conclusions were straightforward ... There were clear signs that "Enron may be manipulating its earnings." ... The report was posted on the Web site of the Cornell University business school, where it has been, ever since, for anyone who cared to read twenty-three pages of analysis.

The students' recommendation was on the first page, in boldfaced type: "Sell."
Gladwell is a delightful writer and this is a topic that should be of interest to virtually everyone. So my recommendation: this is a must read article.

Hunter and Hitchens To Present at SD West

Mark Logic's own Jason Hunter (principal technologist) and Ron Hitchens (senior engineer) will be presenting at the SD West show in Santa Clara, California which runs from 3/19 to 3/23/07.
  • Jason Hunter, author of Java Servlet Programming, will present a half-day tutorial on XQuery as well as several ninety-minute classes, including one based on a talk, Web Publishing 2.0 (slides here), that drew a standing room only crowd at the recent XML 2006 conference.
  • Ron Hitchens, author of Java NIO, will present two ninety-minute classes on Java for intermediate and advanced audiences.
They are both excellent speakers who contine to get top ratings from their audiences and, on occaision, awards from the conferences at which they present.

If you're going to the show, don't miss them!

Sunday, March 04, 2007

The High Cost of Ineffective Search

Just a quick post to a recent article on the costs associated with ineffective enterprise search.

Tidbits include:
  • According to IDC, a company with 1,000 information workers can expect more than $5M in annual wasted salary costs because of poor search.
  • A recent survey of 1,000 middle managers found that more than half the information they find during searching is useless.
  • According to Butler Group, as much as 10% of a company's salary costs are wasted through ineffective search.
  • According to Sue Feldman of IDC, people spend 9-10 hours per week searching for information and aren't successful 1/3 to 1/2 the time.
As I always say, there's a reason why "enterprise search sucks" returns over 1M hits on Google, including posts from luminaries such as John Udell and Tony Byrne.

While Mark Logic is not out to solve the generic enterprise search problem, I have long believed that enterprise search, as a catgory, will become stuck between a rock and a hard place.
  • The rock is the commoditization of the low-end enterprise search market through offerings like the Google Appliance and IBM OmniFind Yahoo Edition. This will suck the money out of the low end, the generic crawl-and-index market.
  • The hard place is DBMSs -- specifically, DBMS-based content applications built to help people in specific roles perform specific tasks. Some people build these applications today by trying to bolt together an enterprise search engine and a DBMS (e.g., Oracle + Verity or Lucene + MySQL), but increasing I believe people will use XML content servers (special-purpose DBMSs designed to handle content) for this purpose.
When you think about it, an inverted keyword index can only help you so much when trying to solve a problem -- even if you gussy it up with taxonomies and sexy extraction technology. In the end, an application designed to solve a specific problem will trump a souped-up tool every time.

Every Publisher Should Read This: Agile Development

Just a quick post to highlight a report done late last year by Outsell entitled "Keep Ahead of the Competition with Agile Development."

As publishers increasingly become application developers who build information products that mix software and content, they increasingly need to adopt state-of-the-art software development practices and methodologies.

Surprisingly, many of the publishers we work with are still doing big-bang development projects following a waterfall approach. I beleive that agile approaches are far better, particularly in the uncertain environment in which publishers find themselves. Rather than specifying huge projects up front, building them over a long time period, and hoping they work when launched, they need to deliver software fast and interact with users about what they've built.

Put simply, if you're a publisher and you're not doing agile development you should click here to buy this report for $395. (You can thank me later.)

You can read the excellent Wikipedia entry on agile development here. You can find the pithy Agile Manifesto here. You can find my blog post on content agility here.

Thursday, March 01, 2007

Oracle Acquires Hyperion: BI Enters Wave 2 Consolidation

According to this story in today's New York Times, Oracle will acquire Hyperion for $3.3B, or $52 per share, a 21% premium over yesterday's closing price.

The concept isn't surprising; that it finally happened is. For years, speculation has circled the major BI vendors -- Business Objects, Cognos, and Hyperion -- who seem somewhat obvious targets for the "big guys" such as Oracle, Microsoft, IBM, and even SAP.

In fact, without any quant to back this up -- it strikes me that BI is one of the biggest unconsolidated (i.e., independent) categories in software. Quick, name another category that has three $1B-ish vendors and where the big guys have either no or no-credible offerings?

This begins what I call the second round of consolidation in BI. Why "second" round? Well, round one was suite-itization. Let's go back in time ten years:
  • Business Objects had the best ad hoc query and reporting tool (BusinessObjects)
  • Cognos had the best OLAP tool (PowerPlay)
  • Crystal had the best enterprise reporting tool (Crystal Reports/Enterprise)
  • Informatica had the best ETL tool
  • Hyperion had the best financial planning software
  • Arbor had the best OLAP server
  • There other, smaller, related categories, with their own leaders, such as data mining, data profiling, and set-based analysis

So you had, back then, leaders in what we now consider sub-categories of BI. Back then, of course, it wasn't obvious to the vendors whether you were leading a category that would remain independent -- or a sub-category of a larger market that was about to consolidate.

As I recall, Hyperion and Cognos were most aggressive about driving category consolidation. I think Hyperion started it all by acquiring Arbor. Cognos later acquired DecisionStream (ETL) and Adaytum (planning). Informatica made a failed atttempt at building its own Q&R tools and analytic applications. We at Business Objects were more dragged into the party. First, we bought Acta (ETL). Then we did the single biggest acquisition in the category in buying Crystal Decisions for ~$1B in 2003.

I have often drawn parallels between BI and ECM consolidation. In both markets, the first consolidation wave consisted of the leaders in N sub-categories buying losers in the other N-1. (The one exception being Business Objects and Crystal which was a leader/leader consolidation.) See this post for my deeper views on parallels between ECM and BI.

There's one huge difference, however. Generally speaking, I'd say that BI consolidation worked and ECM consolidation didn't. I'm not even sure why. But to this day, you continue to hear grumbling about poor integration and worst-of-breedness in the ECM suites that you simply don't hear about the BI ones. You continue to hear ECM analysts undermine the vendors "good enough" suite positioning by arguing that customers should combine best-of-breed elements from various suites (e.g., use FileNet for imaging, Documentum for document management, and Interwoven for web content management). You don't hear those kinds of arguments in BI.

There's one other difference, of course. ECM has already begun "wave two" consolidation -- where the big guys buy suite vendors who survived wave one. For example, in the past year or so, Oracle bought Stellent and IBM bought FileNet. In BI, that hadn't happened yet. Until today.

This acquisition may well set off a scramble for Business Objects and Cognos to quickly find the right dance partners. Or it might not.

There has always been the "Switzerland" argument in BI, meaning that you don't want to buy BI from one of the big guys precisely because you want BI to work with everything. One would rightly assume that Oracle's BI would work best with Oracle's DBMS as would IBM's or Microsoft's. So given the transverse nature of BI (i.e., the need to consolidate information across systems) you would prefer to get it from a third party. I think the Switzerland argument is one reason why BI had yet to undergo wave two consolidation.

But all that changed today. Right now, if you're Business Objects or Cognos and you look into your crystal (pardon the pun) ball, you need to think hard about which matters more: best-of-breedness and the Switzerland argument or good-enough-ness and size, scale, and distribution power.

Ultimately, that is the question that will determine the future of the BI market.