Friday, July 24, 2009
How Do You Query a Key-Value Store Anyway?
Bob: So, how do I query the database?
IT guy: It's not a database. It's a key-value store.
Bob: OK, it's not a database. How do I query it?
IT guy: You write a distributed map-reduce function in Erlang.
Bob: Did you just tell me to go screw myself?
IT guy: I believe I did, Bob.
Thursday, July 23, 2009
Two Great Posts on Media Industry Disruption
On digging through the deluge of RSS articles I found on my return, I located two particularly interesting posts on disruption of the media industry.
The first is a post by Michael Nielsen, a quantum information theorist and seemingly very smart fellow, entitled Is Scientific Publishing About To Be Disrupted, which includes links to some great posts about the challenges facing newspapers, and provides not only a great general discussion of how industry disruption happens, but also specific look at media overall and scientific publishing in particular. I'd never heard of Nielsen before, but I've already subscribed to his blog because he strikes me as a real Renaissance individual working on fascinating projects like a book on The Future of Science, a series of posts on Google's Technology Stack, along with the odd post on things like Why The World Needs Quantum Mechanics.
The second is a post on the ReadWriteWeb entitled Bits of Destruction Hit the Book Publishing Business Part 1 and Part 2. These posts focus on three waves rocking the publishing industry (Google Book Search, e-Books, and print on demand) and their consequences on various participants in the book publishing value chain. In the end they predict that future book revenues end up getting split 33/33/33 among the author, the (web) publisher, and the e-book or print-on-demand deliverer.
Excerpt:
Both posts are well worth reading, but save some time to do so and be sure to hit lots of the links embedded in the Nielsen post.Here is a bookstore owner's nightmare. Customer walks in; browses around; has grand old time in this temple of knowledge; peruses a book that costs $27; takes out Kindle and orders it for $17, right there in front of your nose, using your wi-fi connection. Aaagh!
You wake up sweating at 3:00 in the morning
Monday, July 06, 2009
Follow Me On Twitter: I'm Changing My Sharing Pattern
- My newfound love of bit.ly and the ease with which you can Tweet from bit.ly
- My deep backlog of blog posts; I actually blog on about 1/4 of my candidate articles
- The realization that I can provide valuable sharing with a short comment and a link to a story
Historically, I viewed Facebook as about friends, the blog (and LinkedIn) about work, and Twitter somewhere in between. Going forward, I'm going to view Twitter as an extension of the blog and the only relic of my past confusion will be my username, inspired by the Allman Brothers.
XQuery's Real Potential: Transforming Application Development
First, let's excerpt the abstract to tee things up:
Ten years have passed since the W3C initiated its effort to design a query language for what, in 1999, was a new and controversial semi-structured data format, namely XML. A decade (and a lot of effort) later, the (now programming) language and its implementations are finally reaching industrial strength and are being taken up by customers as a solid alternative for building complex applications.Because, as Daniela points out, cloud is orthogonal, I'm not going to explore that angle here. (But promise I will in the future as I agree that XQuery in the cloud is a great idea.) Instead, I want to focus on the transformative power of XQuery on web application development.
Meanwhile, independently of the development of XQuery, and completely orthogonal to any programming language or application development infrastructure, a new buzzword is becoming more and more visible in the IT arena: the "Cloud." In this talk I will describe the poor state of current application development, which has serious limitations and inconveniences, and I will explain why, today, innovation in this area is unavoidable. The applications bubble is about to burst: existing software components, architectures, programming languages, database models, and communication protocols are under significant pressure to change.
I will argue that a combination of those two important technologies, "XQuery + Cloud," might provide a breakthrough in the area of application development infrastructure.
Frankly, when most people start to use MarkLogic they don't do so because of the potential to transform application development. They come to us because they are having trouble storing, searching, querying, and delivering XML content or semi-structured data. Only after they have built a few applications do they realize -- hey, wait a minute, I could build my entire application in XQuery and replace my RDBMS, my enterprise search engine, and my J2EE application server in one fell swoop by building my applications in top-to-bottom XML.
To be clear, not all of our customers do this. Many are content with the rest of the stack and use MarkLogic to help with XML heavy lifting. But a growing fraction do.
Let's examine some of Daniela's points:
- The problems with the traditional application development stack she highlights on slides 9-12: high cost, inflexibility, and slow time to market.
- The argument for XML on slide 16: spot on.
- (Warning: code on slides 19-26, but keep clicking.)
- Slide 29 is a reasonable argument in favor of XQuery
- The whole permissiveness angle on ACID transactions on slides 32-33 is new to me so I need to think about it more. MarkLogic offers ACID transactions, by the way. But I like the idea (in part because it's good critical thinking) that perhaps the database community is too dogmatic in this regard and that we pay a high price for that dogma.
- Feel free to skip over slide 35 entirely (kidding). I think it mostly summarizes as "XQuery is relatively new" and there is no totally free lunch. Over time the holes will be filled in and MarkLogic fills in several holes already. I don't think XQuery is particularly complicated and I'm certain it's a heck of a lot less complicated than SQL/XQuery Franglais queries that RDBMSs often want you to write to access XML columns. I've seen real deep experts argue for hours over the correct semantics when you're mixing SQL and XQuery. Stay away from that.
- While I'm mostly skipping cloud in this post, I have two comments. First, internally we run some demo systems with MarkLogic installed (quite easily) on Amazon Web Services, so as consumers we like the model. Second, the other day I met with Chris Barbin, CEO of Appirio, and thought he was fascinating guy and Appirio a fascinating company. Among other things they help you, at a strategic level, to figure out which cloud services to use where, and how to link them to each other and to your on-premises infrastructure. In a world where you can rent anything from raw disk blocks to CPU to database to applications to application platforms to BI in the cloud, it surely helps to have a strategy.
- But my favorite slide is back at slide 28 which shows "XQuery's real potential: standalone programming language for information intensive applications [which can let you] build extremely rich applications." I couldn't agree more. And I like the picture even better. It's what we call top-to-bottom XML.

I've embedded the entire presentation below via SlideShare. The original link off the UCI website in PowerPoint format is here.
Sunday, July 05, 2009
New York Times on the Changing Ways of Silicon Valley PR
Excerpt:
The article goes on to discuss what, in my opinion, are truly massive changes to the business of Silicon Valley PR over the past five years, driven by changes in the B2B trade press and the rise of social media.Instead, [publicist Brooke Hammerling] decides that she will “whisper in the ears” of Silicon Valley’s Who’s Who — the entrepreneurs behind tech’s hottest start-ups, including Jay Adelson, the chief executive of Digg; Biz Stone, co-founder of Twitter; and Jason Calacanis, the founder of Mahalo.
Notably, none are journalists.
This is the new world of promoting start-ups in Silicon Valley, where the lines between journalists and everyone else are blurring and the number of followers a pundit has on Twitter is sometimes viewed as more important than old metrics like the circulation of a newspaper.
While the article raises many good points, I think its over-reliance on Ms. Hammerling starts to make it feel -- in an ironic twist of journalistic narcissism -- like a puff piece about her: the journalist admiring the PR person instead of focusing on the changes in the business.
Over the years, her contact list swelled to the point that her stories now overflow with dropped names. There are the e-mail messages from Larry Ellison, the chief executive of Oracle, and the time she handled a client’s crisis from her BlackBerry while traveling to St. Barts to join the former Hollywood überagent Michael Ovitz and his family on his yacht. Or the time she was in her bikini at a Mexican resort, checking her e-mail at the hotel’s computer, when Ron Conway, a veteran tech investor, walked in.
Or the purportedly secret poker party she threw in her suite at a recent tech conference: “All my friends were there — Arianna was there, the Twitter boys were there,” ...
“Arianna told me I was a great hostess, and I thought I was going to die,” she said
Thursday, July 02, 2009
BlueGuru: JetBlue's MarkLogic-Based Publishing and Content Management System
Excerpt to tempt you into reading the 26-page document:
XML is BlueGuru’s enabling technology, and MarkLogic Server is its most critical architectural element. XML addresses JetBlue’s requirements for structured documents—multiple types, multiple components within each type, hierarchical relationships between components, and component sharing across documents. MarkLogic Server is an XML content management system that automates BlueGuru’s documentation processes. Its repository stores BlueGuru’s documents and supports their access and retrieval by Crewmembers, partners, and regulators.I've embedded the full document below in Scribd epaper format. Thanks to Mitch for writing a great document and to the folks at JetBlue for their faith in us, for their support of Mitch in writing the case study, and for the help and input they've provided us.
This case study report tells the story of JetBlue’s business transformation from a documentation system of decentralized and manually maintained manuals to a distributed content management and publishing system.
Semantic Technology at the New York Times
Evan gave an information-packed, 79-slide keynote address at the recent Semantic Technology Conference in San Jose. During our meeting, we went through some of the slides and they were fantastic. While the slides aren't publicly posted, I hope they soon will be and will update this post with a link once and if they are.
He also told me about the New York Times' recent release of a 1.8M article corpus to the computer science research community, known as The New York Times Annotated Corpus. The corpus includes nearly every article published in the New York Times for twenty years (between 1/1/87 and 6/19/07) in XML format (NITF to be precise) along with various metadata about the articles.
They believe the corpus can can be a valuable resource for a number of natural language processing research areas, including document summarization, document categorization and automatic content extraction. I think that's true not only because it's real content in real volume, but because that content comes with real, high-quality metadata that you can use to either build upon and/or validate various text processing algorithms.
Finally, in prepping for the meeting I found this video interview with Evan at the New York Semantic Meetup. Great stuff, embedded below.
Stonebraker: Send Relational DBMSs to the Home for Tired Software
Excerpt:
Moreover, the code line from all of the major vendors is quite elderly, in all cases dating from the 1980s. Hence, the major vendors sell software that is a quarter century old, and has been extended and morphed to meet today’s needs. In my opinion, these legacy systems are at the end of their useful life. They deserve to be sent to the “home for tired software.”His key argument is all about performance: in any given use-case, Stonebraker thinks RDBMSs can be beaten by about a factor of 50.
- In data warehousing he says a column store wins by 50x
- In OLTP he says a memory-resident DBMS wins by 50x
- For scientific data, he says a DBMS specialized for the job can win by 50x
- For RDF, he says column stores do a reasonable job and is confident that specialized RDF triple stores will do better, i.e., 50x or more. (I'd add that at MarkLogic we think we do a reasonable job as well.)
- For text, he points out that no major search engine uses a relational database so they didn't even qualify for consideration.
- For XML, he cites a private report I sent him a while back done for one of our customers comparing MarkLogic performance to a relational DBMS. When on "our turf," we usually win by no less than 10x and sometimes 100x or more. Sometimes, queries are not even processable in an RDBMS and/or need to be hand-optimized and hand-joined between a DBMS and a search engine.
- A non-relational data model
- A different implementation of tables
- A different implementation of transactions
Wednesday, July 01, 2009
The New Phone Book's Here: Mark Logic Mentioned in Forrester DBMS Wave
Navin R. Johnson: The new phone book's here! The new phone book's here!The first "thing" that happens to Johnson is he's selected by a sniper as his next random victim, leading to the famous "he hates the cans, stay away from the cans" line as the sniper repeatedly misses Johnson, blowing holes in the oil cans all around him.
Harry Hartounian: Boy, I wish I could get that excited about nothing.
Navin R. Johnson: Nothing? Are you kidding? Page 73 - Johnson, Navin R.! I'm somebody now! Millions of people look at this book everyday! This is the kind of spontaneous publicity - your name in print - that makes people. I'm in print! Things are going to start happening to me now.
But I'm getting too deep into my metaphor.
The purpose of this post is to say that Mark Logic is mentioned in the new Forrester Wave: Enterprise Database Management Systems , Q2 2009, published 6/30/09 and authored by Principal Analyst Noel Yuhanna.
(The new Forrester Wave's here! The new Forrester Wave's here!)
One of the more stunning highlights of the report is the degree of dominance held by the IBM, Microsoft, Oracle oligopoly: they estimate that those three vendors control more than 88% of the market. Huge market shares and high operating margins breed complacency faster than stagnant water breeds mosquitoes, so I remain confident in the disruption potential in segments of this, per Forrester, $27B market.
We get a mention in the description of the database market landscape, which breaks the market into three segments: OLTP databases, data warehouse databases, and specialized databases.
(I'm in print! Things are going to start happening to me, now.)
Excerpt:
Specialized databases. Beyond the OLTP and warehouse categories, the specialized database category provides DBMSes used by applications for specific purposes — such as mobile applications, XML applications, or standalone applications that need an embedded database repository. Most of these requirements come from value-added resellers (VARs), original equipment manufacturers (OEMs), and independent software vendors (ISVs) that use a specialized database to store data and metadata for their applications. Vendors of specialized databases include IBM, Microsoft, Oracle, and Sybase, as well as smaller vendors such as Mark Logic, Progress, and Software AG.I am happy for two reasons:
- The growing acceptance of special-purpose DBMSs as a valid segment of the DBMS market. Ten years ago, most of the analyst didn't concede the need for specialized DBMSs to exist.
- We get mentioned as a member of the class. There are literally scores of specialized DBMSs out there (e.g., column stores, stream stores, XML stores, DW stores) so I'm happy that we were cited as an example.
