Thursday, December 20, 2007

XQuery "Blanketing the Enterprise"

Check out this InfoWorld story, entitled XQuery Blankets The Enterprise Thanks To Major Collaboration. Excerpts:

Back in 1998 there was no consensus that anyone would need a full-fledged XML query language. Today, XQuery is being implemented by all the major relational databases, by middleware vendors, in content management systems, and by open source projects. It’s even becoming part of the SQL standard. “You’ve got to consider that success for a language,” says Jonathan Robie, one of the prime movers in the development of XQuery.

[...]

As XML becomes the lingua franca for everything from XHTML Web pages to Word documents, the value of a general-purpose XML query language becomes ever clearer.

I couldn't agree more. The more the world world is XML-based -- the content, the browser, the query language, and the programming language (XQuery can be both) -- the greater the need for the underlying DBMSs to XML-based as well. See this post, The Virtues of Top-to-Bottom XML, for more.

Wednesday, December 19, 2007

Microsoft 8-K Filed in Latest XBRL

See this IDG news service story that reports on Microsoft filing its latest form 8-K using the latest and greatest version of XBRL (extensible business reporting language), an emerging XML-based standard to define and exchange business and financial performance information.

Excerpts:
Microsoft said it's the first company to submit data using a new XBRL taxonomy released on Wednesday that allows the description of data according to U.S. Generally Accepted Accounting Principles (GAAP). The taxonomy defines, for example, what tags should be used to label data such as "net profit."

The advantage of XBRL is that it is machine-readable, and computers can use the tags to pull out comparable data from different companies from their filings.

Microsoft is one of about three dozen companies participating in a one-year pilot program to submit reports in XBRL, according to the SEC. The SEC has run a voluntary XBRL filing program since 2005.

If you've never seen an example of XBRL, go here to see the complete filing in an XML marked-up text file. Go here to see the XBRL instance file itself.

When I look at the files, I get two impressions:
  • Wow, think of the powerful queries you'll be able to do with all this mark-up. XML really database-izes content.
  • Wow, is XML verbose! I am so glad we are experts in compression at MarkLogic. When dealing with XML, you need to be.

Cooking With The Bible: XML and Recipes

I've always hoped that a MarkLogic customer would build an application around recipes because, to me, they provide a simple, clear, understandable example of semi-structured content and what an XML content server can do with such content.

I'm pleased to announce that Greenwood Publishing has put such an application online, entitled Cooking With The Bible. Here's the text from their welcome page that introduces the site:

Here you'll find meals found in the scriptures, along with complete menus inspired by biblical passages, food lore, and our thoughts about the meaning of the passages these menus are drawn from.

Each meal has three sections:

First are the mouthwatering recipes inspired by events described in the Bible, like "King David's Nuptials", "A Meal in the Wilderness", or "The Prodigal Son Returns".

Next we've listed all the ingredients necessary to understand what's at stake in the biblical text the meal is drawn from, while also trying to answer how the words of the Bible can be relevant to those encountering them today.

Finally we offer a brief history and background of the biblical passage that inspired the meal and recipes.


Now, I know the Bible may not be everyone's thing, but let's not miss the point. A key reason publishers put content into a contentbase and make selective slices of it is to enable the creation of pinpoint-targeted information products for the people for whom topic-X is their thing. Think long tail.

For example, Greenwood already has a series of Cooking With ... books:
Personally, as something of a foodie myself, I like the trend they're riding: foodie-ism combined with the aging baby boomers increasing interest in history. Cool.

Let's take a quick look at what you can do with this site. Check out this recipe, St. Peter's Fish with Parsley Sauce. You want to know where in the Bible this was eaten? Well, at A Galilean Breakfast. You want some relevant history to go with the fish, go here. You want to make a complete Galilean Breakfast yourself? Go back here, where they list the whole menu. (Enhancement request: make the menu items hyperlinks.) Forgot how to make Galilean Sand Cake? Go here.

Or, if you want one of my favorite dishes, try the Hummus. It looks yummy.

Facebook News Parody Video

Making the point that messages don't always carry well across media, check out this video parody of Facebook, translating it from webpage to live TV news show. Enjoy.

Tuesday, December 18, 2007

DBMS in the Cloud: Amazon SimpleDB

Continuing to steadily and patiently execute on their Amazon Web Services vision, Amazon recently announced SimpleDB, a web service for running queries in real time against structured data.

It's the first instance of which I'm aware of someone offering DBMS-level services in the cloud. Arguably, GoogleBase is a competitor, but I've always viewed that as more aimed at eBay and Craigslist and less about cloud computing.

While most SaaS-type applications are indeed applications (e.g., NetSuite, Salesforce), Amazon has been coming at cloud computing from an infrastructure-up, rather than an application-down, perspective. Previously Amazon launched lower-level services including EC2 (elastic compute cloud) and S3 (simple storage service) in the same "pay as you go to use our infrastructure" manner.

I'm told Amazon got into cloud computing because, due to the spikey nature of retail, they have built a massive infrastructure to handle demand peaks (e.g., Christmas) that goes largely unused most of the time. AWS is their attempt to monetize it.

For more on SimpleDB, see this post on the ProgrammableWeb blog, or check out the developer's guide here.

Finally, here's the pricing for SimpleDB:

Machine Utilization - $0.14 per Amazon SimpleDB Machine Hour consumed (normalized to the hourly capacity of a circa 2007 1.7 GHz Xeon processor).

Data Transfer

    $0.10 per GB - all data transfer in
    $0.18 per GB - first 10 TB / month data transfer out
    $0.16 per GB - next 40 TB / month data transfer out
    $0.13 per GB - data transfer out / month over 50 TB

Structured Data Storage - $1.50 per GB-month

Scott Karp on Blogs and Journalism

Check out this interesting posting on the Publishing 2.0 blog by Scott Karp, entitled Can Blogs Do Journalism?

The post gets off to a rocky start:
On the face of it, the question of whether blogs can do journalism is absurd — like asking whether sites published on Vignette can do journalism. A blog, after all, is just a content management system — revolutionary because it made web-native publishing free and easy for anyone — but at the end of the day still just a CMS.
Personally, I think he's confusing "blogging platform" and "blog." In my case, the former is Blogger and the latter is the Mark Logic CEO Blog, i.e., this blog and its content. But let's put that quibble aside because there's good stuff yet to come.
To me, the more salient questions is whether the blog platform — which, as a web-native CMS, is more powerfully connected the online content ecosystem — will be used by more journalists. And whether more bloggers will start to do what can fairly be considered journalism. Which of course begs the uber-question of what is journalism.
He concludes:

Since many news organizations are too busy focusing on the us vs. them polemic with blogs, it makes sense that someone like Nick Denton would have to step into the vacuum — which traditional news organizations so often create in failing to boldly experiment with new forms, because they appear to threaten the old. [...]

Anyone who still thinks that it’s constructive to focus on drawning distinctions between “blogging” and “journalism,” rather than seeing a blog as a platform to evolve the practice of journalism, would be well advised to heed Maggie Shnayerson.

I think it's simply a question of means and ends. A newspaper is means of delivering a story. So is a weekly news magazine, though a less timely one, which pressures the publisher to deliver value-added analysis.

A blog is a way of delivering a story, too. (And a blogging platform is a way of delivering a blog.) I think Scott's post is flirting with the question: must one be a journalist to perform journalism? I think the answer's clearly no. I'd extend Scott's question to: must one be an analyst to perform analysis? I'd say no again.

I think the question Scott's directly hitting is: will "real" journalists use blogging?

I think I agree with Scott in thinking the answer remains a reluctant yes. And the reluctant part is what prevents them from using the medium offensively. Journalists (and I'd add, analysts), it appears, will continue to be foot-dragged to new media party.

The Demise of Closed-Source RDBMSs?

A friend pointed me to this interesting post by Allan Packer of Sun entitled Are Proprietary Databases Doomed? Overall, I think it's a well done analysis of the DBMS market and well worth reading.

First, a nit. When I was a lad, "proprietary" didn't mean "closed source", it meant proprietary (i.e., vendor controlled) interface. For example, Ingres originally spoke a query language called Quel. SQL then emerged as the standard and any DBMS that spoke a language other than ANSI standard SQL was deemed proprietary. While I know that some people in the open source community view the opposite of "open source" as "proprietary," I think that's a misnomer. I think the correct antonym is closed source.

First, I think Allan makes an excellent point about stagnation:
By the turn of the millenium, relational databases had already pretty much met the essential requirements of end users, and proprietary database companies were either pointing their vaccuum cleaners toward other interesting money piles, or losing the plot entirely and sailing off the edge of the world. Today, database releases continue to tout new features, but they're frosting on the cake rather than essentials. No-one issues a tender for a database unless they have unusual requirements. No-one loses their job because they chose the wrong database. And it's been that way for years.
As a general rule I am shocked by the lack of innovation returned by the R&D budgets of most technology companies. As I mentioned yesterday, despite billions of R&D investment, Google has yet to come up with another big business. And what does Microsoft get for the billions they spend each year on R&D? An incompatible version of Office with irritating "ribbons" that takes four years to make.

Silicon Valley startups create new categories with $10s of millions in venture capital. It seems that once they become "real companies" they forget how to innovate at all, let alone on a shoestring.

Specifically in the DBMS market, I think the lack of innovation -- enabled by the oligopolistic structure of the market -- creates a soft underbelly for focused, innovative companies to carve our niches. (And remember "niches" of $10B market can be pretty big.)

Allan goes on to do some interesting pricing analysis, and then poses the question:

Why, then, is proprietary database software becoming more expensive while everything else reduces in price? End users normally expect to benefit from the cost savings resulting from improvements in technology. I am writing this blog, for example, on an affordable computer that would easily outperform expensive commercial systems from just 10 years ago.

It seems difficult to resist the conclusion that proprietary database companies have managed to redirect a good chunk of these savings away from end users and into their own coffers. Successful as this strategy has been, though, it could ultimately backfire. The more expensive proprietary databases become, the more attractive lower cost alternatives appear.

I think the short answer to his question is (1) the market is an oligopoly and (2) there is a lot of inertia when it comes to database management systems. So change will happen, but it will happen slowly. And, ironically, the force that drives the market change will be overpricing on the leaders' part. Were RDBMSs not so expensive, there would be less impetus to move to open source.

Now, the RDBMS vendors probably argue they should "milk" the market until the real threat emerges and then "wave a wand" to reduce price, but that is a risky strategy because they could very easily wave the wand too late, which is what I think they are doing.

The only point I think Alan misses in his analysis is that some powerful vendors like SAP and EMC don't like the fact that their applications run on top of lower-level DBMS technologies from competitors. For example, SAP has been trying to get itself off Oracle for about a decade, and I'm told they fund developers to work on MySQL towards that end. I know that EMC/Documentum is not comfortable that the vendors who provide the DBMSs they run on are all now challenging them in content management (e.g., Oracle/Stellent, IBM/FileNet, Microsoft SharePoint).

He then speculates on what he thinks will happen going forward:
My vote for the Strategy Most Likely To Succeed is a tie between Revenue Pull-Through and Reduce Prices. Oracle is arguably becoming the most successful proponent of the pull-through strategy. Oracle wants to supply you with a full software stack, including an OS, virtualization software, a broad range of middleware, a database, and end user applications. The largest component of Oracle's revenue currently still comes from database licenses, but the company is working hard to reduce that dependency. Until that happens, reducing prices across the board will be challenging for Oracle. If Oracle succeeds with a pull-through strategy, it doesn't mean that OSDBs will fail, of course. It simply means that Oracle is less likely to sustain major damage from their success.

He concludes:
Are proprietary databases doomed, then? Not at all. Even if proprietary database companies pull no surprises, they won't fade away anytime soon ... Make no mistake, though, open source databases are coming. For established companies it's more likely to be an evolution than a revolution.

I believe there are two major trends in the DBMS market today: (1) open-source chipping away at the closed-source oligopoly, and (2) special-purpose DBMSs innovating and carving out niches in the soft underbelly. I actually think point 1 provides powerful "air cover" for vendors pursuing strategy 2, because point 1 is a direct attack on the existing business.

Monday, December 17, 2007

Google as Publisher: The Grassy Knol

On December 13th Google took its first step from organizer and indexer of the world's knowledge to supporting-creator of it with the announcement of a new free tool called "knol" (a cutesy-ism which stands for unit of knowledge).

The folks at publishing industry watcher Outsell were quick to use the announcement as validation of their predictions that Google would eventually enter the publishing market:
While it was debatable in the past whether or not Google's actions constituted those of a publisher, there can be no doubt about it today.
Outsell takes a broader view of the announcement than most, who generally see it as a clear, direct shot at Wikipedia. (For an example of the consensus viewpoint, see this Newsfactor story entitled Death Knell Sounds for Wikipedia, About.com. I'd add that lesser known and poorly named Freebase seems squarely in the cross-hairs as well.)

Outsell points out that once a large knol-base (phrase coined by me, you heard it here first!) is created, then Google can tweak its search algorithms to favor its content over competing sites such as Wikipedia, which currently enjoys great organic search rankings; About.com, which doesn't; and Answers.com which was a casualty of an algorithm change in August, resulting in a 28% traffic drop and a nearly 20% drop in their stock price.

There has been plenty written about knol so I won't add a deep analysis here. For more, I'd go to the official Google blog post that launched knol and scroll down to see the list of blog postings that refer to it.

Techcrunch has a great write-up here:
Google is moving away from simply indexing the worlds content to being a content provider itself. Of course Google in response would argue that it is simply facilitating user generated content (like with Blogger), that ultimately they are the host as opposed to the creator, but it still competes with existing content providers, many of whom rely on Google search results for their living.
If you thought publishers were uncomfortable partners with Google before, things just got a lot frostier.

To me, despite billions of R&D investment and boatloads of hype, Google remains, as Kris Tuttle at Research 2.0 says, "a one-trick pony (but it's one darn good trick.)" So no new initiative can be presumed successful simply because Google is behind it. Consider the defunct Google Answers, or the perennially weak comparison shopping service, Google Products, nee Froogle.

Is knol a gimme just because Google's its dad? No way. The poor choice of name will hinder it as will Wikipedia's entrenched position, positive karma, and what I sense is a growing Google fatigue in the market.

(Like a boyish 40-year-old suffering from Peter Pan Syndrome, I think Google is increasingly out of touch with its perception. They're not cute and cuddly techies who everybody loves anymore, so they should stop trying to do cute and cuddly things.)

So, should this make publishers uncomfortable? Yes.

Is it (another) warning shot for the information industry? You betcha.

Do I have three words of advice for publishers regarding Google? Watch your back.

Thursday, December 13, 2007

Rave Reviews for Matt Turner at XML 2007

Check out this post from Craig Kitterman at Microsoft (who I believe works on Office Open XML):

I heard rave reviews of the Mark Logic session delivered this afternoon by Matthew Turner on Open XML interoperability with the Mark Logic product. I was unable to attend the session but I stopped by their booth to get caught up. Matt and John Kreisa kindly talked me through it again and I must say that this is one of the most compelling demos that I have seen as of late that really takes advantage of the Open XML format including deep custom schema integration. I hope to be able to work with Matt & John make this demo more widely visible as it a great showcase for the power of Open XML as well as great product interop.

Nice work Matt!

Classmates.com IPO Pulled

In a good example of how a broad general solution can overwhelm a narrow, focused (and not so great ) one, United Online has pulled its planned IPO of Classmates.com, a site I signed up for years ago, used a few times, and haven't really visited in years (until this morning to do a quick review).


It ways, in many ways, one of the original social networks. "A friend of mine told me" that one of its top uses was re-igniting old flames and crushes.

Strategically, in my opinion, Classmates.com suffers from a simple problem: it's naturally engulfed.

I'm normally a big believer in strategic focus and the ability to sustain differentiation by doing one thing (e.g., connecting classmates) better than anyone else. But some things just seem obviously engulfed, and this strikes me as one of them. Just as Facebook status updates seem destined to engulf Twitter's tweets, so does the educational background part of Facebook or LinkedIn seem an easy bet to envelop Classmates.

Just how many social networks do you want to join and why can't you use a few big ones for some very broad purposes? Network effects have always suggested to me there will be a few big social networks, rather than a plethora of independent little ones, even if they can communicate via APIs like OpenSocial.

I'm not the only one who feels this way:

A recent report from Cowen & Co. analyst Jim Friedland spells out exactly why United Online couldn’t cash in with Classmates. One line sums up his thesis: “We expect the Classmates.com subscriber base to peak in the first half of 2008, followed by a steady decline to zero by 2012.” Much of the report hones in on the fact that Classmates is no Facebook.

The biggest difference is that Facebook is free and offers far more robust features. Other factors weighing on Classmates: Classmates has little value for young users, since there’s no need for them to re-connect; they’re already connected through other sites. Meanwhile, Facebook is making major inroads into Classmates’ adult demographic. User engagement is 95 percent lower than on Facebook, suggesting that users see little value in the service they’re paying for.

TechCrunch writes about the IPO pulling here.

Coincidentally, this happens the same week that social network AdultFriendFinder (a swinger's network that definitely wants to be separate from Facebook and LinkedIn) was sold for $500M. I love the story of AdultFriendFinder because it captures yet another example of emergent strategy. Real examples of emergent strategies include:
  • Vicks made a cough syrup that kept putting people to sleep. Solution: market it as a night-time cold remedy, NyQuil.

  • Pfizer worked on a drug for angina and hypertension that kept inducing erections. Solution: market the side effect as Viagra. (See here if you don't believe it.)

  • Honda came to the US in the 1960s hoping to sell big motorcycles, like Harleys. During the weekends, their execs rode minibikes in the Southern California hills and everybody wanted one. Solution: change the strategy and market minibikes.
In AdultFriendFinder's case, they started out trying to make a normal dating site, called FriendFinder, but noticed that certain members kept posting pictures of themselves in various states of undress. What to do?

They went with the emergent strategy, addressed the market need, and created AdultFriendFinder. I suspect the original FriendFinder (which still exists) is worth no more than $50M having been overwhelmed by Match, Yahoo!Personals and 100 other variously undifferentiated dating sites.

(Yes, the photos, which portray my class at Irvington High School in 1980 and in 2005 are real. I pulled them off Classmates this morning. Thanks Grace!)

Wednesday, December 12, 2007

Your Web Search History: Worth a Look

Frankly, I don't spend much time worrying about my web search history -- but I'm starting to wonder if maybe I should.

Sure, I blogged about the topic once here (You Are What You Search). If you've never looked at the rather famous anonymized AOL search log of user 672368 then you should, just to give yourself a concrete idea of how much can be revealed by your search history. As John Battelle says, search creates a database of intentions. Looking at 672368's reveals a lot about hers.

So what does your database of intentions look like?

Well, if you regularly login to Google (e.g., Gmail, Blogger) then Google has been creating your own -- hopefully private -- web search history. In theory, they're using it to personalize and improve your web search results. But the questions are:
  • Do you want them keeping this information?
  • What problems would you face if it was accidentally exposed?
  • Or if it was subpoenaed?
Well, there's no better way to know than go look at yours. If you have a personal web search history on Google, here's how you can go look at it.
  • Login to your account (click sign-in on the Google.com homepage)
  • Click on web history
  • Start browsing
Enjoy the stroll down search memory lane. And then think about search privacy.

Frankly, after looking at my own and weighing the perceived upside (none, as far I can tell) vs. the possible downside, it wasn't a difficult decision to turn it off.

You've Got MarkMail: Interview with Jason Hunter

The Content Wrangler published an interview yesterday with Mark Logic's Jason Hunter, principal technologist and father of MarkMail. If you missed it, MarkMail is a new, MarkLogic-based web service that we launched last month that lets people search email archives, so you get answers to questions, locate expertise/experts, and even perform some basic content analysis.

MarkMail has been getting great reviews in its early weeks of existence. Here are few things we've heard on the web about MarkMail:
  • "MarkMail provides a slick interface and excellent facilities for managing mailing lists of all kinds."
  • "The good folks at Mark Logic ... have setup a kick-ass mailing list archive."
  • "Sweet zombie Jesus, it was cool!"
  • "This is a pretty sweet email archive."
We love MarkMail both because it provides a very useful service and also because it shows the power of content applications you can build on MarkLogic.

Here are some excerpts from the Jason Hunter interview:

Email presents an interesting challenge. Email archives (both public and private) hold huge amounts of information, but the histories haven’t been well utilized. We think one reason for that is technical, that you need a product like MarkLogic Server before you can take full advantage of email content.

Our plan with MarkMail, being built on MarkLogic Server, is to actively push the envelope and build a content application targeted at the email challenge.

[...]

I’ve worked on email search systems before and I can tell you it’s a real challenge because of the nature of email. Email is messy. Email headers are fairly well structured, but not perfectly, because each mailer will send different headers using different formats and there’s no hard and fast constraints. The email body itself may seem like just flat text (what you’d call unstructured), but really there’s more to it. There are paragraphs, quote blocks (where person A quotes person B), initial greetings, and trailing footers (footers are like a person’s signature block, an auto-added listserv notice, an auto-added confidentiality statement, and things like that). There are also attachments, in which there are pages and paragraphs and things like that.

[...]

MarkMail today runs a free service, designed to search public email archives. We’ve had requests by companies, organizations, and individuals who would like to have MarkMail functionality against their private email archives. We’re exploring ways to make that possible.

I think there’s a lot of potential there. Inside Mark Logic we use a private label MarkMail install for our own mailing lists. We have a mailing list dedicated to handling customer support issues. Using MarkMail against that helps speed our support response times. We have another mailing list for technical discussion, and new hires use MarkMail against that list to get up to speed. There’s also a mailing list where we discuss the competitive market. It acts as a knowledge base during sales calls.

By the way, this is the second interview that The Content Wrangler has done with Jason. Check out the first one, too, entitled Findability: Jason Hunter on Mark Logic's Use of XQuery to Leverage the Power of Legacy Content.

Monday, December 10, 2007

Matt Turner: First Encounters with Office Open XML

At the top level, I think of MarkLogic Server as the world's best place to put XML content. This begs the question: who has XML content? Today, you find it in a few places -- most notably, publishing (aka, the information industry) and government. And we're starting to find it in certain applications in other industries such as financial services, life sciences, and aviation.

But the answer to the question changes radically in the future. Today, the approximate answer to "who has XML content" is "no one, except groups X and Y." Tomorrow, the answer changes to: "everyone."

Why? Because of Microsoft Office Open XML, the new native format for Office documents. So we are literally sitting on the cusp of an explosion in XML content thanks to Microsoft and at Mark Logic we are naturally quite excited about that.

I've already blogged about Pete Aven's series of posts of Open Office XML. The purpose of today's post is to point you to another blogger, Matt Turner, and to his recent post, entitled XQuery and Microsoft Office (2007) XML. Matt refers to a slide presentation in his blog but was unable to post the slides directly into the blog, so I'll include them here.



I like Matt's blog a lot, because I view it as the more technical cousin to this blog. I'm fairly conceptual in this blog -- there's basically no code here -- while in Matt's blog I feel like he takes many of the same ideas I'm discussing, adds the next level of technical detail, and then buttresses his argument with code.

Bubble 2.0 Video

This is making the rounds. It's not incredible, but it's not bad either and well worth the 2:45 to watch it.

Friday, December 07, 2007

Mark Logic Building Sign Now Up!

After several months of wrangling with various municipal bodies, we have finally succeeded in getting the sign on our new digs on 999 Skyway Road in San Carlos, California, right off -- and quite visible from -- the central artery of Silicon Valley, Highway 101.

Here's the building photographed this afternoon, with the Mark Logic sign on it.

Thursday, December 06, 2007

Information Today Readers Select MarkLogic for People's Choice Award

I learned last night at our cocktail reception at London Online, fortuitously already drinking a glass of champagne, that MarkLogic has again won an Information Today People's Choice award, in the category Top Enterprise Application.
Other winners included:
  • Top new social networking tool: Facebook

  • Top content creation: Adobe

  • Top social networking tool: Digg

  • Top search and retrieval: Google

  • Top new innovator (is that redundant?): Tim O'Reilly
RSuite (a MarkLogic-based CMS built by Really Strategies) won the award for top content management. That shows the power of information industry focus in tailoring and differentiating a CMS in an otherwise crowded CMS market. Congratulations, Really Strategies!

Thank you to the readers of Information Today for again selecting us. The company's press release is here. The complete list of winners is here.

Tuesday, December 04, 2007

Research 2.0 Software Trends Update

Just a quick post to direct you to this interesting presentation, entitled Software Trends Update by Kris Tuttle and Dennis Byron of Research 2.0.

I know Kris from his days at SoundView, and I always found him a particularly astute financial analyst.

I particularly enjoyed this slide, which does a high-level analysis of the top players. I love the line about Google: "still a one-trick pony, but one damn great trick."

Digitization And Its Discontents

A quick post to refer you to this outstanding article in the New Yorker, entitled Future Reading: Digitization and Its Discontents which covers the history of libraries and archiving, an overview of the Microsoft and Google digitization projects, the information explosion problem (e.g., "scholars have to deal with too much information for millennia"), and several other topics.

Excerpts:
The Google Library Project has so far received mixed reviews. Google shows the reader a scanned version of the page; it is generally accurate and readable. But Google also uses optical character recognition to produce a second version, for its search engine to use, and this double process has some quirks. In a scriptorium lit by the sun, a scribe could mistakenly transcribe a “u” as an “n,” or vice versa. Curiously, the computer makes the same mistake. If you enter qualitas—an important term in medieval philosophy—into Google Book Search, you’ll find almost two thousand appearances. But if you enter “qnalitas” you’ll be rewarded with more than five hundred references that you wouldn’t necessarily have found.
[...]
The supposed universal library, then, will be not a seamless mass of books, easily linked and studied together, but a patchwork of interfaces and databases, some open to anyone with a computer and WiFi, others closed to those without access or money. The real challenge now is how to chart the tectonic plates of information that are crashing into one another and then to learn to navigate the new landscapes they are creating.
[...]
And yet we will still need our libraries and archives. John Seely Brown and Paul Duguid have written of the so-called “social life of information”—the form in which you encounter a text can have a huge impact on how you use it. Original documents reward us for taking the trouble to find them by telling us things that no image can. Duguid describes watching a fellow-historian systematically sniff two-hundred-and-fifty-year-old letters in an archive. By detecting the smell of vinegar—which had been sprinkled, in the eighteenth century, on letters from towns struck by cholera, in the hope of disinfecting them—he could trace the history of disease outbreaks. [...] Marginal annotations, which abounded in the centuries when readers usually went through books with pen in hand, identify the often surprising messages that individuals have found as they read. Many original writers and thinkers—Martin Luther, John Adams, Samuel Taylor Coleridge—have filled their books with notes that are indispensable to understanding their thought.
[...]
Sit in your local coffee shop, and your laptop can tell you a lot. If you want deeper, more local knowledge, you will have to take the narrower path that leads between the lions and up the stairs. There—as in great libraries around the world—you’ll use all the new sources, the library’s and those it buys from others, all the time. You’ll check musicians’ names and dates at Grove Music Online, read Marlowe’s “Doctor Faustus” on Early English Books Online, or decipher Civil War documents on Valley of the Shadow. But these streams of data, rich as they are, will illuminate, rather than eliminate, books and prints and manuscripts that only the library can put in front of you. The narrow path still leads, as it must, to crowded public rooms where the sunlight gleams on varnished tables, and knowledge is embodied in millions of dusty, crumbling, smelly, irreplaceable documents and books.
Wow, the folks at The New Yorker can write.

Monday, December 03, 2007

Series of Postings on Office 2007 XML

Mark Logic's Pete Aven, a fellow Bernese Mountain Dog owner, has begun a series of blog posts related to his work with the new Microsoft Office 2007 XML formats.

The series is being done on Ian Small's blog (MarkLogic SVP of products and general manager of MarkMail) which has a ten-clever-points name, Small Changes. Ian's introductory post to the series is entitled MarkLogic Server and Microsoft Office 2007.

Pete's first post is entitled Office Logic. He first discusses the file structure of Office 2007 XML (a zipfile with a bunch of files in it) and then explains its formal name Office Open XML (OOXML) and that it's actually a series of standards for the different Office areas (e.g., word processing, spreadsheets). He then jumps right into XQuery and does examples of creating, opening, and storing documents in MarkLogic.

Once he teaches the basics, I'm sure he'll get on to the cool stuff. I'm looking forward to the rest of the series Pete!

Sunday, December 02, 2007

Jason Hunter To Deliver Closing Keynote at XML 2007

While I'll be in England at the London Online Information show this week, I should note that Mark Logic's own Jason Hunter will be delivering the closing keynote at the XML 2007 this week in Boston.

Jason's a known Java guru and author, a member of the team that assembled O'Reilly Media's SafariU application, and most recently, the inspiration behind (and author of a great deal of) Mark Logic's new MarkMail service. In addition, he's one of the most popular technology speakers I know. If you go to his speech, I am *sure* you will like it.

The title of Jason's session is a play off the opening keynote which poses the question "Does XML have a future on the Web?" Jason's title is not surprisingly: You're Darn Right (It Does).

He's speaking at 11:00 AM on 5 December 2007. Don't miss it. Unless you're in London having a pint with me.

B2B Publishing 2008: Folio Webinar Slides

We recently sponsored a webinar with Folio entitled The State of B2B 2008. Here is a copy of the slides from that event.

XML Holland: Thoughts and Slides

It’s been too long time since I’ve been in Amsterdam. I must say, particularly coming from parched California, that my first reaction was wow, where did they get all this water, and why hasn't Los Angeles tried to divert it? :-)

I had the privilege of speaking at XML Holland 2007 last week and was surprised to find an active semantic web mafia at the conference. Now I don’t speak Dutch (except for “broodje met kaas”) and, frankly, I can barely understand RDF and OWL in English so I wasn’t able to personally benefit from the sessions. But I can say that there was plenty of interest – and expertise to match – in the semantic web.

Being generally skeptical of the semantic web, I’m not sure if the audience appreciated my latest tongue-in-cheek question: What is Web 2.0?
  • A bunch vowel-dropping, duo-syllabic companies (e.g., Scribd, Tumblr, Flickr, and most recently Definr)?
  • What Tim 2.0 (O’Reilly) did to Tim 1.0 (Berners-Lee) while he was waiting for the semantic web to arrive?
  • A venture capital funded hallucination designed to create Bubble 2.0?
By the way, have you seen the Bubble 2.0 prayer? I’ve seen on it on a few bumper stickers in Silicon Valley: "Please God just one more bubble; we’ll know what to do this time."

On a social / political note I always enjoy Holland for its open-minded culture. In speaking with some Dutch friends, however, I learned that the country is wrestling, of late, with an interesting logic problem: how to handle newcomers who value intolerance in a culture that values tolerance.

You can see the problem – we tolerate everything but intolerance. It reminds me of a Tom Lehrer line from the introduction to his ever-so-cynical classic, National Brotherhood Week:
I'm sure we all agree that we ought to love one another, and I know there are people in the world who do not love their fellow human beings — and I hate people like that!
I should note that my trip was made more colorful by having to traverse a full-blown demonstration with mounted police, tear gas, water cannons, and smashed car windows as I (ignoring advice from the concierge) walked to the wonderful Van Gogh Museum. Dutch high school students (mistranslated as "schoolers") were on strike pending new legislation that increases the mandatory hours of education per year to 1,040.

But I digress. Here's a copy of my presentation slides from the event.