Subscribe: Dare Obasanjo aka Carnage4Life - XML
http://www.25hoursaday.com/weblog/SyndicationService.asmx/GetRssCategory?categoryName=XML
Added By: Feedage Forager Feedage Grade C rated
Language: English
Tags:
data  document  file  format  formats  microsoft  net  new  odf  office  open  problem  schema  support  web  xml  xslt 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Dare Obasanjo aka Carnage4Life - XML

Dare Obasanjo's weblog - XMLAdam Bosworth's "Learning From the Web"



"You can buy cars but you can't buy respect in the hood" - Curtis Jackson Last week Andrew Conrad told me to check out a recent article by Adam Bosworth in the ACM Queue because he wondered what I thought about. I was rather embarassed to note that alt



Published: Wed, 09 Nov 2005 21:12:50 GMT

Last Build Date: Mon, 12 Jan 2009 14:10:08 GMT

Copyright: Dare Obasanjo
 



Can RDF really save us from data format proliferation?

Mon, 12 Jan 2009 14:10:08 GMT

Bill de hÓra has a blog post entitled Format Debt: what you can't say where he writes The closest thing to a deployable web technology that might improve describing these kind of data mashups without parsing at any cost or patching is RDF. Once RDF is parsed it becomes a well defined graph structure - albeit not a structure most web programmers will be used to, it is however the same structure regardless of the source syntax or the code and the graph structure is closed under all allowed operations. If we take the example of MediaRSS, which is not consistenly used or placed in syndication and API formats, that class of problem more or less evaporates via RDF. Likewise if we take the current Zoo of contact formats and our seeming inability to commit to one, RDF/OWL can enable a declarative mapping between them. Mapping can reduce the number of man years it takes to define a "standard" format by not having to bother unifying "standards" or getting away with a few thousand less test cases.  I've always found this particular argument by RDF proponents to be suspect. When I complained about the the lack of standards for representing rich media in Atom feeds, the thrust of the complaint is that you can't just plugin a feed from Picassa into a service that understands how to process feeds from Zooomr without making changes to the service or the input feed. RDF proponents  often to argue that if we all used RDF based formats then instead of having to change your code to support every new photo site's Atom feed with custom extensions, you could instead create a mapping from the format you don't understand to the one you do using something like the OWL Web Ontology Language.  The problem with this argument is that there is a declarative approach to mapping between XML data formats without having to boil the ocean by convincing everyone to switch to RD; XSL Transformations (XSLT). The key problem is that in both cases (i.e. mapping with OWL vs. mapping with XSLT) there is still the problem that Picassa feeds won't work with an app that understand's Zoomr's feeds until some developer writes code. Thus we're really debating on whether it is better cheaper to have the developer write declarative mappings like OWL or XSLT instead of writing new parsing code in their language of choice. In my experience I've seen that creating a software system where you can drop in an XSLT, OWL or other declarative mapping document to deal with new data formats is cheaper and likely to be less error prone than having to alter parsing code written in C#, Python, Ruby or whatever. However we don't need RDF or other Semantic Web technologies to build such solution today. XSLT works just fine as a tool for solving exactly that problem. Now Playing: Lady GaGa & Colby O'Donis - Just Dance [...]



Scalability: I Don't Think That Word Means What You Think It Does

Mon, 14 Jul 2008 11:40:12 GMT

Via Mark Pilgrim I stumbled on an article by Scott Loganbill entitled Google’s Open Source Protocol Buffers Offer Scalability, Speed which contains the following excerpt The best way to explore Protocol Buffers is to compare it to its alternative. What do Protocol Buffers have that XML doesn’t? As the Google Protocol Buffer blog post mentions, XML isn’t scalable: "As nice as XML is, it isn’t going to be efficient enough for [Google’s] scale. When all of your machines and network links are running at capacity, XML is an extremely expensive proposition. Not to mention, writing code to work with the DOM tree can sometimes become unwieldy." We’ve never had to deal with XML in a scale where programming for it would become unwieldy, but we’ll take Google’s word for it. Perhaps the biggest value-add of Protocol Buffers to the development community is as a method of dealing with scalability before it is necessary. The biggest developing drain of any start-up is success. How do you prepare for the onslaught of visitors companies such as Google or Twitter have experienced? Scaling for numbers takes critical development time, usually at a juncture where you should be introducing much-needed features to stay ahead of competition rather than paralyzing feature development to keep your servers running. Over time, Google has tackled the problem of communication between platforms with Protocol Buffers and data storage with Big Table. Protocol Buffers is the first open release of the technology making Google tick, although you can utilize Big Table with App Engine. It is unfortunate that it is now commonplace for people to throw around terms like "scaling" and "scalability" in technical discussions without actually explaining what they mean. Having a Web application that scales means that your application can handle becoming popular or being more popular than it is today in a cost effective manner. Depending on your class of Web application, there are different technologies that have been proven to help Web sites handle significantly higher traffic than they normally would. However there is no silver bullet. The fact that Google uses MapReduce and BigTable to solve problems in a particular problem space does not mean those technologies work well in others. MapReduce isn't terribly useful if you are building an instant messaging service. Similarly, if you are building an email service you want an infrastructure based on message queuing not BigTable. A binary wire format like Protocol Buffers is a smart idea if your applications bottleneck is network bandwidth or CPU used when serializing/deserializing XML.  As part of building their search engine Google has to cache a significant chunk of the World Wide Web and then perform data intensive operations on that data. In Google's scenarios, the network bandwidth utilized when transferring the massive amounts of data they process can actually be the bottleneck. Hence inventing a technology like Protocol Buffers became a necessity. However, that isn't Twitter's problem so a technology like Protocol Buffers isn't going to "help them scale". Twitter's problems have been clearly spelled out by the development team and nowhere is network bandwidth called out as a culprit. Almost every technology that has been loudly proclaimed as unscalable by some pundit on the Web is being used by a massively popular service in some context. Relational databases don't scale? Well, eBay seems to be doing OK. PHP doesn't scale? I believe it scales well enough for Facebook. Microsoft technologies aren't scalable? MySpace begs to differ. And so on… If someone tells you "technology X doesn't scale" without qualifying that statement, it often means the person either doesn't know what he is talking about or is trying to sell you something. Technologies don't scale, services do. Thinking you can just sprinkle a technology on your service and make it scale is the kind of thinking that led Bl[...]



In Defense of XML

Wed, 02 Jul 2008 12:56:10 GMT

Jeff Atwood recently published two anti-XML rants in his blog entitled XML: The Angle Bracket Tax and Revisiting the XML Angle Bracket Tax. The source of his beef with XML and his recommendations to developers are excerpted below Everywhere I look, programmers and programming tools seem to have standardized on XML. Configuration files, build scripts, local data storage, code comments, project files, you name it -- if it's stored in a text file and needs to be retrieved and parsed, it's probably XML. I realize that we have to use something to represent reasonably human readable data stored in a text file, but XML sometimes feels an awful lot like using an enormous sledgehammer to drive common household nails. I'm deeply ambivalent about XML. I'm reminded of this Winston Churchill quote: It has been said that democracy is the worst form of government except all the others that have been tried. XML is like democracy. Sometimes it even works. On the other hand, it also means we end up with stuff like this: DIS … You could do worse than XML. It's a reasonable choice, and if you're going to use XML, then at least learn to use it correctly. But consider: Should XML be the default choice? Is XML the simplest possible thing that can work for your intended use? Do you know what the XML alternatives are? Wouldn't it be nice to have easily readable, understandable data and configuration files, without all those sharp, pointy angle brackets jabbing you directly in your ever-lovin' eyeballs? I don't necessarily think XML sucks, but the mindless, blanket application of XML as a dessert topping and a floor wax certainly does. Like all tools, it's a question of how you use it. Please think twice before subjecting yourself, your fellow programmers, and your users to the XML angle bracket tax. Again. The question of if and when to use XML is one I am intimately familiar with given that I spent the first 2.5 years of my professional career at Microsoft working on the XML team as the “face of XML” on MSDN. My problem with Jeff’s articles is that they take a very narrow view of how to evaluate a technology. No one should argue that XML is the simplest or most efficient technology to satisfy the uses it has been put to today. It isn’t. The value of XML isn’t in its simplicity or its efficiency. It is in the fact that there is a massive ecosystem of knowledge and tools around working with XML. If I decide to use XML for my data format, I can be sure that my data will be consumable using a variety off-the-shelf tools on practically every platform in use today. In addition, there are a variety of tools for authoring XML, transforming it to HTML or text, parsing it, converting it to objects, mapping it to database schemas, validating it against a schema, and so on. Want to convert my XML config file into a pretty HTML page? I can use XSLT or CSS. Want to validate my XML against a schema? I have my choice of Schematron, Relax NG and XSD. Want to find stuff in my XML document? XPath and XQuery to the rescue. And so on. No other data format hits a similar sweet spot when it comes to ease of use, popularity and breadth of tool ecosystem. So the question you really want to ask yourself before taking on the “Angle Bracket Tax” as Jeff Atwood puts it, is whether the benefits of avoiding XML outweigh the costs of giving up the tool ecosystem of XML and the familiarity that practically every developer out there has with the technology? In some cases this might be true such as when deciding whether to go with JSON over XM[...]



Note to Software Vendors, the World is Collaborative and Loosely Coupled

Tue, 17 Jul 2007 16:10:09 GMT

Disclaimer: This may sound like a rant but it isn't meant to be. In the wise words of Raymond Chen this is meant to highlight problems that are harming the productivity of developers and knowledge workers in today's world. No companies or programs will be named because the intent is not to mock or ridicule.  This morning I had to rush into work early instead of going to the gym because of two limitations in the software around us. Problem #1: Collaborative Document Editing So a bunch are working on a document that is due today. Yesterday I wanted to edit the document but found out I could not because the software claimed someone else was currently editing the document. So I opened it in read-only mode, copied out some data, edited it and then sent my changes in an email to the person who was in charge of the document. As if that wasn’t bad enough… This morning, as I'm driving to the gym for my morning work out, I glance at my phone to see that I've received mail from several co-workers because it I've "locked" the document and no one can make their changes. When I get to work, I find out that I didn’t close the document within the application and this was the reason none of my co-workers could edit it. Wow. The notion that only one person at a time can edit a document or that if one is viewing a document, it cannot be edited seems archaic in today’s globally networked world. Why is software taking so long to catch up? Problem #2: Loosely Coupled XML Web Services While I was driving to the office I noticed another email from one of the services that integrates with ours via a SOAP-based XML Web Service. As part of the design to handle a news scenario we added a new type that was going to be returned by one of our methods (e.g. imagine that there was a GetFruit() method which used to return apples and oranges which now returns apples, oranges and bananas) . This change was crashing the applications that were invoking our service because they weren’t expecting us to return bananas. However, the insidious thing is that the failure wasn’t because their application was improperly coded to fail if it saw a fruit it didn’t know, it was because the platform they built on was statically typed. Specifically, the Web Services platform automatically converted the XML to objects by looking at our WSDL file (i.e. the interface definition language which stated up front which types are returned by our service) . So this meant that any time new types were added to our service, our WSDL file would be updated and any application invoking our service which was built on a Web services platform that performed such XML<->object mapping and was statically typed would need to be recompiled. Yes, recompiled. Now, consider how many potentially different applications that could be accessing our service. What are our choices? Come up with GetFruitEx() or GetFruit2() methods so we don’t break old clients? Go over our web server logs and try to track down every application that has accessed our service? Never introduce new types?  It’s sad that as an industry we built a technology on an eXtensible Markup Language (XML) and our first instinct was to make it as inflexible as technology that is two decades old which was never meant to scale to a global network like the World Wide Web.  Software should solve problems, not create new ones which require more technology to fix. Now playing: Young Jeezy - Bang (feat. T.I. & Lil Scrappy) [...]



Microsoft's Astoria and Jasper data access projects

Tue, 01 May 2007 23:28:20 GMT

Andy Conrad, who I used to work with back on the XML team, has two blog posts about Project Astoria and Project Jasper from Microsoft's Data Programmability team. Both projects are listed as data access incubation projects on MSDN. Below are the descriptions of the projects

Project Codename “Astoria”
The goal of Microsoft Codename Astoria is to enable applications to expose data as a
data service that can be consumed by web clients within a corporate network and across the internet. The data service is reachable over regular HTTP requests, and standard HTTP verbs such as GET, POST, PUT and DELETE are used to perform operations against the service. The payload format for the service is controllable by the application, but all options are simple, open formats such as plan XML and JSON. Web-friendly technologies make Astoria an ideal data back-end for AJAX-style applications, and other applications that need to operate against data that is across the web.

To learn more about Project Astoria or download the CTP, visit the Project Astoria website at http://astoria.mslivelabs.com.

Project Codename “Jasper”
Project Jasper is geared towards iterative and agile development. You can start interacting with the data in your database without having to create mapping files or define classes. You can build user interfaces by naming controls according to your model without worrying about binding code. Project Jasper is also extensible, allowing you to provide your own business logic and class model. Since Project Jasper is built on top of the ADO.NET Entity Framework, it supports rich queries and complex mapping.

To learn more about Project Jasper visit the ADO.NET Blog at http://blogs.msdn.com/adonet

I was called in a few weeks ago by an architect on the Data Programmability team to give some advice about Project Astoria. The project is basically a way to create RESTful endpoints on top of a SQL Server database then retrieve the relational data as plain XML, JSON or a subset of RDF+XML using HTTP requests. The reason I was called in was to give some of my thoughts on exposing relational data as RSS/Atom feeds. My feedback was that attempting to map arbitrary relational data to RSS/Atom feeds seemed unnatural and was bordering on abuse of an XML syndication format. Although this feature was not included in the Project Astoria CTP, it seems that mapping relational data to RSS/Atom feeds is still something the team thinks is interesting based on the Project Astoria FAQ. You can find out more in the Project Astoria overview documentation.  

REST is totally sweeping Microsoft.

(image)



Miguel de Icaza on OOXML vs. ODF

Thu, 01 Feb 2007 01:19:28 GMT

Miguel de Icaza of Gnumeric, GNOME and Ximian fame has weighed in with his thoughts on the FUD war that is ODF vs. OOXML. In his blog post entitled The EU Prosecutors are Wrong Miguel writes Open standards and the need for public access to information was a strong message. This became a key component of promoting open office, and open source software. This posed two problems: First, those promoting open standards did not stress the importance of having a fully open source implementation of an office suite. Second, it assumed that Microsoft would stand still and would not react to this new change in the market. And that is where the strategy to promote the open source office suite is running into problems. Microsoft did not stand still. It reacted to this new requirement by creating a file format of its own, the OOXML. ... The Size of OOXML A common objection to OOXML is that the specification is "too big", that 6,000 pages is a bit too much for a specification and that this would prevent third parties from implementing support for the standard. Considering that for years we, the open source community, have been trying to extract as much information about protocols and file formats from Microsoft, this is actually a good thing. For example, many years ago, when I was working on Gnumeric, one of the issues that we ran into was that the actual descriptions for functions and formulas in Excel was not entirely accurate from the public books you could buy. OOXML devotes 324 pages of the standard to document the formulas and functions. The original submission to the ECMA TC45 working group did not have any of this information. Jody Goldberg and Michael Meeks that represented Novell at the TC45 requested the information and it eventually made it into the standards. I consider this a win, and I consider those 324 extra pages a win for everyone (almost half the size of the ODF standard). Depending on how you count, ODF has 4 to 10 pages devoted to it. There is no way you could build a spreadsheet software based on this specification. ... I have obviously not read the entire specification, and am biased towards what I have seen in the spreadsheet angle. But considering that it is impossible to implement a spreadsheet program based on ODF, am convinced that the analysis done by those opposing OOXML is incredibly shallow, the burden is on them to prove that ODF is "enough" to implement from scratch alternative applications. ... The real challenge today that open source faces in the office space is that some administrations might choose to move from the binary office formats to the OOXML formats and that "open standards" will not play a role in promoting OpenOffice.org nor open source. What is worse is that even if people manage to stop OOXML from becoming an ISO standard it will be an ephemeral victory. We need to recognize that this is the problem. Instead of trying to bury OOXML, which amounts to covering the sun with your finger. I think there is an interesting bit of insight in Miguel's post which I highlighted in red font. IBM and the rest of the ODF opponents lobbied governments against Microsoft's products by arguing that its file formats where not open. However they did not expect that Microsoft would turn around and make those very file formats open and instead compete on innovation in the user experience. Now ODF proponents like Rob Weir who've been trumpeting the value of open standards now find themselves in the absurd position of arguing that is a bad thing for Microsoft to open up its file formats and provide exhaustive documentation for them. Instead they demand that Microsoft  should either  abandon backwards compatibility with the billions of documents produced by Microsoft Office in the past decade or that it should embrace and extend ODF to meet its needs. Neither of which sounds like a good thing for customers.  [...]



What is Rob Weir (and IBM's) Agenda with the OOXML Bashing?

Tue, 23 Jan 2007 18:34:22 GMT

In response to my recent post entitled ODF vs. OOXML on Wikipedia one of my readers pointed out Well, many of Weir's points are not about OOXML being a "second", and therefore unnecessary, standard. Many of them, I think, are about how crappy the standard actually is. Since I don't regularly read Rob Weir's blog this was interesting to me. I wondered why someone who identifies himself as working for IBM on various ODF technical topics would be spending a lot of his time attacking a related standard as opposed to talking about the technology he worked. I assumed my reader was mistaken and decided to subscribe to his feed and see how many of his recent posts were about OOXML. Below is a screenshot of what his feed looks like when I subscribed to it in RSS Bandit a few minutes ago Of his 24 most recent posts, 16 of them are explicitly about OOXML while 7 of them are about ODF. Interesting. I wonder why a senior technical guy at IBM is spending more time attacking a technology whose proponents have claimed is not competitive with it instead of talking about the technology he works on? Reading the blogs of Microsoft folks like Raymond Chen, Jensen Harris or Brian Jones you don't see them dedicating two thirds of their blog postings to bash rival products or technologies. From my perspective as an outsider in this debate it seems to me that OOXML is an overspecified description of an open XML document format that is backwards compatible with the billions of documents produced in Microsoft Office formats over the past decade. On the other hand, ODF is an open XML document format that aims to be a generic format for storing business documents that isn't tied to any one product which still needs some work to do in beefing up the specification in certain areas if interoperability is key. In an ideal world both of these efforts would be trying to learn from each other. However it seems that for whatever reasons IBM has decided that it would rather that Microsoft failed at its attempt to open up the XML formats behind the most popular office productivity software in the world. How this is a good thing for Microsoft's customers or IBM's is lost on me. Having a family member who is in politics, I've learned that whenever you see what seems like a religious fundamentalism there usually is a quest for money and/or power behind it. Reading articles such as Reader Beware as ODF News Coverage Increases it seems clear that IBM has a lot of money riding on being first to market with ODF-enabled products while simultaneously encouraging governments to only mandate ODF. The fly in the ointment is that the requirement of most governments is that the document format is open, not that it is ODF. Which explains IBM's unfortunate FUD campaign.  Usually, I wouldn't care about something like this since this is Big Business and Politics 101, but there was something that Rick Jellife wrote in his post An interesting offer: get paid to contribute to Wikipedia which is excerpted below So I think there are distinguishing features for OOXML, and one of the more political issues is do we want to encourage and reward MS for taking the step of opening up their file formats, at last? The last thing I'd personally hate is for this experience to have soured Microsoft from opening up its technologies so I thought I'd throw my hat in the ring at least this once. PS: It's pretty impressive that a Google search for "ooxml" pulls up a bunch of negative blog posts and the wikipedia article as the first couple of hits. It seems the folks on the Microsoft Office team need to do some SEO to fix that pronto. [...]



ODF vs. OOXML on Wikipedia

Mon, 22 Jan 2007 21:44:46 GMT

This morning I stumbled upon an interestingly titled post by Rick Jellife which piqued my interest entitled An interesting offer: get paid to contribute to Wikipedia where he writes I’m not a Microsoft hater at all, its just that I’ve swum in a different stream. Readers of this blog will know that I have differing views on standards to some Microsoft people at least. ... So I was a little surprised to receive email a couple of days ago from Microsoft saying they wanted to contract someone independent but friendly (me) for a couple of days to provide more balance on Wikipedia concerning ODF/OOXML. I am hardly the poster boy of Microsoft partisanship! Apparently they are frustrated at the amount of spin from some ODF stakeholders on Wikipedia and blogs. I think I’ll accept it: FUD enrages me and MS certainly are not hiring me to add any pro-MS FUD, just to correct any errors I see. ... Just scanning quickly the Wikipedia entry I see one example straight away: The OOXML specification requires conforming implementations to accept and understand various legacy office applications . But the conformance section to the ISO standard (which is only about page four) specifies conformance in terms of being able to accept the grammar, use the standard semantics for the bits you implement, and document where you do something different. The bits you don’t implement are no-one’s business. So that entry is simply wrong. The same myth comes up in the form “You have to implement all 6000 pages or Microsoft will sue you.” Are we idiots? Now I certainly think there are some good issues to consider with ODF versus OOXML, and it is good that they come out an get discussed. For example, the proposition that “ODF and OOXML are both office document formats: why should there be two standards?” is one that should be discussed. As I have mentioned before on this blog, I think OOXML has attributes that distinguish it: ODF has simply not been designed with the goal of being able to represent all the information possible in an MS Office document; this makes it poorer for archiving but paradoxically may make it better for level-playing-field, inter-organization document interchange. But the archiving community deserves support just as much as the document distribution community. And XHTML is better than both for simple documents. And PDF still has a role. And specific markup trumps all of them, where it is possible. So I think there are distinguishing features for OOXML, and one of the more political issues is do we want to encourage and reward MS for taking the step of opening up their file formats, at last? I'm glad to hear that Rick Jellife is considering taking this contract. Protecting your brand on Wikipedia, especially against well-funded or organized detractors is unfortunately a full time job and one that really should be performed by an impartial party not a biased one. It's great to see that Microsoft isn't only savvy enough to realize that keeping an eye on Wikipedia entries about itself is important but also is seeking objective 3rd parties to do the policing. It looks to me that online discussion around XML formats for business documents has significantly detoriorated. When I read posts like Rob Weir's A Foolish Inconsistency and The Vast Blue-Wing Conspiracy or Brian Jones's Passing the OpenXML standard over to ISO it seems clear that rational technical discussion is out the windows and the parties involved are in full mud slinging mode. It reminds me of watching TV during U.S. election years. I'm probably a biased party but I think the "why should we have two XML formats for business documents" line that is being thrown around by IBM is crap. The entire reason for XML's existence is so that we can build different formats that satisfy different needs. After all, no one asks them why the ODF folks had to inv[...]



Updated: XML Has Too Many Architecture Astronauts

Wed, 03 Jan 2007 23:25:05 GMT

Joel Spolsky has an seminal article entitled Don't Let Architecture Astronauts Scare You where he wrote A recent example illustrates this. Your typical architecture astronaut will take a fact like "Napster is a peer-to-peer service for downloading music" and ignore everything but the architecture, thinking it's interesting because it's peer to peer, completely missing the point that it's interesting because you can type the name of a song and listen to it right away. All they'll talk about is peer-to-peer this, that, and the other thing. Suddenly you have peer-to-peer conferences, peer-to-peer venture capital funds, and even peer-to-peer backlash with the imbecile business journalists dripping with glee as they copy each other's stories: "Peer To Peer: Dead!"  The Architecture Astronauts will say things like: "Can you imagine a program like Napster where you can download anything, not just songs?" Then they'll build applications like Groove that they think are more general than Napster, but which seem to have neglected that wee little feature that lets you type the name of a song and then listen to it -- the feature we wanted in the first place. Talk about missing the point. If Napster wasn't peer-to-peer but it did let you type the name of a song and then listen to it, it would have been just as popular. This article is relevant because I recently wrote a series of posts explaining why Web developers have begun to favor JSON over XML in Web Services. My motivation for writing this article were conversations I'd had with former co-workers who seemed intent on "abstracting" the discussion and comparing whether JSON was a better data format than XML in all the cases that XML is used today instead of understanding the context in which JSON has become popular. In the past two weeks, I've seen three different posts from various XML heavy hitters committing this very sin JSON and XML by Tim Bray - This kicked it off and starts off by firing some easily refutable allegations about the extensibility and unicode capabilities of JSON as a general data transfer format. Tim Bray on JSON and XML by Don Box - Refutes the allegations by Tim Bray above but still misses the point. All markup ends up looking like XML by David Megginson - argues that XML is just like JSON except with the former we use angle brackets and in the latter we use curly braces + square brackets. Thus they are "Turing" equivalent. Academically interesting but not terribly useful information if you are a Web developer trying to get things done. This is my plea to you, if you are an XML guru and you aren't sure why JSON seems to have come out of nowhere to threaten your precious XML, go read JSON vs. XML: Browser Security Model and JSON vs. XML: Browser Programming Models then let's have the discussion. If you're too busy to read them, here's the executive summary. JSON is a better fit for Web services that power Web mashups and AJAX widgets due to the fact it gets around the cross domain limitations put in place by browsers that hamper XMLHttpRequest and that it is essentially serialized Javascript objects which makes it fit better client side scripting which is primarily done in Javascript. That's it. XML will never fit the bill as well for these scenarios without changes to the existing browser ecosystem which I doubt are forthcoming anytime soon. Update: See comments by David Megginson and Steve Marx below. [...]



JSON vs. XML: Browser Programming Models

Tue, 02 Jan 2007 19:55:11 GMT

Over the holidays I had a chance to talk to some of my old compadres from the XML team at Microsoft and we got to talking about the JSON as an alternative to XML. I concluded that there are a small number of key reasons that JSON is now more attractive than XML for kinds of data interchange that powers Web-based mashups and Web gadgets widgets. This is the second in a series of posts on what these key reasons are. In my previous post, I mentioned that getting around limitations in cross domain requests imposed by modern browsers has been a key reason for the increased adoption of JSON. However this is only part of the story. Early on in the adoption of AJAX techniques across various Windows Live services I noticed that even for building pages with no cross domain requirements, our Web developers favored JSON to XML. One response that kept coming up is the easier programming model when processing JSON responses on the client than with XML. I'll illustrate this difference in ease of use via a JScript code that shows how to process a sample document in both XML and JSON formats taken from the JSON website. Below is the code sample var json_menu = '{"menu": {' + '\n' + '"id": "file",' + '\n' + '"value": "File",' + '\n' + '"popup": {' + '\n' + '"menuitem": [' + '\n' + '{"value": "New", "onclick": "CreateNewDoc()"},' + '\n' + '{"value": "Open", "onclick": "OpenDoc()"},' + '\n' + '{"value": "Close", "onclick": "CloseDoc()"}' + '\n' + ']' + '\n' + '}' + '\n' + '}}'; var xml_menu = '' + '\n' + '' + '\n' + '' + '\n' + '' + '\n' + '' + '\n' + '' + '\n' + ''; WhatHappensWhenYouClick_Xml(xml_menu); WhatHappensWhenYouClick_Json(json_menu); function WhatHappensWhenYouClick_Json(data){   var j = eval("(" + data + ")");   WScript.Echo("When you click the " + j.menu.value + " menu, you get the following options");   for(var i = 0; i < j.menu.popup.menuitem.length; i++){    WScript.Echo((i + 1) + "." + j.menu.popup.menuitem[i].value     + " aka " + j.menu.popup.menuitem[i].onclick);   } } function WhatHappensWhenYouClick_Xml(data){   var x = new ActiveXObject( "Microsoft.XMLDOM" );   x.loadXML(data);   WScript.Echo("When you click the " + x.documentElement.getAttribute("value")                 + " menu, you get the following options");   var nodes = x.documentElement.selectNodes("//menuitem");   for(var i = 0; i < nodes.length; i++){    WScript.Echo((i + 1) + "." + nodes[i].getAttribute("value") + " aka " + nodes[i].getAttribute("onclick"));   } } When comparing both sample functions, it seems clear that the XML version takes more code and requires a layer of mental indirection as the developer has to be knowledgeable about XML APIs and their idiosyncracies. We should dig a little deeper into this.  A couple of people have already replied to my previous post to point out that any good Web application should process JSON responses to ensure they are not malicious. This means my usage of eval() in the code sample, should be replaced with JSON parser that only accepts 'safe' JSON responses. Given that that there are JSON parsers available that come in under 2KB that particular security issue is not a deal breaker. On the XML front, there is no off-the-shelf manner to get a programming model as straightforward and as flexible as that obtained from parsing JSON directly into objects using eval(). One light on the horizon is that E4[...]



JSON vs. XML: Browser Security Model

Tue, 02 Jan 2007 17:46:10 GMT

Over the holidays I had a chance to talk to some of my old compadres from the XML team at Microsoft and we got to talking about the JSON as an alternative to XML. I concluded that there are a small number of key reasons that JSON is now more attractive than XML for kinds of data interchange that powers Web-based mashups and Web gadgets widgets. This is the first in a series of posts on what these key reasons are. The first "problem" that chosing JSON over XML as the output format for a Web service solves is that it works around security features built into modern browsers that prevent web pages from initiating certain classes of communication with web servers on domains other than the one hosting the page. This "problem" is accurately described in the XML.com article Fixing AJAX: XMLHttpRequest Considered Harmful which is excerpted below But the kind of AJAX examples that you don't see very often (are there any?) are ones that access third-party web services, such as those from Amazon, Yahoo, Google, and eBay. That's because all the newest web browsers impose a significant security restriction on the use of XMLHttpRequest. That restriction is that you aren't allowed to make XMLHttpRequests to any server except the server where your web page came from. So, if your AJAX application is in the page http://www.yourserver.com/junk.html, then any XMLHttpRequest that comes from that page can only make a request to a web service using the domain www.yourserver.com. Too bad -- your application is on www.yourserver.com, but their web service is on webservices.amazon.com (for Amazon). The XMLHttpRequest will either fail or pop up warnings, depending on the browser you're using. On Microsoft's IE 5 and 6, such requests are possible provided your browser security settings are low enough (though most users will still see a security warning that they have to accept before the request will proceed). On Firefox, Netscape, Safari, and the latest versions of Opera, the requests are denied. On Firefox, Netscape, and other Mozilla browsers, you can get your XMLHttpRequest to work by digitally signing your script, but the digital signature isn't compatible with IE, Safari, or other web browsers. This restriction is a significant annoyance for Web developers because it eliminates a number of compelling end user applications due to the limitations it imposes on developers. However, there are a number of common workarounds which are also listed in the article Solutions Worthy of Paranoia There is hope, or rather, there are gruesome hacks, that can bring the splendor of seamless cross-browser XMLHttpRequests to your developer palette. The three methods currently in vogue are: Application proxies. Write an application in your favorite programming language that sits on your server, responds to XMLHttpRequests from users, makes the web service call, and sends the data back to users. Apache proxy. Adjust your Apache web server configuration so that XMLHttpRequests can be invisibly re-routed from your server to the target web service domain. Script tag hack with application proxy (doesn't use XMLHttpRequest at all). Use the HTML script tag to make a request to an application proxy (see #1 above) that returns your data wrapped in JavaScript. This approach is also known as On-Demand JavaScript. Although the first two approaches work, there are a number of problems with them. The first is that it adds a requirement that the owner of the page also have Web master level access to a Web server and either tweak its configuration settings or be a savvy enough programmer to write an application to proxy requests between a user's browser and the third part web service. A second problem is that it significantly increases the cost and scalability impact of the page b[...]



Versioning Does Not Make Validation Irrelevant

Fri, 15 Dec 2006 02:57:19 GMT

Mark Baker has a blog post entitled Validation considered harmful where he writes

We believe that virtually all forms of validation, as commonly practiced, are harmful; an anathema to use at Web scale. Specifically, our argument is this;
Tests of validity which are a function of time make the independent evolution of software problematic.

Why? Consider the scenario of two parties on the Web which want to exchange a certain kind of document. Party A has an expensive support contract with BigDocCo that ensures that they’re always running the latest-and-greatest document processing software. But party B doesn’t, and so typically lags a few months behind. During one of those lags, a new version of the schema is released which relaxes an earlier stanza in the schema which constrained a certain field to the values “1″, “2″, or “3″; “4″ is now a valid value. So, party B, with its new software, happily fires off a document to A as it often does, but this document includes the value “4″ in that field. What happens? Of course A rejects it; it’s an invalid document, and an alert is raised with the human adminstrator, dramatically increasing the cost of document exchange. All because evolvability wasn’t baked in, because a schema was used in its default mode of operation; to restrict rather than permit.

This doesn't seem like a very good argument to me. The fact that you enforce that the XML documents you receive must follow a certain structure or must conform to certain constraints does not mean that your system cannot be flexible in the face of new versions. First of all, every system does some form of validation because it cannot process arbitrary documents. For example an RSS reader cannot do anything reasonable with an XBRL or ODF document, no matter how liberal it is in what it accepts. Now that we have accepted that there are certain levels validation that are no-brainers the next question is to ask what happens if there are no constraints on the values of elements and attributes in an input document. Let's say we have a purchase order format which in v1 has a element which can have a value of "U.S. dollars" or "Canadian dollars" then in v2 we now support any valid currency. What happens if a v2 document is sent to a v1 client? Is it a good idea for such a client to muddle along even though it can't handle the specified currency format?

As in all things in software, there are no hard and fast rules as to what is right and what is wrong. In general, it is better to be flexible rather than not as the success of HTML and RSS have shown us but this does not mean that it is acceptable in every situation. And it comes with its own set of costs as the success of HTML and RSS have shown us. :)

Sam Ruby puts it more eloquently than I can in his blog post entitled Tolerance.

(image)



My Good Deed for the Day

Mon, 11 Dec 2006 14:03:17 GMT

Edd Dumbill has a blog post entitled Afraid of the POX? where he writes

The other day I had was tinkering with that cute little poster child of Web 2.0, Flickr. Looking for a lightweight way to incorporate some photos into a web site, I headed to their feeds page to find some XML to use.
...
The result was interesting. Flickr have a variety of outputs in RSS dialects, but you just can't get at the raw data using XML. The bookmarking service del.icio.us is another case in point. My friend Matt Biddulph recently had to resort to screenscraping in order to write his tag stemmer, until some kind soul pointed out there's a JSON feed.

Both of these services support XML output, but only with the semantics crammed awkwardly into RSS or Atom. Neither have plain XML, but do support serialization via other formats. We don't really have "XML on the Web". We have RSS on the web, plus a bunch of mostly JSON and YAML for those who didn't care for pointy brackets.

Interesting set of conclusions but unfortunately based on faulty data. Flickr provides custom XML output from their Plain Old XML over HTTP APIs at http://www.flickr.com/services/api as does del.icio.us from its API at http://del.icio.us/help/api. If anything, this seems to indicate that old school XML heads like Edd have a different set of vocabulary from the Web developer crowd. It seems Edd did searches for "XML feeds" from these sites then came off irritated that the data was in RSS/Atom and not custom XML formats. However once you do a search for "API" with the appropriate service name, you find their POX/HTTP APIs which provide custom XML output.

The morale of this story is that "XML feeds" pretty much means RSS/Atom feeds these days and is not a generic term for XML being provided by a website.

PS: This should really be a comment on Edd's blog but it doesn't look like his blog supports comment. (image)




Miguel De Icaza on the Novell's OpenOffice "Fork"

Wed, 06 Dec 2006 02:37:26 GMT

If you are a reggular reader of Slashdot you probably stumbled on a link to the Groklaw article Novell "Forking" OpenOffice.org by Pamela Jones. In the article, she berates Novell for daring to provide support for the Office Open XML formats in their version of OpenOffice. Miguel De Icaza, a Novell employee, has posted a response entitled OpenOffice Forks? where he writes Facts barely matter when they get in the way of a good smear. The comments over at Groklaw are interesting, in that they explore new levels of ignorance. Let me explain. We have been working on OpenOffice.Org for longer than anyone else has. We were some of the earliest contributors to OpenOffice, and we are the largest external contributor to actual code to OpenOffice than anyone else. ... Today we ship modified versions of OpenOffice to integrate GStreamer, 64-bit fixes, integrate with the GNOME and KDE file choosers, add SVG importing support, add OpenDMA support, add VBA support, integrate Mono, integrate fontconfig, fix bugs, improve performance and a myriad of others. The above url contains some of the patches that are pending, but like every other open source project, we have published all of those patches as part of the src.rpm files that we shipped, and those patches have eventually ended up in every distribution under the sun. But the problem of course is not improving OpenOffice, the problem is improving OpenOffice in ways that PJ disapproves of. Improving OpenOffice to support an XML format created by Microsoft is tantamount to treason. And of course, the code that we write to interop with Office XML is covered by the Microsoft Open Specification Promise (Update: this is a public patent agreement, this has nothing to do with the Microsoft/Novell agreement, and is available to anyone; If you still want to email me, read the previous link, and read it twice before hitting the send button). I would reply to each individual point from PJ, but she either has not grasped how open source is actually delivered to people or she is using this as a rallying cry to advance her own ideological position on ODF vs OfficeXML. Debating the technical merits of one of those might be interesting, but they are both standards that are here to stay, so from an adoption and support standpoint they are a no-brainer to me. The ideological argument on the other hand is a discussion as interesting as watching water boil. Am myself surprised at the spasms and epileptic seizures that folks are having over this. I've been a fan of Miguel ever since I was a good lil' Slashbot in college. I've always admired his belief in "Free" [as in speech] Software and the impact it has on people's lives as well as the fact that he doesn't let geeky religious battles get in the way of shipping code. When Miguel saw good ideas in Microsoft's technologies, he incorporated the ideas into Bonobo and Mono as a way to improve the Linux software landscape instead of resorting to Not Invented Here syndrome. Unfortunately, we don't have enough of that in the software industry today. [...]



Should You Choose RELAX Now?

Tue, 28 Nov 2006 20:56:47 GMT

Tim Bray has a blog post entitled Choose RELAX Now where he writes Elliotte Rusty Harold’s RELAX Wins may be a milestone in the life of XML. Everybody who actually touches the technology has known the truth for years, and it’s time to stop sweeping it under the rug. W3C XML Schemas (XSD) suck. They are hard to read, hard to write, hard to understand, have interoperability problems, and are unable to describe lots of things you want to do all the time in XML. Schemas based on Relax NG, also known as ISO Standard 19757, are easy to write, easy to read, are backed by a rigorous formalism for interoperability, and can describe immensely more different XML constructs. To Elliotte’s list of important XML applications that are RELAX-based, I’d add the Atom Syndication Format and, pretty soon now, the Atom Publishing Protocol. It’s a pity; when XSD came out people thought that since it came from the W3C, same as XML, it must be the way to go, and it got baked into a bunch of other technology before anyone really had a chance to think it over. So now lots of people say “Well, yeah, it sucks, but we’re stuck with it.” Wrong! The time has come to declare it a worthy but failed experiment, tear down the shaky towers with XSD in their foundation, and start using RELAX for all significant XML work. In a past life I was the PM for XML schema technologies at Microsoft so I obviously have an opinion here. What Tim Bray and Elliotte Rusty Harold gloss over in their advocacy is that there are actually two reasons one would choose an XML schema technology. I covered both reasons in my article XML Schema Design Patterns: Is Complex Type Derivation Unnecessary? for XML.com a few years ago. The relevant part of the article is excerpted below As usage of XML and XML schema languages has become more widespread, two primary usage scenarios have developed around XML document validation and XML schemas. Describing and enforcing the contract between producers and consumers of XML documents: An XML schema ordinarily serves as a means for consumers and producers of XML to understand the structure of the document being consumed or produced. Schemas are a fairly terse and machine readable way to describe what constitutes a valid XML document according to a particular XML vocabulary. Thus a schema can be thought of as contract between the producer and consumer of an XML document. Typically the consumer ensures that the XML document being received from the producer conforms to the contract by validating the received document against the schema. This description covers a wide array of XML usage scenarios from business entities exchanging XML documents to applications that utilize XML configuration files. Creating the basis for processing and storing typed data represented as XML documents: As XML became popular as a way to represent rigidly structured, strongly typed data, such as the content of a relational database or programming language objects, the ability to to describe the datatypes within an XML document became important. This led to Microsoft's XML Data and XML Data-Reduced schema languages, which ultimately led to WXS. These schema languages are used to convert an input XML infoset into a type annotated infoset (TAI) where element and attribute information items are annotated with a type name. WXS describes the creation of a type annotated infoset as a consequence of document validation against a schema. During validation against a WXS, an input XML infoset is converted into a post schema validation infoset (PSVI), which among other things contains type annotations. However practical experience has shown that one does not need to perf[...]



On Microsoft Not Joining the OpenDocument Format (ODF) Committee

Tue, 18 Jul 2006 16:38:19 GMT

Brian Jones has a blog post entitled Politics behind standardization where he writes

We ultimately need to prioritize our standardization efforts, and as the Ecma Office Open XML spec is clearly further along in meeting the goal of full interoperability with the existing set of billions of Office documents, that is where our focus is. The Ecma spec is only a few months away from completion, while the OASIS committee has stated they believe they have at least another year before they are even able to define spreadsheet formulas. If the OASIS Open Document committee is having trouble meeting the goal of compatibility with the existing set of Office documents, then they should be able to leverage the work done by Ecma as the draft released back in the spring is already very detailed and the final draft should be published later this year.

To be clear, we have taken a 'hands off' approach to the OASIS technical committees because:  a) we have our hands full finishing a great product (Office 2007) and contributing to Ecma TC45, and b) we do not want in any way to be perceived as slowing down or working against ODF.  We have made this clear during the ISO consideration process as well.  The ODF and Open XML projects have legitimate differences of architecture, customer requirements and purpose.  This Translator project and others will prove that the formats can coexist with a certain tolerance, despite the differences and gaps.

No matter how well-intentioned our involvement might be with ODF, it would be perceived to be self-serving or detrimental to ODF and might come from a different perception of requirements.   We have nothing against the different ODF committees' work, but just recognize that our presence and input would tend to be misinterpreted and an inefficient use of valuable resources.  The Translator project we feel is a good productive 'middle ground' for practical interoperability concerns to be worked out in a transparent way for everyone, rather than attempting to swing one technical approach and set of customer requirements over to the other.

As someone who's watched standards committees from the Microsoft perspective while working on the XML team, I agree with everything Brian writes in his post. Trying to merge a bunch of contradictory requirements often results in a complex technology that causes more problems than it solves (e.g. W3C XML Schema). In addition, Microsoft showing up and trying to change the direction of the project to supports its primary requirement (an XML file format compatible with the legacy Microsoft Office file formats) would not be well received.

Unfortunately, the ODF discussion has seemed to be more political than technical which often obscures the truth. Microsoft is making moves to ensure that Microsoft Office not only provides the best features for its customers but ensures that they can exchange documents in a variety of document formats from those owned by Microsoft to PDF and ODF. I've seen a lot of customers acknowledge this truth and commend the company for it. At the end of the day, that matters a lot more than what competitors and detractors say. Making our customers happy is job #1. 

(image)



Microsoft Announces ODF Support for Office

Thu, 06 Jul 2006 17:04:48 GMT

The Office team continues to impress me how savvy they are about the changing software landscape. In his blog post entitled Open XML Translator project announced (ODF support for Office) Brian Jones writes Today we are announcing the creation of the Open XML Translator project that will help translate between the Office Open XML formats and the OpenDocument format. We've talked a lot about the value the Open XML formats bring, and one of them of course is the ability to filter it down into other formats. While we still aren't seeing a strong demand for ODF support from our corporate or consumer customers, it's now a bit different with governments. We've had some governments request that we help build solutions so that can use ODF for certain situations, so that's why we are creating the Open XML Translator project. I think it's going to be really beneficial to a number of folks and for a number of reasons. There has been a push in Microsoft for better interoperability and this is another great step in that direction. We already have the PDF and XPS support for Office 2007 users that unfortunately had to be separated out of the product and instead offered as a free download. There will be a menu item in the Office applications that will point people to the downloads for XPS, PDF, and now ODF. So you'll have the ability to save to and open ODF files directly within Office (just like any other format). For me, one of the really cool parts of this project is that it will be open source and located up on SourceForge, which means everyone will have the ability to see how to leverage the open architectures of both the Office Open XML formats and ODF. We're developing the tools with the help of Clever Age (based in France) and a few other folks like Aztecsoft (based in India) and Dialogika (based in Germany). There should actually be a prototype of the first translator (for Word 2007) posted up on SourceForge later on today (http://sourceforge.net/projects/odf-converter). It's going to be made available under the BSD license, and anyone can provide feedback, submit bugs, and of course directly contribute to the project. The Word tool should be available by the end of this year, with the Excel and PPT versions following in 2007. This announcement is cool on so many levels. The coolest being that the projects will not only be Open Source but will be hosted on SourceForge. That is sweet. It is interesting to note that it is government customers and not businesses that are interested in ODF support in Office. I guess that makes sense if you consider which parties have been expressing interest in Open Office. There already some great analyst responses to this move such as Stephen O'Grady of Redmonk who in his post Microsoft Office to Support ODF: The Q&A has some great insights. My favorite insight is excerpted below Q: How about Microsoft's competitors? A: Well, this is a bittersweet moment for them. For those like Corel that have eschewed ODF support, it's a matter of minor importance - at least until Microsoft is able to compete in public sector markets that mandate ODF and they are not. But for those vendors that have touted ODF support as a diffentiator, this is a good news/bad news deal. The good news is that they can and almost certainly will point to Microsoft's support as validation of further ODF traction and momentum, they will now be competing - at least in theory, remember the limitation - with an Office suite that is frankly the most capable on the market. I've said for years that packages like OpenOffice.org are more than good enough for the majority of us[...]



Mike Champion on Why We Need XLinq

Fri, 23 Jun 2006 20:38:51 GMT

Mike Champion has a blog post entitled Why does the world need another XML API? where he writes One basic question keeps coming up, something like: "We have SAX, DOM, XmlReader/Writer APIs (and the Java people have a bunch more), we have XSLT, we have XQuery ... why do you think we need Yet Another XML API?" ... XmlReader / XmlWriter can't go away because XLinq uses them to parse and serialize between XLinq objects and XML text. Also, while we are making XLinq as streaming-friendly as possible (see the XStreamingElement class in the CTP release for a taste of where we are going), we're only aiming at hitting the 80/20 point... DOM can't go away because there are important use cases for API-level interoperability, most notably in the browser...DOM doesn't make code truly interoperable across implementations (especially on other languages), but there is enough conceptual similarity that porting is generally not terribly difficult...   XSLT definitely won't go away. The Microsoft XML team was promoting XQuery as a "better XSLT than XSLT 2.0" a few years ago (before I came, don't hurt me!), and got set straight by the large and vocal XSLT user community on why this is not going to fly. While it may be true in some abstract way that XQuery or XLinq might logically be able to do everything that XSLT does, as a practical matter it won't...   XQuery won't go away, at least for its original use case as a database query language.  Microsoft supports a draft of XQuery in SQL Server 2005, contributes to the work of the XQuery working group at W3C, and will continue to invest in finalizing the XQuery Recommendation and implementing it in our DBMS.. we believe that the overall LINQ story is going to have a pretty profound impact on data programmability, and we want to make sure that LINQ has a good story for XML...For XML users, I see a few really powerful implications: The ability to query data by declaraing the characterics of the result set rather than imperatively navigating through and filtering out all the data... The ability to join across diverse data sources, be they XML documents, objects, or DBMS queries The ability to "functionally" reshape data within the same language as the application is written.  XSLT pioneered the functional transformation approach to XML processing, but it is difficult for many developers to learn and requires a processing pipeline architecture to combine XSLT transforms with conventional application logic... This brings back memories of my days on the XML team at Microsoft. We went back and forth a lot about building the "perfect XML API", the one problem we had was that there one too many diverse user bases which had different ideas of what was important to expose in an API. We were always caught between a rock and a hard place when it came to customer requests for fixing our APIs. To some people (e.g. Microsoft Office) XML was a markup format for documents while to others (e.g. Windows Communications Foundation aka Indigo) it was simply a serialization format for programming language objects. Some of our customers were primarily interested in processing XML in a streaming fashion (e.g. Biztalk) while others (e.g. Internet Explorer) always worked on in-memory XML documents. Then there were the teams whose primarily interest was in strongly typed XML (e.g. SQL Server, ADO.NET) since it would be stored in relational database columns. In trying to solve all of these problems with a single set of APIs, we went down the road of prematurely declaring the death of certain XML [...]



On the C# 3.0 Preview: Some Thoughts on LINQ

Wed, 17 May 2006 13:35:13 GMT

If you're a regular reader of Don Box's weblog then you probably know that Microsoft has made available another Community Technical Preview (CTP) of Language Integrated Query (LINQ) aka C# 3.0. I think the notion of integrating data access and query languages into programming languages is the next natural evolution in programming language design. A large number of developers write code that performs queries over rich data structures of some sort whether they are relational databases, XML files or just plain old objects in memory. In all three cases, the code tends to be verbose and more cumbersome than it needs to be. The goal of the LINQ project is to try to simplify and unify data access in programming languages built on the .NET Framework.  When I used to work on the XML team, we also used to salivate about the power that developers would get if they could get rich query over their data stores in a consistent manner. I was the PM for the IXPathNavigable interface and the related XPathNavigator class which we hoped people would implement over their custom stores to enable them to use XPath to query them. Some developers did do exactly that such as Steve Saxon with the ObjectXPathNavigator which allows you to use XPath to query a graph of in-memory objects. The main problem with this approach is that implementing IXPathNavigable for custom data stores is non-trivial especially given the impedence mismatch between XML and other data models. In fact, I've been wanting to do something like this in RSS Bandit for a while but the complexity of implementing my own custom XPathNavigator class over our internal data structures is something I've balked at doing. According to Matt Warren's blog post Oops, we did it again it looks like the LINQ folks have similar ideas but are making it easier than we did on the XML team. He writes  What's the coolest new feature?  IMHO, its IQueryable.   DLINQ's query mechanism has been generalized and available for all to use as part of System.Query.  It implements the Standard Query Operators for you using expression nodes to represent the query. Your queries can now be truly polymorphic, written over a common abstraction and translated into the target environment only when you need it to.     public int CustomersInLondon(IQueryable customers) {         int count = (from c in customers                      where c.City == " London "                      select c).Count();         return count;     } Now you can define a function like this and it can operate on either an in memory collection or a remote DLINQ collection (or you own IQueryable for that matter.)  The query is then either run entirely locally or remotely depending on the target.  If its a DLINQ query a count query is sent to the database. SELECT COUNT(*) AS [value] FROM [Customers] AS [t0] WHERE [t0].[City] = @p0 If its a normal CLR collection, the query is executed locally, using the System.Query.Sequence classes definitions of the standard query operators.  All you need to do is turn your IEnumerable into IQueryable.  This is accomplished easily with a built-in ToQueryable() method. &nb[...]



W3C Publishes XMLHttpRequest Object specification

Tue, 11 Apr 2006 22:55:47 GMT

I just noticed that last week the W3C published a working draft specification for The XMLHttpRequest Object. I found the end of the working draft somewhat interesting. Read through the list of references and authors of the specifcation below References This section is normative DOM3 Document Object Model (DOM) Level 3 Core Specification, Arnaud Le Hors (IBM), Philippe Le Hégaret (W3C), Lauren Wood (SoftQuad, Inc.), Gavin Nicol (Inso EPS), Jonathan Robie (Texcel Research and Software AG), Mike Champion (Arbortext and Software AG), and Steve Byrne (JavaSoft). RFC2119 Key words for use in RFCs to Indicate Requirement Levels, S. Bradner. RFC2616 Hypertext Transfer Protocol -- HTTP/1.1, R. Fielding (UC Irvine), J. Gettys (Compaq/W3C), J. Mogul (Compaq), H. Frystyk (W3C/MIT), L. Masinter (Xerox), P. Leach (Microsoft), and T. Berners-Lee (W3C/MIT). B. Authors This section is informative The authors of this document are the members of the W3C Web APIs Working Group. Robin Berjon, Expway (Working Group Chair) Ian Davis, Talis Information Limited Gorm Haug Eriksen, Opera Software Marc Hadley, Sun Microsystems Scott Hayman, Research In Motion Ian Hickson, Google Björn Höhrmann, Invited Expert Dean Jackson, W3C Christophe Jolif, ILOG Luca Mascaro, HTML Writers Guild Charles McCathieNevile, Opera Software T.V. Raman, Google Arun Ranganathan, AOL John Robinson, AOL Doug Schepers, Vectoreal Michael Shenfield, Research In Motion Jonas Sicking, Mozilla Foundation Stéphane Sire, IntuiLab Maciej Stachowiak, Apple Computer Anne van Kesteren, Opera Software Thanks to all those who have helped to improve this specification by sending suggestions and corrections. (Please, keep bugging us with your issues!) Interesting. A W3C specification that documents a proprietary Microsoft API which not only does not include a Microsoft employee as a spec author but doesn't even reference any of the IXMLHttpRequest documentation on MSDN. I'm sure there's a lesson in there somewhere. ;) [...]



WordPerfect to Support Microsoft Office Open XML Formats

Tue, 24 Jan 2006 19:28:18 GMT

Brian Jones has a blog post entitled Corel to support Microsoft Office Open XML Formats which begins

Corel has stated that they will support the new XML formats in Wordperfect once we release Office '12'. We've already seen other applications like OpenOffice and Apple's TextEdit support the XML formats that we built in Office 2003. Now as we start providing the documentation around the new formats and move through Ecma we'll see more and more people come on board and support these new formats. Here is a quote from Jason Larock of Corel talking about the formats they are looking to support in coming versions (http://labs.pcw.co.uk/2006/01/new_wordperfect_1.html):

Larock said no product could match Wordperfect's support for a wide variety of formats and Corel would include OpenXML when Office 12 is released. "We work with Microsoft now and we will continue to work with Microsoft, which owns 90 percent of the market. We would basically cut ouirselves off if you didn't support the format."

But he admitted that X3 does not support the Open Document Format (ODF), which is being proposed as a rival standard, "because no customer that we are currently dealing with as asked us to do so."

X3 does however allow the import and export of portable document format (pdf) files, something Microsoft has promised for Office 12.

I mention this article because I wanted to again stress that even our competitors will now have clear documentation that allows them to read and write our formats. That isn't really as big of a deal though as the fact that any solution provider can do this. It means that the documents can now be easily accessed 100 years from now, and start to play a more meaningful role in business processes.

Again I want to extend my kudos to Brian and the rest of the folks on the Office team who have been instrumental in the transition of the Microsoft Office file formats from proprietary binary formats to open XML formats.

(image)



Metadata Quality and Mapping Between Domain Languages

Sat, 21 Jan 2006 13:01:23 GMT

One part of the XML vision that has always resonated with me is that it encourages people to build custom XML formats specific to their needs but allows them to map between languages using technologies like XSLT. However XML technologies like XSLT focus on mapping one kind of syntax for another. There is another school of thought from proponents of Semantic Web technologies like RDF, OWL, and DAML+OIL, etc that higher level mapping between the semantics of languages is a better approach.  In previous posts such as RDF, The Semantic Web and Perpetual Motion Machines and More on RDF, The Semantic Web and Perpetual Motion Machines I've disagreed with the thinking of Semantic Web proponents because in the real world you have to mess with both syntactical mappings and semantic mappings. A great example of this is shown in the post entitled On the Quality of Metadata... by Stefano Mazzocchi where he writes One thing we figured out a while ago is that merging two (or more) datasets with high quality metadata results in a new dataset with much lower quality metadata. The "measure" of this quality is just subjective and perceptual, but it's a constant thing: everytime we showed this to people that cared about the data more than the software we were writing, they could not understand why we were so excited about such a system, where clearly the data was so much poorer than what they were expecting. We use the usual "this is just a prototype and the data mappings were done without much thinking" kind of excuse, just to calm them down, but now that I'm tasked to "do it better this time", I'm starting to feel a little weird because it might well be that we hit a general rule, one that is not a function on how much thinking you put in the data mappings or ontology crosswalks, and talking to Ben helped me understand why. First, let's start noting that there is no practical and objective definition of metadata quality, yet there are patterns that do emerge. For example, at the most superficial level, coherence is considered a sign of good care and (here all the metadata lovers would agree) good care is what it takes for metadata to be good. Therefore, lack of coherence indicates lack of good care, which automatically resolves in bad metadata. Note how the is nothing but a syllogism, yet, it's something that, rationally or not, comes up all the time. This is very important. Why? Well, suppose you have two metadatasets, each of them very coherent and well polished about, say, music. The first encodes Artist names as "Beatles, The" or "Lennon, John", while the second encodes them as "The Beatles" and "John Lennon". Both datasets, independently, are very coherent: there is only one way to spell an artist/band name, but when the two are merged and the ontology crosswalk/map is done (either implicitly or explicitly), the result is that some songs will now be associated with "Beatles, The" and others with "The Beatles". The result of merging two high quality datasets is, in general, another dataset with a higher "quantity" but a lower "quality" and, as you can see, the ontological crosswalks or mappings were done "right", where for "right" I mean that both sides of the ontological equation would have approved that "The Beatles" or "Beatles, The" are the band name that is associated with that song. At this point, the fellow semantic web developers would say "pfff, of course you are running into trouble, you haven't used the same URI" and the fellow librari[...]



Microformats vs. XML: Pros and Cons

Wed, 18 Jan 2006 12:39:13 GMT

Since writing my post Microformats vs. XML: Was the XML Vision Wrong?, I've come across some more food for thought in the appropriateness of using microformats over XML formats. The real-world test case I use when thinking about choosing microformats over XML is whether instead of having an HTML web page for my blog and an Atom/RSS feed, I should instead have a single HTML  page with
or

embedded in it. To me this seems like a gross hack but I've seen lots of people comment on how this seems like a great idea to them. Given that I hadn't encountered universal disdain for this idea, I decided to explore further and look for technical arguments for and against both approaches. I found quite a few discussions on the how and why microformats came about in articles such as The Microformats Primer in the Digital Web Magazine and Introduction to Microformats in the Microformats wiki. However I hadn't seen many in-depth technical arguments of why they were better than XML formats until recently.  In a comment in response to my Microformats vs. XML: Was the XML Vision Wrong?, Mark Pilgrim wrote Before microformats had a home page, a blog, a wiki, a charismatic leader, and a cool name, I was against using XHTML for syndication for a number of reasons. http://diveintomark.org/archives/2002/11/26/syndication_is_not_publication I had several basic arguments: 1. XHTML-based syndication required well-formed semantic XHTML with a particular structure, and was therefore doomed to failure. My experience in the last 3+ years with both feed parsing and microformats parsing has convinced me that this was incredibly naive on my part. Microformats may be *easier* to accomplish with semantic XHTML (just like accessibility is easier in many ways if you're using XHTML + CSS), but you can be embed structured data in really awful existing HTML markup, without migrating to "semantic XHTML" at all. 2. Bandwidth. Feeds are generally smaller than their corresponding HTML pages (even full content feeds), because they don't contain any of the extra fluff that people put on web pages (headers, footers, blogrolls, etc.) And feeds only change when actual content changes, whereas web pages can change for any number of valid reasons that don't involve changes to the content a feed consumer would be interested in. This is still valid, and I don't see it going away anytime soon. 3. The full-vs-partial content debate. Lots of people who publish full content on web pages (including their home page) want to publish only partial content in feeds. The rise of spam blogs that automatedly steal content from full-content feeds and republish them (with ads) has only intensified this debate. 4. Edge cases. Hand-crafted feed summaries. Dates in Latin. Feed-only content. I think these can be handled by microformats or successfully ignored. For example, machine-readable dates can be encoded in the title attribute of the human-readable date. Hand-crafted summaries can be published on web pages and marked up appropriately. Feed-only content can just be ignored; few people do it and it goes against one of the core microformats principles that I now agree with: if it's not human-readable in a browser, it's worthless or will become worthless (out of sync) over time. I tend to agree with Mark's conclusions. The main issue with using microformats for syndication instead of RSS/Atom feeds is was[...]




Microformats vs. XML: Was the XML Vision Wrong?

Thu, 12 Jan 2006 21:58:50 GMT

Over a year ago, I wrote a blog post entitled SGML on the Web: A Failed Dream? where I asked whether the original vision of XML had failed. Below are excerpts from that post The people who got together to produce the XML 1.0 recommendation where motivated to do this because they saw a need for SGML on the Web. Specifically   their discussions focused on two general areas: Classes of software applications for which HTML was an inadequate information format Aspects of the SGML standard itself that impeded SGML's acceptance as a widespread information technology The first discussion established the need for SGML on the web. By articulating worthwhile, even mission-critical work that could be done on the web if there were a suitable information format, the SGML experts hoped to justify SGML on the web with some compelling business cases. The second discussion raised the thornier issue of how to "fix" SGML so that it was suitable for the web. And thus XML was born. ... The W3C's attempts to get people to author XML directly on the Web have mostly failed as can be seen by the dismal adoption rate of XHTML and in fact many [including myself] have come to the conclusion that the costs of adopting XHTML compared to the benefits are too low if not non-existent. There was once an expectation that content producers would be able to place documents conformant to their own XML vocabularies on the Web and then display would entirely be handled by stylesheets but this is yet to become widespread. In fact, at least one member of a W3C working group has called this a bad practice since it means that User Agents that aren't sophisticated enough to understand style sheets are left out in the cold. Interestingly enough although XML has not been as successfully as its originators initially expected as a markup language for authoring documents on the Web it has found significant success as the successor to the Comma Separated Value (CSV) File Format. XML's primary usage on the Web and even within internal networks is for exchanging machine generated, structured data between applications. Speculatively, the largest usage of XML on the Web today is RSS and it conforms to this pattern. These thoughts were recently rekindled when reading Tim Bray's recent post Don’t Invent XML Languages where Tim Bray argues that people should stop designing new XML formats. For designing new data formats for the Web, Tim Bray advocates the use of Microformats instead of XML. The vision behind microformats is completely different from the XML vision. The original XML inventers started with the premise that HTML is not expressive enough to describe every possible document type that would be exchanged on the Web. Proponents of microformats argue that one can embed additional semantics over HTML and thus HTML is expressive enough to represent every possible document type that could be exchanged on the Web. I've always considered it a gross hack to think that instead of having an HTML web page for my blog and an Atom/RSS feed, instead I should have a single HTML  page with
or

embedded in it instead. However given that one of the inventors of XML (Tim Bray) is now advocating this approach, I wonder if I'm simply clinging to old ways and have become the kind of intellectual dinosaur I bemoan.  [...]




Don Demsak on XSLT 2.0 and Microsoft

Thu, 15 Dec 2005 18:25:42 GMT

Don Demsak has a post entitled XSLT 2.0, Microsoft, and the future of System.Xml which has some insightful perspectives on the future of XML in the .NET Framework Oleg accidentally restarted the XSLT 2.0 on .Net firestorm by trying to startup an informal survey.  Dare chimed in with his view of how to get XSLT 2.0 in .Net.  M. David (the guy behind Saxon.Net which let .Net developers use Saxon on .Net) jumped in with his opinion. ... One of the things that I’ve struggled with in System.Xml is how hard it is sometimes to extend the core library.  The XML MVPs have done a good job with some things, but other things (like implementing XSLT 2.0 on top of the XSLT 1.0 stuff) are impossible because so much of the library is buried in internal classes.  When building a complex library like System.Xml, there are 2 competing schools of thought: Make the library easy to use and create a very small public facing surface area. Make the library more of a core library with most classes and attributes public, and let others build easy (and very specific) object models on top of it. The upside of the first methodology is that it is much easier to test, and the library just works out of the box.  The downside is that it very hard to extend the library, so it can only be used in very specific ways. The upside of the second methodology is that you don’t have to trying to envision all the ways the library should be used.  Over time others will extend it to accomplish things that the original developers never thought of.  The downside is that you have a much larger surface area to test, and you are totally reliant on other projects to make your library useful.  This goes for both projects internal to Microsoft and external projects like the Mvp.Xml lib. The System.Xml team has tended to use the first methodology, where the ASP.Net team tends to build their core stuff according to the second methodology, and then have a sub-team create another library using the first methodology, so developers have something to use right out of the box (think of System.Web.UI.HtmlControls as the low level API and System.Web.UI.WebControls as the higher level API).  The ASP.Net team builds their API this way because, from the beginning, they have always envisioned 3rd parties extending their library.  At the moment, this is not the case for the System.Xml library.  But the question is, should System.Xml be revamped and become a lower level API, and then rely on 3rd parties (like the Mvp.Xml project) to create more specific and easier to use APIs?  Obviously this is not something to be taken lightly.  It will be more costly to expose more of the internals of System.Xml.  But, if only the lower level API was part of the core .Net framework, it may then be possible to roll out newer, higher level, APIs on a release schedule different then the .Net framework schedule.  This way projects like XSLT 2.0 could be released without having to what for the next version of the framework.  I’ve always been of the opinion that XSLT 2.0 does not need to be part of the core .Net framework.  Oleg doesn’t believe that the .Net open source community is as passionate as some of the other communities, so he would like to see Microsoft build XSLT 2.0.  I’d rather see the transformation of the System[...]



XSLT 2.0 and Microsoft

Sun, 11 Dec 2005 17:45:12 GMT

I've been following a series of posts on Oleg Tkachenko's blog with some bemusement. In his post A business case for XSLT 2.0? he writes If you are using XSLT and you think that XSLT 2.0 would provide you some real benefits, please drop a line of comment with a short explanation pleeeease. I'm collecting some arguments for XSLT 2.0, some real world scenarios that are hard with XSLT 1.0, some business cases when XSLT 2.0 would provide an additional value. That's really important if we want to have more than a single XSLT 2.0 implementation... PS. Of course I've read Kurt's "The Business Case for XSLT 2.0" already. Update: I failed to stress it enough that it's not me who needs such kind of arguments. We have sort of unique chance to persuade one of software giants (guess which one) to support XSLT 2.0 now. In a follow up post entitled XSLT 2.0 and Microsoft Unofficial Survey he reveals which of the software giants he is trying to convince to implement XSLT 2.0 where he writes Moving along business cases Microsoft seeks to implement XSLT 2.0 I'm trying to gather some opinion statistics amongs developers working with XML and XSLT. So I'm holding this survey at the XML Lab site: Would you like to have XSLT 2.0 implementation in the .NET Framework? The possible answers are: Yes, I need XSLT 2.0 Yes, that would be nice to have No, continue improving XSLT 1.0 impl instead No, XSLT 1.0 is enough for me ... Take your chance to influence Microsoft's decision on XSLT 2.0 and win XSLT 2.0 book! My advice to Oleg, if you want to see XSLT 2.0 in the .NET Framework then gather some like minded souls and build it yourself. Efforts like the MVP.XML library for the .NET Framework shows that there are a bunch of talented developers building cool enhancements to the basic XML story Microsoft provides in the .NET Framework. I'm not sure how an informal survey in a blog would convince Microsoft one way or the other about implementing a technology. A business case to convince a product team to do something usually involves showing them that they will lose or gain significant marketshare or revenue by making a technology choice. A handful of XML geeks who want to see the latest and greatest XML specs implemented by Microsoft does not a business case make. Unfortunately, this means that Microsoft will tend to be a follower and not a leader in such cases because customer demand and competitive pressure don't occur until other people have implemented and are using the technology. Thus if you want Microsoft to implement XSLT 2.0, you're best bet is to actually have people using it on other platforms or on Microsoft platforms who will clamor for better support instead of relying on informal surveys and comments in your blog. Just my $0.02 as someone who used to work on the XML team at Microsoft. [...]



Tim Bray's Hypocrisy and Competing XML Formats

Mon, 28 Nov 2005 18:17:46 GMT

Tim Bray has a post entitled Thought Experiments where he writes To keep things short, let’s call OpenDocument Format 1.0 "ODF" and the Office 12 XML File Formats "O12X". Alternatives · In ODF we have a format that’s already a stable OASIS standard and has multiple shipping implementations. In O12X we have a format that will become a stable ECMA standard with one shipping implementation sometime a year or two from now, depending on software-development and standards-process timetables. ODF is in the process of working its way through ISO, and O12X will apparently be sent down that road too, which should put ISO in an interesting situation. On the technology side, the two formats are really more alike than they are different. But, there are differences: O12X's design center, Microsoft has said repeatedly, is capturing the exact semantics of the billions of existing Microsoft Office documents. ODF’s design center is general-purpose reusability, and leveraging existing standards like SVG and MathML and so on. Which do you like better? I know which one I’d pick. But I think we’re missing the point. Why Are There Two? · Almost all office documents are just paragraphs of text, with some bold and some italics and some lists and some tables and some pictures. Almost all spreadsheets are numbers and labels, with some sums and averages and pivots and simple algebra. Almost all presentations are lists of bullet points with occasional pictures. The capabilities of ODF and O12X are essentially identical for all this basic stuff. So why in the flaming hell does the world need two incompatible formats to express it? The answer, obviously, is, "it doesn’t". I find it extremely ironic that one of the driving forces behind creating a redundant and duplicative XML format for website syndication would be one of the first to claim that we only need one XML format to solve any problem. For those who aren't in the know, Tim Bray is one of the chairs of the Atom Working Group in the IETF whose primary goal is to create a competing format to RSS 2.0 which does basically the same thing. In fact Tim Bray has written a decent number of posts attempting to explain why we need multiple XML formats for syndicating blog posts, news and enclosures on the Web. But let's ignore the messenger and focus on the message. Tim Bray's question is quite fair and in fact he answers it later on in his blog entry. As Tim Bray writes, "Microsoft wants there to be an office-document XML format that covers their billions of legacy documents". That's it in a nutshell. Microsoft created XML versions of its binary document formats like .doc and .xls that had full fidelity with the features of these formats. That way a user can convert a 'legacy' binary Office document to a more interoperable Office XML document without worrying about losing data, formatting or embedded rich media. This is a very important goal for the Microsoft Office team and very different from the goal of the designers of the OpenDocument format.  Is it technically possible to create a 'common shared office-XML dialect for the basics' as Tim Bray suggests? It is. It'll probably take several years (e.g. the Atom syndication format which is simply a derivative of RSS has taken over two years to come to fruition) and once it is done, Microsoft will have to 'embrace [...]











The Myth of the Office XML Binary Key

Tue, 18 Oct 2005 18:08:28 GMT

A recent comment on the Groklaw blog entitled Which Binary Key? claims that one needs a "binary key" to consume XML produced by Microsoft Office 2003. Specifically the post claims No_Axe speaks as if MS Office 12 had already been released and everyone was using it. He assumes everyone knows the binary key is gone. Yet Microsoft is saying that MS Office 12 is more or less a year away from release. So who really knows when and if the binary key has been dropped? All i know is that MSXML 12 is not available today. And that MSXML 2003 has a binary key in the header of every file. ... So let me close with this last comment on the fabled “binary key”. In March of 2005, when phase II of the ODF TC work was complete, and the specification had been prepared for both OASIS and ISO ratification, the ODF TC took up the issue of “compliance and conformance” testing. Specifically, we decided to start work on a compliance testing suite that would be useful for developers and application providers to perfect their implementations of ODF. Guess who's XML file format was the first test target? Right. And guess what the problem is with MSXML? Right. It's the binary key. We can't do even a simple transformation between MSXML and ODF! As someone who's used the XML features of Excel and Word, I know for a fact that you don't need a "binary key" to process the files using traditional XML tools. Brian Jones, who works on a number of the XML features in Office, has a post entitled The myth of the Binary Key where he mentions various parts of the Office XML formats that may confuse one into thinking they are some sort of "binary key" such as namespace URIs, processing instructions and Base64 encoded binary data. All of these are standard aspects of XML which one may typically doesn't see in simple uses of the technology such as in RSS feeds. Being that I used to work on the XML team there is one thing I want to add the Brian's list which often confuses people trying to process XML; the unicode byte order mark (BOM). This is often at the beginning of documents saved in UTF-16 or UTF-8 encoding on Windows. However as the Wikipedia entry on BOM's states In UTF-16, a BOM is expressed as the two-byte sequence FE FF at the beginning of the encoded string, to indicate that the encoded characters that follow it use big-endian byte order; or it is expressed as the byte sequence FF FE to indicate little-endian order. Whilst UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may be used to mark text as UTF-8. Quite a lot of Windows software (including Windows Notepad) adds one to UTF-8 files. However in Unix-like systems (which make heavy use of text files for configuration) this practice is not recommended, as it will interfere with correct processing of important codes such as the hash-bang at the start of an interpreted script. It may also interfere with source for programming languages that don't recognise it. For example, gcc reports stray characters at the beginning of a source file, and in PHP, if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script. The UTF-8 representation of the BOM is the byte sequence EF BB BF, w[...]



On Crappy XML Formats

Fri, 30 Sep 2005 19:14:43 GMT

There have been a number of amusing discussions in the recent back and forth between Robert Scoble and several others on whether OPML is a crappy XML format. In posts such as OPML "crappy" Robertson says and More on crappy formats Robert defends OPML. I've seen some really poor arguments made as people rushed to bash Dave Winer and OPML but  none made me want to join the discussion until this morning. In the post Some one has to say it again… brainwagon writes Take for example Mark Pilgrim's comments: I just tested the 59 RSS feeds I subscribe to in my news aggregator; 5 were not well-formed XML. 2 of these were due to unescaped ampersands; 2 were illegal high-bit characters; and then there's The Register (RSS), which publishes a feed with such a wide variety of problems that it's typically well-formed only two days each month. (I actually tracked it for a month once to test this. 28 days off; 2 days on.) I also just tested the 100 most recently updated RSS feeds listed on blo.gs (a weblog tracking site); 14 were not well-formed XML. The reason just isn't that programmers are lazy (we are, but we also like stuff to work). The fact is that the specification itself is ambiguous and weak enough that nobody really knows what it means. As a result, there are all sorts of flavors of RSS out there, and parsing them is a big hassle. The promise of XML was that you could ignore the format and manipulate data using standard off-the-shelf-tools. But that promise is largely negated by the ambiguity in the specification, which results in ill-formed RSS feeds, which cannot be parsed by standard XML feeds. Since Dave Winer himself managed to get it wrong as late as the date of the above article (probably due to an error that I myself have done, cutting and pasting unsafe text into Wordpress) we really can't say that it's because people don't understand the specification unless we are willing to state that Dave himself doesn't understand the specification. As someone who has (i) written a moderately popular RSS reader and (ii) worked on the XML team at Microsoft for three years, I know a thing or two about XML-related specifications. Blaming malformed XML in RSS feeds on the RSS specification is silly. That's like blaming the large number of HTML pages that don't validate on the W3C's HTML specification instead of on the fact that instead of erroring on invalid web pages web browsers actually try to render them. If web browsers didn't render invalid web pages then they wouldn't exist on the Web. Similarly, if every aggregator rejected invalid feeds then they wouldn't exist. However, just like in the browser wars, aggregator authors consider it a competitive advantage to be able to handle malformed feeds. This has nothing to do with the quality of the RSS specification [or the HTML specification], this is all about applications trying to get marketshare.   As for whether OPML is a crappy spec? I've had to read a lot of technology specifications in my day from W3C recommendations and IETF RFCs to API documentation and informal specs. They all suck in their own ways. However experience has thought me that the bigger the spec, the more it sucks. Given that, I'd rather[...]



Questioning RDF

Sun, 18 Sep 2005 03:15:12 GMT

I've been a long time skeptic when it comes to RDF and the Semantic Web. Every once in a while I wonder if perhaps what I have a problem with is the W3C's vision of the Semantic Web as opposed to RDF itself. However in previous attempts to explore RDF I've been surprised to find that its proponents seem to ignore some of the real world problems facing developers when trying to use RDF as a basis for information integration. Recently I've come across blog posts by RDF proponents who've begun to question the technology. The first is the blog post entitled Crises by Ian Davis where he wrote We were discussing the progress of the Dublin Core RDF task force and there were a number of agenda items under discussion. We didn’t get past the first item though - it was so hairy and ugly that no-one could agree on the right approach. The essence of the problem is best illustrated by the dc:creator term. The current definition says An entity primarily responsible for making the content of the resource. The associated comments states Typically, the name of a Creator should be used to indicate the entity and this is exactly the most common usage. Most people, most of the time use a person’s name as the value of this term. That’s the natural mode if you write it in an HTML meta tag and it’s the way tens or hundreds of thousands of records have been written over the past six years...Of course, us RDFers, with our penchant for precision and accuracy take issue with the notion of using a string to denote an “entity”. Is it an entity or the name of an entity. Most of us prefer to add some structure to dc:creator, perhaps using a foaf:Person as the value. It lets us make more assertions about the creator entity. The problem, if it isn’t immediately obvious, is that in RDF and RDFS it’s impossible to specify that a property can have a literal value but not a resource or vice versa. When I ask “what is the email address of the creator of this resource?” what should the (non-OWL) query engine return when the value of creator is a literal? It isn’t a new issue, and is discussed in-depth on the FOAF wiki. There are several proposals for dealing with this. The one that seemed to get the most support was to recommend the latter approach and make the first illegal. That means making hundreds of thousands of documents invalid. A second approach was to endorse current practice and change the semantics of the dc:creator term to explictly mean the name of the creator and invent a new term (e.g. creatingEntity) to represent the structured approach. ... That’s when my crisis struck. I was sitting at the world’s foremost metadata conference in a room full of people who cared deeply about the quality of metadata and we were discussing scraping data from descriptions! Scraping metadata from Dublin Core! I had to go check the dictionary entry for oxymoron just in case that sentence was there! If professional cataloguers are having these kinds of problems with RDF then we are fucked. It says to me that the looseness of the model is introducing far too much complexity as evidenced by the difficulties being experienced by the Dublin Core community and the W3C HTML working group. A simpl[...]



XLinq and Visual Basic 9

Fri, 16 Sep 2005 16:27:23 GMT

The announcements from about Microsoft's Linq project just keep getting better and better. In his post XML, Dynamic Languages, and VB, Mike Champion writes

Thursday at PDC saw lots of details being put out about another big project our team has been working on -- the deep support for XML in Visual Basic 9...On the VB9 front, the big news is that two major features beyond and on top of LINQ will be supported in VB9:

"XML Literals" is  the ability to embed XML syntax directly into VB code. For example,

Dim ele as XElement =

Is translated by the compiler to

Dim ele as XElement =  new XElement("Customer")

The syntax further allows "expression holes" much like those in ASP.NET where computed values can be inserted.

"Late Bound XML" is the ability to reference XML elements and attributes directly in VB syntax rather than having to call navigation functions.  For example

Dim books as IEnumerable(Of XElement) = bib.book

Is translated by the compiler to

Dim books as IEnumerable(Of XElement) = bib.Elements("book")

 We believe that these features will make XML even more accessible to Visual Basic's core audience. Erik Meijer, a hard core languages geek who helped devise the Haskell functional programming language and the experimental XML processing languages X#, Xen, and C-Omega, now touts VB9 as his favorite.

Erik Meijer and I used to spend a lot of time talking about XML integration into popular  programming languages back when I was on the XML team. In fact, all the patent cubes on my desk are related to work we did together in this area. I'm glad to see that some of the ideas we tossed around are going to make it out to developers in the near future. This is great news.

(image)



Integrating XML into Programming Languages: Diving Into XLinq

Thu, 15 Sep 2005 14:43:34 GMT

You know you're a geek when it's not even 7AM but you've already spent half the morning reading a whitepaper about Microsoft's plans to integrate XML and relational query language functionality into the .NET Framework with Linq.  C# 3.0 is going to be hot. Like it's forefather X# Xen Cω, XLinq does an amazing job of integrating XML directly into the Common Language Runtime and the C#/VB.NET programming languages. Below are some code samples to whet your appetite until I can get around to writing an article later this year Creating an XML document XDocument contactsDoc =      new XDocument(       new XDeclaration("1.0", "UTF-8", "yes"),       new XComment("XLinq Contacts XML Example"),       new XProcessingInstruction("MyApp", "123-44-4444"),         new XElement("contacts",           new XElement("contact",             new XElement("name","Patrick Hines"),                                                    new XElement("phone", "206-555-0144"),             new XElement("address",              new XElement("street1", " 123 Main St "),              new XElement("city", " Mercer Island "),              new XElement("state", "WA"),              new XElement("postal", "68042")                         )                       )                     )                   ); Creating an XML element in the "http://example.com" namespace XElement contacts = new XElement("{http://example.com}contacts"); Loading an XML element from a file XElement contactsFromFile = XElement.Load(@"c:\myContactList.xml"); Writing out an array of Person objects as an XML file class Person {         public string Name;         public string[] PhoneNumbers; } var persons = new [] { new Person                                  [...]



Microsoft announces LINQ (and XLinq)

Tue, 13 Sep 2005 22:02:03 GMT

The former co-workers (the Microsoft XML team) have been hard at work with the C# language team to bring the XML query integration into the core languages for the .NET Framework. From Dave Remy's post Anders unveils LINQ! (and XLinq) we learn

In Jim Allchin's keynote At PDC2005 today Anders Hejlsberg showed the LINQ project for the first time.  LINQ stands for Language Integrated Query.  The big idea behind LINQ is to provide a consistent query experience across different "LINQ enabled" data access technologies AND to allow querying these different data access technologies in a single query.  Out of the box there are three LINQ enabled data access technologies that are being shown at PDC.  The first is any in-memory .NET collection that you foreach over (any .NET collection that implements IEnumerable).  The second is DLinq which provides LINQ over a strongly typed relational database layer.  The third, which I have been working on for the last 6 months or so (along with Anders and others on the WebData XML team), is XLinq, a new in-memory XML programming API that is Language Integerated Query enabled.  It is great to get the chance to get this technology to the next stage of development and get all of you involved.  The LINQ Preview bits (incuding XLinq and DLinq) are being made available to PDC attendees.  More information on the LINQ project (including  the preview bits) are also available online at http://msdn.microsoft.com/netframework/future/linq

This is pretty innovative stuff and I definitely can't wait to download the bits when I get some free time. Perhaps I need to write an article exploring LINQ for XML.com the way I did with my Introducing C-Omega article? Then again, I still haven't updated my C# vs. Java comparison to account for C# 2.0 and Java 1.5. It looks like I'll be writing a bunch of programming language articles this fall. 

Which article would you rather see?

(image)



Microformats vs. XML vs. RDF

Mon, 08 Aug 2005 12:47:54 GMT

In response to my post Using XML on the Web is Evil, Since When? Tantek updated his post Avoiding Plain XML and Presentational Markup. Since I'm the kind of person who can't avoid a good debate even when I'm on vacation I've decided to post a response to Tantek's response. Tantek wrote The sad thing is that while namespaces theoretically addressed one of the problems I pointed out (calling different things by the same name), it actually WORSENED the other problem: calling the same thing by different names. XML Namespaces encouraged document/data silos, with little or no reuse, probably because every person/political body defining their elements wanted "control" over the definition of any particular thing in their documents. The tag is the perfect example of needless duplication. And if something was theoretically supposed to have solved something but effectively hasn't 6-7 years later, then in our internet-time-frame, it has failed. This is a valid problem in the real world. For example, for all intents an purposes an element in an Atom feed is semantically equivalent to an element in an RSS feed to every feed reader that supports both. However we have two names for what is effectively the same thing as far as an aggregator developer or end user is concerned. The XML solution to this problem has been that it is OK to have myriad formats as long as we have technologies for performing syntactic translations between XML vocabularies such as XSLT. The RDF solution is for us to agree on the semantics of the data in the format (i.e. a canonical data model for that problem space) in which case alternative syntaxes are fine and we performs translations using RDF-based mapping technologies like DAML+OIL or OWL. The microformat solution which Tantek espouses is that we all agree on a canonical data model and a canonical syntax (typically some subset of [X]HTML). So far the approach that has gotten the most traction in the real world is XML. From my perspective, the reason for this is obvious; it doesn't require that everyone has to agree on a single data model or a single format for that problem space.   Microformats don't solve the problem of different entities coming up with the different names for the same concept. Instead its proponents are ignoring the reasons why the problem exists in the first place and then offering microformats as a panacea when they are not. I personally haven't seen a good explanation of why is better than ... A statement like that begs some homework. The accessibility, media independence, alternative devices, and web design communities have all figured this out years ago. This is Semantic (X)HTML 101. Please read any modern web design book like those on my SXSW Required Reading List, and we'll continue the discussion afterwards. I can see the reasons for a number of the semantic markup guidelines in the case of HTML. What I don't agree with is jumping to the conclusion that markup languages should never have presentat[...]



Using XML on the Web is Evil, Since When?

Thu, 28 Jul 2005 08:48:32 GMT

I've been reading some of the hype around microformats in certain blogs with some amusement. I have been ignoring microformats but now I see that some of its proponents have started claiming that using XML on the Web is bad and instead HTML is the only markup language we'll ever need. In her post Why generic XML on the Web is a bad idea Anne van Kesteren writes Of course, using XML or even RDF serialized as XML you can describe your content much better and in far more detail, but there is no search engine out there that will understand you. For RDF there is a chance one day they will. Generic XML on the other hand will always fail to work. (Semantics will not be extracted.) An example that shows the difference more clearly: Look at me when I talk to you! … and: Look at me when I talk to you! The latter element describes the content probably more accurately, but on ‘the web’ it means close to nothing. Because on the web it’s not humans who come by and try to parse the text, they already know how to read something correctly. No, software comes along and tries to make something meaningful of the above. As the latter is in a namespace no software will know and the latter is also not specified somewhere in a specification it will be ignored. The former however has been here since the beginning of HTML — even before it’s often wrongly considered presentational equivalent I — and will be recognized by software. This post in itself isn't that bad, if anything it is just somewhat misguided. However Tantek Celik followed it up with his post Avoiding plain XML and presentational markup which boggled my mind. Tantek wrote The marketing message of XML has been for people to develop their own tags to express whatever they wanted, rather than being stuck with the limited predefined tag set in HTML. This approach has often been labeled "plain XML" or "generic XML" or "SGML, but easier, better, and designed just for the Web". The problem with this approach is that while having the freedom to make up all your own tags and attributes sounds like a huge improvement over the (mostly perceived) limits of HTML, making up your own XML has numerous problems, both for the author, and for users / readers, especially when sharing with others (e.g. anything you publish on the Web) is important. This post by no means contains a complete set of arguments against plain/generic XML and presentational markup, nor are the arguments presented as definitive proofs. Mostly I wanted to share a bunch of reinforcing resources in one place. Readers are encouraged to improve upon the arguments made here. The original impetus for creating XML was to enable SGML on the Web. People had become frustrated with the limited tag set in HTML and the solution was to create a language that enabled content creators to create their own tags yet have them still readable in browsers via stylesheet technologies (e.g. CSS). Over time, XML has failed to take off as a generic docume[...]



Hacking MSN Virtual Earth

Wed, 13 Jul 2005 12:36:44 GMT

I stumbled on Bus Monster last week and even though I don't take the bus I thought it was a pretty cool application. There's a mapping application that I've been wanting for a few years and I instantly realized that given the Google Maps API I could just write it myself.

Before starting I shot a mail off to Chandu and Steve on the MSN Virtual Earth team and asked if their API would be able to support building the application I wanted. They were like "Hell Yeah" and instead of working on my review I started hacking on Virtual Earth. In an afternoon hacking session, I discovered that I could build the app I wanted and learned new words like geocoding.

My hack should be running internally on my web server at Microsoft before the end of the week. Whenever Virtual Earth goes live I'll move the app to my personal web site. I definitely learned something new with this application and will consider Hacking MSN Virtual Earth as a possible topic for a future Extreme XML column on MSDN. Would anyone be interested in that?

(image)



Apple Embraces and Extends RSS with iTunes 4.9

Tue, 28 Jun 2005 16:08:45 GMT

Today I learned that Apple brings podcasts into iTunes which is excellent news. This will definitely push subscribing to music and videos via RSS feeds into the mainstream. I wonder how long it'll take MTV to start providing podcast feeds. One interesting aspect of the announcement which I didn't see in any of the mainstream media coverage was pointed out to me in Danny Ayers's post Apple - iTunes - Podcasting where he wrote Apple - iTunes - Podcasting and another RSS 2.0 extension (PDF). There are about a dozen new elements (or “tags” as they quaintly describe them) but they don’t seem to add anything new. I think virtually everything here is either already covered by RSS 2.0 itself, except maybe tweaked to apply to the podcast rather than the item. They’ve got their own little category taxonomy and this delightful thing: This tag should be used to note whether or not your Podcast contains explicit material. There are 2 possible values for this tag: Yes or No I wondered at first glance whether this was so you could tell when you were dealing with good data or pure tag soup. However, the word has developed a new meaning: If you populate this tag with “Yes”, a parental advisory tag will appear next to your Podcast cover art on the iTunes Music Store This tag is applicable to both Channel & Item elements. So, in summary it’s a bit of a proprietary thing, released as a fait accompli. Ok if you’re targetting for iTunes, for anything else use Yahoo! Media RSS . I wonder where interop went. This sounds interesting. So now developers of RSS readers that want to consume podcasts have to know how to consume the RSS 2.0 element, Yahoo!'s extensions to RSS and Apple's extensions to RSS to make sure they cover all the bases. Similarly publishers of podcasts also have to figure out which ones they want to publish as well. I guess all that's left is for Real Networks and Microsoft to publish their own extensions to RSS for dealing with providing audio and video metadata in RSS feeds to make it all complete. This definitely complicates my plans for adding podcasting support to RSS Bandit. And I thought the RSS 1.0 vs. RSS 2.0 vs. Atom discussions were exciting. Welcome to the world of syndication. PS: The title of this post is somewhat tongue in cheek. It was inspired by Slashdot's headline over the weekend titled Microsoft To Extend RSS about Microsoft's creation of an RSS module for making syndicating lists work better in RSS. Similar headlines haven't been run about Yahoo! or Apple's extensions to RSS but that's to be expected since we're Microsoft. ;) [...]



Joe Wilcox on Microsoft's Office Open XML formats

Mon, 20 Jun 2005 15:08:58 GMT

Joe Wilcox has a post that has me scratching my head today. In his post Even More on New Office File Formats, he writes

Friday's eWeek story about Microsoft XML-based formats certainly raises some questions about how open they really are. Assuming reporter Pater Galli has his facts straight, Microsoft's formats license "is incompatible with the GNU General Public License and will thus prevent many free and open-source software projects from using the formats." Earlier this month, I raised different concerns about the new formats openness.

To reiterate a point I made a few weeks ago: Microsoft's new Office formats are not XML. The company may call them "Microsoft Office Open XML Fromats," but they are XML-based, which is nowhere the same as being XML or open, as has been widely misreported by many blogsites and news outlets.

There are two points I'd like to make here. The first is that "being GPL compatible" isn't a definition of 'open' that I've ever heard anyone make. It isn't even the definition of Open Source or Free Software (as in speech). Heck, even the GNU website has a long list of Open Source licenses that are incompatible with the GPL. You'll notice that this list includes the original BSD license, the Apache license, the Zope license, and the Mozilla public license. I doubt that EWeek will be writing articles about how Apache and Mozilla are not 'open' because they aren't GPL compatible.

Secondly, it's completely unclear to me what distinction Joe Wilcox is making between being XML and being XML-based. The Microsoft Office Open XML formats are XML formats. They are stored on the hard drive as compressed XML files using standard compression techniques that are widely available on most platforms. Compressing an XML file doesn't change the fact that it is XML. Reading his linked posts doesn't provide any insight into whether this is the distinction Joe Wilcox is making or whether there is another. Anyone have any ideas about this?

(image)



.NET vs. J2EE Performance Benchmarks for XML and XML Web Services

Wed, 08 Jun 2005 13:41:32 GMT

About a year ago, the folks at Sun Microsystems came up with a bunch of benchmarks that showed that XML parsing in Java was much faster than on the .NET Framework. On the XML team at Microsoft we took this as a challenge to do much better in the next version of the .NET Framework. To see how much we improved, you can check out A comparison of XML performance on .NET 2.0 Beta2, .NET 1.1, and Sun Java 1.5 Platforms which is available on MSDN.

In the three test cases, Java 1.5 XML APIs are faster than those in the .NET Framework v1.1 both of which are about half as fast as the XML APIs in .NET Framework v2.0. The source code for the various tests is available so individuals can confirm the results for themselves on their own configurations.

A lot of the improvements in XML parsing on the .NET Framework are due to the excellent work of Helena Kupkova. She is also the author of the excellent XmlBookMarkReader. Great stuff.

For the XML web services geeks there is also a similar comparison of XML Web Services Performance for .NET 2.0 Beta2, .NET 1.1, Sun JWSDP 1.5 and IBM WebSphere 6.0.

(image)



Microsoft Office and XML: Why not the OASIS OpenOffice.org XML format?

Sat, 04 Jun 2005 07:07:55 GMT

Since the recent announcement that the next version of Microsoft Office would move to open XML formats as the default file format in the next version, I've seen some questions raised about why the OpenOffice.org XML formats which were standardized with OASIS weren't used. This point is addressed in a comment by Jean Paoli in the article Microsoft to 'Open' Office File Formats which is excerpted below "We have legacy here," Jean Paoli, Senior Microsoft XML Architect, told BetaNews. "It is our responsibility to our users to provide a full fidelity format. We didn't see any alternative; believe me we thought about it. Without backward compatibility we would have other problems." "Yes this is proprietary and not defined by a standards body, but it can be used by and interoperable with others. They don't need Microsoft software to read and write. It is not an open standard but an open format," Paoli explained. When asked why Microsoft did not use the OASIS (Organization for the Advancement of Structured Information Standards) OpenOffice.org XML file format, Paoli answered, "Sun standardized their own. We could have used a format from others and shoehorned in functionality, but our design needs to be different because we have 400 million legacy users. Moving 400 million users to XML is a complex problem." There is also somewhat of a double standard at play here. The fact that we are Microsoft means that we will get beaten up by detractors no matter what we do. When Sun announced Java 1.5 5.0 with a feature set that looked a lot like those in C#, I don't remember anyone asking why they continued to invest in their proprietary programming language and platform instead of just using C# and the CLI which have been standardized by both ECMA and the ISO. If Microsoft had modified the OpenOffice.org XML file format so that it was 100% backwards compatible with the previous versions of Microsoft Office it is likely that same people would be yelling "embrace and extend". I'm glad the Office guys went the route they chose instead. Use the right tool for the job instead of trying to turn a screwdriver into a hammer. It's a really powerful thing that the most popular Office productivity suite on the planet is wholeheartedly embracing open formats and XML. It's unfortunate that some want to mar this announcement with partisan slings and arrows instead of recognizing the goodness that will come from ending the era of closed binary document formats on the desktop. [...]



Next Version of Office Will Use XML as the Default File Format

Thu, 02 Jun 2005 11:08:31 GMT

About two and half years ago, I was hanging out with several members of the Office team as they gave the details about how Office 2003 would support XML file formats at XML 2002. Now that it's 2005, juicy information like that is now transmitted using blogs. Brian Jones has a blog post entitled New default XML formats in the next version of Office were he reveals some of the details of XML support in the next version of Office. He writes Open XML Formats Overview To summarize really quickly what’s going on, there will be new XML formats for Word, Excel, and PowerPoint in the next version of Office, and they will be the default for each. Without getting too technical, here are some basic points I think are important: Open Format: These formats use XML and ZIP, and they will be fully documented. Anyone will be able to get the full specs on the formats and there will be a royalty free license for anyone that wants to work with the files. Compressed: Files saved in these new XML formats are less than 50% the size of the equivalent file saved in the binary formats. This is because we take all of the XML parts that make up any given file, and then we ZIP them. We chose ZIP because it’s already widely in use today and we wanted these files to be easy to work with. (ZIP is a great container format. Of course I’m not the only one who thinks so… a number of other applications also use ZIP for their files too.) Robust: Between the usage of XML, ZIP, and good documentation the files get a lot more robust. By compartmentalizing our files into multiple parts within the ZIP, it becomes a lot less likely that an entire file will be corrupted (instead of just individual parts). The files are also a lot easier to work with, so it’s less likely that people working on the files outside of Office will cause corruptions. Backward compatible: There will be updates to Office 2000, XP, and 2003 that will allow those versions to read and write this new format. You don’t have to use the new version of Office to take advantage of these formats. (I think this is really cool. I was a big proponent of doing this work) Binary Format support: You can still use the current binary formats with the new version of Office. In fact, people can easily change to use the binary formats as the default if that’s what they’d rather do. New Extensions: The new formats will use new extensions (.docx, .pptx, .xlsx) so you can tell what format the files you are dealing with are, but to the average end user they’ll still just behave like any other Office file. Double click & it opens in the right application. ... Whitepapers The Microsoft Office Open XML Formats: New File Formats for "Office 12" http://download.microsoft.com/download/c/2/9/c2935f83-1a10-4e4a-a137-c1db829637f5/Office12NewFileFormatsWP.doc The Microsoft Office Open XML Formats: Preview for Developers h[...]



Jonathan Marsh On XInclude and XML Schema

Wed, 18 May 2005 13:01:58 GMT

It seems Jonathan Marsh has joined the blogosphere with his new blog Design By Committee. If you don't know Jonathan Marsh, he's been one of Microsoft's representatives at the W3C for several years and has been an editor of a variety of W3C specifications including XML:Base, XPointer Framework, and XInclude. In his post XML Base and open content models Jonathan writes There is a current controversy about XInclude adding xml:base attributes whenever an inclusion is done. If your schema doesn't allow those attributes to appear, you're document won't validate. This surprises some people, since the invalid attributes were added by a previous step in the processing chain (in this case XInclude), rather than by hand. As if that makes a difference to the validator! Norm Walsh , after a false start, correctly points out this behavior was intentional. But he doesn't go the next step to say that this behavior is vital! The reason xml:base attributes are inserted is to keep references and links from breaking. If the included content has a relative URI, and the xml:base attribute is omitted, the link will no longer resolve - or worse, it will resolve to the wrong thing. Can you say "security hole"? Sure it's inconvenient to fail validation when xml:base attributes are added, especially when there are no relative URIs in the included content (and thus the xml:base attributes are unnecessary.) But hey, if you wanted people or processes to add attributes to your content model, you should have allowed them in the schema! I agree that the working group tried to address a valid concern. But this seems to me to be a case of the solution being worse than the problem. To handle a situation for which workarounds will exist in practice (i.e. document authors should use absolute URIs instead of relative URIs in documents) the XInclude working group handicapped using XInclude as part of the processing chain for documents that will be validated by XML Schema. Since the problem they were trying to solve exists in instance documents, even if the document author don't follow a general guideline of favoring absolute URIs over relative URIs, these URIs can be expanded in a single pass using XSLT before being processed up the chain by XInclude. On the other hand if a schema doesn't allow xml:base elements everywhere (basically every XML format in existence) then one cannot use XInclude as part of the pipeline that creates the document if the final document will undergo schema validation. I think the working group optimized for an edge case but ended up breaking a major scenario. Unfortunately this happens a lot more than it should in W3C specifications. [...]



XInclude and W3C XML Schema Will Play Nice Together in .NET Framework v2.0

Mon, 16 May 2005 15:24:18 GMT

Stan Kitsis, who replaced me as the XML Schema program manager on the XML team, has a blog post about XInclude and schema validation where he writes

A lot of people are excited about XInclude and want to start using it in their projects. However, there is an issue with using both XInclude and xsd validation at the same time. The issue is that XInclude adds xml:* attributes to the instance documents while xsd spec forces you to explicitly declare these attributes in your schema. Daniel Cazzulino, an XML MVP, blogged about this a few months ago: "W3C XML Schema and XInclude: impossible to use together???"

To solve this problem, we are introducing a new system.xml validation flag AllowXmlAttributes in VS2005. This flag instructs the engine to allow xml:* attributes in the instance documents even if they are not defined in the schema. The attributes will be validated based on their data type.

This design flaw in the aforementioned XML specifications is a showstopper that prevents one from performing schema validation using XSD on documents that were pre-processed with XInclude unless the schema designer decided up front that they want their format to be used with XInclude. This is fundamentally broken. The sad fact is that as Norm Walsh pointed out in his post XInclude, xml:base and validation this was a problem the various standards groups were aware of but decided to dump on implementers and users anyway. I'm glad the Microsoft XML team decided to take this change and fix a problem that was ignored by the W3C standards groups involved.

(image)



UPDATED: Things to note about foreach and System.Xml.XPath.XPathNodeIterator

Tue, 10 May 2005 17:23:51 GMT

Oleg Tkachenko has a post about one of the changes I was involved in while the program manager for XML programming models in the .NET Framework. In the post foreach and XPathNodeIterator - finally together Oleg writes This one little improvement in System.Xml 2.0 Beta2 is sooo cool anyway: XPathNodeIterator class at last implements IEnumerable! Such unification with .NET iteration model means we can finally iterate over nodes in an XPath selection using standard foreach statement: XmlDocument doc = new XmlDocument(); doc.Load("orders.xml"); XPathNavigator nav = doc.CreateNavigator(); foreach (XPathNavigator node in nav.Select("/orders/order")) Console.WriteLine(node.Value); Compare this to what we have to write in .NET 1.X: XmlDocument doc = new XmlDocument(); doc.Load("../../source.xml"); XPathNavigator nav = doc.CreateNavigator(); XPathNodeIterator ni = nav.Select("/orders/order"); while (ni.MoveNext()) Console.WriteLine(ni.Current.Value); Needless to say - that's the case when just a dozen lines of code can radically simplify a class's usage and improve overall developer's productivity. How come this wasn't done in .NET 1.1 I have no idea. One of the reasons we were hesitant in adding support for the IEnumerable interface to the XPathNodeIterator class is that the IEnumerator returned by the IEnumerable.GetEnumerator method has to have a Reset method. However a run of the mill XPathNodeIterator does not have a way to reset its current position. That means that code like the following has problems XmlDocument doc = new XmlDocument(); doc.Load("orders.xml"); XPathNodeIterator it = doc.CreateNavigator().Select("/root/*");foreach (XPathNavigator node in it) Console.WriteLine(node.Name); foreach (XPathNavigator node in it) Console.WriteLine(node.Value);  The problem is that after the first loop the XPathNodeIterator is positioned at the end of the sequence of nodes so the second loop should not print any values. However this violates the contract of IEnumerable and the behavior of practically every other class that implements the interface. We considered adding an abstract Reset() method to the XPathNodeIterator class in Whideby but this would have broken implementations of that class written against previous versions of the .NET Framework. Eventually we decided that even though the implementation of IEnumerable on the XPathNodeIterator would behave incorrectly when looping over the class multiple times, this was an edge case that shouldn't prevent us from improving the usability of the class. Of course, it is probable that someone may eventually be bitten by this weird behavior but we felt the improved usability was worth the trade off. Yes, backwards compatibility is a pain. UPDATE: Andrew Kimball, who's one of the developers of working on XSLT and XPath [...]



Microsoft licensed Mvp.Xml library

Mon, 09 May 2005 14:09:52 GMT

A little while ago I noticed a post by Oleg Tkachenko entitled Microsoft licensed Mvp.Xml library where he wrote On behalf of the Mvp.Xml project team our one and the only lawyer - XML MVP Daniel Cazzulino aka kzu has signed a license for Microsoft to use and distribute the Mvp.Xml library. That effectively means Microsoft can (and actually wants to) use and distribute XInclude.NET and the rest Mvp.Xml goodies in their products. Wow, I'm glad XML MVPs could come up with something so valuable than Microsoft decided to license it. Mvp.Xml project is developed by Microsoft MVPs in XML technologies and XML Web Services worldwide. It is aimed at supplementing .NET framework functionality available through the System.Xml namespace and related namespaces such as System.Web.Services. Mvp.Xml library version 1.0 released at January 2005 includes Common, XInclude.NET and XPointer.NET modules. As a matter of interest - Mvp.Xml is an open-source project hosted at SourceForge. Since Oleg wrote this I've seen other Microsoft XML MVPs mention the event including Don Demsak and Daniel Cazzulino. I think this is very cool and something I feel somewhat proud of since I helped get the XML MVP program started. A few years ago, as part of my duties as the program manager responsible for putting together a community outreach plan for the XML team I decided that we needed an MVP category for XML. I remember the first time I brought it up with the folks who run the Microsoft MVP program and they thought it was such a weird idea since there were already categories for languages like C# and VB but XML was just a config file format and didn't need enough significant expertise. I was persistent and pointed out that a developer could be a guru at XSLT, XPath, XSD, DOM, etc without necessarily being a C# or VB expert. Eventually they buckled and an MVP category for XML was created. Besides getting the XML Developer Center on MSDN launched, getting the XML MVP program started was probably my most significant achievement as part of my community outreach duties while on the XML team at Microsoft. Now it is quite cool to see this community of developers contributing so much value to the .NET XML community that Microsoft has decided to license their technologies. I definitely want to build a similar developer community around the stuff we are doing at MSN once we start shipping our APIs. I can't wait to get our next release out the door so I can start talking about stuff in more detail. [...]



Adam Bosworth's Web of Data: Is RSS the only API your Website Needs?

Wed, 27 Apr 2005 11:52:37 GMT

Daniel Steinberg has a an article entitled Bosworth's Web of Data where he discusses some of the ideas Adam Bosworth evangelized in his keynote at the MySQL Users Conference 2005. Daniel writes, Bosworth explained that the key factors that enabled the web began with simplicity. HTTP was simple enough that any "P" language or JavaScript programmer could build applications. On the consumption side, web browsers such as Internet Explorer 4 were committed to rendering whatever they got. This meant that people could be sloppy and they didn't need to be high priests of syntax. Because it was a sloppy standard, people who otherwise couldn't have authored content did. The fact that it was a standard allowed this single, simple, sloppy, open wire format to run on every platform. ... The challenge is to take a database and do for the web what was done for content. Bosworth explained that you "need a model that allows for massively linear scalability and federation of information that can spread effortlessly across a federated web." Solutions that were suggested were to use XML and XQuery. The problem with XML is that unlike HTML, there is not a single grammar. This removed the simple and sloppy aspects of the web. The problem with XQuery is the time it took to finish the specification. Bosworth noted that it took more than four years and that "anything that takes four years is not worth doing. It is over-designed. Intead, take six months and learn from customers." ... The next solution used web services, which began as an easy idea: you send an XML request and you get XML back. Instead, the collection of WS-* specs were huge and again, overly complicated. Bosworth said that this was a deliberate effort on the part of the companies that control the specs, like IBM and Microsoft, which deliberately made the specification hard, because then only they could deliver technology to do it. ... Bosworth predicts that RSS 2.0 and Atom will be the lingua franca that will be used to consume all data from everywhere. These are simple formats that are sloppily extensible. Anyone who wants to can use these formats to consume content or to author content. Contrast this with the Semantic Web, which requires that you get a large group of people to agree on the schema of everything. There are lots of interesting ideas here. I won't dwell on the criticisms of XQuery & WS-* mainly because I tend to agree that they are both overdesigned and complicated. I also wont dwell on the apparent contradiction inherent in claiming that the Semantic Web is doomed because it requires people to agree on the same schema for everything then proposing that everyone agree on using RSS as the schema for all data on the Web. I have a suspicion of what he sees as the difference but I'll[...]



Fun with XMLHttpRequest and RSS: Browsing Photo Albums on MSN Spaces

Sun, 24 Apr 2005 22:16:03 GMT

I mentioned in a recent post that I was considering writing an article entitled Using Javascript, XMLHttpRequest and RSS to create an MSN Spaces photo album browser. It actually took less work than I thought to build a webpage that allows you to browse the photo albums in a particular person's Space or from a randomly chosen Space.

I haven't used Javascript in about 5 years but it didn't take much to put the page together thanks mostly to the wealth of information available on the Web.

You can find the page at http://www.25hoursaday.com/spaces/photobrowser.html

The page requires Javascript and currently only works in Internet Explorer but I'm sure that some intrepid soul could make it work in Firefox in a couple of minutes. If you can, please send me whatever changes are necessary.

(image)



Contract-First XML Web Service Design is No Panacea

Fri, 22 Apr 2005 14:05:14 GMT

Every once in a while I see articles like Aaron Skonnard's Contract-First Service Development which make me shake my head in sorrow. His intentions are good but quite often advising people to design their XML Web services starting from an XSD/WSDL file instead of a more restricted model leads to more problems than what some have labelled the "code-first" approach. For example, take this recent post to the XML-DEV mailing list entitled incompatible uses of XML Schema I just got a call from a bespoke client (the XML guru in a large bank) asking whether I knew of any XML Schema refactoring tools. His problem is that one of their systems (from a big company) does not handle recursive elements. Another one of their systems (from another big company) does not handle substitution groups (or, at least, dynamic use of xsi:type.) Another of their systems (from a third big company) does not handle wildcards. (Some departments also used another tool that generated ambiguous schemas.) This is causing them a major headache: they are having to refactor 7,000 element schemas by hand to munge them into forms suited for each system. Their schema-centricism has basically stuffed up the ready interoperability they thought they were buying into with XML, on a practical level. This is obviously a trap: moving to a services-oriented architecture means that the providers can say "we provide XML with a schema" and the pointy-headed bosses can say "you service-user: this tool accepts XML with a schema so you must use that!" and the service-user has little recourse. This is one of the problems of contract first development that many of the consultants, vendors and pundits who are extolling its virtues fail to mention. A core fact of building XML Web services that use WSDL/XSD as the contract is that most people will use object<->XML mapping technologies to either create or consume the web services. However there are fundamental impedance mismatches between the W3C XML Schema Definition (XSD) Language and objects in a traditional object oriented programming language that ensure that these mappings will be problematic. I have written about these impedance mismatches several times over the past few years including posts such as The Impedence Mismatch between W3C XML Schema and the CLR. Every XML Web Service toolkit that consumes WSDL/XSD and generates objects has different parts of the XSD spec that they either fail to handle or handle inadequately. Many of the folks encouraging contract first development are refusing to acknowledge that if developers build schemas by hand for use in XML Web Services, it is likely they will end up using capabilities of XSD that are not supported by one or more of their consuming[...]



Ideas for my next Extreme XML column on MSDN

Tue, 19 Apr 2005 15:49:24 GMT

It looks like I didn't get an Extreme XML column out last month. Work's been hectic but I think I should be able to start on a column by the end of the week and get it done before the end of the month. I have a couple of ideas I'd like to write about but as usual I'm curious as to what folks would be interested in reading about. Below are three article ideas in order of preference. 

  1. Using Javascript, XMLHttpRequest and RSS to create an MSN Spaces photo album browser: The RSS feed for a space on MSN Spaces contains information about the most recent updates to a user's blog, photo album and lists. RSS items containing lists are indicated by using the msn:type element with the value "photoalbum". It is possible to build a photo album browser for various spaces by using a combination of Javascript for dynamic display and XMLHttpRequest for consuming the RSS feed. Of course, my code sample will be nowhere near as cool as the Flickr related tag browser.

  2. Fun with operator overloading and XML: This would be a follow up piece to my Overview of Cω article. This article explores how one could simulate adding XML specific language extensions by overloading various operators on the System.Xml.XmlNode class.

  3. Processing XML in the Real World: 10 Things To Worry About When Processing RSS feeds on the Web: This will be an attempt to distill the various things I've learned over the 2 years I've been working on RSS Bandit. It will cover things like how to properly use the System.Xml.XmlReader class for processing RSS feeds in a streaming fashion, bandwidth saving tips from GZip encoding to sending If-Modified-Since/If-None-Match headers in the request, dealing with proxy servers and authentication.

Which ones would you like to see and/or what is your order of preference?

(image)