Fork me on GitHub

Is XML bad for big data?

An XML to XML transformation
Image via Wikipedia

Mike Driscoll continues his attack against XML for Big Data. He points out three reasons why XML and Big Data are strange bedfellows.

  • XML spawns data bureacracy, which is why JSON exists
  • Size matters and XML is not exactly concise
  • XML is complex and has a cost

One of my problems with XML, and this is from someone who loves markup, has always been that it is used in ways it was never intended to be, or at least I hope not. It is a representational format for documents, but ended up becoming the format for all kinds of data standards and worst of all, data transport. He proposes some rules

  • Don’t invent new formats. I think Hari will wholeheartedly agree with this one. This, in particular, is the bane of science. We invent new formats all the time and sometimes people will just make the choice for you by refusing to use one over the other. Perhaps if the RCSB had listened to Jon Udell, whom Mike quotes, the whole mmCIF snafu would have been avoided
  • Obey the fifteen minute rule, AKA too much complexity is going to kill adoption. I don’t completely agree with this one. Sometimes complexity is necessary, but most of the time, it only creates issues
  • Embrace lazy data modeling. This one will probably be controversial, since data folks love data models. However, as Matt Wood would attest to, predicting how are data models are going to evolve, how data types are going to evolve and how our usage of those data are going to evolve is very hard.

In the end, this boils down to being practical. We need to process large quantities of data in a world where the data, and their sources changes all the time. This is even more true in science. We need to be pragmatic about how we handle data. From where I stand the keys are

  • Consistency. Don’t have three files have different columns mean different things
  • Simplicity. If things are going to change, and you know they are, don’t try and come up with rules and schema up front. In those cases you end up trying to shoehorn everything into a model that’s going to break at some point
  • Prevent format explosion. Or as Mike put it, don’t invent new formats

As informaticians we dread many of these problems. As informaticians dealing with Big Data, we better lead the way in educating against bad practices, or perhaps we just need to be pragmatic, suck it up and keep churning out those parsers. I’ll take the first option. Parsers (and bioinformaticians) don’t scale.

Reblog this post [with Zemanta]

This entry was posted in Big Data, Informatics. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

4 Comments

  1. Posted August 23, 2009 at 01:02 | Permalink

    People just don't get this: XML is bad for things which are already well-defined (like that infamous proposal to mark up a DNA sequence in XML). XML is good for when things are *not* well-defined and you need to add the *markup* to make sense out of it.

    Transfering 1GB of sequence data goes as plain sequence data, not XML. When you have a complex data model with *sparse* data, with many (optional) fields and more then 50 data types, you use XML.

    That's all. Now stop whining about using a hammer for cleaning your nails.

    (Deepak, this is not directly at you, but primary about the author of the blog you cite.)

  2. Posted August 23, 2009 at 10:19 | Permalink

    I am not quite as militant as Mike, since I do agree with you on that hammer and nails comment. However, in my experience (and all you have to do is look at CDISC, HL7, etc), what happens is people try and do exactly that and it slows down progress as you get locked down in committee. And I speak directly to the sequence part. In many of the transport protocols being developed, the sequences would NOT be transferred as sequences, but encapsulated in XML. That's the kind of bureacracy that Mike speaks to.

  3. sumit
    Posted August 23, 2009 at 23:15 | Permalink

    Hi ,

    Great article and stop me to think before proceeding my work to XML conversion. Could you suggest me with my case below.

    I am planning to convert my 1000 nodes flat file configuration setup to xml format file. Every record has around 10 fields. I am also planning to use mxml library for read/write operations. Before starting I need to confirm

    1) if conversion big data file to XML will not be a bottle neck in terms of memory utilization ?
    2) Is there any efficient way expect loading complete XML file in meory at the start of process ?
    3) How we can calculate memory going to be consumed by an XML file.

    Thanks
    sumit

  4. sumit
    Posted August 24, 2009 at 06:15 | Permalink

    Hi ,

    Great article and stop me to think before proceeding my work to XML conversion. Could you suggest me with my case below.

    I am planning to convert my 1000 nodes flat file configuration setup to xml format file. Every record has around 10 fields. I am also planning to use mxml library for read/write operations. Before starting I need to confirm

    1) if conversion big data file to XML will not be a bottle neck in terms of memory utilization ?
    2) Is there any efficient way expect loading complete XML file in meory at the start of process ?
    3) How we can calculate memory going to be consumed by an XML file.

    Thanks
    sumit

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus
  • Archives

  • Disclaimer

    All opinions on this blog are my own and do not reflect those of my employers, past or present