3 posts tagged “data standards”
BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.
SyMBA Demo. The lunch hour was also the demo hour. People came to visit me at the SyMBA demo desk for the whole hour, and we had some interesting conversations. There is one particular question I would like to relate from that hour: what should a bioinformatician choose as an output/export format for multi-omics data? This post relates my thoughts about this challenge. It's not meant to be comprehensive: just some ramblings.
I solve this challenge in SyMBA by storing everything as FuGE objects, which can be exported to FuGE-ML. FuGE-ML can be converted into ISA-TAB and into an html format that mimics ISA-TAB using an XSLT. Therefore, because of this interlink between FuGE and ISA-TAB, you can leverage two complementary formats.
But to a bioinformatician who has just been tasked with building an application (and generally on a short time-scale), how do they choose what export format to use, e.g. FuGE or ISA-TAB? There are considerations of:
- scale: lightweight or heavyweight implementation. A lightweight implementation might favor your own version of ISACreator and the use of ISA-TAB, or a FuGE-based archive (but not a full-blown LIMS) like SyMBA. A heavyweight solution might be a full LIMS such as PIMS, or another FuGE implementation in development called SysFusion.
- intent: what is the purpose of storing this data? Is it for later analysis? For later deposition to a public database, e.g. at the EBI? Is it archiving? Is it a combination of these things? Your intent will shape what type of application you build, and what formats you focus your effort on. If your intent is storage only, choose whatever is most convenient for your users. However, these days there is always some aspect of data sharing or publishing. If you need further analysis of the data, then you probably want to be able to produce a computationally-friendly format such as XML. If your intent is submission to public databases, you need to ensure you export in a format they import.
Unfortunately, what this means is that the decision depends on the circumstances. FuGE and ISA-TAB are linked, and so you really get two for the price of one with those. I see this sort of thing as a positive - you have a choice as to the representation, storage and export of your data - a choice of formats! And many, like FuGE and ISA-TAB, are going to be easily convertable. The choice depends on your needs, but there is one easy choice: use something that's already been developed - don't reinvent the wheel!
Anyone else have any further suggestions?
Today was the first day of the workshop - back at the good old EBI, though it isn't as recognizable as it used to be. Sure, there is the new EBI extension, but I am used to that now. However, they're renovating the inside of the old EBI building as well, reducing many of my friends to portakabin living over the winter months: better them than me!
Today definitely had an emphasis on the "work" part of "workshop". While a large part of the work on the XSLT for converting between FuGE and ISA-TAB is complete, some of the slightly stickier areas of the conversion are still being worked on. We spent today on trying to iron out some of the difficulties that arise from trying to convert the sort of rich tree structure that you get from the XML implementation of FuGE (FuGE-ML) into the flatter tabular format of ISA-TAB. Below are some of the more general ideas that we were throwing around as a result. (Some are more directly related to the conversion process than others - but all raise interesting points to me.)
- One of the column names in the ISA-TAB Assay file is currently named "Raw Data File" in the 1.0 Specification. This caused a large amount of discussion as to what "raw" meant, and that many people would have a different idea of what a raw data file was. It was originally named this way to act as a foil against another (optional) column name, "Derived Data File". However, derived data files have a more precise definition in ISA-TAB - such a column can only be used to name files resulting from data transformations or processing. In the end, we are considering a name change, from "Raw Data File" to "Data File".
- In the end, there will be a few simple ways to format your FuGE-ML files in a way that will aid the conversion into ISA-TAB. It would be useful to eventually produce a set of guidelines to aid in interoperability.
- Some of the developers already using FuGE (myself included) are using the <Description> element within a FuGE-ML file as a way to allow our biologists to give a free-text description to both materials and data files. There is no specific element in these objects to add such information, and therefore the generic Description element is the best location. This isn't exactly as per FuGE best-practices, where the default Description elements are really only meant for private comments within a local FuGE implementation, and can normally be ignored by external bioinformaticians making use of your FuGE-ML. Such material and data descriptions can be copied into the ISA-TAB file as free text within the Comment[] columns, where what sits within the "[]" is the material or data identifier. We'll have to see if this idea turns out to be useful.
- The main challenge in collapsing FuGE-ML into ISA-TAB is ensuring that the multi-level protocol application structures (for more information, see the GenericProtocolApplication and GenericProtocol objects within the FuGE Object Model) are correctly converted. We spent the majority of today trying to figure out an elegant way of doing this. We'll work on it again tomorrow, and will hopefully have a new version of the XSLT with a first-bash solution tomorrow evening!
Tomorrow is the first day of a two-day workshop set up to continue the integration process between the ISA-TAB format and the FuGE standard. (Well, technically, it starts tonight with a workshop dinner, where I'll get to catch up with the people in the workshop, many of whom I haven't seen since the MGED 11 meeting in Italy this past summer. Should be fun!)
ISA-TAB can be seen as the next generation of MAGE-TAB, a very popular format with biologists who need to get their data and metadata into a common format acceptable by public repositories such as ArrayExpress. ISA-TAB goes one step further, and does for tabular formats what FuGE does for object models and XML formats: that is, it is able to represent multi-omics experiments rather than just the transcriptomics experiments of MAGE-TAB. I encourage you to find out more about both FuGE and ISA-TAB by looking at their respective project pages. The FuGE group also has a very nice introduction to the model in their Nature Biotechnology article.
Each day I'll provide a summary of what's gone on at the workshop, which centers around the current status of both ISA-TAB and some relevant FuGE extensions, as well as the production of a seamless conversion from FuGE-ML to ISA-TAB and back again. ISA-TAB necessarily cannot handle as much detail as the FuGE model can (being limited by the tabular format), and therefore in the FuGE-ML to ISA-TAB direction, it is possible that it may not be entirely lossless. However, this workshop and all the work that's gone on around it aims to reconcile the two formats as much as possible. And, even though I have mentioned a caveat or two, this reconciliation is entirely possible: both ISA-TAB and FuGE share the same high-level structures. Indeed, ISA-TAB was created with FuGE in mind, to ensure that such a useful undertaking used all it could of the FuGE Object Model. It is important to remember that FuGE is an abstract model which can be converted into many formats, including XML. Because it is an abstract model, many projects can make use of its structures while maintaing whatever concrete format they wish.
Specific topics of the workshop include:
- Advance and possibly finalize XSLT rendering of FUGE Documents into ISA-TAB. This includes the finishing-off of the generic FuGE XSL stylesheet.
- Work on some of the extensions, including FCM, Gel-ML, and MAGE2. MAGE2 is the most interesting for me for this workshop, as I've heard that it's almost complete. This is the XML format that is a direct extension of the FuGE model, and will be very useful for bioinformaticians wishing to store, share and search their transcriptomics data using a multi-omics standard like FuGE.
Thanks to Philippe Rocca-Serra and Susanna-Assunta Sansone for the hard work they've done on the format specification, and for everyone who's coming today. It's a deliberately small group so that we can spend our time in technical discussion rather than in presentations. I'm a bit of a nut about data and metadata standards (and am in complete agreement with Frank over at peanutbutter on the triumverate of experimental standards) and so I love these types of meetings. It's going to be fun, and I'll keep you updated!