Integrative Bioinformatics 2007, Day 1: The OXL format, Taubert et al
Other than where specified, these are my notes from the IB07 Conference, and are in no way expressions of opinion, and any errors are probably just due to my own misunderstanding
OXL is the ONDEX data format, and they are presenting it as a possible format for the exchange of integrated data. OXL is based upon an ontology (opinion/question: a true ontology, or a CV?) of concepts and relations. ONDEX itself is an open-source data warehouse in Java that performs ontology-based data integration. OXL is in RDF. There are two ways to use RDF: firstly, model things as predicates (but then you cannot have attributes), and secondly they should be modelled as classes. However, it also seems that they have OXL in XML format, using an XSD.
In their XML format, they don't use any cross-references: it is fully expanded. Yes, it generates lots of XML files, but with file compression it isn't a problem. It does make whole-document validation more difficult, but they're working on it. This method makes it more human-readable.
They then presented some examples. The first was the identification of possible pathogenicity genes in Vibrio salmonicida (with the university of Tromso). Identify clusters of orthologs involving V. salmonicida, then colour nodes according to pathogenicity phenotype.
http://ondex.sf.net
Here are my opinions: A well-presented talk on the whole. Don't mean to harp on today about architecture slides, but they're important when describing software. They had some, but they were so small they were pretty hard to read. Also, I've never been convinced about the "human-readable" explanation for why to make a change to an XSD: XML is simply not meant to be human-readable, and changes shouldn't be made to the XSD to make it so. However, ONDEX is a reasonably mature application, and so it may be useful to ask others to use their format. My main question is about probabilities: a lot of similar work uses weights on edges in data integration: how can these be modelled with OXL?
Comments
Just saw this now...
Probabilities in my understanding are nothing more than Double or Float values (to speak in JAVA terms) on the nodes and edges of the graph. Such values can be easily added using a so called GDS (generalized data structure). Sets of GDS can be added to any node and edge in the graph and thus get represented in the OXL format. To be more technical a GDS consists of an attribute name describing what kind of data is contained and a value which can be any JAVA object data type. JAVA object data types are serialized in OXL using the XStream JAVA API.
Here a simple example of representing a NCBI Taxonomy ID 45372 as GDS in OXL:
<concept_gds>
<attrname>
<id>Taxonomy ID</id>
<fullname>Taxonomy ID</fullname>
<description>from NCBI Taxonomy database</description>
<unit>
<id>Integer</id>
<fullname>Integer</fullname>
<description>ID number</description>
</unit>
<datatype>java.lang.Integer</datatype>
</attrname>
<value><![CDATA[<int>45372</int>]]></value>
<doindex>true</doindex>
</concept_gds>