Integrative Bioinformatics 2007, Day 2: Model Format OWL (MFO), Lister et al.
Integration of constraints documented in SBML, SBO, and the SBML Manual facilitates validation of biological models
Published September 2007 by the Journal of Integrative Bioinformatics
Allyson L. Lister1,2, Matthew Pocock2, Anil Wipat1,2,*
1 Centre for Integrated Systems Biology of Ageing and Nutrition (http://www.cisban.ac.uk)
2 School of Computing Science (http://www.cs.ncl.ac.uk),
Newcastle University (http://www.ncl.ac.uk)*
Abstract
The creation of quantitative, simulatable, Systems Biology Markup Language (SBML) models that accurately simulate the system under study is a time-intensive manual process that requires careful checking. Currently, the rules and constraints of model creation, curation, and annotation are distributed over at least three separate documents: the SBML schema document (XSD), the Systems Biology Ontology (SBO), and the “Structures and Facilities for Model Definition” document. The latter document contains the richest set of constraints on models, and yet it is not amenable to computational processing. We have developed a Web Ontology Language (OWL) knowledge base that integrates these three structure documents, and that contains a representative sample of the information contained within them. This Model Format OWL (MFO) performs both structural and constraint integration and can be reasoned over and validated. SBML Models are represented as individuals of OWL classes, resulting in a single computationally amenable resource for model checking. Knowledge that was only accessible to humans is now explicitly and directly available for computational approaches. The integration of all structural knowledge for SBML models into a single resource creates a new style of model development and checking.
Introduction
Systems Biology Markup Language[1] (SBML) is an XML format that has emerged as the de facto standard file format for describing computational models in systems biology. It is supported by a vibrant community who have developed a wide range of tools, allowing models to be generated, analysed and curated in any one of many independently maintained software applications[1]. The Systems Biology Ontology[2][2] (SBO) was developed to enable a useful understanding of the biology to which a model relates, and to provide well-understood terms for describing common modelling concepts. The community is engaged in an on-going effort to develop the SBML standard in ways needed to support systems biology applications. As part of this process, a manual is maintained that describes and defines SBML and SBO[3].
The biological knowledge used to create and annotate a high-quality SBML model is typically analysed and integrated by a researcher. These modellers know and understand both the systems they are modelling and the intricacies of SBML. However, as with most areas of biology, the amount of data that is relevant to generating even a relatively small and well-scoped model is overwhelming. In order to extend the range of modelling tasks that can be automated, it is necessary to both capture the salient biological knowledge in a form that computers can process, and represent the SBML rules in a way computers can systematically interpret. Here we address the latter issue: describing SBML, SBO and the rules about what constitutes a correctly formed model in a way suitable for computational manipulation.
The Semantic Web[4] can be seen as today’s incarnation of the goal to allow computers to go beyond performing numerical computations, and to share and integrate information more easily. There are now several standards forming within the Semantic Web community that together formalise computational languages for representing knowledge and strictly define what conclusions can be reached from facts expressed in these languages. The Web Ontology Language[3][5] (OWL) is one such language that enjoys strong tools support and which is used for capturing biological and medical knowledge (e.g. OBI[6], BioPax[7], EXPO[4], and FMA[5] and GALEN[6] in OWL). Once the information about the domain has been modelled in an OWL file, a software application called a reasoner[7, 8] can automatically deduce all other facts that must logically follow as well as find inconsistencies between asserted facts.
The knowledge about a system described in SBML can be divided into two parts. Firstly, there is the biological knowledge. This includes information about the biological entities involved and their biological. Secondly, there is the structural knowledge, describing how the biological knowledge must be captured in well-formed documents suitable for processing by applications. In the case of a high-quality SBML model, the structural knowledge required to create such a model is tied up in three main locations:
- The Systems Biology Markup Language (SBML[1][8]) XML Schema Document (XSD[9]), describing the range of XML documents considered to be in SBML syntax,
- The Systems Biology Ontology (SBO[2][10]), describing the range of terms that can be used to describe parts of the model in a way understandable to the community using the Open Biological Ontologies (OBO[11]) format, and
- The "Structures and Facilities for Model Definition" document[12] (hereafter referred to as the "SBML Manual"), describing many additional restrictions and constraints upon SBML documents, and the context within which SBO terms can be used, as well as information about how conformant documents should be interpreted.
From a knowledge-engineering point of view, it makes sense to represent these sources of structural knowledge as part of a single knowledge base. Although, to a knowledge-engineer, this current separation of documents could appear arbitrary, it is in fact well-motivated according to consumers of each type of information. The portion of the knowledge codified in SBML transmits all of and only the information needed to parameterise and run a computational simulation of the system. The knowledge in SBO is intended to aid humans in understanding what is being modelled. The SBML Manual is aimed at tools developers needing to ensure that software developed is fully compliant with the specification.
Only two of these three sources of structural knowledge are directly computationally amenable. SBML has an associated XSD that describes the range of legal XML documents, which elements and attributes must appear, and constraints on the values of text within the file. SBO captures a term hierarchy containing human-readable descriptions and labels for each term and a machine-readable ID for each term. Neither of these documents contains much information about how XML elements or SBO terms should be used in practice, how the two interact, or what a particular conformant SBML document should mean to an end-user. The majority of information required to develop a format-compliant model is in the SBML Manual, in formal English. Anything more than simple programmatic steps, such as XML validation, can currently only be done by manually encoding the English descriptions in the SBML Manual into rules in a program. libSBML[13] is the reference implementation of this procedure, capturing the process of validating constraints. Manual encoding provides scope for misinterpretation of the intent of the SBO Manual or may produce code that accepts or generates non-compliant documents due to silent bugs. In practice, these problems are ameliorated by regular SBML Hackathons[14] and the use of libSBML by many SBML applications. However, the need for a more formal and complete description of the information in the SBML Manual becomes more pressing as the community grows beyond the point where all of the relevant developer groups can be adequately served by face-to-face meetings.
We find that some of these issues can be avoided by combining the structural knowledge currently spread across three documents in three formats into a single computationally amenable resource. This method of constraint integration for all information pertinent to SBML will require a degree of rigour that can only improve the clarity of the specification. Once established, standard OWL tools can be used to validate and reason over SBML models, to check their conformance and to derive any conclusions that follow from the facts stated in the document, all without manual intervention.
To address this proposition, we have developed the Model Format OWL (MFO), implemented in OWL-DL and capturing the SBML structure plus a representative sample of SBO and human-readable constraints from the SMBL Manual. We demonstrate that MFO is capable of directly capturing many of the structural rules and semantic constraints documented in the SBML Manual. The mapping between SBML documents and the OWL representation is bi-directional: information can be parsed as OWL individuals from an SBML document, manipulated and studied, and then serialized back out again as SBML. We demonstrate feasibility with two simple, illustrative, examples. In future, we hope to use this as the basis for a method of automatically improving the annotation of SBML models with rich biological knowledge, and as an aid to principled automated model improvement and merging.
The integration of all structural knowledge for SBML models into a single resource creates a new style of model document development, which we believe will greatly reduce the overheads associated with computational transformations between biological knowledge and high-quality systems biology models. MFO is not intended to be a replacement for any of the APIs or software programs available to the SBML community today. It addresses the very specific need of a sub-community within SBML that wishes to be able to express their models in OWL for the purpose of reasoning, validation, and querying. It has also been created as the first step in a larger data integration strategy that will eventually encompass the biological as well as structural knowledge present in SBML documentation and models.
[1] Hucka, M. et al.: The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics (Oxford, England) 19 (2003) 524-531
[2] Le Novere, N.: Model storage, exchange and integration. BMC Neurosci 7 Suppl 1 (2006) S11
[3] Horrocks, I., Patel-Schneider, P.F., van Harmelen, F.: From SHIQ and RDF to OWL: The making of a web ontology language. J. of Web Semantics 1 (2003) 7-26
[4] Soldatova, L.N., King, R.D.: An ontology of scientific experiments. Journal of the Royal Society, Interface / the Royal Society 3 (2006) 795-803
[5] Heja, G., Varga, P., Pallinger, P., Surjan, G.: Restructuring the foundational model of anatomy. Studies in health technology and informatics 124 (2006) 755-760
[6] Heja, G., Surjan, G., Lukacsy, G., Pallinger, P., Gergely, M.: GALEN based formal representation of ICD10. International journal of medical informatics 76 (2007) 118-123
Enjoyed this? To read the rest, please see the Journal of Integrative Bioinformatics