7 posts tagged “data integration”
This post is part of the PLoS One syncroblogging day, as part of the PLoS ONE @ Two birthday celebrations. Happy Synchroblogging! Here's a link to the paper on the PLoS One website.
Biological data: vitally important, determinedly unruly. This challenge facing the life-science community has been present for decades, as witnessed by the often exponential growth of biological databases (see the classic curve in the current graphs of UniProt1 and EMBL if you don't believe me). It's important to me, as a bioinformatics researcher whose main focus is semantic data integration, but it should be important to everyone. Without manageable data that can be easily integrated, all of our work suffers. Nature thinks it's important: it recently devoted an entire issue to Big Data. Similarly, the Journal of Biomedical Informatics just had a Semantic Mashup special issue. Deus et al. (the paper I'm blogging about, published in PLoS One this summer) agree, beginning with "Data, data everywhere", nicely encapsulating both the joy and the challenge in one sentence.
This paper describes work on a distributed management system that can link disparate data sources using methodologies commonly associated with the semantic web (or is that Web 3.0?). I'm a little concerned (not at the paper, just in general) at the fact that we seem to already have a 3.0 version of the web, especially as I have yet to figure out a useful definition for semantic web vs Web 2.0 vs Web 3.0. Definitions of Web 3.0 seems to vary wildly: is it the semantic web? Is it the -rwx- to Web 1.0's -r-- and Web 2.0's -rw-- (as described here)? Are these two definitions one and the same? Perhaps these are discussions for another day... Ultimately, however, I have to agree with the authors that "Web 3.0" is an unimaginative designation2.
So, how can the semantic web help manage our data? That would be a post in itself, and is the focus of many PhD projects (including mine). Perhaps a better question is how does the management model proposed by Deus et al. use the semantic web, and is it a useful example of integrative bioinformatics?
Their introduction focuses on two types of integration: data integration as an aid to holistic approaches such as mathematical modelling, and software integration which could provide tighter interoperability between data and services. They espouse (and I agree) the semantic web as a technology which will allow the semantically-meaningful specification of desired properties of data in a search, rather than retrieving data in a fixed way from fixed locations. They want to extend semantic data integration from the world of bioinformatics into clinical applications. Indeed, they want to move past "clandestine and inefficient flurry of datasets exchanged as spreadsheets through email", a laudable goal.
Their focus is on a common data management and analysis infrastructure that does not place any restrictions on the data stored. This also means multiple instances of light-weight applications are part of the model, rather than a single central application. The storage format is of a more general, flexible nature. Their way of getting the data into a common format, they say, is to break down the "interoperable elements" of the data structures into RDF triples (subject-predicate-object statements). At its most basic, their data structure has two types of triples: Rules and Statements. Rules are phrases like "sky has_color", while statements add a value to the phrase, e.g. "today's_sky has_color blue".
They make the interesting point that the reclassification of data from flat files to XML to RDF to Description Logics starts to dilute "the distinction between data management and data analysis". While it is true that if you are able to store your data in formats such as OWL-DL3, the format is much more amenable to direct computational reasoning and inference, perhaps a more precise statement would be that the distinction between performance of data management tasks and data analysis tasks will blur with richer semantic descriptions of both the data and their applications. As they say later in the paper, once the data and the applications are described in a way that is meaningful for computation, new data being deposited online could automatically trigger a series of appropriate analysis steps without any human input.
A large focus of the paper was on identity, both of the people using it (and therefore addressing the user requirement of a strong permissions system) and of the entities in the model and database (each identified with some type of URI). This theme is core to ensuring that only those with the correct permissions may access possibly-sensitive data, and that each item of information can be unambiguously defined. I like that the sharing of "permissions between data elements in distinct S3DB deployments happens through the sharing the membership in external Collections and Rules...not through extending the permission inheritance beyond the local deployment". It seems a useful and straightforward method of passing permissions.
I enjoyed the introduction, background, and conclusions. Their description of the Semantic Web and how it could be employed in the life sciences is well-written and useful for newcomers to this area of research. Their description of the management model as composed of subject-predicate-object RDF triples plus membership and access layers was interesting. Their website was clear and clean, and they had a demo that worked even when I was on the train4. It's also rather charming that "S3DB" stands for Simple Sloppy Semantic Database - they have to get points for that one 5! However, the description of their S3DB prototype was not extensive, and as a result I have some basic questions, which can be summarized as follows:
- How do they determine what the interoperable elements of different data structures are? Manually? Computationally? Is this methodology generic, or does it have to be done with each new data type?
- The determination of the maturity of a data format is not described, other than that it should be a "stable representation which remains useful to specialized tools". For instance, the mzXML format is considered mature enough to use as the object of an RDF triple. What quality control is there in such cases: in theory, someone could make a bad mzXML file. Or is it not the format which is considered mature, but instead specific data sets that are known to be high quality?
- I would have like to have seen more detail in their practical example. Their user testing was performed together with the Lung Cancer SPORE user community. How long did the trial last? Was there some qualitative measurement of how happy they were with it (e.g. a questionnaire)? The only requirement gathered seems to have been that of high-quality access control.
- Putting information into RDF statements and rules in an unregulated way will not guarantee a data sets that can be integrated with other S3DB implementations, even if they are of the same experiment type. This problem is exemplified by a quote from the paper (p. 8): "The distinct domains are therefore integrated in an interoperable framework in spite of the fact that they are maintained, and regularly edited, by different communities of researchers." The framework might be identical, but that doesn't ensure that people will use the same terms and share the same rules and statements. Different communities could build different statements and rules, and use different terms to describe the same concept. Distributed implementations of S3DB databases, where each group can build their own data descriptions, do not lend themselves well to later integration unless they start by sharing the same ontology/terms and core rules. And, as the authors encourage the "incubation of experimental ontologies" within the S3DB framework, chances are that there will be multiple terms describing the same concept, or even one word that has multiple definitions in different implementations. While they state that data elements can be shared across implementations, it isn't a requirement and could lead to the problems mentioned. I have the feeling I may have gotten the wrong end of the stick here, and it would be great to hear if I've gotten something wrong.
- Their use of the rdfs:subClassOf relation is not ideal. A subclass relation is a bit like saying "is a", (defined here as a transitive property where "all the instances of one class are instances of another") therefore what their core model is saying with the statement "User rdfs:subClassOf Group" is "User is a Group". The same thing happens with the other uses of this relation, e.g. Item is a Collection. A user is not a group, in the same way that a single item is not a collection. There are relations between these classes of object, but rdfs:subClassOf is simply not semantically correct. A SKOS relation such as skos:narrower (defined here as "used to assert a direct hierarchical link between two SKOS concepts") would be more suitable, if they wished to use a "standard" relationship. I particularly feel that I probably misinterpreted this section of their paper, but couldn't immediately find any extra information on their website. I would really like to hear if I've gotten something wrong here, too.
Also, although this is not something that should have been included in the paper, I would be curious to discover what use they think they could make of OBI, which would seem to suit them very well6. An ontology for biological and biomedical investigations would seem a boon to them. Further, such a connection could be two-way: the S3DB people probably have a large number of terms, gathered from the various users who created terms to use within the system. It would be great to work with the S3DB people to add these to the OBI ontology. Let's talk! :)
Thanks for an interesting read, and Happy Birthday to PLoS One!
Footnotes:
1. Yes, I’ve mentioned to the UniProt gang that they need to re-jig
their axes in the first graph in this link. They’re aware of it! :)
2. Although I shouldn’t talk, I am horrible at naming things, as the title of this blog shows
3. A format for ontologies using Description Logics that may be saved as RDF. See the official OWL docs.
4. Which is a really flaky connection, believe me!
5. Note that this expanded acronym is *not* present in this PloS One paper, but is on their website.
6. Note on personal bias: I am one of the core developers of OBI :)
[This post has also been copied across to my researchblogging-friendly wordpress site (now completely defunct except for my research blogging efforts, as Vox doesn't play nicely with their aggregator software)].
Helena F. Deus, Romesh Stanislaus, Diogo F. Veiga, Carmen Behrens, Ignacio I. Wistuba, John D. Minna, Harold R. Garner, Stephen G. Swisher, Jack A. Roth, Arlene M. Correa, Bradley Broom, Kevin Coombes, Allen Chang, Lynn H. Vogel, Jonas S. Almeida (2008). A Semantic Web Management Model for Integrative Biomedical Informatics PLoS ONE, 3 (8) DOI: 10.1371/journal.pone.0002946
Z. Zhang, K.-H. Cheung, J. P. Townsend (2008). Bringing Web 2.0 to bioinformatics Briefings in Bioinformatics DOI: 10.1093/bib/bbn041
[This post has also been copied across to my researchblogging-friendly wordpress site (now completely defunct except for my research blogging efforts, as Vox doesn't play nicely with their aggregator software)].
Carole Goble and the other authors of "Data curation + process curation = data integration + science" have written a paper on the importance of curating not just the services used in bioinformatics, but also how they are used. Just as more and more biologists are becoming convinced of the importance of storing and annotating their data in a common format, so should bioinformaticians take a little of their own medicine and ensure that the services they produce and use are annotated properly. I personally feel that it is just as important to ensure that in silico work is properly curated as it is in the more traditional, wet-lab biological fields.
They mention a common feature of web services and workflows: namely, that they are generally badly documented. Just as the majority of programmers leave it until the last possible minute to comment their code (if they comment at all!), so also are many web services annotated very sparsely, and not necessarily in a way that is useful to either humans or computers. I remember that my first experience with C code was trying to take over a bunch of code written by a C genius, who had but one flaw: a complete lack of commenting. Yes, I learnt a lot about writing efficient C code from his files, but it took me many hours more than it would have done if there had been comments in there!
They touch briefly on how semantic web services (SWS) could help, e.g. using formats such as OWL-S and SAWSDL. I recently read an article in the Journal of Biomedical Informatics (Garcia-Sanchez et al. 2008, citation at the end of the paper) that had a good introduction to both semantic web services and, to a lesser extent, multi-agent systems that could autonomously interact with such services. While the Goble et al. paper did not go into as much detail as the Garcia-Sanchez paper did on this point, it was nice to learn a little more about what was going on in the bioinformatics word with respect to SWS.
Their summary of the pitfalls to be aware of due to the lack of curated processes was good, as was their review of currently-existing catalogues and workflow and WS aggregators. The term "Web 2.0" was used, in my opinion correctly, but I was once again left with the feeling that I haven't seen a good definition of what Web 2.0 is. I must hear it talked about every day, and haven't come across any better definition than Tim O'Reilly's. Does anyone reading this want to share their "favorite" definition? This isn't a failing of this paper - more of my own lack of understanding. It's a bit like trying to define "gene" (this is my favorite) or "systems biology" succinctly and in a way that pleases most people - it's a very difficult undertaking! Another thing I would have liked to have seen in this paper, but which probably wasn't suitable for the granularity level at which this paper was written, is a description and short analysis of the traffic and usage stats for myExperiment. Not a big deal - I'm just curious.
As with anything in standards development, even though there are proposed minimal information guidelines for web services out there (see MIAOWS), the main problem will always be lack of uptake and getting a critical mass (also important in community curation efforts, by the way). In my opinion, a more important consideration for this point is that getting a MIA* guideline to be followed does not guarantee any standard format. All it guarantees is a minimal amount of information to be provided.
They announce the BioCatalogue in the discussion section of this paper, which seems to be a welcome addition to the attempts to get people to annotate and curate their services in a standard way, and store them in a single location. It isn't up and running yet, but is described in the paper as a web interface to more easily allow people to annotate their WSDL files, whereas previous efforts have mainly focused on the registry aspects. Further information can be associated with these files once they are uploaded to the website. However, I do have some questions about this service. What format is the further information (ontology terms, mappings) stored in? Are the ontology terms somehow put back into the WSDL file? How will information about the running of a WS or workflow be stored, if at all? Does it use a SWS format? I would like to see performances of Bioinformatics workflows stored publicly, just as performances of biological workflows (eg running a microarray experiment) can be. But I suppose many of these questions would be answered once BioCatalogue is in a state suitable for publishing on its own.
In keeping with this idea of storing the applications of in silico protocols and software in a standard format, I'd like to mention one syntax standard that might be of use in storing both descriptions of services and their implementation in specific in silico experiments: FuGE. While it does not currently have the structures required to implement everything mentioned in this paper (such as operational capability and usage/popularity scores) in a completely explicit way, many of the other metadata items that this paper suggests can already be stored within the FuGE object model (e.g. provenance, curation provenance, and functional capability). Further, FuGE is built as a model that can easily be extended. There is no reason why we cannot, for example, build a variety of Web services protocols and software within the FuGE structure. One downside of this method would be that information would be stored in the FuGE objects (e.g. a FuGE database or XML file) and not in the WSDL or Taverna workflow file. Further, there is no way to "execute" FuGE XML files, as there is with taverna files or WSs. However, if your in silico experiment is stored in FuGE, you immediately have your computational data stored in a format that can also store all of the wet-lab information, protocols, and applications of the protocols. The integration of your analyses with your wet-lab metadata would be immediate.
In conclusion, this paper presents a summary of a vital area of bioinformatics research: how, in order to aid data integration, it is imperative that we annotate not just wet-lab data and how they were generated, but also our in silico data and how they were generated. Imagine storing your web services in BioCatalogue and then sharing your entire experimental workflows, data and metadata with other bioinformaticians quickly and easily (perhaps using FuGE to integrate in silico analyses with wet-lab metadata, producing a full experimental metadata file that stores all the work of an experiment from test tube to final analysis).
Goble C, Stevens R, Hull D, Wolstencroft K, Lopez R. (2008). Data curation + process curation=data integration + science. Briefings in bioinformatics DOI: 19060304
F GARCIASANCHEZ, J FERNANDEZBREIS, R VALENCIAGARCIA, J GOMEZ, R MARTINEZBEJAR (2008). Combining Semantic Web technologies with Multi-Agent Systems for integrated access to biological resources Journal of Biomedical Informatics, 41 (5), 848-859 DOI: 10.1016/j.jbi.2008.05.007
Last week when scanning through Friendfeed, someone mentioned Nature Blogs. A number of my friends and fellow friendfeeders (1,2,3,4,5,6,etc.) already have their blogs registered there. I took the plunge and submitted my request last week, and this site was accepted for inclusion in the list this week. You can find it listed under the bioinformatics category. In honor of that occasion, I've decided to post a summary of the tags I chose to mark this blog with on Nature Blogs, and the reasons for them. (The obvious one, bioinformatics, wasn't necessary as far as I could tell because that is the top-level category I've placed the site into.)
- data integration: It is the main focus of my research, and one of the biggest challenges facing bioinformatics and the life sciences in general. So many formats, so little time! Reconciling these using brute force, standardization, semantics, and sneakiness are what it's all about.
- ontologies: I like ontologies for many reasons, not the least their potential for reconciling the many different ways of defining and naming things in our lives. We need a common ground from which to perform successful integration and analysis, and I think a well-written ontology (or set of them) is a beautiful thing. They are a major tool in my research bag of tricks. Not only that, but I also help develop a community-driven ontology for describing life-science experiments (OBI).
- workshops: my method of remembering what goes on in workshops and conferences is to take notes, and I can be a pretty fast typist. I enjoy blogging on each lecture at such an event as they happen, and you'll notice a lot of workshop and conference posts on this site. They are mainly written while the speaker is speaking, with a minimum (if any) of later editing. However, if any speaker reads my notes and would like to suggest areas where I made a mistake, I am more than happy to make those sorts of changes. One of my favorite ways of blogging.
- systems biology: that's the field in which my bioinformatics research is applied, which makes it an immediately-applicable tag for this blog. But try to define it and, as with so many things in this world, you could get as many definitions as there are people. (Ok, perhaps a slight exaggeration for dramatic effect.) So, I'll not try to define it today, and just say that my posts often deal with work in this field.
- science outreach: My Mom is a teacher, my Dad was a teacher and remains working in Education. If it wasn't so much hard work, I'd consider it as a career myself. :) However, I do enjoy trying to pass on my enjoyment of and interest in the sciences. Some of my more recent posts talk about the work I'm doing with the Teacher Scientist Network. Outreach is just fantastic, especially when explaining science to kids, and it's something I like to talk about in this site, when the opportunity arises.
- standards: Perhaps it's because I spent years working at the EBI, where they provide databases and services in specific syntaxes. Perhaps it's just the way my personality is. Whatever the reason, I really enjoy working with data standards. I'm lucky enough to be directly involved with two at the moment (FuGE and OBI), and peripherally involved in other efforts such as SBO (by peripherally I mean that I've nagged them in the past about the whys and wheretofores of various aspects of their ontology) and MIGS (I was involved in the initial work on the checklist, and provided advice on FuGE). I'm a bit of a standards fiend, and try to remind myself that not everyone finds them as interesting (though everyone should at least find them relevant!).
Both of these special issues are worth a look, as some of the papers look pretty interesting. I'll spend a little time in a later post on any articles I find particularly relevant.
- Semantic Mashup of Biomedical Data Special Issue of the Journal of Biomedical Informatics. This includes a review article by Carole Goble and Robert Stevens: State of the nation in data integration for bioinformatics
- Nature's Big Data Sepcial Issue. The article entitled "How do your data grow?" was one of the many articles in this issue that I enjoyed. It's interesting to note that these problems in management and curation of big data are only now getting special attention in Nature. When I worked at the EBI, it was common knowledge among the database curators that 1) it would be very difficult for them to find other work as curators if they left the EBI, and 2) the time and high skill level it takes to annotate and curate biological database entries means that it is very difficult to get high coverage in such databases. It's nice to finally see some recognition of all the work the biocurators do by a journal such as Nature. Finally, there are high-profile articles stating that curation begins at home, with the researcher, and that curation needs much more support from researcher-level all the way up to the level of the database curators.
A couple of papers from here at Newcastle University have appeared over the past couple of weeks. Here's a summary of them both.
- Data Standards
From "An Update on Data Standards for Gel Electrophoresis" in Practical Proteomics Issue 1, September 2007, and by Andrew R. Jones and Frank Gibson.
From the abstract: "We report on standards development by the Gel Analysis Workgroup of the Proteomics Standards Initiative. The workgroup develops reporting requirements, data formats and controlled vocabularies for experimental gel electrophoresis, and informatics performed on gel images. We present a tutorial on how such resources can be used and how the community should get involved with the on-going projects. Finally, we present a roadmap for future developments in this area."
Provides a summary of ongoing work in the Gel electrophoresis and Gel informatics fields in terms of data and metadata standardization. This includes work on MIAPE GE and MIAPE GI, two checklists for minimal information required on these types of experiments and analyses. For both GE and GI, there are data formats (GelML and GelInfoML, respectively, both extensions of FuGE) and a suggested controlled vocabulary (sepCV). More information can be found on http://www.psidev.info.
Frank works in the CARMEN neuroscience project here at Newcastle, and Andy is in Liverpool and works on, among other things, FuGE. CARMEN collaborates with the SyMBA project, which was originally developed by me and a few others within Neil Wipat's Integrative Bioinformatics Group here at Newcastle but which is now a sourceforge project at http://symba.sf.net. Andy Jones is a co-author with me, Neil Wipat, Matt Pocock and Olly Shaw on an upcoming SyMBA paper. - Semantic Data Integration
A paper that was presented at the Integrative Bioinformatics Conference 2007 by me and my co-authors, Matt Pocock and Neil Wipat, is now available from the Journal of Integrative Bioinformatics website.
Allyson L. Lister, Matthew Pocock, Anil Wipat. Integration of constraints documented in SBML, SBO, and the SBML Manual facilitates validation of biological models. Journal of Integrative Bioinformatics, 4(3):80, 2007.
A Technical Report for the School of Computing Science of Newcastle University was released last month describing the CISBAN DPI, an implementation of the FuGE Milestone 3 STK. You can find and download that technical report here:
http://www.cs.ncl.ac.uk/research/pubs/trs/abstract.php?number=1016
The Abstract follows:
The Centre for Integrated Systems Biology of Ageing and Nutrition has developed a Data Portal and Integrator (CISBAN DPI) that is based on the FuGE Object Model and which archives, stores, and retrieves raw high-throughput data. Until now, few published systems have successfully integrated multiple omics data types and information about experiments in a single database. The CISBAN DPI is the first published implementation of FuGE that includes a database back-end, expert and standard interfaces, and utilizes a Life Science Identifier (LSID) Resolution and Assigning service to identify objects and provide programmatic access to the database. Having a central data repository prevents deletion, loss, or accidental modification of primary data, while giving convenient access to the data for publication and analysis. It also provides a central location for storage of metadata for the high-throughput data sets, and will facilitate subsequent data integration strategies.
Keywords
Functional Genomics, High-Throughput Experiments, FuGE, LSID, Experimental Workflows, Databases, Data Standards, Data Sharing, Metadata, Data Integration.
CS-TR: 1016 Implementing the FuGE Object Model: a Systems Biology Data Portal and Integrator,
Lister, A. L., Jones, A. R., Pocock, M., Shaw, O., Wipat, A.
School of Computing Science, Newcastle University, Apr 2007
The Centre for Integrated Systems Biology of Ageing and Nutrition has developed a Data Portal and Integrator (CISBAN DPI) based on Milestone 3 of the Functional Genomics Experiment (FuGE) Object Model (FuGE-OM), and which archives, stores, and retrieves raw high-throughput data. We are pleased to announce that the CISBAN Data Portal and Integrator is now available in a public sandbox version. Please note that this release is still at an early beta stage, and any data you may upload to the server may be deleted at any time. You will need a logon to access this database, which you may request from the helpdesk. This is a low-level of security that will only serve to prevent anonymous load on the database and to keep your sandbox area separate from others. For more information on the sandbox DPI, please visit the DPI's technical documentation.
Until now, few published systems have successfully integrated multiple omics data types and information about experiments in a single database. The CISBAN DPI is the first published implementation of FuGE that includes a database back-end, expert and standard interfaces, and utilizes a Life Science Identifier (LSID) Resolution and Assigning service to identify objects and provide programmatic access to the database. Having a central data repository prevents deletion, loss, or accidental modification of primary data, while giving convenient access to the data for publication and analysis. It also provides a central location for storage of metadata for the high-throughput data sets, and will facilitate subsequent data integration strategies.
We encourage you to upload data and create as many experiments as you like so that you may determine if this application may be of use to your own research group. We also appreciate you contacting us with any comments or questions you may have.
Useful links: