Trying out another name (see title of blog above): this time it's "The Mind Wobbles" (see my previous post for the history behind choosing a title). The mind wobbles is the name of my (mainly defunct) wordpress site, so it would make sense to use the same name on my Vox site. Additionally, it has a nice ring to it (and Deepak seems to agree!) It would mean the name would be catchy, but not directly related to my work, which would allow me to change work area and not have to feel the blog title no longer fits.
I'm trying it out now in parentheses in my blog title, to see how I like it. I originally chose it because of the Weebles (weebles wobble but they won't fall down). I like my puns, and thought this was punny. Anyway, comments, as always, welcome. :)
I'm horrible with title creation: I can never think of a good title for papers, projects, etc. Any sort of naming just causes me problems, hence the completely boring name of this blog (Systems Biology & Bioinformatics). But I've been trying to think of a good, more permanent, blog name, and I think I've found one. However, I'm quite nervous that, even after my google search, I may find out that this was a horrible name after all, so I've gone quite tentative and am currently road-testing the new name by putting parentheses around it. The new name could be "Semantically Speaking", and I'd like to know what others think. It won't change how I post, just the main name. I'll retain the SBB part in the sub-title, together with the keywords already present. There is already one blog with this name, but it's two years since the owner's made a post, and it isn't in the same topic area. The other google results are mainly about individual articles, and not websites of the same name.
So, here's your chance - would you like to help name a blog? Do you agree/disagree that this is a good name? I need your help! :)
This post is part of the PLoS One syncroblogging day, as part of the PLoS ONE @ Two birthday celebrations. Happy Synchroblogging! Here's a link to the paper on the PLoS One website.
Biological data: vitally important, determinedly unruly. This challenge facing the life-science community has been present for decades, as witnessed by the often exponential growth of biological databases (see the classic curve in the current graphs of UniProt1 and EMBL if you don't believe me). It's important to me, as a bioinformatics researcher whose main focus is semantic data integration, but it should be important to everyone. Without manageable data that can be easily integrated, all of our work suffers. Nature thinks it's important: it recently devoted an entire issue to Big Data. Similarly, the Journal of Biomedical Informatics just had a Semantic Mashup special issue. Deus et al. (the paper I'm blogging about, published in PLoS One this summer) agree, beginning with "Data, data everywhere", nicely encapsulating both the joy and the challenge in one sentence.
This paper describes work on a distributed management system that can link disparate data sources using methodologies commonly associated with the semantic web (or is that Web 3.0?). I'm a little concerned (not at the paper, just in general) at the fact that we seem to already have a 3.0 version of the web, especially as I have yet to figure out a useful definition for semantic web vs Web 2.0 vs Web 3.0. Definitions of Web 3.0 seems to vary wildly: is it the semantic web? Is it the -rwx- to Web 1.0's -r-- and Web 2.0's -rw-- (as described here)? Are these two definitions one and the same? Perhaps these are discussions for another day... Ultimately, however, I have to agree with the authors that "Web 3.0" is an unimaginative designation2.
So, how can the semantic web help manage our data? That would be a post in itself, and is the focus of many PhD projects (including mine). Perhaps a better question is how does the management model proposed by Deus et al. use the semantic web, and is it a useful example of integrative bioinformatics?
Their introduction focuses on two types of integration: data integration as an aid to holistic approaches such as mathematical modelling, and software integration which could provide tighter interoperability between data and services. They espouse (and I agree) the semantic web as a technology which will allow the semantically-meaningful specification of desired properties of data in a search, rather than retrieving data in a fixed way from fixed locations. They want to extend semantic data integration from the world of bioinformatics into clinical applications. Indeed, they want to move past "clandestine and inefficient flurry of datasets exchanged as spreadsheets through email", a laudable goal.
Their focus is on a common data management and analysis infrastructure that does not place any restrictions on the data stored. This also means multiple instances of light-weight applications are part of the model, rather than a single central application. The storage format is of a more general, flexible nature. Their way of getting the data into a common format, they say, is to break down the "interoperable elements" of the data structures into RDF triples (subject-predicate-object statements). At its most basic, their data structure has two types of triples: Rules and Statements. Rules are phrases like "sky has_color", while statements add a value to the phrase, e.g. "today's_sky has_color blue".
They make the interesting point that the reclassification of data from flat files to XML to RDF to Description Logics starts to dilute "the distinction between data management and data analysis". While it is true that if you are able to store your data in formats such as OWL-DL3, the format is much more amenable to direct computational reasoning and inference, perhaps a more precise statement would be that the distinction between performance of data management tasks and data analysis tasks will blur with richer semantic descriptions of both the data and their applications. As they say later in the paper, once the data and the applications are described in a way that is meaningful for computation, new data being deposited online could automatically trigger a series of appropriate analysis steps without any human input.
A large focus of the paper was on identity, both of the people using it (and therefore addressing the user requirement of a strong permissions system) and of the entities in the model and database (each identified with some type of URI). This theme is core to ensuring that only those with the correct permissions may access possibly-sensitive data, and that each item of information can be unambiguously defined. I like that the sharing of "permissions between data elements in distinct S3DB deployments happens through the sharing the membership in external Collections and Rules...not through extending the permission inheritance beyond the local deployment". It seems a useful and straightforward method of passing permissions.
I enjoyed the introduction, background, and conclusions. Their description of the Semantic Web and how it could be employed in the life sciences is well-written and useful for newcomers to this area of research. Their description of the management model as composed of subject-predicate-object RDF triples plus membership and access layers was interesting. Their website was clear and clean, and they had a demo that worked even when I was on the train4. It's also rather charming that "S3DB" stands for Simple Sloppy Semantic Database - they have to get points for that one 5! However, the description of their S3DB prototype was not extensive, and as a result I have some basic questions, which can be summarized as follows:
- How do they determine what the interoperable elements of different data structures are? Manually? Computationally? Is this methodology generic, or does it have to be done with each new data type?
- The determination of the maturity of a data format is not described, other than that it should be a "stable representation which remains useful to specialized tools". For instance, the mzXML format is considered mature enough to use as the object of an RDF triple. What quality control is there in such cases: in theory, someone could make a bad mzXML file. Or is it not the format which is considered mature, but instead specific data sets that are known to be high quality?
- I would have like to have seen more detail in their practical example. Their user testing was performed together with the Lung Cancer SPORE user community. How long did the trial last? Was there some qualitative measurement of how happy they were with it (e.g. a questionnaire)? The only requirement gathered seems to have been that of high-quality access control.
- Putting information into RDF statements and rules in an unregulated way will not guarantee a data sets that can be integrated with other S3DB implementations, even if they are of the same experiment type. This problem is exemplified by a quote from the paper (p. 8): "The distinct domains are therefore integrated in an interoperable framework in spite of the fact that they are maintained, and regularly edited, by different communities of researchers." The framework might be identical, but that doesn't ensure that people will use the same terms and share the same rules and statements. Different communities could build different statements and rules, and use different terms to describe the same concept. Distributed implementations of S3DB databases, where each group can build their own data descriptions, do not lend themselves well to later integration unless they start by sharing the same ontology/terms and core rules. And, as the authors encourage the "incubation of experimental ontologies" within the S3DB framework, chances are that there will be multiple terms describing the same concept, or even one word that has multiple definitions in different implementations. While they state that data elements can be shared across implementations, it isn't a requirement and could lead to the problems mentioned. I have the feeling I may have gotten the wrong end of the stick here, and it would be great to hear if I've gotten something wrong.
- Their use of the rdfs:subClassOf relation is not ideal. A subclass relation is a bit like saying "is a", (defined here as a transitive property where "all the instances of one class are instances of another") therefore what their core model is saying with the statement "User rdfs:subClassOf Group" is "User is a Group". The same thing happens with the other uses of this relation, e.g. Item is a Collection. A user is not a group, in the same way that a single item is not a collection. There are relations between these classes of object, but rdfs:subClassOf is simply not semantically correct. A SKOS relation such as skos:narrower (defined here as "used to assert a direct hierarchical link between two SKOS concepts") would be more suitable, if they wished to use a "standard" relationship. I particularly feel that I probably misinterpreted this section of their paper, but couldn't immediately find any extra information on their website. I would really like to hear if I've gotten something wrong here, too.
Also, although this is not something that should have been included in the paper, I would be curious to discover what use they think they could make of OBI, which would seem to suit them very well6. An ontology for biological and biomedical investigations would seem a boon to them. Further, such a connection could be two-way: the S3DB people probably have a large number of terms, gathered from the various users who created terms to use within the system. It would be great to work with the S3DB people to add these to the OBI ontology. Let's talk! :)
Thanks for an interesting read, and Happy Birthday to PLoS One!
Footnotes:
1. Yes, I’ve mentioned to the UniProt gang that they need to re-jig
their axes in the first graph in this link. They’re aware of it! :)
2. Although I shouldn’t talk, I am horrible at naming things, as the title of this blog shows
3. A format for ontologies using Description Logics that may be saved as RDF. See the official OWL docs.
4. Which is a really flaky connection, believe me!
5. Note that this expanded acronym is *not* present in this PloS One paper, but is on their website.
6. Note on personal bias: I am one of the core developers of OBI :)
[This post has also been copied across to my researchblogging-friendly wordpress site (now completely defunct except for my research blogging efforts, as Vox doesn't play nicely with their aggregator software)].
Helena F. Deus, Romesh Stanislaus, Diogo F. Veiga, Carmen Behrens, Ignacio I. Wistuba, John D. Minna, Harold R. Garner, Stephen G. Swisher, Jack A. Roth, Arlene M. Correa, Bradley Broom, Kevin Coombes, Allen Chang, Lynn H. Vogel, Jonas S. Almeida (2008). A Semantic Web Management Model for Integrative Biomedical Informatics PLoS ONE, 3 (8) DOI: 10.1371/journal.pone.0002946
Z. Zhang, K.-H. Cheung, J. P. Townsend (2008). Bringing Web 2.0 to bioinformatics Briefings in Bioinformatics DOI: 10.1093/bib/bbn041
BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.
I really enjoyed this workshop - met new people, chatted about systems biology, clinical genetics, surname-DNA associations, The Princess Bride and Spinal Tap. From a combination of presentations and chats, two defining topics of discussion in this workshop emerged:
- social challenges, or getting the different disciplines within systems biology to understand one another. Alternatively, people also mentioned the challenge in getting different collaborating groups to work together;
- stable infrastructure funding, or getting money for supporting software and for building and supporting data standards.
In my opinion, the former is much less of a current challenge than the latter. From my personal experiences within CISBAN (which contains a variety of experimental biologists as well as different types of theoretical biologists, mathematicians and statisticians), we have progressed to the point that I really feel that each "group" understands what the others do. In other words, in a local context, I think that social challenges are minimal. Longer-distance social challenges will remain around a little longer, but with the increasing use of online social networking tools (1, 2, 3, 4, 5, 6), I think much of this could be minimized. In contrast, I think that the challenges in getting funding for stable infrastructure (software and data standards) isn't advancing as quickly as it should. The production and maintenance of life-science data standards are vital to more efficient data sharing and collaboration. People should make room in their grants for the development of data standards (e.g. MIBBI guidelines, syntaxes or semantics - see Frank's excellent discussion on the issue) that will benefit them. Core institutes such as the EBI do a lot of this work, but can't get funding for everything.
I started thinking about all this stuff on Wednesday morning, and writing this did somewhat affect the notes I took in some of the talks, and for that I apologise! :)
And, in conclusion, some light entertainment. There was a third category of discussion which many will be familiar with:
- acronyms
I'm as guilty as the rest of them. Here's a small selection of examples of how much us scientists love our acronyms, and those things which are very close to true acronyms: APPLE, BASIS, CRISP, EMMAS, PRESTA, PheroSys, Phyre, PiMS, SToMP, SyMBA (mine), SysMO, SUMO, ROBuST and others. For a guide to how to build acronyms, see the PhD Comic's excellent summary of the topic (and the related FriendFeed discussion).
BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.
Amanda Greenall: Telomere binding proteins are conserved between yeat and higher eukaryotes. The capping proteins are very important, because they prevent the telomeres from being recognized as double-strand breaks. They work on cdc13, which is the functional homologue of POT1 in humans. A point mutation cdc13-1 allows them to study telomere uncapping. When grown above 27 degrees Celcius, the cdc13-1 protein becomes non-functional, and fall off. This uncapping causes telomere loss and cell-cycle arrest. So, they do further study into the checkpoint response that happens when telomeres are uncapped. Yeast is a good model, as many of the proteins involved in humans have direct analogs in yeast. They did a series of transcriptomics experiments to determine how gene expression is affected when telomeres are uncapped. They did 30 arrays, and the data was analysed using limma. 647 differentially-expressed genes were identified (418 upregulated (carbohydrate metabolism, energy generation, response to OS), and 229 downregulated (amino acid and ribosome biogenesis, RNA metabolism, etc)). The number of differentially-expressed genes increase with time. For example, 259 of the genes were involved in DNA damage response.
They became quite interested in BNA2, which is an enzyme which catalyses de novo NAD+ biosynthesis. Why is it upregulated? It seems over-expression of BNA2 enhances survival of cdc13-1 strains (using spot tests). Nicotinamide biosynthetic genes are altered when telomeres are uncapped in yeast and humans. The second screen was a robotic screen to identify ExoX and/or pathways affecting responses to telomere uncapping. Robots were used to to large-scale screens that can measure systematic cdc13-1 genetic interactions. One of the tests was the up-down assay, which allows them to distinguish Exo1-like and Rad9-like suppressors. Carry on with the spot tests until have worked through the entire library of strains.
Darren Wilkinson: a discrete stochastic kinetic model has been built to model the cellular response to uncapping. (J Royal Soc Interface, 4(12):73-90), and in Biomodels. Encoded in SBML and simulated in BASIS (web-based simulation engine). You can use the microarray data to infer networks of interactions. Such top-down modelling can often be done with Dynamic Bayesian Networks (DBNs) for discretised data and sparse Dynamic Linear Models (DLMs) for (normalized) continuous data. A special case of DLM is the sparse vector auto-regressive model of order 1, known as the sparse VAR(1) model, and this appears to be effective for uncovering dynamic network interactions (see Opgen-Rhein and Strimmer, 2007). They use a simple version of this model. They use a RJ-MCMC algorithm to explore both graphical structure and model parameters. When the RJ-MCMC is performed, it's quite hard to visualize. They do a plot of the marginal probability that an edge exists. This can also be summarised by choosing an arbitrary threshold and then plotting the resulting network. You can change the thickness of the edges so they match the marginal probability associated with each edge. This picture is then easier for biologists to analyse, and allows them to narrow down their search for important genes. He also performed analysis over the robotic genetic screens. There are usually about 1000 images per experiment, each with 384 spots, and therefore image analysis needs to be automated. Want to pick out those strains that are genetically interacting with the query mutation. For interactions to be useful concept in practice, you need the networks to be sparse. With HTP data, we have sufficient data to be able to re-scale the data in order to enforce this sparsity. A scatter-plot of double against single will show them all lying along a straight line (under a model of genetic independence). Points above and below the regression line are phenotypic enhancers and suppressors, respectively.
These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. :)
BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.
More fully, he's talking about the micro-evolutionary dynamics of RNA viruses. They want to get a full picture of what happens from the infection of a single cell to an entire outbreak, and all the intermediate scales. The levels of granularity he's looking at goes as follows: within cell, within host (not all viral particles in one host are genetically identical), within group (physical proximity of host to others), between groups (long-distance spreading). The data at each stage is different: from molecular data to epidemiological data. They looked at foot-and-mouth disease (FMD) and plum pox virus (PPV, transmitted by vectors), both RNA viruses. 10,000 farms were culled in the 2001 UK FMD outbreak. However, during this time, modellers were consulted. Samples were taken from every infected farm, and are stored at the IAH Purbright. This means that there's lots of data available. Then, he described a genetic tree that was built based on the FM viruses found in farms in Durham county during the outbreak. However, many transmission patterns are compatible with the tree. With some basic parameters, you can estimate how likely it is that one farm infected another. Among the total set of transmission trees (~2000), only 4 matched the values properly, and can therefore choose the most-likely tree (which accounted for about 50% of the likelihood), and therefore the most likely transmission pattern. Some of the movements show very large distances (of about 15 km). Is it a fault of the model, or a signature of some extrinsic event like transmission via car travel (human) or delivery of infected material. They still have more data (e.g. timing of transmissions) that they still have to use.
These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. :)
BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.
Talking about the physical changes occuring in leaves, and looking at the different levels of granularity and orders of magnitude you need to think about (e.g. DNA-scale up to the macro, leaf-scale). Multidisciplinary team, and, as with other centres described at this workshop, the lines are beginning to blur. This was a really great talk, but had many videos that just cannot be reproduced here. There was a nice picture of someone viewing, in proper 3-d, bits of a plant, which went with the argument that transposing 3-d objects to 2-d can often cause problems with your visual analysis. They've been able to get parameters for the rate of growth of individual areas of a leaf - many areas, with many rates. They have made the Growth Factor Toolbox (GFtbox). The models they show using the GFtbox are very nice, and show the development of, for example, the specialized leaf of the pitcher plant or the growth of a "standard" leaf shape for Arapbidopsis.
Great talk! :)
These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. :)
BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.
Judy Armitage: Bacterial sensory networks. The e.coli chemotaxis system is probably the best-understood "system" in biology, where biases in swimming direction are provided by regulating motor switching. The chemotaxis pathway is a paradigm for HPK-RR (histidine protein kinase - response regulators) pathways. There can be over 100 HPK pathways in a single species. OCISB projects include: extend E.coli models to species with two or more chemosensory pathways, and extend these to HPKLRR pathways in general to allow prediction of partners. They started with R.sphaeroides, her "favorite" bacterium. This bacterium has 2 targeted pathways preventing crosstalk. They gave the generated data sets to the modelling groups and asked if proteins operating in parallel or linear pathway? The control theory people came up with 4 models that fit the data, but 3 could be excluded based on perturbation tests in vivo. The same data was given to mathematical biologists.
Modelling was with ODEs (temporal dynamics), and partial DE (for spatiotemporal dynamics). Porter et al (2008) PNAS online, showed how histidine kinase CheA3 is also a specific phosphatase for CheY6-P, one of the 6 motor binding proteins - tuning kinas:phosphatase will control motor switching. Further, there must be a link between cytoplasmic cluster and polar kinase. CheB2~P phosphorelay allows response to environment to be tuned to metabolic need. How common is this and how is discrimination achieved? CheA (HPK) CheY/CheB (RR). Modelling MCP Helix mutants with the sidekick tool - a coarse-grain transmembrane (TM) pipeline.
These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. :)
BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.
Michael White: Dynamics and function of the NF-kappaB signalling system. NF-kappaB controls cell division and cell death in all cells. How can a simple signal carry so much information (the cell cannot afford to make a mistake!)? It is a complex network with multiple feedback loops (high dynamic complexity). People think that the IkappaB holds it in the cytoplasm, but this doesn't look to be correct. Living cell imaging shows that NF-kappaB oscillates asynchronously between the cytoplasm and nucleus in single cells (i.e. doesn't happen at the same time in multiple cells). However, each cell is cycling with the same amplitude etc, so they're doing the same thing, just not at the same time.
Can we synchronise the oscillations? You can do a repeat pulse protocol and then check to see if the synchronisation has happened. When you stimulate at 100-150 minutes then you can synchronise and not get damped oscillations. They have built a stochastic model. There are a nice set of pictures of pathways, but obviously cannot reproduce those here.
Here go the batteries again...(rest of notes from the paper notes I took, which are generally much lower quality)
Some of this work is funded under SABR, where they will focus on dynamic live cell imaging, quantitative proteomics/phosphoproteomics, genomics/bioinformatics, data analysis, deterministic/stochastic modelling and databases. What are the causes of differential expression? Oscillation dynamics is one possibility (and what he describes in this talk) Others could be signal-specific IkappaB processing, differential NF-kappaB dimer formation, differential protein modifications. Is degradation of IkappaBs regulated by Rel protein binding? NF-kappaB could be differentially phosphorylated.
Finally, one last note on outreach: they've had quite the success with biologists interacting with mathematicians in the group. Biologists are now taking weekly math courses, and it was their idea. That's great :)
These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. :)
BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.
PRESTA stands for Plant Resposes to Environmental STress in Arabidopsis. Even though the environment is changing rapidly, investment in plant research has declined. Abiotic and biotic stresses will function via core response networks embellished with stress-specific pathways. A fundamental component of these responses is transcriptional change. It seems that in many of the components in stress responses, hormones are key: also, everything seems to focus through key pathways. Two approaches are used: top-down modelling via network inference, or bottom-up modelling via already extant knowledge of key genes. This talk focused on the former.
They used high-resolution time-course microarrays which use 31,000 genome sequence tags (you need these to get the information to the modellers). Then, they use a range of different stress response to reveal commonalities (developmental e.g. senesence, pathogens, and abiotic stress). One example: over 48 hours there were 24 time points taken with 4 biololgical and 3 technical replicates. Two-color arrays allow complex loop design. They've been using the MAANOVA program, and even altered it to make it more efficient. You basically end up with an f test that tells you which genes have changed over time. How to select genes for Network Inference Modeling?: GO annotations, genes known to be involved in stress-related processes, trancription factors known to be involved, early response genes and prior knowledge.
There goes the battery again! Grrr.... transcribed paper notes follow, which aren't generally as detailed in my case...
Vairation of network models: 4 out of the 12 prospective genes shown to have altered pathogen growth phenotype. Knockouts in a hub gene showed both up or down-regulation of senesence. They want to add validation to the network model, and have validated various genes via experimental work). Developed APPLE, which is tha tAnalysis of Plant Promoter-Linked Elements. Discovered if overexpress HSF3 the plants are more tolerant to drought and show increased seed yield. HSF3 is part of the stress response but has a wide range of interactions, which is a good thing for building parameterized models. In the future, wants to look at the genetic diversity in the crops, and try to express a more robust response to the environment.
These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. :)