12 posts tagged “belgium”
David Searls (GlaxoSmithKline Pharmaceuticals, USA)
Other than where specified, these are my notes from the IB07 Conference, and not expressions of opinion. Any errors are probably just due to my
own misunderstanding.
A metaphor for SB: the organizing paradigms of systems and languages map neatly to each other (the "parts list" is the vocabulary, or lexicon). "Connectivity" is the rule-based syntax that determines how words may be arranged. "Function" is the subject matter of semantics. Analogous organizing paradigms can be found in a number of related domains, including systems (componentry, connectivity & behaviour: these match the previous set of 3 words). The equivalent triple for proteins are sequence, structure & function. In some ways, you can think of proteins as systems themselves. The semantics, or meaning, of a system is separate from its pragmatics, which is what it does, usually in the larger context of a discourse. This matches the pure function of a protein and it's actual role.
How does complexity arise in biological networks? Pleiotropy & redundancy. Networks ramify by overlapping function. Pleiotropy (multifunctionality) is common, as in the case of "moonlighting" proteins. Redundancy of function is the flip side of pleiotropy, and such redundancy (full or partial) contributes robustness. Linguists have similar terms for wordnets: polysemy and synonymy.
Network Emergent properties:In connecting pathways into networks, it has been suggested that important novel properties emerge. This idea has started to take hold in the biology community. The phrase can be traced back to the early part of the 20th century and the idea of the unity of science. Reductionism: science generally proceeds by reduction to fundamental components and behaviours. Emergence: complex systems are thought to demonstrate this, such that "the whole is greater than the sum of the parts", and such behaviour could not be predicted a priori. Reductionism seems to be "under fire" at the moment: something completely new and different is the best thing. However, it is actually fair to say that systems biology seems to say that reductionism will no longer do by itself.
The 19th-century logician Gottlob Frege set up competing principles of "meaning"; firstly, compositionality (the meaning of the whloe is a function of the meaning of the parts), and secondly contextuality (no meaning can exist independently of the overall context). Contextuality can be dealt with in a compositional context if you know how much context will be necessary. For instance, substrings have variable pronunciations: "ough" has 6 different pronunciations, but looking at the letters around the set of letters, you can determine pronunciation. But how many more letters do we need to look at? Same thing happens with whole words: does vs does. How do we use the context to determine pronunciation here? In proteins, the string ASVKQVS is part of a beta-sheet in an amino-peptidase, and an alpha-helix in a guanylate kinase.
From a compositional viewpoint, examine the example of artificial neural networks, where you imitate life to try to "learn" functions of many variables. Minksy & Papert showed that early nets couldn't classify some functions, such as exclusive-OR (that is, X or Y but not both), but adding a "hidden layer" of neurons fixed this. This seems to be a case for emergent properties. But is it really? If you design from scratch, ab initio, you can get it, so it is just a case of simple logic. However, could emergence simply be a matter of scale? Would imponderable properties arise in larger hidden layers?
There are some interesting parallels between neural network research of 20 years ago and omic datasets of 5 years ago. For instance in NNs there was a belief/concern that a net is a "black box" whose arhitecture is opaque to interpretation (then they worked on rule extraction). for Omics, profiles may emerge in the absence of any clues to the mechanism. Secondly, in NN if hidden layers are too big, nets tend not to generalize buyt just to memorize (overfit). In Omics, high-d data with few samples can allow statistical artifact. Finally, in NN Nets learn differently upon being retrained (nondeterminism). Luckily, these concerns in NN faded over time.
In what ways might complex biological systems resist reductionist description? Firstly Dependency (too highly interconnected to afford discrete, mechanistic explanations), and secondly, ambiguity (too pleiotropic and nondeterministic for definitive or tractable analysis). There is also dependency, of course, in biological systems: nucleotide base pairs embody dependencies in structural RNAs. 2o structure is an abstraction of this. Dependencies are "stretched" by linearizing the primary sequence. Also, side-chain interactions embody dependencies in folded protein chains. 2o structure is a modular abstraction. Dependencies are parallel / antiparallel orientations and chirality.
Folding ambiguity example: Attenuators use alternative RNA 2o structure by exploiting the syntactic ambiguity of the underlying grammar. He then introduced the Chomsky Hierarchy, but it was quite a complex table and cannot be reproduced here. Not only is the Chomsky Hierarchy useful for understanding modularity, but ICs are abstracted hierarchical modules and should also be considered.
Rosetta Stone Proteins: proteins that interact or participate in the same pathway are often fused. Catalogues of fusions can predict function. Circuit design has steadily evolved to higher levels of abstraction & modularity: standard cell VLSI design used libraries of validated, reusable circuit building blocks. Full custom is reserved for optimization, and hardware description languages (HDLs) lets chips be deigned like writing software. Hard and software are a continuum, therefore. Microcode, programmable gate arrays, etc. Some bioinformaticists have written psuedocode to describe biological pathways. In 1968 computer scientist Edsger Dijkstra wrote a now-classic short note entitled "GOTO considered harmful". In it he critcized programming constructs that allow undisciplined jumps in flow of control leading to so-called "spaghetti code", which made larger programs unwieldy. Therefore he helped to launch the structured programming movement, which enforced a strictly nested modularity for more manageable growth, debugging, modification, etc.
Does nature write spaghetti code? Well looks like pasta to the uninitiated ;) However, if you actually look how things are put together, you'll notice it probably doesn't. Protein domains combine predominately by concatenation or insertion, as seen in pyruvate kinase. Do proteins interleave? Very rarely do proteins seem to have interleaved domain structures, like D-maltodextrin binding protein with three inter-domain crossings (perhaps due to a translocation?) So, it seems quite rare. This puts biological structure and human language at the same level in the Chomsky hierarchy.
Organizing paradigms for linguistics can readily extend to proteins and systems. Abstracted, hierarchical modularity is a means to support "controlled" growth in complexity, in both design and evolution. The Chomsky hierarchy offers tools to measure and analyze this complexity. Proteins and systems form a continuum, exhibiting both compositionality and contextuality (but emergence...? Perhaps not).
The "Computational Thinking" Movement has been growing in recent years, and such work could help people who aren't used to thinking of modularization.
My opinion: Fantastic talk! Great way to start the day.
Presented by Anna-Lise Veuthey, from SIB.
Other than where specified, these are my notes from the IB07 Conference, and not expressions of opinion. Any errors are probably just due to my own misunderstanding. :)
Increase interoperability between molecular biology and clinical resources by indexing UniProtKB with medical terminologies, including MeSH. Related work includes GenesTrace, PhenoGO, and MedGene. These systems use text mining methods, or knowledge- and semantic-based methods using ontological relationships of terms.
Why use MeSH? MeSH is a hierarchical CV developed by NLM. It is part of UMLS and thus is linked to other medical terminologies. Further, it is used to index the biomedical literature.
200 disease names from 97 Swiss-Prot entries manually mapped to MeSH terms. used to evaluate the procedure in terms of recall and precision, and used to set up a score threshold.
The mapping system was tuned for high precision to provide a fully automated procedure. But we need to improve the recall by: including NLP techniques in the disease extraction and matching procedures, refining the score with other parameters, trying to map to other terminologies such as SNOMed-CT, and using information from the literature which is indexed with MeSH terms.
They developed a generic terminology mapping procedure which can be used to link various biomedical resources. Further, indexing SP with medical terms opens new possibilities of searching and mining data relevant for clinical research.
Other than where specified, these are my notes from the IB07 Conference, and not expressions of opinion. Any errors are probably just due to my own misunderstanding. :)
Protein-protein interaction networks: fundamental for comprehensive systems biology? PPIs are crucial for cellular processes. Structuring the "hairball": modularity is a major design principle of biological systems. Division of networks in modules, i.e. clusters of proteins that are highly connected etc. The integration of HPI Maps meant that they've included over 160,000 interactions between over 17000 proteins. The PPI Network is assembled via data extraction from literature, giving 35000 interactions between 8500 non-redundant proteins (based on EntrezID).
Identification of modular structures: most communities have less than 15 proteins, in the "mesoscale" range of cluster. Membership of proteins were such that most of the proteins attributed to modules are found only in 1 community. The next step is functional interactomics, which is linking modules to functions, and the interactions within and between modular structures. Characterization and annotation of detected modules were done using GO information and expression data.
Cellular localization of modules: analysis based on 20 informative GO categories. Almost half of proteins are assigned to "nucleus". Of 316 modules of k>3, 170 contained only proteins allocated to one location. Co-expression, co-localization, and common functional annotation: correlation of co-expression and co-localization with a modest correlation of 0.27: 34/51 large modules (k>10) are significantly co-expressed.
Other than where specified, these are my notes from the IB07 Conference, and not expressions of opinion. Any errors are probably just due to my own misunderstanding. :)
From MIN model to ordinary differential equations. A MIN model is a knowledge management formalism for biology. A model should enable knowledge integration, hyp testing, prediction of response, and discovery of fundamental processes.
MIN has: universality (the integratin of various kinds of bio data available today), parsimony (the simplest possible representation of the data), incrementality ( construction of more complex models from simpler ones), precision (expression of relations in a non-ambiguous mathematical way), transposability (formal rules for the translation of the information contained in the model into commonly used (target) modelling formalisms). MIN improves the MIB model: it is a bi-partite graph with labelled nodes and arcs.
Putting microscopic and macroscopic data together. In an example, she describes relation "F", which enumerates the experimentally observed system states expressed through the variables' values. Translation into multivalued logical formalism: the translation procedure produces the candidate models for further analysis. Then there is a direct translation into ODEs.
Note: I lost my way about here, but it sounded really interesting nonetheless. Refer to the paper for more details.
Other than where specified, these are my notes from the IB07 Conference, and not expressions of opinion. Any errors are probably just due to my own misunderstanding. :)
Talk about multi-value networks, high-level petri nets, and the differences with boolean networks. Formal methods are required to model and analyse complex regulatory interactions. Boolean networks offer a good starting point, but are often too simplistic. Multi-value networks (MVNs) are qualitative, and are seen as a middle ground between differential equation models and boolean networks.
He has applied high-level petri net techniques and a wide range of analysis tools. In MVNs, entities assume a range of values (o...n). Each entity has a neighbourhood of other entities that affect it, and the behaviour of each entity is described using state tables. However, we can't really analyse this: that's where Petri nets come in. They have a graphical notation with mathematical semantics and can model choice, synchronization and concurrency. They have an expressive framework with data types and equational description of behaviour. There are a wide range of analysis techniques and tool support, e.g. model checking. Petri nets use a kind of tokenizing system.
Their approach was as follows. They have defined a set of state transition tables that completely define the model. Equational definitions are extracted from these tables, and then a Petri net is constructed. They also use multi-value logic minimalization applied to each state transition table to simplify the information from the tables. Construction of the high-level Petri net begins with a single place for each entity connected to central transition. Transition encodes equational specification of network behaviour. Each placed "x" is connected to the transition node with input arch "x and output arc x".
They showed how this worked through carbon starvation in E.coli. Exponential growth occurs where there is sufficient carbon, but they enter a stationary phase when the carbon is depleted. The model is validated by checking known properties. Then, you can look at dynamic properties. A mutant analysis was also done, where you can "knockout" or overexpress key genes and observe the effect.
Finally, they do a model comparison with the Boolean network equivalent of this model. There are differences, which leads to some interesting questions: how much detail is required in the model? Is the model representable in the boolean domain?
My opinion: A great, interesting talk that flowed well and was easy to understand. Slides were a little overfull, but it didn't detract. A natural speaker.
Other than where specified, these are my notes from the IB07 Conference, and not expressions of opinion. Any errors are probably just due to my own misunderstanding. :)
Most disease are multifactorial. Now we've reached a critical mass where we need to put all the information together, to understand the whole rather than the bits. This is the main aim of the Reactome database: to create a single, hand-curated model of human molecular biology, in pathway form.
There are three problems with the currently available biological information: data is lacking, the data that is out there is dispersed over a vast array of literature etc, and often the data is inapplicable (how do we discover what information is pertinent to us?). We extract the knowledge from the experts and insert it into Reactome via its data model. The core of Reactome is the mapping - they use external vocabularies, information etc. They map to proteins via UniProt and get further cross-references via this database. Other primary resources include GO and ChEBI. They also provide an API in both perl and java that can be used for querying and working with Reactome. There is also an online application called Reactome Mart.
How does Reactome represent its data? The better the way to describe biological structure, the better the method to describe biological activity. There is a physical entity class that can describe their state and what they are. What then followed was a very nice description of the data flow and the data model of Reactome, specifically how it deals with combinatorial explosion of possible complex types in a given pathway.
Even though they concentrate on human, they can use orthology information and create putative skeleton models for other organisms. They also have a tool called skypainter where you can paste in your favorite IDs and gene expression data, and then the Reactome pathways can be colored according to your data or IDs. All work and tools is freely available from the Reactome website.
Integration of constraints documented in SBML, SBO, and the SBML Manual facilitates validation of biological models
Published September 2007 by the Journal of Integrative Bioinformatics
Allyson L. Lister1,2, Matthew Pocock2, Anil Wipat1,2,*
1 Centre for Integrated Systems Biology of Ageing and Nutrition (http://www.cisban.ac.uk)
2 School of Computing Science (http://www.cs.ncl.ac.uk),
Newcastle University (http://www.ncl.ac.uk)*
Abstract
The creation of quantitative, simulatable, Systems Biology Markup Language (SBML) models that accurately simulate the system under study is a time-intensive manual process that requires careful checking. Currently, the rules and constraints of model creation, curation, and annotation are distributed over at least three separate documents: the SBML schema document (XSD), the Systems Biology Ontology (SBO), and the “Structures and Facilities for Model Definition” document. The latter document contains the richest set of constraints on models, and yet it is not amenable to computational processing. We have developed a Web Ontology Language (OWL) knowledge base that integrates these three structure documents, and that contains a representative sample of the information contained within them. This Model Format OWL (MFO) performs both structural and constraint integration and can be reasoned over and validated. SBML Models are represented as individuals of OWL classes, resulting in a single computationally amenable resource for model checking. Knowledge that was only accessible to humans is now explicitly and directly available for computational approaches. The integration of all structural knowledge for SBML models into a single resource creates a new style of model development and checking.
Introduction
Systems Biology Markup Language[1] (SBML) is an XML format that has emerged as the de facto standard file format for describing computational models in systems biology. It is supported by a vibrant community who have developed a wide range of tools, allowing models to be generated, analysed and curated in any one of many independently maintained software applications[1]. The Systems Biology Ontology[2][2] (SBO) was developed to enable a useful understanding of the biology to which a model relates, and to provide well-understood terms for describing common modelling concepts. The community is engaged in an on-going effort to develop the SBML standard in ways needed to support systems biology applications. As part of this process, a manual is maintained that describes and defines SBML and SBO[3].
The biological knowledge used to create and annotate a high-quality SBML model is typically analysed and integrated by a researcher. These modellers know and understand both the systems they are modelling and the intricacies of SBML. However, as with most areas of biology, the amount of data that is relevant to generating even a relatively small and well-scoped model is overwhelming. In order to extend the range of modelling tasks that can be automated, it is necessary to both capture the salient biological knowledge in a form that computers can process, and represent the SBML rules in a way computers can systematically interpret. Here we address the latter issue: describing SBML, SBO and the rules about what constitutes a correctly formed model in a way suitable for computational manipulation.
The Semantic Web[4] can be seen as today’s incarnation of the goal to allow computers to go beyond performing numerical computations, and to share and integrate information more easily. There are now several standards forming within the Semantic Web community that together formalise computational languages for representing knowledge and strictly define what conclusions can be reached from facts expressed in these languages. The Web Ontology Language[3][5] (OWL) is one such language that enjoys strong tools support and which is used for capturing biological and medical knowledge (e.g. OBI[6], BioPax[7], EXPO[4], and FMA[5] and GALEN[6] in OWL). Once the information about the domain has been modelled in an OWL file, a software application called a reasoner[7, 8] can automatically deduce all other facts that must logically follow as well as find inconsistencies between asserted facts.
The knowledge about a system described in SBML can be divided into two parts. Firstly, there is the biological knowledge. This includes information about the biological entities involved and their biological. Secondly, there is the structural knowledge, describing how the biological knowledge must be captured in well-formed documents suitable for processing by applications. In the case of a high-quality SBML model, the structural knowledge required to create such a model is tied up in three main locations:
- The Systems Biology Markup Language (SBML[1][8]) XML Schema Document (XSD[9]), describing the range of XML documents considered to be in SBML syntax,
- The Systems Biology Ontology (SBO[2][10]), describing the range of terms that can be used to describe parts of the model in a way understandable to the community using the Open Biological Ontologies (OBO[11]) format, and
- The "Structures and Facilities for Model Definition" document[12] (hereafter referred to as the "SBML Manual"), describing many additional restrictions and constraints upon SBML documents, and the context within which SBO terms can be used, as well as information about how conformant documents should be interpreted.
From a knowledge-engineering point of view, it makes sense to represent these sources of structural knowledge as part of a single knowledge base. Although, to a knowledge-engineer, this current separation of documents could appear arbitrary, it is in fact well-motivated according to consumers of each type of information. The portion of the knowledge codified in SBML transmits all of and only the information needed to parameterise and run a computational simulation of the system. The knowledge in SBO is intended to aid humans in understanding what is being modelled. The SBML Manual is aimed at tools developers needing to ensure that software developed is fully compliant with the specification.
Only two of these three sources of structural knowledge are directly computationally amenable. SBML has an associated XSD that describes the range of legal XML documents, which elements and attributes must appear, and constraints on the values of text within the file. SBO captures a term hierarchy containing human-readable descriptions and labels for each term and a machine-readable ID for each term. Neither of these documents contains much information about how XML elements or SBO terms should be used in practice, how the two interact, or what a particular conformant SBML document should mean to an end-user. The majority of information required to develop a format-compliant model is in the SBML Manual, in formal English. Anything more than simple programmatic steps, such as XML validation, can currently only be done by manually encoding the English descriptions in the SBML Manual into rules in a program. libSBML[13] is the reference implementation of this procedure, capturing the process of validating constraints. Manual encoding provides scope for misinterpretation of the intent of the SBO Manual or may produce code that accepts or generates non-compliant documents due to silent bugs. In practice, these problems are ameliorated by regular SBML Hackathons[14] and the use of libSBML by many SBML applications. However, the need for a more formal and complete description of the information in the SBML Manual becomes more pressing as the community grows beyond the point where all of the relevant developer groups can be adequately served by face-to-face meetings.
We find that some of these issues can be avoided by combining the structural knowledge currently spread across three documents in three formats into a single computationally amenable resource. This method of constraint integration for all information pertinent to SBML will require a degree of rigour that can only improve the clarity of the specification. Once established, standard OWL tools can be used to validate and reason over SBML models, to check their conformance and to derive any conclusions that follow from the facts stated in the document, all without manual intervention.
To address this proposition, we have developed the Model Format OWL (MFO), implemented in OWL-DL and capturing the SBML structure plus a representative sample of SBO and human-readable constraints from the SMBL Manual. We demonstrate that MFO is capable of directly capturing many of the structural rules and semantic constraints documented in the SBML Manual. The mapping between SBML documents and the OWL representation is bi-directional: information can be parsed as OWL individuals from an SBML document, manipulated and studied, and then serialized back out again as SBML. We demonstrate feasibility with two simple, illustrative, examples. In future, we hope to use this as the basis for a method of automatically improving the annotation of SBML models with rich biological knowledge, and as an aid to principled automated model improvement and merging.
The integration of all structural knowledge for SBML models into a single resource creates a new style of model document development, which we believe will greatly reduce the overheads associated with computational transformations between biological knowledge and high-quality systems biology models. MFO is not intended to be a replacement for any of the APIs or software programs available to the SBML community today. It addresses the very specific need of a sub-community within SBML that wishes to be able to express their models in OWL for the purpose of reasoning, validation, and querying. It has also been created as the first step in a larger data integration strategy that will eventually encompass the biological as well as structural knowledge present in SBML documentation and models.
[1] Hucka, M. et al.: The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics (Oxford, England) 19 (2003) 524-531
[2] Le Novere, N.: Model storage, exchange and integration. BMC Neurosci 7 Suppl 1 (2006) S11
[3] Horrocks, I., Patel-Schneider, P.F., van Harmelen, F.: From SHIQ and RDF to OWL: The making of a web ontology language. J. of Web Semantics 1 (2003) 7-26
[4] Soldatova, L.N., King, R.D.: An ontology of scientific experiments. Journal of the Royal Society, Interface / the Royal Society 3 (2006) 795-803
[5] Heja, G., Varga, P., Pallinger, P., Surjan, G.: Restructuring the foundational model of anatomy. Studies in health technology and informatics 124 (2006) 755-760
[6] Heja, G., Surjan, G., Lukacsy, G., Pallinger, P., Gergely, M.: GALEN based formal representation of ICD10. International journal of medical informatics 76 (2007) 118-123
Enjoyed this? To read the rest, please see the Journal of Integrative Bioinformatics
Other than where specified, these are my notes from the IB07 Conference, and not expressions of opinion. Any errors are probably just due to my
own misunderstanding. :)
Carole compares it to mySpace. First, an introduction to workflows, where all of the standard bioinformatics statements are made: workflows are fantastic, but laborious, this is the era of Service Oriented Architecture, trying to make repetitive & mundane stuff easier. This leads nicely into a mention of Taverna, a "workflow workbench". Taverna 2 is being built now. They are getting 15,000 downloads per month, when in a good month. Then Carole spent a slide talking about the phrase "In the Cloud", which is a descriptor for independent third-party applications, tools and software. Taverna was designed for people who have little access to resources. She then described a few good examples of how Taverna could be used. Carole suggests to do more in Taverna rather than just workflows, for example SBML models or lab protocols. Then she gave some reasons why using experimental data standards are a good idea. Carole also mentioned that workflows (rather than just data) could be included in peer-reviewed articles.
myExperiment is meant to make it easy for scientists to pool information and data, and is meant to look like a social networking site. This includes collaborative social bookmarking and content sharing. They want to leverage and serve the long tail end of the "cloud". They'll use it as a gateway to other publishing environments and a platform for launching workflows. it is an "Open Archives" Initiative. Want to be able to launch and run workflows via Taverna via myExperiment. Also wants to encourage workflow "mashup" and publishing.
Here's where my opinion goes: By halfway through the talk, she hadn't said what she meant by myExperiment, other than that it is meant to be the mySpace for bioinformatics. However, she did spend at least the last 15 minutes discussing it. Further, while she talks about the usefulness of experimental data standards in relation to adding lab protocols to Taverna, she mentioned it without ever relating to FuGE. As FuGE is being published in Nature Biotech and is being touted as a possible standard experimental data exchange format, it seems an odd omission. (Especially as one of the main developers of FuGE, until recently, also worked at University of Manchester.) In conclusion, while very interesting, not as "meaty" as I'd like. A good talk overall, and I think I'll sign up for myExperiment!
Other than where specified, these are my notes from the IB07 Conference, and are in no way expressions of opinion, and any errors are probably just due to my own misunderstanding
OXL is the ONDEX data format, and they are presenting it as a possible format for the exchange of integrated data. OXL is based upon an ontology (opinion/question: a true ontology, or a CV?) of concepts and relations. ONDEX itself is an open-source data warehouse in Java that performs ontology-based data integration. OXL is in RDF. There are two ways to use RDF: firstly, model things as predicates (but then you cannot have attributes), and secondly they should be modelled as classes. However, it also seems that they have OXL in XML format, using an XSD.
In their XML format, they don't use any cross-references: it is fully expanded. Yes, it generates lots of XML files, but with file compression it isn't a problem. It does make whole-document validation more difficult, but they're working on it. This method makes it more human-readable.
They then presented some examples. The first was the identification of possible pathogenicity genes in Vibrio salmonicida (with the university of Tromso). Identify clusters of orthologs involving V. salmonicida, then colour nodes according to pathogenicity phenotype.
http://ondex.sf.net
Here are my opinions: A well-presented talk on the whole. Don't mean to harp on today about architecture slides, but they're important when describing software. They had some, but they were so small they were pretty hard to read. Also, I've never been convinced about the "human-readable" explanation for why to make a change to an XSD: XML is simply not meant to be human-readable, and changes shouldn't be made to the XSD to make it so. However, ONDEX is a reasonably mature application, and so it may be useful to ask others to use their format. My main question is about probabilities: a lot of similar work uses weights on edges in data integration: how can these be modelled with OXL?
Other than where specified, these are my notes from the IB07 Conference, and are in no way expressions of opinion, and any errors are probably just due to my own misunderstanding.
There are a large number of concepts and methods. Need to integrate different networks and additional information networks.
CABiNet (COmprehensive Analysis of Biological Networks). It is a generic network analysis suite with a semi-automatic network processing pipeline and methods for the exploration of a protein's functional network. The work was driven by the need to ID the substructures of the network (via clustering, network topologies, and known communities), and the knowledge that networks are incomplete (superposition of networks). Problems with the former include the fact that different algorithms may lead to different results. Created with component-oriented architecture. The architecture slides were very clear, but necessarily hard to reproduce in these notes. Uses Hibernate, Spring, and EJBs, among others. The "n-tier" architecture they used is called Genre. There is an asynchronous invocation of processing pipeline based on Message Driven Beans.
Via their integration tier they have acces to SIMAP, which is used in the detection of orthologs. It contains millions of proteins and FASTA hits, and they can do real-time orthology searches. There is web-service and ejb access.
One example pipeline would be functional classification based on multiple biological networks. Another would be the identification of functional modules from gene expression data (you can upload microarray data). They also showed a real example of a protein interaction network from a nucleosomal complex, and then added more and more data sources until the network grew quite complex.
Here comes my opinion: The clearest talk of the afternoon, dealing with an externally-available application that allows the user to create pipelines to perform network analysis. This one did have good architecture slides: perhaps a few too many ;) . Their architecture makes me wonder if they use AndroMDA or some other MDA, as such plugins to maven can build Hibernate/Spring/EJB/web services layers. They should have mentioned other similar projects, and how they are different, as I wouldn't know. However, that's a small beef. Finally, how do they know how good each data source is? Do they have a rating of each data source, for instance against a gold standard?