34 posts tagged “mged”
The third-place-winning SyMBA poster (from the MGED 2008 Meeting) is now available for download for anyone interested. Enjoy!
Here's a couple of views of the location for MGED 11:
While the MGED meeting officially ended yesterday, 4 September, my part in it had finished the day before, with the last poster session and the closing remarks. As a bioinformatician, the lab-biologist-oriented tutorials had little relevance for me. So, what were my thoughts on the meeting overall? Virtually all of the talks were interesting and given by speakers who performed well. There were the usual people who spoke so quickly that it was hard to follow, or who packed their slides full of so much information that the audience's attention had to be split between speaker and slides rather than one complementing the other, but these were few and far between. Of all of the interesting talks, here were my top 3 picks of each of the three days. It would be great to hear from other conference-goers, and see how their picks stand up.
- Monday: Shirley Liu, who spoke on ChIPSeq and Epigenetics, who taught me why there is a difference between in vivo and in vitro nucleosome patterns, and had a great algorithm to talk about. Also, Grant Cramer had a fantastic talk on Systems Biology of Abiotically-Stressed Grapes - it was very clear, and I learned loads about why salinity increases that are equal in water deficit to the same deficit through drought are less harmful to the plant than the drought. Also, Duccio Cavialieri's talk on Evolution of transcriptional regulatory networks in yeast populations on grapes, which was interesting and offered some really cool pictures of yeast colonies, was another highlight.
- Tuesday: Joe Gray's talk on Molecular-marker-guided treatment in breast cancer provided real insight into how research can inform clinicians and drug development. Secondly is Susanna-Assunta Sansone and Philippe Aldebert's talk on integrated standards for omics data, which really brought home how integration of standards is important to really making it easier to integrate data.The third mention is for the short 15-minute talk by Nigel Carter on DECIPHER, a really interesting use of Ensembl to link up data on rare and new diseaase phenotypes and genotypes across the world.
- Wednesday: Naama Barkai and her talk on Evolution of gene expression taught me a lot on a subject that I didn't really know much about before. Steven Oliver's talk on Flux control analysis and the systems biology of the eukaryotic cell. Though my note-taking was probably some of the flimsiest of the conference for his talk, it wasn't due to lack of interest. He is a great speaker who elicted probably the most laughs by the audience of any of the speakers on any of the days. He compared to the budding yeast to the pi muson budding out from the photon (at least I think that's what it was!) with fantastic effect, and mentioned FuGE, a project important in my work. Finally, Atul Butte had a highly entertaining and interesting talk on another termI hadn't heard about yet: translational bioinformatics, which is very similar in subject to what Joe Gray had talked about the day before.
It was my first MGED conference: most who were there had been to many before. I did meet a few newbies like me, though. And, while the audience participation with respect to asking questions was a little lower than I have seen in other conferences, the participation *outside* of the meeting hall was much higher. That's a tradeoff that I'm more than happy with. I saw virtually everyone stay for the entire time in each of the poster sessions: something virtually unheard of in most conferences. While the great Italian wine may have had something to do with it, it was also obvious that it was the talks and posters that people really wanted to stay and talk about. Other people I spoke with also remarked on this behaviour in a positive way. Everyone simply seemed to be enjoying themselves both social and academically. It was a relaxed, interesting, and fun conference in a beautiful location. I also think the conference bags were a big hit: you were given either a shiny silver or a shiny black one (mine is silver), which was a refreshing change from the boring looks of the ones you get from other conferences.
In fact, the true highlight for me was getting 3rd place in the poster competition. With it came a beautiful flower (shown below), a cash prize, and a warm fuzzy feeling of accomplishment. My poster was on SyMBA, my FuGE-based database and web interface (among other things) for holding and archiving experimental data and metadata. You can download the poster shortly on the SyMBA website - I'll put it up in the next few days and then update this post. I really want to thank the judges who thought my work - and my presentation of that work - was worthy of a prize, and also all the people that came up and asked me questions during the first poster session. Chatting with you all for two hours straight was loads of fun - even though it meant I didn't get to the wine until the end of the session! All of my A4 printouts of my poster were also gone before the end of the second poster session - always a good sign.
I encourage everyone who's interested to look at my posts of the conference over the past week. Please let me know of any errors or omissions you'd like me to sort out. I was taking notes while the speakers were talking, then saving the post as soon as they finished, so there are bound to be parts that could benefit from some help.
Next year's conference is in Arizona, and it would be great to experience one of these meetings again. A big thank-you to the organizers and to the judges, and to the attendees and speakers.
Ewan Birney
Keynote Talk, Afternoon Session, 3 September (11th MGED Meeting, 1-4 September, 2008)
ENCODE
There are 4418 TSS with multiple lines of evidence supporting them. This is ~10 fold more than the number of genes. Only 38% would be traditional ones. With many more predicted TSSs, it is consistent with the considerable diversity of transcripts. Independently integrating Chip/Chip data suggested ~1000 "regulatory clusters". Sequence-specific factors are distributed symmetrically around the TSS. Histone information is highly correlated with gene on/off status.
What about the distal sites, and finding them? Chip/Chip isn't "great" - most look close to one of these new TSSs - there could be factor bias. DNAaseI hypersensitive sites (DHS), as all factors give a DHS signal, and 55% of DHSs are distal to any TSS.
Evolutionary conservation and ENCODE. All 44 ENCODE (pilot) regions has 29.998 million bases. Of that, 4.9% are constrained, and of that, 40% are unannotated, 20% are other ENCODE Experimental Annotations, 8% are UTRs, and 32% are coding. Most of the genome is unconserved, which is to be expected. But, not everything is constrained.For instance, ancient repeats (ARs) have a very small fraction of experimental annotation overlapping a constrained sequence (e.g. they genuinly look like they're evolving neutrally). About 90% of coding exons are constrained in some way. Under 50% of the DHS are unconstrained. Why is there this discrepancy? False positives in the exp? Not likely - exps validate at >80% and cross-validate each other. What about false negatives in the constraint detection? Not likely again - can detect up to 8bp elements, and within the "neutral" zone of alignability. Ok, what about the neutral turnover model... There could be functional conservation, where there starts out with two promoter sites, and a speciation event also coincides with the splitting of one into each new species. Then it could look like there is an unconserved region, when there actually is.
What we should learn from ENCODE. "whacky" transcription is real (but we don't know what it does), and there's unconventional transcripts; Lots more TSSs than we understand (many "distal" regions are actually close to promoters); broad-specificity marks are more useful. Neutral model: because things happen reproducibly in multiple tissues does not imply selection (this is not the same as exp variance). could imply "functional" conservation outside of orthologous bases (comparative genomics sequencing not enough: need comparative functional investigation).
ENCODE scale up: 7 grants spanning all the main types of data generated for the pilot. There will be coordinated data collection (UCSC) and integrative analysis. There is also far tighter coordination (cell lines, standards, growth behaviour).
Ensembl
How to handle ENCODE data in Ensembl? In the gene build and add supporting evidence and annotation - from there, you get classification either manually or automatically. In a Regulatory build, declare sections of the genome as regulatory features using the union of many experiments. Then there will be predicted binding of Myc and information about promoter element, and many cell lines. Trying to breakdown the problem into 2 axes: the elements, and the status/annotation of those elements. There are Point sources and Annotating sources/broad marks. PS are DNaseI sites and TF binding. AS will be histone marks and methylation status. This will be represented with a small box for the PS and "whiskers" for the area of the AS.
Initial Regulatory Build (headaches): need to consistently recall peak data (have their own in-house caller, SWEmbl, which will hopefully work via ENCODE to harmonize). There are also genomic headaches (mitochondrial repeats in the genome, centromeres, etc), very long regions, and it's reminiscent of gene builds. Has 55 different datasets, 172112 elements...
Names of very broad types of classification of Regulatory features: Genic (generic), promoter (cell sp.), geneic (cell specific), Promotoer (>1 cell line), unclassified (>1 cell line), unclassified (cell sp).
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. :)
...Toll-like Receptor signalling
Christine Wells
Afternoon Session, 3 September (11th MGED Meeting, 1-4 September, 2008)
Macrophages play a central role in pathogen detection and innate immune signaling. They have an impact on susceptibility to infection, tissue damage, wound healing, acute infection and chronic inflammatory disease. They look at subcellular localization, pathogen recognition/uptake, kinase activity, cytokine production, antigen destcruction and presentation. 256 variant transcripts were expressed from 70 genes in mouse bone-marrow-derived macrophages. These are predicted to generate decoy receptors, among other things. Use FIRMA analysis, which allows for variations in probe efficency, giving statistical analysis of gene-level and exon-level expression differences. Donor differences drive expression of Human TOLLIP isoforms: they are temporally regulated, but are variable between donors. Predictable expressin of some variants but stochastic expression of others may underlie phenotypic heterogeneity.
These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. :)
Josselin Noirel
Afternoon Session, 3 September (11th MGED Meeting, 1-4 September, 2008)
A metabolic network is enzyme centered, small world, scale free. He describes the sort of x you expect when you know that a gene is, for instance, up-regulated. They would like to get "may we infer that a gene is up-regulated (or the probability that it is) measuring x. He used Nostoc punctiforme, a cyanobacterium able to fix nitrogen, a specialized 'heterocyst'. Would like his metabolic network to integrate proteomic and transcriptomic data. However, tthese two types correlate poorly. However, there is correlation in the metabolic pathways (missed part of this slide).
This work is meant to ease the id of pathwyas, and to embody some of the underlying biology in the model. MMG identifies potential targets, and is available as an R package.
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. :)
Stephen Oliver
Keynote Talk, Afternoon Session, 3 September (11th MGED Meeting, 1-4 September, 2008)
pi muson budding out from the photon - physics shamelessly borrowing from budding yeast ;)
How can we deal with the complexity of this "simple" eukaryotic cell? Work top-down, with a coarse-grained model, and bottom-up, by defining discrete subsystems. Wrt to the former, there is metabolic control analysis (MCA), which is a "shortcut" to modelling metabolism. The central device of MCA is the control coefficient, specifically the flux control coefficient (C), is a measure of the degree of control an enzyme has on a pathway flux.
Changing the flux: the big experiment (taken 9 years so far). The idea is that the experimentalists should control the flux == the growth rate of the organism. Hypothesize that can do this by controlling what goes in the chemostat. By changing relative concentrations of micronutrients, we can make it so that one of those micronutrients is the growth-limiting micronutrient. For the vast majority of genes, there is no significant difference in flux when reducing carbon from 2 to 1. Phosphorus-limited is slower (decreased fitness). Nitrogen created an increase in fitness. 196 genes showed haploproficiency in the three defined nutrient limitations, and 350 genes that showed a haploinsufficient uner the same limitations.
Is it a universal law, or is it context dependent? Use a turbidostat rather than chemostat, and have no nutrient limitation. top classes that show haploinsufficieny include cytosolic ribosome, ribosomal subunit, etc. These are completely different than those in the chemostat. The laws determining which genes control growth rate may change according to the selective conditions. The discover of haploproficiency in nutrient-unconstrained conditions suggest that yeast has sacrificed short-term gain in favor of long-term sufficiency.
Metabolic mode: list of all metabolic reactions know for S.cerevisiae, taken from databases and primary papers. Approx 1174 reactions and 584 metabolites. Then they chose a selection of knockouts. Synthetic lethality can come from redundant gene duplicates or alternative cellular pathways - both need to be interrupted to see the phenotype. How to find synthetic lethal gene pairs? Global mapping: screen all possible double gene deletion strains of non-essential genes. Problems: huge number of gene combinations, with about 4% completed in the first 3 years; and interactions are rare, with only 0.6% of pairs show synthetic lethality in yeast.
Can they use computational tools to guide them? The tool: flux balance analysis (FBA): reconstruct the metabolic network, define the nutrient environment and constrain for optimal biomass, etc. The yeast model predicts essential genes with 68-80% success for single-gene deletions.
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. :)
Paul Boutros
Morning Session, 3 September (11th MGED Meeting, 1-4 September, 2008)
Microarray data is analyzed with a pipeline of algorithms to remove non-biological sources of variability. 133, 056 analysis methods in bioconductor. Which steps of the pipeline are important? Which alorithms work the best? What are the interactions, if any? Consider all possible analysis methods. Evaluate perfomance vs. gold-standard. Use a linear modeling approach: calculate a metric (e.g. AUC), fit an ANOVA, calculate percent variance explained (PVE). The PVE tells you how much of the variability is due to that step or set of variables you're considering.
Which steps of the pipeline matter? Only two steps accounted for 93% of the variability: WN (within-array normalization) and DF (differential testing). Do algorithms interact? The default assumption is that there has been no interactions. used 2nd-order ANOVA. There are large interactions: WN and BN (between-array normalization) greatly antagonize each other.
They had 3 different test datasets. The results are the averages of the performances across those 3 datasets. 90% of the variability was accounted for by your choice of differential testing (DF) at the end. The rest of it was mostly to do with BG. RankProd algorithm performed best. Large interactions, as before: one example was NM (normalization method) and method for perfect-match mismatch adjustment (PM).
Which steps are important? Not all of them. Differential expression seems to be the driving factor (DF). Which worked best? Surprising result with minimal dataset dependence. Are there interactions among steps? Yes, so new algorithms should be evaluated "in context" instead of a static pipeline.
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. :)
Peter Murakami
Morning Session, 3 September (11th MGED Meeting, 1-4 September, 2008)
They have created an algorithm with which to evaluate microarray data quality. Good quality assessment (QA) metrics can tell us to remove an array. This study makes use of a dataset of 5954 unique arrays from 167 studies from both GEO and ArrayExpress. Some quality metrics in common use when looking at the original image: average background, scale factor, and % genes called present by Affy's detection algorithm. RNA degradation metrics are also used.
They need arrays known to be of low quality by which to judge the success of the quality metrics. These were ones, for instance, when the RNA had been deliberately degraded in the lab, or used dog RNA rather than human. The arrays were identified using CAT and hierarchical cluster analysis using Euclidean distance based on gene expression estimates.
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. :)
...20,000+ assays in ArrayExpress
Misha Kapushesky
Morning Session, 3 September (11th MGED Meeting, 1-4 September, 2008)
Can context-specificity be uncovered and described from available high-throughput gene expression data? They used data from 9 species, 600+ experiments, and with 2100+ different conditions and 20000+ assays. Do independent gene expression studies support each other? They build linear models to test for differential expression for disease comparison and organ comparison. Information is presented in the ArrayExpress Atlas.
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. :)
Rainer Spang
Plenary Talk, Morning Session, 3 September (11th MGED Meeting, 1-4 September, 2008)
Modeling conditional independence: the idea that the transcription level of TFs drives the expression of other genes is over simplistic, but it is the most easily converted into a statistical model: you model the conditional independence structure of the expression data as a graph. You're explaining the expression level of certain genes with the expression level of other genes, which isn't really that biologically exact.
Model indentification: different graphical models encode for very similar covariance structures, and some models even for identical ones. You need 1000s of microarrays to identify a network reliably (example provided had 2000 arrays). Why do you need so many? You can easily generate graphical models, but the data you get are very similar - this is why you need lots of arrays.
Given many models for a biological network: in order to decide which one is best supported by data, the models must generate sufficiently-different data. If two models generate similar data, their biological interpretation must be similar, too. The model space must adapt to the limited complexity of the data and not to the high complexity of biology: it must be sufficiently coarse.
A coarse model must ignore aspects of biological complexity (George Box: All models are wrong, some are useful"). Which aspects should be ignored and which should be modeled? This depends on the type of data and the motivation of why you model a network. They want to analyze how the flow of information in molecular signalling pathways is disturbed in human tumors, and use this information for novel molecular classifications of tumors.
You cannot get things like a phosphorylation step or a dephosphorylation step in the microarrays directly. For instance, we don't know that signalling is related to cancer via microarrays, but via mutations, which can introduce constitutive active signals or block the signal flow. These mutations may yield changes in gene expression profiles downstream. So how can we get back to the actual cause from the gene expression data? You can mimic loss-of-function mutations using RNAi experiments.
Nested effects models (NEMs). Negative controls (C-), positive controls (C+), Interventions in S-Genes (RNAi), Observations in E-Genes (microarray). The silencing effect is when an e-gene goes from a C+ level back to a C- level. How does the data look like that will be generated by the model, given a certain structure of a linear cascade? Model assumptions: the core model is transitive, every e-gene is connected to exactly one s-gene, and there is independent binary noise.
Scoring observed silencing effects. The silencing scheme allows prediction of e-gene states (when the position is known). We expect a number of false positive and false negative observations. The likelihood is based on these. They then get a likelihood for each core model, and then find the maximum likelihood model. The model search space is large, and gets very very large after 8 or 9 genes. For each pair of genes, fit a model for every pair of genes, and pick the best of the four possible models. The problems with this: works fairly well but not really good, and it looses transitivity (only 2 genes). Alternatively, you can do triplets - usually you are much closer to transitivity.
How well does this network work on simulated data? Pretty well, even with the number of genes goes up to 32, there is still 90% precision with 5 replicates for triples (only about 82% for pairs).
Limitations: hardly any statistical theory on uncertainties, unstable wrt the included s-genes, and unstable wrt data discretization (but this one can be fixed). Applications: NEMs can be used toderive a singalling hierarchy, and a signalling-consistent clustering, and might be used to classify tumors.
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. :)