7 posts tagged “gene expression”
..of Saccharomyces cerevisiae
Gavin Sherlock
Afternoon Session, 1 September (11th MGED Meeting, 1-4 September, 2008)
The population structure in the presence of clonal interference is markedly different from that in a classic model. They needed a system to model population dynamics in a population. They use FACS and different-colored fluorescent cells to do this. They grew the cells in a chemostat, which is seeded with equal numbers of each of the 3 color types, and then measure the proportion of the population over time. The experiment has been done 8 different times. One run, for example, shows expansions and contractions followed by one color becoming the majority. Fixation of a color is not necessarily indicative of fixation of an adaptive event (multiple adaptive clones within a population with the same color).
Using yeast tiling microarrays, they can identifiy location nucleotide differences between the evolved and parent strains. Then sequence the candidate mutations. One of the mutations they found (in cox18) called Red 266 was discovered via decreased hybridization compared to the parent. Another example was where there was a comparatively higher level of hybridization. Mutation history can help determine which strains come from which earlier parent strains with mutations of their own.
Clonal interference is important in adaptive evolution of yeast. Specifically, glucose transport and signalling through the Ras pathway were both affected. In future, they wish to directly determine which mutations are adaptive, find out how general these adaptations are, and discover the effects of the adaptive mutations, and finally - what fraction of the adaptive landscape have we explored? Will it be the same or different in the other 7 experiments?
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. :)
...by a Novel Computational Framework
Alessandro Coppe
Morning Session 1 September (11th MGED Meeting, 1-4 September, 2008)
Genes similarly expressed and/or co-regulated may share common regulatory elements in their regulatory regions. Motifs recognized by a specific regulatory protein can differ slightly from one gene to another. They were investigating the role or promoter similarity and gene clustering in establishing co-expression of genes. They used hematopoiesis as a biological model for studying regulation of gene expression in cellular differentiation. In the starting dataset, there were 9716 genes for which genomic localization, promoter sequences, and expression data in myeloid cells were available. They then searched for CERs (co-expressed chromosomal regions), which are groups of adjacent genes co-expressed during myelopoiesis. They ended up with 26 CERs. 15 CERs were grouped into 2 co-expressed chromosomal meta-regions (CEMRs) by QT clustering.
Requirements of the custom motif discovery framework: finding motifs in a selected set of promoter sequences, not simply highly-over-represented with respect to random sequences. CEG, CER and CEMR are the groups of genes used for the motif discovery analysis. 5325 significantly over-represented motifs (FDR < 0.05) identified in 59 of the 72 considered groups of gene promoters. 19% of the motifs are similar to at least one of the 142 known TFBS. More common motifs were found in the promoters of co-localized and co-expressed genes than in those of simply co-expressed genes.
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. :)
Geoffrey Faulkner
Morning Session 1 September (11th MGED Meeting, 1-4 September, 2008)
Transcriptional Elements (TEs) include: LINEs, SINEs, LTRs, and DNA transposons. TEs were first characterized in maized, and thought they were regulators of nearby genes via an unknown mechanisms. The discoveries of regulatory ncRNAs and active promoters of TEs has helped. It's difficult to detect genome-wide TE transcription. They use CAGE, mentioned in the first talk of the day. CAGE detects transcription start sites (TSSs), and reliably detects TE promoters.80% of CAGE tags mapping to a repetitive element are unique on the genome. TE promoters are sharp, in that there is a single dominant transcription start site. TE promoters were more than twice as likely to be tissue-specific than other promoters (40% rather than 17%). TE promoters were enriched for protein-encoding genes. TEs are known to provide alternative promoters to nearby genes. More than 700 of the ones he studied were confirmed as such alternative promoters (Worked with FANTOM3 and FANTOM4 mouse libraries for CAGE stuff). Also, ncRNAs can be derived from TEs and could produce "anti-silencing" or "transcriptional interfence", but are more likely to provide the former rather than the latter. TEs provide 1000s of functional elements to the genome, even though they're not usually very well conserve. They contain an interesting subclass of promoters, and are enriched near protein-coding genes, and they provide alternative promoter to nearby genes.
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. :)
Shirley Liu, Dana-Farber Cancer Institute, Harvard School of Public Health
Morning Session 1 September (11th MGED Meeting, 1-4 September, 2008)
She describes a peak-finding algorithm called MACS (Model-based Analysis of ChIPSeq). If you look at the tags, they usually lie to the left or the right of the actual binding site. In order to know where that site is, you have to shift the position from where the tag is. Most people don't know how much to shift it by. They use the peaks with the most confidence to calculate the shift size. The mode of this shift size was smaller than expected. This could be that among the whole population you give to the sequencer, it prefers the shorter fragments. Alternatively it could be that there is a binding site right in the middle, and perhaps are hypersensitive regions either side of the binding site where the breaks occur instead. So you shift according to the tag size (about have of the size). The tag distribution along the genome should show a Poisson distribution. However, ChIPSeq shows local biases in the genome, thought to be both chromatin and sequencing bias.
In a 300bp region, in a control there will only be 1-3 tags, which means that are simply too few tags to be good enough. So, rather than just looking at the bases at the binding site, they also look at 1kb, 5 kb, and 10kb regions around it (local lambda). If a global measure is used instead of the local one, the results don't come out very well. With MACS, you get a higher motif occurrence in peak centers, and a improved spatial resolution. if FDR (False Discovery Rate) = control peaks / ChIP peaks, then with the MACs method the FDR is 0.4%, while No control and using only the background lambda, it's 41% (much worse).
You shouldn't use a random sampling of tags to give you your FDR, as the distribution is not random. Further, there can be a problem with unbalanced tags. If you have two channels, whichever channel has more tags will give you more peaks, even after normalization.
They worked with some data for nucleosome positioning in humans. They extended the original tags for each nucleosome and check the tag count across the genome. They also de-noised the nucleosome data using the Coiflet Wavelet De-noising. This is done by decomposing the original signal in steps, and then remove wavelets with high frequency and small coefficients (the noisy ones). They also removed peaks with unbalanced tags.
ChIPSeq may be ineffective at mapping inactive histone marks. The percentage of tags located at identified nucleosomes are mich higher for active histone marks. The percentage of isolated tags are much higher for inactive histones. Active marks tend to bind to sharper regions (more localized) than the inactive ones. Differentiation impairs ChIPSeq efficiency of inactive marks but not active marks. Also, close chromatin are harder to sonicate, so resulting fragments are larger. ChIPSeq library construction biases shorter fragments.
Is there a nucleosome sequence preference in humans? There isn't as much as expected. Need to compare in vivo with in vitro nucleosome sequencing. 10bp periodicity is observed in vitro. However, there isn't this periodicity in vivo. So, people are doing better at predicting the in vitro nucleosome rather than the in vivo nucleosome. So, they extended the tag by 146 bp for the nucleosome profile, and take the middle 73 bp, and then get the correlation coefficient. Only 10% of the in vitro and in vivo overlap. Only 50% of in vitro and in vivo data agrees with each other. There are definitely intrinsic sequence features for nucleosomes, but they don't predict in vivo nucleosome patterns very well.
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. :)
Barbara Wold, California Institute of Technology
Plenary Lecture, Morning Session 1 September (11th MGED Meeting, 1-4 September, 2008)
A direct way to characterize a transcriptome is to sequence a cDNA copy of the entire transcriptome and then calculate the density of reads mapping to any given locus in the genome. Ultra-high-throughput sequencing platforms have made it practical for doing this genome-wide. They have done this for mouse mRNA and human tissues and cellines at levels ranging from 20 to 100M reads per transcriptome. These RNASeq texperiments detect RNA splice patterns including alternate splicing events, y identifiying sequence reads that cross known and theoretical splice junctions.
The methods of the previous speaker tells you about where specific parts are (starts, end). The RNASeq technique is a more broad way of looking at the transcriptome, and is a more brute-force method. However, you can still get really important data out of it. The main purpose of RNASeq is to be able to quantify RNAs, both relative and absolute. RNASeq is good at absolute numbers. It can also do transcript discovery and mapping, including revising gene models, splice isoforms, and RNA editing. Even in "boring" tissues like mouse liver or total mouse brain, you still come up with some robust newly-discovered transcripts using this technique that aren't quantitatively minor. There are limits to this technique, e.g. they're doing the work against known sets of genes. (Although they'd like to do work de novo). They're happy to help with providing data to help with this. A final function is in genetics, specifically expressed SNPs and private mutations that wouldn't normally appear on SNP arrays.
Two features of the data: when they do comparisons of technical replicates, they correlate very nicely. Biological replication can then really be about the biology. Secondly, the map of the RNA transcripts had a very nice linear shape on a log scale.
Should you look at RNA that can map equally well to multiple sites? Looking at 25mer reads in mammalian genomes. Let's see what happens to those can map equally well to 2-10 sites, inclusive, as well as the unique reads. 80% of the genome could be mapped uniquely, with 6% between 2-10, and 14% with more than 10. In myoblast transcriptome, the fraction that maps uniquely is smaller (69%), and this is something that happens generally. This is because there's lots of gene paralogy, and you'll get things that map due to a (recent or old) duplication. So if you ignore these multi-read sequences, you will risk missing out important stuff entirely.
What are the kinds of genes that are multiread sensitive? Their example is an actin gene (EL4r1). If you just map unique reads, you miss everything that is in the exons, and therefore would show as if it was NOT expressed if just using unique reads. RNASeq is really good in detecting alternative splicing. Really rare alternatice splicing events may just be the random events that are not intended, but which the system can tolerate - this should be taken into account.
They have discovered some candidate new genes: 161 in the brain, 95 in the muscle, and 77 in the liver, and some of these are overlaps between the three.
You need to include multireads to detect some true positives in ChIP: 5-10% of sites in the interactome are affected. Can you ID by ChIP essentially all sites predicted by FUNCTION assays? Yes, but strongly conditioned on good abs and good cells. Do you expect detectable function at every site with significant & reproducible in vivo occupancy? No - more data is needed, long range cis-interaction in big genomes, 3-C signals in our data etc. Significant ChIP at all instances of high consensus motif match? MyoD, Myogenin- NO!, as >1 million perfect motifs in genome. Yes for big, well-specified motifs (NRSF), and the meaning of binding seen at some 1/2 of the sites is unclear.
(Chromatin Immunoprecipitation: ChIP.)
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. :)
Dynamics and Complexity of the Coding and Non-Coding Transcriptome
Piero Carninci, RIKEN Omics Science Center
Keynote Lecture, Morning Session 1 September (11th MGED Meeting, 1-4 September, 2008)
They've been mapping the expressed part of the genome, aka the transcriptome. This will help us understand the genome output. There are many different issues with RNAs that are retrieved via standard methods: there are many different transcripts from a single gene, and different promotors. Each promotor will have different levels of activity. They use Cap Analysis Gene Expression (CAGE). They are using MAGE-TAB and SDRF formats to store their data.
While in the 1990s, people thought there were 70,000-100,000 protein-coding genes. Today, we expect that there are only about 20,000. Instead, there is a lot of complexity: post-translational modifications, many overlapping transcripts, multiple promoters, etc.
But what are the long non-coding RNAs (ncRNAs) doing? They are long stretches of sequences that are not conserved. However, their promoter sequences are often conserved. Perhaps the mechanisms of their action do not require long stretches of conservation in the gene. Most of the unknown RNA is polyA minus and nuclear. A large proportion of the long RNAs are cleaved (deriving short RNAS that are often conserved). These derived short RNAs are mapping on the 5' end (PASRs) and 3' ends (TASRs) of genes. Therefore the whole transcript is not conserved because it doesn't need to be: only those bits that are cleaved and used later on need to be conserved. Essentially, this means a large number of RNAs from an individual locus.
PolyA- CAGE is mostly nuclear, overlap introns and TSSs, while cytoplasmic is more on exons. 3' untranslated regions (UTR) also are interesting: they start from a conserved promoter which has a conserved GGG section. There is also RNA that starts from the middle of a gene. It is more prevalent in the tata-box, with sharp promoters. Mouse and humans have similar starting sites. There are also antisense RNAs. Most TU (72%) show antisense transcription. Are the sense-antisense RNAs co-expressed? Is there dynamic regulation? If you perturb antisense RNA, the sense will be overexpressed. It also seems that sense and antisense RNA aren't transcribed at the same time - that they might take turns (this is my impression from the slides, rather than something he said exactly). Sometimes sense-antisense work in the cytoplasm (with theproduction of natural siRNA). One example is the beta-secretase-1 antisense, which increases the sense RNA (feed-forward loop), which is important in Alzheimer's.
You can even get RNA expression from repeats. Repeat elements can produces short RNA, like natural siRNA. They have identified that 10-35% of the transcript correspond to repeat elements. Surprisingly, they have dynamic tissue-specific behaviour / patterns. There is overrepresentation of repeats in the nucleus among polyA- RNAs, and there is compartment specificity.
There is a lot of promoter plasticity. A switch to PyPu will increase transcription, while the reverse decreases it. They're having a look at preferentially-expressed promoters (PEPs). These are promoters that have >30 tags and are statistically significant. The distribution of PEPs in brain tissues: genes that have multiple-tissue-expressed PEPs. Different PEPs drive funtional variability of the proteome. PEPs create more proteome diversity. They make use of THP-1 cells as a model cell. 46% of genes in THP-1 have alternative promoters. Of these 18, 245 are high-confidence promoters. 1909 of these are newly-discovered. CAGE identifies the active set of promoters, and more precisely defines the TSS position.
CAGE is not dependent on microarray design, and measures expression including ncRNAs. They have some bioinformatics tools freely available for the CAGE protocol, and have tried to simplify the CAGE protocol. Please contact him if you wish help in making your own CAGE library.
These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. :)
Welcome and Introduction: Chris Stoeckert and Cesare Furlanello
Chris began with a brief introduction to MGED (the Microarray and Gene Expression Data Society), begun in 1999. However, these days, microarrays are only one technology that they're interested in. They've now broadened the scope of MGED into other areas than just gene expression. The purpose of meetings like this is to showcase cutting-edge work and promote standards efforts.
The society is composed of biologists, computer scientists, and data analysts. The goals of MGED include keeping up with advances in technology, provide and promote software using standards, coordinate with other standards groups, and continue to provide exciting and useful meetings. For more information, see the MGED website.
The local organisers of this meeting is the Bruno Kessler Foundation (FBK), which is a private research organization.