2 posts tagged “chipseq”
Shirley Liu, Dana-Farber Cancer Institute, Harvard School of Public Health
Morning Session 1 September (11th MGED Meeting, 1-4 September, 2008)
She describes a peak-finding algorithm called MACS (Model-based Analysis of ChIPSeq). If you look at the tags, they usually lie to the left or the right of the actual binding site. In order to know where that site is, you have to shift the position from where the tag is. Most people don't know how much to shift it by. They use the peaks with the most confidence to calculate the shift size. The mode of this shift size was smaller than expected. This could be that among the whole population you give to the sequencer, it prefers the shorter fragments. Alternatively it could be that there is a binding site right in the middle, and perhaps are hypersensitive regions either side of the binding site where the breaks occur instead. So you shift according to the tag size (about have of the size). The tag distribution along the genome should show a Poisson distribution. However, ChIPSeq shows local biases in the genome, thought to be both chromatin and sequencing bias.
In a 300bp region, in a control there will only be 1-3 tags, which means that are simply too few tags to be good enough. So, rather than just looking at the bases at the binding site, they also look at 1kb, 5 kb, and 10kb regions around it (local lambda). If a global measure is used instead of the local one, the results don't come out very well. With MACS, you get a higher motif occurrence in peak centers, and a improved spatial resolution. if FDR (False Discovery Rate) = control peaks / ChIP peaks, then with the MACs method the FDR is 0.4%, while No control and using only the background lambda, it's 41% (much worse).
You shouldn't use a random sampling of tags to give you your FDR, as the distribution is not random. Further, there can be a problem with unbalanced tags. If you have two channels, whichever channel has more tags will give you more peaks, even after normalization.
They worked with some data for nucleosome positioning in humans. They extended the original tags for each nucleosome and check the tag count across the genome. They also de-noised the nucleosome data using the Coiflet Wavelet De-noising. This is done by decomposing the original signal in steps, and then remove wavelets with high frequency and small coefficients (the noisy ones). They also removed peaks with unbalanced tags.
ChIPSeq may be ineffective at mapping inactive histone marks. The percentage of tags located at identified nucleosomes are mich higher for active histone marks. The percentage of isolated tags are much higher for inactive histones. Active marks tend to bind to sharper regions (more localized) than the inactive ones. Differentiation impairs ChIPSeq efficiency of inactive marks but not active marks. Also, close chromatin are harder to sonicate, so resulting fragments are larger. ChIPSeq library construction biases shorter fragments.
Is there a nucleosome sequence preference in humans? There isn't as much as expected. Need to compare in vivo with in vitro nucleosome sequencing. 10bp periodicity is observed in vitro. However, there isn't this periodicity in vivo. So, people are doing better at predicting the in vitro nucleosome rather than the in vivo nucleosome. So, they extended the tag by 146 bp for the nucleosome profile, and take the middle 73 bp, and then get the correlation coefficient. Only 10% of the in vitro and in vivo overlap. Only 50% of in vitro and in vivo data agrees with each other. There are definitely intrinsic sequence features for nucleosomes, but they don't predict in vivo nucleosome patterns very well.
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. :)
Barbara Wold, California Institute of Technology
Plenary Lecture, Morning Session 1 September (11th MGED Meeting, 1-4 September, 2008)
A direct way to characterize a transcriptome is to sequence a cDNA copy of the entire transcriptome and then calculate the density of reads mapping to any given locus in the genome. Ultra-high-throughput sequencing platforms have made it practical for doing this genome-wide. They have done this for mouse mRNA and human tissues and cellines at levels ranging from 20 to 100M reads per transcriptome. These RNASeq texperiments detect RNA splice patterns including alternate splicing events, y identifiying sequence reads that cross known and theoretical splice junctions.
The methods of the previous speaker tells you about where specific parts are (starts, end). The RNASeq technique is a more broad way of looking at the transcriptome, and is a more brute-force method. However, you can still get really important data out of it. The main purpose of RNASeq is to be able to quantify RNAs, both relative and absolute. RNASeq is good at absolute numbers. It can also do transcript discovery and mapping, including revising gene models, splice isoforms, and RNA editing. Even in "boring" tissues like mouse liver or total mouse brain, you still come up with some robust newly-discovered transcripts using this technique that aren't quantitatively minor. There are limits to this technique, e.g. they're doing the work against known sets of genes. (Although they'd like to do work de novo). They're happy to help with providing data to help with this. A final function is in genetics, specifically expressed SNPs and private mutations that wouldn't normally appear on SNP arrays.
Two features of the data: when they do comparisons of technical replicates, they correlate very nicely. Biological replication can then really be about the biology. Secondly, the map of the RNA transcripts had a very nice linear shape on a log scale.
Should you look at RNA that can map equally well to multiple sites? Looking at 25mer reads in mammalian genomes. Let's see what happens to those can map equally well to 2-10 sites, inclusive, as well as the unique reads. 80% of the genome could be mapped uniquely, with 6% between 2-10, and 14% with more than 10. In myoblast transcriptome, the fraction that maps uniquely is smaller (69%), and this is something that happens generally. This is because there's lots of gene paralogy, and you'll get things that map due to a (recent or old) duplication. So if you ignore these multi-read sequences, you will risk missing out important stuff entirely.
What are the kinds of genes that are multiread sensitive? Their example is an actin gene (EL4r1). If you just map unique reads, you miss everything that is in the exons, and therefore would show as if it was NOT expressed if just using unique reads. RNASeq is really good in detecting alternative splicing. Really rare alternatice splicing events may just be the random events that are not intended, but which the system can tolerate - this should be taken into account.
They have discovered some candidate new genes: 161 in the brain, 95 in the muscle, and 77 in the liver, and some of these are overlaps between the three.
You need to include multireads to detect some true positives in ChIP: 5-10% of sites in the interactome are affected. Can you ID by ChIP essentially all sites predicted by FUNCTION assays? Yes, but strongly conditioned on good abs and good cells. Do you expect detectable function at every site with significant & reproducible in vivo occupancy? No - more data is needed, long range cis-interaction in big genomes, 3-C signals in our data etc. Significant ChIP at all instances of high consensus motif match? MyoD, Myogenin- NO!, as >1 million perfect motifs in genome. Yes for big, well-specified motifs (NRSF), and the meaning of binding seen at some 1/2 of the sites is unclear.
(Chromatin Immunoprecipitation: ChIP.)
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. :)