Normalization of chip-seq data in manually specified regions using CSAW
1
0
Entering edit mode
Last seen 7 months ago
United States

Hi I have manually spcified certain regions upstream of TSS that I am trying to normalize using csaw. I am currently using the following code to get the read counts from the bam file. But whenever I run region counts using the following parameters I get zero counts in my assay matrix. I am wondering why this is the case. Another approach I have used to get the normalized TMM value for upstream regions is to use featureCounts program and then normalize the reads using DESeq TMM normalization. But in that case the value of total counts would be only the counts falling inside the upstream regions and not the number of reads in the whole library. I was wondering if that would be the right approach to use rather than trying to stick with csaw.

Below is the code that gives zero counts with csaw.

library(csaw)
colnames(TSS1000_mouse) <- c("chr","start","end","strand","ensembl_id")
TSS1000_mouse_granges <- makeGRangesFromDataFrame(TSS1000_mouse)
frag.len <- 36
reg.counts <- regionCounts(bam.files,TSS1000_mouse_granges,ext=frag.len,param=param)
assay(reg.counts)
tail(assay(reg.counts))

I have checked there are no non-zero values in assay(reg.counts). I was wondering if I am doing something wrong here or any parameter values I need to change because for the same dataset with default parameter featureCounts gives me counts that I can further normalize using DESeq (Kindly let me know if this would be a correct approach to use so I can go ahead with it).

PS: I am using the ENCODE CHIP-seq data and make comparisons with other species in the TSS regions.I have tried approaches like RPM to normalize the counts in the upstream of TSS (different sizes 200,500,1kb) and now I am trying to see if TMM would yield different results.

chip-seq TSS TMM csaw • 2.0k views
1
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 3 hours ago
The city by the bay

You probably have the minq parameter turned up too high. I usually use subread, which means that my MAPQ scores fall within [0, 59]. Thus, for me, setting minq=50 is stringent but not nonsensical. If you use a different aligner, then you have to adjust your minq to fall within the range of observed MAPQ scores.

Using featureCounts is another valid approach for getting promoter-level counts in your situation. In fact, in this case, it should be very similar to the regionCounts output (assuming you've used the same extension length, MAPQ threshold, etc.) as the job of counting across pre-specified regions is the same.

However, looking at your example code seems to indicate that you've only got one library. The normalization in csaw (and more generally, in edgeR and DESeq) only works between samples, not between regions. Trying to apply TMM on one library doesn't make sense, so I presume you have other libraries somewhere. Of course, trying to quantitatively compare ChIP-seq signal between species has its own (huge) set of problems - each species has different mapping/sequencing biases, antibody binding efficiency, global protein levels and binding/histone marking activity, etc. So, good luck with that.

0
Entering edit mode

Yes I have other libraries as I said in the post I am using public encode data and some other public data sets. So what would you suggest would be good value to use for minq in this case. I do agree with you that comparing different species would have its own set of biases. So if you were in my shoes would you normalize each species TSS counts separately or together (since the TSS size would be the same). In my initial approach with rpkm I had done it separately do you think that is a valid approach?

Also what other normalization strategies (on top of TMM) do you think could be employed to minimize or reduce these biases.

many thanks!

PS: I am using differential entropy as a measure for comparing the different histone intensities.

0
Entering edit mode

As for the choice of minq, that depends on your aligner, so I suggest you have a closer look at your BAM files to pick a good value. For example, for Bowtie2, I would use values of 10-20, because the range there is [0, 40-ish].

As for normalization... well, to be honest, cross-species comparisons are not something I would willingly do. Because the biases are different between species, you lose the advantage of doing DB analyses in the first place, i.e., the biases no longer cancel out between samples. Note that RPKM or TMM normalization provide no protection against species-specific biases at each region; they only protect against global biases due to library size differences, composition bias or efficiency bias.

If I were in your shoes, the safest approach would be to do DB analyses within each species (if it's meaningful to do so in your analysis), and then do a meta-analysis of the DB statistics across species. This exploits the fact that the region-level biases will cancel out within each species, such that you don't have to worry about them in the meta-analysis. Otherwise, if you want to directly compare between species, it's up to you to resolve each bias. This sounds painful, but if you need it, then do it.

0
Entering edit mode

One would do DB analyses in case of control and treated sample (correct me if I am wrong). But in cases where you are comparing the same tissue between two species how would you correct for region specific biases (any pointers would be appreciated?).

The way I am currently doing the analysis is to normalize each species chip-seq TSS values separately (using either TMM/RPKM) and then bin these TMM of two species according to the KA/KS values and calculate differential entropy for each bin. So for each bin you would have x values of one species chip-seq data and x-values of another species data and from that we would calculate a joint differential entropy (loss of information).

How do you this such an approach can be further improved?

0
Entering edit mode

Indeed, how can you correct for region-specific biases? That's an interesting question to which there doesn't seem to be an easy answer. The best way would be to have input samples for both species; you could then test for differences between ChIP and input within each species (where the biases cancel out), prior to comparing between species. If you don't have this information... well, have fun making innumerable plots of log-fold change against various genomic metrics (GC content, mappability, etc.) in order to identify the biases and remove them.

In any case, whatever you're doing, it doesn't sound like a differential binding analysis. I can't give advice on the comparative genomics part of your proposal, but your normalization strategy doesn't make sense. You can't TMM normalize each species separately, and then expect to be able to compare between species; the normalized values will be on different scales. RPKMs are slightly more comparable, but if any composition biases are present, then again; the scale of the normalized values will be systematically different between species. Now, this might not affect the entropy calculation because you only need the empirical probability densities (such that the location of the distribution of normalized values doesn't matter), but if that's the case, why bother doing TMM/RPKM normalization at all?

0
Entering edit mode

So you are saying I should not do TMM/RPKM normalization in case I use differential entropy. Initially I was using Spearman correlation with that approach now I changed it to differential entropy as that would be more robust.

Let's say I have input sample for each species. Are you saying I should use the signal/noise ration of chip/input summed over the region and normalized by region length?

0
Entering edit mode

If you have an input sample for each species, then you could perform background-based TMM normalization (see Section 4.2 of the csaw user's guide) within each species. Then you can do a DB analysis comparing each ChIP-seq sample to its matching input. Finally, you could compare the fold changes of the significant regions between species. This relies on a high-quality input library (i.e., high coverage, and done in the same batch as the ChIP) in each species, otherwise the counts will be too low to stably calculate the fold change.

0
Entering edit mode

But this would probably not work if I am only interested in promoter regions?

Can I just use the fold change(of chip and input) in the promoter regions directly to compare each species?

0
Entering edit mode

Why wouldn't it work? You're not being asked to analyze the data with 10 kbp bins, only to calculate normalization factors with them. You can still use the counts for the promoters to get the fold changes after applying the normalization factors (that were computed from the bins).

0
Entering edit mode

Thanks for all your help. I will try and get the normalization factors using the input data as you suggested. Most of the Encode data is not very high coverage but I would probably have to do with what I have.

0
Entering edit mode

Hi again,

When trying to remove composition biases using a particular histone and input samples do all the files go into the bam.files argument together if not how to do the normalization using input(control) samples? For differential binding across different conditions there are multiple examples but I don't see an example using input/Igg controls.

0
Entering edit mode

Yes, they all go together. You can imagine it as "differential binding" between ChIP and input, except that the vast majority of changes go in one direction.

0
Entering edit mode

Is there someway to input tagalign files in the CSAW package. The thing is that I am also using Roadmap epigenomics data along with Encode. While Encode has files in Bam format. Roadmap only has tagalign format. Most of the tagalign files are consolidated epigenomes and have been corrected for mapping quality. So I don't need the param parameter of minq for that. I was wondering if you could suggest a way to use the csaw package for that.

0
Entering edit mode

No, csaw only accepts BAM files. I once toyed with supporting BED and BED-like formats, but it was too much hassle without any clear benefit. If TagAlign is close to BED, you might try converting it to BAM format, e.g., using bedtools.

0
Entering edit mode

Actually tagalign does not have mapping qualities. Most of the tagalign files are in the consolidated epigenome of Roadmap so in order to assign it mapping quality it would be ok to assign each read an acceptable mapping quality > 10 using bedtobam I hope?

0
Entering edit mode

Hi Once again,

I was wondering how do I get normalized read counts from csaw. Let's say I have region counts from TSS 1000 bp upstream that I got from regionCounts function. And normalization factors from normOffsets function is following the right way to get TMM normalized counts?

binned <- windowCounts(bam.files, bin=TRUE, width=10000, param=param)
normfacs <- normOffsets(binned)
TSS1000_TMM <- asDGEList(reg_counts_1000, norm.factors=normfacs)

0
Entering edit mode

Best to start a new question, this thread is getting too long.

0
Entering edit mode

Let's continue at the new thread at

Getting normalized read counts from region counts

:)