Question: Getting normalized read counts from region counts
0
3.4 years ago by
United States

Hi I was wondering if you could let me know as to what would be the correct way to get TMM normalized region counts.

Here is the code. Kindly let me know which is the correct way to do it :-

library(csaw)
options(echo=TRUE) # if you want see commands in output file
args <- commandArgs(trailingOnly = TRUE)
if(length(args) < 2) {
args <- c("--help")
}
cell_type <- as.character(unlist(args[1]))
histone_name <- as.character(unlist(args[2]))
bam.files <- unlist(args[3:length(args)])
TSS1000_counts <-  paste("TSS1000_",histone_name,"_",cell_type,".txt",sep='')
colnames(TSS_1000_human) <- c("chr","start","end","strand","ensembl_id")
TSS_1000_human_granges <- with(TSS_1000_human, GRanges(chr, IRanges(start, end), strand, id=ensembl_id))

frag.len <- 36
reg_counts_1000 <- regionCounts(bam.files,TSS_1000_human_granges,ext=frag.len,param=param)

binned <- windowCounts(bam.files, bin=TRUE, width=10000, param=param)
normfacs <- normOffsets(binned)
test_normalize_counts <- (assay(reg_counts_1000))/normfacs
TSS1000_TMM <- asDGEList(reg_counts_1000, norm.factors=normfacs)


csaw tmm regioncounts • 806 views
modified 3.4 years ago • written 3.4 years ago by saadmurtazakhan10
3
3.4 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

The concept of "normalized counts" isn't something that's supported by edgeR. Once you scale the counts by an arbitrary normalization factor, they aren't interpretable as counts anymore. We prefer to calculate normalized expression values (or in this case, binding values), which can be computed as CPMs. All you would have to do is:

cpm(TSS1000_TMM)

... and you'll get a matrix of CPMs that you can use for whatever relative quantification you want to do. If you must have count data, it's far better to incorporate the normalization information into your modelling procedure, e.g., as GLM offsets. This avoids screwing up the mean-variance relationship when small counts get scaled up (inflating the Poisson noise) or large counts get scaled down (leading to underdispersion).

P.S. Your test_normalize_counts doesn't make any sense. You're dividing by the wrong thing (you need to use the effective library size, not just the normalization factor) and you're not performing the divisions column-wise. Use the cpm command instead.

P.P.S. You use a fragment length of 36 bp, which I presume is your read length. You can achieve the same effect by setting ext=NA, which might be better as it ensures your code will still work on another data set with different read lengths.

In the previous post (C: Normalization of chip-seq data in manually specified regions using CSAW) you had mentioned that I should calculate normalization factors and use the promoter counts to get the fold changes. You also suggested that I should use fold changes which had a significant p-value (I assume I would use edgeR over here to calculate p-value). I was wondering that in some cases the fold-change with significant p-values would be too low. In that case can I just use all the values with normalized foldchange for what I set out to do?

PS: another question (also perhaps the last one :)). How do you get the seq_ids along with the region counts from regionCounts since they are already present as metadata in Granges object?

1

Yes, I think you could, as your differential entropy calculations don't depend on differential binding. It's not surprising that you don't get a lot of significant regions, because only have one ChIP and input per species.

As for your other question, I'm not sure what you mean by the seq_ids. Do you mean your ENSEMBL identities? In so, you should be able to get them with rowRanges(reg_counts_1000)\$id.

P.S. If you're going to post additional comments on this thread, add the csaw tag otherwise I won't get notified.

Actually I have 2 replicates for both Chip and input for each species. Thanks for all the help hopefully this would be the last of the comment for this post.

Cute cat BTW!

Does the order of bam files matter if I am calculating fold changes against input. Should it be group1(histone) followed by  group2(input) or otherwise? I know we can always multiply by -1 in each case. I was confused since according to edgeR manual control samples should follow treated in columns. Let me know if I am wrong.

1

There is no requirement for any specific ordering of the BAM files, as long as the order of files that you use for counting matches up to the ordering of ChIP/input factor used to construct the design matrix.