Search
Question: zero rna-seq values AFTER normalisation in edgeR
0
4.3 years ago by
Nick N60
United Kingdom
Nick N60 wrote:
I am using edgeR to analyze RNA-Seq data. This is my script: library("edgeR") ############################# #read in metadata & DGE ############################# composite_samples <- read.csv(file="samples.csv",header=TRUE,sep=",") counts <- readDGE(composite_samples$CountFiles)$counts ############################# #Filter & Library Size Re-set ############################# noint <- rownames(counts) %in% (c("no_feature", "ambiguous", "too_low_aQual", "not_aligned", "alignment_not_unique")) cpms <- cpm(counts) keep <- rowSums(cpms>1)>=3 & !noint counts <- counts[keep,] colnames(counts) <- composite_samples$SampleName d <- DGEList(counts=counts, group=composite_samples$Condition) d$samples$lib.size <- colSums(d$counts) ############################# #Normalisation ############################# d <- calcNormFactors(d) ############################# #Recording the normalized counts ############################# all_cpm=cpm(d, normalized.lib.size=TRUE) all_counts <- cbind(rownames(all_cpm), all_cpm) colnames(all_counts)[1] <- "Ensembl.Gene.ID" rownames(all_counts) <- NULL ############################# #Estimate Dispersion ############################# d <- estimateCommonDisp(d) d <- estimateTagwiseDisp(d) ############################# #Perform a test ############################# de_ctl_mo_composite <- exactTest(d, pair=c("NY", "N")) I believe that the variable "all_counts" shall contain the normalized counts for each sample in each condition. My understanding is also that edgeR adds pseudocounts BEFORE performing the library normalisation. Thus it is possible that some values revert to being zero after normalisation. But I thought that this would happen rarely. Yet in a recent dataset I find an improbably large number of zero values in "all_counts" which made me think that my understanding of how pseudocounts and normalisation work in edgeR might be incorrect. Can, please, somebody comment on this? [[alternative HTML version deleted]] ADD COMMENTlink modified 4.3 years ago by Gordon Smyth35k • written 4.3 years ago by Nick N60 0 4.3 years ago by United States James W. MacDonald48k wrote: Hi Nick, On Fri, Aug 15, 2014 at 9:23 AM, Nick N <feralmedic at="" gmail.com=""> wrote: > I am using edgeR to analyze RNA-Seq data. This is my script: > > > library("edgeR") > ############################# > #read in metadata & DGE > ############################# > composite_samples <- read.csv(file="samples.csv",header=TRUE,sep=",") > counts <- readDGE(composite_samples$CountFiles)counts > ############################# > #Filter & Library Size Re-set > ############################# > noint <- rownames(counts) %in% (c("no_feature", "ambiguous", > "too_low_aQual", "not_aligned", "alignment_not_unique")) > cpms <- cpm(counts) > keep <- rowSums(cpms>1)>=3 & !noint > counts <- counts[keep,] > colnames(counts) <- composite_samplesSampleName > d <- DGEList(counts=counts, group=composite_samples$Condition) > d$samples$lib.size <- colSums(d$counts) > ############################# > #Normalisation > ############################# > d <- calcNormFactors(d) > ############################# > #Recording the normalized counts > ############################# > all_cpm=cpm(d, normalized.lib.size=TRUE) > all_counts <- cbind(rownames(all_cpm), all_cpm) > colnames(all_counts)[1] <- "Ensembl.Gene.ID" > rownames(all_counts) <- NULL > ############################# > #Estimate Dispersion > ############################# > d <- estimateCommonDisp(d) > d <- estimateTagwiseDisp(d) > ############################# > #Perform a test > ############################# > de_ctl_mo_composite <- exactTest(d, pair=c("NY", "N")) > > > I believe that the variable "all_counts" shall contain the normalized > counts for each sample in each condition. This is a misunderstanding. The counts are not affected by the normalization. Instead, the only thing that is affected is the norm.factors column in the 'sample' list item of your DGEList. This is clearly explained in the edgeR User's guide, on p. 12, under section 2.6.6. Best, Jim My understanding is also that > edgeR adds pseudocounts BEFORE performing the library normalisation. Thus > it is possible that some values revert to being zero after normalisation. > But I thought that this would happen rarely. Yet in a recent dataset I find > an improbably large number of zero values in "all_counts" which made me > think that my understanding of how pseudocounts and normalisation work in > edgeR might be incorrect. Can, please, somebody comment on this? > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]]
0
4.3 years ago by
Gordon Smyth35k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth35k wrote:
Dear Nick N, Thanks for using edgeR. You do have misunderstandings however about how normalization works and what is output by the cpm() function. > Date: Fri, 15 Aug 2014 14:23:09 +0100 > From: Nick N <feralmedic at="" gmail.com=""> > To: bioconductor at r-project.org > Subject: [BioC] zero rna-seq values AFTER normalisation in edgeR > > I am using edgeR to analyze RNA-Seq data. This is my script: > > > library("edgeR") [snip] > d <- calcNormFactors(d) > all_cpm=cpm(d, normalized.lib.size=TRUE) [snip] > I believe that the variable "all_counts" shall contain the normalized > counts for each sample in each condition. The cpm() function simply computes counts-per-million, which is a ratio rather than a count. > My understanding is also that edgeR adds pseudocounts BEFORE performing > the library normalisation. No it doesn't. Why would you think that? edgeR works with your data as it actually is rather than trying to fudge it. > Thus it is possible that some values revert to being zero after > normalisation. But I thought that this would happen rarely. Yet in a > recent dataset I find an improbably large number of zero values in > "all_counts" which made me think that my understanding of how > pseudocounts and normalisation work in edgeR might be incorrect. Can, > please, somebody comment on this? cpm() simply computes counts per million by dividing the counts by the normalized library sizes. Obviously a zero count corresponds to a zero count-per-million. That seems pretty natural! Are you perhaps thinking of the use of prior.counts when computing cpm or logFC on the log-scale? The help page for the cpm() function tells you that prior counts are not used when computing plain cpm values on the raw scale. I wonder what source you are relying on for information about edgeR? The most reliable source is the documentation that comes with edgeR. Best wishes Gordon ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}