zero rna-seq values AFTER normalisation in edgeR
2
0
Entering edit mode
Nick N ▴ 60
@nick-n-6370
Last seen 8.6 years ago
United Kingdom
I am using edgeR to analyze RNA-Seq data. This is my script: library("edgeR") ############################# #read in metadata & DGE ############################# composite_samples <- read.csv(file="samples.csv",header=TRUE,sep=",") counts <- readDGE(composite_samples$CountFiles)$counts ############################# #Filter & Library Size Re-set ############################# noint <- rownames(counts) %in% (c("no_feature", "ambiguous", "too_low_aQual", "not_aligned", "alignment_not_unique")) cpms <- cpm(counts) keep <- rowSums(cpms>1)>=3 & !noint counts <- counts[keep,] colnames(counts) <- composite_samples$SampleName d <- DGEList(counts=counts, group=composite_samples$Condition) d$samples$lib.size <- colSums(d$counts) ############################# #Normalisation ############################# d <- calcNormFactors(d) ############################# #Recording the normalized counts ############################# all_cpm=cpm(d, normalized.lib.size=TRUE) all_counts <- cbind(rownames(all_cpm), all_cpm) colnames(all_counts)[1] <- "Ensembl.Gene.ID" rownames(all_counts) <- NULL ############################# #Estimate Dispersion ############################# d <- estimateCommonDisp(d) d <- estimateTagwiseDisp(d) ############################# #Perform a test ############################# de_ctl_mo_composite <- exactTest(d, pair=c("NY", "N")) I believe that the variable "all_counts" shall contain the normalized counts for each sample in each condition. My understanding is also that edgeR adds pseudocounts BEFORE performing the library normalisation. Thus it is possible that some values revert to being zero after normalisation. But I thought that this would happen rarely. Yet in a recent dataset I find an improbably large number of zero values in "all_counts" which made me think that my understanding of how pseudocounts and normalisation work in edgeR might be incorrect. Can, please, somebody comment on this? [[alternative HTML version deleted]]
edgeR edgeR • 1.7k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 7 hours ago
United States
Hi Nick, On Fri, Aug 15, 2014 at 9:23 AM, Nick N <feralmedic at="" gmail.com=""> wrote: > I am using edgeR to analyze RNA-Seq data. This is my script: > > > library("edgeR") > ############################# > #read in metadata & DGE > ############################# > composite_samples <- read.csv(file="samples.csv",header=TRUE,sep=",") > counts <- readDGE(composite_samples$CountFiles)$counts > ############################# > #Filter & Library Size Re-set > ############################# > noint <- rownames(counts) %in% (c("no_feature", "ambiguous", > "too_low_aQual", "not_aligned", "alignment_not_unique")) > cpms <- cpm(counts) > keep <- rowSums(cpms>1)>=3 & !noint > counts <- counts[keep,] > colnames(counts) <- composite_samples$SampleName > d <- DGEList(counts=counts, group=composite_samples$Condition) > d$samples$lib.size <- colSums(d$counts) > ############################# > #Normalisation > ############################# > d <- calcNormFactors(d) > ############################# > #Recording the normalized counts > ############################# > all_cpm=cpm(d, normalized.lib.size=TRUE) > all_counts <- cbind(rownames(all_cpm), all_cpm) > colnames(all_counts)[1] <- "Ensembl.Gene.ID" > rownames(all_counts) <- NULL > ############################# > #Estimate Dispersion > ############################# > d <- estimateCommonDisp(d) > d <- estimateTagwiseDisp(d) > ############################# > #Perform a test > ############################# > de_ctl_mo_composite <- exactTest(d, pair=c("NY", "N")) > > > I believe that the variable "all_counts" shall contain the normalized > counts for each sample in each condition. This is a misunderstanding. The counts are not affected by the normalization. Instead, the only thing that is affected is the norm.factors column in the 'sample' list item of your DGEList. This is clearly explained in the edgeR User's guide, on p. 12, under section 2.6.6. Best, Jim My understanding is also that > edgeR adds pseudocounts BEFORE performing the library normalisation. Thus > it is possible that some values revert to being zero after normalisation. > But I thought that this would happen rarely. Yet in a recent dataset I find > an improbably large number of zero values in "all_counts" which made me > think that my understanding of how pseudocounts and normalisation work in > edgeR might be incorrect. Can, please, somebody comment on this? > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 45 minutes ago
WEHI, Melbourne, Australia
Dear Nick N, Thanks for using edgeR. You do have misunderstandings however about how normalization works and what is output by the cpm() function. > Date: Fri, 15 Aug 2014 14:23:09 +0100 > From: Nick N <feralmedic at="" gmail.com=""> > To: bioconductor at r-project.org > Subject: [BioC] zero rna-seq values AFTER normalisation in edgeR > > I am using edgeR to analyze RNA-Seq data. This is my script: > > > library("edgeR") [snip] > d <- calcNormFactors(d) > all_cpm=cpm(d, normalized.lib.size=TRUE) [snip] > I believe that the variable "all_counts" shall contain the normalized > counts for each sample in each condition. The cpm() function simply computes counts-per-million, which is a ratio rather than a count. > My understanding is also that edgeR adds pseudocounts BEFORE performing > the library normalisation. No it doesn't. Why would you think that? edgeR works with your data as it actually is rather than trying to fudge it. > Thus it is possible that some values revert to being zero after > normalisation. But I thought that this would happen rarely. Yet in a > recent dataset I find an improbably large number of zero values in > "all_counts" which made me think that my understanding of how > pseudocounts and normalisation work in edgeR might be incorrect. Can, > please, somebody comment on this? cpm() simply computes counts per million by dividing the counts by the normalized library sizes. Obviously a zero count corresponds to a zero count-per-million. That seems pretty natural! Are you perhaps thinking of the use of prior.counts when computing cpm or logFC on the log-scale? The help page for the cpm() function tells you that prior counts are not used when computing plain cpm values on the raw scale. I wonder what source you are relying on for information about edgeR? The most reliable source is the documentation that comes with edgeR. Best wishes Gordon ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD COMMENT

Login before adding your answer.

Traffic: 556 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6