zero rna-seq values AFTER normalisation in edgeR

0

Entering edit mode

Nick N ▴ 60

@nick-n-6370

Last seen 8.6 years ago

United Kingdom

I am using edgeR to analyze RNA-Seq data. This is my script: library("edgeR") ############################# #read in metadata & DGE ############################# composite_samples <- read.csv(file="samples.csv",header=TRUE,sep=",") counts <- readDGE(composite_samples$CountFiles)$counts ############################# #Filter & Library Size Re-set ############################# noint <- rownames(counts) %in% (c("no_feature", "ambiguous", "too_low_aQual", "not_aligned", "alignment_not_unique")) cpms <- cpm(counts) keep <- rowSums(cpms>1)>=3 & !noint counts <- counts[keep,] colnames(counts) <- composite_samples$SampleName d <- DGEList(counts=counts, group=composite_samples$Condition) d$samples$lib.size <- colSums(d$counts) ############################# #Normalisation ############################# d <- calcNormFactors(d) ############################# #Recording the normalized counts ############################# all_cpm=cpm(d, normalized.lib.size=TRUE) all_counts <- cbind(rownames(all_cpm), all_cpm) colnames(all_counts)[1] <- "Ensembl.Gene.ID" rownames(all_counts) <- NULL ############################# #Estimate Dispersion ############################# d <- estimateCommonDisp(d) d <- estimateTagwiseDisp(d) ############################# #Perform a test ############################# de_ctl_mo_composite <- exactTest(d, pair=c("NY", "N")) I believe that the variable "all_counts" shall contain the normalized counts for each sample in each condition. My understanding is also that edgeR adds pseudocounts BEFORE performing the library normalisation. Thus it is possible that some values revert to being zero after normalisation. But I thought that this would happen rarely. Yet in a recent dataset I find an improbably large number of zero values in "all_counts" which made me think that my understanding of how pseudocounts and normalisation work in edgeR might be incorrect. Can, please, somebody comment on this? [[alternative HTML version deleted]]

edgeR edgeR • 1.7k views

ADD COMMENT • link updated 9.7 years ago by Gordon Smyth 50k • written 9.7 years ago by Nick N ▴ 60

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 7 hours ago

United States

Hi Nick, On Fri, Aug 15, 2014 at 9:23 AM, Nick N <feralmedic at="" gmail.com=""> wrote: > I am using edgeR to analyze RNA-Seq data. This is my script: > > > library("edgeR") > ############################# > #read in metadata & DGE > ############################# > composite_samples <- read.csv(file="samples.csv",header=TRUE,sep=",") > counts <- readDGE(composite_samples$CountFiles)$counts > ############################# > #Filter & Library Size Re-set > ############################# > noint <- rownames(counts) %in% (c("no_feature", "ambiguous", > "too_low_aQual", "not_aligned", "alignment_not_unique")) > cpms <- cpm(counts) > keep <- rowSums(cpms>1)>=3 & !noint > counts <- counts[keep,] > colnames(counts) <- composite_samples$SampleName > d <- DGEList(counts=counts, group=composite_samples$Condition) > d$samples$lib.size <- colSums(d$counts) > ############################# > #Normalisation > ############################# > d <- calcNormFactors(d) > ############################# > #Recording the normalized counts > ############################# > all_cpm=cpm(d, normalized.lib.size=TRUE) > all_counts <- cbind(rownames(all_cpm), all_cpm) > colnames(all_counts)[1] <- "Ensembl.Gene.ID" > rownames(all_counts) <- NULL > ############################# > #Estimate Dispersion > ############################# > d <- estimateCommonDisp(d) > d <- estimateTagwiseDisp(d) > ############################# > #Perform a test > ############################# > de_ctl_mo_composite <- exactTest(d, pair=c("NY", "N")) > > > I believe that the variable "all_counts" shall contain the normalized > counts for each sample in each condition. This is a misunderstanding. The counts are not affected by the normalization. Instead, the only thing that is affected is the norm.factors column in the 'sample' list item of your DGEList. This is clearly explained in the edgeR User's guide, on p. 12, under section 2.6.6. Best, Jim My understanding is also that > edgeR adds pseudocounts BEFORE performing the library normalisation. Thus > it is possible that some values revert to being zero after normalisation. > But I thought that this would happen rarely. Yet in a recent dataset I find > an improbably large number of zero values in "all_counts" which made me > think that my understanding of how pseudocounts and normalisation work in > edgeR might be incorrect. Can, please, somebody comment on this? > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]]

ADD COMMENT • link 9.7 years ago James W. MacDonald 65k

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 45 minutes ago

WEHI, Melbourne, Australia

Dear Nick N, Thanks for using edgeR. You do have misunderstandings however about how normalization works and what is output by the cpm() function. > Date: Fri, 15 Aug 2014 14:23:09 +0100 > From: Nick N <feralmedic at="" gmail.com=""> > To: bioconductor at r-project.org > Subject: [BioC] zero rna-seq values AFTER normalisation in edgeR > > I am using edgeR to analyze RNA-Seq data. This is my script: > > > library("edgeR") [snip] > d <- calcNormFactors(d) > all_cpm=cpm(d, normalized.lib.size=TRUE) [snip] > I believe that the variable "all_counts" shall contain the normalized > counts for each sample in each condition. The cpm() function simply computes counts-per-million, which is a ratio rather than a count. > My understanding is also that edgeR adds pseudocounts BEFORE performing > the library normalisation. No it doesn't. Why would you think that? edgeR works with your data as it actually is rather than trying to fudge it. > Thus it is possible that some values revert to being zero after > normalisation. But I thought that this would happen rarely. Yet in a > recent dataset I find an improbably large number of zero values in > "all_counts" which made me think that my understanding of how > pseudocounts and normalisation work in edgeR might be incorrect. Can, > please, somebody comment on this? cpm() simply computes counts per million by dividing the counts by the normalized library sizes. Obviously a zero count corresponds to a zero count-per-million. That seems pretty natural! Are you perhaps thinking of the use of prior.counts when computing cpm or logFC on the log-scale? The help page for the cpm() function tells you that prior counts are not used when computing plain cpm values on the raw scale. I wonder what source you are relying on for information about edgeR? The most reliable source is the documentation that comes with edgeR. Best wishes Gordon ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 9.7 years ago Gordon Smyth 50k

Login before adding your answer.