edgeR multifactorial design: confusing BCV plot

0

Entering edit mode

Natasha ▴ 440

@natasha-4640

Last seen 11.4 years ago

I think I figured the glmLRT contrast part. fit = glmFit(y.filt, design) cont = makeContrasts((KO_stim - KO) - (WT_stim - WT), levels=design) lrt = glmLRT(y.filt, fit, contrast=as.vector(cont)) Many Thanks, Natasha -----Original Message----- From: Natasha Sahgal Sent: 11 September 2012 18:26 To: 'bioconductor at r-project.org' Subject: edgeR multifactorial design: confusing BCV plot Dear List, This is a follow up from my previous post: https://stat.ethz.ch/pipermail/bioconductor/2012-July/047173.htmlhttps ://stat.ethz.ch/pipermail/bioconductor/2012-July/047173.html As I finally have the count data, started with the analysis. However, I do not understand the output from the BCV plot after estimating dispersion. Thus would appreciate any help/advice/suggestions. Also, would appreciate comments on the filtering step! As, it appears to me that I still have some genes with 0 read counts (as seen under the normalisation section). ------------------------------------ Code: dim(gene.counts.2) # 33607 8 ## Sample Descriptions group = factor(gsub("\\_[[:digit:]]","",colnames(gene.counts.2))) ## Creating dge list y = DGEList(counts=gene.counts.2, group=group) ## Filtering keep = rowSums(cpm(y)>1) >= 4 table(keep) #FALSE TRUE #17678 15929 keep2 = rowSums(cpm(y)>2) >= 4 table(keep2) #FALSE TRUE #19300 14307 keep3 = rowSums(cpm(y)>3) >= 4 table(keep3) #FALSE TRUE #20229 13378 y.filt = y[keep, ] dim(y.filt$counts) # 15929 8 y.filt2 = y[keep2, ] dim(y.filt2$counts) # 14307 8 y.filt3 = y[keep3, ] dim(y.filt3$counts) # 13378 8 ## Recalculate lib.size y.filt$samples$lib.size = colSums(y.filt$counts) y.filt2$samples$lib.size = colSums(y.filt2$counts) y.filt3$samples$lib.size = colSums(y.filt3$counts) ## Normalisation y.filt = calcNormFactors(y.filt) range(y.filt$counts[,1]) # 0 159659 range(y.filt$counts[,2]) # 0 155390 range(y.filt$counts[,3]) # 0 122249 range(y.filt$counts[,4]) # 0 137046 range(y.filt$counts[,5]) # 0 206528 range(y.filt$counts[,6]) # 0 222176 range(y.filt$counts[,7]) # 0 192333 range(y.filt$counts[,8]) # 0 229413 y.filt2 = calcNormFactors(y.filt2) range(y.filt2$counts[,1]) # 0 159659 range(y.filt2$counts[,2]) # 0 155390 range(y.filt2$counts[,3]) # 0 122249 range(y.filt2$counts[,4]) # 0 137046 range(y.filt2$counts[,5]) # 0 206528 range(y.filt2$counts[,6]) # 0 222176 range(y.filt2$counts[,7]) # 0 192333 range(y.filt2$counts[,8]) # 0 229413 y.filt3 = calcNormFactors(y.filt3) range(y.filt3$counts[,1]) # 0 159659 range(y.filt3$counts[,2]) # 0 155390 range(y.filt3$counts[,3]) # 0 122249 range(y.filt3$counts[,4]) # 0 137046 range(y.filt3$counts[,5]) # 0 206528 range(y.filt3$counts[,6]) # 0 222176 range(y.filt3$counts[,7]) # 0 192333 range(y.filt3$counts[,8]) # 0 229413 ## MDS plots plotMDS(y.filt, main="cpm(y)>1") plotMDS(y.filt2, main="cpm(y)>2") plotMDS(y.filt3, main="cpm(y)>3") ## Design Matrix design = model.matrix(~0+group) colnames(design) = gsub("group","",colnames(design)) design # KO KO_stim WT WT_stim #1 1 0 0 0 #2 1 0 0 0 #3 0 1 0 0 #4 0 1 0 0 #5 0 0 1 0 #6 0 0 1 0 #7 0 0 0 1 #8 0 0 0 1 ## Estimating Dispersion y.filt = estimateGLMCommonDisp(y.filt, design, verbose=T) #Disp = 0.0276 , BCV = 0.1661 y.filt2 = estimateGLMCommonDisp(y.filt2, design, verbose=T) #Disp = 0.02711 , BCV = 0.1646 y.filt3 = estimateGLMCommonDisp(y.filt3, design, verbose=T) #Disp = 0.02665 , BCV = 0.1632 y.filt = estimateGLMTrendedDisp(y.filt,design) y.filt2 = estimateGLMTrendedDisp(y.filt2,design) y.filt3 = estimateGLMTrendedDisp(y.filt3,design) y.filt = estimateGLMTagwiseDisp(y.filt,design) y.filt2 = estimateGLMTagwiseDisp(y.filt2,design) y.filt3 = estimateGLMTagwiseDisp(y.filt3,design) jpeg("BCVplots.jpg",height=500,width=900) par(mfrow=c(1,3)) plotBCV(y.filt, main="cpm(y)>1") plotBCV(y.filt2, main="cpm(y)>2") plotBCV(y.filt3, main="cpm(y)>3") dev.off() ### NOT RUN this section fully ## Fit Model fit = glmFit(y.filt, design) lrt = glmLRT(y.filt, fit, contrast=(KO_stim - KO) - (WT_stim - WT)) ### this does not work on testing, so I think I have not correctly defined the contrast parameter ------------------------------------ sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] splines stats graphics grDevices utils datasets methods [8] base other attached packages: [1] scatterplot3d_0.3-33 DESeq_1.8.3 locfit_1.5-7 [4] Biobase_2.16.0 BiocGenerics_0.2.0 WriteXLS_2.1.0 [7] edgeR_2.6.10 limma_3.12.0 loaded via a namespace (and not attached): [1] annotate_1.34.0 AnnotationDbi_1.18.0 DBI_0.2-5 [4] genefilter_1.38.0 geneplotter_1.34.0 grid_2.15.0 [7] IRanges_1.14.2 lattice_0.20-6 RColorBrewer_1.0-5 [10] RSQLite_0.11.1 stats4_2.15.0 survival_2.36-14 [13] tools_2.15.0 xtable_1.7-0 ------------------------------------ Many Thanks, Natasha

edgeR edgeR • 2.2k views

ADD COMMENT • link 13.4 years ago Natasha ▴ 440

0

Entering edit mode

Natasha ▴ 440

@natasha-4640

Last seen 11.4 years ago

Dear List, This is a follow up from my previous post: https://stat.ethz.ch/pipermail/bioconductor/2012-July/047173.htmlhttps ://stat.ethz.ch/pipermail/bioconductor/2012-July/047173.html As I finally have the count data, started with the analysis. However, I do not understand the output from the BCV plot after estimating dispersion. Thus would appreciate any help/advice/suggestions. Also, would appreciate comments on the filtering step! As, it appears to me that I still have some genes with 0 read counts (as seen under the normalisation section). ------------------------------------ Code: dim(gene.counts.2) # 33607 8 ## Sample Descriptions group = factor(gsub("\\_[[:digit:]]","",colnames(gene.counts.2))) ## Creating dge list y = DGEList(counts=gene.counts.2, group=group) ## Filtering keep = rowSums(cpm(y)>1) >= 4 table(keep) #FALSE TRUE #17678 15929 keep2 = rowSums(cpm(y)>2) >= 4 table(keep2) #FALSE TRUE #19300 14307 keep3 = rowSums(cpm(y)>3) >= 4 table(keep3) #FALSE TRUE #20229 13378 y.filt = y[keep, ] dim(y.filt$counts) # 15929 8 y.filt2 = y[keep2, ] dim(y.filt2$counts) # 14307 8 y.filt3 = y[keep3, ] dim(y.filt3$counts) # 13378 8 ## Recalculate lib.size y.filt$samples$lib.size = colSums(y.filt$counts) y.filt2$samples$lib.size = colSums(y.filt2$counts) y.filt3$samples$lib.size = colSums(y.filt3$counts) ## Normalisation y.filt = calcNormFactors(y.filt) range(y.filt$counts[,1]) # 0 159659 range(y.filt$counts[,2]) # 0 155390 range(y.filt$counts[,3]) # 0 122249 range(y.filt$counts[,4]) # 0 137046 range(y.filt$counts[,5]) # 0 206528 range(y.filt$counts[,6]) # 0 222176 range(y.filt$counts[,7]) # 0 192333 range(y.filt$counts[,8]) # 0 229413 y.filt2 = calcNormFactors(y.filt2) range(y.filt2$counts[,1]) # 0 159659 range(y.filt2$counts[,2]) # 0 155390 range(y.filt2$counts[,3]) # 0 122249 range(y.filt2$counts[,4]) # 0 137046 range(y.filt2$counts[,5]) # 0 206528 range(y.filt2$counts[,6]) # 0 222176 range(y.filt2$counts[,7]) # 0 192333 range(y.filt2$counts[,8]) # 0 229413 y.filt3 = calcNormFactors(y.filt3) range(y.filt3$counts[,1]) # 0 159659 range(y.filt3$counts[,2]) # 0 155390 range(y.filt3$counts[,3]) # 0 122249 range(y.filt3$counts[,4]) # 0 137046 range(y.filt3$counts[,5]) # 0 206528 range(y.filt3$counts[,6]) # 0 222176 range(y.filt3$counts[,7]) # 0 192333 range(y.filt3$counts[,8]) # 0 229413 ## MDS plots plotMDS(y.filt, main="cpm(y)>1") plotMDS(y.filt2, main="cpm(y)>2") plotMDS(y.filt3, main="cpm(y)>3") ## Design Matrix design = model.matrix(~0+group) colnames(design) = gsub("group","",colnames(design)) design # KO KO_stim WT WT_stim #1 1 0 0 0 #2 1 0 0 0 #3 0 1 0 0 #4 0 1 0 0 #5 0 0 1 0 #6 0 0 1 0 #7 0 0 0 1 #8 0 0 0 1 ## Estimating Dispersion y.filt = estimateGLMCommonDisp(y.filt, design, verbose=T) #Disp = 0.0276 , BCV = 0.1661 y.filt2 = estimateGLMCommonDisp(y.filt2, design, verbose=T) #Disp = 0.02711 , BCV = 0.1646 y.filt3 = estimateGLMCommonDisp(y.filt3, design, verbose=T) #Disp = 0.02665 , BCV = 0.1632 y.filt = estimateGLMTrendedDisp(y.filt,design) y.filt2 = estimateGLMTrendedDisp(y.filt2,design) y.filt3 = estimateGLMTrendedDisp(y.filt3,design) y.filt = estimateGLMTagwiseDisp(y.filt,design) y.filt2 = estimateGLMTagwiseDisp(y.filt2,design) y.filt3 = estimateGLMTagwiseDisp(y.filt3,design) jpeg("BCVplots.jpg",height=500,width=900) par(mfrow=c(1,3)) plotBCV(y.filt, main="cpm(y)>1") plotBCV(y.filt2, main="cpm(y)>2") plotBCV(y.filt3, main="cpm(y)>3") dev.off() ### NOT RUN this section fully ## Fit Model fit = glmFit(y.filt, design) lrt = glmLRT(y.filt, fit, contrast=(KO_stim - KO) - (WT_stim - WT)) ### this does not work on testing, so I think I have not correctly defined the contrast parameter ------------------------------------ sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] splines stats graphics grDevices utils datasets methods [8] base other attached packages: [1] scatterplot3d_0.3-33 DESeq_1.8.3 locfit_1.5-7 [4] Biobase_2.16.0 BiocGenerics_0.2.0 WriteXLS_2.1.0 [7] edgeR_2.6.10 limma_3.12.0 loaded via a namespace (and not attached): [1] annotate_1.34.0 AnnotationDbi_1.18.0 DBI_0.2-5 [4] genefilter_1.38.0 geneplotter_1.34.0 grid_2.15.0 [7] IRanges_1.14.2 lattice_0.20-6 RColorBrewer_1.0-5 [10] RSQLite_0.11.1 stats4_2.15.0 survival_2.36-14 [13] tools_2.15.0 xtable_1.7-0 ------------------------------------ Many Thanks, Natasha

ADD COMMENT • link 13.4 years ago Natasha ▴ 440

0

Entering edit mode

Dear List, Sorry for any cross-posting. I have an RNA-Seq expt for which I'd like to use edgeR, as it is multifactorial in design. The expt: 2 cell-lines (ko,wt), 2 conditions(stimulated, unstimulated), n=2 in each group. My aim: to detect DE genes based on the effect of stimulus on ko cells. However, I do not understand the output from the BCV plot after estimating dispersion. Thus would appreciate any help/advice/suggestions. Also, would appreciate comments on the filtering step! As, it appears to me that I still have some genes with 0 read counts (as seen under the normalisation section). ------------------------------------ Code: dim(gene.counts.2) # 33607 8 ## Sample Descriptions group = factor(gsub("\\_[[:digit:]]","",colnames(gene.counts.2))) ## Creating dge list y = DGEList(counts=gene.counts.2, group=group) ## Filtering keep = rowSums(cpm(y)>1) >= 4 table(keep) #FALSE TRUE #17678 15929 keep2 = rowSums(cpm(y)>2) >= 4 table(keep2) #FALSE TRUE #19300 14307 keep3 = rowSums(cpm(y)>3) >= 4 table(keep3) #FALSE TRUE #20229 13378 y.filt = y[keep, ] dim(y.filt$counts) # 15929 8 y.filt2 = y[keep2, ] dim(y.filt2$counts) # 14307 8 y.filt3 = y[keep3, ] dim(y.filt3$counts) # 13378 8 ## Recalculate lib.size y.filt$samples$lib.size = colSums(y.filt$counts) y.filt2$samples$lib.size = colSums(y.filt2$counts) y.filt3$samples$lib.size = colSums(y.filt3$counts) ## Normalisation y.filt = calcNormFactors(y.filt) range(y.filt$counts[,1]) # 0 159659 range(y.filt$counts[,2]) # 0 155390 range(y.filt$counts[,3]) # 0 122249 range(y.filt$counts[,4]) # 0 137046 range(y.filt$counts[,5]) # 0 206528 range(y.filt$counts[,6]) # 0 222176 range(y.filt$counts[,7]) # 0 192333 range(y.filt$counts[,8]) # 0 229413 y.filt2 = calcNormFactors(y.filt2) range(y.filt2$counts[,1]) # 0 159659 range(y.filt2$counts[,2]) # 0 155390 range(y.filt2$counts[,3]) # 0 122249 range(y.filt2$counts[,4]) # 0 137046 range(y.filt2$counts[,5]) # 0 206528 range(y.filt2$counts[,6]) # 0 222176 range(y.filt2$counts[,7]) # 0 192333 range(y.filt2$counts[,8]) # 0 229413 y.filt3 = calcNormFactors(y.filt3) range(y.filt3$counts[,1]) # 0 159659 range(y.filt3$counts[,2]) # 0 155390 range(y.filt3$counts[,3]) # 0 122249 range(y.filt3$counts[,4]) # 0 137046 range(y.filt3$counts[,5]) # 0 206528 range(y.filt3$counts[,6]) # 0 222176 range(y.filt3$counts[,7]) # 0 192333 range(y.filt3$counts[,8]) # 0 229413 ## MDS plots plotMDS(y.filt, main="cpm(y)>1") plotMDS(y.filt2, main="cpm(y)>2") plotMDS(y.filt3, main="cpm(y)>3") ## Design Matrix design = model.matrix(~0+group) colnames(design) = gsub("group","",colnames(design)) design # KO KO_stim WT WT_stim #1 1 0 0 0 #2 1 0 0 0 #3 0 1 0 0 #4 0 1 0 0 #5 0 0 1 0 #6 0 0 1 0 #7 0 0 0 1 #8 0 0 0 1 ## Estimating Dispersion y.filt = estimateGLMCommonDisp(y.filt, design, verbose=T) #Disp = 0.0276 , BCV = 0.1661 y.filt2 = estimateGLMCommonDisp(y.filt2, design, verbose=T) #Disp = 0.02711 , BCV = 0.1646 y.filt3 = estimateGLMCommonDisp(y.filt3, design, verbose=T) #Disp = 0.02665 , BCV = 0.1632 y.filt = estimateGLMTrendedDisp(y.filt,design) y.filt2 = estimateGLMTrendedDisp(y.filt2,design) y.filt3 = estimateGLMTrendedDisp(y.filt3,design) y.filt = estimateGLMTagwiseDisp(y.filt,design) y.filt2 = estimateGLMTagwiseDisp(y.filt2,design) y.filt3 = estimateGLMTagwiseDisp(y.filt3,design) jpeg("BCVplots.jpg",height=500,width=900) par(mfrow=c(1,3)) plotBCV(y.filt, main="cpm(y)>1") plotBCV(y.filt2, main="cpm(y)>2") plotBCV(y.filt3, main="cpm(y)>3") dev.off() (http://www.well.ox.ac.uk/~nsahgal/test/BCVplots.jpg) ### NOT RUN this section ## Fit Model fit = glmFit(y.filt, design) cont = makeContrasts((KO_stim - KO) - (WT_stim - WT), levels=design) lrt = glmLRT(y.filt, fit, contrast=as.vector(cont)) ------------------------------------ sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] splines stats graphics grDevices utils datasets methods [8] base other attached packages: [1] scatterplot3d_0.3-33 DESeq_1.8.3 locfit_1.5-7 [4] Biobase_2.16.0 BiocGenerics_0.2.0 WriteXLS_2.1.0 [7] edgeR_2.6.10 limma_3.12.0 loaded via a namespace (and not attached): [1] annotate_1.34.0 AnnotationDbi_1.18.0 DBI_0.2-5 [4] genefilter_1.38.0 geneplotter_1.34.0 grid_2.15.0 [7] IRanges_1.14.2 lattice_0.20-6 RColorBrewer_1.0-5 [10] RSQLite_0.11.1 stats4_2.15.0 survival_2.36-14 [13] tools_2.15.0 xtable_1.7-0 ------------------------------------ Many Thanks, Natasha

ADD REPLY • link 13.4 years ago Natasha ▴ 440

0

Entering edit mode

Hi Natasha, On 9/12/2012 5:05 AM, Natasha Sahgal wrote: > Dear List, > > Sorry for any cross-posting. > > I have an RNA-Seq expt for which I'd like to use edgeR, as it is multifactorial in design. > The expt: 2 cell-lines (ko,wt), 2 conditions(stimulated, unstimulated), n=2 in each group. > My aim: to detect DE genes based on the effect of stimulus on ko cells. > > However, I do not understand the output from the BCV plot after estimating dispersion. > Thus would appreciate any help/advice/suggestions. What about the output do you not understand? > > Also, would appreciate comments on the filtering step! As, it appears to me that I still have some genes with 0 read counts (as seen under the normalisation section). You didn't filter in such a way to remove all genes with zero counts. In fact you didn't say anything about zero counts - instead what you did was to require at least four samples to have more than 1 or 2 or 3 counts per million. The other four samples were unconstrained, and could easily have had zero counts. Note that this is a reasonable thing to do. You are looking for genes where one sample had more transcripts than the other sample. This includes the situation where one of the samples appears not to transcribe the gene at all. Best, Jim > > ------------------------------------ > Code: > dim(gene.counts.2) # 33607 8 > ## Sample Descriptions > group = factor(gsub("\\_[[:digit:]]","",colnames(gene.counts.2))) > > ## Creating dge list > y = DGEList(counts=gene.counts.2, group=group) > > ## Filtering > keep = rowSums(cpm(y)>1)>= 4 > table(keep) > #FALSE TRUE > #17678 15929 > keep2 = rowSums(cpm(y)>2)>= 4 > table(keep2) > #FALSE TRUE > #19300 14307 > keep3 = rowSums(cpm(y)>3)>= 4 > table(keep3) > #FALSE TRUE > #20229 13378 > > y.filt = y[keep, ] > dim(y.filt$counts) # 15929 8 > > y.filt2 = y[keep2, ] > dim(y.filt2$counts) # 14307 8 > > y.filt3 = y[keep3, ] > dim(y.filt3$counts) # 13378 8 > > ## Recalculate lib.size > y.filt$samples$lib.size = colSums(y.filt$counts) y.filt2$samples$lib.size = colSums(y.filt2$counts) y.filt3$samples$lib.size = colSums(y.filt3$counts) > > ## Normalisation > y.filt = calcNormFactors(y.filt) > > range(y.filt$counts[,1]) # 0 159659 > range(y.filt$counts[,2]) # 0 155390 > range(y.filt$counts[,3]) # 0 122249 > range(y.filt$counts[,4]) # 0 137046 > range(y.filt$counts[,5]) # 0 206528 > range(y.filt$counts[,6]) # 0 222176 > range(y.filt$counts[,7]) # 0 192333 > range(y.filt$counts[,8]) # 0 229413 > > y.filt2 = calcNormFactors(y.filt2) > > range(y.filt2$counts[,1]) # 0 159659 > range(y.filt2$counts[,2]) # 0 155390 > range(y.filt2$counts[,3]) # 0 122249 > range(y.filt2$counts[,4]) # 0 137046 > range(y.filt2$counts[,5]) # 0 206528 > range(y.filt2$counts[,6]) # 0 222176 > range(y.filt2$counts[,7]) # 0 192333 > range(y.filt2$counts[,8]) # 0 229413 > > y.filt3 = calcNormFactors(y.filt3) > > range(y.filt3$counts[,1]) # 0 159659 > range(y.filt3$counts[,2]) # 0 155390 > range(y.filt3$counts[,3]) # 0 122249 > range(y.filt3$counts[,4]) # 0 137046 > range(y.filt3$counts[,5]) # 0 206528 > range(y.filt3$counts[,6]) # 0 222176 > range(y.filt3$counts[,7]) # 0 192333 > range(y.filt3$counts[,8]) # 0 229413 > > ## MDS plots > plotMDS(y.filt, main="cpm(y)>1") > plotMDS(y.filt2, main="cpm(y)>2") > plotMDS(y.filt3, main="cpm(y)>3") > > ## Design Matrix > design = model.matrix(~0+group) > colnames(design) = gsub("group","",colnames(design)) design # KO KO_stim WT WT_stim > #1 1 0 0 0 > #2 1 0 0 0 > #3 0 1 0 0 > #4 0 1 0 0 > #5 0 0 1 0 > #6 0 0 1 0 > #7 0 0 0 1 > #8 0 0 0 1 > > ## Estimating Dispersion > y.filt = estimateGLMCommonDisp(y.filt, design, verbose=T) #Disp = 0.0276 , BCV = 0.1661 > y.filt2 = estimateGLMCommonDisp(y.filt2, design, verbose=T) #Disp = 0.02711 , BCV = 0.1646 > y.filt3 = estimateGLMCommonDisp(y.filt3, design, verbose=T) #Disp = 0.02665 , BCV = 0.1632 > > > y.filt = estimateGLMTrendedDisp(y.filt,design) > y.filt2 = estimateGLMTrendedDisp(y.filt2,design) > y.filt3 = estimateGLMTrendedDisp(y.filt3,design) > > y.filt = estimateGLMTagwiseDisp(y.filt,design) > y.filt2 = estimateGLMTagwiseDisp(y.filt2,design) > y.filt3 = estimateGLMTagwiseDisp(y.filt3,design) > > jpeg("BCVplots.jpg",height=500,width=900) > par(mfrow=c(1,3)) > plotBCV(y.filt, main="cpm(y)>1") > plotBCV(y.filt2, main="cpm(y)>2") > plotBCV(y.filt3, main="cpm(y)>3") > dev.off() > (http://www.well.ox.ac.uk/~nsahgal/test/BCVplots.jpg) > > ### NOT RUN this section > ## Fit Model > fit = glmFit(y.filt, design) > cont = makeContrasts((KO_stim - KO) - (WT_stim - WT), levels=design) > lrt = glmLRT(y.filt, fit, contrast=as.vector(cont)) > ------------------------------------ > > sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] splines stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] scatterplot3d_0.3-33 DESeq_1.8.3 locfit_1.5-7 > [4] Biobase_2.16.0 BiocGenerics_0.2.0 WriteXLS_2.1.0 > [7] edgeR_2.6.10 limma_3.12.0 > > loaded via a namespace (and not attached): > [1] annotate_1.34.0 AnnotationDbi_1.18.0 DBI_0.2-5 > [4] genefilter_1.38.0 geneplotter_1.34.0 grid_2.15.0 > [7] IRanges_1.14.2 lattice_0.20-6 RColorBrewer_1.0-5 > [10] RSQLite_0.11.1 stats4_2.15.0 survival_2.36-14 > [13] tools_2.15.0 xtable_1.7-0 > ------------------------------------ > > Many Thanks, > Natasha > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 13.4 years ago James W. MacDonald 68k

0

Entering edit mode

Dear Jim, Regarding the BCV plots, what I did not understand was the strange profile (at least strange to me), and the low coefficients of BV. Based on some figures from the user guide, it appeared to be very different - an increase towards the higher logCPM. 1) I'm not sure how to interpret these and if it is a good thing or not? (perhaps I have misunderstood the concept of the BCV) 2) How does this affect the differential expression, if at all? Re the filtering, for some reason I was under the impression increasing the counts per million would reduce (if not remove) zero counts in all samples. I agree with what you say about half the samples being unconstrained. I had 3 filters here, just to see what the difference would be. I still need to figure out the best or optimal one to use. Many Thanks and Best Wishes, Natasha -----Original Message----- From: James W. MacDonald [mailto:jmacdon@uw.edu] Sent: 12 September 2012 14:29 To: Natasha Sahgal Cc: 'bioconductor at r-project.org' Subject: Re: [BioC] edgeR: confusing BCV plot Hi Natasha, On 9/12/2012 5:05 AM, Natasha Sahgal wrote: > Dear List, > > Sorry for any cross-posting. > > I have an RNA-Seq expt for which I'd like to use edgeR, as it is multifactorial in design. > The expt: 2 cell-lines (ko,wt), 2 conditions(stimulated, unstimulated), n=2 in each group. > My aim: to detect DE genes based on the effect of stimulus on ko cells. > > However, I do not understand the output from the BCV plot after estimating dispersion. > Thus would appreciate any help/advice/suggestions. What about the output do you not understand? > > Also, would appreciate comments on the filtering step! As, it appears to me that I still have some genes with 0 read counts (as seen under the normalisation section). You didn't filter in such a way to remove all genes with zero counts. In fact you didn't say anything about zero counts - instead what you did was to require at least four samples to have more than 1 or 2 or 3 counts per million. The other four samples were unconstrained, and could easily have had zero counts. Note that this is a reasonable thing to do. You are looking for genes where one sample had more transcripts than the other sample. This includes the situation where one of the samples appears not to transcribe the gene at all. Best, Jim > > ------------------------------------ > Code: > dim(gene.counts.2) # 33607 8 > ## Sample Descriptions > group = factor(gsub("\\_[[:digit:]]","",colnames(gene.counts.2))) > > ## Creating dge list > y = DGEList(counts=gene.counts.2, group=group) > > ## Filtering > keep = rowSums(cpm(y)>1)>= 4 > table(keep) > #FALSE TRUE > #17678 15929 > keep2 = rowSums(cpm(y)>2)>= 4 > table(keep2) > #FALSE TRUE > #19300 14307 > keep3 = rowSums(cpm(y)>3)>= 4 > table(keep3) > #FALSE TRUE > #20229 13378 > > y.filt = y[keep, ] > dim(y.filt$counts) # 15929 8 > > y.filt2 = y[keep2, ] > dim(y.filt2$counts) # 14307 8 > > y.filt3 = y[keep3, ] > dim(y.filt3$counts) # 13378 8 > > ## Recalculate lib.size > y.filt$samples$lib.size = colSums(y.filt$counts) > y.filt2$samples$lib.size = colSums(y.filt2$counts) > y.filt3$samples$lib.size = colSums(y.filt3$counts) > > ## Normalisation > y.filt = calcNormFactors(y.filt) > > range(y.filt$counts[,1]) # 0 159659 > range(y.filt$counts[,2]) # 0 155390 > range(y.filt$counts[,3]) # 0 122249 > range(y.filt$counts[,4]) # 0 137046 > range(y.filt$counts[,5]) # 0 206528 > range(y.filt$counts[,6]) # 0 222176 > range(y.filt$counts[,7]) # 0 192333 > range(y.filt$counts[,8]) # 0 229413 > > y.filt2 = calcNormFactors(y.filt2) > > range(y.filt2$counts[,1]) # 0 159659 > range(y.filt2$counts[,2]) # 0 155390 > range(y.filt2$counts[,3]) # 0 122249 > range(y.filt2$counts[,4]) # 0 137046 > range(y.filt2$counts[,5]) # 0 206528 > range(y.filt2$counts[,6]) # 0 222176 > range(y.filt2$counts[,7]) # 0 192333 > range(y.filt2$counts[,8]) # 0 229413 > > y.filt3 = calcNormFactors(y.filt3) > > range(y.filt3$counts[,1]) # 0 159659 > range(y.filt3$counts[,2]) # 0 155390 > range(y.filt3$counts[,3]) # 0 122249 > range(y.filt3$counts[,4]) # 0 137046 > range(y.filt3$counts[,5]) # 0 206528 > range(y.filt3$counts[,6]) # 0 222176 > range(y.filt3$counts[,7]) # 0 192333 > range(y.filt3$counts[,8]) # 0 229413 > > ## MDS plots > plotMDS(y.filt, main="cpm(y)>1") > plotMDS(y.filt2, main="cpm(y)>2") > plotMDS(y.filt3, main="cpm(y)>3") > > ## Design Matrix > design = model.matrix(~0+group) > colnames(design) = gsub("group","",colnames(design)) design # KO KO_stim WT WT_stim > #1 1 0 0 0 > #2 1 0 0 0 > #3 0 1 0 0 > #4 0 1 0 0 > #5 0 0 1 0 > #6 0 0 1 0 > #7 0 0 0 1 > #8 0 0 0 1 > > ## Estimating Dispersion > y.filt = estimateGLMCommonDisp(y.filt, design, verbose=T) #Disp = > 0.0276 , BCV = 0.1661 > y.filt2 = estimateGLMCommonDisp(y.filt2, design, verbose=T) #Disp = > 0.02711 , BCV = 0.1646 > y.filt3 = estimateGLMCommonDisp(y.filt3, design, verbose=T) #Disp = > 0.02665 , BCV = 0.1632 > > > y.filt = estimateGLMTrendedDisp(y.filt,design) > y.filt2 = estimateGLMTrendedDisp(y.filt2,design) > y.filt3 = estimateGLMTrendedDisp(y.filt3,design) > > y.filt = estimateGLMTagwiseDisp(y.filt,design) > y.filt2 = estimateGLMTagwiseDisp(y.filt2,design) > y.filt3 = estimateGLMTagwiseDisp(y.filt3,design) > > jpeg("BCVplots.jpg",height=500,width=900) > par(mfrow=c(1,3)) > plotBCV(y.filt, main="cpm(y)>1") > plotBCV(y.filt2, main="cpm(y)>2") > plotBCV(y.filt3, main="cpm(y)>3") > dev.off() > (http://www.well.ox.ac.uk/~nsahgal/test/BCVplots.jpg) > > ### NOT RUN this section > ## Fit Model > fit = glmFit(y.filt, design) > cont = makeContrasts((KO_stim - KO) - (WT_stim - WT), levels=design) > lrt = glmLRT(y.filt, fit, contrast=as.vector(cont)) > ------------------------------------ > > sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] splines stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] scatterplot3d_0.3-33 DESeq_1.8.3 locfit_1.5-7 > [4] Biobase_2.16.0 BiocGenerics_0.2.0 WriteXLS_2.1.0 > [7] edgeR_2.6.10 limma_3.12.0 > > loaded via a namespace (and not attached): > [1] annotate_1.34.0 AnnotationDbi_1.18.0 DBI_0.2-5 > [4] genefilter_1.38.0 geneplotter_1.34.0 grid_2.15.0 > [7] IRanges_1.14.2 lattice_0.20-6 RColorBrewer_1.0-5 > [10] RSQLite_0.11.1 stats4_2.15.0 survival_2.36-14 > [13] tools_2.15.0 xtable_1.7-0 > ------------------------------------ > > Many Thanks, > Natasha > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 13.4 years ago Natasha ▴ 440

0

Entering edit mode

Hi Natasha, I have been working on RNA-seq data from small sample numbers of small cell numbers. It turned out the similar results you got for the BCV plot. In my situation, there was an extra PCR step during processing the samples. I was speculating if this step of PCR amplification might increase the variation of highly expressed genes, thus gave me the strange BCV results, increasing variation with higher CPM. Just a possible hint for this. Best, Wenhuo On Wed, Sep 12, 2012 at 9:43 AM, Natasha Sahgal <nsahgal@well.ox.ac.uk>wrote: > Dear Jim, > > Regarding the BCV plots, what I did not understand was the strange profile > (at least strange to me), and the low coefficients of BV. > Based on some figures from the user guide, it appeared to be very > different - an increase towards the higher logCPM. > 1) I'm not sure how to interpret these and if it is a good thing or not? > (perhaps I have misunderstood the concept of the BCV) > 2) How does this affect the differential expression, if at all? > > Re the filtering, for some reason I was under the impression increasing > the counts per million would reduce (if not remove) zero counts in all > samples. I agree with what you say about half the samples being > unconstrained. > > I had 3 filters here, just to see what the difference would be. I still > need to figure out the best or optimal one to use. > > > Many Thanks and Best Wishes, > Natasha > > > -----Original Message----- > From: James W. MacDonald [mailto:jmacdon@uw.edu] > Sent: 12 September 2012 14:29 > To: Natasha Sahgal > Cc: 'bioconductor@r-project.org' > Subject: Re: [BioC] edgeR: confusing BCV plot > > Hi Natasha, > > On 9/12/2012 5:05 AM, Natasha Sahgal wrote: > > Dear List, > > > > Sorry for any cross-posting. > > > > I have an RNA-Seq expt for which I'd like to use edgeR, as it is > multifactorial in design. > > The expt: 2 cell-lines (ko,wt), 2 conditions(stimulated, unstimulated), > n=2 in each group. > > My aim: to detect DE genes based on the effect of stimulus on ko cells. > > > > However, I do not understand the output from the BCV plot after > estimating dispersion. > > Thus would appreciate any help/advice/suggestions. > > What about the output do you not understand? > > > > > Also, would appreciate comments on the filtering step! As, it appears to > me that I still have some genes with 0 read counts (as seen under the > normalisation section). > > You didn't filter in such a way to remove all genes with zero counts. In > fact you didn't say anything about zero counts - instead what you did was > to require at least four samples to have more than 1 or 2 or 3 counts per > million. The other four samples were unconstrained, and could easily have > had zero counts. > > Note that this is a reasonable thing to do. You are looking for genes > where one sample had more transcripts than the other sample. This includes > the situation where one of the samples appears not to transcribe the gene > at all. > > Best, > > Jim > > > > > > ------------------------------------ > > Code: > > dim(gene.counts.2) # 33607 8 > > ## Sample Descriptions > > group = factor(gsub("\\_[[:digit:]]","",colnames(gene.counts.2))) > > > > ## Creating dge list > > y = DGEList(counts=gene.counts.2, group=group) > > > > ## Filtering > > keep = rowSums(cpm(y)>1)>= 4 > > table(keep) > > #FALSE TRUE > > #17678 15929 > > keep2 = rowSums(cpm(y)>2)>= 4 > > table(keep2) > > #FALSE TRUE > > #19300 14307 > > keep3 = rowSums(cpm(y)>3)>= 4 > > table(keep3) > > #FALSE TRUE > > #20229 13378 > > > > y.filt = y[keep, ] > > dim(y.filt$counts) # 15929 8 > > > > y.filt2 = y[keep2, ] > > dim(y.filt2$counts) # 14307 8 > > > > y.filt3 = y[keep3, ] > > dim(y.filt3$counts) # 13378 8 > > > > ## Recalculate lib.size > > y.filt$samples$lib.size = colSums(y.filt$counts) > > y.filt2$samples$lib.size = colSums(y.filt2$counts) > > y.filt3$samples$lib.size = colSums(y.filt3$counts) > > > > ## Normalisation > > y.filt = calcNormFactors(y.filt) > > > > range(y.filt$counts[,1]) # 0 159659 > > range(y.filt$counts[,2]) # 0 155390 > > range(y.filt$counts[,3]) # 0 122249 > > range(y.filt$counts[,4]) # 0 137046 > > range(y.filt$counts[,5]) # 0 206528 > > range(y.filt$counts[,6]) # 0 222176 > > range(y.filt$counts[,7]) # 0 192333 > > range(y.filt$counts[,8]) # 0 229413 > > > > y.filt2 = calcNormFactors(y.filt2) > > > > range(y.filt2$counts[,1]) # 0 159659 > > range(y.filt2$counts[,2]) # 0 155390 > > range(y.filt2$counts[,3]) # 0 122249 > > range(y.filt2$counts[,4]) # 0 137046 > > range(y.filt2$counts[,5]) # 0 206528 > > range(y.filt2$counts[,6]) # 0 222176 > > range(y.filt2$counts[,7]) # 0 192333 > > range(y.filt2$counts[,8]) # 0 229413 > > > > y.filt3 = calcNormFactors(y.filt3) > > > > range(y.filt3$counts[,1]) # 0 159659 > > range(y.filt3$counts[,2]) # 0 155390 > > range(y.filt3$counts[,3]) # 0 122249 > > range(y.filt3$counts[,4]) # 0 137046 > > range(y.filt3$counts[,5]) # 0 206528 > > range(y.filt3$counts[,6]) # 0 222176 > > range(y.filt3$counts[,7]) # 0 192333 > > range(y.filt3$counts[,8]) # 0 229413 > > > > ## MDS plots > > plotMDS(y.filt, main="cpm(y)>1") > > plotMDS(y.filt2, main="cpm(y)>2") > > plotMDS(y.filt3, main="cpm(y)>3") > > > > ## Design Matrix > > design = model.matrix(~0+group) > > colnames(design) = gsub("group","",colnames(design)) design # KO > KO_stim WT WT_stim > > #1 1 0 0 0 > > #2 1 0 0 0 > > #3 0 1 0 0 > > #4 0 1 0 0 > > #5 0 0 1 0 > > #6 0 0 1 0 > > #7 0 0 0 1 > > #8 0 0 0 1 > > > > ## Estimating Dispersion > > y.filt = estimateGLMCommonDisp(y.filt, design, verbose=T) #Disp = > > 0.0276 , BCV = 0.1661 > > y.filt2 = estimateGLMCommonDisp(y.filt2, design, verbose=T) #Disp = > > 0.02711 , BCV = 0.1646 > > y.filt3 = estimateGLMCommonDisp(y.filt3, design, verbose=T) #Disp = > > 0.02665 , BCV = 0.1632 > > > > > > y.filt = estimateGLMTrendedDisp(y.filt,design) > > y.filt2 = estimateGLMTrendedDisp(y.filt2,design) > > y.filt3 = estimateGLMTrendedDisp(y.filt3,design) > > > > y.filt = estimateGLMTagwiseDisp(y.filt,design) > > y.filt2 = estimateGLMTagwiseDisp(y.filt2,design) > > y.filt3 = estimateGLMTagwiseDisp(y.filt3,design) > > > > jpeg("BCVplots.jpg",height=500,width=900) > > par(mfrow=c(1,3)) > > plotBCV(y.filt, main="cpm(y)>1") > > plotBCV(y.filt2, main="cpm(y)>2") > > plotBCV(y.filt3, main="cpm(y)>3") > > dev.off() > > (http://www.well.ox.ac.uk/~nsahgal/test/BCVplots.jpg) > > > > ### NOT RUN this section > > ## Fit Model > > fit = glmFit(y.filt, design) > > cont = makeContrasts((KO_stim - KO) - (WT_stim - WT), levels=design) > > lrt = glmLRT(y.filt, fit, contrast=as.vector(cont)) > > ------------------------------------ > > > > sessionInfo() > > R version 2.15.0 (2012-03-30) > > Platform: x86_64-pc-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 > > [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 > > [7] LC_PAPER=C LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] splines stats graphics grDevices utils datasets methods > > [8] base > > > > other attached packages: > > [1] scatterplot3d_0.3-33 DESeq_1.8.3 locfit_1.5-7 > > [4] Biobase_2.16.0 BiocGenerics_0.2.0 WriteXLS_2.1.0 > > [7] edgeR_2.6.10 limma_3.12.0 > > > > loaded via a namespace (and not attached): > > [1] annotate_1.34.0 AnnotationDbi_1.18.0 DBI_0.2-5 > > [4] genefilter_1.38.0 geneplotter_1.34.0 grid_2.15.0 > > [7] IRanges_1.14.2 lattice_0.20-6 RColorBrewer_1.0-5 > > [10] RSQLite_0.11.1 stats4_2.15.0 survival_2.36-14 > > [13] tools_2.15.0 xtable_1.7-0 > > ------------------------------------ > > > > Many Thanks, > > Natasha > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Wenhuo Hu Park lab Memorial Sloan Kettering Cancer Center Zuckerman Research Building 408 East 69th Street Room ZRC-527 New York, NY 10065 Phone 646-888-3220 huw@mskcc.org [[alternative HTML version deleted]]

ADD REPLY • link 13.4 years ago wenhuo hu ▴ 200

Login before adding your answer.