Too many (?) differentially expressed genes

Too many (?) differentially expressed genes - edgeR and DESeq

1

Entering edit mode

Darya Vanichkina ▴ 120

@darya-vanichkina-6050

Last seen 7.1 years ago

Australia/Centenary Institute Universit…

Hi! I'm doing a differential gene expression analysis on two time points for a human iPS cell line, using 100bp PE RNA-seq data mapped with tophat and aggregated to generate a count table using htseq-count in intersection-strict mode. While the differentiation process changes the phenotype of the cell line significantly, I'm still not sure I would expect >15% of genes in the genome to change significantly. I'm especially worried about the fact that after I filter for lowly expressed genes, ~50% of those for which testing is carried out are called significant. Naturally, I can filter based on cpm/logFC/detection by both tools before coming to a "significant" DG table, but I'm wondering if there is something intrinsically wrong with my data/samples, and if so - what? Or am I just making a very silly mistake in my code somewhere? Thanks in advance for your help, Darya Vanichkina edgeR BCV plot (Disp = 0.0189 , BCV = 0.1375 ) http://i.imgur.com/Pf32qeBh.png edgeR smearplot http://i.imgur.com/bs6JyoEh.png DESeq MAplot http://i.imgur.com/Ds9nfwth.png ## Sample code used # edgeR ----- row.names(counttable.sam) <- counttable.sam$Gene counttable.sam$Gene <- NULL y <- DGEList(counts=counttable.sam[1:6], group=rep(1:2,each=3), genes=row.names(counttable.sam)) keep <- rowSums(cpm(y)>1) >= 1 y <- y[keep,] dim(y) # 16002 6 # Kept, out of 62890 genes in reference (gencode 17). Reads were counted to gene features using htseq-count in intersection-strict mode y$samples$lib.size <- colSums(y$counts) y <- calcNormFactors(y) plotMDS(y) y <- estimateCommonDisp(y, verbose=TRUE) # Disp = 0.0189 , BCV = 0.1375 y <- estimateTagwiseDisp(y) plotBCV(y) et <- exactTest(y) summary(de <- decideTestsDGE(et)) # Out of 16002 tags for which the counts are not too low, 6263 are upregulated and 6235 are down-regulated detags <- rownames(y)[as.logical(de)] plotSmear(et, de.tags=detags) abline(h=c(-1, 1), col="blue") # DESeq ----- # Filter the lowly expressed tags rs <- rowSums (counttable.sam) use <- (rs > 10) table(use) # use # FALSE TRUE # 41704 21186 counttable.sam <- counttable.sam[ use, ] conds <- factor(c("day0", "day0", "day0", "day34", "day34", "day34")) d <- newCountDataSet(counttable.sam, conds) d <- estimateSizeFactors(d) sizeFactors(d) # d0_1 d0_2 d0_3 d34_1 d34_2 d34_3 # 2.0770099 1.4882383 0.7514035 0.4643261 1.3976540 0.6846997 d <- estimateDispersions(d, sharingMode="maximum") plotDispEsts(d) DESeqRes <- nbinomTest(d, "day0", "day34") DESeqRes$GeneShort <- substr(DESeqRes$id, 1, 15) DESeqResSig <- subset(DESeqRes, padj < 0.01) # 10163/21186 genes were called significant by DESeq ... plotMA(DESeqRes) > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] DESeq_1.12.0 lattice_0.20-15 locfit_1.5-9.1 Biobase_2.20.1 [5] BiocGenerics_0.6.0 ggplot2_0.9.3.1 BiocInstaller_1.10.2 edgeR_3.2.4 [9] limma_3.16.6 biomaRt_2.16.0 loaded via a namespace (and not attached): [1] annotate_1.38.0 AnnotationDbi_1.22.6 colorspace_1.2-2 DBI_0.2-7 [5] dichromat_2.0-0 digest_0.6.3 genefilter_1.42.0 geneplotter_1.38.0 [9] grid_3.0.1 gtable_0.1.2 IRanges_1.18.2 labeling_0.2 [13] MASS_7.3-27 munsell_0.4.2 plyr_1.8 proto_0.3-10 [17] RColorBrewer_1.0-5 RCurl_1.95-4.1 reshape2_1.2.2 RSQLite_0.11.4 [21] scales_0.2.3 splines_3.0.1 stats4_3.0.1 stringr_0.6.2 [25] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 xtable_1.7-1 ____________________ Darya Vanichkina PhD Student in Bioinformatics IMB Science Ambassador Institute for Molecular Biosciences (IMB) University of Queensland Building 80 St Lucia Queensland, 4072 Australia d.vanichkina@gmail.com d.vanichkina@imb.uq.edu.au [[alternative HTML version deleted]]

PROcess edgeR DESeq PROcess edgeR DESeq • 6.3k views

ADD COMMENT • link updated 10.7 years ago by Wolfgang Huber ★ 13k • written 10.7 years ago by Darya Vanichkina ▴ 120

0

Entering edit mode

Hi Darya,

I am working with something similar. Even I found about 49% of the ~20k protein Coding genes to be deferentially expressed. If you could share how you dealt with this or justified the observation.

ADD REPLY • link 5.8 years ago ag1805x ▴ 80

1

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 9 days ago

EMBL European Molecular Biology Laborat…

Darya one possibility is that your three replicates each at day 0 and day3/4 are not really capturing enough biological variability (i.e. are too similar) and thus even small differences between the conditions are judged as significant by edgeR / DESeq. Best wishes Wolfgang On 22 Jul 2013, at 06:00, Darya Vanichkina <d.vanichkina at="" gmail.com=""> wrote: > Hi! > > I'm doing a differential gene expression analysis on two time points for a human iPS cell line, using 100bp PE RNA-seq data mapped with tophat and aggregated to generate a count table using htseq-count in intersection-strict mode. While the differentiation process changes the phenotype of the cell line significantly, I'm still not sure I would expect >15% of genes in the genome to change significantly. > > I'm especially worried about the fact that after I filter for lowly expressed genes, ~50% of those for which testing is carried out are called significant. Naturally, I can filter based on cpm/logFC/detection by both tools before coming to a "significant" DG table, but I'm wondering if there is something intrinsically wrong with my data/samples, and if so - what? Or am I just making a very silly mistake in my code somewhere? > > Thanks in advance for your help, > Darya Vanichkina > > > > edgeR BCV plot (Disp = 0.0189 , BCV = 0.1375 ) > > http://i.imgur.com/Pf32qeBh.png > > edgeR smearplot > > http://i.imgur.com/bs6JyoEh.png > > DESeq MAplot > http://i.imgur.com/Ds9nfwth.png > > ## Sample code used > # edgeR ----- > row.names(counttable.sam) <- counttable.sam$Gene > counttable.sam$Gene <- NULL > > y <- DGEList(counts=counttable.sam[1:6], group=rep(1:2,each=3), genes=row.names(counttable.sam)) > > keep <- rowSums(cpm(y)>1) >= 1 > y <- y[keep,] > dim(y) > # 16002 6 > # Kept, out of 62890 genes in reference (gencode 17). Reads were counted to gene features using htseq-count in intersection-strict mode > > y$samples$lib.size <- colSums(y$counts) > y <- calcNormFactors(y) > plotMDS(y) > y <- estimateCommonDisp(y, verbose=TRUE) > # Disp = 0.0189 , BCV = 0.1375 > y <- estimateTagwiseDisp(y) > plotBCV(y) > et <- exactTest(y) > summary(de <- decideTestsDGE(et)) > # Out of 16002 tags for which the counts are not too low, 6263 are upregulated and 6235 are down-regulated > detags <- rownames(y)[as.logical(de)] > plotSmear(et, de.tags=detags) > abline(h=c(-1, 1), col="blue") > > > # DESeq ----- > # Filter the lowly expressed tags > rs <- rowSums (counttable.sam) > use <- (rs > 10) > table(use) > # use > # FALSE TRUE > # 41704 21186 > counttable.sam <- counttable.sam[ use, ] > conds <- factor(c("day0", "day0", "day0", "day34", "day34", "day34")) > > d <- newCountDataSet(counttable.sam, conds) > d <- estimateSizeFactors(d) > sizeFactors(d) > # d0_1 d0_2 d0_3 d34_1 d34_2 d34_3 > # 2.0770099 1.4882383 0.7514035 0.4643261 1.3976540 0.6846997 > d <- estimateDispersions(d, sharingMode="maximum") > plotDispEsts(d) > DESeqRes <- nbinomTest(d, "day0", "day34") > DESeqRes$GeneShort <- substr(DESeqRes$id, 1, 15) > DESeqResSig <- subset(DESeqRes, padj < 0.01) > # 10163/21186 genes were called significant by DESeq ... > plotMA(DESeqRes) > > > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] DESeq_1.12.0 lattice_0.20-15 locfit_1.5-9.1 Biobase_2.20.1 > [5] BiocGenerics_0.6.0 ggplot2_0.9.3.1 BiocInstaller_1.10.2 edgeR_3.2.4 > [9] limma_3.16.6 biomaRt_2.16.0 > > loaded via a namespace (and not attached): > [1] annotate_1.38.0 AnnotationDbi_1.22.6 colorspace_1.2-2 DBI_0.2-7 > [5] dichromat_2.0-0 digest_0.6.3 genefilter_1.42.0 geneplotter_1.38.0 > [9] grid_3.0.1 gtable_0.1.2 IRanges_1.18.2 labeling_0.2 > [13] MASS_7.3-27 munsell_0.4.2 plyr_1.8 proto_0.3-10 > [17] RColorBrewer_1.0-5 RCurl_1.95-4.1 reshape2_1.2.2 RSQLite_0.11.4 > [21] scales_0.2.3 splines_3.0.1 stats4_3.0.1 stringr_0.6.2 > [25] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 xtable_1.7-1 > > > > ____________________ > Darya Vanichkina > PhD Student in Bioinformatics > IMB Science Ambassador > Institute for Molecular Biosciences (IMB) > University of Queensland Building 80 > St Lucia Queensland, 4072 Australia > d.vanichkina at gmail.com > d.vanichkina at imb.uq.edu.au > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.7 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hi Wolfgang, Thank you; this is very likely given the experimental setup of using replicate wells/plates of 1 cell line per condition (the "standard" in the field, which has worried me for quite a while). But I'm afraid I've got more questions: I thought the edgeR BCV and dispersion were meant to test this very thing, and my values of Disp = 0.0189 , BCV = 0.1375 seem closer to the expected value for a genetically identical model organism/cell line rather than a technical replicate. According to the edgeR manual: "Typical values for the common BCV (square- root-dispersion) for datasets arising from well-controlled experiments are 0.4 for human data, 0.1 for data on genetically identical model organisms or 0.01 for technical replicates."... "The BCV (square root of the common dispersion) here is 14%, a typical size for a laboratory experiment with a cell line or a model organism." But unlike the edgeR manual, my BCV plot looks "flatter" and smear plot is a lot "redder"... 1) Is the fact that my BCV plot Y-axis is shorter than the one in the manual (ymax(manual) ~= 1.2; ymax(me)~=0.8) evidence of this under- variation? 2) Is there a way to test for whether there is not enough variation happening (apart from looking at the final plots and seeing all that red)? So for example if I do two more cell lines (+1 per condition), how would I then test that I had enough variability, or whether I should do another 2 cell lines? 3) How would you go about doing the differential expression analysis, since budget/time constraints mean that I can't do the (obviously) preferable experiment of doing more cell lines? So far what I've done is just filtering based on: - abs(logFC) > 1 - logCPM > 0.5 - detection by both EdgeR and DESeq Which are all rather arbitrary cutoffs, and I know I'm losing some biologically meaningful things when I use the fold-change cutoff, since I have a few "expected" genes for the comparison discarded in the non-significant pile. Is there a better way? Thanks in advance, Darya

ADD REPLY • link 10.7 years ago Darya Vanichkina ▴ 120

0

Entering edit mode

Hi Darya A dispersion of 0.02 is typical for cell line experiments, but only for simple ones. For an experiment involving a full month of incubation, it really seems quite low. On the other hand, you do have quite drastic changes: very many genes change more the 32-fold (5 log2 units on the MA plot), and they would still be significant even if you had a higher dispersion estimate. Given the dramatic changes in phenotype that one sees in differentiation, the strong changes in gene expression are not that surprising. In the end, it seems entirely reasonable to me to say that hardly any gene is at the same level in a stem cell as in a terminally differentiated cell. Remember that a significant p value only means that the gene's fold change is not zero and that the observed _direction_ of change is likely the true one. It says nothing about the magnitude of the change. You are hence in a situation where you are no longer interested in _which_ genes change (because the answer simply is: most of them), but in the strength of the change: Which genes have changed dramatically, which genes stringly and which have changed only a bit? Hence, you should now look at fold changes rather than p values. Using ordinary log2 fold change values can give you a misleading picture: As you can see in the MA plot, weak genes seem to have the strongest changes, but this is only an artifact due to the fact that for weak genes, the fold-change estimates are more variable and hence more likely to be exaggerated. This is why we introduced "shrunken log2 fold changes" in DESeq2: they give you a more realistic picture of the strength of changes across the dynamic range. See the DESeq2 vignette, and especially this tutorial for more explanations: http://www.bioconductor.org/help/course- materials/2013/CSAMA2013/tuesday/afternoon/DESeq2_parathyroid.pdf Simon

ADD REPLY • link 10.7 years ago Simon Anders ★ 3.7k

Login before adding your answer.