Question

finding a very large number of false positives using edgeR

0

Entering edit mode

Blum, Charles ▴ 20

@blum-charles-6331

Last seen 9.6 years ago

Hi, I am running edgeR on 6 RNAseq samples that were generated using the exact same protocol but are from different Illumina project runs. In theory, no genes should be differentially expressed. Nevertheless, edgeR identifies almost 7,000 genes as DE at a FDR rate of 0.1. This is very puzzling. I ran edgeR using the classic approach (exactTest) and the glm approach. To get an idea of sequencing depth: Sample: Project1_sample1 Project1_sample2 Project1_sample3 Project2_sample1 Project2_sample2 Project2_sample3 Total unique annotated read counts: 41,440,190 26,429,859 29,655,944 25,423,167 30,914,059 35,41,714 Could it be due to the variability in sequencing depth between projects? Could there anything else in the data or analysis that could violate any assumptions made by edgeR? Is there any known problems with the newest version of edgeR? > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] splines parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.14.2 AnnotationDbi_1.24.0 GenomicRanges_1.14.4 XVector_0.2.0 [5] IRanges_1.20.6 biomaRt_2.18.0 edgeR_3.4.2 limma_3.18.7 [9] DESeq_1.14.0 lattice_0.20-24 locfit_1.5-9.1 Biobase_2.22.0 [13] BiocGenerics_0.8.0 gplots_2.12.1 MASS_7.3-29 heatmap.plus_1.3 loaded via a namespace (and not attached): [1] annotate_1.40.0 Biostrings_2.30.1 bitops_1.0-6 BSgenome_1.30.0 caTools_1.16 [6] DBI_0.2-7 gdata_2.13.2 genefilter_1.44.0 geneplotter_1.40.0 grid_3.0.2 [11] gtools_3.1.1 KernSmooth_2.23-10 RColorBrewer_1.0-5 RCurl_1.95-4.1 Rsamtools_1.14.2 [16] RSQLite_0.11.4 rtracklayer_1.22.0 stats4_3.0.2 survival_2.37-4 tools_3.0.2 [21] XML_3.95-0.2 xtable_1.7-1 zlibbioc_1.8.0 > packageDescription('edgeR')$Maintainer [1] "Mark Robinson <mark.robinson at="" imls.uzh.ch="">, Davis McCarthy\n<dmccarthy at="" wehi.edu.au="">, Yunshun Chen <yuchen at="" wehi.edu.au="">,\nGordon Smyth <smyth at="" wehi.edu.au="">" Thanks, Charles ________________________________ IMPORTANT WARNING: This email (and any attachments) is o...{{dropped:9}}

Sequencing RNASeq edgeR Sequencing RNASeq edgeR • 1.3k views

ADD COMMENT • link updated 10.3 years ago by Gordon Smyth 50k • written 10.3 years ago by Blum, Charles ▴ 20

score 0 · Answer 1 · 2014-01-15

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, On Wed, Jan 15, 2014 at 3:07 PM, Blum, Charles <cblum at="" mednet.ucla.edu=""> wrote: > Hi, > > I am running edgeR on 6 RNAseq samples that were generated using the exact same protocol but are from different Illumina project runs. > In theory, no genes should be differentially expressed. Nevertheless, edgeR identifies almost 7,000 genes as DE at a FDR rate of 0.1. This is very puzzling. > > I ran edgeR using the classic approach (exactTest) and the glm approach. > > To get an idea of sequencing depth: > Sample: Project1_sample1 Project1_sample2 Project1_sample3 Project2_sample1 Project2_sample2 Project2_sample3 > Total unique annotated read counts: 41,440,190 26,429,859 29,655,944 25,423,167 30,914,059 35,41,714 > > Could it be due to the variability in sequencing depth between projects? Shouldn't be such a big issue -- even the differences between sample sized you see here are not very large. > Could there anything else in the data or analysis that could violate any assumptions made by edgeR? > Is there any known problems with the newest version of edgeR? My guess would be "no" -- you could, of course, try the same analysis with limma::voom or DESeq2 to see, but .. Anyway, could you show us the code you used to do the analysis -- the design matrix would be of particular interest along with the coefs/contrasts you are testing, but the whole (relevant) code would be good (ie. from DGEList -> dispersion estimation -> design matrix setup -> and the farious *fit + *table functions). Are you simply testing differential expression between the replicates of Project1 and those of Project2? Presumably your issue is that these are libraries sequenced from what you expect to be the same type of sample/tissue/cell-line/whatever? Perhaps encoding the "batch" (projectID) as another covariate into your design could help mitigate these issues, but I'm not sure what samples you're testing against what, so can't say anything for sure. -steve -- Steve Lianoglou Computational Biologist Genentech

ADD COMMENT • link 10.3 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi, I did run the same analysis using edgeR (glm), edgeR (as below) and DESeq. All had very similar results. Yes, I am simply testing between biological replicates with the exact same treatment only from 2 different Illumina runs. This simple example edgeR code also gave similar results: > group S180_Total_30_r1 S180_Total_30_r2 S180_Total_30_r4 S437_Total_30_1 S437_Total_30_2 S180 S180 S180 S437 S437 S437_Total_30_3 S437 Levels: S180 S437 y <- DGEList(counts=A, group=group, genes=genes) y <- calcNormFactors(y) y <- estimateCommonDisp(y) y <- estimateTagwiseDisp(y) fit <- exactTest(y) fit$table <- cbind(fit$table, FDR=p.adjust(fit$table$PValue,method="BH")) sum(fit$table$FDR <= 0.1) # The result is 6,674 genes Thanks, Charles Blum ICNN - UCLA On Jan 15, 2014, at 3:59 PM, Steve Lianoglou wrote: > Steve Lianoglou ________________________________ IMPORTANT WARNING: This email (and any attachments) is o...{{dropped:9}}

ADD REPLY • link 10.3 years ago Blum, Charles ▴ 20

0

Entering edit mode

Hi, Comments in line: On Wed, Jan 15, 2014 at 4:58 PM, Blum, Charles <cblum at="" mednet.ucla.edu=""> wrote: > Hi, > > I did run the same analysis using edgeR (glm), edgeR (as below) and DESeq. All had very similar results. OK, since the DESeq results are apparently concordant with your edgeR results, this means we can likely assume that the answer to your "Is there any known problems with the newest version of edgeR?" would be no ;-) > Yes, I am simply testing between biological replicates with the exact same treatment only from 2 different Illumina runs. > > This simple example edgeR code also gave similar results: > >> group > S180_Total_30_r1 S180_Total_30_r2 S180_Total_30_r4 S437_Total_30_1 S437_Total_30_2 > S180 S180 S180 S437 S437 > S437_Total_30_3 > S437 > Levels: S180 S437 > > y <- DGEList(counts=A, group=group, genes=genes) > y <- calcNormFactors(y) > y <- estimateCommonDisp(y) > y <- estimateTagwiseDisp(y) > fit <- exactTest(y) Sorry -- I'm still having a problem understanding the association of samples to treatment / Illumina run. Which samples are the controls? Which are the treated? Which belong to which "run" It'd be more helpful if you create a data.frame that has as many rows as samples with, at least, the following columns, and show us that: * run: This indicates which "Illumina run" the sample is from and would be a factor with two levels: "run1" and "run2" * treatment: A factor with two levels: "WT" and "treatment" indicating the grouping. Being that you are in the edgeR universe, I'd then see how these samples "cluster" together via an MDS plot (`plotMDS`). There are several examples of how to use it in the edgeR User's guide, however the example in the RNAseq case study in the limma user's guide (limma::voom ~ page117) might be more informative as it will show you how to differentially color and label each plot -- if I were you I'd label each point with the "run" factor, and color by treatment. If the points are clustering together by `run` instead of `treatment`, then you see the problem. You could, in principle, use something like sva/combat to remove this batch effect: http://bioconductor.org/packages/release/bioc/html/sva.html However there is some sample size considerations required for it to be used reliably: https://stat.ethz.ch/pipermail/bioconductor/2013-June/053098.html As Gordon points out in that same thread I just linked to, the best you might be able to do is just adjust for batch (`run`) in the linear model. In fact, section 4.5 (RNA-Seq of pathogen inoculated Arabidopsis with batch effects) of the edgeR User manual shows you exactly how you can proceed. HTH, -steve -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 10.3 years ago Steve Lianoglou ★ 13k

score 0 · Answer 2 · 2014-01-17

Dear Charles, The obvious conclusion is that these are not false positives and there are genuine batch differences between your runs. Evidence trumps theory. The existence of batch effects is common in genomic practice. Best wishes Gordon > Date: Wed, 15 Jan 2014 23:07:49 +0000 > From: "Blum, Charles" <cblum at="" mednet.ucla.edu=""> > To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> > Subject: [BioC] finding a very large number of false positives using > edgeR > > Hi, > > I am running edgeR on 6 RNAseq samples that were generated using the > exact same protocol but are from different Illumina project runs. > In theory, no genes should be differentially expressed. Nevertheless, > edgeR identifies almost 7,000 genes as DE at a FDR rate of 0.1. This is > very puzzling. > > I ran edgeR using the classic approach (exactTest) and the glm approach. > > To get an idea of sequencing depth: > Sample: Project1_sample1 Project1_sample2 Project1_sample3 Project2_sample1 Project2_sample2 Project2_sample3 > Total unique annotated read counts: 41,440,190 26,429,859 29,655,944 25,423,167 30,914,059 35,41,714 > > Could it be due to the variability in sequencing depth between projects? > Could there anything else in the data or analysis that could violate any assumptions made by edgeR? > Is there any known problems with the newest version of edgeR? > >> sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] splines parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.14.2 AnnotationDbi_1.24.0 GenomicRanges_1.14.4 XVector_0.2.0 > [5] IRanges_1.20.6 biomaRt_2.18.0 edgeR_3.4.2 limma_3.18.7 > [9] DESeq_1.14.0 lattice_0.20-24 locfit_1.5-9.1 Biobase_2.22.0 > [13] BiocGenerics_0.8.0 gplots_2.12.1 MASS_7.3-29 heatmap.plus_1.3 > > loaded via a namespace (and not attached): > [1] annotate_1.40.0 Biostrings_2.30.1 bitops_1.0-6 BSgenome_1.30.0 caTools_1.16 > [6] DBI_0.2-7 gdata_2.13.2 genefilter_1.44.0 geneplotter_1.40.0 grid_3.0.2 > [11] gtools_3.1.1 KernSmooth_2.23-10 RColorBrewer_1.0-5 RCurl_1.95-4.1 Rsamtools_1.14.2 > [16] RSQLite_0.11.4 rtracklayer_1.22.0 stats4_3.0.2 survival_2.37-4 tools_3.0.2 > [21] XML_3.95-0.2 xtable_1.7-1 zlibbioc_1.8.0 > > >> packageDescription('edgeR')$Maintainer > [1] "Mark Robinson <mark.robinson at="" imls.uzh.ch="">, Davis McCarthy\n<dmccarthy at="" wehi.edu.au="">, Yunshun Chen <yuchen at="" wehi.edu.au="">,\nGordon Smyth <smyth at="" wehi.edu.au="">" > > Thanks, > Charles ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}