Hello All,
We recently ran an RNAseq on 6 samples, half of which were treated with a drug and the other half served as a control. The reads were mapped to the reference genome with STAR and then we did a differential expression analysis with DESeq2. The dispersion plot looks good, very similar to the example reported in the documentation. However, when we create an MA plot, we find that the distribution of log2 fold changes has a baseMean-dependent bias: low expressed genes tend to have positive log fold change, whereas high expression genes tend to have negative fold changes.
Consequently, the genes reported as significant by DESeq2 also display the same non-linear bias.
We were wondering what the source of the error could be and if there are normalisation strategies to correct it?
Below is the code that I am using for the analysis. The counts matrix can be downloaded from here.
library(DESeq2) library(ggplot2) annot <- data.frame(stringsAsFactors=FALSE, Barcode = c("RNAA001", "RNAA009", "RNAA008", "RNAA011", "RNAA010", "RNAA014", "RNAA013", "RNAA004", "RNAA006", "RNAA005", "RNAA012", "RNAA007", "RNAA018", "RNAA019", "RNAA016", "RNAA022", "RNAA020", "RNAA021"), Condition = c("Time6_Positive", "Time2_Positive", "Time2_Negative", "Time4_Positive", "Time4_Negative", "Time6_Positive", "Time6_Negative", "Time2_Negative", "Time4_Negative", "Time2_Positive", "Time6_Negative", "Time4_Positive", "Time2_Positive", "Time4_Negative", "Time2_Negative", "Time6_Positive", "Time4_Positive", "Time6_Negative") ) # https://www.dropbox.com/s/gdpubmq0gcp5d3p/counts_mat.RData?dl=0 load("~/counts_mat.RData") dds <- DESeqDataSetFromMatrix(counts_mat[, c("Gene", annot$Barcode)], colData=annot, design=~Condition, tidy=T) dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) plotDispEsts(dds) dds <- nbinomWaldTest(dds) contrast_time2 <- results(dds, contrast=c("Condition","Time2_Positive","Time2_Negative"), tidy=T) contrast_time4 <- results(dds, contrast=c("Condition","Time4_Positive","Time4_Negative"), tidy=T) contrast_time6 <- results(dds, contrast=c("Condition","Time6_Positive","Time6_Negative"), tidy=T) ggplot(contrast_time2, aes(x=log10(baseMean), y=log2FoldChange, colour=(padj<0.01))) + geom_point(size=0.2) + theme_bw() ggplot(contrast_time4, aes(x=log10(baseMean), y=log2FoldChange, colour=(padj<0.01))) + geom_point(size=0.2) + theme_bw() ggplot(contrast_time6, aes(x=log10(baseMean), y=log2FoldChange, colour=(padj<0.01))) + geom_point(size=0.2) + theme_bw()
Thank you very much in advance!
> sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS High Sierra 10.13.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] ggplot2_3.1.0 DESeq2_1.20.0 SummarizedExperiment_1.10.1 [4] DelayedArray_0.6.6 BiocParallel_1.14.2 matrixStats_0.54.0 [7] Biobase_2.40.0 GenomicRanges_1.32.7 GenomeInfoDb_1.16.0 [10] IRanges_2.14.12 S4Vectors_0.18.3 BiocGenerics_0.26.0 loaded via a namespace (and not attached): [1] bit64_0.9-7 splines_3.5.0 Formula_1.2-3 assertthat_0.2.0 [5] latticeExtra_0.6-28 blob_1.1.1 GenomeInfoDbData_1.1.0 yaml_2.2.0 [9] pillar_1.3.0 RSQLite_2.1.1 backports_1.1.2 lattice_0.20-35 [13] glue_1.3.0 digest_0.6.18 RColorBrewer_1.1-2 XVector_0.20.0 [17] checkmate_1.8.5 colorspace_1.3-2 htmltools_0.3.6 Matrix_1.2-14 [21] plyr_1.8.4 XML_3.98-1.16 pkgconfig_2.0.2 genefilter_1.62.0 [25] zlibbioc_1.26.0 purrr_0.2.5 xtable_1.8-3 scales_1.0.0 [29] htmlTable_1.12 tibble_1.4.2 annotate_1.58.0 withr_2.1.2 [33] nnet_7.3-12 lazyeval_0.2.1 survival_2.43-1 magrittr_1.5 [37] crayon_1.3.4 memoise_1.1.0 foreign_0.8-71 tools_3.5.0 [41] data.table_1.11.8 stringr_1.3.1 locfit_1.5-9.1 munsell_0.5.0 [45] cluster_2.0.7-1 AnnotationDbi_1.42.1 bindrcpp_0.2.2 compiler_3.5.0 [49] rlang_0.3.0.1 grid_3.5.0 RCurl_1.95-4.11 rstudioapi_0.8 [53] htmlwidgets_1.3 labeling_0.3 bitops_1.0-6 base64enc_0.1-3 [57] gtable_0.2.0 DBI_1.0.0 R6_2.3.0 gridExtra_2.3 [61] knitr_1.20 dplyr_0.7.7 bit_1.1-14 bindr_0.1.1 [65] Hmisc_4.1-1 stringi_1.2.4 Rcpp_0.12.19 geneplotter_1.58.0 [69] rpart_4.1-13 acepack_1.4.1 tidyselect_0.2.5
You should at least provide the MA plot and some data to reproduce what you've stated in the question. The DESeq2 object saved in a rdata file would be an option.
Please provide all of your R code. This helps figure things out a lot and is part of the posting guide.
Thank you very much for your comment. I have now added in the R code and the session info.