Question

DESeq2 MA plot normalisation Issue

0

Entering edit mode

ma619 • 0

@ma619-18091

Last seen 5.5 years ago

Hello All,

We recently ran an RNAseq on 6 samples, half of which were treated with a drug and the other half served as a control. The reads were mapped to the reference genome with STAR and then we did a differential expression analysis with DESeq2. The dispersion plot looks good, very similar to the example reported in the documentation. However, when we create an MA plot, we find that the distribution of log2 fold changes has a baseMean-dependent bias: low expressed genes tend to have positive log fold change, whereas high expression genes tend to have negative fold changes.

Consequently, the genes reported as significant by DESeq2 also display the same non-linear bias.

We were wondering what the source of the error could be and if there are normalisation strategies to correct it?

Below is the code that I am using for the analysis. The counts matrix can be downloaded from here.

library(DESeq2)
library(ggplot2)

annot <- data.frame(stringsAsFactors=FALSE,
                    Barcode = c("RNAA001", "RNAA009", "RNAA008", "RNAA011", "RNAA010",
                                "RNAA014", "RNAA013", "RNAA004", "RNAA006", "RNAA005",
                                "RNAA012", "RNAA007", "RNAA018", "RNAA019", "RNAA016", "RNAA022",
                                "RNAA020", "RNAA021"),
                    Condition = c("Time6_Positive", "Time2_Positive", "Time2_Negative",
                                  "Time4_Positive", "Time4_Negative", "Time6_Positive",
                                  "Time6_Negative", "Time2_Negative", "Time4_Negative", "Time2_Positive",
                                  "Time6_Negative", "Time4_Positive", "Time2_Positive",
                                  "Time4_Negative", "Time2_Negative", "Time6_Positive", "Time4_Positive",
                                  "Time6_Negative")
)

# https://www.dropbox.com/s/gdpubmq0gcp5d3p/counts_mat.RData?dl=0
load("~/counts_mat.RData")

dds <- DESeqDataSetFromMatrix(counts_mat[, c("Gene", annot$Barcode)], colData=annot, design=~Condition, tidy=T)
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)

plotDispEsts(dds)

dds <- nbinomWaldTest(dds)

contrast_time2 <- results(dds, contrast=c("Condition","Time2_Positive","Time2_Negative"), tidy=T)
contrast_time4 <- results(dds, contrast=c("Condition","Time4_Positive","Time4_Negative"), tidy=T)
contrast_time6 <- results(dds, contrast=c("Condition","Time6_Positive","Time6_Negative"), tidy=T)

ggplot(contrast_time2, aes(x=log10(baseMean), y=log2FoldChange, colour=(padj<0.01))) + geom_point(size=0.2) + theme_bw() 
ggplot(contrast_time4, aes(x=log10(baseMean), y=log2FoldChange, colour=(padj<0.01))) + geom_point(size=0.2) + theme_bw()
ggplot(contrast_time6, aes(x=log10(baseMean), y=log2FoldChange, colour=(padj<0.01))) + geom_point(size=0.2) + theme_bw()

Thank you very much in advance!

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggplot2_3.1.0               DESeq2_1.20.0               SummarizedExperiment_1.10.1
 [4] DelayedArray_0.6.6          BiocParallel_1.14.2         matrixStats_0.54.0         
 [7] Biobase_2.40.0              GenomicRanges_1.32.7        GenomeInfoDb_1.16.0        
[10] IRanges_2.14.12             S4Vectors_0.18.3            BiocGenerics_0.26.0        

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.5.0          Formula_1.2-3          assertthat_0.2.0      
 [5] latticeExtra_0.6-28    blob_1.1.1             GenomeInfoDbData_1.1.0 yaml_2.2.0            
 [9] pillar_1.3.0           RSQLite_2.1.1          backports_1.1.2        lattice_0.20-35       
[13] glue_1.3.0             digest_0.6.18          RColorBrewer_1.1-2     XVector_0.20.0        
[17] checkmate_1.8.5        colorspace_1.3-2       htmltools_0.3.6        Matrix_1.2-14         
[21] plyr_1.8.4             XML_3.98-1.16          pkgconfig_2.0.2        genefilter_1.62.0     
[25] zlibbioc_1.26.0        purrr_0.2.5            xtable_1.8-3           scales_1.0.0          
[29] htmlTable_1.12         tibble_1.4.2           annotate_1.58.0        withr_2.1.2           
[33] nnet_7.3-12            lazyeval_0.2.1         survival_2.43-1        magrittr_1.5          
[37] crayon_1.3.4           memoise_1.1.0          foreign_0.8-71         tools_3.5.0           
[41] data.table_1.11.8      stringr_1.3.1          locfit_1.5-9.1         munsell_0.5.0         
[45] cluster_2.0.7-1        AnnotationDbi_1.42.1   bindrcpp_0.2.2         compiler_3.5.0        
[49] rlang_0.3.0.1          grid_3.5.0             RCurl_1.95-4.11        rstudioapi_0.8        
[53] htmlwidgets_1.3        labeling_0.3           bitops_1.0-6           base64enc_0.1-3       
[57] gtable_0.2.0           DBI_1.0.0              R6_2.3.0               gridExtra_2.3         
[61] knitr_1.20             dplyr_0.7.7            bit_1.1-14             bindr_0.1.1           
[65] Hmisc_4.1-1            stringi_1.2.4          Rcpp_0.12.19           geneplotter_1.58.0    
[69] rpart_4.1-13           acepack_1.4.1          tidyselect_0.2.5

deseq2 normalization maplot rnaseq • 1.1k views

ADD COMMENT • link updated 5.5 years ago by Michael Love 41k • written 5.5 years ago by ma619 • 0

0

Entering edit mode

You should at least provide the MA plot and some data to reproduce what you've stated in the question. The DESeq2 object saved in a rdata file would be an option.

ADD REPLY • link 5.5 years ago ATpoint ★ 4.0k

0

Entering edit mode

Please provide all of your R code. This helps figure things out a lot and is part of the posting guide.

ADD REPLY • link 5.5 years ago Michael Love 41k

0

Entering edit mode

Thank you very much for your comment. I have now added in the R code and the session info.

ADD REPLY • link 5.5 years ago ma619 • 0

score 0 · Answer 1 · 2018-10-30

I see what you mean that, for the second or third comparison for example, there are more down-regulated genes to the right of baseMean 100, whereas the up-regulated genes are spread throughout.

It's possible this is biological, and I think this is supported by the fact that some very highly expressed genes are in fact on or near the y=0 line.

I found the normalization easier to judge by first running this line below, before running any DESeq2 estimation steps. It didn't change the normalization significantly but made it easier to visualize.

keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep,]