DESeq2: Extremely large log2 expression values with little or no read evidence
1
0
Entering edit mode
wlorenz • 0
@wlorenz-14522
Last seen 3.0 years ago
United States

I've tried to find if anyone else has seen this, but none describe what I'm seeing.  I have a pretty simple/small data set: 6 groups, 3 reps/group (2 ctrl and 4 exptl).  Read data were mapped to a Trinity assembly w/bt2 and the expected counts generated with RSEM (gene_results) were used to generate the matrix. 

I like to bring over the normalized counts and group average fpkms (calc from RSEM fpkms) to use for downstream filtering/decisions as to which genes to pursue.  In this dataset there are usually one or two with high log2 and low FDR, but with no or little read evidence. I'd like to know why this is happening and how to prevent it.  Pehaps there is something to add to my code that will prevent it?  I have also run the analysis with and without pre-filtering the dds and i see the same thing.  I have also checked (several times) import of the norm'd read and avg fpkm values, so no mistake there.

Finally, exploratory analysis of this dataset shows that there is a HUGE amount of variance and PCA looks terrible... so there's that. 

If I'm missing something or if anyone has ideas or suggestions as to how to mitigate this, I'd love to know. 

My core DESeq2 code is shown below the these outputs.

gene_id baseMean log2FoldChange lfcSE stat pvalue padj CTL_FPKM_AVG PA_FPKM_AVG CTL1 CTL2 CTL3 PA1 PA2 PA3
TRINITY_DN79696_c1_g4 43.9 28.7 5.3 5.4 5.31E-008 2.08E-004 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
TRINITY_DN78128_c3_g1 260.8 25.8 4.1 6.3 3.47E-010 1.92E-006 30.7 0.0 673.2 643.0 886.2 0.0 0.0 0.0
gene_id baseMean log2FoldChange lfcSE stat pvalue padj CTL_FPKM_AVG KS_FPKM_AVG CTL1 CTL2 CTL3 KS1 KS2 KS3
TRINITY_DN71439_c1_g5 32.6 29.0 5.3 5.5 3.86E-008 1.10E-004 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
TRINITY_DN68768_c2_g1 19.7 26.8 5.3 5.1 3.69E-007 7.17E-004 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
TRINITY_DN85457_c12_g6 22.3 25.0 5.3 4.8 2.02E-006 2.62E-003 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
TRINITY_DN69591_c4_g1 137.2 24.5 5.1 4.8 1.75E-006 2.41E-003 30.5 0.0 468.2 0.0 578.2 0.0 0.0 0.0

DESeq2 code is as follows:

COUNTS <- as.matrix(read.table(file="count_matrix.txt", sep="\t", header=TRUE, row.names=1)) 
COUNTS_rd<-round(COUNTS)
mode(COUNTS_rd) <- "integer"
META <- read.table(file ="metadata.txt", sep="\t", header=TRUE, row.names=1) 

META

sample_id Comparison
CTL1 CTL
CTL2 CTL
CTL3 CTL
CTLAG1 CTLAG
CTLAG2 CTLAG
CTLAG3 CTLAG
PA1 PA
PA2 PA
PA3 PA
PAAG1 PAAG
PAAG2 PAAG
PAAG3 PAAG
KS1 KS
KS2 KS
KS3 KS
KSAG1 KSAG
KSAG2 KSAG
KSAG3 KSAG

dds <- DESeqDataSetFromMatrix(countData=COUNTS_rd, colData=META, design = ~Comparison)

nrow(dds) #rows = 136177

idx <- rowSums( counts(dds, normalized=TRUE) >= 5 ) >= 3  # prefiltering

dds <- dds[idx,]

nrow(dds)  ##rows = 74283

dds <- DESeq(dds)
resultsNames(dds)

counts_norm <- counts(dds,normalized=TRUE)

write.table(counts_norm,file="normed_counts.txt",row.names=TRUE,col.names=TRUE,quote=FALSE,sep="\t")

res1 <- results(dds, contrast=c("Comparison","CTL", "PA")) # first one of four comparisons.

 

Not sure if it's needed, but here's my session info:

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] vsn_3.44.0                 ggplot2_2.2.1             
 [3] dplyr_0.7.2                amap_0.8-14               
 [5] ape_4.1                    genefilter_1.58.1         
 [7] PoiClaClu_1.0.2            RColorBrewer_1.1-2        
 [9] pheatmap_1.0.8             DESeq2_1.16.1             
[11] SummarizedExperiment_1.6.3 DelayedArray_0.2.7        
[13] matrixStats_0.52.2         Biobase_2.36.2            
[15] GenomicRanges_1.28.4       GenomeInfoDb_1.12.2       
[17] IRanges_2.10.2             S4Vectors_0.14.3          
[19] BiocGenerics_0.22.0       

loaded via a namespace (and not attached):
 [1] bit64_0.9-7             splines_3.4.2           Formula_1.2-2          
 [4] assertthat_0.2.0        affy_1.54.0             latticeExtra_0.6-28    
 [7] blob_1.1.0              GenomeInfoDbData_0.99.0 RSQLite_2.0            
[10] backports_1.1.0         lattice_0.20-35         limma_3.32.3           
[13] glue_1.1.1              digest_0.6.12           XVector_0.16.0         
[16] checkmate_1.8.3         colorspace_1.3-2        preprocessCore_1.38.1  
[19] htmltools_0.3.6         Matrix_1.2-10           plyr_1.8.4             
[22] pkgconfig_2.0.1         XML_3.98-1.9            zlibbioc_1.22.0        
[25] xtable_1.8-2            scales_0.4.1            affyio_1.46.0          
[28] BiocParallel_1.10.1     htmlTable_1.9           tibble_1.3.3           
[31] annotate_1.54.0         nnet_7.3-12             lazyeval_0.2.0         
[34] survival_2.41-3         magrittr_1.5            memoise_1.1.0          
[37] nlme_3.1-131            foreign_0.8-69          BiocInstaller_1.26.1   
[40] tools_3.4.2             data.table_1.10.4       stringr_1.2.0          
[43] munsell_0.4.3           locfit_1.5-9.1          cluster_2.0.6          
[46] AnnotationDbi_1.38.1    bindrcpp_0.2            compiler_3.4.2         
[49] rlang_0.1.1             grid_3.4.2              RCurl_1.95-4.8         
[52] htmlwidgets_0.9         labeling_0.3            bitops_1.0-6           
[55] base64enc_0.1-3         gtable_0.2.0            DBI_0.7                
[58] R6_2.2.2                gridExtra_2.2.1         knitr_1.16             
[61] bit_1.1-12              bindr_0.1               Hmisc_4.0-3            
[64] stringi_1.1.5           Rcpp_0.12.12            geneplotter_1.54.0     
[67] rpart_4.1-11            acepack_1.4.1          

 

 

deseq2 • 845 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 14 hours ago
United States

Can you send me the 'dds' to maintainer("DESeq2") and I'll take a look at what results table I get? I'd be interested to see if I can reproduce on my end.

ADD COMMENT
0
Entering edit mode

Thanks, Michael.  Just sent you an e-mail with DropBox link to data.

ADD REPLY
0
Entering edit mode

I took a look, and for these genes there are large counts combined with 0's, within a single group, which are enough of a problem for the negative binomial fit, and then there are not enough samples to allow for the outlier removal techniques. The bad fit is throwing off the Wald statistic.

The problem can be avoided if you use likelihood ratio tests instead here.

So you can run the following code one time, then re-use 'dds' for multiple comparisons:

dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds) # this can use a design=~Comparison

Then for example to contrast PA vs CTL, you would use the following code:

full <- model.matrix(~Comparison, colData(dds))
dds.pa.vs.ctl <- nbinomLRT(dds, full=full,
                           reduced=subset(full, select=-ComparisonPA))
res.pa.vs.ctl <- results(dds.pa.vs.ctl, name="ComparisonPA")

The LFC will still be large, but the p-values will be appropriate.

For comparing against CTLAG, you just re-level (you don't need to re-run dispersion estimation):

Comp2 <- dds$Comparison
Comp2 <- relevel(Comp2, "CTLAG")
full2 <- model.matrix(~Comp2)
subset(full2, select=-ComparisonKSAG) # now use this as reduced
ADD REPLY
0
Entering edit mode

Thank you for the code, I'll try it. 

So, if I understand what you're saying (maybe not), the first gene in my example has other matrix groups (not being compared) with norm count values of 0 1 300, 0 57 0 and 0 133 1 and it is these groups and their lack of sufficient replication (is this impacting/preventing proper Cook's filtering?) that force a bad fit for this particular gene in the comparison groups?  And, when I see this occur, I should handle those affected group comparisons independently/differently as in above and relevel for all others?  

 

ADD REPLY
0
Entering edit mode

The mix of zeros and large counts within groups, and the small replicate number means I would use the LRT for all comparisons in this dataset.

Also recommend LRT for similarly highly noisy datasets of similar small replicate number in the future.

ADD REPLY

Login before adding your answer.

Traffic: 811 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6