DESeq2 log2Foldchanges opposite sign for almost the same groups of comparison
1
0
Entering edit mode
@031f9601
Last seen 6 weeks ago
Germany

Hello!

I have run into a result I cannot explain with DESeq2 and would greatly appreciate an explanation!

I am running DESeq2 and for 197 samples for two groups of comparison. I am not currently interested in interaction terms. My code looks as follows

dds <- DESeqDataSetFromMatrix(counts_ent_T,design_df , ~ param1+param2)
filter<-rowSums(counts(dds)>=10)>=10
dds<-dds[filter,] 
dds<-DESeq(dds,parallel=T)
dds<-dds [which(mcols(dds)$betaConv), ]

and design_df looks as follows:

> design_df [1:6,]
           param1  param2
Sample1    bulk    bulk
Sample2    bulk    bulk
Sample3    bulk    bulk
Sample4    bulk    bulk
Sample5 outlier outlier
Sample6    bulk    bulk

where param1 has 30 "outlier"s and 167 "bulk"s, and param2 has 28 "outlier"s and 169 "bulk"s. Except for 4 samples param1 and param2 are the same.

The problem is in the results. As an example, I have looked at "Gene1":


res1<-results(dds,name="param1_outlier_vs_bulk")
res2<-results(dds,name="param2_outlier_vs_bulk")

> res1["Gene1",] 
log2 fold change (MLE): param1 outlier vs bulk 
Wald test p-value: param1 outlier vs bulk 
DataFrame with 1 row and 6 columns
       baseMean log2FoldChange     lfcSE      stat      pvalue        padj
      <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
Gene1   255.719       -12.7665   2.55447  -4.99771 5.80142e-07 0.000895952
> res2["Gene1",] 
log2 fold change (MLE): param2 outlier vs bulk 
Wald test p-value: param2 outlier vs bulk 
DataFrame with 1 row and 6 columns
       baseMean log2FoldChange     lfcSE      stat      pvalue        padj
      <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
Gene1   255.719        13.0614   2.61151   5.00145 5.69003e-07 0.000906422

As you can see, the log2FoldChange is exactly opposite, while I didn't swap factor levels in the comparisons (comparing outlier vs bulk in both params). Moreover, plotcounts look as follows:

p1<-plotCounts(dds, gene=gene1, intgroup="param1", returnData=T)
p11<-ggplot(p1, aes(x=param1, y=count)) + 
   geom_point(position=position_jitter(w=0.1,h=0)) +ggtitle("Gene1")+
  scale_y_log10(breaks=c(25,100,400))
ggsave(p11, file="plotcounts_param1_default.pdf")


p2<-plotCounts(dds, gene=gene1, intgroup="param2", returnData=T)
p22<-ggplot(p2, aes(x=param2, y=count)) + 
   geom_point(position=position_jitter(w=0.1,h=0))+ggtitle("Gene1")+
   scale_y_log10(breaks=c(25,100,400))
ggsave(p22, file="plotcounts_param2_default.pdf")

plotcounts outlier vs bulk for param1 and param2

So why are the log2FoldChanges exactly opposite? Is it really just the 2 samples with 0s that the outlier group in param1 gains?

Any help would be greatly appreciated.

Rita

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BiocParallel_1.24.1         purrr_0.3.4                
 [3] ggplot2_3.3.3               dplyr_1.0.4                
 [5] plyr_1.8.6                  DESeq2_1.30.1              
 [7] SummarizedExperiment_1.20.0 Biobase_2.50.0             
 [9] MatrixGenerics_1.2.1        matrixStats_0.58.0         
[11] GenomicRanges_1.42.0        GenomeInfoDb_1.26.2        
[13] IRanges_2.24.1              S4Vectors_0.28.1           
[15] BiocGenerics_0.36.0         data.table_1.14.0          

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6             locfit_1.5-9.4         lattice_0.20-41       
 [4] snow_0.4-3             digest_0.6.27          assertthat_0.2.1      
 [7] utf8_1.1.4             R6_2.5.0               RSQLite_2.2.3         
[10] httr_1.4.2             pillar_1.5.0           zlibbioc_1.36.0       
[13] rlang_0.4.10           annotate_1.68.0        blob_1.2.1            
[16] Matrix_1.2-18          labeling_0.4.2         splines_4.0.2         
[19] geneplotter_1.68.0     RCurl_1.98-1.2         bit_4.0.4             
[22] munsell_0.5.0          DelayedArray_0.16.2    compiler_4.0.2        
[25] pkgconfig_2.0.3        tcltk_4.0.2            tidyselect_1.1.0      
[28] tibble_3.1.0           GenomeInfoDbData_1.2.4 XML_3.99-0.5          
[31] fansi_0.4.2            crayon_1.4.1           withr_2.4.1           
[34] bitops_1.0-6           grid_4.0.2             xtable_1.8-4          
[37] gtable_0.3.0           lifecycle_1.0.0        DBI_1.1.1             
[40] magrittr_2.0.1         scales_1.1.1           cachem_1.0.4          
[43] farver_2.1.0           XVector_0.30.0         genefilter_1.72.1     
[46] ellipsis_0.3.1         vctrs_0.3.6            generics_0.1.0        
[49] RColorBrewer_1.1-2     tools_4.0.2            bit64_4.0.5           
[52] glue_1.4.2             fastmap_1.1.0          survival_3.1-12       
[55] AnnotationDbi_1.52.0   colorspace_2.0-0       memoise_2.0.0
RNASeq DESeq • 144 views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 13 hours ago
United States

Except for 4 samples param1 and param2 are the same.

This is a highly collinear model (~param1 + param2) and I would not recommend fitting this.

I'd recommend speaking to a statistician at your institute to talk about statistical analysis plans for your dataset. If you had a single variable y and wanted to do a linear model y ~ param1 + param2, it wouldn't be recommended to approach it like this, with this distribution of these two covariates.

ADD COMMENT
0
Entering edit mode

Thank you, Michael. After your response, I found that running DESeq for each parameter separately seems to solve my problem and gives sensible results. I will just proceed with that!

ADD REPLY

Login before adding your answer.

Traffic: 203 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6