Question

DESeq2 log2Foldchanges opposite sign for almost the same groups of comparison

0

Entering edit mode

krawczyk.margarita • 0

@031f9601

Last seen 4.4 years ago

Germany

Hello!

I have run into a result I cannot explain with DESeq2 and would greatly appreciate an explanation!

I am running DESeq2 and for 197 samples for two groups of comparison. I am not currently interested in interaction terms. My code looks as follows

dds <- DESeqDataSetFromMatrix(counts_ent_T,design_df , ~ param1+param2)
filter<-rowSums(counts(dds)>=10)>=10
dds<-dds[filter,] 
dds<-DESeq(dds,parallel=T)
dds<-dds [which(mcols(dds)$betaConv), ]

and design_df looks as follows:

> design_df [1:6,]
           param1  param2
Sample1    bulk    bulk
Sample2    bulk    bulk
Sample3    bulk    bulk
Sample4    bulk    bulk
Sample5 outlier outlier
Sample6    bulk    bulk

where param1 has 30 "outlier"s and 167 "bulk"s, and param2 has 28 "outlier"s and 169 "bulk"s. Except for 4 samples param1 and param2 are the same.

The problem is in the results. As an example, I have looked at "Gene1":


res1<-results(dds,name="param1_outlier_vs_bulk")
res2<-results(dds,name="param2_outlier_vs_bulk")

> res1["Gene1",] 
log2 fold change (MLE): param1 outlier vs bulk 
Wald test p-value: param1 outlier vs bulk 
DataFrame with 1 row and 6 columns
       baseMean log2FoldChange     lfcSE      stat      pvalue        padj
      <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
Gene1   255.719       -12.7665   2.55447  -4.99771 5.80142e-07 0.000895952
> res2["Gene1",] 
log2 fold change (MLE): param2 outlier vs bulk 
Wald test p-value: param2 outlier vs bulk 
DataFrame with 1 row and 6 columns
       baseMean log2FoldChange     lfcSE      stat      pvalue        padj
      <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
Gene1   255.719        13.0614   2.61151   5.00145 5.69003e-07 0.000906422

As you can see, the log2FoldChange is exactly opposite, while I didn't swap factor levels in the comparisons (comparing outlier vs bulk in both params). Moreover, plotcounts look as follows:

p1<-plotCounts(dds, gene=gene1, intgroup="param1", returnData=T)
p11<-ggplot(p1, aes(x=param1, y=count)) + 
   geom_point(position=position_jitter(w=0.1,h=0)) +ggtitle("Gene1")+
  scale_y_log10(breaks=c(25,100,400))
ggsave(p11, file="plotcounts_param1_default.pdf")


p2<-plotCounts(dds, gene=gene1, intgroup="param2", returnData=T)
p22<-ggplot(p2, aes(x=param2, y=count)) + 
   geom_point(position=position_jitter(w=0.1,h=0))+ggtitle("Gene1")+
   scale_y_log10(breaks=c(25,100,400))
ggsave(p22, file="plotcounts_param2_default.pdf")

plotcounts outlier vs bulk for param1 and param2

So why are the log2FoldChanges exactly opposite? Is it really just the 2 samples with 0s that the outlier group in param1 gains?

Any help would be greatly appreciated.

Rita

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BiocParallel_1.24.1         purrr_0.3.4                
 [3] ggplot2_3.3.3               dplyr_1.0.4                
 [5] plyr_1.8.6                  DESeq2_1.30.1              
 [7] SummarizedExperiment_1.20.0 Biobase_2.50.0             
 [9] MatrixGenerics_1.2.1        matrixStats_0.58.0         
[11] GenomicRanges_1.42.0        GenomeInfoDb_1.26.2        
[13] IRanges_2.24.1              S4Vectors_0.28.1           
[15] BiocGenerics_0.36.0         data.table_1.14.0          

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6             locfit_1.5-9.4         lattice_0.20-41       
 [4] snow_0.4-3             digest_0.6.27          assertthat_0.2.1      
 [7] utf8_1.1.4             R6_2.5.0               RSQLite_2.2.3         
[10] httr_1.4.2             pillar_1.5.0           zlibbioc_1.36.0       
[13] rlang_0.4.10           annotate_1.68.0        blob_1.2.1            
[16] Matrix_1.2-18          labeling_0.4.2         splines_4.0.2         
[19] geneplotter_1.68.0     RCurl_1.98-1.2         bit_4.0.4             
[22] munsell_0.5.0          DelayedArray_0.16.2    compiler_4.0.2        
[25] pkgconfig_2.0.3        tcltk_4.0.2            tidyselect_1.1.0      
[28] tibble_3.1.0           GenomeInfoDbData_1.2.4 XML_3.99-0.5          
[31] fansi_0.4.2            crayon_1.4.1           withr_2.4.1           
[34] bitops_1.0-6           grid_4.0.2             xtable_1.8-4          
[37] gtable_0.3.0           lifecycle_1.0.0        DBI_1.1.1             
[40] magrittr_2.0.1         scales_1.1.1           cachem_1.0.4          
[43] farver_2.1.0           XVector_0.30.0         genefilter_1.72.1     
[46] ellipsis_0.3.1         vctrs_0.3.6            generics_0.1.0        
[49] RColorBrewer_1.1-2     tools_4.0.2            bit64_4.0.5           
[52] glue_1.4.2             fastmap_1.1.0          survival_3.1-12       
[55] AnnotationDbi_1.52.0   colorspace_2.0-0       memoise_2.0.0

RNASeq DESeq • 1.5k views

ADD COMMENT • link 4.4 years ago krawczyk.margarita • 0

score 1 · Answer 1 · 2021-08-26

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 hours ago

United States

Except for 4 samples param1 and param2 are the same.

This is a highly collinear model (~param1 + param2) and I would not recommend fitting this.

I'd recommend speaking to a statistician at your institute to talk about statistical analysis plans for your dataset. If you had a single variable y and wanted to do a linear model y ~ param1 + param2, it wouldn't be recommended to approach it like this, with this distribution of these two covariates.

ADD COMMENT • link 4.4 years ago Michael Love 43k

0

Entering edit mode

Thank you, Michael. After your response, I found that running DESeq for each parameter separately seems to solve my problem and gives sensible results. I will just proceed with that!

ADD REPLY • link 4.4 years ago krawczyk.margarita • 0