Incorrect 2-fold change reported in DESeq2 Data
1
0
Entering edit mode
Naphtap92 ▴ 10
@935aefd5
Last seen 8 days ago
Canada

Hello all, I am currently performing differential gene expression analyses across four growth conditions: "BHI", "SCFM2", "AZM", and "TOB". I have provided the output of the first 10 rows of the raw counts obtained using featurecounts below. I have also provided the metadata table I used for the analyses below:

head(DESeq2_data, 10)  
 Geneid BHI_1 BHI_2 SCFM2_AZT_1 SCFM2_AZT_2 SCFM2_Ctrl_1 SCFM2_Ctrl_2 SCFM2_TOB_1 SCFM2_TOB_2
1  p749tmp_000001  5625   920        3599        3409         6221         5631        2694        4047
2  p749tmp_000002  3018   511        2892        2141         4547         4677        2307        2981
3  p749tmp_000003  4335   703        2364        2113         2848         3183        1629        2329
4  p749tmp_000004  8435  1374        6506        5404        11387        11892        5430        7317
5  p749tmp_000005  3565   653        1792        1567         1888         2060        1066        1040
6  p749tmp_000006   928   144         308         223          528          490         241         346
7  p749tmp_000007   497   113         122          82          457          501         206         366
8  p749tmp_000008  3015   601        3566        3000         4061         4743        2360        2723
9  p749tmp_000009  1201   199        1647        1336         1145         1564         734         803
10 p749tmp_000010   144    18          84          69          120          131          75          79

DESeq2_metadata
             condition  treatment replicate
BHI_1              BHI    Control         1
BHI_2              BHI    Control         2
SCFM2_AZT_1        AZT Antibiotic         1
SCFM2_AZT_2        AZT Antibiotic         2
SCFM2_Ctrl_1     SCFM2 CF_Control         1
SCFM2_Ctrl_2     SCFM2 CF_Control         2
SCFM2_TOB_1        TOB Antibiotic         1
SCFM2_TOB_2        TOB Antibiotic         2

The problem I'm having is that DESeq2 was able to identify 1000+ differentially expressed genes between SCFM2 and AZT or BHI when either is used as the experimental group. If I use TOB however, the program only identifies 6 differentially expressed genes, with almost every gene having the incorrect log2-fold change difference between the groups. In contrast, the log2-fold changes calculated between SCFM2 and AZT/BHI are correct. I ran the below code to conduct my DESeq2 analyses. I have also provided the first 10 lines of of resLFC_TOB for your viewing.

DESeq2_data_raw <- read.table("P749_in_vitro_collated_counts_TOB.txt", quote = "", header = TRUE, sep = "\t", fill = TRUE)
DESeq2_data <- DESeq2_data_raw_TOB[-c(2:9)]
DESeq2_metadata <- read.csv("P749_metadata_TOB.csv", header = TRUE, sep = ",", row.names = 1)
#Ensures that any of the genes without a number of reads assigned is given a 0 value
DESeq2_data_raw[is.na(DESeq2_data)] <- 0
#Create the DESeq2 object
#Note that the design condition column must also be present in the metadata file
dds <- DESeqDataSetFromMatrix(countData=DESeq2_data, colData=DESeq2_metadata, design=~condition, tidy = TRUE)
#Performs the differential gene expression analysis and saves results as variable
dds$condition <- relevel(dds$condition, ref = "SCFM2")
dds <- DESeq(dds)
#Prints out the results from the differential gene expression analysis with an adjusted p-value cutoff of 0.05)
#You can specify the specific growth conditions for pairwise comparisons by changing the name of the conditions to compare
resLFC_TOB <- lfcShrink(dds, "condition_TOB_vs_SCFM2", type="apeglm")
summary(resLFC_TOB)

                       baseMean        log2FoldChange               lfcSE            pvalue              padj
                      <numeric>             <numeric>           <numeric>         <numeric>         <numeric>
p749tmp_000001 4424.68881678204 -1.03551743379652e-05 0.00144265891254581 0.604891364155046 0.999998753141414
p749tmp_000002 3474.00277076498 -2.91621787180776e-06 0.00144262764636008 0.684215332051608 0.999998753141414
p749tmp_000003 2423.72403232178 -1.79873658810948e-06 0.00144263985139406 0.443897459468851 0.999998753141414
p749tmp_000004 8563.30728751548 -1.11940530760462e-05 0.00144264511696388 0.367857210518747 0.999998753141414
p749tmp_000005 1449.36988448574  -3.5100012953251e-06 0.00144266177462219 0.503292334582972 0.999998753141414
p749tmp_000006 383.162514545195 -1.69012424800925e-06 0.00144265867388831 0.764678736861375 0.999998753141414
p749tmp_000007 364.271051829585 -5.31960240925345e-07 0.00144266789298268 0.905021791357283 0.999998753141414
p749tmp_000008 3339.92200338804  1.00853097451184e-06 0.00144264600178013 0.823443774592384 0.999998753141414
p749tmp_000009 1020.72202956593 -1.19469010203148e-06  0.0014426652007553 0.790076777663143 0.999998753141414
p749tmp_000010 98.4764674025082  5.83753965583311e-07 0.00144267991247568 0.867600666081371 0.999998753141414

I have also provided the output of sessionInfo() below should it be needed:

sessionInfo( )

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8     LC_MONETARY=en_CA.UTF-8   
 [6] LC_MESSAGES=en_CA.UTF-8    LC_PAPER=en_CA.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] EnhancedVolcano_1.4.0       ggrepel_0.9.1               hexbin_1.28.2               vsn_3.54.0                 
 [5] pheatmap_1.0.12             apeglm_1.8.0                ggplot2_3.3.3               tximportData_1.14.0        
 [9] DESeq2_1.26.0               SummarizedExperiment_1.16.1 DelayedArray_0.12.3         BiocParallel_1.20.1        
[13] matrixStats_0.58.0          Biobase_2.46.0              GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
[17] IRanges_2.20.2              S4Vectors_0.24.4            BiocGenerics_0.32.0        

loaded via a namespace (and not attached):
 [1] bitops_1.0-7           bit64_4.0.5            RColorBrewer_1.1-2     numDeriv_2016.8-1.1    tools_3.6.3           
 [6] backports_1.2.1        affyio_1.56.0          utf8_1.2.1             R6_2.5.0               rpart_4.1-15          
[11] Hmisc_4.5-0            DBI_1.1.1              colorspace_2.0-1       nnet_7.3-16            withr_2.4.2           
[16] tidyselect_1.1.1       gridExtra_2.3          preprocessCore_1.48.0  bit_4.0.4              compiler_3.6.3        
[21] htmlTable_2.1.0        labeling_0.4.2         scales_1.1.1           checkmate_2.0.0        mvtnorm_1.1-1         
[26] affy_1.64.0            genefilter_1.68.0      stringr_1.4.0          digest_0.6.27          foreign_0.8-75        
[31] XVector_0.26.0         base64enc_0.1-3        jpeg_0.1-8.1           pkgconfig_2.0.3        htmltools_0.5.1.1     
[36] limma_3.42.2           fastmap_1.1.0          bbmle_1.0.23.1         htmlwidgets_1.5.3      rlang_0.4.11          
[41] rstudioapi_0.13        RSQLite_2.2.7          farver_2.1.0           generics_0.1.0         dplyr_1.0.6           
[46] RCurl_1.98-1.3         magrittr_2.0.1         GenomeInfoDbData_1.2.2 Formula_1.2-4          Matrix_1.3-3          
[51] Rcpp_1.0.6             munsell_0.5.0          fansi_0.4.2            lifecycle_1.0.0        stringi_1.6.1         
[56] MASS_7.3-54            zlibbioc_1.32.0        plyr_1.8.6             grid_3.6.3             blob_1.2.1            
[61] bdsmatrix_1.3-4        crayon_1.4.1           lattice_0.20-44        splines_3.6.3          annotate_1.64.0       
[66] locfit_1.5-9.4         knitr_1.33             pillar_1.6.0           geneplotter_1.64.0     XML_3.99-0.3          
[71] glue_1.4.2             latticeExtra_0.6-29    BiocManager_1.30.15    data.table_1.14.0      png_0.1-7             
[76] vctrs_0.3.8            gtable_0.3.0           purrr_0.3.4            assertthat_0.2.1       cachem_1.0.4          
[81] emdbook_1.3.12         xfun_0.22              xtable_1.8-4           coda_0.19-4            survival_3.2-11       
[86] tibble_3.1.1           AnnotationDbi_1.48.0   memoise_2.0.0          cluster_2.1.2          ellipsis_0.3.2

The fact that the calculation error is happening only between SCFM2 and TOB growth conditions leads me to think that DESeq2 isn't calling the two sets of data correctly. I also tried running DESeq2 with only the columns of SCFM2_Ctrl and SCFM2_TOB and I still encounter the mistake. Any help on this issue would be much appreciated.

DESeq2 • 121 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 15 hours ago
United States

Can you use results() with contrast to specify the comparisons you want, and then to count DE genes? My guess is that, somehow, when you relevel, you aren't pulling out the contrast you think you are.

almost every gene having the incorrect log2-fold change difference...

Whenever this has been reported, it's inevitably because there was some confusion about the LFC being reported (it's printed above the table when you print the res object).

You can just specify the contrast directly with contrast=c("variable","numerator","denominator") and that will clear things up.

ADD COMMENT
0
Entering edit mode

Hello Michael, Thanks for your suggestion. I specified the contrast with the following command:

resTOB <- results(dds, contrast=c("condition","TOB","SCFM2"))

Here is the output of the first 10 lines in resTOB:

head(resTOB, 10)
log2 fold change (MLE): condition TOB vs SCFM2 
Wald test p-value: condition TOB vs SCFM2 
DataFrame with 5806 rows and 6 columns
                       baseMean      log2FoldChange             lfcSE               stat            pvalue      padj
                      <numeric>           <numeric>         <numeric>          <numeric>         <numeric> <numeric>
p749tmp_000001 3556.54959458889 -0.0691907367277559  0.15248060598889 -0.453767456385879  0.64999620922885         1
p749tmp_000002 2526.71893649829 -0.0441158679968543 0.143240036971041 -0.307985594877871 0.758093293549083         1
p749tmp_000003 2214.58981960518   0.141975335529016 0.149653123791572  0.948696104244044 0.342775195989688         1
p749tmp_000004 6279.31738727025  -0.114150120849483  0.13463721128765 -0.847834857523919 0.396529947429579         1
p749tmp_000005 1590.94642925673  -0.123475878356198 0.169797955865861 -0.727192961343672 0.467107757427192         1
p749tmp_000006 352.957544985399 -0.0438621396333034 0.195796142900185 -0.224019426448375 0.822742187231782         1
p749tmp_000007 252.800771907051 -0.0107605082312825 0.233271326884023 -0.046128722183813 0.963207653376836         1
p749tmp_000008 2748.64871204587 -0.0236332698279363  0.15406950503963 -0.153393559756405 0.878087923354881         1
p749tmp_000009 1014.12399878704 -0.0446177514534776 0.185584979747248 -0.240416824218444 0.810007136790823         1
p749tmp_000010 78.8876501441187  0.0679939889876004 0.317210871714541  0.214349491302393 0.830274526305989         1

and the output of resSCFM2 for comparison:

head(resSCFM2, 10)
log2 fold change (MLE): condition SCFM2 vs BHI 
Wald test p-value: condition SCFM2 vs BHI 
DataFrame with 10 rows and 6 columns
                       baseMean      log2FoldChange             lfcSE               stat               pvalue                 padj
                      <numeric>           <numeric>         <numeric>          <numeric>            <numeric>            <numeric>
p749tmp_000001 3556.54959458889   0.175329192304796 0.153479692681715   1.14236085075041     0.25330405721748    0.388279729395267
p749tmp_000002 2526.71893649829   0.687476642512472 0.145700092490059   4.71843655527795 2.37664062349517e-06 1.43059921025923e-05
p749tmp_000003 2214.58981960518  -0.418701365767339 0.150536657641803  -2.78139140543179  0.00541264357951398   0.0155043897195524
p749tmp_000004 6279.31738727025   0.567662966713113 0.135566625025791   4.18733568535114 2.82248220639428e-05 0.000140680313349554
p749tmp_000005 1590.94642925673  -0.830336626284275 0.169615753377564  -4.89539803791661 9.81069492511576e-07 6.23504301618974e-06
p749tmp_000006 352.957544985399  -0.740882756594009 0.195675724098803   -3.7862783439601 0.000152920377120351 0.000650873916347488
p749tmp_000007 252.800771907051    -0.1730259906086 0.237168754783721 -0.729547999551569    0.465666517632121    0.605833333734491
p749tmp_000008 2748.64871204587   0.509049432780568 0.156086480065058   3.26132944101496  0.00110891122030771  0.00385287958238918
p749tmp_000009 1014.12399878704   0.261570122623241 0.189270202627288   1.38199314520906    0.166973805861199    0.280890514532859
p749tmp_000010 78.8876501441187 0.00894106440746986 0.329576783725194  0.027128926699294    0.978356903077632    0.987202806360431

The log2foldchanges have changed between TOB vs SCFM2 after using the contrast function to compare TOB vs SCFM2 gene expression. However, the number of differentially expressed genes remains unchanged, and the adjusted p-values for all but the differentially expressed genes is calculated to be 1. I would understand if it's true that many of the genes are not differentially expressed between the two conditions, but I'm skeptical about the adjusted p-values being set to 1 for almost every gene between TOB and SCFM2...

Addendum: I'm not sure how much of the changes in the output are due to me not shrinking the effect sizes as I did after using the relevel function. I'm also not sure how correct the log2-fold changes in the other growth conditions are anymore.

ADD REPLY
0
Entering edit mode

However, the number of differentially expressed genes remains unchanged, and the adjusted p-values for all but the differentially expressed genes is calculated to be 1...I'm skeptical about the adjusted p-values being set to 1 for almost every gene

This sounds correct. It is expected from the BH procedure that the padj for the genes with p-values from the uniform component of the p-value distribution would be 1.

I'm also not sure how correct the log2-fold changes in the other growth conditions are anymore.

You can examine plotCounts for a particular gene if you want to get a sense why the fold change for a particular contrast is in the direction it is in.

Sounds like things are resolved then?

ADD REPLY
1
Entering edit mode

Hmmm if the results are expected, then I suppose so. I'll come back again if I have further questions. Thanks for all your help!

ADD REPLY

Login before adding your answer.

Traffic: 221 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6