Hello all, I am currently performing differential gene expression analyses across four growth conditions: "BHI", "SCFM2", "AZM", and "TOB". I have provided the output of the first 10 rows of the raw counts obtained using featurecounts below. I have also provided the metadata table I used for the analyses below:
head(DESeq2_data, 10)
Geneid BHI_1 BHI_2 SCFM2_AZT_1 SCFM2_AZT_2 SCFM2_Ctrl_1 SCFM2_Ctrl_2 SCFM2_TOB_1 SCFM2_TOB_2
1 p749tmp_000001 5625 920 3599 3409 6221 5631 2694 4047
2 p749tmp_000002 3018 511 2892 2141 4547 4677 2307 2981
3 p749tmp_000003 4335 703 2364 2113 2848 3183 1629 2329
4 p749tmp_000004 8435 1374 6506 5404 11387 11892 5430 7317
5 p749tmp_000005 3565 653 1792 1567 1888 2060 1066 1040
6 p749tmp_000006 928 144 308 223 528 490 241 346
7 p749tmp_000007 497 113 122 82 457 501 206 366
8 p749tmp_000008 3015 601 3566 3000 4061 4743 2360 2723
9 p749tmp_000009 1201 199 1647 1336 1145 1564 734 803
10 p749tmp_000010 144 18 84 69 120 131 75 79
DESeq2_metadata
condition treatment replicate
BHI_1 BHI Control 1
BHI_2 BHI Control 2
SCFM2_AZT_1 AZT Antibiotic 1
SCFM2_AZT_2 AZT Antibiotic 2
SCFM2_Ctrl_1 SCFM2 CF_Control 1
SCFM2_Ctrl_2 SCFM2 CF_Control 2
SCFM2_TOB_1 TOB Antibiotic 1
SCFM2_TOB_2 TOB Antibiotic 2
The problem I'm having is that DESeq2 was able to identify 1000+ differentially expressed genes between SCFM2 and AZT or BHI when either is used as the experimental group. If I use TOB however, the program only identifies 6 differentially expressed genes, with almost every gene having the incorrect log2-fold change difference between the groups. In contrast, the log2-fold changes calculated between SCFM2 and AZT/BHI are correct. I ran the below code to conduct my DESeq2 analyses. I have also provided the first 10 lines of of resLFC_TOB for your viewing.
DESeq2_data_raw <- read.table("P749_in_vitro_collated_counts_TOB.txt", quote = "", header = TRUE, sep = "\t", fill = TRUE)
DESeq2_data <- DESeq2_data_raw_TOB[-c(2:9)]
DESeq2_metadata <- read.csv("P749_metadata_TOB.csv", header = TRUE, sep = ",", row.names = 1)
#Ensures that any of the genes without a number of reads assigned is given a 0 value
DESeq2_data_raw[is.na(DESeq2_data)] <- 0
#Create the DESeq2 object
#Note that the design condition column must also be present in the metadata file
dds <- DESeqDataSetFromMatrix(countData=DESeq2_data, colData=DESeq2_metadata, design=~condition, tidy = TRUE)
#Performs the differential gene expression analysis and saves results as variable
dds$condition <- relevel(dds$condition, ref = "SCFM2")
dds <- DESeq(dds)
#Prints out the results from the differential gene expression analysis with an adjusted p-value cutoff of 0.05)
#You can specify the specific growth conditions for pairwise comparisons by changing the name of the conditions to compare
resLFC_TOB <- lfcShrink(dds, "condition_TOB_vs_SCFM2", type="apeglm")
summary(resLFC_TOB)
baseMean log2FoldChange lfcSE pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric>
p749tmp_000001 4424.68881678204 -1.03551743379652e-05 0.00144265891254581 0.604891364155046 0.999998753141414
p749tmp_000002 3474.00277076498 -2.91621787180776e-06 0.00144262764636008 0.684215332051608 0.999998753141414
p749tmp_000003 2423.72403232178 -1.79873658810948e-06 0.00144263985139406 0.443897459468851 0.999998753141414
p749tmp_000004 8563.30728751548 -1.11940530760462e-05 0.00144264511696388 0.367857210518747 0.999998753141414
p749tmp_000005 1449.36988448574 -3.5100012953251e-06 0.00144266177462219 0.503292334582972 0.999998753141414
p749tmp_000006 383.162514545195 -1.69012424800925e-06 0.00144265867388831 0.764678736861375 0.999998753141414
p749tmp_000007 364.271051829585 -5.31960240925345e-07 0.00144266789298268 0.905021791357283 0.999998753141414
p749tmp_000008 3339.92200338804 1.00853097451184e-06 0.00144264600178013 0.823443774592384 0.999998753141414
p749tmp_000009 1020.72202956593 -1.19469010203148e-06 0.0014426652007553 0.790076777663143 0.999998753141414
p749tmp_000010 98.4764674025082 5.83753965583311e-07 0.00144267991247568 0.867600666081371 0.999998753141414
I have also provided the output of sessionInfo() below should it be needed:
sessionInfo( )
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 LC_MONETARY=en_CA.UTF-8
[6] LC_MESSAGES=en_CA.UTF-8 LC_PAPER=en_CA.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] EnhancedVolcano_1.4.0 ggrepel_0.9.1 hexbin_1.28.2 vsn_3.54.0
[5] pheatmap_1.0.12 apeglm_1.8.0 ggplot2_3.3.3 tximportData_1.14.0
[9] DESeq2_1.26.0 SummarizedExperiment_1.16.1 DelayedArray_0.12.3 BiocParallel_1.20.1
[13] matrixStats_0.58.0 Biobase_2.46.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1
[17] IRanges_2.20.2 S4Vectors_0.24.4 BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] bitops_1.0-7 bit64_4.0.5 RColorBrewer_1.1-2 numDeriv_2016.8-1.1 tools_3.6.3
[6] backports_1.2.1 affyio_1.56.0 utf8_1.2.1 R6_2.5.0 rpart_4.1-15
[11] Hmisc_4.5-0 DBI_1.1.1 colorspace_2.0-1 nnet_7.3-16 withr_2.4.2
[16] tidyselect_1.1.1 gridExtra_2.3 preprocessCore_1.48.0 bit_4.0.4 compiler_3.6.3
[21] htmlTable_2.1.0 labeling_0.4.2 scales_1.1.1 checkmate_2.0.0 mvtnorm_1.1-1
[26] affy_1.64.0 genefilter_1.68.0 stringr_1.4.0 digest_0.6.27 foreign_0.8-75
[31] XVector_0.26.0 base64enc_0.1-3 jpeg_0.1-8.1 pkgconfig_2.0.3 htmltools_0.5.1.1
[36] limma_3.42.2 fastmap_1.1.0 bbmle_1.0.23.1 htmlwidgets_1.5.3 rlang_0.4.11
[41] rstudioapi_0.13 RSQLite_2.2.7 farver_2.1.0 generics_0.1.0 dplyr_1.0.6
[46] RCurl_1.98-1.3 magrittr_2.0.1 GenomeInfoDbData_1.2.2 Formula_1.2-4 Matrix_1.3-3
[51] Rcpp_1.0.6 munsell_0.5.0 fansi_0.4.2 lifecycle_1.0.0 stringi_1.6.1
[56] MASS_7.3-54 zlibbioc_1.32.0 plyr_1.8.6 grid_3.6.3 blob_1.2.1
[61] bdsmatrix_1.3-4 crayon_1.4.1 lattice_0.20-44 splines_3.6.3 annotate_1.64.0
[66] locfit_1.5-9.4 knitr_1.33 pillar_1.6.0 geneplotter_1.64.0 XML_3.99-0.3
[71] glue_1.4.2 latticeExtra_0.6-29 BiocManager_1.30.15 data.table_1.14.0 png_0.1-7
[76] vctrs_0.3.8 gtable_0.3.0 purrr_0.3.4 assertthat_0.2.1 cachem_1.0.4
[81] emdbook_1.3.12 xfun_0.22 xtable_1.8-4 coda_0.19-4 survival_3.2-11
[86] tibble_3.1.1 AnnotationDbi_1.48.0 memoise_2.0.0 cluster_2.1.2 ellipsis_0.3.2
The fact that the calculation error is happening only between SCFM2 and TOB growth conditions leads me to think that DESeq2 isn't calling the two sets of data correctly. I also tried running DESeq2 with only the columns of SCFM2_Ctrl and SCFM2_TOB and I still encounter the mistake. Any help on this issue would be much appreciated.
Hello Michael, Thanks for your suggestion. I specified the contrast with the following command:
Here is the output of the first 10 lines in resTOB:
and the output of resSCFM2 for comparison:
The log2foldchanges have changed between TOB vs SCFM2 after using the contrast function to compare TOB vs SCFM2 gene expression. However, the number of differentially expressed genes remains unchanged, and the adjusted p-values for all but the differentially expressed genes is calculated to be 1. I would understand if it's true that many of the genes are not differentially expressed between the two conditions, but I'm skeptical about the adjusted p-values being set to 1 for almost every gene between TOB and SCFM2...
Addendum: I'm not sure how much of the changes in the output are due to me not shrinking the effect sizes as I did after using the relevel function. I'm also not sure how correct the log2-fold changes in the other growth conditions are anymore.
This sounds correct. It is expected from the BH procedure that the
padj
for the genes with p-values from the uniform component of the p-value distribution would be 1.You can examine
plotCounts
for a particular gene if you want to get a sense why the fold change for a particularcontrast
is in the direction it is in.Sounds like things are resolved then?
Hmmm if the results are expected, then I suppose so. I'll come back again if I have further questions. Thanks for all your help!