>5000 significantly differentially expressed genes
Entering edit mode
Last seen 17 months ago

Dear all,

In one of the RNA-Seq datasets I'm analysing, the knockdown/overexpression of a single gene is being compared to the empty vector. This comparison has been done for seven different genes. In every case, I'm getting over 5000 significantly differentially expressed genes. To me this seems like an excessive amount given that only one gene has been overexpressed/knocked down and that it's happening for all seven genes, so probably I'm doing something wrong. Does anyone have any advice/ideas on how I can find out how correct the results are or how to pinpoint where I went wrong?

The pipeline trims the reads with fastp, aligns them with STAR (I've also tried Salmon with the same result) and tests for differential expression with DESeq2.

Cheers, Liam

deseq2 • 780 views
Entering edit mode
Last seen 2 days ago
United States

We can't offer help without knowing what you've actually done. Take a look at the posting guide:


Entering edit mode

I didn't think that the details would be so useful, but here goes: Paired reads are being trimmed with fastp (-q 30, -c), aligned with STAR (no special parameters), then quantified with featureCounts (-Q 1 --ignoreDup, -s 0). R code as follows:

counts = read_tsv(file.path(INDIR, 'featureCounts-hg38.txt'),  comment = '#', progress = FALSE) %>%
  select(-Strand, -Length) %>%
  rename_at(vars(5:ncol(.)), paths_to_id) %>%
  mutate(Start = unlist(lapply(strsplit(Start, ';'), function(x) min(as.numeric(x)))),
         End = unlist(lapply(strsplit(End, ';'), function(x) max(as.numeric(x)))))

count_matrix = counts %>%
  select(-Chr, -Start, -End) %>%
  filter(rowSums(.[,-1]) > 10) %>%
  as.data.frame() %>%
  column_to_rownames('Geneid') %>%

# [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"   "11"   "12"

samples = data.frame(
  sample = c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'),
  treatment = c('overexpression_control', 'overexpression_control', 'overexpression_control', 'musk', 'musk', 'musk', 'git2', 'git2', 'git2', 'gli3', 'gli3', 'gli3')

dds = DESeqDataSetFromMatrix(
  countData = count_matrix,
  colData = samples,
  design = '~treatment') %>%

results(dds, name = 'treatment_musk_vs_overexpression_control') %>%
# out of 39999 with nonzero total read count
# adjusted p-value < 0.1
# LFC > 0 (up)       : 6737, 17%
# LFC < 0 (down)     : 5271, 13%
# outliers [1]       : 0, 0%
# low counts [2]     : 16285, 41%
# (mean count < 4)


R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.6 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2.2              vsn_3.50.0                  tximport_1.10.1             tibble_1.4.2                scales_0.5.0                rlang_0.4.0                
 [7] rjson_0.2.20                readr_1.3.1                 ggplot2_3.0.0               dplyr_0.7.6                 IHW_1.10.1                  DESeq2_1.22.2              
[13] SummarizedExperiment_1.12.0 DelayedArray_0.4.1          matrixStats_0.53.1          Biobase_2.38.0              GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[19] IRanges_2.16.0              S4Vectors_0.20.1            BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.5.0          Formula_1.2-3          assertthat_0.2.0       affy_1.56.0            latticeExtra_0.6-28    blob_1.1.1             GenomeInfoDbData_1.0.0
 [9] slam_0.1-45            yaml_2.1.19            pillar_1.3.0           RSQLite_2.1.1          backports_1.1.2        lattice_0.20-35        limma_3.34.9           glue_1.3.1            
[17] digest_0.6.15          RColorBrewer_1.1-2     XVector_0.22.0         checkmate_1.8.5        colorspace_1.3-2       preprocessCore_1.40.0  htmltools_0.3.6        Matrix_1.2-14         
[25] plyr_1.8.4             XML_3.98-1.12          pkgconfig_2.0.1        genefilter_1.60.0      zlibbioc_1.24.0        purrr_0.2.5            xtable_1.8-2           affyio_1.48.0         
[33] fdrtool_1.2.15         BiocParallel_1.12.0    htmlTable_1.12         annotate_1.56.2        withr_2.1.2            nnet_7.3-12            lazyeval_0.2.1         survival_2.42-6       
[41] magrittr_1.5           crayon_1.3.4           memoise_1.1.0          foreign_0.8-70         BiocInstaller_1.28.0   tools_3.5.0            data.table_1.11.4      hms_0.4.2             
[49] stringr_1.3.1          locfit_1.5-9.1         munsell_0.5.0          cluster_2.0.7-1        AnnotationDbi_1.44.0   compiler_3.5.0         grid_3.5.0             RCurl_1.95-4.11       
[57] rstudioapi_0.7         htmlwidgets_1.5.1      bitops_1.0-6           base64enc_0.1-3        gtable_0.2.0           DBI_1.0.0              R6_2.2.2               gridExtra_2.3         
[65] knitr_1.25             bit_1.1-14             bindr_0.1.1            Hmisc_4.1-1            lpsymphony_1.10.0      stringi_1.2.3          Rcpp_0.12.17           geneplotter_1.56.0    
[73] rpart_4.1-13           acepack_1.4.1          tidyselect_0.2.5       xfun_0.10  
Entering edit mode

Two suggestions are: take a look at the PCA plot to get a sense for your sample distances within and across group.

Also in the paper we discuss that with sufficient samples you may find many genes where LFC is not zero but these may not be of interest. We therefore recommend use of an specified LFC threshold as an argument to results().

Entering edit mode

Thanks for the help. Until now I'd rarely used LFC as a threshold as it hadn't changed the interpretation much, so it didn't cross my mind. Setting a LFC threshold of 1 reduces the number of genes to ~600, a much more managable number.


Login before adding your answer.

Traffic: 887 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6