Question

>5000 significantly differentially expressed genes

0

Entering edit mode

liam.childs • 0

@liamchilds-22558

Last seen 2.8 years ago

Germany

Dear all,

In one of the RNA-Seq datasets I'm analysing, the knockdown/overexpression of a single gene is being compared to the empty vector. This comparison has been done for seven different genes. In every case, I'm getting over 5000 significantly differentially expressed genes. To me this seems like an excessive amount given that only one gene has been overexpressed/knocked down and that it's happening for all seven genes, so probably I'm doing something wrong. Does anyone have any advice/ideas on how I can find out how correct the results are or how to pinpoint where I went wrong?

The pipeline trims the reads with fastp, aligns them with STAR (I've also tried Salmon with the same result) and tests for differential expression with DESeq2.

Cheers, Liam

deseq2 • 1.4k views

ADD COMMENT • link updated 5.9 years ago by Michael Love 43k • written 5.9 years ago by liam.childs • 0

score 0 · Answer 1 · 2019-12-13

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

We can't offer help without knowing what you've actually done. Take a look at the posting guide:

http://bioconductor.org/help/support/posting-guide/

ADD COMMENT • link 5.9 years ago Michael Love 43k

0

Entering edit mode

I didn't think that the details would be so useful, but here goes: Paired reads are being trimmed with fastp (-q 30, -c), aligned with STAR (no special parameters), then quantified with featureCounts (-Q 1 --ignoreDup, -s 0). R code as follows:

counts = read_tsv(file.path(INDIR, 'featureCounts-hg38.txt'),  comment = '#', progress = FALSE) %>%
  select(-Strand, -Length) %>%
  rename_at(vars(5:ncol(.)), paths_to_id) %>%
  mutate(Start = unlist(lapply(strsplit(Start, ';'), function(x) min(as.numeric(x)))),
         End = unlist(lapply(strsplit(End, ';'), function(x) max(as.numeric(x)))))

count_matrix = counts %>%
  select(-Chr, -Start, -End) %>%
  filter(rowSums(.[,-1]) > 10) %>%
  as.data.frame() %>%
  column_to_rownames('Geneid') %>%
  as.matrix()

colnames(count_matrix)
# [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"   "11"   "12"

samples = data.frame(
  sample = c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'),
  treatment = c('overexpression_control', 'overexpression_control', 'overexpression_control', 'musk', 'musk', 'musk', 'git2', 'git2', 'git2', 'gli3', 'gli3', 'gli3')
))

dds = DESeqDataSetFromMatrix(
  countData = count_matrix,
  colData = samples,
  design = '~treatment') %>%
  DESeq

results(dds, name = 'treatment_musk_vs_overexpression_control') %>%
  summary
# out of 39999 with nonzero total read count
# adjusted p-value < 0.1
# LFC > 0 (up)       : 6737, 17%
# LFC < 0 (down)     : 5271, 13%
# outliers [1]       : 0, 0%
# low counts [2]     : 16285, 41%
# (mean count < 4)

sessionInfo

R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.6 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2.2              vsn_3.50.0                  tximport_1.10.1             tibble_1.4.2                scales_0.5.0                rlang_0.4.0                
 [7] rjson_0.2.20                readr_1.3.1                 ggplot2_3.0.0               dplyr_0.7.6                 IHW_1.10.1                  DESeq2_1.22.2              
[13] SummarizedExperiment_1.12.0 DelayedArray_0.4.1          matrixStats_0.53.1          Biobase_2.38.0              GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[19] IRanges_2.16.0              S4Vectors_0.20.1            BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.5.0          Formula_1.2-3          assertthat_0.2.0       affy_1.56.0            latticeExtra_0.6-28    blob_1.1.1             GenomeInfoDbData_1.0.0
 [9] slam_0.1-45            yaml_2.1.19            pillar_1.3.0           RSQLite_2.1.1          backports_1.1.2        lattice_0.20-35        limma_3.34.9           glue_1.3.1            
[17] digest_0.6.15          RColorBrewer_1.1-2     XVector_0.22.0         checkmate_1.8.5        colorspace_1.3-2       preprocessCore_1.40.0  htmltools_0.3.6        Matrix_1.2-14         
[25] plyr_1.8.4             XML_3.98-1.12          pkgconfig_2.0.1        genefilter_1.60.0      zlibbioc_1.24.0        purrr_0.2.5            xtable_1.8-2           affyio_1.48.0         
[33] fdrtool_1.2.15         BiocParallel_1.12.0    htmlTable_1.12         annotate_1.56.2        withr_2.1.2            nnet_7.3-12            lazyeval_0.2.1         survival_2.42-6       
[41] magrittr_1.5           crayon_1.3.4           memoise_1.1.0          foreign_0.8-70         BiocInstaller_1.28.0   tools_3.5.0            data.table_1.11.4      hms_0.4.2             
[49] stringr_1.3.1          locfit_1.5-9.1         munsell_0.5.0          cluster_2.0.7-1        AnnotationDbi_1.44.0   compiler_3.5.0         grid_3.5.0             RCurl_1.95-4.11       
[57] rstudioapi_0.7         htmlwidgets_1.5.1      bitops_1.0-6           base64enc_0.1-3        gtable_0.2.0           DBI_1.0.0              R6_2.2.2               gridExtra_2.3         
[65] knitr_1.25             bit_1.1-14             bindr_0.1.1            Hmisc_4.1-1            lpsymphony_1.10.0      stringi_1.2.3          Rcpp_0.12.17           geneplotter_1.56.0    
[73] rpart_4.1-13           acepack_1.4.1          tidyselect_0.2.5       xfun_0.10

ADD REPLY • link 5.9 years ago liam.childs • 0

0

Entering edit mode

Two suggestions are: take a look at the PCA plot to get a sense for your sample distances within and across group.

Also in the paper we discuss that with sufficient samples you may find many genes where LFC is not zero but these may not be of interest. We therefore recommend use of an specified LFC threshold as an argument to results().

ADD REPLY • link 5.9 years ago Michael Love 43k

0

Entering edit mode

Thanks for the help. Until now I'd rarely used LFC as a threshold as it hadn't changed the interpretation much, so it didn't cross my mind. Setting a LFC threshold of 1 reduces the number of genes to ~600, a much more managable number.

ADD REPLY • link 5.9 years ago liam.childs • 0