>5000 significantly differentially expressed genes
1
0
Entering edit mode
@liamchilds-22558
Last seen 2.1 years ago

Dear all,

In one of the RNA-Seq datasets I'm analysing, the knockdown/overexpression of a single gene is being compared to the empty vector. This comparison has been done for seven different genes. In every case, I'm getting over 5000 significantly differentially expressed genes. To me this seems like an excessive amount given that only one gene has been overexpressed/knocked down and that it's happening for all seven genes, so probably I'm doing something wrong. Does anyone have any advice/ideas on how I can find out how correct the results are or how to pinpoint where I went wrong?

The pipeline trims the reads with fastp, aligns them with STAR (I've also tried Salmon with the same result) and tests for differential expression with DESeq2.

Cheers, Liam

deseq2 • 343 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 2 days ago
United States

We can't offer help without knowing what you've actually done. Take a look at the posting guide:

http://bioconductor.org/help/support/posting-guide/

ADD COMMENT
0
Entering edit mode

I didn't think that the details would be so useful, but here goes: Paired reads are being trimmed with fastp (-q 30, -c), aligned with STAR (no special parameters), then quantified with featureCounts (-Q 1 --ignoreDup, -s 0). R code as follows:

counts = read_tsv(file.path(INDIR, 'featureCounts-hg38.txt'),  comment = '#', progress = FALSE) %>%
  select(-Strand, -Length) %>%
  rename_at(vars(5:ncol(.)), paths_to_id) %>%
  mutate(Start = unlist(lapply(strsplit(Start, ';'), function(x) min(as.numeric(x)))),
         End = unlist(lapply(strsplit(End, ';'), function(x) max(as.numeric(x)))))

count_matrix = counts %>%
  select(-Chr, -Start, -End) %>%
  filter(rowSums(.[,-1]) > 10) %>%
  as.data.frame() %>%
  column_to_rownames('Geneid') %>%
  as.matrix()

colnames(count_matrix)
# [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"   "11"   "12"

samples = data.frame(
  sample = c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'),
  treatment = c('overexpression_control', 'overexpression_control', 'overexpression_control', 'musk', 'musk', 'musk', 'git2', 'git2', 'git2', 'gli3', 'gli3', 'gli3')
))

dds = DESeqDataSetFromMatrix(
  countData = count_matrix,
  colData = samples,
  design = '~treatment') %>%
  DESeq

results(dds, name = 'treatment_musk_vs_overexpression_control') %>%
  summary
# out of 39999 with nonzero total read count
# adjusted p-value < 0.1
# LFC > 0 (up)       : 6737, 17%
# LFC < 0 (down)     : 5271, 13%
# outliers [1]       : 0, 0%
# low counts [2]     : 16285, 41%
# (mean count < 4)

sessionInfo

R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.6 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2.2              vsn_3.50.0                  tximport_1.10.1             tibble_1.4.2                scales_0.5.0                rlang_0.4.0                
 [7] rjson_0.2.20                readr_1.3.1                 ggplot2_3.0.0               dplyr_0.7.6                 IHW_1.10.1                  DESeq2_1.22.2              
[13] SummarizedExperiment_1.12.0 DelayedArray_0.4.1          matrixStats_0.53.1          Biobase_2.38.0              GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[19] IRanges_2.16.0              S4Vectors_0.20.1            BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] bit64_0.9-7            splines_3.5.0          Formula_1.2-3          assertthat_0.2.0       affy_1.56.0            latticeExtra_0.6-28    blob_1.1.1             GenomeInfoDbData_1.0.0
 [9] slam_0.1-45            yaml_2.1.19            pillar_1.3.0           RSQLite_2.1.1          backports_1.1.2        lattice_0.20-35        limma_3.34.9           glue_1.3.1            
[17] digest_0.6.15          RColorBrewer_1.1-2     XVector_0.22.0         checkmate_1.8.5        colorspace_1.3-2       preprocessCore_1.40.0  htmltools_0.3.6        Matrix_1.2-14         
[25] plyr_1.8.4             XML_3.98-1.12          pkgconfig_2.0.1        genefilter_1.60.0      zlibbioc_1.24.0        purrr_0.2.5            xtable_1.8-2           affyio_1.48.0         
[33] fdrtool_1.2.15         BiocParallel_1.12.0    htmlTable_1.12         annotate_1.56.2        withr_2.1.2            nnet_7.3-12            lazyeval_0.2.1         survival_2.42-6       
[41] magrittr_1.5           crayon_1.3.4           memoise_1.1.0          foreign_0.8-70         BiocInstaller_1.28.0   tools_3.5.0            data.table_1.11.4      hms_0.4.2             
[49] stringr_1.3.1          locfit_1.5-9.1         munsell_0.5.0          cluster_2.0.7-1        AnnotationDbi_1.44.0   compiler_3.5.0         grid_3.5.0             RCurl_1.95-4.11       
[57] rstudioapi_0.7         htmlwidgets_1.5.1      bitops_1.0-6           base64enc_0.1-3        gtable_0.2.0           DBI_1.0.0              R6_2.2.2               gridExtra_2.3         
[65] knitr_1.25             bit_1.1-14             bindr_0.1.1            Hmisc_4.1-1            lpsymphony_1.10.0      stringi_1.2.3          Rcpp_0.12.17           geneplotter_1.56.0    
[73] rpart_4.1-13           acepack_1.4.1          tidyselect_0.2.5       xfun_0.10  
ADD REPLY
0
Entering edit mode

Two suggestions are: take a look at the PCA plot to get a sense for your sample distances within and across group.

Also in the paper we discuss that with sufficient samples you may find many genes where LFC is not zero but these may not be of interest. We therefore recommend use of an specified LFC threshold as an argument to results().

ADD REPLY
0
Entering edit mode

Thanks for the help. Until now I'd rarely used LFC as a threshold as it hadn't changed the interpretation much, so it didn't cross my mind. Setting a LFC threshold of 1 reduces the number of genes to ~600, a much more managable number.

ADD REPLY

Login before adding your answer.

Traffic: 264 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6