Outliers on DESEq2 Results
1
0
Entering edit mode
@Bruno Rodrigo-24702
Last seen 3.1 years ago
Brazil

I have an RNAseq dataset, where one of the genes I intend to analyze has hundreds of counts ranging from 10 to 12, with a few counts > 9000. I process this data in Deseq2 and get that the gene is differentially expressed across several samples of interest. What can justify these strange counts? Are the results reliable?

library("DESeq2")

        counts2=C
        dds <- DESeqDataSetFromMatrix(countData=counts2, 
                                      colData=expdesign_fucos, 
                                      design=~subtype, tidy = TRUE)


# Rodando a função DESeq
dds <- DESeq(dds)
#normalized.counts = counts(dds, normalized=T, replaced=TRUE)

res = results(dds, alpha = 0.01, lfcThreshold =1, 
                          contr = c("subtype","Her2","LumB"))

summary(res)
out of 15 with nonzero total read count

adjusted p-value < 0.01

LFC > 1.00 (up)    : 0, 0%

LFC < -1.00 (down) : 1, 6.7%

outliers [1]       : 0, 0%

low counts [2]     : 0, 0%

(mean count < 12)

[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

with(colData(dds), table(subtype))

> with(colData(dds), table(subtype))
subtype
Basal  Her2  LumA  LumB 
  187    82   563   215 

mcols(dds)$maxCooks

> mcols(dds)$maxCooks
 [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

The results don't seem to be correct for me as this gene (POFUT1) usually has very low values ​​(between 10-12), some samples with large counts (>9000), and looking at the counts table doesn't show significant differences between #groups.

>subset(res, res$lfcSE < 2 & res$padj < 0.01)

log2 fold change (MLE): subtype Her2 vs LumB 
Wald test p-value: subtype Her2 vs LumB 
DataFrame with 1 row and 6 columns
        baseMean log2FoldChange     lfcSE      stat      pvalue        padj
       <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>

POFUT1   12.2556       -2.15519  0.171235  -6.74621 1.51756e-11 2.27633e-10

sessionInfo( )

R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] EnvStats_2.4.0              genefilter_1.72.1           DESeq2_1.30.1              
 [4] SummarizedExperiment_1.20.0 Biobase_2.50.0              MatrixGenerics_1.2.1       
 [7] matrixStats_0.58.0          GenomicRanges_1.42.0        GenomeInfoDb_1.26.7        
[10] IRanges_2.24.1              S4Vectors_0.28.1            BiocGenerics_0.36.1        

loaded via a namespace (and not attached):
 [1] rgl_0.107.10           Rcpp_1.0.6             locfit_1.5-9.4        
 [4] lattice_0.20-41        digest_0.6.27          assertthat_0.2.1      
 [7] utf8_1.2.1             R6_2.5.0               RSQLite_2.2.7         
[10] httr_1.4.2             ggplot2_3.3.3          pillar_1.6.1          
[13] zlibbioc_1.36.0        rlang_0.4.11           annotate_1.68.0       
[16] blob_1.2.1             Matrix_1.3-2           splines_4.0.4         
[19] BiocParallel_1.24.1    geneplotter_1.68.0     htmlwidgets_1.5.3     
[22] RCurl_1.98-1.3         bit_4.0.4              munsell_0.5.0         
[25] DelayedArray_0.16.3    xfun_0.23              compiler_4.0.4        
[28] pkgconfig_2.0.3        htmltools_0.5.1.1      tidyselect_1.1.1      
[31] tibble_3.1.1           GenomeInfoDbData_1.2.4 XML_3.99-0.6          
[34] fansi_0.4.2            crayon_1.4.1           dplyr_1.0.6           
[37] MASS_7.3-53.1          bitops_1.0-7           grid_4.0.4            
[40] jsonlite_1.7.2         xtable_1.8-4           gtable_0.3.0          
[43] lifecycle_1.0.0        DBI_1.1.1              magrittr_2.0.1        
[46] scales_1.1.1           cachem_1.0.5           XVector_0.30.0        
[49] robustbase_0.93-8      qpcR_1.4-1             ellipsis_0.3.2        
[52] vctrs_0.3.8            generics_0.1.0         RColorBrewer_1.1-2    
[55] tools_4.0.4            bit64_4.0.5            glue_1.4.2            
[58] DEoptimR_1.0-9         purrr_0.3.4            crosstalk_1.1.1       
[61] fastmap_1.1.0          survival_3.2-10        AnnotationDbi_1.52.0  
[64] colorspace_2.0-1       minpack.lm_1.2-1       memoise_2.0.0         
[67] knitr_1.33
RNASeqR DifferentialExpression DESeq2 • 1.1k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 19 hours ago
United States

The gene probably has more and higher counts in Her2 compared to LumB. If you don't want such a gene to pop up, you can filter out these genes with something like:

keep <- rowSums(counts(dds) >= 20) >= X
dds <- dds[keep,]
# before DESeq()

Where X is a minimal number of sample to have a count greater than 20.

ADD COMMENT

Login before adding your answer.

Traffic: 698 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6