estimateDispersions calculation for > 500,000 exons for 2 conditions taking forever
1
0
Entering edit mode
gv ▴ 40
@gv-6516
Last seen 19 months ago
United States

I am trying to run DEXSeq package for 2 conditions, control and treatment.

This is hg19 genome. I have used python scripts to create the matrix. In the matrix I have >500,000 exons which is typical for human genome. Below is the first 20 lines of my matrix:

exon    C1A C1B C1C C2A C2B C2C
ENSG00000000003.10:001  862 817 923 495 501 484
ENSG00000000003.10:002  443 410 421 196 220 202
ENSG00000000003.10:003  396 348 372 165 197 178
ENSG00000000003.10:004  350 319 310 138 175 169
ENSG00000000003.10:005  350 332 341 142 181 179
ENSG00000000003.10:006  377 369 397 159 205 181
ENSG00000000003.10:007  362 348 399 197 219 186
ENSG00000000003.10:008  332 312 398 187 198 183
ENSG00000000003.10:009  511 473 564 293 265 236
ENSG00000000003.10:010  42  54  48  61  37  42
ENSG00000000003.10:011  346 308 343 155 157 116
ENSG00000000003.10:012  273 255 265 126 116 92
ENSG00000000003.10:013  26  19  18  20  15  10
ENSG00000000003.10:014  17  9   10  9   10  7
ENSG00000000003.10:015  0   0   0   0   1   0
ENSG00000000005.5:001   0   0   0   0   0   0
ENSG00000000005.5:002   0   0   0   0   0   0
ENSG00000000005.5:003   0   0   0   0   0   0
ENSG00000000005.5:004   0   0   0   0   0   0

Following is the code i am using :

seqdata <- read.table("FINAL_EXON_COUNTS_SS.txt",header=TRUE,sep="\t",row.names = 1)
sampleinfo <- read.table("Sample_Info.txt",header=TRUE,sep="\t",row.name=1)

dxd <- DEXSeqDataSet(seqdata, sampleinfo, design= ~ sample + exon + condition:exon,
                              featureID = gsub(pattern = "ENSG.*:","",row.names(seqdata)), 
                              groupID = gsub(pattern = ":.*","",row.names(seqdata)) )

dxd = estimateSizeFactors(dxd )
dxd = estimateDispersions( dxd )

I am stuck at estimateDispersions step. It is taking forever to finish this step. Is there a way to speed this up??

This is my sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] statmod_1.4.30              DEXSeq_1.28.3               org.Hs.eg.db_3.7.0         
 [4] AnnotationDbi_1.44.0        DESeq2_1.22.2               SummarizedExperiment_1.12.0
 [7] DelayedArray_0.8.0          BiocParallel_1.16.6         matrixStats_0.54.0         
[10] Biobase_2.42.0              GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
[13] IRanges_2.16.0              S4Vectors_0.20.1            BiocGenerics_0.28.0        
[16] pheatmap_1.0.12             RColorBrewer_1.1-2          edgeR_3.24.3               
[19] limma_3.38.3               

loaded via a namespace (and not attached):
 [1] httr_1.4.0             bit64_0.9-7            splines_3.5.1          Formula_1.2-3         
 [5] assertthat_0.2.1       BiocManager_1.30.4     latticeExtra_0.6-28    blob_1.1.1            
 [9] Rsamtools_1.34.1       GenomeInfoDbData_1.2.0 progress_1.2.0         pillar_1.3.1          
[13] RSQLite_2.1.1          backports_1.1.4        lattice_0.20-38        glue_1.3.1            
[17] digest_0.6.18          XVector_0.22.0         checkmate_1.9.3        colorspace_1.4-1      
[21] htmltools_0.3.6        Matrix_1.2-17          plyr_1.8.4             XML_3.98-1.19         
[25] pkgconfig_2.0.2        biomaRt_2.38.0         genefilter_1.64.0      zlibbioc_1.28.0       
[29] purrr_0.3.2            xtable_1.8-4           scales_1.0.0           htmlTable_1.13.1      
[33] tibble_2.1.1           annotate_1.60.1        ggplot2_3.1.1          nnet_7.3-12           
[37] lazyeval_0.2.2         survival_2.44-1.1      magrittr_1.5           crayon_1.3.4          
[41] memoise_1.1.0          hwriter_1.3.2          foreign_0.8-71         prettyunits_1.0.2     
[45] tools_3.5.1            data.table_1.12.2      hms_0.4.2              stringr_1.4.0         
[49] munsell_0.5.0          locfit_1.5-9.1         cluster_2.0.9          Biostrings_2.50.2     
[53] compiler_3.5.1         rlang_0.3.4            grid_3.5.1             RCurl_1.95-4.12       
[57] rstudioapi_0.10        htmlwidgets_1.3        bitops_1.0-6           base64enc_0.1-3       
[61] gtable_0.3.0           DBI_1.0.0              R6_2.4.0               gridExtra_2.3         
[65] knitr_1.22             dplyr_0.8.0.1          bit_1.1-14             Hmisc_4.2-0           
[69] stringi_1.4.3          Rcpp_1.0.1             geneplotter_1.60.0     rpart_4.1-15          
[73] acepack_1.4.1          tidyselect_0.2.5       xfun_0.6

Hope to hear from you soon.

Thanks

deseq2 dexseq • 1.2k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 14 hours ago
United States

hi gv,

I see some of your exons have many 0 counts. You can pre-filter these, e.g. any exon that doesn't have a count of 10 or more for some minimal number of samples has no power to detect differential expression or splicing. Maybe Alejandro or Simon can comment, but I think with 6 samples, DEXSeq is typically running in a few minutes.

ADD COMMENT
0
Entering edit mode

HI Michael, Thanks for the reply. I thought that too. But the problem I see in that approach is : What if for a particular gene which has 10 exons, exon1 does not have enough reads(threshold value >10), then this exon according to my filtering will not be present in the initial matrix. Would that be fine since I would like to use plotDexseq function and if exon 1 is missing in my initial matrix, it will also not make in the dxr object? Your thoughts on this? In the above example, ENSG00000000003.10:015 will not be present in the matrix after filtering it. Would that be fine because I would like to make some graphs downstream?

ADD REPLY
0
Entering edit mode

Hi gv,

Last time I ran an analysis with 6 samples for a human genome, it took around an hour for the whole pipeline. When you say "It is taking forever to finish this step.", how much time are you referring to?

Also, please make sure you are using the latest versions of R and Bioconductor (I could not see the your sessionInfo() in your original message).

ADD REPLY
0
Entering edit mode

Hi Alenjandro, I have edited the post with my sessionInfo(). My program did finish overnight though. It took more than 3 hours I guess. I will try to put in parallel option. Also can you tell me about removing exons which has no counts and it's effect downstream such as in making plots? Thanks a lot for your help.

ADD REPLY
1
Entering edit mode

Sure. If you remove these exons from the object, they won't appear on these plots.

However, as Mike mentioned, you could filter genes that are lowly expressed. If a gene is not expressed to begin with, it is not worth testing this gene for differential exon usage. Statistically, you won't have power to detect differential exon usage. Biologically, those genes are probably not relevant in the context of your experiment.

Alejandro

ADD REPLY
0
Entering edit mode

Sure. If you remove these exons from the object, they won't appear on these plots.

However, as Mike mentioned, you could filter genes that are lowly expressed. If a gene is not expressed to begin with, it is not worth testing this gene for differential exon usage. Statistically, you won't have power to detect differential exon usage. Biologically, those genes are probably not relevant in the context of your experiment.

Alejandro

ADD REPLY

Login before adding your answer.

Traffic: 466 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6