Question

DEXSeq DEXSeqDataSet count matrix duplication

0

Entering edit mode

NiKo ▴ 10

@5c805697

Last seen 8 months ago

Germany

Hi there!

I have been using a recommended pipeline to produce Differential transcript usage using DEXseq, as shown in this tutorial. However, when creating the DEXSeqDataSet object from the DRIMseq I noticed that the count matrix has more columns than the number of samples entered. So, it is duplicating the samples and counts. Please see attached figures for comparison. Is this a bug, or? What could be done to solve this? I am removing manually the extra columns and adding the colnames to the DEXSeqDataSet but I think, this should not be case, right?

Original count matrix used in the DRIMseq with 16 samples

Duplicated matrix when producing the DEXSeqDataSet object with 32 samples

Thanks for any help on this, Niko


# build a DEXSeqDataSet from the data contained in the dmDStest objecc
sample.data <- DRIMSeq::samples(d)
count.data <- round(as.matrix(counts(d)[,-c(1:2)]))

dxd <- DEXSeq::DEXSeqDataSet(countData=count.data,
                     sampleData=sample.data,
                     design=~ sample + exon + SV1:exon+SV2:exon+SV3:exon+ Stimulus:exon,
                     featureID=counts(d)$feature_id,
                     groupID=counts(d)$gene_id)

> dxd
class: DEXSeqDataSet 
dim: 17624 32 
metadata(1): version
assays(1): counts
rownames(17624): ENSG00000223972.5:BambuTx1 ENSG00000223972.5:BambuTx2 ...
  ENSG00000288701.1:ENST00000684112.1 ENSG00000288701.1:ENST00000684596.1
rowData names(4): featureID groupID exonBaseMean exonBaseVar
colnames: NULL
colData names(14): sample name ... SV3 exon

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] stageR_1.24.0               pals_1.8                    DRIMSeq_1.30.0             
 [4] bambu_3.4.0                 BSgenome_1.70.0             rtracklayer_1.62.0         
 [7] BiocIO_1.12.0               Biostrings_2.70.1           XVector_0.42.0             
[10] IsoformSwitchAnalyzeR_2.2.0 pfamAnalyzeR_1.2.0          sva_3.50.0                 
[13] genefilter_1.84.0           mgcv_1.9-0                  nlme_3.1-162               
[16] satuRn_1.10.0               DEXSeq_1.48.0               RColorBrewer_1.1-3         
[19] AnnotationDbi_1.64.0        DESeq2_1.42.0               SummarizedExperiment_1.32.0
[22] GenomicRanges_1.54.0        GenomeInfoDb_1.38.0         IRanges_2.36.0             
[25] S4Vectors_0.40.1            MatrixGenerics_1.14.0       matrixStats_1.0.0          
[28] Biobase_2.62.0              BiocGenerics_0.48.0         BiocParallel_1.36.0        
[31] limma_3.58.0                lubridate_1.9.3             forcats_1.0.0              
[34] stringr_1.5.0               dplyr_1.1.3                 purrr_1.0.2                
[37] readr_2.1.4                 tidyr_1.3.0                 tibble_3.2.1               
[40] ggplot2_3.4.4               tidyverse_2.0.0             gtools_3.9.4               

loaded via a namespace (and not attached):
  [1] splines_4.3.1                 later_1.3.1                  
  [3] bitops_1.0-7                  filelock_1.0.2               
  [5] XML_3.99-0.14                 lifecycle_1.0.3              
  [7] edgeR_4.0.0                   lattice_0.21-8               
  [9] ensembldb_2.26.0              magrittr_2.0.3               
 [11] yaml_2.3.7                    httpuv_1.6.11                
 [13] mapproj_1.2.11                pbapply_1.7-2                
 [15] DBI_1.1.3                     maps_3.4.1                   
 [17] abind_1.4-5                   zlibbioc_1.48.0              
 [19] AnnotationFilter_1.26.0       RCurl_1.98-1.12              
 [21] rappdirs_0.3.3                GenomeInfoDbData_1.2.11      
 [23] annotate_1.80.0               codetools_0.2-19             
 [25] DelayedArray_0.28.0           xml2_1.3.5                   
 [27] tidyselect_1.2.0              futile.logger_1.4.3          
 [29] locfdr_1.1-8                  BiocFileCache_2.10.1         
 [31] GenomicAlignments_1.38.0      jsonlite_1.8.7               
 [33] ellipsis_0.3.2                survival_3.5-5               
 [35] tools_4.3.1                   progress_1.2.2               
 [37] Rcpp_1.0.11                   glue_1.6.2                   
 [39] gridExtra_2.3                 SparseArray_1.2.0            
 [41] withr_2.5.2                   formatR_1.14                 
 [43] BiocManager_1.30.22           fastmap_1.1.1                
 [45] boot_1.3-28.1                 fansi_1.0.5                  
 [47] digest_0.6.33                 timechange_0.2.0             
 [49] R6_2.5.1                      mime_0.12                    
 [51] colorspace_2.1-0              dichromat_2.0-0.1            
 [53] biomaRt_2.58.0                RSQLite_2.3.2                
 [55] utf8_1.2.4                    generics_0.1.3               
 [57] data.table_1.14.8             tximeta_1.20.0               
 [59] prettyunits_1.1.1             httr_1.4.7                   
 [61] S4Arrays_1.2.0                pkgconfig_2.0.3              
 [63] gtable_0.3.4                  blob_1.2.4                   
 [65] hwriter_1.3.2.1               htmltools_0.5.5              
 [67] geneplotter_1.80.0            ProtGenerics_1.34.0          
 [69] scales_1.2.1                  png_0.1-8                    
 [71] lambda.r_1.2.4                rstudioapi_0.15.0            
 [73] tzdb_0.4.0                    reshape2_1.4.4               
 [75] rjson_0.2.21                  curl_5.1.0                   
 [77] cachem_1.0.8                  BiocVersion_3.18.0           
 [79] parallel_4.3.1                restfulr_0.0.15              
 [81] pillar_1.9.0                  grid_4.3.1                   
 [83] vctrs_0.6.4                   promises_1.2.0.1             
 [85] dbplyr_2.4.0                  xtable_1.8-4                 
 [87] tximport_1.30.0               VennDiagram_1.7.3            
 [89] GenomicFeatures_1.54.0        cli_3.6.1                    
 [91] locfit_1.5-9.8                compiler_4.3.1               
 [93] futile.options_1.0.1          Rsamtools_2.18.0             
 [95] rlang_1.1.1                   crayon_1.5.2                 
 [97] plyr_1.8.9                    stringi_1.7.12               
 [99] munsell_0.5.0                 lazyeval_0.2.2               
[101] Matrix_1.6-1.1                hms_1.1.3                    
[103] bit64_4.0.5                   KEGGREST_1.42.0              
[105] statmod_1.5.0                 shiny_1.7.4.1                
[107] interactiveDisplayBase_1.40.0 AnnotationHub_3.10.0         
[109] memoise_2.0.1                 bit_4.0.5                    
[111] xgboost_1.7.5.1

DEXSeq DRIMSeq DEXSeqDataSet columnduplication • 514 views

ADD COMMENT • link 8 months ago NiKo ▴ 10

1

Entering edit mode

After reading the DEXseq vignette, I realized that this wrongly called duplication is in fact the sum of counts in the other exons. So, if a gene only has two isoforms, counts from the other isoform will be added to the other columns. That's why I thought about a duplication issue, but in fact, they are not.

From the DEXseq vignette example:

Note that the number of columns is 14, the first seven (we have seven samples) corresponding to the number of reads mapping to out exonic regions and the last seven correspond to the sum of the counts mapping to the rest of the exons from the same gene on each sample.

We can also access only the first five rows from the count belonging to the exonic regions ('this') (without showing the sum of counts from the rest of the exons from the same gene) by doing,

head( featureCounts(dxd) , 5 )

Sorry for this question,

Niko

ADD REPLY • link 8 months ago NiKo ▴ 10