Hi there!
I have been using a recommended pipeline to produce Differential transcript usage using DEXseq, as shown in this tutorial. However, when creating the DEXSeqDataSet object from the DRIMseq I noticed that the count matrix has more columns than the number of samples entered. So, it is duplicating the samples and counts. Please see attached figures for comparison. Is this a bug, or? What could be done to solve this? I am removing manually the extra columns and adding the colnames to the DEXSeqDataSet but I think, this should not be case, right?
Original count matrix used in the DRIMseq with 16 samples
Duplicated matrix when producing the DEXSeqDataSet object with 32 samples
Thanks for any help on this, Niko
# build a DEXSeqDataSet from the data contained in the dmDStest objecc
sample.data <- DRIMSeq::samples(d)
count.data <- round(as.matrix(counts(d)[,-c(1:2)]))
dxd <- DEXSeq::DEXSeqDataSet(countData=count.data,
sampleData=sample.data,
design=~ sample + exon + SV1:exon+SV2:exon+SV3:exon+ Stimulus:exon,
featureID=counts(d)$feature_id,
groupID=counts(d)$gene_id)
> dxd
class: DEXSeqDataSet
dim: 17624 32
metadata(1): version
assays(1): counts
rownames(17624): ENSG00000223972.5:BambuTx1 ENSG00000223972.5:BambuTx2 ...
ENSG00000288701.1:ENST00000684112.1 ENSG00000288701.1:ENST00000684596.1
rowData names(4): featureID groupID exonBaseMean exonBaseVar
colnames: NULL
colData names(14): sample name ... SV3 exon
> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] stageR_1.24.0 pals_1.8 DRIMSeq_1.30.0
[4] bambu_3.4.0 BSgenome_1.70.0 rtracklayer_1.62.0
[7] BiocIO_1.12.0 Biostrings_2.70.1 XVector_0.42.0
[10] IsoformSwitchAnalyzeR_2.2.0 pfamAnalyzeR_1.2.0 sva_3.50.0
[13] genefilter_1.84.0 mgcv_1.9-0 nlme_3.1-162
[16] satuRn_1.10.0 DEXSeq_1.48.0 RColorBrewer_1.1-3
[19] AnnotationDbi_1.64.0 DESeq2_1.42.0 SummarizedExperiment_1.32.0
[22] GenomicRanges_1.54.0 GenomeInfoDb_1.38.0 IRanges_2.36.0
[25] S4Vectors_0.40.1 MatrixGenerics_1.14.0 matrixStats_1.0.0
[28] Biobase_2.62.0 BiocGenerics_0.48.0 BiocParallel_1.36.0
[31] limma_3.58.0 lubridate_1.9.3 forcats_1.0.0
[34] stringr_1.5.0 dplyr_1.1.3 purrr_1.0.2
[37] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[40] ggplot2_3.4.4 tidyverse_2.0.0 gtools_3.9.4
loaded via a namespace (and not attached):
[1] splines_4.3.1 later_1.3.1
[3] bitops_1.0-7 filelock_1.0.2
[5] XML_3.99-0.14 lifecycle_1.0.3
[7] edgeR_4.0.0 lattice_0.21-8
[9] ensembldb_2.26.0 magrittr_2.0.3
[11] yaml_2.3.7 httpuv_1.6.11
[13] mapproj_1.2.11 pbapply_1.7-2
[15] DBI_1.1.3 maps_3.4.1
[17] abind_1.4-5 zlibbioc_1.48.0
[19] AnnotationFilter_1.26.0 RCurl_1.98-1.12
[21] rappdirs_0.3.3 GenomeInfoDbData_1.2.11
[23] annotate_1.80.0 codetools_0.2-19
[25] DelayedArray_0.28.0 xml2_1.3.5
[27] tidyselect_1.2.0 futile.logger_1.4.3
[29] locfdr_1.1-8 BiocFileCache_2.10.1
[31] GenomicAlignments_1.38.0 jsonlite_1.8.7
[33] ellipsis_0.3.2 survival_3.5-5
[35] tools_4.3.1 progress_1.2.2
[37] Rcpp_1.0.11 glue_1.6.2
[39] gridExtra_2.3 SparseArray_1.2.0
[41] withr_2.5.2 formatR_1.14
[43] BiocManager_1.30.22 fastmap_1.1.1
[45] boot_1.3-28.1 fansi_1.0.5
[47] digest_0.6.33 timechange_0.2.0
[49] R6_2.5.1 mime_0.12
[51] colorspace_2.1-0 dichromat_2.0-0.1
[53] biomaRt_2.58.0 RSQLite_2.3.2
[55] utf8_1.2.4 generics_0.1.3
[57] data.table_1.14.8 tximeta_1.20.0
[59] prettyunits_1.1.1 httr_1.4.7
[61] S4Arrays_1.2.0 pkgconfig_2.0.3
[63] gtable_0.3.4 blob_1.2.4
[65] hwriter_1.3.2.1 htmltools_0.5.5
[67] geneplotter_1.80.0 ProtGenerics_1.34.0
[69] scales_1.2.1 png_0.1-8
[71] lambda.r_1.2.4 rstudioapi_0.15.0
[73] tzdb_0.4.0 reshape2_1.4.4
[75] rjson_0.2.21 curl_5.1.0
[77] cachem_1.0.8 BiocVersion_3.18.0
[79] parallel_4.3.1 restfulr_0.0.15
[81] pillar_1.9.0 grid_4.3.1
[83] vctrs_0.6.4 promises_1.2.0.1
[85] dbplyr_2.4.0 xtable_1.8-4
[87] tximport_1.30.0 VennDiagram_1.7.3
[89] GenomicFeatures_1.54.0 cli_3.6.1
[91] locfit_1.5-9.8 compiler_4.3.1
[93] futile.options_1.0.1 Rsamtools_2.18.0
[95] rlang_1.1.1 crayon_1.5.2
[97] plyr_1.8.9 stringi_1.7.12
[99] munsell_0.5.0 lazyeval_0.2.2
[101] Matrix_1.6-1.1 hms_1.1.3
[103] bit64_4.0.5 KEGGREST_1.42.0
[105] statmod_1.5.0 shiny_1.7.4.1
[107] interactiveDisplayBase_1.40.0 AnnotationHub_3.10.0
[109] memoise_2.0.1 bit_4.0.5
[111] xgboost_1.7.5.1
After reading the DEXseq vignette, I realized that this wrongly called duplication is in fact the sum of counts in the other exons. So, if a gene only has two isoforms, counts from the other isoform will be added to the other columns. That's why I thought about a duplication issue, but in fact, they are not.
From the DEXseq vignette example:
Sorry for this question,
Niko