Dear all,
txImport requires a txdb file with correspondences between transcript names and gene names. In the sample data provided by txImportData, there is a pre-constructed table.
I was interested in reconstructing it from the sequence files, but I ended up with a discrepancy: a gene (3 transcripts) seem to be present in the files, but not the table (see code below).
So, my questions are:
1/ is this discrepancy a problem with the data or with my reasoning? Is it safe for me to construct the txdb table by reading all files and finding associated gene names?
2/ if it's a problem with the data, will it affect downstream analysis? Should I report it to tximportData developers and how?
Sorry if this seem like basic questions, I'm pretty new to this field.
# as per the vignette
library(tximportData)
dir <- system.file("extdata", package = "tximportData")
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
files <- file.path(dir, "kallisto", samples$run, "abundance.tsv")
names(files) <- paste0("sample", 1:6)
tx2gene <- read.csv(file.path(dir, "tx2gene.csv"))
imported <- sort(as.character(tx2gene$TXNAME))
# by manually reading the samples
comb <- factor()
for(f in files){
aa <- read.table(f, header = TRUE)$target_id
comb <- c(comb, levels(aa))
}
manually <- unique(sort(comb))
# comparing
length(imported)
#48006
length(manually)
#48009
all.equal(imported, manually)
# 9950 mismatches
tests <- which(imported != manually)
tests[1]
#38057
imported[38055:38061]
# "NR_001525_1" "NR_001525_2" "NR_001527" "NR_001527_1" "NR_001530" "NR_001530_1"
# "NR_001533"
manually[38055:38061]
# "NR_001525_1" "NR_001525_2" "NR_001526" "NR_001526_1" "NR_001526_2" "NR_001527"
# "NR_001527_1"
#tximportData version 1.2.0
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8
[4] LC_COLLATE=fr_FR.UTF-8 LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.26.3
[3] AnnotationDbi_1.36.2 tximportData_1.2.0
[5] tximport_1.2.0 BiocInstaller_1.24.0
[7] DESeq2_1.14.1 SummarizedExperiment_1.4.0
[9] Biobase_2.34.0 GenomicRanges_1.26.3
[11] GenomeInfoDb_1.10.3 IRanges_2.8.1
[13] S4Vectors_0.12.1 BiocGenerics_0.20.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 locfit_1.5-9.1 lattice_0.20-34
[4] Rsamtools_1.26.1 Biostrings_2.42.1 assertthat_0.1
[7] digest_0.6.12 plyr_1.8.4 backports_1.0.5
[10] acepack_1.4.1 RSQLite_1.1-2 ggplot2_2.2.1
[13] zlibbioc_1.20.0 lazyeval_0.2.0 data.table_1.10.4
[16] annotate_1.52.1 rpart_4.1-10 Matrix_1.2-7.1
[19] checkmate_1.8.2 splines_3.3.2 BiocParallel_1.8.1
[22] geneplotter_1.52.0 stringr_1.2.0 foreign_0.8-67
[25] htmlwidgets_0.8 RCurl_1.95-4.8 biomaRt_2.30.0
[28] munsell_0.4.3 rtracklayer_1.34.2 base64enc_0.1-3
[31] htmltools_0.3.5 nnet_7.3-12 tibble_1.2
[34] gridExtra_2.2.1 htmlTable_1.9 Hmisc_4.0-2
[37] XML_3.98-1.5 GenomicAlignments_1.10.0 bitops_1.0-6
[40] grid_3.3.2 xtable_1.8-2 gtable_0.2.0
[43] DBI_0.5-1 magrittr_1.5 scales_0.4.1
[46] stringi_1.1.2 XVector_0.14.0 genefilter_1.56.0
[49] latticeExtra_0.6-28 Formula_1.2-1 RColorBrewer_1.1-2
[52] tools_3.3.2 survival_2.40-1 colorspace_1.3-2
[55] cluster_2.0.5 memoise_1.0.0 knitr_1.15.1

For my question 1, I didn't notice the function
makeTxDbFromGFF()fromGenomicFeatures, that indeed provides another way to generate the tx2gene list.So I'm just left with the second question: out of curiosity, is it going to be a problem, and should I report it someplace?