Question

txImport: a gene seems missing from the sample data

0

Entering edit mode

alexlok • 0

@alexlok-12509

Last seen 7.2 years ago

Dear all,

txImport requires a txdb file with correspondences between transcript names and gene names. In the sample data provided by txImportData, there is a pre-constructed table.

I was interested in reconstructing it from the sequence files, but I ended up with a discrepancy: a gene (3 transcripts) seem to be present in the files, but not the table (see code below).

So, my questions are:

1/ is this discrepancy a problem with the data or with my reasoning? Is it safe for me to construct the txdb table by reading all files and finding associated gene names?

2/ if it's a problem with the data, will it affect downstream analysis? Should I report it to tximportData developers and how?

Sorry if this seem like basic questions, I'm pretty new to this field.

# as per the vignette
library(tximportData)
dir <- system.file("extdata", package = "tximportData")
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
files <- file.path(dir, "kallisto", samples$run, "abundance.tsv")
names(files) <- paste0("sample", 1:6)
tx2gene <- read.csv(file.path(dir, "tx2gene.csv"))

imported <- sort(as.character(tx2gene$TXNAME))

# by manually reading the samples
comb <- factor()
for(f in files){
  aa <- read.table(f, header = TRUE)$target_id
  comb <- c(comb, levels(aa))
}
manually <- unique(sort(comb))

# comparing
length(imported)
#48006
length(manually)
#48009
all.equal(imported, manually)
# 9950 mismatches

tests <- which(imported != manually)
tests[1]
#38057
imported[38055:38061]
# "NR_001525_1" "NR_001525_2" "NR_001527"   "NR_001527_1" "NR_001530"   "NR_001530_1"
# "NR_001533"
manually[38055:38061]
# "NR_001525_1" "NR_001525_2" "NR_001526"   "NR_001526_1" "NR_001526_2" "NR_001527"  
# "NR_001527_1"



#tximportData version 1.2.0
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8       
 [4] LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.26.3                 
 [3] AnnotationDbi_1.36.2                    tximportData_1.2.0                     
 [5] tximport_1.2.0                          BiocInstaller_1.24.0                   
 [7] DESeq2_1.14.1                           SummarizedExperiment_1.4.0             
 [9] Biobase_2.34.0                          GenomicRanges_1.26.3                   
[11] GenomeInfoDb_1.10.3                     IRanges_2.8.1                          
[13] S4Vectors_0.12.1                        BiocGenerics_0.20.0                    

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9              locfit_1.5-9.1           lattice_0.20-34         
 [4] Rsamtools_1.26.1         Biostrings_2.42.1        assertthat_0.1          
 [7] digest_0.6.12            plyr_1.8.4               backports_1.0.5         
[10] acepack_1.4.1            RSQLite_1.1-2            ggplot2_2.2.1           
[13] zlibbioc_1.20.0          lazyeval_0.2.0           data.table_1.10.4       
[16] annotate_1.52.1          rpart_4.1-10             Matrix_1.2-7.1          
[19] checkmate_1.8.2          splines_3.3.2            BiocParallel_1.8.1      
[22] geneplotter_1.52.0       stringr_1.2.0            foreign_0.8-67          
[25] htmlwidgets_0.8          RCurl_1.95-4.8           biomaRt_2.30.0          
[28] munsell_0.4.3            rtracklayer_1.34.2       base64enc_0.1-3         
[31] htmltools_0.3.5          nnet_7.3-12              tibble_1.2              
[34] gridExtra_2.2.1          htmlTable_1.9            Hmisc_4.0-2             
[37] XML_3.98-1.5             GenomicAlignments_1.10.0 bitops_1.0-6            
[40] grid_3.3.2               xtable_1.8-2             gtable_0.2.0            
[43] DBI_0.5-1                magrittr_1.5             scales_0.4.1            
[46] stringi_1.1.2            XVector_0.14.0           genefilter_1.56.0       
[49] latticeExtra_0.6-28      Formula_1.2-1            RColorBrewer_1.1-2      
[52] tools_3.3.2              survival_2.40-1          colorspace_1.3-2        
[55] cluster_2.0.5            memoise_1.0.0            knitr_1.15.1

txImportData txImport • 2.7k views

ADD COMMENT • link updated 3.3 years ago by James W. MacDonald 65k • written 7.2 years ago by alexlok • 0

0

Entering edit mode

For my question 1, I didn't notice the function makeTxDbFromGFF() from GenomicFeatures, that indeed provides another way to generate the tx2gene list.

So I'm just left with the second question: out of curiosity, is it going to be a problem, and should I report it someplace?

ADD REPLY • link 7.2 years ago alexlok • 0

score 3 · Accepted Answer · 2017-03-06

3

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 hour ago

United States

hi Alex,

The tx2gene.csv file in tximportData is only for the specific quantification files in tximportData, but is not appropriate in general. I couldn't tell from your post, but are you interested in using tximport on your own data?

If so, you should generate your own tx2gene table for your specific transcriptome. So then, you'd have to know which transcriptome you quantified against. You would then want to go find a GTF file for that transcriptome, and you could use some Bioconductor functions to generate an appropriate tx2gene table.

If you are just interested in the data in tximportData, I believe the gene with three transcripts had an ID of "" (empty string) from the GTF file, which caused problems at one or more places throughout the pipeline, and so these transcripts were removed from the tx2gene table.

ADD COMMENT • link 7.2 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Indeed I'm generating my own tx2gene table. My question was really just about the tximportData dataset, and you have perfectly answered it.

Thank you.

ADD REPLY • link 7.2 years ago alexlok • 0

0

Entering edit mode

Hello, I am also new and I am also learning. If I create my own table tx2gene. Will these missing transcripts in tx2gene cause problems calculating offset? I am using the recommendation for edgeR in https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#Downstream_DGE_in_Bioconductor