I used salmon to map RNA-Seq data against hg38 from UCSC. In quant.sf i have the Ref-seq transcript IDs. I now want to use tximport to aggregate to gene level. However "TxDb.Hsapiens.UCSC.hg19.knownGene" is of little use, as they neither contain RefSeq Ids nor Gene Symbols (or am i wrong here)?
What i want is exactly, what Mike did with the pre-constructed table "tx2gene.csv". I assume this is hg19. So, i need that for hg38. I downloaded kgXref from UCSC and exported the columns "mRNA" and "Gene symbol" in the correct column order. I tried to use that as written in example code in the vignette. I get this error:
> txi.salmon <- tximport("quant.sf", type = "salmon", tx2gene = tx2gene2_clean, reader = read_tsv) reading in files 1 Parsed with column specification: cols( Name = col_character(), Length = col_integer(), EffectiveLength = col_double(), TPM = col_double(), NumReads = col_double() ) transcripts missing genes: 18604 summarizing abundance summarizing counts summarizing length Error: all(names(aveLengthSampGene) == rownames(lengthMat)) is not TRUE In addition: Warning message: In names(aveLengthSampGene) == rownames(lengthMat) : longer object length is not a multiple of shorter object length
> sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] tximportData_1.2.0 [2] org.Hs.eg.db_3.4.0 [3] readr_1.0.0 [4] tximport_1.2.0 [5] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0 [6] GenomicFeatures_1.26.3 [7] AnnotationDbi_1.36.2 [8] Biobase_2.34.0 [9] GenomicRanges_1.26.3 [10] GenomeInfoDb_1.10.3 [11] IRanges_2.8.1 [12] S4Vectors_0.12.1 [13] BiocGenerics_0.20.0 [14] BiocInstaller_1.24.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.9 XVector_0.14.0 [3] GenomicAlignments_1.10.0 zlibbioc_1.20.0 [5] BiocParallel_1.8.1 lattice_0.20-33 [7] R6_2.2.0 tools_3.3.1 [9] grid_3.3.1 SummarizedExperiment_1.4.0 [11] DBI_0.5-1 assertthat_0.1 [13] digest_0.6.12 tibble_1.2 [15] Matrix_1.2-6 rtracklayer_1.34.2 [17] bitops_1.0-6 RCurl_1.95-4.8 [19] biomaRt_2.30.0 memoise_1.0.0 [21] RSQLite_1.1-2 Biostrings_2.42.1 [23] Rsamtools_1.26.1 XML_3.98-1.5
> traceback() 4: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), ch), call. = FALSE, domain = NA) 3: stopifnot(all(names(aveLengthSampGene) == rownames(lengthMat))) 2: summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) 1: tximport("quant.sf", type = "salmon", tx2gene = tx2gene2_clean, reader = read_tsv)
hi Sebastian,
I figure out the bug and fixed it in the devel branch. The code wasn't expecting duplicate rows in the tx2gene table.
The easiest way for you to get around it is just to remove duplicate rows before supplying it to tximport():
Great, that works.
Thank you very much for your great help
Dear Mike, when dealing with 1-to-many mappings in tx2gene table as in Sebastian's case, shouldn't the first occurrence of the tx in question also be removed? In your suggested workaround, the first tx id is kept and the rest are tossed. Could you please elaborate on the logic behind this workaround?
Say you have transcripts A, B, C which are linked to gene 1. But for whatever reason, when building tx2gene someone makes a tx2gene table:
The tx2gene table only needs to know that the TPMs for A, B, C should be collapsed to gene 1. The extra row being there is just a bug in the generation of the table (and tximport does not generate these tables for the user).
Thanks for the prompt reply, Mike. I am facing the same error as Sebastian, and I fully understand the logic behind removing duplicate rows in the provided example. However, in my case (and perhaps also in Sebastian's), the scenario does not include presence of duplicate rows, but rather mapping of the same transcript id to more than one gene id.
A 1
A 2
B 1
C 3
Using
tx2gene <- tx2gene[!duplicated(tx2gene[,1]),]
as you had suggested, keeps the (A,1) pair but removes (A,2). My question was whether we should treat such cases as "ambiguous" tx2gene mapping and remove both incidents.I don't know what the best option is for this case, so you're a bit on your own as to how to clean up the tx2gene table.
I believe this kind of arrangement would not be possible from a TxDb, if the transcripts belonged to multiple genes. I think GTF and TxDb expect a transcript to have a single gene 'parent'. This is the kind of input that tximport is expecting.