Search
Question: tximport - using hg38 with RefSeq (NM) and Gene Symbol
0
19 months ago by
seb.boegel0 wrote:

I used salmon to map RNA-Seq data against hg38 from UCSC. In quant.sf i have the Ref-seq transcript IDs. I now want to use tximport to aggregate to gene level. However "TxDb.Hsapiens.UCSC.hg19.knownGene" is of little use, as they neither contain RefSeq Ids nor Gene Symbols (or am i wrong here)?

What i want is exactly, what Mike did with the pre-constructed table "tx2gene.csv". I assume this is hg19. So, i need that for hg38. I downloaded kgXref from UCSC and exported the columns "mRNA" and "Gene symbol" in the correct column order. I tried to use that as written in example code in the vignette. I get this error:

> txi.salmon <- tximport("quant.sf", type = "salmon", tx2gene = tx2gene2_clean, reader = read_tsv)
1 Parsed with column specification:
cols(
Name = col_character(),
Length = col_integer(),
EffectiveLength = col_double(),
TPM = col_double(),
)

transcripts missing genes: 18604
summarizing abundance
summarizing counts
summarizing length
Error: all(names(aveLengthSampGene) == rownames(lengthMat)) is not TRUE
In names(aveLengthSampGene) == rownames(lengthMat) :
longer object length is not a multiple of shorter object length
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] tximportData_1.2.0
[2] org.Hs.eg.db_3.4.0
[4] tximport_1.2.0
[5] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0
[6] GenomicFeatures_1.26.3
[7] AnnotationDbi_1.36.2
[8] Biobase_2.34.0
[9] GenomicRanges_1.26.3
[10] GenomeInfoDb_1.10.3
[11] IRanges_2.8.1
[12] S4Vectors_0.12.1
[13] BiocGenerics_0.20.0
[14] BiocInstaller_1.24.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.9                XVector_0.14.0
[3] GenomicAlignments_1.10.0   zlibbioc_1.20.0
[5] BiocParallel_1.8.1         lattice_0.20-33
[7] R6_2.2.0                   tools_3.3.1
[9] grid_3.3.1                 SummarizedExperiment_1.4.0
[11] DBI_0.5-1                  assertthat_0.1
[13] digest_0.6.12              tibble_1.2
[15] Matrix_1.2-6               rtracklayer_1.34.2
[17] bitops_1.0-6               RCurl_1.95-4.8
[19] biomaRt_2.30.0             memoise_1.0.0
[21] RSQLite_1.1-2              Biostrings_2.42.1
[23] Rsamtools_1.26.1           XML_3.98-1.5
> traceback()
4: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"),
ch), call. = FALSE, domain = NA)
3: stopifnot(all(names(aveLengthSampGene) == rownames(lengthMat)))
2: summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance)
1: tximport("quant.sf", type = "salmon", tx2gene = tx2gene2_clean,
reader = read_tsv)
modified 19 months ago by Michael Love19k • written 19 months ago by seb.boegel0
1
19 months ago by
Michael Love19k
United States
Michael Love19k wrote:

There could be some bug here, but it's not clear to me what's going on. It may have to do with the missing transcripts or with only importing a single file, although I have unit tests for this, so it should be fine.

Could you email me -- you can get my email with maintainer("tximport") -- the quant.sf file and also the tx2gene table (as rda)? I'll take a look and reply back on this thread.

save(tx2gene_clean, file="tx2gene.rda")

hi Sebastian,

I figure out the bug and fixed it in the devel branch. The code wasn't expecting duplicate rows in the tx2gene table.

The easiest way for you to get around it is just to remove duplicate rows before supplying it to tximport():

tx2gene <- tx2gene[!duplicated(tx2gene[,1]),]

Great, that works.

Thank you very much for your great help

Dear Mike, when dealing with 1-to-many mappings in tx2gene table as in Sebastian's case, shouldn't the first occurrence of the tx in question also be removed? In your suggested workaround, the first tx id is kept and the rest are tossed. Could you please elaborate on the logic behind this workaround?

Say you have transcripts A, B, C which are linked to gene 1. But for whatever reason, when building tx2gene someone makes a tx2gene table:

A 1
A 1
B 1
C 1

The tx2gene table only needs to know that the TPMs for A, B, C should be collapsed to gene 1. The extra row being there is just a bug in the generation of the table (and tximport does not generate these tables for the user).

Thanks for the prompt reply, Mike. I am facing the same error as Sebastian, and I fully understand the logic behind removing duplicate rows in the provided example. However, in my case (and perhaps also in Sebastian's), the scenario does not include presence of duplicate rows, but rather mapping of the same transcript id to more than one gene id.

A 1
A 2
B 1
C 3

Using tx2gene <- tx2gene[!duplicated(tx2gene[,1]),] as you had suggested, keeps the (A,1) pair but removes (A,2). My question was whether we should treat such cases as "ambiguous" tx2gene mapping and remove both incidents.

1

I don't know what the best option is for this case, so you're a bit on your own as to how to clean up the tx2gene table.

I believe this kind of arrangement would not be possible from a TxDb, if the transcripts belonged to multiple genes. I think GTF and TxDb expect a transcript to have a single gene 'parent'. This is the kind of input that tximport is expecting.