tximport - using hg38 with RefSeq (NM) and Gene Symbol
1
0
Entering edit mode
seb.boegel • 0
@sebboegel-12547
Last seen 7.7 years ago

I used salmon to map RNA-Seq data against hg38 from UCSC. In quant.sf i have the Ref-seq transcript IDs. I now want to use tximport to aggregate to gene level. However "TxDb.Hsapiens.UCSC.hg19.knownGene" is of little use, as they neither contain RefSeq Ids nor Gene Symbols (or am i wrong here)?

What i want is exactly, what Mike did with the pre-constructed table "tx2gene.csv". I assume this is hg19. So, i need that for hg38. I downloaded kgXref from UCSC and exported the columns "mRNA" and "Gene symbol" in the correct column order. I tried to use that as written in example code in the vignette. I get this error:

> txi.salmon <- tximport("quant.sf", type = "salmon", tx2gene = tx2gene2_clean, reader = read_tsv)
reading in files
1 Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_integer(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)

transcripts missing genes: 18604
summarizing abundance
summarizing counts
summarizing length
Error: all(names(aveLengthSampGene) == rownames(lengthMat)) is not TRUE
In addition: Warning message:
In names(aveLengthSampGene) == rownames(lengthMat) :
  longer object length is not a multiple of shorter object length
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] tximportData_1.2.0                     
 [2] org.Hs.eg.db_3.4.0                     
 [3] readr_1.0.0                            
 [4] tximport_1.2.0                         
 [5] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0
 [6] GenomicFeatures_1.26.3                 
 [7] AnnotationDbi_1.36.2                   
 [8] Biobase_2.34.0                         
 [9] GenomicRanges_1.26.3                   
[10] GenomeInfoDb_1.10.3                    
[11] IRanges_2.8.1                          
[12] S4Vectors_0.12.1                       
[13] BiocGenerics_0.20.0                    
[14] BiocInstaller_1.24.0                   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9                XVector_0.14.0            
 [3] GenomicAlignments_1.10.0   zlibbioc_1.20.0           
 [5] BiocParallel_1.8.1         lattice_0.20-33           
 [7] R6_2.2.0                   tools_3.3.1               
 [9] grid_3.3.1                 SummarizedExperiment_1.4.0
[11] DBI_0.5-1                  assertthat_0.1            
[13] digest_0.6.12              tibble_1.2                
[15] Matrix_1.2-6               rtracklayer_1.34.2        
[17] bitops_1.0-6               RCurl_1.95-4.8            
[19] biomaRt_2.30.0             memoise_1.0.0             
[21] RSQLite_1.1-2              Biostrings_2.42.1         
[23] Rsamtools_1.26.1           XML_3.98-1.5
> traceback()
4: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), 
       ch), call. = FALSE, domain = NA)
3: stopifnot(all(names(aveLengthSampGene) == rownames(lengthMat)))
2: summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance)
1: tximport("quant.sf", type = "salmon", tx2gene = tx2gene2_clean, 
       reader = read_tsv)
salmon hg38 tximport tx2gene.csv • 3.0k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 2 days ago
United States

There could be some bug here, but it's not clear to me what's going on. It may have to do with the missing transcripts or with only importing a single file, although I have unit tests for this, so it should be fine.

Could you email me -- you can get my email with maintainer("tximport") -- the quant.sf file and also the tx2gene table (as rda)? I'll take a look and reply back on this thread.

save(tx2gene_clean, file="tx2gene.rda")

ADD COMMENT
0
Entering edit mode

hi Sebastian,

I figure out the bug and fixed it in the devel branch. The code wasn't expecting duplicate rows in the tx2gene table.

The easiest way for you to get around it is just to remove duplicate rows before supplying it to tximport():

tx2gene <- tx2gene[!duplicated(tx2gene[,1]),]
ADD REPLY
0
Entering edit mode

Great, that works.

Thank you very much for your great help

ADD REPLY
0
Entering edit mode

Dear Mike, when dealing with 1-to-many mappings in tx2gene table as in Sebastian's case, shouldn't the first occurrence of the tx in question also be removed? In your suggested workaround, the first tx id is kept and the rest are tossed. Could you please elaborate on the logic behind this workaround?

ADD REPLY
0
Entering edit mode

Say you have transcripts A, B, C which are linked to gene 1. But for whatever reason, when building tx2gene someone makes a tx2gene table:

A 1
A 1
B 1
C 1

The tx2gene table only needs to know that the TPMs for A, B, C should be collapsed to gene 1. The extra row being there is just a bug in the generation of the table (and tximport does not generate these tables for the user).

ADD REPLY
0
Entering edit mode

Thanks for the prompt reply, Mike. I am facing the same error as Sebastian, and I fully understand the logic behind removing duplicate rows in the provided example. However, in my case (and perhaps also in Sebastian's), the scenario does not include presence of duplicate rows, but rather mapping of the same transcript id to more than one gene id.

A 1
A 2
B 1
C 3

Using tx2gene <- tx2gene[!duplicated(tx2gene[,1]),] as you had suggested, keeps the (A,1) pair but removes (A,2). My question was whether we should treat such cases as "ambiguous" tx2gene mapping and remove both incidents.

 

ADD REPLY
1
Entering edit mode

I don't know what the best option is for this case, so you're a bit on your own as to how to clean up the tx2gene table.

I believe this kind of arrangement would not be possible from a TxDb, if the transcripts belonged to multiple genes. I think GTF and TxDb expect a transcript to have a single gene 'parent'. This is the kind of input that tximport is expecting.

ADD REPLY

Login before adding your answer.

Traffic: 848 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6