I used salmon to map RNA-Seq data against hg38 from UCSC. In quant.sf i have the Ref-seq transcript IDs. I now want to use tximport to aggregate to gene level. However "TxDb.Hsapiens.UCSC.hg19.knownGene" is of little use, as they neither contain RefSeq Ids nor Gene Symbols (or am i wrong here)?
What i want is exactly, what Mike did with the pre-constructed table "tx2gene.csv". I assume this is hg19. So, i need that for hg38. I downloaded kgXref from UCSC and exported the columns "mRNA" and "Gene symbol" in the correct column order. I tried to use that as written in example code in the vignette. I get this error:
> txi.salmon <- tximport("quant.sf", type = "salmon", tx2gene = tx2gene2_clean, reader = read_tsv) reading in files 1 Parsed with column specification: cols( Name = col_character(), Length = col_integer(), EffectiveLength = col_double(), TPM = col_double(), NumReads = col_double() ) transcripts missing genes: 18604 summarizing abundance summarizing counts summarizing length Error: all(names(aveLengthSampGene) == rownames(lengthMat)) is not TRUE In addition: Warning message: In names(aveLengthSampGene) == rownames(lengthMat) : longer object length is not a multiple of shorter object length
> sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale:  LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252  LC_MONETARY=German_Germany.1252 LC_NUMERIC=C  LC_TIME=German_Germany.1252 attached base packages:  stats4 parallel stats graphics grDevices utils datasets  methods base other attached packages:  tximportData_1.2.0  org.Hs.eg.db_3.4.0  readr_1.0.0  tximport_1.2.0  TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0  GenomicFeatures_1.26.3  AnnotationDbi_1.36.2  Biobase_2.34.0  GenomicRanges_1.26.3  GenomeInfoDb_1.10.3  IRanges_2.8.1  S4Vectors_0.12.1  BiocGenerics_0.20.0  BiocInstaller_1.24.0 loaded via a namespace (and not attached):  Rcpp_0.12.9 XVector_0.14.0  GenomicAlignments_1.10.0 zlibbioc_1.20.0  BiocParallel_1.8.1 lattice_0.20-33  R6_2.2.0 tools_3.3.1  grid_3.3.1 SummarizedExperiment_1.4.0  DBI_0.5-1 assertthat_0.1  digest_0.6.12 tibble_1.2  Matrix_1.2-6 rtracklayer_1.34.2  bitops_1.0-6 RCurl_1.95-4.8  biomaRt_2.30.0 memoise_1.0.0  RSQLite_1.1-2 Biostrings_2.42.1  Rsamtools_1.26.1 XML_3.98-1.5
> traceback() 4: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), ch), call. = FALSE, domain = NA) 3: stopifnot(all(names(aveLengthSampGene) == rownames(lengthMat))) 2: summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) 1: tximport("quant.sf", type = "salmon", tx2gene = tx2gene2_clean, reader = read_tsv)