Wrong annotation (TSS/TTS) for many genes: keytype="GENEID" with mapIds(TxDb.Mmusculus.UCSC.mm10.knownGene)
Entering edit mode
94133 • 0
Last seen 3.7 years ago
USA, Stanford


I'm trying to annotate genes in DESeq2 results object with chrom, strand, start, end. Code works fine but many genes have incorrect TSS/TTS annotation on IGV/UCSC browser. If I convert GENEID to ENSEML the problem is solved but not all GENEIDs map, which is suboptimal. How to fix? Thanks in advance!

B_samples <- read.table(file.path(dir, "sample_info.tsv"), header = TRUE)
B_files <- file.path(dir, B_samples$name)
names(B_files) <- file.path(B_samples$sample)
txdb <- (TxDb.Mmusculus.UCSC.mm10.knownGene)
k <- keys(txdb, keytype = "GENEID")
df <- AnnotationDbi::select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
tx2gene <- df[, 2:1]
B_txi <- tximport(B_files, type = "kallisto", tx2gene = tx2gene)
rownames(B_samples) <- B_samples$sample
B_ddsTxi <- DESeqDataSetFromTximport(B_txi, colData = B_samples, design = ~sequencer + condition)
B_ddsTxi <- DESeq(B_ddsTxi)
B_res <- results(B_ddsTxi, c("condition", "Y", "X"))

B_res$symbol <- mapIds(Mus.musculus, keys=row.names(B_res), column="SYMBOL", keytype="GENEID", multiVals="first")
B_res$chrom <- mapIds(TxDb.Mmusculus.UCSC.mm10.knownGene, keys=row.names(B_res), column="TXCHROM", keytype="GENEID", multiVals="first")
B_res$strand <- mapIds(TxDb.Mmusculus.UCSC.mm10.knownGene, keys=row.names(B_res), column="TXSTRAND", keytype="GENEID", multiVals="first")
B_res$start <- mapIds(TxDb.Mmusculus.UCSC.mm10.knownGene, keys=row.names(B_res), column="TXSTART", keytype="GENEID", multiVals="first")
B_res$end <- mapIds(TxDb.Mmusculus.UCSC.mm10.knownGene, keys=row.names(B_res), column="TXEND", keytype="GENEID", multiVals="first")

> head(tx2gene)
1 uc009veu.1 100009600
2 uc033jjg.1 100009600
3 uc012fog.1 100009609
4 uc011xhj.2 100009614
5 uc007inp.3 100009664
6 uc008vqx.2    100012

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0   BiocInstaller_1.32.1                     gdtools_0.2.1                           
 [4] UpSetR_1.4.0                             gprofiler2_0.1.9                         GeneOverlap_1.18.0                      
 [7] sva_3.30.1                               genefilter_1.64.0                        mgcv_1.8-31                             
[10] nlme_3.1-148                             rafalib_1.0.0                            ggpubr_0.3.0                            
[13] ggsignif_0.6.0                           ashr_2.2-47                              gridExtra_2.3                           
[16] gtools_3.8.2                             naturalsort_0.1.3                        ggthemes_4.2.0                          
[19] viridis_0.5.1                            viridisLite_0.3.0                        RColorBrewer_1.1-2                      
[22] pheatmap_1.0.12                          circlize_0.4.10                          ComplexHeatmap_1.20.0                   
[25] reshape2_1.4.4                           scales_1.1.1                             forcats_0.5.0                           
[28] stringr_1.4.0                            purrr_0.3.4                              tidyr_1.1.0                             
[31] tibble_3.0.1                             tidyverse_1.3.0                          dplyr_1.0.0                             
[34] ggrepel_0.8.2                            eulerr_6.1.0                             gplots_3.0.3                            
[37] ggplot2_3.3.1                            VennDiagram_1.6.20                       futile.logger_1.4.3                     
[40] Mus.musculus_1.3.1                       org.Mm.eg.db_3.7.0                       GO.db_3.7.0                             
[43] OrganismDbi_1.24.0                       TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.4 GenomicFeatures_1.34.8                  
[46] AnnotationDbi_1.44.0                     limma_3.38.3                             DESeq2_1.22.2                           
[49] SummarizedExperiment_1.12.0              DelayedArray_0.8.0                       BiocParallel_1.16.6                     
[52] matrixStats_0.56.0                       Biobase_2.42.0                           GenomicRanges_1.34.0                    
[55] GenomeInfoDb_1.18.2                      IRanges_2.16.0                           S4Vectors_0.20.1                        
[58] BiocGenerics_0.28.0                      readr_1.3.1                              tximport_1.10.1                         
AnnotationDbi annotation TxDb.Mmusculus.UCSC.mm10.knownGene • 1.4k views
Entering edit mode
Last seen 8 hours ago
United States

Multiple things. First, you are using a very outdated version of R and Bioconductor. You should update. Second, if you are using Ensembl gene mappings on IGV and NCBI mappings in R, then yes there are disagreements. As you noted when you tried to map between NCBI and Ensembl. The fix is to not do that. If your data are mapped to a NCBI-annotated genome, then use NCBI IDs throughout. And vice versa for Ensembl.

Although it depends on what you mean by 'incorrect annotation on UCSC browser'. The TxDb.Mmusculus.UCSC.mm10.knownGene, is, as its name suggests, based on gene information from UCSC. So it seems rather unlikely that the data from UCSC doesn't match up to the data on the UCSC browser, given that the data source is the same. Do note that you are taking the (random) first TSS/TTS, of possibly many. Which is a convenient simplification, but it is a simplification.

Entering edit mode

Thanks for your reply. I'm using the GENEID/REFSEQ mappings on the UCSC and IGV browser and see disagreements.

Many TSS/TTSs that I get from GENEID and REFSEQ don't correspond to a TSS/TTS at all on UCSC/IGV mm10 browser (~10%), they are in the middle of a gene for example so it's not an alternative start site. If I cross-reference the GENEID/REFSEQ to ENSEMBL (below) and then pull the TSS/TTS then I get coordinates that map correctly both on UCSC/IGV mm10. The problem appears to be the annotation for GENEID/REFSEQ gene models. Seems best option is just to redo everything with ENSEMBL unless you know of a way to fix.

Thanks for your help!

B_res$ENSEMBL <- mapIds(Mus.musculus, keys=row.names(B_res), column="ENSEMBL", keytype="GENEID", multiVals="first")
B_res$start <- mapIds(TxDb.Mmusculus.UCSC.mm10.ensGene, keys=B_res$ENSEMBL, column="TXSTART", keytype="GENEID", multiVals="first")
B_res$end <- mapIds(TxDb.Mmusculus.UCSC.mm10.ensGene, keys=B_res$ENSEMBL, column="TXEND", keytype="GENEID", multiVals="first")
Entering edit mode

Do you have some examples? It's hard to diagnose, based on a statement that ~10% are off.

Entering edit mode

I updated R, Bioconductor, and reinstalled all packages and now TXNAME is ENSEMBL style (ENSMUST00000115494.2) instead of UCSC (uc009veu.1) and transcripts don't match.

k <- keys(txdb, keytype = "GENEID")
df <- AnnotationDbi::select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
tx2gene <- df[, 2:1]
B_txi <- tximport(B_files, type = "kallisto", tx2gene = tx2gene)
Error in .local(object, ...) : 
  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.
              TXNAME    GENEID
1 ENSMUST00000115494.2 100009600
2 ENSMUST00000213826.1 100009600
3 ENSMUST00000044583.9 100009609
4 ENSMUST00000232823.1 100009609
5 ENSMUST00000233024.1 100009609
6 ENSMUST00000233469.1 100009609

I'm just going to redo everything with the ENSEMBL index.


Login before adding your answer.

Traffic: 633 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6