Question

Getting CDSCHROM values from ENSEMBL/SYMBOL keys using mapIds from TxDb.Hsapiens.UCSC.hg19.knownGene

0

Entering edit mode

kushshah ▴ 10

@kushshah-20393

Last seen 2.9 years ago

University of North Carolina, Chapel Hi…

I am new to using Bioconductor. I have a SingleCellExperiment object, sce, that contains rownames in SYMBOL format, and rowData in ENSEMBL format. Using TxDb.Hsapiens.UCSC.hg19.knownGene, I wish to find the chromosomal location for each gene (for downstream mitochondrial gene controlling) and store these CDSCHROM values as a new vector within rowData. The code I have tried looks like this this:

location <- mapIds(TxDb.Hsapiens.UCSC.hg19.knownGene, keys=rowData(sce)$ENSEMBL, column="CDSCHROM", keytype=???) rowData(sce)$CHR <- location

However, I do not understand how to fill in the keytype argument. I see that "ENSEMBL" is not a valid keytype, so how would I go about this problem?

Seeing that "GENEID" is a valid keytype, I thought about doing the following:

geneidSymbols <- mapIds(org.Hs.eg.db, keys=rownames(sce), keytype="SYMBOL", column="GENEID") rowData(sce)$GENEID <- geneidSymbols

and then using the Gene ID's as my keys in the new code. But "GENEID" is not a valid column type for org.Hs.eg.db, so that did not work either.

I would appreciate any suggestions as I am new to Bioconductor and scRNA-seq in general. Thank you!

ensembl cdschrom org.Hs.eg.db AnnotationDbi SingleCellExperiment • 2.1k views

ADD COMMENT • link updated 6.3 years ago by James W. MacDonald 68k • written 6.3 years ago by kushshah ▴ 10

0

Entering edit mode

You might try looking at Organism.dplyr which combines the TxDb and OrgDb information allowing to query and filter based on data from both.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(Organism.dplyr)
src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene")
src_tbls(src)
colnames(tbl(src, "id"))
keytypes(src)

ADD REPLY • link 6.3 years ago shepherl 4.2k

score 3 · Accepted Answer · 2019-04-03

If you are using Ensembl IDs, then you should probably not use NCBI-based data to say where your genes are located. There are any number of disagreements between EBI/EMBL and NCBI about how many genes there are, where they are, etc, and in general that is orthogonal to what you are trying to do, so why make your life more complicated?

Also, not being familiar with SingleCellExperiment containers, I am sort of shocked (shocked I say!) that you could have a SingleCellExperiment object that doesn't have gene locations as part of the rowData. How did you map reads to genes without already having those data? And how were you able to generate the SingleCellExperiment without those data being added? Have you looked at rowRanges(sce) to see if they are there?

Anyway, if you actually don't have the gene locations, you probably want to use one of Johannes Rainer's EnsemblDb objects to add the gene locations.

> library(BiocManager)
> BiocManager::install("EnsDb.Hsapiens.v75)
> gnloc <- genes(EnsDb.Hsapiens.v75)
> gnloc
GRanges object with 64102 ranges and 6 metadata columns:
                  seqnames            ranges strand |         gene_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENSG00000223972        1       11869-14412      + | ENSG00000223972
  ENSG00000227232        1       14363-29806      - | ENSG00000227232
  ENSG00000243485        1       29554-31109      + | ENSG00000243485
  ENSG00000237613        1       34554-36081      - | ENSG00000237613
  ENSG00000268020        1       52473-54936      + | ENSG00000268020
              ...      ...               ...    ... .             ...
  ENSG00000224240        Y 28695572-28695890      + | ENSG00000224240
  ENSG00000227629        Y 28732789-28737748      - | ENSG00000227629
  ENSG00000237917        Y 28740998-28780799      - | ENSG00000237917
  ENSG00000231514        Y 28772667-28773306      - | ENSG00000231514
  ENSG00000235857        Y 59001391-59001635      + | ENSG00000235857
                    gene_name gene_biotype seq_coord_system      symbol
                  <character>  <character>      <character> <character>
  ENSG00000223972     DDX11L1   pseudogene       chromosome     DDX11L1
  ENSG00000227232      WASH7P   pseudogene       chromosome      WASH7P
  ENSG00000243485  MIR1302-10      lincRNA       chromosome  MIR1302-10
  ENSG00000237613     FAM138A      lincRNA       chromosome     FAM138A
  ENSG00000268020      OR4G4P   pseudogene       chromosome      OR4G4P
              ...         ...          ...              ...         ...
  ENSG00000224240     CYCSP49   pseudogene       chromosome     CYCSP49
  ENSG00000227629  SLC25A15P1   pseudogene       chromosome  SLC25A15P1
  ENSG00000237917     PARP4P1   pseudogene       chromosome     PARP4P1
  ENSG00000231514     FAM58CP   pseudogene       chromosome     FAM58CP
  ENSG00000235857     CTBP2P1   pseudogene       chromosome     CTBP2P1
                                                       entrezid
                                                         <list>
  ENSG00000223972                       c(100287596, 100287102)
  ENSG00000227232                          c(100287171, 653635)
  ENSG00000243485 c(100422919, 100422834, 100422831, 100302278)
  ENSG00000237613                     c(654835, 645520, 641702)
  ENSG00000268020                                            NA
              ...                                           ...
  ENSG00000224240                                            NA
  ENSG00000227629                                            NA
  ENSG00000237917                                            NA
  ENSG00000231514                                            NA
  ENSG00000235857                                            NA
  -------

And then you can add those data to your SingleCellExperiment. You will have to look at some documentation as to how one accomplishes that.