Getting CDSCHROM values from ENSEMBL/SYMBOL keys using mapIds from TxDb.Hsapiens.UCSC.hg19.knownGene
1
0
Entering edit mode
kushshah ▴ 10
@kushshah-20393
Last seen 2.2 years ago
University of North Carolina, Chapel Hi…

I am new to using Bioconductor. I have a SingleCellExperiment object, sce, that contains rownames in SYMBOL format, and rowData in ENSEMBL format. Using TxDb.Hsapiens.UCSC.hg19.knownGene, I wish to find the chromosomal location for each gene (for downstream mitochondrial gene controlling) and store these CDSCHROM values as a new vector within rowData. The code I have tried looks like this this:

location <- mapIds(TxDb.Hsapiens.UCSC.hg19.knownGene, keys=rowData(sce)$ENSEMBL, column="CDSCHROM", keytype=???) rowData(sce)$CHR <- location

However, I do not understand how to fill in the keytype argument. I see that "ENSEMBL" is not a valid keytype, so how would I go about this problem?

Seeing that "GENEID" is a valid keytype, I thought about doing the following:

geneidSymbols <- mapIds(org.Hs.eg.db, keys=rownames(sce), keytype="SYMBOL", column="GENEID") rowData(sce)$GENEID <- geneidSymbols

and then using the Gene ID's as my keys in the new code. But "GENEID" is not a valid column type for org.Hs.eg.db, so that did not work either.

I would appreciate any suggestions as I am new to Bioconductor and scRNA-seq in general. Thank you!

ensembl cdschrom org.Hs.eg.db AnnotationDbi SingleCellExperiment • 1.8k views
ADD COMMENT
0
Entering edit mode

You might try looking at Organism.dplyr which combines the TxDb and OrgDb information allowing to query and filter based on data from both.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(Organism.dplyr)
src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene")
src_tbls(src)
colnames(tbl(src, "id"))
keytypes(src)
ADD REPLY
3
Entering edit mode
@james-w-macdonald-5106
Last seen 19 hours ago
United States

If you are using Ensembl IDs, then you should probably not use NCBI-based data to say where your genes are located. There are any number of disagreements between EBI/EMBL and NCBI about how many genes there are, where they are, etc, and in general that is orthogonal to what you are trying to do, so why make your life more complicated?

Also, not being familiar with SingleCellExperiment containers, I am sort of shocked (shocked I say!) that you could have a SingleCellExperiment object that doesn't have gene locations as part of the rowData. How did you map reads to genes without already having those data? And how were you able to generate the SingleCellExperiment without those data being added? Have you looked at rowRanges(sce) to see if they are there?

Anyway, if you actually don't have the gene locations, you probably want to use one of Johannes Rainer's EnsemblDb objects to add the gene locations.

> library(BiocManager)
> BiocManager::install("EnsDb.Hsapiens.v75)
> gnloc <- genes(EnsDb.Hsapiens.v75)
> gnloc
GRanges object with 64102 ranges and 6 metadata columns:
                  seqnames            ranges strand |         gene_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENSG00000223972        1       11869-14412      + | ENSG00000223972
  ENSG00000227232        1       14363-29806      - | ENSG00000227232
  ENSG00000243485        1       29554-31109      + | ENSG00000243485
  ENSG00000237613        1       34554-36081      - | ENSG00000237613
  ENSG00000268020        1       52473-54936      + | ENSG00000268020
              ...      ...               ...    ... .             ...
  ENSG00000224240        Y 28695572-28695890      + | ENSG00000224240
  ENSG00000227629        Y 28732789-28737748      - | ENSG00000227629
  ENSG00000237917        Y 28740998-28780799      - | ENSG00000237917
  ENSG00000231514        Y 28772667-28773306      - | ENSG00000231514
  ENSG00000235857        Y 59001391-59001635      + | ENSG00000235857
                    gene_name gene_biotype seq_coord_system      symbol
                  <character>  <character>      <character> <character>
  ENSG00000223972     DDX11L1   pseudogene       chromosome     DDX11L1
  ENSG00000227232      WASH7P   pseudogene       chromosome      WASH7P
  ENSG00000243485  MIR1302-10      lincRNA       chromosome  MIR1302-10
  ENSG00000237613     FAM138A      lincRNA       chromosome     FAM138A
  ENSG00000268020      OR4G4P   pseudogene       chromosome      OR4G4P
              ...         ...          ...              ...         ...
  ENSG00000224240     CYCSP49   pseudogene       chromosome     CYCSP49
  ENSG00000227629  SLC25A15P1   pseudogene       chromosome  SLC25A15P1
  ENSG00000237917     PARP4P1   pseudogene       chromosome     PARP4P1
  ENSG00000231514     FAM58CP   pseudogene       chromosome     FAM58CP
  ENSG00000235857     CTBP2P1   pseudogene       chromosome     CTBP2P1
                                                       entrezid
                                                         <list>
  ENSG00000223972                       c(100287596, 100287102)
  ENSG00000227232                          c(100287171, 653635)
  ENSG00000243485 c(100422919, 100422834, 100422831, 100302278)
  ENSG00000237613                     c(654835, 645520, 641702)
  ENSG00000268020                                            NA
              ...                                           ...
  ENSG00000224240                                            NA
  ENSG00000227629                                            NA
  ENSG00000237917                                            NA
  ENSG00000231514                                            NA
  ENSG00000235857                                            NA
  -------

And then you can add those data to your SingleCellExperiment. You will have to look at some documentation as to how one accomplishes that.

ADD COMMENT
1
Entering edit mode

Just a small comment to James' excellent answer: if your alignment/gene annotations were based on a specific Ensembl release, you should also ensure that you are loading/adding annotations from the same release. Example, your object is based on Ensembl release 91 data, then you would want to get the corresponding Ensembl annotations from AnnotationHub:

library(AnnotationHub)
ah <- AnnotationHub()
query(ah, "EnsDb.Hsapiens.v91")
AnnotationHub with 1 record
# snapshotDate(): 2019-04-03 
# names(): AH60773
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2017-12-21
# $title: Ensembl 91 EnsDb for Homo Sapiens
# $description: Gene and protein annotations for Homo Sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("EnsDb", "Ensembl", "Gene", "Transcript", "Protein",
#   "Annotation", "91", "AHEnsDbs") 
# retrieve record with 'object[["AH60773"]]' 
edb <- ah[["AH60773"]]

Now it depends what type of annotations you want to extract. If you're just looking for gene annotations (i.e. Ensembl gene ID to gene names/symbols) you could use:

> gene_df <- genes(edb, return.type = "data.frame")
> head(gene_df)
           gene_id   gene_name                       gene_biotype
1  ENSG00000223972     DDX11L1 transcribed_unprocessed_pseudogene
6  ENSG00000227232      WASH7P             unprocessed_pseudogene
7  ENSG00000278267   MIR6859-1                              miRNA
8  ENSG00000243485 MIR1302-2HG                            lincRNA
9  ENSG00000284332   MIR1302-2                              miRNA
10 ENSG00000237613     FAM138A                            lincRNA
   gene_seq_start gene_seq_end seq_name seq_strand seq_coord_system
1           11869        14409        1          1       chromosome
6           14404        29570        1         -1       chromosome
7           17369        17436        1         -1       chromosome
8           29554        31109        1          1       chromosome
9           30366        30503        1          1       chromosome
10          34554        36081        1         -1       chromosome
                                                                        description
1                 DEAD/H-box helicase 11 like 1 [Source:HGNC Symbol;Acc:HGNC:37102]
6       WAS protein family homolog 7 pseudogene [Source:HGNC Symbol;Acc:HGNC:38034]
7                               microRNA 6859-1 [Source:HGNC Symbol;Acc:HGNC:50039]
8                           MIR1302-2 host gene [Source:HGNC Symbol;Acc:HGNC:52482]
9                               microRNA 1302-2 [Source:HGNC Symbol;Acc:HGNC:35294]
10 family with sequence similarity 138 member A [Source:HGNC Symbol;Acc:HGNC:32334]
     gene_id_version      symbol                                       entrezid
1  ENSG00000223972.5     DDX11L1 102725121, 100287596, 100287102, 727856, 84771
6  ENSG00000227232.5      WASH7P                                         653635
7  ENSG00000278267.1   MIR6859-1                                      102466751
8  ENSG00000243485.5 MIR1302-2HG                                      105376912
9  ENSG00000284332.1   MIR1302-2                                      100302278
10 ENSG00000237613.2     FAM138A                                         645520

With the return.type = "data.frame" you specify that you want to get a simple data.frame back instead of the default GRanges object.

ADD REPLY
0
Entering edit mode

Sorry - pls see comment below

ADD REPLY
0
Entering edit mode

Hi James - to answer the question you asked about how we got reads: our data is already in gene symbol format. E.g. data looks like this. We didn't have to map reads to genes because we are accessing another data source that directly provides this information:

.................AZA1 AZA10 AZA11 AZA12 AZ_A2

INS ..........49261.08 878.46353 49.76273 14.0510 63.76123

ERCC...... 0.00 0.00000 0.00000 0.0000 0.00000

GCG........ 26648.19 13.85157 181712.07151 101.0933 53.95560\

TTR .........31584.42 4868.56147 39362.45851 4158.4443 6735.08240

PPY......... 30181.84 111342.76740 33.34756 218.4515 258295.01898

ADD REPLY

Login before adding your answer.

Traffic: 966 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6