Question

How to retrieve genomic coordinates for a list of entrezID and their respective gene names

0

Entering edit mode

rohitsatyam102 ▴ 20

@rohitsatyam102-24390

Last seen 15 days ago

India

lshepard

I wanted to get Gene Symbols for EntrezIDs I have and also the promoter region of these genes. However, I don't know if I am doing it correctly. Here is what I did:

library(AnnotationHub)
library("org.Hs.eg.db")
hs <- org.Hs.eg.db

### The excel sheet was obtained from dbindel database: Sample: HCC1954 (indel.txt file)
file <- read.csv("GSM721136.indel.txt", sep="\t", header = T)
ids2 <- as.character(file$related_gene)
#keytypes(hs)

AnnotationDbi::mapIds(hs, keys = ids2, column='SYMBOL', keytype='ENTREZID')

##Successfully converted the entrezid to symbols

### Retrieve the TSS for all EntrezIDs (we don't have strand information for these gene IDs)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(GenomicRanges)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
keytypes(txdb)
[1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"     "TXNAME"  

columns(txdb)
 [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"   "CDSSTRAND"  "EXONCHROM" 
 [8] "EXONEND"    "EXONID"     "EXONNAME"   "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"    
[15] "TXCHROM"    "TXEND"      "TXID"       "TXNAME"     "TXSTART"    "TXSTRAND"   "TXTYPE"  

select(txdb, keys=ids2,
       keytype = "GENEID",
       columns=c("CDSCHROM","CDSSTART","CDSSTRAND")
)

I wish to confirm if this is the right approach to get the TSS sites for the EntrezIDs. Can CDSstart be taken as TSS? If yes why do I get multiple results for a single entrezID? Also can mapID function be used for this purpose because it gives the following error.

If the approach is wrong can you point me to the right package/approach/post because I am almost lost in this?

Error in mapIds_base(x, keys, column, keytype, ..., multiVals = multiVals) : mapIds can only use one column.

genomic AnnotationHub GenomicRanges org.Hs.eg.db • 5.6k views

ADD COMMENT • link 4.3 years ago • updated 4.2 years ago rohitsatyam102 ▴ 20

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 17 hours ago

United States

The TSS is the first base of the transcript, and genes can have multiple transcripts. And the NCBI Gene ID pertains to genes, not transcripts. So you should expect multiple transcript start sites for any gene that has multiple transcripts.

Anyway, using select and a TxDb object are both probably suboptimal. I would instead use the Homo.sapiens package.

> zz <- cdsBy(Homo.sapiens, "gene", columns = c("ENTREZID","SYMBOL"))
'select()' returned many:many mapping between keys and columns
> zz
GRangesList object of length 19809:
$`1`
GRanges object with 8 ranges and 3 metadata columns:
      seqnames            ranges strand |    cds_name          SYMBOL
         <Rle>         <IRanges>  <Rle> | <character> <CharacterList>
  [1]    chr19 58858388-58858395      - |        <NA>            A1BG
  [2]    chr19 58858719-58859006      - |        <NA>            A1BG
  [3]    chr19 58861736-58862017      - |        <NA>            A1BG
  [4]    chr19 58862757-58863053      - |        <NA>            A1BG
  [5]    chr19 58863649-58863921      - |        <NA>            A1BG
  [6]    chr19 58864294-58864563      - |        <NA>            A1BG
  [7]    chr19 58864658-58864693      - |        <NA>            A1BG
  [8]    chr19 58864770-58864803      - |        <NA>            A1BG
             ENTREZID
      <CharacterList>
  [1]               1
  [2]               1
  [3]               1
  [4]               1
  [5]               1
  [6]               1
  [7]               1
  [8]               1
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

$`10`
GRanges object with 1 range and 3 metadata columns:
      seqnames            ranges strand |    cds_name          SYMBOL
         <Rle>         <IRanges>  <Rle> | <character> <CharacterList>
  [1]     chr8 18257514-18258386      + |        <NA>            NAT2
             ENTREZID
      <CharacterList>
  [1]              10
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

...
<19807 more elements>
## resize to get TSS
> unlist(resize(zz, 1))
GRanges object with 237874 ranges and 3 metadata columns:
       seqnames    ranges strand |    cds_name          SYMBOL        ENTREZID
          <Rle> <IRanges>  <Rle> | <character> <CharacterList> <CharacterList>
     1    chr19  58858395      - |        <NA>            A1BG               1
     1    chr19  58859006      - |        <NA>            A1BG               1
     1    chr19  58862017      - |        <NA>            A1BG               1
     1    chr19  58863053      - |        <NA>            A1BG               1
     1    chr19  58863921      - |        <NA>            A1BG               1
   ...      ...       ...    ... .         ...             ...             ...
  9994     chr6  90575672      + |        <NA>        CASP8AP2            9994
  9994     chr6  90581008      + |        <NA>        CASP8AP2            9994
  9994     chr6  90581008      + |        <NA>        CASP8AP2            9994
  9994     chr6  90583522      + |        <NA>        CASP8AP2            9994
  9997    chr22  50962840      - |        <NA>            SCO2            9997
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

ADD COMMENT • link 4.3 years ago James W. MacDonald 68k

1

Entering edit mode

And do note that the names of the GRangesList you get from cdsBy are the NCBI Gene IDs (what you are calling EntrezGene IDs). So you could subset that first, using the set of IDs you care about.

ADD REPLY • link 4.3 years ago James W. MacDonald 68k

0

Entering edit mode

Following the question of rohitsatyam102 - is there a way to retrieve the genomic coordinates for a list of variants using an R based script? I am currently using mutalyzer's position converter to do this task but is not available as an R based operation. To give you an example:

If I have the mRNA variant NM_199292.2:c.657C>A, the corresponding genomic change would be NC_000011.9:g.2189333G>T

Is there an automated way to get this job done for a huge list of such variants??

ADD REPLY • link 4.3 years ago heiko_kin ▴ 60

0

Entering edit mode

This is a new question and it got lost in the middle of a conversation about a different question. Please ask your question in a new thread. Thanks!

ADD REPLY • link 4.3 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks James W. MacDonald , I will test this and get back to you.

ADD REPLY • link 4.3 years ago rohitsatyam102 ▴ 20

0

Entering edit mode

Hi James W. MacDonald . It looks like Homo.sapiens package doesn't have a vignette. Any suggestions on where else to look for it's manual?

ADD REPLY • link 4.3 years ago rohitsatyam102 ▴ 20

1

Entering edit mode

Homo.sapiens is a OrganismDb package which means it can use the same syntax query you see in TxDb and OrgDb.

ADD REPLY • link 4.3 years ago shepherl 4.1k

1

Entering edit mode

And since it's an OrganismDbi package, the vignette is for the OrganismDbi package, rather than the Homo.sapiens package. This is true of all annotation packages; none of them have vignettes, but instead you have to read the vignette provided with the package that is intended to provide the functionality to use the package (e.g., AnnotationDbi for OrgDb packages, etc).

ADD REPLY • link 4.3 years ago James W. MacDonald 68k

1

Entering edit mode

Also, you should read the vignettes for GenomicFeatures, as that's where the code for interacting with a TxDb comes from.

ADD REPLY • link 4.3 years ago James W. MacDonald 68k

0

Entering edit mode

rohitsatyam102 ▴ 20

@rohitsatyam102-24390

Last seen 15 days ago

India

Really really thanks James W. MacDonald and Hervé Pagès for taking out time to answer my query. I would like to extend this thread and add how one can do this if working with Ensembl annotations.

library(biomaRt)
library(regioneR)
library(BSgenome.Hsapiens.UCSC.hg38)

mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
att <- listAttributes(mart)

#obtaining the TSS information from Biomart in form of dataframe

transcripts <- getBM(attributes=c("chromosome_name", "transcript_start", "transcript_end", "strand", "ensembl_gene_id","gene_biotype", "ensembl_transcript_id"),
      filters ="biotype",
      values  =c("protein_coding"),
      mart    =mart) 

#changing the strand connotations from 1 and -1 to + and -

transcripts$strand[transcripts$strand==-1] <- "-"
transcripts$strand[transcripts$strand==1] <- "+"

#converting the df to granges
transcripts.gr <- toGRanges(transcripts)
genome(transcripts.gr) <- "hg38"

# This is important because promoter() function uses strand information to correctly choose start coordinate of genes on negative strand
strand(transcripts.gr) <- transcripts.gr$strand

# to remove strand attribute from metadata columns which otherwise throws error
transcripts.gr <- transcripts.gr[,-1]

# Optional:To keep canonical chromosome transcript only: 
transcripts.gr <- keepSeqlevels(transcripts.gr, c(1:22,"X", "Y"), pruning.mode = "coarse")

#optional: To convert the chromosome to UCSC style like converting 1 to chr1
seqlevelsStyle(transcripts.gr) <- "UCSC"

# To get TSS
promoters(transcripts.gr, upstream=0, downstream=1)

# For Promoter in Humans generally 1500 bo upstream of TSS and 500 bp downstream is considered
pr <- promoters(transcripts.gr, upstream=1500, downstream=500)

# Read the list that you wish to obtain the Promoter sequence of

file <- read.csv("homo sapiens_genes.tsv", sep = "\t", header = T)

#subset the Promoter object with the entrez gene id

sub <- subset(pr, pr$ensembl_transcript_id%in%file$single_exon_isoforms)

# retrieve the promoter sequence for all the subseted ranges

seq <-getSeq(BSgenome.Hsapiens.UCSC.hg38, sub)

# change the names of the sequences using entrezid
names(seq) <- sub$ensembl_transcript_id

#save them in fast format
writeXStringSet(seq,"promoter.fa")

ADD COMMENT • link 4.2 years ago rohitsatyam102 ▴ 20

score 3 · Accepted Answer · 2020-12-07

I don't think the CDS start can be taken as TSS in general, only when the transcript has no 5' UTR.

Here is how you can extract the TSS for all transcripts in the UCSC knownGene track for hg38:

1. Extract all transcripts from the UCSC knownGene track for hg38:

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
tx <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene, columns=c("tx_name", "gene_id"))
tx
# GRanges object with 247541 ranges and 2 metadata columns:
#                    seqnames        ranges strand |           tx_name         gene_id
#                       <Rle>     <IRanges>  <Rle> |       <character> <CharacterList>
#        [1]             chr1   11869-14409      + | ENST00000456328.2       100287102
#        [2]             chr1   12010-13670      + | ENST00000450305.2       100287102
#        [3]             chr1   29554-31097      + | ENST00000473358.1                
#        [4]             chr1   30267-31109      + | ENST00000469289.1                
#        [5]             chr1   30366-30503      + | ENST00000607096.1       100302278
#        ...              ...           ...    ... .               ...             ...
#   [247537] chrUn_GL000220v1 155997-156149      + | ENST00000619779.1       109864274
#   [247538] chrUn_KI270442v1 380608-380726      + | ENST00000620265.1                
#   [247539] chrUn_KI270442v1 217250-217401      - | ENST00000611690.1                
#   [247540] chrUn_KI270744v1   51009-51114      - | ENST00000616830.1                
#   [247541] chrUn_KI270750v1 148668-148843      + | ENST00000612925.1                
#   -------
#   seqinfo: 595 sequences (1 circular) from hg38 genome

The gene_id metadata column is a CharacterList object that links each transcript to 0 or 1 gene:

mcols(tx)$gene_id
# CharacterList of length 247541
# [[1]] 100287102
# [[2]] 100287102
# [[3]] character(0)
# [[4]] character(0)
# [[5]] 100302278
# [[6]] character(0)
# [[7]] character(0)
# [[8]] character(0)
# [[9]] character(0)
# [[10]] 79501
# ...
# <247531 more elements>

table(lengths(mcols(tx)$gene_id))
#      0      1 
#  50159 197382

It might actually be more convenient to represent this mapping as a character vector:

mcols(tx)$gene_id <- as.character(mcols(tx)$gene_id)
tx
# GRanges object with 247541 ranges and 2 metadata columns:
#                    seqnames    ranges strand |           tx_name     gene_id
#                       <Rle> <IRanges>  <Rle> |       <character> <character>
#        [1]             chr1     11869      + | ENST00000456328.2   100287102
#        [2]             chr1     12010      + | ENST00000450305.2   100287102
#        [3]             chr1     29554      + | ENST00000473358.1        <NA>
#        [4]             chr1     30267      + | ENST00000469289.1        <NA>
#        [5]             chr1     30366      + | ENST00000607096.1   100302278
#        ...              ...       ...    ... .               ...         ...
#   [247537] chrUn_GL000220v1    155997      + | ENST00000619779.1   109864274
#   [247538] chrUn_KI270442v1    380608      + | ENST00000620265.1        <NA>
#   [247539] chrUn_KI270442v1    217401      - | ENST00000611690.1        <NA>
#   [247540] chrUn_KI270744v1     51114      - | ENST00000616830.1        <NA>
#   [247541] chrUn_KI270750v1    148668      + | ENST00000612925.1        <NA>
#   -------
#   seqinfo: 595 sequences (1 circular) from hg38 genome

2. Add the gene names/symbols to 'tx' as another metadata column:

The gene names/symbols are not stored in the TxDb.Hsapiens.UCSC.hg38.knownGene package so we cannot do something like transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene, columns=c("tx_name", "gene_id", "SYMBOL")) to obtained this information. Instead we get it from the org.Hs.eg.db package:

library(org.Hs.eg.db)
ENTREZID2SYMBOL <- select(org.Hs.eg.db, mcols(tx)$gene_id, c("ENTREZID", "SYMBOL"))
# 'select()' returned many:1 mapping between keys and columns
stopifnot(identical(ENTREZID2SYMBOL$ENTREZID, mcols(tx)$gene_id))
mcols(tx)$SYMBOL <- ENTREZID2SYMBOL$SYMBOL

Note that an alternative to the above would be to use the Homo.sapiens package, which bundles TxDb.Hsapiens.UCSC.hg19.knownGene and org.Hs.eg.db together:

library(Homo.sapiens)
Homo.sapiens
# OrganismDb Object:
## Includes GODb Object:  GO.db 
## With data about:  Gene Ontology 
## Includes OrgDb Object:  org.Hs.eg.db 
## Gene data about:  Homo sapiens 
## Taxonomy Id:  9606 
## Includes TxDb Object:  TxDb.Hsapiens.UCSC.hg19.knownGene 
## Transcriptome data about:  Homo sapiens 
## Based on genome:  hg19 
## The OrgDb gene id ENTREZID is mapped to the TxDb gene id GENEID .

transcripts(Homo.sapiens, columns=c("ENTREZID", "SYMBOL"))
# 'select()' returned 1:1 mapping between keys and columns
# GRanges object with 82960 ranges and 2 metadata columns:
#                 seqnames        ranges strand |          SYMBOL        ENTREZID
#                    <Rle>     <IRanges>  <Rle> | <CharacterList> <CharacterList>
#       [1]           chr1   11874-14409      + |         DDX11L1       100287102
#       [2]           chr1   11874-14409      + |         DDX11L1       100287102
#       [3]           chr1   11874-14409      + |         DDX11L1       100287102
#       [4]           chr1   69091-70008      + |           OR4F5           79501
#       [5]           chr1 321084-321115      + |            <NA>            <NA>
#       ...            ...           ...    ... .             ...             ...
#   [82956] chrUn_gl000237        1-2686      - |            <NA>            <NA>
#   [82957] chrUn_gl000241   20433-36875      - |            <NA>            <NA>
#   [82958] chrUn_gl000243   11501-11530      + |            <NA>            <NA>
#   [82959] chrUn_gl000243   13608-13637      + |            <NA>            <NA>
#   [82960] chrUn_gl000247     5787-5816      - |            <NA>            <NA>
#   -------
#   seqinfo: 93 sequences (1 circular) from hg19 genome

Note that the reason Homo.sapiens contains less transcripts than TxDb.Hsapiens.UCSC.hg38.knownGene is because by default Homo.sapiens is based on hg19 instead of hg38. This can be changed with:

TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.knownGene

3. Use promoters() to get the promoters or TSS:

We use upstream=0 and downstream=1 to get the TSS (see ?promoters in the GenomicFeatures package for more information):

tss <- promoters(tx, upstream=0, downstream=1)
tss
# GRanges object with 247541 ranges and 3 metadata columns:
#                    seqnames    ranges strand |           tx_name     gene_id      SYMBOL
#                       <Rle> <IRanges>  <Rle> |       <character> <character> <character>
#        [1]             chr1     11869      + | ENST00000456328.2   100287102     DDX11L1
#        [2]             chr1     12010      + | ENST00000450305.2   100287102     DDX11L1
#        [3]             chr1     29554      + | ENST00000473358.1        <NA>        <NA>
#        [4]             chr1     30267      + | ENST00000469289.1        <NA>        <NA>
#        [5]             chr1     30366      + | ENST00000607096.1   100302278   MIR1302-2
#        ...              ...       ...    ... .               ...         ...         ...
#   [247537] chrUn_GL000220v1    155997      + | ENST00000619779.1   109864274   RNA5-8SN4
#   [247538] chrUn_KI270442v1    380608      + | ENST00000620265.1        <NA>        <NA>
#   [247539] chrUn_KI270442v1    217401      - | ENST00000611690.1        <NA>        <NA>
#   [247540] chrUn_KI270744v1     51114      - | ENST00000616830.1        <NA>        <NA>
#   [247541] chrUn_KI270750v1    148668      + | ENST00000612925.1        <NA>        <NA>
#   -------
#   seqinfo: 595 sequences (1 circular) from hg38 genome

H.