Question

how to get complete cds annotation information ?

0

Entering edit mode

KB ▴ 50

@k-8495

Last seen 2.9 years ago

United States

Hello,

I am trying to get the complete CDS annotation information for the hg18 genome. My goal is to extract these columns for the complete hg18 genome

     chrom  strand cdsStart    cdsEnd      GeneID        
[1,] "chr1" "-"    "    36081" "    36081" "FAM138A"     
[2,] "chr1" "-"    "    36081" "    36081" "FAM138F"     
[3,] "chr1" "+"    "    69090" "    70008" "OR4F5"       
[4,] "chr1" "+"    "   328581" "   328581" "LOC100132287"
[5,] "chr1" "+"    "   328581" "   328581" "LOC100132062"
[6,] "chr1" "+"    "   367658" "   368597" "OR4F3"

I could not find this information in AnnotationHub, so went to TXDB. This is what I did so far.

library(TxDb.Hsapiens.UCSC.hg18.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg18.knownGene
txdb.cds <- select(txdb, keys = keys(txdb, "GENEID"),columns=c("GENEID","CDSID","CDSNAME","CDSCHROM","CDSSTRAND","CDSSTART", "CDSEND"),keytype = "GENEID")

> head(txdb.cds) GENEID CDSID CDSNAME CDSCHROM CDSSTRAND CDSSTART CDSEND 1 1 197520 <NA> chr19 - 63556582 63556615 2 1 197519 <NA> chr19 - 63556470 63556505 3 1 197518 <NA> chr19 - 63556106 63556375 4 1 197517 <NA> chr19 - 63555461 63555733 5 1 197516 <NA> chr19 - 63554569 63554865 6 1 197515 <NA> chr19 - 63553548 63553829

I am using Txdb for the first time, so I would like to check the correctness of my query.

1) Will this query give me the COMPLETE hg18 gene CDS annotations ?

2) How can I get the gene symbol ? I saw another thread related to this for mouse genome (https://support.bioconductor.org/p/51856/), but I don't think that answer will work for me.

I'm not sure how to proceed at the moment. Should I use Biomart ? Any suggestions will be helpful.

Thanks.

cds • 2.4k views

ADD COMMENT • link 10.1 years ago KB ▴ 50

score 0 · Answer 1 · 2015-12-09

If you really want a data.frame, then you can use the Homo.sapiens package:

> z <- select(Homo.sapiens, keys(Homo.sapiens, "ENTREZID"), c("CDSID","CDSNAME","CDSCHROM","CDSSTRAND","CDSSTART","CDSEND","SYMBOL"), "ENTREZID")
'select()' returned 1:many mapping between keys and columns
> head(z)
  ENTREZID SYMBOL  CDSID CDSNAME CDSCHROM CDSSTRAND CDSSTART   CDSEND
1        1   A1BG 206062    <NA>    chr19         - 58864770 58864803
2        1   A1BG 206061    <NA>    chr19         - 58864658 58864693
3        1   A1BG 206060    <NA>    chr19         - 58864294 58864563
4        1   A1BG 206059    <NA>    chr19         - 58863649 58863921
5        1   A1BG 206058    <NA>    chr19         - 58862757 58863053
6        1   A1BG 206057    <NA>    chr19         - 58861736 58862017

Alternatively you could use a GRanges object, which is in many respects easier to deal with

> zz <- cdsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
> zz <- unlist(zz)
> zz
GRanges object with 237874 ranges and 2 metadata columns:
       seqnames               ranges strand   |    cds_id    cds_name
          <Rle>            <IRanges>  <Rle>   | <integer> <character>
     1    chr19 [58858388, 58858395]      -   |    206055        <NA>
     1    chr19 [58858719, 58859006]      -   |    206056        <NA>
     1    chr19 [58861736, 58862017]      -   |    206057        <NA>
     1    chr19 [58862757, 58863053]      -   |    206058        <NA>
     1    chr19 [58863649, 58863921]      -   |    206059        <NA>
   ...      ...                  ...    ... ...       ...         ...
  9994     chr6 [90575672, 90578801]      +   |     76227        <NA>
  9994     chr6 [90581008, 90581107]      +   |     76228        <NA>
  9994     chr6 [90581008, 90581109]      +   |     76229        <NA>
  9994     chr6 [90583522, 90583528]      +   |     76230        <NA>
  9997    chr22 [50962040, 50962840]      -   |    218795        <NA>
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
## add in symbols
> mcols(zz)$symbol <- mapIds(org.Hs.eg.db, names(zz), "SYMBOL", "ENTREZID", multiVals="CharacterList")
> zz
GRanges object with 237874 ranges and 3 metadata columns:
       seqnames               ranges strand   |    cds_id    cds_name
          <Rle>            <IRanges>  <Rle>   | <integer> <character>
     1    chr19 [58858388, 58858395]      -   |    206055        <NA>
     1    chr19 [58858719, 58859006]      -   |    206056        <NA>
     1    chr19 [58861736, 58862017]      -   |    206057        <NA>
     1    chr19 [58862757, 58863053]      -   |    206058        <NA>
     1    chr19 [58863649, 58863921]      -   |    206059        <NA>
   ...      ...                  ...    ... ...       ...         ...
  9994     chr6 [90575672, 90578801]      +   |     76227        <NA>
  9994     chr6 [90581008, 90581107]      +   |     76228        <NA>
  9994     chr6 [90581008, 90581109]      +   |     76229        <NA>
  9994     chr6 [90583522, 90583528]      +   |     76230        <NA>
  9997    chr22 [50962040, 50962840]      -   |    218795        <NA>
                symbol
       <CharacterList>
     1            A1BG
     1            A1BG
     1            A1BG
     1            A1BG
     1            A1BG
   ...             ...
  9994        CASP8AP2
  9994        CASP8AP2
  9994        CASP8AP2
  9994        CASP8AP2
  9997            SCO2
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

This is obviously not for hg18. You can put the TxDb.Hsapiens.hg18.UCSC.knownGene object into the Homo.sapiens object like this:

> TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg18.knownGene
Now getting the GODb Object directly
Now getting the OrgDb Object directly
Now loading the TxDb Object from local disc
> Homo.sapiens
OrganismDb Object:
# Includes GODb Object:  GO.db
# With data about:  Gene Ontology
# Includes OrgDb Object:  org.Hs.eg.db
# Gene data about:  Homo sapiens
# Taxonomy Id:  9606
# Includes TxDb Object:  TxDb.Hsapiens.UCSC.hg18.knownGene
# Transcriptome data about:  Homo sapiens
# Based on genome:  hg18
# The OrgDb gene id ENTREZID is mapped to the TxDb gene id GENEID .

And then proceed as above. The (possible) downside of doing that has to do with the difference between a genome build and the other annotations. Genome builds are semi-static things that are released on a semi-regular basis (e.g., hg18 was released in 2006, which in 'annotation years' is literally eons ago).

All the other annotation databases are updated on a weekly basis, and only made semi-static by the fact that we release them semi-yearly. So the TxDb.Hsapiens.UCSC.hg18.knownGene is based on what we knew about the human genome a decade ago, whereas the org.Hs.eg.db is based on data from a couple of months ago. I don't know if UCSC tries to harmonize the Gene IDs so they match up with the current annotations - that wouldn't make sense to me; if you want old data, you should get old data - so you run the risk of having old Gene IDs that have been retired, that won't match up with whatever the current Gene ID is.

score 0 · Answer 2 · 2015-12-09

Thank you very much for the detailed response. Really appreciate it.

We are working with some old data profiled and analyzed using the hg18 reference genome in 2009. So the old Gene IDs are fine.

As for the data type, I didn't wan't to be too restrictive in my questions. Definitely prefer GRanges to data frame.

I have a follow up question. I tried to extract the information as GRanges object from hg18 as you suggested

library(TxDb.Hsapiens.UCSC.hg18.knownGene)
library(org.Hs.eg.db)
zz <- cdsBy(TxDb.Hsapiens.UCSC.hg18.knownGene, "gene")
zz <- unlist(zz)
mcols(zz)$symbol <- mapIds(org.Hs.eg.db, names(zz), "SYMBOL", "ENTREZID", multiVals="CharacterList")

> head(zz) GRanges object with 6 ranges and 3 metadata columns: seqnames ranges strand | cds_id cds_name symbol <Rle> <IRanges> <Rle> | <integer> <character> <CharacterList> 1 chr19 [63550200, 63550207] - | 197513 <NA> A1BG,A1BG,A1BG,... 1 chr19 [63550531, 63550818] - | 197514 <NA> A1BG,A1BG,A1BG,... 1 chr19 [63553548, 63553829] - | 197515 <NA> A1BG,A1BG,A1BG,... 1 chr19 [63554569, 63554865] - | 197516 <NA> A1BG,A1BG,A1BG,... 1 chr19 [63555461, 63555733] - | 197517 <NA> A1BG,A1BG,A1BG,... 1 chr19 [63556106, 63556375] - | 197518 <NA> A1BG,A1BG,A1BG,... ------- seqinfo: 49 sequences (1 circular) from hg18 genome

Question - why did the symbol column have multiple gene names, when in your hg19 example there is only one gene name ? Is there a way to get the gene name to appear only once (so as "character" class, and not "CharcterList" ?

Thank you !