Access Ensembl GRCH38 Version 87 Using BiomaRt
Entering edit mode
Last seen 6 months ago
United States

I am using a dataset that was annotated using Ensembl version 87 of human reference genome version 38. I wrote an R script to access certain values for that data (FASTA sequences for the protein-coding genes). However it appears that biomaRt does not have access to an archived version of that reference genome. Is there a way around this? Like another source for the annotated reference genome? I see that version 92 of the reference genome has been archived, but I am hesitant to use that version because I don't want my transcripts to get mistaken for other transcripts. Here is my code.

getFASTAseq <- function(data){
    ensembl = useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "87")
    sequences = getSequence(id = data[2:length(data)], type='ensembl_transcript_id', seqType='peptide', mart = ensembl)
    # gets the sequences for the transcripts
    exportFASTA( sequences, file=data[1] )
biomaRt ensembldb RNA ensembl • 455 views
Entering edit mode

You could use the EnsDb annotation resource for Ensembl 87. EnsDb annotation databases from the ensembldb package contain all annotations (genes, transcripts, exons, proteins) for a particular Ensembl release and are fully compatible with the TxDb databases from the GenomicFeatures package. You can get these data resources easily from AnnotationHub.

To get the annotation resource for Ensembl 87:

> library(AnnotationHub)
> ah <- AnnotationHub()
> query(ah, "EnsDb.Hsapiens.v87")
AnnotationHub with 1 record
# snapshotDate(): 2022-10-31
# names(): AH53211
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2017-02-07
# $title: Ensembl 87 EnsDb for Homo Sapiens
# $description: Gene and protein annotations for Homo Sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl:
# $sourcesize: NA
# $tags: c("EnsDb", "Ensembl", "Gene", "Transcript", "Protein",
#   "Annotation", "87", "AHEnsDbs") 
# retrieve record with 'object[["AH53211"]]' 
> edb <- ah[["AH53211"]]

you can then query this edb object to retrieve e.g. protein sequences for all genes.

> prots <- proteins(edb)
> head(prots)
DataFrame with 6 rows and 3 columns
              tx_id      protein_id       protein_sequence
        <character>     <character>            <character>
1   ENST00000373020 ENSP00000362111 MASPSRRLQTKPVITCFKSV..
2   ENST00000612152 ENSP00000482130 MLKLYAMFLTLVFLVELVAA..
...             ...             ...                    ...
5   ENST00000371588 ENSP00000360644 MASLEVSRSPRRSRRELEVR..
6   ENST00000371582 ENSP00000360638 MASLEVSRSPRRSRRELEVR..

If you want to get the sequences only for certain genes:

> proteins(edb, filter = ~ gene_name %in% c("BCL2", "NR3C1"))
DataFrame with 16 rows and 4 columns
              tx_id      protein_id       protein_sequence   gene_name
        <character>     <character>            <character> <character>
1   ENST00000398117 ENSP00000381185 MAHAGRTGYDNREIVMKYIH..        BCL2
2   ENST00000333681 ENSP00000329623 MAHAGRTGYDNREIVMKYIH..        BCL2
...             ...             ...                    ...         ...
15  ENST00000343796 ENSP00000343205 MDSKESLTPGREENPSSVLA..       NR3C1
16  ENST00000424646 ENSP00000405282 MDSKESLTPGREENPSSVLA..       NR3C1

if you want filter on transcript identifiers you would simply use filter = ~ tx_id %in% ... in the query above.

Hope this helps.

cheers, jo

Entering edit mode
Last seen 8 hours ago
United States

You could just get the peptide FASTA from Ensembl directly and parse out what you need.

Login before adding your answer.

Traffic: 350 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6