Question

Transcript biotypes for ncRNA in GRCh37 using biomaRt?

0

Entering edit mode

sergio.martinezcuesta ▴ 10

@sergiomartinezcuesta-9159

Last seen 3.6 years ago

United Kingdom

Dear all,

I am attempting to retrieve transcript biotypes for ncRNAs using Bioconductors's biomaRt in GRCh37 as follows:

library(biomaRt)

ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", dataset="hsapiens_gene_ensembl")

# biotypes for mRNAs are obtained fine

refseqids_nm = c("NM_152486","NM_080605", "NM_031921")

getBM(attributes=c("refseq_mrna", "transcript_biotype"), filters="refseq_mrna", values=refseqids_nm, mart=ensembl)

#  refseq_mrna transcript_biotype

#1   NM_031921     protein_coding

#2   NM_080605     protein_coding

#3   NM_152486     protein_coding

# However not for ncRNAs

refseqids_nr = c("NR_015434", "NR_036637")

getBM(attributes=c("refseq_ncrna", "transcript_biotype"), filters="refseq_ncrna", values=refseqids_nr, mart=ensembl)

#[1] refseq_ncrna       transcript_biotype

#<0 rows> (or 0-length row.names)

When I try the same as above but with the current release of Ensembl:

ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl")

getBM(attributes=c("refseq_ncrna", "transcript_biotype"), filters="refseq_ncrna", values=refseqids_nr, mart=ensembl)

#  refseq_ncrna   transcript_biotype

#1    NR_015434            antisense

#2    NR_036637 processed_transcript

Then I get biotypes for ncRNAs just fine.

Perhaps there is something I am missing here. Does GRCh37 have annotations for ncRNAs? If so, any input on how I can obtain transcript biotypes using biomaRt as above?

Thanks,

Sergio

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4 biomaRt_2.30.0   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9          IRanges_2.8.1        XML_3.98-1.5         digest_0.6.12        bitops_1.0-6         DBI_0.6-1            stats4_3.3.2         RSQLite_1.1-2        S4Vectors_0.12.1     tools_3.3.2          Biobase_2.34.0       RCurl_1.95-4.8       parallel_3.3.2       BiocGenerics_0.20.0
[15] AnnotationDbi_1.36.2 memoise_1.1.0

grch37 transcripts biotype biomaRt ncrna • 1.5k views

ADD COMMENT • link updated 5.7 years ago by James W. MacDonald 65k • written 5.7 years ago by sergio.martinezcuesta ▴ 10

score 2 · Accepted Answer · 2018-08-17

The different genome builds (GRCh37, GRCh38, etc) have to do with the structure of the genome, and where different things (genes, exons, CpGs, SNPs, whatnot) can be found. It has almost nothing to do with what those things do for a living, such as whether or not a gene is transcribed, or what pathway it might be in, etc.

The genome builds are worked on until they are released, at which time most further work goes towards the new genome build, so there are the discrete genome builds, which are tagged with the date they were released, and which remain +/- static from then on. The other annotation data are in a constant state of flux, and are not tagged to a particular genome build. These other annotations (like RefSeq, for example) are updated quite regularly, with RefSeq being updated weekly. These updates include additions, deletions and merging of identifiers that tag the same thing.

In addition, RefSeq is maintained by NCBI, whereas the Biomart is maintained by EBI/EMBL.

So the query you are trying to do doesn't really make sense to try. You are asking the European annotation service to tell you the transcription biotype of some NCBI RefSeq IDs, using an archived version of their data, that has primarily been archived because of the structural component of that data. But the transcription biotype isn't dependent on the genome build, so there's no profit in using the archived data unless you also need to know positional information. In addition, you are assuming that EBI/EMBL and NCBI agree upon what the underlying transcript identified in RefSeq as NR_015434 does, which may or may not be true.

Unfortunately, there isn't an easy way to get the transcript biotype from NCBI (you could hypothetically use the reutils package to do so, but it's non-tivial), so maybe the best idea is to simply make the query on the current Ensembl release and ignore the NCBI/Ensembl mapping issues.