Extracting all possible annotations between two genomic coordinates
2
0
Entering edit mode
KB ▴ 50
@k-8495
Last seen 2.8 years ago
United States

Hello,

I have a list of several base pair locations (each have a start and end base pair). Eg: Chr:17,  BasePair1: 26804211 , BasePair2: 26818676

And I am looking to find all possible annotations including gene name, known SNPs , microRNA etc between these two base pair locations (the region need not be in the exonic region, it can be any region in the genome) 

I'm trying to sort of get a comprehensive understanding of the region from the annotation.

I can create a GRanges object with my input, and can start with the "TxDb.Hsapiens.UCSC.hg19.knownGene" to get gene names.

Can anyone suggest similar packages for me to get any other annotation information ?

Thanks, K

 

 

genomes • 2.2k views
ADD COMMENT
0
Entering edit mode
Mike Smith ★ 6.6k
@mike-smith
Last seen 4 weeks ago
EMBL Heidelberg

You could try using the biomaRt package to query the various ensembl databases.

For example, to find genes in a region, and their GC content, you can do something like this:

library(biomaRt)
genes_mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
getBM(mart = mart,
      attributes = c('ensembl_gene_id', 'percentage_gc_content'),
      filters = c('chromosomal_region'),
      values = "17:26804211-26818676")

To retrieve SNP information on a region, you have to query a different dataset, but it would be something like this:

snp_mart = useMart(biomart = "ENSEMBL_MART_SNP", dataset="hsapiens_snp",
                  host = "asia.ensembl.org")
getBM(mart = snp_mart,
      attributes = c('refsnp_id', 'allele'),
      filters = c('chromosomal_region'),
      values = "17:26804211-26818676")

Your example region appears to be in the centromere of chromosome 17, so I wouldn't expect to find much annotation there, but if that happens to be an unfortunate example this approach might be useful for other regions.  I would recommend reading the biomaRt vignette here to get a better idea of what you can do with the package, and looking at the listAttributes() function to understand what information is available for a particular data set.

 

ADD COMMENT
0
Entering edit mode

Quick follow up question

I believe using this command below, it links to the latest genome reference GrCh38.

genes_mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")

Could you tell me how to change setting to the GRch37 Ensembl ?

ADD REPLY
0
Entering edit mode

I believe this is it:

grch37 = useEnsembl(biomart="ensembl",GRCh=37,dataset="hsapiens_gene_ensembl")
ADD REPLY
0
Entering edit mode

Yes, that should do it.  You can also do this using the standard useMart() function, and specifying one of the ensembl archives as the host.  This is a little more flexible than useEnsembl() So for GRCh37 you would use:

useMart(biomart = 'ENSEMBL_MART_ENSEMBL', 
        dataset="hsapiens_gene_ensembl", 
        host = "grch37.ensembl.org")

You could also query the oldest mirror availabe (from May 2009) with:

useMart(biomart = 'ENSEMBL_MART_ENSEMBL', 
        dataset="hsapiens_gene_ensembl", 
        host = "may2009.archive.ensembl.org")
ADD REPLY
0
Entering edit mode
KB ▴ 50
@k-8495
Last seen 2.8 years ago
United States

Thank you ! This is helpful -  I will start with Biomart. That's a good start. 

ADD COMMENT

Login before adding your answer.

Traffic: 580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6