BiomaRt & retrieving flanking regions
1
0
Entering edit mode
@greg-slodkowicz-6123
Last seen 9.6 years ago
Dear all, I'm having some trouble fetching flanking regions for a gene from the Ensembl Biomart: library(biomaRt) ensembl <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", host="www.ensembl.org", path="/biomart/martservice") attrs <- c("ensembl_gene_id", "upstream_flank") gene_ann <- getBM(attributes=attrs, filters=list(ensembl_gene_id="ENSG00000165702", upstream_flank=100), mart=ensembl, checkFilters=F) gene_ann (here's a Gist: I also tried passing the parameters differently but it doesn't make any difference: gene_ann <- getBM(attributes=attrs, filters=c("ensembl_gene_id", "upstream_flank"), values=list(ensembl_gene_id="ENSG00000165702", upstream_flank=100), mart=ensembl, checkFilters=F) Am I doing something wrong? Best, Greg -- Greg Slodkowicz PhD student, Nick Goldman group European Bioinformatics Institute (EMBL-EBI) [[alternative HTML version deleted]]
• 1.4k views
ADD COMMENT
0
Entering edit mode
@greg-slodkowicz-6123
Last seen 9.6 years ago
I should've probably mentioned the exact error I'm getting: Error in getBM(attributes = attrs, filters = list(ensembl_gene_id = "ENSG00000165702", : Query ERROR: caught BioMart::Exception: non-BioMart die(): Can't use an undefined value as an ARRAY reference at /ensemblweb/wwwmart/www_74/biomart- perl/lib/BioMart/Dataset/GenomicSequence.pm line 396. G. On Sat, Dec 28, 2013 at 1:42 PM, Greg Slodkowicz <gregs@ebi.ac.uk> wrote: > Dear all, > I'm having some trouble fetching flanking regions for a gene from the > Ensembl Biomart: > > library(biomaRt) > ensembl <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", > dataset="hsapiens_gene_ensembl", > host="www.ensembl.org", path="/biomart/martservice") > > attrs <- c("ensembl_gene_id", "upstream_flank") > gene_ann <- getBM(attributes=attrs, > filters=list(ensembl_gene_id="ENSG00000165702", upstream_flank=100), > mart=ensembl, checkFilters=F) > gene_ann > > (here's a Gist: > > I also tried passing the parameters differently but it doesn't make any > difference: > > gene_ann <- getBM(attributes=attrs, filters=c("ensembl_gene_id", > "upstream_flank"), values=list(ensembl_gene_id="ENSG00000165702", > upstream_flank=100), mart=ensembl, checkFilters=F) > > Am I doing something wrong? > > Best, > Greg > > -- > Greg Slodkowicz > PhD student, Nick Goldman group > European Bioinformatics Institute (EMBL-EBI) > -- Greg Slodkowicz PhD student, Nick Goldman group European Bioinformatics Institute (EMBL-EBI) [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Hi Greg, Not sure exactly why you're getting this error. FWIW this can also be done with the GenomicFeatures and BSgenome packages: library(GenomicFeatures) txdb <- makeTranscriptDbFromBiomart("ensembl", "hsapiens_gene_ensembl") gn <- genes(txdb) Then: > gn["ENSG00000165702"] GRanges with 1 range and 1 metadata column: seqnames ranges strand | gene_id <rle> <iranges> <rle> | <characterlist> ENSG00000165702 9 [135820932, 135867083] + | ENSG00000165702 --- seqlengths: 1 2 ... LRG_98 LRG_99 249250621 243199373 ... 18750 13294 Obtain the ranges of the flanking regions with flank(). By default flank() returns the upstream flanks: > gnflanks <- flank(gn, width=100) > gnflanks["ENSG00000165702"] GRanges with 1 range and 1 metadata column: seqnames ranges strand | gene_id <rle> <iranges> <rle> | <characterlist> ENSG00000165702 9 [135820832, 135820931] + | ENSG00000165702 --- seqlengths: 1 2 ... LRG_98 LRG_99 249250621 243199373 ... 18750 13294 To extract the corresponding sequence, we can use the BSgenome package for hg19. This assembly is compatible with the GRCh37.p12 assembly for all the main chromosomes *except* for chrM. Also we'll need to adjust the names of the chromosomes because Ensembl/NCBI and UCSC use different naming conventions: > library(BSgenome.Hsapiens.UCSC.hg19) > head(seqlevels(Hsapiens)) [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" Keep only flank regions for genes located on chromosomes 1 to 22, X and Y, and rename the chromosomes: seqlevels(gnflanks, force=TRUE) <- c(1:22, "X", "Y") seqlevels(gnflanks) <- paste0("chr", seqlevels(gnflanks)) Finally, use getSeq() to extract the DNA sequence: > getSeq(Hsapiens, gnflanks["ENSG00000165702"]) A DNAStringSet instance of length 1 width seq [1] 100 TAATGACATCTGTGACGTGAAGAATCAAGGAGAT...GCCTAGTAAATGCTCACTCACTAGATAGGTGGC Note that getSeq() can extract more than 1 sequence at once: > my_genes <- c("ENSG00000165702", "ENSG00000252380", "ENSG00000200999") > getSeq(Hsapiens, gnflanks[my_genes]) A DNAStringSet instance of length 3 width seq [1] 100 TAATGACATCTGTGACGTGAAGAATCAAGGAGAT...GCCTAGTAAATGCTCACTCACTAGATAGGTGGC [2] 100 AGTCAAGGTGCTAGTTTGTGATGGAGCGGCAGGC...TCAACTAAATAAAACTAGCAAACTAGCTACAAA [3] 100 CTTACATGCAAGAAAACTTCTAAGAAAACTTATA...AGAAAAAGCAGAATGAGCATCAGTAAATATTTA Cheers, H. On 12/28/2013 04:43 AM, Greg Slodkowicz wrote: > I should've probably mentioned the exact error I'm getting: > > Error in getBM(attributes = attrs, filters = list(ensembl_gene_id = > "ENSG00000165702", : > Query ERROR: caught BioMart::Exception: non-BioMart die(): Can't use an > undefined value as an ARRAY reference at > /ensemblweb/wwwmart/www_74/biomart- perl/lib/BioMart/Dataset/GenomicSequence.pm > line 396. > > G. > > > On Sat, Dec 28, 2013 at 1:42 PM, Greg Slodkowicz <gregs at="" ebi.ac.uk=""> wrote: > >> Dear all, >> I'm having some trouble fetching flanking regions for a gene from the >> Ensembl Biomart: >> >> library(biomaRt) >> ensembl <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", >> dataset="hsapiens_gene_ensembl", >> host="www.ensembl.org", path="/biomart/martservice") >> >> attrs <- c("ensembl_gene_id", "upstream_flank") >> gene_ann <- getBM(attributes=attrs, >> filters=list(ensembl_gene_id="ENSG00000165702", upstream_flank=100), >> mart=ensembl, checkFilters=F) >> gene_ann >> >> (here's a Gist: >> >> I also tried passing the parameters differently but it doesn't make any >> difference: >> >> gene_ann <- getBM(attributes=attrs, filters=c("ensembl_gene_id", >> "upstream_flank"), values=list(ensembl_gene_id="ENSG00000165702", >> upstream_flank=100), mart=ensembl, checkFilters=F) >> >> Am I doing something wrong? >> >> Best, >> Greg >> >> -- >> Greg Slodkowicz >> PhD student, Nick Goldman group >> European Bioinformatics Institute (EMBL-EBI) >> > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Hi Hervé, Thanks for a really comprehensive answer! I actually need to fetch these from many genomes (orthologs for ~30 species) so in the meantime I fell back on PyCogent and it seems to work. Best, Greg On Sun, Dec 29, 2013 at 8:53 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Greg, > > Not sure exactly why you're getting this error. FWIW this can > also be done with the GenomicFeatures and BSgenome packages: > > library(GenomicFeatures) > txdb <- makeTranscriptDbFromBiomart("ensembl", "hsapiens_gene_ensembl") > gn <- genes(txdb) > > Then: > > > gn["ENSG00000165702"] > GRanges with 1 range and 1 metadata column: > seqnames ranges strand | gene_id > <rle> <iranges> <rle> | > <characterlist> > ENSG00000165702 9 [135820932, 135867083] + | > ENSG00000165702 > --- > seqlengths: > 1 2 ... LRG_98 LRG_99 > 249250621 243199373 ... 18750 13294 > > Obtain the ranges of the flanking regions with flank(). By default > flank() returns the upstream flanks: > > > gnflanks <- flank(gn, width=100) > > > gnflanks["ENSG00000165702"] > GRanges with 1 range and 1 metadata column: > seqnames ranges strand | gene_id > <rle> <iranges> <rle> | > <characterlist> > ENSG00000165702 9 [135820832, 135820931] + | > ENSG00000165702 > --- > seqlengths: > 1 2 ... LRG_98 LRG_99 > 249250621 243199373 ... 18750 13294 > > To extract the corresponding sequence, we can use the BSgenome > package for hg19. This assembly is compatible with the GRCh37.p12 > assembly for all the main chromosomes *except* for chrM. Also we'll > need to adjust the names of the chromosomes because Ensembl/NCBI > and UCSC use different naming conventions: > > > library(BSgenome.Hsapiens.UCSC.hg19) > > head(seqlevels(Hsapiens)) > [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" > > Keep only flank regions for genes located on chromosomes 1 to 22, X > and Y, and rename the chromosomes: > > seqlevels(gnflanks, force=TRUE) <- c(1:22, "X", "Y") > seqlevels(gnflanks) <- paste0("chr", seqlevels(gnflanks)) > > Finally, use getSeq() to extract the DNA sequence: > > > getSeq(Hsapiens, gnflanks["ENSG00000165702"]) > A DNAStringSet instance of length 1 > width seq > [1] 100 TAATGACATCTGTGACGTGAAGAATCAAGGAGAT... > GCCTAGTAAATGCTCACTCACTAGATAGGTGGC > > Note that getSeq() can extract more than 1 sequence at once: > > > my_genes <- c("ENSG00000165702", "ENSG00000252380", "ENSG00000200999") > > > getSeq(Hsapiens, gnflanks[my_genes]) > A DNAStringSet instance of length 3 > width seq > [1] 100 TAATGACATCTGTGACGTGAAGAATCAAGGAGAT... > GCCTAGTAAATGCTCACTCACTAGATAGGTGGC > [2] 100 AGTCAAGGTGCTAGTTTGTGATGGAGCGGCAGGC... > TCAACTAAATAAAACTAGCAAACTAGCTACAAA > [3] 100 CTTACATGCAAGAAAACTTCTAAGAAAACTTATA... > AGAAAAAGCAGAATGAGCATCAGTAAATATTTA > > Cheers, > H. > > > > On 12/28/2013 04:43 AM, Greg Slodkowicz wrote: > >> I should've probably mentioned the exact error I'm getting: >> >> Error in getBM(attributes = attrs, filters = list(ensembl_gene_id = >> "ENSG00000165702", : >> Query ERROR: caught BioMart::Exception: non-BioMart die(): Can't use an >> undefined value as an ARRAY reference at >> /ensemblweb/wwwmart/www_74/biomart-perl/lib/BioMart/ >> Dataset/GenomicSequence.pm >> line 396. >> >> G. >> >> >> On Sat, Dec 28, 2013 at 1:42 PM, Greg Slodkowicz <gregs@ebi.ac.uk> wrote: >> >> Dear all, >>> I'm having some trouble fetching flanking regions for a gene from the >>> Ensembl Biomart: >>> >>> library(biomaRt) >>> ensembl <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", >>> dataset="hsapiens_gene_ensembl", >>> host="www.ensembl.org", path="/biomart/martservice") >>> >>> attrs <- c("ensembl_gene_id", "upstream_flank") >>> gene_ann <- getBM(attributes=attrs, >>> filters=list(ensembl_gene_id="ENSG00000165702", upstream_flank=100), >>> mart=ensembl, checkFilters=F) >>> gene_ann >>> >>> (here's a Gist: >>> >>> I also tried passing the parameters differently but it doesn't make any >>> difference: >>> >>> gene_ann <- getBM(attributes=attrs, filters=c("ensembl_gene_id", >>> "upstream_flank"), values=list(ensembl_gene_id="ENSG00000165702", >>> upstream_flank=100), mart=ensembl, checkFilters=F) >>> >>> Am I doing something wrong? >>> >>> Best, >>> Greg >>> >>> -- >>> Greg Slodkowicz >>> PhD student, Nick Goldman group >>> European Bioinformatics Institute (EMBL-EBI) >>> >>> >> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- Greg Slodkowicz PhD student, Nick Goldman group European Bioinformatics Institute (EMBL-EBI) [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 783 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6