getSequence with biomaRt

0

Entering edit mode

mpg33@drexel.edu ▴ 90

@mpg33drexeledu-1897

Last seen 9.6 years ago

I am trying to DNA sequence of the upstream regulatory region of a number of genes using the biomaRt package. I start with a list of EntrezGene IDs and would like to extract the sequence from 10Kb upstream of the end of the 5'UTR to the end of the 5'UTR. I wrote a neat little package of scripts to do this with biomaRt and export the data in .FASTA format. I have found that this works well when I search for one gene at a time. But when I input a list of entrez gene ids to the getSequence function it gives me back sequences but the sequences do not always match the answer I get when I search for the gene one at a time. Sending calls to biomaRt one gene at a time will clearly be much slower but fast searches do me little good if I don't know whether the answer is correct or not. The codes I have written are pretty straight forward. The meat of the code is the getSequence function which I call as follows: biomart<- useMart('ensembl') martDataset<-useDataset( dataset,mart=biomart) getSequence(id = '6720', type='entrezgene',seqType = 'coding_gene_flank',upstream = 10000, mart = martDataset) As an example, when I search the 5869 gene by itself, I get the same answer provided by the biomart web based tool. When I search for the 5869 gene in a list like the following list, I get a different sequence. c('8317','1435','1063','4751','3832','3070','5869','675','81624','7249 ','1186','3801','672','1058','22974','23654','4171','1062','3148','400 1','3007','26271','9314') Thanks for any help with this problem. Let me know if you need more info. I am open to other ways of solving this problem without biomaRt getSequence. Thanks, Michael Gormley [[alternative HTML version deleted]]

biomaRt biomaRt • 1.6k views

ADD COMMENT • link updated 12.4 years ago by Marc Carlson ★ 7.2k • written 12.4 years ago by mpg33@drexel.edu ▴ 90

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 7.7 years ago

United States

Hi Michael, I am not actually sure how you would best do this with BiomaRt, but here is how you might do it using one of the TxDb packages: gids <- c('8317','1435','1063','4751','3832') library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene utrs <- fiveUTRsByTranscript(txdb) ## This then gets you a GRangesList object for the entrez gene IDs in gids. ## This object has the ranges for the UTRs that match the different known transcripts for each gene. matches <- utrs[names(utrs) %in% gids] ## At this point you may decide you want more or less UTR sequence. ## You can easily adjust your ranges accordingly if you want something else. ## And once you are happy with your ranges, you can retrieve sequences with the help of a BSGenome package like this: library(BSgenome.Hsapiens.UCSC.hg19) ## Then use getSeq ( help("getSeq,BSgenome-method") ) unlist(matches) seqs <- getSeq(Hsapiens, names=unlist(matches)) ## This gives you a DNAStringSet object seqs Hope this helps, Marc On 12/14/2011 02:14 PM, Michael Gormley wrote: > I am trying to DNA sequence of the upstream regulatory region of a number > of genes using the biomaRt package. I start with a list of EntrezGene IDs > and would like to extract the sequence from 10Kb upstream of the end of the > 5'UTR to the end of the 5'UTR. I wrote a neat little package of scripts to > do this with biomaRt and export the data in .FASTA format. I have found > that this works well when I search for one gene at a time. But when I > input a list of entrez gene ids to the getSequence function it gives me > back sequences but the sequences do not always match the answer I get when > I search for the gene one at a time. Sending calls to biomaRt one gene at > a time will clearly be much slower but fast searches do me little good if I > don't know whether the answer is correct or not. > > The codes I have written are pretty straight forward. The meat of the code > is the getSequence function which I call as follows: > > biomart<- useMart('ensembl') > martDataset<-useDataset( > dataset,mart=biomart) > getSequence(id = '6720', type='entrezgene',seqType = > 'coding_gene_flank',upstream = 10000, mart = martDataset) > > As an example, when I search the 5869 gene by itself, I get the same answer > provided by the biomart web based tool. When I search for the 5869 gene in > a list like the following list, I get a different sequence. > > c('8317','1435','1063','4751','3832','3070','5869','675','81624','72 49','1186','3801','672','1058','22974','23654','4171','1062','3148','4 001','3007','26271','9314') > > > Thanks for any help with this problem. Let me know if you need more info. > I am open to other ways of solving this problem without biomaRt > getSequence. > > Thanks, > Michael Gormley > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.4 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Hi Marc, Thanks for the input. I looked into using the GenomicRanges package but I thought I had to generate my own TxDb and I didn't realize this would be associated with EntrezGene IDs. Using the already generated TxDb from bioconductor makes this much easier. I actually coded up my own version of this using the chromosomal locations in the org.Hs.eg.db package and the BSGenome library to get sequences. I had to fudge the identification of 5'UTRs a bit. I used the getSeq function to extract the entire gene sequence, searched for the first instance of ATG and then adjusted my ranges to get the genomic region upstream of this spot. Obviously not as good as using annotated 5'UTRs but good enough for my purposes. I will implement your method for any future projects. Thanks again, Mike On Thu, Dec 15, 2011 at 6:55 PM, Marc Carlson <mcarlson@fhcrc.org> wrote: > Hi Michael, > > I am not actually sure how you would best do this with BiomaRt, but here > is how you might do it using one of the TxDb packages: > > gids <- c('8317','1435','1063','4751',**'3832') > > library(TxDb.Hsapiens.UCSC.**hg19.knownGene) > txdb <- TxDb.Hsapiens.UCSC.hg19.**knownGene > utrs <- fiveUTRsByTranscript(txdb) > > ## This then gets you a GRangesList object for the entrez gene IDs in gids. > ## This object has the ranges for the UTRs that match the different known > transcripts for each gene. > matches <- utrs[names(utrs) %in% gids] > ## At this point you may decide you want more or less UTR sequence. > ## You can easily adjust your ranges accordingly if you want something > else. > > ## And once you are happy with your ranges, you can retrieve sequences > with the help of a BSGenome package like this: > library(BSgenome.Hsapiens.**UCSC.hg19) > ## Then use getSeq ( help("getSeq,BSgenome-method") ) > unlist(matches) > seqs <- getSeq(Hsapiens, names=unlist(matches)) > ## This gives you a DNAStringSet object > seqs > > > Hope this helps, > > > > Marc > > > > > > On 12/14/2011 02:14 PM, Michael Gormley wrote: > >> I am trying to DNA sequence of the upstream regulatory region of a number >> of genes using the biomaRt package. I start with a list of EntrezGene IDs >> and would like to extract the sequence from 10Kb upstream of the end of >> the >> 5'UTR to the end of the 5'UTR. I wrote a neat little package of scripts >> to >> do this with biomaRt and export the data in .FASTA format. I have found >> that this works well when I search for one gene at a time. But when I >> input a list of entrez gene ids to the getSequence function it gives me >> back sequences but the sequences do not always match the answer I get when >> I search for the gene one at a time. Sending calls to biomaRt one gene at >> a time will clearly be much slower but fast searches do me little good if >> I >> don't know whether the answer is correct or not. >> >> The codes I have written are pretty straight forward. The meat of the >> code >> is the getSequence function which I call as follows: >> >> biomart<- useMart('ensembl') >> martDataset<-useDataset( >> dataset,mart=biomart) >> getSequence(id = '6720', type='entrezgene',seqType = >> 'coding_gene_flank',upstream = 10000, mart = martDataset) >> >> As an example, when I search the 5869 gene by itself, I get the same >> answer >> provided by the biomart web based tool. When I search for the 5869 gene >> in >> a list like the following list, I get a different sequence. >> >> c('8317','1435','1063','4751',**'3832','3070','5869','675','** >> 81624','7249','1186','3801','**672','1058','22974','23654','** >> 4171','1062','3148','4001','**3007','26271','9314') >> >> >> Thanks for any help with this problem. Let me know if you need more info. >> I am open to other ways of solving this problem without biomaRt >> getSequence. >> >> Thanks, >> Michael Gormley >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]

ADD REPLY • link 12.4 years ago Michael Gormley ▴ 60

Login before adding your answer.