retrieving genomic sequences with biomaRt
1
0
Entering edit mode
@jdelasherasedacuk-1189
Last seen 9.3 years ago
United Kingdom
Hi, I am putting a little script together to look at promoter regions, search for CpG islands etc, given a list of gene ids. I thought I would be able to use biomaRt to retrieve the sequence of interest, but I just can't find how right now... either I'm too tired (possible) or I'm missing something and it's just not possible. In particular 'getSequence' looked promising, but no luck so far. What I want to do is: 1) I have a gene id, from it I want to locate the chromosome location of the gene, the start of transcription. 2) knowing the transcriptional start, I want to retrieve the genomic sequence around that area... say 2000bp upstream and 1000 downstream. Anybody knows if this is possible and/or could give me any pointers? Thank you! Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK
Transcription biomaRt Transcription biomaRt • 1.5k views
ADD COMMENT
0
Entering edit mode
@steffen-durinck-1780
Last seen 10.2 years ago
Hi Jose, This should be possible with biomaRt, I would recommend using it in MySQL mode as the web service sometimes hangs for sequences. Which type of identifiers do you have? As an example with Entrez Gene identifiers you could do: library(biomaRt) ensmart=useMart("ensembl",dataset="hsapiens_gene_ensembl", mysql=TRUE) entrez = c("100","330") gene = getGene(id=entrez, type="entrezgene", mart=ensmart) getSequence(chromosome = gene$chromosome, start = gene$start - 2000, end = gene$end + 1000, mart=ensmart) Note that you'll have to make it a little bit more complicated than this as you have to take into account the strand on which the genes are located. The strand is also returned by the getGene function. best, Steffen J.delasHeras at ed.ac.uk wrote: > Hi, > > I am putting a little script together to look at promoter regions, > search for CpG islands etc, given a list of gene ids. > > I thought I would be able to use biomaRt to retrieve the sequence of > interest, but I just can't find how right now... either I'm too tired > (possible) or I'm missing something and it's just not possible. In > particular 'getSequence' looked promising, but no luck so far. > > What I want to do is: > 1) I have a gene id, from it I want to locate the chromosome location > of the gene, the start of transcription. > 2) knowing the transcriptional start, I want to retrieve the genomic > sequence around that area... say 2000bp upstream and 1000 downstream. > > Anybody knows if this is possible and/or could give me any pointers? > > Thank you! > > Jose > > -- Steffen Durinck, Ph.D. Oncogenomics Section Pediatric Oncology Branch National Cancer Institute, National Institutes of Health URL: http://home.ccr.cancer.gov/oncology/oncogenomics/ Phone: 301-402-8103 Address: Advanced Technology Center, 8717 Grovemont Circle Gaithersburg, MD 20877
ADD COMMENT
0
Entering edit mode
Hi Steffen, that is more or less what I tried, but I get the following error after the call to 'getSequence' "Error in inherits(x, "factor") : argument "seqType" is missing, with no default" The only thing I seem to do different is that I didn't indicate mysql=TRUE because I haven't installed the package yet... I will, 'though, but I don't suppose that's the cause of the error? Thanks for your help, Steffen! Jose Quoting Steffen Durinck <durincks at="" mail.nih.gov="">: > Hi Jose, > > This should be possible with biomaRt, I would recommend using it in > MySQL mode as the web service sometimes hangs for sequences. > > Which type of identifiers do you have? > > As an example with Entrez Gene identifiers you could do: > > library(biomaRt) > ensmart=useMart("ensembl",dataset="hsapiens_gene_ensembl", mysql=TRUE) > entrez = c("100","330") > gene = getGene(id=entrez, type="entrezgene", mart=ensmart) > getSequence(chromosome = gene$chromosome, start = gene$start - 2000, end > = gene$end + 1000, mart=ensmart) > > Note that you'll have to make it a little bit more complicated than this > as you have to take into account the strand on which the genes are > located. The strand is also returned by the getGene function. > > best, > Steffen > > J.delasHeras at ed.ac.uk wrote: >> Hi, >> >> I am putting a little script together to look at promoter regions, >> search for CpG islands etc, given a list of gene ids. >> >> I thought I would be able to use biomaRt to retrieve the sequence of >> interest, but I just can't find how right now... either I'm too >> tired (possible) or I'm missing something and it's just not >> possible. In particular 'getSequence' looked promising, but no luck >> so far. >> >> What I want to do is: >> 1) I have a gene id, from it I want to locate the chromosome >> location of the gene, the start of transcription. >> 2) knowing the transcriptional start, I want to retrieve the genomic >> sequence around that area... say 2000bp upstream and 1000 downstream. >> >> Anybody knows if this is possible and/or could give me any pointers? >> >> Thank you! >> >> Jose >> >> > > > -- > Steffen Durinck, Ph.D. > > Oncogenomics Section > Pediatric Oncology Branch > National Cancer Institute, National Institutes of Health > URL: http://home.ccr.cancer.gov/oncology/oncogenomics/ > > Phone: 301-402-8103 > Address: > Advanced Technology Center, > 8717 Grovemont Circle > Gaithersburg, MD 20877 > -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK
ADD REPLY
0
Entering edit mode
Hi Jose, For one biomaRt function, the implementation in MySQL or web service mode is not the same: getSequence. If you currently use getSequence in MySQL mode you need to give a chromosome name + start and end positions of the sequences you want to retrieve, and you can not use the seqType argument as 'only' genomic sequences (regardless of their annotation) are returned. If you use getSequence in web service mode, you have to use the seqType argument as the Ensembl BioMart web service restricts sequence retrieval to annotated sequences such as either protein, 5'UTR, 3'UTR, ... (full list see MartView of Ensembl). The result of this query will be a data.frame with all sequences that have this annotation (e.g. 5'UTR) and are located on the given chromosome between the given start and end positions. This should be better documented in the getSequence function and the getSequence function should throw a proper error message. I will make these fixes asap. Cheers, Steffen J.delasHeras at ed.ac.uk wrote: > > Hi Steffen, > > that is more or less what I tried, but I get the following error after > the call to 'getSequence' > > "Error in inherits(x, "factor") : argument "seqType" is missing, with > no default" > > The only thing I seem to do different is that I didn't indicate > mysql=TRUE because I haven't installed the package yet... I will, > 'though, but I don't suppose that's the cause of the error? > > Thanks for your help, Steffen! > > Jose > > Quoting Steffen Durinck <durincks at="" mail.nih.gov="">: > >> Hi Jose, >> >> This should be possible with biomaRt, I would recommend using it in >> MySQL mode as the web service sometimes hangs for sequences. >> >> Which type of identifiers do you have? >> >> As an example with Entrez Gene identifiers you could do: >> >> library(biomaRt) >> ensmart=useMart("ensembl",dataset="hsapiens_gene_ensembl", mysql=TRUE) >> entrez = c("100","330") >> gene = getGene(id=entrez, type="entrezgene", mart=ensmart) >> getSequence(chromosome = gene$chromosome, start = gene$start - 2000, end >> = gene$end + 1000, mart=ensmart) >> >> Note that you'll have to make it a little bit more complicated than this >> as you have to take into account the strand on which the genes are >> located. The strand is also returned by the getGene function. >> >> best, >> Steffen >> >> J.delasHeras at ed.ac.uk wrote: >>> Hi, >>> >>> I am putting a little script together to look at promoter regions, >>> search for CpG islands etc, given a list of gene ids. >>> >>> I thought I would be able to use biomaRt to retrieve the sequence of >>> interest, but I just can't find how right now... either I'm too >>> tired (possible) or I'm missing something and it's just not >>> possible. In particular 'getSequence' looked promising, but no luck >>> so far. >>> >>> What I want to do is: >>> 1) I have a gene id, from it I want to locate the chromosome >>> location of the gene, the start of transcription. >>> 2) knowing the transcriptional start, I want to retrieve the genomic >>> sequence around that area... say 2000bp upstream and 1000 downstream. >>> >>> Anybody knows if this is possible and/or could give me any pointers? >>> >>> Thank you! >>> >>> Jose >>> >>> >> >> >> -- >> Steffen Durinck, Ph.D. >> >> Oncogenomics Section >> Pediatric Oncology Branch >> National Cancer Institute, National Institutes of Health >> URL: http://home.ccr.cancer.gov/oncology/oncogenomics/ >> >> Phone: 301-402-8103 >> Address: >> Advanced Technology Center, >> 8717 Grovemont Circle >> Gaithersburg, MD 20877 >> > > > -- Steffen Durinck, Ph.D. Oncogenomics Section Pediatric Oncology Branch National Cancer Institute, National Institutes of Health URL: http://home.ccr.cancer.gov/oncology/oncogenomics/ Phone: 301-402-8103 Address: Advanced Technology Center, 8717 Grovemont Circle Gaithersburg, MD 20877
ADD REPLY

Login before adding your answer.

Traffic: 1006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6