retrieving mRNA sequences via biomaRt

0

Entering edit mode

Simon ▴ 30

@simon-3613

Last seen 9.6 years ago

Hello everybody, I am trying to solve the following tasks as a first contact with the bioconductor project: # Task 1: # find: # * mRNA sequence (5'UTR, Coding region, 3'UTR) # * position of start codon in sequence # * position of stop codon in sequence # * ID (Which ID(s) would I choose to reference my # sequence hits? Embl, ensembl transcript id, # Entrez Gene id, RefSeq, etc.?) # * name of associated protein product # # where: # * origin is human # Entrez Search would be: human[ORGN] # * sequence is mRNA transcript # Entrez Search for Molecule Type: biomol_mRNA[PROP]? # * mRNA sequence length is 3000 to 5000 nts # * Entrez Search for Sequence Length: 3000:5000[SLEN] # * coding region of mRNA length is 2000 to 3000 nts # * Entrez Search Field for stop and start of # coding region: start:stop[CDS] # # # Task 2: # store the retrieved information to file for the first 200 hits # (Which would be a suitable file formate?) I started by using and playing around with the biomaRt package for R, but I got overwhelmed by its many possibilities. I would be glad to get any feedback, on how to start or even solve my tasks. Best regards, Simon

GLAD biomaRt GLAD biomaRt • 3.1k views

ADD COMMENT • link updated 14.7 years ago by Wolfgang Huber ★ 13k • written 14.7 years ago by Simon ▴ 30

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 16 days ago

EMBL European Molecular Biology Laborat…

Hi Simon, with all respect, for a first contact with the Bioconductor project I'd also recommend studying some of the documentation. A (slightly biased) set of points to start with are the "Bioconductor Case Studies" book by Hahne, Huber, Gentleman, Falcon and the paper "Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt." by Durinck et al. in Nature Protocols 2009;4(8):1184-91. Best wishes Wolfgang Simon ha scritto: > Hello everybody, > > I am trying to solve the following tasks as a first contact with the > bioconductor project: > > # Task 1: > # find: > # * mRNA sequence (5'UTR, Coding region, 3'UTR) > # * position of start codon in sequence > # * position of stop codon in sequence > # * ID (Which ID(s) would I choose to reference my > # sequence hits? Embl, ensembl transcript id, > # Entrez Gene id, RefSeq, etc.?) > # * name of associated protein product > # > # where: > # * origin is human > # Entrez Search would be: human[ORGN] > # * sequence is mRNA transcript > # Entrez Search for Molecule Type: biomol_mRNA[PROP]? > # * mRNA sequence length is 3000 to 5000 nts > # * Entrez Search for Sequence Length: 3000:5000[SLEN] > # * coding region of mRNA length is 2000 to 3000 nts > # * Entrez Search Field for stop and start of > # coding region: start:stop[CDS] > # > # > # Task 2: > # store the retrieved information to file for the first 200 hits > # (Which would be a suitable file formate?) > > I started by using and playing around with the biomaRt package for R, > but I got overwhelmed by its many possibilities. > > I would be glad to get any feedback, on how to start or even solve my > tasks. > > Best regards, > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang ------------------------------------------------------- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD COMMENT • link 14.7 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Thanks, for the recommendation. So far, I just read Steffen's and your biomaRt user?s guide and had a look at the BioMart 0.7 Documentation, since I needed quick results. I'm going to have a look at the recommended book and paper, now. In the meantime, I got to a solution - but not a very satisfying one: ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl) myAttributes = c("embl", "cdna", "5utr", "coding", "3utr", "5_utr_end", "3_utr_start", "sequence_cdna_length","cds_length") ... qresult = getBM(attributes=myAttributes, filters=..., values=..., mart=ensembl) finalResult = mySeqCdsLengthFilter(qresult, c(3000, 5000), c(2000, 3000)) For now, I parse my query results manually, using the values for "sequence_cdna_length" and "cds_length" as limits. I wish these attributes were filters ... or there was a BioMart and a database, I could use in a linked query via getLDS. I'm still curious for a smarter solution. Best regards, Simon Wolfgang Huber wrote: > > Hi Simon, > > with all respect, for a first contact with the Bioconductor project I'd > also recommend studying some of the documentation. > > A (slightly biased) set of points to start with are the "Bioconductor > Case Studies" book by Hahne, Huber, Gentleman, Falcon and the paper > "Mapping identifiers for the integration of genomic datasets with the > R/Bioconductor package biomaRt." by Durinck et al. in Nature Protocols > 2009;4(8):1184-91. > > Best wishes > Wolfgang > > > > > Simon ha scritto: >> Hello everybody, >> >> I am trying to solve the following tasks as a first contact with the >> bioconductor project: >> >> # Task 1: >> # find: >> # * mRNA sequence (5'UTR, Coding region, 3'UTR) >> # * position of start codon in sequence >> # * position of stop codon in sequence >> # * ID (Which ID(s) would I choose to reference my >> # sequence hits? Embl, ensembl transcript id, >> # Entrez Gene id, RefSeq, etc.?) >> # * name of associated protein product >> # >> # where: >> # * origin is human >> # Entrez Search would be: human[ORGN] >> # * sequence is mRNA transcript >> # Entrez Search for Molecule Type: biomol_mRNA[PROP]? >> # * mRNA sequence length is 3000 to 5000 nts >> # * Entrez Search for Sequence Length: 3000:5000[SLEN] >> # * coding region of mRNA length is 2000 to 3000 nts >> # * Entrez Search Field for stop and start of >> # coding region: start:stop[CDS] >> # >> # >> # Task 2: >> # store the retrieved information to file for the first 200 hits >> # (Which would be a suitable file formate?) >> >> I started by using and playing around with the biomaRt package for R, >> but I got overwhelmed by its many possibilities. >> >> I would be glad to get any feedback, on how to start or even solve my >> tasks. >> >> Best regards, >> Simon >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.7 years ago Simon ▴ 30

0

Entering edit mode

Hi Simon, The cdna attribute is the combination of 5utr + coding + 3utr so you can remove 5utr, coding and 3utr from your list of attributes to retrieve. I would take ensembl_transcript_id instead of embl. Cheers, Steffen > Thanks, for the recommendation. > > So far, I just read Steffen's and your biomaRt user?s guide and had a > look at the BioMart 0.7 Documentation, since I needed quick results. > I'm going to have a look at the recommended book and paper, now. > > > In the meantime, I got to a solution - but not a very satisfying one: > > ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl) > > myAttributes = c("embl", "cdna", "5utr", "coding", "3utr", "5_utr_end", > "3_utr_start", "sequence_cdna_length","cds_length") > > ... > > qresult = getBM(attributes=myAttributes, > filters=..., > values=..., > mart=ensembl) > > finalResult = mySeqCdsLengthFilter(qresult, c(3000, 5000), c(2000, 3000)) > > For now, I parse my query results manually, using > the values for "sequence_cdna_length" and "cds_length" as limits. > I wish these attributes were filters ... > or there was a BioMart and a database, I could use in a linked query via > getLDS. > > I'm still curious for a smarter solution. > > > Best regards, > Simon > > > Wolfgang Huber wrote: >> >> Hi Simon, >> >> with all respect, for a first contact with the Bioconductor project I'd >> also recommend studying some of the documentation. >> >> A (slightly biased) set of points to start with are the "Bioconductor >> Case Studies" book by Hahne, Huber, Gentleman, Falcon and the paper >> "Mapping identifiers for the integration of genomic datasets with the >> R/Bioconductor package biomaRt." by Durinck et al. in Nature Protocols >> 2009;4(8):1184-91. >> >> Best wishes >> Wolfgang >> >> >> >> >> Simon ha scritto: >>> Hello everybody, >>> >>> I am trying to solve the following tasks as a first contact with the >>> bioconductor project: >>> >>> # Task 1: >>> # find: >>> # * mRNA sequence (5'UTR, Coding region, 3'UTR) >>> # * position of start codon in sequence >>> # * position of stop codon in sequence >>> # * ID (Which ID(s) would I choose to reference my >>> # sequence hits? Embl, ensembl transcript id, >>> # Entrez Gene id, RefSeq, etc.?) >>> # * name of associated protein product >>> # >>> # where: >>> # * origin is human >>> # Entrez Search would be: human[ORGN] >>> # * sequence is mRNA transcript >>> # Entrez Search for Molecule Type: biomol_mRNA[PROP]? >>> # * mRNA sequence length is 3000 to 5000 nts >>> # * Entrez Search for Sequence Length: 3000:5000[SLEN] >>> # * coding region of mRNA length is 2000 to 3000 nts >>> # * Entrez Search Field for stop and start of >>> # coding region: start:stop[CDS] >>> # >>> # >>> # Task 2: >>> # store the retrieved information to file for the first 200 hits >>> # (Which would be a suitable file formate?) >>> >>> I started by using and playing around with the biomaRt package for R, >>> but I got overwhelmed by its many possibilities. >>> >>> I would be glad to get any feedback, on how to start or even solve my >>> tasks. >>> >>> Best regards, >>> Simon >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.7 years ago steffen@stat.Berkeley.EDU ▴ 600

0

Entering edit mode

Hi Steffen, Thanks for the information. Best regards, Simon Steffen at stat.Berkeley.EDU wrote: > Hi Simon, > > The cdna attribute is the combination of 5utr + coding + 3utr so you can > remove 5utr, coding and 3utr from your list of attributes to retrieve. I > would take ensembl_transcript_id instead of embl. > > Cheers, > Steffen > >> Thanks, for the recommendation. >> >> So far, I just read Steffen's and your biomaRt user?s guide and had a >> look at the BioMart 0.7 Documentation, since I needed quick results. >> I'm going to have a look at the recommended book and paper, now. >> >> >> In the meantime, I got to a solution - but not a very satisfying one: >> >> ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl) >> >> myAttributes = c("embl", "cdna", "5utr", "coding", "3utr", "5_utr_end", >> "3_utr_start", "sequence_cdna_length","cds_length") >> >> ... >> >> qresult = getBM(attributes=myAttributes, >> filters=..., >> values=..., >> mart=ensembl) >> >> finalResult = mySeqCdsLengthFilter(qresult, c(3000, 5000), c(2000, 3000)) >> >> For now, I parse my query results manually, using >> the values for "sequence_cdna_length" and "cds_length" as limits. >> I wish these attributes were filters ... >> or there was a BioMart and a database, I could use in a linked query via >> getLDS. >> >> I'm still curious for a smarter solution. >> >> >> Best regards, >> Simon >> >> >> Wolfgang Huber wrote: >>> Hi Simon, >>> >>> with all respect, for a first contact with the Bioconductor project I'd >>> also recommend studying some of the documentation. >>> >>> A (slightly biased) set of points to start with are the "Bioconductor >>> Case Studies" book by Hahne, Huber, Gentleman, Falcon and the paper >>> "Mapping identifiers for the integration of genomic datasets with the >>> R/Bioconductor package biomaRt." by Durinck et al. in Nature Protocols >>> 2009;4(8):1184-91. >>> >>> Best wishes >>> Wolfgang >>> >>> >>> >>> >>> Simon ha scritto: >>>> Hello everybody, >>>> >>>> I am trying to solve the following tasks as a first contact with the >>>> bioconductor project: >>>> >>>> # Task 1: >>>> # find: >>>> # * mRNA sequence (5'UTR, Coding region, 3'UTR) >>>> # * position of start codon in sequence >>>> # * position of stop codon in sequence >>>> # * ID (Which ID(s) would I choose to reference my >>>> # sequence hits? Embl, ensembl transcript id, >>>> # Entrez Gene id, RefSeq, etc.?) >>>> # * name of associated protein product >>>> # >>>> # where: >>>> # * origin is human >>>> # Entrez Search would be: human[ORGN] >>>> # * sequence is mRNA transcript >>>> # Entrez Search for Molecule Type: biomol_mRNA[PROP]? >>>> # * mRNA sequence length is 3000 to 5000 nts >>>> # * Entrez Search for Sequence Length: 3000:5000[SLEN] >>>> # * coding region of mRNA length is 2000 to 3000 nts >>>> # * Entrez Search Field for stop and start of >>>> # coding region: start:stop[CDS] >>>> # >>>> # >>>> # Task 2: >>>> # store the retrieved information to file for the first 200 hits >>>> # (Which would be a suitable file formate?) >>>> >>>> I started by using and playing around with the biomaRt package for R, >>>> but I got overwhelmed by its many possibilities. >>>> >>>> I would be glad to get any feedback, on how to start or even solve my >>>> tasks. >>>> >>>> Best regards, >>>> Simon >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >

ADD REPLY • link 14.7 years ago Simon ▴ 30

Login before adding your answer.