Search
Question: retrieving mRNA sequences via biomaRt
0
gravatar for Simon
8.1 years ago by
Simon30
Simon30 wrote:
Hello everybody, I am trying to solve the following tasks as a first contact with the bioconductor project: # Task 1: # find: # * mRNA sequence (5'UTR, Coding region, 3'UTR) # * position of start codon in sequence # * position of stop codon in sequence # * ID (Which ID(s) would I choose to reference my # sequence hits? Embl, ensembl transcript id, # Entrez Gene id, RefSeq, etc.?) # * name of associated protein product # # where: # * origin is human # Entrez Search would be: human[ORGN] # * sequence is mRNA transcript # Entrez Search for Molecule Type: biomol_mRNA[PROP]? # * mRNA sequence length is 3000 to 5000 nts # * Entrez Search for Sequence Length: 3000:5000[SLEN] # * coding region of mRNA length is 2000 to 3000 nts # * Entrez Search Field for stop and start of # coding region: start:stop[CDS] # # # Task 2: # store the retrieved information to file for the first 200 hits # (Which would be a suitable file formate?) I started by using and playing around with the biomaRt package for R, but I got overwhelmed by its many possibilities. I would be glad to get any feedback, on how to start or even solve my tasks. Best regards, Simon
ADD COMMENTlink modified 8.1 years ago by Wolfgang Huber13k • written 8.1 years ago by Simon30
0
gravatar for Wolfgang Huber
8.1 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:
Hi Simon, with all respect, for a first contact with the Bioconductor project I'd also recommend studying some of the documentation. A (slightly biased) set of points to start with are the "Bioconductor Case Studies" book by Hahne, Huber, Gentleman, Falcon and the paper "Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt." by Durinck et al. in Nature Protocols 2009;4(8):1184-91. Best wishes Wolfgang Simon ha scritto: > Hello everybody, > > I am trying to solve the following tasks as a first contact with the > bioconductor project: > > # Task 1: > # find: > # * mRNA sequence (5'UTR, Coding region, 3'UTR) > # * position of start codon in sequence > # * position of stop codon in sequence > # * ID (Which ID(s) would I choose to reference my > # sequence hits? Embl, ensembl transcript id, > # Entrez Gene id, RefSeq, etc.?) > # * name of associated protein product > # > # where: > # * origin is human > # Entrez Search would be: human[ORGN] > # * sequence is mRNA transcript > # Entrez Search for Molecule Type: biomol_mRNA[PROP]? > # * mRNA sequence length is 3000 to 5000 nts > # * Entrez Search for Sequence Length: 3000:5000[SLEN] > # * coding region of mRNA length is 2000 to 3000 nts > # * Entrez Search Field for stop and start of > # coding region: start:stop[CDS] > # > # > # Task 2: > # store the retrieved information to file for the first 200 hits > # (Which would be a suitable file formate?) > > I started by using and playing around with the biomaRt package for R, > but I got overwhelmed by its many possibilities. > > I would be glad to get any feedback, on how to start or even solve my > tasks. > > Best regards, > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang ------------------------------------------------------- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber
ADD COMMENTlink written 8.1 years ago by Wolfgang Huber13k
Thanks, for the recommendation. So far, I just read Steffen's and your biomaRt user?s guide and had a look at the BioMart 0.7 Documentation, since I needed quick results. I'm going to have a look at the recommended book and paper, now. In the meantime, I got to a solution - but not a very satisfying one: ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl) myAttributes = c("embl", "cdna", "5utr", "coding", "3utr", "5_utr_end", "3_utr_start", "sequence_cdna_length","cds_length") ... qresult = getBM(attributes=myAttributes, filters=..., values=..., mart=ensembl) finalResult = mySeqCdsLengthFilter(qresult, c(3000, 5000), c(2000, 3000)) For now, I parse my query results manually, using the values for "sequence_cdna_length" and "cds_length" as limits. I wish these attributes were filters ... or there was a BioMart and a database, I could use in a linked query via getLDS. I'm still curious for a smarter solution. Best regards, Simon Wolfgang Huber wrote: > > Hi Simon, > > with all respect, for a first contact with the Bioconductor project I'd > also recommend studying some of the documentation. > > A (slightly biased) set of points to start with are the "Bioconductor > Case Studies" book by Hahne, Huber, Gentleman, Falcon and the paper > "Mapping identifiers for the integration of genomic datasets with the > R/Bioconductor package biomaRt." by Durinck et al. in Nature Protocols > 2009;4(8):1184-91. > > Best wishes > Wolfgang > > > > > Simon ha scritto: >> Hello everybody, >> >> I am trying to solve the following tasks as a first contact with the >> bioconductor project: >> >> # Task 1: >> # find: >> # * mRNA sequence (5'UTR, Coding region, 3'UTR) >> # * position of start codon in sequence >> # * position of stop codon in sequence >> # * ID (Which ID(s) would I choose to reference my >> # sequence hits? Embl, ensembl transcript id, >> # Entrez Gene id, RefSeq, etc.?) >> # * name of associated protein product >> # >> # where: >> # * origin is human >> # Entrez Search would be: human[ORGN] >> # * sequence is mRNA transcript >> # Entrez Search for Molecule Type: biomol_mRNA[PROP]? >> # * mRNA sequence length is 3000 to 5000 nts >> # * Entrez Search for Sequence Length: 3000:5000[SLEN] >> # * coding region of mRNA length is 2000 to 3000 nts >> # * Entrez Search Field for stop and start of >> # coding region: start:stop[CDS] >> # >> # >> # Task 2: >> # store the retrieved information to file for the first 200 hits >> # (Which would be a suitable file formate?) >> >> I started by using and playing around with the biomaRt package for R, >> but I got overwhelmed by its many possibilities. >> >> I would be glad to get any feedback, on how to start or even solve my >> tasks. >> >> Best regards, >> Simon >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLYlink written 8.1 years ago by Simon30
Hi Simon, The cdna attribute is the combination of 5utr + coding + 3utr so you can remove 5utr, coding and 3utr from your list of attributes to retrieve. I would take ensembl_transcript_id instead of embl. Cheers, Steffen > Thanks, for the recommendation. > > So far, I just read Steffen's and your biomaRt user?s guide and had a > look at the BioMart 0.7 Documentation, since I needed quick results. > I'm going to have a look at the recommended book and paper, now. > > > In the meantime, I got to a solution - but not a very satisfying one: > > ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl) > > myAttributes = c("embl", "cdna", "5utr", "coding", "3utr", "5_utr_end", > "3_utr_start", "sequence_cdna_length","cds_length") > > ... > > qresult = getBM(attributes=myAttributes, > filters=..., > values=..., > mart=ensembl) > > finalResult = mySeqCdsLengthFilter(qresult, c(3000, 5000), c(2000, 3000)) > > For now, I parse my query results manually, using > the values for "sequence_cdna_length" and "cds_length" as limits. > I wish these attributes were filters ... > or there was a BioMart and a database, I could use in a linked query via > getLDS. > > I'm still curious for a smarter solution. > > > Best regards, > Simon > > > Wolfgang Huber wrote: >> >> Hi Simon, >> >> with all respect, for a first contact with the Bioconductor project I'd >> also recommend studying some of the documentation. >> >> A (slightly biased) set of points to start with are the "Bioconductor >> Case Studies" book by Hahne, Huber, Gentleman, Falcon and the paper >> "Mapping identifiers for the integration of genomic datasets with the >> R/Bioconductor package biomaRt." by Durinck et al. in Nature Protocols >> 2009;4(8):1184-91. >> >> Best wishes >> Wolfgang >> >> >> >> >> Simon ha scritto: >>> Hello everybody, >>> >>> I am trying to solve the following tasks as a first contact with the >>> bioconductor project: >>> >>> # Task 1: >>> # find: >>> # * mRNA sequence (5'UTR, Coding region, 3'UTR) >>> # * position of start codon in sequence >>> # * position of stop codon in sequence >>> # * ID (Which ID(s) would I choose to reference my >>> # sequence hits? Embl, ensembl transcript id, >>> # Entrez Gene id, RefSeq, etc.?) >>> # * name of associated protein product >>> # >>> # where: >>> # * origin is human >>> # Entrez Search would be: human[ORGN] >>> # * sequence is mRNA transcript >>> # Entrez Search for Molecule Type: biomol_mRNA[PROP]? >>> # * mRNA sequence length is 3000 to 5000 nts >>> # * Entrez Search for Sequence Length: 3000:5000[SLEN] >>> # * coding region of mRNA length is 2000 to 3000 nts >>> # * Entrez Search Field for stop and start of >>> # coding region: start:stop[CDS] >>> # >>> # >>> # Task 2: >>> # store the retrieved information to file for the first 200 hits >>> # (Which would be a suitable file formate?) >>> >>> I started by using and playing around with the biomaRt package for R, >>> but I got overwhelmed by its many possibilities. >>> >>> I would be glad to get any feedback, on how to start or even solve my >>> tasks. >>> >>> Best regards, >>> Simon >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLYlink written 8.1 years ago by steffen@stat.Berkeley.EDU600
Hi Steffen, Thanks for the information. Best regards, Simon Steffen at stat.Berkeley.EDU wrote: > Hi Simon, > > The cdna attribute is the combination of 5utr + coding + 3utr so you can > remove 5utr, coding and 3utr from your list of attributes to retrieve. I > would take ensembl_transcript_id instead of embl. > > Cheers, > Steffen > >> Thanks, for the recommendation. >> >> So far, I just read Steffen's and your biomaRt user?s guide and had a >> look at the BioMart 0.7 Documentation, since I needed quick results. >> I'm going to have a look at the recommended book and paper, now. >> >> >> In the meantime, I got to a solution - but not a very satisfying one: >> >> ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl) >> >> myAttributes = c("embl", "cdna", "5utr", "coding", "3utr", "5_utr_end", >> "3_utr_start", "sequence_cdna_length","cds_length") >> >> ... >> >> qresult = getBM(attributes=myAttributes, >> filters=..., >> values=..., >> mart=ensembl) >> >> finalResult = mySeqCdsLengthFilter(qresult, c(3000, 5000), c(2000, 3000)) >> >> For now, I parse my query results manually, using >> the values for "sequence_cdna_length" and "cds_length" as limits. >> I wish these attributes were filters ... >> or there was a BioMart and a database, I could use in a linked query via >> getLDS. >> >> I'm still curious for a smarter solution. >> >> >> Best regards, >> Simon >> >> >> Wolfgang Huber wrote: >>> Hi Simon, >>> >>> with all respect, for a first contact with the Bioconductor project I'd >>> also recommend studying some of the documentation. >>> >>> A (slightly biased) set of points to start with are the "Bioconductor >>> Case Studies" book by Hahne, Huber, Gentleman, Falcon and the paper >>> "Mapping identifiers for the integration of genomic datasets with the >>> R/Bioconductor package biomaRt." by Durinck et al. in Nature Protocols >>> 2009;4(8):1184-91. >>> >>> Best wishes >>> Wolfgang >>> >>> >>> >>> >>> Simon ha scritto: >>>> Hello everybody, >>>> >>>> I am trying to solve the following tasks as a first contact with the >>>> bioconductor project: >>>> >>>> # Task 1: >>>> # find: >>>> # * mRNA sequence (5'UTR, Coding region, 3'UTR) >>>> # * position of start codon in sequence >>>> # * position of stop codon in sequence >>>> # * ID (Which ID(s) would I choose to reference my >>>> # sequence hits? Embl, ensembl transcript id, >>>> # Entrez Gene id, RefSeq, etc.?) >>>> # * name of associated protein product >>>> # >>>> # where: >>>> # * origin is human >>>> # Entrez Search would be: human[ORGN] >>>> # * sequence is mRNA transcript >>>> # Entrez Search for Molecule Type: biomol_mRNA[PROP]? >>>> # * mRNA sequence length is 3000 to 5000 nts >>>> # * Entrez Search for Sequence Length: 3000:5000[SLEN] >>>> # * coding region of mRNA length is 2000 to 3000 nts >>>> # * Entrez Search Field for stop and start of >>>> # coding region: start:stop[CDS] >>>> # >>>> # >>>> # Task 2: >>>> # store the retrieved information to file for the first 200 hits >>>> # (Which would be a suitable file formate?) >>>> >>>> I started by using and playing around with the biomaRt package for R, >>>> but I got overwhelmed by its many possibilities. >>>> >>>> I would be glad to get any feedback, on how to start or even solve my >>>> tasks. >>>> >>>> Best regards, >>>> Simon >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >
ADD REPLYlink written 8.1 years ago by Simon30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 282 users visited in the last hour