retrieving mRNA sequences via biomaRt
1
0
Entering edit mode
Simon ▴ 30
@simon-3613
Last seen 10.2 years ago
Hello everybody, I am trying to solve the following tasks as a first contact with the bioconductor project: # Task 1: # find: # * mRNA sequence (5'UTR, Coding region, 3'UTR) # * position of start codon in sequence # * position of stop codon in sequence # * ID (Which ID(s) would I choose to reference my # sequence hits? Embl, ensembl transcript id, # Entrez Gene id, RefSeq, etc.?) # * name of associated protein product # # where: # * origin is human # Entrez Search would be: human[ORGN] # * sequence is mRNA transcript # Entrez Search for Molecule Type: biomol_mRNA[PROP]? # * mRNA sequence length is 3000 to 5000 nts # * Entrez Search for Sequence Length: 3000:5000[SLEN] # * coding region of mRNA length is 2000 to 3000 nts # * Entrez Search Field for stop and start of # coding region: start:stop[CDS] # # # Task 2: # store the retrieved information to file for the first 200 hits # (Which would be a suitable file formate?) I started by using and playing around with the biomaRt package for R, but I got overwhelmed by its many possibilities. I would be glad to get any feedback, on how to start or even solve my tasks. Best regards, Simon
GLAD biomaRt GLAD biomaRt • 3.4k views
ADD COMMENT
0
Entering edit mode
@wolfgang-huber-3550
Last seen 3 months ago
EMBL European Molecular Biology Laborat…
Hi Simon, with all respect, for a first contact with the Bioconductor project I'd also recommend studying some of the documentation. A (slightly biased) set of points to start with are the "Bioconductor Case Studies" book by Hahne, Huber, Gentleman, Falcon and the paper "Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt." by Durinck et al. in Nature Protocols 2009;4(8):1184-91. Best wishes Wolfgang Simon ha scritto: > Hello everybody, > > I am trying to solve the following tasks as a first contact with the > bioconductor project: > > # Task 1: > # find: > # * mRNA sequence (5'UTR, Coding region, 3'UTR) > # * position of start codon in sequence > # * position of stop codon in sequence > # * ID (Which ID(s) would I choose to reference my > # sequence hits? Embl, ensembl transcript id, > # Entrez Gene id, RefSeq, etc.?) > # * name of associated protein product > # > # where: > # * origin is human > # Entrez Search would be: human[ORGN] > # * sequence is mRNA transcript > # Entrez Search for Molecule Type: biomol_mRNA[PROP]? > # * mRNA sequence length is 3000 to 5000 nts > # * Entrez Search for Sequence Length: 3000:5000[SLEN] > # * coding region of mRNA length is 2000 to 3000 nts > # * Entrez Search Field for stop and start of > # coding region: start:stop[CDS] > # > # > # Task 2: > # store the retrieved information to file for the first 200 hits > # (Which would be a suitable file formate?) > > I started by using and playing around with the biomaRt package for R, > but I got overwhelmed by its many possibilities. > > I would be glad to get any feedback, on how to start or even solve my > tasks. > > Best regards, > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Best wishes Wolfgang ------------------------------------------------------- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber
ADD COMMENT
0
Entering edit mode
Thanks, for the recommendation. So far, I just read Steffen's and your biomaRt user?s guide and had a look at the BioMart 0.7 Documentation, since I needed quick results. I'm going to have a look at the recommended book and paper, now. In the meantime, I got to a solution - but not a very satisfying one: ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl) myAttributes = c("embl", "cdna", "5utr", "coding", "3utr", "5_utr_end", "3_utr_start", "sequence_cdna_length","cds_length") ... qresult = getBM(attributes=myAttributes, filters=..., values=..., mart=ensembl) finalResult = mySeqCdsLengthFilter(qresult, c(3000, 5000), c(2000, 3000)) For now, I parse my query results manually, using the values for "sequence_cdna_length" and "cds_length" as limits. I wish these attributes were filters ... or there was a BioMart and a database, I could use in a linked query via getLDS. I'm still curious for a smarter solution. Best regards, Simon Wolfgang Huber wrote: > > Hi Simon, > > with all respect, for a first contact with the Bioconductor project I'd > also recommend studying some of the documentation. > > A (slightly biased) set of points to start with are the "Bioconductor > Case Studies" book by Hahne, Huber, Gentleman, Falcon and the paper > "Mapping identifiers for the integration of genomic datasets with the > R/Bioconductor package biomaRt." by Durinck et al. in Nature Protocols > 2009;4(8):1184-91. > > Best wishes > Wolfgang > > > > > Simon ha scritto: >> Hello everybody, >> >> I am trying to solve the following tasks as a first contact with the >> bioconductor project: >> >> # Task 1: >> # find: >> # * mRNA sequence (5'UTR, Coding region, 3'UTR) >> # * position of start codon in sequence >> # * position of stop codon in sequence >> # * ID (Which ID(s) would I choose to reference my >> # sequence hits? Embl, ensembl transcript id, >> # Entrez Gene id, RefSeq, etc.?) >> # * name of associated protein product >> # >> # where: >> # * origin is human >> # Entrez Search would be: human[ORGN] >> # * sequence is mRNA transcript >> # Entrez Search for Molecule Type: biomol_mRNA[PROP]? >> # * mRNA sequence length is 3000 to 5000 nts >> # * Entrez Search for Sequence Length: 3000:5000[SLEN] >> # * coding region of mRNA length is 2000 to 3000 nts >> # * Entrez Search Field for stop and start of >> # coding region: start:stop[CDS] >> # >> # >> # Task 2: >> # store the retrieved information to file for the first 200 hits >> # (Which would be a suitable file formate?) >> >> I started by using and playing around with the biomaRt package for R, >> but I got overwhelmed by its many possibilities. >> >> I would be glad to get any feedback, on how to start or even solve my >> tasks. >> >> Best regards, >> Simon >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Hi Simon, The cdna attribute is the combination of 5utr + coding + 3utr so you can remove 5utr, coding and 3utr from your list of attributes to retrieve. I would take ensembl_transcript_id instead of embl. Cheers, Steffen > Thanks, for the recommendation. > > So far, I just read Steffen's and your biomaRt user?s guide and had a > look at the BioMart 0.7 Documentation, since I needed quick results. > I'm going to have a look at the recommended book and paper, now. > > > In the meantime, I got to a solution - but not a very satisfying one: > > ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl) > > myAttributes = c("embl", "cdna", "5utr", "coding", "3utr", "5_utr_end", > "3_utr_start", "sequence_cdna_length","cds_length") > > ... > > qresult = getBM(attributes=myAttributes, > filters=..., > values=..., > mart=ensembl) > > finalResult = mySeqCdsLengthFilter(qresult, c(3000, 5000), c(2000, 3000)) > > For now, I parse my query results manually, using > the values for "sequence_cdna_length" and "cds_length" as limits. > I wish these attributes were filters ... > or there was a BioMart and a database, I could use in a linked query via > getLDS. > > I'm still curious for a smarter solution. > > > Best regards, > Simon > > > Wolfgang Huber wrote: >> >> Hi Simon, >> >> with all respect, for a first contact with the Bioconductor project I'd >> also recommend studying some of the documentation. >> >> A (slightly biased) set of points to start with are the "Bioconductor >> Case Studies" book by Hahne, Huber, Gentleman, Falcon and the paper >> "Mapping identifiers for the integration of genomic datasets with the >> R/Bioconductor package biomaRt." by Durinck et al. in Nature Protocols >> 2009;4(8):1184-91. >> >> Best wishes >> Wolfgang >> >> >> >> >> Simon ha scritto: >>> Hello everybody, >>> >>> I am trying to solve the following tasks as a first contact with the >>> bioconductor project: >>> >>> # Task 1: >>> # find: >>> # * mRNA sequence (5'UTR, Coding region, 3'UTR) >>> # * position of start codon in sequence >>> # * position of stop codon in sequence >>> # * ID (Which ID(s) would I choose to reference my >>> # sequence hits? Embl, ensembl transcript id, >>> # Entrez Gene id, RefSeq, etc.?) >>> # * name of associated protein product >>> # >>> # where: >>> # * origin is human >>> # Entrez Search would be: human[ORGN] >>> # * sequence is mRNA transcript >>> # Entrez Search for Molecule Type: biomol_mRNA[PROP]? >>> # * mRNA sequence length is 3000 to 5000 nts >>> # * Entrez Search for Sequence Length: 3000:5000[SLEN] >>> # * coding region of mRNA length is 2000 to 3000 nts >>> # * Entrez Search Field for stop and start of >>> # coding region: start:stop[CDS] >>> # >>> # >>> # Task 2: >>> # store the retrieved information to file for the first 200 hits >>> # (Which would be a suitable file formate?) >>> >>> I started by using and playing around with the biomaRt package for R, >>> but I got overwhelmed by its many possibilities. >>> >>> I would be glad to get any feedback, on how to start or even solve my >>> tasks. >>> >>> Best regards, >>> Simon >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Hi Steffen, Thanks for the information. Best regards, Simon Steffen at stat.Berkeley.EDU wrote: > Hi Simon, > > The cdna attribute is the combination of 5utr + coding + 3utr so you can > remove 5utr, coding and 3utr from your list of attributes to retrieve. I > would take ensembl_transcript_id instead of embl. > > Cheers, > Steffen > >> Thanks, for the recommendation. >> >> So far, I just read Steffen's and your biomaRt user?s guide and had a >> look at the BioMart 0.7 Documentation, since I needed quick results. >> I'm going to have a look at the recommended book and paper, now. >> >> >> In the meantime, I got to a solution - but not a very satisfying one: >> >> ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl) >> >> myAttributes = c("embl", "cdna", "5utr", "coding", "3utr", "5_utr_end", >> "3_utr_start", "sequence_cdna_length","cds_length") >> >> ... >> >> qresult = getBM(attributes=myAttributes, >> filters=..., >> values=..., >> mart=ensembl) >> >> finalResult = mySeqCdsLengthFilter(qresult, c(3000, 5000), c(2000, 3000)) >> >> For now, I parse my query results manually, using >> the values for "sequence_cdna_length" and "cds_length" as limits. >> I wish these attributes were filters ... >> or there was a BioMart and a database, I could use in a linked query via >> getLDS. >> >> I'm still curious for a smarter solution. >> >> >> Best regards, >> Simon >> >> >> Wolfgang Huber wrote: >>> Hi Simon, >>> >>> with all respect, for a first contact with the Bioconductor project I'd >>> also recommend studying some of the documentation. >>> >>> A (slightly biased) set of points to start with are the "Bioconductor >>> Case Studies" book by Hahne, Huber, Gentleman, Falcon and the paper >>> "Mapping identifiers for the integration of genomic datasets with the >>> R/Bioconductor package biomaRt." by Durinck et al. in Nature Protocols >>> 2009;4(8):1184-91. >>> >>> Best wishes >>> Wolfgang >>> >>> >>> >>> >>> Simon ha scritto: >>>> Hello everybody, >>>> >>>> I am trying to solve the following tasks as a first contact with the >>>> bioconductor project: >>>> >>>> # Task 1: >>>> # find: >>>> # * mRNA sequence (5'UTR, Coding region, 3'UTR) >>>> # * position of start codon in sequence >>>> # * position of stop codon in sequence >>>> # * ID (Which ID(s) would I choose to reference my >>>> # sequence hits? Embl, ensembl transcript id, >>>> # Entrez Gene id, RefSeq, etc.?) >>>> # * name of associated protein product >>>> # >>>> # where: >>>> # * origin is human >>>> # Entrez Search would be: human[ORGN] >>>> # * sequence is mRNA transcript >>>> # Entrez Search for Molecule Type: biomol_mRNA[PROP]? >>>> # * mRNA sequence length is 3000 to 5000 nts >>>> # * Entrez Search for Sequence Length: 3000:5000[SLEN] >>>> # * coding region of mRNA length is 2000 to 3000 nts >>>> # * Entrez Search Field for stop and start of >>>> # coding region: start:stop[CDS] >>>> # >>>> # >>>> # Task 2: >>>> # store the retrieved information to file for the first 200 hits >>>> # (Which would be a suitable file formate?) >>>> >>>> I started by using and playing around with the biomaRt package for R, >>>> but I got overwhelmed by its many possibilities. >>>> >>>> I would be glad to get any feedback, on how to start or even solve my >>>> tasks. >>>> >>>> Best regards, >>>> Simon >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >
ADD REPLY

Login before adding your answer.

Traffic: 989 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6