Whole genome amino acid composition package?
1
0
Entering edit mode
biotech ▴ 160
@biotech-6194
Last seen 7.7 years ago
European Union
Hi there, I'm interested in a statistical tool to get bacterial amino acid composition and genetic code usage at genome level. I've looking in MESH terms database but I'm a bit lost. http://www.ncbi.nlm.nih.gov/pubmed?term=%28%22Genetic%20Code%22%5BMesh %5D%29%20AND%20%22Software%22%5BMesh%5D Thanks for your help. All the best, Bernardo -- *Bernardo Bello Ortí* PhD student CReSA-IRTA Campus de Bellaterra-Universitat Autònoma de Barcelona Edifici CReSA 08193 Bellaterra (Barcelona, Spain) Tel.: 647 42 52 63 *www.cresa.es * * * [[alternative HTML version deleted]]
• 1.1k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 1 hour ago
Seattle, WA, United States
Hi Bernardo, On 10/21/2013 03:41 AM, Bernardo Bello wrote: > Hi there, > > I'm interested in a statistical tool to get bacterial amino acid > composition and genetic code usage at genome level. > > I've looking in MESH terms database but I'm a bit lost. > http://www.ncbi.nlm.nih.gov/pubmed?term=%28%22Genetic%20Code%22%5BMe sh%5D%29%20AND%20%22Software%22%5BMesh%5D Assuming you've access to the annotated DNA sequence for your bacteria, you would need: (a) A way to load the bacterial genome in Bioconductor as a DNAString object. Some suggested tools: - readDNAStringSet() in Biostrings. - scanFa() in Rsamtools. - import.2bit() in rtracklayer. - The biomaRt package to retrieve the bacterial genome from an online database via a Mart service. - You might find if convenient to wrap your bacterial genomes into a BSgenome data package like we've done for E. coli (see the BSgenome.Ecoli.NCBI.20080805 package). (b) A way to load the gene/CDS coordinates as a GRanges or GRangesList object. You need to make sure those coordinates are relative to the exact same genome assembly you used in (a). Some suggested tools: - import.gff() and import.bed() in rtracklayer. - makeTranscriptDbFromGFF() in GenomicFeatures. - The biomaRt package to retrieve the gene coordinates from an online database via a Mart service. Actually, if you're going to use biomaRt, you can skip that step and retrieve directly the gene/CDS sequences. See (c). (c) A way to extract the gene/CDS sequences as a DNAStringSet object. One gotcha for this step is to make sure that the sequences are extracted from the right strand i.e. if a gene is on the minus strand and its sequence is extracted from the plus strand, then the sequence needs to be reverse complemented (with reverseComplement() from the Biostrings package). Some suggested tools: - extractAt() from the Biostrings package. Does NOT handle the strand for you (i.e. you need to reverse complement sequences for genes on the minus strand). - The biomaRt package to retrieve the gene/CDS sequences from an online database via a Mart service. - getSeq() from the BSgenome package (if the DNA sequences are stored in a BSgenome data package). It does handle the strand for you. - Just mentioning it FYI: for more complex organisms with introns and UTRs, extractTranscriptsFromGenome() in GenomicFeatures would be the tool to use to extract the transcript or CDS sequences from a BSgenome object and a TranscriptDb object. It handles the strand and removes the introns for you. (d) Once you have the gene/CDS sequences in a DNAStringSet object, you can use translate() (from Biostrings) to translate to proteins. Right now it uses the Standard Genetic Code to translate but it would be easy to add support for alternate genetic codes. The result of the translation is an AAStringSet object. You can also use extractAt() on the DNAStringSet object to extract the codon composition of the gene/CDS sequences as a DNAStringSetList object. A simple table(unlist()) on the DNAStringSetList object will give you condon usage. I don't know if there is already a package in Bioconductor or on CRAN that does the above for you but all the building blocks are here, and it should not be too hard to put them together. If you decide to go that route and need more help, please come back, hopefully this time with more specific questions. Cheers, H. > > Thanks for your help. All the best, Bernardo > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Dear Hervé, Thank you very much for your help, it seems you handle R with ease. Wow! Hope one day I would manage R this way. I forgot to tell that I need to input R tens if bacterial genomes at the same tine and get statistical codon usage for each batch. Is that possible to automate this in your described workflow? Best, Bernardo 2013/10/21 Hervé Pagès <hpages@fhcrc.org> > Hi Bernardo, > > > On 10/21/2013 03:41 AM, Bernardo Bello wrote: > >> Hi there, >> >> I'm interested in a statistical tool to get bacterial amino acid >> composition and genetic code usage at genome level. >> >> I've looking in MESH terms database but I'm a bit lost. >> http://www.ncbi.nlm.nih.gov/**pubmed?term=%28%22Genetic%** >> 20Code%22%5BMesh%5D%29%20AND%**20%22Software%22%5BMesh%5D<http: ww="" w.ncbi.nlm.nih.gov="" pubmed?term="%28%22Genetic%20Code%22%5BMesh%5D%29%20" and%20%22software%22%5bmesh%5d=""> >> > > Assuming you've access to the annotated DNA sequence for your bacteria, > you would need: > > (a) A way to load the bacterial genome in Bioconductor as a > DNAString object. > > Some suggested tools: > - readDNAStringSet() in Biostrings. > - scanFa() in Rsamtools. > - import.2bit() in rtracklayer. > - The biomaRt package to retrieve the bacterial genome from > an online database via a Mart service. > - You might find if convenient to wrap your bacterial genomes > into a BSgenome data package like we've done for E. coli > (see the BSgenome.Ecoli.NCBI.20080805 package). > > (b) A way to load the gene/CDS coordinates as a GRanges or GRangesList > object. You need to make sure those coordinates are relative to > the exact same genome assembly you used in (a). > > Some suggested tools: > - import.gff() and import.bed() in rtracklayer. > - makeTranscriptDbFromGFF() in GenomicFeatures. > - The biomaRt package to retrieve the gene coordinates from > an online database via a Mart service. Actually, if you're > going to use biomaRt, you can skip that step and retrieve > directly the gene/CDS sequences. See (c). > > (c) A way to extract the gene/CDS sequences as a DNAStringSet object. > One gotcha for this step is to make sure that the sequences are > extracted from the right strand i.e. if a gene is on the minus > strand and its sequence is extracted from the plus strand, then > the sequence needs to be reverse complemented (with > reverseComplement() from the Biostrings package). > > Some suggested tools: > - extractAt() from the Biostrings package. Does NOT handle > the strand for you (i.e. you need to reverse complement > sequences for genes on the minus strand). > - The biomaRt package to retrieve the gene/CDS sequences from > an online database via a Mart service. > - getSeq() from the BSgenome package (if the DNA sequences are > stored in a BSgenome data package). It does handle the > strand for you. > - Just mentioning it FYI: for more complex organisms > with introns and UTRs, extractTranscriptsFromGenome() in > GenomicFeatures would be the tool to use to extract the > transcript or CDS sequences from a BSgenome object and a > TranscriptDb object. It handles the strand and removes the > introns for you. > > (d) Once you have the gene/CDS sequences in a DNAStringSet object, > you can use translate() (from Biostrings) to translate to > proteins. Right now it uses the Standard Genetic Code to > translate but it would be easy to add support for alternate > genetic codes. The result of the translation is an AAStringSet > object. You can also use extractAt() on the DNAStringSet object > to extract the codon composition of the gene/CDS sequences as > a DNAStringSetList object. A simple table(unlist()) on the > DNAStringSetList object will give you condon usage. > > I don't know if there is already a package in Bioconductor or on CRAN > that does the above for you but all the building blocks are here, and > it should not be too hard to put them together. If you decide to go that > route and need more help, please come back, hopefully this time with > more specific questions. > > Cheers, > H. > > >> Thanks for your help. All the best, Bernardo >> >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- *Bernardo Bello Ortí* PhD student CReSA-IRTA Campus de Bellaterra-Universitat Autònoma de Barcelona Edifici CReSA 08193 Bellaterra (Barcelona, Spain) Tel.: 647 42 52 63 *www.cresa.es * * * [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
On 10/21/2013 12:57 PM, Bernardo Bello wrote: > Dear Herv?, > > Thank you very much for your help, it seems you handle R with ease. Wow! > Hope one day I would manage R this way. > > I forgot to tell that I need to input R tens if bacterial genomes at the > same tine and get statistical codon usage for each batch. Is that > possible to automate this in your described workflow? You give so little details (in particular you don't say where your data is coming from and/or what it looks like) that it's hard to tell for sure, but I would say: why not? More precisely: assuming you've managed to solve the problem for 1 bacterial genome (i.e. you now have a function that summarizes codon usage for a given bacterial genome), then I don't see any reason why you couldn't apply this function to a batch of genomes and combine the results together. Depending on the format you use for the individual codon usage summaries, combining the results might be trivial (e.g. could just consist of rbind()'ing the rows of a 64-col table) or a little bit more complicated... Again, hard to know without you providing more details. H. > > Best, Bernardo > > > 2013/10/21 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Hi Bernardo, > > > On 10/21/2013 03:41 AM, Bernardo Bello wrote: > > Hi there, > > I'm interested in a statistical tool to get bacterial amino acid > composition and genetic code usage at genome level. > > I've looking in MESH terms database but I'm a bit lost. > http://www.ncbi.nlm.nih.gov/__pubmed?term=%28%22Genetic%__20 Code%22%5BMesh%5D%29%20AND%__20%22Software%22%5BMesh%5D > <http: www.ncbi.nlm.nih.gov="" pubmed?term="%28%22Genetic%20Cod" e%22%5bmesh%5d%29%20and%20%22software%22%5bmesh%5d=""> > > > Assuming you've access to the annotated DNA sequence for your bacteria, > you would need: > > (a) A way to load the bacterial genome in Bioconductor as a > DNAString object. > > Some suggested tools: > - readDNAStringSet() in Biostrings. > - scanFa() in Rsamtools. > - import.2bit() in rtracklayer. > - The biomaRt package to retrieve the bacterial genome from > an online database via a Mart service. > - You might find if convenient to wrap your bacterial genomes > into a BSgenome data package like we've done for E. coli > (see the BSgenome.Ecoli.NCBI.20080805 package). > > (b) A way to load the gene/CDS coordinates as a GRanges or > GRangesList > object. You need to make sure those coordinates are relative to > the exact same genome assembly you used in (a). > > Some suggested tools: > - import.gff() and import.bed() in rtracklayer. > - makeTranscriptDbFromGFF() in GenomicFeatures. > - The biomaRt package to retrieve the gene coordinates from > an online database via a Mart service. Actually, if you're > going to use biomaRt, you can skip that step and retrieve > directly the gene/CDS sequences. See (c). > > (c) A way to extract the gene/CDS sequences as a DNAStringSet object. > One gotcha for this step is to make sure that the sequences are > extracted from the right strand i.e. if a gene is on the minus > strand and its sequence is extracted from the plus strand, then > the sequence needs to be reverse complemented (with > reverseComplement() from the Biostrings package). > > Some suggested tools: > - extractAt() from the Biostrings package. Does NOT handle > the strand for you (i.e. you need to reverse complement > sequences for genes on the minus strand). > - The biomaRt package to retrieve the gene/CDS sequences from > an online database via a Mart service. > - getSeq() from the BSgenome package (if the DNA sequences are > stored in a BSgenome data package). It does handle the > strand for you. > - Just mentioning it FYI: for more complex organisms > with introns and UTRs, extractTranscriptsFromGenome() in > GenomicFeatures would be the tool to use to extract the > transcript or CDS sequences from a BSgenome object and a > TranscriptDb object. It handles the strand and removes the > introns for you. > > (d) Once you have the gene/CDS sequences in a DNAStringSet object, > you can use translate() (from Biostrings) to translate to > proteins. Right now it uses the Standard Genetic Code to > translate but it would be easy to add support for alternate > genetic codes. The result of the translation is an AAStringSet > object. You can also use extractAt() on the DNAStringSet object > to extract the codon composition of the gene/CDS sequences as > a DNAStringSetList object. A simple table(unlist()) on the > DNAStringSetList object will give you condon usage. > > I don't know if there is already a package in Bioconductor or on CRAN > that does the above for you but all the building blocks are here, and > it should not be too hard to put them together. If you decide to go that > route and need more help, please come back, hopefully this time with > more specific questions. > > Cheers, > H. > > > Thanks for your help. All the best, Bernardo > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > > > -- > > *Bernardo Bello Ort?* > > PhD student > > CReSA-IRTA > > Campus de Bellaterra-Universitat Aut?noma de Barcelona > > Edifici CReSA > > 08193 Bellaterra (Barcelona, Spain) > > Tel.: 647 42 52 63 *www.cresa.es <http: www.cresa.es=""> * > > * > * > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6