Search
Question: kmer and zscore calculation
1
gravatar for Fabrice Tourre
3.9 years ago by
Fabrice Tourre950 wrote:
Dear list, I have a list of bed regions. each region is 10bp length. I want to calculate the hexamers and pentamers in theses regions and get the zscore. Is there any existed packages to do this? Thank you very much in advance.
ADD COMMENTlink modified 3.9 years ago by Hervé Pagès ♦♦ 13k • written 3.9 years ago by Fabrice Tourre950
3
gravatar for Hervé Pagès
3.9 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:
[Oops, forgot to Cc the list when I answered this. Sending it again...] Hi Fabrice, The oligonucleotideFrequency() function in the Biostrings package counts the nb of occurrences of all possible 5-mers (use 'width=5') or 6-mers (use 'width=6'). You need to store your sequence(s) in a DNAString or DNAStringSet object first. On a DNAStringSet, the counts are returned in a matrix with 1 row per sequence and 1 column per k-mer: library(Biostrings) library(hgu95av2probe) probes <- DNAStringSet(hgu95av2probe) count5 <- oligonucleotideFrequency(probes, width=5) Then: > dim(count5) [1] 201800 1024 > count5[1:6, 1:10] AAAAA AAAAC AAAAG AAAAT AAACA AAACC AAACG AAACT AAAGA AAAGC [1,] 0 0 0 0 0 0 0 0 0 0 [2,] 0 0 0 0 0 0 0 0 0 0 [3,] 0 0 0 0 0 0 0 0 0 0 [4,] 0 0 0 0 0 0 0 0 0 0 [5,] 0 0 0 0 0 0 0 0 0 0 [6,] 0 0 0 0 0 0 0 0 0 0 Maybe this function should have been called kmerFrequency()... Cheers, H. On 12/17/2013 10:28 AM, Fabrice Tourre wrote: > Dear list, > > I have a list of bed regions. each region is 10bp length. I want to > calculate the hexamers and pentamers in theses regions and get the > zscore. Is there any existed packages to do this? > > Thank you very much in advance. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENTlink written 3.9 years ago by Hervé Pagès ♦♦ 13k
Thank you very much. It is helpful. On Thu, Dec 19, 2013 at 1:33 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > [Oops, forgot to Cc the list when I answered this. Sending it again...] > > > Hi Fabrice, > > The oligonucleotideFrequency() function in the Biostrings > package counts the nb of occurrences of all possible 5-mers > (use 'width=5') or 6-mers (use 'width=6'). You need to store > your sequence(s) in a DNAString or DNAStringSet object first. > On a DNAStringSet, the counts are returned in a matrix with 1 > row per sequence and 1 column per k-mer: > > library(Biostrings) > library(hgu95av2probe) > probes <- DNAStringSet(hgu95av2probe) > count5 <- oligonucleotideFrequency(probes, width=5) > > Then: > > > dim(count5) > [1] 201800 1024 > > count5[1:6, 1:10] > AAAAA AAAAC AAAAG AAAAT AAACA AAACC AAACG AAACT AAAGA AAAGC > [1,] 0 0 0 0 0 0 0 0 0 0 > [2,] 0 0 0 0 0 0 0 0 0 0 > [3,] 0 0 0 0 0 0 0 0 0 0 > [4,] 0 0 0 0 0 0 0 0 0 0 > [5,] 0 0 0 0 0 0 0 0 0 0 > [6,] 0 0 0 0 0 0 0 0 0 0 > > Maybe this function should have been called kmerFrequency()... > > Cheers, > H. > > > On 12/17/2013 10:28 AM, Fabrice Tourre wrote: >> >> Dear list, >> >> I have a list of bed regions. each region is 10bp length. I want to >> calculate the hexamers and pentamers in theses regions and get the >> zscore. Is there any existed packages to do this? >> >> Thank you very much in advance. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319
ADD REPLYlink written 3.9 years ago by Fabrice Tourre950
0
gravatar for Julian Gehring
3.9 years ago by
Julian Gehring1.3k
Julian Gehring1.3k wrote:
Hi Fabrice, Could you give an example for this, especially on what the z-score statistics is about to tell you? Best wishes Julian On 12/17/2013 07:28 PM, Fabrice Tourre wrote: > Dear list, > > I have a list of bed regions. each region is 10bp length. I want to > calculate the hexamers and pentamers in theses regions and get the > zscore. Is there any existed packages to do this? > > Thank you very much in advance. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENTlink written 3.9 years ago by Julian Gehring1.3k
Julian, Thank you for your reply. For example, I have a sample fastq file. >chr1:14404-14435(-) GGCACA >chr1:14409-14440(-) AAAACG >chr1:14423-14454(-) AGAGGC >chr1:14424-14455(-) AAGAGG I want to calculate 6kmers in this sample file. Also I have extract a control fastq file. >chr1:14404-14435(-) CCTACA >chr1:14409-14440(-) TCGACG >chr1:14423-14454(-) TCAGAT >chr1:14424-14455(-) CAAGGC The z-score was calculated for each pentamer as: (occurrence in sequences ? average occurrence in control sequences) / standard deviation of occurrence in control sequences On Wed, Dec 18, 2013 at 4:07 AM, Julian Gehring <julian.gehring at="" embl.de=""> wrote: > > > Hi Fabrice, > > Could you give an example for this, especially on what the z-score > statistics is about to tell you? > > Best wishes > Julian > > > > On 12/17/2013 07:28 PM, Fabrice Tourre wrote: >> >> Dear list, >> >> I have a list of bed regions. each region is 10bp length. I want to >> calculate the hexamers and pentamers in theses regions and get the >> zscore. Is there any existed packages to do this? >> >> Thank you very much in advance. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor-0bNBQ1PAWB4BXFe83j6qeQ at public.gmane.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >
ADD REPLYlink written 3.9 years ago by Fabrice Tourre950
Hi Fabrice, I'm not sure whether there is a package which offers this out of the box. However, it should be easy to achieve with some standard functionality of bioconductor. 1. Import the sequences with the 'Rsamtools' or 'ShortRead' package. 2. Define your k-mers and count the occurrance with functions like 'countPattern' or 'vcountPattern' from the 'Biostrings' package. 3. Calculate your z-scores. Hope this helps. Best wishes Julian On 12/18/2013 06:38 PM, Fabrice Tourre wrote: > Julian, > > Thank you for your reply. > > For example, I have a sample fastq file. > >> chr1:14404-14435(-) > GGCACA >> chr1:14409-14440(-) > AAAACG >> chr1:14423-14454(-) > AGAGGC >> chr1:14424-14455(-) > AAGAGG > > I want to calculate 6kmers in this sample file. Also I have extract a > control fastq file. > >> chr1:14404-14435(-) > CCTACA >> chr1:14409-14440(-) > TCGACG >> chr1:14423-14454(-) > TCAGAT >> chr1:14424-14455(-) > CAAGGC > > The z-score was calculated for each pentamer as: > > (occurrence in sequences ? average occurrence in control sequences) / > standard deviation of occurrence in control sequences > > On Wed, Dec 18, 2013 at 4:07 AM, Julian Gehring <julian.gehring at="" embl.de=""> wrote: >> >> >> Hi Fabrice, >> >> Could you give an example for this, especially on what the z-score >> statistics is about to tell you? >> >> Best wishes >> Julian >> >> >> >> On 12/17/2013 07:28 PM, Fabrice Tourre wrote: >>> >>> Dear list, >>> >>> I have a list of bed regions. each region is 10bp length. I want to >>> calculate the hexamers and pentamers in theses regions and get the >>> zscore. Is there any existed packages to do this? >>> >>> Thank you very much in advance. >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor-0bNBQ1PAWB4BXFe83j6qeQ at public.gmane.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLYlink written 3.9 years ago by Julian Gehring1.3k
Julian, Thank you for your reply. For example, I have a sample fastq file. >chr1:14404-14435(-) GGCACA >chr1:14409-14440(-) AAAACG >chr1:14423-14454(-) AGAGGC >chr1:14424-14455(-) AAGAGG I want to calculate 6kmers in this sample file. Also I have extract a control fastq file. >chr1:14404-14435(-) CCTACA >chr1:14409-14440(-) TCGACG >chr1:14423-14454(-) TCAGAT >chr1:14424-14455(-) CAAGGC The z-score was calculated for each pentamer as: (occurrence in sequences ? average occurrence in control sequences) / standard deviation of occurrence in control sequences On Wed, Dec 18, 2013 at 4:07 AM, Julian Gehring <julian.gehring at="" embl.de=""> wrote: > > > Hi Fabrice, > > Could you give an example for this, especially on what the z-score > statistics is about to tell you? > > Best wishes > Julian > > > > On 12/17/2013 07:28 PM, Fabrice Tourre wrote: >> >> Dear list, >> >> I have a list of bed regions. each region is 10bp length. I want to >> calculate the hexamers and pentamers in theses regions and get the >> zscore. Is there any existed packages to do this? >> >> Thank you very much in advance. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor-0bNBQ1PAWB4BXFe83j6qeQ at public.gmane.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >
ADD REPLYlink written 3.9 years ago by Fabrice Tourre950
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 153 users visited in the last hour