Sequence Distance matrix with large sequences
1
0
Entering edit mode
@benjamin-ward-env-6205
Last seen 7.1 years ago
Hi, I've been using the DNAbin class and the dist.dna() function in a package I've been making to get a matrix of hamming distances between DNA sequences in a multiple sequence alignment. I've done this with sequences hundreds of thousands long but want to allow the capability to use sequences from genome data i.e. Mbp long. I know there is a Biostring package in the Bioconductor project that is supposed to store very big sequences effectively. Can I do an equivalent job with Bio-strings yielding me such distance information, and can I also identify all the SNPs in an alignment with these large sequences i.e. the segregating sites? If so how? Many Thanks, Ben. [[alternative HTML version deleted]]
Alignment Alignment • 1.5k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States
Hi Benjamin, To compute the Hamming distance between 2 strings in Biostrings, you can use neditAt(): > library(BSgenome.Scerevisiae.UCSC.sacCer2) > library(BSgenome.Scerevisiae.UCSC.sacCer3) > BSgenome.Scerevisiae.UCSC.sacCer2::Scerevisiae$chrVIII 562643-letter "DNAString" instance seq: CCCACACACACCACACCCACACACCACACCCACACT...GTGTGTGGGTGTGGTGTGGGTGTGGTGTGTG TGTGG > BSgenome.Scerevisiae.UCSC.sacCer3::Scerevisiae$chrVIII 562643-letter "DNAString" instance seq: CCCACACACACCACACCCACACACCACACCCACACT...GTGTGTGGGTGTGGTGTGGGTGTGGTGTGTG TGTGG Same length! Are the strings the same? We can answer this by computing the number of mismatches: > neditAt(BSgenome.Scerevisiae.UCSC.sacCer2::Scerevisiae$chrVIII, BSgenome.Scerevisiae.UCSC.sacCer3::Scerevisiae$chrVIII) [1] 320140 Note that neditAt() has a 'with.indels' arg that is set to FALSE by default, hence you get the Hamming distance. Be aware that using 'with.indels=TRUE' to compute the edit (aka Levenshtein) distance would generally not work because neditAt() performs a global-local alignment: > neditAt("ACGACG", "ACGTACGTTT", with.indels=TRUE) [1] 1 FWIW you could also have a look at stringDist() in Biostrings, or at the stringdist package on CRAN for computing the (Hamming or Levenshtein) distance matrix of a collection of strings, Don't know how those tools will scale with sequences hundreds of thousands long though... Cheers, H. On 10/25/2013 06:45 AM, Benjamin Ward (ENV) wrote: > Hi, > > I've been using the DNAbin class and the dist.dna() function in a package I've been making to get a matrix of hamming distances between DNA sequences in a multiple sequence alignment. I've done this with sequences hundreds of thousands long but want to allow the capability to use sequences from genome data i.e. Mbp long. I know there is a Biostring package in the Bioconductor project that is supposed to store very big sequences effectively. Can I do an equivalent job with Bio-strings yielding me such distance information, and can I also identify all the SNPs in an alignment with these large sequences i.e. the segregating sites? If so how? > > Many Thanks, > Ben. > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT

Login before adding your answer.

Traffic: 357 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6