Hi Everyone at Bioconductor,
I've been doing work with large multiple sequence alignments in which I find the polymorphic sites, keeping them and their base positions, and rejecting all the non-polymorphic sites. I usually use a package like ape for this - which implements the DNAbin class. This time the MSA I have is simply too large and can't be represented as a DNAbin class and R gives an error about not supporting Long vectors of Raw. I've heard and read that the sequence types in Bioconductor are good because they don't store the entire sequence in memory and are built to handle very large sequences efficiently. This will be my first use of Bioconductor. I've currently identified the polymorphic sites using the following code:
library(Biostrings) origMAlign <- readDNAStringSet(filepath = "~/Dropbox/MySequences.fas", format = "fasta") consensusM <- consensusMatrix(origMAlign)[1:4,] polymorphic <- which(colSums(consensusM != 0) > 1) for(i in 1:length(origMAlign)) origMAlign[[i]] <- origMAlign[[i]][polymorphic]
Is this the most efficient way to do this with the Biostring classes - being R, use of a loop feels instinctively wrong!
Ben Ward - UEA & TSL.