Hi Everyone at Bioconductor,
I've been doing work with large multiple sequence alignments in which I find the polymorphic sites, keeping them and their base positions, and rejecting all the non-polymorphic sites. I usually use a package like ape for this - which implements the DNAbin class. This time the MSA I have is simply too large and can't be represented as a DNAbin class and R gives an error about not supporting Long vectors of Raw. I've heard and read that the sequence types in Bioconductor are good because they don't store the entire sequence in memory and are built to handle very large sequences efficiently. This will be my first use of Bioconductor. I've currently identified the polymorphic sites using the following code:
library(Biostrings) origMAlign <- readDNAStringSet(filepath = "~/Dropbox/MySequences.fas", format = "fasta") consensusM <- consensusMatrix(origMAlign)[1:4,] polymorphic <- which(colSums(consensusM != 0) > 1) for(i in 1:length(origMAlign)) origMAlign[[i]] <- origMAlign[[i]][polymorphic]
Is this the most efficient way to do this with the Biostring classes - being R, use of a loop feels instinctively wrong!
Thanks,
Ben Ward - UEA & TSL.
The example you give returns an error:
Error in rep.int(IntegerList(InformativeBp), length(FullSequence)) : attempt to replicate an object of type 'S4'
But something like:
Seems to work.
Hi Ben,
Replicating an IntegerList object does work for me:
Please make sure that you are using the current release of BioC (3.0) and that all your packages are up-to-date (especially the BiocGenerics, S4Vectors and IRanges packages). See my sessionInfo() below.
Cheers,
H.