Hi,
Depending on how you are clustering you reads exactly, here are some tools that I know of that can help you with extracting consensus sequences:
consensusMatrix()
and consensusString()
in the Biostrings package.
sequenceLayer()
in the GenomicAlignments package.
pileup()
in the Rsamtools package.
Let's say you have some read sequences:
library(pasillaBamSubset)
library(GenomicAlignments)
param <- ScanBamParam(what="seq")
reads <- readGAlignments(untreated1_chr4(), param=param)
readseqs <- mcols(reads)$seq
readseqs
And let's say you want to cluster them by position (POS field in the BAM file):
read_clusters <- split(readseqs, paste(seqnames(reads), start(reads), sep="-"))
lengths(read_clusters)[1:12]
You can compute the consensus matrix and consensus sequence of the reads positioned at nucleotide 10021 on chromosome 4 (RNAME=chr4 and POS=10021 in the BAM file) with something like:
cluster1 <- read_clusters[["chr4-10021"]]
cluster1
cons_mat <- consensusMatrix(cluster1, baseOnly=TRUE)
dim(cons_mat)
cons_mat[ , 1:12]
consensusString(cluster1)
HOWEVER: It's important to note that this approach only makes sense if the reads have no indels or junctions (i.e. the CIGAR strings have no I, D, or N in them). If they have indels or junctions, the consensus obtained by the above approach would be meaningless unless the reads sequences are first stretched or/and chrunk based on the corresponding CIGAR. sequenceLayer()
could be used for this stretching/chrinking.
An alternative approach is to use pileup()
:
pileup_res <- pileup(untreated1_chr4())
dim(pileup_res)
head(pileup_res)
But then there is some significant work to do to extract consensus sequences from this data.frame.
Hope this helps,
H.