Hi,
Depending on how you are clustering you reads exactly, here are some tools that I know of that can help you with extracting consensus sequences:
consensusMatrix()
and consensusString()
in the Biostrings package.
sequenceLayer()
in the GenomicAlignments package.
pileup()
in the Rsamtools package.
Let's say you have some read sequences:
## Let's load some read sequences:
library(pasillaBamSubset)
library(GenomicAlignments)
param <- ScanBamParam(what="seq")
reads <- readGAlignments(untreated1_chr4(), param=param)
readseqs <- mcols(reads)$seq
readseqs
# A DNAStringSet instance of length 204355
# width seq
# [1] 75 CTGTGGTGACCAACACCACAGAA...CCCCTTTCCTGGCTAGGTTGTCC
# [2] 75 TCGGGCCCAATTAGAGGGTTCCC...GCTCATTTCCTGGGCTGTTGTTG
# [3] 75 CCCAATTAGAGGATTCTCTGCCC...TTTCCCGGGATGTTGTTGTGTCC
# [4] 75 GTTCTCTGCCCCTTTCCTGGCTA...TTGTTGTGTCCCGGGACCCACCT
# [5] 75 TTCCTGGCTAGGTTGTCCGCTAG...GGACACACCTTATTGTGAGTTTG
# ... ... ...
# [204351] 75 ATTTAATACAATATTTTCAAAAT...TATAGGCTTCTTCTTACTATGGT
# [204352] 75 ATTTAATACAATATTTTCAAAAT...TATAGGCTTCTTCTTACTATGGT
# [204353] 75 ATTTAATACAATATTTTCAAAAT...TATAGGCTTCTTCTTACTATGGG
# [204354] 75 CCTCCGCTTTGGTTCACGTTCTG...GGCTTCACTTTTAGCTACTGTTG
# [204355] 75 CCTCCGCTTTGGTTCACGTTCTG...GGCTTCACTTTTAGCTACTGTTG
And let's say you want to cluster them by position (POS field in the BAM file):
read_clusters <- split(readseqs, paste(seqnames(reads), start(reads), sep="-"))
## Sizes of the clusters:
lengths(read_clusters)[1:12]
# chr4-10000 chr4-10002 chr4-10003 chr4-10004 chr4-10005 chr4-10007
# 2 1 1 1 1 1
# chr4-10008 chr4-10009 chr4-10010 chr4-100112 chr4-10013 chr4-10014
# 2 1 2 1 3 1
You can compute the consensus matrix and consensus sequence of the reads positioned at nucleotide 10021 on chromosome 4 (RNAME=chr4 and POS=10021 in the BAM file) with something like:
cluster1 <- read_clusters[["chr4-10021"]]
cluster1
# A DNAStringSet instance of length 8
# width seq
# [1] 75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [2] 75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [3] 75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [4] 75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [5] 75 GCAGATGCCTACGACTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [6] 75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [7] 75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [8] 75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
cons_mat <- consensusMatrix(cluster1, baseOnly=TRUE)
dim(cons_mat)
# [1] 5 75
cons_mat[ , 1:12]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
# A 0 0 8 0 8 0 0 0 0 0 8 0
# C 0 8 0 0 0 0 0 8 8 0 0 8
# G 8 0 0 8 0 0 8 0 0 0 0 0
# T 0 0 0 0 0 8 0 0 0 8 0 0
# other 0 0 0 0 0 0 0 0 0 0 0 0
consensusString(cluster1)
# [1] "GCAGATGCCTACGATTAACTCCGAACTTTACTGTTGGACGGACTCCACGATAGTGCTTGCATGGTTAAGCAAGCC"
HOWEVER: It's important to note that this approach only makes sense if the reads have no indels or junctions (i.e. the CIGAR strings have no I, D, or N in them). If they have indels or junctions, the consensus obtained by the above approach would be meaningless unless the reads sequences are first stretched or/and chrunk based on the corresponding CIGAR. sequenceLayer()
could be used for this stretching/chrinking.
An alternative approach is to use pileup()
:
pileup_res <- pileup(untreated1_chr4())
dim(pileup_res)
# [1] 860358 5
head(pileup_res)
# seqnames pos strand nucleotide count
# 1 chr4 892 - C 1
# 2 chr4 893 - T 1
# 3 chr4 894 - G 1
# 4 chr4 895 - T 1
# 5 chr4 896 - G 1
# 6 chr4 897 - G 1
But then there is some significant work to do to extract consensus sequences from this data.frame.
Hope this helps,
H.