Question

Consensus sequence from BAM file

0

Entering edit mode

O. William McClung • 0

@o-william-mcclung-22004

Last seen 4.5 years ago

United States

I have a .bam file which aligns reads against a reference genome. The locations of the reads partition themselves into a small number of clusters. Can BioConductor be used, perhaps with other tools such as samtools, to compute a consensus sequence for each cluster?

consensus BAM • 1.4k views

ADD COMMENT • link updated 4.4 years ago by Hervé Pagès 16k • written 4.5 years ago by O. William McClung • 0

score 0 · Answer 1 · 2019-11-13

Hi,

Depending on how you are clustering you reads exactly, here are some tools that I know of that can help you with extracting consensus sequences:

consensusMatrix() and consensusString() in the Biostrings package.
sequenceLayer() in the GenomicAlignments package.
pileup() in the Rsamtools package.

Let's say you have some read sequences:

## Let's load some read sequences:
library(pasillaBamSubset)
library(GenomicAlignments)
param <- ScanBamParam(what="seq")
reads <- readGAlignments(untreated1_chr4(), param=param)
readseqs <- mcols(reads)$seq
readseqs
#   A DNAStringSet instance of length 204355
#          width seq
#      [1]    75 CTGTGGTGACCAACACCACAGAA...CCCCTTTCCTGGCTAGGTTGTCC
#      [2]    75 TCGGGCCCAATTAGAGGGTTCCC...GCTCATTTCCTGGGCTGTTGTTG
#      [3]    75 CCCAATTAGAGGATTCTCTGCCC...TTTCCCGGGATGTTGTTGTGTCC
#      [4]    75 GTTCTCTGCCCCTTTCCTGGCTA...TTGTTGTGTCCCGGGACCCACCT
#      [5]    75 TTCCTGGCTAGGTTGTCCGCTAG...GGACACACCTTATTGTGAGTTTG
#      ...   ... ...
# [204351]    75 ATTTAATACAATATTTTCAAAAT...TATAGGCTTCTTCTTACTATGGT
# [204352]    75 ATTTAATACAATATTTTCAAAAT...TATAGGCTTCTTCTTACTATGGT
# [204353]    75 ATTTAATACAATATTTTCAAAAT...TATAGGCTTCTTCTTACTATGGG
# [204354]    75 CCTCCGCTTTGGTTCACGTTCTG...GGCTTCACTTTTAGCTACTGTTG
# [204355]    75 CCTCCGCTTTGGTTCACGTTCTG...GGCTTCACTTTTAGCTACTGTTG

And let's say you want to cluster them by position (POS field in the BAM file):

read_clusters <- split(readseqs, paste(seqnames(reads), start(reads), sep="-"))

## Sizes of the clusters:
lengths(read_clusters)[1:12]
#  chr4-10000  chr4-10002  chr4-10003  chr4-10004  chr4-10005  chr4-10007 
#           2           1           1           1           1           1 
#  chr4-10008  chr4-10009  chr4-10010 chr4-100112  chr4-10013  chr4-10014 
#           2           1           2           1           3           1

You can compute the consensus matrix and consensus sequence of the reads positioned at nucleotide 10021 on chromosome 4 (RNAME=chr4 and POS=10021 in the BAM file) with something like:

cluster1 <- read_clusters[["chr4-10021"]]
cluster1
#   A DNAStringSet instance of length 8
#     width seq
# [1]    75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [2]    75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [3]    75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [4]    75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [5]    75 GCAGATGCCTACGACTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [6]    75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [7]    75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC
# [8]    75 GCAGATGCCTACGATTAACTCCGAACTTTACTG...TCCACGATAGTGCTTGCATGGTTAAGCAAGCC

cons_mat <- consensusMatrix(cluster1, baseOnly=TRUE)
dim(cons_mat)
# [1]  5 75

cons_mat[ , 1:12]
#       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
# A        0    0    8    0    8    0    0    0    0     0     8     0
# C        0    8    0    0    0    0    0    8    8     0     0     8
# G        8    0    0    8    0    0    8    0    0     0     0     0
# T        0    0    0    0    0    8    0    0    0     8     0     0
# other    0    0    0    0    0    0    0    0    0     0     0     0

consensusString(cluster1)
# [1] "GCAGATGCCTACGATTAACTCCGAACTTTACTGTTGGACGGACTCCACGATAGTGCTTGCATGGTTAAGCAAGCC"

HOWEVER: It's important to note that this approach only makes sense if the reads have no indels or junctions (i.e. the CIGAR strings have no I, D, or N in them). If they have indels or junctions, the consensus obtained by the above approach would be meaningless unless the reads sequences are first stretched or/and chrunk based on the corresponding CIGAR. sequenceLayer() could be used for this stretching/chrinking.

An alternative approach is to use pileup():

pileup_res <- pileup(untreated1_chr4())
dim(pileup_res)
# [1] 860358      5

head(pileup_res)
#   seqnames pos strand nucleotide count
# 1     chr4 892      -          C     1
# 2     chr4 893      -          T     1
# 3     chr4 894      -          G     1
# 4     chr4 895      -          T     1
# 5     chr4 896      -          G     1
# 6     chr4 897      -          G     1

But then there is some significant work to do to extract consensus sequences from this data.frame.

Hope this helps,

H.