Getting Aligned DNAStringSet from reference and BAM/SAM file(s)
1
1
Entering edit mode
ben.ward ▴ 30
@benward-7169
Last seen 6.5 years ago
United Kingdom

Hi,

I've been given a BAM format file and I'd like to read it and generate from it, a DNAStringSet object I can then feed into some code I've written previously, which previously worked on a DNAString set from a FASTA alignment file.

This will be the first time I've worked with such files, I've been reading GenomicAlignment vignettes and I think the way to go about this is:

3. Use position info of reads and the differences between the read and reference, to work out the full sequence.

Is this a common task to want to do? Is this possible with Bioconductor? It's similar to converting a BAM to a FASTA alignment file. Does anyone know of an example or a doc that will give me direction on how to get this done.

EDIT:

After comments and feedback, I believe something like a pileup or consensus sequence for the BAM file gets me close to the end result, I've seen a command:

samtools pileup -cv

Which I think does this - can I do this from in Bioconductor, without having to use system()?

Thanks,

Ben.

bam sam alignment biostrings • 4.7k views
0
Entering edit mode

I think you'll need to explain what you mean by a "FASTA alignment file", since FASTA is usually a format for storing unaligned sequences.

1
Entering edit mode

Yes and Ryan's remark could be extended to DNAStringSet. Even though you can easily load the unaligned read sequences in a DNAStringSet object (just do gal <- readGAlignments(..., param=ScanBamParam(what="seq")), then extract the "seq" metadata column with mcols(gal)\$seq), note that this only gives you the unaligned sequences. In order to store the aligned sequences you would also need to keep track of the geometry of the alignments (i.e. position with respect to the reference, indels, and junctions). How do you intend to do that with a DNAStringSet object?

Depending on what you plan to achieve, there might be a better route than trying to load aligned reads in a DNAStringSet object. For example have a look at pileup() in Rsamtools if you want to summarize nucleotide content by individual position on the reference.

Cheers,

H.

0
Entering edit mode

If you just want the consensus sequence from the BAM file, there are several methods you could use, that might be faster than trying to extract things from a BSgenome package.

0
Entering edit mode

Thanks for the responses and feedback! By a FASTA alignment file I mean a FASTA formatted file with sequences that are all of the same length, and contain '-' characters. They are described here and here and can be for protein sequences or DNA sequences. They can be output by Clustal and I have used them frequently in Phylogenetics in programs like MEGA and the R 'ape' package. In code I've written before, I read such FASTA files into a DNAStringSet (in which every sequence is of the same length, and contains '-' as allowed by IUPAC) using Biostrings. I've had a look at the pileup format and Rsamtools - it looks like this might be the best route for me, getting nucleotide frequency at each position.

1
Entering edit mode

I see. But this FASTA alignment format is for representing multiple alignments, that is, all the sequences in the file are aligned to each other. In that case the aligned sequences are obtained by stretching out the original (aka unaligned) sequences by inserting '-' characters to them. As a result, the aligned sequences all have the same length and can be (at least conceptually) stacked on the resulting consensus.

However, if your BAM file contains reads from an HTS experiment, the situation is quite different, because in that case all the reads are aligned to the same reference. This is not a situation of multiple alignment. A major difference is that if you want to lay (or project) the read sequences on the reference, you need to take into account deletions (D in the cigar), insertions (I in the cigar), and skipped regions (N in the cigar), all of these being reported with respect to the reference. You can use sequenceLayer() from the GenomicAlignments package for this:

library(GenomicAlignments)
cigars <- c("5M", "4M2D1M", "2M1I2M")
##   A DNAStringSet instance of length 3
##     width seq
## [1]     5 AGGAT
## [2]     7 CCTC--G
## [3]     4 TTGG

As you can see, even though the unaligned reads are all the same length, the aligned sequences are not. Also, when laying a read that contains an insertion, the insertion gets removed from the sequence (see 3rd read in above example). This is because this insertion contains nucleotides that don't belong to the reference and cannot be matched to any valid position on the reference.

See ?sequenceLayer in GenomicAlignments for how to use sequenceLayer() on the reads that are in your BAM file.

HTH,

H.

0
Entering edit mode

Thank you for explaining Herve, that does help a lot. I see so rather than an MSA, SAM is a whole lot of pairwise alignments with the reference. Therefore pursuing getting a FASTA "MSA" from SAM will result in loss of insertions. Will pileup ignore insertions to then since it returns info on a per base basis, and insertions are not in the reference? I've read somewhere that for paired end reads it is possible to infer indels based on their length and where they map.

The code I've written that accepts MSA as input, basically identifies all the base positions in the MSA that are polymorphic, which includes gaps. - In R this is easy, just represent the MSA as a matrix and find cols which contain more than one character.

So perhaps rather than try to convert my BAM/SAM data to MSA format and then identify the polymorphisms, rather the task can be distilled to "the best way I can find the polymorphic sites, and insertions and deletions from the SAM/BAM data I currently have?"

I had some code that performed a column wise "samtools pileup", so insertions would get their own rows in the output rather than shoehorning them into the same line via +seq. It needs some tidying up as it's no longer so standalone (it's part of my BAM reading code in Gap5), but could be made so again if it's useful.

As others have pointed out this is only half the problem though. It's not going to get your data aligned better and put P CIGAR operators in. I'm not aware of tools to automatically do this, but I'm sure some must exist.

I'm not sure if it's useful.

Thanks,

Ben.

1
Entering edit mode

Hi Ben,

Just to clarify, SAM also supports what they call "padded representation" of the alignments (see the spec at http://samtools.sourceforge.net/) which is suited for multiple sequence alignment representation and is typically used for de novo assembly. However, aligners commonly used for HTS reads (e.g. bowtie, TopHat, STAR, Rsubread, gmap/gsnap, etc...) produce a SAM/BAM file with an unpadded representation  (a.k.a. unpadded SAM). So it's not SAM itself that is a whole lot of pairwise alignments with the reference, it's what the aligner you used to align your reads has put in the file.

Rsamtools::pileup() does report insertions. IIRC an insertion is reported at the position that immediately precedes it (e.g. position 10 if the insertion happened between position 10 and 11 of the reference sequence). See the man page for the details.

When you say "identify all the base positions in the MSA that are polymorphic", it sounds to me that Rsamtools::pileup() might be a first step in the right direction for that. You might also need access to the reference sequence but maybe not (you certainly will if you want to do SNP calling). Also the h5vc package might be of interest for you as it might already provide the functionalities you're trying to implement:

Others people watching the support site may have other suggestions.

Cheers,

H.

1
Entering edit mode
@martin-morgan-1513
Last seen 20 days ago
United States

Perhaps GenomicAlignments::stackStringsFromBam() (for more-or-less literally creating a DNAStringSet padded to match alignments to a particular region; maybe also sequenceLayer()) or Rsamtools::pileup() (for summarizing nucleotide frequency). In both cases the data is likely to be 'large', so you'll either query (using ScanBamParam(which=...)) or iterate through (using BamFile(..., yieldSize=...)) the BAM file.

The first example from ?stackStringsFromBam is

bamfile1 <- BamFile(system.file("extdata", "ex1.bam", package="Rsamtools")) region1 <- GRanges("seq1", IRanges(1, 60))  # region of interest stackStringsFromBam(bamfile1, param=region1)

resulting in

  A DNAStringSet instance of length 25
width seq
[1]    60 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG++++++++++++++++++++++++
[2]    60 ++CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT+++++++++++++++++++++++
[3]    60 ++++AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC+++++++++++++++++++++
[4]    60 +++++GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT+++++++++++++++++++
[5]    60 ++++++++GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG+++++++++++++++++
...   ... ...
[21]    60 ++++++++++++++++++++++++++++++++++++++++++++++++++TTAGGGAGCT
[22]    60 +++++++++++++++++++++++++++++++++++++++++++++++++++TAGGGAGCT
[23]    60 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++CT
[24]    60 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++T
[25]    60 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++T

The use of '+' is to avoid confusion with indels '-', but can be controlled by arguments Lpadding.letter and Rpadding.letter.