Question: Comparing read sequence back to reference sequence
0
5.0 years ago by
Germany
Stefanie Tauber40 wrote:

Hi,

I would like to compare aligned reads back to the reference sequence. Matches, mismatches, insertions and deletions are present in my alignments but no soft/hard clipping. I aim for having an exhaustive enumeration of all occurring (mis)matches\indels per position (of the read) over all reads.

I get the position and length of indels by means of explodeCigarOps and explodeCigarOpLengths.

I also know how to get my query and reference sequence to the 'pairwise' space (using 'sequenceLayer'). What would be the best way to record per position and read if and which mismatch occurs?

Basically a kind of compareStrings followed, right?
I just wonder if I miss some fancy function..

Best,

Stefanie

modified 5.0 years ago by Nathaniel Hayden180 • written 5.0 years ago by Stefanie Tauber40

I have run into a similar issue with PileupParam. My question is how would you perform a pileup on only the last nucleotide of each read with variable length reads, such as cycle_bins=c(Inf-1, Inf).

Any insight appreciated,

Sam

3
5.0 years ago by
United States
Nathaniel Hayden180 wrote:

Hi, Stefanie. See Rsamtools::pileup. (There's even an embarrassing, but hopefully useful video by yours truly on our YouTube channel.) And I actually just added support for insertions!

The results are presented in a way to make it useful for a variety of subsequent analyses. The tool is designed to give a summary of what's going on at each genomic position in terms of bases, strands, indels, and where the reads aligned (binning by cycle). It might feel like overkill if you're really looking to only compare a single read with a single reference sequence, but it should be trivial to get what you want from the results--and it's a good tool if you want to do more. The examples section is quite illustrative, and should also give you ideas on how to munge the results.

I highly recommend using a ScanBamParam for performance reasons.

Let me know if you need help or more explanation. I'd also appreciate feedback on the documentation.

Hi Nate,

thanks for the comment. I guess cycle_bins will be particularly useful to me. I will let you know if I encounter any problems! Thanks a lot!

Hi Nate,

I just checked out pileup:

I specify the parameters as follows:

p_param = PileupParam(distinguish_strands = TRUE, cycle_bins = seq(0, 1), max_depth = 41000,min_mapq = 0)
res = pileup(bamfile, pileupParam = p_param)

By doing so, I specify the first nucleotide of each alignment, correct? Thus, I should actually retrieve all reads in my bam file. However, I only retrieve a certain subset. This subset also changes if I use a diffent interval for cycle_bins (such as seq(1,2)). I have checked on the CIGAR string if e.g. reads with soft clipping are not considered at all, but the numbers do not match.

I checked the maximum coverage, so I do not lose any reads because of this filter.

Do you have any idea why I do not retrieve all reads?

Best,

Stefanie

1

Hi Stephanie,

My first thought is the default for include_query_Ns (ambiguous nucleotides) and include_insertions is FALSE, so based on your PileupParam, some alignments might slip through. My second thought is send me your file so I can look at it :) ( nhayden at fredhutch point org )

"[S]pecify the first nucleotide of each alignment" doesn't necessarily tell the whole story. It's more like "for each position X, of all alignments that overlap X (and satisfy the other criteria), count only the alignments that align with their first nucleotide on X." But yes, if there isn't any clipping in the file my initial reaction is all reads would be counted [at least] once. See the note in "Insertions and padding" about how insertions might skew that.

But it sounds like there might be clipping after all?

Hi,

Just wanted to let you know that I think it is a problem with the soft clipping. As soon I am sure, I will let you know...

Best,

Stefanie

1

Hi, Stefanie. I will update the documentation to be more specific about how hard and soft clipping affects results. I had only left this part out of the earlier correspondence because you said there was no clipping, but I'm happy to clarify. (If you preprocess the file to remove soft clipping I believe the results will be more in line with what you expect.)

Soft and hard clipping operations in the CIGAR never contribute to counts, i.e., there are no hard-/soft-clip nucleotide codes (in contrast to indels). But because samtools calculates cycle position with respect to the SEQ field (leaving deletions out of the discussion for now) in the file, soft clipping can affect the cycle position. The relevant derived field in the samtools library is called qpos for "query position", and this is what's used to calculate cycle_bin. I'm going to attempt some ASCII art. Let's say the two alignments below start at position 11; the nucleotides are the SEQ field:

           pos 11, qpos 2
|
v
1S2M2S3M  tACttGTN
1H2M2H3M  ACGTN
^
|
pos 11, qpos 1


So you can see the first A in each alignment is reported at genomic position position 11, yet because of soft clipping the position of the A within the SEQ field of the first alignment is 2. Incidentally, these very sequences are used in unit testing for pileup (though in the test file they start at 1).

library(Rsamtools)
fl <- system.file("extdata", "tiny.bam", package="Rsamtools")
sbp <- ScanBamParam(which=GRanges(c("ext_cigar_soft", "ext_cigar_hard"), IRanges(1,10)))
p_param <- PileupParam(cycle_bins=seq(0, 2))
res <- pileup(file=fl, scanBamParam=sbp, pileupParam=p_param)

Also, if you're unsure about how any other functionality works, you can check out the human-readable sam versions of the bam files (namely 'Rsamtools/inst/extdata/tiny.sam') used for unit testing of pileup. The unit tests themselves ('Rsamtools/inst/unitTests/test_pileup_single_range.R') also stand as living documentation.

Let me know if you have any more questions.

Hi,

thanks for the update and the explanation!! I will mainly work on non-soft-clipped files, that's why I didn't put an emphasis on it..
now, everything is crystal-clear. Without soft-clipping pileup recovers as many reads as I expect.
One last question: I would like to retrieve the very last position of each read. Any possibility to specify this? Something along the lines of (Inf-1, Inf].
One actually knows the last cycle position for each read (since we know the width of each read). Is there any possibility to share this knowledge with cycle_bins?

Best,

Stefanie

1

Happy to help. Sure, if reads in the file(s) are a fixed width, say X, a cycle_bins arg c(X - 1, X) should get the last cycle position as long as it satisfies the other filter criteria. (That is, just keep in mind that all the PileupParam filter criteria are expected to interact.) Also, if there's a range in lengths, you could use cycle_bins=seq(lower_limit - 1, upper_limit) so you can see the results within the tail end of the reads broken down by each position within the read.

I have run into a similar issue with PileupParam. My question is how would you perform a pileup on only the last nucleotide of each read with variable length reads, such as cycle_bins=c(Inf-1, Inf).

Any insight appreciated,

Sam

Hi, Sam. Your question deserves its own top-level post. I didn't see this until today (and only because someone pointed it out to me) because it was a comment, so I didn't get a notification for it. Could you post a separate question for this and give more information surrounding your use case? And please add appropriate tags, e.g., pileup, rsamtools, bam, etc.

And it would be a good idea to reference this post, or any of the specific comments herein.