I'm building a workflow in biconductor to evaluate in vivo viral vector function with assessment using paired-end illumina sequencing. In this workflow, I'm aligning the paired reads to a known genomic reference from which all the vector genomes are derived although all with different regions of the reference included. I'm using bowtie 2 to align the pairs to the reference and samtools to convert the output into a bam that is then read into bioconductor using "readGAlignmentPairsFromBam". That far everything is well. However, what I would need for the remaining parts of the workflow is to convert the GAlignmentPairs object into a GAlignment object where each pair is fused into a single GenomicRange with a N containing CIGAR describing the alignment towards the reference. I have found that the galp2gal function of the RIPSeeker package should do exactly what I need. Unfortunately this function considers every read pair where the sequences meet or overlap (i.e., end of first read >= start of last read in the pair) as out-of-bound ranges and discards them. I really need these pairs as well, as I use long reads (2x300bp) and have variable but sometimes short inserts, to the reads often overlap. Any hlep to solve this would be highly appreciated. The solution can be either directly on the BAM file (e.g., if it is possible with samtools or equivalent) or in R/Bioconductor. Thanks'
Hi,
As far as I know, a pair where the 2 ends overlap cannot be represented with a single CIGAR. So I guess this is why
RIPSeeker::galp2gal()
discards these pairs. I'm curious about what you're planning to do with these "merged CIGARS". What do you need them for? If you are after the ranges, just squeeze them out of your GAlignmentPairs object withgranges()
orgrglist()
.granges()
will return you a single genomic range per pair whilegrglist()
will return you n ranges per pair (where n = nb of Ns in the 2 CIGARS + 2). See?GAlignmentPairs
for more information. If you want exactly 2 ranges per pair (and thus ignore the Ns in the CIGARS), use:Maybe I should add this functionality to GenomicAlignments.
If you are after the coverage or want to do
pileup()
, you don't need the pairing at all (i.e. just usereadGAlignments()
to load your paired-end reads). Anyway knowing a little bit more about what you're planning to do downstream would probably help us help you.Cheers,
H.
Hi Hervé,
Thank you for your very helpful reply. Yes, the granges() function is what I use today and for many use cases this is perfectly acceptable. For the application I'm after, however, I'm trying to incorporate annotation of mutations into the sequence as well. We use gene chip arrays to produce the sequences with a low to intermediate mutagenesis rate of approx. 0.5% and couple this to molecular barcodes (placed upstream and used to pool all reads from one specific gene chip synthesised sequence). As the long read PE illumina 2x300 produces rather low quality at the end of each read, I'm after a strategy where the consensus CIGAR is generated for the entire sequence based on the overlaps between the two paired reads. I do recognise that this is a non-trivial calculation and why this would not be included in a standard function. Maybe a better strategy is to first split the bam into one with overlapping pairs and a second with non-overlapping and then find a separate utility to generate consensus sequences from each pair and then re-align them as single-end reads?
As an update to this, I can say that I'm now testing to approaches, but neither of them is 100% what I would like. The first one is that I use the FLASH utility to first merge any overlapping paired end reads and then align them using bowtie2. The other alternative I use is the "clipOverlap" function of bamUtil (http://genome.sph.umich.edu/wiki/BamUtil) to clip the overlap from the read with lowest average quality after PE alignment using bowtie2 whereafter I use the RIPSeeker::galp2gal() function.
Both approaches work and produce a CIGAR that I can work with. However, they have their respective weakness and thus I would tike to find a better solution. The weakness with the FLASH prior to bowtie approach is that it does not take all information at hand into account. We do have a lot of information about the extent of the overlap in out sequence knowledge that requires bowtie first. Thus the risk is that we will get false merged overlaps that will not align correctly. The issue with the bamUtil after bowtie is smaller, but that is that the added sequencing fidelity of sequencing the same sequence twice (in the overlap) is not taken into account. This is especially important in longer fragments where only the absolute ends of the 2x300bp reads overlap. If I could find a solution to this, I would be happy with the second approach although the clipOverlap function is surprisingly slow and would benefit from multithreading...
Another update.... pandaseq-sam (https://github.com/neufeld/pandaseq-sam) could also have been an alternative. However, the algoritm it uses to generate the merged sequence is unclear and the output is only a FASTA that then needs to be aligned again with bowtie, but now all read quality information is lost.... I'll keep looking, but any tips are appreciated.
Hi Tomas,
So if I understand correctly, you're trying to infer the sequence of the DNA fragment and then re-align it as a single end read. Have you considered doing this *before* alignment (i.e. directly from the sequences in the 2 FASTQ files) instead of *after* the PE alignment step? For example in your case, if the DNA fragment was < 600bp then the sequences of the 2 reads should "overlap" (i.e. have suffixes that are reverse complement from each other). Let's call this "overlap before alignment". More precisely, with the following 16-bp DNA fragment:
and 2x10bp paired-end sequencing, the 2 reads would be (assuming there was no sequencing errors):
The 4-nucleotide suffixes in the 2 reads are reverse complement one from the other (AGAA and TTCT) so the 2 reads do "overlap before alignment".
The problem of inferring the sequence of the DNA fragment can be seen as the problem of finding the longest overlap between the 2 reads (eventually modulo some errors). From a computational point of view, this is a much simpler problem than doing PE alignment to a reference genome and then trying to infer the sequence of the DNA fragments from the alignments.
It should be noted that if the 2 reads in a pair overlap before alignment, it could just be because the same motif is found twice in the DNA fragment. For example, the 2 reads above could come from the following DNA fragment:
In that case, the sequence of the DNA fragment cannot be inferred from the reads but we have no way to tell so we have a false positive. AFAIK these false positives are inherent to the PE sequencing technology itself.
Note that, in addition to being much more complicated, trying to infer the sequence of the DNA fragment *after* alignment can introduce additional false positives as well as false negatives. It really depends on what the aligner does exactly and what settings are used. Any reason for trying to do this *after* alignment?
H.
Edit: Maybe this link is relevant to our discussion:
http://thegenomefactory.blogspot.com/2013/08/paired-end-read-confusion-library.html
They refer to the process of inferring the sequence of the DNA fragment as "stitching R1 and R2" and provide a link to a list of tools for doing so. I'm not aware of such tools in Bioconductor.