Question: Efficient counting of nucleotide changes in short-reads
1
5 weeks ago by
Robert Castelo2.3k
Spain/Barcelona/Universitat Pompeu Fabra
Robert Castelo2.3k wrote:

Dear BioC experts,

I'm trying to do an efficient counting of all possible nucleotide changes between aligned nucleotides from short-read sequencing and corresponding reference nucleotides, i.e. how many A>C, A>G, A>T, C>A, C>G, C>T, etc. I have crafted a first approach using the Biostrings and GenomicAlignment packages and the examples I found in the documentation, but I end up doing a nested for loop which obviously is right now the performance bottleneck. I would like to ask if anyone has a suggestion on how could I do this job more efficiently, prior to start parallelizing the code. Please find below the code using example data from the GenomicAlignments package.

thanks!!

library(GenomicAlignments)
library(RNAseqData.HNRNPC.bam.chr14)
library(BSgenome.Hsapiens.UCSC.hg19)

bamfiles <- RNAseqData.HNRNPC.bam.chr14_BAMFILES

sbparam <- ScanBamParam(what="seq",
flag=scanBamFlag(isProperPair=TRUE,
isDuplicate=FALSE,
isSecondaryAlignment=FALSE))

maxqwidth <- max(qwidth(gal))

## fetch query and reference sequences
qseq <- mcols(gal)\$seq
rseq <- getSeq(Hsapiens, as(gal, "GRanges"))

## discard indels and skipped regions
qseq2 <- sequenceLayer(qseq, cigar(gal),
from="query", to="pairwise-dense")
rseq2 <- sequenceLayer(rseq, cigar(gal),
from="reference", to="pairwise-dense")
stopifnot(identical(elementNROWS(qseq2), elementNROWS(rseq2))) ## QC

## reverse complement query sequences aligning to the reverse strand

## select only short-reads aligning to the maximum length
masklen <- width(qseq2) == maxqwidth & width(rseq2) == maxqwidth

## tally nucleotide changes
nt2 <- combn(DNA_BASES, 2)
ntPairs <- apply(nt2, 2, paste, collapse=">")
ntPairs <- c(ntPairs, apply(rbind(nt2[2, ], nt2[1, ]), 2, paste, collapse=">"))
ntChanges <- matrix(0, nrow=length(ntPairs),
ncol=maxqwidth, dimnames=list(ntPairs, 1:maxqwidth))

## for every pair of possible nucleotide changes
for (nucQ in DNA_BASES) {
for (nucR in setdiff(DNA_BASES, nucQ)) {
## build a logical mask of the query nucleotide
qNs <- hasLetterAt(qseq2, paste(rep(nucQ, maxqwidth), collapse=""), at=1:maxqwidth)
## build a logical mask of the reference nucleotide
rNs <- hasLetterAt(rseq2, paste(rep(nucR, maxqwidth), collapse=""), at=1:maxqwidth)
## sum the number of occurrences of both nucleotides at each position
ntChanges[paste(nucR, nucQ, sep=">"), ] <- colSums(qNs & rNs)
}
}

dim(ntChanges) ## nucleotide changes by short-read positions
## [1] 12 72
ntChanges[1:5, 1:10]
##        1    2    3    4    5   6    7    8    9   10
## A>C 4221  410  248  419  496 714  904 1274 1557 1119
## A>G  534 1550 1003 1227 1140 789  715 1234 1310 1440
## A>T 5155  367  218  229  321 446 1548  987 1442 1094
## C>G  149  325  354  322  561 484  334  686  683  893
## C>T 3057  764  518  334  513 377 1153  702  899  757
modified 5 weeks ago by Michael Lawrence10k • written 5 weeks ago by Robert Castelo2.3k
0
5 weeks ago by
United States
Michael Lawrence10k wrote:

To get tallies by read position, you could try VariantTools::tallyVariants() and its read_pos_breaks parameter. Set it to e.g. 0:72 to get a tally for every read position, by every position in the genome. Then you'll need to colSums() those columns in the result.

hi Michael, thanks for hint. i've been thinking looking at the documentation and examples on pileup() and thinking about it and i can't see how to use it because pileup() tallies nucleotides along the reference sequence, e.g.:

  seqnames      pos strand nucleotide count
1    chr14 19069583      +          T     1
2    chr14 19069584      +          G     1
3    chr14 19069586      +          G     1
4    chr14 19069588      +          A     1

while i'm interested to tally nucleotides along positions of aligned reads (see the ntChanges matrix in my example above). i can't see how to transform the pileup data.frame into a read-level summary of substitutions. any further idea?

thanx!