Question

Efficient counting of nucleotide changes in short-reads

1

Entering edit mode

Robert Castelo ★ 3.4k

@rcastelo

Last seen 4 months ago

Barcelona/Universitat Pompeu Fabra

Dear BioC experts,

I'm trying to do an efficient counting of all possible nucleotide changes between aligned nucleotides from short-read sequencing and corresponding reference nucleotides, i.e. how many A>C, A>G, A>T, C>A, C>G, C>T, etc. I have crafted a first approach using the Biostrings and GenomicAlignment packages and the examples I found in the documentation, but I end up doing a nested for loop which obviously is right now the performance bottleneck. I would like to ask if anyone has a suggestion on how could I do this job more efficiently, prior to start parallelizing the code. Please find below the code using example data from the GenomicAlignments package.

thanks!!

library(GenomicAlignments)
library(RNAseqData.HNRNPC.bam.chr14)
library(BSgenome.Hsapiens.UCSC.hg19)

bamfiles <- RNAseqData.HNRNPC.bam.chr14_BAMFILES

sbparam <- ScanBamParam(what="seq",
                        flag=scanBamFlag(isProperPair=TRUE,
                                         isDuplicate=FALSE,
                                         isSecondaryAlignment=FALSE))

gal <- readGAlignments(bamfiles[1], use.names=TRUE, param=sbparam)

maxqwidth <- max(qwidth(gal))

## fetch query and reference sequences
qseq <- mcols(gal)$seq
rseq <- getSeq(Hsapiens, as(gal, "GRanges"))

## discard indels and skipped regions
qseq2 <- sequenceLayer(qseq, cigar(gal),
                       from="query", to="pairwise-dense")
rseq2 <- sequenceLayer(rseq, cigar(gal),
                       from="reference", to="pairwise-dense")
stopifnot(identical(elementNROWS(qseq2), elementNROWS(rseq2))) ## QC

## reverse complement query sequences aligning to the reverse strand
mask <- strand(gal) == "-"
qseq2[mask] <- reverseComplement(qseq2[mask])

## select only short-reads aligning to the maximum length
masklen <- width(qseq2) == maxqwidth & width(rseq2) == maxqwidth
qseq2 <- qseq2[masklen]
rseq2 <- rseq2[masklen]

## tally nucleotide changes
nt2 <- combn(DNA_BASES, 2)
ntPairs <- apply(nt2, 2, paste, collapse=">")
ntPairs <- c(ntPairs, apply(rbind(nt2[2, ], nt2[1, ]), 2, paste, collapse=">"))
ntChanges <- matrix(0, nrow=length(ntPairs),
                    ncol=maxqwidth, dimnames=list(ntPairs, 1:maxqwidth))

## for every pair of possible nucleotide changes
for (nucQ in DNA_BASES) {
  for (nucR in setdiff(DNA_BASES, nucQ)) {
    ## build a logical mask of the query nucleotide
    qNs <- hasLetterAt(qseq2, paste(rep(nucQ, maxqwidth), collapse=""), at=1:maxqwidth)
    ## build a logical mask of the reference nucleotide
    rNs <- hasLetterAt(rseq2, paste(rep(nucR, maxqwidth), collapse=""), at=1:maxqwidth)
    ## sum the number of occurrences of both nucleotides at each position
    ntChanges[paste(nucR, nucQ, sep=">"), ] <- colSums(qNs & rNs)
  }
}

dim(ntChanges) ## nucleotide changes by short-read positions
## [1] 12 72
ntChanges[1:5, 1:10]
##        1    2    3    4    5   6    7    8    9   10
## A>C 4221  410  248  419  496 714  904 1274 1557 1119
## A>G  534 1550 1003 1227 1140 789  715 1234 1310 1440
## A>T 5155  367  218  229  321 446 1548  987 1442 1094
## C>G  149  325  354  322  561 484  334  686  683  893
## C>T 3057  764  518  334  513 377 1153  702  899  757

GenomicAlignments Biostrings • 1.7k views

ADD COMMENT • link updated 7.1 years ago by Michael Lawrence ★ 11k • written 7.1 years ago by Robert Castelo ★ 3.4k

score 0 · Answer 1 · 2018-12-13

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 4.1 years ago

United States

To get tallies by read position, you could try VariantTools::tallyVariants() and its read_pos_breaks parameter. Set it to e.g. 0:72 to get a tally for every read position, by every position in the genome. Then you'll need to colSums() those columns in the result.

ADD COMMENT • link 7.1 years ago Michael Lawrence ★ 11k

0

Entering edit mode

hi Michael, thanks for hint. i've been thinking looking at the documentation and examples on pileup() and thinking about it and i can't see how to use it because pileup() tallies nucleotides along the reference sequence, e.g.:

  seqnames      pos strand nucleotide count
1    chr14 19069583      +          T     1
2    chr14 19069584      +          G     1
3    chr14 19069586      +          G     1
4    chr14 19069588      +          A     1

while i'm interested to tally nucleotides along positions of aligned reads (see the ntChanges matrix in my example above). i can't see how to transform the pileup data.frame into a read-level summary of substitutions. any further idea?

thanx!

ADD REPLY • link 7.1 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

Sorry I will edit my answer.

ADD REPLY • link 7.1 years ago Michael Lawrence ★ 11k