Question: Efficient counting of nucleotide changes in short-reads
1
gravatar for Robert Castelo
5 weeks ago by
Robert Castelo2.3k
Spain/Barcelona/Universitat Pompeu Fabra
Robert Castelo2.3k wrote:

Dear BioC experts,

I'm trying to do an efficient counting of all possible nucleotide changes between aligned nucleotides from short-read sequencing and corresponding reference nucleotides, i.e. how many A>C, A>G, A>T, C>A, C>G, C>T, etc. I have crafted a first approach using the Biostrings and GenomicAlignment packages and the examples I found in the documentation, but I end up doing a nested for loop which obviously is right now the performance bottleneck. I would like to ask if anyone has a suggestion on how could I do this job more efficiently, prior to start parallelizing the code. Please find below the code using example data from the GenomicAlignments package.

thanks!!

library(GenomicAlignments)
library(RNAseqData.HNRNPC.bam.chr14)
library(BSgenome.Hsapiens.UCSC.hg19)

bamfiles <- RNAseqData.HNRNPC.bam.chr14_BAMFILES

sbparam <- ScanBamParam(what="seq",
                        flag=scanBamFlag(isProperPair=TRUE,
                                         isDuplicate=FALSE,
                                         isSecondaryAlignment=FALSE))

gal <- readGAlignments(bamfiles[1], use.names=TRUE, param=sbparam)

maxqwidth <- max(qwidth(gal))

## fetch query and reference sequences
qseq <- mcols(gal)$seq
rseq <- getSeq(Hsapiens, as(gal, "GRanges"))

## discard indels and skipped regions
qseq2 <- sequenceLayer(qseq, cigar(gal),
                       from="query", to="pairwise-dense")
rseq2 <- sequenceLayer(rseq, cigar(gal),
                       from="reference", to="pairwise-dense")
stopifnot(identical(elementNROWS(qseq2), elementNROWS(rseq2))) ## QC

## reverse complement query sequences aligning to the reverse strand
mask <- strand(gal) == "-"
qseq2[mask] <- reverseComplement(qseq2[mask])

## select only short-reads aligning to the maximum length
masklen <- width(qseq2) == maxqwidth & width(rseq2) == maxqwidth
qseq2 <- qseq2[masklen]
rseq2 <- rseq2[masklen]

## tally nucleotide changes
nt2 <- combn(DNA_BASES, 2)
ntPairs <- apply(nt2, 2, paste, collapse=">")
ntPairs <- c(ntPairs, apply(rbind(nt2[2, ], nt2[1, ]), 2, paste, collapse=">"))
ntChanges <- matrix(0, nrow=length(ntPairs),
                    ncol=maxqwidth, dimnames=list(ntPairs, 1:maxqwidth))

## for every pair of possible nucleotide changes
for (nucQ in DNA_BASES) {
  for (nucR in setdiff(DNA_BASES, nucQ)) {
    ## build a logical mask of the query nucleotide
    qNs <- hasLetterAt(qseq2, paste(rep(nucQ, maxqwidth), collapse=""), at=1:maxqwidth)
    ## build a logical mask of the reference nucleotide
    rNs <- hasLetterAt(rseq2, paste(rep(nucR, maxqwidth), collapse=""), at=1:maxqwidth)
    ## sum the number of occurrences of both nucleotides at each position
    ntChanges[paste(nucR, nucQ, sep=">"), ] <- colSums(qNs & rNs)
  }
}

dim(ntChanges) ## nucleotide changes by short-read positions
## [1] 12 72
ntChanges[1:5, 1:10]
##        1    2    3    4    5   6    7    8    9   10
## A>C 4221  410  248  419  496 714  904 1274 1557 1119
## A>G  534 1550 1003 1227 1140 789  715 1234 1310 1440
## A>T 5155  367  218  229  321 446 1548  987 1442 1094
## C>G  149  325  354  322  561 484  334  686  683  893
## C>T 3057  764  518  334  513 377 1153  702  899  757
ADD COMMENTlink modified 5 weeks ago by Michael Lawrence10k • written 5 weeks ago by Robert Castelo2.3k
Answer: Efficient counting of nucleotide changes in short-reads
0
gravatar for Michael Lawrence
5 weeks ago by
United States
Michael Lawrence10k wrote:

To get tallies by read position, you could try VariantTools::tallyVariants() and its read_pos_breaks parameter. Set it to e.g. 0:72 to get a tally for every read position, by every position in the genome. Then you'll need to colSums() those columns in the result.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Michael Lawrence10k

hi Michael, thanks for hint. i've been thinking looking at the documentation and examples on pileup() and thinking about it and i can't see how to use it because pileup() tallies nucleotides along the reference sequence, e.g.:

  seqnames      pos strand nucleotide count
1    chr14 19069583      +          T     1
2    chr14 19069584      +          G     1
3    chr14 19069586      +          G     1
4    chr14 19069588      +          A     1

while i'm interested to tally nucleotides along positions of aligned reads (see the ntChanges matrix in my example above). i can't see how to transform the pileup data.frame into a read-level summary of substitutions. any further idea?

thanx!

ADD REPLYlink written 5 weeks ago by Robert Castelo2.3k

Sorry I will edit my answer.

ADD REPLYlink written 5 weeks ago by Michael Lawrence10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 183 users visited in the last hour