Efficient counting of nucleotide changes in short-reads
1
1
Entering edit mode
Robert Castelo ★ 3.4k
@rcastelo
Last seen 1 day ago
Barcelona/Universitat Pompeu Fabra

Dear BioC experts,

I'm trying to do an efficient counting of all possible nucleotide changes between aligned nucleotides from short-read sequencing and corresponding reference nucleotides, i.e. how many A>C, A>G, A>T, C>A, C>G, C>T, etc. I have crafted a first approach using the Biostrings and GenomicAlignment packages and the examples I found in the documentation, but I end up doing a nested for loop which obviously is right now the performance bottleneck. I would like to ask if anyone has a suggestion on how could I do this job more efficiently, prior to start parallelizing the code. Please find below the code using example data from the GenomicAlignments package.

thanks!!

library(GenomicAlignments)
library(RNAseqData.HNRNPC.bam.chr14)
library(BSgenome.Hsapiens.UCSC.hg19)

bamfiles <- RNAseqData.HNRNPC.bam.chr14_BAMFILES

sbparam <- ScanBamParam(what="seq",
                        flag=scanBamFlag(isProperPair=TRUE,
                                         isDuplicate=FALSE,
                                         isSecondaryAlignment=FALSE))

gal <- readGAlignments(bamfiles[1], use.names=TRUE, param=sbparam)

maxqwidth <- max(qwidth(gal))

## fetch query and reference sequences
qseq <- mcols(gal)$seq
rseq <- getSeq(Hsapiens, as(gal, "GRanges"))

## discard indels and skipped regions
qseq2 <- sequenceLayer(qseq, cigar(gal),
                       from="query", to="pairwise-dense")
rseq2 <- sequenceLayer(rseq, cigar(gal),
                       from="reference", to="pairwise-dense")
stopifnot(identical(elementNROWS(qseq2), elementNROWS(rseq2))) ## QC

## reverse complement query sequences aligning to the reverse strand
mask <- strand(gal) == "-"
qseq2[mask] <- reverseComplement(qseq2[mask])

## select only short-reads aligning to the maximum length
masklen <- width(qseq2) == maxqwidth & width(rseq2) == maxqwidth
qseq2 <- qseq2[masklen]
rseq2 <- rseq2[masklen]

## tally nucleotide changes
nt2 <- combn(DNA_BASES, 2)
ntPairs <- apply(nt2, 2, paste, collapse=">")
ntPairs <- c(ntPairs, apply(rbind(nt2[2, ], nt2[1, ]), 2, paste, collapse=">"))
ntChanges <- matrix(0, nrow=length(ntPairs),
                    ncol=maxqwidth, dimnames=list(ntPairs, 1:maxqwidth))

## for every pair of possible nucleotide changes
for (nucQ in DNA_BASES) {
  for (nucR in setdiff(DNA_BASES, nucQ)) {
    ## build a logical mask of the query nucleotide
    qNs <- hasLetterAt(qseq2, paste(rep(nucQ, maxqwidth), collapse=""), at=1:maxqwidth)
    ## build a logical mask of the reference nucleotide
    rNs <- hasLetterAt(rseq2, paste(rep(nucR, maxqwidth), collapse=""), at=1:maxqwidth)
    ## sum the number of occurrences of both nucleotides at each position
    ntChanges[paste(nucR, nucQ, sep=">"), ] <- colSums(qNs & rNs)
  }
}

dim(ntChanges) ## nucleotide changes by short-read positions
## [1] 12 72
ntChanges[1:5, 1:10]
##        1    2    3    4    5   6    7    8    9   10
## A>C 4221  410  248  419  496 714  904 1274 1557 1119
## A>G  534 1550 1003 1227 1140 789  715 1234 1310 1440
## A>T 5155  367  218  229  321 446 1548  987 1442 1094
## C>G  149  325  354  322  561 484  334  686  683  893
## C>T 3057  764  518  334  513 377 1153  702  899  757
GenomicAlignments Biostrings • 1.3k views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.0 years ago
United States

To get tallies by read position, you could try VariantTools::tallyVariants() and its read_pos_breaks parameter. Set it to e.g. 0:72 to get a tally for every read position, by every position in the genome. Then you'll need to colSums() those columns in the result.

ADD COMMENT
0
Entering edit mode

hi Michael, thanks for hint. i've been thinking looking at the documentation and examples on pileup() and thinking about it and i can't see how to use it because pileup() tallies nucleotides along the reference sequence, e.g.:

  seqnames      pos strand nucleotide count
1    chr14 19069583      +          T     1
2    chr14 19069584      +          G     1
3    chr14 19069586      +          G     1
4    chr14 19069588      +          A     1

while i'm interested to tally nucleotides along positions of aligned reads (see the ntChanges matrix in my example above). i can't see how to transform the pileup data.frame into a read-level summary of substitutions. any further idea?

thanx!

ADD REPLY
0
Entering edit mode

Sorry I will edit my answer.

ADD REPLY

Login before adding your answer.

Traffic: 558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6