Create consensusMatrix for codon bases
1
0
Entering edit mode
Hassan • 0
@11a409ca
Last seen 3 months ago
South Africa

How do I create consensusMatrix equivalent with respect to codons? That is, how do I count codon frequencies by sequence site to generate something similar to what consensusMatrix returns for a DNAStringSet or a AAStringSet object? I have tried letterFrequency but it generates frequencies per. sequence not per site.

dna1 <- DNAStringSet(c("TTGATATGGCCCTTATAA", "CTCAGATCACTCTTTGGC"))
consensusMatrix(dna1, baseOnly=TRUE)

The sample code above returns a 5 by 18 matrix with respect to DNA bases (and an "other" category). How do I generate a 65 by 6 equivalent in terms of codons?

Biostrings • 291 views
ADD COMMENT
0
Entering edit mode
Kevin Blighe ★ 4.0k
@kevin
Last seen 2 hours ago
The Cave, 181 Longwood Avenue, Boston, …

You can achieve this by first extracting the codons from each sequence in the DNAStringSet and then counting their frequencies at each codon position across all sequences. The following code uses the Biostrings package to generate a 65 by 6 matrix, where the rows correspond to the 64 standard codons plus an "other" category (for any non-standard codons, though none appear in your example), and the columns correspond to the 6 codon positions.

library(Biostrings)

dna1 <- DNAStringSet(c("TTGATATGGCCCTTATAA", "CTCAGATCACTCTTTGGC"))

# Define all standard codons and positions
all_codons <- names(GENETIC_CODE)
n_positions <- width(dna1)[1] / 3  # Assumes all sequences have equal length divisible by 3

# Initialize the matrix
codon_matrix <- matrix(0, nrow = length(all_codons) + 1, ncol = n_positions,
                       dimnames = list(c(all_codons, "other"), 1:n_positions))

# Extract codons for each sequence and count per position
for (i in seq_along(dna1)) {
  seq_codons <- as.character(codons(dna1[[i]]))
  for (pos in 1:n_positions) {
    codon <- seq_codons[pos]
    if (codon %in% all_codons) {
      codon_matrix[codon, pos] <- codon_matrix[codon, pos] + 1
    } else {
      codon_matrix["other", pos] <- codon_matrix["other", pos] + 1
    }
  }
}

codon_matrix

This approach processes each sequence's codons individually and accumulates counts position-wise. It is equivalent to the base-level consensusMatrix but operates on triplets instead of single nucleotides. If your sequences include ambiguities or gaps, they will be captured in the "other" row.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 716 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6