I am using the Biostrings package in R to translate my NGS reads directly to amino acids. It has been great so far in my analysis, but I need to modify it for a part of the analysis.
I am using a conserved genomic site to begin translating reads in the forward direction from the 5' end. However, if the read is truncated at the 5' end, I now need to start translating from the conserved site at the 3' end end to stay in frame. I believe I need to start reading the string at the -3 position and read each codon forward every -3 from that position? *It is fine if the output is 3' - 5'- I can use the reverse function to reverse the order of the amino acids.
For example:
5' CTTGAGATGCGAATT 3'
5' - - - - AGATGCGAATT 3'
The first one will translate in frame. The second read will not translate in frame. Thanks for any help/advice!
Here is the code that I believe will need to be updated:
"last %d bases were ignored", phase);
return 1;
}
return 0;
}
/*
* --- .Call ENTRY POINT ---
* Return an AAStringSet object.
*/
SEXP DNAStringSet_translate(SEXP x, SEXP base_codes, SEXP lkup, SEXP skipcode)
{
cachedXStringSet X, Y;
cachedCharSeq X_elt, Y_elt;
char skipcode0;
int ans_length, i, errcode;
SEXP ans, width, ans_width;
TwobitEncodingBuffer teb;
X = _cache_XStringSet(x);
skipcode0 = (unsigned char) INTEGER(skipcode)[0];
ans_length = _get_cachedXStringSet_length(&X);
PROTECT(width = NEW_INTEGER(ans_length));
for (i = 0; i < ans_length; i++) {
X_elt = _get_cachedXStringSet_elt(&X, i);
INTEGER(width)[i] = X_elt.length / 3;
}
PROTECT(ans = alloc_XRawList("AAStringSet", "AAString", width));
Y = _cache_XStringSet(ans);
teb = _new_TwobitEncodingBuffer(base_codes, 3, 0);
ans_width = _get_XStringSet_width(ans);
for (i = 0; i < ans_length; i++) {
X_elt = _get_cachedXStringSet_elt(&X, i);
Y_elt = _get_cachedXStringSet_elt(&Y, i);
errcode = translate(&X_elt, &Y_elt, &teb, lkup, skipcode0);
if (errcode == -1) {
UNPROTECT(2);
if (ans_length == 1)
error("%s", errmsg_buf);
else
error("in 'x[[%d]]': %s", i + 1, errmsg_buf);
}
if (errcode == 1) {
if (ans_length == 1)
warning("%s", errmsg_buf);
else
warning("in 'x[[%d]]': %s", i + 1, errmsg_buf);
}
INTEGER(ans_width)[i] = Y_elt.length;
}
UNPROTECT(2);
return ans;
}

Yes, I need 15 nucleotides from the 3' end to remain in any case. That solution is easier. That will work- is that what this does? Thanks!
You'll have to try it.
Also make sure to set
no.init.codonto TRUE when callingtranslate()on sequences where the initiation codon has been removed or truncated.no.init.codonis a new argument totranslate():https://github.com/Bioconductor/Biostrings/commit/dcd915c21b96d4d981d4530f913ff9108239e129
I just added it to Biostrings 2.50.1 (BioC 3.8, current release) and Biostrings 2.51.1 (BioC 3.9, current devel). Both versions of Biostrings will become available via
BiocManager::install()to BioC >= 3.8 users in the next couple of days.Cheers,
H.
Thank you very much for your responses. I have not yet tried this approach because I am playing around with retrieving intact reads specific to a genomic range from my indexed BAM file using the Rsamtools and GenomiRanges packages. However, it seems that most of the Bioconductor tools that analyze the reads in a BAM file are designed to group the reads. I am haplotyping mutations so I need to analyze the reads individually- is there a tool/method that can use the indexed BAM file to translate my subset in frame?