Hello
I am wondering if there is some package to detect str in a Bam file. I have found some things for RNA, but I don't know if they also can be used for DNA
thanks
Hello
I am wondering if there is some package to detect str in a Bam file. I have found some things for RNA, but I don't know if they also can be used for DNA
thanks
Hi,
I don't know if there is any Bioconductor package for detecting short tandem repeats in a BAM file. Note that you could try to do this on the FASTQ files (containing the read sequences before alignment) instead of the BAM file. FWIW here is a home-made solution that builds on top of the findTandemRepeats()
function from the A: Is there any package helps finding Tandem Repeats ? post:
## Return the index of sequences in DNAStringSet object 'x' that ## contain repeat tandems. detectTandemRepeats <- function(x) { big_subject <- unlist(xscat(x, "-"), use.names=FALSE) tr_ranges_on_big_subject <- ranges( findTandemRepeats(big_subject, include.period1=TRUE) ) x_ranges_on_big_subject <- successiveIRanges(width(x), gapwidth=1L) hits <- findOverlaps(tr_ranges_on_big_subject, x_ranges_on_big_subject) q_hits <- queryHits(hits) s_hits <- subjectHits(hits) ## A sanity check. stopifnot( all(start(tr_ranges_on_big_subject)[q_hits] >= start(x_ranges_on_big_subject)[s_hits]), all(end(tr_ranges_on_big_subject)[q_hits] <= end(x_ranges_on_big_subject)[s_hits])) unique(s_hits) }
Then:
library(GenomicAlignments) library(RNAseqData.HNRNPC.bam.chr14) file <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] gal <- readGAlignments(file, use.names=TRUE, param=ScanBamParam(what="seq")) seq <- mcols(gal)$seq names(seq) <- names(gal) seq # A DNAStringSet instance of length 800484 # width seq names # [1] 72 TGAGAATGATGATTTCC...TTTTTTATGGCTGCAT ERR127306.26333541 # [2] 72 CCCATATGTACATCAGG...CAGTTGGATACACACA ERR127306.12926170 # [3] 72 CCCCAGGTATACACTGG...TCAAGGTGGACACCAG ERR127306.12926170 # [4] 72 CATAGATGCAAGAATCC...AAAAGATAACTTACCA ERR127306.26974899 # [5] 72 TAGCACACTGAATTCAA...TTACCCCAAGGATACA ERR127306.26974899 # ... ... ... # [800480] 72 GAAATTGCTGAAACTTG...GAGTACATCACATCAG ERR127306.3567919 # [800481] 72 CAAAGCTGGATGTGTCT...GTGTGGTTTTGCTGCC ERR127306.21510817 # [800482] 72 GTGTGGTTTTGCTGCCC...AGGCACATTAATTTTC ERR127306.21510817 # [800483] 72 AAGGAACCCTTGAACTC...GGTCCTCTGCCTCAAG ERR127306.661203 # [800484] 72 CATGACTTGATGGCTGG...TTGCCATACAGTATTT ERR127306.661203 tandem_repeat_idx <- detectTandemRepeats(seq)
tandem_repeat_idx
now contains the index of alignments in gal
with repeat tandems. To see the details:
findTandemRepeats(seq[[tandem_repeat_idx[1]]], include.period1=TRUE) # Views on a 72-letter DNAString subject # subject: CTGTCTCAAAAACAAACAAACCAAAATAAT...GATGATGATGACTGCAATGAGCAGGTGCAC # views: # start end width # [1] 29 53 25 [ATGATGATGATGATGATGATGATGA] findTandemRepeats(seq[[tandem_repeat_idx[2]]], include.period1=TRUE) # Views on a 72-letter DNAString subject # subject: ATTTTGAAGCCCCAGGAATATGTGTGTGTG...TGTGTGTGTGTGTGTGTGTGTGTGTGTGTG # views: # start end width # [1] 39 72 34 [TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG] findTandemRepeats(seq[[tandem_repeat_idx[3]]], include.period1=TRUE) # Views on a 72-letter DNAString subject # subject: CACACACACACACACACACACACACACACA...TGTTTCTCCAGGCTTGGTCCCTGGGACCGT # views: # start end width # [1] 1 31 31 [CACACACACACACACACACACACACACACAC] ## etc...
Cheers,
H.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I edited my answer above to make a couple of corrections to
detectTandemRepeats()
.Also, a different approach would be to find the alignments whose range on the reference overlaps with known tandem repeats. Something like this:
Note that the 2 approaches produce significantly different results. This is because the presence of a tandem repeat in the unaligned read sequence (approach 1) doesn't imply that the aligned read overlaps with a TRF region (approach 2), and vice-versa. For the following reasons:
findTandemRepeats()
(used bydetectTandemRepeats()
) will probably see no tandem repeat in the read sequence itself.findTandemRepeats()
finds exact repeats while those found by Tandem Repeats Finder are allowed to contain some fuzziness (i.e. the pattern of a Tandem Repeat is not repeated in an exact way).A closer inspection would probably reveal other reasons.
H.