Is there any package helps finding Tandem Repeats ?
1
1
Entering edit mode
a.afshinfard ▴ 10
@aafshinfard-7617
Last seen 9.6 years ago
Iran, Islamic Republic Of

Hi everyone,

i have a DNA reference  and i want a package that helps me with finding tandem repeats in that sequence,

i specially just want the regions of tandem repeats and using them inside R for further computation with bioconductor

 

thanks everyone

r bioconductor Repeat tandem repeats • 3.3k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 11 days ago
Seattle, WA, United States

Hi,

I'm not aware of any BioC package dedicated to finding Tandem Repeats.

However Bioconductor provides masked genomes (e.g. the BSgenome.Hsapiens.UCSC.hg19.masked package). In a masked genome the chromosome sequences are the same as in a non-masked genome (e.g. the BSgenome.Hsapiens.UCSC.hg19 package) except that masks have been added on top of them to mask different kinds of regions. For example in BSgenome.Hsapiens.UCSC.hg19.masked each sequence has 4 masks and the 4th one is the "TRF mask" which masks the Tandem Repeats of period <= 12:

library(BSgenome.Hsapiens.UCSC.hg19.masked)
genome <- BSgenome.Hsapiens.UCSC.hg19.masked

## Display Tandem Repeats on chr17.
Views(unmasked(genome$chr17), masks(genome$chr17)$TRF)
#   Views on a 81195210-letter DNAString subject
# subject: AAGCTTCTCACCCTGTTCCTGCATAGATAATTG...GGTGTGGGTGTGGTGTGTGGGTGTGGGTGTGGT
# views:
#            start      end width
#     [1]    30966    30996    31 [GGGCTGGGACCCGGGCTGGGGCCCGGGCTGG]
#     [2]    37773    37798    26 [TTTTTTTTTTTTTTTTTTTTTTTTTT]
#     [3]    40174    40389   216 [AAAGAAGGAAAGAAAGAAGGA...AGAAAGAAAGAAGAAAGAAA]
#     [4]    68984    69008    25 [TTTTTTTTTTTTTTTTTTTTTTTTT]
#     [5]    82621    82678    58 [GTGTGTGTGTGTGTGTGTGTG...GTGTGTGTGTGTGTGTGTGT]
#     ...      ...      ...   ... ...
# [11977] 81162540 81162582    43 [AAAAAAAAGAAAGAAAGAAAGAAAAGAAAAGAAAAAAAAGAAA]
# [11978] 81166318 81166372    55 [TGAGTGTGTATGACTGTGTGT...TTTGTGTGTGTGAGTGTGTA]
# [11979] 81173535 81173562    28 [TGATTACCAGGCTGATTACCAGGCTGAT]
# [11980] 81185759 81185799    41 [ACACACACACACACACACACACACACACACACACACACACA]
# [11981] 81194905 81195206   302 [GTGAGGGTGAGGGTGAGGGTG...GTGGTGTGTGGGTGTGGGTG]

See ?BSgenome.Hsapiens.UCSC.hg19.masked and http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ for more information

You can also implement your own little Tandem Repeats finder function:

## Find all *exact* tandem repeats with a period equal to or a divisor of
## 'period.multiple'.
.findTandemRepeats0 <- function(subject, period.multiple=2,
                                         include.period1=FALSE, min.length=24)
{
    if (!isSingleNumber(period.multiple))
        stop("'period.multiple' must be a single integer")
    if (!is.integer(period.multiple))
        period.multiple <- as.integer(period.multiple)
    if (period.multiple < 2L)
        stop("'period.multiple' must be >= 2")
    if (!isTRUEorFALSE(include.period1))
        stop("'include.period1' must be TRUE or FALSE")
    if (!isSingleNumber(min.length))
        stop("'min.length' must be a single integer")
    if (!is.integer(min.length))
        min.length <- as.integer(min.length)
    if (min.length < 12L)
        stop("'min.length' must be >= 12")
    s1 <- subseq(subject, start=1L+period.multiple)
    s2 <- subseq(subject, end=-1L-period.multiple)
    ir <- as(as.raw(s1) == as.raw(s2), "NormalIRanges")
    ir <- ir[width(ir) >= period.multiple]
    end(ir) <- end(ir) + period.multiple
    ir <- ir[width(ir) >= min.length]
    repeats <- Views(subject, ir)
    
    af <- alphabetFrequency(repeats, baseOnly=TRUE)
    ok <- af[ , "other"] == 0L  # has no IUPAC ambiguities
    if (!include.period1)
        ok <- ok & rowSums(af[ , DNA_BASES] != 0L) >= 2L
    repeats[which(ok)]
}

## Find all *exact* tandem repeats of period <= 12.
findTandemRepeats <- function(subject, include.period1=FALSE)
{
    trs_list <- lapply(7:12,
        function(period.multiple)
            ranges(.findTandemRepeats0(subject,
                                       period.multiple,
                                       include.period1))
    )
    trs <- sort(unique(do.call("c", trs_list)))
    Views(subject, trs)
}

Then:

chr17_trs <- findTandemRepeats(unmasked(genome$chr17), include.period1=TRUE)
chr17_trs
#   Views on a 81195210-letter DNAString subject
# subject: AAGCTTCTCACCCTGTTCCTGCATAGATAATTG...GGTGTGGGTGTGGTGTGTGGGTGTGGGTGTGGT
# views:
#            start      end width
#     [1]    37773    37798    26 [TTTTTTTTTTTTTTTTTTTTTTTTTT]
#     [2]    40289    40328    40 [AGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAA]
#     [3]    68984    69008    25 [TTTTTTTTTTTTTTTTTTTTTTTTT]
#     [4]    82621    82678    58 [GTGTGTGTGTGTGTGTGTGTG...GTGTGTGTGTGTGTGTGTGT]
#     [5]    90847    90891    45 [AATAATAATAATAATAATAAT...ATAATAATAATAATAATAAT]
#     ...      ...      ...   ... ...
# [11413] 81194994 81195034    41 [TTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG]
# [11414] 81195037 81195083    47 [AGGGTGAGGGTGAGGGTGAGG...GTGAGGGTGAGGGTGAGGGT]
# [11415] 81195087 81195127    41 [GGGTTGGGGTTGGGGTTGGGGTTGGGGTTGGGGTTGGGGTT]
# [11416] 81195123 81195162    40 [GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGT]
# [11417] 81195159 81195190    32 [GGGTGTGGGTGTGGGTGTGGGTGTGGGTGTGG]

Even though the results are very close to the Tandem Repeats Finder locations provided by UCSC, there are some small differences. I didn't take a close look but it seems that one of the reasons for these differences is that the Tandem Repeats from UCSC are allowed to contain some fuzziness (i.e. the pattern of a Tandem Repeat is not repeated in an exact way), whereas findTandemRepeats() finds exact Tandem Repeats.

Cheers,

H.

ADD COMMENT
0
Entering edit mode
Is there a way to do this with only raw sequences (like "GCAGGGCTACAAGAAAGTTTGAACTGATTTCTTATGAAATAGCAGGCAACATCAATTTCCTAATCGCTTTAAGTCTAAATGACACACTTCTGCACTTTGCCTAAAGTACCACAATGCTTCTTTATAGCTTTCCCTCTATATAATGGCTGTTTCTACCTTATAAACTGCACACAACTCCCTGTTGAAGTTCTTCCCTGTCACAGTCTGCCTGAGTGCCCTATTTAGGTACCTTTGAAGGAGACTCTCAACTTTGTCATGACAGGAACTTTAGAGCCAGTGAGTTACCCCAAATACCAGCCTCATCCTGTATCCAGATACACAGCTCAGTGCCCACATATTTAGTTGTTTTACTGCCTGTTTCCTTTGATATCAGTCCTGGAAGTTACACCTCCAAACCCTGAGCAGTGCCTCAGTCTCTGCTCATAGCAGTAGCACATCCAAAGTGGCATCACAAGAAATGTCACAGAGATGCCATTTTGCAGATTAAGTGTGATCCTTTAAAACCACAGTGGATGGCTTCTTAATCACATCTTGCATCAACATGGTAACTACACATTTGTGGAGAAATATGTTGCTTTTCTTTTTATTTATTCTTACAAGTACACACATAGGTCAGAATGCTTTGGTTCGATGCAATTTTCAGGAGAAAATTGAACAATAAAATAGTGCTATCATGTGTTTCACTTGCAGGTCTCCAGACAGTAGGAATATTCAGAGTTGGAAGCTCAAAAAAGAGAGTAAGACAGGTGAGTGGATATGGATATGGTTTTCTTTCATCAGCTATAATACAAACAAATCTGATTCCTTTTGTTGTTATTTCTTTCCCATCTGTAGCATGGTGAAATTTAGTCTTTCTGGGTCAGTGACTTTGCTGTTTTTTTGATGGCTCATAAAAAGCTATTTCCAGTATCTTTTACATCTGGTGAAATCCTTGTCAGGTTCTCAGCAGGTTAACAGTCTCTGTATGTTTTGCTCTGGAAAAGTAGCACAGGATT") for example? 
ADD REPLY
1
Entering edit mode

Hi,

The little findTandemRepeats() function I provided above is designed to work on a DNAString object so all you need to do is load the Biostrings package and turn your raw sequence into a DNAString object before calling findTandemRepeats on it:

library(Biostrings)
findTandemRepeats(DNAString("AAAGGCATAGGCATAGGCATAGGCATATTT"),
                  include.period1=TRUE)
#   Views on a 30-letter DNAString subject
# subject: AAAGGCATAGGCATAGGCATAGGCATATTT
# views:
#     start end width
# [1]     3  27    25 [AGGCATAGGCATAGGCATAGGCATA]

Cheers,

H.

ADD REPLY

Login before adding your answer.

Traffic: 888 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6