Question

Is there any package helps finding Tandem Repeats ?

1

Entering edit mode

a.afshinfard ▴ 10

@aafshinfard-7617

Last seen 8.9 years ago

Iran, Islamic Republic Of

Hi everyone,

i have a DNA reference and i want a package that helps me with finding tandem repeats in that sequence,

i specially just want the regions of tandem repeats and using them inside R for further computation with bioconductor

thanks everyone

r bioconductor Repeat tandem repeats • 3.0k views

ADD COMMENT • link updated 8.8 years ago by Hervé Pagès 16k • written 8.9 years ago by a.afshinfard ▴ 10

score 1 · Answer 1 · 2015-06-23

Hi,

I'm not aware of any BioC package dedicated to finding Tandem Repeats.

However Bioconductor provides masked genomes (e.g. the BSgenome.Hsapiens.UCSC.hg19.masked package). In a masked genome the chromosome sequences are the same as in a non-masked genome (e.g. the BSgenome.Hsapiens.UCSC.hg19 package) except that masks have been added on top of them to mask different kinds of regions. For example in BSgenome.Hsapiens.UCSC.hg19.masked each sequence has 4 masks and the 4th one is the "TRF mask" which masks the Tandem Repeats of period <= 12:

library(BSgenome.Hsapiens.UCSC.hg19.masked)
genome <- BSgenome.Hsapiens.UCSC.hg19.masked

## Display Tandem Repeats on chr17.
Views(unmasked(genome$chr17), masks(genome$chr17)$TRF)
#   Views on a 81195210-letter DNAString subject
# subject: AAGCTTCTCACCCTGTTCCTGCATAGATAATTG...GGTGTGGGTGTGGTGTGTGGGTGTGGGTGTGGT
# views:
#            start      end width
#     [1]    30966    30996    31 [GGGCTGGGACCCGGGCTGGGGCCCGGGCTGG]
#     [2]    37773    37798    26 [TTTTTTTTTTTTTTTTTTTTTTTTTT]
#     [3]    40174    40389   216 [AAAGAAGGAAAGAAAGAAGGA...AGAAAGAAAGAAGAAAGAAA]
#     [4]    68984    69008    25 [TTTTTTTTTTTTTTTTTTTTTTTTT]
#     [5]    82621    82678    58 [GTGTGTGTGTGTGTGTGTGTG...GTGTGTGTGTGTGTGTGTGT]
#     ...      ...      ...   ... ...
# [11977] 81162540 81162582    43 [AAAAAAAAGAAAGAAAGAAAGAAAAGAAAAGAAAAAAAAGAAA]
# [11978] 81166318 81166372    55 [TGAGTGTGTATGACTGTGTGT...TTTGTGTGTGTGAGTGTGTA]
# [11979] 81173535 81173562    28 [TGATTACCAGGCTGATTACCAGGCTGAT]
# [11980] 81185759 81185799    41 [ACACACACACACACACACACACACACACACACACACACACA]
# [11981] 81194905 81195206   302 [GTGAGGGTGAGGGTGAGGGTG...GTGGTGTGTGGGTGTGGGTG]

See ?BSgenome.Hsapiens.UCSC.hg19.masked and http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ for more information

You can also implement your own little Tandem Repeats finder function:

## Find all *exact* tandem repeats with a period equal to or a divisor of
## 'period.multiple'.
.findTandemRepeats0 <- function(subject, period.multiple=2,
                                         include.period1=FALSE, min.length=24)
{
    if (!isSingleNumber(period.multiple))
        stop("'period.multiple' must be a single integer")
    if (!is.integer(period.multiple))
        period.multiple <- as.integer(period.multiple)
    if (period.multiple < 2L)
        stop("'period.multiple' must be >= 2")
    if (!isTRUEorFALSE(include.period1))
        stop("'include.period1' must be TRUE or FALSE")
    if (!isSingleNumber(min.length))
        stop("'min.length' must be a single integer")
    if (!is.integer(min.length))
        min.length <- as.integer(min.length)
    if (min.length < 12L)
        stop("'min.length' must be >= 12")
    s1 <- subseq(subject, start=1L+period.multiple)
    s2 <- subseq(subject, end=-1L-period.multiple)
    ir <- as(as.raw(s1) == as.raw(s2), "NormalIRanges")
    ir <- ir[width(ir) >= period.multiple]
    end(ir) <- end(ir) + period.multiple
    ir <- ir[width(ir) >= min.length]
    repeats <- Views(subject, ir)
    
    af <- alphabetFrequency(repeats, baseOnly=TRUE)
    ok <- af[ , "other"] == 0L  # has no IUPAC ambiguities
    if (!include.period1)
        ok <- ok & rowSums(af[ , DNA_BASES] != 0L) >= 2L
    repeats[which(ok)]
}

## Find all *exact* tandem repeats of period <= 12.
findTandemRepeats <- function(subject, include.period1=FALSE)
{
    trs_list <- lapply(7:12,
        function(period.multiple)
            ranges(.findTandemRepeats0(subject,
                                       period.multiple,
                                       include.period1))
    )
    trs <- sort(unique(do.call("c", trs_list)))
    Views(subject, trs)
}

Then:

chr17_trs <- findTandemRepeats(unmasked(genome$chr17), include.period1=TRUE)
chr17_trs
#   Views on a 81195210-letter DNAString subject
# subject: AAGCTTCTCACCCTGTTCCTGCATAGATAATTG...GGTGTGGGTGTGGTGTGTGGGTGTGGGTGTGGT
# views:
#            start      end width
#     [1]    37773    37798    26 [TTTTTTTTTTTTTTTTTTTTTTTTTT]
#     [2]    40289    40328    40 [AGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAA]
#     [3]    68984    69008    25 [TTTTTTTTTTTTTTTTTTTTTTTTT]
#     [4]    82621    82678    58 [GTGTGTGTGTGTGTGTGTGTG...GTGTGTGTGTGTGTGTGTGT]
#     [5]    90847    90891    45 [AATAATAATAATAATAATAAT...ATAATAATAATAATAATAAT]
#     ...      ...      ...   ... ...
# [11413] 81194994 81195034    41 [TTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGG]
# [11414] 81195037 81195083    47 [AGGGTGAGGGTGAGGGTGAGG...GTGAGGGTGAGGGTGAGGGT]
# [11415] 81195087 81195127    41 [GGGTTGGGGTTGGGGTTGGGGTTGGGGTTGGGGTTGGGGTT]
# [11416] 81195123 81195162    40 [GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGT]
# [11417] 81195159 81195190    32 [GGGTGTGGGTGTGGGTGTGGGTGTGGGTGTGG]

Even though the results are very close to the Tandem Repeats Finder locations provided by UCSC, there are some small differences. I didn't take a close look but it seems that one of the reasons for these differences is that the Tandem Repeats from UCSC are allowed to contain some fuzziness (i.e. the pattern of a Tandem Repeat is not repeated in an exact way), whereas findTandemRepeats() finds exact Tandem Repeats.

Cheers,

H.