Search
Question: Detect short tandem repetitions from bam file
0
gravatar for Pau Marc Muñoz Torres
2.1 years ago by
Barcelona
Pau Marc Muñoz Torres20 wrote:

Hello

 

 I am wondering if there is some package to detect str in a Bam file. I have found some things for RNA, but I don't know if they also can be used for DNA

 

thanks

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Pau Marc Muñoz Torres20
1
gravatar for Hervé Pagès
2.1 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

I don't know if there is any Bioconductor package for detecting short tandem repeats in a BAM file. Note that you could try to do this on the FASTQ files (containing the read sequences before alignment) instead of the BAM file. FWIW here is a home-made solution that builds on top of the findTandemRepeats() function from the A: Is there any package helps finding Tandem Repeats ? post:

## Return the index of sequences in DNAStringSet object 'x' that
## contain repeat tandems.
detectTandemRepeats <- function(x)
{
    big_subject <- unlist(xscat(x, "-"), use.names=FALSE)
    tr_ranges_on_big_subject <- ranges(
        findTandemRepeats(big_subject, include.period1=TRUE)
    )
    x_ranges_on_big_subject <- successiveIRanges(width(x),
                                                 gapwidth=1L)
    hits <- findOverlaps(tr_ranges_on_big_subject,
                         x_ranges_on_big_subject)
    q_hits <- queryHits(hits)
    s_hits <- subjectHits(hits)
    ## A sanity check.
    stopifnot(
        all(start(tr_ranges_on_big_subject)[q_hits] >=
            start(x_ranges_on_big_subject)[s_hits]),
        all(end(tr_ranges_on_big_subject)[q_hits] <=
            end(x_ranges_on_big_subject)[s_hits]))
    unique(s_hits)
}

Then:

library(GenomicAlignments)
library(RNAseqData.HNRNPC.bam.chr14)
file <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
gal <- readGAlignments(file, use.names=TRUE, param=ScanBamParam(what="seq"))
seq <- mcols(gal)$seq
names(seq) <- names(gal)
seq
#  A DNAStringSet instance of length 800484
#          width seq                                          names               
#      [1]    72 TGAGAATGATGATTTCC...TTTTTTATGGCTGCAT ERR127306.26333541
#      [2]    72 CCCATATGTACATCAGG...CAGTTGGATACACACA ERR127306.12926170
#      [3]    72 CCCCAGGTATACACTGG...TCAAGGTGGACACCAG ERR127306.12926170
#      [4]    72 CATAGATGCAAGAATCC...AAAAGATAACTTACCA ERR127306.26974899
#      [5]    72 TAGCACACTGAATTCAA...TTACCCCAAGGATACA ERR127306.26974899
#      ...   ... ...
# [800480]    72 GAAATTGCTGAAACTTG...GAGTACATCACATCAG ERR127306.3567919
# [800481]    72 CAAAGCTGGATGTGTCT...GTGTGGTTTTGCTGCC ERR127306.21510817
# [800482]    72 GTGTGGTTTTGCTGCCC...AGGCACATTAATTTTC ERR127306.21510817
# [800483]    72 AAGGAACCCTTGAACTC...GGTCCTCTGCCTCAAG ERR127306.661203
# [800484]    72 CATGACTTGATGGCTGG...TTGCCATACAGTATTT ERR127306.661203
tandem_repeat_idx <- detectTandemRepeats(seq)

tandem_repeat_idx now contains the index of alignments in gal with repeat tandems. To see the details:

findTandemRepeats(seq[[tandem_repeat_idx[1]]], include.period1=TRUE)
#   Views on a 72-letter DNAString subject
# subject: CTGTCTCAAAAACAAACAAACCAAAATAAT...GATGATGATGACTGCAATGAGCAGGTGCAC
# views:
#     start end width
# [1]    29  53    25 [ATGATGATGATGATGATGATGATGA]

findTandemRepeats(seq[[tandem_repeat_idx[2]]], include.period1=TRUE)
#   Views on a 72-letter DNAString subject
# subject: ATTTTGAAGCCCCAGGAATATGTGTGTGTG...TGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
# views:
#     start end width
# [1]    39  72    34 [TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG]

findTandemRepeats(seq[[tandem_repeat_idx[3]]], include.period1=TRUE)
#   Views on a 72-letter DNAString subject
# subject: CACACACACACACACACACACACACACACA...TGTTTCTCCAGGCTTGGTCCCTGGGACCGT
# views:
#     start end width
# [1]     1  31    31 [CACACACACACACACACACACACACACACAC]

## etc...

Cheers,

H.

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by Hervé Pagès ♦♦ 13k

I edited my answer above to make a couple of corrections to detectTandemRepeats().

Also, a different approach would be to find the alignments whose range on the reference overlaps with known tandem repeats. Something like this:

library(BSgenome.Hsapiens.UCSC.hg19.masked)
genome <- BSgenome.Hsapiens.UCSC.hg19.masked

makeGRangesFromGenomeMasks <- function(genome, maskname)
{
    mask_list <- lapply(
        setNames(seqlevels(genome), seqlevels(genome)),
        function(seqlevel) masks(genome[[seqlevel]])[[maskname]])
    as(IRangesList(mask_list), "GRanges")
}

TRF_ranges <- makeGRangesFromGenomeMasks(genome, "TRF")
gal_ranges <- grglist(gal)
hits1 <- findOverlaps(gal_ranges, TRF_ranges)

## Keep only hits where the overlap between the alignment and the
## tandem repeat involves at least 6 nucleotides.
q_hits1 <- queryHits(hits1)
s_hits1 <- subjectHits(hits1)
idx <- which(sum(width(pintersect(gal_ranges[q_hits1],
                                  TRF_ranges[s_hits1]))) >= 6)
hits2 <- hits1[idx]

## Then:
tandem_repeat_idx2 <- unique(queryHits(hits2))

Note that the 2 approaches produce significantly different results. This is because the presence of a tandem repeat in the unaligned read sequence (approach 1) doesn't imply that the aligned read overlaps with a TRF region (approach 2), and vice-versa. For the following reasons:

  • If the overlap between an alignment and a TRF region contains only 1 occurrence of the base motif that makes the TRF region, then findTandemRepeats() (used by detectTandemRepeats()) will probably see no tandem repeat in the read sequence itself.
  • The alignment can contain mismatches and/or indels.
  • The algorithm implemented by findTandemRepeats() finds exact repeats while those found by Tandem Repeats Finder are allowed to contain some fuzziness (i.e. the pattern of a Tandem Repeat is not repeated in an exact way).

A closer inspection would probably reveal other reasons.

H.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Hervé Pagès ♦♦ 13k
0
gravatar for Pau Marc Muñoz Torres
2.1 years ago by
Barcelona
Pau Marc Muñoz Torres20 wrote:

Hi

 

 Thanks for the answer, it will be useful

 

pau

ADD COMMENTlink written 2.1 years ago by Pau Marc Muñoz Torres20
0
gravatar for Pau Marc Muñoz Torres
2.1 years ago by
Barcelona
Pau Marc Muñoz Torres20 wrote:

Hi

 

 Thanks for the answer, it will be useful

 

pau

ADD COMMENTlink written 2.1 years ago by Pau Marc Muñoz Torres20
0
gravatar for Pau Marc Muñoz Torres
2.1 years ago by
Barcelona
Pau Marc Muñoz Torres20 wrote:

Hi

 

 Thanks for the answer, it will be useful

 

pau

ADD COMMENTlink written 2.1 years ago by Pau Marc Muñoz Torres20
0
gravatar for Pau Marc Muñoz Torres
2.1 years ago by
Barcelona
Pau Marc Muñoz Torres20 wrote:

Hi

 

 Thanks for the answer, it will be useful

 

pau

ADD COMMENTlink written 2.1 years ago by Pau Marc Muñoz Torres20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 172 users visited in the last hour