Detect short tandem repetitions from bam file
5
0
Entering edit mode
@pau-marc-munoz-torres-6731
Last seen 7.3 years ago
Barcelona

Hello

 

 I am wondering if there is some package to detect str in a Bam file. I have found some things for RNA, but I don't know if they also can be used for DNA

 

thanks

bam short tandem repetitions package • 2.5k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 14 hours ago
Seattle, WA, United States

Hi,

I don't know if there is any Bioconductor package for detecting short tandem repeats in a BAM file. Note that you could try to do this on the FASTQ files (containing the read sequences before alignment) instead of the BAM file. FWIW here is a home-made solution that builds on top of the findTandemRepeats() function from the A: Is there any package helps finding Tandem Repeats ? post:

## Return the index of sequences in DNAStringSet object 'x' that
## contain repeat tandems.
detectTandemRepeats <- function(x)
{
    big_subject <- unlist(xscat(x, "-"), use.names=FALSE)
    tr_ranges_on_big_subject <- ranges(
        findTandemRepeats(big_subject, include.period1=TRUE)
    )
    x_ranges_on_big_subject <- successiveIRanges(width(x),
                                                 gapwidth=1L)
    hits <- findOverlaps(tr_ranges_on_big_subject,
                         x_ranges_on_big_subject)
    q_hits <- queryHits(hits)
    s_hits <- subjectHits(hits)
    ## A sanity check.
    stopifnot(
        all(start(tr_ranges_on_big_subject)[q_hits] >=
            start(x_ranges_on_big_subject)[s_hits]),
        all(end(tr_ranges_on_big_subject)[q_hits] <=
            end(x_ranges_on_big_subject)[s_hits]))
    unique(s_hits)
}

Then:

library(GenomicAlignments)
library(RNAseqData.HNRNPC.bam.chr14)
file <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
gal <- readGAlignments(file, use.names=TRUE, param=ScanBamParam(what="seq"))
seq <- mcols(gal)$seq
names(seq) <- names(gal)
seq
#  A DNAStringSet instance of length 800484
#          width seq                                          names               
#      [1]    72 TGAGAATGATGATTTCC...TTTTTTATGGCTGCAT ERR127306.26333541
#      [2]    72 CCCATATGTACATCAGG...CAGTTGGATACACACA ERR127306.12926170
#      [3]    72 CCCCAGGTATACACTGG...TCAAGGTGGACACCAG ERR127306.12926170
#      [4]    72 CATAGATGCAAGAATCC...AAAAGATAACTTACCA ERR127306.26974899
#      [5]    72 TAGCACACTGAATTCAA...TTACCCCAAGGATACA ERR127306.26974899
#      ...   ... ...
# [800480]    72 GAAATTGCTGAAACTTG...GAGTACATCACATCAG ERR127306.3567919
# [800481]    72 CAAAGCTGGATGTGTCT...GTGTGGTTTTGCTGCC ERR127306.21510817
# [800482]    72 GTGTGGTTTTGCTGCCC...AGGCACATTAATTTTC ERR127306.21510817
# [800483]    72 AAGGAACCCTTGAACTC...GGTCCTCTGCCTCAAG ERR127306.661203
# [800484]    72 CATGACTTGATGGCTGG...TTGCCATACAGTATTT ERR127306.661203
tandem_repeat_idx <- detectTandemRepeats(seq)

tandem_repeat_idx now contains the index of alignments in gal with repeat tandems. To see the details:

findTandemRepeats(seq[[tandem_repeat_idx[1]]], include.period1=TRUE)
#   Views on a 72-letter DNAString subject
# subject: CTGTCTCAAAAACAAACAAACCAAAATAAT...GATGATGATGACTGCAATGAGCAGGTGCAC
# views:
#     start end width
# [1]    29  53    25 [ATGATGATGATGATGATGATGATGA]

findTandemRepeats(seq[[tandem_repeat_idx[2]]], include.period1=TRUE)
#   Views on a 72-letter DNAString subject
# subject: ATTTTGAAGCCCCAGGAATATGTGTGTGTG...TGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
# views:
#     start end width
# [1]    39  72    34 [TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG]

findTandemRepeats(seq[[tandem_repeat_idx[3]]], include.period1=TRUE)
#   Views on a 72-letter DNAString subject
# subject: CACACACACACACACACACACACACACACA...TGTTTCTCCAGGCTTGGTCCCTGGGACCGT
# views:
#     start end width
# [1]     1  31    31 [CACACACACACACACACACACACACACACAC]

## etc...

Cheers,

H.

ADD COMMENT
0
Entering edit mode

I edited my answer above to make a couple of corrections to detectTandemRepeats().

Also, a different approach would be to find the alignments whose range on the reference overlaps with known tandem repeats. Something like this:

library(BSgenome.Hsapiens.UCSC.hg19.masked)
genome <- BSgenome.Hsapiens.UCSC.hg19.masked

makeGRangesFromGenomeMasks <- function(genome, maskname)
{
    mask_list <- lapply(
        setNames(seqlevels(genome), seqlevels(genome)),
        function(seqlevel) masks(genome[[seqlevel]])[[maskname]])
    as(IRangesList(mask_list), "GRanges")
}

TRF_ranges <- makeGRangesFromGenomeMasks(genome, "TRF")
gal_ranges <- grglist(gal)
hits1 <- findOverlaps(gal_ranges, TRF_ranges)

## Keep only hits where the overlap between the alignment and the
## tandem repeat involves at least 6 nucleotides.
q_hits1 <- queryHits(hits1)
s_hits1 <- subjectHits(hits1)
idx <- which(sum(width(pintersect(gal_ranges[q_hits1],
                                  TRF_ranges[s_hits1]))) >= 6)
hits2 <- hits1[idx]

## Then:
tandem_repeat_idx2 <- unique(queryHits(hits2))

Note that the 2 approaches produce significantly different results. This is because the presence of a tandem repeat in the unaligned read sequence (approach 1) doesn't imply that the aligned read overlaps with a TRF region (approach 2), and vice-versa. For the following reasons:

  • If the overlap between an alignment and a TRF region contains only 1 occurrence of the base motif that makes the TRF region, then findTandemRepeats() (used by detectTandemRepeats()) will probably see no tandem repeat in the read sequence itself.
  • The alignment can contain mismatches and/or indels.
  • The algorithm implemented by findTandemRepeats() finds exact repeats while those found by Tandem Repeats Finder are allowed to contain some fuzziness (i.e. the pattern of a Tandem Repeat is not repeated in an exact way).

A closer inspection would probably reveal other reasons.

H.

ADD REPLY
0
Entering edit mode
@pau-marc-munoz-torres-6731
Last seen 7.3 years ago
Barcelona

Hi

 

 Thanks for the answer, it will be useful

 

pau

ADD COMMENT
0
Entering edit mode
@pau-marc-munoz-torres-6731
Last seen 7.3 years ago
Barcelona

Hi

 

 Thanks for the answer, it will be useful

 

pau

ADD COMMENT
0
Entering edit mode
@pau-marc-munoz-torres-6731
Last seen 7.3 years ago
Barcelona

Hi

 

 Thanks for the answer, it will be useful

 

pau

ADD COMMENT
0
Entering edit mode
@pau-marc-munoz-torres-6731
Last seen 7.3 years ago
Barcelona

Hi

 

 Thanks for the answer, it will be useful

 

pau

ADD COMMENT

Login before adding your answer.

Traffic: 950 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6