Identify repetitive sequences in 10kbp promoter region in Mouse
1
0
Entering edit mode
mkmb2004 • 0
@mkmb2004-12253
Last seen 7.9 years ago

Dear All,

I have two files, the first one is the mouse repeatMasker file which I downloaded from UCSC and the second one is the 10kbp promoter file for mouse (mm10). I Identified the number of repetitive sequences by repeat Family (repFamily) in 10kbp promoter region in mouse.I used GRanges package in R to create genomic ranges for both mouse repeatMasker file and 10kbp promoter. Then I used findovelap function in GRanges to find overlaping regions and I got them. Now since some of these repeats are truncated in this region, I am looking for help using the previous functions or any other functions in GRanges package to identify the repetitive sequences based on total base pairs for each repeat.

granges short tandem repetitions repeat • 4.5k views
ADD COMMENT
0
Entering edit mode

To clarify, are you looking for the repeats that cross over the boundaries of the promoter regions?

ADD REPLY
0
Entering edit mode

Yes, I am looking of the repeats (i.e repeat family "repFamily") in front of the genes in the 10kbp promoter region. For example, I found two different B4's in upstream of ENSMUST00000038191 gene and I wonder to check if these two repeats are truncated or not (I mean are they represent two, different repeats or one truncated repeat). Here I want to find count the overlaps between those two files based on base pairs not based on number of repeat fragments.

 
ADD REPLY
0
Entering edit mode

Once you have found the overlaps, you could e.g. use pintersect() to find the intersecting parts and find the width() of the intersection. But I'm still not exactly clear on what you want here.

ADD REPLY
0
Entering edit mode

As I mentioned before, I created GRanges for both mouse RepeatMasker file and 10k promoter region. Then I found the overlapping between mouse RepeatMasker file and 10k promoter region as follows:

overlap <- findOverlaps(gr_promoter,gr_repeats), where gr_promoter is the genomic ranges for the promoter region and gr_repeats is the genomic ranges for the mouse RepeatMasker file.

mm10_rep<- MouseRep[as.matrix(overlap)[,2],] where MouseRep is the mouse repeatMasker file that I downloaded from UCSC.

Then I added the gene info from the 10k promoter file:

    mm10_rep$gene = promoter[ as.matrix(overlap)[,1] ,4]                
     mm10_rep$gene = gsub("_.*","",repeats$gene)
Last, I created a new object called temp.1 to find the number repeats infront of each gene      
     ### For repeat families (erpFamily) ###
     temp.1 =mm10_rep[,c(13,19)]

But, the results gave me the repFamily name and how many are they in front of each gene. On the other hand I want to find the exact number of base pairs in each of these repeats to make sure if the repeat found is a complete repeat or truncated repeat.

ADD REPLY
0
Entering edit mode

Maybe you could paste the actual script, because the code above does not seem right.

ADD REPLY
0
Entering edit mode

Here is the full script


#Create a genomic ranges for the mouse repeatMasker file 

gr_repeats = GRanges( seqnames = Rle(MouseRep[,6]),
              ranges =IRanges (start=MouseRep[,7], end=MouseRep[,8]),
              #strand = Rle( MouseRep[,10] )
)
gr_repeats 

# Compute the total base pair covered for mouse repeatMasker 

covered_bp(gr_repeats)                                                            

# Create a genomic ranges for the mouse promoter file 
        
    promoter_10k = read.table(File_Name3,sep="\t",header=FALSE) 
    gr_promoter= GRanges( seqnames = Rle(promoter_10k[,1]),
              ranges =IRanges (start=promoter_10k[,2], end=promoter_10k[,3]),
              #strand = Rle ( promoter_10k[,6]  )
              )
      gr_promoter
       
# Compute coverage of regions 

      total_covered = covered_bp(gr_promoter)
# Find overlapping region between to gr's
      
      overlap <- findOverlaps(gr_promoter,gr_repeats)         

      repeats <- MouseRep[as.matrix(overlap)[,2],]       
      
    ### Add gene info ###                                                 

    repeats$gene = promoter_10k[ as.matrix(overlap)[,1] ,4]                
    repeats$gene = gsub("_.*","",repeats$gene)
     
     #Count the number of  repeat families in overlapping region
     temp.1 =repeats[,c(13,19)]   
     

ADD REPLY
0
Entering edit mode
@michael-lawrence-3846
Last seen 3.1 years ago
United States

To aggregate the overlapping repeat widths by the combination of repFamily and promoter, first find the overlapping ranges:

w <- width(pintersect(gr_repeats[subjectHits(overlap)], gr_promoter[queryHits(overlap)])

Then split by the combination of gene and repeat family and sum it.

repeats_sum <- as.data.frame(xtabs(w ~ gene + repFamily, repeats))

To get the wide form:

reshape(repeats_sum, direction="wide", timevar="repFamily", idvar="gene")

 

ADD COMMENT
0
Entering edit mode

It doesn't work and it gave the error:

Error: could not find function "%pwithin%"

Then I replaced "%pwithin%" with "%within%" . It worked but gave additional error

Error in overlapsAny(query, subject, type = "within") : 
  error in evaluating the argument 'query' in selecting a method for function 'overlapsAny': Error in gr_repeats[subjectHits(hits)] : 
  error in evaluating the argument 'i' in selecting a method for function '[': Error in subjectHits(hits) : 
  error in evaluating the argument 'x' in selecting a method for function 'subjectHits': Error: object 'hits' not found

 

ADD REPLY
0
Entering edit mode

I guess you just have an old version of Bioconductor.

ADD REPLY
0
Entering edit mode

I updated both of R and  Bioconductor, and still give me other error:

 gr_repeats[subjectHits(hits)] %pwithin% gr_promoter[queryHits(overlap)]

Error in to(x, ...) : object 'hits' not found

ADD REPLY
0
Entering edit mode

I updated both of R and  Bioconductor, and still give me other error:

 gr_repeats[subjectHits(hits)] %pwithin% gr_promoter[queryHits(overlap)]

Error in to(x, ...) : object 'hits' not found

ADD REPLY
0
Entering edit mode

Sorry, I edited my answer.

ADD REPLY
0
Entering edit mode

Still doesn't work and give this error:

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘poverlaps’ for signature ‘"GRanges", "GRanges"’

But when I replaced %pwithin% with %within%, it returns a logic results (TRUE OR FALSE).

The main aim in this part to count the total base pairs for each repetitive sequences in the overlapping region (i.e if there are two Alu elements in front of specific gene in this promoter region, we need to find the coverage (#bps) for each one). that will tell us if this repeat is a complete or divided into to parts due to insertion on it.

ADD REPLY
0
Entering edit mode

I edited my answer. Looks like support for that was never added.

ADD REPLY
0
Entering edit mode

When I replaced %pwithin% with %within%, it returns a logic results (TRUE OR FALSE).

The main aim in this part to count the total base pairs for each repetitive sequences in the overlapping region (i.e if there are two Alu elements in front of specific gene in this promoter region, we need to find the coverage (#bps) for each one). that will tell us if this repeat is a complete or divided into to parts due to insertion on it.

ADD REPLY
0
Entering edit mode

How would counting bp tell you whether the repeat was broken by an insertion? It sounds like you're computing overlaps using the UCSC alignments of repeat consensus sequences. You'd have to consider the repStart, repEnd and repLeft columns from the RepeatMasker track to really know the structure. It would be tricky.

ADD REPLY
0
Entering edit mode

Is there any way to do that using GRanges package to find the coverage of each repeat that is the overlapping file since the overlapping file overlaps the repeats in found in both repeatMasker file and promoter region.

Simply, I want to quantify how much of promoters are covered by certain repeats in terms of base pairs

ADD REPLY
0
Entering edit mode

Please see the edited answer. I'm still not really sure what you want. What you want may be simple, but your wording is not precise enough.

ADD REPLY
0
Entering edit mode

let make it simple. I want to count the total number of base pairs for each repeat in promoter file. For example, if I found two Alu elements in front of specific gene how can I add the base pairs of those two which will show us in general much much of promoter are covered by certain repeat in terms of base pairs.  

ADD REPLY
0
Entering edit mode

I edited my answer with my latest guess at what you want to do.
 

ADD REPLY
0
Entering edit mode

I added the repFamily as meta column on gr_repeats as follows:

gr_repeats = GRanges( seqnames = Rle(Mouse_repeats[,6]),
              ranges =IRanges (start=Mouse_repeats[,7], end=Mouse_repeats[,8]),repFamily=Rle(Mouse_repeats[,13]),
              #strand = Rle( Mouse_repeats[,10] )
)

and I checked the results as shown:

GRanges object with 5147736 ranges and 1 metadata column:
                        seqnames             ranges strand |     repFamily
                           <Rle>          <IRanges>  <Rle> |         <Rle>
        [1]                 chr1 [3000000, 3002128]      * |            L1
        [2]                 chr1 [3003152, 3003994]      * |            L1
        [3]                 chr1 [3003993, 3004054]      * |            L1
        [4]                 chr1 [3004040, 3004206]      * |            L1
        [5]                 chr1 [3004206, 3004270]      * | Simple_repeat
        ...                  ...                ...    ... .           ...
  [5147732] chrY_JH584303_random   [152556, 155890]      * |            L1
  [5147733] chrY_JH584303_random   [155890, 156883]      * |            L1
  [5147734] chrY_JH584303_random   [157069, 157145]      * |     ERVL-MaLR
  [5147735] chrY_JH584303_random   [157908, 157960]      * |     ERVL-MaLR
  [5147736] chrY_JH584303_random   [157952, 158099]      * |     ERVL-MaLR
  -------
  seqinfo: 66 sequences from an unspecified genome; no seqlengths

Then I found the ovelapping between gr_repeats and gr_promoter.

After that I run your modified code lines. The first line worked perfectly, but I got error for the second line as shown below:

> sum(split(w, ~ queryHits(overlap) + gr_repeats[subjectHits(overlap)]$repFamily))
Error in unique.default(x, nmax = nmax) : 
  unique() applies only to vectors

 

 

ADD REPLY
0
Entering edit mode

See edited answer.
 

ADD REPLY
0
Entering edit mode

I run the modified line and I got the error:

> sum(splitAsList(w, ~ queryHits(overlap) + gr_repeats[subjectHits(overlap)]$repFamily))
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘as.env’ for signature ‘"integer"’

 

ADD REPLY
0
Entering edit mode

I made another attempt. If I had some sample data I could test this and get it right once and for all.

ADD REPLY
0
Entering edit mode

Here is a link for both datasets (repeatMasker file and sample of promoter file).

https://www.dropbox.com/sh/15l65676wl8wr8hvc/AABP3KBNSBZNfYUUTYQU3Xa?dl=0

ADD REPLY
0
Entering edit mode

So it didn't work for you?

ADD REPLY
0
Entering edit mode

Let me explain exactly what I need:

1- I created a genomic ranges for the repeatsMasker file (gr_repeats). 

2- I created a genomic ranges for the promoter file (gr_promoter).

3- I found the overlapping region between them (overlap)

overlap <- findOverlaps(gr_promoter,gr_repeats)

4- I run the following code lines to merege both files and add gene information to identify the repeats in front of each gene.

repeats <- Mouse_repeats[as.matrix(overlap)[,2],]    

# I added ID column to the repeat files to avoid on repeat matches many genes.

repeats$repID <- as.matrix(overlap)[,2]

# Add gene info from original promoter file                                             repeats$gene = promoter[ as.matrix(overlap)[,1] ,4]           

    repeats$gene = gsub("_.*","",repeats$gene)
    # Find repeat families in front of each gene ###
  
     temp.1=repeats[,c(13,19)]   

I got the results, which shows the how many repeats in front of each gene. This is a sample of results:

  repFamily gene
247211 Simple_repeat ENSMUST00000086465
247212 MIR ENSMUST00000086465
247213 Simple_repeat ENSMUST00000086465
247214 MIR ENSMUST00000086465
247215 Alu ENSMUST00000086465
247216 B4 ENSMUST00000086465
247217 Simple_repeat ENSMUST00000086465
247218 ID ENSMUST00000086465
247219 B4 ENSMUST00000086465

From the results, you can see the gene  ENSMUST00000086465 for example, has two B4 in front of it as a count. Since each of them has different numbers of base pairs, I want to find the total base pairs (add bps) of these two repeats as one B4 repeat in front of that gene.

ADD REPLY
0
Entering edit mode

Ok, I edited my answer.

ADD REPLY
0
Entering edit mode

 

For repeat families (Here I want to create a file called temp.1 to show the following columns on it:

repFamily name, total base pairs corresponding to each repeat, and gene name). Then I want to sum up total base pairs for each certin repeats in front of the gene. The results below show just repeat name and corresponding gene. As you can see gene ENSMUST00000086465 has 3 different simple_repeat, and two different B4 in its promoter region.

 

ADD REPLY
0
Entering edit mode

    

What I want to get as final results as a table like this:

Gene_name                                     

Simple_repeat

Alu

B4

ID

…………………

MIR

ENSMUST00000086465

#BASE PAIRS

#BP

#BP

#BP

#BP

#BP

ENSMUST00000038191

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ADD REPLY
0
Entering edit mode

I think my edited answer gets you close to these tables.

ADD REPLY
0
Entering edit mode

Thaks a lot.

ADD REPLY

Login before adding your answer.

Traffic: 609 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6