Dear All,
I have two files, the first one is the mouse repeatMasker file which I downloaded from UCSC and the second one is the 10kbp promoter file for mouse (mm10). I Identified the number of repetitive sequences by repeat Family (repFamily) in 10kbp promoter region in mouse.I used GRanges package in R to create genomic ranges for both mouse repeatMasker file and 10kbp promoter. Then I used findovelap function in GRanges to find overlaping regions and I got them. Now since some of these repeats are truncated in this region, I am looking for help using the previous functions or any other functions in GRanges package to identify the repetitive sequences based on total base pairs for each repeat.
To clarify, are you looking for the repeats that cross over the boundaries of the promoter regions?
Yes, I am looking of the repeats (i.e repeat family "repFamily") in front of the genes in the 10kbp promoter region. For example, I found two different B4's in upstream of ENSMUST00000038191 gene and I wonder to check if these two repeats are truncated or not (I mean are they represent two, different repeats or one truncated repeat). Here I want to find count the overlaps between those two files based on base pairs not based on number of repeat fragments.
Once you have found the overlaps, you could e.g. use
pintersect()
to find the intersecting parts and find thewidth()
of the intersection. But I'm still not exactly clear on what you want here.As I mentioned before, I created GRanges for both mouse RepeatMasker file and 10k promoter region. Then I found the overlapping between mouse RepeatMasker file and 10k promoter region as follows:
overlap <- findOverlaps(gr_promoter,gr_repeats), where gr_promoter is the genomic ranges for the promoter region and gr_repeats is the genomic ranges for the mouse RepeatMasker file.
mm10_rep<- MouseRep[as.matrix(overlap)[,2],] where MouseRep is the mouse repeatMasker file that I downloaded from UCSC.
Then I added the gene info from the 10k promoter file:
mm10_rep$gene = promoter[ as.matrix(overlap)[,1] ,4]
mm10_rep$gene = gsub("_.*","",repeats$gene)
Last, I created a new object called temp.1 to find the number repeats infront of each gene
### For repeat families (erpFamily) ###
temp.1 =mm10_rep[,c(13,19)]
But, the results gave me the repFamily name and how many are they in front of each gene. On the other hand I want to find the exact number of base pairs in each of these repeats to make sure if the repeat found is a complete repeat or truncated repeat.
Maybe you could paste the actual script, because the code above does not seem right.
Here is the full script
#Create a genomic ranges for the mouse repeatMasker file
gr_repeats = GRanges( seqnames = Rle(MouseRep[,6]),
ranges =IRanges (start=MouseRep[,7], end=MouseRep[,8]),
#strand = Rle( MouseRep[,10] )
)
gr_repeats
# Compute the total base pair covered for mouse repeatMasker
covered_bp(gr_repeats)
# Create a genomic ranges for the mouse promoter file
promoter_10k = read.table(File_Name3,sep="\t",header=FALSE)
gr_promoter= GRanges( seqnames = Rle(promoter_10k[,1]),
ranges =IRanges (start=promoter_10k[,2], end=promoter_10k[,3]),
#strand = Rle ( promoter_10k[,6] )
)
gr_promoter
# Compute coverage of regions
total_covered = covered_bp(gr_promoter)
# Find overlapping region between to gr's
overlap <- findOverlaps(gr_promoter,gr_repeats)
repeats <- MouseRep[as.matrix(overlap)[,2],]
### Add gene info ###
repeats$gene = promoter_10k[ as.matrix(overlap)[,1] ,4]
repeats$gene = gsub("_.*","",repeats$gene)
#Count the number of repeat families in overlapping region
temp.1 =repeats[,c(13,19)]