How to use bootRanges to bootstrap small RNA loci (nullranges package)
2
0
Entering edit mode
Poonam • 0
@e9336529
Last seen 25 days ago
India

Dear authors,

I came across your null ranges function. Kindly address my doubts.

I have a small RNA loci data and i want to test the association with other genomic feature. I want to generate bootstrapped data for my small RNA loci. One way of doing is to perform subsampling and another way is using bootRanges.

  1. gr1 is my granges data representing my small RNA loci. I want to bootstrap only these regions because other regions are not small RNA loci. Therefore, i added the background genome coordinates to the exclude option (gr_toexclude). Is it right?

  2. How do i select block length??? If you say i can share my loci data so that you can help me in deciding the block length.

p1 = bootRanges(gr1, 
  blockLength=100, R=100,  
  exclude=gr_to_exclude, 
  type="bootstrap",  
  withinChrom=TRUE)
bootRanges nullranges • 579 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 4 hours ago
United States

Thread #1

Thanks for posting here.

So you want to move the small RNA loci around the genome? Do they tend to cluster?

For mouse or human genome we usually use a block length around 200 to 500 kb, and we usually use a simple genome segmentation based on gene density with excluded regions from the excluderanges package.

ADD COMMENT
0
Entering edit mode

yes, they exist in clusters. How it should be done for a clustered data??

I used block length from 100bp, 5000bp, 10000bp, 100000bp, 1000000bps. As i am increasing the block length after 10000bp, the bootranges comes with zero width. Can you please explain me what this block length is doing?? what it should be for a clustered data??

ADD REPLY
0
Entering edit mode

Let's discuss a few of the points of bootRanges:

  • blockLength - you want this to be larger than the typical clustering pattern but smaller than the region you are tiling with blocks. We used a large segmentation of hg38, e.g. >1Mb contiguous segments, and found that 200-500 kb blocks worked well.
  • seg - you can choose which blocks go where, or what parts of the genome to bootstrap. We typically bootstrap the entire genome, but you can bootstrap smaller segments only, I am still not sure where you want to place your bootstrapped features. Can you explain more about this choice for your data?
  • exclude - you can additionally specify places that features should not be placed, in addition to how you specify the segmentation.

Maybe you can give me an example of a locus and also the genomic context, and where you want the bootstrapped features to be placed. There is a lot of flexibility in this package, as there are many different use cases (e.g. whether the features live in genome space or transcript space, etc.).

ADD REPLY
0
Entering edit mode

Thank you for your response. I will share the data through email. I will share bootranges outcome also.

ADD REPLY
0
Entering edit mode

If you want a default analysis, you can use the example in the quick start:

https://nullranges.github.io/nullranges/articles/bootRanges.html#quick-start

Here we use a basic segmentation for hg38 and some recommended excluded regions, with block length of 500kb.

We bootstrap across chromosome typically (meaning a feature from chromosome A can be placed on chromosome B in the bootstrapping). Is there a reason why you don't want to do that?

ADD REPLY
0
Entering edit mode

I thought of keeping it the way it is, otherwise there is no specific reason for within chromsome bootstrapping.

ADD REPLY
0
Entering edit mode

I'd recommend to keep that argument as default:

withinChrom = FALSE

ADD REPLY
0
Entering edit mode

Okay. I will do that.

ADD REPLY
0
Entering edit mode

Hello Dr. Michael,

I did the analysis with withinChrom=FALSE

How do you calculate inter range or inter feature distance? how do you take care of the different chromosomes? Because by any mean your boot ranges data does not look like your original data. In the original data the locus were apart and there were no clashes. In the bootranges generated data from one locus it has made three entries in different iters. If i take all iter to calculate inter range distance, it is less than the original. How should i calculate it?

ADD REPLY
0
Entering edit mode

We have a part of the bootRanges vignette where we assess the distance between subsequent ranges. Importantly, this is done per iteration, not across all iterations. It's probably sufficient to just look at one iteration, here we look at three:

https://nullranges.github.io/nullranges/articles/bootRanges.html#assessing-quality-of-bootstrap-samples

The bootstrapped ranges are supposed to look roughly similar in terms of this distribution to the original data, if everything has been set up correctly.

ADD REPLY
0
Entering edit mode

Thank you for your help Dr. Michael.

One more doubt, my original loci contains 7000 locus on different chromosomes. When i am generating bootranges one iteration has only 10-13 locus entries on different chromosomes. Is it okay?

ADD REPLY
0
Entering edit mode

The number should be roughly 7000 across the iterations.

Can you show table(seqnames(gr)) for the original and for one iteration of the bootstrap?

ADD REPLY
0
Entering edit mode
> table(seqnames(gr1))  #original data

 chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 
  515   350   574   312   377   268   300   255   323   231 
chr19  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrX 
  228   538   405   521   483   396   518   401   382   315 
 chrY 
    9 

> table(seqnames(p6))  #this is for all iterations (1000)

 chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 
  808   325   616   253   345   418   259   414   395   191 
chr19  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrX 
  236  1269   443  1196   878   461   753   409   535   362 
 chrY 
   50

Sorry, i was not able to fetch the seqnames number per iter. The following image represents first iteration

ADD REPLY
0
Entering edit mode

You can do R=1 if you want to just look at a single iteration.

My guess is that you are losing >99% of the bootstrapped data to excluded regions. We thought of the excluded regions as e.g. telomere, centromere, just a small number of places where it doesn't make sense to put blocks.

What do you have for sum(width(excluded))?

It may help me to understand: how you are defining the excluded regions?

ADD REPLY
0
Entering edit mode

sum(width(gr_toexclude)) [1] 2713499861

I excluded everything which was not locus. If small rna locus is from 2000 to 4000 and next locus is from 5000 to 6000, i excluded 1 to 2000 and then 4000 too 5000 and defined chr length as the last granges number for that chromosome.

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 4 hours ago
United States

Thread #2

Starting a new thread. If you want to sample from a small region of the genome and only place features in that same small region of the genome, you should use seg instead of exclude. This gives a segmentation of the genome that instructs where bootRanges should allow sampling from and to. So then if you imagine a genome with states 1 and 2 (there can be more, but it's enough to just have two states):

[---1---][---2---][---1---]

ranges will be sampled from these states above
and only placed into matching states below

[---1---][---2---][---1---]

A feature is allowed to be placed from the left 1 state into the right 1 state, but the number has to match.

A segmentation is just a GRanges object that covers the genome and has a metadata column state that instructs where to place ranges.

If you want to control where the features are placed, seg is the argument to use. exclude is just a filter at the end to remove any features that overlap a region, but you want to be in control of where they are placed, not just filter at the end.

ADD COMMENT
0
Entering edit mode

Thank you Dr. Michael for all this help. A little more please...

I think I have a lot of questions, which are related to each other. My first confusion also relates to whether I have been able to understand the concept clearly. I am trying to put all of them here in minimum words. Kindly let me know if I am right and where I am wrong, if so;

I am dealing with two features (small RNA and DNA methylation).

I aim to look for their overlap and want to rule out a chance overlap.

By doing nullranges, I am trying to generate ‘null’ data which will be used for hypothesis testing that my two features have real overlaps (not by chance). Nullranges leaves one feature in the background and segments the genome into ‘states’ as per the classes (possibilities/feature characteristics) within the other feature (small RNA in my case). If there is no way to subclassify the feature, the it belongs to one state only. Now one feature is picked from one state (original data) and put into that state (in my boot data) and this way state-wise features are maintained. Further, after putting each feature into the respective state, every time the background feature (methylation) will be used to find the overlap. I will be using this overlap of two features (seen in boot data) and compare my experimental data with this to test my hypothesis.

ADD REPLY
0
Entering edit mode

Nullranges leaves one feature in the background and segments the genome into ‘states’ as per the classes (possibilities/feature characteristics) within the other feature (small RNA in my case).

I would say, in bootRanges, we leave one feature in its original position, let's call that x. For the other feature y, we move it around to new locations. Segmentation seg defines what can move where, but it seems like this is complicating your analysis here. If you don't specify seg at all, it will move the features to all positions in the genome, by sampling blocks of the original features. I would recommend to try without specifying seg or exclude -- this may help simplify your analysis.

As an example of how we think of the use of seg, in the paper we bootstrap DHS peaks. We segmented the genome to >1Mb segments based on gene density (but you can use any large scale segmentation you like). We provide a segmentation for hg38 and you can easily use our functions to segment any genome based on density of genes etc. What this does is, features in one state will only move to other segments of the same state. DHS in gene rich regions can then only be placed in other gene rich regions.

ADD REPLY
0
Entering edit mode

Thank you again. I will do it without seg and exclude.

I need help in one more point please..

After I am done with my original (experimental) data (calculating overlap of two features) and then done with my bootranges data (finding overlap of two features) across all data sets Am I going to use my original data (overlaps) against the null data generated by bootranges for hypothesis testing? An Answer to this will help me in statistical comparison after generating bootranges data.

ADD REPLY
0
Entering edit mode

We have some example code in the vignette for doing this. This part is highly customizable.

Essentially we recommend this paradigm:

x %>% mutate(num_overlaps = count_overlaps(., original_y))
sum( x$num_overlaps ) # one type of statistic, sum of overlaps

compared to:

x %>% 
  join_overlap_inner(boots) %>%
  group_by(x_id, iter) %>%
  summarize(n_overlaps = n()) %>%
  as.data.frame() %>%
  complete(x_id, iter, fill=list(n_overlaps = 0)) %>%
  group_by(iter) %>%
  summarize(sumOverlaps = sum(n_overlaps))

See vignette for this example. You have to label the ranges in x with a factor:

x <- x %>% mutate(x_id = factor(seq_along(.)))

https://nullranges.github.io/nullranges/articles/bootRanges.html#statistic-i-the-total-number-of-overlaps

ADD REPLY
0
Entering edit mode

Thank you Dr. Michael for all your help. Your elaborative responses has helped me in understanding the concept.

ADD REPLY

Login before adding your answer.

Traffic: 468 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6