Sampling protein coding genes of the same length distribution as another set of elements using GRanges
1
0
Entering edit mode
@dimitris-polychronopoulos-9192
Last seen 4.9 years ago
United Kingdom

Dear all,

I have a set of elements with the following distribution of lengths:

summary(width(positivelincrnas))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
470    4164    9872   18940   20790  152600 

and another dataset with the following distribution:

summary(width(positivegeneshg19))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
20    5558   20460   59880   58360 4829000 

I would like to get elements from the second dataset (genes) such that they are of the same length distribution as the first set of elements (lincrnas). Both objects are GRanges objects.

Any suggestions?

Thanks a lot,

Dimitris

granges R sample • 668 views
0
Entering edit mode
Julian Gehring ★ 1.3k
@julian-gehring-5818
Last seen 3.0 years ago

In order to match the length distributions, you can compute a density estimate from the first data set and sample from the second data set considering that density. Let's assume we have two GRanges object: gr1 (positivelincrnas) and gr2 (positivegeneshg19). The trick here is to use a weighted sampling scheme where the probability is derived from the distribution of the first dataset.

bins = seq(1000, 25000, by = 1000) ## choose according to your dataset
h = hist(width(gr1), bins, plot = FALSE)
idx = cut(width(gr2), bins, labels = FALSE)
gr2matched = sample(gr2, final_size, prob = h\$density[idx]) ## adjust the 'size' and 'replace' arguments

0
Entering edit mode

Thanks a lot for your reply Julian. That algorithm was also what I was thinking but still it doesn't work for me. A couple of questions:

1. When creating object idx, I have to remove NAs with na.omit for example, right?

2. In the sample command as argument you put gr - there is no object gr defined so probably you mean gr1 right?

0
Entering edit mode
1. You may need to adjust the bins such that they also span the range of width(gr2). Otherwise, values outside the bin range cannot be assigned a proper index, and this results in NAs. I would try this instead of removing NAs.
2. I actually meant gr2 because this is the data set we want to sample from. I have changed the code in the example.