Question

Retrieving Regions from hg38 of the same GC content and length with a list of elements

4

Entering edit mode

Dimitris Polychronopoulos ▴ 80

@dimitris-polychronopoulos-9192

Last seen 7.0 years ago

United Kingdom

Dear all,

I am using the package regioneR with the aim to take elements from the genome that are of the same GC content and length with my list of elements. I would also like to exclude repetive regions and keep elements on the same chromosome (sth like shuffleBed -i chrom), therefore I used this command:

randomizeRegions(myelements, genome="hg38masked", per.chromosome = TRUE, allow.overlaps = FALSE)

The hg38masked object corresponds to the masked BSgenome hg38. This is going to exclude elements that fall within repetitive regions, would that be correct?

As an extra criterion, I would like to also get regions from the hg38 genome that are of the same GC content one by one with my elements. Any suggestions of a way of doing this?

Many thanks,

Dimitris

regioneR GRanges GCcontent • 1.9k views

ADD COMMENT • link updated 8.7 years ago by bernatgel ▴ 150 • written 8.7 years ago by Dimitris Polychronopoulos ▴ 80

score 2 · Accepted Answer · 2015-11-26

Hi Dimitris,

For your first question: Yes, by default randomizeRegions uses the default mask of the genome (in this case the hg38masked) and so it won't create any of the new random regions in the repetitive regions.

You second question is a bit more tricky. Right now regioneR does not have the functionality of adding extra criterions in the process of creating new random regions. This is something we have been thinking about and would like to add in the future, but right now it's not yet implemented.

Depending on the stringency you need for "the same GC content" I would suggest two different approaches:

1 - If a general High GC, Mid GC and Low GC would suffice, you could define the regions of the genome with a given GC content (using any of the precomputed GC content tracks available), split your myelements by GC content in the same broad groups you used to partition the genome and finally randomize each group in its specific regions. To do that, asuming you have a GRanges highGC with the regions with high GC content, you should create a custom mask:

gam <- getGenomeAndMask("hg38masked")
genome <- gam$genome
mask <- gam$mask
non.highGC <- subtractRegions(genome, highGC)
highGC.mask <- mergeRegions(mask, non.highGC)

And then, specify the mask in the randomizeRegions call:

randomizeRegions(myelements, genome="hg38masked", mask=highGC.mask, per.chromosome = TRUE, allow.overlaps = FALSE)

With that you should get regions roughly in the same GC content range, but there's no guarantee at all about the GC content of each specific element. In addition, you could be biasing your randomization by the initial partitioning of the genome.

2 - A second approach could be to "keep randomizing blindly until each region has the required GC content". Basically, compute the GC content of each element, the create a randomization and computed the GC content of each random element. Those with the GC content in the desired range are kept, and the rest are re-randomized. Keep doing this until all regions have the desired GC content. This should be slower but in the end you would end up with exactly the desired GC content and no additional biases.

Hope this helps

Bernat