Question

sampling random regions of same nucleotide composition from hg38 genome

0

Entering edit mode

rohitsatyam102 ▴ 20

@rohitsatyam102-24390

Last seen 7 months ago

India

Hi Everyone!!

I am in a pursuit to retrieve some random regions (nullset) from the hg38 genome with reserved nucleotide composition like I have in the input BED file (positiveset). I know this shouldn't be theoretically difficult, but it turns out to be otherwise. I wish to obtain output in BED format since I wish to use annotatr finally which uses randomregions in ranges object form and I can not use their builtin randomize_regions() function because I was cautioned about the way 'annotatr` generates random regions:

It is important to note that if the regions to be randomized have a particular property, for example, they are CpGs, the randomize_regions() wrapper will not preserve that property! Instead, we recommend using regioneR::resampleRegions() with the universe being the superset of the data regions you want to sample from.

Since in my analysis preserving AT and GC content is important but it seems it's difficult to do it with regioneR as described here. I also discovered that though the length of the random regions was the same in nullset as it was in positiveset, the nucleotide composition (GC%) differed drastically. Below is the relevant code:

library("genomation")
library(regioneR)
B <- readBed("cancer_end_newcoor.txt")
rs <- randomizeRegions(B, genome="hg38", per.chromosome = TRUE, allow.overlaps = FALSE)

library(Repitools)
gc<- gcContentCalc(rs , organism=BSgenome.Hsapiens.UCSC.hg38, verbose=TRUE)
f <- as.data.frame(rs)
f$gc <- as.character(as.list(gc))
write.csv(f,file = "random_combined.csv")

I also tried another R package gkmSVM using function genNullSeqs that provides both fast and bed coordinates as output. However it's giving output for only hg19 not hg38. Here is a code for it:

suppressMessages(library(GenomicFeatures))
suppressMessages(library(GenomeInfoDbData))
suppressMessages(library("BSgenome.Hsapiens.UCSC.hg38.masked"))
suppressMessages(library(gkmSVM))
suppressMessages(library(GenomicRanges))
library("BSgenome.Hsapiens.UCSC.hg19.masked")

######  This works fine (note file.bed contains coordinates from hg38  #####
set.seed(123)
filename <- "file.bed"
        name <- tools::file_path_sans_ext(basename(filename))
        negSet.bed <- paste(name,"_negSet.bed", sep = "")
        posSet.fa <- paste(name,"_posSet.fa", sep = "")
        negSet.fa <- paste(name,"_negSet.fa", sep = "")
        genNullSeqs(inputBedFN = filename, genomeVersion = 'hg19', outputBedFN = negSet.bed, outputPosFastaFN = posSet.fa, outputNegFastaFN = negSet.fa)

######  But this fails  #####
hg38 <- BSgenome.Hsapiens.UCSC.hg38
filename <- "file.bed"
        name <- tools::file_path_sans_ext(basename(filename))
        negSet.bed <- paste(name,"_negSet.bed", sep = "")
        posSet.fa <- paste(name,"_posSet.fa", sep = "")
        negSet.fa <- paste(name,"_negSet.fa", sep = "")
        genNullSeqs(inputBedFN = filename, genome = hg38, outputBedFN = negSet.bed)

importing sequences for cancer_end_newcoor.txt from BSgenome.Hsapiens.UCSC.hg19 
 calculating repeat distributions
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘findOverlaps’ for signature ‘"NULL", "IRanges"’

Can you give me some suggestions on what might have gone wrong here or a better alternative package to achieve similar goals or use regioneR to achieve this.

Apart from this I wish to ask that if we wish to do enrichment analysis and make an statement that our regions are enriched in certain kind of region of chromosomes (genic pr extragenic) over random control, how many random control regions do you recommend to have. Should the number of randomly sampled regions (nullset) = number of regions in positiveset.

annotatr regioneR gkmSVM • 1.8k views

ADD COMMENT • link updated 2.3 years ago by Mahmoud • 0 • written 5.2 years ago by rohitsatyam102 ▴ 20

score 0 · Answer 1 · 2023-11-17

0

Entering edit mode

Mahmoud • 0

@0ff47b68

Last seen 2.3 years ago

United States

The genNullSeqs function in gkmSVM expects that the provided genome file has the TRF mask. In your code please try using the following and it should fix the problem:

hg38 <- BSgenome.Hsapiens.UCSC.hg38.masked

ADD COMMENT • link 2.3 years ago Mahmoud • 0