I am in a pursuit to retrieve some random regions (nullset) from the hg38 genome with reserved nucleotide composition like I have in the input BED file (positiveset). I know this shouldn't be theoretically difficult, but it turns out to be otherwise. I wish to obtain output in BED format since I wish to use
annotatr finally which uses randomregions in ranges object form and I can not use their builtin
randomize_regions() function because I was cautioned about the way 'annotatr` generates random regions:
It is important to note that if the regions to be randomized have a particular property, for example, they are CpGs, the randomize_regions() wrapper will not preserve that property! Instead, we recommend using regioneR::resampleRegions() with the universe being the superset of the data regions you want to sample from.
Since in my analysis preserving AT and GC content is important but it seems it's difficult to do it with
regioneR as described here. I also discovered that though the length of the random regions was the same in nullset as it was in positiveset, the nucleotide composition (GC%) differed drastically. Below is the relevant code:
library("genomation") library(regioneR) B <- readBed("cancer_end_newcoor.txt") rs <- randomizeRegions(B, genome="hg38", per.chromosome = TRUE, allow.overlaps = FALSE) library(Repitools) gc<- gcContentCalc(rs , organism=BSgenome.Hsapiens.UCSC.hg38, verbose=TRUE) f <- as.data.frame(rs) f$gc <- as.character(as.list(gc)) write.csv(f,file = "random_combined.csv")
I also tried another R package
gkmSVM using function
genNullSeqs that provides both fast and bed coordinates as output. However it's giving output for only hg19 not hg38. Here is a code for it:
suppressMessages(library(GenomicFeatures)) suppressMessages(library(GenomeInfoDbData)) suppressMessages(library("BSgenome.Hsapiens.UCSC.hg38.masked")) suppressMessages(library(gkmSVM)) suppressMessages(library(GenomicRanges)) library("BSgenome.Hsapiens.UCSC.hg19.masked") ###### This works fine (note file.bed contains coordinates from hg38 ##### set.seed(123) filename <- "file.bed" name <- tools::file_path_sans_ext(basename(filename)) negSet.bed <- paste(name,"_negSet.bed", sep = "") posSet.fa <- paste(name,"_posSet.fa", sep = "") negSet.fa <- paste(name,"_negSet.fa", sep = "") genNullSeqs(inputBedFN = filename, genomeVersion = 'hg19', outputBedFN = negSet.bed, outputPosFastaFN = posSet.fa, outputNegFastaFN = negSet.fa) ###### But this fails ##### hg38 <- BSgenome.Hsapiens.UCSC.hg38 filename <- "file.bed" name <- tools::file_path_sans_ext(basename(filename)) negSet.bed <- paste(name,"_negSet.bed", sep = "") posSet.fa <- paste(name,"_posSet.fa", sep = "") negSet.fa <- paste(name,"_negSet.fa", sep = "") genNullSeqs(inputBedFN = filename, genome = hg38, outputBedFN = negSet.bed) importing sequences for cancer_end_newcoor.txt from BSgenome.Hsapiens.UCSC.hg19 calculating repeat distributions Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘findOverlaps’ for signature ‘"NULL", "IRanges"’
Can you give me some suggestions on what might have gone wrong here or a better alternative package to achieve similar goals or use
regioneR to achieve this.
Apart from this I wish to ask that if we wish to do enrichment analysis and make an statement that our regions are enriched in certain kind of region of chromosomes (genic pr extragenic) over random control, how many random control regions do you recommend to have. Should the number of randomly sampled regions (nullset) = number of regions in positiveset.