sampling random regions of same nucleotide composition from hg38 genome
0
0
Entering edit mode
@rohitsatyam102-24390
Last seen 11 weeks ago
India

Hi Everyone!!

I am in a pursuit to retrieve some random regions (nullset) from the hg38 genome with reserved nucleotide composition like I have in the input BED file (positiveset). I know this shouldn't be theoretically difficult, but it turns out to be otherwise. I wish to obtain output in BED format since I wish to use annotatr finally which uses randomregions in ranges object form and I can not use their builtin randomize_regions() function because I was cautioned about the way 'annotatr generates random regions:

It is important to note that if the regions to be randomized have a particular property, for example, they are CpGs, the randomize_regions() wrapper will not preserve that property! Instead, we recommend using regioneR::resampleRegions() with the universe being the superset of the data regions you want to sample from.

Since in my analysis preserving AT and GC content is important but it seems it's difficult to do it with regioneR as described here. I also discovered that though the length of the random regions was the same in nullset as it was in positiveset, the nucleotide composition (GC%) differed drastically. Below is the relevant code:

library("genomation")
library(regioneR)
B <- readBed("cancer_end_newcoor.txt")
rs <- randomizeRegions(B, genome="hg38", per.chromosome = TRUE, allow.overlaps = FALSE)

library(Repitools)
gc<- gcContentCalc(rs , organism=BSgenome.Hsapiens.UCSC.hg38, verbose=TRUE)
f <- as.data.frame(rs)
f\$gc <- as.character(as.list(gc))
write.csv(f,file = "random_combined.csv")


I also tried another R package gkmSVM using function genNullSeqs that provides both fast and bed coordinates as output. However it's giving output for only hg19 not hg38. Here is a code for it:

suppressMessages(library(GenomicFeatures))
suppressMessages(library(GenomeInfoDbData))
suppressMessages(library("BSgenome.Hsapiens.UCSC.hg38.masked"))
suppressMessages(library(gkmSVM))
suppressMessages(library(GenomicRanges))
library("BSgenome.Hsapiens.UCSC.hg19.masked")

######  This works fine (note file.bed contains coordinates from hg38  #####
set.seed(123)
filename <- "file.bed"
name <- tools::file_path_sans_ext(basename(filename))
negSet.bed <- paste(name,"_negSet.bed", sep = "")
posSet.fa <- paste(name,"_posSet.fa", sep = "")
negSet.fa <- paste(name,"_negSet.fa", sep = "")
genNullSeqs(inputBedFN = filename, genomeVersion = 'hg19', outputBedFN = negSet.bed, outputPosFastaFN = posSet.fa, outputNegFastaFN = negSet.fa)

######  But this fails  #####
hg38 <- BSgenome.Hsapiens.UCSC.hg38
filename <- "file.bed"
name <- tools::file_path_sans_ext(basename(filename))
negSet.bed <- paste(name,"_negSet.bed", sep = "")
posSet.fa <- paste(name,"_posSet.fa", sep = "")
negSet.fa <- paste(name,"_negSet.fa", sep = "")
genNullSeqs(inputBedFN = filename, genome = hg38, outputBedFN = negSet.bed)

importing sequences for cancer_end_newcoor.txt from BSgenome.Hsapiens.UCSC.hg19
calculating repeat distributions
Error in (function (classes, fdef, mtable)  :
unable to find an inherited method for function ‘findOverlaps’ for signature ‘"NULL", "IRanges"’


Can you give me some suggestions on what might have gone wrong here or a better alternative package to achieve similar goals or use regioneR` to achieve this.

Apart from this I wish to ask that if we wish to do enrichment analysis and make an statement that our regions are enriched in certain kind of region of chromosomes (genic pr extragenic) over random control, how many random control regions do you recommend to have. Should the number of randomly sampled regions (nullset) = number of regions in positiveset.

annotatr regioneR gkmSVM • 485 views
ADD COMMENT

Login before adding your answer.

Traffic: 505 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6