permutation test on one set of data to get p-value of highest peaks. permTest? How?
1
0
Entering edit mode
linuxborg2 • 0
@linuxborg2-14849
Last seen 6.1 years ago

How do I do a permutation test on one bed file of data to get the p-value of the highest peaks of distance frequency?  Is permTest what I want? (https://www.rdocumentation.org/packages/regioneR/versions/1.4.0/topics/permTest)

 

I am measuring QKI binding sites at exon-intron junctions.  Right now I have histograms of how common it is for QKI to be certain distances.  For example the x-axis of one of my histograms is -300 nt intron <-> 0 intron-exon junction <-> exon 300 nt and the y-axis is the frequency of how often QKI is in that +-300nt range from the intron-exon junction in the entire transcriptome.

 

Below is a test file.  The last column is the distance from that QKI binding site to the intron-exon junction (the junction is at the zero in the histogram).  For example 77 means the QKI binding site is in the exon 77 nt downstream of the intron-exon junction and -191 means the QKI binding site is in the intron 191 nt upstream of the intron-exon junction.  I think I can trim all of this data and only keep the final column and that would be the test.bed I input to permTest. I need help understanding how to make the proper randomize.function and evaluate.function so I can get this code to work: permTest(test.bed, ntimes=1000, randomize.function, evaluate.function, alternative="auto", min.parallel=1000, force.parallel=NULL, randomize.function.name=NULL, evaluate.function.name=NULL, verbose=TRUE)

 

pt <- permTest(A=my.genes, randomize.function=resampleRegions, universe=all.genes, evaluate.function=meanInRegions, x=methylation.levels.450K)

https://bioconductor.org/packages/3.7/bioc/vignettes/regioneR/inst/doc/regioneR.pdf

 

Except do not do x= and universe should be something like c(-300,300)

universe a region set in any of the formats accepted by toGRanges (GenomicRanges, data.frame, etc...)

https://bioconductor.org/packages/3.7/bioc/manuals/regioneR/man/regioneR.pdf

 

resampleRegions: This is the fastest of the built-in randomization functions. It uses sample to select N regions from a user provided universe. It is basically a linear algorithm with respect to the number of regions

https://bioconductor.org/packages/3.7/bioc/vignettes/regioneR/inst/doc/regioneR.pdf

 

resampleRegions(test.bed, universe=c(-300,300))

https://www.rdocumentation.org/packages/regioneR/versions/1.4.0/topics/resampleRegions

 

meanInRegions(test.bed, x=peakstobetested, col.name=NULL, ...)

https://bioconductor.org/packages/3.7/bioc/manuals/regioneR/man/regioneR.pdf

 

test.bed:

chr1 11922712 11922712 PARCLIP#QKI_Hafner2010c_hg19*QKI 1.0 - chr1 11922635 11923489 uc001atk.4_exon_1_0_chr1_11922636_r 0 - 0 77

chr1 15428836 15428836 PARCLIP#QKI_Hafner2010c_hg19*QKI 2.0 + chr1 15428592 15430343 uc001awh.2_exon_3_0_chr1_15428593_f 0 + 0 244

chr1 16305834 16305834 PARCLIP#QKI_Hafner2010c_hg19*QKI 1.0 - chr1 16305802 16305919 uc001ayg.4_exon_7_0_chr1_16305803_r 0 - 0 32

chr1 26281226 26281226 PARCLIP#QKI_Hafner2010c_hg19*QKI 1.0 + chr1 26281057 26281522 uc001blu.4_exon_2_0_chr1_26281058_f 0 + 0 169

chr1 27822452 27822452 PARCLIP#QKI_Hafner2010c_hg19*QKI 3.0 + chr1 27822230 27824452 uc001bou.4_exon_8_0_chr1_27822231_f 0 + 0 222

 

chr1 19119365 19119365 PARCLIP#QKI_Hafner2010c_hg19*QKI 4.0 - chr1 19118957 19119556 uc001bbi.4_intron_35_0_chr1_19118958_r 0 - 0 -191

chr1 20810619 20810619 PARCLIP#QKI_Hafner2010c_hg19*QKI 2.0 - chr1 20807500 20810737 uc001bef.4_intron_0_0_chr1_20807501_r 0 - 0 -118

regioneR permTest • 1.9k views
ADD COMMENT
0
Entering edit mode

@linuxborg2, you do not have a universe of all possible QKI peak position, so resample regions does not make sense. meanInRegions will compute the mean of a given value over the regions in A, but you do not have such a value. Please see my response below.

ADD REPLY
1
Entering edit mode
bernatgel ▴ 150
@bernatgel-7226
Last seen 3 months ago
Spain

Hi @linuxborg2,

If I understand correctly your question, you want to know if QKI binding sites are positioned closer to to the intron-exon junctions than one would expect by chance. If this is correct, you can certainly use regioneR for that.

In that case, you would need to give to `permTest`:

  - `A`: a BED file or GRanges object with the positions of yor QKI peaks. It does not need the distances, since it will compute the distances itself.

  - `B`: a BED file or GRanges object with the positions of the intron-exon boundaries. Needed to compute the distances to them.

  - `randomize.function`: `randomizeRegions`. It will create completely random regions in the genome.

  - `evaluate.function`: `meanDistance`. It will compute the mean of the distances of each QKI peak (or random region) to the closest intron-exon boundary

  - `genome`: the genome you are using, for example "hg19"

  - `mask`: VERY IMPORTANT! the non-available parts of the genome where regions can not be randomized.

WHY do you need a MASK? One of  the key elements when performing a permutation test is to have a valid null hypothesis.  This will have a profound impact on the exact question you are answering with the test and very often it's overlooked. I do not know enough of your study to make a recommendation, but, for example, if all your QKI peaks are overlapping genes, you'll want to give permTest a mask object masking out all non-gene parts of the genome (something like: `mask <- subtractRegions(getGenome("hg19"), "all.genes.bed")` or similar). With that, you will make sure that the random regions are similar to the original QKI peaks and that they also fall in genes. If your QKI peaks are somewhat different, you should adapt the mask accordingly.

 

On the other hand, if your question is specifically, "are my QKI peaks closer than 300 bps to an intron-exon boundary"? Then you should build a custom evaluation function, something like "numCloserThan300(A, B)", a function that returns the number of regions in A closer than 300bp to any region in B. You can use `distanceToNearest` from GenomicRanges for that.

Oh, and for debugging I would recommend starting with `ntimes=10` or similar small value and `force.parallel=FALSE` until you get it working and then run it with a larger ntimes and in parallel if needed. 

Hope this helps

 

ADD COMMENT
0
Entering edit mode

Thank you your comment helped me make a lot of progress.  I still have questions about evaluate.function and randomize.function.

 

Can I do this then trim the result in AWK to only be peaks closer than 300 bps to an intron-exon boundary?

evaluate.function=meanDistance(test.bed, exonintronjunctions.bed)

evaluate.function=meanDistance(test.bed, intronexonjunctions.bed)

 

If I can't do that then this would be better.  I do not understand what A and B should be because I am already loading all of my data into distanceToNearest.  Sorry.  I am a Masters student and have no experience making custom functions in R:

evaluate.function=numCloserThan300(A, B, distanceToNearest(test.bed, exonintronjunctions.bed))
evaluate.function=numCloserThan300(A, B, distanceToNearest(test.bed, intronexonjunctions.bed))

 

# naive null randomize across the genes I think I figured out, but it is not as good as randomizing within each gene

randomize.function=randomizeRegions(test.bed, genome="hg38", mask=not_hg38_genes.bed)

not_hg38_genes.bed being:

  1. hg38_genes.bed from UCSC Table Browser with column 2 being start of gene and column 3 being end of gene
  2. then I calculate the regions between those genes column 2 being end of gene from hg38_genes.bed and column 3 being the start of the next gene from hg38_genes.bed

 

# better null randomize within each gene (more accurate).  I need help with universe.  Is universe the raw data from UCSC Table Browser the hg38_genes.bed that in column 2 has gene start locations and column 3 gene end locations?  Is resampleRegions dumb and only allow one universe range (eg. c(-300,300)) or is it smart and do one universe per gene (eg. gene1:100-1000 gene2:2000-5000 etc)?  Does resampleRegions randomize within each gene and not across all genes?  resampleRegions was my best guess of the 3 possibilities (resampleRegions, circularRandomizeRegions, randomizeRegions https://bioconductor.org/packages/3.7/bioc/vignettes/regioneR/inst/doc/regioneR.pdf p. 37).

randomize.function=resampleRegions(QKI.bed, universe=hg38_genes.bed)

Another stab at trying to define universe:

exonintrondists=distanceToNearest(QKI.bed, exonintronjunctionlocations.bed)
exonintrondists@elementMetadata # is distances DataFrame
intronexondists=distanceToNearest(QKI.bed, intronexonjunctionlocations.bed)
intronexondists@elementMetadata # is distances DataFrame

ADD REPLY
1
Entering edit mode

I'm afraid regioneR will only tell you if all your regions at the same time, taken as a group, are closer to the boudaries than one would expect. It will not tell you which are the peaks that are closer. 

There's plenty of documentation on how to create your own functions for example [here](https://www.statmethods.net/management/userfunctions.html), [here](https://www.datacamp.com/community/tutorials/functions-in-r-a-tutorial) and [here](https://swcarpentry.github.io/r-novice-inflammation/02-func-R/). Knowing how to create them and work with them is a very useful skill in many situations. The evaluate.function example you have written is not a function definition but a call to a non-existing function.

In your situation you cannot use resampleRegions with a universe, since you do not have a set of all possible peaks. You'll have to randomize. The randomization functions available do not randomize "per gene" as you suggest. You would need to create a new randomization function for that (you can make use of the existing randomization function, but it might be a little tricky). 

ADD REPLY
0
Entering edit mode

I got the standard randomizeRegions and meanDistance to work, but I am not sure that my custom genome (I want hg38 not the default hg19) and mask variables were put into randomizeRegions because I was unable to put the genome and mask variables inside of randomizeRegions and instead had to put them inside of permTest.  I am not sure if genome and mask in permTest are loaded into randomizeRegions.:

# works

> pt = permTest(A=QKI, B=exonintronJunctions, ntimes=1000, genome=hg38genome, mask=myMask, randomize.function=randomizeRegions, evaluate.function=meanDistance, alternative="auto", min.parallel=1000, force.parallel=TRUE, randomize.function.name=NULL, evaluate.function.name=NULL, verbose=FALSE)

 

# does not work.  Do not put stuff inside the evaluate.function nor the randomize.function

> pt2 = permTest(A=QKI, B=exonintronJunctions, ntimes=10, randomize.function=randomizeRegions(QKI, genome=hg38genome, mask=myMask), evaluate.function=meanDistance, alternative="auto", min.parallel=1000, force.parallel=NULL, randomize.function.name=NULL, evaluate.function.name=NULL, verbose=TRUE)
Error in permTest(A = QKI, B = exonintronJunctions, ntimes = 10, randomize.function = randomizeRegions(QKI,  : 
  randomize.function must be a function

 

# install permTest
source("https://bioconductor.org/biocLite.R")
biocLite("regioneR")
library("regioneR")

# install import function
source("https://bioconductor.org/biocLite.R")
biocLite("rtracklayer")
library(rtracklayer)
# https://www.biostars.org/p/170631/

# install hg38 genome
library(BiocInstaller)
biocLite("BSgenome.Hsapiens.UCSC.hg38")

QKI = import("PARCLIP_QKI_Hafner2010c_hg38_noWeirdChr_sorted.bed", format="bed")

exonintronJunctions = import("hg38_introns_noWeirdChr_sorted.bed", format="bed")

genes = import("hg38_genes_noWeirdChr_sorted.bed", format="bed")

hg38genome = getGenome("hg38")
myMask = subtractRegions(genome, genes)

 

I am worried that my hg38genome and especially myMask genes were not passed from permTest to randomizeRegions, because the output graph I has such a huge meanDistance difference between Evobs and Evperm but with such a huge distance why is the p-value 0.0909 and not much lower?:

> pt = permTest(A=QKI, B=exonintronJunctions, ntimes=1000, genome=hg38genome, mask=myMask, randomize.function=randomizeRegions, evaluate.function=meanDistance, alternative="auto", min.parallel=1000, force.parallel=TRUE, randomize.function.name=NULL, evaluate.function.name=NULL, verbose=FALSE)
> # https://www.rdocumentation.org/packages/regioneR/versions/1.4.0/topics/permTest
> summary(pt)
Permutation tests: 1
Significant permutation tests: 1
Iterations: 1000
Randomization Function: randomizeRegions
Tests Results:
                  pvalue zscore    test
meanDistance 0.000999001 9.0285 greater
 
> plot(pt, plotType="Tailed")

https://drive.google.com/open?id=1tait0VINW8faghhEZFYjEpBGvlQ8qhNz

 

I used alternative "auto" but I think I want to test both "greater" and "less" so next time I run this I'll make two permTest to test each.  Does "greater" mean testing QKI downstream of the junction are they closer than by chance and "less" mean QKI upstream of the junction are closer than by chance?

 

NOTE: this bug has still not been resolved (regioneR: error when verbose=TRUE)

Setting force.parallel=TRUE and verbose=TRUE failed (below).  The fix is to set force.parallel=TRUE and verbose=FALSE:

> pt = permTest(A=QKI, B=exonintronJunctions, ntimes=1000, genome=hg38genome, mask=myMask, randomize.function=randomizeRegions, evaluate.function=meanDistance, alternative="auto", min.parallel=1000, force.parallel=TRUE, randomize.function.name=NULL, evaluate.function.name=NULL, verbose=TRUE)
  |==============================================================================================================| 100%Error in random.evaluate[, i] : incorrect number of dimensions

 

ADD REPLY

Login before adding your answer.

Traffic: 572 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6