Question

MEDIPS results differ with sample ordering

0

Entering edit mode

keegan ▴ 60

@keegan-11987

Last seen 7 months ago

Vancouver, BC, Canada

I am obtaining unexpected results with the MEDIPS package to identify DMRs with the MEDIPS.meth function (which calls edgeR under the hood) that could potentially be a bug. In particular, when the ordering of the samples within one sample group is altered, the results are different. I would expect that no matter the sample order within group, the results would be identical. Below is a reprex that uses the MEDIPSData example data and illustrates the behaviour I am referring to.


library(MEDIPSData)
library(MEDIPS)
#> Loading required package: BSgenome
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#>     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#>     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#>     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#>     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#>     union, unique, unsplit, which, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#> 
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#> 
#>     strsplit
#> Loading required package: rtracklayer
#> Loading required package: Rsamtools
library(BSgenome.Hsapiens.UCSC.hg19)

data(hESCs_MeDIP)
data(DE_MeDIP)

CS = MEDIPS.couplingVector(pattern = "CG", refObj = hESCs_MeDIP[[1]])
#> Get genomic sequence pattern positions...
#> Searching in chr22 ...[ 578097 ] found.
#> Number of identified  CG  pattern:  578097 
#> Calculating genomic coordinates...
#> Creating Granges object for genome wide windows...
#> Counting the number of CG's in each window...

mr.edgeR = MEDIPS.meth(MSet1 = DE_MeDIP, MSet2 = hESCs_MeDIP,
                       CSet = CS, p.adj = "BH", 
                       diff.method = "edgeR", MeDIP = F, CNV = F, 
                       minRowSum = 10)
#> Calculating genomic coordinates...
#> Creating Granges object for genome wide windows...
#> Preprocessing MEDIPS SET 1 in MSet1...
#> Preprocessing MEDIPS SET 2 in MSet1...
#> Preprocessing MEDIPS SET 3 in MSet1...
#> Preprocessing MEDIPS SET 1 in MSet2...
#> Preprocessing MEDIPS SET 2 in MSet2...
#> Preprocessing MEDIPS SET 3 in MSet2...
#> Differential coverage analysis...
#> Extracting count windows with at least 10 reads...
#> Execute edgeR for count data of 168437 windows...
#> (Neglecting parameter 'type')
#> Creating a DGEList object...
#> Apply trimmed mean of M-values (TMM) for library sizes normalization...
#> Estimating common dispersion...
#> Estimating tagwise dispersion...
#> Calculating differential coverage...
#> Adjusting p.values for multiple testing...
#> Please note, log2 ratios are reported as log2(MSet1/MSet2).
#> Creating results table...
#> Adding differential coverage results...

mr.edgeR_alt = MEDIPS.meth(MSet1 = rev(DE_MeDIP), MSet2 = rev(hESCs_MeDIP),
                           CSet = CS, p.adj = "BH",
                           diff.method = "edgeR", MeDIP = F, CNV = F, 
                           minRowSum = 10)
#> Calculating genomic coordinates...
#> Creating Granges object for genome wide windows...
#> Preprocessing MEDIPS SET 1 in MSet1...
#> Preprocessing MEDIPS SET 2 in MSet1...
#> Preprocessing MEDIPS SET 3 in MSet1...
#> Preprocessing MEDIPS SET 1 in MSet2...
#> Preprocessing MEDIPS SET 2 in MSet2...
#> Preprocessing MEDIPS SET 3 in MSet2...
#> Differential coverage analysis...
#> Extracting count windows with at least 10 reads...
#> Execute edgeR for count data of 168437 windows...
#> (Neglecting parameter 'type')
#> Creating a DGEList object...
#> Apply trimmed mean of M-values (TMM) for library sizes normalization...
#> Estimating common dispersion...
#> Estimating tagwise dispersion...
#> Calculating differential coverage...
#> Adjusting p.values for multiple testing...
#> Please note, log2 ratios are reported as log2(MSet1/MSet2).
#> Creating results table...
#> Adding differential coverage results...

colnames(mr.edgeR)
#>  [1] "chr"                              "start"                           
#>  [3] "stop"                             "DE.MeDIP.Rep1.chr22.bam.counts"  
#>  [5] "DE.MeDIP.Rep2.chr22.bam.counts"   "DE.MeDIP.Rep3.chr22.bam.counts"  
#>  [7] "hESC.MeDIP.Rep1.chr22.bam.counts" "hESC.MeDIP.Rep2.chr22.bam.counts"
#>  [9] "hESC.MeDIP.Rep3.chr22.bam.counts" "DE.MeDIP.Rep1.chr22.bam.rpkm"    
#> [11] "DE.MeDIP.Rep2.chr22.bam.rpkm"     "DE.MeDIP.Rep3.chr22.bam.rpkm"    
#> [13] "hESC.MeDIP.Rep1.chr22.bam.rpkm"   "hESC.MeDIP.Rep2.chr22.bam.rpkm"  
#> [15] "hESC.MeDIP.Rep3.chr22.bam.rpkm"   "MSets1.counts.mean"              
#> [17] "MSets1.rpkm.mean"                 "MSets2.counts.mean"              
#> [19] "MSets2.rpkm.mean"                 "edgeR.logFC"                     
#> [21] "edgeR.logCPM"                     "edgeR.p.value"                   
#> [23] "edgeR.adj.p.value"

# reorder columns to match original ordering
mr.edgeR_alt2 <- mr.edgeR_alt[,c(1:3, 6,5,4, 9,8,7, 12,11,10, 15,14,13, 16:23)]

# check that reordering worked
identical(colnames(mr.edgeR), colnames(mr.edgeR_alt2))
#> [1] TRUE

# Results are not identical
identical(mr.edgeR, mr.edgeR_alt2)
#> [1] FALSE

# data columns identical; edgeR results columns not
for (j in 1:ncol(mr.edgeR)){
  print(paste0(colnames(mr.edgeR)[j], ": ",
               identical(mr.edgeR[,j], mr.edgeR_alt2[,j])))
}
#> [1] "chr: TRUE"
#> [1] "start: TRUE"
#> [1] "stop: TRUE"
#> [1] "DE.MeDIP.Rep1.chr22.bam.counts: TRUE"
#> [1] "DE.MeDIP.Rep2.chr22.bam.counts: TRUE"
#> [1] "DE.MeDIP.Rep3.chr22.bam.counts: TRUE"
#> [1] "hESC.MeDIP.Rep1.chr22.bam.counts: TRUE"
#> [1] "hESC.MeDIP.Rep2.chr22.bam.counts: TRUE"
#> [1] "hESC.MeDIP.Rep3.chr22.bam.counts: TRUE"
#> [1] "DE.MeDIP.Rep1.chr22.bam.rpkm: TRUE"
#> [1] "DE.MeDIP.Rep2.chr22.bam.rpkm: TRUE"
#> [1] "DE.MeDIP.Rep3.chr22.bam.rpkm: TRUE"
#> [1] "hESC.MeDIP.Rep1.chr22.bam.rpkm: TRUE"
#> [1] "hESC.MeDIP.Rep2.chr22.bam.rpkm: TRUE"
#> [1] "hESC.MeDIP.Rep3.chr22.bam.rpkm: TRUE"
#> [1] "MSets1.counts.mean: TRUE"
#> [1] "MSets1.rpkm.mean: FALSE"
#> [1] "MSets2.counts.mean: TRUE"
#> [1] "MSets2.rpkm.mean: FALSE"
#> [1] "edgeR.logFC: FALSE"
#> [1] "edgeR.logCPM: FALSE"
#> [1] "edgeR.p.value: FALSE"
#> [1] "edgeR.adj.p.value: FALSE"

^{Created on 2020-01-24 by the reprex package (v0.3.0)}

MEDIPS edgeR • 884 views

ADD COMMENT • link updated 4.0 years ago by Lukas Chavez ▴ 570 • written 4.3 years ago by keegan ▴ 60

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 12 minutes ago

WEHI, Melbourne, Australia

edgeR does give the same results regardless of ordering of samples. I am not familiar with MEDIPS but I suggest you try rerunning MEDIPS.couplingVector with the changed ordering as well as MEDIPS.meth. I looks to me like you may have disassociated the samples from their normalization parameters by running the two functions on differently ordered datasets.

ADD COMMENT • link 4.2 years ago Gordon Smyth 50k

0

Entering edit mode

Thanks for the input, Gordon. As I understand it, the MEDIPS.couplingVector in MEDIPS is the set of bins over the reference genome, and doesn't take input from the counts themselves. So as long as two samples have the same reference genome, they will generate identical coupling sets (I have verified this for this example - switching samples around generates an identical coupling set object).

I'm guessing the problem lies elsewhere within the MEDIPS.meth function.

ADD REPLY • link 4.2 years ago keegan ▴ 60

score 2 · Accepted Answer · 2020-05-12

Hi keegan,

Thank you for your email and please apologize the very long delay!

I finally found some time to look into this tonight (this ‘bug' report was very worrysome). However, it turns out that the results are identical, no matter in which order the samples are given. It’s just a floating point rounding error and the identical() function is merciless. See for example:

identical((1.2-0.3), 0.9) [1] FALSE

If you use the all.equal() function instead of identical() you’ll see that everything is just fine:

all.equal((1.2-0.3), 0.9) [1] TRUE

all.equal(mr.edgeR, mr.edgeR_alt2) [1] TRUE

for (j in 1:ncol(mr.edgeR)){ print(paste0(colnames(mr.edgeR)[j], ": ", all.equal(mr.edgeR[,j], mr.edgeR_alt2[,j]))) } [1] "chr: TRUE" [1] "start: TRUE" [1] "stop: TRUE" [1] "DE.MeDIP.Rep1.chr22.bam.counts: TRUE" [1] "DE.MeDIP.Rep2.chr22.bam.counts: TRUE" [1] "DE.MeDIP.Rep3.chr22.bam.counts: TRUE" [1] "hESC.MeDIP.Rep1.chr22.bam.counts: TRUE" [1] "hESC.MeDIP.Rep2.chr22.bam.counts: TRUE" [1] "hESC.MeDIP.Rep3.chr22.bam.counts: TRUE" [1] "DE.MeDIP.Rep1.chr22.bam.rpkm: TRUE" [1] "DE.MeDIP.Rep2.chr22.bam.rpkm: TRUE" [1] "DE.MeDIP.Rep3.chr22.bam.rpkm: TRUE" [1] "hESC.MeDIP.Rep1.chr22.bam.rpkm: TRUE" [1] "hESC.MeDIP.Rep2.chr22.bam.rpkm: TRUE" [1] "hESC.MeDIP.Rep3.chr22.bam.rpkm: TRUE" [1] "MSets1.counts.mean: TRUE" [1] "MSets1.rpkm.mean: TRUE" [1] "MSets2.counts.mean: TRUE" [1] "MSets2.rpkm.mean: TRUE" [1] "edgeR.logFC: TRUE" [1] "edgeR.logCPM: TRUE" [1] "edgeR.p.value: TRUE" [1] "edgeR.adj.p.value: TRUE"

I hope this helps, Lukas