Question

Setting seed in preprocessSWAN and random probe selection within and between arrays

0

Entering edit mode

metamaden ▴ 10

@madensean-8348

Last seen 17 months ago

United States

Hello all,

I am interested in details about the random component in probe selection at both the intra- and inter-array levels for preprocessSWAN() in minfi.

From the original paper I don't notice any explanation of explicit inter-array normalization, just that the intra-array normalization reduces technical variability between arrays. However, I find the following clause in documentation for minfi's preprocessSWAN function:

"SWAN uses a random subset of probes to do the between array normalization. In order to achive reproducible results, the seed needs to be set using set.seed."

So I am wondering:

1. Why is it unnecessary to set the seed for intra-array normalization and not just inter-array normalization? Is this because differences in replication are negligible? Should I be worried about the effects of random intra-array probe selection in hindering reproducibility?

2. What is happening in the inter-array normalization? I understand SWAN (Subset-Quantile Within Array Normalization) selects subsets of probes with varying levels of internal CpGs in order to define a kind of intensity distributions of each assay type, to which remaining probes on the array of subset are normalized. So what aspect of this process is used in the inter-array normalization?

Thanks as always.

best,

Sean

minfi • 2.1k views

ADD COMMENT • link updated 9.5 years ago by James W. MacDonald 68k • written 9.5 years ago by metamaden ▴ 10

score 0 · Answer 1 · 2015-10-16

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 days ago

United States

I think that is just a typo. My understanding of SWAN is that it is simply intended to normalize between the type I and type II probes on an array. Since it uses a random subset of the probes, if you want the results to be consistent over repeated runs of your analysis, you have to set the seed so you use the same subset each time.

In other words, SWAN doesn't do any between-array normalization. It seems that the current state of the art is to use the preprocessFunnorm() function to normalize between arrays. I don't know if you use that in concert with preprocessSWAN(). I sort of doubt it, but there is a paper you can read.

ADD COMMENT • link 9.5 years ago James W. MacDonald 68k

0

Entering edit mode

Thank you very much for your reply.

I notice an error thrown when I attempt to normalize a single array using preprocessSWAN(), which seems like it shouldn't happen if the normalization is only at the intra-array level (please see below).

Am I correct in that you are referring to the Functional Normalization paper by Fortin et al 2014 (PMID: 25599564)?

The only concern with preprocessFunnorm is that, as a newer method, relatively fewer papers appear to be out that implement this method, based on a search of PubMed and PMC. In the interests of comparability between 450k results, it struck me as helpful to have consistent normalization strategies between studies. I realize there is currently no consensus as to which strategy to use.

thanks,

Sean

### preprocessSWAN, single-array error ###

> rg.sub
RGChannelSet (storageMode: lockedEnvironment)
assayData: 622399 features, 1 samples 
  element names: Green, Red 
An object of class 'AnnotatedDataFrame'
  sampleNames: 8691803114_R04C02
  varLabels: DNA.soln Batch_ID ... filenames (19 total)
  varMetadata: labelDescription
Annotation
  array: IlluminaHumanMethylation450k
  annotation: ilmn12.hg19
> dim(rg.sub)
Features  Samples 
  622399        1 
> mset.swan.sub <- preprocessSWAN(rg.sub)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘colMedians’ for signature ‘"integer"’
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] IlluminaHumanMethylation450kmanifest_0.4.0 minfi_1.14.0                              
 [3] bumphunter_1.8.0                           locfit_1.5-9.1                            
 [5] iterators_1.0.7                            foreach_1.4.2                             
 [7] Biostrings_2.36.4                          XVector_0.8.0                             
 [9] GenomicRanges_1.20.6                       GenomeInfoDb_1.4.2                        
[11] IRanges_2.2.7                              S4Vectors_0.6.5                           
[13] lattice_0.20-33                            Biobase_2.28.0                            
[15] BiocGenerics_0.14.0                       

loaded via a namespace (and not attached):
 [1] genefilter_1.50.0       splines_3.2.2           beanplot_1.2            rtracklayer_1.28.10     GenomicFeatures_1.20.4 
 [6] XML_3.98-1.3            survival_2.38-3         DBI_0.3.1               BiocParallel_1.2.21     RColorBrewer_1.1-2     
[11] registry_0.3            rngtools_1.2.4          lambda.r_1.1.7          doRNG_1.6               matrixStats_0.14.2     
[16] plyr_1.8.3              pkgmaker_0.22           stringr_1.0.0           zlibbioc_1.14.0         futile.logger_1.4.1    
[21] codetools_0.2-14        biomaRt_2.24.0          AnnotationDbi_1.30.1    illuminaio_0.10.0       preprocessCore_1.30.0  
[26] Rcpp_0.12.0             xtable_1.7-4            limma_3.24.15           base64_1.1              annotate_1.46.1        
[31] Rsamtools_1.20.4        digest_0.6.8            stringi_0.5-5           nor1mix_1.2-1           RPMM_1.20              
[36] grid_3.2.2              GEOquery_2.34.0         quadprog_1.5-5          tools_3.2.2             bitops_1.0-6           
[41] magrittr_1.5            siggenes_1.42.0         RCurl_1.95-4.7          RSQLite_1.0.0           cluster_2.0.3          
[46] futile.options_1.0.0    MASS_7.3-44             reshape_0.8.5           mclust_5.0.2            GenomicAlignments_1.4.1
[51] multtest_2.24.0         nlme_3.1-122

ADD REPLY • link 9.5 years ago metamaden ▴ 10

0

Entering edit mode

When you see an error like

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘colMedians’ for signature ‘"integer"’

that is an indication that the expectation was that you would be passing in a matrix or data.frame, but the function got a vector instead. This is not what I would call a bug per se, because why would anybody ever do a methylation experiment with a single sample?

In other words, there isn't code in preprocessSWAN() to ensure that data from a single sample doesn't get reduced to a vector (the canonical thing is when you subset using '['. If you do matrix[,someindicator] and 'someindicator' is of length 1, then that will return a vector. If you do matrix[,someindicator,drop=FALSE], then you won't lose dimensions in your return object.). But this doesn't have anything to do with whether or not preprocessSWAN() is an intra or inter-array normalization. It's just an artifact of the fact that it never occurred to Kasper that somebody might try to analyze a single array, so there isn't anything in the code to catch that. In fact, the error occurs here:

bgIntensitySwan <- function(rgSet){
    grnMed <- matrixStats::colMedians(getGreen(rgSet)[getControlAddress(rgSet, controlType = "NEGATIVE"), ])
    redMed <- matrixStats::colMedians(getRed(rgSet)[getControlAddress(rgSet, controlType = "NEGATIVE"), ])
    return(rowMeans(cbind(grnMed, redMed)))
}

which is just the background estimation step.

I am not sure why you are trying to normalize a single array, because that doesn't really make any sense to me. But perhaps you were trying to convince yourself that preprocessSWAN() is in fact a within-array normalization. If so, why not just look at the code? The entire function is like 30 lines or so, and the relevant portion goes like this:

for(i in 1:ncol(mSet)) {
        if(verbose) cat(sprintf("[preprocessSwan] Normalizing array %d of %d\n", i, ncol(mSet)))
        normMeth <- normaliseChannel(methData[rownames(methData) %in% counts$Name[counts$Type=="I"], i],
                                     methData[rownames(methData) %in% counts$Name[counts$Type=="II"], i],
                                     xNormSet, bg[i])
        normMethData[, i] <- normMeth
        normUnmeth <- normaliseChannel(unmethData[rownames(unmethData) %in% counts$Name[counts$Type=="I"], i],
                                       unmethData[rownames(unmethData) %in% counts$Name[counts$Type=="II"], i],
                                       xNormSet, bg[i])
        normUnmethData[, i] <- normUnmeth
    }

Which I think is a pretty clear.

ADD REPLY • link 9.5 years ago James W. MacDonald 68k

0

Entering edit mode

This makes sense, thank you.

best,

Sean

ADD REPLY • link 9.5 years ago metamaden ▴ 10