Question

Why is the "detection" value given by lumidat the inverse (i.e. 1-x) of the "detection_pval" given by GenomeStudio?

3

Entering edit mode

willj ▴ 30

@willj-8763

Last seen 6.6 years ago

France

I've ran several idat files of microarray data through lumidat, including the example idat files provided in the lumidat package. The "Detection" column always has it's greatest values (i.e. values at or close to 1) for the probes having the strongest signal strength. For example, below I paste output for the three probes having the strongest signal in the lumidat example idat file 5356583020_A. This output was generated using the example given in the lumidat vignette: example(preprocess.illumina.idat)

I have Genome Studio output for some of the idats that I have run. There are minimal changes in the ranking of probes according to signal strength but the Genome Studio "Detection Pval" for the strongest probes is usually close to 0. This makes sense to me: except if I've misunderstood something, the probes with the strongest signal would be the most confidently detected and thus have the lowest detection p-values. So, why is it that lumidat gives a "detection" value close to 1 and not close to 0? Is the lumidat "Detection" column intended to represent the detection p-value, or does it represent something else? (e.g. detection type 2 error, which I guess is 1-p?). Also (and more importantly) is this difference between Genome Studio and lumidat automatically accounted for in downstream workflows such as Lumi?

edit: I should add that the title may be a little misleading because the lumidat "detection" value is not precisely 1-x of the "detection_pval" given by GenomeStudio - it is approximately 1-x, in the examples I have checked.

TargetID	ProbeID	MIN_Signal-5356583020_A	AVG_Signal-5356583020_A	MAX_Signal-5356583020_A	NARRAYS-5356583020_A	ARRAY_STDEV-5356583020_A	BEAD_STDEV-5356583020_A	Avg_NBEADS-5356583020_A	Detection-5356583020_A
HBG2	TIPRV4rkp9ekgjrh6I	42501.53	42501.53	42501.53	1	NaN	15260.67	23	1
UBB	oV5F615l4qunvD30tM	40860.58	40860.58	40860.58	1	NaN	10581.83	25	1
EEF1A1	97viJ90hzUX_rZ.nvY	40495.68	40495.68	40495.68	1	NaN	12339.05	30	1

lumidat idat microarray illumina • 2.3k views

ADD COMMENT • link updated 8.6 years ago by Mark Cowley ▴ 400 • written 8.6 years ago by willj ▴ 30

4

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

lumidat is not a Bioconductor package, so this is not the place to ask questions about it.

GenomeStudio is not consistent about how it outputs the Detection column. Some versions of GenomeStudio have written a p-value in this column, and some write 1-p instead.

My lab uses the read.idat() function in the limma package to read Illumina idat files. This feeds straight into the neqc() function for background correcting and normalizing the expression values. These functions automatically check for how the detection p-value has been represented in the idat file. They automatically convert back to proper p-values if 1-p has been given.

I don't know about lumidat, but it is possible that they have chosen to always convert to 1-p instead of p. You would have to write to the lumidat authors to find out.

neqc() doesn't damp down the signal as much as the lumi procedures do, see:

http://www.ncbi.nlm.nih.gov/pubmed/20929874

Edit: limma::read.idat() and lumidat::read.ilmn.idat() seem to be similar in intention, and both should work fine. read.idat() will pass through the Detection column as found in the idat file, whereas read.ilmn.idat() resets the Detection column. Either should work fine with neqc() and downstream functions.

BTW, neqc() has the nice feature that it figures out what the negative control values must have been from the detection p-values, then uses the negative control values to set the parameters of the normexp background correction step.

ADD COMMENT • link 8.6 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Gordon, thanks very much for explaining this about the GenomeStudio output and how it is interpreted by read.idat(), and for your advice on using your neqc function. I will try to find a way to contact the lumidat authors (edit: Mark Cowley commented below), but I do suggest that it was still good to post the question here because it is relevant to downstream analyses in Bioconductor packages like limma and lumi.

ADD REPLY • link 8.6 years ago willj ▴ 30

Gordon Smyth · Accepted Answer · 2015-09-17

Hi Bill,

My goal for lumidat was to recapitulate the behaviour of GenomeStudio, for better or for worse. As noted by Gordon, and as you can see below in the code, the actual behaviour of GS contradicted the software manual. I like Gordon's idea of detecting & converting them to proper p-values. I usually used lumi for downstream processing, and felt that lumi handled the detection p-values appropriately produced by lumidat. For use with the limma pipeline, perhaps 1-p is more appropriate. I'm afraid that level of nuance has escaped me after so many years of looking at this code.

Here's an excerpt from the underling Java code:

    /**
     * According to the GenomeStudio manual, Detection p-value = 1-R/N,
     * where R is the rank of the gene signal relative to negative controls and N is number of negative controls.
     * if intensity < all negControls, R=0, p=1
     * else if intensity > all negControls, R=N, p=0
     * else R = sum(negControls < intensity). in other words, R is the 0-based index that it would have been, if it was in the vector
     *
     * However, in comparison to GenomeStudio version 1.8.0 actual output, it's clear that for the Reporter probes,
     * P=R/N; ie higher expressed genes have P=1; very lowly expressed genes have P=0.
     *
     * Since this function aims to reproduce the ACTUAL GenomeStudio output, we write out P=R/N
     *
     * References:
     * Chapter 4, Page 106 (page 122/186 in PDF) GenomeStudio_GX_Module_v1.0_UG_11319121_RevA_tcm240-21462.pdf
     *
     * @param R the rank of the gene
     * @param N the number of negative controls
     * @return The Pvalue of a gene with rank R, relative to the N negative controls.
     */
    public static float getPValue(int R, int N) {
        float P = 1.0f - ((float)R / (float)N);
        P = 1 - P; // In comparing to GenomeStudio output, it's clear that they've use P=R/N, so highly expressed genes have P=1
        return (P);
    }

best regards,

Mark