Dear Efstathie,
In response to your questions:
1) Regarding lumiB, the default method is 'none', which assumes that the BeadStudio output data is background corrected. So by default, it will do nothing. I did not want that because I did not background correct my data in Genome Studio. I tried using the bgAdjust method but this resulted in some negative values that could not be handled by the following log2 transformation. Thus, I ended up using the 'forcePositive' method, which prepares its output for log2 transformation.
2) Regarding my filtering criteria at the linear model fit step (both after neqc and after lumi preprocessing) I used:
ct<-factor(targets$T790M) #the mutation I am interested in - I have named the columns of my matrix accordingly
design <- model.matrix(~0+ct+targets$Batch)
colnames(design) <- c(levels(ct), 'batch')
arrayw <- arrayWeights(gefR, design)
dupcor <- duplicateCorrelation(gefR, design, block=targets$Sample, weights=arrayw) #correlation between the samples
fit <- lmFit(gefR, design, block=targets$Sample, correlation=dupcor$consensus.correlation, weights=arrayw)
contrasts <- makeContrasts('pos-neg',levels=design) #I am interested in the contrast: mutation-positive versus mutation-negative
fit2 <- contrasts.fit(fit, contrasts)
fit2 <- eBayes(fit2, robust=TRUE, trend=TRUE) #I added the trend=TRUE after your suggestion
top.tab = topTable(fit2, adjust='fdr', number=Inf, p.value=0.05) #
At the last step: After neqc, if I additionally put lfc=1, I won't get any significant genes. Using just p-value criteria, I get 20 DE genes. After lumi preprocessing, on the contrary, I am getting 60 DE genes when I apply both p-value and lfc criteria.
My concern (and the source of all these differences) is the big difference in the truly expressed genes I remain with, after each p-value detection filtering. As I wrote at the original post, when using y$other$Detection < 0.01 I get ~4,000 truly expressed probes, whereas when I use detectionCall() function of lumi I get ~21,000 truly expressed probes. So, much larger a dataset to work with and, obviously, more DE genes detected at the end...
Thank you very much!
Dear Gordon,
Thank you very much for your response and your suggestions. Here is the output of the suggested code
Moreover, I tested the "Detection" column of the Illumina arrays and the values are from 0 to 1...The column is called 'Detection'. How can I understand if this is p-value or 1-p.value?
I also attach the output of summary of my illumina array Lumi object, including the control data:
Finally, why do you say that my limma filtering is incorrect? Do you see any error in the code that I posted?
Thank you very much for your feedback.
Best wishes,
Eleni
limma is telling you that 48-54% of the probes are expressed in each sample, so you can't possibly end up with only 4000 out of 48,000 probes.
Dear Gordon,
Yes! That was it! I now end u with exactly the same number of expressed probes between the methods. Thank you so much!
I also found this small number of probes weird...but how did you understand that these are exceedance values and not p-values?
Thank you very much for your help.
Best wishes,
Eleni
Moreover, here is my code for getting the detection p-value with limma:
...and for getting the detection p-values with lumi:
I even set the largest threshold of p.value=0.05 in limma's way of pre-processing! And the results are so different...I guess I may be doing something really basic in the wrong way...
Thank you very much for your time and feedback.
Best wishes,
Eleni
I suspect that you need
because the Detection scores are actually exceedence values rather than p-values.
Dear Gordon,
Yes! That was it! I now end u with exactly the same number of expressed probes between the methods. Thank you so much! I also found this small number of probes weird...but how did you understand that these are exceedance values and not p-values?
Thank you very much for your help.
Best wishes,
Eleni
Dear Gordon,
please sorry to interfere on this matter, but usually as illustrated in the limma vignette, the detection column holds p-values right ? and something similar with > expressed <- rowSums(y$other$Detection < 0.05) >= cutoff (or < 0.01 for more strigent purposes) would be fine, correct ?
See Why is the "detection" value given by lumidat the inverse (i.e. 1-x) of the "detection_pval" given by GenomeStudio?
read.ilmn() simply reads what columns are provided in the Illumina output. Illumina's Genome Studio software sometimes writes p-values (p) and sometimes writes detection scores (1-p), and one cannot determine which has been done from the name of the column. Functions like neqc() always do the right thing because they check internally whether the detection scores are positively or negatively correlated with expression values. However the y$other$Detection values are just the raw values provided by Illumina, so you have to check yourself if you these values for filtering.
It is very easy to check. Just look at a few expression values like x$E[1:,5,1] and the corresponding detection values x$other$Dection[1:5,1], and it will be obvious whether Detection increases with E or decreases with E. In any case, it will always become obvious very soon from the resulting AveExpr values if you accidently select lowly expressed genes instead of highly expressed.
So in simple words if are the detection p-values, they will "decrease" (more significant) with the increase of expression in each sample---Many many thanks Gordon !!
Thank you very much, it is clear now.