Density plots microarray
1
0
Entering edit mode
@eleni-christodoulou-2653
Last seen 4.2 years ago
Singapore

Dear all,

I have an issue with the density plot of my microarray data. My data span 6 chips of the Illumina HT12 v4 beadarray. I uploaded all the raw data to Illumina IGS and extracted the SampleProbeProfile and ControlProbeProfile. I proceeded with Limma neqc normalization and transformation. The PCA plot puts the data in two clearly distinct clusters. The density plots of the normalized data are normal distributions that form two separate peaks as well: some lines have one pick and some another.These are corresponding to the two different clusters. But can I continue using these data or not? I tried standardizing them (mean=0, sd=1) and the two peaks remain present but this time they are much closer. Shall I use this second transformation?

Thank you very much for any feedback.

microarray limma normalization densityplot • 1.8k views
1
Entering edit mode
Axel Klenk ▴ 990
@axel-klenk-3224
Last seen 16 hours ago
Switzerland

Dear Eleni,

I'm hesitating a bit to answer this since I've very little experience with Illumina arrays

but as the Illumina experts have been silent so far, I'll give it a try...

First of all, I'm not sure I understand your question:

Are you concerned about the control and sample probes having different density curves?

This is something I would expect and not worry about.

Or do the clusters you mention correspond to, say, treatment groups? This is ok for the PCA

but alarming for density plots... especially after applying a quantile normalization via neqc().

As always it would really help us to help you if you provided your R code for clarification...

Cheers,

- axel

0
Entering edit mode

Thank you Axel for your response.

I see two populations that correspond to different treatments. So, it falls in the second case you mention...the alerting one :)

Here I provide my code for better understanding:

library(limma)
library(RColorBrewer)

# The expression profiles for all the samples (parental and gefR) together
propexpr(raw.data)  #to see what percentage of my probes are truly expressed. Around 52% of probes are truly expressed
table(raw.data$genes$Status)  #there are 769 negative control probes

# Normalization and filtering
norm.data <- neqc(raw.data)
expressed <- rowSums(norm.data$other$Detection > 0.99) >= 2  #filter out probes that are not expressed.
norm.data <- norm.data[expressed,]

dim(norm.data) # [1] 23922    24
colnames(norm.data) = c(rep('CrizoG1',3), rep('CrizoG3',3), rep('LDKG3',3), rep('AlecG2',3), rep('AlecG3',3), rep('APG1',3), rep('APG2',3), rep('Par',3))  #Crizo, Alec, AP and LDK correspond to different treatments G1 and G3 correspond to different generations of the same treated population

targets <- readTargets() # targets file containing information about the treatment (which drug) and the replicates(I have 3 technical replicates for each condition.

# Plot densities after normalizing and cleaning (truly expressed only) the data
require(graphics)
numOfBioReplicates = targets$BiolRep[length(targets$BiolRep)]
col.palette = palette(rainbow(numOfBioReplicates))
plot(density(norm.data$E[,1]), ylim=c(0,0.3), main='Normalized and Cleaned Expressions', xlab='log2 Expression Values', ylab='Density') # Density plot all the technical replicates with the same color. Different colors only for different samplesfor(i in 2:dim(norm.data$E)[2]) lines(density(norm.data$E[,i]),col=(col.palette[factor(targets$BiolRep)])[i])  

Thank you very much for any feedback!

Best,

Eleni

1
Entering edit mode

Dear Eleni,

again, I'm not an Illumina expert, but I would assume/guess that the detection p-value is significant, i.e. very small, if a probe's signal is different from/above the negative controls -- yes or no? :-)

If yes, then your filtering step is keeping probes only if they have at least two (out of 24) "very bad" detection p-values (> 0.99) i.e. you would be filtering for probes that are *not* expressed in at least two samples. Doesn't make sense to me.

If norm.data$other$Detection is supposed to be 1 if the probe actually is expressed, my assumption/guess is wrong and I apologize :-). But I'd anyway try a density plot without the filtering step since I cannot see any other potential reason for your problem in your code.

Hope this helps,

- axel

0
Entering edit mode

Thank you Axel for taking time to look into my problem. Actually the p-value that I see there is not the detection p-value (as the neqc() guide suggests) but exceedance p-values....Thus, norm.data$other$Detection is supposed to be 1 if the probe actually is expressed. So the higher it is the better (I had a long discussion with Gordon Smyth on a previous experiment I analyzed - see post https://support.bioconductor.org/p/77385/#77700).

But thank you very much for the input!

Best regards,

Eleni

0
Entering edit mode

Actually, I also tried a density plot without filtering, and the distributions were well aligned-one on top of the other. However, I think I should do some filtering to capture the truly expressed genes...or else I may have many false positives, don't you think?

Thanks a lot!

Eleni

0
Entering edit mode

Dear Eleni,

sorry for that late response; I'm on holidays and hence very busy... :-)

Some thoughts:

- AFAIK independent filtering is supposed to increase the power in the massive multiple testing situation, i.e., reducing the number of false negatives rather than false positives.

- it should be independent of your experimental groups and your density plots suggest that this is not the case for your detection-based filtering.

- you may of course consider another filtering criterion. There is the PNAS paper by Bourgon et al. (http://www.ncbi.nlm.nih.gov/pubmed/20460310) and some, hmm, discussion about it on this list (https://support.bioconductor.org/p/53061/).

- if the experts tend to slightly disagree, what should the average applied bioinformatician do who is usually expected to deliver analysis results by yesterday? *I* usually start by doing no filtering; often we have hundreds or thousands of DEG and little need for additional power. A more practical consideration is that some customer will inevitably scroll down the top table looking for (his or her favourite) gene XYZ and conclude that my/your analysis is wrong if that gene does not show up because it has been removed by a filter.

- yet another thought is that you will of course be interested in genes that are consistently below detection level in one experimental group and above it in another group although fold change and p-value etc cannot easily be quantified. Hence you should not require more than your smallest group size to be detected. In your particular setting with many rather small groups I'd be very careful with filtering by detection.

Just a couple of thoughts; it is your data and hence up to you to make a decision that you can justify... :-)

Hope this helps,

- axel