How to define an outlier array from PCA?
Entering edit mode
serpalma.v ▴ 60
Last seen 4 months ago

Dear community,

I am currently doing data analysis of microarrays. There are 20 arrays, devided into 5 animals and 4 treatments. It is a repeated meassurements experiments.

I have done PCA to see if the treatments can explain the variance and saw one array quite far from the rest of arrays. This happens when using the whole data set (PCA1 = -200, PCA2 = 50) and then the data set having only the differentially expressed genes (PCA1 =,-100,  PCA2 = -30) .

Please find the plots in the link:

My question is whether I should remove this array from the analysis. 

Additionally, the number of regulated genes decreases dramatically after removing this animal from the analysis.


microarray PCA outliers sample outliers limma • 1.1k views
Entering edit mode
Aaron Lun ★ 27k
Last seen 4 hours ago
The city by the bay

A more graduated approach might be to use arrayWeights, which should assign a lower weight to any outlier array with variable signal relative to its replicates. This reduces its influence on the linear modelling, DE testing, etc. without requiring the drastic action of tossing out the array altogether. I prefer not to remove arrays if possible, as that means I'm throwing out data and reducing residual d.f. to estimate the variance/power to detect DE (as you might have witnessed yourself, from the reduction in DE genes when the affected animal is removed). It's also hard to draw the line between what is an outlier and what isn't when you have small numbers of samples.

Entering edit mode

Dear Aaron,

thank you very much (again).

I have applied the arrayWeightsSimple as in the example ?arrayWeights

The amount of DEGs now increases. I guess this makes sense, since the variance is deflated when the outlier array is weighted.

However, this works within the linear model and I would like to visualize the weighted arrays in a new PCA or dendrogram.

Is it correct if I do the following?

normalization.factor <- arrayWeightsSimple(eset,design) 

normalized.arrays <- sapply(1:ncol(exprs(eset)), function(i) exprs(eset) [,i]/normalization.factor[i])  # normalize each array by their assigned weights.



Entering edit mode

I don't think it makes sense to interpret the weights as scaling factors. Rather, they modify the expression for the sum-of-squares to be minimised, when solving the linear system in lmFit. As such, you'll need to figure out if your downstream processes have an analogous objective function, in order to get a consistent interpretation for the weights. I've heard of methods for weighted PCA, but I haven't used them so I can't vouch for how sensible they are.


Login before adding your answer.

Traffic: 429 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6