Dear all / authors of the edgeR package,
I have a question concerning the use of the multidimensional scaling
plot provided by the edgeR package.
I have RNA-seq data for 8 libraries from a 2x2 factorial design and I
want produce a mds plot (plotMDS.DGEList) to get a better idea of the
distances between the libraries.
The function plotMDS.DGEList now offers me the option top. Here, I can
choose the x (top=x) genes that show the highest tagwise dispersion
looking at all libraries.
My question is now, what I should consider for the choice of x?
As I have a 2x2 factorial design I was going to choose x=number of all
genes as I don't see a rational for choosing a specific number smaller
number, which would seem somehow arbitrary to me.
Are there any opinions on that?
Thanks a lot,
Susanne
Please leave top at the default value unless you have a good reason to change it.
Setting top to the whole genome would mean that you would be trying to distinguish your samples using a collection genes that are mostly
either not differentially expressed between the samples or are not expressed at all. This would increase the noise in your comparison and risk masking real patterns.
There is a large literature on unsupervised clustering, of which MDS is a type, and filtering the genes to those which contain real information for distinguishing the samples is pretty much universally recommended. The exact number that are used is not important, but the fact that it is limited to more variable genes is.