I am doing a differential genes analysis between 24 pairs of paired samples, normal vs diseased. I want to generate an MDS Plot to check if my normal and diseased samples are being clustered well. I am running edgeR for the analysis.

Could anyone tell me the difference between the 2 methods given below?

I have inputted my file of raw read counts, described the groupings of the samples and the classification of normal and diseased. I have created a DGEList 'y'.

Method 1

y_norm <- calcNormFactors(y)

plotMDS(y_norm)

Method 2

y_norm <- calcNormFactors(y)

cpm <- cpm (y_norm$counts)

plotMDS(cpm)

In both the cases, the MDS Plots generated are showing the normal and diseased samples as separate clusters, but the clusters themselves are different in the two cases.

Can anyone please let me know which of these is the proper way to do it?

Method 1 is correct and Method 2 is wrong. All the documentation uses Method 1, so what has make you think that Method 2 might be appropriate?

Method 2 would be same as Method 1 if you called cpm() with log=TRUE and prior.count=2. We have always given the same advice, for example in Section 2.15 of the edgeR User's Guide. As it is though, your cpm values are unlogged and so are on the wrong scale to compute linear distances.

Note that priort.count=2 is now the default in the latest version of edgeR, but you still need

logCPM <- cpm(y_norm, log=TRUE)

if you want to compute summary values for plots and heatmaps.

I originally have used Method 1. But recently I saw a couple of pages where someone had suggested to use method 2, i.e. doing the MDS plot with the normalized counts.

Could you please explain me why the second method is wrong, how the two methods are differing?

Kevin, thanks for answering questions about edgeR on Biostars, and I hope you will continue to do that. The use of cpm() in the Biostars thread is fine because it uses log=TRUE, which was my main concern. The prior.count setting is less important, and prior.count=2 is the default anyway in the latest version of edgeR.

Thank you Gordon - no problem. We try our best to be as accurate as possible. If in doubt, we direct users here. Usually most questions are already answered here on Bioconductor by you, Aaron, or James, in fact.

I am the one that have used the incorrect method (cpm without the prior.count function). But I didn't know about the need of defining te prior.count. I only explained a problem using edgeR with my data in biostars page https://www.biostars.org/p/356810 on how to correct the batch effect of my samples, and how to see if my samples cluster together in PCA plot. Kevin and I have been disscusing about the issue during all the day, and I would be very gratefull if you consider on having a look of the coments and give me your opinion.

Iraia, your use of cpm seems fine because you used log=TRUE (unlike OP's Method 2). The prior.count setting isn't so important, and prior.count=2 is the default in the latest version of edgeR anyway.

Thank you for your reply.

I originally have used Method 1. But recently I saw a couple of pages where someone had suggested to use method 2, i.e. doing the MDS plot with the normalized counts.

Could you please explain me why the second method is wrong, how the two methods are differing?

Which page has advised Method 2? Can you give a link please.

cpm computes counts-per-million. It doesn't produce "normalized counts", because the result are not counts.

I do not know if this is the same user on Biostars, but I and others have just been providing comments here: https://www.biostars.org/p/356810/#357188

I have added a final comment in response to your answer here, Gordon

Kevin, thanks for answering questions about edgeR on Biostars, and I hope you will continue to do that. The use of cpm() in the Biostars thread is fine because it uses log=TRUE, which was my main concern. The prior.count setting is less important, and prior.count=2 is the default anyway in the latest version of edgeR.

Thank you Gordon - no problem. We try our best to be as accurate as possible. If in doubt, we direct users here. Usually most questions are already answered here on Bioconductor by you, Aaron, or James, in fact.

Hi Gordon,

I am the one that have used the incorrect method (cpm without the prior.count function). But I didn't know about the need of defining te prior.count. I only explained a problem using edgeR with my data in biostars page https://www.biostars.org/p/356810 on how to correct the batch effect of my samples, and how to see if my samples cluster together in PCA plot. Kevin and I have been disscusing about the issue during all the day, and I would be very gratefull if you consider on having a look of the coments and give me your opinion.

Thanks in advance

Iraia, your use of cpm seems fine because you used log=TRUE (unlike OP's Method 2). The prior.count setting isn't so important, and prior.count=2 is the default in the latest version of edgeR anyway.