Question: Help: how to use wilcox.test with a SummarizedExperiment object?
gravatar for longsong
5 months ago by
longsong0 wrote:

I have a SummarizedExperiment object (not RangedSummarizedExperiment object): se.

The row names are Gene IDs, and columns names are sample names. Samples are divided into 2 groups: A and V groups. The ‘Column’ (sample) data has a filed: group, which has the values of A or V.

The object has only one assay, which is a integer count.

Now I'd like to use Wilcoxon-Mann-Whitney test to check: for each Gene ID, for A group and V group, the count is different or not?

It maybe a simple question, but I am still a learner of Bioconductor and not know how to do it.

Anyone can help me?


ADD COMMENTlink modified 4 months ago by Gordon Smyth39k • written 5 months ago by longsong0
Answer: Help: how to use wilcox.test with a SummarizedExperiment object?
gravatar for Gordon Smyth
4 months ago by
Gordon Smyth39k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth39k wrote:

It is not usually statistically appropriate to use Wilcoxon tests to compare read counts in a SummarizedExperiment object. James has mentioned that the Wilcoxon test loses statistical power. While that is true, there are even more fundamental objections to using the Wilcoxon test. The Wilcoxon test assumes that all the counts are independent and identically distributed under the null hypothesis, which is very unlikely to be true for sequencing data where the sequencing depth typically varies from the one library to another.

Bioconductor provides a number of very sophisticated and effective statistical procedures for comparing read counts between groups. You don't explain what sort of integer counts you have (RNA-seq? GWAS? ChIP-seq?) but, whatever they are, it would probably be more productive for you to learn about what the Bioconductor packages can provide in terms of statistical analysis of your data, rather than trying to make up a simple statistical analysis for yourself. Even if you are a top statistics expert in your own right, it would still be best to find out what Bioconductor provides first, before trying to figure out whether you can improve it yourself.

Although I am very sceptical that this is a good idea for your data, performing Wilcoxon tests in R is very easy. The rankSumTestWithCorrelation function in the limma package is pretty fast. If y is a matrix:

ngenes <- 1000
nsamples <- 10
y <- matrix(rpois(ngenes*nsamples,lambda=5), ngenes, nsamples)

and your samples are in two groups, with columns j being the first group, for example

j <- 1:5

if the first 5 columns are group 1. Then row-wise Wilcoxon tests can be done by:

PValues <- matrix(1, ngenes, 2)
for (i in 1:nrow(y)) PValues[i,] <- rankSumTestWithCorrelation(j, y[i,])

This takes about a second on my laptop for 10,000 genes and 10 samples.

ADD COMMENTlink modified 4 months ago • written 4 months ago by Gordon Smyth39k

Thanks for your reply.

I have some fungi DNA SNPs data. I created a SummarizedExperiment object from it. The assay is used to count non-synonymous SNPs per Gene. I want to find whether there are count differences per Gene between Group A samples and group V samples. From my understanding (I may be wrong), my case is different from RNA-seq / GWAS / ChIP-seq.

As a beginner, I am not familiar with sophisticated statistics models and most Bioconductor packages. I just want to try many ways to do some preliminary data analysis. Wilcoxon test is one of them. I think there should be a simple way in Bioconductor to use Wilcoxon test to do it. I tried to figure out it myself. But it’s not easy for me now.

Is it somehow correct to use Wilcoxon test for my case? If it is correct, how to do it in a simple way in Bioconductor?


ADD REPLYlink written 4 months ago by longsong0

I don't know enough about your experiment to advise whether the Wilcoxon test is valid for your data, although I'm pretty sure it is not terribly good. Anyway, I have edited my answer to tell you how to do Wilcoxon tests.

ADD REPLYlink written 4 months ago by Gordon Smyth39k

Thank you very much! Your code will be helpful to me.

ADD REPLYlink written 4 months ago by longsong0
Answer: Help: how to use wilcox.test with a SummarizedExperiment object?
gravatar for James W. MacDonald
5 months ago by
United States
James W. MacDonald51k wrote:

There's not to my knowledge a way to do a Wilcoxon rank sum efficiently on high-throughput data, primarily because it's not a common thing to do. There's the RankProd package which is probably the closest you are going to get, but in general people tend to use parametric methods because people usually want to have as much power as possible, and using non-parametric methods guarantees a reduction in power.

You could always loop through each row and run wilcox.test, but that would be suboptimal in terms of power and computational efficiency.

Count data are generally analyzed using either a generalized linear model (with e.g., edgeR or DESeq2), or after suitably transforming in order to use methods based on the normal assumptions (limma-voom and limma-trend).

You could be less mysterious and just tell us what sort of data these are and what you are trying to do rather than asking about particular statistics, but that's up to you.

ADD COMMENTlink written 5 months ago by James W. MacDonald51k

For efficient implementation, there is the siggenes package which has a row-wise Wilcoxon implemented. But I wouldn't use this directly on counts.

We've also been working on a Wilcoxon-based method, similar to SAMseq, for DTE or DGE. One caveat is that this won't work with n=3, but it does seem to have sensitivity beyond that. The method is called swish and the package, fishpond, is now in the release branch. This runs downstream of Salmon and tximeta.

For the SAMseq method:

ADD REPLYlink modified 5 months ago • written 5 months ago by Michael Love25k

While we're throwing methods into the hat, I'll also mention that scran has a pairwiseWilcox() function that does pairwise Wilcoxon rank sum tests between groups of cells in scRNA-seq data.

I implemented it on a whim, but to be honest, I've never really used it for various reasons. (It's slower, has theoretical issues due to non-equality of distributions after normalization, and doesn't yield an easily interpretable effect size.)

ADD REPLYlink modified 5 months ago • written 5 months ago by Aaron Lun25k

Great! We’ll try it out.

ADD REPLYlink written 5 months ago by Michael Love25k

Thank you all.

Please check my other comment to Gordon Smyth's reply.

ADD REPLYlink written 4 months ago by longsong0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 374 users visited in the last hour