Question: Help: how to use wilcox.test with a SummarizedExperiment object?
0
4 weeks ago by
longsong0
longsong0 wrote:

I have a SummarizedExperiment object (not RangedSummarizedExperiment object): se.

The row names are Gene IDs, and columns names are sample names. Samples are divided into 2 groups: A and V groups. The ‘Column’ (sample) data has a filed: group, which has the values of A or V.

The object has only one assay, which is a integer count.

Now I'd like to use Wilcoxon-Mann-Whitney test to check: for each Gene ID, for A group and V group, the count is different or not?

It maybe a simple question, but I am still a learner of Bioconductor and not know how to do it.

Anyone can help me?

Thanks!

modified 29 days ago by Gordon Smyth37k • written 4 weeks ago by longsong0
Answer: Help: how to use wilcox.test with a SummarizedExperiment object?
1
29 days ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

It is not usually statistically appropriate to use Wilcoxon tests to compare read counts in a SummarizedExperiment object. James has mentioned that the Wilcoxon test loses statistical power. While that is true, there are even more fundamental objections to using the Wilcoxon test. The Wilcoxon test assumes that all the counts are independent and identically distributed under the null hypothesis, which is very unlikely to be true for sequencing data where the sequencing depth typically varies from the one library to another.

Bioconductor provides a number of very sophisticated and effective statistical procedures for comparing read counts between groups. You don't explain what sort of integer counts you have (RNA-seq? GWAS? ChIP-seq?) but, whatever they are, it would probably be more productive for you to learn about what the Bioconductor packages can provide in terms of statistical analysis of your data, rather than trying to make up a simple statistical analysis for yourself. Even if you are a top statistics expert in your own right, it would still be best to find out what Bioconductor provides first, before trying to figure out whether you can improve it yourself.

Although I am very sceptical that this is a good idea for your data, performing Wilcoxon tests in R is very easy. The rankSumTestWithCorrelation function in the limma package is pretty fast. If y is a matrix:

library(limma)
ngenes <- 1000
nsamples <- 10
y <- matrix(rpois(ngenes*nsamples,lambda=5), ngenes, nsamples)


and your samples are in two groups, with columns j being the first group, for example

j <- 1:5


if the first 5 columns are group 1. Then row-wise Wilcoxon tests can be done by:

PValues <- matrix(1, ngenes, 2)
for (i in 1:nrow(y)) PValues[i,] <- rankSumTestWithCorrelation(j, y[i,])


This takes about a second on my laptop for 10,000 genes and 10 samples.

I have some fungi DNA SNPs data. I created a SummarizedExperiment object from it. The assay is used to count non-synonymous SNPs per Gene. I want to find whether there are count differences per Gene between Group A samples and group V samples. From my understanding (I may be wrong), my case is different from RNA-seq / GWAS / ChIP-seq.

As a beginner, I am not familiar with sophisticated statistics models and most Bioconductor packages. I just want to try many ways to do some preliminary data analysis. Wilcoxon test is one of them. I think there should be a simple way in Bioconductor to use Wilcoxon test to do it. I tried to figure out it myself. But it’s not easy for me now.

Is it somehow correct to use Wilcoxon test for my case? If it is correct, how to do it in a simple way in Bioconductor?

Thanks!

I don't know enough about your experiment to advise whether the Wilcoxon test is valid for your data, although I'm pretty sure it is not terribly good. Anyway, I have edited my answer to tell you how to do Wilcoxon tests.

Answer: Help: how to use wilcox.test with a SummarizedExperiment object?
0
4 weeks ago by
United States
James W. MacDonald50k wrote:

There's not to my knowledge a way to do a Wilcoxon rank sum efficiently on high-throughput data, primarily because it's not a common thing to do. There's the RankProd package which is probably the closest you are going to get, but in general people tend to use parametric methods because people usually want to have as much power as possible, and using non-parametric methods guarantees a reduction in power.

You could always loop through each row and run wilcox.test, but that would be suboptimal in terms of power and computational efficiency.

Count data are generally analyzed using either a generalized linear model (with e.g., edgeR or DESeq2), or after suitably transforming in order to use methods based on the normal assumptions (limma-voom and limma-trend).

You could be less mysterious and just tell us what sort of data these are and what you are trying to do rather than asking about particular statistics, but that's up to you.

For efficient implementation, there is the siggenes package which has a row-wise Wilcoxon implemented. But I wouldn't use this directly on counts.

We've also been working on a Wilcoxon-based method, similar to SAMseq, for DTE or DGE. One caveat is that this won't work with n=3, but it does seem to have sensitivity beyond that. The method is called swish and the package, fishpond, is now in the release branch. This runs downstream of Salmon and tximeta.

https://bioconductor.org/packages/fishpond

For the SAMseq method:

https://cran.r-project.org/package=samr

While we're throwing methods into the hat, I'll also mention that scran has a pairwiseWilcox() function that does pairwise Wilcoxon rank sum tests between groups of cells in scRNA-seq data.

I implemented it on a whim, but to be honest, I've never really used it for various reasons. (It's slower, has theoretical issues due to non-equality of distributions after normalization, and doesn't yield an easily interpretable effect size.)

Great! We’ll try it out.

Thank you all.