Help: how to use wilcox.test with a SummarizedExperiment object?
2
0
Entering edit mode
longsong • 0
@longsong-20032
Last seen 5.5 years ago

I have a SummarizedExperiment object (not RangedSummarizedExperiment object): se.

The row names are Gene IDs, and columns names are sample names. Samples are divided into 2 groups: A and V groups. The ‘Column’ (sample) data has a filed: group, which has the values of A or V.

The object has only one assay, which is a integer count.

Now I'd like to use Wilcoxon-Mann-Whitney test to check: for each Gene ID, for A group and V group, the count is different or not?

It maybe a simple question, but I am still a learner of Bioconductor and not know how to do it.

Anyone can help me?

Thanks!

Wilcoxon-Mann-Whitney test SummarizedExperiment Wilcoxon-Mann-Whitney test • 4.1k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 3 minutes ago
WEHI, Melbourne, Australia

It is not usually statistically appropriate to use Wilcoxon tests to compare read counts in a SummarizedExperiment object. James has mentioned that the Wilcoxon test loses statistical power. While that is true, there are even more fundamental objections to using the Wilcoxon test. The Wilcoxon test assumes that all the counts are independent and identically distributed under the null hypothesis, which is very unlikely to be true for sequencing data where the sequencing depth typically varies from the one library to another.

Bioconductor provides a number of very sophisticated and effective statistical procedures for comparing read counts between groups. You don't explain what sort of integer counts you have (RNA-seq? GWAS? ChIP-seq?) but, whatever they are, it would probably be more productive for you to learn about what the Bioconductor packages can provide in terms of statistical analysis of your data, rather than trying to make up a simple statistical analysis for yourself. Even if you are a top statistics expert in your own right, it would still be best to find out what Bioconductor provides first, before trying to figure out whether you can improve it yourself.

Although I am very sceptical that this is a good idea for your data, performing Wilcoxon tests in R is very easy. The rankSumTestWithCorrelation function in the limma package is pretty fast. If y is a matrix:

library(limma)
ngenes <- 1000
nsamples <- 10
y <- matrix(rpois(ngenes*nsamples,lambda=5), ngenes, nsamples)

and your samples are in two groups, with columns j being the first group, for example

j <- 1:5

if the first 5 columns are group 1. Then row-wise Wilcoxon tests can be done by:

PValues <- matrix(1, ngenes, 2)
for (i in 1:nrow(y)) PValues[i,] <- rankSumTestWithCorrelation(j, y[i,])

This takes about a second on my laptop for 10,000 genes and 10 samples.

ADD COMMENT
0
Entering edit mode

Thanks for your reply.

I have some fungi DNA SNPs data. I created a SummarizedExperiment object from it. The assay is used to count non-synonymous SNPs per Gene. I want to find whether there are count differences per Gene between Group A samples and group V samples. From my understanding (I may be wrong), my case is different from RNA-seq / GWAS / ChIP-seq.

As a beginner, I am not familiar with sophisticated statistics models and most Bioconductor packages. I just want to try many ways to do some preliminary data analysis. Wilcoxon test is one of them. I think there should be a simple way in Bioconductor to use Wilcoxon test to do it. I tried to figure out it myself. But it’s not easy for me now.

Is it somehow correct to use Wilcoxon test for my case? If it is correct, how to do it in a simple way in Bioconductor?

Thanks!

ADD REPLY
0
Entering edit mode

I don't know enough about your experiment to advise whether the Wilcoxon test is valid for your data, although I'm pretty sure it is not terribly good. Anyway, I have edited my answer to tell you how to do Wilcoxon tests.

ADD REPLY
0
Entering edit mode

Thank you very much! Your code will be helpful to me.

ADD REPLY
0
Entering edit mode

Hi Gordon Smyth,

Thank you for the answer and R code on how to perform a Wilcoxon test. I am wondering how to interpret the left and right tail p-values from the output generated with rankSumTestWithCorrelation. Is this the probability that say, the gene MMP14 is greater in the "mutant" group compared the "wild-type" group (if looking at the right tail p-value), and if so, it is statistically significant assuming alpha = 0.05? I am unsure how to interpret results, relative to the base r wilcox.test. Any clarification you may be able to give is greatly appreciated.

Best, Jennifer

ADD REPLY
0
Entering edit mode
@james-w-macdonald-5106
Last seen 11 hours ago
United States

There's not to my knowledge a way to do a Wilcoxon rank sum efficiently on high-throughput data, primarily because it's not a common thing to do. There's the RankProd package which is probably the closest you are going to get, but in general people tend to use parametric methods because people usually want to have as much power as possible, and using non-parametric methods guarantees a reduction in power.

You could always loop through each row and run wilcox.test, but that would be suboptimal in terms of power and computational efficiency.

Count data are generally analyzed using either a generalized linear model (with e.g., edgeR or DESeq2), or after suitably transforming in order to use methods based on the normal assumptions (limma-voom and limma-trend).

You could be less mysterious and just tell us what sort of data these are and what you are trying to do rather than asking about particular statistics, but that's up to you.

ADD COMMENT
0
Entering edit mode

For efficient implementation, there is the siggenes package which has a row-wise Wilcoxon implemented. But I wouldn't use this directly on counts.

We've also been working on a Wilcoxon-based method, similar to SAMseq, for DTE or DGE. One caveat is that this won't work with n=3, but it does seem to have sensitivity beyond that. The method is called swish and the package, fishpond, is now in the release branch. This runs downstream of Salmon and tximeta.

https://bioconductor.org/packages/fishpond

For the SAMseq method:

https://cran.r-project.org/package=samr

ADD REPLY
0
Entering edit mode

While we're throwing methods into the hat, I'll also mention that scran has a pairwiseWilcox() function that does pairwise Wilcoxon rank sum tests between groups of cells in scRNA-seq data.

I implemented it on a whim, but to be honest, I've never really used it for various reasons. (It's slower, has theoretical issues due to non-equality of distributions after normalization, and doesn't yield an easily interpretable effect size.)

ADD REPLY
0
Entering edit mode

Great! We’ll try it out.

ADD REPLY
0
Entering edit mode

Thank you all.

Please check my other comment to Gordon Smyth's reply.

ADD REPLY

Login before adding your answer.

Traffic: 766 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6