29 days ago by
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
It is not usually statistically appropriate to use Wilcoxon tests to compare read counts in a SummarizedExperiment object. James has mentioned that the Wilcoxon test loses statistical power. While that is true, there are even more fundamental objections to using the Wilcoxon test. The Wilcoxon test assumes that all the counts are independent and identically distributed under the null hypothesis, which is very unlikely to be true for sequencing data where the sequencing depth typically varies from the one library to another.
Bioconductor provides a number of very sophisticated and effective statistical procedures for comparing read counts between groups. You don't explain what sort of integer counts you have (RNA-seq? GWAS? ChIP-seq?) but, whatever they are, it would probably be more productive for you to learn about what the Bioconductor packages can provide in terms of statistical analysis of your data, rather than trying to make up a simple statistical analysis for yourself. Even if you are a top statistics expert in your own right, it would still be best to find out what Bioconductor provides first, before trying to figure out whether you can improve it yourself.
Although I am very sceptical that this is a good idea for your data, performing Wilcoxon tests in R is very easy. The
rankSumTestWithCorrelation function in the limma package is pretty fast. If
y is a matrix:
ngenes <- 1000
nsamples <- 10
y <- matrix(rpois(ngenes*nsamples,lambda=5), ngenes, nsamples)
and your samples are in two groups, with columns
j being the first group, for example
j <- 1:5
if the first 5 columns are group 1. Then row-wise Wilcoxon tests can be done by:
PValues <- matrix(1, ngenes, 2)
for (i in 1:nrow(y)) PValues[i,] <- rankSumTestWithCorrelation(j, y[i,])
This takes about a second on my laptop for 10,000 genes and 10 samples.