Hello.
I want to ask a question whether it is possible to compare two or more GSVA outputs (of same gene sets), when it is not possible to merge the data (i.e. too large gene expression data matrix for GSVA to run) and run GSVA at once.
It looks like when there are too many samples, GSVA takes too much time to run, or even have no progress even after long time of waiting. I'm curious if it is possible to split the samples into two or more groups, run GSVA separately to each groups, and compare the GSVA values with each other.
If it is not possible or right to do so, is there any way to do instead?
Thank you very much.
Thank you very much for your reply. I'm really happy to hear about the updated version of GSVA.
I took a look at BPPARAM parameter in the new gsva function, but it seems difficult for me to understand. Could you give me an example code? Below is what I tried, and I don't know if this is the right way.
output <- gsva(geneexpressionforgsva ,genesets ,min.sz=5, max.sz=500, method = "gsva",mx.diff=TRUE, verbose=T, parallel.sz=3, BPPARAM=SerialParam(progressbar=T))
*in the help page originally it was BPPARAM=SerialParam(progressbar=verbose) but this gave an error saying verbose does not exist, and I changed it to T.
Thank you very much.
sure, let me first clarify that the "Usage:" section of a help page from a function in R normally shows the function call with all the arguments and their default values in the form
f(argument=value)
but this does not mean that you should call the functionf()
exactly in that way, it means that if you do not set that argument, then the function will use that default value. in fact, i see that in your case, it may suffice to set the argumentparallel.sz=3
, in addition to the input expression data matrix, the input list of gene sets and your desired cutoffs on the minimum and maximum gene set size.the following simulation of an input expression matrix with 10k genes and 1k samples, and 1000 gene sets, runs in about three minutes using 4 cores in my computer:
by setting the argument
parallel.sz
thegsva()
function will try to run the calculations in parallel using a multicore backend with the number of cores specified inparallel.sz
. if you need additional fine tuning of the parallelization then you need to set the argumentBPPARAM
. to learn how to set that argument you should read sections 1 to 3 of the vignette "Introduction to BiocParallel` from the BiocParallel package.Thank you very much for your reply.
I tried gsvaparallel.sz=4) and it seemed to work, but after 100 iterations it said
which was quite frustrating. In fact, my gene expression matrix is much much larger than your example, and it took quite a long time for 100 iterations to finish.
Nevertheless, I am using 64GB RAM computer (Windows 10), and upon checking memory usage after the error message it was still about 48Gb, which seemed to have still more space to allocate 7.9 Gb.
Would there be any suggestions?
Thank you very much.
Could you tell us what are the dimensions of your gene expression matrix? Could you try running this in a linux machine?
Sure, the number of genes are about 58000 and the samples about 19000.
Um, how could I try this in Linux? Do I need separate computer or could I use Ubuntu in Windows? Or perhaps virtual machine?
You could filter out lowly-expressed genes and reduce the gene dimension, this can greatly reduce the memory requirements. what kind of expression units do you have? normalized log2-CPM?
my guess is that a linux system may have a better memory management so I think it would be best if you have access to a separate computer running linux, alternatively you may try running linux (e.g., Ubuntu as you suggest) in a VM on windows, or also you may install docker and run GSVA from the docker container. you can find instructions on how to run a pre-configured docker container for Bioconductor in this link.
Yes I guess it would be useful to filter out lowly-expressed genes. As I know people use criteria something like "genes that have 0 expressions in at least 70% of all samples" something like this, is this correct?
My expression matrix is RNASeq data, DESeq2-normalized, log2 transformed.
Thank you for the linux and related suggestions. I learned a lot.
The strategy of selecting genes with a minimum expression level in a minimum number of samples is usually applied when you have some intuition about what this minimum number of samples can be, such as the size of the smallest group in a comparison across two or more groups of samples. With 19000 samples, it may not be sensible to follow that strategy, so I'd simply plot the distribution of the mean expression values of genes, identify the lower mode corresponding to lowly-expressed genes across the 19000 samples and set a cutoff that discards genes in that mode, e.g., if
y
is your gene expression data matrix of normalized log2-transformed gene expression units, do something like:let's say you consider your cutoff on the x-axis to be 1, but you should determine this from the plot:
and then call
gsva()
usingy.filt
.Your suggestion worked perfectly. Thank you so much.