I would like to transform RSEM expected counts from the data deposited at Xenabrowser for the TCGA TARGET GTEx dataset (https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/TcgaTargetGtex_gene_expected_count.gz) into VST-transformed counts. If I do not misunderstand the use of this function, in this dataset covariates such as primary_site (brain, breast, etc), sample_type (cell line, normal tissue, tumor tissue) and project (TCGA, GTEX, TARGET) should also be included in the design formula so the VST function, when blind is set to FALSE, takes them into account to properly transform the data.
Is this correct?
I would like to use these transformed values for a few genes (up to 10) in a Cox regression model to see where I will also add all the covariates I included in the design matrix in the DESeqDataSetFromMatrix function. Is there anything else that I should consider?
In this dataset I'm also facing the error "every gene contains at least one zero". I added a value of 1 to all expected counts as suggested in this post https://www.biostars.org/p/440379/
Regarding your "every gene contains at least one zero" error:
I think when you run vst(), it will estimate size factors if they're not already present (like running DESeq() in the post you linked), so estimating size factors manually with something like dds <- estimateSizeFactors(dds, type = "poscounts") or dds <- estimateSizeFactors(dds, type = "iterate") and then running vst() may help (as suggested by Kevin Blighe's solution #2), though, as ATpoint mentioned in that post, maybe this isn't the best solution.
From the estimateSizeFactors() help:
type: Method for estimation: either "ratio", "poscounts", or
"iterate". "ratio" uses the standard median ratio method introduced in
DESeq. The size factor is the median ratio of the sample over a
"pseudosample": for each gene, the geometric mean of all samples.
"poscounts" and "iterate" offer alternative estimators, which can be
used even when all genes contain a sample with a zero (a problem for
the default method, as the geometric mean becomes zero, and the ratio
undefined). The "poscounts" estimator deals with a gene with some
zeros, by calculating a modified geometric mean by taking the n-th
root of the product of the non-zero counts. This evolved out of use
cases with Paul McMurdie's phyloseq package for metagenomic samples.
The "iterate" estimator iterates between estimating the dispersion
with a design of ~1, and finding a size factor vector by numerically
optimizing the likelihood of the ~1 model.