Hi,
I would like to transform RSEM expected counts from the data deposited at Xenabrowser for the TCGA TARGET GTEx dataset (https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/TcgaTargetGtex_gene_expected_count.gz) into VST-transformed counts. If I do not misunderstand the use of this function, in this dataset covariates such as primary_site (brain, breast, etc), sample_type (cell line, normal tissue, tumor tissue) and project (TCGA, GTEX, TARGET) should also be included in the design formula so the VST function, when blind is set to FALSE, takes them into account to properly transform the data.
Is this correct?
I would like to use these transformed values for a few genes (up to 10) in a Cox regression model to see where I will also add all the covariates I included in the design matrix in the DESeqDataSetFromMatrix function. Is there anything else that I should consider?
In this dataset I'm also facing the error "every gene contains at least one zero". I added a value of 1 to all expected counts as suggested in this post https://www.biostars.org/p/440379/
Hi Michael. Thanks for your reply.
Yes, the dataset includes samples from these three projects TCGA, GTEX and TARGET. I selected just solid tumors, ~ 17000 samples, and ran DESeq2.
I will apply estimateSizeFactors(dds, type = "poscounts") instead of adding 1.
If this is taking too long, you can run it on a subsample (e.g. hundreds of samples) to identify parameters for the VST, and then apply it to the whole dataset.
Regarding your "every gene contains at least one zero" error:
I think when you run vst(), it will estimate size factors if they're not already present (like running DESeq() in the post you linked), so estimating size factors manually with something like dds <- estimateSizeFactors(dds, type = "poscounts") or dds <- estimateSizeFactors(dds, type = "iterate") and then running vst() may help (as suggested by Kevin Blighe's solution #2), though, as ATpoint mentioned in that post, maybe this isn't the best solution.
From the estimateSizeFactors() help:
type: Method for estimation: either "ratio", "poscounts", or
"iterate". "ratio" uses the standard median ratio method introduced in
DESeq. The size factor is the median ratio of the sample over a
"pseudosample": for each gene, the geometric mean of all samples.
"poscounts" and "iterate" offer alternative estimators, which can be
used even when all genes contain a sample with a zero (a problem for
the default method, as the geometric mean becomes zero, and the ratio
undefined). The "poscounts" estimator deals with a gene with some
zeros, by calculating a modified geometric mean by taking the n-th
root of the product of the non-zero counts. This evolved out of use
cases with Paul McMurdie's phyloseq package for metagenomic samples.
The "iterate" estimator iterates between estimating the dispersion
with a design of ~1, and finding a size factor vector by numerically
optimizing the likelihood of the ~1 model.
Agree, noting just that "poscounts" is used quite a bit and fast (also it has been benchmarked in publications a few times). "iterate" is kind of a theoretical idea without benchmarking and slow. Wouldn't recommend for large dataset.
Hi Michael. Thanks for your reply. Yes, the dataset includes samples from these three projects TCGA, GTEX and TARGET. I selected just solid tumors, ~ 17000 samples, and ran DESeq2.
I will apply estimateSizeFactors(dds, type = "poscounts") instead of adding 1.
Thanks
If this is taking too long, you can run it on a subsample (e.g. hundreds of samples) to identify parameters for the VST, and then apply it to the whole dataset.