VST transformation and covariates in the TCGA TARGET GTEx dataset
2
0
Entering edit mode
S ▴ 10
@399a8e69
Last seen 6 months ago
Spain

Hi, I would like to transform RSEM expected counts from the data deposited at Xenabrowser for the TCGA TARGET GTEx dataset (https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/TcgaTargetGtex_gene_expected_count.gz) into VST-transformed counts. If I do not misunderstand the use of this function, in this dataset covariates such as primary_site (brain, breast, etc), sample_type (cell line, normal tissue, tumor tissue) and project (TCGA, GTEX, TARGET) should also be included in the design formula so the VST function, when blind is set to FALSE, takes them into account to properly transform the data.


dds <- DESeqDataSetFromMatrix(countData = unlog_ucsc_xena,
colData = phenotype_survival,
design= ~ sample_type + primary_site + project)
vst_matrix <- vst(dds, blind = FALSE)


Is this correct? I would like to use these transformed values for a few genes (up to 10) in a Cox regression model to see where I will also add all the covariates I included in the design matrix in the DESeqDataSetFromMatrix function. Is there anything else that I should consider?

In this dataset I'm also facing the error "every gene contains at least one zero". I added a value of 1 to all expected counts as suggested in this post https://www.biostars.org/p/440379/

S.

VST TCGATARGETGTEx DESeq2 • 524 views
1
Entering edit mode
@mikelove
Last seen 27 minutes ago
United States

The above sounds fine to me.

I do not recommend adding 1 to all expected counts for DESeq2.

It is strange that every gene contains a zero, is the data here all samples from those three projects? How many samples is this?

0
Entering edit mode

Hi Michael. Thanks for your reply. Yes, the dataset includes samples from these three projects TCGA, GTEX and TARGET. I selected just solid tumors, ~ 17000 samples, and ran DESeq2.

Thanks

1
Entering edit mode

If this is taking too long, you can run it on a subsample (e.g. hundreds of samples) to identify parameters for the VST, and then apply it to the whole dataset.

1
Entering edit mode
josephbrown ▴ 10
@josephbrown-21888
Last seen 8 months ago
United States

Regarding your "every gene contains at least one zero" error:

I think when you run vst(), it will estimate size factors if they're not already present (like running DESeq() in the post you linked), so estimating size factors manually with something like dds <- estimateSizeFactors(dds, type = "poscounts") or dds <- estimateSizeFactors(dds, type = "iterate") and then running vst() may help (as suggested by Kevin Blighe's solution #2), though, as ATpoint mentioned in that post, maybe this isn't the best solution.

From the estimateSizeFactors() help:

type: Method for estimation: either "ratio", "poscounts", or "iterate". "ratio" uses the standard median ratio method introduced in DESeq. The size factor is the median ratio of the sample over a "pseudosample": for each gene, the geometric mean of all samples. "poscounts" and "iterate" offer alternative estimators, which can be used even when all genes contain a sample with a zero (a problem for the default method, as the geometric mean becomes zero, and the ratio undefined). The "poscounts" estimator deals with a gene with some zeros, by calculating a modified geometric mean by taking the n-th root of the product of the non-zero counts. This evolved out of use cases with Paul McMurdie's phyloseq package for metagenomic samples. The "iterate" estimator iterates between estimating the dispersion with a design of ~1, and finding a size factor vector by numerically optimizing the likelihood of the ~1 model.

0
Entering edit mode

Agree, noting just that "poscounts" is used quite a bit and fast (also it has been benchmarked in publications a few times). "iterate" is kind of a theoretical idea without benchmarking and slow. Wouldn't recommend for large dataset.