Question

VST transformation and covariates in the TCGA TARGET GTEx dataset

0

Entering edit mode

S ▴ 10

@399a8e69

Last seen 8 months ago

Spain

Hi, I would like to transform RSEM expected counts from the data deposited at Xenabrowser for the TCGA TARGET GTEx dataset (https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/TcgaTargetGtex_gene_expected_count.gz) into VST-transformed counts. If I do not misunderstand the use of this function, in this dataset covariates such as primary_site (brain, breast, etc), sample_type (cell line, normal tissue, tumor tissue) and project (TCGA, GTEX, TARGET) should also be included in the design formula so the VST function, when blind is set to FALSE, takes them into account to properly transform the data.


dds <- DESeqDataSetFromMatrix(countData = unlog_ucsc_xena,
                              colData = phenotype_survival,
                              design= ~ sample_type + primary_site + project)
vst_matrix <- vst(dds, blind = FALSE)

Is this correct? I would like to use these transformed values for a few genes (up to 10) in a Cox regression model to see where I will also add all the covariates I included in the design matrix in the DESeqDataSetFromMatrix function. Is there anything else that I should consider?

In this dataset I'm also facing the error "every gene contains at least one zero". I added a value of 1 to all expected counts as suggested in this post https://www.biostars.org/p/440379/

Thanks in advance.

S.

VST TCGATARGETGTEx DESeq2 • 1.5k views

ADD COMMENT • link updated 2.6 years ago by Michael Love 41k • written 2.6 years ago by S ▴ 10

score 1 · Answer 1 · 2021-09-21

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 hour ago

United States

The above sounds fine to me.

I do not recommend adding 1 to all expected counts for DESeq2.

It is strange that every gene contains a zero, is the data here all samples from those three projects? How many samples is this?

ADD COMMENT • link 2.6 years ago Michael Love 41k

0

Entering edit mode

Hi Michael. Thanks for your reply. Yes, the dataset includes samples from these three projects TCGA, GTEX and TARGET. I selected just solid tumors, ~ 17000 samples, and ran DESeq2.

I will apply estimateSizeFactors(dds, type = "poscounts") instead of adding 1.

Thanks

ADD REPLY • link 2.6 years ago S ▴ 10

1

Entering edit mode

If this is taking too long, you can run it on a subsample (e.g. hundreds of samples) to identify parameters for the VST, and then apply it to the whole dataset.

ADD REPLY • link 2.6 years ago Michael Love 41k

score 1 · Answer 2 · 2021-09-22

Regarding your "every gene contains at least one zero" error:

I think when you run vst(), it will estimate size factors if they're not already present (like running DESeq() in the post you linked), so estimating size factors manually with something like dds <- estimateSizeFactors(dds, type = "poscounts") or dds <- estimateSizeFactors(dds, type = "iterate") and then running vst() may help (as suggested by Kevin Blighe's solution #2), though, as ATpoint mentioned in that post, maybe this isn't the best solution.

From the estimateSizeFactors() help:

type: Method for estimation: either "ratio", "poscounts", or "iterate". "ratio" uses the standard median ratio method introduced in DESeq. The size factor is the median ratio of the sample over a "pseudosample": for each gene, the geometric mean of all samples. "poscounts" and "iterate" offer alternative estimators, which can be used even when all genes contain a sample with a zero (a problem for the default method, as the geometric mean becomes zero, and the ratio undefined). The "poscounts" estimator deals with a gene with some zeros, by calculating a modified geometric mean by taking the n-th root of the product of the non-zero counts. This evolved out of use cases with Paul McMurdie's phyloseq package for metagenomic samples. The "iterate" estimator iterates between estimating the dispersion with a design of ~1, and finding a size factor vector by numerically optimizing the likelihood of the ~1 model.