Question

DeSeq Strategies for run to run Replication

0

Entering edit mode

Giorgio • 0

@b0635dc4

Last seen 3.5 years ago

United States

Hi all,

This is more a general (philosophical?) question:

Say I have a dataset analyzed with DeSeq default, and a vst normalized dataset is obtained.
The VST is then used to generate a final predictive ML model (binary) with only few genes from the VST dataset.
Now I have a few samples that I re-rerun end-to-end with the exact same pipeline as above.
After importing in DeSeq what normalization would you think to use in order to obtain a close VST from the original run?
The goal is to predict correctly the new repeated samples.

I know there are millions of variables in play, but was curious to see what the folks would answer.

Thank you in advance

DESeq2 Normalization Clustering replicate • 2.6k views

ADD COMMENT • link updated 9 months ago by Michael Love 43k • written 3.5 years ago by Giorgio • 0

score 0 · Answer 1 · 2022-08-25

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

You can use the same VST on the new samples. We have some help to do this in ?varianceStabilizingTransformation:

The variance stabilizing transformation from a previous dataset can be "frozen" and reapplied to new samples. The frozen VST is accomplished by saving the dispersion function accessible with dispersionFunction, assigning this to the DESeqDataSet with the new samples, and running varianceStabilizingTransformation with 'blind' set to FALSE. Then the dispersion function from the previous dataset will be used to transform the new sample(s).

ADD COMMENT • link 3.5 years ago Michael Love 43k

0

Entering edit mode

Dear Michael,

could you please comment on how to properly apply a vst transformation to a new dataset (in this case, left-out samples)? I have been breaking my head about this and cannot seem to get it correct. dds_test is currently two samples but this could vary in the future. Ideal use case is just one test sample.

a small example of my current script:

dds_train <- DESeqDataSetFromTximport(txi_train, colData = train_samples, design = ~ sex. + condition)
dds_train <- dds_train[keep, ] # some filtering for minimal expression 
dds_train <- DESeq(dds_train) # to obtain size/normalization factores, although I believe vst internally normalizes the data so this strictly isn't necessary 
vsd_train <- vst(dds_train, blind = TRUE)

dds_test <- DESeqDataSetFromTximport(txi_test, colData = test_samples, design = ~ 1)
dds_test <- dds_test[keep, ] # some filtering for minimal expression --> different genes ! almost complete overlap
dds_test <- DESeq(dds_test)
disp_fun_train <- dispersionFunction(dds_train)
normFactors_test <- counts(dds_test) / assays(dds_test)[["avgTxLength"]]
normFactors_test[normFactors_test == 0] <- 1
normFactors_test <- normFactors_test / exp(rowMeans(log(normFactors_test)))
normalizationFactors(dds_test) <- normFactors_test
dispersionFunction(dds_test) <- disp_fun_train
vsd_test <- varianceStabilizingTransformation(dds_test, blind = FALSE)


> sizeFactors(dds_train)
NULL
> dim(normalizationFactors(dds_train))
[1] 24307    85

I will further perform PCA on the train data and then project the vst-transformed left-out samples in the same PCA space.

Thank you in advance!

ADD REPLY • link 10 months ago i.c.denhond • 0

0

Entering edit mode

You shouldn't need to run DESeq on the test data, or manually compute normalization factors.

dds_test <- ... # create object
dds_test <- ... # somehow come up with scaling factors
dispersionFunction(dds_test) <- dispersionFunction(dds_train)
vsd_test <- varianceStabilizingTransformation(dds_test, blind = FALSE)

If you want to scale the counts of the new data to match the train data. you would do:

dds_test <- estimateSizeFactors(dds_test, geoMeans = geo_means_of_train_data)

ADD REPLY • link 10 months ago Michael Love 43k

0

Entering edit mode

Dear Michael, thank you for your prompt response ! I have been working a bit more on this approach. As a simple question now where I have not yet found an answer for, how could I best approach on transferring the normalization (it is transcript-level data from tximport so I have normalizationFactors instead of sizeFactors) from the train to test set?

dds_train <- DESeqDataSetFromTximport(txi_train, colData = samples_train, design = ~ Characteristics.sex. + condition)
dds_train <- DESeq(dds_train) 

vsd <- varianceStabilizingTransformation(dds_train, blind = T)
coefs <- attr(dispersionFunction(dds_train), "coefficients")

dds_test <-  DESeqDataSetFromTximport(txi_test, colData = sample_test, design = ~ 1)
## how to apply normalization from train set?? 
vsd_test <- manual_vst(dds_test_normalized, coefs)

or can I indeed simply calculate the geometric mean of the train data ? Sorry I am a bit puzzled, I hope you can help!

My goal is to get the counts comparable between train and test, and make sure that the test is also library-size normalized.

ADD REPLY • link 9 months ago i.c.denhond • 0

0

Entering edit mode

or can I indeed simply calculate the geometric mean of the train data

Yes this should be fine and then use this vector in estimateSizeFactors on the test data.

ADD REPLY • link 9 months ago Michael Love 43k