Question

ROC curves using normalized sequencing data from 2 experiments

0

Entering edit mode

sebastian.lobentanzer ▴ 60

@sebastianlobentanzer-11790

Last seen 4.7 years ago

Germany

Hi all, hi Michael,

we want to build receiver operating characteristics for patients from two separate cohorts (using gene expression from sequencing as predictor), and I was thinking about how to compare them in this context. I did vst() (varianceStablilizingTransformation()) on both of them and tried to use one dataset as training set ("pred_data") and the other set as the test set ("test_data") in the generation of the ROC, like so (count is the variance stabilized count):

fit_glm <- glm(event ~ count, pred_data, family=binomial(link="logit"))

glm_link_scores <- predict(fit_glm, test_data, type="link")

glm_response_scores <- predict(fit_glm, test_data, type="response")

I chose the most highly DE gene in both datasets as the first example. However, the vst() output of data set 1 has slighty higher stabilized counts than the second one for this particular gene (median around 20 vs median around 16). Thus, the linear model based on data set 1 seems to be non-applicable to data set 2 and because of that the ROC curve is nonsense.

stabilized counts of example gene in both data sets

This is a scatter plot of the stabilized counts of the example gene in both data sets. Each dot represents an individual patient, the color is event/no event. I imagine that for a valid comparison, I would have to get these two distributions to approximately the same mean and sd. Adding pseudocounts to the log-transform does not really impact these highly expressed genes. I also think that a per-gene approach would be more appropriate than a correction of the whole set, but that might be totally wrong.

Now I believe that using TPM values for comparing between samples is not a good approach, correct me if I'm wrong. The question is, can I normalize or scale the vst() counts to get a usable distribution around the same mean or median for these two datasets, and how would I robustly do that? (Ie, "pull up" the stabilized counts of set 2 to the levels of set 1.)

Thanks in advance!

Sebastian

deseq2 pROC ROCR glm • 2.1k views

ADD COMMENT • link 6.6 years ago sebastian.lobentanzer ▴ 60

score 0 · Answer 1 · 2019-06-11

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 hour ago

United States

We have a way to "freeze" a VST, which is what you'd want to do rather than computing a new VST, there is some info in the help pages I think. But this is a little complicated, and then you'd want to share the parameters if someone wanted to reproduce your results.

Maybe simpler would just be to find a pc pseudocount for normTransform which just performs log(x + pc) on normalized counts. You can find an appropriate pc sometimes by looking at vsn::meanSdPlot like we do in the workflow. pc=1 tends to be too small, sometimes 5 or 10 is better at stabilizing the trend.

ADD COMMENT • link 6.6 years ago Michael Love 43k

0

Entering edit mode

Hi Michael, thanks for the rapid help!

I think my question might not have been clear enough; anyhow, the pseudocounts do next to nothing to the genes I am interested in for the ROC, probably because these are some of the most highly expressed genes in the set.

I updated my question accordingly, could you please have another look at it?

ADD REPLY • link 6.6 years ago sebastian.lobentanzer ▴ 60

0

Entering edit mode

Sorry I don’t have any concrete suggestions, I think it’s maybe beyond just a question about VST or normalization in DESeq2 and more about batch effects across study and how to come up with classifies that generalize well. I think the top scoring pairs may be a relevant place for you to look but I don’t think DESeq2 can help here.

ADD REPLY • link 6.6 years ago Michael Love 43k