ROC curves using normalized sequencing data from 2 experiments
Entering edit mode
Last seen 9 weeks ago

Hi all, hi Michael,

we want to build receiver operating characteristics for patients from two separate cohorts (using gene expression from sequencing as predictor), and I was thinking about how to compare them in this context. I did vst() (varianceStablilizingTransformation()) on both of them and tried to use one dataset as training set ("pred_data") and the other set as the test set ("test_data") in the generation of the ROC, like so (count is the variance stabilized count):

fit_glm <- glm(event ~ count, pred_data, family=binomial(link="logit"))

glm_link_scores <- predict(fit_glm, test_data, type="link")

glm_response_scores <- predict(fit_glm, test_data, type="response")

I chose the most highly DE gene in both datasets as the first example. However, the vst() output of data set 1 has slighty higher stabilized counts than the second one for this particular gene (median around 20 vs median around 16). Thus, the linear model based on data set 1 seems to be non-applicable to data set 2 and because of that the ROC curve is nonsense.

stabilized counts of example gene in both data sets

This is a scatter plot of the stabilized counts of the example gene in both data sets. Each dot represents an individual patient, the color is event/no event. I imagine that for a valid comparison, I would have to get these two distributions to approximately the same mean and sd. Adding pseudocounts to the log-transform does not really impact these highly expressed genes. I also think that a per-gene approach would be more appropriate than a correction of the whole set, but that might be totally wrong.

Now I believe that using TPM values for comparing between samples is not a good approach, correct me if I'm wrong. The question is, can I normalize or scale the vst() counts to get a usable distribution around the same mean or median for these two datasets, and how would I robustly do that? (Ie, "pull up" the stabilized counts of set 2 to the levels of set 1.)

Thanks in advance!


deseq2 pROC ROCR glm • 238 views
Entering edit mode
Last seen 2 hours ago
United States

We have a way to "freeze" a VST, which is what you'd want to do rather than computing a new VST, there is some info in the help pages I think. But this is a little complicated, and then you'd want to share the parameters if someone wanted to reproduce your results.

Maybe simpler would just be to find a pc pseudocount for normTransform which just performs log(x + pc) on normalized counts. You can find an appropriate pc sometimes by looking at vsn::meanSdPlot like we do in the workflow. pc=1 tends to be too small, sometimes 5 or 10 is better at stabilizing the trend.

Entering edit mode

Hi Michael, thanks for the rapid help!

I think my question might not have been clear enough; anyhow, the pseudocounts do next to nothing to the genes I am interested in for the ROC, probably because these are some of the most highly expressed genes in the set.

I updated my question accordingly, could you please have another look at it?

Entering edit mode

Sorry I don’t have any concrete suggestions, I think it’s maybe beyond just a question about VST or normalization in DESeq2 and more about batch effects across study and how to come up with classifies that generalize well. I think the top scoring pairs may be a relevant place for you to look but I don’t think DESeq2 can help here.


Login before adding your answer.

Traffic: 549 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6