Question: ROC curves using normalized sequencing data from 2 experiments
gravatar for sebastian.lobentanzer
4 months ago by
sebastian.lobentanzer10 wrote:

Hi all, hi Michael,

we want to build receiver operating characteristics for patients from two separate cohorts (using gene expression from sequencing as predictor), and I was thinking about how to compare them in this context. I did vst() (varianceStablilizingTransformation()) on both of them and tried to use one dataset as training set ("pred_data") and the other set as the test set ("test_data") in the generation of the ROC, like so (count is the variance stabilized count):

fit_glm <- glm(event ~ count, pred_data, family=binomial(link="logit"))

glm_link_scores <- predict(fit_glm, test_data, type="link")

glm_response_scores <- predict(fit_glm, test_data, type="response")

I chose the most highly DE gene in both datasets as the first example. However, the vst() output of data set 1 has slighty higher stabilized counts than the second one for this particular gene (median around 20 vs median around 16). Thus, the linear model based on data set 1 seems to be non-applicable to data set 2 and because of that the ROC curve is nonsense.

stabilized counts of example gene in both data sets

This is a scatter plot of the stabilized counts of the example gene in both data sets. Each dot represents an individual patient, the color is event/no event. I imagine that for a valid comparison, I would have to get these two distributions to approximately the same mean and sd. Adding pseudocounts to the log-transform does not really impact these highly expressed genes. I also think that a per-gene approach would be more appropriate than a correction of the whole set, but that might be totally wrong.

Now I believe that using TPM values for comparing between samples is not a good approach, correct me if I'm wrong. The question is, can I normalize or scale the vst() counts to get a usable distribution around the same mean or median for these two datasets, and how would I robustly do that? (Ie, "pull up" the stabilized counts of set 2 to the levels of set 1.)

Thanks in advance!


deseq2 glm proc rocr • 106 views
ADD COMMENTlink modified 4 months ago • written 4 months ago by sebastian.lobentanzer10
Answer: ROC curves using normalized sequencing data from 2 experiments
gravatar for Michael Love
4 months ago by
Michael Love25k
United States
Michael Love25k wrote:

We have a way to "freeze" a VST, which is what you'd want to do rather than computing a new VST, there is some info in the help pages I think. But this is a little complicated, and then you'd want to share the parameters if someone wanted to reproduce your results.

Maybe simpler would just be to find a pc pseudocount for normTransform which just performs log(x + pc) on normalized counts. You can find an appropriate pc sometimes by looking at vsn::meanSdPlot like we do in the workflow. pc=1 tends to be too small, sometimes 5 or 10 is better at stabilizing the trend.

ADD COMMENTlink written 4 months ago by Michael Love25k

Hi Michael, thanks for the rapid help!

I think my question might not have been clear enough; anyhow, the pseudocounts do next to nothing to the genes I am interested in for the ROC, probably because these are some of the most highly expressed genes in the set.

I updated my question accordingly, could you please have another look at it?

ADD REPLYlink written 4 months ago by sebastian.lobentanzer10

Sorry I don’t have any concrete suggestions, I think it’s maybe beyond just a question about VST or normalization in DESeq2 and more about batch effects across study and how to come up with classifies that generalize well. I think the top scoring pairs may be a relevant place for you to look but I don’t think DESeq2 can help here.

ADD REPLYlink written 4 months ago by Michael Love25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 215 users visited in the last hour