Question: ROC curves using normalized sequencing data from 2 experiments
0
13 days ago by
sebastian.lobentanzer10 wrote:

Hi all, hi Michael,

we want to build receiver operating characteristics for patients from two separate cohorts (using gene expression from sequencing as predictor), and I was thinking about how to compare them in this context. I did vst() (varianceStablilizingTransformation()) on both of them and tried to use one dataset as training set ("pred_data") and the other set as the test set ("test_data") in the generation of the ROC, like so (count is the variance stabilized count):

fit_glm <- glm(event ~ count, pred_data, family=binomial(link="logit"))

glm_response_scores <- predict(fit_glm, test_data, type="response")


I chose the most highly DE gene in both datasets as the first example. However, the vst() output of data set 1 has slighty higher stabilized counts than the second one for this particular gene (median around 20 vs median around 16). Thus, the linear model based on data set 1 seems to be non-applicable to data set 2 and because of that the ROC curve is nonsense.

This is a scatter plot of the stabilized counts of the example gene in both data sets. Each dot represents an individual patient, the color is event/no event. I imagine that for a valid comparison, I would have to get these two distributions to approximately the same mean and sd. Adding pseudocounts to the log-transform does not really impact these highly expressed genes. I also think that a per-gene approach would be more appropriate than a correction of the whole set, but that might be totally wrong.

Now I believe that using TPM values for comparing between samples is not a good approach, correct me if I'm wrong. The question is, can I normalize or scale the vst() counts to get a usable distribution around the same mean or median for these two datasets, and how would I robustly do that? (Ie, "pull up" the stabilized counts of set 2 to the levels of set 1.)

Sebastian

deseq2 glm proc rocr • 50 views
modified 13 days ago • written 13 days ago by sebastian.lobentanzer10
Answer: ROC curves using normalized sequencing data from 2 experiments
0
13 days ago by
Michael Love24k
United States
Michael Love24k wrote:

We have a way to "freeze" a VST, which is what you'd want to do rather than computing a new VST, there is some info in the help pages I think. But this is a little complicated, and then you'd want to share the parameters if someone wanted to reproduce your results.

Maybe simpler would just be to find a pc pseudocount for normTransform which just performs log(x + pc) on normalized counts. You can find an appropriate pc sometimes by looking at vsn::meanSdPlot like we do in the workflow. pc=1 tends to be too small, sometimes 5 or 10 is better at stabilizing the trend.

Hi Michael, thanks for the rapid help!

I think my question might not have been clear enough; anyhow, the pseudocounts do next to nothing to the genes I am interested in for the ROC, probably because these are some of the most highly expressed genes in the set.

I updated my question accordingly, could you please have another look at it?