Question

[DESeq2] Normalization of External Test Dataset for Machine Learning (ML) Application

0

Entering edit mode

micolak0115 • 0

@4bea8831

Last seen 17 months ago

South Korea

Dear Community,

My current research involves developing a ML-based predictor model, for which I have chosen DESeq2 for normalization. I would appreciate any advice regarding some challenges I am facing.

In my study, I trained the model on RNA from blood samples of healthy donors (which has been validated by an additional healthy cohort). I then tested the model using RNA from virally infected patients to quantify the degree of change"

For normalization, I used all the samples (both training and test) together in order to account for global RNA perturbations caused by infection, as suggested by our prior studies. Given this, I used all genes as "control genes," and normalizing only the healthy donors wasn't a viable option for me.

However, I am now encountering issues with using external datasets. Normalizing these datasets, with their own RNA compositions, separately for the test seems nonsensical. Alternatively, combining them with my current dataset and redoing the normalization would change the model (both for this and future data).

I would be very grateful for any suggestions to resolve this problem.

Cheers,

Alan

DESeq2 • 2.1k views

ADD COMMENT • link updated 17 months ago by Michael Love 43k • written 17 months ago by micolak0115 • 0

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 2 days ago

United States

Your question is off-topic for this support site. You might try over on biostars.org instead.

ADD COMMENT • link 17 months ago James W. MacDonald 68k

0

Entering edit mode

Sorry - I requested this to be posted here, I will answer.

ADD REPLY • link 17 months ago Michael Love 43k

0

Entering edit mode

Hi James,

I will try to leave only relevant questions on bioconductor forum hereinafter.

Thanks for the note.

Cheers,

Alan

ADD REPLY • link 17 months ago micolak0115 • 0

0

Entering edit mode

I request the post here, so not a problem.

ADD REPLY • link 17 months ago Michael Love 43k

score 2 · Accepted Answer · 2024-09-05

2

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 days ago

United States

DESeq2 has a way of fixing the reference pseudo-sample used for normalization.

In estimateSizeFactors man page, you will see an argument geoMeans. If you provide the geometric means of the original data (you can compute this with log -> rowMeans -> exp), it will apply that when scaling the new data. You can leave the -Inf from log, it will turn into 0 which is correct (any row with a zero isn't used in median ratio method).