Question

DESeq2 ML Query

0

Entering edit mode

chris2.a.white • 0

@b9ada64c

Last seen 1 day ago

Australia

Hello A/Prof Love,

Hope you are well and looking forward to CSAMA in the Italian Alps!

After reading your comprehensive machine learning slides from BIOS 735 - Introduction to Statistical Computing (plus the HarvardX youtube videos with Prof Rafael Irizarry), we were hoping for one of the ML examples to use differentially expressed genes from RNA Sequencing analysis (please point in right direction if it exists, and have accidentally missed it - sorry).

Essentially we would like to identify differentially expressed genes using DESeq2 to build a binomial classifier (elastic net) with the top DE genes.

Project:

Cohort: 100 disease, 100 control
Specimen: Human plasma samples.
Pipeline: Nextflow/rnaseq (Salmon/STAR transcriptome alignment)

What count matrix/file from tximport or DESeq2 would be advised for an ML classifier. The emphasis is on a parsimonous set of genes, that are robust, reproducible and controlled uncertainty. (Is this the trillion dollar question, we only have a billion lol).

Also, if we supply a design matrix (including technical confounders to adjust degrees of freedom) to DESeq2, should we use this corrected matrix as values for the ML classifier (would this data be in the dds)? Or, should we be using a matrix that would have been transformed by vst or rlog?

Happy to post this to Bioconductor if you prefer, it started off as more of an ML query related to your BIOS 375 course.

Thanks in advance for any insight,

Chris

DESeq2 • 172 views

ADD COMMENT • link updated 2 days ago by Michael Love 42k • written 9 days ago by chris2.a.white • 0

score 0 · Answer 1 · 2024-06-20

0

Entering edit mode

Michael Love 42k

@mikelove

Last seen 12 hours ago

United States

What count matrix/file from tximport or DESeq2 would be advised for an ML classifier. The emphasis is on a parsimonous set of genes, that are robust, reproducible and controlled uncertainty. (Is this the trillion dollar question, we only have a billion lol).

The VST or scaled counts should be fine.

Also, if we supply a design matrix (including technical confounders to adjust degrees of freedom) to DESeq2, should we use this corrected matrix as values for the ML classifier (would this data be in the dds)? Or, should we be using a matrix that would have been transformed by vst or rlog?

You could remove variation from vst data using removeBatchEffects as shown in the vignette. I would not use batch or design arguments in the ML case, so it is unsupervised preprocessing.

ADD COMMENT • link 9 days ago Michael Love 42k

0

Entering edit mode

Thank you Mike.

Re: "The VST or scaled counts should be fine."

So either the DESeqTransform object following VST, or the txi object (txi <- tximport(files, type="salmon", tx2gene=tx2gene)) before "DESeqDataSetFromTximport" can be used?

Re: The last point you highlighted: if we do not use limma's removeBatchEffects (with batch and design arguments), might the flow hypothetically look like this?

dds <- DESeqDataSetFromTximport(txi,

                          colData = coldata,
                          design = ~ batch)

vsd <- vst(dds, blind = FALSE)
Then use the transformed values within vsd in ML

Also, we have a wide range of sequencing depth in our human plasma samples after umi-deduplication (2million - 10 million). Would you recommend rlog as more appropriate or to stick with the vst due to cohort size? We don't mind the slowness (slow is smooth and smooth is fast) but also note the comments in your guide, under which transformation to choose.

Thanks for any comments you might have on that aspect.

Hope CSAMA went well.

ADD REPLY • link 2 days ago chris2.a.white • 0

0

Entering edit mode

Scaled counts would be counts(dds, normalized=TRUE)

might the flow hypothetically look like this?

yes.

Stick with VST, we prefer this one post the 2014 publication.

ADD REPLY • link 2 days ago Michael Love 42k