Question

How to use spike-in information (sequences from another species) with DESeq2::DESeq()

0

Entering edit mode

kalavattam ▴ 10

@kalavattam-21708

Last seen 9 months ago

United States

Hi,

I am working with colleagues to perform differential expression analyses using data that have been spiked with RNA from another species, the purpose of which is to get a sense of the absolute numbers of transcripts that are up and down between conditions. We add the spike-in at the cell stage, prior to the steps of library generation, so the working assumption is that any features or quirks of library generation are equally reflected in the sample and spike-in information.

My question is, "What is the appropriate way to use this spike-in information with DESeq2?" Is it appropriate to use the spike-in information with the DESeq2::estimateSizeFactors() controlGenes argument (e.g., including the thousands of genes for this other species in the counts matrix that is supplied to DESeq2, then supplying the other-species gene names as a vector to the controlGenes)? When following this approach and subsequently running DESeq2::DESeq(), I believe the genes for the other species remain in the counts matrix and are included in the tests for significance, multiple hypothesis testing, etc. So, we should somehow exclude these genes from the counts matrix in the DESeqDataSet object prior to running DESeq2::DESeq()—is that correct?

Another thing we've considered is calculating and applying a simple sample-wise scaling factor prior to running DESeq2 as follows:

We calculate a scaling factor for the experiment (EXP) condition: x = (no. EXP sample reads)/(no. EXP spike-in reads)
We calculate a scaling factor for the control (CTRL) condition: y = (no. CTRL sample reads)/(no. CTRL spike-in reads)
We take a counts matrix such and we divide the sample counts by the corresponding scaling factor, e.g.,

gene    EXP CTRL
A   5   10
B   15  100
C   0   2
etc.

#  The above becomes...
gene    EXP CTRL
A   5/x     10/y
B   15/x    100/y
C   0/x     2/y
etc.

We then use the adjusted counts matrix as input for DESeq2.

Does this seem appropriate to you? (However, I don't think DESeq2 accepts non-integer counts in the counts matrix.)

We'll greatly appreciate any input or advice you can give us. Thank you!

-Kris

DESeq2 SpikeIn Normalization • 1.3k views

ADD COMMENT • link updated 15 months ago by Michael Love 41k • written 15 months ago by kalavattam ▴ 10

score 0 · Answer 1 · 2023-01-23

When following this approach and subsequently running DESeq2::DESeq(), I believe the genes for the other species remain in the counts matrix and are included in the tests for significance, multiple hypothesis testing, etc. So, we should somehow exclude these genes from the counts matrix in the DESeqDataSet object prior to running DESeq2::DESeq()—is that correct?

I don't see any problem with including those other species genes really, but if you wanted to exclude them you could do:

estimateSizeFactors() with controlGenes
subset out the other species's genes using square bracket indexing [,]
DESeq()