Hi,
I am working with colleagues to perform differential expression analyses using data that have been spiked with RNA from another species, the purpose of which is to get a sense of the absolute numbers of transcripts that are up and down between conditions. We add the spike-in at the cell stage, prior to the steps of library generation, so the working assumption is that any features or quirks of library generation are equally reflected in the sample and spike-in information.
My question is, "What is the appropriate way to use this spike-in information with DESeq2?" Is it appropriate to use the spike-in information with the DESeq2::estimateSizeFactors()
controlGenes
argument (e.g., including the thousands of genes for this other species in the counts matrix that is supplied to DESeq2, then supplying the other-species gene names as a vector to the controlGenes
)? When following this approach and subsequently running DESeq2::DESeq()
, I believe the genes for the other species remain in the counts matrix and are included in the tests for significance, multiple hypothesis testing, etc. So, we should somehow exclude these genes from the counts matrix in the DESeqDataSet object prior to running DESeq2::DESeq()
—is that correct?
Another thing we've considered is calculating and applying a simple sample-wise scaling factor prior to running DESeq2 as follows:
- We calculate a scaling factor for the experiment (EXP) condition:
x = (no. EXP sample reads)/(no. EXP spike-in reads)
- We calculate a scaling factor for the control (CTRL) condition:
y = (no. CTRL sample reads)/(no. CTRL spike-in reads)
- We take a counts matrix such and we divide the sample counts by the corresponding scaling factor, e.g.,
gene EXP CTRL
A 5 10
B 15 100
C 0 2
etc.
# The above becomes...
gene EXP CTRL
A 5/x 10/y
B 15/x 100/y
C 0/x 2/y
etc.
- We then use the adjusted counts matrix as input for DESeq2.
Does this seem appropriate to you? (However, I don't think DESeq2 accepts non-integer counts in the counts matrix.)
We'll greatly appreciate any input or advice you can give us. Thank you!
-Kris
Hi, I have a question barely related to this topic. Do you use a custom reference genome combining your organism and the one used as a spike-in? I think it's a common approach, but then if you only allow to map uniquely mapped reads, you are losing spike-in information, aren't you? Since spike-in mapping is only for quantitative purposes, I think it would be better to align to both genomes separately, allowing multimapping for spike-in, and then filter BAM files to exclude reads present in both of them. But I am not sure if I am missing anything, so any input or advice is welcome.
Thanks a lot in advance!
Eugenia