Hi all,
I've assembled a transcriptome de novo, and now wish to carry out differential expression analysis using DESeq2. The RIN varies between samples for a variety of reasons, and by the nature of our study system we could not simply refine extractions until having RIN >8 for all samples.
I used my normal pipeline of sample processing followed by DE analysis with DESeq2 and got DE results with no issues, but because RIN can affect gene expression I also want to explore including RNA integrity number (RIN) in my DE model design. I've seen a few recommendations of how to do this on Bioconductor and Biostars, but I could not determine what the consensus was of the best way to do so (if such a consensus even exists).
I've trialled the impact of a couple of different methods by visualising the impact on PCA plots after removing the effect with limma::removeBatchEffect, like so:
1) Including RIN in the model design as a numerical vector
dds <- DESeqDataSetFromTximport(txi, colData = samples, design = ~RIN+Group)
dds <- DESeq(dds)
colData(dds)
vsd <- vst(dds, blind=FALSE)
Group <- factor(colData(vsd)$Group)
RIN <- samples$RIN
design0 <- model.matrix(~Group)
vsd2 <- vsd # backup
assay(vsd2) <- limma::removeBatchEffect(assay(vsd), covariates = vsd$RIN, design=design0) #include RIN as a numerical covariate
plotPCA(vsd2, "Group", ntop=70128)
2) Splitting the vector of RINs into a factor with five bins
vsd3 <- vsd # backup. I would also re-run DESeq(dds) with the proper new design.
RIN_cut <- cut(RIN, breaks = 5) #split the RINs into a factor with 5 bins
RIN_cut
[1] (8,8.6] (7.4,8] (8,8.6] (7.4,8] (8,8.6] (5.6,6.2] (5.6,6.2] (6.8,7.4] (5.6,6.2] (8,8.6]
Levels: (5.6,6.2] (6.2,6.8] (6.8,7.4] (7.4,8] (8,8.6]
assay(vsd3) <- limma::removeBatchEffect(assay(vsd), batch = RIN_cut, design=design0) #remove RIN as a 'batch'
plotPCA(vsd3, "Group", ntop=70128)
(apologies for the size of the plots)
By eyeballing the two PCAplots, the second option including RIN as a factor looks to have done a better job at separating the two groups. However, I don't know whether this is an artificial separation, i.e. whether the treatment of RINs is somehow biasing the results.
Does anyone have any suggestions about which of the two treatments is better, or even have a suggestion for a better treatment? Without any treatment of RINs, the PCA looks very similar to Option 1 above, only with PC1 contributing 47% (vs 50% for Opption 1 above)..
Thanks, Charles
Thanks for your advice. I've done a bit of digging to read further on SVA. Instead of trying to "guess" (model) the impact of RNA degradation by using RIN, I'll just use SVA to account for hidden surrogate variables. Following the code from the DESeq2 vignette:
I'm not overly worried about RIN having a big impact on my results considering there are no clear trends of samples clustering based on RIN in a PCA plot of vsd counts (without taking RIN into account). It's just good to know the right way to go about these analyses. Cheers!