Hi,
I am new to DESeq and am wondering if I can input my raw RNAseq count data, perform all the normalizations and pre-processing as per DeSeq, but not run the negative-binomial GLM. I'm interested in instead outputting the continuous RNAseq data (that accounts for e.g., the library size, composition differences, dispersion etc) and running a different regression model that's not a GLM. I have come across numerous posts on using VST in context of DeSeq-WGCNA which are a similar idea but I have a few questions:
1. Would the following steps be sound and sufficient to get normalized, pre-processed, continuous RNAseq data?
# Step 1. Save model in deseq
dds_adj <- DESeqDataSetFromMatrix(countData = RNA,
colData = design_formula,
design = ~ covar1 + as.character(covar2) + condition)
# Step 2. Pre-filter low count RNAs
dds_adj <- estimateSizeFactors(dds_adj)
smallestGroupSize <- ncol(dds_adj)*0.10 # 10% of total sample size (N)
keep_1 <- rowSums(counts(dds_adj, normalized=TRUE) >= 1 ) >= smallestGroupSize # cut off is 1 cpm here
dds_adj <- dds_adj[keep_1,]
# Step 3. VST the data to take care of 1. estimating and normalizing for size factor (library composition) and 2. dispersion
vst<- varianceStabilizingTransformation(dds_adj, blind = FALSE)
norm.data <- assay(vst)
- I would like to explore different conditions (e.g., continous and categorical age) but when I consider them in Step1, each leads a slightly different
norm.data
. I'd like a single, finalnorm.data
. By reading the WGCNA issues, my understanding is that settingblind = T
in vst() or setting thedesign = ~1
could help but that comes with repurcussions. Would it be appropriate todesign = ~ covar1 + as.character(covar2)
and ignoring the main condition of interest altogether and why? Also, by using a design that includes covariates, it doesn't mean that the RNAseq data is now adjusted for these covariates and I still do need to control for them downstream, right?
Thank you so much!