Hello all,
I am curious if the RUVg normalization method (ERCC spike in normalization) is applicable to transcript-level counts for differential transcript expression, I know RUVseq was developed for gene-level counts but it seems like it would still be OK for transcript level count analysis?
My idea is to use Kallisto + Sleuth for DTE analysis coupled with RUVg spike in normalization since my samples are from different RNA library protocols (Poly A+, Poly A-, total RNA).
so = sleuth_prep(s2c_tmp, transformation_function = function(x) log2(x+0.5))
Expr = so$obs_norm %>% dplyr::select(target_id, sample, est_counts)
Emat = tidyr::spread(Expr, key = sample, est_counts)
## need to round est_counts to integer for RUVseq
Emat = round(Emat)
# Filter transcripts
filter <- apply(Emat, 1, function(x) length(x[x > 5]) >= 2) ##remove non-expressed transcripts
filtered <- Emat[filter,]
spikes <- rownames(filtered)[grep("^ERCC", rownames(filtered))]
x <- as.factor(s2c_tmp$condition)
set <- newSeqExpressionSet(as.matrix(filtered),
phenoData = data.frame(x, row.names=colnames(filtered)))
## USE RUVg to normalize samples based on control spikeins
set1 <- RUVg(set, spikes, k=1)
## get normalization factors
wts = pData(set1)[ , 2:ncol(pData(set1))]
so$sample_to_covariates$W1 = wts
## Use W1 (RUVg factors) as batch
so <- sleuth_fit(so, ~ W1 + condition, 'full')
so <- sleuth_fit(so, ~ W1, 'reduced')
## Wald test for Log 2 Fold-change
so = sleuth_wt(so, "conditionX")
Does this approach/ code seem reasonable?
Thanks for your help Davide, just wanted to post a quick follow-up if this may help others. I checked the p-value distributions and sleuth + RUVg does appear uniform with a spike a 0, so it does appear to work. I also tried your second approach, solely including the protocol as a covariate, similar p-value distribution, however i found about 25% more differentially expressed transcripts (qval < 0.05) than the Sleuth + RUVg method, so i went with this approach.