Question

RUVseq: How to implement to more complex time-course model?

0

Entering edit mode

heikki.sarin ▴ 10

@heikkisarin-13379

Last seen 4.3 years ago

We have quite complex model in the analysis and was wondering about the RUVseq argument "k". Should it be set to 1 or how many in this case below (the vignette isn't really helpful in modifying the example code) ? And is the below right way to implement the W_1 in the model of differential expression analysis? Or how should it be done?

#Ruvseq code

design <- model.matrix(~ time + condition + time:subject.nested + condition:time, data=pData(set))
y <- DGEList(counts=counts(set1), group=coldata$CASE_CONTROL)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=35:36)
top <- topTags(lrt, n=nrow(set1))$table
empirical <- rownames(set1)[which(!(rownames(set1) %in% rownames(top)[1:5000]))]

set2 <- RUVg(set, empirical, k=1)

#Code for differential expression

full <- model.matrix(~ W_1+ time + condition + time:subject.nested + condition:time, pData(set2))

reduced <- model.matrix(~ W_1 + time + condition + time:subject.nested, pData(set2))

ddsFullCountTable <- DESeqDataSetFromMatrix(countData = counts(set2),colData = pData(set2), design = ~ 1 )

ddsFullCountTable <- ddsFullCountTable[ rowSums(counts(ddsFullCountTable)) > 1, ]
keep <- rowSums(counts(ddsFullCountTable) >= 5) >= 5
ddsFullCountTable <- ddsFullCountTable[keep,]
dds <- DESeq(ddsFullCountTable, test="LRT", full=full, reduced=reduced)

ruvseq deseq2 model.matrix confounders • 2.5k views

ADD COMMENT • link updated 7.6 years ago by davide risso ▴ 980 • written 7.6 years ago by heikki.sarin ▴ 10

score 0 · Answer 1 · 2017-08-10

Hi,

I'm sorry but from your code is hard to understand what you are trying to do. It seems that there are two different issues:

1. How to choose k.

2. How to compute empirical negative controls.

As for 1) we don't have any "magic" formula, the advice is to look at exploratory plots like RLE and PCA (as we do in the vignette) and try to figure out if the data are still influenced by unwanted variation after the correction. For instance, you could start with k=1 and see if that solves your problem (I'm assuming that you are using RUV because you observed some unwanted variation in the data... the same way you determined that the data were influenced by unwanted variation, you can now try and determine if unwanted variation is present after correcting with k=1). I can see that with a complicated design this might not be easy (or possible) but then the question is how did you determine that you need RUV?

Assuming that you do need RUV, even before settling on a value of k you have to choose which genes to use as negative controls. In general, we suggest to use a set of genes for which you know a priori that they should not change over time and condition. Often times such a list of genes exist because of prior knowledge in the lab or can be found by scanning the literature. A general list of housekeeping genes can also be used, and it works fairly well in a wide variety of cases. If none of this is available, then you can use the empirical control method, which is what you are doing in the first part of the code. However, I'm not sure that this code is correct. First of all, if you're using DESeq2 as your final model, why are you using edgeR to find the empirical controls? It seems that you would want to use the same model for identifying the controls and for testing for DE. Second, this line seems wrong:

lrt <- glmLRT(fit, coef=35:36)

Why are you testing only for the 35th and 36th coefficient? What contrast does this represent? As I said, this is a fairly complicated design and you need to choose how to define the empirical controls based on your experimental factors (for instance, selecting genes that do not change over time, and across conditions). In the final model you are only testing for the interaction, so maybe that's what the 35th and 36th coefficients represent? But I don't think it's correct to select the genes with no significant interaction as empirical controls, because you want your controls not to be influenced by _any_ biological factor, otherwise they're hardly negative controls! I would test against the null model (with just the intercept) in this case, but I'm not sure that empirical controls are the way to go with such a complex experimental design.