Question

limma-voom WithQualityWeights and duplicateCorrelation in RNA-seq

2

Entering edit mode

BharathAnanth ▴ 80

@bharathananth-10049

Last seen 5.8 years ago

Hi

I have the following experimental set up. I have RNA-seq time series data that span (0h,2h,....22h) and am interested in temporal patterns of gene regulation. However, the data was constructed by sampling repeatedly two cultures that are (in time) 11h apart. In other words, samples 0,2,..10 and 12,14,...,22 come from two different cultures but maintained nearly identically.

Since I have a multi-level design, I decided to capture the correlation between the two sets of samples. In addition, I use quality weights and based on previous BioC posts, I iterated the duplicateCorrelation and voomWithQualityWeights twice.

I have the two following questions:

1. Is it reasonable to do both duplicateCorrelation and quality weights?

2. Should I use the same blocking I use for duplicateCorrelation also with Quality weights?

3. Can the consensus.correlation be negative? Depending on the design matrix I used I get either positive or negative correlation values and the latter gives an error in chol.default(V). What is recommended in this case?

4. I tried to test if the FDR control are reasonable, by shuffling the time labels and estimating the the number of false positives and I realised that actual FDR was much higher than that designed for?

Thank you in advance!

My code snippet:

# count matrix has dimension 9000 x 12
T <- 22
time <- seq(0,T,by=2)
block <- rep(c(1,2), each = 6)
in.phase <- cos(2*pi/T*time)
out.phase <- sin(2*pi/T*time)

y_harm <- calcNormFactors(y_harm)

design <- model.matrix(~in.phase + out.phase)

v_harm <- voomWithQualityWeights(y_harm, design)
corfit <- duplicateCorrelation(v_harm, design, block = block) 

v_harm <- voomWithQualityWeights(y_harm, design, block = block, correlation = corfit$consensus.correlation)

corfit <- duplicateCorrelation(v_harm, design, block = block) 

fit_harm <- lmFit(v_harm, design, correlation = corfit$consensus.correlation, block = block)
fit_harm <- eBayes(fit_harm, robust = robust, trend=TRUE)

voomwithqualityweights limma voom duplicatecorrelation • 2.1k views

ADD COMMENT • link updated 5.8 years ago by Aaron Lun ★ 28k • written 5.8 years ago by BharathAnanth ▴ 80

score 1 · Answer 1 · 2018-07-08

I will start by ignoring your questions, because I think you will find it much easier to use:

b <- factor(block)
design <- model.matrix(~b + in.phase + out.phase)

This is permissible as you're expecting oscillations in expression over time, which means that block is not completely confounded with the time points. A unique solution to the linear model can thus be found using information within each culture. You can then test for significant oscillations by dropping the last two coefficients.

By comparison, the accuracy of duplicateCorrelation's correlation estimate is dependent on the number of levels of the blocking factor. To illustrate, consider an experiment with 2 batches of 10 samples in each condition:

set.seed(0)
treatment <- factor(rep(1:2, each=20))
batch <- factor(rep(1:4, each=10))
design <- model.matrix(~treatment)

# Data generation with a batch effect.
dummy <- matrix(rnorm(10000 * length(treatment)), ncol=length(batch))
batch.effect <- matrix(rnorm(10000 * 4), ncol=4)
dummy <- dummy + batch.effect[,as.integer(batch)]

# True value should be 0.5, as both 'dummy' and 'batch.effect' where N(0, 1).
duplicateCorrelation(dummy, design, block=batch)$consensus

You only have two levels in your blocking factor, so you can expect the estimation of the correlation to be similarly inaccurate. This will manifest as anticonservativeness as correlations between samples are not fully considered.

So, long story short, make life easier for yourself and put block in your design matrix instead.

Now, onto your questions, most of which no longer matter.

1) Yes.

2) Yes.

3) In practice, yes, due to imprecise estimation of a small positive (or zero) correlation. In theory, probably not; I can't easily imagine how batch effects would manifest as true negative correlations, and it's not surprising that it yields errors downstream. If you get negative estimates of the correlation, it's either because you don't have enough information to estimate the correlation precisely; or the true correlation is very close to zero, such that you don't need to use duplicateCorrelation at all. It's hard to distinguish between these two possibilities when you only have two levels in your blocking factor. So, put block in the design matrix instead.

4) Your strategy is flawed for various reasons, but mostly because the null hypothesis is not true for most label permutations. I'll illustrate with a simple two-group example. Imagine I have two groups (A and B) of 10 samples, and I want to shuffle them into mixed groups to test whether my hypothesis tests control the type I error rate. This is only valid for permutations where there are equal numbers of samples from each original group in each mixed group. In any other permutation, the null hypothesis is not guaranteed to be true. For example, if I shuffled such that one group had 9 A's and 1 B, of course I would detect significant differences from the other group containing 9 B's and 1 A (assuming that there were DE genes between A and B in the first place). This means that significant genes cannot be interpreted as false positives.

Another reason is that the variances are inflated when you mix samples from different groups together and pretend that they are "replicates". In and of itself, this is not too much of an issue as the variances are only inflated for genes that are DE between groups - not great, but it should be mostly conservative and should not do too much harm. However, empirical Bayes shrinkage means that information is shared across different genes, which means that the inflated variances can increase the variance of other genes (and conversely, are themselves decreased). This can have unpredictable effects and give a misleading idea of the performance of the tests on real data.

I would suggest you read https://academic.oup.com/nar/article/43/7/e47/2414268, which discusses the limitations of permutation-based methods for hypothesis testing in 'omics contexts.