Question

Using RUVSeq RUVs() to perform batch correction using technical replicates

0

Entering edit mode

enricoferrero ▴ 680

@enricoferrero-6037

Last seen 3.3 years ago

Switzerland

Ciao Davide,

I would be very interested in using RUVSeq::RUVs() as it appears to be the only published and packaged method for correcting batch effects using technical replicates profiled across batches.

However, I must be doing something wrong because I can't get it to work.

I have a non-ideal experimental design with two batches:

Batch 1 contains 40 samples with condition A + 12 samples with condition B
Batch 2 contains 40 samples with condition C + the same 12 samples with condition B

So, the 12 samples with condition B are technical replicates.

I'm interested in comparing the 40 samples with condition A from batch 1 with the 40 samples with condition C from batch 2. I'm not interested in condition B per se.

For the differential expression analysis, I'm using DESeq2.

This is what I'm doing:

# load libraries
library(RUVSeq)
library(DESeq2)

# create original DESeq2 dataset
dds_original <- DESeqDataSetFromMatrix(countData = counts_original, colData = samples, design = ~ 1)
vsd_original <- vst(dds_original)

# plot original PCA
pca_data <- plotPCA(vsd_original, intgroup = c("donor_id", "batch", "condition"), ntop = 500, returnData = TRUE)
percent_var <- round(100 * attr(pca_data, "percentVar"))
ggplot(pca_data, aes(PC1, PC2)) +
  geom_point(aes(colour = batch, shape = condition), size = 2, alpha = 0.6) +
  geom_line(aes(group = donor_id), alpha = 0.6) +
  xlab(paste0("PC1: ",percent_var[1],"% variance")) +
  ylab(paste0("PC2: ",percent_var[2],"% variance")) + 
  coord_fixed() +
  theme(aspect.ratio = 1)

The samples connected by a black line are the technical replicates in the two batches: Original PCA

# run RUVSeq::RUVs()
set_original <- newSeqExpressionSet(counts = as.matrix(counts_original), phenoData = AnnotatedDataFrame(samples))
set_corrected <- RUVs(set_original, cIdx = featureNames(set_original), k = 1, scIdx = makeGroups(set_original$donor_id))

# create corrected DESeq2 dataset
dds_corrected <- DESeqDataSetFromMatrix(countData = counts(set_corrected), colData = pData(set_corrected), design = ~ W_1)
vsd_corrected <- vst(dds_corrected)

# plot corrected PCA
# plot original PCA
pca_data <- plotPCA(vsd_original, intgroup = c("donor_id", "batch", "condition"), ntop = 500, returnData = TRUE)
percent_var <- round(100 * attr(pca_data, "percentVar"))
ggplot(pca_data, aes(PC1, PC2)) +
  geom_point(aes(colour = batch, shape = condition), size = 2, alpha = 0.6) +
  geom_line(aes(group = donor_id), alpha = 0.6) +
  xlab(paste0("PC1: ",percent_var[1],"% variance")) +
  ylab(paste0("PC2: ",percent_var[2],"% variance")) + 
  coord_fixed() +
  theme(aspect.ratio = 1)

Corrected PCA

So it looks like not much is happening after running RUVs. The two plots, while not identical, are extremely similar.

My expectation is that that the technical replicates would get closer to one another and that the black lines would almost disappear.

Am I missing something or doing something wrong? Can RUVSeq be used to achieve what I want here?

Thank you,

Enrico

RUVSeq ruvseq RUVs DESEq2 deseq2 • 2.7k views

ADD COMMENT • link updated 6.0 years ago by Gordon Smyth 52k • written 6.1 years ago by enricoferrero ▴ 680

score 1 · Answer 1 · 2019-02-18

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 15 hours ago

WEHI, Melbourne, Australia

I am having trouble making sense of your question because your PCA plots show a clear separation of the samples into two groups, but the groups are not associated with any of the variables in your experiment and you don't comment on this.

The groups I am referring to you are the obvious groups at the top and bottom of your plots. The groups are associated with PC2 but not with your conditions or batches. You can't begin to do any sensible analysis of this data until you identify what is the cause of the two groups. It is a massive effect that you don't seem to have addressed.

Alternatively, is it possible that the two groups are in fact the batch effect and you have simply mislabeled the batches and conditions?

ADD COMMENT • link 6.0 years ago Gordon Smyth 52k

0

Entering edit mode

I agree with Gordon, those two groups are indeed worrying, even more so as your technical replicates are all in the top group, so there's no hope that those will help you account for it...

As for your original question, increasing k in RUVs() should make the technical replicates closer to each other, but again, as Gordon pointed out, this is not the most important issue with the data.

ADD REPLY • link 6.0 years ago davide risso ▴ 980

0

Entering edit mode

Thanks both. The two groups have indeed not escaped my attention. There are several other variables in the experiment, both biological and technical, that can affect the PCs (e.g.: treatment, gender). I was planning to first account for the known batch effect using the technical replicates and then investigate what else might be driving these differences.

Davide, I've tried using higher values of k (1, 2, 5 and 10) and the result is always similar. What is the best way to estimate k? As I understand it represents the number of factors to be removed I picked 1 because I only wanted to remove 1 factor (i.e.: the batches).

I also tried using a subset of low-variance genes as the cIdx argument, again without observing any meaningful changes in the PCA. Finally, should I run betweenLaneNormalization() before RUVs()? Would you have any other suggestions?

Currently I'm considering proceeding as suggested by Gordon in his answer to my more general question here.

ADD REPLY • link 6.0 years ago enricoferrero ▴ 680