Question

Variation between technical replicates (DESeq2, batch effects)

0

Entering edit mode

Nik dAK ▴ 10

@nik-dak-15517

Last seen 12 weeks ago

Germany

Hi all,

I read a lot about testing on datasets with batch effects, but in all cases the effect is on biological replicates and not technical replicates.

Just some quick terms for a better understanding: technical replicates: same sample, same RNA isolation and same library prep, just loaded 2x on the machine (for read depth) biological replicates: different samples (e.g. celllines) of same group (e.g. genotype) and different RNA isolation, but same library prep and loading onto the machine

I noticed two different types of variation between technical replicates:

A) systematic/plane shift in PCA --> batch effect due to different sequencing runs (see image A below)

B) dispersed --> small random technical variability within one sequencing run (see image B below)

Normally one would expect the variation between technical replicates to be small and non-systematic in the PCA (B).

Now I had the event of a batch effect between technical replicates (A). The experiment was design with 6x biological replicates (6 different samples for each group of interest, colored dots in PCA) and 2x technical replicates for each biological replicate (samples connected via line in PCA). The technical replicates were on two different sequencing runs.

The general approach for (B) is to simply add the technical replicates together and do the test between groups. As discussed in https://support.bioconductor.org/p/85536/.

Now for the case of a batch effect between technical replicates (A), it gets a bit ambiguous for me.

X) Ignore the batch effect and simply sum up and test (I feel bad about this)

Y) Not merge the technical replicates together, but test using a design including the batch as covariate (~genotype+batch)

Z) Test the two runs of technical replicates individually and keep the intersect of significant genes (probably loss of power due to lower library sizes and sample number)

Findings: Y results in a major increase in identified significant genes compared to X.

Which way would be the best to handle this situation? Will the fact that I am not summing up the technical replicates (y) be a problem?

And how much technical variation (without a batch effect, case B) can be "ignored" before again proceeding with one of the approaches X,Y,Z?

Additional note: I also tried SVA, but noticed that the first surrogate vector corresponds exactly to the batchrun covariate.

Thank you very much for your help!

EDIT (correct links) A-batcheffectPCA: https://ibb.co/VYpxnxh B-techvariancePCA: https://ibb.co/SQnWnjr

rnaseq deseq2 differential expression batch effect replicates • 2.8k views

ADD COMMENT • link updated 6.2 years ago by Michael Love 43k • written 6.2 years ago by Nik dAK ▴ 10

score 3 · Accepted Answer · 2019-09-23

There is a big problem with your approach "Y", which is that you are doubling the sample size but you don't have double the samples. The variation appears more significant to the model for the across-sample comparisons than it truly is.

An extreme example:

> a <- rnorm(3)
> b <- rnorm(3)
> t.test(a, b)$p.value
[1] 0.7956494
> t.test(rep(a,each=50), rep(b,each=50))$p.value
[1] 0.01723657

You could use random effect models to account for multiple technical measurements on the same unit (we only support fixed effects in DESeq2), and then test across units. Summing the technical replicates is a simpler approach that's available to you in DESeq2, so I would use X over Y and Z. I believe you could also use limma-voom's duplicateCorrelation() function in their RNA-seq pipeline to model the multiple technical samples.