Search
Question: Combining newer/older RNAseq data, batch correcting
1
22 months ago by
schrist110
University of Maryland
schrist110 wrote:

After preforming an RNAseq analysis with 35 samples (10 controls, 25 cases), we've sequenced an additional 6 samples of a separate nature and would like to combine all 41 samples in a larger analysis. The design of the experiment is basically this:

Older data - 4 separate conditions (control, negative, intermediate, positive), 5 batches

Newer data - 1 completely different condition (diffuse), 1 batch

We know there are batch effects in the older data and would like to correct for those batch effects but are unsure of the best way to do so within the combination of all the data due to the confounded condition-batch of the newer data. The approach I've tried takes the old data, utilizes removeBatchEffects from the limma package, forces any negatives from the batch effect removal to zero, combines the old/new data, and then executes voom with just condition in the model. This seems to yield the desired results. However, the comparison of the conditions to the controls within the older data differ greatly compared to previous analysis (simply including batch in the model). Unfortunately, we can't include it in the model here because the new data's condition and batch are confounded. Would it be possible to model the older data through voom with just the batch factor, combine the old/new data, then model the complete set with the condition factor? Hoping for suggestions on the best approach to remove the batch effects of the older data while still maintaining the power to compare the newer data condition to the older data conditions.

Steve

modified 22 months ago by Aaron Lun20k • written 22 months ago by schrist110
5
22 months ago by
Aaron Lun20k
Cambridge, United Kingdom
Aaron Lun20k wrote:

First off, let me say that this experimental design is very poor. But I guess you know that already.

Secondly, running removeBatchEffect will throw out the baby with the bathwater, given that your new/old batches are confounded with a condition of interest. I wouldn't expect you to detect many differences between the diffuse and other conditions - after all, you've explicitly removed all differences between the corresponding batches. You also overstate the residual d.f. of your model when you use batch-corrected data without having a batch blocking factor in the design matrix. This is not healthy (overstate precision of variance, coefficient estimates, etc.) and may contribute to the differences with your previous analyses.

Finally, it may be possible to salvage something from this experiment by using duplicateCorrelation. Set up a design matrix based on the condition, and block on the batch in duplicateCorrelation. This assumes that your batches within the older data set are not confounded with the condition, allowing estimation of the correlation caused by the batches. It also assumes that any batch effect between the old and new data is comparable to that between batches within the older data - this may not be true, due to the time effect in the former that's not present in the latter, and can be checked roughly with a MDS plot. Afterwards, you can compare directly between diffuse and other conditions to identify DE genes. This approach accounts for the contribution of the batch effect to the differences between conditions without requiring it to be explicitly regressed out.

ADD COMMENTlink modified 22 months ago • written 22 months ago by Aaron Lun20k

Aaron,

Thanks for the quick reply! As the experiment was originally only meant to encompass the older data, we definitely realize the poor design, just trying to see if there's a solution that leads to a legitimate comparison of the data. The more I looked at removeBatchEffect, the more I realized it's not a good solution for DE analysis and more useful in visualizations of data.

An initial look at your suggestion looks like using duplicateCorrelation will work! The batches in the older data are not confounded with condition. We may have to add a disclaimer about the time effect but I will check that with the MDS plot and let you know. I'll delve a little deeper in the next few days and come back with some analysis.

Thanks again!

- Steve

Aaron,

I've compared your suggested method that includes the new data to the analysis of just the old data that included batch in the model. There are some differences in the number of DE genes but not large differences and strong similarities exist when comparing the fold changes of the common genes. As for testing for time effect, the MDS of the data shows the diffuse condition separate from the other conditions. However, I don't believe this is a time effect due to the nature of the DE genes, many of which are what we expected to be different. I think we will move forward with analysis using this method. Thanks for your help!

Steve