Batch effects between controls
2
0
Entering edit mode
llo ▴ 10
@llo-13602
Last seen 7.0 years ago

Hello, I am working with data that I downloaded from the SRA database. I am only working with the same stage, library preparation, and species. However, when I plot a PCA plot of the data, they do not align very well even though I am using the same reference genome and annotation. How would I correct for this batch effect? I have tried RUVSeq's upper quartile normalization but it does not do anything, I have not tried using "negative control genes" or housekeeping genes. 

I also have single and paired end data, how do I correct for the batch effects between the two? Thank you

batch-effect ruvseq • 2.5k views
ADD COMMENT
1
Entering edit mode
davide risso ▴ 980
@davide-risso-5075
Last seen 8 months ago
University of Padova

Hi Ilo,

It is expected that upper-quartile normalization will not handle batch effects as it is only a global scaling normalization and is not related to the RUV method.

I suggest that you read carefully the RUVSeq vignette if you want to use RUV to try and adjust for batch effects. An alternative approach would be to use the sva package. It's a good idea to read both vignettes and see if these methods can help.

You said you want to use the RUV method, but you haven't tried using the negative controls. That is the main point of RUV: Using negative controls to estimate the batch effects. So you cannot use RUVSeq without using negative control genes. Again, the vignette is pretty clear on how to use the RUVSeq package, I suggest that you start from there.

It may also be useful to read the RUVSeq and svaseq papers, as they make clear the difference between adjusting for sequencing depth (what upper-quartile does) and removing batch effects.

Best,
Davide

 

ADD COMMENT
0
Entering edit mode

Thank you for your reply. I will try using negative control genes but the vignette does not include how to use specific genes but rather how to use spike ins. I have a list of potential control genes but no spike ins, do you know how to use a list of genes that I have by gene name to use as a negative control gene? 

ADD REPLY
1
Entering edit mode

I'm not sure I understand your question. The same way you specify the names of the spike ins, you can specify the names of the endogenous genes that you want to use as negative controls. Section 2.4 of the vignette uses endogenous genes as negative controls.

ADD REPLY
0
Entering edit mode

Thank you, I completely missed 2.4, it does exactly what I need it to

 

ADD REPLY
1
Entering edit mode
@james-w-macdonald-5106
Last seen 13 minutes ago
United States

A PCA plot simply shows you the largest differences between samples, so 'not aligning well' can mean more than one thing. For example, it may be that there is lots of technical variability that is obscuring the biological differences between your samples. But this is a matter of degree!

If you have really large changes between samples for a lot of genes, but larger technical variability due to batches or whatever, then the technical variability can obscure the biological variability (which usually shows up in higher principal components). In this case, using something like RUVSeq or svaseq from the sva package can help control for the unwanted technical variability.

However, if you have consistent, but real differences between samples in just a few genes, then the 'normal' variability that one might expect is often predominant in a PCA plot. This (IMO) doesn't necessarily mean you have to do something to 'fix' the data. With any adjustments to the data you always run the risk that you may be capturing some of your real biological variability with a surrogate variable, and thereby reducing your abilities to see the real changes that exist.

My point is that there is no free lunch here. Any adjustment you make to fix perceived faults in your data may well erase real signal. So I usually try to figure out if I really do have a problem, and if I can identify the source of the problem first.

As to correcting for SE and PE data, if they were run in separate batches (you seem to imply that these data were all run together, although I am inferring that from you saying 'the same stage, library preparation, and species' , which may not mean what I think), then you would simply fit a batch effect in your model. But it is pretty uncommon in my experience for samples to be run using the same library preparation, but sequenced differently.

Perhaps this is just a compilation of a bunch of different samples from different labs? If that is the case, you really shouldn't just be piling them all into one analysis. You would be better off doing separate analyses and then using something like the GeneMeta package to do a meta-analysis.

ADD COMMENT
0
Entering edit mode

Ah, I meant that all the data sets I got off SRA was at the same stage, used the same sequencer, were from the same species, and had the same preparation (poly A selection). However, these two data sets was either single or pair ended. Since I specifically chose controls, I thought that each data set should be similar to each other. That's why I'm testing for and trying to remove batch effects. I wanted a larger sample size that was at least somewhat consistent with each other.

ADD REPLY
0
Entering edit mode

Well, if you have both controls and treated (or whatever) from both SRA data sets, and the treatment is the same, then you might be able to put it all together and analyze using a batch effect. However, if the variability for one data set is much different from the other (quite possible with PE vs SE reads, since PE are by definition a bit easier to accurately align), then you would probably be better off doing the two analyses separately and then doing a meta analysis.

 

ADD REPLY

Login before adding your answer.

Traffic: 797 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6