I have a data set, where we assume one of the conditions to have transcriptional amplification toward a set of genes. For that reason I decide to test the data with the RUVSeq package.
I have ran the analysis as explained in the vignette and calculated the estimated factors of unwanted variation. I have got the following factors
Mix 1 and Mix 2 are supposed to be different in the two samples, so I'm not surprised that you capture the group difference with the first factor of UV based on them. You can either use only group B or subtract the expected fold-change from the expression matrix to use all the spike-ins. From our paper:
"Interestingly, one can relax the negative control gene assumption by requiring instead the identification of a set of Jc positive or negative controls, for which the value of βc is known a priori but need not be zero. Then, Xβc is known and one can perform the singular value decomposition of logYc − Xbc − Oc to estimate W as in step 3 of RUVg above. Steps 4 and 5 remain the same. This allows us to make full use of all 92 ERCC spike-in controls for the SEQC data set. "
I have been struggling with this paragraph in the paper for sometime, I am still unsure how it is possible to use positive controls. How/where does one specify the expected values for the fold changes.
Let me omit the offset for simplicity. The model for the control genes becomes log Y_c ~ X b_c + W a_c.
If b_c is known (and different from zero), you can estimate W a_c by SVD on the matrix log Y_c - X b_c.
In R, it would be something like:
logY <- log1p(Y)
logY[controls,] <- logY[controls,] - X * b_c[controls]
Assuming that Y is the full matrix of counts, controls is an indicator of the positive controls and X is a one dimensional indicator variable (i.e., two-class comparison).
the reason you get these results is because your factor of UV is almost perfectly correlated with the biology: all negative values for your controls and all positive values for your "strained" samples. So you're testing and correcting for basically the same variable, and this causes troubles in the DE test.
Either your negative controls are not really negative controls and capture the biological difference between your samples, or the unwanted variation is truly perfectly correlated with the biology. If the latter is true, than I don't think RUV can help you here.
It looks like the expression of the spike-ins is different in the two groups. Are you using the same mix for both controls and treated samples, or are you using Mix 1 vs Mix 2?
If you're using the same mix, it might be worth looking at the distribution of the spike-in reads across samples (e.g., box plots, MAplots, ...) as well as plotting the first principal components of the spike-in expressions. If you see the samples cluster by biology, removing the variation inferred from the spike-ins will likely remove your signal of interest.
I am using a mix1-mix2 mix of the ERCC spikeIns. I have tested this and it came out quite good for the two conditions. Where I wasn't sure how to proceed was whether or not to use only group B of the spikeIn mix, where in both samples I have the same concentration. But as far as I understand it from your above explanation, this will also remove the signal of interest I'm looking for.
I have been struggling with this paragraph in the paper for sometime, I am still unsure how it is possible to use positive controls. How/where does one specify the expected values for the fold changes.
Let me omit the offset for simplicity. The model for the control genes becomes log Y_c ~ X b_c + W a_c.
If b_c is known (and different from zero), you can estimate W a_c by SVD on the matrix log Y_c - X b_c.
In R, it would be something like:
Assuming that Y is the full matrix of counts, controls is an indicator of the positive controls and X is a one dimensional indicator variable (i.e., two-class comparison).