Question

Fundamental questions about DESeq2 and RUVs in targeted gene set of 500

1

Entering edit mode

grashow ▴ 10

@grashow-13014

Last seen 6.7 years ago

Hi,

I am brand new to bioinformatics and differential expression analysis, trying to assimilate a lot of new technical and conceptual information. I am working with a dataset that used BioSpyder’s TempO-seq technology to analyze a cancer cell line using a targeted gene set of 500, where 10 chemicals were tested under two conditions. Biospyder has provided the raw read counts for each of the 500 genes. There were no housekeeping genes in our dataset, but we do have biological replicates under control and treated conditions.

My questions are: should we be using any particular normalization options in DESeq2, given that there is no group of genes for which we expect no change at all? In other words, what are the implications of using a targeted gene panel when there all genes could possibly show altered expression? From my read of the 2014 Love et al. paper, the approach to evaluating dispersion by looking across all samples should address this in our targeted set, but I may be missing a finer point.

Secondly, RUVseq has been recommended to adjust for unwanted technical effects that may be present in these data. I assume that means we’d be using the RUVs approach described in Risso et al., although I’m unsure whether it would be better to use the empirical control genes approach (2.4 in the RUVseq vignette). From my reading of the Bioconductor postings, we would do RUVseq first to compute size factors, and then use that information in DESeq2. Is that correct? Can someone provide a simple explanation of what a size factor is? I’ve read through the vignette and other Bioconductor questions but haven’t found an answer that makes sense to me.

Lastly, we do not have lane information for the TempO-seq data, but we do have the plate information. Should/could that be used similar to the betweenLaneNormalization in RUVseq?

Apologies that these are very basic questions. Thanks to all,

Rachel

deseq2 ruvseq • 2.4k views

ADD COMMENT • link updated 7.0 years ago by davide risso ▴ 950 • written 7.0 years ago by grashow ▴ 10

score 3 · Answer 1 · 2017-05-11

Hi,

I will let Mike chime in for the DESeq2 specific questions, but here are my two cents.

> should we be using any particular normalization options in DESeq2, given that there is no group of genes for which we expect no change at all? In other words, what are the implications of using a targeted gene panel when there all genes could possibly show altered expression?

This is a difficult question and it's likely that only through some exploratory data analysis you will be able to find an indication on how to proceed. I suggest that you run a few exploratory plots, such as RLE plots, MA plots and PCA / MDS to get a feel of how the data "look like".

The EDASeq package (and its vignette) is likely a good place to start for these plots, but I believe that many of the same ideas are implemented in the DESeq2 package.

> Secondly, RUVseq has been recommended to adjust for unwanted technical effects that may be present in these data. I assume that means we’d be using the RUVs approach described in Risso et al., although I’m unsure whether it would be better to use the empirical control genes approach (2.4 in the RUVseq vignette).

RUVSeq does make the assumption that you can identify some genes that do not change across conditions. If this is not the case, I'm afraid that this won't be a good approach for this dataset. I'm not familiar with the technology that you used, but I'm guessing that this platform doesn't include any control spike-in / housekeeping genes, correct?

Both RUVs and RUVg with empirical controls assume that you have some "negative control" genes, i.e., genes that do not change across conditions. Using the empirical control approach in the absence of truly negative controls will likely just remove (some) biological signal from your data. RUVs seems more robust to the misspecification of negative controls, so I would tend to suggest that approach (with a very low k to minimize the chance of removing biologically meaningful variation). Again, looking at EDA plots should give you a rough idea if it's reasonable to assume that at least a handful of genes don't change.

> From my reading of the Bioconductor postings, we would do RUVseq first to compute size factors, and then use that information in DESeq2. Is that correct? Can someone provide a simple explanation of what a size factor is? I’ve read through the vignette and other Bioconductor questions but haven’t found an answer that makes sense to me.

The terminology is tricky, but you're generally correct. You would compute what we call "unwanted variation factors" and then add them as covariates in the design matrix in DESeq2. The "size factors" are offsets that can be included in the model to account for the "library size" (i.e., how many reads were sequenced per sample -- although the DESeq2 size factors are a more robust version of this, this is intuitively what they correct for) and hence to make the samples comparable to each other.

Unfortunately, size factors too can be confounded with biology when the majority of your genes is differentially expressed (unless they are roughly symmetrically distributed between up- and down-regulated). Again, the only way to understand if the size factors are OK is to look at some exploratory plots: e.g., look at the size factors between conditions and see if there is any systematic difference between conditions.

> Lastly, we do not have lane information for the TempO-seq data, but we do have the plate information. Should/could that be used similar to the betweenLaneNormalization in RUVseq?

The "betweenLaneNormalization" is just a poorly named function that perform between-sample normalization, independently of the fact that they are organized in lanes, plates, etc. It is used in RUVSeq just to make sure that the factors of unwanted variation don't capture the difference in library size that should be captured by the size factors. (I understand that the jargon here is quite confusing! :) Please, let me know if anything is unclear and I will try to explain myself better!)

I would start by exploring how the DESeq2 size factors look like, and how the PCA plot looks like after scaling by those factors. If everything looks good (i.e., samples grouping by condition in PCA and no systematic bias associated with size factors) I would proceed without RUV. If you see apparent batch effects, then adding 1-2 factors of RUV might help, but keep in mind that because of the nature of your dataset, RUV might not work, and so look at the results carefully.