Search
Question: Fundamental questions about DESeq2 and RUVs in targeted gene set of 500
0
gravatar for grashow
17 months ago by
grashow0
grashow0 wrote:

Hi,

I am brand new to bioinformatics and differential expression analysis, trying to assimilate a lot of new technical and conceptual information. I am working with a dataset that used BioSpyder’s TempO-seq technology to analyze a cancer cell line using a targeted gene set of 500, where 10 chemicals were tested under two conditions. Biospyder has provided the raw read counts for each of the 500 genes. There were no housekeeping genes in our dataset, but we do have biological replicates under control and treated conditions.

My questions are: should we be using any particular normalization options in DESeq2, given that there is no group of genes for which we expect no change at all? In other words, what are the implications of using a targeted gene panel when there all genes could possibly show altered expression? From my read of the 2014 Love et al. paper, the approach to evaluating dispersion by looking across all samples should address this in our targeted set, but I may be missing a finer point.

Secondly, RUVseq has been recommended to adjust for unwanted technical effects that may be present in these data. I assume that means we’d be using the RUVs approach described in Risso et al., although I’m unsure whether it would be better to use the empirical control genes approach (2.4 in the RUVseq vignette). From my reading of the Bioconductor postings, we would do RUVseq first to compute size factors, and then use that information in DESeq2. Is that correct? Can someone provide a simple explanation of what a size factor is? I’ve read through the vignette and other Bioconductor questions but haven’t found an answer that makes sense to me.

Lastly, we do not have lane information for the TempO-seq data, but we do have the plate information. Should/could that be used similar to the betweenLaneNormalization in RUVseq?

Apologies that these are very basic questions. Thanks to all,

Rachel

ADD COMMENTlink modified 17 months ago by davide risso750 • written 17 months ago by grashow0
0
gravatar for davide risso
17 months ago by
davide risso750
Weill Cornell Medicine
davide risso750 wrote:

Hi,

I will let Mike chime in for the DESeq2 specific questions, but here are my two cents.

> should we be using any particular normalization options in DESeq2, given that there is no group of genes for which we expect no change at all? In other words, what are the implications of using a targeted gene panel when there all genes could possibly show altered expression?

This is a difficult question and it's likely that only through some exploratory data analysis you will be able to find an indication on how to proceed. I suggest that you run a few exploratory plots, such as RLE plots, MA plots and PCA / MDS to get a feel of how the data "look like".

The EDASeq package (and its vignette) is likely a good place to start for these plots, but I believe that many of the same ideas are implemented in the DESeq2 package.

> Secondly, RUVseq has been recommended to adjust for unwanted technical effects that may be present in these data. I assume that means we’d be using the RUVs approach described in Risso et al., although I’m unsure whether it would be better to use the empirical control genes approach (2.4 in the RUVseq vignette). 

RUVSeq does make the assumption that you can identify some genes that do not change across conditions. If this is not the case, I'm afraid that this won't be a good approach for this dataset. I'm not familiar with the technology that you used, but I'm guessing that this platform doesn't include any control spike-in / housekeeping genes, correct?

Both RUVs and RUVg with empirical controls assume that you have some "negative control" genes, i.e., genes that do not change across conditions. Using the empirical control approach in the absence of truly negative controls will likely just remove (some) biological signal from your data. RUVs seems more robust to the misspecification of negative controls, so I would tend to suggest that approach (with a very low k to minimize the chance of removing biologically meaningful variation). Again, looking at EDA plots should give you a rough idea if it's reasonable to assume that at least a handful of genes don't change.

> From my reading of the Bioconductor postings, we would do RUVseq first to compute size factors, and then use that information in DESeq2. Is that correct? Can someone provide a simple explanation of what a size factor is? I’ve read through the vignette and other Bioconductor questions but haven’t found an answer that makes sense to me.

The terminology is tricky, but you're generally correct. You would compute what we call "unwanted variation factors" and then add them as covariates in the design matrix in DESeq2. The "size factors" are offsets that can be included in the model to account for the "library size" (i.e., how many reads were sequenced per sample -- although the DESeq2 size factors are a more robust version of this, this is intuitively what they correct for) and hence to make the samples comparable to each other.

Unfortunately, size factors too can be confounded with biology when the majority of your genes is differentially expressed (unless they are roughly symmetrically distributed between up- and down-regulated). Again, the only way to understand if the size factors are OK is to look at some exploratory plots: e.g., look at the size factors between conditions and see if there is any systematic difference between conditions.

> Lastly, we do not have lane information for the TempO-seq data, but we do have the plate information. Should/could that be used similar to the betweenLaneNormalization in RUVseq?

The "betweenLaneNormalization" is just a poorly named function that perform between-sample normalization, independently of the fact that they are organized in lanes, plates, etc. It is used in RUVSeq just to make sure that the factors of unwanted variation don't capture the difference in library size that should be captured by the size factors. (I understand that the jargon here is quite confusing! :) Please, let me know if anything is unclear and I will try to explain myself better!)

I would start by exploring how the DESeq2 size factors look like, and how the PCA plot looks like after scaling by those factors. If everything looks good (i.e., samples grouping by condition in PCA and no systematic bias associated with size factors) I would proceed without RUV. If you see apparent batch effects, then adding 1-2 factors of RUV might help, but keep in mind that because of the nature of your dataset, RUV might not work, and so look at the results carefully.

ADD COMMENTlink written 17 months ago by davide risso750

Davide has a great set of answers here, and I agree with everything. Let me know if you have any further questions on the DESeq2 side. It's really unfortunate to have a targeted panel with no housekeeping or control genes, as it's difficult to estimate the right size factors in this case. Hopefully, from looking at the MA plot there will be some number of targeted genes which nevertheless are roughly "in the middle" after performing the default size factor estimation within DESeq().

ADD REPLYlink written 17 months ago by Michael Love19k

Davide and Mike,

Thank so much for these answers. You're right that there are no spike-in or housekeeping genes. According to a colleague, this is because our vendor uses an approach where an equal amount of RNA from each sample prior to sequencing. So their sequencing results in absolute counts of gene expression, and the total counts provided to us represent the total level of RNA for all 500 genes (not all RNA in the cell). So I currently have a matrix containing 500 genes and the read counts for each of the samples, and another condition table that lists for each sample- whether or not it was treated, and its plate.

The advice I've just gotten is to just get the data into DESeq2 because the data already represent actual read counts so there's no uncertainty to normalize against. If that's truly the case, I should nevertheless be able to look at batch (i.e. plate) effects using DESeq2, correct? If I wished to do some simple clustering or PCA analysis to look for plate effects, it is best to just use the plot.PCA function in DESeq2?

Thanks for your patience- I really appreciate it.

Rachel

ADD REPLYlink written 17 months ago by grashow0

hi Rachel,

I don't follow the logic of the colleague/vendor. Those counts are still providing evidence of relative abundances and need to be normalized. The fact that a roughly equal amount of RNA is used does not change it. Here's a simple thought experiment, suppose there are only two genes A and B, and in one sample we obtain counts of 20 and 20. In the second sample, gene B triples in its gene expression so we have 1:3 ratio of A:B, and we get counts of 10 and 30. Not knowing that it is gene B which has changed, you would think by looking only at the absolute counts that gene A has gone down to 1/2x and gene B up to 3/2x their original values. Hence one needs normalization, and for this you need either housekeeping genes, technical controls, or to make the assumption that there are a number of genes, like A, which are "in the middle" with respect to the log fold changes across condition.

DESeq2 can be used to normalize, but it will assume that there is a group of genes "in the middle" which do not change across condition, unless you provide specific genes to the 'controlGenes' argument of estimateSizeFactors().

ADD REPLYlink written 17 months ago by Michael Love19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 238 users visited in the last hour