Question

How to deal with low-depth RNA-Seq within DESeq2 or edgeR

0

Entering edit mode

David R ▴ 90

@david-rengel-6321

Last seen 4 months ago

European Union

Hi,

I am dealing with RNA-Seq data for which the PI of the project would like to have DEGs, but I am having second thoughts about the nature of the data and its suitability for differential analysis.

The data come from infected plants, in which only reads from the infecting microorganism have been kept. There are two time points in the infection, one of them a very early one. The number of reads for each sample is very variable, ranging between 100K and 8M reads. Only 5K genes present at least 3 reads in at least one condition (all replicates for one condition considered). This is due to the fact that, especially at early stages, the proportion plant/microorganism is very low in the sequenced library.

Overall, I think that the data are hardly suitable for differential analysis because I assume that the distribution of the genes will not be the same in all samples and neither can we assume the hypothesis that most genes are not DE.

One alternative I would have tried is to treat plant and microorganism data altogether. Should this help stabilizing the data?

It should be noted that no House Keeping Genes are available for the species I am talking about. I mention this cause some authors use HKG under similar circumstances as ours In order to normalize the data.

Any help or advice will be most greatly appreciated,

David

rnaseq normalization differential gene expression edger deseq2 • 3.6k views

ADD COMMENT • link updated 7.2 years ago by Ryan C. Thompson ★ 7.9k • written 7.2 years ago by David R ▴ 90

score 2 · Answer 1 · 2017-02-08

From what you've described, you're in a spot of bother. The problem is not so much the difference in the library sizes - this can be handled by the model - but rather, the fact that you're unwilling to assume that most genes are not DE. This limits your options for normalization, as both TMM and DESeq's size factor method cannot be used. I guess you don't have spike-in RNA either, which is a standard strategy (in single-cell RNA-seq, at least) for getting rid of technical biases in cells that have very different transcriptomic profiles. So, you have two options:

Normalize by library size, and hope for the best. Technically speaking, library size normalization assumes that the only difference between samples is due to sequencing depth - almost any DE will introduce composition biases, excepting some rather artificial scenarios where the upregulation for some genes perfectly cancels out the downregulation for other genes. This strategy errs on the side of caution and will under-normalize your data, but it shouldn't do anything horribly wrong.
Try to define some house-keeping genes a priori. For example, constitutively expressed genes like ribosomal proteins should be okay, as well as genes involves core metabolic processes (glycolysis, perhaps - I can't remember my biochemistry). This might make the assumption of a non-DE majority more reasonable if you only apply TMM to these genes. However, it depends on having some good annotation for your organism, as well as enough knowledge about what processes are "constant enough".

I would probably go with the first option, simply because I don't know enough biology; and then take the DE results with a tablespoon of salt, at least with respect to looking for changes in expression. (A DE analysis following library size normalization will instead find changes in the proportion of reads assigned to each gene between conditions. This is not the same as a standard DE analysis, or as interpretable due to the interaction with all other genes.) There's no point throwing the plant data in for normalization, because then you introduce another factor, i.e., the ratio of plant RNA to microbial RNA, which doesn't help in correcting for biases in the microbe counts.

Another problem is with your filtering, which is not blind to the condition that each library comes from. Because your libraries are so imbalanced, the genes you retain after filtering are more likely to be upregulated in your later time point, i.e., it's a lot easier to get a gene with 3 or more reads in the 8M libraries compared to a gene with 3 or more reads in the 100K libraries. This is problematic as you'll bias the downstream results when you filter in a manner that's not independent of the test statistics. It'd be safer to filter on the CPMs, or to use the aveLogCPM function to compute an average abundance for filtering - see the edgeR user's guide for details.

score 0 · Answer 2 · 2017-02-08

It is not required for every sample to have the same sequencing depth in order to assess differential expression. When you say that only 5k genes from the microorganism have sufficient reads, what fraction of the microorganism's genome is that?

In any case, it is important to precisely define what you mean by differential expression in this context, so you can choose a normalization method that supports your desired analysis. If the total fraction of reads coming from the microorganism increases from one condition to another, would you regard that as upregulation of all the microorganism's genes? My guess is probably not, in which case you need to normalize out those changes. This means that you don't want to normalize to the abundance of the host genes. So as a start, subset to only the microorganism genes and normalize as usual, using TMM. Then filter based on the output of aveLogCPM on the full data set. I recommend looking at a histogram of the aveLogCPM values and choosing a threshold between the signal mode and noise mode. Then perform your differential expression analysis, and look at your MA plot. If the low-expression genes' fold changes are biased in the direction of the more deeply-sequenced group, then you likely need to filter more stringently.

Ultimately, you will probably filter out a lot of genes due to low counts. So when you deliver the list of DEGs, you need to be very clear that this is an incomplete list due to the very limited sequencing depth.