Question

edgeR: request for additional guidelines for responsible limiting of genes that can be analyzed statistically

0

Entering edit mode

lepstein • 0

@lepstein-20060

Last seen 4.1 years ago

We have TagSeq libraries of a fungus with 5 replicates in two conditions: the fungus “in vitro” growing in plant extract and “in vivo” when it’s infecting its host plant. We actually have three fungal strains (race 2, race 3 and race 4), but at this point we’re mostly interested in what genes are “up-regulated” or “down-regulated” in vivo. In vitro libraries range in size from 3,034,612 - 5,796,518. But the fungal counts from the in planta libraries are far lower than ideal. Nonetheless, it does seem that we make some reasonable inferences, in particular, when the in vitro, for example, has 0 or 1 raw counts in all the reps and there are considerably more in all the reps of the in vivo. There also appear to be some legitimate examples of down-regulation. The fungus is haploid and has ca. 15,000 genes.

In the edgeR manual on p. 12, it states, “As a rule of thumb, genes are dropped if they can’t possibly be expressed in all the samples for any of the conditions. Users can set their own deﬁnition of genes being expressed. Usually a gene is required to have a count of 5-10 in a library to be considered expressed in that library. Users should also ﬁlter with count-per-million (CPM) rather than ﬁltering on the counts directly, as the latter does not account for diﬀerences in library sizes between samples.”

Here are the raw total counts (library size) for each of the in vivo libraries

Race 4 836601 20248 12414 13095 20967

Race 3 8565 1774 9289 7364 1928

Race 2 10408 4639 4495 827 1075

edgeR indicates that we have ca. 20,000 genes and expression of ca. 1,100 genes.

One question is, can we use the cut-offs described on p. 12 of your manual to make a responsible decision about what genes should be excluded due to small sample size for each of our fungi?

If we need to have a raw count of 5+ in a library, for example, if we use the case of race 4, and our smallest library has 12,414 raw counts, then should we have rowSum$(cpm (y)>403 , calculated as 5 raw counts * conversion to cpm for the smallest sample.

And for our 5 reps, do you have a recommended minimum number of samples in the group that should have, let’s say, cpm(y)>403 ?

And if the cut-offs are an appropriate place to exclude genes that would be otherwise erroneously indicated as down-regulated, are there other guidelines for the minimum cpm and # in each group that should be used for genes that are up-regulated in the in vivo?

Many thanks for any guidance!

edger • 839 views

ADD COMMENT • link updated 5.1 years ago by Gordon Smyth 50k • written 5.1 years ago by lepstein • 0

score 1 · Answer 1 · 2019-03-01

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

I agree that using the smallest library size to guide the cpm cutoff may be very conservative in terms of how many genes to keep in the analysis. In general, we have found that people in general have difficulty correctly applying the filtering advice of the User's Guide, so edgeR now has a function that will optimize the filtering for you. Just use

keep <- filterByExpr(dge, design)

or just

keep <- filterByExpr(dge)

if dge$samples$group is set appropriately. Then

dge.filtered <- dge[keep, , keep.lib.sizes=FALSE]

Here dge is the DGEList for all the samples together, both in vivo and in vitro. See our Example Workflow Using edgeR-QL for a complete example analysis using filterByExpr.

The above will probably be less conservative than what you are doing now. I can't guarantee that it will work perfectly because your setup is really extreme. I have never before seen anyone trying to get information from TagSeq or RNA-seq libraries with less than 1000 reads, except in a single-cell context. You just have to accept that you will need to filter a lot of genes if you want to keep all the libraries. Depending on how the analysis works out, you might consider removing the four samples that have library sizes less than 2000.

ADD COMMENT • link 5.1 years ago Gordon Smyth 50k

1

Entering edit mode

In addition, TMM normalization (or any method based on computing ratios) is not going to be happy with such low counts and a high frequency of zeroes. Less than 10000 counts spread over 15000 genes? That's barely one count per gene on average - any subsequent ratios or M-values are meaningless for calculations within calcNormFactors. One could try method="TMMwzp", which might do better than the default TMM method, but the fundamental problem remains - that there is not enough information to estimate scaling factors when there are so many more zeroes in the in vivo samples. You'd probably see lots of stripes in your MA plots.

ADD REPLY • link 5.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks! This is my first TagSeq dataset. I think that I understand the low sample size problem. My next question is about the high frequency of 0's in the in vitro dataset with 3,034,612 to 5,796,518 counts/rep. Assuming that our dataset is typical with a reasonably high proportion of genes with 0 cts in all or most reps in one treatment (with sufficient cell counts), is there a larger statistical issue here?

ADD REPLY • link 5.1 years ago lepstein • 0

0

Entering edit mode

I think that I understand the low sample size problem.

Be careful with your terminology. The sample size refers to the number of replicates you have. This is not the problem, which is instead the coverage (i.e., depth) of your in vivo samples.

Assuming that our dataset is typical with a reasonably high proportion of genes with 0 cts in all or most reps in one treatment (with sufficient cell counts), is there a larger statistical issue here?

That's a pretty vague question, so here's a pretty vague answer. Zeroes naturally occur as part of count-based distributions like the negative binomial, so the model can handle them if they occur. However, they will cause problems if they occur more frequently than expected, e.g., due to dropout events. Is this the case for your data? Who knows? I guess you'll find out if you get lots of very large dispersion estimates, which means that some replicates have very large counts and other replicates have counts of zero.

ADD REPLY • link 5.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks. Let's say 10% of the genes that are expressed in vivo are not expressed at all (0 mean w/ 0 variance) in the in vitro, i.e., at least for those genes, they wouldn't meet a criterion for homogeneity of variance.

ADD REPLY • link 5.1 years ago lepstein • 0

1

Entering edit mode

No, there is no problem with "homogeneity of variance". The statistical model handles zeros without any problems.

edgeR in fact takes special care when all the counts are zero for a particular group: https://www.ncbi.nlm.nih.gov/pubmed/28599403

Your problem is rather one of statistical power. If the counts are all very low for a particular gene, then the statistical power for detecting differential expression with any confidence with also be low.