We have TagSeq libraries of a fungus with 5 replicates in two conditions: the fungus “in vitro” growing in plant extract and “in vivo” when it’s infecting its host plant. We actually have three fungal strains (race 2, race 3 and race 4), but at this point we’re mostly interested in what genes are “up-regulated” or “down-regulated” in vivo. In vitro libraries range in size from 3,034,612 - 5,796,518. But the fungal counts from the in planta libraries are far lower than ideal. Nonetheless, it does seem that we make some reasonable inferences, in particular, when the in vitro, for example, has 0 or 1 raw counts in all the reps and there are considerably more in all the reps of the in vivo. There also appear to be some legitimate examples of down-regulation. The fungus is haploid and has ca. 15,000 genes.
In the edgeR manual on p. 12, it states, “As a rule of thumb, genes are dropped if they can’t possibly be expressed in all the samples for any of the conditions. Users can set their own definition of genes being expressed. Usually a gene is required to have a count of 5-10 in a library to be considered expressed in that library. Users should also filter with count-per-million (CPM) rather than filtering on the counts directly, as the latter does not account for differences in library sizes between samples.”
Here are the raw total counts (library size) for each of the in vivo libraries
Race 4 836601 20248 12414 13095 20967
Race 3 8565 1774 9289 7364 1928
Race 2 10408 4639 4495 827 1075
edgeR indicates that we have ca. 20,000 genes and expression of ca. 1,100 genes.
One question is, can we use the cut-offs described on p. 12 of your manual to make a responsible decision about what genes should be excluded due to small sample size for each of our fungi?
If we need to have a raw count of 5+ in a library, for example, if we use the case of race 4, and our smallest library has 12,414 raw counts, then should we have rowSum$(cpm (y)>403 , calculated as 5 raw counts * conversion to cpm for the smallest sample.
And for our 5 reps, do you have a recommended minimum number of samples in the group that should have, let’s say, cpm(y)>403 ?
And if the cut-offs are an appropriate place to exclude genes that would be otherwise erroneously indicated as down-regulated, are there other guidelines for the minimum cpm and # in each group that should be used for genes that are up-regulated in the in vivo?
Many thanks for any guidance!
In addition, TMM normalization (or any method based on computing ratios) is not going to be happy with such low counts and a high frequency of zeroes. Less than 10000 counts spread over 15000 genes? That's barely one count per gene on average - any subsequent ratios or M-values are meaningless for calculations within
calcNormFactors
. One could trymethod="TMMwzp"
, which might do better than the default TMM method, but the fundamental problem remains - that there is not enough information to estimate scaling factors when there are so many more zeroes in the in vivo samples. You'd probably see lots of stripes in your MA plots.Thanks! This is my first TagSeq dataset. I think that I understand the low sample size problem. My next question is about the high frequency of 0's in the in vitro dataset with 3,034,612 to 5,796,518 counts/rep. Assuming that our dataset is typical with a reasonably high proportion of genes with 0 cts in all or most reps in one treatment (with sufficient cell counts), is there a larger statistical issue here?
Be careful with your terminology. The sample size refers to the number of replicates you have. This is not the problem, which is instead the coverage (i.e., depth) of your in vivo samples.
That's a pretty vague question, so here's a pretty vague answer. Zeroes naturally occur as part of count-based distributions like the negative binomial, so the model can handle them if they occur. However, they will cause problems if they occur more frequently than expected, e.g., due to dropout events. Is this the case for your data? Who knows? I guess you'll find out if you get lots of very large dispersion estimates, which means that some replicates have very large counts and other replicates have counts of zero.
Thanks. Let's say 10% of the genes that are expressed in vivo are not expressed at all (0 mean w/ 0 variance) in the in vitro, i.e., at least for those genes, they wouldn't meet a criterion for homogeneity of variance.
No, there is no problem with "homogeneity of variance". The statistical model handles zeros without any problems.
edgeR in fact takes special care when all the counts are zero for a particular group: https://www.ncbi.nlm.nih.gov/pubmed/28599403
Your problem is rather one of statistical power. If the counts are all very low for a particular gene, then the statistical power for detecting differential expression with any confidence with also be low.
Thanks for the advice!