To answer your first question; if you include extra genes, then the statistics will change. This will mainly be due to an increase in the number of tests that need to be performed, which will alter the nature of the multiple testing correction. In addition, all genes contribute to estimating shrinkage statistics in the empirical Bayes procedure, so putting in new genes will almost certainly change the estimated dispersion for existing genes.
To answer your second and third questions; when analysing classes of features that are very different, I try to keep them entirely separate, i.e., new DGEList
, fresh dispersion estimates. This avoids mixing features that have systematic differences in behaviour (e.g., different mean-variance relationships), which would interfere with EB shrinkage. It also avoids distorting the multiple testing correction, e.g., if one class has a lot more features. Practically, it's simpler in that your results won't change (much) every time you decide to add or remove a particular class. Of course, you'll need to have enough features for stable estimation of the shrinkage parameters in each class.
In your case, it's probably fine to lump pre-defined antisense transcripts with the known genes - some of these should already be present in your annotation anyway - but I would separate the analysis of novel transcripts. This is because discovery of novel transcripts adds a whole new level of uncertainty, depending on how you've done it (genome scanning, de novo assembly, etc.). The counts for such transcripts may be a lot more variable than those for known genes, potentially resulting in a distinct difference in the dispersion estimates between the feature classes. This would interfere with edgeR's attempts to fit a common mean-dispersion trend.
The exception to this separation philosophy is during normalization, where I might replace the normalization factors and library sizes for the other classes with those from the known genes. This is in cases where I cannot assume that, for a particular feature class, most features are not differentially expressed between samples. For example, if I were looking at the effect of knocking out microRNA-processing machinery, it would not be safe to assume that most microRNAs are non-DE. This means that running calcNormFactors
on the microRNA counts would be largely inappropriate. By comparison, it's probably safe to assume that most annotated genes are not DE, so I can use the effective library sizes (i.e., norm.factors
and lib.size
) from that analysis instead.
Thank you for your comprehensive answer.
I am working on a relatively unexplored genome/transcriptome of an organism so I would probably take the same approach for antisense transcripts as the novel transcripts. The antisense transcripts here are not annotated.
But I would take your advice and utilize the normalization factors from the known genes. Would you say that it is not wise to include counts from antisene or novel transcripts when computing the normalization factors, i.e. computing using only the annotated genes?
Probably not. You say you want to use the normalization factors from the known genes - then, by implication, you don't trust the information provided by the antisense or novel transcripts, so it doesn't make sense to use them for normalization. Moreover, if you have a lot more novel transcripts than known genes, the former might end up dominating the normalization procedure in
calcNormFactors
.I should stress that the computed normalization factors cannot be interpreted without the library sizes used to calculate them. If you want to normalize all feature classes in the same way, you cannot just use the same normalization factors - you have to replace both the normalization factors and library sizes in
y$samples
. This is because their product forms the "effective library size" that is used in the underlying edgeR code.Thank you. While we are on the subject of normalization, I would like to get your opinion on the usage of entrez ids.
This is partly because entrez ids are required for some of downstream analyses in edger such as goana() or kegga(). Because some of the genes (regardless of the organisms they come from) don't have entrez ids assigned to them, converting the gene ids to entrez ids would essentially remove some of the genes, and as a result, normalisation and DGEA would happen without them.
I presume this would affect the genes that are coming up as DE, as we are removing some of the genes form analysis. What's your take on this? I'm asking because from the edgeR manual, I read that entrez id matching is done prior to the normalization step.
Alternatively, I'm thinking it would be better to keep all the genes (without taking into account whether they have matched entrez ids or not) for normalisation and DGEA, and at goana() and/or kegga() stage, I would convert the ids of DE genes into entrez ids. Would you say this is a better approach? Thanks!
Probably best to ask a new question, but I'll give a brief answer here. These days, I perform counting with Ensembl IDs in my analyses, but there isn't a 1:1 correspondence between Ensembl and Entrez. So, I still match IDs at the start of the analysis with
org.XX.eg.db
, but I don't remove Ensembl-annotated genes that are missing Entrez IDs. I leave them in and I only deal with this discrepancy when Entrez IDs are required, e.g., ingoana
.