I have a set of non coding genes and a set of coding genes expressed across a number of samples. Based on the boxplots it is obvious that the lncRNAs are expressed lower than coding genes (log2 non-normallized counts). But what is the appropriate test for this kind of data, and is such a test implemented in e.g. DESeq2?
In my opinion, there is no rigorous statistical test available. Any obvious test, such a t-test or Wilcoxon test, assumes that the genes are independent whereas they are not. The dependence structure between the genes is unknown and, if it could be incorporated into the test, might have a dramatic effect on the computed p-value.
Since the result is so obvious, do you need a test?
I don't know — whenever someone asks me to say something about the relative expression of different genes within the same samples, I never feel comfortable doing it ... in fact, I'll fight tooth and nail against it for the following reasons.
First, and perhaps most importantly: what does this comparison tell you? In the scenario, if you have fewer amounts (in TPM(!)) of lncRNAs than mRNAs, does that mean the lncRNAs have less impact/activity than mRNAs? Or if you have less mRNA for gene A than gene B, are "you" going to then assert that gene A has less functional influence on the cell than gene B?
If I offered to give you a "pinch" of uranium or a pound of salt, I bet you'd take the pound of salt.
Second: do we really think length normalization is all we need? Perhaps the results are out there (so please point me to those references if so) but are there convincing publications out there that show accuracy of TPMs to something like molar quantities of transcripts for a large number of transcripts?
If you want to know how the relative abundance of genes change across conditions (ie. ratio of gene A:gene B in condition vs condition 2), then fine: we're good ;-)