Hi,
I have RNA seq data with the samples defined as follows:
**ID level stability type**
A1 med unstable exp
A2 med unstable exp
A3 med unstable exp
B1 med stable exp
B2 med stable exp
B3 med stable exp
C1 low stable exp
C2 low stable exp
C3 low stable exp
D1 low unstable exp
D2 low unstable exp
D3 low unstable exp
E1 high stable exp
E2 high stable exp
E3 high stable exp
F1 high unstable exp
F2 high unstable exp
F3 high unstable exp
G1 host host host1
G2 host host host1
G3 host host host1
H1 host host host2
H2 host host host2
H3 host host host2
I1 host host host3
I2 host host host3
I3 host host host3
ID column here describes the samples in 3 biological replicates. My objective is to identify DEGs while comparing different samples in pair (A vs B, B vs C, A vs D etc) as well as in group-wise comparison (stable vs unstable, host vs expressed). But when I define a contrast such as "contrast = c("stability","unstable","stable")", I get some DEGs with just one of the replicates from 3 different stable or unstable samples being high or low in comparison to others as well.
As I understand that's because when the program identifies a gene in at least 3 samples within a group of the same profile (irrespective of being just 1 of the replicates of 3 different samples) with significant difference from the rest, it reports it as a DEG. However, I would like to know if there is some parameter that can be introduced to say that more than 3 samples in a group of 9 (may be at least 6 samples in our case) are required to have concordance to be reported as DEG.
If not, then can someone kindly suggest some other way to do avoid getting DEGs not representative of entire group but just 3 samples of a big group.
Thanks !!
Thanks for your response Michael. But my problem is not with filtering the raw count matrix. I use the same before moving ahead to differential expression analysis.
When I call differentially expressd genes in group of stable (9 samples = 3 stable with all 3 replicates each) vs unstable (another 9 samples = 3 unstable with all 3 replicates each) , I get some genes which have high level only in B2, C3, E3. So, only one of the 3 replicates from each sample. Here I want to set some parameter to report DE only if 6 or more of the 9 stable samples have high level in comparison to the 9 in unstable. Is there some option like that ? Or can you recommend another way to identify DEGs in stable vs unstable as per my sample annotation.
Thanks !!
Can you post an example of the plotCounts for such a gene, as well as the LFC reported from
lfcShrink
. E.g.:Usually such genes have an LFC that is moderated by a Bayesian approach
So here is what I did... I don't understand why I don't see unstablevsstable condition listed in resultsNamesDDS. But counts are as mentioned below. A gene is being reported DE based on high counts in B3, D1 and F3 while comparing stable and unstable.
This is described in the vignette. See note on factor levels. You should set the reference level of interest.
If you can post an image of a single gene using plotCounts, that would help.
Result from lfcShrink is as follows:
If you check the heatmap, it shows differential expression when at least any 3 columns are differentially expressed. But when considering a group I would want that at least 2 samples with at least 2 replicates in a group are having significantly high or low values from the other group (apart from the problem with the gene I mentioned before).
I see four samples in the unstable group supporting a large LFC. This is clearly not null so the small p-value is justified. I wouldn't characterize these are FP.
You could also try SAMseq which works on ranks if you aren't interested so much in the LFC but more in the ranks. We find it performs well. We also have developed an extension of SAMseq for working downstream of the Salmon quantification method, called Swish.
OK. Thanks Michael I will give a try at SAMseq.