Dear all,
I am currently using DESeq2 package to analyze differential expression of 6 RNAseq samples (each condition has 3 biological replicates = 3 independant inoculations). One sample is problematic as out of 44,480 genes, 780 are highly expressed in that same sample but are at 0 for the 5 other samples. Using sample to sample distance, this sample clusters alone and the levels of expression of these genes are very high (almost 1,200 reads for one of these genes). But aside this rather small amount of genes, that sample is not consistently an outlier and behave "normally" like the other replicates. (I also read this post : https://support.bioconductor.org/p/95755/ )
I was wondering if DESeq2 was capable of dealing proprely with these outlier genes knowing the "small" amount of replicates available? So I checked the results of the P-adj and most of them had NA (which I understand means that are marked as outliers, right?). However, 12 of them still have a p-value < 0.05 but the p-adj corrected this and are > 0.05.
So apparently, it is well capable of dealing with these outlier gene (fiou!), but in that case do I loose power to identify real DEG by introducing noise with these genes ? In that case, Is it possible to/ is it correct to just remove these type of genes as a pre-filtering ?
Thank you in advance for your help,
This is very relevant to my work. I also have one sample with these properties. This sample is a clear outlier in both PCA and MDS plot, in a way that all data points of all groups fall to the left of the PCA of PC1 and that single sample to the right, making the percentage of variance on PC1 62%. I decided to remove it from the analysis.
I am glad to not be the only one in that case !
However, I noticed that (in my case) removing that one sample leads to uncalculated P-adj for some genes for which I had a P-adj calculated when the "outlier sample" was in the analysis.
In my experiments, the three replicates are independent inoculations so we were expecting replicates to be different (unlike when you used plants are replicates) and probably to have a higher variability than the treatment itself (inoculation) as very little amount of genes have their expression modified upon inoculation. And, indeed, for both conditions (non inoculated and inoculated), the replicate #2 has globally higher number of reads than the other 2 replicates for the same genes so when I removed the outlier sample (Non inoculated replicate #2), these genes for the sample "Inoculated replicate #2" that had higher number of reads count than other samples were found to be outlier genes with no P-adj calculated... but they were not like that when the "Non inoculated replicate #2" was considered in the analysis and they had a P-adj.
In my case, complete removal of that outlier wouldn't be a good solution: first because I have very little amount of replicates (only three) and second because for the other genes (not the 780 weirdly high that makes it an outlier), the expression is consisent with the other inoculated sample from the same independent inoculation. That is why I was wondering about the removal of these "outlier" genes instead of the whole sample.
Did you observed anything like this when you removed your "outlier sample"?