[Outlier Gene removal] - One sample with outlier counts for some genes
1
0
Entering edit mode
DcL-A • 0
@dcl-a-23007
Last seen 3.6 years ago

Dear all,

I am currently using DESeq2 package to analyze differential expression of 6 RNAseq samples (each condition has 3 biological replicates = 3 independant inoculations). One sample is problematic as out of 44,480 genes, 780 are highly expressed in that same sample but are at 0 for the 5 other samples. Using sample to sample distance, this sample clusters alone and the levels of expression of these genes are very high (almost 1,200 reads for one of these genes). But aside this rather small amount of genes, that sample is not consistently an outlier and behave "normally" like the other replicates. (I also read this post : https://support.bioconductor.org/p/95755/ )

I was wondering if DESeq2 was capable of dealing proprely with these outlier genes knowing the "small" amount of replicates available? So I checked the results of the P-adj and most of them had NA (which I understand means that are marked as outliers, right?). However, 12 of them still have a p-value < 0.05 but the p-adj corrected this and are > 0.05.

So apparently, it is well capable of dealing with these outlier gene (fiou!), but in that case do I loose power to identify real DEG by introducing noise with these genes ? In that case, Is it possible to/ is it correct to just remove these type of genes as a pre-filtering ?

Thank you in advance for your help,

RNAseq DESeq2 outlier counts DE analysis • 2.3k views
ADD COMMENT
0
Entering edit mode

This is very relevant to my work. I also have one sample with these properties. This sample is a clear outlier in both PCA and MDS plot, in a way that all data points of all groups fall to the left of the PCA of PC1 and that single sample to the right, making the percentage of variance on PC1 62%. I decided to remove it from the analysis.

ADD REPLY
0
Entering edit mode

I am glad to not be the only one in that case !

However, I noticed that (in my case) removing that one sample leads to uncalculated P-adj for some genes for which I had a P-adj calculated when the "outlier sample" was in the analysis.

In my experiments, the three replicates are independent inoculations so we were expecting replicates to be different (unlike when you used plants are replicates) and probably to have a higher variability than the treatment itself (inoculation) as very little amount of genes have their expression modified upon inoculation. And, indeed, for both conditions (non inoculated and inoculated), the replicate #2 has globally higher number of reads than the other 2 replicates for the same genes so when I removed the outlier sample (Non inoculated replicate #2), these genes for the sample "Inoculated replicate #2" that had higher number of reads count than other samples were found to be outlier genes with no P-adj calculated... but they were not like that when the "Non inoculated replicate #2" was considered in the analysis and they had a P-adj.

In my case, complete removal of that outlier wouldn't be a good solution: first because I have very little amount of replicates (only three) and second because for the other genes (not the 780 weirdly high that makes it an outlier), the expression is consisent with the other inoculated sample from the same independent inoculation. That is why I was wondering about the removal of these "outlier" genes instead of the whole sample.

Did you observed anything like this when you removed your "outlier sample"?

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 1 hour ago
United States

Does this sample have higher sequencing depth? Or it is just aberrant in the 780 genes but not due to depth? Can you make a PCA plot?

ADD COMMENT
0
Entering edit mode

No all samples have been sequenced with the same sequencing depth and all have rougly the same. That sample is just aberrant in these 780 genes but is consistent with other replicates for the other genes and so to me it is not due to depth. The PCA plot is: With the outlier (pink on the left) with-outlier

and without it (sample completely removed just to try) without-outlier

ADD REPLY
0
Entering edit mode

It does seem like an observation to be worried about from the PCA plot.

Another approach I recommend is using MultiQC to look across various QC reports. For example FASTQC reports on GC content and basepair scores are useful for flagging technical issues.

Yes, keeping a sample that is outlying for hundreds of genes and outlying in the PCA will affect your power.

ADD REPLY
0
Entering edit mode

I did check the QC reports using MultiQC but unfortunately, that sample is not different from other samples in terms of GC content or basepair scores.

What would you suggets then ? Because I have noticed that removing that sample affects the analysis. Genes that use to have a P-adj calculated have no longer their p-adj calculated and are therefore flagged as outliers. Indeed, I observed that replicate #2 for both conditions (inoculated = I or non-inoculated = N) have globally higher gene counts for the genes expressed (it might be some biological variation due to the independent inoculation that we performed) but the sample Non-inoculated from replicate #2 has, for these specific 780 genes, very high counts.

See : PCA PCA only considering the same time point (so just 6 of the 24 samples I have)

Then, removing that sample makes the inoculated sample of replicate #2 (B in the PCA) be considered as an outlier... Should I then removed completely the replicate #2 (both I and N samples) and perform the analysis on 2 replicates left?

So it is not possible to considere a removal of these 780 outlier genes from the outlier sample (non-inoculated replicate #2) ?

ADD REPLY
1
Entering edit mode

Sorry this is up to you:

“ What would you suggets then ?”

I can only provide some software support here but the final decision about analysis is up to you.

ADD REPLY
0
Entering edit mode

I was just wondering if removing these genes that have a strange behaviour in one sample is possible or not. But I guess not. Thank you for your help !

ADD REPLY

Login before adding your answer.

Traffic: 868 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6