Search
Question: outlier detection of RNAseq samples
1
gravatar for wd
11 months ago by
wd10
Germany
wd10 wrote:

Hi

I have RNA seq data for six different treatments (A,B,C,D,E,F) of a model organism, with four-fold biological (NOT technical) replicates.

FASTQC revealed no abnormalites in the RNAseq data and after normalization (rlogtransformation) with DESeq2 I generated a PCA plot (using the 500 most variable genes).

Based on the PCA plot (see link: http://imgur.com/NVcWv5j) and a hierachical clustering (HC) analysis (not shown) I would think that the dots with a rectangle (1,2,3) can be considered as outliers and might be left out for further differential expression analysis (between treatments).

However, this is just based on visual inspection of the PCA/HC analysis. I was wondering if there is any objective metric to determine whether an RNAseq sample can be considered as an outlier (instead of just by visual inspection of PCA, like most papers do).

In a recent paper of Conesa et al  2016 (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8) they state the following:

"Reproducibility among technical replicates should be generally high (Spearman R2 > 0.9) [1], but no clear standard exists for biological replicates, as this depends on the heterogeneity of the experimental system."

So one might consider to include all replicates (incl. outliers) based on Conesa et al. 2016, but then you might end up with a lower number of diff. expressed genes between treatments...

Any advice/help regarding this topic would be much appreciated

 

 

 

ADD COMMENTlink modified 11 months ago • written 11 months ago by wd10
1

based on eye-balling your PCA plot I am not sure if you can justify the exclusion of the marked points as outliers. Your sample size is quite low (statistically speaking - I know that it's hard to have more) and the variability is not so small as to clearly flag the points as 'wild' outliers. But if you want to use a statistical test for outlier removal you can calculate the mean (or median) pairwise distance (within group or maybe for all groups pooled) and the standard deviation. Then you can flag those points that are greater then mean/median ± 2 sd. I'll note though that while this makes it consistent between groups, the threshold is still arbitrary (although frequently used).

ADD REPLYlink written 11 months ago by fabian.roger0810
1
gravatar for Ryan C. Thompson
11 months ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson6.1k wrote:

The samples you have highlighted are certainly farther than average from the group means, but I wouldn't consider them outliers. For example, the highlighted sample from group A is far away from the others along PC2, but all of group A is spread out over PC2, so this is not out of the ordinary, and group A does cluster tightly along PC1. Similarly, group B clusters tightly along PC2. So in both groups, there are clearly at least a subset of genes that are quite consistent within the groups.

If you are really concerned that these samples may be dragging down your analysis, I recommend you use voomWithQualityWeights from the limma package. It will attempt to identify and down-weight outlier samples in the analysis. In addition, you can compare the list of lowest-weighted samples to the list of outlier samples that you identified by eye to see if the weighting method matches your intuitions.

ADD COMMENTlink modified 11 months ago • written 11 months ago by Ryan C. Thompson6.1k

I agree with Ryan. Probably it's just large within-group variability relative to the large-scale differences across groups. But you can try out limma's quality weighting to see if it helps.

ADD REPLYlink written 11 months ago by Michael Love14k
0
gravatar for Peter Langfelder
11 months ago by
United States
Peter Langfelder1.3k wrote:

As far as "objective" measures to identify outlier samples go, I would check out the article by Oldham et al (myself included), Network methods for describing sample relationships in genomic datasets: application to Huntington's disease. BMC Syst Biol. 2012 Jun 12;6(1):63. PMID: 22691535 46(11) 1-17.

The gist of the method is to sum the distances (or conversely network adjacencies is a sample network), standardize them, and flag as outliers samples with high (or conversely high negative) standardized distance (connectivity).

ADD COMMENTlink written 11 months ago by Peter Langfelder1.3k
0
gravatar for wd
11 months ago by
wd10
Germany
wd10 wrote:

Dear Fabian, Ryan, Michael and Peter

Thank you for your valuable advice! Very much appreciated.

Kind regards

Wannes

 

ADD COMMENTlink written 11 months ago by wd10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 136 users visited in the last hour