Question: outlier detection of RNAseq samples
1
3.0 years ago by
wd30
Germany
wd30 wrote:

Hi

I have RNA seq data for six different treatments (A,B,C,D,E,F) of a model organism, with four-fold biological (NOT technical) replicates.

FASTQC revealed no abnormalites in the RNAseq data and after normalization (rlogtransformation) with DESeq2 I generated a PCA plot (using the 500 most variable genes).

Based on the PCA plot (see link: http://imgur.com/NVcWv5j) and a hierachical clustering (HC) analysis (not shown) I would think that the dots with a rectangle (1,2,3) can be considered as outliers and might be left out for further differential expression analysis (between treatments).

However, this is just based on visual inspection of the PCA/HC analysis. I was wondering if there is any objective metric to determine whether an RNAseq sample can be considered as an outlier (instead of just by visual inspection of PCA, like most papers do).

In a recent paper of Conesa et al  2016 (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8) they state the following:

"Reproducibility among technical replicates should be generally high (Spearman R2 > 0.9) [1], but no clear standard exists for biological replicates, as this depends on the heterogeneity of the experimental system."

So one might consider to include all replicates (incl. outliers) based on Conesa et al. 2016, but then you might end up with a lower number of diff. expressed genes between treatments...

Any advice/help regarding this topic would be much appreciated

modified 3.0 years ago • written 3.0 years ago by wd30
1

based on eye-balling your PCA plot I am not sure if you can justify the exclusion of the marked points as outliers. Your sample size is quite low (statistically speaking - I know that it's hard to have more) and the variability is not so small as to clearly flag the points as 'wild' outliers. But if you want to use a statistical test for outlier removal you can calculate the mean (or median) pairwise distance (within group or maybe for all groups pooled) and the standard deviation. Then you can flag those points that are greater then mean/median ± 2 sd. I'll note though that while this makes it consistent between groups, the threshold is still arbitrary (although frequently used).

Answer: outlier detection of RNAseq samples
1
3.0 years ago by
Scripps Research, La Jolla, CA
Ryan C. Thompson7.4k wrote:

The samples you have highlighted are certainly farther than average from the group means, but I wouldn't consider them outliers. For example, the highlighted sample from group A is far away from the others along PC2, but all of group A is spread out over PC2, so this is not out of the ordinary, and group A does cluster tightly along PC1. Similarly, group B clusters tightly along PC2. So in both groups, there are clearly at least a subset of genes that are quite consistent within the groups.

If you are really concerned that these samples may be dragging down your analysis, I recommend you use voomWithQualityWeights from the limma package. It will attempt to identify and down-weight outlier samples in the analysis. In addition, you can compare the list of lowest-weighted samples to the list of outlier samples that you identified by eye to see if the weighting method matches your intuitions.

1

I agree with Ryan. Probably it's just large within-group variability relative to the large-scale differences across groups. But you can try out limma's quality weighting to see if it helps.

Answer: outlier detection of RNAseq samples
0
3.0 years ago by
United States
Peter Langfelder2.3k wrote:

As far as "objective" measures to identify outlier samples go, I would check out the article by Oldham et al (myself included), Network methods for describing sample relationships in genomic datasets: application to Huntington's disease. BMC Syst Biol. 2012 Jun 12;6(1):63. PMID: 22691535 46(11) 1-17.

The gist of the method is to sum the distances (or conversely network adjacencies is a sample network), standardize them, and flag as outliers samples with high (or conversely high negative) standardized distance (connectivity).

Answer: outlier detection of RNAseq samples
0
3.0 years ago by
wd30
Germany
wd30 wrote:

Dear Fabian, Ryan, Michael and Peter

Kind regards

Wannes