Question

Count outliers differs per design will running DESeq

0

Entering edit mode

MiKappa ▴ 30

@mikappa-23113

Last seen 4.1 years ago

I am running DESeq using 3 different designs for the same set of data. I have 157 human samples (RNAseq) and I am performing differential gene expression analysis comparing 2 phenotypes (insulin resistant vs insulin sensitive). For 2 of the 3 designs, deseq runs smoothly but for the 3rd summary(res) reports thousands of outliers. I have followed the instructions of the documentation and the posts from this forum and I have set DESeq with minReplicatesForReplace=Inf and results with cooksCutoff=FALSE. I am positive my dataset doesn't have outliers and I would like to understand why for 2 of the 3 models deseq runs without problems and for the 3rd the method for flagging outliers is not appropriate for the distribution of counts in my data and should be turned off ?

model 1: corrects for sex, BMI and age
model 2: corrects for sex, BMI ,age and differences in cell type composition
model 3: corrects for sex, BMI ,age, lipid & glucose lowering medication and differences in cell type composition

Models 2 & 3 run without any errors. Model 1 reported thousands of outliers (before I turned it off). Could someone explain to me why? I understand that each model corrects for different things obviously and the designs are the not the same. I consider model 1 a simple (classical) design and I was quite frankly surprised that the method for flagging out outliers was not appropriate for that design but it is for the other 2.

deseq2 • 390 views

ADD COMMENT • link updated 4.1 years ago by Michael Love 42k • written 4.1 years ago by MiKappa ▴ 30

score 0 · Answer 1 · 2020-03-26

0

Entering edit mode

Michael Love 42k

@mikelove

Last seen 19 hours ago

United States

This is expected. The criterion we use for outlier flagging (see 2014 paper) is how much the observation affects the coefficient vector. As the coefficients are defined by the design you can see how you get different results. Also you can imagine a simple case where a batch with a large batch effect has a single sample. Including the batch covariate explains the deviation of the sample, but without the batch covariate, it would greatly affect the other LFC and be flagged with a large Cook’s distance.