dealing with outliers (deseq2)
Entering edit mode
andreia ▴ 10
Last seen 23 months ago

Hi there, I have been stuck with this analysis, that in first place seems a quite simple experiment but the results that gave me are weird and i dont know how to resolve this. Its not my first time that i have problems with outliers that the DESeq2 not filter, i tried have a solid answer for my problem, but with no sucess.
So the experiment, or more likely the model that i use is ~RIN+group. I know about the continuous variable the automatic filtering is off (ref: in those cases Love et al recommend to use the assays(dds)[["cooks"]] "perform a manual visualization and filtering if necessary" but i am suck!

My experiment is:

                RIN   group
  DRG-PNI-1     7.7   SN_Injury
  DRG-PNI-2     7.9   SN_Injury
  DRG-PNI-4     6.9   SN_Injury
  DRG-PNI-5     6.8   SN_Injury
  DRG-PNIsh-1   6.9   SN_Sham
  DRG-PNIsh-2   8.2   SN_Sham
  DRG-PNIsh-3   7     SN_Sham
  DRG-PNIsh-4   8.2   SN_Sham
  MN-PNI-2      8.7   MN_Injury
  MN-PNI-3      8     MN_Injury 
  MN-PNI-4      7.5   MN_Injury
  MN-PNI-5      8.1   MN_Injury
  MN-PNIsh-1    7.2   MN_Sham
  MN-PNIsh-2    7.2   MN_Sham
  MN-PNIsh-3    7.4   MN_Sham

My code is:

Code should be placed in three backticks as shown below

data$group<-relevel(data$group, ref="SN_Sham")
dds<-DESeqDataSetFromMatrix(countData = table, colData = data, design= ~RIN + group)
dds$RIN <- dds$RIN / sd(dds$RIN) 
dds<-estimateSizeFactors(dds, controlGenes=index)
keep <- rowSums(counts(dds) >= 10) >= 4 
ddsClean <- dds[keep,]
ddsClean <- estimateDispersions(ddsClean)
dds <- nbinomWaldTest(ddsClean, maxit=10000)

res <- results(dds, name=c("group_SN_Sham_vs_MN_Sham"), pAdjustMethod="fdr", alpha=0.05)

In this comparison, i have 2 genes that are considered significative:

  log2 fold change (MLE): group SN Injury vs SN Sham 
  Wald test p-value: group SN Injury vs SN Sham 
                                                                   baseMean log2FoldChange     lfcSE      stat     pvalue      padj
                                                                    <numeric>      <numeric> <numeric> <numeric>  <numeric> <numeric>
                       ENSRNOT00000007365|ENSRNOG00000005286|Coch   13.6013       -11.7896   3.68177  -3.20215 0.00136407 0.0342551
                       ENSRNOT00000019133|ENSRNOG00000014233|Krt19   21.8576       -15.3058   3.79164  -4.03673 5.42012e-05 0.00283855

But looking for the matrix of counts all samples from the this comparison have zero counts:

                                                 DRG-PNI-1 DRG-PNI-2 DRG-PNI-4 DRG-PNI-5 DRG-PNIsh-1 DRG-PNIsh-2 DRG-PNIsh-3 DRG-PNIsh-4 MN-PNI-2 MN-PNI-3 MN-PNI-4 MN-PNI-5 MN-PNIsh-1 MN-PNIsh-2 MN-PNIsh-3
 ENSRNOT00000007365|ENSRNOG00000005286|Coch         0         0         0         0           0           0           0           0      109       13       10        7          0         0        289                                  
 ENSRNOT00000019133|ENSRNOG00000014233|Krt19        0         0         0         0           0           0           0           0       14      195      138       50          0         0        246

I will be very appreciate if anyone help me.

Thanks in advance, Andreia

maxcooks outliers zero counts DESeq2 • 893 views
Entering edit mode
Last seen 3 days ago
United States

You may have accidentally mismapped the colData and the samples. Best way to check is plotCounts for these genes (anyway, I always recommend looking at plotCounts of example genes).

Entering edit mode

Thanks Michael for your reply.

I checked the colData and everything is ok.

I tested the approach from an older post [DESeq2] Cook's distance flagging for continuous variables and i couldnt get any idea which was the best threshold to apply, so i end up after talking with my boss that i will filtering manually after the results() the genes that have rowsums < 5 (for the comparison samples) and remove the genes that have counts in only one sample.

Not very secure of this approach, but i really need to move on...

Entering edit mode

So when you make plotCounts, what do you see? DESeq2 will always give you a 0 LFC for a gene when two groups that have 0, unless there is a sample assignment bug. What version of DESeq2 do you have, see sessionInfo().


Login before adding your answer.

Traffic: 372 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6