Deseq2 cooks filter for 2 replicates x 3 replicates
1
0
Entering edit mode
andreia • 0
@andreia-23745
Last seen 9 days ago
Portugal

Hi there,

I am here to get advice regarding the cooks filter for a comparison 2 x 3. I already read posts in several forums but i still got doubts because i need to do a "manual" filter since the cooks filter don´t apply here in my situation.

(DESeq2 user guide states: "The results function automatically flags genes which contain a Cook’s distance above a cutoff for samples which have 3 or more replicates. The p values and adjusted p values for these genes are set to NA. At least 3 replicates are required for flagging, as it is difficult to judge which sample might be an outlier with only 2 replicates. This filtering can be turned off with results(dds,cooksCutoff=FALSE).")

I dont feel confortable enough to use a "manual" filter without sharing here and someone give a input of this...

My dataset is

 SampleID CellType    Generation  CellType_Gen
28A        Ctrl         1          Ctrl_1
29A        Ctrl         1          Ctrl_1
30A        Ctrl         1          Ctrl_1
28C        Ctrl       500          Ctrl_500
29C        Ctrl       500          Ctrl_500
31A        Mut          1          Mut_1
32A        Mut          1          Mut_1
33A        Mut          1          Mut_1
34A        Mut          1          Mut_1
37A        Mut          1          Mut_1
41A        Mut          1          Mut_1
34C        Mut        500          Mut_500
37C        Mut        500          Mut_500
38C        Mut        500          Mut_500
41C        Mut        500          Mut_500
45C        Mut        500          Mut_500
68C        Mut        500          Mut_500


Its important to state i had 3 for Ctrl_500 but one i removed because is a outlier.


dds<-DESeqDataSetFromMatrix(countData = table, colData = data, design= ~ CellType_Gen)

dds<-DESeq(dds)

results(dds, name=c("Cell_Gen_Ctrl_500_vs_Ctrl_1"), pAdjustMethod="fdr", lfcThreshold = 0.59, alpha=0.05)


But gives me a DEGs that i am not confortable calling DEGs:

            baseMean           log2FoldChange     lfcSE          stat               pvalue         padj
Gene1     3914.910978          -2.29767        0.3983925     -4.286401        1.815913e-05     9.350556e-03
Gene2     106.197308          -29.94073        5.6210867     -5.221540        1.774413e-07     1.202515e-04
Gene3     84.828549           -29.83375        5.6220937     -5.201577        1.976048e-07     1.202515e-04
Gene4     586.986077          -12.48934        2.8471245     -4.179423        2.922500e-05     1.397373e-02
Gene5     6036.720262         -30.00000        5.6208094     -5.232342        1.673753e-07     1.202515e-04
Gene6     40.291988           -29.99999        5.1266506     -5.736687        9.654640e-09     1.615704e-05
Gene7     1.974293            -29.99423        4.0568705     -7.248008        4.229450e-13     1.415597e-09
Gene8     1.865582            -29.92969        4.8053043     -6.105688        1.023586e-09     2.283961e-06
Gene9     5.142519            -29.94403        5.6271704     -5.216481        1.823542e-07     1.202515e-04
Gene10    108.487045          -30.00000        5.6212903     -5.231895        1.677811e-07     1.202515e-04
Gene11    188.233450          -29.96725        5.6210734     -5.226271        1.729627e-07     1.202515e-04
Gene12    132.808155          -29.95709        4.0282970     -7.290200        3.094953e-13     1.415597e-09
Gene13    230.582865           28.20695        5.5955986      4.935477        7.995492e-07     4.460152e-04
Gene14    83.802949            30.00000        5.5596687      5.289884        1.223942e-07     1.202515e-04


The raw counts of this DEGs are:

          28A 28C  29A 29C  30A  31A  32A  33A  34A  34C  37A  37C  38C  41A  41C  45C  68C
Gene1    6129 924 4520 761 3147 3294 8336 8978 8605 2187 4928 2247 2474 6074 2059 1540 1299
Gene2      0   0  614   0    0    0    0     0   0     0    0    0   0   684   0    0   540
Gene3      0   0  147   0    0   205   0    255  0     0    0  278  255   0   246   0   0
Gene4    1441  0  1131  0  1002  869   0    816 792  682   991 617  685   0   587  657  0
Gene5    18177 0  13712 0    0    0    0     0   0  21031 25555  0   0    0    0  29532 0
Gene6     137  0   0    0   102   0   101   106  0     0    0   64  108   0    0    0   65
Gene7      0   0   0    0    5    0    0     0   4     0   15    0   0    0    6    0   4
Gene8     10   0   0    0    0    0    0     0   6     0    0    9   7    0    0    0   0
Gene9     34   0   0    0    0    0    0     0   0     0    0    0   0    43   0   13   0
Gene10     0   0  620   0    0    0    0     0   0     0    0    0   0   1289  0    0   0
Gene11     0   0   0    0  1940   0    0     0   0     0  1488   0   0    0    0    0   0
Gene12    215  0  257    0   0   153   0    193  0    336  331 323  182   0    0  310   0
Gene13     0   0   0    3    0    0  1187    0   0     0  1300   2  1457  0    0    0   0
Gene14     0  913  0    0    0    0    0     0   0     0    0   0    0    0   444   0   0


So, if you look to the samples 28A, 29A, 30A (Ctrl_1) and 28C, 29C (Ctrl_500) there are some DEGs with counts in only one sample. Thats why i need to apply a manual filter to remove these false positives.

I applied:

mcols(dds)\$maxCooks<-apply(assays(dds)[["cooks"]],1,max)


And now, my number of DEGs are 11! But i still have got DEGs with counts in one sample. This step removed the Genes 10,11 and 14.

I am asking if there is other approaches that i can do to "fix" this issue! Or i just do the rest of the analysis (functional analysis) with this 11 DEGs.

Thanks in advance. All the best. Andreia

DESeq2 Nullcook'svalues 2x3 DEGs • 178 views
0
Entering edit mode
@mikelove
Last seen 12 hours ago
United States

On other threads, what I've recommended when there are many count outliers like the above is to simply pre-filter, at the top of your script (before running DESeq()):

keep <- rowSums(counts(dds) >= 10) >= X
dds <- dds[keep,]


You can put in X as the minimum number of samples with a moderate count, e.g. maybe 6 samples for your design.

0
Entering edit mode

Today, at lunch i have got a filtering topic conversation with my collegue :) its seems that not exist a standard steps or the best approach to follow... its understandble because the dataset is different everytime for a different sequencing! We need to adjust the filter to the dataset. But, almost of the times i choose to do a filter:

    table <- table[rowSums(table) > 0,]
max<-apply(table, 1, function(x) max(x))
table <- table[rowSums(table) != max, ]


and then apply the DESeq normalization.

So, I did what you suggest and i end up with 5 DEGs ... Gene 1, 3, 4 and 6. Its away better but still wonder if the Gene3 is a false positive or not.

I understand this filter improved the dataset, but looking again the threads about the filtering settings, sometimes you answer like you did to me:

    keep <- rowSums(counts(dds) >= X) >= Y
dds <- dds[keep,]


    keep <- rowSums(counts(dds, normalized=TRUE) >= X) >= Y
dds <- dds[keep,]


i wonder what is the correct one, given that counts is different from the normalized counts.

Also, if i want to use the CellType instead of Celltype_Gen to compare only Ctrl and Mut the correct filter in this case would be:

    keep <- rowSums(counts(dds, normalized=TRUE) >= 10) >= 5
dds <- dds[keep,]