Hello all,
I am writing to learn how to set up DESeq2 when my samples have large variation in gene counts. For example below is one row from my gene count table.
Samples | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z | aa | bb | cc | dd | ee | ff | gg | hh | ii | jj | kk | ll |
Gene240880 | 0 | 0 | 0 | 0 | 0 | 0 | 347 | 248 | 6 | 21 | 0 | 0 | 0 | 0 | 605 | 665 | 438 | 760 | 597 | 511 | 448 | 184 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 16 | 44 | 17 | 5 | 215 | 0 | 0 | 0 |
As you can see that some samples have hundreds of reads at gene240880, but some have zero. When I feed the whole table (with ~25k genes) to Deseq2, using pretty much default settings recommended in the DEseq2 tutorial, and looking at comparisons between condition 1 (triplets z, aa, bb) and condition 2 (triplets jj, kk, ll), for some reason I get a very significant p value --- even the numbers are all zeros, there were an log2FoldChange.
I figure it may have something to do with my gene's variable behavior, so for now I split the tables to leave only the 6 samples for condition 1 and 2, and the comparisons no longer show problem. (We have been using deseq2 for quite a while, and this is the first time we need to split tables).
Therefore I would love to learn what is the reason for this problem - is it due to the normalization deseq2 does? Also, because of this incidence, I am a little worried when and how exactly I should consider to split samples apart when using deseq2. Any advice is appreciated!
Hi Dr. Love - thanks so much for the prompt reply. I will update DESeq2 and try again. Among all the methods you mentioned, would you recommend the subset-to-two-groups approach? I personally like to hold the complete table together, but I guess in theory it should not matter.
If you update, you don’t need to do anything it should be solved.