Hey Bioconductor Community,
So I have a bulk RNA-seq of 4 groups used for doing 7 different comparisons in various combinations. No complex designs, just simply designs like this: ~ condition (so, x vs y). I began with
In DESeq2 now, when running the comparisons that are of high-interest to me, there's a lot of Immunoglobulin genes that are marked as significant. They pretty much dominate the DEG table. I want to see what else is there besides those. So I went back a little upstream to right after
txi <- tximport(...) and
dds <- DESeqDataSetFromTximport(txi,..) and right before
DESeq() and `results(). (Note: This study is in mouse.)
So I did something like this to subset out the Ig*'s something like this:
txi <- tximport(...) dds <- DESeqDataSetFromTximport(txi,..) dds <- dds[grep(x = rowData(dds)$mgi_symbols, pattern = "Ig",invert = TRUE),] #added this DESeq(...) results(...)
Results looked good that the Immunoglobulins weren't flooding the significant genes anymore... but then
H2-*. So I went back upstream and did it like this:
txi <- tximport(...) dds <- DESeqDataSetFromTximport(txi,..) dds <- dds[grep(x = rowData(dds)$mgi_symbols, pattern = "Ig",invert = TRUE),] #added this dds <- dds[grep(x = rowData(dds)$mgi_symbols, pattern = "H2-",invert = TRUE),] #and this DESeq(...) results(...)
Results look great now... Really they do. But now I am just wondering if someone could just offer their suggestions if what I have done is "okay"? I understand that I will report the original findings/stages of subsetting, I think, in the supplemental, and the goodies we found now... So our results could be reproduced.
But the heart of my question is: Did I do this type-of filtering at the correct stage of the analysis?
I have been reading some posts here and at Biostars that say strictly not to do this. And there's I think one post were someone did something like the subsetting I have done, and are completely fine with it. And then I think Gordon Smyth talks about the details of doing this in
edgeR before or after normalization... I had read one other post Gordon Smyth was saying with
edgeR, he uses RefSeq to bypass some of this even more upstream. So I have chosen to do this analysis with
DESeq2 because I have become familiar with it, and also have become familar with using Gencode and Ensembl. I am kind-of hoping to stay with these for now. But really would appreciate if someone could help with the heart of my question above, please?