Hey Bioconductor Community,
So I have a bulk RNA-seq of 4 groups used for doing 7 different comparisons in various combinations. No complex designs, just simply designs like this: ~ condition (so, x vs y). I began with nf-core/rnaseq
using star_salmon
to tximport
to DESeq2
.
In DESeq2 now, when running the comparisons that are of high-interest to me, there's a lot of Immunoglobulin genes that are marked as significant. They pretty much dominate the DEG table. I want to see what else is there besides those. So I went back a little upstream to right after txi <- tximport(...)
and dds <- DESeqDataSetFromTximport(txi,..)
and right before DESeq()
and `results(). (Note: This study is in mouse.)
So I did something like this to subset out the Ig*'s something like this:
txi <- tximport(...)
dds <- DESeqDataSetFromTximport(txi,..)
dds <- dds[grep(x = rowData(dds)$mgi_symbols, pattern = "Ig",invert = TRUE),] #added this
DESeq(...)
results(...)
Results looked good that the Immunoglobulins weren't flooding the significant genes anymore... but then HLA
's were H2-*
. So I went back upstream and did it like this:
txi <- tximport(...)
dds <- DESeqDataSetFromTximport(txi,..)
dds <- dds[grep(x = rowData(dds)$mgi_symbols, pattern = "Ig",invert = TRUE),] #added this
dds <- dds[grep(x = rowData(dds)$mgi_symbols, pattern = "H2-",invert = TRUE),] #and this
DESeq(...)
results(...)
Results look great now... Really they do. But now I am just wondering if someone could just offer their suggestions if what I have done is "okay"? I understand that I will report the original findings/stages of subsetting, I think, in the supplemental, and the goodies we found now... So our results could be reproduced.
But the heart of my question is: Did I do this type-of filtering at the correct stage of the analysis?
I have been reading some posts here and at Biostars that say strictly not to do this. And there's I think one post were someone did something like the subsetting I have done, and are completely fine with it. And then I think Gordon Smyth talks about the details of doing this in edgeR
before or after normalization... I had read one other post Gordon Smyth was saying with edgeR
, he uses RefSeq to bypass some of this even more upstream. So I have chosen to do this analysis with DESeq2
because I have become familiar with it, and also have become familar with using Gencode and Ensembl. I am kind-of hoping to stay with these for now. But really would appreciate if someone could help with the heart of my question above, please?