Removing Ig* and H2-* genes and counts from txi after DESeqDataSetFromTximport() in a DESeq2 analysis prior to DESeq() and results()?
Pratik Mehta ▴ 10
Hey Bioconductor Community,

So I have a bulk RNA-seq of 4 groups used for doing 7 different comparisons in various combinations. No complex designs, just simply designs like this: ~ condition (so, x vs y). I began with nf-core/rnaseq using star_salmon to tximport to DESeq2.

In DESeq2 now, when running the comparisons that are of high-interest to me, there's a lot of Immunoglobulin genes that are marked as significant. They pretty much dominate the DEG table. I want to see what else is there besides those. So I went back a little upstream to right after txi <- tximport(...) and dds <- DESeqDataSetFromTximport(txi,..) and right before DESeq() and results(). (Note: This study is in mouse.)

So I did something like this to subset out the Ig*'s something like this:

txi <- tximport(...)
dds <- DESeqDataSetFromTximport(txi,..)

dds <- dds[grep(x = rowData(dds)$mgi_symbols, pattern = "Ig",invert = TRUE),] #added this DESeq(...) results(...)  Results looked good that the Immunoglobulins weren't flooding the significant genes anymore... but then HLA's were H2-*. So I went back upstream and did it like this: txi <- tximport(...) dds <- DESeqDataSetFromTximport(txi,..) dds <- dds[grep(x = rowData(dds)$mgi_symbols, pattern = "Ig",invert = TRUE),] #added this
dds <- dds[grep(x = rowData(dds)\$mgi_symbols, pattern = "H2-",invert = TRUE),] #and this

DESeq(...)
results(...)


Results look great now... Really they do. But now I am just wondering if someone could just offer their suggestions if what I have done is "okay"? I understand that I will report the original findings/stages of subsetting, I think, in the supplemental, and the goodies we found now... So our results could be reproduced.

But the heart of my question is: Did I do this type-of filtering at the correct stage of the analysis?

I have been reading some posts here and at Biostars that say strictly not to do this. And there's I think one post were someone did something like the subsetting I have done, and are completely fine with it. And then I think Gordon Smyth talks about the details of doing this in edgeR before or after normalization... I had read one other post Gordon Smyth was saying with edgeR, he uses RefSeq to bypass some of this even more upstream. So I have chosen to do this analysis with DESeq2` because I have become familiar with it, and also have become familar with using Gencode and Ensembl. I am kind-of hoping to stay with these for now. But really would appreciate if someone could help with the heart of my question above, please?

