Hello.
I'm dealing with a particular design: Basically I am looking if DNA alterations (in particular focal CNV) affect somehow gene expression. The idea is to look if the deletion of a gene (or a portion of a gene) leads to disregulation of its expression, and I would like to use Deseq2. I made a count matrix from salmon results with tximport and then i used this command:
dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~1)
vsd <- vst(dds)
(in this case coldata is only the list of patients names) Subsequently I used the vst normalized values for T testing, comparing the values of GENE A of patients that carry a deletion in that gene vs wild type patients. I know that this is not the conventional approach and I should have used the entire Deseq2 approach for differential expression, but I had some doubts: I expect to see variations in gene expression levels only for the gene affected by the deletion, I expect that variations in any other gene are a "false positive" not related to the deletion (unless the deletion has some downstream effect, but i am not interested in it right now..), so it made more sense to me to normalize the counts grouping the patients all together, rather than dividing them in groups, in particular because i have something like > 200K deletions to test. Considering that, i should have run the analysis 200K times, each time with a different "design", one for every "positive-wildtype" group, right? I just wanted to make a single matrix of values to use in downstream analysis, and in another post i read that you suggested to use "design = ~1" for downstream analysis (in particular it was cox regression). The question is: is my approach still correct even if not "the best"? I found some striking results with T test using VST values, and now I'm worried that my approach was completely biased.
Also, if my approach is not rubbish: I've read that the VST values are somehow comparable to log2 values (for large counts), in this case am I forced to use non parametric tests, or is parametric testing/correlation still correct? does it make sense to average them (when needed) and make boxplots showing median/percentiles etc. ?
Sorry for the long question and thank you very much in advance for the help!