Question

Using VST normalized values for "external" single gene analysis

0

Entering edit mode

filippo.martignano • 0

@filippomartignano-20303

Last seen 5.3 years ago

Hello.

I'm dealing with a particular design: Basically I am looking if DNA alterations (in particular focal CNV) affect somehow gene expression. The idea is to look if the deletion of a gene (or a portion of a gene) leads to disregulation of its expression, and I would like to use Deseq2. I made a count matrix from salmon results with tximport and then i used this command:

dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~1)
vsd <- vst(dds)

(in this case coldata is only the list of patients names) Subsequently I used the vst normalized values for T testing, comparing the values of GENE A of patients that carry a deletion in that gene vs wild type patients. I know that this is not the conventional approach and I should have used the entire Deseq2 approach for differential expression, but I had some doubts: I expect to see variations in gene expression levels only for the gene affected by the deletion, I expect that variations in any other gene are a "false positive" not related to the deletion (unless the deletion has some downstream effect, but i am not interested in it right now..), so it made more sense to me to normalize the counts grouping the patients all together, rather than dividing them in groups, in particular because i have something like > 200K deletions to test. Considering that, i should have run the analysis 200K times, each time with a different "design", one for every "positive-wildtype" group, right? I just wanted to make a single matrix of values to use in downstream analysis, and in another post i read that you suggested to use "design = ~1" for downstream analysis (in particular it was cox regression). The question is: is my approach still correct even if not "the best"? I found some striking results with T test using VST values, and now I'm worried that my approach was completely biased.

Also, if my approach is not rubbish: I've read that the VST values are somehow comparable to log2 values (for large counts), in this case am I forced to use non parametric tests, or is parametric testing/correlation still correct? does it make sense to average them (when needed) and make boxplots showing median/percentiles etc. ?

Sorry for the long question and thank you very much in advance for the help!

VST normalization deseq2 • 1.1k views

ADD COMMENT • link updated 5.3 years ago by Michael Love 43k • written 5.3 years ago by filippo.martignano • 0

score 0 · Answer 1 · 2019-08-20

I think I understand that you have essentially two matrices, one RNA-seq count matrix and another matrix which tells you the deletion status of each gene across the patients? As you say, it's not possible to use DESeq() when the sample grouping changes per row (e.g. deletion status), so I don't see anything wrong so far with using the VST values and then doing a row-wise analysis. How many patients do you have though?