Hi,
I have questions that are not a part of typical RNAseq work flow so I would like lots of your input/help!
I used DESeq2 to get the full list of DEgenes with their respective log2fc between the two phenotype groups in my data. The metadata are TCGA-LGG samples and I am interested in phenotype of seizure history. The two groups I am interested in are 177 samples in no seizure history group and 298 samples in yes seizure history group. After getting log2fc and converting to a metric, I continued to GSEA preranked tool to get a list of enriched gene sets between the groups.
And now based on this result of enriched gene sets, I am trying to assign some numeric value for a particular gene set to each individual. For example, if I have a gene set 'calcium signaling pathway' enriched in seizure history yes group, I would like to take one individual from the group, calculate log2 fold change against all individuals in no seizure history group. In other words, I am calculating log2fold change of gene expression between one sample from one group and all samples in the other group. Then I would like to get the mean of log2fc only among the genes that are found in the gene set 'calcium signaling pathway'(m), get the mean and standard deviation of log2fc for all genes detected (M and S respectively), and get z-score using (m-M)/(S/sqrt(n)) where n denotes the number of genes in the gene set. High z-score would mean that that particular sample are highly enriched for that gene set. I would like to do this for all individuals in the sz-yes group and try to find any correlation with other clinical features.
My main question is whether I can calculate log2fc using one sample from one group and all samples in the other group without seriously distorting the previous analysis /workflow (initial DESeq2 to GSEA)? I understand the log2fc from DESeq2 is not the simple ratio of normalized counts, but I am not sure if there is any closed form equation to give shrunken estimate of log2fc as in DESeq2 result..One way I can think of is to make separate countdata and metadata that includes one sample from one group and all samples in the other group, and run DESeq2 to get log2fc, but also not sure if this would not interfere with previous analysis.
Looking forward to any inputs!
Thank you!
Yes! This sounds like a much more efficient and right way. I have a couple followup questions though.
When I calculate the mean and standard deviation for the control group, I would only include the genes in a particular gene set, because higher z-score would still mean that gene set is more enriched in that sample compared to the control, right? But in this case, would the z-scores from two gene sets be comparable?
I am still not completely clear what VST does. Can I consider this as a form of normalized estimate with stabilized variance? What are the intermediate steps between VST and getting log2fc?
Thank you so much!
Right (only mean and sd from that gene set).
z-scores are always comparable. So I'd say yes to comparing across gene sets.
for more details on VST, take a look at the transformation section in the vignette or the workflow (linked from top of vignette)