Question

DESeq normalized counts vs CQN and statistics

0

Entering edit mode

emlucero • 0

@0e95d562

Last seen 9 months ago

United States

Hi All,

I am working on a project where I am using human RNAseq data. These data are from a larger project at another university and were collected over time from over brain samples of over 1000 individuals. I have ran the data through an RNAseq pipeline where I have gone from FASTq files to BAM files and have generated raw counts (STAR), TPM and FPKM (RSEM), DESeq normalized counts from the raw counts from STAR, batch corrected (combat) DESeq normalized counts, and CPM of the raw counts.

I am fairly new to the bioinformatics and RNAseq world and am checking my work as I go since a case-control differential expression analaysis on a subset of subjects (about half) has already been published on this data set where they used CQN gene expression values. Additionally, data frames with the CPM, FPKM, and CQN CPM values for a subset of the data are working with are available.

I am not running a differential expression analysis but rather, my specific analysis is using a linear regression model to determine whether a specific gene associates or predicts a specific phenotype. I have expression values of a gene on my x axis and a quantitative phenotype on my y axis with a few covariats.

When I run my lm or glm using either my batch corrected DESeq values, log CPM, TPM, or FPKM I see an association with the expression of a specifc gene with a specific phenotype. This analysis reveals a very significant the p-value. My work replicates when I use the data frames that were already published, even though my work includes the addition of about twice as many samples. I have even replicated the technical (PICARD alignment and RNAseq metrics) and biolgical covariats. I have not generated CQN values but when I look at the CQN offset for gene length and GC content data that is available (previously published by others), I see a complete opposite trend that is also very significant.

Has anyone ever seen anything like this? Why the difference?

I do not want to just report my results using methods that fit my hypothesis, but want to report the most biological meaninful results even if they are the exact opposite of what I predicted. Since I am looking at only one gene's association with a phenotype between samples, I don't think TPM or FPKM would be suitable even though the gene length and the GC content of one gene should be relatively the same across each sample.

Where I am stuck is how should I take into consideration the very different (opposite) results I see when comparing the results of different methods. Specifically. the complete opposite trend when using CQN normalized values or batch corrected DESeq noramlized values. Should I view the CQN values offset for length and GC content the same way I am for TPM or FPKM? Would CPM or batch corrected DESeq normalized counts be a better fit, or is there something I am missing?

I appreciate all and any advice! Thanks in advance,

EL

DESeq2 cqn • 956 views

ADD COMMENT • link written 9 months ago by emlucero • 0

score 1 · Answer 1 · 2024-07-12

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

when I look at the CQN offset for gene length and GC content data that is available (previously published by others), I see a complete opposite trend that is also very significant.

Be careful about the output of functions, it may be giving output in the opposite direction as you expect.

You basically want the expression where technical factors are regressed out.

Two ways to produce the type of data you want:

Compute log CPM or vst() counts, and then regress out technical factors estimated using RUV or SVA (I prefer this approach).
Incorporate the CQN offset into the DESeqDataSet and then compute vst(). This will remove the variation associated with GC and length as computed by CQN. See DESeq2 vignette for example code for incorporating CQN.

ADD COMMENT • link 9 months ago Michael Love 43k

0

Entering edit mode

Thank you so much for your reply Mike! Per your comment "Compute log CPM or vst() counts, and then regress out technical factors estimated using RUV or SVA (I prefer this approach)." This apporach would give me a corrected counts table that is best suitable for linear regression models, and I would not use direct CQN counts for my model? Also, I am guessing that with the approach of calculating log CPM and then regressing out technical factors using RUV or SVA would not require DESeq afterwords since it would be be somewhat of a double normalization, is this correct?

ADD REPLY • link 9 months ago emlucero • 0

1

Entering edit mode

that is best suitable for linear regression models

Yes, you can then work with these downstream doing whatever you plan.

If you are performing DE you would instead include the technical factors in the design (don't regress them out first).

ADD REPLY • link 9 months ago Michael Love 43k