Question

Difference in differentially expressed genes captured from two sets of same sample

0

Entering edit mode

alva.james • 0

@alvajames-6967

Last seen 5.7 years ago

Germany

Dear All,

I have gene count from RNA-seq data from 50 samples, the data is collected from patients from two different time points. In which I have divided the genes as 2 sets one set is the whole transcriptome (with protein-coding , non-coding together) and another set (focus set ) is only non-coding genes. Then I ran DEseq between two-time series for each set. From the whole set, I have a total of 301 (non-coding) significantly differently expressed genes filtered as the top candidate based on filters such as FDR <=0.05 , p_value <=0.05 and log fold change below and above -1 and +1 as both up and down regulated genes.

Whereas the for the second set which I have named as focus set which includes only the non-coding genes, I have run the same analysis and filtered with same filters and I am left out with a significant candidate of 201 genes. So I need an advice or suggestion here for following questions,

1. Is it statistically right to group the genes and run DEseq for those ? if no, then why not?

2. Why is it there is the huge difference of numbers between results from both sets ?

Thanky ou so much for support

deseq2 deseq • 1.1k views

ADD COMMENT • link updated 8.0 years ago by Michael Love 41k • written 8.0 years ago by alva.james • 0

score 2 · Accepted Answer · 2016-04-15

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 4 hours ago

United States

hi,

1) It is not recommended to split the genes into sets to process differently. The DESeq2 methods expect all the genes in one DESeqDataSet object. Why? Because the prior estimation steps need to look at all the genes in order to come up with a reasonable prior hyperparameters (location and width of the dispersion prior, width of the LFC prior, see DESeq2 paper).

2) I can turn this question around: why do you expect that differentially expressed genes would be present and detectable at an equal proportion in these two sets? It's not a random subset after all, but selected for some biological reasons.

But most importantly, you should not split the DESeqDataSet into two, but put all genes into a single DESeqDataSet.

ADD COMMENT • link 8.0 years ago Michael Love 41k

0

Entering edit mode

Thanks for your reply.

First a comment: I only counted the resulting differentially expressed lncRNAs for comparing the two approaches.

2.) Regarding point 2: We were hypothesizing that if we separate the lncRNAs from the protein_coding transcripts, that the method may get more sensitive to the generally much lower expression levels of lncRNAs. The fear was that the high expression of protein-coding transcripts has bigger changes between the two sample groups and therefore dilutes the 'smaller changes' within the lncRNAs. **The result** was the opposite: more lncRNAs were classified as differentially expressed, when all transcripts together were included in the analysis.

ADD REPLY • link 8.0 years ago alva.james • 0

0

Entering edit mode

Hmm, yes that assumption is not a good one. The method is most sensitivity when it can observe the data from all the genes. The highly expressed genes do not really steal the sensitivity from the lowly expressed genes.