Dear All,
I have gene count from RNA-seq data from 50 samples, the data is collected from patients from two different time points. In which I have divided the genes as 2 sets one set is the whole transcriptome (with protein-coding , non-coding together) and another set (focus set ) is only non-coding genes. Then I ran DEseq between two-time series for each set. From the whole set, I have a total of 301 (non-coding) significantly differently expressed genes filtered as the top candidate based on filters such as FDR <=0.05 , p_value <=0.05 and log fold change below and above -1 and +1 as both up and down regulated genes.
Whereas the for the second set which I have named as focus set which includes only the non-coding genes, I have run the same analysis and filtered with same filters and I am left out with a significant candidate of 201 genes. So I need an advice or suggestion here for following questions,
1. Is it statistically right to group the genes and run DEseq for those ? if no, then why not?
2. Why is it there is the huge difference of numbers between results from both sets ?
Thanky ou so much for support
Thanks for your reply.
First a comment: I only counted the resulting differentially expressed lncRNAs for comparing the two approaches.
2.) Regarding point 2: We were hypothesizing that if we separate the lncRNAs from the protein_coding transcripts, that the method may get more sensitive to the generally much lower expression levels of lncRNAs. The fear was that the high expression of protein-coding transcripts has bigger changes between the two sample groups and therefore dilutes the 'smaller changes' within the lncRNAs. **The result** was the opposite: more lncRNAs were classified as differentially expressed, when all transcripts together were included in the analysis.
Hmm, yes that assumption is not a good one. The method is most sensitivity when it can observe the data from all the genes. The highly expressed genes do not really steal the sensitivity from the lowly expressed genes.
Ok, Thank you