Dear all,
We are working with RNAseq data to characterize specific cell populations. We have extracted 4 distinct cell populations (A, B, C and D) and performed Illumina RNAseq on these sample. Reads were mapped with TopHat and counts were determined with HTseq count. The sequencing platform has advised us to use edgeR-voom for data normalization and transformation and limma package for identification of differentially expressed genes. We compared each cell population one by one with these contrasts:
A vs B
A vs C
A vs D
B vs C
B vs D
C vs D
We obtained our lists of up and down regulated transcripts for each contrasts. However, we are interested to identify genes that are specifically expressed in one cell type and not in the others. We thought of 2 methods for this:
-first: take the 3 contrasts implying each cell populations (i.e A vs B, A vs C and A vs D for the cell population A) and extract genes that are differentially expressed in the 3 contrasts. With this, we obtained a few number of "cell-type specific transcripts" (classically between 100-200).
-second: design new contrasts comparing each cell type with all the other (i.e A vs (B+C+D)) and apply limma. With this method, the vast majority of the genes have significant adjusted p values (but all have negative logFC, indicating they are not specific for the cell population A...)
It seems evident for us that the second method is not suitable but the reasons are not really clear (we are thinking that pooling all the populations creates an imbalance for the analysis, as if we are comparing A with mean of B+C+D). However, is our first method right or is there another way to statistically identify cell-type specific mRNA?
Please, do not hesitate to indicate me if my explanations are not clear.
Thank you in advance.
Best regards,
Nicolas
Hi Jim,
Thank you for your answer. Indeed, results seem more correct using your contrast matrix. We have the impression that the first method is more severe (less genes that are consistently differentially expressed among contrasts). We will check if these genes consistently differentially expressed are found in the significant gene set of the second one. We are thinking that the second method is statistically more robust, do you think it is right?
Thank you.
Best,
Nicolas