Hello,
I took encode data RNAseq from mouse G1E cell line and G1E treated with 24 hours of estradiol.
I had following data:
dataset A
G1E, 2 replicates, single end reads
G1E-treated, 2 replicates, single end reads
dataset B
G1E, 2 replicates,paired end reads
G1E-treated, 2 replicates, paired end reads
I did differential analysis within each dataset and for following MA plots. Number of DE genes in dataset A is 4038 and in dataset B is 2169. When I do an intersection of DE genes from both datasets I get 604 up-regulated and and 520 down-regulated and the genes correspond to the expected pathways. However, I am very concerned about the MA plots I get and if DESeq2 normalization is appropriate in this case.
dataset A:
with MLE:
PCA plot:
dataset B:
with MLE
and PCA plot
edit: I did the same analysis but with edgeR and this issue is not that striking.
dataset A:
dataset B:
I also plotted LFC of two datasets:
DESeq2:
edgeR:
Also, the number of disagreeing genes in DESeq2 and edgeR.
I am a bit concerned about such discordance by the two methods. Do you have any ideas about the reasons for that?
I edited my question by adding the plots you asked for. 24 hours differentiation of G1E-ER4 cells with 10 nM beta-estradiol.
So to me this looks like the treatment had an extreme effect on the samples (see PC1 is nearly 100% of the variance separating treated from pre-treated).
To compare across dataset, I would plot log2FoldChange against log2FoldChange instead of p-values. The p-values depend on the power of the experiment, which depends on sample size (here only 2 vs 2) and on the technical and biological noise. But the effect size estimates should be relatively consistent, if both datasets are of high enough quality.
I don't see a problem with the normalization here (in the unshrunken plots, the y=0 axis goes through the middle of the cloud of points).
It's curious that the bulk of one dataset is up and the other down, are you sure you are comparing treated / pre-treated identically in both datasets?
Yes, I do compare them identically and I checked if I switched columns in one of the datasets and it is not the case. It is also a bit strange to me that the number of DE genes in dataset B is twice less then in dataset A.
I also checked the read lengths in dataset A and it turned out that in the cell line G1E (untreated) one replicate (g1e_1) was sequenced with 36bp read length and another (g1e_2) with 55bp read length. And one can see that in PCA plot. However, treated replicates were both sequenced with 36bp read lengths. Can be the reason for finding more DE genes than in dataset B?
The number of genes which pass an FDR threshold is highly variable, especially with very few samples, where the effect of technical noise and difficulty in estimating parameters have a large effect. So I would not be surprised to see this number change across two datasets like this. I would be more concerned about the general shape of differential expression (that the bulk is up- in one dataset and down- in another) and I would compare LFC directly, not look at p-value sets.
I edited my question with new plots and edgeR results.
It looks like you are filtering low count genes for the edgeR analysis, which are therefore cut off from the MA plot.
For a more clear comparison, why don't you also remove the same genes from the DESeq2 analysis that you remove from the edgeR analysis.