Hi,
I am performing a differential RNAseq analysis, where I have 3 different groups (cell lines say A,B,C) and in each group there are 2 treatment conditions (Trt1 & Trt2). I am interested in a particular case: finding the DE genes between Trt1 and Trt2 in cell line A. I have two or more replicates in each treatment condition across all groups. I included all my samples in my design matrix, called DESEQ2 with default settings. Using contrast I obtained the results. I have around a total of ~ 1300 DE genes (padj < 0.05).
Initially I performed the DE analysis only with samples which I was interested (Cell line A, Trt 1 & Trt2 samples. Later I came to know that this is not ideal way of doing it and in order to improve the dispersion estimate one should include all the samples). In this case I have around 1600 DE genes (padj < 0.05)
So I was comparing the results between these two scenarios. From my understanding, the log2fold change of the genes should not change much. Mainly the p-value and number of outliers would be different between the first analysis (calling it complete sample analysis) and second (subsetted analysis). I plotted a scatter plot of the log2fold changes of all the genes in these two comparisons. Nearly almost all the genes lie in 45 degree line (Figure 1). But there are 10 genes that have significantly different log2foldchange. In some cases, the difference is very huge like complete analysis log2foldchange = 16 and subsetted log2foldchange = 3. I am concerned about this because out of these 10 genes, 3 of the genes are top DE genes in the complete analysis case (FDR <= 0.05) and these genes are not significant in subsetted case. (Figure 2: shows scatter plot for these 10 genes). Comparing between these two cases may be not a perfect thing to do, still I am wondering whether this (these 3 genes) is a false positive case or not. (Given the FDR = 0.05 on an average I would expect ~60 false positives). I made sure that size factor is not contributing to this difference by providing the sizefactor value of subsetted analysis to complete sample analysis. Also when I checked the normalized read counts for both cases, which are very similar in values. Any insights would be greatly helpful.
Thanks,
Thomas
> sessionInfo() R version 3.4.0 (2017-04-21) DESeq2_1.16.1

0
amalthomas111
1 minute ago by
amalthomas111 • 0
Hi Michael,
Thanks for prompt reply. From a quick look into the new analysis: Including the outlier replacement for the complete analysis did not change change anything (at least the number of DE genes). For the subsetted case, the number of DE genes slightly changed. Those 3 genes are still significant in complete case with very high LFC difference as compared to the latter where they are not significant (padj <0.05).
Just to give you an idea about the magnitude of the LFC change for these 3 genes.
LFC complete case LFC subsetted case
gene1 16.95 3.99
gene2 15.08 2.13
gene2 13.76 0.69
I should also add that, in the subsetted case, these three genes have padj = NA. They didn't pass the filter threshold. Similar was the case before without the (minReplicatesForReplace=Inf).
The log fold change should be basically identical if you turn off outlier replacement, regardless of what other groups are included. Can you show all your code? And you might want to look at plotCounts() for these genes.
Here's code for complete analysis case. For subsetted case only difference is there in the input file: df = read.table("RNAseq_22ndMay2017.tsv") nrow(df) df = df[rowSums(df) > 1,] nrow(df) design = read.table("RNAseq_design.txt",sep="\t",col.names = c("filename","batchid","tissue","metstype","cellline")) df_design = as.data.frame(design[match(colnames(df),design$filename),]) design.matrix = data.frame(row.names = df_design$filename,tissue =as.factor(df_design$tissue), batch=as.factor(paste0("batch",df_design$batchid)),metstype=as.factor(df_design$metstype), cellline = as.factor(df_design$cellline)) design.matrix dds = DESeqDataSetFromMatrix(countData= df, colData= design.matrix, ~ tissue ) dds = DESeq(dds)#, minReplicatesForReplace=Inf) res = results(dds, alpha=0.05, contrast = c("tissue","50brai","50H")) summary(res)plotCounts for genes (that shows high difference) do not agree with the high reported LFC. See the figures in markdown PDF file.
Markdown PDF can be found at link.
Complete code: link
Outlier replacement is on in this code. Can you confirm there is a LFC difference after subsetting with outlier replacement off?
With outlier replacement off:
For subsetted analysis: ( If I filter the count tables with df[rowSums(df) > 1,] )
> res["ENSG00000196154",] log2 fold change (MLE): tissue 50brai vs 50H Wald test p-value: tissue 50brai vs 50H DataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSG00000196154 1.948717 3.999509 2.760516 1.448826 0.1473861 NA > res["ENSG00000104332",] log2 fold change (MLE): tissue 50brai vs 50H Wald test p-value: tissue 50brai vs 50H DataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSG00000104332 0.5305491 2.137836 3.411102 0.6267289 0.530837 NA > res["ENSG00000149948",] log2 fold change (MLE): tissue 50brai vs 50H Wald test p-value: tissue 50brai vs 50H Error in DataFrame(object) : missing values in 'row.names'If no filtering of count tables using df[rowSums(df) > 1,], then for subsetted case:
> res["ENSG00000196154",] log2 fold change (MLE): tissue 50brai vs 50H Wald test p-value: tissue 50brai vs 50H DataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSG00000196154 1.948717 3.999509 2.760516 1.448826 0.1473861 NA > res["ENSG00000104332",] log2 fold change (MLE): tissue 50brai vs 50H Wald test p-value: tissue 50brai vs 50H DataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSG00000104332 0.5305491 2.137836 3.411102 0.6267289 0.530837 NA > res["ENSG00000149948",] log2 fold change (MLE): tissue 50brai vs 50H Wald test p-value: tissue 50brai vs 50H DataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSG00000149948 0.1548006 0.6966229 4.181248 0.1666064 0.8676797 NAFor complete case:
log2 fold change (MLE): tissue 50brai vs 50H Wald test p-value: tissue 50brai vs 50H DataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSG00000196154 55.77841 16.95991 2.009837 8.438447 3.215818e-17 3.452664e-13 log2 fold change (MLE): tissue 50brai vs 50H Wald test p-value: tissue 50brai vs 50H DataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSG00000104332 40.77479 15.083 3.119895 4.834458 1.335088e-06 0.0002866835 log2 fold change (MLE): tissue 50brai vs 50H Wald test p-value: tissue 50brai vs 50H DataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSG00000149948 86.96833 13.76065 3.725956 3.693186 0.000221462 0.009398128Yes raw read counts are same in both.
> data.frame(counts(dds)["ENSG00000196154",]) counts.dds...ENSG00000196154.... X50brai.56.3 5 X50brai.58.3 0 X50brai.71.1 0 X50brai.83.2 6 X50brai.84.2 3 X50HR 0 X50H 0Normalized read counts:
subsetted case:
> data.frame(counts(dds,normalized= TRUE )["ENSG00000196154",]) counts.dds..normalized...TRUE...ENSG00000196154.... X50brai.56.3 5.418020 X50brai.58.3 0.000000 X50brai.71.1 0.000000 X50brai.83.2 4.854396 X50brai.84.2 3.368603 X50HR 0.000000 X50H 0.000000 complete case X50brai.56.3 5.658075 X50brai.58.3 0.000000 X50brai.71.1 0.000000 X50brai.83.2 5.065489 X50brai.84.2 3.584338 X50HR 0.000000 X50H 0.000000Could the zero read counts in both 50H and 50HR (one of the groups(50H group in my contrast) ) causing some issue)? But that still doesn't explain huge LFC difference between both cases.
Thanks again Michael. I am bit confused about how to do the filtering. In my data, the two groups :50brai vs 50H in my contrast (under my condition tissue), have very low expression for these 10 genes but there are other samples with enough reads for this genes. If I am using the entire samples for better dispersion estimate how would I do the filtering?
> head(data.frame(ENSG00000196154_counts=counts(dds)["ENSG00000196154",],tissue=design.matrix$tissue),n=29) ENSG00000196154_counts tissue X07H.R3 87 07H X07HR 22 07H X07H 11 07H X07kidn.80.1 11 07kidn X07lung.25.2 86 07lung X07lung.28.2 7 07lung X07lung.30.2 39 07lung X07lung.71.1 137 07lung X07lung.77.1 6 07lung X07lymp.30.2 216 07lymp X07ovar.12.1 39 07ovar X07ovar.27.2 3 07ovar X07ovar.28.2 11 07ovar X07ovar.30.2 3 07ovar X42brai.66.1 0 42brai X42brai.69.1 501 42brai X42brai.74.1 4 42brai X42HR 54 42H X42H 222 42H X42ovar.73.1 3 42ovar X50bone.83.2 14 50bone X50bone.84.2 23 50bone X50brai.56.3 5 50brai X50brai.58.3 0 50brai X50brai.71.1 0 50brai X50brai.83.2 6 50brai X50brai.84.2 3 50brai X50HR 0 50H X50H 0 50HFiltering based on rowSums(df) might not work right, since other samples have reasonable read counts for this gene?
Do you mean for each contrast first find those genes with at least 10 counts in more than two samples. Then estimate the dispersion of these genes using all samples and the do the contrast? For e.g. say for the contrast 50brai vs 50H
df = read.table("countable.tsv") df_subset = df[,grepl("50brai",colnames(df)) |grepl("50H",colnames(df))] head(df_subset) df_new = df[rownames(df_subset[rowSums(df_subset >10) > 2,]),] nrow(df) nrow(df_new) dds = DESeqDataSetFromMatrix(countData= df_new, colData= design.matrix, ~ tissue ) dds = DESeq(dds)Thanks
Thank you very much Michael!