Question

Log2fold change difference : Including complete samples vs subsetted samples

0

Entering edit mode

amalthomas111 • 0

@amalthomas111-13189

Last seen 2.7 years ago

United States

Hi,

I am performing a differential RNAseq analysis, where I have 3 different groups (cell lines say A,B,C) and in each group there are 2 treatment conditions (Trt1 & Trt2). I am interested in a particular case: finding the DE genes between Trt1 and Trt2 in cell line A. I have two or more replicates in each treatment condition across all groups. I included all my samples in my design matrix, called DESEQ2 with default settings. Using contrast I obtained the results. I have around a total of ~ 1300 DE genes (padj < 0.05).

Initially I performed the DE analysis only with samples which I was interested (Cell line A, Trt 1 & Trt2 samples. Later I came to know that this is not ideal way of doing it and in order to improve the dispersion estimate one should include all the samples). In this case I have around 1600 DE genes (padj < 0.05)

So I was comparing the results between these two scenarios. From my understanding, the log2fold change of the genes should not change much. Mainly the p-value and number of outliers would be different between the first analysis (calling it complete sample analysis) and second (subsetted analysis). I plotted a scatter plot of the log2fold changes of all the genes in these two comparisons. Nearly almost all the genes lie in 45 degree line (Figure 1). But there are 10 genes that have significantly different log2foldchange. In some cases, the difference is very huge like complete analysis log2foldchange = 16 and subsetted log2foldchange = 3. I am concerned about this because out of these 10 genes, 3 of the genes are top DE genes in the complete analysis case (FDR <= 0.05) and these genes are not significant in subsetted case. (Figure 2: shows scatter plot for these 10 genes). Comparing between these two cases may be not a perfect thing to do, still I am wondering whether this (these 3 genes) is a false positive case or not. (Given the FDR = 0.05 on an average I would expect ~60 false positives). I made sure that size factor is not contributing to this difference by providing the sizefactor value of subsetted analysis to complete sample analysis. Also when I checked the normalized read counts for both cases, which are very similar in values. Any insights would be greatly helpful.

Thanks,

Thomas

> sessionInfo()
R version 3.4.0 (2017-04-21)
DESeq2_1.16.1

deseq2 • 1.9k views

ADD COMMENT • link 6.9 years ago amalthomas111 • 0

score 1 · Answer 1 · 2017-06-05

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 40 minutes ago

United States

What happens to the LFC plot if you turn off outlier replacement in both analyses? DESeq(dds, minReplicatesForReplace=Inf)

ADD COMMENT • link 6.9 years ago Michael Love 41k

0

Entering edit mode

0

amalthomas111

1 minute ago by

amalthomas111 • 0

Hi Michael,

Thanks for prompt reply. From a quick look into the new analysis: Including the outlier replacement for the complete analysis did not change change anything (at least the number of DE genes). For the subsetted case, the number of DE genes slightly changed. Those 3 genes are still significant in complete case with very high LFC difference as compared to the latter where they are not significant (padj <0.05).

Just to give you an idea about the magnitude of the LFC change for these 3 genes.

LFC complete case LFC subsetted case

gene1 16.95 3.99

gene2 15.08 2.13

gene2 13.76 0.69

I should also add that, in the subsetted case, these three genes have padj = NA. They didn't pass the filter threshold. Similar was the case before without the (minReplicatesForReplace=Inf).

ADD REPLY • link 6.9 years ago amalthomas111 • 0

0

Entering edit mode

The log fold change should be basically identical if you turn off outlier replacement, regardless of what other groups are included. Can you show all your code? And you might want to look at plotCounts() for these genes.

ADD REPLY • link 6.9 years ago Michael Love 41k

0

Entering edit mode

Here's code for complete analysis case. For subsetted case only difference is there in the input file:

df = read.table("RNAseq_22ndMay2017.tsv")
nrow(df)
df = df[rowSums(df) > 1,]
nrow(df)
design = read.table("RNAseq_design.txt",sep="\t",col.names = c("filename","batchid","tissue","metstype","cellline"))

df_design = as.data.frame(design[match(colnames(df),design$filename),])
design.matrix = data.frame(row.names = df_design$filename,tissue =as.factor(df_design$tissue),
                           batch=as.factor(paste0("batch",df_design$batchid)),metstype=as.factor(df_design$metstype),
                           cellline = as.factor(df_design$cellline))
design.matrix

dds  = DESeqDataSetFromMatrix(countData= df, colData= design.matrix, ~ tissue )
dds = DESeq(dds)#, minReplicatesForReplace=Inf)

res = results(dds, alpha=0.05, contrast = c("tissue","50brai","50H"))
summary(res)

plotCounts for genes (that shows high difference) do not agree with the high reported LFC. See the figures in markdown PDF file.

Markdown PDF can be found at link.

Complete code: link

ADD REPLY • link 6.9 years ago amalthomas111 • 0

0

Entering edit mode

Outlier replacement is on in this code. Can you confirm there is a LFC difference after subsetting with outlier replacement off?

ADD REPLY • link 6.9 years ago Michael Love 41k

0

Entering edit mode

With outlier replacement off:

For subsetted analysis: ( If I filter the count tables with df[rowSums(df) > 1,] )

> res["ENSG00000196154",]
log2 fold change (MLE): tissue 50brai vs 50H
Wald test p-value: tissue 50brai vs 50H
DataFrame with 1 row and 6 columns
                 baseMean log2FoldChange     lfcSE      stat    pvalue      padj
                <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000196154  1.948717       3.999509  2.760516  1.448826 0.1473861        NA
> res["ENSG00000104332",]
log2 fold change (MLE): tissue 50brai vs 50H
Wald test p-value: tissue 50brai vs 50H
DataFrame with 1 row and 6 columns
                 baseMean log2FoldChange     lfcSE      stat    pvalue      padj
                <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000104332 0.5305491       2.137836  3.411102 0.6267289  0.530837        NA
> res["ENSG00000149948",]
log2 fold change (MLE): tissue 50brai vs 50H
Wald test p-value: tissue 50brai vs 50H
Error in DataFrame(object) : missing values in 'row.names'

If no filtering of count tables using df[rowSums(df) > 1,], then for subsetted case:

> res["ENSG00000196154",]
log2 fold change (MLE): tissue 50brai vs 50H 
Wald test p-value: tissue 50brai vs 50H 
DataFrame with 1 row and 6 columns
                 baseMean log2FoldChange     lfcSE      stat    pvalue      padj
                <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000196154  1.948717       3.999509  2.760516  1.448826 0.1473861        NA
> res["ENSG00000104332",]
log2 fold change (MLE): tissue 50brai vs 50H 
Wald test p-value: tissue 50brai vs 50H 
DataFrame with 1 row and 6 columns
                 baseMean log2FoldChange     lfcSE      stat    pvalue      padj
                <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000104332 0.5305491       2.137836  3.411102 0.6267289  0.530837        NA
> res["ENSG00000149948",]
log2 fold change (MLE): tissue 50brai vs 50H 
Wald test p-value: tissue 50brai vs 50H 
DataFrame with 1 row and 6 columns
                 baseMean log2FoldChange     lfcSE      stat    pvalue      padj
                <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000149948 0.1548006      0.6966229  4.181248 0.1666064 0.8676797        NA

For complete case:

log2 fold change (MLE): tissue 50brai vs 50H 
Wald test p-value: tissue 50brai vs 50H 
DataFrame with 1 row and 6 columns
                 baseMean log2FoldChange     lfcSE      stat       pvalue         padj
                <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
ENSG00000196154  55.77841       16.95991  2.009837  8.438447 3.215818e-17 3.452664e-13
log2 fold change (MLE): tissue 50brai vs 50H 
Wald test p-value: tissue 50brai vs 50H 
DataFrame with 1 row and 6 columns
                 baseMean log2FoldChange     lfcSE      stat       pvalue         padj
                <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
ENSG00000104332  40.77479         15.083  3.119895  4.834458 1.335088e-06 0.0002866835
log2 fold change (MLE): tissue 50brai vs 50H 
Wald test p-value: tissue 50brai vs 50H 
DataFrame with 1 row and 6 columns
                 baseMean log2FoldChange     lfcSE      stat      pvalue        padj
                <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
ENSG00000149948  86.96833       13.76065  3.725956  3.693186 0.000221462 0.009398128

ADD REPLY • link 6.9 years ago amalthomas111 • 0

0

Entering edit mode

Can you also confirm that the raw counts are exactly the same across the subset and full analysis? By printing out counts(dds)[gene,] for the subset and counts(dds)[gene,samples] for the full dds? Just for one of the genes with changed LFC is enough.

ADD REPLY • link 6.9 years ago Michael Love 41k

0

Entering edit mode

Yes raw read counts are same in both.

> data.frame(counts(dds)["ENSG00000196154",])
             counts.dds...ENSG00000196154....
X50brai.56.3                                5
X50brai.58.3                                0
X50brai.71.1                                0
X50brai.83.2                                6
X50brai.84.2                                3
X50HR                                       0
X50H                                        0

Normalized read counts:
subsetted case:

> data.frame(counts(dds,normalized= TRUE )["ENSG00000196154",])
             counts.dds..normalized...TRUE...ENSG00000196154....
X50brai.56.3                                            5.418020
X50brai.58.3                                            0.000000
X50brai.71.1                                            0.000000
X50brai.83.2                                            4.854396
X50brai.84.2                                            3.368603
X50HR                                                   0.000000
X50H                                                    0.000000

complete case


X50brai.56.3                                            5.658075
X50brai.58.3                                            0.000000
X50brai.71.1                                            0.000000
X50brai.83.2                                            5.065489
X50brai.84.2                                            3.584338
X50HR                                                   0.000000
X50H                                                    0.000000

Could the zero read counts in both 50H and 50HR (one of the groups(50H group in my contrast) ) causing some issue)? But that still doesn't explain huge LFC difference between both cases.

ADD REPLY • link 6.9 years ago amalthomas111 • 0

0

Entering edit mode

Yes, actually the LFC is undefined in these cases, so that explains why mostly the LFC were identical except a few genes with two groups with no expression. And the dispersion plays a big role in calculating the Wald statistic and the dispersion estimates will be very different when you just have two groups (one with n=2) vs with three groups. I'd recommend you use the three group analysis for more informative dispersion estimates but filter out genes from the beginning that do not have at least two samples with counts greater or equal to 10.

ADD REPLY • link 6.9 years ago Michael Love 41k

0

Entering edit mode

Thanks again Michael. I am bit confused about how to do the filtering. In my data, the two groups :50brai vs 50H in my contrast (under my condition tissue), have very low expression for these 10 genes but there are other samples with enough reads for this genes. If I am using the entire samples for better dispersion estimate how would I do the filtering?

> head(data.frame(ENSG00000196154_counts=counts(dds)["ENSG00000196154",],tissue=design.matrix$tissue),n=29)
             ENSG00000196154_counts tissue
X07H.R3                          87    07H
X07HR                            22    07H
X07H                             11    07H
X07kidn.80.1                     11 07kidn
X07lung.25.2                     86 07lung
X07lung.28.2                      7 07lung
X07lung.30.2                     39 07lung
X07lung.71.1                    137 07lung
X07lung.77.1                      6 07lung
X07lymp.30.2                    216 07lymp
X07ovar.12.1                     39 07ovar
X07ovar.27.2                      3 07ovar
X07ovar.28.2                     11 07ovar
X07ovar.30.2                      3 07ovar
X42brai.66.1                      0 42brai
X42brai.69.1                    501 42brai
X42brai.74.1                      4 42brai
X42HR                            54    42H
X42H                            222    42H
X42ovar.73.1                      3 42ovar
X50bone.83.2                     14 50bone
X50bone.84.2                     23 50bone
X50brai.56.3                      5 50brai
X50brai.58.3                      0 50brai
X50brai.71.1                      0 50brai
X50brai.83.2                      6 50brai
X50brai.84.2                      3 50brai
X50HR                             0    50H
X50H                              0    50H

Filtering based on rowSums(df) might not work right, since other samples have reasonable read counts for this gene?

Do you mean for each contrast first find those genes with at least 10 counts in more than two samples. Then estimate the dispersion of these genes using all samples and the do the contrast? For e.g. say for the contrast 50brai vs 50H

df = read.table("countable.tsv")
df_subset = df[,grepl("50brai",colnames(df)) |grepl("50H",colnames(df))]
head(df_subset)
df_new = df[rownames(df_subset[rowSums(df_subset >10) > 2,]),]
nrow(df)
nrow(df_new)
dds = DESeqDataSetFromMatrix(countData= df_new, colData= design.matrix, ~ tissue )
dds = DESeq(dds)

Thanks

ADD REPLY • link 6.9 years ago amalthomas111 • 0

1

Entering edit mode

You can filter after results if you want, to just filter for these genes: keep <- rowSums(counts(dds[,dds$tissue %in% c("X","Y")]) >= 10) >= 2)