Question

Suggestions for avoiding False Positives while identifying differentially expressed genes with more than 3 samples per group

0

Entering edit mode

HD • 0

@hd-21574

Last seen 4.6 years ago

Hi,

I have RNA seq data with the samples defined as follows:

   **ID level    stability  type**
        A1    med     unstable    exp
        A2    med     unstable    exp
        A3    med     unstable    exp
        B1    med       stable    exp
        B2    med       stable    exp
        B3    med       stable    exp
        C1    low       stable    exp
        C2    low       stable    exp
        C3    low       stable    exp
        D1    low     unstable    exp
        D2    low     unstable    exp
        D3    low     unstable    exp
        E1   high       stable    exp
        E2   high       stable    exp
        E3   high       stable    exp
        F1   high     unstable    exp
        F2   high     unstable    exp
        F3   high     unstable    exp
        G1   host         host  host1
        G2   host         host  host1
        G3   host         host  host1
        H1   host         host  host2
        H2   host         host  host2
        H3   host         host  host2
        I1   host         host  host3
        I2   host         host  host3
        I3   host         host  host3

ID column here describes the samples in 3 biological replicates. My objective is to identify DEGs while comparing different samples in pair (A vs B, B vs C, A vs D etc) as well as in group-wise comparison (stable vs unstable, host vs expressed). But when I define a contrast such as "contrast = c("stability","unstable","stable")", I get some DEGs with just one of the replicates from 3 different stable or unstable samples being high or low in comparison to others as well.

As I understand that's because when the program identifies a gene in at least 3 samples within a group of the same profile (irrespective of being just 1 of the replicates of 3 different samples) with significant difference from the rest, it reports it as a DEG. However, I would like to know if there is some parameter that can be introduced to say that more than 3 samples in a group of 9 (may be at least 6 samples in our case) are required to have concordance to be reported as DEG.

If not, then can someone kindly suggest some other way to do avoid getting DEGs not representative of entire group but just 3 samples of a big group.

Thanks !!

deseq2 • 766 views

ADD COMMENT • link updated 4.7 years ago by Michael Love 41k • written 4.7 years ago by HD • 0

score 0 · Answer 1 · 2019-08-12

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 6 hours ago

United States

You can use a minimal count filter to remove such cases most likely:

keep = rowSums(counts(dds) >= x) >= y dds = dds[keep,]

You might try x of 10 and y of 3 so that at least 3 samples must have a count of 10 or higher.

ADD COMMENT • link 4.7 years ago Michael Love 41k

0

Entering edit mode

Thanks for your response Michael. But my problem is not with filtering the raw count matrix. I use the same before moving ahead to differential expression analysis.

When I call differentially expressd genes in group of stable (9 samples = 3 stable with all 3 replicates each) vs unstable (another 9 samples = 3 unstable with all 3 replicates each) , I get some genes which have high level only in B2, C3, E3. So, only one of the 3 replicates from each sample. Here I want to set some parameter to report DE only if 6 or more of the 9 stable samples have high level in comparison to the 9 in unstable. Is there some option like that ? Or can you recommend another way to identify DEGs in stable vs unstable as per my sample annotation.

Thanks !!

ADD REPLY • link 4.7 years ago HD • 0

0

Entering edit mode

Can you post an example of the plotCounts for such a gene, as well as the LFC reported from lfcShrink. E.g.:

res <- lfcShrink(dds, coef="...", type="apeglm")

Usually such genes have an LFC that is moderated by a Bayesian approach

ADD REPLY • link 4.7 years ago Michael Love 41k

0

Entering edit mode

So here is what I did... I don't understand why I don't see unstablevsstable condition listed in resultsNamesDDS. But counts are as mentioned below. A gene is being reported DE based on high counts in B3, D1 and F3 while comparing stable and unstable.

#Create input for DE analysis
batch <- DESeqDataSetFromMatrix( countData = counts,colData  = sample_ann,design = ~ stability)
#Pre-filtering of the dataset
batch <- estimateSizeFactors(batch)
keep <- rowSums(counts(batch) >= 10) >= 3
batch.filter<- batch[keep,]
vsd <- vst(batch.filter)
#Identifying DEGs
DDS_batch <- DESeq(batch.filter)
de_results<-results(DDS_batch, contrast = c("stability","unstable","stable"),lfcThreshold=0, independentFiltering =T, pAdjustMethod="BH", alpha = 0.01)
#Export sig DE list
summary(de_results)
sig_de_results <-subset(de_results,  abs(log2FoldChange)>1 & padj < 0.001)
sig_de_results <- sig_de_results[order(sig_de_results$log2FoldChange, decreasing=T),]
summary(sig_de_results)
#Plot significant DEGs
topVarGenes <- rownames(sig_de_results)
hmcol <- brewer.pal(11,'RdBu')
nCounts <- counts(DDS_batch , normalized=TRUE)
heatmap(as.matrix(nCounts[topVarGenes, ]), Rowv = NA,Colv = NA, col = hmcol, mar = c(8,2), keep.dendro = FALSE)
#Counts of weird DEG reported
nCounts["LOC100774944",]

A1      low     stable     exp  234.4729
A2      low     stable     exp  250.4983
A3      low     stable     exp  241.4016
B1      low   unstable     exp  284.9478
B2      low   unstable     exp  239.4944
B3      low   unstable     exp  24684.5542
C1      med     stable     exp  197.0878
C2      med     stable     exp  222.2543
C3      med     stable     exp  195.8788
D1      med   unstable     exp  14254.4449
D2      med   unstable     exp  165.1056
D3      med   unstable     exp  1162.1804
E1     high     stable     exp  207.6912
E2     high     stable     exp  341.6933
E3     high     stable     exp  201.4843
F1     high   unstable     exp  171.2375
F2     high   unstable     exp  175.9629
F3     high   unstable     exp  21480.3368
G1     host       host   host1  198.9543
G2     host       host   host1  244.1278
G3     host       host   host1  176.3735
H1     host       host   host2  227.679
H2     host       host   host2  260.2603
H3     host       host   host2  242.7947
I1     host       host   host3  223.5859
I2     host       host   host3  16864.951
I3     host       host   host3  246.7982

#lfcShrink
resultsNames(DDS_batch)
[1] "Intercept""stability_stable_vs_host"  
[3] "stability_unstable_vs_host"
lfcShrink(DDS_batch, coef="stability_unstable_vs_stable", type="apeglm")
Error in lfcShrink(DDS_batch, coef = "stability_unstable_vs_stable", type = "apeglm") : 
 coef %in% resultsNamesDDS is not TRUE

ADD REPLY • link 4.7 years ago HD • 0

0

Entering edit mode

This is described in the vignette. See note on factor levels. You should set the reference level of interest.

If you can post an image of a single gene using plotCounts, that would help.

ADD REPLY • link 4.7 years ago Michael Love 41k

0

Entering edit mode

Result from lfcShrink is as follows:

dds<-batch.filter
dds$stability <- factor(dds$stability, levels = c("stable","unstable", "host"))
design(dds) <- ~ stability
DDS_batch <- DESeq(dds)
res<-lfcShrink(DDS_batch, coef="stability_unstable_vs_stable", type="apeglm")
res["LOC100774944",]
log2 fold change (MAP): stability unstable vs stable 
Wald test p-value: stability unstable vs stable 
DataFrame with 1 row and 5 columns
                     baseMean   log2FoldChange            lfcSE
                    <numeric>        <numeric>        <numeric>
LOC100774944 3088.75007019989 4.48346113381571 1.03333173945058
                           pvalue                 padj
                        <numeric>            <numeric>
LOC100774944 4.76558220130759e-07 0.000320088271187826

If you check the heatmap, it shows differential expression when at least any 3 columns are differentially expressed. But when considering a group I would want that at least 2 samples with at least 2 replicates in a group are having significantly high or low values from the other group (apart from the problem with the gene I mentioned before).

plotCounts(DDS_batch, "LOC100774944", intgroup="stability")

plotCounts

topVarGenes <- rownames(sig_de_results)
nCounts <- counts(DDS_batch , normalized=TRUE)
mat<-as.matrix(log2(nCounts[topVarGenes, ]+1))
mat  <- mat - rowMeans(mat)
pheatmap(mat,cluster_rows=F, cluster_cols=F)

heatmap

ADD REPLY • link 4.7 years ago HD • 0

0

Entering edit mode

I see four samples in the unstable group supporting a large LFC. This is clearly not null so the small p-value is justified. I wouldn't characterize these are FP.

You could also try SAMseq which works on ranks if you aren't interested so much in the LFC but more in the ranks. We find it performs well. We also have developed an extension of SAMseq for working downstream of the Salmon quantification method, called Swish.