Question: Suggestions for avoiding False Positives while identifying differentially expressed genes with more than 3 samples per group
0
8 days ago by
HD0
HD0 wrote:

Hi,

I have RNA seq data with the samples defined as follows:

   **ID level    stability  type**
A1    med     unstable    exp
A2    med     unstable    exp
A3    med     unstable    exp
B1    med       stable    exp
B2    med       stable    exp
B3    med       stable    exp
C1    low       stable    exp
C2    low       stable    exp
C3    low       stable    exp
D1    low     unstable    exp
D2    low     unstable    exp
D3    low     unstable    exp
E1   high       stable    exp
E2   high       stable    exp
E3   high       stable    exp
F1   high     unstable    exp
F2   high     unstable    exp
F3   high     unstable    exp
G1   host         host  host1
G2   host         host  host1
G3   host         host  host1
H1   host         host  host2
H2   host         host  host2
H3   host         host  host2
I1   host         host  host3
I2   host         host  host3
I3   host         host  host3


ID column here describes the samples in 3 biological replicates. My objective is to identify DEGs while comparing different samples in pair (A vs B, B vs C, A vs D etc) as well as in group-wise comparison (stable vs unstable, host vs expressed). But when I define a contrast such as "contrast = c("stability","unstable","stable")", I get some DEGs with just one of the replicates from 3 different stable or unstable samples being high or low in comparison to others as well.

As I understand that's because when the program identifies a gene in at least 3 samples within a group of the same profile (irrespective of being just 1 of the replicates of 3 different samples) with significant difference from the rest, it reports it as a DEG. However, I would like to know if there is some parameter that can be introduced to say that more than 3 samples in a group of 9 (may be at least 6 samples in our case) are required to have concordance to be reported as DEG.

If not, then can someone kindly suggest some other way to do avoid getting DEGs not representative of entire group but just 3 samples of a big group.

Thanks !!

deseq2 • 72 views
modified 8 days ago by Michael Love24k • written 8 days ago by HD0
Answer: Suggestions for avoiding False Positives while identifying differentially expres
0
8 days ago by
Michael Love24k
United States
Michael Love24k wrote:

You can use a minimal count filter to remove such cases most likely:

keep = rowSums(counts(dds) >= x) >= y dds = dds[keep,]

You might try x of 10 and y of 3 so that at least 3 samples must have a count of 10 or higher.

Thanks for your response Michael. But my problem is not with filtering the raw count matrix. I use the same before moving ahead to differential expression analysis.

When I call differentially expressd genes in group of stable (9 samples = 3 stable with all 3 replicates each) vs unstable (another 9 samples = 3 unstable with all 3 replicates each) , I get some genes which have high level only in B2, C3, E3. So, only one of the 3 replicates from each sample. Here I want to set some parameter to report DE only if 6 or more of the 9 stable samples have high level in comparison to the 9 in unstable. Is there some option like that ? Or can you recommend another way to identify DEGs in stable vs unstable as per my sample annotation.

Thanks !!

Can you post an example of the plotCounts for such a gene, as well as the LFC reported from lfcShrink. E.g.:

res <- lfcShrink(dds, coef="...", type="apeglm")


Usually such genes have an LFC that is moderated by a Bayesian approach

So here is what I did... I don't understand why I don't see unstablevsstable condition listed in resultsNamesDDS. But counts are as mentioned below. A gene is being reported DE based on high counts in B3, D1 and F3 while comparing stable and unstable.

#Create input for DE analysis
batch <- DESeqDataSetFromMatrix( countData = counts,colData  = sample_ann,design = ~ stability)
#Pre-filtering of the dataset
batch <- estimateSizeFactors(batch)
keep <- rowSums(counts(batch) >= 10) >= 3
batch.filter<- batch[keep,]
vsd <- vst(batch.filter)
#Identifying DEGs
DDS_batch <- DESeq(batch.filter)
de_results<-results(DDS_batch, contrast = c("stability","unstable","stable"),lfcThreshold=0, independentFiltering =T, pAdjustMethod="BH", alpha = 0.01)
#Export sig DE list
summary(de_results)
sig_de_results <-subset(de_results,  abs(log2FoldChange)>1 & padj < 0.001)
sig_de_results <- sig_de_results[order(sig_de_results$log2FoldChange, decreasing=T),] summary(sig_de_results) #Plot significant DEGs topVarGenes <- rownames(sig_de_results) hmcol <- brewer.pal(11,'RdBu') nCounts <- counts(DDS_batch , normalized=TRUE) heatmap(as.matrix(nCounts[topVarGenes, ]), Rowv = NA,Colv = NA, col = hmcol, mar = c(8,2), keep.dendro = FALSE) #Counts of weird DEG reported nCounts["LOC100774944",] A1 low stable exp 234.4729 A2 low stable exp 250.4983 A3 low stable exp 241.4016 B1 low unstable exp 284.9478 B2 low unstable exp 239.4944 B3 low unstable exp 24684.5542 C1 med stable exp 197.0878 C2 med stable exp 222.2543 C3 med stable exp 195.8788 D1 med unstable exp 14254.4449 D2 med unstable exp 165.1056 D3 med unstable exp 1162.1804 E1 high stable exp 207.6912 E2 high stable exp 341.6933 E3 high stable exp 201.4843 F1 high unstable exp 171.2375 F2 high unstable exp 175.9629 F3 high unstable exp 21480.3368 G1 host host host1 198.9543 G2 host host host1 244.1278 G3 host host host1 176.3735 H1 host host host2 227.679 H2 host host host2 260.2603 H3 host host host2 242.7947 I1 host host host3 223.5859 I2 host host host3 16864.951 I3 host host host3 246.7982 #lfcShrink resultsNames(DDS_batch) [1] "Intercept""stability_stable_vs_host" [3] "stability_unstable_vs_host" lfcShrink(DDS_batch, coef="stability_unstable_vs_stable", type="apeglm") Error in lfcShrink(DDS_batch, coef = "stability_unstable_vs_stable", type = "apeglm") : coef %in% resultsNamesDDS is not TRUE  ADD REPLYlink written 8 days ago by HD0 This is described in the vignette. See note on factor levels. You should set the reference level of interest. If you can post an image of a single gene using plotCounts, that would help. ADD REPLYlink written 8 days ago by Michael Love24k Result from lfcShrink is as follows: dds<-batch.filter dds$stability <- factor(dds\$stability, levels = c("stable","unstable", "host"))
design(dds) <- ~ stability
DDS_batch <- DESeq(dds)
res<-lfcShrink(DDS_batch, coef="stability_unstable_vs_stable", type="apeglm")
res["LOC100774944",]
log2 fold change (MAP): stability unstable vs stable
Wald test p-value: stability unstable vs stable
DataFrame with 1 row and 5 columns
baseMean   log2FoldChange            lfcSE
<numeric>        <numeric>        <numeric>
LOC100774944 3088.75007019989 4.48346113381571 1.03333173945058
<numeric>            <numeric>
LOC100774944 4.76558220130759e-07 0.000320088271187826


If you check the heatmap, it shows differential expression when at least any 3 columns are differentially expressed. But when considering a group I would want that at least 2 samples with at least 2 replicates in a group are having significantly high or low values from the other group (apart from the problem with the gene I mentioned before).

plotCounts(DDS_batch, "LOC100774944", intgroup="stability")


topVarGenes <- rownames(sig_de_results)
nCounts <- counts(DDS_batch , normalized=TRUE)
mat<-as.matrix(log2(nCounts[topVarGenes, ]+1))
mat  <- mat - rowMeans(mat)
pheatmap(mat,cluster_rows=F, cluster_cols=F)


I see four samples in the unstable group supporting a large LFC. This is clearly not null so the small p-value is justified. I wouldn't characterize these are FP.

You could also try SAMseq which works on ranks if you aren't interested so much in the LFC but more in the ranks. We find it performs well. We also have developed an extension of SAMseq for working downstream of the Salmon quantification method, called Swish.