Question

DESeq2 gives weird results

0

Entering edit mode

chipolino • 0

@chipolino-15565

Last seen 4.5 years ago

Hi,

I am doing DE analysis. I have 4 conditions: control, drug1, drug2, drug3 (2 replicates for each condition). I started with Salmon to count transcripts and imported them with tximport. After this I constructed DESeq data set using DESeqDataSetFromTximport following this tutorial. Then I tried to compare control with drug1. Here is my code:

>library(tximportData)
>library(tximport)
>library(tximeta)
>library(DESeq2)
>library(GenomicFeatures)

>files <- file.path("path_to_salmon_files", "quants", metadata$Quant, "quant.sf")
>names(files) <- metadata$Samples
>TxDb <- makeTxDbFromGFF(file="gencode.v30.annotation.gff3.gz")

>k <- keys(TxDb, keytype = "TXNAME")
>tx2gene <- AnnotationDbi::select(TxDb, k, "GENEID", "TXNAME")

>txi <- tximport(files, type = "salmon", tx2gene = tx2gene, ignoreAfterBar = TRUE)

>ddsTxi <-DESeqDataSetFromTximport(txi, colData = metadata, design =~Treatment)

>keep <-rowSums(counts(ddsTxi))>=10
>ddsTxi <- ddsTxi[keep,]

#"drug1_1", "drug1_2" are technical replicates
>xt <- ddsTxi[,c("drug1_1", "drug1_2", "control1", "control2")]
>yt <- factor(c("D", "D", "C", "C"))
>xt$Treatment <- yt
>xt <-DESeq(xt)
>res <-results(xt, name="Treatment_C_vs_D")

>resOrdered <- res[order(res$padj),]
>results <-as.data.frame(resOrdered)

Then when I check results I see this information for gene ETV6:

| baseMean  | log2FoldChange    | lfcSE     | stat      | padj      |
|---------- |----------------   |-------    |------     |------     |
| 661.72988 | -3.5438716        | 0.3112069 | -11.387510| 2.982196e-26   |

Which basically means that the gene is downregulated in control compared to drug1 (with log fold change 3.5). But if I look at raw counts, this is what I see:

| drug1_1 | drug1_2      | control1 | control2 |
|----------|--------------|-------  |------    |
| 280      | 357          | 396     | 401      |

So according to raw counts, there is no significant change in control compared to treatment. And almost all counts come from the single transcript in all conditions. All library sizes are very similar (~15M). What might be the reason for such log fold change in DESeq analysis? Can it be that in raw counts there is no change, but due to normalization relative levels of this gene in the drug treatment (compared to all other genes) increased?

P.S. When I analyze the same data with limma-voom, I get logFC for this gene 0.23 (increase in control compared to drug1).

deseq2 salmon • 1.2k views

ADD COMMENT • link updated 4.9 years ago by Michael Love 41k • written 4.9 years ago by chipolino • 0

score 0 · Answer 1 · 2019-05-30

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 day ago

United States

Note that the base mean for the gene you have selected is 661.7. But the counts that you posted are all smaller.

What are the normalized counts instead of the raw counts?

ADD COMMENT • link 4.9 years ago Michael Love 41k

0

Entering edit mode

Thank you for your response!

Here are the normalized counts:

| drug1_1  | drug1_2      | control1 | control2  |
|----------|--------------|-------   |------     |
| 1293.573 | 1144.279     | 90.263   | 118.803   |

I also checked the quant.sf files from Salmon for this gene and that's what I see:

Drug1

|Transcript| Length  | EffectiveLength | TPM       |NumReads   |
|----------|---------|-------          |------     |------     |
| ETV6-202 | 5989    | 5727.000        | 0.130581  | 275.035   | 
| ETV6-203 | 572     | 323.000         | 0         | 0         | 
| ETV6-204 | 733     | 484.000         | 0         | 0         | 
| ETV6-205 | 652     | 403.000         | 0         | 0         | 
| ETV6-201 | 160     | 6.178           | 2.185476  | 4.965     |

Drug2

|Transcript| Length  | EffectiveLength | TPM       |NumReads   |
|----------|---------|-------          |------     |------     |
| ETV6-202 | 5989    | 5727.000        | 0.107006  | 351.076   | 
| ETV6-203 | 572     | 323.000         | 0         | 0         | 
| ETV6-204 | 733     | 484.000         | 0.004320  | 1.198     | 
| ETV6-205 | 652     | 403.000         | 0         | 0         | 
| ETV6-201 | 160     | 6.178           | 1.354557  | 4.794     |

Control1

|Transcript| Length  | EffectiveLength | TPM       |NumReads   |
|----------|---------|-------          |------     |------     |
| ETV6-202 | 5989    | 5727.000        | 0.191470  | 396.000   | 
| ETV6-203 | 572     | 323.000         | 0         | 0         | 
| ETV6-204 | 733     | 484.000         | 0         | 0         | 
| ETV6-205 | 652     | 403.000         | 0         | 0         | 
| ETV6-201 | 160     | 6.178           | 0         | 0         |

Control2

|Transcript| Length  | EffectiveLength | TPM       |NumReads   |
|----------|---------|-------          |------     |------     |
| ETV6-202 | 5989    | 5727.000        | 0.084566  | 393.219   | 
| ETV6-203 | 572     | 323.000         | 0         | 0         | 
| ETV6-204 | 733     | 484.000         | 0.006440  | 2.531     | 
| ETV6-205 | 652     | 403.000         | 0.016047  | 5.251     | 
| ETV6-201 | 160     | 6.178           | 0         | 0         |

So looks like the overall difference comes from ETV6-201 transcript... but it has just a few reads in the drug treatment. How can I take this into account? Should I use limma-voom instead?

ADD REPLY • link 4.9 years ago chipolino • 0

1

Entering edit mode

The issue is not limma-voom vs DESeq2, but using counts alone vs. using the transcript lengths.

I'm suspicious, like you are, about the large change in normalized counts from the expression switching to the short isoform, given there are only a few reads allocated to that transcript.

My quick fix for this dataset would be to only use the txi$counts, and pass these to DESeqDataSetFromMatrix, so tossing out the average transcript lengths for normalization, because we don't trust it in the above example. A more involved approach to deal with these short isoforms would be to generate the transcript-level matrices with tximport, then remove all the transcripts with a very low read count, then summarizeToGene and pass to DESeqDataSetFromTximport.

ADD REPLY • link 4.9 years ago Michael Love 41k