Hi,
I am doing DE analysis. I have 4 conditions: control, drug1, drug2, drug3 (2 replicates for each condition). I started with Salmon to count transcripts and imported them with tximport. After this I constructed DESeq data set using DESeqDataSetFromTximport following this tutorial. Then I tried to compare control with drug1. Here is my code:
>library(tximportData)
>library(tximport)
>library(tximeta)
>library(DESeq2)
>library(GenomicFeatures)
>files <- file.path("path_to_salmon_files", "quants", metadata$Quant, "quant.sf")
>names(files) <- metadata$Samples
>TxDb <- makeTxDbFromGFF(file="gencode.v30.annotation.gff3.gz")
>k <- keys(TxDb, keytype = "TXNAME")
>tx2gene <- AnnotationDbi::select(TxDb, k, "GENEID", "TXNAME")
>txi <- tximport(files, type = "salmon", tx2gene = tx2gene, ignoreAfterBar = TRUE)
>ddsTxi <-DESeqDataSetFromTximport(txi, colData = metadata, design =~Treatment)
>keep <-rowSums(counts(ddsTxi))>=10
>ddsTxi <- ddsTxi[keep,]
#"drug1_1", "drug1_2" are technical replicates
>xt <- ddsTxi[,c("drug1_1", "drug1_2", "control1", "control2")]
>yt <- factor(c("D", "D", "C", "C"))
>xt$Treatment <- yt
>xt <-DESeq(xt)
>res <-results(xt, name="Treatment_C_vs_D")
>resOrdered <- res[order(res$padj),]
>results <-as.data.frame(resOrdered)
Then when I check results I see this information for gene ETV6:
| baseMean | log2FoldChange | lfcSE | stat | padj |
|---------- |---------------- |------- |------ |------ |
| 661.72988 | -3.5438716 | 0.3112069 | -11.387510| 2.982196e-26 |
Which basically means that the gene is downregulated in control compared to drug1 (with log fold change 3.5). But if I look at raw counts, this is what I see:
| drug1_1 | drug1_2 | control1 | control2 |
|----------|--------------|------- |------ |
| 280 | 357 | 396 | 401 |
So according to raw counts, there is no significant change in control compared to treatment. And almost all counts come from the single transcript in all conditions. All library sizes are very similar (~15M). What might be the reason for such log fold change in DESeq analysis? Can it be that in raw counts there is no change, but due to normalization relative levels of this gene in the drug treatment (compared to all other genes) increased?
P.S. When I analyze the same data with limma-voom, I get logFC for this gene 0.23 (increase in control compared to drug1).
Thank you for your response!
Here are the normalized counts:
I also checked the quant.sf files from Salmon for this gene and that's what I see:
Drug1
Drug2
Control1
Control2
So looks like the overall difference comes from ETV6-201 transcript... but it has just a few reads in the drug treatment. How can I take this into account? Should I use limma-voom instead?
The issue is not limma-voom vs DESeq2, but using counts alone vs. using the transcript lengths.
I'm suspicious, like you are, about the large change in normalized counts from the expression switching to the short isoform, given there are only a few reads allocated to that transcript.
My quick fix for this dataset would be to only use the
txi$counts
, and pass these toDESeqDataSetFromMatrix
, so tossing out the average transcript lengths for normalization, because we don't trust it in the above example. A more involved approach to deal with these short isoforms would be to generate the transcript-level matrices withtximport
, then remove all the transcripts with a very low read count, thensummarizeToGene
and pass toDESeqDataSetFromTximport
.