Kallisto with tximport gives different results with asymmetrical experimental design
1
0
Entering edit mode
@thibaultlorin34-14148
Last seen 5.9 years ago

Hello all,

I had to remove samples from my data and I end up with comparing 3 samples (cond1) vs. 2 samples (cond2) — I know that I'm not in an ideal situation. I want to get the most DEGs. I have run kallisto and import the data in R using `tximport`.

The problem is that when I compare 2 vs 1 I dont't have the same results that when I compare 1 vs 2... o.O I can share any data with whoever interested.

Here are my files: samples 13 and 14 correspond to cond1, and 16 to 18 to cond2 (I removed file 15 because I was not satisfied with it).

sample13
"TTLN13/abundance.tsv"
sample14
"TTLN14/abundance.tsv"
sample16
"TTLN16/abundance.tsv"
sample17
"TTLN17/abundance.tsv"
sample18
"TTLN18/abundance.tsv"

I import them, no problem:

txi.kallisto.tsv <- tximport(files, type = "kallisto", tx2gene = tx2gene)

pairwise  #some french headers #pairwise is a subset of a bigger dataframe
  sample stade ponte jour lane
1   S1_1    S1   OP9    1    1
2   S1_2    S1   OP9    1    2
4   S2_1    S2   OP9    4    4
5   S2_2    S2   OP9    2    1
6   S2_3    S2   OP9    2    2

> dds <- DESeqDataSetFromTximport(txi.kallisto.tsv, pairwise, ~stade)
factor levels were dropped which had no samples
using counts and average transcript lengths from tximport
> dds
class: DESeqDataSet
dim: 21877 5
metadata(1): version
assays(2): counts avgTxLength
rownames(21877): '' aaas ... zzef1 zzz3
rowData names(0):
colnames(5): sample13 sample14 sample16 sample17 sample18
colData names(5): sample stade ponte jour lane
> dds <- dds[ rowSums(counts(dds)) > 1, ]
> dds <- DESeq(dds)
estimating size factors
using 'avgTxLength' from assays(dds), correcting for library size
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> counts <- counts(dds,norm=TRUE)
> counts <- as.data.frame(counts)
> res <- results(dds)
> markers <- res[abs(res$log2FoldChange)>2 & res$padj < 0.05 & !is.na(res$padj), ]
> markers
log2 fold change (MAP): stade S2 vs S1
Wald test p-value: stade S2 vs S1
DataFrame with 9 rows and 6 columns
                baseMean log2FoldChange     lfcSE      stat       pvalue         padj
               <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
cpa1          6444.24007       2.441987 0.2732529  8.936729 4.008585e-19 1.303057e-15
kif14          251.38984       2.088346 0.2748661  7.597683 3.014796e-14 5.345508e-11
LOC103353161   162.16751       3.518848 0.2916084 12.067033 1.577185e-33 3.076141e-29
LOC103358230 17178.41284       2.127152 0.2182673  9.745631 1.925790e-22 9.390154e-19
LOC103359291 17611.56096       2.092522 0.2768940  7.557124 4.120793e-14 6.697663e-11
LOC103361754    82.92328       2.646414 0.3090550  8.562923 1.100413e-17 2.384717e-14
LOC103373239   460.27637       2.124864 0.2425778  8.759518 1.960866e-18 4.780592e-15
LOC103373302 18387.41060       2.354269 0.2583432  9.112950 8.017192e-20 3.127346e-16
LOC103376355   103.27335       3.573329 0.3130216 11.415599 3.494866e-30 3.408193e-26

 

Now, I do the same as before, but I reverse the order of the samples in the condition dataframe --> the results are different!

 

> pairwise2 <- pairwise
> pairwise2[1:3,] <- pairwise[3:5,]
> pairwise2[4:5,] <- pairwise[1:2,]
> pairwise2
  sample stade ponte jour lane
1   S2_1    S2   OP9    4    4
2   S2_2    S2   OP9    2    1
4   S2_3    S2   OP9    2    2
5   S1_1    S1   OP9    1    1
6   S1_2    S1   OP9    1    2
> dds2 <- DESeqDataSetFromTximport(txi.kallisto.tsv, pairwise2, ~stade)
factor levels were dropped which had no samples
using counts and average transcript lengths from tximport
> dds2
class: DESeqDataSet
dim: 21877 5
metadata(1): version
assays(2): counts avgTxLength
rownames(21877): '' aaas ... zzef1 zzz3
rowData names(0):
colnames(5): sample13 sample14 sample16 sample17 sample18
colData names(5): sample stade ponte jour lane
> dds2 <- dds2[ rowSums(counts(dds2)) > 1, ]
> class(counts(dds2))
[1] "matrix"
> dds2 <- DESeq(dds2)
estimating size factors
using 'avgTxLength' from assays(dds), correcting for library size
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> counts2 <- counts(dds2,norm=TRUE)
> counts2 <- as.data.frame(counts2)
> res2 <- results(dds2)
> markers2 <- res2[abs(res2$log2FoldChange)>2 & res2$padj < 0.05 & !is.na(res2$padj), ]
> markers2
log2 fold change (MAP): stade S2 vs S1
Wald test p-value: stade S2 vs S1
DataFrame with 3 rows and 6 columns
              baseMean log2FoldChange     lfcSE       stat       pvalue         padj
             <numeric>      <numeric> <numeric>  <numeric>    <numeric>    <numeric>
LOC103361292  81.60711       3.288638 0.3039721  10.818880 2.801800e-27 2.793535e-23
LOC103361620 109.93052       2.289659 0.3047828   7.512427 5.804097e-14 3.857983e-10
LOC103376346 441.07780      -4.524589 0.2573114 -17.584097 3.261354e-69 6.503466e-65

###

I tried this with a symmetrical design (n=3 in both conditions), and there is no problem: results are identical when comparing cond1 to cond2 and when comparing cond2 to cond1. My problem seems to be due to an asymmetrical design. Would you have any idea of what is happening?

Many thanks!

kallisto deseq2 tximport • 1.4k views
ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 4 hours ago
United States

Please read over the DESeq2 vignette,

vignette("DESeq2")

which describes under 'Differential expression analysis' that the correct way to make different comparisons of groups is using the 'contrast' argument of the results() function.

Above, when you re-arranged the samples, you've mismatched the samples with their labels. You re-arranged the sample table but not the columns of the matrices contained in the object returned by tximport(). This is mentioned as a critical step in the tximport vignette:

"The user should make sure the rownames of sampleTable align with the colnames of txi$counts, if there are colnames. The best practice is to read sampleTable from a CSV file, and to construct files from a column of sampleTable, as was shown in the tximport examples above."

Best practice is to have a sample table which specifies the file locations as well as the sample information, then run tximport, then build the DESeqDataSet. At that point you can re-arrange the order, as DESeqDataSet links the counts, and other assays, with the column data. And the correct way to switch the comparison is using 'contrast'

ADD COMMENT
0
Entering edit mode

Thanks Michael! Please correct me if I'm wrong. What I should do is import a sampleTable from a file with the file location info.

> head(cond)

# A tibble: 6 x 6
  sample stade ponte  jour  lane      file
   <chr> <chr> <chr> <int> <int>    <chr>
1   S1_1    S1   OP9     1     1 count/kallisto/TTLN13/abundance.tsv
2   S1_2    S1   OP9     1     2 /count/kallisto/TTLN14/abundance.tsv
3   S2_1    S2   OP9     4     4 count/kallisto/TTLN16/abundance.tsv
4   S2_2    S2   OP9     2     1 count/kallisto/TTLN17/abundance.tsv
5   S2_3    S2   OP9     2     2 count/kallisto/TTLN18/abundance.tsv
6   S3_1    S3   OP9     5     3 count/kallisto/TTLN19/abundance.tsv

Then, using this table (19 rows in total), import the whole kallisto dataset:

files <- cond$file
txi.kallisto.tsv <- tximport(files, type = "kallisto", tx2gene = tx2gene, reader = read_tsv)

Then import this dataset with the desired design

dds <- DESeqDataSetFromTximport(txi.kallisto.tsv, cond, ~ponte+stade)
dds <- dds[ rowSums(counts(dds)) > 1, ]
dds <- DESeq(dds)
counts <- counts(dds,norm=TRUE)
counts <- as.data.frame(counts)

And finally compare what I would like to compare (the so-called contrasts):

res <- results(dds,contrast=c("stade", "S2", "S1"))

Only then will my result be consistent from one comparison (S2 vs S1) to the other (S1 vs S2)?

markers <- res[abs(res$log2FoldChange)>2 & res$padj < 0.05 & !is.na(res$padj), ]
> markers
log2 fold change (MAP): stade S1 vs S2
Wald test p-value: stade S1 vs S2
DataFrame with 14 rows and 6 columns
                baseMean log2FoldChange     lfcSE      stat       pvalue         padj
               <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
cib3            14.94635       2.378058 0.5276680  4.506731 6.583392e-06 0.0068116157
cmpk2           63.79279      -2.175851 0.5389369 -4.037302 5.406953e-05 0.0279548843
...                  ...            ...       ...       ...          ...          ...
LOC103373390    54.55987      -2.787341 0.4958837 -5.620957 1.899021e-08 8.252387e-05
LOC103374201   416.65274      -2.255090 0.3326805 -6.778547 1.213906e-11 2.637574e-07
ADD REPLY
1
Entering edit mode

Yes, and above you had switched the sample table around but not switched the count data (and other associated data compiled by tximport). DESeqDataSet and other Bioconductor objects link the assay data (counts) and the sample information.

ADD REPLY

Login before adding your answer.

Traffic: 583 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6