Question

RNA-seq data analysis with two species using DESeq2

0

Entering edit mode

fl ▴ 20

@fl-16173

Last seen 2 days ago

Germany

Hello,

I'm working on a gene expression analysis with two non-model species. I followed the same field collecting and RNA preservation protocol for all samples and generated 24 libraries that were pooled and sequenced across three lanes. I'm following the steps below for differential expression analysis:

1. Perform structural annotation with draft genome assemblies (results show similar completeness and quality of annotation)
2. Use transcripts from gene models to build a kallisto index for each species and quantify
3. Identify one-to-one orthologs between the two species
4. Create copies of kallisto's "abundance.tsv" files, replace the "target_id" field with ortholog identifiers across samples, and discard quantification data from genes without orthologs
5. Import kallisto files with tximport and "txOut = TRUE"; the genome annotation does not include isoforms so I don't have a transcript-to-gene map
6. Run DESeq2 pipeline using a single factor (behavior) with four levels. One species has three levels and the other has one level.

Here is the result of a PCA with distances calculated from VTS data. PC1 clearly separates the two species and explains an unusually high percentage of the variance (80%).

https://ibb.co/cA51UT

I was wondering if there are major problems with the approach that I'm using. I rarely see studies using DESeq2 with more than one species, and I would appreciate any thoughts on this type of analysis or if it's appropriate to use DESeq2 for this purpose.

DESeq2 • 1.3k views

ADD COMMENT • link updated 5.7 years ago by Simon Anders ★ 3.7k • written 5.7 years ago by fl ▴ 20

0

Entering edit mode

Why are you surprised that species explains the bulk of the variance?

Also: Please always post the sample table with such questions and explain how the samples differ.

ADD REPLY • link 5.7 years ago Simon Anders ★ 3.7k

0

Entering edit mode

I'm not surprised that species explains most of the variance. However, I was wondering if there are underlying problems with using DESeq2 for this type of cross-species analysis. The system involves two closely related species with females that perform specialized behaviors, and "behavior4" in "sp2" might be considered a modification of "behavior2" in "sp1". I'm using this simple formula
dds <- DESeqDataSetFromTximport(txi_kallisto, colData = coldata, design = ~ condition)
and then evaluating three contrasts between "sp2" and "sp1" to test predictions of gene expression associated with certain behaviors. I have attached the data table below. Thank you!

| sample | condition |
|---------------|-----------|
| sp1_lvl1_rep1 | behavior1 |
| sp1_lvl1_rep2 | behavior1 |
| sp1_lvl1_rep3 | behavior1 |
| sp1_lvl1_rep4 | behavior1 |
| sp1_lvl1_rep5 | behavior1 |
| sp1_lvl1_rep6 | behavior1 |
| sp1_lvl2_rep1 | behavior2 |
| sp1_lvl2_rep2 | behavior2 |
| sp1_lvl2_rep3 | behavior2 |
| sp1_lvl2_rep4 | behavior2 |
| sp1_lvl3_rep1 | behavior3 |
| sp1_lvl3_rep2 | behavior3 |
| sp1_lvl3_rep3 | behavior3 |
| sp1_lvl3_rep4 | behavior3 |
| sp1_lvl3_rep5 | behavior3 |
| sp1_lvl3_rep6 | behavior3 |
| sp2_lvl4_rep1 | behavior4 |
| sp2_lvl4_rep2 | behavior4 |
| sp2_lvl4_rep3 | behavior4 |
| sp2_lvl4_rep4 | behavior4 |
| sp2_lvl4_rep5 | behavior4 |
| sp2_lvl4_rep6 | behavior4 |
| sp2_lvl4_rep7 | behavior4 |
| sp2_lvl4_rep8 | behavior4 |

ADD REPLY • link 5.7 years ago fl ▴ 20

score 0 · Answer 1 · 2018-08-02

If you find significant differences between the samples with behaviour 4 and behaviour 1, you cannot say whether this difference is due to the fact that the samples are from animals exhibiting different behavior or due to the fact that they are different species. Your observation that PCA shows most variance to be due to species makes the latter the much more likely explanation.

So, yes, there are major problems with your approach.