Hello,
I'm working on a gene expression analysis with two non-model species. I followed the same field collecting and RNA preservation protocol for all samples and generated 24 libraries that were pooled and sequenced across three lanes. I'm following the steps below for differential expression analysis:
1. Perform structural annotation with draft genome assemblies (results show similar completeness and quality of annotation)
2. Use transcripts from gene models to build a kallisto index for each species and quantify
3. Identify one-to-one orthologs between the two species
4. Create copies of kallisto's "abundance.tsv" files, replace the "target_id" field with ortholog identifiers across samples, and discard quantification data from genes without orthologs
5. Import kallisto files with tximport and "txOut = TRUE"; the genome annotation does not include isoforms so I don't have a transcript-to-gene map
6. Run DESeq2 pipeline using a single factor (behavior) with four levels. One species has three levels and the other has one level.
Here is the result of a PCA with distances calculated from VTS data. PC1 clearly separates the two species and explains an unusually high percentage of the variance (80%).
I was wondering if there are major problems with the approach that I'm using. I rarely see studies using DESeq2 with more than one species, and I would appreciate any thoughts on this type of analysis or if it's appropriate to use DESeq2 for this purpose.
Why are you surprised that species explains the bulk of the variance?
Also: Please always post the sample table with such questions and explain how the samples differ.
I'm not surprised that species explains most of the variance. However, I was wondering if there are underlying problems with using DESeq2 for this type of cross-species analysis. The system involves two closely related species with females that perform specialized behaviors, and "behavior4" in "sp2" might be considered a modification of "behavior2" in "sp1". I'm using this simple formula
dds <- DESeqDataSetFromTximport(txi_kallisto, colData = coldata, design = ~ condition)
and then evaluating three contrasts between "sp2" and "sp1" to test predictions of gene expression associated with certain behaviors. I have attached the data table below. Thank you!
| sample | condition |
|---------------|-----------|
| sp1_lvl1_rep1 | behavior1 |
| sp1_lvl1_rep2 | behavior1 |
| sp1_lvl1_rep3 | behavior1 |
| sp1_lvl1_rep4 | behavior1 |
| sp1_lvl1_rep5 | behavior1 |
| sp1_lvl1_rep6 | behavior1 |
| sp1_lvl2_rep1 | behavior2 |
| sp1_lvl2_rep2 | behavior2 |
| sp1_lvl2_rep3 | behavior2 |
| sp1_lvl2_rep4 | behavior2 |
| sp1_lvl3_rep1 | behavior3 |
| sp1_lvl3_rep2 | behavior3 |
| sp1_lvl3_rep3 | behavior3 |
| sp1_lvl3_rep4 | behavior3 |
| sp1_lvl3_rep5 | behavior3 |
| sp1_lvl3_rep6 | behavior3 |
| sp2_lvl4_rep1 | behavior4 |
| sp2_lvl4_rep2 | behavior4 |
| sp2_lvl4_rep3 | behavior4 |
| sp2_lvl4_rep4 | behavior4 |
| sp2_lvl4_rep5 | behavior4 |
| sp2_lvl4_rep6 | behavior4 |
| sp2_lvl4_rep7 | behavior4 |
| sp2_lvl4_rep8 | behavior4 |