I'm working on a gene expression analysis with two non-model species. I followed the same field collecting and RNA preservation protocol for all samples and generated 24 libraries that were pooled and sequenced across three lanes. I'm following the steps below for differential expression analysis:
1. Perform structural annotation with draft genome assemblies (results show similar completeness and quality of annotation)
2. Use transcripts from gene models to build a kallisto index for each species and quantify
3. Identify one-to-one orthologs between the two species
4. Create copies of kallisto's "abundance.tsv" files, replace the "target_id" field with ortholog identifiers across samples, and discard quantification data from genes without orthologs
5. Import kallisto files with tximport and "txOut = TRUE"; the genome annotation does not include isoforms so I don't have a transcript-to-gene map
6. Run DESeq2 pipeline using a single factor (behavior) with four levels. One species has three levels and the other has one level.
Here is the result of a PCA with distances calculated from VTS data. PC1 clearly separates the two species and explains an unusually high percentage of the variance (80%).
I was wondering if there are major problems with the approach that I'm using. I rarely see studies using DESeq2 with more than one species, and I would appreciate any thoughts on this type of analysis or if it's appropriate to use DESeq2 for this purpose.