Dear Bioconductor community,
I intend to perform interspecies RNA-seq analysis with samples from 3 related species (human and 2 primates). We have paired-end data that we mapped to respective genomes with STAR, and then obtained read counts with featureCounts, using the most recent GTF gene annotations for the 3 species. I then obtained the lists of orthologue genes from ENSEMBL, and overlapped the read count tables from the 3 species into one dataframe in R.
Now, I'm considering how to perform the next steps properly, and would welcome any advice.
I realize that the gene lengths (sum of exon lengths) of orthologuous genes can be different in different species.
I asked this on Biostars, and was hinted by Devon Ryan to contact Bioconductor/DEseq2 with this question. (https://www.biostars.org/p/253849/#312451 ). Presumably, software like DEseq2 can be used to account for such gene-level factors such as gene length (possibly also, GC %?), but it would involve advanced approaches, (offsets?). I'd be grateful for any advice on using DEseq2 for interspecies RNA-seq, and/or for any links to relevant tutorials/publications.
Best,
Alex
Thank you!
Do you think it's more justified to normalize by the sum of exon lengths , or by "average transcript length", as the name "avgTxLength" hints at?
Honestly, I feel both options are not perfect... but then interspecies analysis is not supposed to be straightforward.
Best,Alex
In general (ignoring your specific case above, but in case others come upon this thread), I think the best approach is to use Salmon (which corrects for the common sample- and batch-specific biases in Illumina data), followed by tximport (which corrects for potential differential isoform usage across samples when testing DGE), followed by DESeq2, edgeR or limma-voom. It's fast (~5 min per sample including bias correction) and I think the easiest pipeline. Everything is handled for you.
However, in your specific case, you would have to do something else to get the transcript quantification in order to estimate the average transcript length per gene x sample. You don't have them from featureCounts. So if you just want to correct for the differences due to gene length across species, you can use the
sum(width(reduce(exonsByGene)))
paradigm.Hi Michael. And how should this matrix be? I'm manually building one but I'm getting a lot of errors, that's is its structure:
Thanks for your help!
What matrix are you referring to? What are you trying to do? If you want to get average transcript lengths per gene and sample, I have my recommendation above.
Sorry, I've confused gene length with gene average length for gene and sample . I'd like to add (someway I don't know) gene lengths for calculating fpkm (because my
rowRanges(dds)
are of zero length).I don't have any code for this. I've worked on other methods which provide TPM, average transcript lengths, estimated counts, etc.
Ok, thanks anyway for your help.