Question

inter-species RNA-seq with DEseq2: how to account for gene length differences

0

Entering edit mode

akozlenkov • 0

@akozlenkov-13889

Last seen 7.6 years ago

Dear Bioconductor community,

I intend to perform interspecies RNA-seq analysis with samples from 3 related species (human and 2 primates). We have paired-end data that we mapped to respective genomes with STAR, and then obtained read counts with featureCounts, using the most recent GTF gene annotations for the 3 species. I then obtained the lists of orthologue genes from ENSEMBL, and overlapped the read count tables from the 3 species into one dataframe in R.

Now, I'm considering how to perform the next steps properly, and would welcome any advice.

I realize that the gene lengths (sum of exon lengths) of orthologuous genes can be different in different species.

I asked this on Biostars, and was hinted by Devon Ryan to contact Bioconductor/DEseq2 with this question. (https://www.biostars.org/p/253849/#312451 ). Presumably, software like DEseq2 can be used to account for such gene-level factors such as gene length (possibly also, GC %?), but it would involve advanced approaches, (offsets?). I'd be grateful for any advice on using DEseq2 for interspecies RNA-seq, and/or for any links to relevant tutorials/publications.

Best,

Alex

deseq2 normalization • 2.9k views

ADD COMMENT • link updated 7.7 years ago by Michael Love 43k • written 7.7 years ago by akozlenkov • 0

score 0 · Answer 1 · 2018-05-03

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 4 days ago

United States

Hi Alex,

If you want to correct for gene length differences, all you need to do is add a matrix called "avgTxLength" before you run DESeq(). This is what DESeqDataSetFromTximport() does for example, to control for gene-length changes across samples due to isoform switching.

To add gene lengths manually:

assays(dds)[["avgTxLength"]] <- length.mat
dds <- DESeq(dds)
...

ADD COMMENT • link 7.7 years ago Michael Love 43k

0

Entering edit mode

Thank you!

Do you think it's more justified to normalize by the sum of exon lengths , or by "average transcript length", as the name "avgTxLength" hints at?

Honestly, I feel both options are not perfect... but then interspecies analysis is not supposed to be straightforward.

Best,Alex

ADD REPLY • link 7.6 years ago akozlenkov • 0

0

Entering edit mode

In general (ignoring your specific case above, but in case others come upon this thread), I think the best approach is to use Salmon (which corrects for the common sample- and batch-specific biases in Illumina data), followed by tximport (which corrects for potential differential isoform usage across samples when testing DGE), followed by DESeq2, edgeR or limma-voom. It's fast (~5 min per sample including bias correction) and I think the easiest pipeline. Everything is handled for you.

However, in your specific case, you would have to do something else to get the transcript quantification in order to estimate the average transcript length per gene x sample. You don't have them from featureCounts. So if you just want to correct for the differences due to gene length across species, you can use the sum(width(reduce(exonsByGene))) paradigm.

ADD REPLY • link 7.6 years ago Michael Love 43k

0

Entering edit mode

Hi Michael. And how should this matrix be? I'm manually building one but I'm getting a lot of errors, that's is its structure:

      hgnc_symbol length
20576        A1BG   8314
20566    A1BG-AS1   7737
7565         A1CF  86266
7229          A2M  48565
7226      A2M-AS1   3526
3420        A2ML1  64380

Thanks for your help!

ADD REPLY • link 5.7 years ago jgarces • 0

0

Entering edit mode

What matrix are you referring to? What are you trying to do? If you want to get average transcript lengths per gene and sample, I have my recommendation above.

ADD REPLY • link 5.7 years ago Michael Love 43k

0

Entering edit mode

Sorry, I've confused gene length with gene average length for gene and sample . I'd like to add (someway I don't know) gene lengths for calculating fpkm (because my rowRanges(dds) are of zero length).

ADD REPLY • link 5.7 years ago jgarces • 0

0

Entering edit mode

I don't have any code for this. I've worked on other methods which provide TPM, average transcript lengths, estimated counts, etc.