inter-species RNA-seq with DEseq2: how to account for gene length differences
1
0
Entering edit mode
akozlenkov • 0
@akozlenkov-13889
Last seen 6.6 years ago

Dear Bioconductor community,

I intend to perform interspecies RNA-seq analysis with samples from 3 related species (human and 2 primates). We have paired-end data that we mapped to respective genomes with STAR, and then obtained read counts with featureCounts, using the most recent GTF gene annotations for the 3 species. I then obtained the lists of orthologue genes from ENSEMBL, and overlapped the read count tables from the 3 species into one dataframe in R.

Now, I'm considering how to perform the next steps properly, and would welcome any advice.

I realize that the gene lengths (sum of exon lengths) of orthologuous genes can be different in different species.

I asked this on Biostars, and was hinted by Devon Ryan to contact Bioconductor/DEseq2 with this question.  (https://www.biostars.org/p/253849/#312451 ). Presumably, software like DEseq2 can be used to account for such gene-level factors such as gene length (possibly also, GC %?), but it would involve advanced approaches, (offsets?). I'd be grateful for any advice on using DEseq2 for interspecies RNA-seq, and/or for any links to relevant tutorials/publications.

Best,

Alex

 

deseq2 normalization • 2.3k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 11 hours ago
United States

Hi Alex,

If you want to correct for gene length differences, all you need to do is add a matrix called "avgTxLength" before you run DESeq(). This is what DESeqDataSetFromTximport() does for example, to control for gene-length changes across samples due to isoform switching.

To add gene lengths manually:

assays(dds)[["avgTxLength"]] <- length.mat
dds <- DESeq(dds)
​...
ADD COMMENT
0
Entering edit mode

Thank you!

Do you think it's more justified to normalize by the sum of exon lengths , or by "average transcript length", as the name "avgTxLength" hints at?

Honestly, I feel both options are not perfect... but then interspecies analysis is not supposed to be straightforward.

Best,Alex

ADD REPLY
0
Entering edit mode

In general (ignoring your specific case above, but in case others come upon this thread), I think the best approach is to use Salmon (which corrects for the common sample- and batch-specific biases in Illumina data), followed by tximport (which corrects for potential differential isoform usage across samples when testing DGE), followed by DESeq2, edgeR or limma-voom. It's fast (~5 min per sample including bias correction) and I think the easiest pipeline. Everything is handled for you.

However, in your specific case, you would have to do something else to get the transcript quantification in order to estimate the average transcript length per gene x sample. You don't have them from featureCounts. So if you just want to correct for the differences due to gene length across species, you can use the sum(width(reduce(exonsByGene))) paradigm.

ADD REPLY
0
Entering edit mode

Hi Michael. And how should this matrix be? I'm manually building one but I'm getting a lot of errors, that's is its structure:

      hgnc_symbol length
20576        A1BG   8314
20566    A1BG-AS1   7737
7565         A1CF  86266
7229          A2M  48565
7226      A2M-AS1   3526
3420        A2ML1  64380

Thanks for your help!

ADD REPLY
0
Entering edit mode

What matrix are you referring to? What are you trying to do? If you want to get average transcript lengths per gene and sample, I have my recommendation above.

ADD REPLY
0
Entering edit mode

Sorry, I've confused gene length with gene average length for gene and sample . I'd like to add (someway I don't know) gene lengths for calculating fpkm (because my rowRanges(dds) are of zero length).

ADD REPLY
0
Entering edit mode

I don't have any code for this. I've worked on other methods which provide TPM, average transcript lengths, estimated counts, etc.

ADD REPLY
0
Entering edit mode

Ok, thanks anyway for your help.

ADD REPLY

Login before adding your answer.

Traffic: 630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6