Question

Summarizing counts of alternate sequences in GRCh38

0

Entering edit mode

Chakravarthi Kanduri • 0

@chakravarthi-kanduri-9532

Last seen 6.7 years ago

University of Helsinki

We are using lightweight-alignment based estimated readcounts (Salmon) as input to DESeq2. We are using GRCh38 to leverage the known variation (alternate sequences) together with the strength of Salmon that probabilistically assigns reads to sequences without double-counting. Thus, when supplied with alternate sequences, reads may still be mapped to an alternate sequence (annotated as patches in GRCh38) if the reads contain true variation with respect to the original reference sequence. The expression level of a gene could then be cumulative of readcounts across all of its alternate sequences. We are finding it hard to believe that this way of summarization would be wrong, but just wanted to confirm whether this violates any of the assumptions that DESEq2 makes?

If this sound a valid approach, at what stage should this aggregation be done? We are using tximport and DESeqDataSetFromTximport to create a DESeq dataset. Can this be done after creating a DESeq dataset like below?

dds <- DESeqDataSetFromTximport(txi, sampleTable, ~samples.group)

counts(dds) <- some_function(counts(dds))

deseq2 tximport • 968 views

ADD COMMENT • link 7.5 years ago Chakravarthi Kanduri • 0

0

Entering edit mode

I was not referring to transcript-to-gene abundances and apologies if it was bit unclear.

To give an example of alternate sequences in GRCh38, I refer to the Ensembl gene IDs of a famous polymorphic gene in humans in MHC region. All these gene IDs correspond to a single gene:

"ENSG00000179344" "ENSG00000206237" "ENSG00000206302" "ENSG00000225824" "ENSG00000231286" "ENSG00000231939" "ENSG00000233209"

When the read assignment is distributed across all the above gene IDs, we would like to aggregate the readcounts to summarize the expression level of that gene. This is just an example.

ADD REPLY • link 7.5 years ago Chakravarthi Kanduri • 0

0

Entering edit mode

I would use tximport to summarize the estimated abundances and counts from multiple transcripts to a single "gene". You can do this by altering the second column of tx2gene. The ideas put forward in the F1000Research paper would follow here, that as long as reads are not double-counted in the estimated counts, the methods should work to allow you to identify genes where expression (as total "gene" output) changes across samples.

ADD REPLY • link 7.5 years ago Michael Love 41k

0

Entering edit mode

Sounds appropriate, thanks.

ADD REPLY • link 7.5 years ago Chakravarthi Kanduri • 0

score 0 · Answer 1 · 2016-11-02

Yes, this is the expected use of Salmon => tximport => DESeq2. The use of estimated counts followed by methods like DESeq2 and edgeR is laid out in the paper:

https://f1000research.com/articles/4-1521

The aggregation of transcript-to-gene abundances, counts, and lengths is made by tximport when you supply the tx2gene data frame. I don't know what other kind of aggregation you are referring to.