We are using lightweight-alignment based estimated readcounts (Salmon) as input to DESeq2. We are using GRCh38 to leverage the known variation (alternate sequences) together with the strength of Salmon that probabilistically assigns reads to sequences without double-counting. Thus, when supplied with alternate sequences, reads may still be mapped to an alternate sequence (annotated as patches in GRCh38) if the reads contain true variation with respect to the original reference sequence. The expression level of a gene could then be cumulative of readcounts across all of its alternate sequences. We are finding it hard to believe that this way of summarization would be wrong, but just wanted to confirm whether this violates any of the assumptions that DESEq2 makes?
If this sound a valid approach, at what stage should this aggregation be done? We are using tximport and DESeqDataSetFromTximport to create a DESeq dataset. Can this be done after creating a DESeq dataset like below?
dds
<- DESeqDataSetFromTximport(
txi
, sampleTable, ~samples.group)
counts(dds) <- some_function(counts(
dds
))
I was not referring to transcript-to-gene abundances and apologies if it was bit unclear.
To give an example of alternate sequences in GRCh38, I refer to the Ensembl gene IDs of a famous polymorphic gene in humans in MHC region. All these gene IDs correspond to a single gene:
"ENSG00000179344" "ENSG00000206237" "ENSG00000206302" "ENSG00000225824" "ENSG00000231286" "ENSG00000231939" "ENSG00000233209"
When the read assignment is distributed across all the above gene IDs, we would like to aggregate the readcounts to summarize the expression level of that gene. This is just an example.
I would use
tximport
to summarize the estimated abundances and counts from multiple transcripts to a single "gene". You can do this by altering the second column oftx2gene
. The ideas put forward in the F1000Research paper would follow here, that as long as reads are not double-counted in the estimated counts, the methods should work to allow you to identify genes where expression (as total "gene" output) changes across samples.Sounds appropriate, thanks.