Question

Differential gene expression analysis on haplotype-resolved diploid assembly

0

Entering edit mode

Paulito • 0

@6447da9a

Last seen 3 months ago

Italy

Apologies for cross-posting on biostars, but the part most relevant to me involves procedures using Bioconductor packages. I'm working on a haplotype-resolved diploid assembly of a plant genome, where each chromosome is represented by two FASTA/GTF pairs rather than a single consensus. I want to carry out Bulk RNA-seq count-based differential expression analysis with Bioconductor (e.g. limma, edgeR or DESeq2 ) but I'm unsure how to adapt the standard workflow for this dual-sequence setup.

Experimental Design:

Organism: Plant.
Samples: 3 replicates of Condition A and 3 replicates of Condition B.
Data: Paired-end RNA-seq reads (150 bp, 30 millions reads for sample) aligned to a haplotype-resolved genome assembly.
Goal: Identify DE genes between Conditions A and B, accounting for haplotype-specific expression.

I would appreciate your opinions on:

It would make sense to concatenate the two haplotype FASTAs (and GTFs) into one "merged" reference, or it would be better to keep them separate and run two parallel alignments?
I was wondering how to use subread package to take into account haplotype information.

I was wondering how to build the count matrix:
Option A: Separate counts per haplotype (two columns per sample) and then sum counts for downstream DE?
Option B: Sum at the gene level before DE and ignore haplotypeorigin? Option C: Test allele-specific expression by including haplotype as a factor in the design?

About the Statistical modelling in DESeq2/limma/edgeR:
If I keep haplotypes separate, can I simply aggregate counts (geneA_hap1 + geneA_hap2) into a single count per sample?
If I wish to model allele-specific changes (e.g. hap1 vs. hap2 expression bias across conditions), what design formulas or contrasts are recommended?

I know these are a lot of questions and not really focused on code that can actually be used. My idea was to get some general opinions about the procedure first, and then focus on the specific code for the analysis. Thanks in advance!

edgeR DESeq2 haplotype limma • 1.5k views

ADD COMMENT • link 7 months ago Paulito • 0

0

Entering edit mode

Given that you have cross-posted to Biostars and under a different name, it would be helpful to link to the cross-post: https://www.biostars.org/p/9612183/

ADD REPLY • link 7 months ago Gordon Smyth 53k

score 2 · Answer 1 · 2025-05-23

2

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

We have conducted several haptotype-specific RNA-seq and Hi-C analyses in our own work (using Rsubread and edgeR), although none are yet published. edgeR is particulary good at detecting loci vs haplotype interactions, i.e., loci where the proportion of reads from one haplotype becomes larger in one condition vs the other. It is only possible to conduct haplotype-specific analyses for reads mapping to regions or SNPs that are unique to one of the haplotypes. Routine read counting will not work.

ADD COMMENT • link 7 months ago Gordon Smyth 53k

0

Entering edit mode

Thank you for the helpful information. I have been using both limma and edgeR for over 10 years, and I would like to thank you not only for these two essential tools in my work, but also for your dedication in answering the questions you receive. Thank you again!

ADD REPLY • link 7 months ago Paulito • 0