Hello everyone,
I am working on a bulk RNA-seq differential expression analysis comparing tumor versus healthy tissue. The complication is that the samples come from two different sources, and the sequencing layout is fully confounded with condition:
- Tumor samples: paired-end sequencing (Source A)
 - Healthy samples: single-end sequencing (Source B)
 
As a result, the effects of condition, source, and layout cannot be statistically separated. This makes standard batch correction approaches invalid, because there is no within-condition variation in these technical factors.
Based on previous discussions, it seems there are two potential strategies:
- Technical homogenization: Reprocess all samples uniformly as single-end (using only R1 for the paired-end tumor samples), then re-quantify.
 - Introduce within-condition variation: Add samples where both conditions include both layouts (not possible in my project).
 
Since option 2 cannot be done, my question is:
Is option 1 considered an appropriate and valid strategy to to make a tumor vs healthy DE comparison defensible in this scenario?
Additionally, I would appreciate guidance on:
- Important QC checks after homogenization
 - Recommendations for modeling dataset source once layout is no longer confounded
 - Any references or previous workflows where this approach has been used successfully
 - Potential limitations I should report when interpreting the results
 
Thank you very much in advance for your advice and feedback!
Best!

Yes, Option 1 (technical homogenization) reprocessing paired-end samples as single-end using only R1 is the most defensible approach when layout and condition are fully confounded. This removes layout-driven biases and makes a direct comparison between tumor and healthy samples more consistent, even though it sacrifices some information from the paired end data.
Guidance:
QC checks after homogenization:
Inspect read quality (FastQC), alignment rates, and read length distributions.
Verify consistent fragment length and gene body coverage profiles between groups (e.g., using RSeQC).
Use PCA or clustering to confirm that samples group by biological condition rather than source.
Modeling recommendations:
Once layout is homogenized, include source (if possible) as a covariate in your DE model (~ source + condition).
If source and condition remain fully confounded, interpret DE results cautiously they may still reflect technical effects.
References/workflows:
See discussions in Bioconductor Support and seqanswers on single- vs paired-end homogenization for DESeq2/edgeR.
General strategy used in: Love et al., Nat Protoc 2014 (DESeq2 methods).
Limitations to report:
Potential loss of sensitivity and mapping accuracy due to single-end conversion.
Inability to fully separate biological from technical effects if confounding remains.
@eggy car game