Hello everyone,
I am working on a bulk RNA-seq differential expression analysis comparing tumor versus healthy tissue. The complication is that the samples come from two different sources, and the sequencing layout is fully confounded with condition:
- Tumor samples: paired-end sequencing (Source A)
- Healthy samples: single-end sequencing (Source B)
As a result, the effects of condition, source, and layout cannot be statistically separated. This makes standard batch correction approaches invalid, because there is no within-condition variation in these technical factors.
Based on previous discussions, it seems there are two potential strategies:
- Technical homogenization: Reprocess all samples uniformly as single-end (using only R1 for the paired-end tumor samples), then re-quantify.
- Introduce within-condition variation: Add samples where both conditions include both layouts (not possible in my project).
Since option 2 cannot be done, my question is:
Is option 1 considered an appropriate and valid strategy to to make a tumor vs healthy DE comparison defensible in this scenario?
Additionally, I would appreciate guidance on:
- Important QC checks after homogenization
- Recommendations for modeling dataset source once layout is no longer confounded
- Any references or previous workflows where this approach has been used successfully
- Potential limitations I should report when interpreting the results
Thank you very much in advance for your advice and feedback!
Best!
