Question

Integration: Paired-end vs healthy single-end from a different source

0

Entering edit mode

m.c.ruma • 0

@fcbfb215

Last seen 18 hours ago

The Netherlands

Hello everyone,

I am working on a bulk RNA-seq differential expression analysis comparing tumor versus healthy tissue. The complication is that the samples come from two different sources, and the sequencing layout is fully confounded with condition:

Tumor samples: paired-end sequencing (Source A)
Healthy samples: single-end sequencing (Source B)

As a result, the effects of condition, source, and layout cannot be statistically separated. This makes standard batch correction approaches invalid, because there is no within-condition variation in these technical factors.

Based on previous discussions, it seems there are two potential strategies:

Technical homogenization: Reprocess all samples uniformly as single-end (using only R1 for the paired-end tumor samples), then re-quantify.
Introduce within-condition variation: Add samples where both conditions include both layouts (not possible in my project).

Since option 2 cannot be done, my question is:

Is option 1 considered an appropriate and valid strategy to to make a tumor vs healthy DE comparison defensible in this scenario?

Additionally, I would appreciate guidance on:

Important QC checks after homogenization
Recommendations for modeling dataset source once layout is no longer confounded
Any references or previous workflows where this approach has been used successfully
Potential limitations I should report when interpreting the results

Thank you very much in advance for your advice and feedback!

Best!

Bioconductor • 47 views

ADD COMMENT • link updated 7 hours ago by igkryycpinmshzugby • 0 • written 18 hours ago by m.c.ruma • 0

0

Entering edit mode

Yes, Option 1 (technical homogenization) reprocessing paired-end samples as single-end using only R1 is the most defensible approach when layout and condition are fully confounded. This removes layout-driven biases and makes a direct comparison between tumor and healthy samples more consistent, even though it sacrifices some information from the paired end data.

Guidance:

QC checks after homogenization:

Inspect read quality (FastQC), alignment rates, and read length distributions.

Verify consistent fragment length and gene body coverage profiles between groups (e.g., using RSeQC).

Use PCA or clustering to confirm that samples group by biological condition rather than source.

Modeling recommendations:

Once layout is homogenized, include source (if possible) as a covariate in your DE model (~ source + condition).

If source and condition remain fully confounded, interpret DE results cautiously they may still reflect technical effects.

References/workflows:

See discussions in Bioconductor Support and seqanswers on single- vs paired-end homogenization for DESeq2/edgeR.

General strategy used in: Love et al., Nat Protoc 2014 (DESeq2 methods).

Limitations to report:

Potential loss of sensitivity and mapping accuracy due to single-end conversion.

Inability to fully separate biological from technical effects if confounding remains.

@eggy car game

ADD REPLY • link 7 hours ago igkryycpinmshzugby • 0

score 0 · Answer 1 · 2025-11-03

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 16 hours ago

United States

If by reprocess you mean generate libraries and re-sequence, then that's something you could do. Otherwise the biological and technical differences will be confounded, and there is nothing you can do to 'fix' that, and even if you could, how would you know you had done so?

ADD COMMENT • link 17 hours ago James W. MacDonald 68k