Question

Not fully paired differential gene expression analysis

0

Entering edit mode

nhaus ▴ 30

@789c70a6

Last seen 3 hours ago

Switzerland

Hi all,

I have ~1000 RNAseq samples that come from 100 donors and am using edgeR to analyse it. The tissue from the 100 donors was treated with either 9 different chemicals (A, B, C, ...) or not treated at all (control).

Unfortunatly, due to technical reasons, I had to remove some of the control condition (only 3) due to a low number of reads (<5e6). This means that I have some "unpaired" samples, i.e. they received the treatment but I have no control for them. Is it correct, that I can still formulate a design model like this, if I am just interested in the average difference between for example treatment5 and the control?

#my design (for simplicity all batch effects are ignored)
design = ~ patient.id + treatment

A related question is the following. Because I have so many samples, I am using sva to adjust for batch effects. Is it correct, that after using this approach, I can remove the patient.id from my design because the identified surrogate variables account for patient-specific effects.

#my design after SVA
design = ~ SV1 + SV2 + SV3 +SV4+ SV5 ... + SV12 + treatment

Any input is very much appreciated!

Cheers

edgeR DifferentialExpression • 406 views

ADD COMMENT • link updated 17 days ago by ATpoint ★ 4.1k • written 20 days ago by nhaus ▴ 30

score 2 · Accepted Answer · 2024-04-09

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 18 hours ago

United States

You removed samples because you had fewer than 5M reads? You might reconsider that decision. Unless the library sizes are really disparate, it shouldn't be much of a problem. The only time I have ever worried about such things is when doing miRNA-Seq, where you can get some really small library sizes and the first PC is clearly capturing sequencing depth. But that is with really low sequencing depth, not something like 5M reads.

Also, I wouldn't use surrogate variables instead of blocking on subject. You might have some residual batch effects left over, which you might sop up with some surrogate variables, but in general I would probably just block on subject and call it a day unless I saw something that really needed fixing.

ADD COMMENT • link 20 days ago James W. MacDonald 65k

0

Entering edit mode

Thank you very much for your input! I picked the 5M read threshold, because this is what I generally read on the internet (for example ECSEQ or Illumina. ENCODE even recommends 30 million reasds [see RNA Sequence Experiment Design: Replication and sequencing depth]).

The reason I am using SVA is the following: I know that there are several technical variables (and also biological ones) that influence expression such as age or sex or ethnicity, but they are not of interest for me. The problem is that I do not have information about age or ethinicity for all my participants (because it is a big dataset), so I can not model the effect of it. Would you agree that in this case SVA is the right approach?

ADD REPLY • link 20 days ago nhaus ▴ 30

1

Entering edit mode

The first thing with low depth samples to do is to resequence them. If that is not possible, check in exploratory analysis such as PCA or MDS whether there is data-driven evidence that the depth is a problem and drives outlier behaviour. If that is true you could still use limma with arrayWeights or voomWithQualityWeights to downweight rather than remove the sample. Removing would be the last option to consider.

ADD REPLY • link 20 days ago ATpoint ★ 4.1k

1

Entering edit mode

If you are blocking on subject, none of the things you mention should have any effect, and couldn't be fit anyway (after blocking on subject). In other words, algebraically you are fitting a paired t-test, where for each subject you are computing the difference between a treatment and that person's control. Both the treatment and control provide information about the sex, age, ethnicity, and any other phenotype inherent to that person, and by subtracting one from the other (control from treatment) you are effectively canceling out any subject-specific variation.

And if you randomized correctly (keeping each subject within a given batch), then you are also canceling out the batch effect. If you didn't randomize correctly, then that is a tragedy.

Also, there is a difference between recommending a certain number of reads (30M is excessive IMO) and saying 'Remove any samples with <X million reads`. People recommend all sorts of things that aren't necessarily achievable in practice.

ADD REPLY • link 19 days ago James W. MacDonald 65k

0

Entering edit mode

Thank you very much, this makes a lot of sense! We actually randomized "correctly" (i.e. batch and subject are confounded and not batch and condition). Just one last thing that I am unsure about. Even if I do not remove any samples because of low counts, some conditions are not paired (sequencing failed and we simply do not have them in our meta data). For these cases, can I still keep the "single" sample and just have the design ~donor + condition, or will this mess up the "paired t-test" since some samples are unpaired?

ADD REPLY • link 18 days ago nhaus ▴ 30

1

Entering edit mode

There are previous questions on incomplete pairing in DESeq2, and the unpaired samples are not considered so they don't add to any inference.

ADD REPLY • link 17 days ago ATpoint ★ 4.1k