Question

Surrogate variable analysis for paired RNA seq experiment

0

Entering edit mode

nhaus ▴ 50

@789c70a6

Last seen 21 hours ago

Switzerland

Hi all,

I have an RNAseq experiment with 1000s of samples. Each sample comes from a patient and was treated either with a drug or DMSO (as control). I am currently deciding between SVA and RUVseq to account for "hidden" technical variation. My question is, if I have identified surrogate variables (regardless of method) and include them in my model like this:

design = ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + ... + condition

Will the surrogate variables also account for the fact that I have a "paired" design (each sample has a control and a treated counterpart)?

Any insights are very much appreciated!

Cheers!

RUVSeq sva • 516 views

ADD COMMENT • link 3 months ago nhaus ▴ 50

1

Entering edit mode

Following up somewhat on what James already said in Not fully paired differential gene expression analysis -- if you do a paired analysis then the comparison is within donor and you would not need to correct for "hidden" factors such age and sex -- the pairing takes care of that. Do you have a good data-driven reason to believe that additional correction is necessary? With thousands of samples you might simply removing the incomplete pairs and go with paired analysis + weighting as I suggested in linked post.

ADD REPLY • link 3 months ago ATpoint ★ 4.2k

0

Entering edit mode

Thank you very much for your comment! The reason I am thinking of going with the "SVA approach" is that I suspect that there are technical differences between the samples that are not recorded in the mdata (for example the differentiation time or purity of the cell culture). I believe that this might be the case because PC1 capture quite a lot of variation (30%) but I am unable to assign it to any column in the metadata and the experimental condition that I am interested in is not associated with PC1 and PC2. After running SVA, I get a nice separation for my experimental groups. I know that this is expected given that I explicitly tell SVA what biological condition I am interested in.

ADD REPLY • link 3 months ago nhaus ▴ 50

1

Entering edit mode

for example the differentiation time or purity of the cell culture

That also would be captured by the pairing, no? Are the samples per donor processed in the same batch/run/time or is there a delay?

because PC1 capture quite a lot of variation (30%)

Just to be sure, that is PCA after regressing the donor effect, is it?

ADD REPLY • link 3 months ago ATpoint ★ 4.2k

0

Entering edit mode

Are the samples per donor processed in the same batch/run/time or is there a delay?

All samples from the same donor are processed and differentiated and treated together, but for each batch, there are several donors (for example Donor A, B and C are treated together in batch1). This means that by adjusting for donor, I am also adjusting for batch right? I think it might still be the case, that one cell line took longer to be differentiated into the cell subtype of interest. I will ask my collaborator.

Just to be sure, that is PCA after regressing the donor effect, is it?

After regressing out donor effect using limma PC1 captures 22% of variance and it is very difficult to determine what PC1 and PC2 actually corresponds to. PC3 and PC4 very nicely separate my biological groups. If I use UMAP project, I also get a nice seperation on UMAP1 and UMAP2 according to my biological groups.

ADD REPLY • link 3 months ago nhaus ▴ 50

1

Entering edit mode

Sounds to me that including donor should capture a lot of relevant confounders and you should be good to go. What is the problem actually, don't you get DEGs when including donor without further corrections?

ADD REPLY • link 3 months ago ATpoint ★ 4.2k

0

Entering edit mode

I do get differentially expressed genes, but I honestly have never had a situation where I could not explain what PC1 and PC2 corresponds to. With the SVA approach I was able to do that, however, doing it like that the explained variance of PC1 was reduced to just 7% which seems very low...

If I decide to just include donor in my model as a covariate, does this mean that I have to make sure that each donor has an observation for each biological condition of interest (i.e. no incomplete observations)? I am asking, because if I run DESeq2 with incomplete observation, it still works and the linear model fitting works without any problem. Here is some example code for that:

library(DESeq2)

mdata <- data.frame(
    patient = rep(c("A", "B", "C", "D"), 2),
    condition = rep(c("treat", "ctrl"), each = 4)
)

# remove ctrl from D to simulate "dropout", i.e. unpaired sample
mdata <- mdata[-8, ]

# random counts
counts <- matrix(round(runif(700) * 100), ncol=7)

dds <- DESeqDataSetFromMatrix(
    countData = counts,
    colData = mdata,
    design = ~ 0 +patient + condition
)

res <- DESeq(dds)
results(res)

ADD REPLY • link 3 months ago nhaus ▴ 50