Question

Seeking Advice on Using edgeR and variancePartition for RNA-seq Data from Multiple Tissues and Patients

0

Entering edit mode

Seymoo • 0

@seymoo-12522

Last seen 9 months ago

Oslo

Hello everyone

I`m working with RNA-seq gene expression data derived from multiple tissue samples collected from different patients. My primary goal is to identify differentially expressed genes (DEGs) while minimizing the confounding effects of tissue of origin on the results.

A brief overview of my approach:

I`ve converted the raw count expression data into CPM values using edgeR to normalize between samples, accounting for library size. Filtering was applied using filterByExpr to retain relevant genes. Normalization was conducted using the TMM method:

keep <- filterByExpr(counts)
counts <- counts[keep, , keep.lib.sizes=FALSE]
counts <- normLibSizes(counts, method = "TMM")
counts_cpm <- cpm(counts, log = TRUE)

To account for patient and tissue variability, I applied a mixed linear model using the variancePartition package. My formula models the contribution of both patient and tissue to the gene expression variation:

form <- ~ (1 | Tissue) + (1 | Patient)
vp_modelFit <- fitVarPartModel(counts_cpm, form, df)
vp_modelFit_res <- residuals(vp_modelFit)

My understanding is that the residuals from this model should, in theory, represent gene expression values devoid of tissue- and patient-specific effects, potentially revealing the intrinsic cancer-related signals.

Question:

Is this approach statistically sound for achieving my aim? Specifically, does this methodology appropriately remove the unwanted variation from tissue and patient sources while preserving biologically relevant signals?
Any recommendations for improving the robustness of this approach, especially in terms of ensuring that intrinsic cancer-related signals are not inadvertently removed?

Thank you!

edgeR variancePartition • 1.7k views

ADD COMMENT • link updated 17 months ago by Gordon Smyth 53k • written 17 months ago by Seymoo • 0

0

Entering edit mode

I don't follow what problem you are trying to solve. Your goal is to identify DE genes, but DE between what?

ADD REPLY • link 17 months ago Gordon Smyth 53k

0

Entering edit mode

Hi Gordon, Thanks for your response. I will investigate DEGs between genetic subclones, that is subclasses with particulate genetic mutations, after I regressed out the potential impact of different tissues and individuals from the expression data. I did not want to make the question more complex.

ADD REPLY • link 17 months ago Seymoo • 0

0

Entering edit mode

Cross-posted to Biostars: https://www.biostars.org/p/9601556/

ADD REPLY • link 17 months ago Gordon Smyth 53k

score 1 · Answer 1 · 2024-08-27

1

Entering edit mode

gabriel.hoffman ▴ 170

@gabrielhoffman-8391

Last seen 12 weeks ago

United States

You should always use voomWithDreamWeights() on the counts before fitting the model. This produces log2 CPM and computes precision weights

Doing an analysis in two steps by a) computing residuals and then b) performing a second regression for statistical methods will increase the false positive rate since the covariance between the variable of interest and other variables is ignored. I _strongly_ recommend fitting a single model with ~ (1 | Tissue) + (1 | Patient) + variable and using dream()
These issues are not specific to the dream workflow, they are general properties of regression that apply to limma as well

ADD COMMENT • link 17 months ago gabriel.hoffman ▴ 170

0

Entering edit mode

Thank you Daniel! So, to be clear, you mean that I should implement my grouping variable for DEG in the same formula, as a fixed effect? Assuming my variable of interest contains different subclass of genetic subclones that has been defined for each patient, then using a single model with ~ (1 | Tissue) + (1 | Patient) + 'GeneticClonesClass' will results in DEGs that are not due to differences in the tissue or patients?

Best

ADD REPLY • link 17 months ago Seymoo • 0

0

Entering edit mode

While we are at it, what's your opinion on modeling variables as below, so that both a random intercept and a random slope for "Tissue" within each "Patient" will be fitted. ~ (1 + Tissue | Patient) + GeneticClonesClass This is because same tissue types may also have different expression profile depending on the patient that is taken from. For some types of tissues I have 1 or 2 cases, and I am not sure if this can lead to inappropriate modeling if Tissue and Patient are considered seperately.

ADD REPLY • link 17 months ago Seymoo • 0

score 1 · Answer 2 · 2024-08-27

1

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 4 hours ago

WEHI, Melbourne, Australia

I generally agree with Gabriel Hoffman although, without knowing more about the design of your experiment, it is impossible to say whether treating Tissue and Patient as random is appropriate.

No, your proposed approach is not statistically sound. The whole purpose of a statistical analysis is to evaluate systematic changes relative to variation between patients and tissue samples, so trying to remove patient and sample variation before the analysis begins doesn't make sense. Such an approach will give an unrealistic impression of significance.

ADD COMMENT • link 17 months ago Gordon Smyth 53k

0

Entering edit mode

Thanks Gordon. I would greatly appreciate if you could provide a minimal code with the most appropriate way to perform this analysis using EdgeR. Also, would you recommend log transformation for CPM? I have read that hyperbolic arcsine (asinh) transformation is a better approach instead of adding a pseudocount prior to log transformation. This is a recommendation by Johnson, K.A Genome Biol 23, 2022 [DOI: 10.1186/s13059-021-02568-9]

ADD REPLY • link 17 months ago Seymoo • 0

0

Entering edit mode

You have not fully described your experiment, and I can't provide code for an experiment that I know so little about. But the edgeR document is full of examples and case studies that you can follow.

ADD REPLY • link 17 months ago Gordon Smyth 53k