Hello,
I am running DESeq2 analysis in R for a set of different cancer stage samples from different patients. I can see in heatmaps and t-sne that samples are clustering by patients but not too much by cancer stage, meaning that I have an important variability among them. I have included all the patients (8) and samples from them of different stages in cancer progression (normal, neoplasia, carcinoma in situ, invasive, metastasis, etc).
From an statistical point of view, is that correct if I use an interaction term for the stages of the cancer progression? What I am asking will be:
dds <- DESeqDataSetFromMatrix(countData = ex[,2:113], colData = pdata, design= ~ patient + stages + patient:stage)
where:
samples = colnames(ex[,2:113]) stages = rep("xxx",111) names = c("normal","EN", 'DCIS','IDC', "AVL",'ECE', "met_ECE", 'met_no_ECE') for(i in names){ print(i) stages[grep(i,samples)] = i } patient = rep("ccc", 111) names2 = c("patient1", "patient2", "patient3", "patient4", "patient5", "patient6", "patient7", "patient8") for (y in names2) { print(y) patient[grep(y, samples)] = y } pdata = data.frame(samples, patient, stages) pdata looks: samples patient stages 1 patient1_AVL_rep1 patient1 AVL 2 patient1_AVL_rep2 patient1 AVL 3 patient1_DCIS_rep1 patient1 DCIS 4 patient1_DCIS_rep2 patient1 DCIS 5 patient1_IDC_rep1 patient1 IDC 6 patient1_IDC_rep2 patient1 IDC 7 patient1_met_no_ECE_rep1 patient1 met_no_ECE 8 patient1_met_no_ECE_rep2 patient1 met_no_ECE 9 patient1_normal_rep1 patient1 normal 10 patient1_normal_rep2 patient1 normal
Do you think it would be better subsetting the data set and running the DESeq2 separately comparing different stages or progression for every patient?
Thank so much in advance for your help!!
best,
Belen
Hi Michael,
Thanks so much for your answer.
The replicates are biological.
This is what I have done so far. But when I do unsupervised hierarchical clustering or t-sne, my samples are clustering by patients, and only "normal stage" is clustering together. I am not sure whether I should 1)analyze the whole data set and then using contrast function to compare different stages; 2) subsetting stages 2 by 2 (normal vs X, etc.); or 3) adding an interaction term for patient. I saw in other posts that for DESeq2 is recommended to use the whole data set, unless there is high variability between groups.
Sorry if my questions are so naive but I am quite new in the field and is the first time I am doing bioinformatic analysis, and I want to make sure that the design is well done.
Thanks for your help,
Belen
I'm sorry, I'm still confused. Exactly what kind of replication is it? Multiple samples from the same tumor? Usually biological replication involves different organisms/donors, and you want to know for example if the differences across condition are larger than the differences among organisms/donors within a condition.
Here you have multiple levels of replication obviously, but to give the right answer, we need to ask the right question.
Sorry, I should have explained myself better from the beginning.
I am studying gene expression profile in cancer progression. Each replicate corresponds to a different sample that was independently isolated and processed for library preparation and sequencing (they are not technical replicates).
When I plot PCA by cancer stage you can see that normal areas are clustering together, and the others are close but there is variability. When I do tsne or unsupervised hierarchical clustering the samples are clustering by patient and not by cancer stage, meaning that the variablitiy among patient is high, right?
I was doing the DEseq by " ~patient + stage" and including the whole data set, but I wonder if that is statistically correct when the samples are so different. The other options I am considering are:
1. Subsetting the data by stages, and running the analysis comparing 2 stages at a point (normal vs neoplasia; normal vs invasive, neoplasia vs invasive, etc.)
2. Subsetting each patient, analyze gene expression profile in the progression of the disease and look for a common signature between patients.(I do not think I should)
3. Introducing an interaction term for cancer stage, because one can think that each stage is a different cell type, that indeed it is a kind of different. I saw people use the interaction term to compare different tissues from the same specie and also for treated/non-treated conditions. But I am not sure if that applies here.
Thanks for your help. I hope I explained better now.
B
The fact that there is a lot of patient variation in the PCA plot doesn't imply a problem with ~patient + stage. For example, there is large variation due to donor in the airway dataset (PCA), but ~donor + treatment allows one to control for the variation and find the common effect of the treatment. It seems to me like you are interested in finding the differences due to stage, and so this design will accomplish that. You can use plotCounts to look at the top genes afterward to get a sense how controlling for patient baseline works in this design.
You don't need to subset the dataset, if you want to analyze each patient separately, you can follow the recommendation in the vignette and combine patient and stage into a single group factor, then compare the different levels with 'contrast'.