Hi there, I'm a complete newbie to working with data in R for my own project that's not been cleaned up and streamlined for a university course, so wanted to ask if anyone has had experience working with or has thoughts on using data from the TCGA's RNA Seq V2 RSEM data for differential expression analysis through voom and limma
After scouring the forums, it seems that there's a general consensus to filter out any low counts and then run voom-limma
My question is whether the presence of a mutation in a tumour leads to significant up or downregulation of genes compared to those without the mutation. However, when I do this with the TCGA's RSEM data, I get that almost all of the genes in both my groups are significantly upregulated. Any thoughts or advice would be appreciated. This is bits and pieces of what I've been able to learn online through papers, forums, and youtube videos.
I can also supply more code to try to describe the problem better. Thanks so much for reading!
#mrna is a dataframe of 20531 genes x 230 samples
counts <- mrna
counts[is.na(counts)] <- 0
d0 <- DGEList(counts)
#mut is a variable of 'Yes' and 'No' regarding whether the patient harbours the mutation of interest or not
group <- as.factor(sample$mut)
d0$samples$group <- group
keep <- rowSums(counts) > 20
counts <- counts[keep,]
design <- model.matrix(~0+group)
colnames(design) <- gsub("group", "", colnames(design))
v <- voom(d0, design, plot=T)
vfit <- lmFit(v, design)
efit <- eBayes(vfit)
summary(decideTests(efit))
tfit <- treat(vfit, lfc=1)
dt <- decideTests(tfit)
summary(dt)
#the output of this is that both the groups with and without the mutation almost all the 20,300 genes that were included are upregulated
Thanks so much for your response. I was following the limma-voom workflow link previously but now after reading the guide for creating design matrices, I'm seeing that I didn't make my wild type group the reference group and maybe that contributed to the problem
However, with that adjustment and removing the requirement for the intercept to be 0 and the renaming of the columns, I'm now getting that none of the genes are up or downregulated out of ~20,000 genes- is this likely? Or does my analysis just need more adjustment?
After tinkering with this all day, some strange things that I've found are:
When I run View(decideTests(efit)), my TestResults matrix yields all 1s for the (Intercept) column and all 0s for the groupmut column- why would this happen if I specified d0$samples$group <- group earlier and my d0 object showed that I clearly delineated which sample fell into which group?
When I run topTable(efit), all my adj. P values are exactly the same at 0.3699211, but when I run topTable(efit, coef=1), I get different adj P values that are highly significant. I've read the documentation on coef within topTable, but am not sure in my case what the NULL coef and a coef of 1 means for my data.
Thanks for reading until this point and I thank you again in advance for your help.
The workflow showed you how to form contrasts but you skipped that step.
No it didn't. If you use
0+
for the design matrix then you must explicitly form contrasts. The question of which is the reference group is quite separate. The only difference that setting the reference group makes is to change the sign of the logFC values. The p-values will be identical either way.Yes, it is very likely indeed that noisy human data will show no significant DE. TCGA in particular is notoriously noisy. It is one of the key aims of the limma software that it does not give significant DE results unless there is good reproducible evidence for differential expression.
Very likely. You've skipped out several important steps from the limma-voom workflow including
filterByExpr
to cut down on variability, library size normalization withcalcNormFactors
and data exploration withplotMDS
. You almost certainly need sample quality weights. There are almost certainly batch effects or covariates that should be accounted for. Large human datasets require lots of quality asssessment and trouble-shooting.There are no circumstances in which there should be any NA values in an RNA-seq dataset! This makes me suspicious that there is something wrong or non-standard with your pipeline.
Nothing at all strange about these things. The first coefficient of the fitted model is the intercept. It makes no sense to test whether the intercept is equal to zero, of course it won't be. Please type
colnames(efit)
and then compare to the documentation to understand what the two coefficients represent.Hi again, thank you so much for getting back to me so thoroughly. You're right that I didn't include contrasts and that was from my misunderstanding that I wasn't supposed to use them since I am comparing between just two groups.
Your point about the NA values is very helpful, I saw that one sample had 516 NAs and once I removed that, there were no longer any NA values for the rest of the dataset. I tried to more thoroughly follow along with the guidelines and also went through the limma User Guide and have come to the following:
My output when I try to explore the data with plot MDS (with NS being no mutation and CS being currently mutated) looking like this:
So the way this data looks is that there is no natural clustering of my two groups, and perhaps indeed no DE genes would that be a correct interpretation? Would the next step to be to find another dataset to ask this question on or is there more that can be done with this data in terms of adjustments for an exploratory analysis? Especially since the % variance explained by the top two dimensions are both only 3% with the 'pairwise' gene.selection, it feels like I'm hitting a dead end.
Apologies if these questions are incredibly naive, thanks for your patience and reading through.
The MDS plots show absolutely no suggestion of any separation between the groups, i.e., no suggestion of differential expression. I can't tell you want to do next as that's a research question rather than a software question. It is theoretically possible but very unlikely that there might a DE in the data if appropriate covariates or technical effects were accounted for, but I have no idea whether any such covariates exist.
To find out what
gene.selection="common"
does, typehelp(plotMDS)
.Thank you so much! One final question, I think it makes sense that nothing is significantly up or downregulated based on plotMDS, but why would in turn every single gene show up as upregulated with the same data when I give more weight to fold changes in:
You have gone back to the same mistake as in your original question. Let me repeat, you don't need to use
0+
to create the design matrix but, if you do, then you must explicitly form contrasts.Okay I see, that makes complete sense- thanks so much again for your patience and help! Will be going back to the drawing board to reassess next steps