Question

Trying to follow a gene set analyisis in R

0

Entering edit mode

leonardorosas2001 • 0

@953c4872

Last seen 4.0 years ago

Mexico

Michael Love Hi. I am following one of your courses about Introduction to Bioconductor. I have a question regarding the code that is given to answer an exercise. I am required to perform the following code but on a different GO:

library(GEOquery)
e = getGEO("GSE34313")[[1]]
e$condition = e$characteristics_ch1.2
levels(e$condition) = c("dex24","dex4","control")
table(e$condition)
names(fData(e))
fData(e)$GO_ID[1:4]
lvls = c("control", "dex4")
es = e[,e$condition %in% lvls]
es$condition = factor(es$condition, levels=lvls)
library(limma)
library(qvalue)
design = model.matrix(~ es$condition)
**fit = lmFit(es, design=design)**
fit = eBayes(fit)
topTable(fit)[,c(6,7,18,22)]
set.seed(1)
idx = grep("GO:0006955", fData(es)$GO_ID)
length(idx)
**r1 = roast(es, idx, design)**
r1

However, when I type fit = lmFit(es, design=design) it shows "Coefficients not estimable: (Intercept) es$conditiondex4 Error in lm.fit(design, t(M)) : 0 (non-NA) cases". I wonder why is that. Since fit is not functioning properly, the roast function is not working properly as well. It shows: "Error in .lmEffects(y = y, design = design, contrast = contrast, array.weights = Dots$array.weights, : No residual degrees of freedom"

In the video of the course, there seems to be no problem with the code, but now there is. If I could get a better insight about what is happening, and how I could correct these functions, I would be thankful. Best regards!

roast-function • 2.6k views

ADD COMMENT • link updated 4.0 years ago by Michael Love 43k • written 4.0 years ago by leonardorosas2001 • 0

score 0 · Answer 1 · 2022-01-19

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 8 hours ago

United States

What is e$condition after line 3?

ADD COMMENT • link 4.0 years ago Michael Love 43k

0

Entering edit mode

e$condition stands for an expression set that comes from the data in GEO. It was created to store the data in e$characteristics_ch1.2. If I run it, it gives:

e$characteristics_ch1.2 [1] "treatment: none"
[2] "treatment: none"
[3] "treatment: none"
[4] "treatment: none"
[5] "treatment: dexamethasone for 4 hr" [6] "treatment: dexamethasone for 4 hr" [7] "treatment: dexamethasone for 4 hr" [8] "treatment: dexamethasone for 24 hr" [9] "treatment: dexamethasone for 24 hr" [10] "treatment: dexamethasone for 24 hr"

It was done like that in a video of the course

ADD REPLY • link 4.0 years ago leonardorosas2001 • 0

0

Entering edit mode

What do you get for design after running design = model.matrix(~ es$condition)

ADD REPLY • link 4.0 years ago Michael Love 43k

0

Entering edit mode

I get the following

design = model.matrix(~ es$condition)

design (Intercept) es$conditiondex4 attr(,"assign") [1] 0 1 attr(,"contrasts") attr(,"contrasts")$es$condition [1] "contr.treatment"

ADD REPLY • link 4.0 years ago leonardorosas2001 • 0

0

Entering edit mode

It looks like you only have a single sample.

ADD REPLY • link 4.0 years ago Michael Love 43k

0

Entering edit mode

Mm what could I do about it then? I had followed what was given but then, how would I add more samples?

ADD REPLY • link 4.0 years ago leonardorosas2001 • 0

0

Entering edit mode

I see, the problem is that base R has changed its behavior since the code was written. This will fix, changing line 3 to:

e$condition <- factor(e$characteristics_ch1.2)

ADD REPLY • link 4.0 years ago Michael Love 43k

0

Entering edit mode

Thank you. It actually worked! Now, I think the codes and therefore the answers may be not updated. This exercise had 2 parts. I hope you allow me to present this in advance as well. The 1st part is now solved. The 2nd part asks me to execute the following code

library(org.Hs.eg.db)
org.Hs.egGO2EG
go2eg = as.list(org.Hs.egGO2EG)
head(go2eg)
govector = unlist(go2eg)
golengths = sapply(go2eg, length)
head(fData(es)$GENE)

idxvector = match(govector, fData(es)$GENE);table(is.na(idxvector))
##This is the organized list of indexes for genes per GO term:
idx = split(idxvector, rep(names(go2eg), golengths))
##We can see the genes like this:
go2eg[[1]]
fData(es)$GENE[idx[[1]]]
##removal of gene sets with less than 10 genes and have NA values
idxclean = lapply(idx, function(x) x[!is.na(x)])
idxlengths = sapply(idxclean, length)
idxsub = idxclean[idxlengths > 10]
length(idxsub)
##an mroast is run
set.seed(1)
r2 = mroast(es, idxsub, design)
head(r2)

All of this is correct. In fact, I get a data.frame. However, maybe I'm misinterpreting the results. If head(r2) is run, NGenes stands for number of genes, correct? Then, the answer to the question: "Find the GO term that has the the largest proportion of upregulated genes. How many genes does this term have?" would need to biggest value in Ngenes correct?

If this is true, the biggest value in NGenes is 11990, but it's not correct according to the exercise. Maybe I'm not filtering correctly, so how could I filter for the biggest value?

ADD REPLY • link 4.0 years ago leonardorosas2001 • 0

0

Entering edit mode

I won’t be able to help more here but look up the help for the function producing the object. This advice really applies to all Bioconductor packages and functions.

ADD REPLY • link 4.0 years ago Michael Love 43k

0

Entering edit mode

Thank you. I hope I can still let you know if I have issues with your exercises. They are very challenging but useful! :)

ADD REPLY • link 4.0 years ago leonardorosas2001 • 0

0

Entering edit mode

No I can’t help with exercises. Just to be clear I’m not affiliated with Harvard and have no supported effort in relation to the edx classes. I try to preserve time answering questions here about software my lab supports.

ADD REPLY • link 4.0 years ago Michael Love 43k