Question

How to extract gene clusters (genes of similar pattern over time) once spline curves are fitted?

0

Entering edit mode

h.valipour.k • 0

@hvalipourk-16085

Last seen 4.9 years ago

Hi All,

My question below might have been answered in support.bioconductor, please direct me to those of most relevant answers, in case if I overlooked them, although I've been through many posts.

Here is my Target file:

Sample	Species	Condition	Time	Replicate	Cond_Cult
1	A	C	1	1	A.C
2	A	C	1	2	A.C
3	A	C	2	1	A.C
.	A	C	2	2	A.C
.	B	C	1	1	B.C
.	B	C	1	2	B.C
.	B	C	2	1	B.C
.	B	C	2	2	B.C
.	.	.	.	.	...
.	.	.	.	.	...
.	.	.	.	.	...
.	A	C	7	1	A.C
.	A	C	7	2	A.C
.	A	C	8	1	A.C
.	A	C	8	2	A.C
.	B	C	7	1	B.C
.	B	C	7	2	B.C
.	B	C	8	1	B.C
.	B	C	8	2	B.C
.	.	.	.	.	...
.	.	.	.	.	...
.	.	.	.	.	...
.	A	T	1	1	A.T
.	A	T	1	2	A.T
.	A	T	1	3	A.T
.	A	T	2	1	A.T
.	A	T	2	2	A.T
.	A	T	2	3	A.T
.	B	T	1	1	B.T
.	B	T	1	2	B.T
.	B	T	1	3	B.T
.	B	T	2	1	B.T
.	B	T	2	2	B.T
.	B	T	2	3	B.T
.	.	.	.	.	...
.	.	.	.	.	...
.	.	.	.	.	...
.	A	T	7	1	A.T
.	A	T	7	2	A.T
.	A	T	7	3	A.T
.	A	T	8	1	A.T
.	A	T	8	2	A.T
.	A	T	8	3	A.T
.	B	T	7	1	B.T
.	B	T	7	2	B.T
.	B	T	7	3	B.T
.	B	T	8	1	B.T
.	B	T	8	2	B.T
96	B	T	8	3	B.T

Description:

There are 2 species (A & B), each of them have 2 levels of condition (Control & Treatment) and 2-3 levels of Replications.

There are two main questions, plus a few more:

Q1. How the two species behave over time once subjected to treatment (expression pattern).

Q2. How many expression patterns can be seen and what genes represent a cluster.

I followed part 9.6.2 of the Limma package;

f <- factor(Target$Cond_Cult, levels=unique(Target$Cond_Cult))
# levels:  A.C B.C A.T B.T

Q3. Is A.C my primary level?

# Creating design matrix
design1 <- model.matrix(~0+f)

# Creating DGEList object (dge)
dge <- DGEList(counts=myCounts,  genes = myIDs)

# Filtering
keep <- filterByExpr(dge, design1)
dge <- dge[keep, keep.lib.sizes=FALSE]

# TMM Normalization
dge <- calcNormFactors(dge)

# the voom transformation is applied to the normalized and filtered DGEList object
v <- voom(dge, design1, plot=TRUE)

# Setting up a basis for a natural regression spline
library(splines)
X <- ns(Target$Time, df=5)
Group <- factor(Target$Cond_Cult)
design2 <- model.matrix(~Group*X)
colnames(design2) <- gsub("Group", "", colnames(design2))
colnames(design2)
# [1]  "(Intercept)" "K.I"         "P.C"         "P.I"        
# [5]  "X1"          "X2"          "X3"          "X4"         
# [9]  "X5"          "K.I:X1"      "P.C:X1"      "P.I:X1"     
# [13] "K.I:X2"      "P.C:X2"      "P.I:X2"      "K.I:X3"     
# [17] "P.C:X3"      "P.I:X3"      "K.I:X4"      "P.C:X4"     
# [21] "P.I:X4"      "K.I:X5"      "P.C:X5"      "P.I:X5" 

# Fitting Spline model
fit <- lmFit(v, design2)
fit <- eBayes(fit, robust = TRUE)
topTable(fit, coef=10:24, number=20)

Q4. Is my design matrix (design2) appropriate for my main questions?

Q5. How to find out the number of gene clusters and extract them from the analysis (command in R).

Thank you all.

Hossein

Limma voom spline gene cluster Time course experiment • 1.0k views

ADD COMMENT • link updated 5.8 years ago by Gordon Smyth 50k • written 5.9 years ago by h.valipour.k • 0

score 1 · Answer 1 · 2018-06-14

As currently formulated, your questions have nothing to do with testing for differential expression, so it's hard to give any recommendations on whether you are using limma correctly. I will answer what seems to be your main question, which is "how do I cluster genes with similar expression profiles across samples?"

The quick and dirty method is to do something like this:

Use cpm to compute log-CPM values from the counts with a reasonably large prior.count.
Create a pasted factor that combines Species, Condition and Time.
Use sumTechReps to compute the combination-specific average for each gene.
Use hclust, kmeans, etc. to cluster the genes based on the average matrix.

If you want to get species- or condition-specific clusters, then you just subset your average matrix by your species or conditions of interest. I don't see the need to use splines here; the clustering algorithm doesn't care about the spline coefficients, all it needs is a measure of similarity or difference between points. Of course, this raises more general questions like "how many clusters?" and "are my clusters trustworthy?"; these issues are covered in other forums so I will not talk about them here.

If you want more statistical rigour, perhaps it would be better to reformulate your questions in terms of "what is the set of genes that exhibit a significant time response in Species A and Condition C?" Now, this is a question that can be answered with limma.