Search
Question: How to extract gene clusters (genes of similar pattern over time) once spline curves are fitted?
0
gravatar for h.valipour.k
4 months ago by
h.valipour.k0 wrote:

Hi All,

My question below might have been answered in support.bioconductor, please direct me to those of most relevant answers, in case if I overlooked them, although I've been through many posts. 

Here is my Target file:

Sample Species Condition Time Replicate Cond_Cult 
1 A C 1 1 A.C
2 A C 1 2 A.C
3 A C 2 1 A.C
. A C 2 2 A.C
. B C 1 1 B.C
. B C 1 2 B.C
. B C 2 1 B.C
. B C 2 2 B.C
. . . . . ...
. . . . . ...
. . . . . ...
. A C 7 1 A.C
. A C 7 2 A.C
. A C 8 1 A.C
. A C 8 2 A.C
. B C 7 1 B.C
. B C 7 2 B.C
. B C 8 1 B.C
. B C 8 2 B.C
. . . . . ...
. . . . . ...
. . . . . ...
. A T 1 1 A.T
. A T 1 2 A.T
. A T 1 3 A.T
. A T 2 1 A.T
. A T 2 2 A.T
. A T 2 3 A.T
. B T 1 1 B.T
. B T 1 2 B.T
. B T 1 3 B.T
. B T 2 1 B.T
. B T 2 2 B.T
. B T 2 3 B.T
. . . . . ...
. . . . . ...
. . . . . ...
. A T 7 1 A.T
. A T 7 2 A.T
. A T 7 3 A.T
. A T 8 1 A.T
. A T 8 2 A.T
. A T 8 3 A.T
. B T 7 1 B.T
. B T 7 2 B.T
. B T 7 3 B.T
. B T 8 1 B.T
. B T 8 2 B.T
96 B T 8 3 B.T

 

Description:

There are 2 species (A & B), each of them have 2 levels of condition (Control & Treatment) and 2-3 levels of Replications.

There are two main questions, plus a few more:

Q1. How the two species behave over time once subjected to treatment (expression pattern).

Q2. How many expression patterns can be seen and what genes represent a cluster.

 

I followed part 9.6.2 of the Limma package;

f <- factor(Target$Cond_Cult, levels=unique(Target$Cond_Cult))
# levels:  A.C B.C A.T B.T

Q3. Is A.C my primary level?

# Creating design matrix
design1 <- model.matrix(~0+f)

# Creating DGEList object (dge)
dge <- DGEList(counts=myCounts,  genes = myIDs)

# Filtering
keep <- filterByExpr(dge, design1)
dge <- dge[keep, keep.lib.sizes=FALSE]

# TMM Normalization
dge <- calcNormFactors(dge)

# the voom transformation is applied to the normalized and filtered DGEList object
v <- voom(dge, design1, plot=TRUE)

# Setting up a basis for a natural regression spline
library(splines)
X <- ns(Target$Time, df=5)
Group <- factor(Target$Cond_Cult)
design2 <- model.matrix(~Group*X)
colnames(design2) <- gsub("Group", "", colnames(design2))
colnames(design2)
# [1]  "(Intercept)" "K.I"         "P.C"         "P.I"        
# [5]  "X1"          "X2"          "X3"          "X4"         
# [9]  "X5"          "K.I:X1"      "P.C:X1"      "P.I:X1"     
# [13] "K.I:X2"      "P.C:X2"      "P.I:X2"      "K.I:X3"     
# [17] "P.C:X3"      "P.I:X3"      "K.I:X4"      "P.C:X4"     
# [21] "P.I:X4"      "K.I:X5"      "P.C:X5"      "P.I:X5" 

# Fitting Spline model
fit <- lmFit(v, design2)
fit <- eBayes(fit, robust = TRUE)
topTable(fit, coef=10:24, number=20) 

Q4. Is my design matrix (design2) appropriate for my main questions?

Q5. How to find out the number of gene clusters and extract them from the analysis (command in R).

 

Thank you all.

Hossein

ADD COMMENTlink modified 4 months ago by Gordon Smyth35k • written 4 months ago by h.valipour.k0
1
gravatar for Aaron Lun
4 months ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

As currently formulated, your questions have nothing to do with testing for differential expression, so it's hard to give any recommendations on whether you are using limma correctly. I will answer what seems to be your main question, which is "how do I cluster genes with similar expression profiles across samples?"

The quick and dirty method is to do something like this:

  1. Use cpm to compute log-CPM values from the counts with a reasonably large prior.count.
  2. Create a pasted factor that combines Species, Condition and Time.
  3. Use sumTechReps to compute the combination-specific average for each gene.
  4. Use hclust, kmeans, etc. to cluster the genes based on the average matrix.

If you want to get species- or condition-specific clusters, then you just subset your average matrix by your species or conditions of interest. I don't see the need to use splines here; the clustering algorithm doesn't care about the spline coefficients, all it needs is a measure of similarity or difference between points. Of course, this raises more general questions like "how many clusters?" and "are my clusters trustworthy?"; these issues are covered in other forums so I will not talk about them here.

If you want more statistical rigour, perhaps it would be better to reformulate your questions in terms of "what is the set of genes that exhibit a significant time response in Species A and Condition C?" Now, this is a question that can be answered with limma.

ADD COMMENTlink modified 4 months ago • written 4 months ago by Aaron Lun21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 380 users visited in the last hour