extract similar expressed genes (SEGs) rather than DEGs
4
4
Entering edit mode
@andreas-heider-4538
Last seen 9.2 years ago

Dear BioC community,

currently I am searching for a way to extract similar expressed genes in contrast to standard DEGs.
Is there a way to achieve this by using already available packages, maybe LIMMA?

OH, I am working with single channel data, all aggregated in an ExpressionSet. I have the right annotations available, of course.

 

Thanks,

Andreas

limma microarray • 3.6k views
ADD COMMENT
0
Entering edit mode

Can you elaborate a bit more on what "similarly expressed" genes are so we can get some more clarity on what you are after?

Do you mean you want to find groups of genes with correlated expression across many samples/conditions, or something else?

ADD REPLY
5
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 20 hours ago
The city by the bay

limma is geared towards testing for differences in gene expression between conditions. However, as you probably already know, failing to detect any DE is not the same as "similar expression", i.e., absence of evidence does not provide evidence of absence. For example, a truly DE gene might have a large p-value if its expression is highly variable across replicates.

A formal statistical test for a similarly-expressed gene might be something like the two one-sided tests (TOST) procedure. Briefly, you specify a log-fold change threshold, under which you consider the gene as having no differences in expression. You then test for whether the observed log-fold change is significantly smaller and larger than the positive and negative threshold, respectively. If so, then that represents evidence that the true fold change does lie under the threshold. However, this can be quite conservative (as it involves an intersection) and I'm not aware of any implementation within the limma framework.

Depending on your application, an ad hoc approach may be sufficient, e.g., considering similarly-expressed genes as those with small log-fold changes. You can also get confidence intervals on the log-fold changes by specifying confint=X in topTable, for some percentage X; you could then pick the genes where the boundaries of the interval are close to zero. We'd need to know more about your study, though, to give more advice.

ADD COMMENT
0
Entering edit mode

Hi Steve and Aaron,

thanks for your answers!

 

@Aaron:

What you eloborated reads like you are talking about something, that is for exmaple the content of a volcano plot? With the genes consitently expressed in the same manner in let's say the upper right corner?

Would it further make sense to perform a Gene Ontology Analysis on those genes to get a overview of the processes/functions that are expressed similarily?

 

Thanks again,

Andreas

ADD REPLY
0
Entering edit mode

If you're looking for similarly expressed genes in a volcano plot, you'd want those with low log-odds for DE and near-zero log-fold changes. So, if you're using the volcanoplot function from limma, you'd be looking for the genes at the bottom of the plot, i.e., at the tip of the volcano shape. However, there's no need to actually generate a volcano plot. You can get all of the required statistics by running topTable.

As for the GO analyses; I suppose it's possible to look for overrepresentation of GO terms within a defined set of (putative) similarly-expressed genes. However, I can think of some complications. For example, standard GO analyses with DE lists require adjustment for gene length bias. This is because the DE list is enriched for long genes that have more counts and greater power to reject the null. In your case, the problem may be inverted, as your set of similarly-expressed genes may be enriched for underpowered short genes. I don't know whether any analysis packages are capable of accommodating this kind of bias.

ADD REPLY
0
Entering edit mode

Dear Aaron, thanks for the explanation,

however, I think I did not explain the problem in detail. I have a dataset, that has several (10+) phenotypes, in this case celltypes. Exploratory analysis shows, that there are 2 celltypes (let's say A and B), that are similar to each other but different from the other ones (let's say C-J).

I now want to know the genes contributing to this "common phenotype". Can this be done with LIMMA comparing the following sample groups: (A and B)-(C to J) ?

Or is there another approach, maybe something with an add sign "+" in the LIMMA call? The approach explained here would result in the upper right quadrant of a volcanoplot giving the right similar expressed genes in sample groups A+B, or am I wrong?

I would be happy for your suggestions!

 

Thanks,

Andreas

ADD REPLY
0
Entering edit mode

If you want to find genes that separate, say, cell types A and B from cell type C, what you could do is to test for DE genes between A and C, and also between B and C. You can then intersect these two DE lists, selecting for genes with the same direction of fold change for A/C and B/C. This will give you the genes that are driving the separation of A and B from C.

Note that this approach would pick up a gene that has, e.g., a 10-fold change between A/C and 2-fold change between B/C. Technically, this might not be what you want as this gene is also DE between A and B. However, there's no (obvious) way to correct for this other than to use the procedures I described in my original answer. Besides, I think that this gene is still a good candidate, as it is involved in separating A and B from C.

If you want to test separation of A and B from multiple other cell types, e.g., C to J, you can do a DE comparison between A and the average of C to J, and another between B and that average. Intersections can then be performed as previously described.

The approach you've suggested involves a single DE comparison between the average of A+B and the average of C to J. This won't work as well, because you could get DE genes that are changing in the opposite direction for A and B, e.g., if it goes up by 10-fold in A vs C-J but down by 2-fold in B vs C-J, the average of A and B can still be (significantly) higher than that of C-J.

ADD REPLY
2
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States

I wrote a quick blog entry on finding non-differentially-expressed genes based on an excellent mail list answer by Simon Anders.  This was based on RNA-seq, but the ideas are general.

http://watson.nci.nih.gov/~sdavis/blog/testing-for-non-differentially-expressed-genes/

ADD COMMENT
0
Entering edit mode

The link above doesn't work anymore. Could you please make it work again? Thanks.

ADD REPLY
0
Entering edit mode
@andreas-heider-4538
Last seen 9.2 years ago

Hi Steve and Aaron,

thanks for your answers!

 

@Aaron:

What you eloborated reads like you are talking about something, that is for exmaple the content of a volcano plot? With the genes consitently expressed in the same manner in let's say the upper right corner?

Would it further make sense to perform a Gene Ontology Analysis on those genes to get a overview of the processes/functions that are expressed similarily?

 

Thanks again,

Andreas

ADD COMMENT
0
Entering edit mode
@andreas-heider-4538
Last seen 9.2 years ago

I am thinking about two one sided test (TOST) or double one sided t-test (DOSTT). Can these be used within LIMMA?

I am sure it can be used somewhere, but don't know how. Especially, since one needs the so called "epsilon" to determine the "magnitude of the region of similarity" as written in the documentation of the "equivalence" package. At this moment I am not sure how to calculate a reasonable "epsilon".

Really hope someone already ran into this problem.

 

Thanks,

Andreas

ADD COMMENT
2
Entering edit mode

As I've written before, there is no support for TOST in limma. If you want it, you'll have to do it yourself, and it'll be pretty messy. Tentatively, if you have a fit object after eBayes:

threshold <- 1
of.interest <- 2
logFC <- fit$coefficients[,of.interest]
t1 <- (logFC - threshold)/fit$stdev.unscaled[,of.interest]/sqrt(fit$s2.post)
p1 <- pt(t1, df=fit$df.total, lower.tail=TRUE) # One-sided p-value.
t2 <- (logFC + threshold)/fit$stdev.unscaled[,of.interest]/sqrt(fit$s2.post)
p2 <- pt(t2, df=fit$df.total, lower.tail=FALSE)
tost.p <- pmax(p1, p2)

I'm not sure how right that is (if at all) in the EB framework, so YMMV.

The threshold parameter is equivalent to the 'epsilon' parameter you were mentioning. This is an inherently arbitrary parameter, as it depends on what kind of fold change you're willing to tolerate. For example, a 5-fold change might not be biologically interesting for constitutively expressed genes, whereas a 1.5-fold change might be very interesting for critical genes like master regulators or whatnot.

ADD REPLY
0
Entering edit mode

Dear Aaron,

I tried the approach you described and in a short example using the Dilution dataset I came up with a vector of p-values for each feature/gene/transcript. However, I did not understand on which "fit" object to make the calculations.

Does it need to be a fit, a lmfit or an eBayes fit-object? What about design matrices, contrasts and reference samples? Should I declare those in the call of lmfit or eBayes? How can I asume a reasonable epsilon or threshold as you describe it? What can I select with "of.interest"?

I am sorry to have so much questions, so please be gentle :-)

 

Yours,

Andreas

ADD REPLY
0
Entering edit mode

The fit object is that produced by eBayes. However, by the time you get to that step, you should have already decided on the design matrix and contrast that you're interested in. In short, you should run lmFit, in which you specify your design; contrasts.fit, to specify the comparisons that you're interested in; and then eBayes, to perform empirical Bayes shrinkage. The final MArrayLM object is what fit should be.

I don't know what you're talking about when you say "reference sample", so I'll just ignore that for now. As for obtaining a "reasonable" epsilon, that's not easy for reasons discussed above. There's no obvious way to figure out what is biologically reasonable from the data. If you insist on using TOST, you could just pick an arbitrary value like 1. Then, if you discover any similar genes, you can mention that the fold change was significantly less than 2 in either direction.

The specification of of.interest depends on which coefficient corresponds to your comparison of interest. For example, if you have a design matrix where your second coefficient corresponds to your treatment effect, then of.interest <- 2. If you have more complicated contrasts, then you can set that using contrasts.fit. Check out the documentation at ?contrasts.fit for more details.

ADD REPLY
2
Entering edit mode

I think you would be better off simply using the confidence intervals already provided by limma.

TOST and DOSTT are very formal methods, but there is no advantage to this formality if you don't have an a priori meaningful value for epsilon. Since you have to choose episilon ad hoc, the whole method becomes ad hoc. So you would I think be just as well served by the flexibility of the confidence intervals.

You can rank genes by evidence for equivalence by sorting in increasing order of the maximum absolute limit of the confidence interval.

ADD REPLY

Login before adding your answer.

Traffic: 675 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6