Possible implementation of roast/mroast for testing common DE patterns across two different experiments
2
0
Entering edit mode
svlachavas ▴ 830
@svlachavas-7225
Last seen 13 months ago
Germany/Heidelberg/German Cancer Resear…

 Dear Bioconductor community,

based on one of my previous posts (C: Merge different datasets regarding a common number of DE genes in R) i asked about possible approaches or ideas about finding "common genes" in two different DEG lists regarding the same disease-also, if another more sophisticated way of testing if the "observed pattern" of DE  is similar/conistent across the different datasets-expreriments-DEG lists exists. Moreover, after the possible proposal of Mr Aaron Lun, i would like to ask if it is possible(because im not experienced in R) to use the one DEG list to define a geneset, and then use roast to test for DE for those genes in the other experiment/dataset ? All suggestions or ideas would be essential !!

 

 

 

gene set testing roast mroast differential gene expression bioconductor • 2.0k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 3 hours ago
The city by the bay

It should be okay to use ROAST in this application, as the definition of the gene set (from one experiment) is independent of the application of the gene set in ROAST (in the second experiment). This means that you're not doing any inappropriate data snooping, e.g., by running ROAST on the same data that you used to define the gene set.

In practice, I would define my DGE list from the first experiment, and then split it into two separate DGE lists based on the sign of the log-fold change. I would then use each of the two sublists in ROAST for the comparison in the second experiment. If the experiments have similar DE profiles, then the "up" sublist should have a consistent direction of "up" in the ROAST output, and similarly for the "down" sublist.

Besides intersection, no "sophisticated approaches" come to mind. One (visual) alternative might be to simply plot the log-fold changes of one experiment against the log-fold change of the other experiment for all genes. If you have similar DE profiles between experiments, then you should see a more-or-less straight line through the origin. You can also compute the correlation coefficient to get some quantitative measure of similarity. You don't have to worry about technical correlations as you've got independent experiments.

Edit: roast can now accept a vector of gene.weights. This is intended to accept log-fold changes associated with the gene set, and can have both positive or negative entries. Thus, you don't have to split the DGE list from the first experiment to define your up/down sets. Rather, you can use the entire set in ROAST, along with the vector of log-fold changes for all genes in the set.

ADD COMMENT
0
Entering edit mode
svlachavas ▴ 830
@svlachavas-7225
Last seen 13 months ago
Germany/Heidelberg/German Cancer Resear…

Dear Aaron,

firstly thank you for your consideration on this matter !! I tried to implement your suggestion in R, but im struggling with the code:

as you mentioned roast, i have to define three basic arguments (please excume me if i have understand anything wrong):

firstly i have to selected my probesets from my first DEG list from my one experiment. Secondly, i have to define the design matrix of my second experiment which i have used for the moderated t-test. And the final is how to associate the DEG set of the first experiment to the probeset IDs of the second experiment of the other dataset. Thus, this could be performed via the argument index ?and by which way? The only similar code i used in the past was regarding mroast to test for DE KEGG pathways:

x <- hgu133aPATH2PROBE # for the hgu133a platform
mapped_probes <- mappedkeys(x)
xx <- as.list(x[mapped_probes])
indices <- ids2indices(xx, rownames(data.trusted.eset)) # expressionset
res <- mroast(data.trusted.eset, indices, design, contrast=2)
res <- res[order(res$PValue.Mixed),]

 

ADD COMMENT
2
Entering edit mode

Let's say that you have a vector of probeset IDs from your first experiment, named dge.set. This is associated with a vector of log-fold changes, named dge.fc. Now, I'll assume you have a suitable matrix-like object for your second experiment named y, for which the probeset IDs are stored in the row names. We should be able to run:

matched <- match(dge.set, rownames(y))
matched <- matched[!is.na(matched)]
roast(y, index=matched, design, gene.weights=dge.fc[matched])

Of course, this assumes that the probeset IDs are directly comparable between your two chips. You can check this by looking at the hgu133a.db and hgu133plus2.db annotation packages. A brief look suggests that the probeset IDs are shared and match up to the same Entrez IDs, etc., so it's probably okay.

ADD REPLY
0
Entering edit mode

Dear Aaron,

firstly thank you for your code and your recommendations. Hovever, i need to ask you one more question that looks not important, but it concerns me.

So, based on your above approach, i have the top table data frame named top with me selected DE probesets from the first experiment and based on your above: set <- rownames(top)-so these are the probesetIDs of the first DEG list.

Secondly, i assume i have the Eset2 that is the ExpressionSet of the second dataset which we want to test the first DEG list. Thus, about your above code it should be:

matched <- match(set, rownames(Eset2) ?

and after the rest of the above code you wrote ?

Also, as my DEGlist has no duplicate probesets, should i also filter my Eset2 from my second dataset for any duplicates or after limma annotate mt topTable and use the rownames from this as rownames of my Eset2 ?

Thank you again for your consideration on this matter !!

 

ADD REPLY
1
Entering edit mode

That looks right; you should be able to just replace y in my code with Eset2 in your code.

As for your duplicate probesets, you can use avereps from the limma package on your Eset2 object. This will merge duplicates by replacing all duplicated probesets with a single row representing their average expression in each library. This seems better than just filtering out duplicates, which would result in the loss of information. It's also better than manually replacing row names, which could result in book-keeping errors if you're not careful with what you're doing.

ADD REPLY
0
Entering edit mode

Dear Aaron,

thank you again for your recommendations; regarding the duplicates matter, i tend to use not the average, as i have read in other posts/threads that i should treat each probeset as unique and not average, because duplicates of one transcript might represent alternative transcripts or isoforms and thus should not consider the average or median-instead, i use the MAD(Median Absolut Deviation) to keep the probeset with the bigger one from other duplicates. Also i have heard the custom cdf from Brain Array, but the Annotation approach is a whole procedure by itself and thus many different approaches exist.

ADD REPLY
0
Entering edit mode

Dear Aaron,

i used your code  as above and i also implemented the argument contrast=2 to specify the specific contrast i need from my design matrix:

genes <- set$PROBEID # where set is the data.frame with the DE probesetIDs from my first experiment

matched <- match(genes, rownames(data.trusted.eset))
matched <- matched[!is.na(matched)]

But when i used :

roast(data.trusted.eset, index=matched, design, contrast=2,gene.weights=genes.fc[matched])

# where data.trusted.eset is the expressionset from the second experiment, and also design the design matrix of the data.trusted.eset, it gave me the following result:

               Active.Prop   P.Value
Down              NA         NA
Up                  NA           NA
UpOrDown      NA           NA
Mixed             NA          NA

 

Any ideas or help ?

ADD REPLY
0
Entering edit mode
  • Check whether matched is empty. If so, the set's empty, so there's no point running roast.
  • Check whether genes.fc[matched] contains NA values. If so, subset matched so that you get rid of those entries:
matched <- matched[!is.na(genes.fc[matched])]
  • Check whether data.trusted.eset is okay. Does the following code generate NA's (assuming there's at least 10 genes in your second data set)?
roast(data.trusted.eset, index=1:10, design, contrast=2)
ADD REPLY
0
Entering edit mode

Dear Aaron,

yes you are right: when i write hgu133a$logFC[matched] where hgu133a the dataframe for topTable regarding the DEG genes and other info of the first dataset used to use it for the DEG genes -> 

head(hgu133a)
      PROBEID      GENE_SYMBOL
1   209735_at       ABCG2
2   207502_at      GUCA2B
3   206784_at        AQP8
4 206209_s_at         CA4
5   217546_at        MT1M
6   204475_at        MMP1
                                                                  GENE_NAME
1 ATP-binding cassette, sub-family G (WHITE), member 2 (Junior blood group)
2                              guanylate cyclase activator 2B (uroguanylin)
3                                                               aquaporin 8
4                                                     carbonic anhydrase IV
5                                                        metallothionein 1M
6                      matrix metallopeptidase 1 (interstitial collagenase)
  ENTREZID     logFC adjusted.P.Value
1     9429 -6.432513      0.006563296
2     2981 -6.597180      0.006563296
3      343 -8.451506      0.006563296
4      762 -7.728190      0.006563296
5     4499 -6.294264      0.008197294
6     4312  5.620516      0.009569328

and it gave me many NAs:

hgu133a$logFC[matched]
  [1] -6.427696 -6.279190 -6.294264 -2.354063 -3.191032 -2.176708 -6.091865
  [8] -3.196283 -3.294334  2.372149 -5.675410 -2.697132  1.365858  1.516165
 [15] -3.776092        NA        NA  2.354714 -2.588221 -3.528173  1.693774
 [22]  3.619838  1.937780 -2.273544        NA -3.732733 -2.245229  1.283509
 [29]        NA -1.080846        NA  2.146243        NA -1.761236  2.467160
 [36] -3.808038 -1.667233  1.634998  2.101009        NA  1.598870  1.578977
 [43] -3.426366        NA  1.014390  1.268782 -2.829184  2.166192        NA
 [50]        NA -3.652755  1.419136 -1.928969 -1.911653  2.624400        NA.....

thus i wrote your function above: 

matched <- matched[!is.na(genes.fc[matched])]

length(matched) :208

and then roast(data.trusted.eset, index=matched, design, contrast=2, gene.weights=hgu133a$logFC[matched]):

                  Active.Prop     P.Value
Down          0.2500000      1.0000000000
Up              0.3509615      0.0005002501
UpOrDown   0.3509615      0.0010000000
Mixed          0.6009615      0.0030000000

But i find a bit difficult to explain the above results

ADD REPLY
0
Entering edit mode

Check out the value and details section in ?roast for an explanation of the various fields. In your case, the test result for "Up" is significant. This indicates that, generally, the up- and down-regulated genes in your first set are also significantly up- and down-regulated, respectively, in your second experiment (specifically, 35% of all genes in the set have the same changes between experiments). If I remember correctly, the interpretation above is because the gene.weights are used to adjust the sign of the log-fold changes in the second experiment, so if you have negative log-FCs from both the first and second experiment, the final log-FC will be seen as positive - hence, "up".

ADD REPLY

Login before adding your answer.

Traffic: 523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6