Dear Community,
based on the results of a previous post (C: Possible ways of performing differential gene expression and analysis of RNA-Seq), regarding the analysis of a TCGA RNA-Seq data, i ended up with a list of DE genes (the analysis was performed on log2(estimated counts +1) values, with the pipeline of microarrays:limma-eBayes). Then, from a simple Venn diagram, i compared the ~5000 DE gene symbols from the RNA-Seq, with another small gene signature(94 DE genes), which i have acquired from a independent microarray experiment with similar experimental condition (same cancer, very similar comparison in limma,etc). Based on Venn diagram, the 89 gene symbols are common-and also they have the same alteration of gene expression (based on the log2FC).
Thus, the most appropriate/unbiased way of interpreting the results would be that these 89 gene symbols are more genuine DE ? Found in two independent datasets ? Or an even more "advanced approach" could be utilized, to also take the log2FCs into account ? Like a small-approach/kind of meta-analysis ? Despite the different high-throughput technologies ? As also another (might) drawback, regarding the annotation process ? That is, the microarrays were analyzed with customCDF arrays (affymetrix), whereas the RNA-Seq loaded from a specific R package has been already annotated to unique gene symbols.
[*I understand that this question might be a little more general for the purpose of this group, but there might be R packages or approaches for this purpose which i'm not familiar with.]
Any opinion or feedback is welcome !!
Dear Aaron, thank you for your answer and rationale of your explanation !! To be honest, one initial thought i had was if it was appropriate to construct a forest plot: that possibly include the log2FCs for the 89 DE genes, for both 2 studies, along with their confidence intervals--however, i was a bit reluctant as i have never used a similar approach. Moreover, the other thing i tried, was a heatmap of the common gene symbols in the second dataset, which also showed similar expression patterns and the separation of cancer and control samples.
Regarding your approach, which sounds very interesting and not trivial:
1) You mean in the beggining, create for instance two gene symbols vector: one for the up-genes from the signature, and the other with the down genes, along with their respective log2FCs ? and i mentioned gene symbols, due to the fact that the RNA-Seq data has already unique gene symbols in the rows?
2) And then run roast 2 times: one for the up and then the other for the down genes ? with the following formula:
roast(TCGA.eset, index=vec1, design, contrast=, gene.weights=genes.fc.vec1)
# for instance where vec1 is a character vector which has the upregulated gene symbols, and the genes.fc.vec1 also a numeric vector of their relative log2Fcs ??
Gene weights can be positive and negative in the same vector, you don't have to separate up- and down-regulated genes.
oh i think i got it-so provide a total vector of gene symbols in the index, as also the total relative log2FCs in the gene.weights argument, correct ? and then from the proportion of up and down genes, i should interpret in the same way the consistensy of this signature in the tested dataset, right ?
Dear Aaron,
sorry to return after some days, just a small update and your comment on the final interpretation. Briefly, based on your relative comments:
head(vec1)
[1] "AARS" "ABCD3" "ACADM" "ACADS" "ACADVL" "AHCY" # the vector of common DE gene symbols
head(genes.fc.vec1)
[1] 0.7871676 -0.8855676 -1.0332955 -1.2122512 -0.7056447 1.1534264 # the relative log2FCs of the above common DE symbols in the microarray dataset
roast(y=y, index=vec1, design=design.2, contrast=2, gene.weights=genes.fc.vec1) # where y the TCGA RNA-Seq
Active.Prop P.Value
Down 0 1.0000000000
Up 1 0.0005002501
UpOrDown 1 0.0010000000
Mixed 1 0.0010000000
My quick questions are the following:
1) As the Up proportion is 1, essentially "validates" the direction of my signature from microarrays in the RNA-Seq dataset, correct ? As also of course by the Mixed proportion (as i have both up and down genes).
2) with implementing roast in the context of my original post question, essentially i give an alternative validation of my gene signature, with a more "sophisticated" way, except the initial DE expression analysis with the RNA-Seq dataset ?
3) Because of the two different technologies (although RNA-Seq processed in a similar way like microarrays),and also the different annotation procedures, still roast is valid, correct ? As essentially i use the final gene symbols, right ?