10 months ago by
There are 3 stages to mogsa, 1) the datasets are projected into the same space, 2) the gene set scores of the genes/proteins are calculated for each principal component, 3) the overall score of each geneset per sample is extracted for the selected components.
Both datasets would need to be in the form of matrices with matched samples. However the features (rows) do not need to match. Then the set of features in the new space will show the covariance/association between ATAC seq and RNAseq for the samples. For part 2), the gene set score would be generated probably using only the RNA annotation. The gene set annoation, if a binary or weight matrix of genes x genesets, where 1 (or any score >0) means a gene is in that geneset. If you wish to generate gene set annotation for the ATAC seq, I would include the maximum number of possible associations, of chromatin regions to genesets. If you have no annotation, create an empty matrix (all zero) with the features of the ATAC in the rows, and the genesets in the columns. Then only the RNAseq will be used to score the genesets, however the weight of each gene in the space is determined by both datasets. 3) finally consider if any of the PCs are associated to batch effects, or how many PCs provide useful data. Include the PCs you wish to keep for the final score. Hope this helps, happy to generate a vignette if you point me to example data.