Question

mogsa on RNA-seq and ATAC-seq data

0

Entering edit mode

meeta.mistry ▴ 30

@meetamistry-7355

Last seen 3.6 years ago

United States

Hi,

I am trying to use mogsa to analyze ATAC -seq and RNA-seq data on the same samples. In the vignette example, the data matrices are multiple microarray data. If I use ATAC-seq data, what should I use as input? I was thinking a count matrix for a set of consensus peaks across all samples. However, in order to map it to the RNA-seq data presumably we need nearest gene annotations for each peak? IN taht case we will have multiple peaks mapping to a single gene - is this going to be problematic for mogsa? Is this the correct way to handle this type of data?

Any help is much appreciated.

Thanks,

Meeta

mogsa atacseq • 2.0k views

ADD COMMENT • link updated 7.6 years ago by mengchen18 • 0 • written 7.7 years ago by meeta.mistry ▴ 30

score 0 · Answer 1 · 2018-06-13

Hi Meeta

There are 3 stages to mogsa, 1) the datasets are projected into the same space, 2) the gene set scores of the genes/proteins are calculated for each principal component, 3) the overall score of each geneset per sample is extracted for the selected components.

Both datasets would need to be in the form of matrices with matched samples. However the features (rows) do not need to match. Then the set of features in the new space will show the covariance/association between ATAC seq and RNAseq for the samples. For part 2), the gene set score would be generated probably using only the RNA annotation. The gene set annoation, if a binary or weight matrix of genes x genesets, where 1 (or any score >0) means a gene is in that geneset. If you wish to generate gene set annotation for the ATAC seq, I would include the maximum number of possible associations, of chromatin regions to genesets. If you have no annotation, create an empty matrix (all zero) with the features of the ATAC in the rows, and the genesets in the columns. Then only the RNAseq will be used to score the genesets, however the weight of each gene in the space is determined by both datasets. 3) finally consider if any of the PCs are associated to batch effects, or how many PCs provide useful data. Include the PCs you wish to keep for the final score. Hope this helps, happy to generate a vignette if you point me to example data.

Best

Aedin

score 0 · Answer 2 · 2018-06-18

"I was thinking a count matrix for a set of consensus peaks across all samples"

You can use the count data. But in order to avoid your results driven by a few rows (chromatin regions) in ATAC seq data, it's better to normalized the matrix in advance, such as log transform or as in (non-symmetric) correspondence analysis.

"However, in order to map it to the RNA-seq data presumably we need nearest gene annotations for each peak? IN taht case we will have multiple peaks mapping to a single gene - is this going to be problematic for mogsa? "

As Aedin said, there is nothing wrong with this method. You can also consider more, e.g. TF binding region of genes or any other mechanism related to the gene expression.