Question

Correlation between two different datasets: between results of RNAseq and absence/presence of Type3 Secretion System

0

Entering edit mode

keshav.prasad.gubbi • 0

@d5e202c8

Last seen 2.4 years ago

Germany

Dear All,

I have a "How would you solve" kind of question. I have two sets of tables : 1. Log2FoldChange table and 2. Effectors Table.

Firstly, the Log2FoldChange table was obtained by performing DESeq analysis of 14 different infected samples being compared to Control and then obtaining the foldchange values from DEseq for each sample and then merging all the 14 different log2foldchnage columns into a single table, based on genes (each row is a unique gene). This table is 22000 * 14. So there are 22414 unique genes for 14 different strains in this table.

Secondly, present/absent-effector list for all 14 strains. So it tells us which effectors are present in each strain (they all have different sets of effectors). This is a 50 * 14 table for the same set of 14 strains, with each unique effector enlisted in a row and indicating either 0 or 1 for absence or presence in the rows.

What we want to investigate is: is there a correlation between the presence/absence of effectors and the gene expression in the host? Essentially , we would like to obtain the correlation between these two separate datasets?

Any ideas/suggestions on how to go about solving this problem would be very helpful and useful. My Initial idea is to carry out a Canonical Correlation Analysis (CCA) and I am still working on it. But I am open to more ideas and suggestions from the community.

Thanks in advance for our time and suggestions.

RNASeq Correlation DESeq2 • 1.5k views

ADD COMMENT • link updated 3.5 years ago by Michael Love 43k • written 3.5 years ago by keshav.prasad.gubbi • 0

score 0 · Answer 1 · 2022-08-15

Re: LFC, if you are working with this and then downstream analysis, I'd recommend lfcShrink if you aren't already using this. It will shrink unreliable LFCs and generally improves the type of analysis I think you are doing.

One matrix you can make is cor(LFC, effector) which should be 22414 x 50.

CCA would tell you, which combinations of genes and which combinations of effectors give you a similar distribution of the 14 samples. Because of the high dimension of the genes, you should use a sparse CCA method like MultiCCA (or many many alternatives, but you'll have to research such choices).