Question

How to create the colData for a complex experiment

0

Entering edit mode

Lluís Revilla Sancho ▴ 760

@lluis-revilla-sancho

Last seen 20 days ago

European Union

I have a dataset where for the same patient and time we have extracted different samples from different locations.

What would be the best way to encode this in MAE?

Patient	Time	Region	State of the region
A	0	A	Healthy
A	0	B	Injured
A	0	C	Injured
A	24	A	Healthy
A	24	B	Injured

The sampleMap is provides "many-to-one" mapping, but when those phenotypes are from each sample how should I store it? I have several variables related to the patient (sex, age of diagnosis, disease, C-reactive protein, treatment followed, antibiotics, ...) and some related to the sample mainly (date of extraction, region extracted, state of that region, Endoscopic Score of the region, type of sample, ...)

The only way I thought is using as ID a combination of Patient, Time and Location, something like paste(Patient, Time, Location, collapse = "_") of the samples but it would duplicate information about the patient in order to store correctly the information about the sample.

Is there any better solution?

multiassayexperiment colData • 1.8k views

ADD COMMENT • link updated 7.3 years ago by Levi Waldron ★ 1.1k • written 7.3 years ago by Lluís Revilla Sancho ▴ 760

score 0 · Answer 1 · 2017-12-06

0

Entering edit mode

Levi Waldron ★ 1.1k

@levi-waldron-3429

Last seen 11 weeks ago

CUNY Graduate School of Public Health a…

Hi Lluís - there are different ways you could do this, but if you think of the five rows you showed of as five different "biological units", it might make sense to keep them separate in the colData as shown. The main difference from the MAE perspective will be how you interact with the object with subsetting by column and reshaping through wideFormat() etc. If you collapse rows like you suggested, then MAE management functions like mergeReplicates() and duplicated() would treat those five measurements as duplicates. Why do you want to collapse those rows? I would be more inclined to keep them separate as you showed, but maybe I don't understand your motivation for having one row per patient in the colData.

ADD COMMENT • link 7.3 years ago Levi Waldron ★ 1.1k

0

Entering edit mode

Hi Levi, I didn't explain myself well, sorry.

I have some samples linked to a location (biopsies from 5 regions) and some that aren't (stools) [or that they are are always from the same region]. I have two essays for the biopsies (RNA-seq and 16S-seq) and one assay for the stools (16S-seq).

My main goal is to know the relationship between assays. However, the regions of the biopsies differ on how they behave, so the relationship between assays could be different depending on the region of the biopsies. At the same time, it is interesting to see if there is a common relationship between patients in the relationship between biopsies and stools (RNA-seq to 16S-seq, 16S-seq to 16S-seq or between all the assays). I was considering to have just one row per patient in order to be able to see these common relationship between assays.

I hope I have explained myself a bit better. Many thanks

ADD REPLY • link 7.3 years ago Lluís Revilla Sancho ▴ 760

score 0 · Answer 2 · 2017-12-08

I think I understand better now. I guess the biopsies would be labeled by intestinal location, so they are not exchangeable (for example location might be labeled stool, rectum, sigmoid, descending, transverse, ascending, caecum). So I see several potential ways to set up:

1. separate rows in the MultiAssayExperiment colData for each site, with a column specifying location.

2. one row per patient in the colData, with sampling location as per-experiment colData variable.

3. one row per patient in the colData, with each biopsy site as a different `ExperimentList` element (like a different assay, with assay names reflecting body sites).

I lean towards option 3, which I suspect will allow the simplest syntax for calculating simple correlations. For setting up regressions like 16S ~ RNA-seq + location, option 1 might be simplest. For simple correlations, you might use the `assays()` extractor to give a list of matrices to calculate correlations on with `cor()`. For regressions, I imagine using `wideFormat()` to integrate the assays and colData column for location into a single DataFrame. But if I were the data analyst here, I would probably start with 3 and see if something about it ends up being annoying, and if so think about doing it differently :).