Question

Help with defining groups

0

Entering edit mode

Phinney, Brett ▴ 10

@phinney-brett-6324

Last seen 4 months ago

United States

Hi everyone, thanks for the great software! I was wondering if you can give a short code example of defining groups after I read in a DIA-NN report.parquet file. I assume there is an easy way to link the file names to their conditions and replicates ?

Cheers

Brett

limpa • 518 views

ADD COMMENT • link updated 6 weeks ago by Gordon Smyth 53k • written 5 months ago by Phinney, Brett ▴ 10

score 0 · Answer 1 · 2025-09-26

Defining sample groups is separate to reading in the DIA-NN data, because the sample conditions and covariates are not necessarily coded into the file names. It is something that is done as part of the R computing language rather than specifically by limpa. It is the same for any Bioconductor package that does differential expression or differential abundance analyses, like limma, edgeR or DESeq2, and you can see lots of examples in the case studies of those packages.

After you read in the DIA-NN peptide quants using limpa:

x <- readDIANN(...)

the sample names extracted from the DIA-NN report are available from colnames(x). If the sample names are informative, then you can often convert them easily into a condition factor.

In my work, I encourage my collaborators and my lab team to create an Excel spreadsheet giving all the sample annotation available. One column will give the sample file names while the other columns will give conditions and covariates. In the limma documentation, this is called the targets data.frame. The design matrix is then created using model.matrix() using the column information in the targets data.frame. My reasoning is that the biologists who prepared the samples must have such a spreadsheet, or its equivalent, as part of their sample preparation. They then create a unique sample ID when passing the sample onto the proteomics lab for mass spectrometry. The targets data.frame then links the sample IDs to the sample annotation. (The terminology of "targets" comes from the original terminology of "probes" vs "targets" for DNA expression microarrays. The DNA spot on the microarray was a "probe" and the RNA sample was the "target". These days I would probably call it "samples" or "SampleInfo". Back in the early days of limma I was worried about possible confusion between statistical samples and RNA samples, hence used a more specific term for the latter.)

The expression objects created by limpa optionally contain sample annotation in the targets component, which is a data.frame with one row for each sample.