Hi everyone, thanks for the great software! I was wondering if you can give a short code example of defining groups after I read in a DIA-NN report.parquet file. I assume there is an easy way to link the file names to their conditions and replicates ?
Defining sample groups is separate to reading in the DIA-NN data, because the sample conditions and covariates are not necessarily coded into the file names. It is something that is done as part of the R computing language rather than specifically by limpa. It is the same for any Bioconductor package that does differential expression or differential abundance analyses, like limma, edgeR or DESeq2, and you can see lots of examples in the case studies of those packages.
After you read in the DIA-NN peptide quants using limpa:
x <- readDIANN(...)
the sample names extracted from the DIA-NN report are available from colnames(x). If the sample names are informative, then you can often convert them easily into a condition factor.
In my work, I encourage my collaborators and my lab team to create an Excel spreadsheet giving all the sample annotation available. One column will give the sample file names while the other columns will give conditions and covariates.
In the limma documentation, this is called the targets data.frame. The design matrix is then created using model.matrix() using the column information in the targets data.frame.
My reasoning is that the biologists who prepared the samples must have such a spreadsheet, or its equivalent, as part of their sample preparation. They then create a unique sample ID when passing the sample onto the proteomics lab for mass spectrometry. The targets data.frame then links the sample IDs to the sample annotation.
The expression objects created by limpa optionally contain sample annotation in the targets component, which is a data.frame with one row for each sample.