Search
Question: What is colData? How do I make one?
0
27 days ago by
mjrarcher0 wrote:

Hello,

First of, I just want to say I've never used DESeq2 before and I'm new to R. I've a counts.htseq file I've created with none of the mentioned tools. I simply used bash to aggregate the gene counts of each of my samples in to one file, which i've called counts.htseq.

Now, i thought it would be a breeze to run deseq2, but the first thing i noticed before even running the first line of code, is that I need a sample information table or "coldata". The documentation does not explain what that means or how I can generate one applicable to my counts file.

So, what is this "coldata" object and what kind of sample information is it supposed to contain and how do I make it? The documentation assumes that this is clear, but its not.

My counts file has 118 samples and thousands of genes expression values (read counts). Please see image of counts.htseq2 below. I'd appreciate any help in this regard.

Imgur

modified 24 days ago by swbarnes250 • written 27 days ago by mjrarcher0
1
27 days ago by
Michael Love20k
United States
Michael Love20k wrote:

In DESeq2 vignette we describe colData as a table of sample information.

The vignette has lots of information but if you’re brand new to RNA-seq analysis we also recommend reading the workflow which goes at a slower pace. See for example this section:

http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#the-deseqdataset-object-sample-information-and-the-design-formula

Hi Michael,

Thanks for the reply. I've looked at the vignette, but its still not clear to me. It emphasizes a lot on using SummariedExperiment objects, which apparently works with colData function. However, i'm using a counts file, which i'm reading using "read.table" in R. The documentation says its possible to use a counts matrix or an htseq_counts_file, but it doesn't say how i'm supposed to generate a coldata file from that. When I try coldata <- colData(counts_file), I just get an error. Am I supposed to create this coldata file myself instead? If so, what do I need to provide. I'm trying to identify differential gene expression between samples that are sequenced from tumors and samples sequenced from culture.

2

Quoting from the link I sent

“However, when you work with your own data, you will have to add the pertinent sample / phenotypic information for the experiment at this stage. We highly recommend keeping this information in a comma-separated value (CSV) or tab-separated value (TSV) file, which can be exported from an Excel spreadsheet, and the assign this to the colData slot, making sure that the rows correspond to the columns of theSummarizedExperiment.”

Thank you, Michael. I really appreciate your help. It sounds like for my case, I'd only have to include a second column  in my 'coldata' listing my three conditions 'tumor','culture','pdx' so that it corresponds to the samples, correct? Also, just to be sure, does it matter if these conditions are not sorted as long as they are listed in order of the columns?  I ask this because the way my samples are listed is alphabetically, and so the conditions are dispersed like (tumor,pdx,pdx,culture,culture,tumor...)

The only thing that matters is that each row of colData matches each column for the counts. The first row corresponds to the first column, the second row corresponds to the second column.

We say as much in the vignette text, quoted here:

“It is absolutely critical that the columns of the count matrix and the rows of the column data (information about samples) are in the same order. DESeq2 will not make guesses as to which column of the count matrix belongs to which row of the column data, these must be provided to DESeq2 already in consistent order.”

0
27 days ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.0k wrote:

The DESeqDataSet used by DESeq2 is a subclass of SummarizedExperiment, which is what provides rowData and colData. You should read more about SummarizedExperiment objects here: https://www.bioconductor.org/packages/devel/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html

Briefly, colData is a data frame containing metadata about each sample. It should contain a sample identifier as well as any relevant experimental factors (e.g. treatment/control, cell type, tissue, etc.).

Hi Ryan,

But i'm not using SummarizedExperiment objects. I already have a counts file, and according to the manual you can use DESeq2 with either or a few different options, but they all have that colData in common.

0
24 days ago by
swbarnes250
swbarnes250 wrote:

Since you are new, I strongly recommend that you find a tutorial with example data, and walk through the tutorial with it, stopping to examine what you've got every step, so you understand what's going on.  Walk through a few different tutorials, with their data and with yours.

But yes, you need colData.  That's the part where you tell the software which samples are controls and which ones aren't, among other things.