Question

What is colData? How do I make one?

1

Entering edit mode

mjrarcher ▴ 10

@mjrarcher-18313

Last seen 5.3 years ago

Hello,

First of, I just want to say I've never used DESeq2 before and I'm new to R. I've a counts.htseq file I've created with none of the mentioned tools. I simply used bash to aggregate the gene counts of each of my samples in to one file, which i've called counts.htseq.

Now, i thought it would be a breeze to run deseq2, but the first thing i noticed before even running the first line of code, is that I need a sample information table or "coldata". The documentation does not explain what that means or how I can generate one applicable to my counts file.

So, what is this "coldata" object and what kind of sample information is it supposed to contain and how do I make it? The documentation assumes that this is clear, but its not.

My counts file has 118 samples and thousands of genes expression values (read counts). Please see image of counts.htseq2 below. I'd appreciate any help in this regard.

Imgur

deseq2 coldata • 42k views

ADD COMMENT • link updated 5.4 years ago by swbarnes2 ★ 1.3k • written 5.5 years ago by mjrarcher ▴ 10

score 1 · Answer 1 · 2018-11-13

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 3 hours ago

United States

In DESeq2 vignette we describe colData as a table of sample information.

The vignette has lots of information but if you’re brand new to RNA-seq analysis we also recommend reading the workflow which goes at a slower pace. See for example this section:

http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#the-deseqdataset-object-sample-information-and-the-design-formula

ADD COMMENT • link 5.5 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Thanks for the reply. I've looked at the vignette, but its still not clear to me. It emphasizes a lot on using SummariedExperiment objects, which apparently works with colData function. However, i'm using a counts file, which i'm reading using "read.table" in R. The documentation says its possible to use a counts matrix or an htseq_counts_file, but it doesn't say how i'm supposed to generate a coldata file from that. When I try coldata <- colData(counts_file), I just get an error. Am I supposed to create this coldata file myself instead? If so, what do I need to provide. I'm trying to identify differential gene expression between samples that are sequenced from tumors and samples sequenced from culture.

ADD REPLY • link 5.5 years ago mjrarcher ▴ 10

2

Entering edit mode

Quoting from the link I sent

“However, when you work with your own data, you will have to add the pertinent sample / phenotypic information for the experiment at this stage. We highly recommend keeping this information in a comma-separated value (CSV) or tab-separated value (TSV) file, which can be exported from an Excel spreadsheet, and the assign this to the colData slot, making sure that the rows correspond to the columns of theSummarizedExperiment.”

ADD REPLY • link 5.5 years ago Michael Love 41k

0

Entering edit mode

Thank you, Michael. I really appreciate your help. It sounds like for my case, I'd only have to include a second column in my 'coldata' listing my three conditions 'tumor','culture','pdx' so that it corresponds to the samples, correct? Also, just to be sure, does it matter if these conditions are not sorted as long as they are listed in order of the columns? I ask this because the way my samples are listed is alphabetically, and so the conditions are dispersed like (tumor,pdx,pdx,culture,culture,tumor...)

ADD REPLY • link 5.5 years ago mjrarcher ▴ 10

0

Entering edit mode

The only thing that matters is that each row of colData matches each column for the counts. The first row corresponds to the first column, the second row corresponds to the second column.

We say as much in the vignette text, quoted here:

“It is absolutely critical that the columns of the count matrix and the rows of the column data (information about samples) are in the same order. DESeq2 will not make guesses as to which column of the count matrix belongs to which row of the column data, these must be provided to DESeq2 already in consistent order.”

ADD REPLY • link 5.5 years ago Michael Love 41k

score 0 · Answer 2 · 2018-11-13

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

The DESeqDataSet used by DESeq2 is a subclass of SummarizedExperiment, which is what provides rowData and colData. You should read more about SummarizedExperiment objects here: https://www.bioconductor.org/packages/devel/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html

Briefly, colData is a data frame containing metadata about each sample. It should contain a sample identifier as well as any relevant experimental factors (e.g. treatment/control, cell type, tissue, etc.).

ADD COMMENT • link 5.5 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Hi Ryan,

But i'm not using SummarizedExperiment objects. I already have a counts file, and according to the manual you can use DESeq2 with either or a few different options, but they all have that colData in common.

ADD REPLY • link 5.5 years ago mjrarcher ▴ 10

score 0 · Answer 3 · 2018-11-16

Since you are new, I strongly recommend that you find a tutorial with example data, and walk through the tutorial with it, stopping to examine what you've got every step, so you understand what's going on. Walk through a few different tutorials, with their data and with yours.

But yes, you need colData. That's the part where you tell the software which samples are controls and which ones aren't, among other things.