Search
Question: Determining gene expression levels with DESeq2 in a design with no condition
0
11 months ago by
robles.daniela0 wrote:

Hello all,

I am analysing RNA-Seq data generated by the TCGA-SKCM project. After asking about what the best metric to use from those released by the TCGA project (HTSeq-counts, FPKM, or FPKM-UQ) I was advised to use HTSeq-Counts (here), and to further process them with DESeq2.

My objective is to incorporate expression levels of two genes into a linear model to find significant predictors of mutation type and number in TCGA-SKCM tumours. So then, in order to do this, I downloaded all HTSeq-Counts data for the 271 tumours we have in the model, and built a matrix manually (with Perl) to input into DESeq2. (For this step, I ignored all reads that were marked as __no_feature, __ambiguous, __too_low_aQual, __not_aligned, or __alignment_not_unique). Samples in columns and genes in rows, following this very comprehensive tutorial. I also generated the sample information file, following the columns in the tutorial (sample, condition, type).

DESeq2 seems to work, reads the input files well and runs the DESeq function with no errors. My question really comes on how to do a "no design" analysis. Here, I am not interested in contrasting groups (I don't have treated vs. untreated, nor tumours vs. normals, I just have a bunch of tumours). I only want to find a value for the expression level of the two genes I am interested in (find out whether there's tumours with a high expression level of these genes for example), that is comparable across the tumours in this set, and extract that information and input it into our linear model.

So far I have tried to accomplish this by putting "1" in the condition column in the sample information file, and by running DESeq like

dds <- DESeqDataSetFromMatrix(countData = cts,
colData = coldata,
design = ~1)

However I'd like to know whether this is the right way to do this, and which value you would use for a gene per tumour that is ocmparable across samples.

Thanks so much in advance for any help,

Daniela

modified 11 months ago • written 11 months ago by robles.daniela0
0
11 months ago by
Simon Anders3.5k
Zentrum für Molekularbiologie, Universität Heidelberg
Simon Anders3.5k wrote:

If you only want to know the two genes' expression levels, what do you need DESeq for? You can read it off right from your FPKM table, or you use DESeq only for normalizing the count table.

Maybe explain in more detail what you want to regress on what.

0
11 months ago by
robles.daniela0 wrote:

Hello Simon,

Thanks for your answer. Well, I am new to RNA-Seq data expression analysis (you might have figured that already!) and initially I thought I could download the FPKM-UQ values right off from the TCGA project and use them directly with no further processing, as after reading papers and tutorials I figured these were comparable directly across samples. However, I was advised not to do this, and do the downstream analysis myself starting from HTSeq-Counts (I asked this specifically here). So what I am trying to do now, as HTSeq-Counts are not directly comparable between samples, is to do normalisation on these counts to extract a comparable expression value for a couple of genes that I want to integrate into the linear model.

This model (if you're interested it's here) regresses the number and type (C>A, C>T, etc) of mutations in tumours on a number of clinical variables, including age, gender, sample origin, ulceration presence, etc. A collaborator now asked me whether expression levels of these two genes would also be a significant predictor of mutation number and type, and thus I want to include them into the predictors.

I hope this makes sense? Thank you for your help.

Daniela