I am analysing RNA-Seq data generated by the TCGA-SKCM project. After asking about what the best metric to use from those released by the TCGA project (HTSeq-counts, FPKM, or FPKM-UQ) I was advised to use HTSeq-Counts (here), and to further process them with DESeq2.
My objective is to incorporate expression levels of two genes into a linear model to find significant predictors of mutation type and number in TCGA-SKCM tumours. So then, in order to do this, I downloaded all HTSeq-Counts data for the 271 tumours we have in the model, and built a matrix manually (with Perl) to input into DESeq2. (For this step, I ignored all reads that were marked as __no_feature, __ambiguous, __too_low_aQual, __not_aligned, or __alignment_not_unique). Samples in columns and genes in rows, following this very comprehensive tutorial. I also generated the sample information file, following the columns in the tutorial (sample, condition, type).
DESeq2 seems to work, reads the input files well and runs the DESeq function with no errors. My question really comes on how to do a "no design" analysis. Here, I am not interested in contrasting groups (I don't have treated vs. untreated, nor tumours vs. normals, I just have a bunch of tumours). I only want to find a value for the expression level of the two genes I am interested in (find out whether there's tumours with a high expression level of these genes for example), that is comparable across the tumours in this set, and extract that information and input it into our linear model.
So far I have tried to accomplish this by putting "1" in the condition column in the sample information file, and by running DESeq like
dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~1)
However I'd like to know whether this is the right way to do this, and which value you would use for a gene per tumour that is ocmparable across samples.
Thanks so much in advance for any help,