Peoples from my lab have been generating some RNA-seq data. I don’t really want to perform differential gene expression analyses on these data. Instead, I want to make some gene profile visualisation, heatmaps, clustering… But no DE analyses.
I’ve been using DESeq2 for a long time, but I’m wondering about which normalisation strategy I should use in order to produce data well suited to do what I want. The biggest question I have right now is about the gene length bias and the comparison of different genes.
Here is what i’m planning to do to obtain my normalised transcript count, using salmon (with all the bias correction options activated), tximport and deseq2 :
# Importing the output of salmon dir <- "./count_data_salmon" list_id=as.factor(list.files(dir)) files <- file.path(dir, list_id, "quant.sf") names(files)=list_id # Check if all the files exists all(file.exists(files)) ###Using tximport to import the data ### I'm working on a non model species, and all the annoted genes have only one annoted transcript. That is why i'm not using the stuff to sumarize transcript to gene level. library(tximport) txi=tximport(files, type = "salmon", txOut=TRUE, readLength=150) ##Using DESeq2 for normalisation only library(DESeq2) sampleTable <- read.csv("file_list", sep=",", row.names=1) rownames(sampleTable) <- colnames(txi$counts) dds <- DESeqDataSetFromTximport(txi, sampleTable, ~ organ) ######Here I have 2 possibilities : ###First (basic DESeq2 usage) : dds <- DESeq(dds) normalised_count <- counts(dds, normalized = TRUE) write.table(normalised_count , "Normalized_count.csv") ###Or, second, with fpkm fpkm=fpkm(dds) write.table(fpkm , "fpkm.csv")
My questions are :
Before that, I was using hisat2+stringtie, and then I was just doing a standard DESeq2 workflow to get the normalized values and do my things. Do you thing that the steps above are better to output normalized count data suited to my needs ?
Do you thing that data produced by the above steps are good for what I plan to do with them ? If I undertood well what I read, if I import data with the tximport, then DESeq2 will take care of gene length correction, which is not the case with a strandard workflow using hisat2+stringrie. I think that for clustering and heatmap visualisation, it's important to correct for the difference in length, and I was not doing that before...
And do you think It's usefull to use the fpkm function here ?
Thanks you in advance. Please tell me if you need more precisions on some aspects of my questions.