Hello,
Peoples from my lab have been generating some RNA-seq data. I don’t really want to perform differential gene expression analyses on these data. Instead, I want to make some gene profile visualisation, heatmaps, clustering… But no DE analyses.
I’ve been using DESeq2 for a long time, but I’m wondering about which normalisation strategy I should use in order to produce data well suited to do what I want. The biggest question I have right now is about the gene length bias and the comparison of different genes.
Here is what i’m planning to do to obtain my normalised transcript count, using salmon (with all the bias correction options activated), tximport and deseq2 :
# Importing the output of salmon
dir <- "./count_data_salmon"
list_id=as.factor(list.files(dir))
files <- file.path(dir, list_id, "quant.sf")
names(files)=list_id
# Check if all the files exists
all(file.exists(files))
###Using tximport to import the data
### I'm working on a non model species, and all the annoted genes have only one annoted transcript. That is why i'm not using the stuff to sumarize transcript to gene level.
library(tximport)
txi=tximport(files, type = "salmon", txOut=TRUE, readLength=150)
##Using DESeq2 for normalisation only
library(DESeq2)
sampleTable <- read.csv("file_list", sep=",", row.names=1)
rownames(sampleTable) <- colnames(txi$counts)
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~ organ)
######Here I have 2 possibilities :
###First (basic DESeq2 usage) :
dds <- DESeq(dds)
normalised_count <- counts(dds, normalized = TRUE)
write.table(normalised_count , "Normalized_count.csv")
###Or, second, with fpkm
fpkm=fpkm(dds)
write.table(fpkm , "fpkm.csv")
My questions are :
Before that, I was using hisat2+stringtie, and then I was just doing a standard DESeq2 workflow to get the normalized values and do my things. Do you thing that the steps above are better to output normalized count data suited to my needs ?
Do you thing that data produced by the above steps are good for what I plan to do with them ? If I undertood well what I read, if I import data with the tximport, then DESeq2 will take care of gene length correction, which is not the case with a strandard workflow using hisat2+stringrie. I think that for clustering and heatmap visualisation, it's important to correct for the difference in length, and I was not doing that before...
And do you think It's usefull to use the fpkm function here ?
Thanks you in advance. Please tell me if you need more precisions on some aspects of my questions.
Cheers,
Alex
Hello,
Thanks you for your answer.
I am a bit confused about the definition of gene length bias :
What I have in mind when I talk about gene length bias is the fact that longer genes will accumulate more reads and so more count. For example, for a gene of 10kb, I expect him to have 10 times higher expression value if we compare him to a 1kb gene which is supposed (biologically) to have the same level of expression. So the count from 2 different genes is not comparable if the size of the gene/transcript is different, if I’m not wrong.
I noticed by looking in the quant.sf files (from salmon) that each gene has a EffectiveLength. So I think that this EffectiveLength is used to correct for some transcript length difference across samples. But does tximport/DESeq2 also take care of the gene length bias I described above, with different genes of different size having different expression values because one is smaller/bigger than the other ? I'm not sure and I don't want to make any mistake with that...
Thanks you in advance,
Alex.
I was referring to biases across samples.
If you want to compare across genes use the TPMs in the
$abundance
slot.Hello,
So I will extract the TPM values for all my samples to get a matrix of value and use that as an input for the heatmaps and for the clustering.
I will also try the things you mentionned here : https://bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#gene-clustering This should remove the length bias I was talking about. It's probably a better way to output heatmap.
Thanks a lot for you help and for you patience.
Alex.
Hello,
I thougt about something while sleeping. If I take the data in the txi$abundance slot, this will be data without any kind of « between sample » normalisation, so the samples cannot be compared together, no ? So the gene comparison is only possible within one sample.
Am I right ? (sorry for asking that much questions...)
Thanks again,
Alex
They are all scaled to have sum of one million. I have in the past done extra scaling of TPM using the media ratio method to remove additional global differences which then produces sums that are not equal to one million but tend to minimize changes for more genes. This is up to you. Most just use TPM.