Question: [deseq2] normalization between samples after merging transcripts into genes
0
4 months ago by
marie0
marie0 wrote:

Hi, I am trying to do comparative RNA-seq analysis with DESeq2.

My purposes are: 1. combine transcripts into genes 2. detect gene expression difference under different conditions 3. obtain a summarized gene expression table

1. combine transcripts into genes I basically followed these commands: https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#kallisto
library(tximport)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
names(txi)
library(tximportData)

dir <- system.file("/Users", package = "tximportData")
all(file.exists(files))
files <- file.path("/Users", "kallisto", samples$sample, "abundance.tsv") names(files) <- samples$sample

library(tximportData)
txi <- tximport(files, type = "kallisto", tx2gene = tx2gene1)



Is that a valid way? I found the sum of the gene expression of each sample is largely diverse.

1. detect gene expression difference under different conditions I followed this: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-un-normalized-counts

Since it describes that "As input, the DESeq2 package expects count data as obtained, e.g., from RNA-seq or another high-throughput sequencing experiment, in the form of a matrix of integer values", I used "txt" above as input without normalization.

dds <- DESeqDataSetFromMatrix(countData = txi,
colData = colData,
design = ~condition)
result <- results(dds, contrast=c("condition","conditionA","conditionB"))


Is that a valid way to do it?

1. obtain a summarized gene expression table
dds <- estimateSizeFactors(dds)
normalized <-counts(dds, normalized=TRUE)


However, I found the sum of the gene expression of each sample is largely diverse. Something like: (two samples for condition A and B)

sample A1 A2 B1 B2
gene1  10  11  1  1
gene2  20  19  2  1
gene3  30  32  3  4
gene4  40  38  4  4


It looks odd and I wonder how to obtain a gene expression table of all the samples, which are normalized by both gene length and between-sample bias. Or, is there a significant gene expression depletion in sample B1 and B2 overall?

deseq2 tximport • 146 views
modified 4 months ago by Michael Love24k • written 4 months ago by marie0
Answer: [deseq2] normalization between samples after merging transcripts into genes
0
4 months ago by
Michael Love24k
United States
Michael Love24k wrote:

Yes that is how to summarize to gene with tximport.

You should use DESeqDataSetFromTximport which is specifically designed for importing from tximport, as the name suggests. See here:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#transcript-abundance-files-and-tximport-input

That table looks like there is DE for those genes. I don't follow what you are trying to show me?

Thank you for the prompt response.

I tried DESeqDataSetFromTximport this time, but the last problem remains.

What I was trying to ask is: I expected the sum of all gene expression in the tximport output is same across the samples, like some other normalization methods, e.g., "all TPMs (transcript-level or gene-level) should always equal 1,000,000." However, the observed sum of all gene (I assumed gene1-4 is all the genes in the genome in this example above) expression in the tximport output is quite different between the samples (in this case, the sum is 100 for sample A1 and 10 for sample B1), even after the normalized <-counts(dds, normalized=TRUE).

Is that appropriate to calculate fold change over this dataset? If this is not properly normalized, how can I obtain the normalized data table?

Can you show me the column sum of normalized counts for your real data?

The actual data is like this:

sample stage1_treatment1 stage1_treatment2 stage1_treatment2 stage1_treatment3 stage1_treatment2 stage1_treatment3 stage1_treatment2 stage1_treatment1 stage1_treatment3 stage1_treatment3 stage1_treatment3 stage1_treatment3 stage1_treatment1 stage2_treatment1 stage2_treatment2 stage2_treatment1 stage2_treatment3 stage2_treatment2 stage2_treatment3 stage2_treatment1 stage2_treatment1 stage2_treatment3 stage2_treatment1 stage2_treatment3 stage2_treatment1 stage2_treatment3 stage2_treatment2 sum 11360882.22 15284394.07 19442307.11 19787325.58 17012527.38 15219009.92 15389659.54 16551581.26 15560148.45 16184824.93 17221220.16 14641597.56 12890883.85 7180978.134 7209581.174 7462163.292 8744727.234 7069309.349 7538834.546 6157925.432 6281529.7 7788110.682 8102544.576 8004473.424 7499249.027 7546127.962 7360558.875

seems like it ranges from 6 to 20 million? Yes this is strange if those are colSums of normalized counts. Is there something very different about these samples? Are they perhaps experiencing massive changes in the distribution of expressed genes?