[deseq2] normalization between samples after merging transcripts into genes
1
0
Entering edit mode
marie • 0
@marie-19926
Last seen 3.5 years ago
Norway

Hi, I am trying to do comparative RNA-seq analysis with DESeq2.

My purposes are: 1. combine transcripts into genes 2. detect gene expression difference under different conditions 3. obtain a summarized gene expression table

  1. combine transcripts into genes I basically followed these commands: https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#kallisto
library(tximport)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
names(txi)
library(tximportData)

dir <- system.file("/Users", package = "tximportData")
samples <- read.table(file.path("/Users", "samplelist.csv"), header = TRUE)
tx2gene1 <- read.csv(file.path("/Users", "transcript_gene.csv"), header = TRUE)
all(file.exists(files))
files <- file.path("/Users", "kallisto", samples$sample, "abundance.tsv")
names(files) <- samples$sample

library(tximportData)
 txi <- tximport(files, type = "kallisto", tx2gene = tx2gene1)

Is that a valid way? I found the sum of the gene expression of each sample is largely diverse.

  1. detect gene expression difference under different conditions I followed this: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-un-normalized-counts

Since it describes that "As input, the DESeq2 package expects count data as obtained, e.g., from RNA-seq or another high-throughput sequencing experiment, in the form of a matrix of integer values", I used "txt" above as input without normalization.

dds <- DESeqDataSetFromMatrix(countData = txi,
                              colData = colData,
                              design = ~condition)
result <- results(dds, contrast=c("condition","conditionA","conditionB"))

Is that a valid way to do it?

  1. obtain a summarized gene expression table
dds <- estimateSizeFactors(dds)
normalized <-counts(dds, normalized=TRUE)

However, I found the sum of the gene expression of each sample is largely diverse. Something like: (two samples for condition A and B)

sample A1 A2 B1 B2
gene1  10  11  1  1
gene2  20  19  2  1
gene3  30  32  3  4
gene4  40  38  4  4

It looks odd and I wonder how to obtain a gene expression table of all the samples, which are normalized by both gene length and between-sample bias. Or, is there a significant gene expression depletion in sample B1 and B2 overall?

Thank you for your help.

deseq2 tximport • 2.3k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

Yes that is how to summarize to gene with tximport.

You should use DESeqDataSetFromTximport which is specifically designed for importing from tximport, as the name suggests. See here:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#transcript-abundance-files-and-tximport-input

That table looks like there is DE for those genes. I don't follow what you are trying to show me?

ADD COMMENT
0
Entering edit mode

Thank you for the prompt response.

I tried DESeqDataSetFromTximport this time, but the last problem remains.

What I was trying to ask is: I expected the sum of all gene expression in the tximport output is same across the samples, like some other normalization methods, e.g., "all TPMs (transcript-level or gene-level) should always equal 1,000,000." However, the observed sum of all gene (I assumed gene1-4 is all the genes in the genome in this example above) expression in the tximport output is quite different between the samples (in this case, the sum is 100 for sample A1 and 10 for sample B1), even after the normalized <-counts(dds, normalized=TRUE).

Is that appropriate to calculate fold change over this dataset? If this is not properly normalized, how can I obtain the normalized data table?

I appreciate your help,

ADD REPLY
0
Entering edit mode

Can you show me the column sum of normalized counts for your real data?

ADD REPLY
0
Entering edit mode

The actual data is like this:

sample stage1_treatment1 stage1_treatment2 stage1_treatment2 stage1_treatment3 stage1_treatment2 stage1_treatment3 stage1_treatment2 stage1_treatment1 stage1_treatment3 stage1_treatment3 stage1_treatment3 stage1_treatment3 stage1_treatment1 stage2_treatment1 stage2_treatment2 stage2_treatment1 stage2_treatment3 stage2_treatment2 stage2_treatment3 stage2_treatment1 stage2_treatment1 stage2_treatment3 stage2_treatment1 stage2_treatment3 stage2_treatment1 stage2_treatment3 stage2_treatment2 sum 11360882.22 15284394.07 19442307.11 19787325.58 17012527.38 15219009.92 15389659.54 16551581.26 15560148.45 16184824.93 17221220.16 14641597.56 12890883.85 7180978.134 7209581.174 7462163.292 8744727.234 7069309.349 7538834.546 6157925.432 6281529.7 7788110.682 8102544.576 8004473.424 7499249.027 7546127.962 7360558.875

ADD REPLY
0
Entering edit mode

seems like it ranges from 6 to 20 million? Yes this is strange if those are colSums of normalized counts. Is there something very different about these samples? Are they perhaps experiencing massive changes in the distribution of expressed genes?

ADD REPLY
0
Entering edit mode

I see. Thank you very much for your response. Yes, it could be possible; there can be a huge difference between the condition. I will discuss with my collaborators and will also search for studies with a similar setting. Actually, this is my first experience observing such an extreme diversity between samples although I have analyzed several RNAseq datasets with different experimental designs.

ADD REPLY

Login before adding your answer.

Traffic: 689 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6