Question

how can I do gene differential analysis on data after vst

0

Entering edit mode

linouhao ▴ 20

@linouhao-15901

Last seen 12 months ago

United States

Hello! I come across one article（Transcriptome analysis of non human primate-induced pluripotent stem cell-derived cardiomyocytes in 2D monolayer culture vs. 3D engineered heart tissue）, and I would like to use the expression matrix associated with the GSE156237 dataset. However, I noticed that the author have uploaded a VSD matrix processed with VST to the GEO website. Is there any other ways to do differential analysis on vsd data, if I do not want to from the raw fastq.

Thank you very much for your time and consideration.

Best regards,

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE156237 GSE156237_DESeq2_vsd_exp.txt.gz

DESeq2 • 2.9k views

ADD COMMENT • link written 2.8 years ago by linouhao ▴ 20

score 1 · Answer 1 · 2023-03-29

1

Entering edit mode

ATpoint ★ 5.0k

@atpoint-13662

Last seen 1 hour ago

Germany

I am not sure who you are addressing here, the support forum is not associated with GEO. In general if you need raw counts either download fastq (for example via sra-explorer.info) and process it (for example via salmon) or use something like BioJupies or recount project. In any case, this is off-topic here. vst-transformed counts can be analysed with limma-trend (see limma user guide).

ADD COMMENT • link 2.8 years ago ATpoint ★ 5.0k

0

Entering edit mode

hi thanks a lot. It is a matrix, not much related to GEO, more related to DESEQ2, I want to know if raw count after VST function , then can use wilcox or limma to do gene differential analysis

ADD REPLY • link 2.8 years ago linouhao ▴ 20

1

Entering edit mode

I do not understand what you mean. Please search previous posts. vst can be used with limma, and DESeq2 needs raw counts as described in the vignette.

ADD REPLY • link 2.8 years ago ATpoint ★ 5.0k

0

Entering edit mode

GSE156237_DESeq2_vsd_exp.txt.gz I have a gene expression matrix, the data in it was transformed by vst, how can I do differential analysis on this data

thanks a lot

ADD REPLY • link 2.8 years ago linouhao ▴ 20

score 1 · Answer 2 · 2023-03-29

1

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Differential expression analysis of vst values can be performed by limma-trend.

ADD COMMENT • link 2.8 years ago Gordon Smyth 53k

0

Entering edit mode

I'm so happy and surprised to get your professional reply.

library(limma)
library(edge)

counts <- read.csv("raw_counts.csv",row.names = 1)
#Creates a DGEList
dge_counts <- DGEList(counts = counts,remove.zeros = T)

tmm_counts <- calcNormFactors(dge_counts)

logCPM_counts <- cpm(tmm_counts , log=TRUE)

group_list <- factor(c(rep("control",2), rep("treat",2)))
design <- model.matrix(~group_list)
colnames(design) <- levels(group_list)
rownames(design) <- colnames(counts)

y = voom(counts, design, plot = T)
fit <- lmFit(y, design)
fit <- eBayes(fit)
DE_genes <- topTable(fit, coef = 2,p.value = 0.05, lfc = log2(1.5), number = Inf,sort.by="logFC")

The above code is a usual process. do you mean just use as my following code

vsd_count <- vroom('GSE156237_DESeq2_vsd_exp.txt')


group_list <- factor(c(rep("control",2), rep("treat",2)))

design <- model.matrix(~group_list)


colnames(design) <- levels(group_list)

rownames(design) <- colnames(vsd_count )

fit <- lmFit(vsd_count , design) 
fit <- eBayes(fit, trend=TRUE) 
topTable(fit, coef=ncol(design))

Please reply for me when you are free. Thank you very much.

ADD REPLY • link 2.8 years ago linouhao ▴ 20

1

Entering edit mode

I advised you to use limma-trend, which you can look up in the limma User's Guide. You simply analyse the vst values as if they were from a microarray, using standard limma code. The vst values are treated the same as one would treat logCPM values from cpm().

You obviously can't use voom, DGEList etc because they need counts but you don't have counts.

Your second code chunk also can't be right because the dataset has 12 samples but you have defined a factor with only 4 samples. But the second code chunk is the right sort of code.

That's as much help as I have time to give. I don't have time for detailed code checking, and you just need to use a standard pipeline anyway.

ADD REPLY • link 2.8 years ago Gordon Smyth 53k

0

Entering edit mode

Thanks a lot, 2 vs 2 is just a test code, thanks for you kind pointing out. My concer is about the following: in the user guide, about the trend part. it talks about If the sequencing depth is reasonably consistent across the RNA samples, then the simplest and most robust approach to differential exis to use limma-trend. This approach will usually work well if the ratio of the largest library size to the smallest is not more than about 3-fold.

there are some sequencing samples that may not guarantee this triple hypothesis, can you still use trend = T directly? eBayes(fit, proportion = 0.01, stdev.coef.lim = c(0.1,4), trend = FALSE, robust = FALSE, winsor.tail.p = c(0.05,0.1))

you told me yesterday 'Differential expression analysis of vst values can be performed by limma-trend.' how should I set the argument in eBayes another question, you does not mean using trend in treat(fit, lfc=log2(1.2), trend=TRUE), am Iright?

ADD REPLY • link 2.8 years ago linouhao ▴ 20

0

Entering edit mode

This is whole code, because there are four groups in the geo set. I want the first 3 vs 3, and the 1:3 is the control, 4:6 is the treat group. is this right

file1 <- 'GSE156237_DESeq2_vsd_exp.txt'

df <- vroom(file1) %>%
  column_to_rownames(var = 'id')

exprSet <- df %>%
  dplyr::select(c(1:6))
group <- factor(c(rep("control",3), rep("treat",3)))
treat_group_names <- 'treat'
control_group_names <- 'control'
design <- model.matrix(~0 + factor(group)) 
colnames(design) <- levels(factor(group))
rownames(design) <- colnames(exprSet)
cont.matrix <- makeContrasts(contrasts = paste0(treat_group_names, '-' , control_group_names),  levels = design)
fit1 <- lmFit(exprSet, design)              
fit2 <- contrasts.fit(fit1, cont.matrix) 
efit <- eBayes(fit2, trend = T)  
tempDEG <- topTable(efit, coef = paste0(treat_group_names, '-', control_group_names), number = Inf)

ADD REPLY • link 2.8 years ago linouhao ▴ 20

1

Entering edit mode

This is getting out of hand. You have been advised how your analysis can basically be done, the user guide covers it. It is not feasible to hands-on guide you through your analysis. If you need that then please collaborate with a local person familiar with R and differential analysis.

ADD REPLY • link 2.8 years ago ATpoint ★ 5.0k

0

Entering edit mode

Thanks a lot. so do you know when professor Gordon Smyth says 'Differential expression analysis of vst values can be performed by limma-trend.'

do you know his answer is trend = F or trend =T?

ADD REPLY • link 2.8 years ago linouhao ▴ 20

0

Entering edit mode

Read the manual.

ADD REPLY • link 2.8 years ago ATpoint ★ 5.0k

0

Entering edit mode

Thanks a lot. so what is your answer, and wait for the author to judge it is right ot not.

I have already read it, If the sequencing depth is reasonably consistent across the RNA samples, then the simplest and most robust approach to differential exis to use limma-trend. This approach will usually work well if the ratio of the largest library size to the smallest is not more than about 3-fold. In the limma-trend approach, the counts are converted to logCPM values using edgeR’s cpm function: > logCPM <- cpm(dge, log=TRUE, prior.count=3) The prior count is used here to damp down the variances of logarithms of low counts. The logCPM values can then be used in any standard limma pipeline, using the trend=TRUE argument when running eBayes or treat. For example:

fit <- lmFit(logCPM, design) fit <- eBayes(fit, trend=TRUE) topTable(fit, coef=ncol(design)) Or, to give more weight to fold-changes in the gene ranking, one might use: fit <- lmFit(logCPM, design) fit <- treat(fit, lfc=log2(1.2), trend=TRUE) topTreat(fit, coef=ncol(design))

so I am here want to get your answer disadvantage: 3 fold library size limit advantage: describe clearly says : logCPM values using trend = T is good

Can you provide your professional answers now? Let the author of the package evaluate them

ADD REPLY • link 2.8 years ago linouhao ▴ 20