Question

RNASeq DE analysis starts from different DESeq2 normalized data, TMM normalized data and CPM

1

Entering edit mode

li.zhaoz ▴ 10

@lizhaoz-22033

Last seen 3.9 years ago

Hi, I am new to bioinformatics. I am learning how to process RNA-Seq data download from GEO. But people post different types of data on GEO, some post DESeq2 normalized data and others post TMM normalized data or CPM data or FPKM. I am confused because people were saying different things about how to start from those data.

My questions: 1. Is there any function that I could use to filter out low expressed genes in data starts from DESeq2 or TMM normalized data? or just draw a plot to make sure they did. what should I do if they didn't filter out low expressed genes? 2. When I try to filter out low expressed gene from FPKM or CPM data, which one is better to use, rowMeans or rowSums? 3. Is there anything wrong with my current RNA-Seq pipeline? or any advice?:

1) Start from FPKM: limma-trend

keep = rowMeans(v) >= 2
v = v[keep, ]
v = log(FPKM + 0.1) 
fit  = lmFit(v, design)
cont.matrix = makeContrasts(contrasts = Disease-Control, levels = design)
fit2 = contrasts.fit(fit, cont.matrix)
fit2  = eBayes(fit2, trend = TRUE)

2) Start from CPM: limma-trend

keep =  rowSums(CPM > 1) >= 2
v   = log2(data[keep,] + 0.1)
fit  = lmFit(v, design)
cont.matrix = makeContrasts(contrasts = Disease-Control, levels = design)
fit2 = contrasts.fit(fit, cont.matrix)
fit2  = eBayes(fit2, trend = TRUE)

3) Start from DESeq2 normalized data: limma-voom

v = voom(data, design, plot=T)
fit = lmFit(v, design)
cont.matrix = makeContrasts(contrasts = Disease-Control, levels = design)
fit2 = contrasts.fit(fit, cont.matrix)
fit2 = eBayes(fit2)

4) Start from TMM normalized data: limma-voom How to filter out low expressed data

v = voom(data, design, plot=T)
fit = lmFit(v, design)
cont.matrix = makeContrasts(contrasts = Disease-Control, levels = design)
fit2 = contrasts.fit(fit, cont.matrix)
fit2 = eBayes(fit2)

5) Start from Raw count

dge = DGEList(counts = data)
keep =rowSums(cpm(dge)>1) >= 2
dge = calcNormFactors(dge)
v = voom(dge, design, plot=T)
fit = lmFit(v, design)
cont.matrix = makeContrasts(contrasts = Disease-Control, levels = design)
fit2 = contrasts.fit(fit, cont.matrix)
fit2 = eBayes(fit2)

I appreciate your valuable help!!

limma edger • 3.3k views

ADD COMMENT • link updated 4.8 years ago by Gordon Smyth 51k • written 4.8 years ago by li.zhaoz ▴ 10

score 6 · Answer 1 · 2019-10-01

The best approach would be to follow the analysis guidelines and examples described in the edgeR and limma User Guides. The short answer is that you should always get raw counts for any RNA-seq dataset. We recommend that filtering be done by filterByExpr.

Your approach (5) is the standard limma-voom pipeline so we obviously recommend it, although it could be improved further by using filterByExpr.

Approach (2) might also be ok. It the similar to the limma-trend pipeline except that your logCPM computation is not exactly the same.

Pipeline (1) is bad, as we have often said. See for example the comments in the edgeR User Guide.

I have no idea what pipelines (3) and (4) are, so can't comment on them. The terms "TMM normalized data" or "DESeq2 normalized data" could mean anything. I am very much against vague descriptors like this, although they are all too common in the bioinformatics literature. No wonder people get confused.

score 1 · Answer 2 · 2019-10-01

Dear li.zhaoz,

it is important before to proceed to read the following posts/articles about rna-seq analysis, normalization and transformation methodologies:

https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ https://academic.oup.com/bib/article/14/6/671/189645 https://f1000research.com/articles/5-1408/v3 https://support.bioconductor.org/p/98820/ https://support.bioconductor.org/p/69433/

based on your relative questions:

1) What is your purpose and/or experimental design of your study ? You just presented various R code chunks, but without information you have mixed various approaches with fuzzy code.Ideally, you should start with raw counts, and follow any available pipelines proposed in Bioconductor:

https://www.bioconductor.org/packages/release/BiocViews.html#_GeneExpressionWorkflow

2) Secondly, you should not use any transformed data-like cpm,tpm-with edgeR or relative tools for DE analysis. Moreover, regarding the limma-trend approach in the vignette is mentioned:

"If the sequencing depth is reasonably consistent across the RNA samples, then the simplest and most robust approach to differential exis to use limma-trend. This approach will usually work well if the ratio of the largest library size to the smallest is not more than about 3-fold"

Similarly, if you intend to use the limma-voom approach, you should check for high variability in library sizes between your samples...

3) Regarding your filtering question, there are various approaches that you would perform- as a representative example you can check the following sections:

https://www.bioconductor.org/packages/release/workflows/vignettes/RnaSeqGeneEdgeRQL/inst/doc/edgeRQL.html#filtering-to-remove-low-counts

https://www.bioconductor.org/packages/devel/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html#removing-genes-that-are-lowly-expressed

Overall, you should check your experimental design, actual data and from quality control and exploratory data analysis, choose the most appropriate or more appropriate methodologies.

Hope that helps,

Efstathios