Sample metadata

Question

Does EdgeR trimmed mean of M values (TMM) account for gene length?

0

Entering edit mode

Sabiha ▴ 20

@7f93ecd8

Last seen 22 months ago

United States

Hi,

I am working with RNAseq data using EdgeR (steps below), while I was discussing some preliminary data analysis and observations, I cam across a question about the gene length. Does EdgeR trimmed mean of M values (TMM) account for gene length along with the sequencing depth and RNA composition?

While I was exploring more about this, I came across a couple of resources (links below):

List item

EdgeR trimmed mean of M values (TMM) - accounts for sequencing depth, RNA composition, and gene length,

List item

[A scaling normalization method for differential expression analysis of RNA-seq data: 2 It states that gene length is generally absorbed into a certain parameter and does not get used in the inference procedure. The focus of the TMM method is on estimating the relative RNA production of two samples, essentially a global fold change, by equating the overall expression levels of genes between samples under the assumption that the majority of them are not differentially expressed. Thus, while gene length biases are acknowledged as significant in gene expression analysis.

Sample metadata

#>   Samples Ind Event     Treatment
#> 1      S1    I1  5m Untreated
#> 2      S2    I1  9m   Treated
#> 3      S3    I2  5m Untreated
#> 4      S4    I2  9m   Treated
#> 5      S5    I3  5m Untreated
#> 6      S6    I3  9m   Treated

EdgeR Analysis

library(edgeR)
group.Treatment <-  factor(Sample_metadata$Treatment)
y <- DGEList(counts = gene_counts, group = group.Treatment, remove.zeros = TRUE)
keep <- filterByExpr(y)
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y, method = "TMM")
logCPM = cpm(y, prior.count=1, log=TRUE)
Then, use logCPM values for downstream analysis such as to calculate fold changes per individual, plotting, and more...

Thank you,

Sabiha

Normalization edgeR RNASeq • 2.8k views

ADD COMMENT • link 2.0 years ago Sabiha ▴ 20

score 1 · Answer 1 · 2024-01-20

1

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen just now

WEHI, Melbourne, Australia

TMM normalizes the library sizes, not the genewise read counts. Gene lengths are not relevant to the computation that TMM does.

In general, edgeR does not need to adjust for gene length in DE analyses because gene length cancels out of DE comparisons. Please see the section on normalization in the edgeR User's Guide.

"Gene length bias" as defined by Young et al (2010) is quite a different thing and is taken into account after the DE analysis when interpretting the results in terms of annotation categories. This type of gene length bias relates to the power to detect DE and is optionally adjusted for by goseq or by the edgeR methods goana() and kegga().

Reference

Young MD, Wakefield MJ, Smyth GK, Oshlack A (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology 11, R14.

ADD COMMENT • link 2.0 years ago Gordon Smyth 53k

0

Entering edit mode

Gordon Smyth thank you.

Additionally, I was also reading the below article and learnt,

A scaling normalization method for differential expression analysis of RNA-seq data: I infer that gene length is generally absorbed into a certain parameter and does not get used in the inference procedure. The focus of the TMM method is on estimating the relative RNA production of two samples, essentially a global fold change, by equating the overall expression levels of genes between samples under the assumption that the majority of them are not differentially expressed. Thus, while gene length biases are acknowledged as significant in gene expression analysis.

https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25

Is this a different TMM approach? Sampling framework section of the paper does describes about the gene length.

ADD REPLY • link 2.0 years ago Sabiha ▴ 20

1

Entering edit mode

I already gave you a complete answer as it affects edgeR. The paper that you quote (Robinson & Oshlack, 2010) agrees with what I told you in every respect.

TMM does not adjust for gene length nor does it need to. The genes lengths do not enter into the TMM calculation. They are not relevant for what TMM is trying to achieve.

If you want more detailed explanations for statements made in the paper that you quote, it would be best to write to the authors of that paper. I can only answer questions about how to conduct analyses in edgeR or about the edgeR documentation.

ADD REPLY • link 2.0 years ago Gordon Smyth 53k

0

Entering edit mode

Gordon Smyth thanks. Thank you for your response and for clarifying the role of TMM in edgeR. I truly appreciate the time and effort you took to explain this.

I should clarify that my use of edgeR often varies depending on the specific requirements, basically, I import the raw counts in the edgeR package. While I sometimes use edgeR for differential expression analysis, there are instances where I only use it to extract logCPM values (steps above). I then incorporate these logCPM values into other tools like limma for comparative analysis or for other downstream applications.

ADD REPLY • link 2.0 years ago Sabiha ▴ 20