Question: edgeR normalization method
0
Sara0 wrote:

Hi all experts,

I am a biology student that started to learn R and NGS analysis and have some basic questions, so please be patient with me. Regarding differential gene expression analysis from RNA-seq experiment, as far as I read, edgeR accept raw count and normalize with TMM method, is it right? However, I read in a paper used edgeR for differential expression analysis, gene fold change calculated as log2 (FPKM treatment / FPKM control), I got confused why the author said "FPKM", could someone please  kindly explain me this issue, where does FPKM come from?

For statistical analysis, we need to ensure that all samples are comparable, if box plot shows samples have not a normal distribution, in fact, one of samples stands out from the rest, please let me know if we normal these data before running edgeR analysis?

4
Aaron Lun25k wrote:

FPKM = fragments per kilobase/million. To compute this, you divide the count by the exonic length of the gene (in kilobases) and the library size (in millions of reads). This can be done using the rpkm function.

However, calculation of the FPKM is distinct from edgeR's normalization. In edgeR, the TMM method computes normalization factors that represent sample-specific biases. These factors are multiplied by the library size to yield the effective library size, i.e., the library size that we would have gotten if those biases were not present. The effective library sizes can then be used for various normalization purposes, most frequently as offsets in generalised linear models. Calculation of the FPKM is not essential to this process.

That said, if you wanted to compute FPKM values that incorporate information from TMM normalization, you would use the effective library size instead of the library size in the FPKM calculation. This is done automatically if you run a DGEList object through calcNormFactors and supply the resulting object to rpkm.

As for your final question; edgeR uses a negative binomial distribution, so lack of normality is not an issue. It's not exactly clear what you're making boxplots of; (normalized?) expression values across samples for each gene, or expression values across genes for each sample? I would be reluctant to define a sample as an outlier based on boxplots for a small number of genes. In any case, your options are to turn on robust=TRUE for estimateDisp or use estimateGLMRobustDisp to reduce the impact of outliers in a few genes; or remove the offending outlier sample prior to the DE analysis, if all genes are affected.

Thank you very much for your complete reply. Regarding boxplot, my mean was to make boxplot of raw count values across genes for each sample. MDS plot in the R package is something like boxplot and can be used for variance evaluation between samples before doing differential expression analysis, am I right?

No. If you're referring to plotMDS, this constructs a multidimensional scaling plot, not a boxplot. The MDS plot serves the same function as a PCA plot, i.e., similar samples should cluster together while dissimilar samples should be far apart. This allows you to tell whether the replicates are consistent; whether the treatment conditions have any noticeable effect; and whether there are outlier samples. The diagnostic information that you get from a MDS plot is far more valuable than that from a boxplot of counts for each sample - the latter doesn't really tell you if there are systematic differences in gene expression between samples (as that is confounded by differences in library sizes between samples).