3.7 years ago by
Cambridge, United Kingdom
FPKM = fragments per kilobase/million. To compute this, you divide the count by the exonic length of the gene (in kilobases) and the library size (in millions of reads). This can be done using the
However, calculation of the FPKM is distinct from edgeR's normalization. In edgeR, the TMM method computes normalization factors that represent sample-specific biases. These factors are multiplied by the library size to yield the effective library size, i.e., the library size that we would have gotten if those biases were not present. The effective library sizes can then be used for various normalization purposes, most frequently as offsets in generalised linear models. Calculation of the FPKM is not essential to this process.
That said, if you wanted to compute FPKM values that incorporate information from TMM normalization, you would use the effective library size instead of the library size in the FPKM calculation. This is done automatically if you run a
DGEList object through
calcNormFactors and supply the resulting object to
As for your final question; edgeR uses a negative binomial distribution, so lack of normality is not an issue. It's not exactly clear what you're making boxplots of; (normalized?) expression values across samples for each gene, or expression values across genes for each sample? I would be reluctant to define a sample as an outlier based on boxplots for a small number of genes. In any case, your options are to turn on
estimateDisp or use
estimateGLMRobustDisp to reduce the impact of outliers in a few genes; or remove the offending outlier sample prior to the DE analysis, if all genes are affected.
modified 3.7 years ago
3.7 years ago by
Aaron Lun • 25k