Question

RNA-seq: Batch-free expression values

0

Entering edit mode

Ekarl2 ▴ 80

@ekarl2-7202

Last seen 7.9 years ago

Sweden

Greetings,

My experimental design consists of two experimental factors, four conditions (+/+, -/+, +/-, -/-) and four biological replicate per condition. The batch effect in this experiment comes from the fact that the first replicate for all conditions was taken separately, the second separately and so on.

Now I have written a somewhat bloated R script that tries to accomplish two separate goals:

(1) find genes that are differentially expressed between presence and absence of the two experimental factors separately, i.e. which are differentially expressed due to factor 1 alone or factor 2 alone.

(2) getting batch-free TMM-normalized FPKM values (where the batch effect has been removed). This will not be used in any differential expression analysis, just for making bar graphs, clustering and heat maps generation.

I accomplished (1) by essentially modifying the Arabidopsis example from the awesome EdgeR user guide to fit this dataset. This worked great.

For (2), my previous idea was to use my matrix with raw counts, log2 transform them, use removeBatchEffect() with a design matrix that lack batch effect factor, remove log2 transformation and use a utility wrapper from the Trinity assembler to get TMM-normalized FPKM values.

After reading the answers to some previous questions:

RNA-Seq, generate batch-free count matrix
Remove batch effect in small RNASeq study (SVA or others?)

...on the Bioconductor support site I now consider something like this:

- make DGEList object from counts.
- filter by total counts to remove non-expressed features.
- TMM-normalization with calcNormFactors().
- log2 transformation.
- remove batch effect with removeBatchEffect() specifying a batch factor and a design matrix with only experimental factors (i.e. without a batch factor).
- use rpkm().

Does this procedure make sense for the goals of being able to make bar graphs for "batch-free" TMM-normalized RPKM values / heat map based on clustering? Does this procedure even produce something that can be reasonable called "batch-free TMM-normalized RPKM values", considering this response:

A: RNA-Seq, generate batch-free count matrix

...to a previous question? If not, what is the technically correct name for the values that the procedure above produces?

Thanks,

edgeR limma removeBatchEffect() • 3.3k views

ADD COMMENT • link updated 9.4 years ago by Aaron Lun ★ 28k • written 9.4 years ago by Ekarl2 ▴ 80

score 2 · Answer 1 · 2014-12-28

A smoother approach would be to use rpkm on your DGEList object, and then call removeBatchEffect on the resulting (log-)RPKM values. For example, if your DGEList object is y, you could do:

lrpkm <- rpkm(y, log=TRUE, prior.count=3)
lrpkm.c <- removeBatchEffect(lrpkm, batch=batch, design=design)

This generates batch-corrected log-RPKM values, equivalent to the batch-corrected log-CPM values in A: RNA-Seq, generate batch-free count matrix. The only difference between the two is that RPKMs account for the gene lengths. However, the gene length is the same for all counts of any given gene, so there shouldn't be any difference in the extent of the batch correction for that gene.

Personally, I don't feel the need to mention that TMM normalization was performed, as that's implicit in the consideration of the (effective) library size in the RPKM value. That said, I can't imagine that calling them "batch-free TMM-normalized (log-)RPKM values" would be too offensive to anybody.