Question

CalcNormFactors in Edge r

0

Entering edit mode

phickner2 • 0

@phickner2-18903

Last seen 5.8 years ago

Can anyone explain the "CalcNormFactors" normalization process in edge r? I have RNAseq data from insect antennae. I do not have a genome assembly for my organism, but I did assemble the transcriptome using trinity. I then retrieved my transcripts of interest (chemosensory genes) from the de novo assembly. I then mapped my reads to ~300 chemosensory genes. I used htseq count to count the reads. I provided the library sizes to edge r. I am concerned that since my counts represent a small proportion of all of the reads, normalization using CalcNormFactors may not be appropriate. I can't seem to find out exactly how the normalization is performed. Any help would be appreciated.

Thanks,

Paul

edger • 829 views

ADD COMMENT • link updated 5.8 years ago by Gordon Smyth 51k • written 5.8 years ago by phickner2 • 0

score 2 · Answer 1 · 2018-12-21

The short answer is 'don't do that'.

One issue with RNA-Seq is differences in library size. If you took two aliquots of the same exact sample and sequenced twice, one time getting 20M reads, and the other getting 10M reads, the expectation for the counts/gene would be that you would get twice the number of counts for each gene in the first aliquot than the second (because you have twice the reads). You want to account for those differences, because in this silly example you already know that the genes were expressed at the same exact level in both samples, and the difference in counts is due only to the library size.

It's not that simple when you are using biological replicates. All things equal, if you have twice the reads in one sample than the other, you expect about twice the counts/gene. But all things aren't equal, and you may have some genes that are really highly expressed in one sample, which may contribute more than expected to the total library size (this is called compositional bias), in which case you would want to find a set of genes that aren't obviously changing expression bigly between the two samples. The TMM normalization that is the default in calcNormFactors is intended to do that - estimate normalization offsets for each library, based on a reasonable set of genes.

If you subset to some arbitrary set of genes, you may well lose all the genes that aren't really changing, and that would be useful for calculating differences in library size. In addition, all the other genes that you may not be interested in are quite useful for estimating things like the relationship between gene counts and variance (which you need for linear modeling). So in general you should keep all the genes that appear to be expressed at a reasonable level in your data set all the way through the model fit. After that, you can filter out the genes you care about.

score 0 · Answer 2 · 2018-12-23

I wouldn't run calcNormFactors for your data. You can run an edgeR analysis on your 300 genes, omitting the calcNormFactors step. However you have to be sure to set the library sizes to the total number of reads you have for each samples, including reads not mapped to any of your 300 chemosensory genes.

For example, you might use

y <- DGEList(counts=counts, lib.size=n)

where counts is your count matrix with 300 rows and n is a vector giving the sequencing depth for each sample.

Compared to a standard RNA-seq analysis with a reference genome, you will lose some of the protection that calcNormFactors would normally give you, but the DE analysis will still be ok provided your RNA samples are of a consistent quality.