Question: Normalization of polyA RNA-seq?
8.1 years ago by
Xiaohui Wu • 280
Xiaohui Wu • 280 wrote:
Thank you Simon. Yes, I want to pool them together to do downstream analyses, like comparing with other tissues or just some analyses of that dataset. At first, I thought that I should compare these two libs first to make sure they are similar enough to be combined. The correlation of gene expression of all genes between these two libs was 0.5, which was not so high, while the correlation between another two different libs was only 0.01, so I thought they could be combined. I'm sure my counting is correct. Yes, you are right, I was not clear enough, the gene has 1 read in small lib does have more reads in big lib but 80% of them are no more than 20 times, 50% of them are less than 10 times. When I used normalization method like DESeq and EdgeR as you said to get the estimate size factor, the difference of the normalized lib size is 20 times. For example, the big lib size is 1,200,000, the small one is 25,000, the size factor is 1:2.5, so after normalization, the new lib size is: 1,200,000:625,000, which is still about 20 times. I mean after normalization, the gene expression of the small one will become higher than the big one. Maybe I just concern too much, there should be some genes with higher expression in lib1 but some other genes with higher expression in lib2, I can't make them so consistent so same even the two libs from same tissue same condition. As you said, I will add up the number without normalization to do analyses of one single lib, and normalize the libs to do sample comparison. Thank you again, I'm not so confused now. And another question about the size factor, the TPM normaliztion is: newCount=(oldCount*1,000,000)/libsize. Does the normalization in EdgeR or DESeq estimate the size factor to adjust the lib size, but not do other things? That is I can replace the libsize with the adjusted libsize in the TPM fomular to do the normalization? Xiaohui Simon Anders 2010-08-14 05:31:47 Wu, Xiaohui Ms. bioconductor Re: [BioC] Normalization of polyA RNA-seq? Hi Xiaohui > I have two libraries of RNA-seq only with polyA of same tissue (leaf and > leaf), and have mapped them to the genome. Most of these reads are in 3'UTR > but not spread over the whole gene body. And the size of these two > libraries are in great difference, like 25,000 reads versus 1250,000 reads. > About 40% and 60% of genes only have 1 read in small lib and bigger one, > respectively. Most of the tags are dominated by only a few genes. I want to > combine these two libs for larger one, but I think I should normalize the > read count before pooling them together. > > If use TPM normalization, read count in smaller library will be multiplied > by 50 times, that means the 1-tag gene will become 50-tag gene, in the > small lib, while maybe that gene is also 1-tag gene in bigger lib, I feel > not comfortable that TPM may make skew the read distribution. Do you have > any idea on normalizing the data instead of TPM? So, you want to ensure that both libraries get the same weight in your downstream analysis. but why would you want that? The smaller library contains less information, so it should not get the same weight. Actually, your description is not to clear. You want to combine the two libraries to a single one, i.e., give up the information which sample each read came from. This would make sense only if these are replicates. If so, it seems very suspicious that a gene that has one count in the small library only gets one count in the bigger one. This might occur occasionally, but should not happen for many genes. You should really double-check whether you did the counting correctly. (Try, for example, my htseq-count script [http://www-huber.embl.de/users/anders/HTSeq/doc/count.html] to see whether its results are similar to yours.) Apart from this issue: If you really just want to combine the reads to one large sample, just add up the number, without normalization. If, however, you want to compare the samples against each other, and normalize to make them comparable, you may want to look at the normalization functions of DESeq (function 'estimateSizeFactors') or edgeR (function 'calcNormFactors'). Simon [[alternative HTML version deleted]]
ADD COMMENT • link •