Search
Question: normalization factors for ChIP/RNA-IP-seq data
0
6.4 years ago by
mali salmon320
Israel
mali salmon320 wrote:
Dear List I have peak counts from RNA-IP samples and corresponding inputs, for two different conditions. I would like to find DE-binding between the two IP conditions after removing the differential expression effect. In a previous post (titled "differential binding question") Mark Robinson suggested to do GLM analysis. Before doing the DE analysis I have to normalize the data. Using DESeq "estimateSizeFactors" function I get the following sizeFactors > sizeFactors( cds ) cond1_IP cond1_IP.1 cond1_Input cond1_Input.1 cond2_IP 6.3672619 6.1015548 0.3209480 0.2553967 3.2300114 cond2_IP.1 cond2_IP.2 cond2_Input cond2_Input.1 1.7808445 1.7027369 0.2480639 0.2530747 With edgeR, these are the normalize factors I get using both TMM and RLE methods > dTMM$samples group lib.size norm.factors cond1_IP H 8345160 0.9916792 cond1_IP.1 H 9395446 1.2221615 cond1_Input H 1126656 0.4489350 cond1_Input.1 H 219823 2.1955057 cond2_IP S 5707895 0.8339317 cond2_IP.1 S 5914904 0.5014391 cond2_IP.2 S 5602070 0.5043970 cond2_Input S 223442 1.9909578 cond2_Input.1 S 226840 1.9934207 >dRLE$samples group lib.size norm.factors cond1_IP H 8345160 1.2656111 cond1_IP.1 H 9395446 1.0772223 cond1_Input H 1126656 0.4725259 cond1_Input.1 H 219823 1.9271892 cond2_IP S 5707895 0.9386643 cond2_IP.1 S 5914904 0.4994138 cond2_IP.2 S 5602070 0.5041749 cond2_Input S 223442 1.8415393 cond2_Input.1 S 226840 1.8505947 The "real" library size (number of reads that have been successfully aligned in each sample) are cond1_IP 24055908 cond1_IP 16654296 cond1_lnput 12919153 cond1_Input 33778948 cond2_IP 17340233 cond2_IP 29284664 cond2_IP 27788144 cond2_Input 33477921 cond2_Input 33980303 As you can see, DESeq and edgeR are weighting-up Input samples and weighting-down IP. I suppose this is due to the fact that many less Input reads are found in peak regions compared to IP which makes DESeq and edgeR to think that the Input library size is much lower than IP. In fact, the original library size of Input samples is in most cases larger than the IP. What do you think, shall I use the original library sizes as normalization factors instead of the calculated ones? I know this is possible with DESeq, but I couldn't find how to do it with edgeR. Thanks Mali [[alternative HTML version deleted]]
modified 13 months ago by gaxusoh0 • written 6.4 years ago by mali salmon320
0
6.4 years ago by
mali salmon320
Israel
mali salmon320 wrote:
OK, if updating DGEList$samples$lib.size is the way of using original library sizes, than I know how to do it, but still I'm not sure if this is the right way to go with this kind of IP-Input normalization Mali On Sun, Jan 8, 2012 at 5:43 PM, mali salmon <shalmom1@gmail.com> wrote: > Dear List > I have peak counts from RNA-IP samples and corresponding inputs, for two > different conditions. > I would like to find DE-binding between the two IP conditions after > removing the differential expression effect. > In a previous post (titled "differential binding question") Mark Robinson > suggested to do GLM analysis. > Before doing the DE analysis I have to normalize the data. > > Using DESeq "estimateSizeFactors" function I get the following sizeFactors > > > sizeFactors( cds ) > cond1_IP cond1_IP.1 cond1_Input cond1_Input.1 cond2_IP > 6.3672619 6.1015548 0.3209480 0.2553967 3.2300114 > cond2_IP.1 cond2_IP.2 cond2_Input cond2_Input.1 > 1.7808445 1.7027369 0.2480639 0.2530747 > > With edgeR, these are the normalize factors I get using both TMM and RLE > methods > > dTMM$samples > group lib.size norm.factors > cond1_IP H 8345160 0.9916792 > cond1_IP.1 H 9395446 1.2221615 > cond1_Input H 1126656 0.4489350 > cond1_Input.1 H 219823 2.1955057 > cond2_IP S 5707895 0.8339317 > cond2_IP.1 S 5914904 0.5014391 > cond2_IP.2 S 5602070 0.5043970 > cond2_Input S 223442 1.9909578 > cond2_Input.1 S 226840 1.9934207 > > >dRLE$samples > group lib.size norm.factors > cond1_IP H 8345160 1.2656111 > cond1_IP.1 H 9395446 1.0772223 > cond1_Input H 1126656 0.4725259 > cond1_Input.1 H 219823 1.9271892 > cond2_IP S 5707895 0.9386643 > cond2_IP.1 S 5914904 0.4994138 > cond2_IP.2 S 5602070 0.5041749 > cond2_Input S 223442 1.8415393 > cond2_Input.1 S 226840 1.8505947 > > > The "real" library size (number of reads that have been successfully > aligned in each sample) are > cond1_IP 24055908 > cond1_IP 16654296 > cond1_lnput 12919153 > cond1_Input 33778948 > cond2_IP 17340233 > cond2_IP 29284664 > cond2_IP 27788144 > cond2_Input 33477921 > cond2_Input 33980303 > > As you can see, DESeq and edgeR are weighting-up Input samples and > weighting-down IP. I suppose this is due to the fact that many less Input > reads are found in peak regions compared to IP which makes DESeq and edgeR > to think that the Input library size is much lower than IP. In fact, the > original library size of Input samples is in most cases larger than the IP. > > What do you think, shall I use the original library sizes as normalization > factors instead of the calculated ones? I know this is possible with DESeq, > but I couldn't find how to do it with edgeR. > > Thanks > Mali > > [[alternative HTML version deleted]]
Hi Mali, > OK, if updating DGEList$samples$lib.size is the way of using original > library sizes, than I know how to do it, but still I'm not sure if this is > the right way to go with this kind of IP-Input normalization Yes, you can manually modify the lib.size and norm.factors elements. The product of these is used as the "effective" library size (i.e. similar to DESeq's sizeFactors). I'd be inclined to look at M-vs-A / "smear" plots -- plotSmear() or maPlot() or similar -- to get a feel for what the normalization factors are actually doing. Have you done this? >> As you can see, DESeq and edgeR are weighting-up Input samples and >> weighting-down IP. I suppose this is due to the fact that many less Input >> reads are found in peak regions compared to IP which makes DESeq and edgeR >> to think that the Input library size is much lower than IP. My interpretation of this is that the Input-seq populations are more diverse, so you are sequencing them to a lower depth (on average, relative to total). > In fact, the > original library size of Input samples is in most cases larger than the IP. How was the peak detection done? That may have an influence too. Anyways, I don't think you can decide on the "right way" without a serious look at the data. Regards, Mark ---------- Prof. Dr. Mark Robinson Bioinformatics Institute of Molecular Life Sciences University of Zurich Winterthurerstrasse 190 8057 Zurich Switzerland v: +41 44 635 4848 f: +41 44 635 6898 e: mark.robinson at imls.uzh.ch o: Y32-J-34 w: http://tiny.cc/mrobin On 08.01.2012, at 16:58, mali salmon wrote: > OK, if updating DGEList$samples$lib.size is the way of using original > library sizes, than I know how to do it, but still I'm not sure if this is > the right way to go with this kind of IP-Input normalization > Mali > > On Sun, Jan 8, 2012 at 5:43 PM, mali salmon <shalmom1 at="" gmail.com=""> wrote: > >> Dear List >> I have peak counts from RNA-IP samples and corresponding inputs, for two >> different conditions. >> I would like to find DE-binding between the two IP conditions after >> removing the differential expression effect. >> In a previous post (titled "differential binding question") Mark Robinson >> suggested to do GLM analysis. >> Before doing the DE analysis I have to normalize the data. >> >> Using DESeq "estimateSizeFactors" function I get the following sizeFactors >> >>> sizeFactors( cds ) >> cond1_IP cond1_IP.1 cond1_Input cond1_Input.1 cond2_IP >> 6.3672619 6.1015548 0.3209480 0.2553967 3.2300114 >> cond2_IP.1 cond2_IP.2 cond2_Input cond2_Input.1 >> 1.7808445 1.7027369 0.2480639 0.2530747 >> >> With edgeR, these are the normalize factors I get using both TMM and RLE >> methods >>> dTMM$samples >> group lib.size norm.factors >> cond1_IP H 8345160 0.9916792 >> cond1_IP.1 H 9395446 1.2221615 >> cond1_Input H 1126656 0.4489350 >> cond1_Input.1 H 219823 2.1955057 >> cond2_IP S 5707895 0.8339317 >> cond2_IP.1 S 5914904 0.5014391 >> cond2_IP.2 S 5602070 0.5043970 >> cond2_Input S 223442 1.9909578 >> cond2_Input.1 S 226840 1.9934207 >> >>> dRLE$samples >> group lib.size norm.factors >> cond1_IP H 8345160 1.2656111 >> cond1_IP.1 H 9395446 1.0772223 >> cond1_Input H 1126656 0.4725259 >> cond1_Input.1 H 219823 1.9271892 >> cond2_IP S 5707895 0.9386643 >> cond2_IP.1 S 5914904 0.4994138 >> cond2_IP.2 S 5602070 0.5041749 >> cond2_Input S 223442 1.8415393 >> cond2_Input.1 S 226840 1.8505947 >> >> >> The "real" library size (number of reads that have been successfully >> aligned in each sample) are >> cond1_IP 24055908 >> cond1_IP 16654296 >> cond1_lnput 12919153 >> cond1_Input 33778948 >> cond2_IP 17340233 >> cond2_IP 29284664 >> cond2_IP 27788144 >> cond2_Input 33477921 >> cond2_Input 33980303 >> >> As you can see, DESeq and edgeR are weighting-up Input samples and >> weighting-down IP. I suppose this is due to the fact that many less Input >> reads are found in peak regions compared to IP which makes DESeq and edgeR >> to think that the Input library size is much lower than IP. In fact, the >> original library size of Input samples is in most cases larger than the IP. >> >> What do you think, shall I use the original library sizes as normalization >> factors instead of the calculated ones? I know this is possible with DESeq, >> but I couldn't find how to do it with edgeR. >> >> Thanks >> Mali >> >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
0
13 months ago by
vulegom0
vulegom0 wrote:

nice post and really interesting information!!!!!

Happy birthday

0
13 months ago by
gaxusoh0
gaxusoh0 wrote: