Tissue heterogeneity and TMM normalization

0

Entering edit mode

Ni Feng ▴ 30

@ni-feng-6726

Last seen 9.6 years ago

Dear all, I have a general question about whether TMM normalization is appropriate for my data. I apologize for this long winded email. I am not a trained bioinformatician and therefore have been struggling with some data analysis. A colleague and I did an RNA seq experiment with 6 samples (each had RNA pooled from 6 individuals) and no biological replicates. The 6 samples included 2 tissue types collected at 3 different time points. I know that this is not an ideal experimental set-up, we did this experiment 3 years ago. We used the Trinity package to do most of the transcriptome assembly and downstream analyses, such as leveraging EdgeR for differential expression. Naively I went on with all downstream analyses without verifying whether my data violated underlying assumptions of TMM normalization. For example, we found ~30% of our transcripts showed differential expression between any 2 pairwise comparisons. Does this violate the TMM assumption that most genes are NOT differentially expressed? Furthermore, we noticed that there is still a tissue bias after normalization. Attached is a scatterplot of TMM normalized values for each tissue (summed across 3 sample groups for each tissue). Plotted in black on top of all transcripts are CEG (Core Eukaryotic Genes) expression, which we believe should be good candidates for "house keeping" genes. Both CEGs and all genes show that at higher expression levels, there is a skew towards one tissue ("VMN"), whereas in the middle values, there is a skew towards the other tissue ("H"). I have also attached a density plot of the M values, and a MA plot to visualize the skew. These plots were generated from 1 pair of tissue comparisons ("SMH" vs "SMV). These plots reflect the fact that one tissue is more heterogeneous than the other. Although TMM normalization is designed to deal with this problem, our data seems to need further normalization. Our within tissue comparisons are great and do not show this kind of skew. My questions are: 1) does our data violate TMM normalization assumptions 2) do you have another normalization method to suggest for our data 3) should we just forget about tissue-comparisons I have also played around with the suggestions about estimating a dispersion value based on the EdgeR user guide. Can discuss this further. Thank you for your time and patience, and any advice is much appreciated. -- Ni (Jenny) Ye Feng Ph.D. Candidate Bass Laboratory Cornell University Dept of Neurobiology and Behavior Ithaca, NY 14853 -------------- next part -------------- A non-text attachment was scrubbed... Name: CEG_FPKM_over_all_090814.png Type: image/png Size: 86336 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140908="" 89d8411d="" attachment.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: SMV_SMH_density_log2(M).pdf Type: application/pdf Size: 4716 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140908="" 89d8411d="" attachment.pdf=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: SMH_SMV_MA_plot_0903.png Type: image/png Size: 51246 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140908="" 89d8411d="" attachment-0001.png="">

Normalization edgeR • 2.1k views

ADD COMMENT • link updated 9.6 years ago by Wolfgang Huber ★ 13k • written 9.6 years ago by Ni Feng ▴ 30

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

Hello, I think it is clear from your charts that normalization is a concern. Obviously you want to see an MA plot centered at zero. With your data, as you have noticed, there appears to be a dependency between M and A. There is nothing that TMM or any other scaling normalization method can do to eliminate this dependency, since a scaling normalization means that there is a single normalization factor for all genes in a sample. You might want to investigate the use of the more complex normalization procedures offered in the cqn or EDASeq packages. These normalizations are variations on quantile normalization, which can remove the trend between M and A. However, it is up to you to decide whether this trend reflects a technical artifact that should be removed or a real biological phenomenon that should be preserved. You can test this by verifying that the CEG end up on the zero line of the MA plot after normalization. Lastly, note that having 30% of genes differentially expressed does not violate the assumptions of TMM. With the default options, TMM trims the top and bottom 30% of ratios, so these differentially expressed genes will not disrupt the computation of the normalization factor. The assumption being violated is that the assumption of a direct linear relationship between RNA abundance and read count for all genes within a sample. This is the assumption behind all scaling normalizations. -Ryan On Mon 08 Sep 2014 09:15:38 AM PDT, Ni Feng wrote: > > Dear all, > I have a general question about whether TMM normalization is appropriate > for my data. I apologize for this long winded email. I am not a trained > bioinformatician and therefore have been struggling with some data > analysis. > > A colleague and I did an RNA seq experiment with 6 samples (each had RNA > pooled from 6 individuals) and no biological replicates. The 6 samples > included 2 tissue types collected at 3 different time points. I know that > this is not an ideal experimental set-up, we did this experiment 3 years > ago. > > We used the Trinity package to do most of the transcriptome assembly and > downstream analyses, such as leveraging EdgeR for differential expression. > Naively I went on with all downstream analyses without verifying > whether my > data violated underlying assumptions of TMM normalization. > > For example, we found ~30% of our transcripts showed differential > expression between any 2 pairwise comparisons. Does this violate the TMM > assumption that most genes are NOT differentially expressed? > > Furthermore, we noticed that there is still a tissue bias after > normalization. Attached is a scatterplot of TMM normalized values for each > tissue (summed across 3 sample groups for each tissue). Plotted in > black on > top of all transcripts are CEG (Core Eukaryotic Genes) expression, > which we > believe should be good candidates for "house keeping" genes. Both CEGs and > all genes show that at higher expression levels, there is a skew towards > one tissue ("VMN"), whereas in the middle values, there is a skew towards > the other tissue ("H"). > > I have also attached a density plot of the M values, and a MA plot to > visualize the skew. These plots were generated from 1 pair of tissue > comparisons ("SMH" vs "SMV). > > These plots reflect the fact that one tissue is more heterogeneous > than the > other. Although TMM normalization is designed to deal with this problem, > our data seems to need further normalization. Our within tissue > comparisons > are great and do not show this kind of skew. My questions are: > > 1) does our data violate TMM normalization assumptions > 2) do you have another normalization method to suggest for our data > 3) should we just forget about tissue-comparisons > > I have also played around with the suggestions about estimating a > dispersion value based on the EdgeR user guide. Can discuss this further. > > Thank you for your time and patience, and any advice is much appreciated. > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 9.6 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 18 days ago

EMBL European Molecular Biology Laborat…

Hi Ni the ?most genes are not differentially expressed? is a sufficient assumption that one can use to prove that the estimated normalisation factor is close to the true one, under some model. It is not a necessary assumption, TMM or similar normalisations can still be useful beyond (e.g. if many genes are d.e. but up and down are about balanced; etc.) Did you try compouting the normalisation parameters from the CEG genes only and then applying to all data? An interesting idea was put forward by J. Li, D. M. Witten, I. M. Johnstone and R. Tibshirani: Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics, 13:523 (2012) ? www.biostat.washington.edu/~dwitten/Papers/LiWittenJohnstoneTibs.pdf They determine the normalisation factor so as to minimize the amount of differential expression. (This is one instance of this idea I am aware of, it?s been put out for microarrays before, apologies to anyone else who proposed this.) Also, if I understood your plots correctly, the biases are relatively small in amplitude. So you could leave them there, but apply a banded hypothesis test (i.e. H0: |beta| < theta) rather than H0: beta=0, where beta is the fold change and theta a positive number. This is, e.g., described in the DESeq2 vignette. Best wishes Wolfgang Il giorno 08 Sep 2014, alle ore 18:15, Ni Feng <fengni99 at="" gmail.com=""> ha scritto: > Dear all, > I have a general question about whether TMM normalization is appropriate > for my data. I apologize for this long winded email. I am not a trained > bioinformatician and therefore have been struggling with some data > analysis. > > A colleague and I did an RNA seq experiment with 6 samples (each had RNA > pooled from 6 individuals) and no biological replicates. The 6 samples > included 2 tissue types collected at 3 different time points. I know that > this is not an ideal experimental set-up, we did this experiment 3 years > ago. > > We used the Trinity package to do most of the transcriptome assembly and > downstream analyses, such as leveraging EdgeR for differential expression. > Naively I went on with all downstream analyses without verifying whether my > data violated underlying assumptions of TMM normalization. > > For example, we found ~30% of our transcripts showed differential > expression between any 2 pairwise comparisons. Does this violate the TMM > assumption that most genes are NOT differentially expressed? > > Furthermore, we noticed that there is still a tissue bias after > normalization. Attached is a scatterplot of TMM normalized values for each > tissue (summed across 3 sample groups for each tissue). Plotted in black on > top of all transcripts are CEG (Core Eukaryotic Genes) expression, which we > believe should be good candidates for "house keeping" genes. Both CEGs and > all genes show that at higher expression levels, there is a skew towards > one tissue ("VMN"), whereas in the middle values, there is a skew towards > the other tissue ("H"). > > I have also attached a density plot of the M values, and a MA plot to > visualize the skew. These plots were generated from 1 pair of tissue > comparisons ("SMH" vs "SMV). > > These plots reflect the fact that one tissue is more heterogeneous than the > other. Although TMM normalization is designed to deal with this problem, > our data seems to need further normalization. Our within tissue comparisons > are great and do not show this kind of skew. My questions are: > > 1) does our data violate TMM normalization assumptions > 2) do you have another normalization method to suggest for our data > 3) should we just forget about tissue-comparisons > > I have also played around with the suggestions about estimating a > dispersion value based on the EdgeR user guide. Can discuss this further. > > Thank you for your time and patience, and any advice is much appreciated. > > -- > Ni (Jenny) Ye Feng > Ph.D. Candidate > Bass Laboratory > Cornell University > Dept of Neurobiology and Behavior > Ithaca, NY 14853 > <ceg_fpkm_over_all_090814.png><smv_smh_density_log2(m).pdf><smh_smv_ ma_plot_0903.png="">_______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 9.6 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Thank you Wolfgang! We are using fold change >4 and FDR corrected P value of <0.001 as thresholds for calling differential expression, do you think this is stringent enough given our skew? It was hard for me to gauge just how bad the skew is and that was another thing I wanted to get an opinion on. Yesterday I took out lowly expressed transcripts (<0.1 FPKM in any sample), which gave me a small dispersion value akin to what Trinity uses as default (0.1), but using the normalization factors from this dataset did not improve the skew. Given what Ryan Thompson said earlier I guess this makes sense. I had only used CEGs to calculate the dispersion, but will try to get the normalization factors from them and see how well it works. Thanks for the suggestion! If this doesn't work, I'll try the quantile normalization. Thanks again for your help! Jenny ---------- Forwarded message ---------- From: Wolfgang Huber <whuber@embl.de> Date: Tue, Sep 9, 2014 at 3:58 AM Subject: Re: [BioC] Tissue heterogeneity and TMM normalization To: Ni Feng <fengni99 at="" gmail.com=""> Cc: bioconductor at r-project.org Hi Ni the ?most genes are not differentially expressed? is a sufficient assumption that one can use to prove that the estimated normalisation factor is close to the true one, under some model. It is not a necessary assumption, TMM or similar normalisations can still be useful beyond (e.g. if many genes are d.e. but up and down are about balanced; etc.) Did you try compouting the normalisation parameters from the CEG genes only and then applying to all data? An interesting idea was put forward by J. Li, D. M. Witten, I. M. Johnstone and R. Tibshirani: Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics, 13:523 (2012) ? www.biostat.washington.edu/~dwitten/Papers/LiWittenJohnstoneTibs.pdf They determine the normalisation factor so as to minimize the amount of differential expression. (This is one instance of this idea I am aware of, it?s been put out for microarrays before, apologies to anyone else who proposed this.) Also, if I understood your plots correctly, the biases are relatively small in amplitude. So you could leave them there, but apply a banded hypothesis test (i.e. H0: |beta| < theta) rather than H0: beta=0, where beta is the fold change and theta a positive number. This is, e.g., described in the DESeq2 vignette. Best wishes Wolfgang Il giorno 08 Sep 2014, alle ore 18:15, Ni Feng <fengni99 at="" gmail.com=""> ha scritto: > Dear all, > I have a general question about whether TMM normalization is appropriate > for my data. I apologize for this long winded email. I am not a trained > bioinformatician and therefore have been struggling with some data > analysis. > > A colleague and I did an RNA seq experiment with 6 samples (each had RNA > pooled from 6 individuals) and no biological replicates. The 6 samples > included 2 tissue types collected at 3 different time points. I know that > this is not an ideal experimental set-up, we did this experiment 3 years > ago. > > We used the Trinity package to do most of the transcriptome assembly and > downstream analyses, such as leveraging EdgeR for differential expression. > Naively I went on with all downstream analyses without verifying whether my > data violated underlying assumptions of TMM normalization. > > For example, we found ~30% of our transcripts showed differential > expression between any 2 pairwise comparisons. Does this violate the TMM > assumption that most genes are NOT differentially expressed? > > Furthermore, we noticed that there is still a tissue bias after > normalization. Attached is a scatterplot of TMM normalized values for each > tissue (summed across 3 sample groups for each tissue). Plotted in black on > top of all transcripts are CEG (Core Eukaryotic Genes) expression, which we > believe should be good candidates for "house keeping" genes. Both CEGs and > all genes show that at higher expression levels, there is a skew towards > one tissue ("VMN"), whereas in the middle values, there is a skew towards > the other tissue ("H"). > > I have also attached a density plot of the M values, and a MA plot to > visualize the skew. These plots were generated from 1 pair of tissue > comparisons ("SMH" vs "SMV). > > These plots reflect the fact that one tissue is more heterogeneous than the > other. Although TMM normalization is designed to deal with this problem, > our data seems to need further normalization. Our within tissue comparisons > are great and do not show this kind of skew. My questions are: > > 1) does our data violate TMM normalization assumptions > 2) do you have another normalization method to suggest for our data > 3) should we just forget about tissue-comparisons > > I have also played around with the suggestions about estimating a > dispersion value based on the EdgeR user guide. Can discuss this further. > > Thank you for your time and patience, and any advice is much appreciated. > > -- > Ni (Jenny) Ye Feng > Ph.D. Candidate > Bass Laboratory > Cornell University > Dept of Neurobiology and Behavior > Ithaca, NY 14853 > <ceg_fpkm_over_all_090814.png><smv_smh_density_log2(m).pdf><smh_smv_ma _plot_0903.png="">_______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Ni (Jenny) Ye Feng Ph.D. Candidate Bass Laboratory Cornell University Dept of Neurobiology and Behavior Ithaca, NY 14853 [[alternative HTML version deleted]]

ADD REPLY • link 9.6 years ago Ni Feng ▴ 30

0

Entering edit mode

Hi Jenny, you may also want to have a look at our new RUVSeq package. In particular, you can use the RUVg function to estimate factors of "unwanted variation" (UV) using the CEG genes as "negative controls." This is not equivalent to estimate the TMM normalization factors on a subset of genes (which doesn't work too well in our experience), because our UV factors are included in the model with some parameters (coefficients) that are then re-estimated for all the genes. Have a look at the vignette of RUVSeq package for details and let me know if you have questions. Best, davide On Tue, Sep 9, 2014 at 7:30 AM, Ni Feng <fengni99 at="" gmail.com=""> wrote: > Thank you Wolfgang! > We are using fold change >4 and FDR corrected P value of <0.001 as > thresholds for calling differential expression, do you think this is > stringent enough given our skew? > > It was hard for me to gauge just how bad the skew is and that was another > thing I wanted to get an opinion on. > > Yesterday I took out lowly expressed transcripts (<0.1 FPKM in any sample), > which gave me a small dispersion value akin to what Trinity uses as default > (0.1), but using the normalization factors from this dataset did not > improve the skew. Given what Ryan Thompson said earlier I guess this makes > sense. > > I had only used CEGs to calculate the dispersion, but will try to get the > normalization factors from them and see how well it works. Thanks for the > suggestion! > If this doesn't work, I'll try the quantile normalization. > > Thanks again for your help! > Jenny > > ---------- Forwarded message ---------- > From: Wolfgang Huber <whuber at="" embl.de=""> > Date: Tue, Sep 9, 2014 at 3:58 AM > Subject: Re: [BioC] Tissue heterogeneity and TMM normalization > To: Ni Feng <fengni99 at="" gmail.com=""> > Cc: bioconductor at r-project.org > > > Hi Ni > > the ?most genes are not differentially expressed? is a sufficient > assumption that one can use to prove that the estimated normalisation > factor is close to the true one, under some model. It is not a necessary > assumption, TMM or similar normalisations can still be useful beyond (e.g. > if many genes are d.e. but up and down are about balanced; etc.) > > Did you try compouting the normalisation parameters from the CEG genes only > and then applying to all data? > > An interesting idea was put forward by J. Li, D. M. Witten, I. M. Johnstone > and R. Tibshirani: Normalization, testing, and false discovery rate > estimation for RNA-sequencing data. Biostatistics, 13:523 (2012) ? > www.biostat.washington.edu/~dwitten/Papers/LiWittenJohnstoneTibs.pdf > They determine the normalisation factor so as to minimize the amount of > differential expression. > (This is one instance of this idea I am aware of, it?s been put out for > microarrays before, apologies to anyone else who proposed this.) > > Also, if I understood your plots correctly, the biases are relatively small > in amplitude. So you could leave them there, but apply a banded hypothesis > test (i.e. H0: |beta| < theta) rather than H0: beta=0, where beta is the > fold change and theta a positive number. This is, e.g., described in the > DESeq2 vignette. > > Best wishes > Wolfgang > > > Il giorno 08 Sep 2014, alle ore 18:15, Ni Feng <fengni99 at="" gmail.com=""> ha > scritto: > >> Dear all, >> I have a general question about whether TMM normalization is appropriate >> for my data. I apologize for this long winded email. I am not a trained >> bioinformatician and therefore have been struggling with some data >> analysis. >> >> A colleague and I did an RNA seq experiment with 6 samples (each had RNA >> pooled from 6 individuals) and no biological replicates. The 6 samples >> included 2 tissue types collected at 3 different time points. I know that >> this is not an ideal experimental set-up, we did this experiment 3 years >> ago. >> >> We used the Trinity package to do most of the transcriptome assembly and >> downstream analyses, such as leveraging EdgeR for differential expression. >> Naively I went on with all downstream analyses without verifying whether > my >> data violated underlying assumptions of TMM normalization. >> >> For example, we found ~30% of our transcripts showed differential >> expression between any 2 pairwise comparisons. Does this violate the TMM >> assumption that most genes are NOT differentially expressed? >> >> Furthermore, we noticed that there is still a tissue bias after >> normalization. Attached is a scatterplot of TMM normalized values for each >> tissue (summed across 3 sample groups for each tissue). Plotted in black > on >> top of all transcripts are CEG (Core Eukaryotic Genes) expression, which > we >> believe should be good candidates for "house keeping" genes. Both CEGs and >> all genes show that at higher expression levels, there is a skew towards >> one tissue ("VMN"), whereas in the middle values, there is a skew towards >> the other tissue ("H"). >> >> I have also attached a density plot of the M values, and a MA plot to >> visualize the skew. These plots were generated from 1 pair of tissue >> comparisons ("SMH" vs "SMV). >> >> These plots reflect the fact that one tissue is more heterogeneous than > the >> other. Although TMM normalization is designed to deal with this problem, >> our data seems to need further normalization. Our within tissue > comparisons >> are great and do not show this kind of skew. My questions are: >> >> 1) does our data violate TMM normalization assumptions >> 2) do you have another normalization method to suggest for our data >> 3) should we just forget about tissue-comparisons >> >> I have also played around with the suggestions about estimating a >> dispersion value based on the EdgeR user guide. Can discuss this further. >> >> Thank you for your time and patience, and any advice is much appreciated. >> >> -- >> Ni (Jenny) Ye Feng >> Ph.D. Candidate >> Bass Laboratory >> Cornell University >> Dept of Neurobiology and Behavior >> Ithaca, NY 14853 >> > <ceg_fpkm_over_all_090814.png><smv_smh_density_log2(m).pdf><smh_smv_ ma_plot_0903.png="">_______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > Ni (Jenny) Ye Feng > Ph.D. Candidate > Bass Laboratory > Cornell University > Dept of Neurobiology and Behavior > Ithaca, NY 14853 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Davide Risso, PhD Post Doctoral Scholar Department of Statistics University of California, Berkeley 344 Li Ka Shing Center, #3370 Berkeley, CA 94720-3370 E-mail: davide.risso at berkeley.edu

ADD REPLY • link 9.6 years ago davide risso ▴ 950

0

Entering edit mode

Thank you Davide. I'll definitely give it a try and let you know if I bump into any questions. In addition, as a follow up to Wolfgang Huber's suggestion, I've attached a graph showing the tissue comparisons after normalizing based on CEG derived normalization factors. I will try these other normalization methods people have suggested until I feel confident about the skew. Best, Jenny On Tue, Sep 9, 2014 at 12:48 PM, davide risso <risso.davide at="" gmail.com=""> wrote: > Hi Jenny, > > you may also want to have a look at our new RUVSeq package. In > particular, you can use the RUVg function to estimate factors of > "unwanted variation" (UV) using the CEG genes as "negative controls." > > This is not equivalent to estimate the TMM normalization factors on a > subset of genes (which doesn't work too well in our experience), > because our UV factors are included in the model with some parameters > (coefficients) that are then re-estimated for all the genes. Have a > look at the vignette of RUVSeq package for details and let me know if > you have questions. > > Best, > davide > > > > > On Tue, Sep 9, 2014 at 7:30 AM, Ni Feng <fengni99 at="" gmail.com=""> wrote: > > Thank you Wolfgang! > > We are using fold change >4 and FDR corrected P value of <0.001 as > > thresholds for calling differential expression, do you think this is > > stringent enough given our skew? > > > > It was hard for me to gauge just how bad the skew is and that was another > > thing I wanted to get an opinion on. > > > > Yesterday I took out lowly expressed transcripts (<0.1 FPKM in any > sample), > > which gave me a small dispersion value akin to what Trinity uses as > default > > (0.1), but using the normalization factors from this dataset did not > > improve the skew. Given what Ryan Thompson said earlier I guess this > makes > > sense. > > > > I had only used CEGs to calculate the dispersion, but will try to get the > > normalization factors from them and see how well it works. Thanks for the > > suggestion! > > If this doesn't work, I'll try the quantile normalization. > > > > Thanks again for your help! > > Jenny > > > > ---------- Forwarded message ---------- > > From: Wolfgang Huber <whuber at="" embl.de=""> > > Date: Tue, Sep 9, 2014 at 3:58 AM > > Subject: Re: [BioC] Tissue heterogeneity and TMM normalization > > To: Ni Feng <fengni99 at="" gmail.com=""> > > Cc: bioconductor at r-project.org > > > > > > Hi Ni > > > > the ?most genes are not differentially expressed? is a sufficient > > assumption that one can use to prove that the estimated normalisation > > factor is close to the true one, under some model. It is not a necessary > > assumption, TMM or similar normalisations can still be useful beyond > (e.g. > > if many genes are d.e. but up and down are about balanced; etc.) > > > > Did you try compouting the normalisation parameters from the CEG genes > only > > and then applying to all data? > > > > An interesting idea was put forward by J. Li, D. M. Witten, I. M. > Johnstone > > and R. Tibshirani: Normalization, testing, and false discovery rate > > estimation for RNA-sequencing data. Biostatistics, 13:523 (2012) ? > > www.biostat.washington.edu/~dwitten/Papers/LiWittenJohnstoneTibs.pdf > > They determine the normalisation factor so as to minimize the amount of > > differential expression. > > (This is one instance of this idea I am aware of, it?s been put out for > > microarrays before, apologies to anyone else who proposed this.) > > > > Also, if I understood your plots correctly, the biases are relatively > small > > in amplitude. So you could leave them there, but apply a banded > hypothesis > > test (i.e. H0: |beta| < theta) rather than H0: beta=0, where beta is the > > fold change and theta a positive number. This is, e.g., described in the > > DESeq2 vignette. > > > > Best wishes > > Wolfgang > > > > > > Il giorno 08 Sep 2014, alle ore 18:15, Ni Feng <fengni99 at="" gmail.com=""> ha > > scritto: > > > >> Dear all, > >> I have a general question about whether TMM normalization is appropriate > >> for my data. I apologize for this long winded email. I am not a trained > >> bioinformatician and therefore have been struggling with some data > >> analysis. > >> > >> A colleague and I did an RNA seq experiment with 6 samples (each had RNA > >> pooled from 6 individuals) and no biological replicates. The 6 samples > >> included 2 tissue types collected at 3 different time points. I know > that > >> this is not an ideal experimental set-up, we did this experiment 3 years > >> ago. > >> > >> We used the Trinity package to do most of the transcriptome assembly and > >> downstream analyses, such as leveraging EdgeR for differential > expression. > >> Naively I went on with all downstream analyses without verifying whether > > my > >> data violated underlying assumptions of TMM normalization. > >> > >> For example, we found ~30% of our transcripts showed differential > >> expression between any 2 pairwise comparisons. Does this violate the TMM > >> assumption that most genes are NOT differentially expressed? > >> > >> Furthermore, we noticed that there is still a tissue bias after > >> normalization. Attached is a scatterplot of TMM normalized values for > each > >> tissue (summed across 3 sample groups for each tissue). Plotted in black > > on > >> top of all transcripts are CEG (Core Eukaryotic Genes) expression, which > > we > >> believe should be good candidates for "house keeping" genes. Both CEGs > and > >> all genes show that at higher expression levels, there is a skew towards > >> one tissue ("VMN"), whereas in the middle values, there is a skew > towards > >> the other tissue ("H"). > >> > >> I have also attached a density plot of the M values, and a MA plot to > >> visualize the skew. These plots were generated from 1 pair of tissue > >> comparisons ("SMH" vs "SMV). > >> > >> These plots reflect the fact that one tissue is more heterogeneous than > > the > >> other. Although TMM normalization is designed to deal with this problem, > >> our data seems to need further normalization. Our within tissue > > comparisons > >> are great and do not show this kind of skew. My questions are: > >> > >> 1) does our data violate TMM normalization assumptions > >> 2) do you have another normalization method to suggest for our data > >> 3) should we just forget about tissue-comparisons > >> > >> I have also played around with the suggestions about estimating a > >> dispersion value based on the EdgeR user guide. Can discuss this > further. > >> > >> Thank you for your time and patience, and any advice is much > appreciated. > >> > >> -- > >> Ni (Jenny) Ye Feng > >> Ph.D. Candidate > >> Bass Laboratory > >> Cornell University > >> Dept of Neurobiology and Behavior > >> Ithaca, NY 14853 > >> > > > <ceg_fpkm_over_all_090814.png><smv_smh_density_log2(m).pdf><smh_smv_ ma_plot_0903.png="">_______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > -- > > Ni (Jenny) Ye Feng > > Ph.D. Candidate > > Bass Laboratory > > Cornell University > > Dept of Neurobiology and Behavior > > Ithaca, NY 14853 > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Davide Risso, PhD > Post Doctoral Scholar > Department of Statistics > University of California, Berkeley > 344 Li Ka Shing Center, #3370 > Berkeley, CA 94720-3370 > E-mail: davide.risso at berkeley.edu > -- Ni (Jenny) Ye Feng Ph.D. Candidate Bass Laboratory Cornell University Dept of Neurobiology and Behavior Ithaca, NY 14853 -------------- next part -------------- A non-text attachment was scrubbed... Name: CEG_normalized_allseqs_090914.png Type: image/png Size: 93911 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140909="" 94c14246="" attachment.png="">

ADD REPLY • link 9.6 years ago Ni Feng ▴ 30

Login before adding your answer.