limma for homemade microarray - question on NAs and multiple probes for one gene

0

Entering edit mode

Peter Langfelder ★ 3.0k

@peter-langfelder-4469

Last seen 28 days ago

United States

[CCing the list as well] On Mon, Jul 8, 2013 at 6:20 PM, zhengyu jiang <zhyjiang2006 at="" gmail.com=""> wrote: > Hi Peter, > Thank you very much for your reply. That's a very good point. > I am just layman to limma and microarray analysis & try to learn. You are > very helpful. > Can I ask you two more questions? > > My data has a huge dynamic range - raw intensity from about 500 to 93760. I > decided the cutoff for background just based on literature - 99%. Is there > any better way to decide the outliers? Is there any better way to do the > normalization other than quantile for big dynamic range? > I am not an expert on microarray pre-processing; quantile normalization is what I usually do but I do check whether it's really necessary. As for identifying outlier samples, we typically use the sample network approach of Oldham MC, Langfelder P, Horvath S (2012) Network methods for describing sample relationships in genomic datasets: application to Huntington's disease. BMC Syst Biol. 2012 Jun 12;6(1):63. PMID: 22691535 46(11) 1-17 http://www.biomedcentral.com/1752-0509/6/63/abstract (sorry for constantly advertising work I was involved in :)). This web page: http://labs.genetics.ucla.edu/horvath/htdocs/CoexpressionNetwork/Sampl eNetwork/ contains the necessary code and a tutorial with sample data. Peter > Best, > Zhengyu > > > > On Mon, Jul 8, 2013 at 9:07 PM, Peter Langfelder > <peter.langfelder at="" gmail.com=""> wrote: >> >> Hi Zhengyu, >> >> for turning probe-level data into gene-level data, you may want to >> look at the function collapseRows in the WGCNA package (on CRAN). The >> relevant citation is >> >> Miller JA, Cai C, Langfelder P, Geschwind DH, Kurian SM, Salomon DR, >> Horvath S (2011) Strategies for aggregating gene expression data: The >> collapseRows R function. BMC Bioinformatics12:322. PMID: 21816037, >> PMCID: PMC3166942 http://www.biomedcentral.com/1471-2105/12/322 >> >> As for dealing with the missing data, well, I personally would try to >> go back and restore the original values unless they are outliers. Even >> knowing that something is below 5 is better than a missing datum. To >> remove noise, I would possibly remove probes whose expression >> consistently remains below 5 but not individual expression values. For >> example, if a probe is consistently above 7 in cancer samples and >> below 5 (unexpressed) in normals, you want to identify it - but you >> will completely miss it if you turn all values below 5 into NA. >> >> Hope this helps, >> >> Peter >> >> On Mon, Jul 8, 2013 at 5:40 PM, zhengyu jiang <zhyjiang2006 at="" gmail.com=""> >> wrote: >> > Dear Bioconductor experts, >> > >> > We have data from a homemade one-channel microarray that I tried to >> > apply >> > limma for differential expression analysis between matched paired Normal >> > (N) and Tumor (Tumor) samples - 8 biological replicates (one tech >> > replicate >> > has been averaged after normalization). All samples are formatted in one >> > matrix (M). >> > >> > Signals have been quantile normalized between each paired normal and >> > tumor. >> > Signal values below 5 (log scale) have been replaced by "NA" since they >> > are >> > potentially noises. So there are many NAs in M. >> > >> > I followed the user manual and made the codes below. >> > >> > I think the code is correct? My questions are (1) how to deal with NAs - >> > as >> > I did a search but no clear idea (2) how do people do the statistics at >> > the >> > gene level for one gene having multiple probes - averaging or taking >> > median? >> > >> > Thanks, >> > Zhengyu >> > >> > >> > > head(M) >> > N1 N2 N3 N4 N5 N6 N7 >> > N8 T1 T2 T3 >> > 2 8.622724 7.423568 NA NA 7.487174 NA 8.516293 >> > NA >> > 7.876259 7.856707 NA >> > T4 T5 T6 T7 T8 >> > 2 NA 7.720018 NA 7.752550 NA >> > >> >> eset<-as.matrix(M) >> >> Pair=factor(targets$Pair) >> >> Treat=factor(targets$Treatment,levels=c("N","T")) # compared >> >> matched >> > normal to tumors >> >> design<-model.matrix(~Pair+Treat) >> >> targets >> > FilenName Pair Treatment >> > 1 N1 1 N >> > 2 N2 2 N >> > 3 N3 3 N >> > 4 N4 4 N >> > 5 N5 5 N >> > 6 N6 6 N >> > 7 N7 7 N >> > 8 N8 8 N >> > 9 T1 1 T >> > 10 T2 2 T >> > 11 T3 3 T >> > 12 T4 4 T >> > 13 T5 5 T >> > 14 T6 6 T >> > 15 T7 7 T >> > 16 T8 8 T >> > fit_pair<-lmFit(eset,design) >> > fit_pair<-eBayes(fit_pair) >> > >> > R=topTable(fit_pair, coef="TreatT", adjust="BH",number=30) # display >> > top 30 >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor > >

Microarray Normalization GO Network Cancer probe limma Microarray Normalization GO probe • 1.2k views

ADD COMMENT • link updated 10.8 years ago by Gordon Smyth 50k • written 10.8 years ago by Peter Langfelder ★ 3.0k

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 43 minutes ago

WEHI, Melbourne, Australia

Dear Zhengyu, Please keep questions on the list. Results show that pair 6 is different from the others, and that there isn't a consistent treatment effect. I suggest that you explore your data data more: plotMDS(eset) The SvsA plot shows that the log-expression values are less variable at both low and high intensities. This has presumably arisen from the microarray platform and from the way that you have pre-processed the data, which you don't tell us much about. In any case, limma has taken the trend into account. Since this a single-channel platform, your comment "take log ratios" is a bit mysterious. Do you just mean log-intensity? Best wishes Gordon On Fri, 12 Jul 2013, zhengyu jiang wrote: > Hi Gordon, > Thank you very much for your reply. Sorry for the delay. > > Following your instructions, I started with original matrix (M) of raw > intensities & made the codes below. > If everything is correct, my questions are (1) does the figure (below) log > residual standard deviation vs. log expression make sense? - how should I > decide if it is normal? > (2) does the summary(decideTest) tell me that only pair 6 has great changes > in expression? > > Thanks, > > eset<-as.matrix(log(M)) #### take log ratios > eset <- normalizeBetweenArrays(eset, method="quantile") ## quantile > normalization for all arrays > design<-model.matrix(~Pair+Treat) > targets<-readTargets("targets.txt",row.names=1) ## or row name =1 > Pair<-factor(targets$Pair) > Treat<-factor(targets$Treatment,levels=c("N","T")) > fit_pair<-lmFit(eset,design) > plotSA(fit_pair) > fit= eBayes(fit_pair, trend=TRUE) >> summary(decideTests(fit)) > (Intercept) Pair2 Pair3 Pair4 Pair5 Pair6 Pair7 Pair8 TreatT > -1 0 0 0 0 0 2770 0 0 0 > 0 0 257279 257279 257278 257275 254372 257275 257273 257279 > 1 257279 0 0 0 0 102 0 2 0 >> R=topTable(fit, coef="TreatT", adjust="BH",number=30 > + ) >> head(R) > logFC AveExpr t P.Value adj.P.Val B > 207370 0.6232462 6.473489 5.990300 6.818613e-05 0.8997989 -3.715276 > 45481 0.5559283 6.477330 4.960299 3.494802e-04 0.8997989 -3.822746 > 178158 0.4404808 6.148200 4.872321 4.046088e-04 0.8997989 -3.833704 > 172530 -0.3670701 5.793056 -4.800323 4.564952e-04 0.8997989 -3.842908 > 165527 0.4721218 6.324581 4.740798 5.046498e-04 0.8997989 -3.850680 > 80386 -0.5450653 6.551971 -4.723036 5.200275e-04 0.8997989 -3.853028 > > > > > [image: Inline image 1] > > > On Tue, Jul 9, 2013 at 7:34 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > >> Dear Zhengyu, >> >> Date: Mon, 8 Jul 2013 20:40:14 -0400 >>> From: zhengyu jiang <zhyjiang2006 at="" gmail.com=""> >>> To: bioconductor at r-project.org >>> Subject: [BioC] limma for homemade microarray - question on NAs and >>> multiple probes for one gene >>> >>> Dear Bioconductor experts, >>> >>> We have data from a homemade one-channel microarray that I tried to apply >>> limma for differential expression analysis between matched paired Normal >>> (N) and Tumor (Tumor) samples - 8 biological replicates (one tech >>> replicate >>> has been averaged after normalization). All samples are formatted in one >>> matrix (M). >>> >> >> This is a standard problem, well covered in the limma User's Guide. >> >> Signals have been quantile normalized between each paired normal and >>> tumor. >>> >> >> I don't know what you mean by this. I recommend ordinary quantile >> normalization of all the arrays together, without regard to which sample is >> paired with which. >> >> Signal values below 5 (log scale) have been replaced by "NA" since they >>> are >>> potentially noises. So there are many NAs in M. >>> >> >> Please don't do this. limma can deal with low intensities perfectly weel, >> and can figure out whether they are noise or not. You are simply removing >> valid data. >> >> If you are concerned about high variability of low intensity probes, then >> look at >> >> plotSA(fit_pair) >> >> to examine the mean-variance trend, and use >> >> eBayes(fit_pair, trend=TRUE) >> >> if there is a strong trend. >> >> I followed the user manual and made the codes below. >>> >>> I think the code is correct? >>> >> >> Looks ok. >> >> My questions are (1) how to deal with NAs - as I did a search but no >>> clear idea >>> >> >> Don't create them in the first place. This has been said many times >> before! >> >> (2) how do people do the statistics at the gene level for one gene having >>> multiple probes - averaging or taking median? >>> >> >> Usually we do not find it necessary to aggregate the multiple probes. The >> multiple probes might represent different transcripts or variants, and some >> of the probes might not work. Why do you need to take any special action? >> >> However, if you feel that you must, the best method to aggregate the >> multiple probes is probably to retain the probe for each gene with highest >> fit_pair$Amean value. We have used this strategy in a number of published >> papers. The rationale for this is to choose the probe corresponding to the >> most highly expressed transcript for the gene. Alternatively, you could >> average the probes using the avereps() function in limma. >> >> Best wishes >> Gordon >> >> Thanks, >>> Zhengyu >>> >>> >>>> head(M) >>> N1 N2 N3 N4 N5 N6 N7 >>> N8 T1 T2 T3 >>> 2 8.622724 7.423568 NA NA 7.487174 NA 8.516293 NA >>> 7.876259 7.856707 NA >>> T4 T5 T6 T7 T8 >>> 2 NA 7.720018 NA 7.752550 NA >>> >>> eset<-as.matrix(M) >>>> Pair=factor(targets$Pair) >>>> Treat=factor(targets$**Treatment,levels=c("N","T")) # compared >>>> matched >>>> >>> normal to tumors >>> >>>> design<-model.matrix(~Pair+**Treat) >>>> targets >>>> >>> FilenName Pair Treatment >>> 1 N1 1 N >>> 2 N2 2 N >>> 3 N3 3 N >>> 4 N4 4 N >>> 5 N5 5 N >>> 6 N6 6 N >>> 7 N7 7 N >>> 8 N8 8 N >>> 9 T1 1 T >>> 10 T2 2 T >>> 11 T3 3 T >>> 12 T4 4 T >>> 13 T5 5 T >>> 14 T6 6 T >>> 15 T7 7 T >>> 16 T8 8 T >>> fit_pair<-lmFit(eset,design) >>> fit_pair<-eBayes(fit_pair) >>> >>> R=topTable(fit_pair, coef="TreatT", adjust="BH",number=30) # display top >>> 30 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 10.8 years ago Gordon Smyth 50k

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 43 minutes ago

WEHI, Melbourne, Australia

Dear Zhengyu, On Tue, 16 Jul 2013, zhengyu jiang wrote: > Hi Gordon, > Thanks for your reply. What do you think of the MDS plot below? The samples don't cluster on the MDS plot by either treatment or pair, so it is not surprising that there are no statistically significant results. > For preprocessing, the microarray was designed as SNP array. We hybridized > cDNA samples on the SNP array. For each marker, there are two channel > reads. But for expression analysis, I combined both channel raw intensities > to give one value for each marker - so it is considered as one- channel > microarray. > > I took log values for each value - so yes it is log-intensities rather than > log ratios. sorry for the confusing. > > I am actually not familiar with limma. You mentioned that limma deals with > noises pretty well so I did not remove any values. > > Could you please comment more on applying limma to analyzing the > differential expression in my case? I have no experience applying limma to a SNP array. I have no feel for what is sensible to do in this context so I can't give you any more help. Best wishes Gordon > Thanks, > Zhengyu > > > [image: Inline image 1] > > > On Fri, Jul 12, 2013 at 7:20 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > >> Dear Zhengyu, >> >> Please keep questions on the list. >> >> Results show that pair 6 is different from the others, and that there >> isn't a consistent treatment effect. I suggest that you explore your data >> data more: >> >> plotMDS(eset) >> >> The SvsA plot shows that the log-expression values are less variable at >> both low and high intensities. This has presumably arisen from the >> microarray platform and from the way that you have pre-processed the data, >> which you don't tell us much about. In any case, limma has taken the trend >> into account. >> >> Since this a single-channel platform, your comment "take log ratios" is a >> bit mysterious. Do you just mean log-intensity? >> >> Best wishes >> Gordon >> >> >> On Fri, 12 Jul 2013, zhengyu jiang wrote: >> >> Hi Gordon, >>> Thank you very much for your reply. Sorry for the delay. >>> >>> Following your instructions, I started with original matrix (M) of raw >>> intensities & made the codes below. >>> If everything is correct, my questions are (1) does the figure (below) log >>> residual standard deviation vs. log expression make sense? - how should I >>> decide if it is normal? >>> (2) does the summary(decideTest) tell me that only pair 6 has great >>> changes >>> in expression? >>> >>> Thanks, >>> >>> eset<-as.matrix(log(M)) #### take log ratios >>> eset <- normalizeBetweenArrays(eset, method="quantile") ## quantile >>> normalization for all arrays >>> design<-model.matrix(~Pair+**Treat) >>> targets<-readTargets("targets.**txt",row.names=1) ## or row name =1 >>> Pair<-factor(targets$Pair) >>> Treat<-factor(targets$**Treatment,levels=c("N","T")) >>> fit_pair<-lmFit(eset,design) >>> plotSA(fit_pair) >>> fit= eBayes(fit_pair, trend=TRUE) >>> >>>> summary(decideTests(fit)) >>>> >>> (Intercept) Pair2 Pair3 Pair4 Pair5 Pair6 Pair7 Pair8 TreatT >>> -1 0 0 0 0 0 2770 0 0 0 >>> 0 0 257279 257279 257278 257275 254372 257275 257273 257279 >>> 1 257279 0 0 0 0 102 0 2 0 >>> >>>> R=topTable(fit, coef="TreatT", adjust="BH",number=30 >>>> >>> + ) >>> >>>> head(R) >>>> >>> logFC AveExpr t P.Value adj.P.Val B >>> 207370 0.6232462 6.473489 5.990300 6.818613e-05 0.8997989 -3.715276 >>> 45481 0.5559283 6.477330 4.960299 3.494802e-04 0.8997989 -3.822746 >>> 178158 0.4404808 6.148200 4.872321 4.046088e-04 0.8997989 -3.833704 >>> 172530 -0.3670701 5.793056 -4.800323 4.564952e-04 0.8997989 -3.842908 >>> 165527 0.4721218 6.324581 4.740798 5.046498e-04 0.8997989 -3.850680 >>> 80386 -0.5450653 6.551971 -4.723036 5.200275e-04 0.8997989 -3.853028 >>> >>> >>> >>> >>> [image: Inline image 1] >>> >>> >>> On Tue, Jul 9, 2013 at 7:34 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: >>> >>> Dear Zhengyu, >>>> >>>> Date: Mon, 8 Jul 2013 20:40:14 -0400 >>>> >>>>> From: zhengyu jiang <zhyjiang2006 at="" gmail.com=""> >>>>> To: bioconductor at r-project.org >>>>> Subject: [BioC] limma for homemade microarray - question on NAs and >>>>> multiple probes for one gene >>>>> >>>>> Dear Bioconductor experts, >>>>> >>>>> We have data from a homemade one-channel microarray that I tried to >>>>> apply limma for differential expression analysis between matched >>>>> paired Normal (N) and Tumor (Tumor) samples - 8 biological >>>>> replicates (one tech replicate has been averaged after >>>>> normalization). All samples are formatted in one matrix (M). >>>>> >>>>> >>>> This is a standard problem, well covered in the limma User's Guide. >>>> >>>>> Signals have been quantile normalized between each paired normal and >>>>> tumor. >>>>> >>>>> >>>> I don't know what you mean by this. I recommend ordinary quantile >>>> normalization of all the arrays together, without regard to which >>>> sample is paired with which. >>>>> >>>>> Signal values below 5 (log scale) have been replaced by "NA" since >>>>> they are potentially noises. So there are many NAs in M. >>>>> >>>>> >>>> Please don't do this. limma can deal with low intensities perfectly >>>> weel, >>>> and can figure out whether they are noise or not. You are simply >>>> removing >>>> valid data. >>>> >>>> If you are concerned about high variability of low intensity probes, then >>>> look at >>>> >>>> plotSA(fit_pair) >>>> >>>> to examine the mean-variance trend, and use >>>> >>>> eBayes(fit_pair, trend=TRUE) >>>> >>>> if there is a strong trend. >>>> >>>> I followed the user manual and made the codes below. >>>> >>>>> >>>>> I think the code is correct? >>>>> >>>>> >>>> Looks ok. >>>> >>>> My questions are (1) how to deal with NAs - as I did a search but no >>>> >>>>> clear idea >>>>> >>>>> >>>> Don't create them in the first place. This has been said many times >>>> before! >>>> >>>> (2) how do people do the statistics at the gene level for one gene >>>> having >>>> >>>>> multiple probes - averaging or taking median? >>>>> >>>>> >>>> Usually we do not find it necessary to aggregate the multiple probes. The >>>> multiple probes might represent different transcripts or variants, and >>>> some >>>> of the probes might not work. Why do you need to take any special >>>> action? >>>> >>>> However, if you feel that you must, the best method to aggregate the >>>> multiple probes is probably to retain the probe for each gene with >>>> highest >>>> fit_pair$Amean value. We have used this strategy in a number of >>>> published >>>> papers. The rationale for this is to choose the probe corresponding to >>>> the >>>> most highly expressed transcript for the gene. Alternatively, you could >>>> average the probes using the avereps() function in limma. >>>> >>>> Best wishes >>>> Gordon >>>> >>>> Thanks, >>>> >>>>> Zhengyu >>>>> >>>>> >>>>> head(M) >>>>>> >>>>> N1 N2 N3 N4 N5 N6 N7 >>>>> N8 T1 T2 T3 >>>>> 2 8.622724 7.423568 NA NA 7.487174 NA 8.516293 >>>>> NA >>>>> 7.876259 7.856707 NA >>>>> T4 T5 T6 T7 T8 >>>>> 2 NA 7.720018 NA 7.752550 NA >>>>> >>>>> eset<-as.matrix(M) >>>>> >>>>>> Pair=factor(targets$Pair) >>>>>> Treat=factor(targets$****Treatment,levels=c("N","T")) # compared >>>>>> matched >>>>>> >>>>>> normal to tumors >>>>> >>>>> design<-model.matrix(~Pair+****Treat) >>>>>> targets >>>>>> >>>>>> FilenName Pair Treatment >>>>> 1 N1 1 N >>>>> 2 N2 2 N >>>>> 3 N3 3 N >>>>> 4 N4 4 N >>>>> 5 N5 5 N >>>>> 6 N6 6 N >>>>> 7 N7 7 N >>>>> 8 N8 8 N >>>>> 9 T1 1 T >>>>> 10 T2 2 T >>>>> 11 T3 3 T >>>>> 12 T4 4 T >>>>> 13 T5 5 T >>>>> 14 T6 6 T >>>>> 15 T7 7 T >>>>> 16 T8 8 T >>>>> fit_pair<-lmFit(eset,design) >>>>> fit_pair<-eBayes(fit_pair) >>>>> >>>>> R=topTable(fit_pair, coef="TreatT", adjust="BH",number=30) # display top >>>>> 30 >>>>> >>>> ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 10.8 years ago Gordon Smyth 50k

Login before adding your answer.