What to do with multiple probes?

0

Entering edit mode

krasikov@science.uva.nl ▴ 100

@krasikovscienceuvanl-1517

Last seen 11.4 years ago

Dear all, 1. I have a general question about the multiple probes for each gene. This question has been discussed several times by BioC community, but I didn't find any clear solution. My array platform is bacterial Custom Agilent oligo microarray. It consists of 8000 unique probes for bit more than 3000 genes (complete bacterial genome) with 1, 2 or 3 probes per gene (mostly depending on the length of the gene: 1 for short and 3 for long ones). The generated list contains statistics for each probe. What should I do to generate the gene list (which is normally needed for the biology related research)? It's fine when the gene is decided to be regulated for all three probes in the same direction, but what to do if not? Should I exclude such genes from final list? May anybody give me a clue how to deal with that? 2. This is for a while my particular solution, which is maybe far too strict. My list contain the info like this (result of the write.fit): (for three probes for the same gene) A M p Result Probename * * * 1 xxx1111_123 * * * 0 xxx1111_566 * * * 1 xxx1111_1050 How to arrange it in elegant way: A.mean M.mean New.Result xxx1111 M.1 M.2 M.3 p.1 p.2 p.3 ? where A.mean and M.mean are means of all probes for that gene and a new Result is logical (something like all three 1 then 1, all three -1 then -1, if at least one zero or opposite than 0) 3. For my experiment (in a strictly controlled conditions, with 5 biological replicates and some dye-swaps for them) from my 8000 probes 3500 diceded to be regulated, which is almost half of complete set (big part of the decisions is biologically relevant, which is nice). Is not it to much? (I'm thinking about the statistical assumption that most of genes should be not changed) However physiologically my experiment should produce rather big differential expression. I used direct ratio design, loess and than aquantile normalization, with BH correction in decideTests and p-value cut-off 0.001. Thanks in advance for any help. Vladimir

Microarray Normalization probe oligo Microarray Normalization probe oligo • 1.9k views

ADD COMMENT • link updated 20.2 years ago by rgentleman ★ 5.5k • written 20.2 years ago by krasikov@science.uva.nl ▴ 100

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 4 weeks ago

United States

On 11/25/05 9:32 AM, "krasikov at science.uva.nl" <krasikov at="" science.uva.nl=""> wrote: > Dear all, > > 1. > I have a general question about the multiple probes for each gene. > This question has been discussed several times by BioC community, > but I didn't find any clear solution. > > My array platform is bacterial Custom Agilent oligo microarray. > It consists of 8000 unique probes for bit more than 3000 genes (complete > bacterial genome) with 1, 2 or 3 probes per gene (mostly depending on > the length of the gene: 1 for short and 3 for long ones). > > The generated list contains statistics for each probe. > What should I do to generate the gene list (which is normally needed for > the biology related research)? > It's fine when the gene is decided to be regulated for all three probes > in the same direction, but what to do if not? > Should I exclude such genes from final list? > May anybody give me a clue how to deal with that? Unfortunately, if all quality metrics are the same (no reason to choose one probe over the other), then validation is in order using another platform for gene expression (PCR, another array, etc.). Another possibility is to go back and blast all probes against some transcript database (like refseq) to get some sense of cross- hybridization potential, mismatch (if there is any), alignment to transcript variants, and 3'-bias. In some cases, it may be clear that one probe represents one transcript and another probe represents a different transcript, each of which is expressed in a different tissue, for example. (I have to say that such situations are rare, though.) > 2. This is for a while my particular solution, > which is maybe far too strict. > > My list contain the info like this > (result of the write.fit): > (for three probes for the same gene) > A M p Result Probename > * * * 1 xxx1111_123 > * * * 0 xxx1111_566 > * * * 1 xxx1111_1050 > > How to arrange it in elegant way: > A.mean M.mean New.Result xxx1111 M.1 M.2 M.3 p.1 p.2 p.3 ? > where A.mean and M.mean are means of all probes for that gene > and a new Result is logical (something like all three 1 then 1, > all three -1 then -1, if at least one zero or opposite than 0) I guess it depends on what you want to do with the information. If you are in a gene discovery mode (minimize false-negatives), you may simply list all probes in order of significance. If two of three probes are not significant, that isn't a problem, as you will need to validate some proportion of your data, anyway. > 3. > For my experiment (in a strictly controlled conditions, with 5 > biological replicates and some dye-swaps for them) from my > 8000 probes 3500 diceded to be regulated, which is almost half of > complete set (big part of the decisions is biologically relevant, > which is nice). > Is not it to much? (I'm thinking about the statistical assumption that > most of genes should be not changed) However physiologically my > experiment should produce rather big differential expression. It is possible to have a large number of differentially-expressed genes, yes. Sean

ADD COMMENT • link 20.2 years ago Sean Davis 21k

0

Entering edit mode

rgentleman ★ 5.5k

@rgentleman-7725

Last seen 10.8 years ago

United States

Hi, Sean has already answered some of your questions, but I will provide a few of my thoughts on this. 1) there is little discussion because it is a reasonably difficult topic and there is not clear cut answers, besides "it depends" and it does depend on a lot of different things. For example, you might exclude the information on some probe sets if they are far from the poly-A tail and dT-priming was used, if random priming was used, then all should be equally good (but I am not aware of a comprehensive comparison). In some cases, depending on the data, processing etc, you can develp tools for comparing duplicate probe sets and combining the information to get better estimates for whether genes are expressed and at what levels (you could compare the probes and see if they are unique in the genome, for example). In these situations, using R is a good thing, since you can pretty much do any reasonable analysis, but you need to know some statistics and some programming to do it, and there is no clear recipe to follow. krasikov at science.uva.nl wrote: > Dear all, > > 1. > I have a general question about the multiple probes for each gene. > This question has been discussed several times by BioC community, > but I didn't find any clear solution. > > My array platform is bacterial Custom Agilent oligo microarray. > It consists of 8000 unique probes for bit more than 3000 genes (complete > bacterial genome) with 1, 2 or 3 probes per gene (mostly depending on > the length of the gene: 1 for short and 3 for long ones). > > The generated list contains statistics for each probe. > What should I do to generate the gene list (which is normally needed for > the biology related research)? > It's fine when the gene is decided to be regulated for all three probes > in the same direction, but what to do if not? > Should I exclude such genes from final list? > May anybody give me a clue how to deal with that? > > 2. This is for a while my particular solution, > which is maybe far too strict. > > My list contain the info like this > (result of the write.fit): > (for three probes for the same gene) > A M p Result Probename > * * * 1 xxx1111_123 > * * * 0 xxx1111_566 > * * * 1 xxx1111_1050 > > How to arrange it in elegant way: > A.mean M.mean New.Result xxx1111 M.1 M.2 M.3 p.1 p.2 p.3 ? > where A.mean and M.mean are means of all probes for that gene > and a new Result is logical (something like all three 1 then 1, > all three -1 then -1, if at least one zero or opposite than 0) > > 3. > For my experiment (in a strictly controlled conditions, with 5 > biological replicates and some dye-swaps for them) from my > 8000 probes 3500 diceded to be regulated, which is almost half of > complete set (big part of the decisions is biologically relevant, > which is nice). Do you mean about 3500 are showing differential expression? This seems very large, and you do realize that it violates most of the principles that underly the usual normalization procedures? That may be more of a problem for you than the duplicate probes. And fixing it, or convincing yourself that the outputs of the normalization are ok, will take some time and statistical expertise. In my experience these are way outside of what can easily be dealt with on a mailing list - local expertise is what is needed. Best wishes, Robert > Is not it to much? (I'm thinking about the statistical assumption that > most of genes should be not changed) However physiologically my > experiment should produce rather big differential expression. > > I used direct ratio design, loess and than aquantile normalization, > with BH correction in decideTests and p-value cut-off 0.001. > > Thanks in advance for any help. > Vladimir > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD COMMENT • link 20.2 years ago rgentleman ★ 5.5k

0

Entering edit mode

Hi Thanks Robert and Sean for your comments on my problem. Robert Gentleman wrote: > Hi, > Sean has already answered some of your questions, but I will provide a > few of my thoughts on this. > > 1) there is little discussion because it is a reasonably difficult > topic and there is not clear cut answers, besides "it depends" and it > does depend on a lot of different things. > > For example, you might exclude the information on some probe sets if > they are far from the poly-A tail and dT-priming was used, if random > priming was used, then all should be equally good (but I am not aware of > a comprehensive comparison). > > In some cases, depending on the data, processing etc, you can develp > tools for comparing duplicate probe sets and combining the information > to get better estimates for whether genes are expressed and at what > levels (you could compare the probes and see if they are unique in the > genome, for example). In these situations, using R is a good thing, > since you can pretty much do any reasonable analysis, but you need to > know some statistics and some programming to do it, and there is no > clear recipe to follow. I have complete bacterial genome custom Agilent microarrays. In my case custom means that we design all probes ourself (not Agilent), with assigning quality-scores to the probes, and have tried to avoid "poor" probes. The quality metrics for the probes should be equally good. It is random priming experiment. This is an bacterial platform, so no tissue specificity is possible, more over I may expect that there are not to much biological variation between biological replicates. Taking into the account above discussion about multiple probes, I guess I can only make decisions individually gene-by-gene. For that I can imagine the design in my point 2. My question is: what kind of script should be to rearrange MAlist and save it in text format? > krasikov at science.uva.nl wrote: > >> Dear all, >> >> 1. >> I have a general question about the multiple probes for each gene. >> This question has been discussed several times by BioC community, >> but I didn't find any clear solution. >> >> My array platform is bacterial Custom Agilent oligo microarray. >> It consists of 8000 unique probes for bit more than 3000 genes >> (complete bacterial genome) with 1, 2 or 3 probes per gene (mostly >> depending on the length of the gene: 1 for short and 3 for long ones). >> >> The generated list contains statistics for each probe. >> What should I do to generate the gene list (which is normally needed >> for the biology related research)? >> It's fine when the gene is decided to be regulated for all three probes >> in the same direction, but what to do if not? >> Should I exclude such genes from final list? >> May anybody give me a clue how to deal with that? >> >> 2. This is for a while my particular solution, >> which is maybe far too strict. >> >> My list contain the info like this >> (result of the write.fit): >> (for three probes for the same gene) >> A M p Result Probename >> * * * 1 xxx1111_123 >> * * * 0 xxx1111_566 >> * * * 1 xxx1111_1050 >> >> How to arrange it in elegant way: >> A.mean M.mean New.Result xxx1111 M.1 M.2 M.3 p.1 >> p.2 p.3 ? >> where A.mean and M.mean are means of all probes for that gene >> and a new Result is logical (something like all three 1 then 1, >> all three -1 then -1, if at least one zero or opposite than 0) >> >> 3. >> For my experiment (in a strictly controlled conditions, with 5 >> biological replicates and some dye-swaps for them) from my >> 8000 probes 3500 diceded to be regulated, which is almost half of >> complete set (big part of the decisions is biologically relevant, >> which is nice). > > > Do you mean about 3500 are showing differential expression? This seems > very large, and you do realize that it violates most of the principles > that underly the usual normalization procedures? That may be more of a > problem for you than the duplicate probes. And fixing it, or convincing > yourself that the outputs of the normalization are ok, will take some > time and statistical expertise. In my experience these are way outside > of what can easily be dealt with on a mailing list - local expertise is > what is needed. The general question: How to validate the normalization outcome? Density plots? I have tried "loees with aquantile" and "vsn" and outcome of the decideTests is more or less the same - a lot of probes with differential expression. Here below the code I used in limma: RG <- read.maimages(...) ...assigning spotTypes ...removing controlspots from the RG RGb <- backgroundCorrect(RG,method="minimum") MA <- normalizeWithinArrays(RGb, method="loess") MA <- normalizeBetweenArrays(MA, method="Aquantile") ...design fit <- lmFit(MA, design) ...contrast.matrix fit <- contrasts.fit(fit, contrast.matrix) fit <- eBayes(fit) res <- decideTests(fit, method = "separate", adjust.method="BH", + p.value=0.001) write.fit(fit, results = res, file = "...", digits=2, adjust="BH", sep="\t") In that condition I've got 1800 up and 1800 down probes (out from 8100) Decreasing p.value to 0.0001 gave me 800 up and 800 down. I would like to mention here, that quite a big part of obtained data is physiologically relevant in my experiment, and the nature of the experiment suggests big differential expression. Any suggestions? Best wishes Vladimir > Best wishes, > Robert > > >> Is not it to much? (I'm thinking about the statistical assumption that >> most of genes should be not changed) However physiologically my >> experiment should produce rather big differential expression. >> >> I used direct ratio design, loess and than aquantile normalization, >> with BH correction in decideTests and p-value cut-off 0.001. >> >> Thanks in advance for any help. >> Vladimir >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >

ADD REPLY • link 20.2 years ago krasikov@science.uva.nl ▴ 100

0

Entering edit mode

On 11/30/05 5:59 AM, "krasikov at science.uva.nl" <krasikov at="" science.uva.nl=""> wrote: > Hi > Thanks Robert and Sean for your comments on my problem. > > > Robert Gentleman wrote: >> Hi, >> Sean has already answered some of your questions, but I will provide a >> few of my thoughts on this. >> >> 1) there is little discussion because it is a reasonably difficult >> topic and there is not clear cut answers, besides "it depends" and it >> does depend on a lot of different things. >> >> For example, you might exclude the information on some probe sets if >> they are far from the poly-A tail and dT-priming was used, if random >> priming was used, then all should be equally good (but I am not aware of >> a comprehensive comparison). >> >> In some cases, depending on the data, processing etc, you can develp >> tools for comparing duplicate probe sets and combining the information >> to get better estimates for whether genes are expressed and at what >> levels (you could compare the probes and see if they are unique in the >> genome, for example). In these situations, using R is a good thing, >> since you can pretty much do any reasonable analysis, but you need to >> know some statistics and some programming to do it, and there is no >> clear recipe to follow. > > I have complete bacterial genome custom Agilent microarrays. > In my case custom means that we design all probes ourself (not Agilent), > with assigning quality-scores to the probes, and have tried to avoid > "poor" probes. > > The quality metrics for the probes should be equally good. It is random > priming experiment. This is an bacterial platform, so no tissue > specificity is possible, more over I may expect that there are not to > much biological variation between biological replicates. > > Taking into the account above discussion about multiple probes, I guess > I can only make decisions individually gene-by-gene. > For that I can imagine the design in my point 2. > My question is: what kind of script should be to rearrange MAlist and > save it in text format? Just to be clear--your multiple probes per gene are different and not identical for each gene? If that is the case, then I think that you will need to write the script yourself, as I don't think one is available "off-the-shelf". I would look at commands like "reshape", "split", and "aggregate" for good general purpose building blocks for jobs like this. Sean

ADD REPLY • link 20.2 years ago Sean Davis 21k

0

Entering edit mode

Sean Davis wrote: >On 11/30/05 5:59 AM, "krasikov at science.uva.nl" <krasikov at="" science.uva.nl=""> >wrote: > > > >>Hi >>Thanks Robert and Sean for your comments on my problem. >> >> >>Robert Gentleman wrote: >> >> >>>Hi, >>> Sean has already answered some of your questions, but I will provide a >>>few of my thoughts on this. >>> >>> 1) there is little discussion because it is a reasonably difficult >>>topic and there is not clear cut answers, besides "it depends" and it >>>does depend on a lot of different things. >>> >>> For example, you might exclude the information on some probe sets if >>>they are far from the poly-A tail and dT-priming was used, if random >>>priming was used, then all should be equally good (but I am not aware of >>>a comprehensive comparison). >>> >>> In some cases, depending on the data, processing etc, you can develp >>>tools for comparing duplicate probe sets and combining the information >>>to get better estimates for whether genes are expressed and at what >>>levels (you could compare the probes and see if they are unique in the >>>genome, for example). In these situations, using R is a good thing, >>>since you can pretty much do any reasonable analysis, but you need to >>>know some statistics and some programming to do it, and there is no >>>clear recipe to follow. >>> >>> >>I have complete bacterial genome custom Agilent microarrays. >>In my case custom means that we design all probes ourself (not Agilent), >>with assigning quality-scores to the probes, and have tried to avoid >>"poor" probes. >> >>The quality metrics for the probes should be equally good. It is random >>priming experiment. This is an bacterial platform, so no tissue >>specificity is possible, more over I may expect that there are not to >>much biological variation between biological replicates. >> >>Taking into the account above discussion about multiple probes, I guess >>I can only make decisions individually gene-by-gene. >>For that I can imagine the design in my point 2. >>My question is: what kind of script should be to rearrange MAlist and >>save it in text format? >> >> > >Just to be clear--your multiple probes per gene are different and not >identical for each gene? If that is the case, then I think that you will >need to write the script yourself, as I don't think one is available >"off-the-shelf". I would look at commands like "reshape", "split", and >"aggregate" for good general purpose building blocks for jobs like this. > >Sean > > > Yes my probes all unique - 1 to 3 per gene (8000 probes for 3000 genes). I will try to do something, unfortunately my programming and statistical background is rather low, I afraid it would take from me to understand how to implement it in R too much time, which I'm lacking now. But I can do it in Excell (have some experience in visual basic). Vladimir -- Krasikov Vladimir Universiteit van Amsterdam Faculty of Science IBED/AMB (Aquatische Microbiologie) Nieuwe Achtergracht 127 NL-1018WS Amsterdam the Netherlands tel. + 31 20 5257060 fax + 31 20 5257064

ADD REPLY • link 20.2 years ago krasikov@science.uva.nl ▴ 100

Login before adding your answer.