Illumina Probe_ID used in the LIMMA package for neqc function

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 4 days ago

Australia/Melbourne

Hi William, Please keep the posts on the list. You should certainly remove from analysis those probes which do not express in any of your samples, ie keeping only the probes which express in at least one sample. You can do so by applying a detection p value cutoff (eg 0.05 or 0.01) or you may run the propexpr function to estimate the proportion of expressed probes and then use that information to filter out probes. See ?propexpr for more details. Best wishes, Wei On Jul 4, 2013, at 2:55 PM, William D'Avigdor wrote: > Hi Wei, > > Many thanks for your response. > > I would like to ask you another question, specifically about probe filtering. > > So far I have performed all my analyses on UNFILTERED Illumina data from Genome Studio. Is it still VALID for Illumina data to use unfiltered data in contrast to filtered probes (comparing to background signal) with a particular p-value (eg p=0.01, or 0.1 according to your paper: Illumina WG-6 BeadChip strips should be normalised separately). > > I am assuming when performing hierachical clustering on the full data, the genes at background level will not significantly contribute to the clustering. However, I do notice that the clustering distance is narrowed obviously because the samples appear closer than they otherwise would. > > Further, when performing t-tests / LIMMA on the full data, those genes that are close to background level should not contribute to significant differences across groups. Is this correct? And is there anything I am missing out on? Apart from maybe a contribution by FDR. > > Many thanks, > Wil > > On 2/07/2013 7:18 PM, Wei Shi wrote: >> Dear William, >> >> What you have done is correct. As you have found, the 'ProbeID' is the same as the Array_Address_ID. The 'ProbeID' column was used in the old versions of Illumina BeadChip arrays, and it was later replaced with 'PROBE_ID" in the newer versions of BeadChips. >> >> The neqc() function uses negative control probes to carry out background correction. The 'TargetID' column in the control probe profile file indicates the types of control probes and the negative control probes have the type of 'NEGATIVE'. Neqc also uses all the probes including regular probes and all types of control probes (negative controls, housekeeping, ...) to perform a quantile between- array normalization. >> >> Best wishes, >> >> Wei >> >> On Jul 2, 2013, at 3:56 PM, William D'Avigdor wrote: >> >>> Hi, >>> >>> I am doing some Illumina analysis using HumanWG-6_V2 microarrays, and have been using the annotation file: HumanWG- 6_V2_0_R4_11223189_A.bgx, and I am normalising using the NEQC function in the LIMMA package. >>> >>> I know there are traditionally a number of Illumina identifiers and I am concerned that I may have potentially been using the wrong ones, and I'm not sure whether this has affected the normalisation proceedure, or anything at all. >>> >>> After summarisation in Genome Studio, when looking at the 'Sample Probe Profile', the main identifiers that come up (and which I have used in LIMMA) are 'PROBE_ID' and 'SYMBOL', the first row being ILMN_1762337 and 7A5 respectively. I also noticed that this PROBE_ID column was the one used in the Illumina example in the LIMMA manual. >>> >>> HOWEVER, in Genome Studio, there is also a column called 'ProbeID'. This does not exist in the original annotation file (HumanWG-6_V2_0_R4_11223189_A), but it is identical to the Array_Address_ID (except for the preceeding 000s), the latter of which is both in Genome Strudio and in the Annotation file, and UNIQUE to the version of the microarray. >>> >>> IN CONTRAST, in the 'Control Probe Profile' in Genome Studio, there is only the 'TargetID' and the 'ProbeID' available, the latter of which (I believe) is the Array_Address_ID? >>> >>> HENCE, for the LIMMA input, I am wondering whether I am correct when I have included the Sample Probe ID text file (which includes PROBE_ID, that is, ILMN_1762337), and the Control Probe ID text file (which includes ProbeID instead, which is most likely the Array Address ID). >>> >>> Many thanks in advance, >>> William d'Avigdor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

Microarray Annotation Normalization Clustering probe limma Microarray Annotation probe • 2.1k views

ADD COMMENT • link updated 12.5 years ago by William D'Avigdor ▴ 40 • written 12.5 years ago by Wei Shi ★ 3.6k

0

Entering edit mode

William D'Avigdor ▴ 40

@william-davigdor-6023

Last seen 11.4 years ago

Hi Wei, For probe filtering, I have been using a p-value cut-off of p=0.01 with at least one sample passing this threshold across my data set, which reduces the number of probes from 48,701 to 16,877. I would like to confirm that this is the suitable threshold for my analyses? Many thanks in advance, Wil Sent from my iPhone On 04/07/2013, at 6:37 PM, Wei Shi <shi at="" wehi.edu.au=""> wrote: > Hi William, > > Please keep the posts on the list. > > You should certainly remove from analysis those probes which do not express in any of your samples, ie keeping only the probes which express in at least one sample. You can do so by applying a detection p value cutoff (eg 0.05 or 0.01) or you may run the propexpr function to estimate the proportion of expressed probes and then use that information to filter out probes. See ?propexpr for more details. > > Best wishes, > > Wei > > On Jul 4, 2013, at 2:55 PM, William D'Avigdor wrote: > >> Hi Wei, >> >> Many thanks for your response. >> >> I would like to ask you another question, specifically about probe filtering. >> >> So far I have performed all my analyses on UNFILTERED Illumina data from Genome Studio. Is it still VALID for Illumina data to use unfiltered data in contrast to filtered probes (comparing to background signal) with a particular p-value (eg p=0.01, or 0.1 according to your paper: Illumina WG-6 BeadChip strips should be normalised separately). >> >> I am assuming when performing hierachical clustering on the full data, the genes at background level will not significantly contribute to the clustering. However, I do notice that the clustering distance is narrowed obviously because the samples appear closer than they otherwise would. >> >> Further, when performing t-tests / LIMMA on the full data, those genes that are close to background level should not contribute to significant differences across groups. Is this correct? And is there anything I am missing out on? Apart from maybe a contribution by FDR. >> >> Many thanks, >> Wil >> >> On 2/07/2013 7:18 PM, Wei Shi wrote: >>> Dear William, >>> >>> What you have done is correct. As you have found, the 'ProbeID' is the same as the Array_Address_ID. The 'ProbeID' column was used in the old versions of Illumina BeadChip arrays, and it was later replaced with 'PROBE_ID" in the newer versions of BeadChips. >>> >>> The neqc() function uses negative control probes to carry out background correction. The 'TargetID' column in the control probe profile file indicates the types of control probes and the negative control probes have the type of 'NEGATIVE'. Neqc also uses all the probes including regular probes and all types of control probes (negative controls, housekeeping, ...) to perform a quantile between- array normalization. >>> >>> Best wishes, >>> >>> Wei >>> >>> On Jul 2, 2013, at 3:56 PM, William D'Avigdor wrote: >>> >>>> Hi, >>>> >>>> I am doing some Illumina analysis using HumanWG-6_V2 microarrays, and have been using the annotation file: HumanWG- 6_V2_0_R4_11223189_A.bgx, and I am normalising using the NEQC function in the LIMMA package. >>>> >>>> I know there are traditionally a number of Illumina identifiers and I am concerned that I may have potentially been using the wrong ones, and I'm not sure whether this has affected the normalisation proceedure, or anything at all. >>>> >>>> After summarisation in Genome Studio, when looking at the 'Sample Probe Profile', the main identifiers that come up (and which I have used in LIMMA) are 'PROBE_ID' and 'SYMBOL', the first row being ILMN_1762337 and 7A5 respectively. I also noticed that this PROBE_ID column was the one used in the Illumina example in the LIMMA manual. >>>> >>>> HOWEVER, in Genome Studio, there is also a column called 'ProbeID'. This does not exist in the original annotation file (HumanWG-6_V2_0_R4_11223189_A), but it is identical to the Array_Address_ID (except for the preceeding 000s), the latter of which is both in Genome Strudio and in the Annotation file, and UNIQUE to the version of the microarray. >>>> >>>> IN CONTRAST, in the 'Control Probe Profile' in Genome Studio, there is only the 'TargetID' and the 'ProbeID' available, the latter of which (I believe) is the Array_Address_ID? >>>> >>>> HENCE, for the LIMMA input, I am wondering whether I am correct when I have included the Sample Probe ID text file (which includes PROBE_ID, that is, ILMN_1762337), and the Control Probe ID text file (which includes ProbeID instead, which is most likely the Array Address ID). >>>> >>>> Many thanks in advance, >>>> William d'Avigdor >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely for the addressee. >>> You must not disclose, forward, print or use it without the permission of the sender. >>> ______________________________________________________________________ > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}}

ADD COMMENT • link 12.5 years ago William D'Avigdor ▴ 40

0

Entering edit mode

Hi Wil, You removed about two thirds of your probes, which is pretty high. You may try to use a cutoff of p<0.05 to see how many are filtered out. Typically, around half of probes were filtered out in our analyses. We often use a cutoff of p<0.05 but we also require all the replicates to satisfy this criteria. You should also check if the p values in your data are 'detection scores' or 'detection p-values'. If they are detection scores, then the low p value means low intensity and you should use p>0.95 for the filtering. You can easily check this by just looking at a few probes. Cheers, We On Jul 9, 2013, at 5:21 PM, Wil D'Avigdor wrote: > Hi Wei, > > For probe filtering, I have been using a p-value cut-off of p=0.01 with at least one sample passing this threshold across my data set, which reduces the number of probes from 48,701 to 16,877. > > I would like to confirm that this is the suitable threshold for my analyses? > > Many thanks in advance, > Wil > > Sent from my iPhone > > On 04/07/2013, at 6:37 PM, Wei Shi <shi at="" wehi.edu.au=""> wrote: > >> Hi William, >> >> Please keep the posts on the list. >> >> You should certainly remove from analysis those probes which do not express in any of your samples, ie keeping only the probes which express in at least one sample. You can do so by applying a detection p value cutoff (eg 0.05 or 0.01) or you may run the propexpr function to estimate the proportion of expressed probes and then use that information to filter out probes. See ?propexpr for more details. >> >> Best wishes, >> >> Wei >> >> On Jul 4, 2013, at 2:55 PM, William D'Avigdor wrote: >> >>> Hi Wei, >>> >>> Many thanks for your response. >>> >>> I would like to ask you another question, specifically about probe filtering. >>> >>> So far I have performed all my analyses on UNFILTERED Illumina data from Genome Studio. Is it still VALID for Illumina data to use unfiltered data in contrast to filtered probes (comparing to background signal) with a particular p-value (eg p=0.01, or 0.1 according to your paper: Illumina WG-6 BeadChip strips should be normalised separately). >>> >>> I am assuming when performing hierachical clustering on the full data, the genes at background level will not significantly contribute to the clustering. However, I do notice that the clustering distance is narrowed obviously because the samples appear closer than they otherwise would. >>> >>> Further, when performing t-tests / LIMMA on the full data, those genes that are close to background level should not contribute to significant differences across groups. Is this correct? And is there anything I am missing out on? Apart from maybe a contribution by FDR. >>> >>> Many thanks, >>> Wil >>> >>> On 2/07/2013 7:18 PM, Wei Shi wrote: >>>> Dear William, >>>> >>>> What you have done is correct. As you have found, the 'ProbeID' is the same as the Array_Address_ID. The 'ProbeID' column was used in the old versions of Illumina BeadChip arrays, and it was later replaced with 'PROBE_ID" in the newer versions of BeadChips. >>>> >>>> The neqc() function uses negative control probes to carry out background correction. The 'TargetID' column in the control probe profile file indicates the types of control probes and the negative control probes have the type of 'NEGATIVE'. Neqc also uses all the probes including regular probes and all types of control probes (negative controls, housekeeping, ...) to perform a quantile between- array normalization. >>>> >>>> Best wishes, >>>> >>>> Wei >>>> >>>> On Jul 2, 2013, at 3:56 PM, William D'Avigdor wrote: >>>> >>>>> Hi, >>>>> >>>>> I am doing some Illumina analysis using HumanWG-6_V2 microarrays, and have been using the annotation file: HumanWG- 6_V2_0_R4_11223189_A.bgx, and I am normalising using the NEQC function in the LIMMA package. >>>>> >>>>> I know there are traditionally a number of Illumina identifiers and I am concerned that I may have potentially been using the wrong ones, and I'm not sure whether this has affected the normalisation proceedure, or anything at all. >>>>> >>>>> After summarisation in Genome Studio, when looking at the 'Sample Probe Profile', the main identifiers that come up (and which I have used in LIMMA) are 'PROBE_ID' and 'SYMBOL', the first row being ILMN_1762337 and 7A5 respectively. I also noticed that this PROBE_ID column was the one used in the Illumina example in the LIMMA manual. >>>>> >>>>> HOWEVER, in Genome Studio, there is also a column called 'ProbeID'. This does not exist in the original annotation file (HumanWG-6_V2_0_R4_11223189_A), but it is identical to the Array_Address_ID (except for the preceeding 000s), the latter of which is both in Genome Strudio and in the Annotation file, and UNIQUE to the version of the microarray. >>>>> >>>>> IN CONTRAST, in the 'Control Probe Profile' in Genome Studio, there is only the 'TargetID' and the 'ProbeID' available, the latter of which (I believe) is the Array_Address_ID? >>>>> >>>>> HENCE, for the LIMMA input, I am wondering whether I am correct when I have included the Sample Probe ID text file (which includes PROBE_ID, that is, ILMN_1762337), and the Control Probe ID text file (which includes ProbeID instead, which is most likely the Array Address ID). >>>>> >>>>> Many thanks in advance, >>>>> William d'Avigdor >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> ______________________________________________________________________ >>>> The information in this email is confidential and intended solely for the addressee. >>>> You must not disclose, forward, print or use it without the permission of the sender. >>>> ______________________________________________________________________ >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD REPLY • link 12.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi Wei, Thanks for this. Can I specifically ask want you mean by 'replicates'? Is this ALL your microarrays? Or to do with the propexpr function? If I filter to keep only those probes that satisfy p<0.05 across ALL samples (n=36), I am only left with 11,102 probes. My understanding is that I should keep those probes that are significantly different to background in at least one sample. If I use a detection p-value of p<0.05, I get 26,816; compared to p<0.01, I get 16,877 probes that remain. Based on this, would you suggest I use p<0.05? This is approximately half of the original 48,701 probes. Kind regards, Wil On 10/07/2013 9:51 AM, Wei Shi wrote: > Hi Wil, > > You removed about two thirds of your probes, which is pretty high. You may try to use a cutoff of p<0.05 to see how many are filtered out. Typically, around half of probes were filtered out in our analyses. We often use a cutoff of p<0.05 but we also require all the replicates to satisfy this criteria. > > You should also check if the p values in your data are 'detection scores' or 'detection p-values'. If they are detection scores, then the low p value means low intensity and you should use p>0.95 for the filtering. You can easily check this by just looking at a few probes. > > Cheers, > We > > On Jul 9, 2013, at 5:21 PM, Wil D'Avigdor wrote: > >> Hi Wei, >> >> For probe filtering, I have been using a p-value cut-off of p=0.01 with at least one sample passing this threshold across my data set, which reduces the number of probes from 48,701 to 16,877. >> >> I would like to confirm that this is the suitable threshold for my analyses? >> >> Many thanks in advance, >> Wil >> >> Sent from my iPhone >> >> On 04/07/2013, at 6:37 PM, Wei Shi <shi at="" wehi.edu.au=""> wrote: >> >>> Hi William, >>> >>> Please keep the posts on the list. >>> >>> You should certainly remove from analysis those probes which do not express in any of your samples, ie keeping only the probes which express in at least one sample. You can do so by applying a detection p value cutoff (eg 0.05 or 0.01) or you may run the propexpr function to estimate the proportion of expressed probes and then use that information to filter out probes. See ?propexpr for more details. >>> >>> Best wishes, >>> >>> Wei >>> >>> On Jul 4, 2013, at 2:55 PM, William D'Avigdor wrote: >>> >>>> Hi Wei, >>>> >>>> Many thanks for your response. >>>> >>>> I would like to ask you another question, specifically about probe filtering. >>>> >>>> So far I have performed all my analyses on UNFILTERED Illumina data from Genome Studio. Is it still VALID for Illumina data to use unfiltered data in contrast to filtered probes (comparing to background signal) with a particular p-value (eg p=0.01, or 0.1 according to your paper: Illumina WG-6 BeadChip strips should be normalised separately). >>>> >>>> I am assuming when performing hierachical clustering on the full data, the genes at background level will not significantly contribute to the clustering. However, I do notice that the clustering distance is narrowed obviously because the samples appear closer than they otherwise would. >>>> >>>> Further, when performing t-tests / LIMMA on the full data, those genes that are close to background level should not contribute to significant differences across groups. Is this correct? And is there anything I am missing out on? Apart from maybe a contribution by FDR. >>>> >>>> Many thanks, >>>> Wil >>>> >>>> On 2/07/2013 7:18 PM, Wei Shi wrote: >>>>> Dear William, >>>>> >>>>> What you have done is correct. As you have found, the 'ProbeID' is the same as the Array_Address_ID. The 'ProbeID' column was used in the old versions of Illumina BeadChip arrays, and it was later replaced with 'PROBE_ID" in the newer versions of BeadChips. >>>>> >>>>> The neqc() function uses negative control probes to carry out background correction. The 'TargetID' column in the control probe profile file indicates the types of control probes and the negative control probes have the type of 'NEGATIVE'. Neqc also uses all the probes including regular probes and all types of control probes (negative controls, housekeeping, ...) to perform a quantile between- array normalization. >>>>> >>>>> Best wishes, >>>>> >>>>> Wei >>>>> >>>>> On Jul 2, 2013, at 3:56 PM, William D'Avigdor wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I am doing some Illumina analysis using HumanWG-6_V2 microarrays, and have been using the annotation file: HumanWG- 6_V2_0_R4_11223189_A.bgx, and I am normalising using the NEQC function in the LIMMA package. >>>>>> >>>>>> I know there are traditionally a number of Illumina identifiers and I am concerned that I may have potentially been using the wrong ones, and I'm not sure whether this has affected the normalisation proceedure, or anything at all. >>>>>> >>>>>> After summarisation in Genome Studio, when looking at the 'Sample Probe Profile', the main identifiers that come up (and which I have used in LIMMA) are 'PROBE_ID' and 'SYMBOL', the first row being ILMN_1762337 and 7A5 respectively. I also noticed that this PROBE_ID column was the one used in the Illumina example in the LIMMA manual. >>>>>> >>>>>> HOWEVER, in Genome Studio, there is also a column called 'ProbeID'. This does not exist in the original annotation file (HumanWG-6_V2_0_R4_11223189_A), but it is identical to the Array_Address_ID (except for the preceeding 000s), the latter of which is both in Genome Strudio and in the Annotation file, and UNIQUE to the version of the microarray. >>>>>> >>>>>> IN CONTRAST, in the 'Control Probe Profile' in Genome Studio, there is only the 'TargetID' and the 'ProbeID' available, the latter of which (I believe) is the Array_Address_ID? >>>>>> >>>>>> HENCE, for the LIMMA input, I am wondering whether I am correct when I have included the Sample Probe ID text file (which includes PROBE_ID, that is, ILMN_1762337), and the Control Probe ID text file (which includes ProbeID instead, which is most likely the Array Address ID). >>>>>> >>>>>> Many thanks in advance, >>>>>> William d'Avigdor >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> ______________________________________________________________________ >>>>> The information in this email is confidential and intended solely for the addressee. >>>>> You must not disclose, forward, print or use it without the permission of the sender. >>>>> ______________________________________________________________________ >>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely for the addressee. >>> You must not disclose, forward, print or use it without the permission of the sender. >>> ______________________________________________________________________ > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:4}}

ADD REPLY • link 12.5 years ago William D'Avigdor ▴ 40

0

Entering edit mode

Hi Wil, By 'replicates', I mean those arrays which were hybridized to the same sample. For example, if you have two arrays which were both hybridized to a wild type sample then they are replicates. I think you should use p<0.05 for your filtering. Cheers, Wei On Jul 10, 2013, at 3:01 PM, William D'Avigdor wrote: > Hi Wei, > > Thanks for this. Can I specifically ask want you mean by 'replicates'? Is this ALL your microarrays? Or to do with the propexpr function? If I filter to keep only those probes that satisfy p<0.05 across ALL samples (n=36), I am only left with 11,102 probes. > > My understanding is that I should keep those probes that are significantly different to background in at least one sample. If I use a detection p-value of p<0.05, I get 26,816; compared to p<0.01, I get 16,877 probes that remain. Based on this, would you suggest I use p<0.05? This is approximately half of the original 48,701 probes. > > Kind regards, > Wil > > On 10/07/2013 9:51 AM, Wei Shi wrote: >> Hi Wil, >> >> You removed about two thirds of your probes, which is pretty high. You may try to use a cutoff of p<0.05 to see how many are filtered out. Typically, around half of probes were filtered out in our analyses. We often use a cutoff of p<0.05 but we also require all the replicates to satisfy this criteria. >> >> You should also check if the p values in your data are 'detection scores' or 'detection p-values'. If they are detection scores, then the low p value means low intensity and you should use p>0.95 for the filtering. You can easily check this by just looking at a few probes. >> >> Cheers, >> We >> >> On Jul 9, 2013, at 5:21 PM, Wil D'Avigdor wrote: >> >>> Hi Wei, >>> >>> For probe filtering, I have been using a p-value cut-off of p=0.01 with at least one sample passing this threshold across my data set, which reduces the number of probes from 48,701 to 16,877. >>> >>> I would like to confirm that this is the suitable threshold for my analyses? >>> >>> Many thanks in advance, >>> Wil >>> >>> Sent from my iPhone >>> >>> On 04/07/2013, at 6:37 PM, Wei Shi <shi at="" wehi.edu.au=""> wrote: >>> >>>> Hi William, >>>> >>>> Please keep the posts on the list. >>>> >>>> You should certainly remove from analysis those probes which do not express in any of your samples, ie keeping only the probes which express in at least one sample. You can do so by applying a detection p value cutoff (eg 0.05 or 0.01) or you may run the propexpr function to estimate the proportion of expressed probes and then use that information to filter out probes. See ?propexpr for more details. >>>> >>>> Best wishes, >>>> >>>> Wei >>>> >>>> On Jul 4, 2013, at 2:55 PM, William D'Avigdor wrote: >>>> >>>>> Hi Wei, >>>>> >>>>> Many thanks for your response. >>>>> >>>>> I would like to ask you another question, specifically about probe filtering. >>>>> >>>>> So far I have performed all my analyses on UNFILTERED Illumina data from Genome Studio. Is it still VALID for Illumina data to use unfiltered data in contrast to filtered probes (comparing to background signal) with a particular p-value (eg p=0.01, or 0.1 according to your paper: Illumina WG-6 BeadChip strips should be normalised separately). >>>>> >>>>> I am assuming when performing hierachical clustering on the full data, the genes at background level will not significantly contribute to the clustering. However, I do notice that the clustering distance is narrowed obviously because the samples appear closer than they otherwise would. >>>>> >>>>> Further, when performing t-tests / LIMMA on the full data, those genes that are close to background level should not contribute to significant differences across groups. Is this correct? And is there anything I am missing out on? Apart from maybe a contribution by FDR. >>>>> >>>>> Many thanks, >>>>> Wil >>>>> >>>>> On 2/07/2013 7:18 PM, Wei Shi wrote: >>>>>> Dear William, >>>>>> >>>>>> What you have done is correct. As you have found, the 'ProbeID' is the same as the Array_Address_ID. The 'ProbeID' column was used in the old versions of Illumina BeadChip arrays, and it was later replaced with 'PROBE_ID" in the newer versions of BeadChips. >>>>>> >>>>>> The neqc() function uses negative control probes to carry out background correction. The 'TargetID' column in the control probe profile file indicates the types of control probes and the negative control probes have the type of 'NEGATIVE'. Neqc also uses all the probes including regular probes and all types of control probes (negative controls, housekeeping, ...) to perform a quantile between- array normalization. >>>>>> >>>>>> Best wishes, >>>>>> >>>>>> Wei >>>>>> >>>>>> On Jul 2, 2013, at 3:56 PM, William D'Avigdor wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am doing some Illumina analysis using HumanWG-6_V2 microarrays, and have been using the annotation file: HumanWG- 6_V2_0_R4_11223189_A.bgx, and I am normalising using the NEQC function in the LIMMA package. >>>>>>> >>>>>>> I know there are traditionally a number of Illumina identifiers and I am concerned that I may have potentially been using the wrong ones, and I'm not sure whether this has affected the normalisation proceedure, or anything at all. >>>>>>> >>>>>>> After summarisation in Genome Studio, when looking at the 'Sample Probe Profile', the main identifiers that come up (and which I have used in LIMMA) are 'PROBE_ID' and 'SYMBOL', the first row being ILMN_1762337 and 7A5 respectively. I also noticed that this PROBE_ID column was the one used in the Illumina example in the LIMMA manual. >>>>>>> >>>>>>> HOWEVER, in Genome Studio, there is also a column called 'ProbeID'. This does not exist in the original annotation file (HumanWG-6_V2_0_R4_11223189_A), but it is identical to the Array_Address_ID (except for the preceeding 000s), the latter of which is both in Genome Strudio and in the Annotation file, and UNIQUE to the version of the microarray. >>>>>>> >>>>>>> IN CONTRAST, in the 'Control Probe Profile' in Genome Studio, there is only the 'TargetID' and the 'ProbeID' available, the latter of which (I believe) is the Array_Address_ID? >>>>>>> >>>>>>> HENCE, for the LIMMA input, I am wondering whether I am correct when I have included the Sample Probe ID text file (which includes PROBE_ID, that is, ILMN_1762337), and the Control Probe ID text file (which includes ProbeID instead, which is most likely the Array Address ID). >>>>>>> >>>>>>> Many thanks in advance, >>>>>>> William d'Avigdor >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> ______________________________________________________________________ >>>>>> The information in this email is confidential and intended solely for the addressee. >>>>>> You must not disclose, forward, print or use it without the permission of the sender. >>>>>> ______________________________________________________________________ >>>> ______________________________________________________________________ >>>> The information in this email is confidential and intended solely for the addressee. >>>> You must not disclose, forward, print or use it without the permission of the sender. >>>> ______________________________________________________________________ >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD REPLY • link 12.5 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Many thanks for your advice. Most helpful! Cheers, Wil On 10/07/2013 4:12 PM, Wei Shi wrote: > Hi Wil, > > By 'replicates', I mean those arrays which were hybridized to the same sample. For example, if you have two arrays which were both hybridized to a wild type sample then they are replicates. > > I think you should use p<0.05 for your filtering. > > Cheers, > Wei > > On Jul 10, 2013, at 3:01 PM, William D'Avigdor wrote: > >> Hi Wei, >> >> Thanks for this. Can I specifically ask want you mean by 'replicates'? Is this ALL your microarrays? Or to do with the propexpr function? If I filter to keep only those probes that satisfy p<0.05 across ALL samples (n=36), I am only left with 11,102 probes. >> >> My understanding is that I should keep those probes that are significantly different to background in at least one sample. If I use a detection p-value of p<0.05, I get 26,816; compared to p<0.01, I get 16,877 probes that remain. Based on this, would you suggest I use p<0.05? This is approximately half of the original 48,701 probes. >> >> Kind regards, >> Wil >> >> On 10/07/2013 9:51 AM, Wei Shi wrote: >>> Hi Wil, >>> >>> You removed about two thirds of your probes, which is pretty high. You may try to use a cutoff of p<0.05 to see how many are filtered out. Typically, around half of probes were filtered out in our analyses. We often use a cutoff of p<0.05 but we also require all the replicates to satisfy this criteria. >>> >>> You should also check if the p values in your data are 'detection scores' or 'detection p-values'. If they are detection scores, then the low p value means low intensity and you should use p>0.95 for the filtering. You can easily check this by just looking at a few probes. >>> >>> Cheers, >>> We >>> >>> On Jul 9, 2013, at 5:21 PM, Wil D'Avigdor wrote: >>> >>>> Hi Wei, >>>> >>>> For probe filtering, I have been using a p-value cut-off of p=0.01 with at least one sample passing this threshold across my data set, which reduces the number of probes from 48,701 to 16,877. >>>> >>>> I would like to confirm that this is the suitable threshold for my analyses? >>>> >>>> Many thanks in advance, >>>> Wil >>>> >>>> Sent from my iPhone >>>> >>>> On 04/07/2013, at 6:37 PM, Wei Shi <shi at="" wehi.edu.au=""> wrote: >>>> >>>>> Hi William, >>>>> >>>>> Please keep the posts on the list. >>>>> >>>>> You should certainly remove from analysis those probes which do not express in any of your samples, ie keeping only the probes which express in at least one sample. You can do so by applying a detection p value cutoff (eg 0.05 or 0.01) or you may run the propexpr function to estimate the proportion of expressed probes and then use that information to filter out probes. See ?propexpr for more details. >>>>> >>>>> Best wishes, >>>>> >>>>> Wei >>>>> >>>>> On Jul 4, 2013, at 2:55 PM, William D'Avigdor wrote: >>>>> >>>>>> Hi Wei, >>>>>> >>>>>> Many thanks for your response. >>>>>> >>>>>> I would like to ask you another question, specifically about probe filtering. >>>>>> >>>>>> So far I have performed all my analyses on UNFILTERED Illumina data from Genome Studio. Is it still VALID for Illumina data to use unfiltered data in contrast to filtered probes (comparing to background signal) with a particular p-value (eg p=0.01, or 0.1 according to your paper: Illumina WG-6 BeadChip strips should be normalised separately). >>>>>> >>>>>> I am assuming when performing hierachical clustering on the full data, the genes at background level will not significantly contribute to the clustering. However, I do notice that the clustering distance is narrowed obviously because the samples appear closer than they otherwise would. >>>>>> >>>>>> Further, when performing t-tests / LIMMA on the full data, those genes that are close to background level should not contribute to significant differences across groups. Is this correct? And is there anything I am missing out on? Apart from maybe a contribution by FDR. >>>>>> >>>>>> Many thanks, >>>>>> Wil >>>>>> >>>>>> On 2/07/2013 7:18 PM, Wei Shi wrote: >>>>>>> Dear William, >>>>>>> >>>>>>> What you have done is correct. As you have found, the 'ProbeID' is the same as the Array_Address_ID. The 'ProbeID' column was used in the old versions of Illumina BeadChip arrays, and it was later replaced with 'PROBE_ID" in the newer versions of BeadChips. >>>>>>> >>>>>>> The neqc() function uses negative control probes to carry out background correction. The 'TargetID' column in the control probe profile file indicates the types of control probes and the negative control probes have the type of 'NEGATIVE'. Neqc also uses all the probes including regular probes and all types of control probes (negative controls, housekeeping, ...) to perform a quantile between- array normalization. >>>>>>> >>>>>>> Best wishes, >>>>>>> >>>>>>> Wei >>>>>>> >>>>>>> On Jul 2, 2013, at 3:56 PM, William D'Avigdor wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am doing some Illumina analysis using HumanWG-6_V2 microarrays, and have been using the annotation file: HumanWG- 6_V2_0_R4_11223189_A.bgx, and I am normalising using the NEQC function in the LIMMA package. >>>>>>>> >>>>>>>> I know there are traditionally a number of Illumina identifiers and I am concerned that I may have potentially been using the wrong ones, and I'm not sure whether this has affected the normalisation proceedure, or anything at all. >>>>>>>> >>>>>>>> After summarisation in Genome Studio, when looking at the 'Sample Probe Profile', the main identifiers that come up (and which I have used in LIMMA) are 'PROBE_ID' and 'SYMBOL', the first row being ILMN_1762337 and 7A5 respectively. I also noticed that this PROBE_ID column was the one used in the Illumina example in the LIMMA manual. >>>>>>>> >>>>>>>> HOWEVER, in Genome Studio, there is also a column called 'ProbeID'. This does not exist in the original annotation file (HumanWG-6_V2_0_R4_11223189_A), but it is identical to the Array_Address_ID (except for the preceeding 000s), the latter of which is both in Genome Strudio and in the Annotation file, and UNIQUE to the version of the microarray. >>>>>>>> >>>>>>>> IN CONTRAST, in the 'Control Probe Profile' in Genome Studio, there is only the 'TargetID' and the 'ProbeID' available, the latter of which (I believe) is the Array_Address_ID? >>>>>>>> >>>>>>>> HENCE, for the LIMMA input, I am wondering whether I am correct when I have included the Sample Probe ID text file (which includes PROBE_ID, that is, ILMN_1762337), and the Control Probe ID text file (which includes ProbeID instead, which is most likely the Array Address ID). >>>>>>>> >>>>>>>> Many thanks in advance, >>>>>>>> William d'Avigdor >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> ______________________________________________________________________ >>>>>>> The information in this email is confidential and intended solely for the addressee. >>>>>>> You must not disclose, forward, print or use it without the permission of the sender. >>>>>>> ______________________________________________________________________ >>>>> ______________________________________________________________________ >>>>> The information in this email is confidential and intended solely for the addressee. >>>>> You must not disclose, forward, print or use it without the permission of the sender. >>>>> ______________________________________________________________________ >>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely for the addressee. >>> You must not disclose, forward, print or use it without the permission of the sender. >>> ______________________________________________________________________ > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:4}}

ADD REPLY • link 12.5 years ago William D'Avigdor ▴ 40

Login before adding your answer.