Dear List,
I have a set of mouse affy data. They platform is Affy mouse 430a2
chip.
There are 8 samples , 4 for each condition.
I normalized the data using rma. The array has 22090 probes
originally.
Then, in order to filter out the genes which have no entrez id, are
duplicates for the same gene, I used the following command .
filter <- nsFilter(eset1,require.entrez=T,remove.dupEntrez=T,var.func=
IQR,var.filter=T)
This leaves me with 6579 genes after filtering. I think I loose many
of the genes here. Is there a better way to do the same?.
Also, the other problem that I am facing is that after this step, I
create a expression matrix of these remaining 6579 probes.
Now, in order to annotate them, I use the library mouse4302.db
I select the ids from my list and then use the following command
Symbol <- mouse4302SYMBOL[ids]
This gives me a lesser number of probes and genes. I loose more data
here.
For example, I am interested in the gene DCK, I check the original
annotation file of affymetrix and there are 3 probes that are present
for this gene. That means that it should have annotation. But in the
final dataset I do not find it.
Can anyone suggest a better method or any corrections to the approach
that I am using. I eventually need to merge this data with other data
from affy and check for the expression values but, i figured out that
I am not getting the right amount of genes.
Any help is much appreciated. I am a newbie to R and Biconductor, so I
am sorry if it is a basic question.
Thank you all in advance for your help.
Thanks,
Himanshu
[[alternative HTML version deleted]]
Hi Himanshu,
On 1/11/2013 4:57 PM, Himanshu Sharma wrote:
> Dear List,
> I have a set of mouse affy data. They platform is Affy mouse 430a2
chip.
> There are 8 samples , 4 for each condition.
> I normalized the data using rma. The array has 22090 probes
originally.
> Then, in order to filter out the genes which have no entrez id, are
duplicates for the same gene, I used the following command .
>
> filter<- nsFilter(eset1,require.entrez=T,remove.dupEntrez=T,var.func
=IQR,var.filter=T)
You are filtering on three things here. First you require that all
probesets have an Entrez Gene ID, then you remove any duplicates, then
you require that the inter quartile range of the remaining data be
greater than 0.5.
This is one way of doing things. Depending on your goals, there may be
better or worse things you could do, but that depends on your goals.
If
for instance you don't want to lose DAX, regardless of possible low
variation, you could not filter on variation.
But 'better' is a subjective term, and you are the only one who can
decide what is better or worse in your particular situation.
>
> This leaves me with 6579 genes after filtering. I think I loose many
of the genes here. Is there a better way to do the same?.
>
> Also, the other problem that I am facing is that after this step, I
create a expression matrix of these remaining 6579 probes.
>
> Now, in order to annotate them, I use the library mouse4302.db
> I select the ids from my list and then use the following command
> Symbol<- mouse4302SYMBOL[ids]
>
> This gives me a lesser number of probes and genes. I loose more data
here.
> For example, I am interested in the gene DCK, I check the original
annotation file of affymetrix and there are 3 probes that are present
for this gene. That means that it should have annotation. But in the
final dataset I do not find it.
>
> Can anyone suggest a better method or any corrections to the
approach that I am using. I eventually need to merge this data with
other data from affy and check for the expression values but, i
figured out that I am not getting the right amount of genes.
There is no such thing as 'the right amount of genes'. There are only
assumptions and tradeoffs. You can make the assumption that genes with
an IQR < 0.5 are not really changing enough to consider, and then
filter
them out. Or you can assume that smaller variation is still
biologically
meaningful, and reduce the IQR cutoff, or eliminate entirely. Or you
can
assume that duplicated genes on the Affy Mouse 430 chip are really
measuring different splice variants or some such, and you want to keep
them all in the data set.
All these assumptions have tradeoffs, including the possibility that
you
are wrong and you are polluting your dataset with noise, or
unnecessarily increasing the multiplicity of your comparisons. But in
the end it is up to the analyst to decide what assumptions are to be
made, and to be prepared to defend those assumptions to those higher
up
(your PI, your funding source, journal reviewers, whomever).
Best,
Jim
>
> Any help is much appreciated. I am a newbie to R and Biconductor, so
I am sorry if it is a basic question.
> Thank you all in advance for your help.
> Thanks,
> Himanshu
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
Thanks a lot James. I really appreciate your help. Also, when I
annotate the ids, there should be equal number of probes as after
filtering.? How do I lose more when I annotate them?.
Thanks,
Himanshu
From: James W. MacDonald [jmacdon@uw.edu]
Sent: Saturday, January 12, 2013 9:48 AM
To: Himanshu Sharma
Cc: bioconductor at r-project.org mailman
Subject: Re: [BioC] Filtering out duplicate probes in Affy data
Hi Himanshu,
On 1/11/2013 4:57 PM, Himanshu Sharma wrote:
> Dear List,
> I have a set of mouse affy data. They platform is Affy mouse 430a2
chip.
> There are 8 samples , 4 for each condition.
> I normalized the data using rma. The array has 22090 probes
originally.
> Then, in order to filter out the genes which have no entrez id, are
duplicates for the same gene, I used the following command .
>
> filter<- nsFilter(eset1,require.entrez=T,remove.dupEntrez=T,var.func
=IQR,var.filter=T)
You are filtering on three things here. First you require that all
probesets have an Entrez Gene ID, then you remove any duplicates, then
you require that the inter quartile range of the remaining data be
greater than 0.5.
This is one way of doing things. Depending on your goals, there may be
better or worse things you could do, but that depends on your goals.
If
for instance you don't want to lose DAX, regardless of possible low
variation, you could not filter on variation.
But 'better' is a subjective term, and you are the only one who can
decide what is better or worse in your particular situation.
>
> This leaves me with 6579 genes after filtering. I think I loose many
of the genes here. Is there a better way to do the same?.
>
> Also, the other problem that I am facing is that after this step, I
create a expression matrix of these remaining 6579 probes.
>
> Now, in order to annotate them, I use the library mouse4302.db
> I select the ids from my list and then use the following command
> Symbol<- mouse4302SYMBOL[ids]
>
> This gives me a lesser number of probes and genes. I loose more data
here.
> For example, I am interested in the gene DCK, I check the original
annotation file of affymetrix and there are 3 probes that are present
for this gene. That means that it should have annotation. But in the
final dataset I do not find it.
>
> Can anyone suggest a better method or any corrections to the
approach that I am using. I eventually need to merge this data with
other data from affy and check for the expression values but, i
figured out that I am not getting the right amount of genes.
There is no such thing as 'the right amount of genes'. There are only
assumptions and tradeoffs. You can make the assumption that genes with
an IQR < 0.5 are not really changing enough to consider, and then
filter
them out. Or you can assume that smaller variation is still
biologically
meaningful, and reduce the IQR cutoff, or eliminate entirely. Or you
can
assume that duplicated genes on the Affy Mouse 430 chip are really
measuring different splice variants or some such, and you want to keep
them all in the data set.
All these assumptions have tradeoffs, including the possibility that
you
are wrong and you are polluting your dataset with noise, or
unnecessarily increasing the multiplicity of your comparisons. But in
the end it is up to the analyst to decide what assumptions are to be
made, and to be prepared to defend those assumptions to those higher
up
(your PI, your funding source, journal reviewers, whomever).
Best,
Jim
>
> Any help is much appreciated. I am a newbie to R and Biconductor, so
I am sorry if it is a basic question.
> Thank you all in advance for your help.
> Thanks,
> Himanshu
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
Hi Himanshu,
On 1/12/2013 10:27 AM, Himanshu Sharma wrote:
> Thanks a lot James. I really appreciate your help. Also, when I
annotate the ids, there should be equal number of probes as after
filtering.? How do I lose more when I annotate them?.
Two possible reasons. First, having an Entrez Gene ID doesn't
necessarily imply having a gene symbol. Second, and more likely, the
moe430a.db package masks all probe -> symbol mappings where there are
multiple symbols.
But exposing multiple probe -> symbol mappings adds an additional
level
of complexity. As an example:
> library(moe430a.db)
> x <- toggleProbes(moe430aSYMBOL, "all")
> x <- as.list(x)
> x[sapply(x, length) > 1][1:10]
$`1415716_a_at`
[1] "Rps27" "Gm9846"
$`1415763_a_at`
[1] "Tmem234" "LOC100505293"
$`1415781_a_at`
[1] "Sumo2" "Gm13430"
$`1415788_at`
[1] "Gm12663" "Ublcp1"
$`1415789_a_at`
[1] "Gm12663" "Ublcp1"
$`1415790_at`
[1] "Gm12663" "Ublcp1"
$`1415825_s_at`
[1] "Slc38a10" "Csnk1d"
$`1415875_at`
[1] "Fam60a" "3010003L21Rik"
$`1415895_at`
[1] "Snrpn" "Snurf"
$`1415896_x_at`
[1] "Snrpn" "Snurf"
So now which symbol do you use? The first one? Is 1415896_x_at Snrpn
or
Snurf? Are these symbols synonyms? Without checking each one you are
left with ad hoc decisions. You could just use the first one, but then
you choose Gm12663 over Ublcp1 for probeset 1415790_at, which looks
like
the opposite of what you should be doing.
As they say, ignorance is bliss. The more you know about this stuff,
the
messier it gets, and the less clear it is what the 'right' thing to do
might be. Or maybe that should be the 'right' thing to do in light of
limited time to spend chasing your tail.
Best,
Jim
> Thanks,
> Himanshu
> From: James W. MacDonald [jmacdon at uw.edu]
> Sent: Saturday, January 12, 2013 9:48 AM
> To: Himanshu Sharma
> Cc: bioconductor at r-project.org mailman
> Subject: Re: [BioC] Filtering out duplicate probes in Affy data
>
> Hi Himanshu,
>
> On 1/11/2013 4:57 PM, Himanshu Sharma wrote:
>> Dear List,
>> I have a set of mouse affy data. They platform is Affy mouse 430a2
chip.
>> There are 8 samples , 4 for each condition.
>> I normalized the data using rma. The array has 22090 probes
originally.
>> Then, in order to filter out the genes which have no entrez id, are
duplicates for the same gene, I used the following command .
>>
>> filter<- nsFilter(eset1,require.entrez=T,remove.dupEntrez=T,var.fun
c=IQR,var.filter=T)
> You are filtering on three things here. First you require that all
> probesets have an Entrez Gene ID, then you remove any duplicates,
then
> you require that the inter quartile range of the remaining data be
> greater than 0.5.
>
> This is one way of doing things. Depending on your goals, there may
be
> better or worse things you could do, but that depends on your goals.
If
> for instance you don't want to lose DAX, regardless of possible low
> variation, you could not filter on variation.
>
> But 'better' is a subjective term, and you are the only one who can
> decide what is better or worse in your particular situation.
>
>> This leaves me with 6579 genes after filtering. I think I loose
many of the genes here. Is there a better way to do the same?.
>>
>> Also, the other problem that I am facing is that after this step, I
create a expression matrix of these remaining 6579 probes.
>>
>> Now, in order to annotate them, I use the library mouse4302.db
>> I select the ids from my list and then use the following command
>> Symbol<- mouse4302SYMBOL[ids]
>>
>> This gives me a lesser number of probes and genes. I loose more
data here.
>> For example, I am interested in the gene DCK, I check the original
annotation file of affymetrix and there are 3 probes that are present
for this gene. That means that it should have annotation. But in the
final dataset I do not find it.
>>
>> Can anyone suggest a better method or any corrections to the
approach that I am using. I eventually need to merge this data with
other data from affy and check for the expression values but, i
figured out that I am not getting the right amount of genes.
> There is no such thing as 'the right amount of genes'. There are
only
> assumptions and tradeoffs. You can make the assumption that genes
with
> an IQR< 0.5 are not really changing enough to consider, and then
filter
> them out. Or you can assume that smaller variation is still
biologically
> meaningful, and reduce the IQR cutoff, or eliminate entirely. Or you
can
> assume that duplicated genes on the Affy Mouse 430 chip are really
> measuring different splice variants or some such, and you want to
keep
> them all in the data set.
>
> All these assumptions have tradeoffs, including the possibility that
you
> are wrong and you are polluting your dataset with noise, or
> unnecessarily increasing the multiplicity of your comparisons. But
in
> the end it is up to the analyst to decide what assumptions are to be
> made, and to be prepared to defend those assumptions to those higher
up
> (your PI, your funding source, journal reviewers, whomever).
>
> Best,
>
> Jim
>
>
>> Any help is much appreciated. I am a newbie to R and Biconductor,
so I am sorry if it is a basic question.
>> Thank you all in advance for your help.
>> Thanks,
>> Himanshu
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>
>
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099