outlier removal from gene chip
3
0
Entering edit mode
Weiwei Shi ★ 1.2k
@weiwei-shi-1407
Last seen 9.6 years ago
dear listers: I have a question on whether bioconductor has some tool-kit to detect outliers and remove them. my original dataset looks like this: V1 V51 V53 V55 V57 1 -493249600 1.459459 -3.069444 -1.300000 1.935484 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978 3 1626196571 -3.500000 -1.011662 2.223881 3.921053 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979 7 530182506 -1.431677 -1.336343 -3.126437 4.878788 8 1173842263 1.215385 1.856410 -2.059794 -6.020833 9 28847 2.407895 -2.048889 -1.730337 -1.178947 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058 V1: internal probe id the rests are different samples. the cells are fold-change of disease/normal. summary of the sample columns( V51, ... V57) gives the following: V51 V53 V55 V57 Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min. :-14086.750 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st Qu.: -1.831 Median : -1.199 Median : -1.0416 Median : -1.200 Median : -1.080 Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean : -1.874 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd Qu.: 1.521 Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max. : 683.519 My question is, is there any package which can detect those outliers (like -14086.750)and remove them and get an "average" for each gene (instead of each probe)? Thank you. Weiwei -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
probe probe • 1.4k views
ADD COMMENT
0
Entering edit mode
Fangxin Hong ▴ 810
@fangxin-hong-912
Last seen 9.6 years ago
Dear Weiwei, The definition of outlier is not clear, and no data point should be treated as outlier unless there is reason to believe so. The simple way to detect it is that 1.5IQR criteria, which you can write your own code (one or two lines). Update me if there are any other method to detect outliers. Fangxin > dear listers: > > I have a question on whether bioconductor has some tool-kit to detect > outliers and remove them. > > my original dataset looks like this: > V1 V51 V53 V55 V57 > 1 -493249600 1.459459 -3.069444 -1.300000 1.935484 > 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978 > 3 1626196571 -3.500000 -1.011662 2.223881 3.921053 > 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692 > 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708 > 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979 > 7 530182506 -1.431677 -1.336343 -3.126437 4.878788 > 8 1173842263 1.215385 1.856410 -2.059794 -6.020833 > 9 28847 2.407895 -2.048889 -1.730337 -1.178947 > 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058 > > V1: internal probe id > the rests are different samples. the cells are fold-change of > disease/normal. > > summary of the sample columns( V51, ... V57) gives the following: > V51 V53 V55 V57 > Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min. > :-14086.750 > 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st Qu.: > -1.831 > Median : -1.199 Median : -1.0416 Median : -1.200 Median : > -1.080 > Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean : > -1.874 > 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd Qu.: > 1.521 > Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max. : > 683.519 > > > My question is, is there any package which can detect those outliers > (like -14086.750)and remove them and get an "average" for each gene > (instead of each probe)? > > Thank you. > > Weiwei > > -- > Weiwei Shi, Ph.D > Research Scientist > GeneGO, Inc. > > "Did you always know?" > "No, I did not. But I believed..." > ---Matrix III > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -------------------- Fangxin Hong Ph.D. Plant Biology Laboratory The Salk Institute 10010 N. Torrey Pines Rd. La Jolla, CA 92037 E-mail: fhong at salk.edu (Phone): 858-453-4100 ext 1105
ADD COMMENT
0
Entering edit mode
my current way is using mahalanobis() distance. to Sean: do u think that example: -14k is ok? On 9/19/06, fhong at salk.edu <fhong at="" salk.edu=""> wrote: > Dear Weiwei, > The definition of outlier is not clear, and no data point should be > treated as outlier unless there is reason to believe so. The simple way to > detect it is that 1.5IQR criteria, which you can write your own code (one > or two lines). Update me if there are any other method to detect outliers. > > Fangxin > > > > dear listers: > > > > I have a question on whether bioconductor has some tool-kit to detect > > outliers and remove them. > > > > my original dataset looks like this: > > V1 V51 V53 V55 V57 > > 1 -493249600 1.459459 -3.069444 -1.300000 1.935484 > > 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978 > > 3 1626196571 -3.500000 -1.011662 2.223881 3.921053 > > 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692 > > 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708 > > 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979 > > 7 530182506 -1.431677 -1.336343 -3.126437 4.878788 > > 8 1173842263 1.215385 1.856410 -2.059794 -6.020833 > > 9 28847 2.407895 -2.048889 -1.730337 -1.178947 > > 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058 > > > > V1: internal probe id > > the rests are different samples. the cells are fold-change of > > disease/normal. > > > > summary of the sample columns( V51, ... V57) gives the following: > > V51 V53 V55 V57 > > Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min. > > :-14086.750 > > 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st Qu.: > > -1.831 > > Median : -1.199 Median : -1.0416 Median : -1.200 Median : > > -1.080 > > Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean : > > -1.874 > > 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd Qu.: > > 1.521 > > Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max. : > > 683.519 > > > > > > My question is, is there any package which can detect those outliers > > (like -14086.750)and remove them and get an "average" for each gene > > (instead of each probe)? > > > > Thank you. > > > > Weiwei > > > > -- > > Weiwei Shi, Ph.D > > Research Scientist > > GeneGO, Inc. > > > > "Did you always know?" > > "No, I did not. But I believed..." > > ---Matrix III > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > -------------------- > Fangxin Hong Ph.D. > Plant Biology Laboratory > The Salk Institute > 10010 N. Torrey Pines Rd. > La Jolla, CA 92037 > E-mail: fhong at salk.edu > (Phone): 858-453-4100 ext 1105 > > -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
ADD REPLY
0
Entering edit mode
On Sep 19, 2006, at 12:18 PM, Weiwei Shi wrote: > my current way is using mahalanobis() distance. > > to Sean: > do u think that example: -14k is ok? That example could be a case of the gene being expressed in one condition and not being expressed in another. I do not remember where the data are from (or if you have even described that) or platform or ..., but I would agree with Sean and say that you do not want to blindly remove the genes. Note that we are not advising that you shouldn't remove the gene, just that you should take a careful look at the data and try to decide what to do. As Fangxin clearly writes, it is hard to really know what is an outlier. Kasper > > On 9/19/06, fhong at salk.edu <fhong at="" salk.edu=""> wrote: >> Dear Weiwei, >> The definition of outlier is not clear, and no data point should be >> treated as outlier unless there is reason to believe so. The >> simple way to >> detect it is that 1.5IQR criteria, which you can write your own >> code (one >> or two lines). Update me if there are any other method to detect >> outliers. >> >> Fangxin >> >> >>> dear listers: >>> >>> I have a question on whether bioconductor has some tool-kit to >>> detect >>> outliers and remove them. >>> >>> my original dataset looks like this: >>> V1 V51 V53 V55 V57 >>> 1 -493249600 1.459459 -3.069444 -1.300000 1.935484 >>> 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978 >>> 3 1626196571 -3.500000 -1.011662 2.223881 3.921053 >>> 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692 >>> 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708 >>> 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979 >>> 7 530182506 -1.431677 -1.336343 -3.126437 4.878788 >>> 8 1173842263 1.215385 1.856410 -2.059794 -6.020833 >>> 9 28847 2.407895 -2.048889 -1.730337 -1.178947 >>> 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058 >>> >>> V1: internal probe id >>> the rests are different samples. the cells are fold-change of >>> disease/normal. >>> >>> summary of the sample columns( V51, ... V57) gives the following: >>> V51 V53 V55 V57 >>> Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min. >>> :-14086.750 >>> 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st Qu.: >>> -1.831 >>> Median : -1.199 Median : -1.0416 Median : -1.200 Median : >>> -1.080 >>> Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean : >>> -1.874 >>> 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd Qu.: >>> 1.521 >>> Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max. : >>> 683.519 >>> >>> >>> My question is, is there any package which can detect those outliers >>> (like -14086.750)and remove them and get an "average" for each gene >>> (instead of each probe)? >>> >>> Thank you. >>> >>> Weiwei >>> >>> -- >>> Weiwei Shi, Ph.D >>> Research Scientist >>> GeneGO, Inc. >>> >>> "Did you always know?" >>> "No, I did not. But I believed..." >>> ---Matrix III >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >> >> -------------------- >> Fangxin Hong Ph.D. >> Plant Biology Laboratory >> The Salk Institute >> 10010 N. Torrey Pines Rd. >> La Jolla, CA 92037 >> E-mail: fhong at salk.edu >> (Phone): 858-453-4100 ext 1105 >> >> > > > -- > Weiwei Shi, Ph.D > Research Scientist > GeneGO, Inc. > > "Did you always know?" > "No, I did not. But I believed..." > ---Matrix III > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
thanks for all of suggestions here. i will go w/o removing those "outliers" first and update some result if necessary. On 9/19/06, Kasper Daniel Hansen <khansen at="" stat.berkeley.edu=""> wrote: > > On Sep 19, 2006, at 12:18 PM, Weiwei Shi wrote: > > > my current way is using mahalanobis() distance. > > > > to Sean: > > do u think that example: -14k is ok? > > That example could be a case of the gene being expressed in one > condition and not being expressed in another. I do not remember where > the data are from (or if you have even described that) or platform > or ..., but I would agree with Sean and say that you do not want to > blindly remove the genes. Note that we are not advising that you > shouldn't remove the gene, just that you should take a careful look > at the data and try to decide what to do. > > As Fangxin clearly writes, it is hard to really know what is an outlier. > > Kasper > > > > > > On 9/19/06, fhong at salk.edu <fhong at="" salk.edu=""> wrote: > >> Dear Weiwei, > >> The definition of outlier is not clear, and no data point should be > >> treated as outlier unless there is reason to believe so. The > >> simple way to > >> detect it is that 1.5IQR criteria, which you can write your own > >> code (one > >> or two lines). Update me if there are any other method to detect > >> outliers. > >> > >> Fangxin > >> > >> > >>> dear listers: > >>> > >>> I have a question on whether bioconductor has some tool-kit to > >>> detect > >>> outliers and remove them. > >>> > >>> my original dataset looks like this: > >>> V1 V51 V53 V55 V57 > >>> 1 -493249600 1.459459 -3.069444 -1.300000 1.935484 > >>> 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978 > >>> 3 1626196571 -3.500000 -1.011662 2.223881 3.921053 > >>> 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692 > >>> 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708 > >>> 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979 > >>> 7 530182506 -1.431677 -1.336343 -3.126437 4.878788 > >>> 8 1173842263 1.215385 1.856410 -2.059794 -6.020833 > >>> 9 28847 2.407895 -2.048889 -1.730337 -1.178947 > >>> 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058 > >>> > >>> V1: internal probe id > >>> the rests are different samples. the cells are fold-change of > >>> disease/normal. > >>> > >>> summary of the sample columns( V51, ... V57) gives the following: > >>> V51 V53 V55 V57 > >>> Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min. > >>> :-14086.750 > >>> 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st Qu.: > >>> -1.831 > >>> Median : -1.199 Median : -1.0416 Median : -1.200 Median : > >>> -1.080 > >>> Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean : > >>> -1.874 > >>> 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd Qu.: > >>> 1.521 > >>> Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max. : > >>> 683.519 > >>> > >>> > >>> My question is, is there any package which can detect those outliers > >>> (like -14086.750)and remove them and get an "average" for each gene > >>> (instead of each probe)? > >>> > >>> Thank you. > >>> > >>> Weiwei > >>> > >>> -- > >>> Weiwei Shi, Ph.D > >>> Research Scientist > >>> GeneGO, Inc. > >>> > >>> "Did you always know?" > >>> "No, I did not. But I believed..." > >>> ---Matrix III > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at stat.math.ethz.ch > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >>> > >> > >> > >> -------------------- > >> Fangxin Hong Ph.D. > >> Plant Biology Laboratory > >> The Salk Institute > >> 10010 N. Torrey Pines Rd. > >> La Jolla, CA 92037 > >> E-mail: fhong at salk.edu > >> (Phone): 858-453-4100 ext 1105 > >> > >> > > > > > > -- > > Weiwei Shi, Ph.D > > Research Scientist > > GeneGO, Inc. > > > > "Did you always know?" > > "No, I did not. But I believed..." > > ---Matrix III > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/ > > gmane.science.biology.informatics.conductor > > -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
ADD REPLY
0
Entering edit mode
You should really check the original data, not the ratio, and then decide, rather than blindly choosing to use or remove those extreme values. As Kasper said, some could well represent genes that show strong expresion on one condition only, either because they become silenced or activated, and these are potentially very interesting. Jose Quoting Weiwei Shi <helprhelp at="" gmail.com="">: > thanks for all of suggestions here. > > i will go w/o removing those "outliers" first and update some result > if necessary. > > On 9/19/06, Kasper Daniel Hansen <khansen at="" stat.berkeley.edu=""> wrote: >> >> On Sep 19, 2006, at 12:18 PM, Weiwei Shi wrote: >> >> > my current way is using mahalanobis() distance. >> > >> > to Sean: >> > do u think that example: -14k is ok? >> >> That example could be a case of the gene being expressed in one >> condition and not being expressed in another. I do not remember where >> the data are from (or if you have even described that) or platform >> or ..., but I would agree with Sean and say that you do not want to >> blindly remove the genes. Note that we are not advising that you >> shouldn't remove the gene, just that you should take a careful look >> at the data and try to decide what to do. >> >> As Fangxin clearly writes, it is hard to really know what is an outlier. >> >> Kasper >> >> >> > >> > On 9/19/06, fhong at salk.edu <fhong at="" salk.edu=""> wrote: >> >> Dear Weiwei, >> >> The definition of outlier is not clear, and no data point should be >> >> treated as outlier unless there is reason to believe so. The >> >> simple way to >> >> detect it is that 1.5IQR criteria, which you can write your own >> >> code (one >> >> or two lines). Update me if there are any other method to detect >> >> outliers. >> >> >> >> Fangxin >> >> >> >> >> >>> dear listers: >> >>> >> >>> I have a question on whether bioconductor has some tool-kit to >> >>> detect >> >>> outliers and remove them. >> >>> >> >>> my original dataset looks like this: >> >>> V1 V51 V53 V55 V57 >> >>> 1 -493249600 1.459459 -3.069444 -1.300000 1.935484 >> >>> 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978 >> >>> 3 1626196571 -3.500000 -1.011662 2.223881 3.921053 >> >>> 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692 >> >>> 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708 >> >>> 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979 >> >>> 7 530182506 -1.431677 -1.336343 -3.126437 4.878788 >> >>> 8 1173842263 1.215385 1.856410 -2.059794 -6.020833 >> >>> 9 28847 2.407895 -2.048889 -1.730337 -1.178947 >> >>> 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058 >> >>> >> >>> V1: internal probe id >> >>> the rests are different samples. the cells are fold-change of >> >>> disease/normal. >> >>> >> >>> summary of the sample columns( V51, ... V57) gives the following: >> >>> V51 V53 V55 V57 >> >>> Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min. >> >>> :-14086.750 >> >>> 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st Qu.: >> >>> -1.831 >> >>> Median : -1.199 Median : -1.0416 Median : -1.200 Median : >> >>> -1.080 >> >>> Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean : >> >>> -1.874 >> >>> 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd Qu.: >> >>> 1.521 >> >>> Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max. : >> >>> 683.519 >> >>> >> >>> >> >>> My question is, is there any package which can detect those outliers >> >>> (like -14086.750)and remove them and get an "average" for each gene >> >>> (instead of each probe)? >> >>> >> >>> Thank you. >> >>> >> >>> Weiwei >> >>> >> >>> -- >> >>> Weiwei Shi, Ph.D >> >>> Research Scientist >> >>> GeneGO, Inc. >> >>> >> >>> "Did you always know?" >> >>> "No, I did not. But I believed..." >> >>> ---Matrix III >> >>> >> >>> _______________________________________________ >> >>> Bioconductor mailing list >> >>> Bioconductor at stat.math.ethz.ch >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>> Search the archives: >> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>> >> >>> >> >> >> >> >> >> -------------------- >> >> Fangxin Hong Ph.D. >> >> Plant Biology Laboratory >> >> The Salk Institute >> >> 10010 N. Torrey Pines Rd. >> >> La Jolla, CA 92037 >> >> E-mail: fhong at salk.edu >> >> (Phone): 858-453-4100 ext 1105 >> >> >> >> >> > >> > >> > -- >> > Weiwei Shi, Ph.D >> > Research Scientist >> > GeneGO, Inc. >> > >> > "Did you always know?" >> > "No, I did not. But I believed..." >> > ---Matrix III >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at stat.math.ethz.ch >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: http://news.gmane.org/ >> > gmane.science.biology.informatics.conductor >> >> > > > -- > Weiwei Shi, Ph.D > Research Scientist > GeneGO, Inc. > > "Did you always know?" > "No, I did not. But I believed..." > ---Matrix III > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK
ADD REPLY
0
Entering edit mode
hi, Sean: I added some info here: I did some pathway analysis and compare the results between using those "outliers" and not using them. My result (validated by domain knowledge, since they are unsupervised learning) shows the former is better, which agrees with your suggestion. but i still do not think the one with -14k and some numbers shown in the summary in the first email make sense to me. weiwei On 9/19/06, Weiwei Shi <helprhelp at="" gmail.com=""> wrote: > my current way is using mahalanobis() distance. > > to Sean: > do u think that example: -14k is ok? > > > On 9/19/06, fhong at salk.edu <fhong at="" salk.edu=""> wrote: > > Dear Weiwei, > > The definition of outlier is not clear, and no data point should be > > treated as outlier unless there is reason to believe so. The simple way to > > detect it is that 1.5IQR criteria, which you can write your own code (one > > or two lines). Update me if there are any other method to detect outliers. > > > > Fangxin > > > > > > > dear listers: > > > > > > I have a question on whether bioconductor has some tool-kit to detect > > > outliers and remove them. > > > > > > my original dataset looks like this: > > > V1 V51 V53 V55 V57 > > > 1 -493249600 1.459459 -3.069444 -1.300000 1.935484 > > > 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978 > > > 3 1626196571 -3.500000 -1.011662 2.223881 3.921053 > > > 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692 > > > 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708 > > > 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979 > > > 7 530182506 -1.431677 -1.336343 -3.126437 4.878788 > > > 8 1173842263 1.215385 1.856410 -2.059794 -6.020833 > > > 9 28847 2.407895 -2.048889 -1.730337 -1.178947 > > > 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058 > > > > > > V1: internal probe id > > > the rests are different samples. the cells are fold-change of > > > disease/normal. > > > > > > summary of the sample columns( V51, ... V57) gives the following: > > > V51 V53 V55 V57 > > > Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min. > > > :-14086.750 > > > 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st Qu.: > > > -1.831 > > > Median : -1.199 Median : -1.0416 Median : -1.200 Median : > > > -1.080 > > > Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean : > > > -1.874 > > > 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd Qu.: > > > 1.521 > > > Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max. : > > > 683.519 > > > > > > > > > My question is, is there any package which can detect those outliers > > > (like -14086.750)and remove them and get an "average" for each gene > > > (instead of each probe)? > > > > > > Thank you. > > > > > > Weiwei > > > > > > -- > > > Weiwei Shi, Ph.D > > > Research Scientist > > > GeneGO, Inc. > > > > > > "Did you always know?" > > > "No, I did not. But I believed..." > > > ---Matrix III > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at stat.math.ethz.ch > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > > > > -------------------- > > Fangxin Hong Ph.D. > > Plant Biology Laboratory > > The Salk Institute > > 10010 N. Torrey Pines Rd. > > La Jolla, CA 92037 > > E-mail: fhong at salk.edu > > (Phone): 858-453-4100 ext 1105 > > > > > > > -- > Weiwei Shi, Ph.D > Research Scientist > GeneGO, Inc. > > "Did you always know?" > "No, I did not. But I believed..." > ---Matrix III > -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
ADD REPLY
0
Entering edit mode
Weiwei Shi ★ 1.2k
@weiwei-shi-1407
Last seen 9.6 years ago
some added info: V1 is gene id, but each row represents a probe. so there could be multiple rows with the same V1 since they (those probes) correspond to the same gene. On 9/19/06, Weiwei Shi <helprhelp at="" gmail.com=""> wrote: > dear listers: > > I have a question on whether bioconductor has some tool-kit to detect > outliers and remove them. > > my original dataset looks like this: > V1 V51 V53 V55 V57 > 1 -493249600 1.459459 -3.069444 -1.300000 1.935484 > 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978 > 3 1626196571 -3.500000 -1.011662 2.223881 3.921053 > 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692 > 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708 > 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979 > 7 530182506 -1.431677 -1.336343 -3.126437 4.878788 > 8 1173842263 1.215385 1.856410 -2.059794 -6.020833 > 9 28847 2.407895 -2.048889 -1.730337 -1.178947 > 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058 > > V1: internal probe id > the rests are different samples. the cells are fold-change of disease/normal. > > summary of the sample columns( V51, ... V57) gives the following: > V51 V53 V55 V57 > Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min. :-14086.750 > 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st Qu.: -1.831 > Median : -1.199 Median : -1.0416 Median : -1.200 Median : -1.080 > Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean : -1.874 > 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd Qu.: 1.521 > Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max. : 683.519 > > > My question is, is there any package which can detect those outliers > (like -14086.750)and remove them and get an "average" for each gene > (instead of each probe)? > > Thank you. > > Weiwei > > -- > Weiwei Shi, Ph.D > Research Scientist > GeneGO, Inc. > > "Did you always know?" > "No, I did not. But I believed..." > ---Matrix III > -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. "Did you always know?" "No, I did not. But I believed..." ---Matrix III
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States
On 9/19/06 1:02 PM, "Weiwei Shi" <helprhelp at="" gmail.com=""> wrote: > dear listers: > > I have a question on whether bioconductor has some tool-kit to detect > outliers and remove them. > > my original dataset looks like this: > V1 V51 V53 V55 V57 > 1 -493249600 1.459459 -3.069444 -1.300000 1.935484 > 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978 > 3 1626196571 -3.500000 -1.011662 2.223881 3.921053 > 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692 > 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708 > 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979 > 7 530182506 -1.431677 -1.336343 -3.126437 4.878788 > 8 1173842263 1.215385 1.856410 -2.059794 -6.020833 > 9 28847 2.407895 -2.048889 -1.730337 -1.178947 > 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058 > > V1: internal probe id > the rests are different samples. the cells are fold-change of disease/normal. > > summary of the sample columns( V51, ... V57) gives the following: > V51 V53 V55 V57 > Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min. :-14086.750 > 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st Qu.: -1.831 > Median : -1.199 Median : -1.0416 Median : -1.200 Median : -1.080 > Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean : -1.874 > 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd Qu.: 1.521 > Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max. : 683.519 > > > My question is, is there any package which can detect those outliers > (like -14086.750)and remove them and get an "average" for each gene > (instead of each probe)? Hi, Weiwei. The better option, probably, is to remove datapoints that are questionable BEFORE making a ratio using good quality control, plots, etc. Extreme ratios may be biologically very important, so simply removing them is probably not the best option. I would suggest looking at the two data values that went into making the ratios that you think are in question and see if there is an explanation (for example, one probe of the two failed, for example). Simply removing ratios because they look like outliers is potentially removing your most interesting data. Sean
ADD COMMENT

Login before adding your answer.

Traffic: 936 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6