removal of outliers in matrix

0

Entering edit mode

Johannes Hanson ▴ 20

@johannes-hanson-1604

Last seen 9.6 years ago

Dear all, After some work with analysis of micro array data I am now facing my first metabolomics dataset. The first problem I encountered is that the structure of the data is different from what I am used to. Due to the alignment of the chromatogram I do have extreme outliers within the dataset. The alignment is good (and I don't want to manually adjust 8000 peaks). If I could easily remove the outliers the rest of the analysis would be easier. The outliers I want to remove are most often a total lack of signal as the peak is missing. I do have five replicates of each treatment I am looking for something that could remove only the extreme outliers (sample nr nine in the example below). A typical outlier: Untreated 0.00040016 0.001029071 0.00101226 0.000739958 0.000288475 Treated 5.58151787 4.146639291 4.080655391 0.00120032 4.786810001 The data is structured as a matrix with one line per peak and the replicates as individual columns (much like micro array data). Thanks for any suggestions on how to continue Johannes

Alignment Alignment • 1.3k views

ADD COMMENT • link updated 16.4 years ago by Saroj Mohapatra ▴ 450 • written 16.4 years ago by Johannes Hanson ▴ 20

0

Entering edit mode

Saroj Mohapatra ▴ 450

@saroj-mohapatra-1446

Last seen 9.6 years ago

Hello Johannes: If I understand correctly, you have a matrix of data that have variables (metabolites) as rows and sample-replicates as columns. For example, for two metabolites: > my.data Con.1 Con.2 Con.3 Con.4 Con.5 Trt.1 Trt.2 Trt.3 Trt.4 Trt.5 Metab.1 0 0 0 0 0 5.58 4.15 4.08 0.00 4.79 Metab.2 0 0 0 0 0 5.58 0.00 4.08 4.08 4.79 The outliers are. for Metab.1, Trt.4 and for Metab.2, Trt.2 I could use simple rules like (any value that is 1 S.D below or above mean) to detect the outliers. > apply(my.data, 1, function(y) {x=y[6:10]; which(x<(mean(x)-sd(x)) | x > (mean(x)+sd(x))) } ) Metab.1 Metab.2 4 2 Gives you the sample that is the outlier for each metabolite. If you want a new matrix with the outliers removed: > new.data=t(apply(my.data, 1, function(y) {x=y[6:10]; sel=(x>(mean(x)-sd(x))&(x<(mean(x)+sd(x))));c(y[1:5],x[sel])})) > new.data [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] Metab.1 0 0 0 0 0 5.58 4.15 4.08 4.79 Metab.2 0 0 0 0 0 5.58 4.08 4.08 4.79 I have assumed that (1) there is only one outlier, and (2) the replicates are tightly close to each other, except for the outlier. HTH Saroj Johannes Hanson wrote: >Dear all, > >After some work with analysis of micro array data I am now facing my first >metabolomics dataset. >The first problem I encountered is that the structure of the data is >different from what I am used to. Due to the alignment of the chromatogram I >do have extreme outliers within the dataset. The alignment is good (and I >don't want to manually adjust 8000 peaks). If I could easily remove the >outliers the rest of the analysis would be easier. >The outliers I want to remove are most often a total lack of signal as the >peak is missing. I do have five replicates of each treatment I am looking >for something that could remove only the extreme outliers (sample nr nine in >the example below). > >A typical outlier: >Untreated >0.00040016 0.001029071 0.00101226 0.000739958 0.000288475 >Treated >5.58151787 4.146639291 4.080655391 0.00120032 4.786810001 > >The data is structured as a matrix with one line per peak and the replicates >as individual columns (much like micro array data). > >Thanks for any suggestions on how to continue > >Johannes > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD COMMENT • link 16.4 years ago Saroj Mohapatra ▴ 450

Login before adding your answer.