Question

MassiR Gender Prediction on Affy microarrays

0

Entering edit mode

t.demooij • 0

@tdemooij-8817

Last seen 8.6 years ago

United States

Dear all,

Please forgive any follies, I am an R novice.

I have downloaded ~1000 GPL570 (Affymetrix U133 Plus 2) microarrays from GEO. Of these, 120 are not annotated with a gender. I want to use MassiR to predict the gender of my samples. However, I have some trouble loading the code. From looking up the error message (See below in bold and highlighted) on other places on this Forum, i gather that it may be due to:

A) Use of commas instead or periods in the values, or data being recognized as text rather than numbers. Or,

B) Plotting a figure consisting of NA (missing values)

I understand that subsequent errors are probably generated because in the output from "massi.y.out" is taken as an input. The ">massi_y_plot(massi.y.out)" returned empty.

Thank you for your help,

Tristan

> library(massiR)

Loading required package: cluster

Loading required package: gplots

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:

lowess

Loading required package: diptest

> data(massi.eset)

> data(y.probes)

> probes <-

+ y.probes$affy_hg_u133_plus_2

> rownames(probes)

[1] "1565320_at" "224143_at" "224142_s_at" "233151_s_at" "234309_at"

[6] "217261_at" "216665_s_at" "216673_at" "243384_at" "216374_at"

[11] "207893_at" "201909_at" "207246_at" "230760_at" "207247_s_at"

[16] "233178_at" "217049_x_at" "211227_s_at" "210292_s_at" "207918_s_at"

[21] "217160_at" "208220_x_at" "211462_s_at" "214983_at" "228492_at"

[26] "206624_at" "208805_at" "216376_x_at" "216786_at" "235941_s_at"

[31] "205001_s_at" "1570360_s_at" "205000_at" "1570359_at" "208339_at"

[36] "211149_at" "208067_x_at" "210322_x_at" "224007_at" "211460_at"

[41] "207063_at" "224003_at" "209772_s_at" "1562313_at" "232618_at"

[46] "223645_s_at" "236694_at" "223646_s_at" "214131_at" "206769_at"

[51] "206922_at" "207703_at" "1554125_a_at" "207646_s_at" "207647_at"

[56] "224052_at" "224293_at" "206700_s_at" "1555743_s_at" "211614_at"

[61] "204410_at" "204409_s_at" "221179_at" "208307_at" "224292_at"

[66] "207916_at" "216664_at" "224041_at" "224040_at" "1552952_at"

[71] "216544_at" "234913_at" "208332_at" "232402_at" "211461_at"

[76] "234715_at" "224270_at" "208331_at" "208282_x_at" "207909_x_at"

[81] "216922_x_at" "208281_x_at" "216351_x_at" "207912_s_at" "1555444_a_at"

[86] "234931_at" "1556677_at" "237225_at" "204060_s_at" "206279_at"

[91] "1569787_at" "224195_at" "224174_at" "216842_x_at"

> massi.y.out <- massi_y(massi.eset, probes)

> massi_y_plot(massi.y.out)

Error in plot.window(xlim, ylim, log = log, ...) :

need finite 'xlim' values

In addition: Warning messages:

1: In min(w.l) : no non-missing arguments to min; returning Inf

2: In max(w.r) : no non-missing arguments to max; returning -Inf

3: In min(x) : no non-missing arguments to min; returning Inf

4: In max(x) : no non-missing arguments to max; returning -Inf

> massi.select.out <-

+ massi_select(massi.eset, probes, threshold=4)

> head(massi.select.out)[,1:5]

[1] S1 S2 S3 S4 S5

<0 rows> (or 0-length row.names)

> results <- massi_cluster(massi.select.out)

Error in pam(x = y_data.subset.t, k = 2) :

No clustering performed, NAs in the computed dissimilarity matrix.

> sample.results <- data.frame(results[[2]])

Error in data.frame(results[[2]]) : object 'results' not found

> head(sample.results)

Error in head(sample.results) : object 'sample.results' not found

> massi_cluster_plot(massi.select.out, results)

Error in heatmap.2(x = as.matrix(massi_select_data[ord, ]), keysize = 2, :

`x' must be a numeric matrix

> summary(massi.eset)

Length Class Mode

1 ExpressionSet S4

> summary(probes)

< table of extent 0 x 0 >

microarray affy massiR • 2.0k views

ADD COMMENT • link 8.6 years ago • updated 8.5 years ago t.demooij • 0

score 0 · Answer 1 · 2015-09-30

It seems you are trying to run the analysis using the test dataset which is from an Illumina platform and the probes you are using are from Affymetrix. It should work if you change the probes variable

library(massiR)
data(massi.eset)
data(y.probes)
probes <- y.probes$illumina_humanht_12
head(rownames(probes))
massi.y.out <- massi_y(massi.eset, probes)

HTH

score 0 · Answer 2 · 2015-10-01

Dear Sam,

Thanks for your help. If I change the code to what you cite, I just get the analysis from the sample data (from http://www.bioconductor.org/packages/release/bioc/vignettes/massiR/inst/doc/massiR_Vignette.pdf). This leads me to believe I need to call out the dataset differently. Can you help? The dataset does consist of only affymetrix *.cel files. Here's the code output:

> source("http://bioconductor.org/biocLite.R")
Bioconductor version 3.1 (BiocInstaller 1.18.4), ?biocLite for help
> biocLite()
BioC_mirror: http://bioconductor.org
Using Bioconductor version 3.1 (BiocInstaller 1.18.4), R version 3.2.2.
Old packages: 'IRanges'
Update all/some/none? [a/s/n]:
a
trying URL 'http://bioconductor.org/packages/3.1/bioc/bin/macosx/mavericks/contrib/3.2/IRanges_2.2.8.tgz'
Content type 'application/x-gzip' length 1555841 bytes (1.5 MB)
==================================================
downloaded 1.5 MB

The downloaded binary packages are in
/var/folders/6c/kxwlrfw57qlg32n4jzvzy_500000gn/T//RtmpyA7sTn/downloaded_packages
> library(affy)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:stats’:

xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
do.call, duplicated, eval, evalq, Filter, Find, get, intersect,
is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax,
pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rep.int,
rownames, sapply, setdiff, sort, table, tapply, union, unique,
unlist, unsplit

Loading required package: Biobase
Welcome to Bioconductor

Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.

> Data<-ReadAffy()
> eset<-rma(Data)
Creating a generic function for ‘nchar’ from package ‘base’ in package ‘S4Vectors’
Background correcting
Normalizing
Calculating Expression
> library(massiR)
Loading required package: cluster
Loading required package: gplots

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:

lowess

Loading required package: diptest
> data(massi.eset)
> data(y.probes)
> probes <- y.probes$illumina_humanht_12
> head(rownames(probes))
[1] "ILMN_2068522" "ILMN_1794407" "ILMN_2087817" "ILMN_2224454" "ILMN_1810319"
[6] "ILMN_2224230"
> massi.y.out <- massi_y(massi.eset, probes)
> massi_y_plot(massi.y.out)
> massi.select.out <-
+ massi_select(massi.eset, probes, threshold=4)
> head(massi.select.out)[,1:5]
S1 S2 S3 S4 S5
ILMN_1670821 5.746427 5.686032 6.307110 6.179258 6.594808
ILMN_1685690 5.459125 5.567289 6.919465 6.789817 6.559376
ILMN_1739587 5.883483 5.764190 6.441775 6.438789 6.707278
ILMN_1755537 5.882456 5.831844 8.133164 8.052959 8.298985
ILMN_1772163 5.696833 5.680091 5.907170 6.017871 6.465122
ILMN_1804958 5.815093 5.654395 5.929610 6.104089 5.868732
> results <- massi_cluster(massi.select.out)
> sample.results <- data.frame(results[[2]])
> head(sample.results)
ID mean_y_probes_value y_probes_sd z_score sex
1 S1 5.911089 0.4572756 -0.6453629 female
2 S10 6.749520 0.8418586 0.7773050 male
3 S11 5.689586 0.4750484 -1.1074329 female
4 S12 6.702993 0.7894613 0.7045705 male
5 S13 5.838450 0.6759924 -0.7193198 female
6 S14 5.819845 0.6593184 -0.7524047 female
> massi_cluster_plot(massi.select.out, results)

score 0 · Answer 3 · 2015-10-02

0

Entering edit mode

Sam Buckberry • 0

@sam-buckberry-8071

Last seen 8.6 years ago

Australia

Yes. In your code you are calling the 'massi.eset' object in the massi_y() function.

Try

massi.y.out <- massi_y(eset, probes)

ADD COMMENT • link 8.6 years ago Sam Buckberry • 0

score 0 · Answer 4 · 2015-10-02

Yeah, removing the massi from parts of the code worked. Here's the fully functional code, in case anyone needs to do this also.

> source("http://bioconductor.org/biocLite.R") 

Bioconductor version 3.1 (BiocInstaller 1.18.4), ?biocLite for help

> biocLite()

BioC_mirror: http://bioconductor.org

Using Bioconductor version 3.1 (BiocInstaller 1.18.4), R version 3.2.2.

> library(affy)

Loading required package: BiocGenerics

Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,

    clusterExport, clusterMap, parApply, parCapply, parLapply,

    parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:stats’:

    xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,

    do.call, duplicated, eval, evalq, Filter, Find, get, intersect,

    is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax,

    pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rep.int,

    rownames, sapply, setdiff, sort, table, tapply, union, unique,

    unlist, unsplit

Loading required package: Biobase

Welcome to Bioconductor

    Vignettes contain introductory material; view with

    'browseVignettes()'. To cite Bioconductor, see

    'citation("Biobase")', and for packages 'citation("pkgname")'.

#Change working directory to where you keep your files

> Data<-ReadAffy()

> eset<-rma(Data)

Creating a generic function for ‘nchar’ from package ‘base’ in package ‘S4Vectors’

Background correcting

Normalizing

Calculating Expression

> library(massiR)

Loading required package: cluster

Loading required package: gplots

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:

    lowess

Loading required package: diptest

> data(massi.eset)

> data(y.probes)

> probes <-

+ y.probes$affy_hg_u133_plus_2

> rownames(probes)

 [1] "1565320_at"   "224143_at"    "224142_s_at"  "233151_s_at"  "234309_at"   

 [6] "217261_at"    "216665_s_at"  "216673_at"    "243384_at"    "216374_at"   

[11] "207893_at"    "201909_at"    "207246_at"    "230760_at"    "207247_s_at" 

[16] "233178_at"    "217049_x_at"  "211227_s_at"  "210292_s_at"  "207918_s_at" 

[21] "217160_at"    "208220_x_at"  "211462_s_at"  "214983_at"    "228492_at"   

[26] "206624_at"    "208805_at"    "216376_x_at"  "216786_at"    "235941_s_at" 

[31] "205001_s_at"  "1570360_s_at" "205000_at"    "1570359_at"   "208339_at"   

[36] "211149_at"    "208067_x_at"  "210322_x_at"  "224007_at"    "211460_at"   

[41] "207063_at"    "224003_at"    "209772_s_at"  "1562313_at"   "232618_at"   

[46] "223645_s_at"  "236694_at"    "223646_s_at"  "214131_at"    "206769_at"   

[51] "206922_at"    "207703_at"    "1554125_a_at" "207646_s_at"  "207647_at"   

[56] "224052_at"    "224293_at"    "206700_s_at"  "1555743_s_at" "211614_at"   

[61] "204410_at"    "204409_s_at"  "221179_at"    "208307_at"    "224292_at"   

[66] "207916_at"    "216664_at"    "224041_at"    "224040_at"    "1552952_at"  

[71] "216544_at"    "234913_at"    "208332_at"    "232402_at"    "211461_at"   

[76] "234715_at"    "224270_at"    "208331_at"    "208282_x_at"  "207909_x_at" 

[81] "216922_x_at"  "208281_x_at"  "216351_x_at"  "207912_s_at"  "1555444_a_at"

[86] "234931_at"    "1556677_at"   "237225_at"    "204060_s_at"  "206279_at"   

[91] "1569787_at"   "224195_at"    "224174_at"    "216842_x_at" 

> massi.y.out <- massi_y(eset, probes)

> massi_y_plot(massi.y.out)

> massi.select.out <-

+ massi_select(eset, probes, threshold=4)

#I've found that threshold 4 works better for this probe set than 3.

> head(massi.select.out)[,1:5]

            GSM80565.CEL GSM80566.CEL GSM80567.CEL GSM80568.CEL GSM80569.CEL

1565320_at      4.582043     4.108978     4.485962     4.329909     4.341401

201909_at       8.232437     8.706362     4.896416     9.207012     8.727713

204060_s_at     4.796196     4.681760     4.397050     4.994450     4.078981

204409_s_at     4.725281     4.704018     2.818979     4.796895     4.476516

204410_at       3.315085     3.471119     2.645960     3.165650     3.577728

205000_at       5.579820     5.817207     2.584345     5.956768     5.686904

> results <- massi_cluster(massi.select.out)

> sample.results <- data.frame(results[[2]])

> head(sample.results)

            ID mean_y_probes_value y_probes_sd    z_score    sex

1 GSM80565.CEL            4.540014   1.2529928  0.4884207   male

2 GSM80566.CEL            4.700915   1.3709173  0.6681523   male

3 GSM80567.CEL            3.602066   0.8182796 -0.7231144 female

4 GSM80568.CEL            4.508269   1.4099205  0.3791156   male

5 GSM80569.CEL            4.620233   1.3520400  0.5671010   male

6 GSM80570.CEL            4.733685   1.4062835  0.7098807   male

> head(results)

#This is a useful piece of code that will generate a list of the gender prediction, which you may then copy to third party software

$cluster.data

Medoids:

             ID 1565320_at 201909_at 204060_s_at 204409_s_at 204410_at

GSM80660.CEL 62   4.519028  8.654829    4.954312    4.946165  3.840129

GSM80643.CEL 51   4.582904  4.576714    4.769909    2.834069  2.799607

             205000_at 205001_s_at 206624_at 206700_s_at 207063_at 207703_at

GSM80660.CEL  6.464741    5.291633  3.751901    6.915360  4.343147  4.799728

GSM80643.CEL  2.543636    3.943899  2.486193    4.092055  3.549006  3.751694

             208281_x_at 210292_s_at 211149_at 214131_at 214983_at 216351_x_at

GSM80660.CEL    3.811135    5.709114  2.845757  4.156809  4.547827    3.669275

GSM80643.CEL    3.695598    4.793473  2.534841  3.787987  3.809421    3.553610

             216376_x_at 223645_s_at 223646_s_at 228492_at 230760_at 232618_at

GSM80660.CEL    4.280756    5.233865    5.231600  5.192282  3.348659  2.887069

GSM80643.CEL    3.954610    4.203071    3.864447  3.170545  2.467412  2.833363

             236694_at

GSM80660.CEL  3.943838

GSM80643.CEL  3.214124

Clustering vector:

GSM80565.CEL GSM80566.CEL GSM80567.CEL GSM80568.CEL GSM80569.CEL GSM80570.CEL 

           1            1            2            1            1            1 

GSM80571.CEL GSM80572.CEL GSM80573.CEL GSM80574.CEL GSM80575.CEL GSM80581.CEL 

           1            1            1            2            1            2 

GSM80585.CEL GSM80586.CEL GSM80587.CEL GSM80591.CEL GSM80592.CEL GSM80593.CEL 

           1            1            1            2            2            2 

GSM80594.CEL GSM80595.CEL GSM80596.CEL GSM80597.CEL GSM80598.CEL GSM80599.CEL 

           2            2            2            2            2            2 

GSM80600.CEL GSM80601.CEL GSM80611.CEL GSM80612.CEL GSM80613.CEL GSM80614.CEL 

           2            2            1            1            2            1 

GSM80616.CEL GSM80617.CEL GSM80618.CEL GSM80619.CEL GSM80620.CEL GSM80621.CEL 

           1            1            2            1            1            1 

GSM80622.CEL GSM80623.CEL GSM80626.CEL GSM80627.CEL GSM80628.CEL GSM80629.CEL 

           2            1            1            1            1            2 

GSM80630.CEL GSM80636.CEL GSM80637.CEL GSM80638.CEL GSM80639.CEL GSM80640.CEL 

           2            2            2            2            2            2 

GSM80641.CEL GSM80642.CEL GSM80643.CEL GSM80644.CEL GSM80645.CEL GSM80646.CEL 

           2            2            2            2            2            2 

GSM80647.CEL GSM80648.CEL GSM80649.CEL GSM80650.CEL GSM80651.CEL GSM80652.CEL 

           2            2            2            1            1            2 

GSM80653.CEL GSM80660.CEL GSM80661.CEL GSM80662.CEL GSM80663.CEL GSM80664.CEL 

           2            1            2            2            1            1 

GSM80665.CEL GSM80666.CEL GSM80667.CEL GSM80668.CEL GSM80669.CEL GSM80670.CEL 

           1            1            1            1            2            1 

GSM80671.CEL GSM80675.CEL GSM80676.CEL GSM80677.CEL GSM80678.CEL GSM80679.CEL 

           1            2            2            2            2            2 

GSM80680.CEL GSM80681.CEL GSM80682.CEL GSM80683.CEL GSM80684.CEL GSM80690.CEL 

           2            2            2            2            2            1 

GSM80691.CEL GSM80692.CEL GSM80693.CEL GSM80699.CEL GSM80700.CEL GSM80701.CEL 

           1            2            1            1            1            2 

GSM80702.CEL GSM80703.CEL GSM80704.CEL GSM80705.CEL GSM80706.CEL GSM80708.CEL 

           1            1            1            2            2            1 

GSM80709.CEL GSM80711.CEL GSM80713.CEL GSM80714.CEL 

           2            1            1            1 

Objective function:

  build    swap 

2.18487 2.07933 

Available components:

 [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 

 [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"      


$massi.results

              ID mean_y_probes_value y_probes_sd     z_score    sex

1   GSM80565.CEL            4.540014   1.2529928  0.48842071   male

...

100 GSM80714.CEL            4.684840   1.2652858  0.82444022   male

#1 ... 100: That list I mentioned earlier.

> massi_cluster_plot(massi.select.out, results)

#On Mac you only have 1 Quartz screen. However, the cluster_plot function has a heat map, PCI and histogram output. Use the back and forth options in the Quartz dropdown menu to switch between these different outputs.

> dip.result <- massi_dip(massi.select.out)

dip test statistic is >0.08. This suggests that the proportion of male and female samples in this dataset is relatively balanced.

> dip.result <- massi_dip(massi.select.out)

dip test statistic is >0.08. This suggests that the proportion of male and female samples in this dataset is relatively balanced.

> plot(dip.result[[3]])

> dip.result <- massi_dip(massi.select.out)

dip test statistic is >0.08. This suggests that the proportion of male and female samples in this dataset is relatively balanced.

> hist(x=dip.result[[2]], breaks=20)

In my dataset, which was sufficiently balanced according to dip analysis, I've found 98% accuracy in 200 test samples.

I hope this helped!

score 0 · Answer 5 · 2015-10-02

0

Entering edit mode

t.demooij • 0

@tdemooij-8817

Last seen 8.6 years ago

United States

Thank you, Sam Buckberry!

ADD COMMENT • link 8.5 years ago t.demooij • 0