Question

Bioinformatics exercises for beginner

0

Entering edit mode

Sara • 0

@be2fd1dc

Last seen 2.2 years ago

Italy

Hello, I am a beginner and I am reading the book "Bioinformatics and Computational Biology Solutions Using R and Bioconductor" during my University bachelor's degree. I installed R in my Ubuntu linux machine and it works.

I am approaching the first chapter exercise (page 15"Preprocessing HDO Arrays"). Unfortunately I'm not able to find the dataset that have to be used during the exercise 'Data <- ReadAffy()'. Moreover in the website [1], there is the Documentation [2] referring to a broken link [3] therefore I can't continue.

I am kindly asking if anyone could help me providing updated details about how to start using Bioconductor with R in a practical and proficient way.

Thank you

[1] http://bioconductor.org/packages/2.0/data/experiment/html/affydata.html
[2] http://bioconductor.org/packages/2.0/data/experiment/vignettes/affydata/inst/doc/affydata.pdf
[3] http://qolotus02.genelogic.com/datasets.nsf

Best regards,
Sara

beginner Bioconductor tutorial examples • 1.0k views

ADD COMMENT • link updated 2.2 years ago by Gordon Smyth 50k • written 2.2 years ago by Sara • 0

score 1 · Answer 1 · 2022-02-14

Hi Sara,

You are being too literal. The sentence immediately preceding that code says 'A typical invocation is', which is meant to imply that they are not actually reading anything with that argument. Normally what one does is start R in the directory containing the CEL files and then you can use ReadAffy() to automatically read in all the CEL files you have in that directory.

If you don't have your own data, you can use GEO to get some. Here's a working example.

## load GEOquery and affy packages
> library(GEOquery)
> library(affy)

## get the raw data from an HG-U95A_V2 array
> getGEOSuppFiles("GSE117247")
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE117nnn/GSE117247/suppl//GSE117247_RAW.tar?tool=geoquery'
Content type 'application/x-tar' length 58685440 bytes (56.0 MB)
downloaded 56.0 MB

                                                         size isdir mode
C:/Users/jmacdon/Desktop/GSE117247/GSE117247_RAW.tar 58685440 FALSE  666
                                                                   mtime
C:/Users/jmacdon/Desktop/GSE117247/GSE117247_RAW.tar 2022-02-14 11:13:15
                                                                   ctime
C:/Users/jmacdon/Desktop/GSE117247/GSE117247_RAW.tar 2022-02-14 11:05:29
                                                                   atime exe
C:/Users/jmacdon/Desktop/GSE117247/GSE117247_RAW.tar 2022-02-14 11:13:15  no

## go into downloaded directory and untar the contents
> setwd("GSE117247/")
> untar("GSE117247_RAW.tar")

## What do we have here?
> dir()
 [1] "GSE117247_RAW.tar"                "GSM3289857_1_1_Normal_1.CEL.gz"  
 [3] "GSM3289858_1_10_Normal_10.CEL.gz" "GSM3289859_1_2_Normal_2.CEL.gz"  
 [5] "GSM3289860_1_3_Normal_3.CEL.gz"   "GSM3289861_1_4_Normal_4.CEL.gz"  
 [7] "GSM3289862_1_5_Normal_5.CEL.gz"   "GSM3289863_1_6_Normal_6.CEL.gz"  
 [9] "GSM3289864_1_7_Normal_7.CEL.gz"   "GSM3289865_1_8_Normal_8.CEL.gz"  
[11] "GSM3289866_1_9_Normal_9.CEL.gz"   "GSM3289867_3_1_SCC_1.CEL.gz"     
[13] "GSM3289868_3_2_SCC_2.CEL.gz"      "GSM3289869_3_3_SCC_3.CEL.gz"     
[15] "GSM3289870_3_4_SCC_4.CEL.gz"      "GSM3289871_3_5_SCC_5.CEL.gz"     
[17] "GSM3289872_3_6_SCC_6.CEL.gz"      "GSM3289873_3_7_SCC_7.CEL.gz"     
[19] "GSM3289874_3_8_SCC_8.CEL.gz"      "GSM3289875_5_1_TSCC_1.CEL.gz"    
[21] "GSM3289876_5_2_TSCC_2.CEL.gz"     "GSM3289877_5_3_TSCC_3.CEL.gz"    
[23] "GSM3289878_5_4_TSCC_4.CEL.gz"     "GSM3289879_5_5_TSCC_5.CEL.gz"    

## read it into an AffyBatch
> abatch <- ReadAffy()
> abatch

AffyBatch object
size of arrays=640x640 features (27 kb)
cdf=HG_U95Av2 (12625 affyids)
number of samples=23
number of genes=12625
annotation=hgu95av2
notes=

But do note that the book you are reading is relatively old, and the affy package has been superceded by the oligo package. If you plan to use anything but old arrays, you should switch.

## Read in. Note you have to tell oligo that the files are Gzipped
> z <- read.celfiles(list.celfiles(listGzipped = TRUE))
Loading required package: pd.hg.u95av2
Loading required package: RSQLite
Loading required package: DBI
Platform design info loaded.
Reading in : GSM3289857_1_1_Normal_1.CEL.gz
Reading in : GSM3289858_1_10_Normal_10.CEL.gz
Reading in : GSM3289859_1_2_Normal_2.CEL.gz
Reading in : GSM3289860_1_3_Normal_3.CEL.gz
Reading in : GSM3289861_1_4_Normal_4.CEL.gz
Reading in : GSM3289862_1_5_Normal_5.CEL.gz
Reading in : GSM3289863_1_6_Normal_6.CEL.gz
Reading in : GSM3289864_1_7_Normal_7.CEL.gz
Reading in : GSM3289865_1_8_Normal_8.CEL.gz
Reading in : GSM3289866_1_9_Normal_9.CEL.gz
Reading in : GSM3289867_3_1_SCC_1.CEL.gz
Reading in : GSM3289868_3_2_SCC_2.CEL.gz
Reading in : GSM3289869_3_3_SCC_3.CEL.gz
Reading in : GSM3289870_3_4_SCC_4.CEL.gz
Reading in : GSM3289871_3_5_SCC_5.CEL.gz
Reading in : GSM3289872_3_6_SCC_6.CEL.gz
Reading in : GSM3289873_3_7_SCC_7.CEL.gz
Reading in : GSM3289874_3_8_SCC_8.CEL.gz
Reading in : GSM3289875_5_1_TSCC_1.CEL.gz
Reading in : GSM3289876_5_2_TSCC_2.CEL.gz
Reading in : GSM3289877_5_3_TSCC_3.CEL.gz
Reading in : GSM3289878_5_4_TSCC_4.CEL.gz
Reading in : GSM3289879_5_5_TSCC_5.CEL.gz

## summarize using rma from each package
> affyRMA <- affy::rma(abatch)
Background correcting
Normalizing
Calculating Expression
> oligoRMA <- oligo::rma(z)
Background correcting
Normalizing
Calculating Expression

## are the results the same?
> all.equal(exprs(affyRMA), exprs(oligoRMA))
[1] TRUE

## survey says yes