Question

Possible ways of performing non-specific intensity filtering on Perfect-match only affymetrix array

0

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 13 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor Community,

i have 10 CEL files of the primeview PM-only affymetrix array, and i have pre-prossessed and normalized with rma. I would like to perform Present/Absent calls for non-specific intensity filtering, to remove probesets which are consistenly non expressed(i.e. "absent" in more than 5 of the 10 samples). As i cannot install xps due to the fact that is not available to my current R version( i also tried on 32-bit version on my windows 8 but it didnt worked either), i found the function paCalls in the oligo package: which has the option "DABG" or "PSDABG".

Thus, my main question is that should i use on of the above options with paCalls, or these are inapropriate because in the first place DABG was designed for exon arrays?

On the other hand, if i would like to perform a more general filtering on mean log-expression, could i use the genefilter package and try something like this: i.e., filter <- pOverA(0.5, log2(100)) ? Or is too strict ??

Just to mention after rma normalization i found a lot of probesets with very low log2 intensity(from 0-2) and some with negative values.

Any suggestion or proposal would be essential !!!

oligo DABG non-specific filtering primeview • 3.1k views

ADD COMMENT • link 9.6 years ago svlachavas ▴ 830

score 1 · Answer 1 · 2015-04-22

1

Entering edit mode

Stephen Piccolo ▴ 600

@stephen-piccolo-6761

Last seen 4.2 years ago

United States

You might consider trying the UPC function in the SCAN.UPC package. It is designed to indicate whether a gene is expressed or not and can be applied to all types of Affymetrix arrays.

ADD COMMENT • link 9.6 years ago Stephen Piccolo ▴ 600

0

Entering edit mode

Dear Stephen,

thank you for your answer !! i didnt know about the specific package, so i would like to ask you if i can use it after the normalization on the probeset level ? or it requires other imputs ?

ADD REPLY • link 9.6 years ago svlachavas ▴ 830

1

Entering edit mode

The simplest way would be to use UPC for normalization. You can input .CEL files, and it will output a UPC value for each probeset. Alternatively, you could apply UPC (using the "generic" functions in SCAN.UPC) to prenormalized data. But since you are working with Affy data, it seems simpler to start with the .CEL files.

ADD REPLY • link 9.6 years ago Stephen Piccolo ▴ 600

0

Entering edit mode

Please excuse me for one more question, but i had a first look on the vignette--thus in order to use the UPC function i have to use only the normalization implemented by UPC? Moreover, if im not mistaken is a normalization process for each CEL file(-array) separately ?

ADD REPLY • link 9.6 years ago svlachavas ▴ 830

0

Entering edit mode

Yes. UPC processes each CEL file separately. But you can specify wildcards, so it is easy to process multiple CEL files at a time. Even though it processes one sample at a time, we believe it performs as well as or better than methods such as RMA that combine data across samples (see our papers).

ADD REPLY • link 9.6 years ago Stephen Piccolo ▴ 600

0

Entering edit mode

Yes, i found the paper on Elzevier so it would be very interesting to learn more and implement your approach, and more importantly your methodology--if i have more questions especially on the UPC methodology i will return with more questions

ADD REPLY • link 9.6 years ago svlachavas ▴ 830

0

Entering edit mode

One main question regarding the vignette, as you mentioned earlier, i have the celfile directory with the 10 CEL files. When i naively tried SCAN with the celfilePatern=datadir (where datadir the directory of the CEL files), i got the following error and also a warning:

Error in { : task 1 failed - "These are directories:
C:/Users/Efstathios/Desktop/totalcel"
In addition: Warning message:
executing %dopar% sequentially: no parallel backend registered

Thus, how i could specify the wildcards you mentioned earlier ? Thank you again for your time !!

ADD REPLY • link 9.6 years ago svlachavas ▴ 830

0

Entering edit mode

Assuming I understand your question correctly, you would run it something like this:

UPC("C:/Users/Efstathios/Desktop/totalcel/*.CEL", outFilePath="myoutput.txt")

ADD REPLY • link 9.6 years ago Stephen Piccolo ▴ 600

score 0 · Answer 2 · 2015-04-23

0

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 13 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Stephen,

after your notification when i simply tried:

object <- SCAN("C:/Users/Efstathios/Desktop/totalcel/*.CEL", outFilePath="myoutput.txt")

but unfortunately it resulted in:

Package 'pd.primeview' was not found in the BioConductor repository.

The 'pdInfoBuilder' package can often be used in situations like this.

Loading required package: pd.primeview

Attempting to obtain 'pd.primeview' from BioConductor website.

Checking to see if your internet connection works...

Package 'pd.primeview' was not found in the BioConductor repository.

The 'pdInfoBuilder' package can often be used in situations like this.

Error in { :

task 1 failed - "The annotation package, pd.primeview, could not be loaded."

In addition: Warning messages:

1: In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :...

i also tried

biocLite("primeview.db")

BioC_mirror: http://bioconductor.org

Using Bioconductor version 3.0 (BiocInstaller 1.16.4), R version 3.1.1.

Installing package(s) 'primeview.db'

Warning message:

package ‘primeview.db’ is not available (for R version 3.1.1)

Do you think that is a problem due to the recent new version of Bioconductor ?

sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Greek_Greece.1253 LC_CTYPE=Greek_Greece.1253 LC_MONETARY=Greek_Greece.1253
[4] LC_NUMERIC=C LC_TIME=Greek_Greece.1253

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] simpleaffy_2.42.0 gcrma_2.38.0 genefilter_1.48.1 affy_1.44.0
[5] Biobase_2.26.0 BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] affyio_1.34.0 affyPLM_1.42.0 annotate_1.44.0 AnnotationDbi_1.28.2
[5] BiocInstaller_1.16.4 Biostrings_2.34.1 DBI_0.3.1 GenomeInfoDb_1.2.5
[9] IRanges_2.0.1 preprocessCore_1.28.0 RSQLite_1.0.0 S4Vectors_0.4.0
[13] splines_3.1.1 stats4_3.1.1 survival_2.38-1 tools_3.1.1
[17] XML_3.98-1.1 xtable_1.7-4 XVector_0.6.0 zlibbioc_1.12.0
Warning message:
package ‘BiocGenerics’ was built under R version 3.1.2

ADD COMMENT • link 9.6 years ago svlachavas ▴ 830

0

Entering edit mode

Hmmm...SCAN does depend on there being a "pd" annotation package. It appears that one has not been created for this platform. As the error message states, you could create one using pdInfoBuilder, but there may be something about the design of these arrays that makes it difficult. If you don't want to deal with this, another option would be to apply the UPC_Generic_ExpressionSet function to rma normalized data.

ADD REPLY • link 9.6 years ago Stephen Piccolo ▴ 600

0

Entering edit mode

Dear Stephen,

i use the UPC_Generic_ExpressionSet

on my rma normalized dataset,and my resulted expression set looks like this:

head(exprs(scan))
dataset_603.dat dataset_604. dat dataset_605.dat dataset_606.dat
11715111_s_at 2.965738e-07 9.909145e-08 2.767829e-08 5.957925e-08
11715113_x_at 5.540580e-08 2.033334e-07 4.882098e-08 3.347576e-07
11715114_x_at 7.286652e-08 4.551362e-07 2.166561e-08 4.330943e-07
11715124_s_at 2.884776e-08 1.784328e-08 4.464651e-09 8.284289e-09
11715125_at 2.782606e-07 2.890932e-07 1.739811e-07 2.556597e-07
11715129_s_at 1.635254e-08 5.047175e-08 5.341312e-09 2.147749e-08

Thus, afterwards how i can i use UPC ?

ADD REPLY • link 9.6 years ago svlachavas ▴ 830

0

Entering edit mode

Unless something has gone wrong, you should see many probe sets with values close to zero and many probe sets with values close to 1. And there will be a few in between. The interpretation here is that probesets with values near 1 are likely to be active in the sample, whereas probesets with values near 0 are likely to be inactive (background noise). How many probesets are on the PrimeView arrays?

ADD REPLY • link 9.6 years ago Stephen Piccolo ▴ 600

0

Entering edit mode

Dear Stephen,

please excume me for posting again, but i would like to ask you about one more important question: regarding the output of UPC_Generic_ExpressionSet () fuction on the rma dataset from above, but without initial intensity filtering:

UPCestimates <- UPC_Generic_ExpressionSet(normalized.dataset) # normalized.dataset from rma normalization

but it also gave me the below warnings:

No annotation information was present for length, so no correction will be made for this.
No annotation information was present for GC content, so no correction will be made for this.
Calculating UPC values using the normal-normal model.

So it is not approprietly executed ?

Moreover, with

head(exprs(UPCestimates))
dataset_603.dat dataset_604.dat dataset_605.dat dataset_606.dat
11715100_at 6.979218e-07 2.362883e-07 4.602797e-07 1.356901e-06
11715101_s_at 3.142123e-03 3.254098e-03 2.521463e-03 1.192469e-03
11715102_x_at 1.199584e-06 3.610192e-07 1.395787e-06 8.721665e-07
11715103_x_at 5.737275e-04 6.386170e-05 4.631829e-04 1.004890e-04
11715104_s_at 5.843028e-06 3.680387e-06 7.402574e-06 7.118657e-06
11715105_at 1.624335e-07 8.563865e-08 2.050229e-07 1.186254e-07
dataset_607.dat dataset_608.dat dataset_609.dat dataset_610.dat
11715100_at 1.799721e-06 3.832471e-07 8.969882e-07 2.797917e-07
11715101_s_at 2.239764e-03 1.886348e-03 2.715755e-03 2.508869e-03
11715102_x_at 5.538724e-06 6.452287e-07 1.505462e-06 9.826996e-07
11715103_x_at 7.170666e-04 5.779356e-04 4.283308e-04 3.930316e-04
11715104_s_at 6.772010e-06 4.415230e-05 2.757155e-06 1.249225e-05
11715105_at 1.128266e-07 1.360335e-07 3.913944e-08 8.927028e-08
dataset_611.dat dataset_612.dat
11715100_at 2.007719e-07 6.768238e-07
11715101_s_at 8.883523e-03 2.576094e-03
11715102_x_at 7.106819e-07 2.626234e-07
11715103_x_at 1.078969e-04 1.215150e-04
11715104_s_at 4.829564e-06 6.392590e-06
11715105_at 2.092360e-07 2.737968e-07

It seems(and also with the most part) that there arent any values of any probeset near zero. Is this possible ? or is something due to the rma normalization ?

ADD REPLY • link 9.6 years ago svlachavas ▴ 830

0

Entering edit mode

Those warnings are fine for the analysis you are doing. The UPC method can correct for length, but since all Affymetrix probes are the same length, this correction is not necessary. rma doesn't correct for GC content explicitly, but since it normalizes across samples, it should be OK.

You said there are no probesets with values near zero. However, the data you show above appears to all be near zero (it's in scientific notation). You might try plotting a histogram of the UPC values to see how many are close to either one or zero.

ADD REPLY • link 9.6 years ago Stephen Piccolo ▴ 600

0

Entering edit mode

Please excuse me, i was mistaken because i didnt notice in the beggining the length of zeros in each probeset. This is also similar to rma expression levels, as it seems that many probesets have many low intensities-some either have negative values. From the hist plot there a small proportion of values close to one and many to zero. But how could i substract these probesets which have expression measures near to 1, so i can filter my initial rma normalized dataset ?

ADD REPLY • link 9.6 years ago svlachavas ▴ 830

score 0 · Answer 3 · 2015-04-23

0

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 13 months ago

Germany/Heidelberg/German Cancer Resear…

In the PrimeView array there are 49975 probesets after normalization. But i perfomed the UPC_Generic_ExpressionSet

after an initial non-specific intensity filtering, as i had many probesets with very low intensities

ADD COMMENT • link 9.6 years ago svlachavas ▴ 830