Question

"oligo" package: probenames/probeinfo different from expr-set (read.cellfles): miRNA 4.0 (affy)

0

Entering edit mode

dietmar.pils • 0

@dietmarpils-7082

Last seen 8.2 years ago

Austria

hello,

I have a problem with the oligo package and affys miRNA 4.0 GeneChips.

I load CEL-files (affymetrix, miRNA 4.0 GeneChips) and want extract the raw probelevel values.

dat1 <- read.cellfles(files)

Using probeNames(dat1) and getProbeInfo(dat1) I get 346085 probes.

But the exprs-slot (dat1@assayData$exprs) or if I extract the expressions ( exprs(dat1) ) shows only 292681 values with rownames from 1 to 292681. Which probes are missing or more important how can I annotate these 292681 values with the correct probenames or fid's (means: feature identifier?).

--- SESSION INFO ---

R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Scientific Linux release 6.7 (Carbon)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
[9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] limma_3.24.15 pd.mirna.4.0_3.12.0 RSQLite_1.0.0
[4] DBI_0.3.1 oligo_1.32.0 Biostrings_2.36.4
[7] XVector_0.8.0 IRanges_2.2.9 S4Vectors_0.6.6
[10] Biobase_2.28.0 oligoClasses_1.30.0 BiocGenerics_0.14.0
[13] rj_2.0.3-1
loaded via a namespace (and not attached):
[1] affxparser_1.40.0 GenomicRanges_1.20.8 splines_3.2.2
[4] zlibbioc_1.14.0 bit_1.1-12 rj.gd_1.1.3-1
[7] foreach_1.4.3 GenomeInfoDb_1.4.3 tools_3.2.2
[10] ff_2.2-13 iterators_1.0.8 preprocessCore_1.30.0
[13] affyio_1.36.0 codetools_0.2-14 BiocInstaller_1.18.5

microarray probe annotation affy • 1.7k views

ADD COMMENT • link 8.2 years ago dietmar.pils • 0

score 0 · Answer 1 · 2016-02-03

You are misunderstanding something fundamental about this array in particular, and a fair number of the more recent Affy arrays. In earlier versions of the miRNA arrays, Affy would tile down a set of (identical) 25-mers (4 per miRNA, IIRC), for each miRNA that they were supposed to be interrogating. What this meant in practice for identically conserved miRNAs from different species was that there would be a bunch of identical 25-mers spread all over the array, that were partitioned into probesets for the different species.

Affy apparently decided that this was a waste of space, which I can't really disagree with. So with the 4.0 version of the array, if there is a miRNA that is identical among say 5 species, Affy will tile down just 4 (or whatever) 25-mers, and then recycle those probes in all five probesets.

So when you use getProbeInfo, or probeNames, you end up recycling the same probes into various probesets, resulting in more apparent probes than what you get if you use exprs to get the raw data.

score 0 · Answer 2 · 2016-02-03

0

Entering edit mode

dietmar.pils • 0

@dietmarpils-7082

Last seen 8.2 years ago

Austria

dear james,

something like this I had expected, in reality exactly this is was I thought.

Therefore, in the exprs-slot, after reading in the CEL files (before summarization) are the unique values for all probes (regardless for which probeset each probe is used). But unfortunately I don't know how to annotate this values with the (unique) probe names. They are only labeled with numbers from 1 to XXX.

I want use my own normalisation procedure before summarization, but for this I have to know which are the negative controls and which positive spike-in controls.

My problem is more an "oligo"/R-usage problem than an understanding problem of these arrays.

Can someone help me to get the annotation of the exprs-slot?

Thank you!

Dietmar

ADD COMMENT • link 8.2 years ago dietmar.pils • 0

0

Entering edit mode

When making a comment, please use the 'ADD COMMENT' link rather than 'Add your answer'.

I think you probably want to query the underlying SQLite database directly.

> library(pd.mirna.4.0)

> con <- db(pd.mirna.4.0)
> z <- dbGetQuery(con, "select man_fsetid, fid from featureSet, pmfeature where pmfeature.fsetid=featureSet.fsetid")
> head(z)
       man_fsetid    fid
1 MIMAT0000001_st  31288
2 MIMAT0000001_st  32604
3 MIMAT0000001_st  89004
4 MIMAT0000001_st 149368
5 MIMAT0000001_st 150239
6 MIMAT0000001_st 211154
> dim(z)
[1] 346085      2

And if you want to know which are controls or main probes

> probesets <- dbGetQuery(con, "select * from featureSet;")

> head(probesets)
    fsetid      man_fsetid type
1 20500000 MIMAT0000001_st    1
2 20500001 MIMAT0015091_st    1
3 20500002 MIMAT0000002_st    1
4 20500003 MIMAT0015092_st    1
5 20500004 MIMAT0020301_st    1
6 20500005 MIMAT0000003_st    1
> table(probesets$type)

    1     7    12    13    14
36205    95    27    17     9

> dbGetQuery(con, "select * from type_dict;")
   type                    type_id
1     1                       main
2     2            main->junctions
3     3                 main->psrs
4     4               main->rescue
5     5              control->affx
6     6              control->chip
7     7  control->bgp->antigenomic
8     8      control->bgp->genomic
9     9             normgene->exon
10   10           normgene->intron
11   11   rescue->FLmRNA->unmapped
12   12   control->affx->bac_spike
13   13             oligo_spike_in
14   14            r1_bac_spike_at
15   15 control->affx->polya_spike
16   16        control->affx->ercc
17   17  control->affx->ercc->step

ADD REPLY • link 8.2 years ago James W. MacDonald 65k

0

Entering edit mode

Thank you again James,

good to know how to access the db directly, but I have already this information. I only need the IDs for the raw exprs matrix in the ExpressionFeatureSet. Perhaps it is so easy, that I only miss something...

See below. I cannot assign the probe-info (n=346085) to the rows of the exprs matrix (n=292681), which shows only an index from 1 to 292681.

Unfortunately, in dat1 (ExpressionFeatureSet) there is no annotation to these 292681 rows in the exprs matrix.

>dat1 <- read.celfiles(filenames = files, verbose=TRUE)
>head(dat1@assayData$exprs)
CEL1 CEL2 CEL3 CEL4 CEL5 CEL6 CEL7 CEL8 CEL9 CEL10
1 265 267 235 244 252 342 265 293 270 310
2 20391 22209 16880 15594 19141 24108 18475 21372 17726 17710
3 360 395 292 383 372 418 386 399 332 330
4 21006 22268 16604 15389 19839 24747 18818 21025 18548 17387
5 224 172 187 176 211 224 104 199 228 97
6 56 74 75 42 64 66 78 69 61 51

>dim(dat1@assayData$exprs)
[1] 292681 10
　

>pINFO <- getProbeInfo(dat1, field=c('fid', 'fsetid', 'type'))
>head(pINFO)
fid man_fsetid fsetid type
1 6 MIMAT0027572_st 20525633 main
2 8 ENSG00000238592_st 20533397 main
3 9 U22_st 20538324 main
4 10 AFFX-BkGr-GC11_st 20538344 control->bgp->antigenomic
5 11 MIMAT0018536_st 20518032 main
6 12 ENSG00000252461_st 20533969 main

>dim(pINFO)
[1] 346085 4

ADD REPLY • link 8.2 years ago dietmar.pils • 0

0

Entering edit mode

I suppose it is obvious. The arrays are read by the scanner, by row, and when you read in the celfile it's in the same order. That is the fid.

> dbGetQuery(con, "select * from pmfeature where fid between 1 and 20;")
   fid   fsetid   atom  x y
1    6 20525633 230698  5 0
2    8 20533397 302161  7 0
3    9 20538324 356080  8 0
4   10 20538344 357183  9 0
5   11 20518032 162289 10 0
6   12 20533969 308386 11 0
7   12 20533970 308397 11 0
8   13 20534098 309788 12 0
9   14 20538434 364730 13 0
10  15 20536297 333855 14 0
11  16 20536387 334849 15 0
12  18 20535646 326732 17 0
13  20 20537767 349990 19 0

The count is zero based, so the first cell read in is at (0,0). The first cell that is used for anything is in the first row (0 on the y axis) and the sixth column (5 on the x axis). So that probe is in the sixth row of the data.frame you get from doing exprs(dat1).

So your pINFO data.frame tells you the row index (fid), the probeset that particular probe goes into, and what type of probeset it is. That should be sufficient, no?

As a side note, there is no profit in accessing slots directly (e.g., dat1@assayData$exprs) - the exprs function does the expected thing, and if Benilton changes the underlying structure, the exprs function will continue to do the expected thing, but direct queries may not.

ADD REPLY • link 8.2 years ago James W. MacDonald 65k