Question regarding Bioconductor packages "affy" and "agilp"
1
0
Entering edit mode
Maximilian ▴ 10
@maximilian-11742
Last seen 6.9 years ago

Dear Ladies and Gentlemen,

I have a question regarding the two Bioconductor packages "affy" and "agilp". For my work I need to use expression data from Agilent software in a .txt file and then use this data to create tissue specific models with the help of the COBRA toolbox.

To use the COBRA function "createTissueSpecificModel" you need to have the present/absent calls of the genes in the respective probe/data. For expression data from Affymetrix .CEL files you can use the Bioconductor package "affy" and then use the commands "ReadAffy" -> "mas5calls" -> "exprs" to get a ExpressionSet of the present/absent genes in the respective .CEL file.

Now my question: Is there something similar for Agilent's .txt files within the Bioconductor package "agilp"? If yes, could someone please explain to me how to get the same output as with a .CEL file, namely present/absent calls of the genes?

Thank you for your help.

Kind regards,

Maximilian

affy • 1.3k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States

The present/absent calls are specific to the Affy platform, and were originally based on a comparison between the PM and MM probes (which is what you are talking about here, with mas5calls).

The Agilent platform is completely different, based on longer probes, without anything very similar to a MM probe. You could hypothetically write a function that did something similar, comparing the foreground and background probes on the Agilent array, but I don't know of anything that currently exists for doing that.

ADD COMMENT
0
Entering edit mode

Dear Mr. MacDonald,

thank you for your response. The Agilent platform design files I am referring to are accessible on the GEO database under accession number GPL18948:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL18948

An example for the resulting RAW data is accessible on the GEO database under accession number GSM1436498:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1436498

In this context, what are the foreground and the background probes you are referring to?

Thank you for your help.

Kind regards,

Maximilian

ADD REPLY
0
Entering edit mode

Well, the summarized data that people load onto GEO is sometimes less than useful, and this looks like a case of that. But they were kind enough to load up the raw data as well, so you can start from scratch.

> library(GEOquery)
> library(limma)
> getGEOSuppFiles("GSE59408")
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE59nnn/GSE59408/suppl/
OK
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE59nnn/GSE59408/suppl//GSE59408_RAW.tar'
Content type 'application/x-tar' length 126945280 bytes (121.1 MB)
downloaded 121.1 MB

trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE59nnn/GSE59408/suppl//filelist.txt'
Content type 'text/plain' length 2788 bytes
downloaded 2788 bytes

> setwd("GSE59408/")
> dir()
[1] "filelist.txt"     "GSE59408_RAW.tar"
> untar("GSE59408_RAW.tar")

> fls <- dir(".", "txt.gz$")

> dat <- read.maimages(fls[-(1:2)], "agilent.median", green.only = TRUE)
<snip>
> head(dat$genes)
  Row Col ControlType       ProbeName  SystematicName
1   1   1           1 GE_BrightCorner GE_BrightCorner
2   1   2           1      DarkCorner      DarkCorner
3   1   3           1      DarkCorner      DarkCorner
4   1   4           0           31181           31181
5   1   5           0           55868           55868
6   1   6           0             962             962
> table(dat$genes$ControlType)

   -1     0     1
  308 61648  1011

That's not helpful. But they do give you a better set of annotations, so let's use that.

> bettergns <- read.delim(dir()[2])
> head(bettergns)
  Column Row       ProbeName              ID RefNumber ControlType
1      1   1 GE_BrightCorner GE_BrightCorner         1         pos
2      2   1      DarkCorner      DarkCorner         2         pos
3      3   1      DarkCorner      DarkCorner         3         pos
4      4   1                           31181         4       FALSE
5      5   1                           55868         5       FALSE
6      6   1                             962         6       FALSE
         GeneName TopHit Description Go ChromosomalLocation EntrezGeneID
1 GE_BrightCorner            Unknown NA             Unknown           NA
2      DarkCorner            Unknown NA             Unknown           NA
3      DarkCorner            Unknown NA             Unknown           NA
4                            Unknown NA            unmapped           NA
5                            Unknown NA            unmapped           NA
6                            Unknown NA            unmapped           NA

> dim(dat)
[1] 62967     3
> dim(bettergns)
[1] 62976    12

So we have too many rows in our new annotations. Let's fix that.

> bettergns <- bettergns[match(apply(dat$genes[,1:2], 1, paste, collapse = "_"), apply(bettergns[,2:1], 1, paste, collapse = "_")),]

This uses the idea that if we take the row and column and paste them together, we get a unique identifier that we can match on (e.g., we get 1_1, 1_2, etc, and they uniquely identify each spot). So I just paste them together and use match to subset and reorder the 'bettergns' data.frame. I can then just put that into the 'dat' object.

> dat$genes <- bettergns
> table(dat$genes$ControlType)

 FALSE ignore    neg    pos
 61648      0    308   1011

Now we know which are positive and which are negative genes, and you can hypothetically then come up with a measure of present/absent to use.

ADD REPLY
0
Entering edit mode

You may also want to have a look at the library SCAN.UPC, specifically the function UPC_TwoColor(). This will calculate a Universal exPression Codes (UPC) score for two-channel microarrays, which estimates the probability a probe is "active" (expressed) in a sample.

ADD REPLY
0
Entering edit mode

Dear Mr. Hooiveld,

thank you very much for your suggestion with the package "scan.upc".

I read the manual and found the sample code but because I am a beginner with Bioconductor and R in general I just want to ask you again: all I need to give in is

result = UPC_TwoColor("sample_Agilent_input_file.txt", "output_file.txt") 

and then I get a matrix with the presence/abscence calls of the corresponding genes?

I tried to run it like this with the input file 'GSM1436498_Parent_replicate1.txt' from

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1436498

which is a simple RAW Agilent .txt file but then I get this error:

> result = UPC_TwoColor("GSM1436498_Parent_replicate1.txt", "output_file.txt")

Normalizing ./GSM1436498_Parent_replicate1.txt
Error in `[.data.frame`(data, , cols[3]) : undefined columns selected

Can you please tell me what I am doing wrong because I have no idea. Thank you for your help.

Kind regards,

Maximilian

 

ADD REPLY
0
Entering edit mode

Well, this is apparently a single-color Agilent experiment, and as is stated in the UPC.SCAN vignette (PDF; section 3, top page 5): "This package does not yet support normalizing Agilent one-color arrays."

To calculate a UPC score for a single-color array, you could switch to the function UPC_Generic(). Be sure to check the help pages for this function!

Using the object dat that is generated by the code James posted above, you could do something like this:

for(i in 1:dim(dat)[2]){
  if(i==1){
       UPC.scores <- UPC_Generic(dat$E[,i], verbose = FALSE)
        }else{
     UPC.scores <- cbind(UPC.scores,UPC_Generic(dat$E[,i], verbose = FALSE))
        }
}

colnames(UPC.scores) <- colnames(dat$E)
> dim(UPC.scores)
[1] 62967    42
> UPC.scores[1:3,1:3]
     GSM1436498_Parent_replicate1.txt GSM1436499_Parent_replicate2.txt GSM1436500_CPZ1.txt
[1,]                     1.000000e+00                     9.999998e-01        9.999990e-01
[2,]                     1.513477e-07                     1.051706e-09        7.364431e-07
[3,]                     2.102613e-07                     5.708945e-08        4.169059e-07
> 

I noticed some warning messages are returned, but AFAIK these can be ignored (I am not sure about this since I never used this for Agilent arrays...) Otherwise modify the default value for the parameter convThreshold.

1: In UPC_nn(expressionValues, l = lengths, gc = gcContent, conv = convThreshold) :
  Most values were close to 0.0, so stopping. Perhaps try with a higher convergence threshold.

 

ADD REPLY

Login before adding your answer.

Traffic: 1022 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6