Question

Question about Agilent microarray data measurement

0

Entering edit mode

Maximilian ▴ 10

@maximilian-11742

Last seen 6.9 years ago

Dear bioconductor forum,

I need help with getting the data I want from agilent microarrays from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=+GSE59408. I used the Bioconductor package "limma" to read in the data from the single-color Agilent microarrays and then the package "SCAN.UPC" to get the UPC values of the data.

I adjusted the variable convThreshold in the function UPC_Generics to 0,1 so the function didn't give me any warning messages. After that I computed the average UPC-value for each of the ocurring genes. My goal is to determine which of the genes in the dataset are present/absent and I already got the hint that I have to think of a own measurement. The problem is that the respective positive and negative ControlTypes to the respective genes are nothing but Agilent control probes on the microarray, the "real" genes I want to destinguish/use aren't marked as positive or negative.

Now my questions: Am I on the right way using the averaged UPC-values? Although the genes I want to use aren't marked as positive or negative ControlTypes, is there still a way to figure out a present/absent pattern for them?

I hope my questions are clearer this time.

Kind regards,

Max

Agilent limma scan.upc microarray • 2.3k views

ADD COMMENT • link 7.5 years ago Maximilian ▴ 10

0

Entering edit mode

Maxmillian, thanks for your questions. Are you saying that all the non-control genes are assigned UPC values of either 0 or 1?

ADD REPLY • link 7.5 years ago Stephen Piccolo ▴ 590

0

Entering edit mode

Mr. Picollo, thank you for your answer. I am sorry for my cryptic question, I will explain it again:

In the Agilent RAW data I am using (from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=+GSE59408) in column 'ControlType' the respective genes on the microarrays are assigned to different values (-1, 0, 1). As I said, the -1 and 1 'ControlType' values are assigned to Agilent control probes, the genes I want to use have the 'ControlType' value 0.

My goal is to get somekind of measurement of the genes to get a presence/absence pattern of the respective genes on the microarrays, just like the known gene presence/absence pattern for Affymetrix Arrays (mas5calls).

I already got the hint to use the 'ControlType' values to come up with a measurement to distinguish present/absent genes, but as I mentioned before the -1 and 1 'ControlType' values are assigned to Agilent Control probes on the microarray, so I am not sure if I can do it that way. I tried to find a reference or somebody describing a similar problem and the respective solution or approach but couldn't find anything.

I used the package 'SCAN.UPC' to normalize the expression values provided by limma because in my case it is a Agilent single-colour microarray. Then I averaged the corresponding normalized UPC values of the respective genes. But as I said, I am not sure if this is the right way to get a presence/absence pattern of the genes on the microarrays.

To summarize it: which of the two possible ways is more likely to get me to my goal, the presence/absence patterns of the respective genes on the microarray: with the 'ControlType' values or with the UPC values?

Thanks in advance, Max

ADD REPLY • link 7.5 years ago Maximilian ▴ 10

0

Entering edit mode

Hi Max,

I'm not sure if this will help, but one option that you could try is to use the limma package to get the expression values for the non-control probes and then apply UPC_Generic to those only.

FYI: I've been working on adding more formal support for Agilent one-color arrays to the SCAN.UPC package.

ADD REPLY • link 7.5 years ago Stephen Piccolo ▴ 590

0

Entering edit mode

Dear Mr. Picollo,

thank you for your answer. I have tried your suggested way:

> library(limma)

The file "targets.txt" contains all Agilent RAW .txt files and has this format:

SampleNumber FileName Condition
1 GSM1436498_Parent_replicate1.txt NONE
2 GSM1436499_Parent_replicate2.txt NONE
3 GSM1436500_CPZ1.txt CPZ
4 GSM1436501_CPZ2.txt CPZ

> targets <- readTargets("targets.txt")
> x <- read.maimages(targets, source="agilent.median",green.only=TRUE)
> platform <- read.table(file="GPL18948_platform_design_file2.txt", sep="\t") 
> library(SCAN.UPC)

Now I get rid of control probe rows in x for all 42 columns:

> indNeg <- which(with(x$genes, ControlType == -1))
> x.adjusted <- x[ -indNeg, ]
> indPos <- which(with(x.adjusted$genes, ControlType == 1))
> x.adjusted <- x.adjusted[ -indPos, ]

Now I calculate the UPC values using UPC_Generics:

> for(i in 1:dim(x.adjusted)[2]){
   if(i==1){
        UPC.scores <- UPC_Generic(x.adjusted$E[,i], verbose = FALSE)
         }else{
      UPC.scores <- cbind(UPC.scores,UPC_Generic(x.adjusted$E[,i], verbose = FALSE))
         }
 }

> colnames(UPC.scores) <- colnames(x.adjusted$E)

Finally I average the UPC values of corresponding gene probes on the microarrays:

> UPC.scores.ave <- avereps(UPC.scores,ID=x.adjusted$genes[,"SystematicName"])

> dim(UPC.scores.ave)
[1] 59998    42

This is way to big. I swap the column x$genes$SystematicName with platform$GeneSymbol because in case of my microarrays the SystematicNames are predominantly the ProbeNames and therefore it is not possible to get a average value for the different genes.

> x$genes$SystematicName <- platform$GeneSymbol

> indNeg <- which(with(x$genes, ControlType == -1))
> x.adjusted <- x[ -indNeg, ]

> indPos <- which(with(x.adjusted$genes, ControlType == 1))
> x.adjusted <- x.adjusted[ -indPos, ]

> indAgilent <- which(with(x.adjusted$genes, SystematicName == "Agilent Control probe"))
> x.adjusted <- x.adjusted[ -indAgilent, ]

> for(i in 1:dim(x.adjusted)[2]){
   if(i==1){
        UPC.scores <- UPC_Generic(x.adjusted$E[,i], verbose = FALSE)
         }else{
      UPC.scores <- cbind(UPC.scores,UPC_Generic(x.adjusted$E[,i], verbose = FALSE))
         }
}

> colnames(UPC.scores) <- colnames(x.adjusted$E)

> UPC.scores.ave <- avereps(UPC.scores,ID=x.adjusted$genes[,"SystematicName"])

> dim(UPC.scores.ave)
[1] 4471   42

This looks way better. Is there now a way to distinguish some kind of presence/absence pattern for genes with these computed UPC values?

Thanks in advance, Max

ADD REPLY • link 7.5 years ago Maximilian ▴ 10

0

Entering edit mode

Can you post a histogram for how UPC.scores.ave looks?

It would also be interesting to see how UPC.scores looks.

ADD REPLY • link 7.5 years ago Stephen Piccolo ▴ 590

0

Entering edit mode

Dear Mr. Picollo,

thank you for your answer.

The histogram for UPC.scores.ave looks like this: https://postimg.org/image/nibp6l5rp/

The histogram for UPC.scores looks like this: https://postimg.org/image/h5wjwr2ph/

Does this help to distinguish some kind of presence/absence pattern for the genes with help of the computed UPC values?

Thanks in advance, Max

ADD REPLY • link 7.5 years ago Maximilian ▴ 10

0

Entering edit mode

Yes, that is what I would expect to see. Values closer to 1 suggest that the respective gene is "on." Values closer to 0 suggest that the respective gene is "off." You'll have to decide what threshold you want to use to distinguish between "on" and "off" but I've used 0.5 typically. Hope that helps!

ADD REPLY • link 7.5 years ago Stephen Piccolo ▴ 590

0

Entering edit mode

Or I guess you could instead say "on" = "Present" and "off" = "Absent" if you prefer that terminology.

ADD REPLY • link 7.5 years ago Stephen Piccolo ▴ 590

0

Entering edit mode

Dear Mr. Picollo,

thank you very much. You helped me a lot!

Kind regards, Max

ADD REPLY • link 7.5 years ago Maximilian ▴ 10