Search
Question: Why are there so many not annotated probes in the PorGene-1_1-st affymetrix array?
1
gravatar for serpalma.v
21 months ago by
serpalma.v10
Germany
serpalma.v10 wrote:

Dear community,

After creating an expression set for the PorGene-1_1-st affymetrix array. I found out that many of probes do not have annotations. As a result, once I am done with differential expression analysis, I found out many regulated probes that do not have any feature assigned assigned. These is highly problematic to do pathway analysis.

I got the annotations by using the function getNetAffx(eset, "transcript")

This is what it looks like once I create the expression set:

> eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 27558 features, 20 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: A1_Mix-Diestrus A1_Mix-Estrus ... A6_P4-Estrus (20 total)
  varLabels: animal treatment
  varMetadata: labelDrescription labelDescription
featureData
  featureNames: 15180001 15180003 ... 15351650 (27558 total)
  fvarLabels: transcriptclusterid probesetid ... category (18 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: PorGene-1_1-st 

These are the dimentions of the array

> dim(fData(eset))
[1] 27558    18

This is the amount of not annotated probes (at the gene assigment level)
> sum(is.na(fData(eset)$geneassignment))
[1] 14026

 

You can see that more than half of the probes do not have annotations for gene assigments. And many of these probes show up as differentialy expressed later on in the analysis.

 

Why are those probes there?

Should they be removed from the analysis?, if so, at what point?

 

Thanks

EDIT:

Does this has something to do with the "category" column?

> colnames(fData(eset))

 [1] "transcriptclusterid" "probesetid"          "seqname"             "strand"              "start"               "stop"               
 [7] "totalprobes"         "geneassignment"      "mrnaassignment"      "swissprot"           "unigene"             "gobiologicalprocess"
[13] "gocellularcomponent" "gomolecularfunction" "pathway"             "proteindomains"      "crosshybtype"        "category"           


> as.data.frame(table(fData(eset)$category))
                         Var1  Freq
1               control->affx    18
2    control->affx->bac_spike    18
3  control->affx->polya_spike    39
4   control->bgp->antigenomic    23
5                        main 19124
6              normgene->exon   453
7            normgene->intron  1537
8                    reporter    82
9                      rescue  6261
10                       rrna     3

It seems that only the probes in main are relevant. Is this correct?

ADD COMMENTlink modified 21 months ago by James W. MacDonald45k • written 21 months ago by serpalma.v10
1
gravatar for James W. MacDonald
21 months ago by
United States
James W. MacDonald45k wrote:

There are two issues here. The first is, as you suspect, that there are lots of control probesets of various types on the array. You should definitely remove those probesets, as the normgene->intron controls in particular have a really bad habit of popping up in the list of top genes. When you remove them is up to you, and depends on how you are analyzing the data. I suppose you could argue that they help estimate the variance prior if you are using limma, and you should keep them in until after the eBayes step. I tend to remove them earlier than that, because I am not sure how helpful they are for that estimation, being controls and all.

The second issue has to do with how many un-annotated probesets there are. I should note that the getNetAffx function is simply reading in the data from Affy's annotation csv, so the dearth of annotation is on them, not us. Even if we had an annotation package for this array we would simply be passing on what Affy says the probesets measure, so you should really be asking your Affy rep, as (s)he is the one who sold you the arrays, and is supposed to be providing support.

Do note that you can get a FASTA file of the transcript clusters, which you could use to annotate using BLAST, which may prove useful.

ADD COMMENTlink written 21 months ago by James W. MacDonald45k

Thanks James,

I was wondering now how to proceed with the pathway analysis. 

There needs to be a set of regulated genes and a background in order to infer wether a pathway is enriched or not. For the background of genes I was thinking using all the annotated probes that belong to the "main" category. However, I could also use all the transcripts known to exist for porcine, but this I think will reduce the power of the test quite substantially. 

Is it valid to only select as a background all the annotated probes belonging to the "main" category?

Thanks!

ADD REPLYlink written 21 months ago by serpalma.v10

Certainly you only want to use the main category probesets, as the remainder are controls. How you proceed from there is up to you. In other words, you are asking questions that are unanswerable on a forum. People here can help you with questions having to do with the software and whatnot, but when it comes down to choices made in performing an analysis, only you can make those decisions.

ADD REPLYlink written 21 months ago by James W. MacDonald45k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 174 users visited in the last hour