Question: Why are there so many not annotated probes in the PorGene-1_1-st affymetrix array?
1
3.5 years ago by
serpalma.v40
Germany
serpalma.v40 wrote:

Dear community,

After creating an expression set for the PorGene-1_1-st affymetrix array. I found out that many of probes do not have annotations. As a result, once I am done with differential expression analysis, I found out many regulated probes that do not have any feature assigned assigned. These is highly problematic to do pathway analysis.

I got the annotations by using the function getNetAffx(eset, "transcript")

This is what it looks like once I create the expression set:

> eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 27558 features, 20 samples
element names: exprs
protocolData: none
phenoData
sampleNames: A1_Mix-Diestrus A1_Mix-Estrus ... A6_P4-Estrus (20 total)
varLabels: animal treatment
featureData
featureNames: 15180001 15180003 ... 15351650 (27558 total)
fvarLabels: transcriptclusterid probesetid ... category (18 total)
experimentData: use 'experimentData(object)'
Annotation: PorGene-1_1-st 

These are the dimentions of the array

> dim(fData(eset))
[1] 27558    18

This is the amount of not annotated probes (at the gene assigment level)
> sum(is.na(fData(eset)$geneassignment)) [1] 14026 You can see that more than half of the probes do not have annotations for gene assigments. And many of these probes show up as differentialy expressed later on in the analysis. Why are those probes there? Should they be removed from the analysis?, if so, at what point? Thanks EDIT: Does this has something to do with the "category" column? > colnames(fData(eset)) [1] "transcriptclusterid" "probesetid" "seqname" "strand" "start" "stop" [7] "totalprobes" "geneassignment" "mrnaassignment" "swissprot" "unigene" "gobiologicalprocess" [13] "gocellularcomponent" "gomolecularfunction" "pathway" "proteindomains" "crosshybtype" "category" > as.data.frame(table(fData(eset)$category))
Var1  Freq
1               control->affx    18
2    control->affx->bac_spike    18
3  control->affx->polya_spike    39
4   control->bgp->antigenomic    23
5                        main 19124
6              normgene->exon   453
7            normgene->intron  1537
8                    reporter    82
9                      rescue  6261
10                       rrna     3

It seems that only the probes in main are relevant. Is this correct?

modified 3.5 years ago by James W. MacDonald50k • written 3.5 years ago by serpalma.v40
Answer: Why are there so many not annotated probes in the PorGene-1_1-st affymetrix arra
1
3.5 years ago by
United States
James W. MacDonald50k wrote:

There are two issues here. The first is, as you suspect, that there are lots of control probesets of various types on the array. You should definitely remove those probesets, as the normgene->intron controls in particular have a really bad habit of popping up in the list of top genes. When you remove them is up to you, and depends on how you are analyzing the data. I suppose you could argue that they help estimate the variance prior if you are using limma, and you should keep them in until after the eBayes step. I tend to remove them earlier than that, because I am not sure how helpful they are for that estimation, being controls and all.

The second issue has to do with how many un-annotated probesets there are. I should note that the getNetAffx function is simply reading in the data from Affy's annotation csv, so the dearth of annotation is on them, not us. Even if we had an annotation package for this array we would simply be passing on what Affy says the probesets measure, so you should really be asking your Affy rep, as (s)he is the one who sold you the arrays, and is supposed to be providing support.

Do note that you can get a FASTA file of the transcript clusters, which you could use to annotate using BLAST, which may prove useful.

Thanks James,

I was wondering now how to proceed with the pathway analysis.

There needs to be a set of regulated genes and a background in order to infer wether a pathway is enriched or not. For the background of genes I was thinking using all the annotated probes that belong to the "main" category. However, I could also use all the transcripts known to exist for porcine, but this I think will reduce the power of the test quite substantially.

Is it valid to only select as a background all the annotated probes belonging to the "main" category?

Thanks!