Dear community,
After creating an expression set for the PorGene-1_1-st affymetrix array. I found out that many of probes do not have annotations. As a result, once I am done with differential expression analysis, I found out many regulated probes that do not have any feature assigned assigned. These is highly problematic to do pathway analysis.
I got the annotations by using the function getNetAffx(eset, "transcript")
This is what it looks like once I create the expression set:
> eset ExpressionSet (storageMode: lockedEnvironment) assayData: 27558 features, 20 samples element names: exprs protocolData: none phenoData sampleNames: A1_Mix-Diestrus A1_Mix-Estrus ... A6_P4-Estrus (20 total) varLabels: animal treatment varMetadata: labelDrescription labelDescription featureData featureNames: 15180001 15180003 ... 15351650 (27558 total) fvarLabels: transcriptclusterid probesetid ... category (18 total) fvarMetadata: labelDescription experimentData: use 'experimentData(object)' Annotation: PorGene-1_1-st
These are the dimentions of the array
> dim(fData(eset)) [1] 27558 18 This is the amount of not annotated probes (at the gene assigment level) > sum(is.na(fData(eset)$geneassignment)) [1] 14026
You can see that more than half of the probes do not have annotations for gene assigments. And many of these probes show up as differentialy expressed later on in the analysis.
Why are those probes there?
Should they be removed from the analysis?, if so, at what point?
Thanks
EDIT:
Does this has something to do with the "category" column?
> colnames(fData(eset)) [1] "transcriptclusterid" "probesetid" "seqname" "strand" "start" "stop" [7] "totalprobes" "geneassignment" "mrnaassignment" "swissprot" "unigene" "gobiologicalprocess" [13] "gocellularcomponent" "gomolecularfunction" "pathway" "proteindomains" "crosshybtype" "category" > as.data.frame(table(fData(eset)$category)) Var1 Freq 1 control->affx 18 2 control->affx->bac_spike 18 3 control->affx->polya_spike 39 4 control->bgp->antigenomic 23 5 main 19124 6 normgene->exon 453 7 normgene->intron 1537 8 reporter 82 9 rescue 6261 10 rrna 3
It seems that only the probes in main are relevant. Is this correct?
Thanks James,
I was wondering now how to proceed with the pathway analysis.
There needs to be a set of regulated genes and a background in order to infer wether a pathway is enriched or not. For the background of genes I was thinking using all the annotated probes that belong to the "main" category. However, I could also use all the transcripts known to exist for porcine, but this I think will reduce the power of the test quite substantially.
Is it valid to only select as a background all the annotated probes belonging to the "main" category?
Thanks!
Certainly you only want to use the main category probesets, as the remainder are controls. How you proceed from there is up to you. In other words, you are asking questions that are unanswerable on a forum. People here can help you with questions having to do with the software and whatnot, but when it comes down to choices made in performing an analysis, only you can make those decisions.