Dear list,
I'm analyzing data obtained using the HuGene1.0-st arrays from affymetrix.
I'm interested in a particular set of genes. Two of these genes are not present in the results if I filter out probes with missing Entrez IDs using nsFilter. However, I think that these two genes (PER1 and CSNK1E) actually have an Entrez ID and should not be filtered out. The only way I obtain them in my results is if I don't apply any kind of filter with nsFilter. Do you think I'm doing something wrong with the annotation? This is the code I'm using and the output of sessionInfo(). Thank you so much for your kind help.
exp.data <- read.celfiles(filenames=list.celfiles(full.names=TRUE, dataDir))
phenoData <- read.table(file.path(dataDir ,"pheno_BD_CT.txt"), sep="\t", header=TRUE)
row.names(phenoData) <- sampleNames(exp.data)
pd <- AnnotatedDataFrame(data=phenoData)
phenoData(exp.data) <- pd
eset <- rma(exp.data)
subs <- grep("CT", phenoData(exp.data)$Phenotype)
eset.subs <- eset[,c(subs)]
annotation(eset.subs) <- "hugene10sttranscriptcluster.db"
esetf.subs <- nsFilter(eset.subs, require.entrez=TRUE, remove.dupEntrez=FALSE, var.filter=FALSE)
express <- esetf.subs$eset
> esetf.subs
$eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 19777 features, 20 samples
element names: exprs
protocolData
rowNames: 14(1015)_(HuGene-1_0-st-v1).CEL 21(1707)_(HuGene-1_0-st-v1).CEL ...
9(1168)_(HuGene-1_0-st-v1).CEL (20 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: 14(1015)_(HuGene-1_0-st-v1).CEL 21(1707)_(HuGene-1_0-st-v1).CEL ...
9(1168)_(HuGene-1_0-st-v1).CEL (20 total)
varLabels: FileName Response ... Analysis (5 total)
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hugene10sttranscriptcluster.db
$filter.log
$filter.log$numRemoved.ENTREZID
[1] 13520
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] pd.hugene.1.0.st.v1_3.14.1 RSQLite_1.0.0
[3] DBI_0.5 hugene10sttranscriptcluster.db_8.4.0
[5] org.Hs.eg.db_3.3.0 AnnotationDbi_1.34.4
[7] genefilter_1.54.2 oligo_1.36.1
[9] Biostrings_2.40.2 XVector_0.12.1
[11] IRanges_2.6.1 S4Vectors_0.10.3
[13] limma_3.28.18 oligoClasses_1.34.0
[15] Biobase_2.32.0 BiocGenerics_0.18.0
[17] mvtnorm_1.0-5
loaded via a namespace (and not attached):
[1] Rcpp_0.12.6 BiocInstaller_1.22.3 GenomeInfoDb_1.8.3
[4] iterators_1.0.8 tools_3.3.1 zlibbioc_1.18.0
[7] bit_1.1-12 tibble_1.1 annotate_1.50.0
[10] preprocessCore_1.34.0 lattice_0.20-33 ff_2.2-13
[13] Matrix_1.2-6 foreach_1.4.3 dplyr_0.5.0
[16] affxparser_1.44.0 grid_3.3.1 R6_2.1.2
[19] XML_3.98-1.4 survival_2.40-1 magrittr_1.5
[22] codetools_0.2-14 splines_3.3.1 GenomicRanges_1.24.2
[25] assertthat_0.1 SummarizedExperiment_1.2.3 xtable_1.8-2
[28] sandwich_2.3-4 zoo_1.7-14 affyio_1.42.0
>
Thank you so much James for your answer.
I wanted to use nsFilter to filter out also probes with duplicated IDs, but also in this case I find my genes of interest filtered out (even if they do not have duplicated probes according to the results I obtain when I do not apply any filter).
Is there a way to filter out duplicated probes without using nsFilter?
Also, just if you have time to explain this to me, do you have an idea on why I don't find these two genes in the results if they have an ID?
Thank you so much for your time and help
Claudia
This is what I obtain when using the method you suggested, but unfortunately my genes of interest are still filtered out:
> eset.subs <- annotateEset(eset.subs, "hugene10sttranscriptcluster.db")
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
> eset.subs2 <- eset.subs[!is.na(fData(eset.subs)$ENTREZID),]
> eset.subs2
ExpressionSet (storageMode: lockedEnvironment)
assayData: 22384 features, 20 samples
element names: exprs
protocolData
rowNames: 14(1015)_(HuGene-1_0-st-v1).CEL 21(1707)_(HuGene-1_0-st-v1).CEL ...
9(1168)_(HuGene-1_0-st-v1).CEL (20 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: 14(1015)_(HuGene-1_0-st-v1).CEL 21(1707)_(HuGene-1_0-st-v1).CEL ...
9(1168)_(HuGene-1_0-st-v1).CEL (20 total)
varLabels: FileName Response ... Analysis (5 total)
varMetadata: labelDescription
featureData
featureNames: 7896740 7896742 ... 8180418 (22384 total)
fvarLabels: PROBEID ENTREZID SYMBOL GENENAME
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: hugene10sttranscriptcluster.db
>
Well, what is happening is essentially this:
So if you take an unfiltered ExpressionSet and annotate it using annotateEset, then I can assure you that both PER1 and CSNK1E will be in there.
Thank you for your answer