Entering edit mode
Guest User
★
13k
@guest-user-4897
Last seen 10.2 years ago
Dear Benilton/BioC list,
Query: Is it possible to perform summarisation using the oligo package
after exclusion of probes containing SNPs?
Background: I've performed an eQTL analysis where the expression data
has been obtained on Affymetrix HuGene ST 1.1 microarrays.
After reading in the CEL files, I pre-processed the GeneFeatureSet
doing background correction, quantile normalisation, and summarisation
to
gene-level using the rma function from the oligo package.
# What my real data looks like:
( cels <- read.celfiles(pathToFiles) ) # generates a GeneFeatureSet
GeneFeatureSet (storageMode: lockedEnvironment)
assayData: 1178100 features, 90 samples
element names: exprs
protocolData
rowNames: 75_459_CD16.CEL 76_460_CD16.CEL ...
triad0078_H7_498_CD16.CEL (90 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: 75_459_CD16.CEL 76_460_CD16.CEL ...
triad0078_H7_498_CD16.CEL (90 total)
varLabels: cell_type info.batch.name ... age (29 total)
varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hugene.1.1.st.v1
#Example workfow using oligoData for reproducibilty:
library(oligo)
library(oligoData)
data(affyGeneFS)
affyGeneFS # a GeneFeatureSet
GeneFeatureSet (storageMode: lockedEnvironment)
assayData: 1102500 features, 33 samples
element names: exprs
protocolData
rowNames: TisMap_Brain_01_v1_WTGene1.CEL
TisMap_Brain_02_v1_WTGene1.CEL ... TisMap_Thyroid_03_v1_WTGene1.CEL
(33 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: TisMap_Brain_01_v1_WTGene1.CEL
TisMap_Brain_02_v1_WTGene1.CEL ... TisMap_Thyroid_03_v1_WTGene1.CEL
(33 total)
varLabels: index
varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hugene.1.0.st.v1
# generate gene-level expression values:
eset <- rma(affyGeneFS)
# equivalent to:
bgCorrected <- backgroundCorrect(affyGeneFS)
# Background correct
normalized <- normalize(bgCorrected, method="quantile")
# Quantile normalise
eset2 <- rma(normalized, background=F, normalize=F, subset=NULL)
# Summarize with median polish
I would like to re-run the eQTL analysis, this time excluding probes
from the expression data that contain
SNPs in >1% of CEU individuals to see whether this has any substantial
effect on my findings. My concern is
the possibility of apparent eQTLs which are in fact artefacts due to
less efficient binding between probes
and transcripts containing the minor allele.
My questions are-
1) Having obtained a list of the probes to exclude, is it valid to
simply perform BG correction and quantile normalisation on the whole
GeneFeatureSet,
but to then remove probes containing SNPs prior to summarisation?
Would the validity of mapping the resulting probesets to Entrez ids
etc.
be questionable if some probes had been excluded?
2) If such an approach is reasonable, how can I map Affymetrix probe-
level identifiers to the GeneFeatureSet?
# I know how to get feature ids for PM probes and their groupings into
gene-level probesets:
featureInfo <- oligo:::stArrayPmInfo(normalized, target = 'core')
head(featureInfo,10)
fid fsetid
1 116371 7892501
2 943979 7892501
3 493089 7892501
4 907039 7892501
5 1033309 7892502
6 653512 7892502
7 690769 7892502
8 997409 7892502
9 379979 7892503
10 469846 7892503
The fsetid s correspond to gene-level probesets. Are the fid s row
indices in the GeneFeatureSet,
and how can they be linked to Affy probe level ids?
Assuming I can map Affy probe ids to the fid column, here is my
proposed solution for
summarisation after removal of selected probes, but I wonder if there
is a more straightforward way?
# for demo purposes generate a random list of fids to remove
# some fid s are duplicated in the data.frame as they map to multiple
probesets, so use 'unique'
set.seed(1234)
fidToCut <- sample(unique(featureInfo$fid), 10000, replace=F)
# remove rows from GeneFeatureSet corressponding to fid s we want to
exclude
normalized2 <- normalized[-fidToCut, ]
# create mapping for the new GeneFeatureSet of how old row indices map
to new indices
map <- cbind(new_ind= 1:nrow(normalized2), old_ind=
featureNames(normalized2) )
# pmi gives row indices of pm probes in 'normalized', not
'normalized2'
pmi <- featureInfo[["fid"]]
# re-order dataframe so 'old_ind' column is the same as pmi
map1 <- map[match(pmi, map[,"old_ind"]),]
map1 <- cbind(map1, featureInfo)
tail(map1)
new_ind old_ind fid fsetid
861488 961136 969962 969962 8180418
861489 208831 210719 210719 8180418
861490 870109 878113 878113 8180418
861491 598190 603636 603636 8180418
861492 858752 866642 866642 8180418
861493 713834 720353 720353 8180418
# we need to get rid of rows with NA... e.g. where 'fid' but no
'new_ind' exists:
# effectively recreating a new 'featureInfo' dataframe
map1 <- na.omit(map1)
# now we can get the new row indices, and the probesets they belong to
pmi <- as.numeric( as.character( map1[["new_ind"]] )) # 'new_ind' got
coerced to a factor
pnVec <- as.character(map1[["fsetid"]])
# subset the normalized probe-level values to keep probes by indexes
in pmi
pms <- exprs(normalized2)[pmi, , drop = FALSE]
dimnames(pms) <- NULL
colnames(pms) <- sampleNames(normalized2)
# get a matrix of summarized values, which can be used to create an
ExpressionSet
theExprs <- basicRMA(pms, pnVec, normalize=F, background=F)
I'd greatly appreciate any advice. I'm a biologist by background so
apologies if this is rather basic.
Thanks very much
Jimmy Peters, Smith Lab, Cambridge Institute for Medical Research.
-- output of sessionInfo():
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets
methods base
other attached packages:
[1] pd.hugene.1.0.st.v1_3.8.0 oligoData_1.8.0
biomaRt_2.18.0 pd.hugene.1.1.st.v1_3.8.0 RSQLite_0.11.4
DBI_0.2-7
[7] oligo_1.26.0 Biobase_2.22.0
oligoClasses_1.24.0 BiocGenerics_0.8.0
loaded via a namespace (and not attached):
[1] affxparser_1.34.0 affyio_1.30.0 BiocInstaller_1.12.0
Biostrings_2.30.0 bit_1.1-10 codetools_0.2-8
ff_2.2-12
[8] foreach_1.4.1 GenomicRanges_1.14.1 IRanges_1.20.0
iterators_1.0.6 preprocessCore_1.24.0 RCurl_1.95-4.1
splines_3.0.2
[15] stats4_3.0.2 tools_3.0.2 XML_3.95-0.2
XVector_0.2.0 zlibbioc_1.8.0
--
Sent via the guest posting facility at bioconductor.org.
Hi Daniel,
in order to remove probes before normalization, I filtered out some probes ( ~300000) from the cfs and pgf files (clarim r1 analysis files) and then created a new pd.clariom.d.human library. Around 0.5% and 2% of the transcript clusters and probesets, respectively, were removed.
However, when I read the CEL files, the probes I remove before making the library are still present. In this way, after normalizing (rma), probesets and transcript clusters that should have been removed are indeed present, too, although with lower intensities.
I would appreciate a lot if you could tell me how to avoid having such probes, probesets and transcript clusters that should not be there after excluding them from the library (the common approach you say).
Thanks a lot